Many initiatives for information access came together recently with the announcement of the UK Web Archive (UKWA; www.webarchive.org.uk). The British Library (BL) will store and make accessible every site in the .uk top-level domain (www.bl.uk/aboutus/stratpolprog/digi/webarch/index.html). To do this, The British Library is working with IBM's BigSheets (www.ibm.com/software/ebusiness/jstart/bigsheets/index.html), which consists of an array of text-mining and analysis software, including Hadoop, Nutch, Pig Latin, Open Calais, and IBM's own InfoSphere classification tool, all to be presented with a user-friendly interface. The combination of hundreds of millions of pages and a suite of powerful tools can benefit research in topics from linguistics to epidemiology to market research.
For The British Library, archiving of all the websites in the .uk top-level domain is a natural extension of its role as the main U.K. Legal Deposit Library. Like the U.S. Library of Congress, publishers of books in the U.K. and Ireland must send a copy of every physical item published in the U.K. to The British Library. However, issues of copyright regulation in the U.K. are not settled, and The British Library has only been archiving 6,000 sites that have given permission in the past.
With this new project, the UK Web Archive is taking a wider approach, similar to that of search engines from the mid-1990s, Lycos to Google, and the U.S.-based Internet Archive (www.archive.org). The UK Web Archive will be crawling and indexing the U.K. portion of the web (with domain names ending in ".uk"). It has announced that it will follow a site's preferences in the robots.txt and robots meta tags (www.robotstxt.org), which means that any site or page that indicates it should not be indexed will not be archived by this system.
The result will be a sophisticated archive of British online activity over time, tracking changes to webpages and even sites that have been removed or replaced by other uses. We're "helping to avoid the creation of a ‘digital black hole' in the nation's memory," said Lynne Brindley, CEO at The British Library. The UK Web Archive has launched with archived websites organized into several key events, such as the Credit Crunch (including pages from sites which have since been closed, such as Woolworths and Zavvi). The current example is the U.K. 2005 Elections area (www.webarchive.org.uk/analytics/analytics.htm) with its tag cloud (see screen example).
To prepare more special areas and perform dynamic research, the UKWA and IBM plan to implement "text mining," which is like data mining in databases, except that instead of crunching numbers, the software attempts to crunch text and extract the best nuggets. This project will include processes for classifying pages into categories, extracting entities (people, places, or things) as metadata, and offering several approaches to querying and visualizing data. For a good overview, see "What Is Text Mining" (http://people.ischool.berkeley.edu/~hearst/text-mining.html) by Marti Hearst of the University of California-Berkeley School of Information.
Susan Feldman of IDC, who has been analyzing the issues of unstructured information management for many years, says of BigSheets and the UKWA, "Visualization of it can be extremely valuable to users who are snowed under with so much information that they have trouble picking out trends, major entities, etc."
IBM has decades of experience in text mining, and it is applying it here. The BigSheets interface uses the spreadsheet approach, showing categories and keywords assigned to each site. These can be treated much like any text spreadsheet, so researchers could drill down and view sites in the same category, sites that link to each other, or sites that use the same vocabulary. Then, they would be able to pivot the data set, looking at aspects such as word frequencies and relationships. It will offer additional ways to view the information using pie charts, tag clouds, bubble charts, and other visualization techniques.
"IBM BigSheets does for big data what spreadsheets did for personal computing," said Rod Smith, vice president, emerging internet technologies, IBM. "Within a matter of minutes, researchers, academics, and students around the world will be able to use a standard Web browser to search five terabytes of Web pages from the UK domain, analyze the results and effortlessly visualize the results of the search."
The main technology used in BigSheets is a set of complementary open source software projects collected and licensed by the Apache Foundation (http://apache.org). The core is Hadoop (http://hadoop.apache.org), which is a data storage system that can scale to billions of items with less required structure and space than a relational database system. Hadoop simplifies the IT management of the data, making it much easier to handle large amounts of traffic using parallel processing, easy addition of new servers, replication, fail-over, and load balancing. Hadoop powers much of the Amazon Cloud, ImageShack, Last.fm charts, Quantcast, and large parts of Facebook and LinkedIn.
The BigSheets UK Web Archive project is using the query language Pig Latin, which is designed to work with Hadoop, and so it has some features beyond the SQL (Structured Query Language) that's commonly used with databases. These queries are for classification, entity extraction, and other text-mining tasks, and the Pig/Hadoop combination will automatically deal with scaling and directing the query and compiling the results. They did name the project Pig (http://hadoop.apache.org/pig) so they could combine it with Latin-open source people tend to be as amused about cute names as librarians.
Another big part of the UKWA project is crawling the websites to index pages and check for changes. The open source crawler Nutch (http://lucene.apache.org/nutch) was designed for web-scale and can handle millions of URLs efficiently. Other technologies include the ManyEyes visualization tools, Open Calais (a freely available entity extraction system from Thomson Reuters), and IBM's own InfoSphere Classification Module is used to help organizations organize unstructured data using pattern matching/extraction.
Uses for this technology can apply to simple customer service-finding out why students did not use streaming video (doi:10.1111/j.1467-8535.2009.00980.x) to complex legal research, pharmaceutical clinical trials analysis, and mergers and acquisitions due diligence. These tools can even lead to indirect connections, the most famous being the discovery that fish oil may be a good treatment for Reynaud's disease; this has become the field of literature-based discovery (LBD, overview at doi:10.1.1.77.6842).
Hearst has this to say about the UK Web Archive and BigSheets approach: "The critical thing for this kind of tool is how accurate is the text analysis and how easy is it to do the exploratory analysis." She added, "IBM has been developing advanced tools for text analytics for years (both in terms of algorithms and scalability) and probably [has] one of the best around."