KMWorld CRM Media Streaming Media Faulkner Speech Technology Unisphere/DBTA
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM EContentMag Faulkner Information Services Fulltext Sources Online InfoToday Europe Internet@Schools Intranets Today KMWorld Library Resource Literary Market Place OnlineVideo.net Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



News & Events > NewsBreaks
Back Index Forward
Twitter RSS Feed
 



British Library and IBM Team Up on Web Archiving Project
by
Posted On March 8, 2010
Click here for full-size image
Click here for full-size image
Click here for full-size image
Click here for full-size image

Many initiatives for information access came together recently with the announcement of the UK Web Archive (UKWA; www.webarchive.org.uk). The British Library (BL) will store and make accessible every site in the .uk top-level domain (www.bl.uk/aboutus/stratpolprog/digi/webarch/index.html). To do this, The British Library is working with IBM's BigSheets (www.ibm.com/software/ebusiness/jstart/bigsheets/index.html), which consists of an array of text-mining and analysis software, including Hadoop, Nutch, Pig Latin, Open Calais, and IBM's own InfoSphere classification tool, all to be presented with a user-friendly interface. The combination of hundreds of millions of pages and a suite of powerful tools can benefit research in topics from linguistics to epidemiology to market research.

For The British Library, archiving of all the websites in the .uk top-level domain is a natural extension of its role as the main U.K. Legal Deposit Library. Like the U.S. Library of Congress, publishers of books in the U.K. and Ireland must send a copy of every physical item published in the U.K. to The British Library. However, issues of copyright regulation in the U.K. are not settled, and The British Library has only been archiving 6,000 sites that have given permission in the past.

With this new project, the UK Web Archive is taking a wider approach, similar to that of search engines from the mid-1990s, Lycos to Google, and the U.S.-based Internet Archive (www.archive.org). The UK Web Archive will be crawling and indexing the U.K. portion of the web (with domain names ending in ".uk"). It has announced that it will follow a site's preferences in the robots.txt and robots meta tags (www.robotstxt.org), which means that any site or page that indicates it should not be indexed will not be archived by this system.

The result will be a sophisticated archive of British online activity over time, tracking changes to webpages and even sites that have been removed or replaced by other uses. We're "helping to avoid the creation of a ‘digital black hole' in the nation's memory," said Lynne Brindley, CEO at The British Library. The UK Web Archive has launched with archived websites organized into several key events, such as the Credit Crunch (including pages from sites which have since been closed, such as Woolworths and Zavvi). The current example is the U.K. 2005 Elections area (www.webarchive.org.uk/analytics/analytics.htm) with its tag cloud (see screen example).

To prepare more special areas and perform dynamic research, the UKWA and IBM plan to implement "text mining," which is like data mining in databases, except that instead of crunching numbers, the software attempts to crunch text and extract the best nuggets. This project will include processes for classifying pages into categories, extracting entities (people, places, or things) as metadata, and offering several approaches to querying and visualizing data. For a good overview, see "What Is Text Mining" (http://people.ischool.berkeley.edu/~hearst/text-mining.html) by Marti Hearst of the University of California-Berkeley School of Information.

Susan Feldman of IDC, who has been analyzing the issues of unstructured information management for many years, says of BigSheets and the UKWA, "Visualization of it can be extremely valuable to users who are snowed under with so much information that they have trouble picking out trends, major entities, etc."

IBM has decades of experience in text mining, and it is applying it here. The BigSheets interface uses the spreadsheet approach, showing categories and keywords assigned to each site. These can be treated much like any text spreadsheet, so researchers could drill down and view sites in the same category, sites that link to each other, or sites that use the same vocabulary. Then, they would be able to pivot the data set, looking at aspects such as word frequencies and relationships. It will offer additional ways to view the information using pie charts, tag clouds, bubble charts, and other visualization techniques.

"IBM BigSheets does for big data what spreadsheets did for personal computing," said Rod Smith, vice president, emerging internet technologies, IBM. "Within a matter of minutes, researchers, academics, and students around the world will be able to use a standard Web browser to search five terabytes of Web pages from the UK domain, analyze the results and effortlessly visualize the results of the search."

The main technology used in BigSheets is a set of complementary open source software projects collected and licensed by the Apache Foundation (http://apache.org). The core is Hadoop (http://hadoop.apache.org), which is a data storage system that can scale to billions of items with less required structure and space than a relational database system. Hadoop simplifies the IT management of the data, making it much easier to handle large amounts of traffic using parallel processing, easy addition of new servers, replication, fail-over, and load balancing. Hadoop powers much of the Amazon Cloud, ImageShack, Last.fm charts, Quantcast, and large parts of Facebook and LinkedIn.

The BigSheets UK Web Archive project is using the query language Pig Latin, which is designed to work with Hadoop, and so it has some features beyond the SQL (Structured Query Language) that's commonly used with databases. These queries are for classification, entity extraction, and other text-mining tasks, and the Pig/Hadoop combination will automatically deal with scaling and directing the query and compiling the results. They did name the project Pig (http://hadoop.apache.org/pig) so they could combine it with Latin-open source people tend to be as amused about cute names as librarians.

Another big part of the UKWA project is crawling the websites to index pages and check for changes. The open source crawler Nutch (http://lucene.apache.org/nutch) was designed for web-scale and can handle millions of URLs efficiently. Other technologies include the ManyEyes visualization tools, Open Calais (a freely available entity extraction system from Thomson Reuters), and IBM's own InfoSphere Classification Module is used to help organizations organize unstructured data using pattern matching/extraction.

Uses for this technology can apply to simple customer service-finding out why students did not use streaming video (doi:10.1111/j.1467-8535.2009.00980.x) to complex legal research, pharmaceutical clinical trials analysis, and mergers and acquisitions due diligence. These tools can even lead to indirect connections, the most famous being the discovery that fish oil may be a good treatment for Reynaud's disease; this has become the field of literature-based discovery (LBD, overview at doi:10.1.1.77.6842).

Hearst has this to say about the UK Web Archive and BigSheets approach: "The critical thing for this kind of tool is how accurate is the text analysis and how easy is it to do the exploratory analysis." She added, "IBM has been developing advanced tools for text analytics for years (both in terms of algorithms and scalability) and probably [has] one of the best around."


Avi Rappoport is available for search engine consulting on both small and large projects.  She is also the editor of www.searchtools.com.

Email Avi Rappoport

Related Articles

5/7/2012IBM, Vivísimo, and the ‘Big Data’ Buzz


Comments Add A Comment
Posted By the dude3/9/2010 6:17:45 AM

Contrary to what the article mentions, OpenCalais is NOT free. One can get a free licence for 50000 transactions a day, which is definitely not enough to process the numbers targeted by the UKWA project. There must be some kind of license agreement here... Who can tell us more ?

*************************************
From the FAQ:

During the beta period we¹ll be limiting usage to a total of 50,000 transactions per license per day and four transactions per second. If you have a great idea that requires more processing capability than this, please contact us and we can talk. After the beta period we'll be significantly increasing these usage limits ­ our goal is to allow users to submit as many documents as they need to every day.


Posted By Natasha Gabriel3/8/2010 10:58:21 PM

Great way to describe it! -"crunch text and extract the best nugget" - Many organizations mistakenly think that some of their old information is useless. We've however shown many of our customers how they can leverage information even in legacy systems and transform it into contextual intelligence thus adding value to their enterprise. It's been truly fascinating to watch our customers calculate the ROI that we deliver. Good read! http://vivisimo.com/

              Back to top