Internet Archive Adds Search Engine
Posted On September 8, 2003
Since 1996, the Internet Archive (http://www.archive.org) has collected "snapshots" of the whole Web every month or 6 weeks using the Alexa Internet crawlers. In 2001, the Archive introduced a Web interface to its 100 plus-terabyte collection of Web pages called the Wayback Machine, which provided access to historical Web coverage for user-supplied URLs. Until now, the service has not offered keyword searching of archived Web page content, though a set of research tools offered some assistance. Now, volunteer Anna Patterson has built a search engine designed specifically for handling the massive digital archive. Currently in beta test mode (http://web.archive.org or http://recall.archive.org) and covering 11 billion pages of the 30 billion page Archive, the Recall Search engine is scheduled to go into full service on the full Archive in mid-October. [Details on how it works are in a PowerPoint presentation at http://ia00406.archive.org/cobwebsearch.ppt.]
The Recall search engine offers some features specifically designed for archive searching. For example, it has time-based modifiers next to the search box that allow users to specify the time frame from which they want pages retrieved. The time frame only allows months and years, since, according to Patterson, the 4- to 6-week schedule for crawling the Web does not allow more specificity. A panel in the results display shows how the main topics covering the search have changed in relevancy over the years of the Archive's coverage, while a graph shows how the number of pages using the search terms has risen and fallen over time.
The Recall engine also uses categorization and topics. A panel to the right of the results section lists topics covering the results of the search. Searchers can link on related terms. Recall ranks search results on the basis of content rather than popularity.
In developing the Recall search engine, Patterson found the Archive very stable. She also asserted, "The scripts are very stable for the corpus. They can handle over 200 billion pages—which the Archive could reach in six years." She believes her design is much more scalable than other Web search engines, but states that the Internet Archive Recall could not handle Google-type request loads. "Currently, we're getting 5,000 uses an hour; they get 500 a second." However, she points out that her engine—with its connections to up to 15 related terms-builds in a greater understanding of content. After the launch is completed in October, Patterson said she may consider other more profitable licensing options.
Brewster Kahle, director of the Internet Archive, hailed the search engine and Patterson as "awesome." Kahle, one of the world's leading digital librarians, envisions a future that may contain a "flowering" of search engines designed to suit the needs of different communities. "I'm hoping we will get good open source search engines, different ones supporting different types of research—one that works better for statisticians and economists, another for medical. We should have a diversity of features and developments in search engine services. We need to start seeing things that are not oriented toward 100 million users in a short period of time—the mass advertising model. We need researcher-oriented tools....Recall is one."
The Internet Archive relies on corporate donations, government and foundation grants, and donations from generous and talented individuals. It represents one of the great success stories on the Love side of the "For Love or Money" saga of the Internet. With the addition of an effective search engine, it also represents a site that serious Internet searchers should carry high on their lists of Favorites or Bookmarks.