Alexa Internet (http://www.alexa.com), the leading archivist of the World Wide Web, has given the Library of Congress (http://www.loc.gov), one of the world's leading archives from the print age, a gift copy of its recording of the World Wide Web. The donation, comprising 2 terabytes of Web content, is in the form of an interactive digital sculpture. In making the gift, Brewster Kahle, president and CEO, hopes to encourage LC and other research libraries to accept responsibility for preserving the knowledge on the Web as they preserve knowledge in print.
Presented on October 13, the donation represents the first large-scale contribution of digital materials received by the Library of Congress. Digital artist Alan Rath designed the sculpture entitled "World Wide Web 1997: 2 Terabytes in 63 Inches." The gift contains 44 digital tapes alongside four red computer monitors that intermittently flash 10,000 Web pagestwo every secondfrom the 500,000 sites gathered and stored by Alexa Internet. The archive includes text, images, and audio files representing a full "snapshot" of the Web from early 1997.
The donation fits with the Library of Congress' National Digital Library Program, which makes over 1 million rare American manuscripts, films, sound recordings, and photographs from its collections available free on the Net. Winston Tabb, associate librarian for library services in charge of the Library's collections, welcomed the gift: "Alexa Internet's donation of the Web enhances the Library's holdings and ensures that one of the most significant collections of human thought and expression born of a new medium is preserved in the national collections. Alan Rath's sculpture serves as a tangible icon representing the Web and will help our visitors envision the scope of what has become one of the largest sources of information ever built by humankind."
Alexa's Brewster Kahle has made preserving the Web and its data for posterity his mission: "The fabric of the Web is a temporary one at best unless we commit to its long-term care and feeding. With our donation of the Web archive to the Library of Congress, we're trying to build an infrastructure that transforms the Web into a resource to benefit future generations of scholars and historians."
Alexa Internet estimates that the Web grows at the rate of 1.5 million pages or sites daily. If the present rate of growth continues, the Web will contain more than 1 billion pages by the year 2000. A current snapshot of the Web takes up 3 terabytes (3 million megabytes). Alexa also estimates that the Web doubles in size every 8 months, that approximately 20 million Web content areas exist, and that 100,000 different host machines handle 90 percent of all Web traffic, with 50 percent of all traffic going to 900 top Web sites. However, Alexa's data also reveals that 1 percent of all Web pages are gone after 1 week. Since 1996, Alexa Internet has sent out its robots to "crawl" the public Web every 6 to 8 weeks. It then gathers, stores, and preserves the content in the Internet Archive, a nonprofit organization.
Alexa Internet uses the data in its free Alexa service for generating Site Statistics and Related Links. Other leading Web services tap into its data and technologies including Netscape Communications (the "What's Related" feature in Netscape Communicator 4.5), Encyclopaedia Britannica (Eblast), et al. When downloaded, Alexa appears as a toolbar at the bottom of users' screens and continually communicates with the browser to supply background information on each site searched, including Site Statistics (the owner registrant, popularity, number of sites linking to that site, third-party affiliations with privacy advocates, etc.) and Related Links (a list of 10 related links for each site visited based on usage patterns by all Alexa users). Kahle regards analysis of user behavior in moving from link to link as a tremendous data resource, "a huge, invisible, peer-reviewed process." As Kahle puts it, "You are what you link." If you have an old URL, Alexa Internet might be able to bring it back from an "out-of-print" Web server, but the service cannot search old data specifically. Besides myriad scholarly questions such data might answer, valuable business insights or even legal evidence could come from such an archive.
What to Do with It
Kahle is a true missionary. He named Alexa Internet after the first library of antiquity in Alexandria, Egypt. He takes a virtual view of reality. When we interviewed him, his digital archivist's eye "guess-timated" the Library of Congress' existing print holdings as "about 20 terabytes or $200,000 in storage space. It would take up the space of a couple of Coke machines." Of course, unlike Alexa Internet, which takes everything on pages including video clips, sound, and graphics, Kahle's estimate for digital storage of LC's print collection reflects "only the text, all ASCII. The graphics would get very complicated to estimate."
A Library of Congress representative told us that the Library had accepted the digital donation with two purposes in mindto preserve the Web content in an archive and to use the collection to experiment with needs and methods for future Web archiving. LC hopes to define what should and what should not go into a digital archivewhat to keep and what to discardand to work on how to make archived Web content available to users. Kahle encourages librarians everywhere to start to "grapple with all the political, social, economic, and other issues of documents born digital, not a digitized version of a print library. We're moving toward all digital material. It's never been effectively dealt with before. Problems are everywhere."
When asked about the problems of copyright for the donated Web archive, the LC representative referred to Kahle's theory at Alexa Internet. Alexa limits its archiving to publicly available sections of the Web, and it will remove any material if the copyright holder asks Alexa to do so. However, this approach to copyright seems to put an affirmative duty upon copyright holders to maintain their rights and without notice, as a legal librarian pointed out to us, i.e., Alexa Internet does not notify or alert each and every Web-site administrator of its activities. Kahle admitted that the issue is "murky," but he takes a proactive posture "like the search engines did. AltaVista and the others just went ahead and did it. They didn't ask everyone for permission. The essence of copyright is trying to protect knowledge. There's a role for a library that makes sense in this world. We're proactively going out and building one. We didn't see anyone else trying to do it."
Involving the Library of Congress in the grand mission represents a "really big deal" to Kahle: "This is a watershed. The Web is coming of age. We've gotten the Web as a publishing infrastructure. Now we need to bring more fire power to bear. This is a significant event. It will open doors for other libraries to play meaningful roles. By the Library of Congress doing this, it sets a precedent that the World Wide Web is worth collecting and is a usable tool. We didn't give it to the Smithsonian. We want it in a library oriented for access by researchers, historians, and scholars."