Regardless of who wins this November’s presidential election, the business of government still chugs along. Or does it? With so much of the daily activity of the federal government now conducted on the web, the effect of a change of administration becomes a matter of curiosity or even anxiety. In 2001 and 2004, the National Archives and Records Administration (NARA; www.archives.gov) created a "snapshot" crawl of federal agency websites (109th Congress and 2004 presidential term available at www.webharvest.gov). However, in March, NARA announced it would not conduct the same kind of snapshot for 2008/2009. Responding to the possible loss of an historically important record, five agencies and organizations-- the Library of Congress (LC; www.loc.gov), Internet Archive (www.archive.org), California Digital Library (www.cdlib.org), University of North Texas Libraries (www.library.unt.edu), and the U.S. Government Printing Office (GPO; www.gpo.gov) --have partnered to take on the task.
Each of the participants will concentrate on specific approaches to gathering and curating the collection of content from federal agency websites under LC’s leadership. The bulk of the collecting will come from a sequence of crawls by the Internet Archive. LC will contribute congressional content based on its regular monthly crawls, which it has conducted since 2003. The California Digital Library (CDL) and the University of North Texas (UNT) are already participants in the LC-funded Web-at-Risk project. They will tap their own considerable experience to identify key agency content. The UNT Libraries, already a model organization for digital archiving of government sites with its CyberCemetery
(http://govinfo.library.unt.edu) begun in 1997, have developed a specific program designed to handle input from expert government document librarians around the country for identifying and suggesting key "not to be missed" sites. Assisting in this curation side of the effort, GPO will promote the program within the Federal Depository Library Program.
According to Martha Anderson, director of program management for the Library of Congress’ National Digital Information Infrastructure and Preservation Program (NDIIPP), the sequencing of crawls is due, in part, to the "practice of politeness. We will not hit a particular server too often. It could take a week or two to collect all the content." Nor will they reach for content that requires registration or passwords or robot exclusions, according to Anderson. However, Mark Phillips, head of the digital projects unit at the UNT Libraries, indicated that they might look at any Sitemap Protocol openings they might find. The Sitemap Protocol is an open standard that enables search engines to reach into proprietary or legacy systems with the permission and assistance of the site owner.
In its sequence of crawls, the Internet Archive will use the Heritrix software (http://sourceforge.net/projects/archive-crawler), developed by the Internet Archive with the Nordic National Libraries under commission from the International Internet Preservation Consortium (www.netpreserve.org). This open source software crawler software collects a much wider range of content than many text-oriented search engine crawls, encompassing multimedia formats.
As yet, no one knows how large the final collection will be. Phillips estimates 14–20 terabytes; Kris Carpenter, director of the Web Group at the Internet Archive, estimates 10–12. The depth of the site drilling will differ among participants. Tracy Seneca, web archiving service manager at the CDL, says that they plan to set their crawler for 23 "hops away." She explained, "People call it drilling down, but crawlers don’t work on a directory/subdirectory structure. You give them a page and they follow the links as many ‘hops away’ as you tell them." The CDL contribution to the effort will focus on sites of known value, either based on their own experience and contributions from experts. According to Carpenter, the Internet Archive will only set up a "hops away" limit for when their crawlers reach outside dot-gov sites, where limits may be as little as three or four hops-away. Otherwise their initial baseline crawl, already underway and expected to last till mid-September, will only slow down (stop hopping??) to give the servers a breather from overloading. They plan to crawl "everything we can in the dot-gov domain and sub-domains." Subsequent crawls, when they add sites recommended by experts, may get even more extensive, according to Carpenter.
Once all the crawls and in-depth plummeting of nominated sites is complete, the final results will first appear on the Internet Archive as a separate subset in the Wayback Machine archive (www.archive.org/web/web.php). It should probably appear online in mid- to late February 2009, according to Carpenter. Copies of the final results will be sent to each of the participants for further presentation on each of their own websites. Different interests, approaches, and even funding will determine what each of the other participants do with their versions of the End-of-Term Snapshot. But the Library of Congress is expected to load it into its www.loc.gov/webcapture site.
So Where’s NARA?
Clearly, the participants in this end-of-term archiving effort have taken on a challenging and vital task in the public interest. But a citizen taxpayer could wonder why a bucket brigade has to do the job while the fire department sits it out. NARA followed up the March 27 announcement of its intention to not conduct a 2008 snapshot (NWM 13.2008; www.archives.gov/records-mgmt/memos/nwm13-2008.html) with an April 15 more detailed defense of its decision (Web Harvest Background Information, www.archives.gov/records-mgmt/memos/nwm13-2008-brief.html). Nevertheless, not everyone agrees with its logic. Though none of the spokespersons for the agencies participating in this project would say anything negative about NARA’s decision, the American Association of Law Libraries (www.aallnet.org) formally expressed its "disappointment" in NARA’s decision in an April letter to Allen Weinstein, the Archivist of the United States. (For more information, go to AALL’s Washington Blawg; http://aallwash.wordpress.com/2008/08/19/partners-join-together-to-preserve-government-web-sites.)
The harvesting policies at NARA are probably not as effective as the ones used by the current effort. Paul Wester, director of NARA’s Modern Records Program, pointed out that they used a "pull" policy, asking agencies to do crawls and send results to NARA. In fact, the original 2001 NARA snapshot gave agencies just 8 days to comply with a Jan. 12, 2001, memo providing a Jan. 20, 2001, completion date and a 60-day delivery period (www.archives.gov/records-mgmt/basics/snapshot-public-web-sites.html).
All may not be lost, however; NARA is still crawling congressional and White House sites. Wester confirmed that, after assembly, data from their efforts would become available to the public, as usual with NARA collections. Whether it would become available to searchers directly through NARA’s webharvest.gov site—a site hosted by Internet Archive actually—was unclear, but requesters could get copies of the data files. So it should be possible for the bucket brigaders to borrow the fire department’s hose sometime in the future. Wester also stated that agencies have been improving the performance of their digital archiving responsibilities as NARA has provided more detailed guidelines.
One thing became clear in the course of interviewing the different parties in this matter. Archiving policies, procedures, and assignments have not kept pace with the rapid switch of the federal government to web technology. Ironically, the digitization of government activity has the potential to create a much more thorough and complete archive than print documentation, but only if someone is doing the digital archiving. This project may actually prove beneficial beyond the creation of one very useful database. People may discover that archiving the "G" is a lot more doable—and a lot more affordable—than they might imagine. The importance of doing it is inescapable.
As the Internet Archive’s Carpenter says, "The thing to do is just do it."
[If you would like to contribute URLs for the project to crawl, keep an eye out for an announcement of how to do that on the GOVDOC-L list (http://govdoc-l.org) or the ALA GODORT list (http://lists.ala.org/sympa/info/godort).]