With all the controversy still swirling around Google Books and its post-settlement offerings, an alternative route to the millions of digitized books and journals supplied by leading Google Book Search library partners has arrived. The HathiTrust (www.hathitrust.org) is a collaboration of 25 research libraries already participating in Google Book Search to produce a shared digital repository for preservation and access to a curated collection. By mid-November, the HathiTrust Digital Library will have a full-featured, full-text search service for 4.3-5 million items. The searches will retrieve bibliographic citations and page references, including those for in-copyright books. Content will extend beyond the digitized copies of books returned to early library partners by Google. HathiTrust is pushing to acquire other digitized special collections from its members, as well as making arrangements for opening access to university press books.
Begun in October 2008, HathiTrust members currently include the 10 University of California system libraries, plus the California Digital Library, Indiana University, Michigan State University, Northwestern University, The Ohio State University, Penn State University, Purdue University, The University of Chicago, University of Illinois, University of Illinois at Chicago, The University of Iowa, University of Michigan, University of Minnesota, University of Wisconsin-Madison, and the University of Virginia. The depository currently includes digitized volumes from the University of Michigan, University of California, Indiana University, and the University of Wisconsin.
According to John Wilkin, associate university librarian at the University of Michigan and executive director of the HathiTrust, "The partnership is still expanding. We're on the verge of announcing maybe three new partners. We're exploring a new partnership model in working with OCLC and RLG Research with New York University. They won't be depositing but will recognize items and use us to weed a print collection. If they back up in print and HathiTrust will commit to digital storage, they will help pay for the curation. We have no commercial partners right now, but we're always looking for sustainable models."
An Oct. 23, 2008, NewsBreak by Beth Ashmore described the launch of the collaboration (http://newsbreaks.infotoday.com/NewsBreaks/HathiTrust-A-Digital-Repository-for-Libraries-by-Libraries-51225.asp). With all the hoopla these days about Google and its rights, it is interesting to note that Ashmore interviewed Wilkin, about rights management issues. "When asked about the rights management system behind HathiTrust, Wilkin described a complex database-driven system that automatically assigns a rights status (in-copyright, public domain, etc.) based on the metadata for the item (place of publication, date of publication, etc.). Wilkin also noted that the rights management system allows for manual overrides of the automated assertions to allow for a wide variety of instances, including occasions where rightsholders allow open access to their work."
The new launch will open indexing to nearly 1.5 billion pages from well more than 4.3 million volumes with full-text searching by keyword or phrase. (Just between us, if you simply cannot wait until mid-November, go to http://babel.hathitrust.org/cgi/ls. Wilkin tipped me off that, although this "experimental search" site claims to search only 500,000 documents, it actually includes the full 4.3-5 million volumes. Feedback options appear at the top and bottom of each search results page.) The system already had the equivalent of library cataloging searching, though they expect to upgrade even that kind of searching under a cooperative program with OCLC.
You can download the full text of public domain works from HathiTrust, but only page by page. That's what Google sent back to its library partners. If you want to download a whole book in one fell swoop, it's back to Google Book Search (http://books.google.com). By the way, the indexed in-copyright content includes in-print and out-of-print offerings. Future full-text searching options will include faceted browsing, advanced search, "more like this" options, and tools for computational analysis.
The system uses an open source search engine technology from Solr/Lucene. According to Wilkin, this "search engine is used by even large commercial developers, like the Internet Movie DataBase. It is very fast and handles large amounts of data effectively."
Of course, the first question that comes to mind in looking at the HathiTrust Digital Library is "How does HathiTrust compare to Google Book Search?" In fact, that represents a question answered in HathiTrust's FAQ, which says it "complements Google's massive undertaking to digitize the world's library collections. While both systems offer digitized books via the Internet, it is likely that HathiTrust will provide some content Google will not, such as digital collections unique to each institution, works from institutional repositories, and native born-digital materials. HathiTrust also provides a new platform for the expert curation and consistent access long associated with research libraries."
Wilkin expanded the answer: "We work on tuning the search aspect to the needs of scholars. It's fair to say that the Google Book Search tool will find you the most likely results, but scholars want to see every result, every occurrence. We're going to identify every page where search terms occur and how many times, even in the in-copyright material. Google does a lot of working around the books, for example, integrating with Google Maps and other sales functions, but we're building a scholar's finding tool."
As for unique content, Wilkin admitted that, if the proposed Google Book Search post-settlement files emerged from the current controversy intact, there would be "significant overlap, but there are many things they have that we don't have," e.g., content coming in from Google's publisher partners. However, Wilkin pointed to content they have that Google would not. "For example, we have some Open Content Alliance content from some of our partners, such as Illinois, University of California, and Penn State. We also get many special collections and other digitized efforts."
As part of the University of Michigan Press' (UMP) Digital Transition project, UMP will make its collection available for sale in print or electronic format. HathiTrust Digital Library will refer interested users to the UMP website for making purchases. According to Phil Pochoda, director of UMP, "If you want to either search for or happen to come across Michigan Press books, you can view them onscreen anywhere, anytime, and decide if you are interested in making the purchase. ... This is just one of the many avenues we're pursuing to allow today's readers to find and "flip through" a book to see if it meets their needs and interests without ever turning an actual page."
Other future plans, according to Wilkin, include expanding print-on-demand (POD) options. "We have over 100,000 volumes in POD now and probably [will have] 500,000 by the end of the year. They are only public domain at this point, but the University of Michigan Press may lead to printing at the point of sale." HP has also announced a print-on-demand service called BookPrep that will print any of 500,000 out-of-print or hard-to-find books from the University of Michigan's libraries for sale through Amazon.
For the long run, Wilkin wants everyone to know, "We're trying to be a depository. We're in the business of long-term access. We strive for comprehensiveness. Though we love Google Book Search's functionality and users use it, we're here to safeguard the future."