Google and Research Libraries Launch Massive Digitization Project
Posted On December 20, 2004
For years librarians have bemoaned the failure of patrons to realize that not everything is on the Web and that there was life before the Internet. Both problems may now be on the way to a solution. Holding true to its mission "to organize the world's information and make it universally accessible and useful," Google has launched a program with a number of research libraries in the U.S. and the U.K. aimed at ultimately scanning all the books in their collections. The result of the multiple-year project would be an online digital library of what could number as many as 30 million volumes. The program will encompass books in and out of print, in copyright, and in the public domain—all available for full-text searching and, for the public domain items, full-image viewing. Participants in the program are the libraries of Harvard, Stanford, the University of Michigan, and Oxford University, as well as the New York Public Library (NYPL).
Although some library participants apparently were worried that publishers might object to the program on the grounds of copyright violation, Patricia Schroeder, executive director of the Association of American Publishers (AAP), assured me that they have no immediate plans to try to deter the program, such as through legal action. The program expands on the existing Google Print program built on similar digitization done in direct arrangements with publishers.
Larry Page, Google co-founder and president of products, stated: "Even before we started Google, we dreamed of making the incredible breadth of information that librarians so lovingly organize searchable online. Today we're pleased to announce this program to digitize the collections of these amazing libraries so that every Google user can search them instantly. Our work with libraries further enhances the existing Google Print program, which enables users to find matches within the full text of books, while publishers and authors monetize that information."
Librarians and their employers are making this access possible. Mary Sue Coleman, president of the University of Michigan, said, "We believe passionately that such universal access to the world's printed treasures is mission-critical for today's great public university."
Initially, the program's scope will vary from participant to participant. The University of Michigan has committed to complete digitization of all 7 million volumes in its collection, excluding its rare books and other fragile material. Three of the participants—Harvard, NYPL, and Stanford—refer to initial efforts as pilot projects.
The Stanford "pilot," however, will cover as many as 2 million books, according to Michael Keller, Stanford's library director and director of academic information resources, in the first phase. Full digitization would extend to Stanford's entire 8-million-book collection.
Harvard will begin a pilot project with 40,000 books randomly selected from the Harvard Depository collection. Estimated to run for about 6 months, the pilot will test whether Harvard will begin a large-scale digitization that could cover its entire 15-million-volume collection.
The New York Public Library will contribute only its public domain, non-copyrighted material initially. Since it will only contribute public domain material, NYPL will also make the digital collection it receives back from Google available for searching and delivery on its own Web site (http://www.nypl.org) as well as Google's.
Oxford will contribute the 19th-century collections (again public domain) from its Bodleian library. Even restricting to this subset, the number of titles could run into a million items. One of the world's largest and oldest libraries, the Bodleian has served as a legal deposit library for nearly 4 centuries.
Where Do the Publishers Stand?
According to Adam Smith, product manager for the Google Print library operation, Google's primary goal aims at out-of-print material, whether public domain or in copyright. Google maintains that it is meeting library copyright standards. Participants will receive no financial compensation from Google, but the massive digitization project will also cost them nothing, according to involved librarians with whom I have spoken. Each library in the program will receive digital copies of the books it has contributed, which it can then use to enhance service to its own patrons.
Some library participants in the program (e.g., NYPL) have clearly sought to avoid any copyright problems by limiting access to public domain. Others are moving carefully, as if through a legal minefield. However, my conversation with Schroeder of the AAP may serve to reassure Google and its library partners. Schroeder indicated that publishers were relatively comfortable with the prospect of Google's entry into their world. She admitted that publishers reissuing public domain works in print might take a hit, but she then pointed to the advantage to publishers, particularly small ones, of having their backlists digitized and promoted for free. (Google has apparently been talking with AAP publisher members.)
When I asked Schroeder whether lawsuits to stop the project were under consideration, she assured me they were not. According to Schroeder: "At the moment, there are no alarm bells ringing from members. Many are consulting with Google. Of course, if the bells do start ringing, we will be out [of] there like a 12-alarm fire, but for now Google is working with publishers to create a whole new way to deliver content. We are ever vigilant, but unless the system crashes or we see large-scale piracy or leakage or changes in Google's business models, our people are being cooperative."
In launching the program, Google promised publishers and authors that this expansion of Google Print "will increase the visibility of in- and out-of-print books and generate book sales via ‘Buy this Book' links and advertising." Searchers on Google will see links to Google Print results—including the massive library collections—appear in a box on the first page of Web search results. Clicking in the Google Print box will retrieve the full-image of public domain works and up to three snippets of text and bibliographic citations for copyrighted material. In situations where Google Print has a working relationship directly with publishers, publishers will allow fuller descriptions and a full-text percentage available to users each month. Public domain works are not downloadable. Readers will have the option to browse and read the image texts online while connected to Google.
When I asked Smith whether Google was prepared to have thousands of readers connected to its system reading full-length books, he replied, "Absolutely." And, according to another Google representative, the company has no plans at present to "commercialize" the experience by inserting ads. The Google Print interfaces will, however, connect to purchasing alternatives—links to online booksellers, such as Amazon and Barnes & Noble; to out-of-print booksellers, such as Alibris; and to libraries through the OCLC Open WorldCat library locator service. To see how it works, go to http://print.google.com/googleprint/library.html.
How Will It Work?
The scanning process involves Google installing proprietary, high-speed scanning stations on-site at the different libraries (with the exception of Stanford, which will send its books to nearby Google headquarters). The proprietary scanning equipment, using tent scanners, has been in development for over a year, according to Smith. Google staff will scan the books and forward image copies to a central facility for quality control checking and OCR (optical character recognition) conversion to text. Harvard's FAQ for the project stated that Google's scanning process was "much gentler with books than other high-speed processes in use today." The nondestructive digitization process does not involve removing the binding, for example. (For more details, see the FAQ at http://hul.harvard.edu.)
Some of the details remained unclear—for example, how Google plans to avoid duplication between scanning operations at different library participants. With regard to new books coming into library collections, Smith said Google had no plans in place at this time and would, if possible, prefer to work directly with publishers in Google Print. The library side of Google Print concentrates on out-of-print material. Smith pointed to the significant benefits to publishers from direct participation: bookseller links, publisher logo, links back to the publisher's Web site, additional reporting, expanded material exposure, etc. (Around the time of Google's announcement, Random House announced it would be selling books directly from its site.)
Beginning, Ending, or Both
Like the other announcements from universities and libraries, Harvard viewed the program as creating an important public good and serving the world. Harvard president Lawrence H. Summers stated: "Harvard has the greatest university library in the world. If this experiment is successful, we have the potential to provide the world's greatest system for dissemination as well." In time, the program at Harvard would benefit students and faculty by linking directly to HOLLIS (Harvard Online Library Information System; http://holliscatalog.harvard.edu) for location of books on campus. They also expected it to expand usage of the 5 million books in the Harvard Depository, many of which are out of copyright.
Michael A. Keller, Stanford University librarian and publisher of both the Stanford University Press and HighWire Press (Stanford's online co-publishing service for scholarly journals), said: "We have been digitizing texts for years now to make them more accessible and searchable, but with books, as opposed to journals, such efforts have been severely limited in scope for both technical and financial reasons. The Google arrangement catapults our effective digital output from the boutique scale to the truly industrial. Through this program and others like it, Stanford intends to promote learning and to stimulate innovation."
Several of the people with whom I spoke and much of the press coverage for the new program see the long-term potential of the Google library program as creating a universal virtual library, one that would—in time—challenge the role of and need for physical libraries. Sidney Verba, Carl H. Pforzheimer University Professor and director of the Harvard University Library, countered: "The possibility of a large-scale digitization of Harvard's library books does not in any way diminish the University's commitment to the collection and preservation of books as physical objects. The digital copy will not be a substitute for the books themselves. We will continue actively to acquire materials in all formats, and we will continue to conserve them. In fact, as part of the pilot, we are developing criteria for identifying books that are too fragile for digitizing and for selecting them out of the project."
Nonetheless, both Harvard and the University of Michigan used the term "revolutionary" in referring to the program's possible impact. John Wilkin, associate director for digital library services at the University of Michigan, said: "This is the day the world changed. It will be disruptive because some people will worry that this is the beginning of the end of libraries. But this is something we have to do to revitalize the profession and make it more meaningful."
Reactions to the Google library program continue to pour in. As to speculations whether Google would stay the course on such an expensive and lengthy commitment of its resources, even with its recent influx of IPO lucre, one point to consider comes from an NPR interview with Keller. When asked whether Stanford's arrangement with Google was exclusive or whether it would deal with Yahoo! and/or Microsoft, he assured the interviewer that the university would be happy to work with other leading search engines. However, due to the agreement with Google and the huge investment Google was making, the university could not take Google's product and give it away. On the other hand, the NYPL's press release announcing the program stated that it planned to make the electronic copies of public domain books supplied to it by Google available on its own Web site. Apparently, at least one early payoff to Google from its library program could be blocking Web-user eyeballs from well-funded, well-known adversaries.
Look to next week's NewsBreak for a follow-up on this story with afterthoughts from leading librarians, information industry leaders, and other observant gurus of the changing scene. If you have any questions or comments, send them to me by Wednesday and I'll try to include answers or incorporate comments in the NewsBreak. (As editor of Searcher and Up Front columnist for Information Today, I also recommend you read the February issues of both those publications for—I hope—insightful comments.)
Just to give you a preview of next Monday's NewsBreak, I interviewed Jay Jordan, president and CEO of OCLC, on the leading library vendor's view of this development. Jordan said they were "drinking champagne." He felt that it validated the strategic planning decision they made 5 years ago that it was "imperative to weave libraries into the Web and the Web into libraries." On the other hand, he did admit that turning libraries into "fulfillment houses" would take more work on nationwide interlibrary loan services. He looked forward to working with Google and Yahoo! and other Web services to build for librarians and knowledge seekers everywhere.
Others predict that this development, however desirable, could mark the beginning of the end of brick-and-mortar libraries. We shall see …
Tune in next week.