1923–1963: Google Book Search Targeting More Books for Public Domain?

1923. Publishers, authors, librarians, and readers can still hear the roar from that one year of the Roaring ’20s. That is the year that Google Book Search (http://books.google.com) has set as the cut-off point for public domain status in its U.S. offerings. Before that year, all library partners in the program let Google’s mass digitization program grind away, but only a handful of library partners will risk letting post-1923, probably in-copyright material from their collections into the program. On June 24, Google Book Search introduced a downloadable XML file (http://booksearch.blogspot.com/2008/06/us-copyright-renewal-records-available.html) containing U.S. copyright renewal records for books published from 1923 to 1963. Under the copyright law in force during that period, copyright holders had to renew their registrations in the 28th year after publication, equivalent to 1951 to 1991. If they failed to renew, then the copyright lapsed and the material could be considered public domain. Or not. When it comes to copyright law and practice, nothing is simple—at least, not yet.

A lot of work went into creating the copyright renewal database. Records had to be culled from the U.S. Copyright Office’s Catalog of Copyright Entries. For the period prior to 1977, this involved gathering information from OCRed text taken from hardbound volumes. For this content, Google relied on a combination of page images supplied by the Universal Library Project at Carnegie Mellon University and "tireless" proofing of the OCR text by volunteers at Distributed Proofreaders (http://www.pgdp.net/c) and Project Gutenberg (www.gutenberg.org). Google used the free public domain already available at Project Gutenberg for the pre-1977 content.

Post-1978 records are available online at the Copyright Office’s website (www.copyright.gov/records). To gather this data, Google staff composed and submitted masses of individual requests built around the "R" and "RE" tag identifying renewal registrations and then "scraped" the information out of the results for all post-1978 records. According to Jon Orwant, Google engineering manager, the effort "simulated someone typing in every author and title." He assured me that they followed guidelines and requests on the Copyright Office site to try to conduct such mass downloads outside business hours and not tie up their servers. "We’re a good neighbor," said Orwant.

So what has Google gathered? According to the Readme file accompanying the database, "We believe we have compiled the only complete set of monograph renewal records outside of the U.S. Copyright Office. This is not a perfect set of renewal records and may contain inaccuracies." The file only focuses on books, not other copyrightable formats, e.g., movies and radio. The file is 390MB, which compresses to a downloaded Zip file of 56MB. When I downloaded the Zip file to my computer and clicked on the extracted file, my Internet Explorer browser stepped up to the task of loading the XML file. More than an hour later, it was still stepping. A second try just led to more hours with the cursor doing that whirligig thing. When asked about the problem, Orwant was not surprised. He expected the file to go to "programmers or someone at a library who searches renewal records. We provide the raw structured data, the tagged XML file, but expect people to import it into another application program like Filemaker or Oracle." However, according to Orwant, I could have used a Unix command like "grep." (In other words, no amateurs need apply.)

Why would Google build a database like this and not make it searchable on its site? It can’t be the size. The database only holds an estimated 427,000 records. Although no absolute counts exist, 40 years of publishing, according to Orwant, could have produced as many as 2 million titles with estimates running as low as 10% for the number of books renewed. And a goodly share of those books probably reside on the shelves of the library partners willing to let Google digitize their in-copyright material and therefore now inside Google Book Search itself. If Google could prove the public domain status of individual titles, it could add another 40 years of access to the service. One could also speculate that library partners that have denied Google access to digitizing their in-copyright collections, defined as post-1923, might loosen those restrictions if they could receive reliable assurances of public domain status.

However, as any professional searcher can tell you, verifying the public domain status of a work involves the most difficult type of search there is—a negative search. The absence of a book result from a search of a copyright renewal file doesn’t necessarily prove that the item is in the public domain. The zero result could stem from poor searching or poor data entry or any number of factors. Orwant’s comments on why they didn’t make the file searchable on Google seem to confirm Google’s awareness of this problem. He explained the decision not to make the file searchable on Google, e.g., as a section in the Advanced Google Book Search options. "In part because it’s just a copy of what the Copyright Office believes to be authoritative. If we assert a match for a particular renewal to a particular book, that’s too close to our making a claim that if you see no renewal, then the book is in the public domain. We want to be careful not to make that kind of claim. Nevertheless, it would make sense to take information on the rights of a particular book and make it available. This is a baby step toward that position."

And there’s no doubt that expanding the public domain collection in Google Book Search beyond 1923 is a company goal. According to a Google spokesperson’s message, "These records will enable us to put more books into full view on Google Book Search, furthering us toward our goal to make books accessible to users while still respecting copyright. We’re committed to clarifying the public domain status of books and making as many books available online to users as possible." The complexity of copyright issues, according to Orwant, can lead to some "nightmare scenarios." "We don’t want to portray our database as canonical. We encourage people to go to the Copyright Office as the source. We will periodically go through the process, doing it again with the Copyright Office records, to produce a new version."

Google is not alone in this effort or even in the creation of a copyright renewal database. Orwant indicated that he has distributed copies of the Google copyright renewal database to OCLC, Project Gutenberg, some library partners, and other interested parties. The Stanford University libraries have loaded its own copyright renewals database, covering the same content and using much the same methods. And, unlike Google’s, the Stanford Copyright Renewal Database (http://collections.stanford.edu/copyrightrenewals/bin/page?forward=home) is fully searchable. You can browse by year, title, and author; do a simple search; or use the advanced search options to search by title, author, registration date, and renewal date. Mimi Calter, special projects librarian and intellectual property manager for the Stanford University Libraries, described their file as covering 1950 to 1992, an extra year at each end of the 28-year span required by copyright law for 1923 to 1963 books. The file also covers only U.S. Class A book renewals. Stanford also provides a downloadable version of its file, which "uses the Lucene format to supply a search tool for the text fielded data." (Stanford also has a sophisticated and detailed page explaining fair use issues (http://fairuse.stanford.edu), including an introduction to the permissions project.)

OCLC has even more elaborate efforts in development. Bill Carney, OCLC product manager, revealed that it would launch a pilot project in a few weeks called the WorldCat Copyright Evidence Registry. It would link various databases and sources needed to verify copyright status. "Let librarians share their knowledge on the copyright status of books," said Carney. He hoped that the work would supply the "due diligence" and "qualified searches" required by the recent Congressional legislative action for "orphan works." By linking input from many librarians and others, e.g., publishers and authors, Carney hoped that the new effort would replicate the collegial network approach that Frederick Kilgour espoused when he started OCLC.

So what can you expect to do with Google’s copyright renewal database, or Stanford’s, or Gutenberg’s, or OCLC’s new service? Whatever you do, do it very, very carefully is the advice of Carol Ebbinghouse, law librarian at the California Second District Court of Appeal and Searcher magazine’s legal columnist. Do more research and then still more. You can start with the excellent chart, "Copyright Term and the Public Domain in the United States," (January 2008; www.copyright.cornell.edu/public_domain), carefully reading the chart’s extensive footnotes. Extensive information and wise advice can also be found at the Library of Congress’ Copyright Office site (www.copyright.gov) and at the Online Books Page (http://onlinebooks.library.upenn.edu or, more specifically, http://onlinebooks.library.upenn.edu/renewals.html). For a fee, the Copyright Office will even conduct a search for you, but it warns that it cannot guarantee legal issues surrounding search results. If you get lucky, you might find the book already in Project Gutenberg and piggyback your permissions work on theirs. However, Ebbinghouse cautions, "Project Gutenberg’s Rule Six on how to determine the renewal issue is posted as under revision now." You can also check out Ebbinghouse’s January 2008 The Sidebar column titled "‘Copyfraud’ and Public Domain Works," or, at the very least, download the collection of URLs in the article using Searcher’s LiveLinks online service at www.infotoday.com/searcher/jan08/LiveLinks_Ebbinghouse_0108.htm.

The future of all orphan works, digitized or undigitized, may depend in part on the success of key legislation, specifically S. 2913: Shawn Bentley Orphan Works Act of 2008 (House bill, H.R. 5889), which limits judicial remedies for copyright infringement cases involving orphan works. So far that bill has been reported out of the Senate Judiciary committee with an amendment and no written report and placed on the Senate Legislative Calendar.

One way or another, however, Google and librarians across the country will be pushing the issue.