Google Digitization Initiative to Expand Google News Archive
Posted On September 15, 2008
As usual, Google (www.google.com) appears to view any cost concerns as mere quibbling when it comes to bringing ever more content into its giant maw. If all it takes is resources, technology, effort, and money to bring content onto the web, well—no problem. Now the company has offered free digitization to any newspaper publisher willing to put all or any part of its archives onto the web for access through Google News Archive (http://news.google.com/archivesearch). Most newspapers already have microfilm copies of their archives created and stored in the vaults of leading microfilm houses. But most microfilmed titles have not been digitized and probably never would be without such an initiative. Two leading microfilm vaults—those of ProQuest (www.proquest.com) and Heritage Microfilm (www.heritagemicrofilm.com)—have begun opening to the new initiative. Once the newspaper publishers give permission, Google will start the digitizing. In the case of "orphan" sources, e.g., old enough for public domain status and with a defunct publisher, Google may start digitizing "any time."
Begun in 2006, the Google News Archive already taps into a wealth of newspaper content extending back hundreds of years. It reaches both free and fee content. [For a background, read the two NewsBreaks from September 2006 on the service’s launch, "Traditional Information Industry Opens Premium Content to Google News Archive," http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=18226, and "Who? What? How Much?: Google News Archive Premium Content Suppliers," http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=18227. —Ed.]
ProQuest has also supplied content to Google News Archive from the beginning, but indirectly, acting as the host for Washington Post content. ProQuest has around 30 newspapers available through its Historical Newspapers archive service, including most of the major national newspapers. However, according to Chris Cowan, vice president of publishing at ProQuest, its microfilm vaults contain 10,774 newspaper titles. "For a lot of newspaper publishers, the market opportunity is very limited," says Cowan. "Digitization of their backfile is not something they would undertake and library budgets won’t support it. So a lot of content that’s not digitized now, this program will bring into the free web." Cowan also stated that ProQuest would be willing to work with the leading publishers now in its Historical Newspapers program to allow Google to index content it hosts, as it does with The Washington Post. Once users find a citation they want through Google News Archive, they move to the "pay wall" at the publisher.
Heritage Microfilms’ NewspaperARCHIVE.com, a subscription service, carries some 2,900 titles digitized from Heritage’s microfilm holdings. It has fed fee-based content access to Google News Archive since the beginning of the service. But its microfilm collection numbers more than 5,000 titles. Rather than counting by title, Derek Fiscus, director of research and development at Heritage Microfilm, preferred image counts. "We have 86 million images in NewspaperARCHIVE.com," says Fiscus, "and 26 million of them are in Google News Archive now." Fiscus says that Google was making a two-pronged effort in reaching for Heritage content. "They are focused on both putting in more automated files we have done for NewspaperARCHIVE and in engaging with our publishers to expand digitizing." Fiscus even discussed situations in which searchers might find citations linking to articles in NewspaperARCHIVE that overlapped with digitized content Google had negotiated from the publisher. The former would require a membership subscription for full-text viewing, while the latter would come free.
The first step in the process will involve getting permission from the publishers, a process in which both ProQuest and Heritage plan to assist in making introductions. Already two newspapers—Pittsburgh Post-Gazette and the St. Petersburg Times—have Google-digitized content in Google News Archive. In return for joining the program, newspaper publishers will share in ad revenue from postings on the site. According to Jim Gerber, director of content partnerships at Google, the sharing of ad revenue applied only to publishers willing to make their content free on the web under this initiative. Publishers in Google News Archive who charge for articles will have to rely on their own ad revenue schemes when users reach their websites. One benefit applies to aggregators, however. According to Cowan, ProQuest will receive the "digitized files back as part of our contributions to the project. For orphan material, we have the right to the digitized copy. We can re-purpose and editorially enhance it to create new products for libraries and the research community."
Speaking of permission, Fiscus warns that even for apparently "orphan-ed" content, newspaper lineages and ownership rights could be very complicated. Which brings up another question: What about the 2001 decision in the Tasini case that gave digital rights to freelance authors working for newspapers? [For a background, see "Tasini Case Final Decision: Authors Win," http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=17563. —Ed.] No one would want to wrestle with that problem for decades of contributions to thousands of newspapers! However, according to Cowan, there should be no problem. "With the full-page image, there should be no problem with Tasini. It’s not a separate work under the Tasini ruling."
Nevertheless, wouldn’t so much expanded access to a free service threaten sales of commercial products, e.g., ProQuest’s Historical Newspaper archives to libraries? Apparently, ProQuest isn’t worried. The company expects that its strong search tools, including detailed metadata, and improved presentation of results will keep licensees loyal. For example, when users of Google News Archive scan the page of a retrieved newspaper, they will have to look for yellow highlighting to see which article(s) on the page contains their search terms. As they pass their mouse over the page, the title of the key article will light up in blue. However, if the article continues on another page, they will have to enter the page number in a box at the top of the screen and start scanning to find the rest of the piece. In ProQuest’s Historical Newspaper archive, searchers can find articles complete, tied together like clippings, as well as complete pages. The screen presentation of a Google News Archive result has a row of navigation icons at the top that allow users to expand to a full page, zoom in and out, get rid of thumbnails, go back to the original article, etc.
Some gurus in the field consider that this development could prove devastating to traditional newspaper archiving organizations. On his blog, Stephen Arnold, longtime follower of the industry and Google in particular, stated, "[Y]ou can kiss most commercial database publishers as great investments good bye. Customers are tired of paying through the nose for ‘real’ databases. The idea is that Google makes ‘toy’ databases. Wrong. Google is collecting information and making it available with a business model that allows searching for free. Google’s business model is a big earth mover grinding down traditional media. Most traditional media mavens hear crunching but have not connected the noise with the footfalls of the GOOG … If GooNews wipes out companies in the archived news business, to whom does one complain. In short, GooNews is the start of a new era at Google. I dubbed the company Googzilla in 2005. No one paid much attention. Bet those folks at ProQuest and Newsbank are perking up now" (http://arnoldit.com/wordpress/2008/09/09/goonews-google-dooms-some-commercial-database-publishers). Arnold adds remarks indicating that he’s willing to be proved wrong, but only with facts.
Some experience seems to back up ProQuest’s confidence. A year ago, The New York Times, probably ProQuest’s Historical Newspaper’s bestseller, opened up its complete archives to the open web—well, all but 1923 to 1986, for which you could get that hosted search access through ProQuest, but not the full article delivery. [For a background, read "Demise of TimesSelect Deals Blow to Pay-for-News and Alters Access to Archives," http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=39678. —Ed.] Information industry and library listservs were all abuzz about the destructive impact this would have on ProQuest’s sales. However, Cowan reports very firmly that there was "no impact whatsoever. The title has continued to grow nicely. And there was no fallout from making [The] Washington Post searchable on the web and the same for the Los Angeles Times. It’s a little sliver of product via the web. Our product has maintained its value."
ProQuest also expects to continue its commitment to microfilming. Rod Gauvin, senior vice president of publishing at ProQuest, says, "The open web program is about access to content and has no impact on preservation, where microfilm is the ‘gold standard.’ Microfilm is a technology-neutral format, so no matter the state of future technology, anything preserved on it can be read and stored effectively. It’s an essential for preserving local history and culture, as well as the world’s scholarship."
ProQuest boasts that its "pristine master film copies" with its "high level of microfilm quality allows for the creation of better scanned images, which will ultimately deliver more accurate OCR results for users." However, ProQuest has no intention of opening up the newspaper files on its newly acquired Dialog service, Cowan assures us, where the indexing would be all digital, rather than through OCR (optical character recognition).
Again, experts in the field have their doubts. Iris Hanney, founder and president of Unlimited Priorities Corp., a management consultant to information industry firms, including newspaper archiving clients, doubts that any OCR’ing of "newspapers, especially older ones, would yield a quality level higher than 70% accuracy." And, unlike the situation with OCR indexing of digitized books in Google Book Search, newspaper articles are not only often poor quality paper and printing, they are also short. Missing a search term on one page of a book still leaves one all the other pages in the book for retrieval. Missing a search term in a newspaper article could mean missing the newspaper’s coverage of a story entirely.
However, Gerber affirmed, "We have invested quite heavily in a bunch of technologies to improve OCR for quite some time. Our image clean-up is highly visible in Google Book Search today with books from the 1800s showing very clean page images—no yellowing, dots removed, etc. We are leveraging some of that technology in the newspaper initiative. It is a little more challenging. If we use clean-up too much, we could strip important data as well." However, Gerber expected it would improve over time. "It will take lots of tuning and experimentation to optimize. What you see today in terms of image and OCR quality will improve over time even for the material that’s already there on Google News Archive." Gerber also indicated that they would be offering users the opportunity to identify unreadable content and/or help them fix it as they do with Google Book Search. "I’m not sure on the timing, but I’m sure we will."
Clearly this is only the beginning. The Google announcement of the initiative indicated that, in time, the newspaper content would integrate into the main Google service. Despite the dire predictions of some, Hanney judged, "This doesn’t rattle the foundations of the world as we know it. It supplements it. People need to stop being scared of Google. It’s just forcing all of us to be better. We need to partner with them."