Open Content Alliance Expands Rapidly; Reveals Operational Details
Posted On October 31, 2005
Just a few weeks after its launch, the Open Content Alliance (http://www.opencontentalliance.org) has already added dozens of new members to its Open Library project (http://www.openlibrary.org). (For background on OCA, see the NewsBreak "Open Content Alliance Rises to the Challenge of Google Print" at http://newsbreaks.infotoday.com/nbreader.asp?ArticleID=16110.) Twenty-four new participants have joined the initial 10 founding members. All contributors have committed to donating services, facilities, tools, and/or funding. Microsoft Corp. has joined the effort with the announcement of MSN Book Search, a new mass book digitization project. (For coverage, see the companion NewsBreak, "Microsoft Launches Book Digitization Project—MSN Book Search" at http://newsbreaks.infotoday.com/nbreader.asp?ArticleID=16090.) The Research Libraries Group (RLG; http://www.rlg.org), a major library bibliographic utility, has also joined OCA, contributing its bibliographic metadata. In contrast with Google Print's close-mouthed policy toward its proprietary digitization equipment, the Open Content Alliance has released extensive details on its Scribe system, as well as other options for participants and users.
New and Future Participants
Almost all the new OCA contributors are university libraries, including Columbia University, The Johns Hopkins University, The University of Virginia, University of Pittsburgh, and several Canadian universities, as well as a cooperative project called the Biodiversity Heritage Library. The Smithsonian Institution Libraries and several botanical gardens and museums also contribute. (For a complete list of OCA contributors, see http://www.opencontentalliance.org/contributors.html.)
RLG plans to supply bibliographic descriptions to Open Content Alliance digitizing operations from the more than 48 million titles in its RLG Union Catalog (available for direct searching at http://www.redlightgreen.com). Though much smaller in membership than OCLC, the other major library bibliographic utility, RLG's membership of more than 150 research libraries, archives, and museums have a breadth of subjects, languages, and content types in their collections that should assist OCA in handling archives of older, public domain material. James Michalko, president of RLG, confirmed that it plans to work with the digitization of the entire American literature collection of the University of California via the California Digital Library. Daniel Greenstein, university librarian at the California Digital Library, stated: "We are delighted to have RLG be a part of this effort. Efficient identification of digital texts, high-quality descriptions that allow them to be discovered, and a broad understanding of where to target further effort are important things that RLG can contribute."
Microsoft has joined the Open Content Alliance with an estimated $5 million promise to digitize approximately 150,000 books next year to launch its MSN Book Search service. It promises to help the OCA not only scan and digitize publicly available print materials, but also to work with copyright owners to legally scan protected materials. In making the announcement, Christopher Payne, corporate vice president of MSN Search at Microsoft, said: "With MSN Book Search, we are excited to be working with libraries worldwide to digitize and index information from the world's printed materials."
Brewster Kahle, digital librarian and founder of the Internet Archive, commented: "We are proud that MSN is working with the OCA in the shared vision of creating a better, more relevant search experience for people around the world." Initial efforts of Microsoft and the Internet Archive, host for the OCA effort, will focus on public domain material.
In joining the OCA, Microsoft has entered an alliance strongly connected with search/portal rival Yahoo!. Rumors circulate that the next potential OCA participant may even be Google. At the Oct. 25 evening inaugural event, Google's Dan Cleary attended and, according to our reporter, Lisa Picarille, "clapped throughout much of the presentation, but not when Yahoo! (a founding member) spoke or when the Microsoft executive was on stage." Cleary even slipped off his name tag when he examined the scanner demonstration, reportedly to avoid being bothered by the press. An OCA representative commented that "Google wants to help us." Time will tell.
The task of mass digitization is an imposing one. Susan Feldman, research vice president for content technologies at IDC, pointed out that, in her experience: "To do it right, these books are VERY carefully imaged, page by page. NO slitting the binding off and stacking them in a stack loader. Instead, you need special copying machines with good resolution and large imaging areas, and tremendous care [must be] taken to make sure the images are clear and carry the entire page, registered exactly in the center so that it looks like the book, even with penciled in comments. Since some of the older works are fragile, they are often difficult to handle without destroying them. Then the pages need to be linked and indexed so that you have an entire work stored together. Questions about the best medium for digital storage are also not entirely settled. So, it's great to have this kind of initiative."
The for-profit companies contributing hardware, software, and service support to the OCA would seem to have strong motivation to use the effort to illustrate and market their abilities to the world. This could explain why the OCA is so much more open about its equipment and its performance than Google Print. Hardware and software for digitizing and producing PDF versions of the book come from Hewlett-Packard Labs, Luratech, and Adobe; DJVU formats are done in cooperation with LizardTech Technology; and scanners are designed and manufactured by Internet Archive and Kirtas Technology.
The Scribe system used by the Internet Archive, digitizing on behalf of the OCA and its members, requires a manual operator to turn pages and monitor the images. The equipment involves placing books of almost any size in a cradle that holds the book at a 90-degree angle with a glass platen raised and lowered to hold the page flat. The full-color images for most books are around 500 pixels per inch. A single uncompressed page image runs around 20 MB, meaning that a 300-page book could run to 6 gigabytes, with 1 million books running to 6 petabytes. For this kind of massive storage, Internet Archive uses the Petabox system developed with Capricorn Technologies. Internet Archive claims its digitization process costs around 10 cents a page and takes from 30 to 60 minutes for each book, depending on length.
The interface for books at the Open Library site models a book with page turning, highlight searching, virtual highlighting, and magnifying. Some of the books will even offer an audio version, in which case one can click on "listen" at the book. LibriVox supplies the audio technology and a network of volunteers does the reading. A connection with Lulu.com supplies bound, print-on-demand versions of books at a user's request with an estimated average price of approximately $8 a book or about $1 for a short (100-page) black and white book.
Recognizing the need for disaster-proof preservation, Internet Archive replicates its digitized book collections in Amsterdam and Egypt. Recognizing that technology changes can render stored content obsolescent, it plans to move all its content to new systems every 3 years.
For more details on the entire process, its goals and results, as well as the basic principles of the OCA itself, go to http://www.openlibrary.org/details/openlibrary. You can also see the initial interface, its features, navigation, and value-added, at this site.
Despite the wonders of technological developments, as Feldman points out: "The technology has existed to create digital library collections for more than a decade. The money, the labor, and the legal problems are the touchy part." With two copyright lawsuits hanging over its head (one from The Authors Guild, the other from the Association of American Publishers), Google now faces the "touchiness" of the problem. However, the solution may still not lie in turning back the hands of time to nothing later than the dawn of the last century or nothing later supplied by anyone but government agencies.
At OCA's inaugural event, Brewster Kahle stated that OCA would try to target the 80 percent of books published between 1923 and 1964 that are out of copyright, then expand to include orphaned books, where the publisher and author can not be found, then out-of-print works, and finally in-print material. He called the effort "tricky but doable." This could put a lot of pressure on participating libraries to develop ways of verifying copyright ownership.
Nonetheless, participants seem very enthusiastic. Commenting on the OCA workshops conducted for participants before the evening inaugural event, Greenstein was amazed at how quickly the corporate and not-for-profit, public institutions aligned their thinking. "It was really exciting for me to see corporate partners with a convergence of purpose around aggregation and making content available to everybody," said Greenstein. Greenstein also predicted that the OCA will develop robust governance and an independent identity fairly quickly.
The principles behind OCA affirm open access, but they also guarantee that contributing members—and other third parties—can develop their own value-added versions of material. The first tenet of the OCA principles commits the organization to "encourage the greatest possible degree of access to and reuse of collections in its archive, while respecting the rights of content owners and contributors." However, a further principle states: "Contributors will determine the terms and conditions under which their collections are distributed and how attribution should be made." OCA's ability to discourage stinginess in members, as well as the provision of inferior content, would seem to stem from still another principle stating that the OCA is "not obligated to accept all content and may give preference to more widely accessible [content]."
Greenstein stated that the hope of the OCA effort is to promote the recognition of the principle that content distributors "must compete on value added to the content, not on ownership." Opening content up to third parties will "drive innovations in service provisions, such as annotated and educational services." In the future, Greenstein hopes that publishers will recognize that "proprietary control over content is an impediment to commerce."