Digitization has exploded over the last few years. The number of new projects has multiplied rapidly, and the scope of existing programs has expanded greatly. OCLC (www.oclc.org), the major library vendor, has taken a leadership role in digital preservation for years, e.g., with its CONTENTdm service (www.contentdm.com). Now it has introduced a "dark archive" storage service for protecting the high-resolution master files from which accessible digitized content files are spun. However, the $750 a year minimum price tag for OCLC’s Digital Archive service may make it vulnerable to challenges from the plummeting cost of large storage devices and "cloud computing" operations, such as Amazon Web Services.
Building on preservation efforts and digital collection management services acquired in its 2006 acquisition of DiMeMa (Digital Media Management), OCLC has constructed its new Digital Archive service to provide secure storage for large, high resolution master files and digital originals. Files can include both digitized copies of print and multimedia content as well as "born digital" content. Taylor Surface, OCLC’s global product manager for digital collection services, describes preserving files in the Digital Archive service as "like creating a microfilm master, a high resolution image of the original physical item. This is the master file from which you make all derivative, low resolution copies for public access." In the case of "born digital" content, Surface says, "the preservation copy may be a PDF of some Word document."
According to Surface, file sizes "depend entirely on format. With scanned images of newspaper pages, each page can run 50 to 100 megabytes. One hour of stereo sound for a high resolution sound file can take one gigabyte, for example, a 60-minute cassette tape from an oral history collection. A photograph of a page in a document can run around 10KB." As to what goes into a file and in what format, Surface says that "all decisions are made by collection curators before they send their content to the Digital Archive." Digital preservation advocates are constantly expressing concerns with the longevity and durability of specific "standard" formats. (For example, the Digital Preservation Coalition [DPC; www.dpconline.org], a U.K.-based group including The British Library and National Archives, just issued a report ["Technology Watch Report 08-02: Preserving the Data Explosion: Using PDF," www.dpconline.org/docs/reports/dpctw08-02.pdf], approving the use of Adobe’s PDF/A format for archiving but exhorting readers to eternal vigilance in seeking out new and better formats.) However, when we asked Surface whether the new service would handle format migration automatically, he said it would not. "We will move data to new storage systems, but we will not change data formats. So if you send us a TIFF file today, that’s what we’ll send back, but people can get a copy of their data, transform it into a new format, and put that new version back into their archive."
The secure, managed storage components of OCLC’s Digital Archive include automated maintenance, monitoring and replacing of disk storage devices, migrating content to new devices, physical security in a limited access operations facility, an information security team reviewing processes, six backup copies (five at off-site facilities and one on-site), disaster recovery procedures specific to the Digital Archive system; uninterruptible power supplies, fire suppression, redundant servers and network feeds, and ISO 9001 certification for quality assurance.
Automated quality checks and reports include manifest verification matching files received with shipping manifests for files submitted on physical media, virus checks, the creation of a "fingerprint" or "fixity key" used to verify that files have not been altered, and format verification by file extension. Monthly "health reports" will report back on regular virus, fixity, and format verification tests. The "health reports" are posted on a personal archive report portal.
Surface described the marketing strategy for the new service as aiming to "support cultural heritage institutions, like libraries or museum archives where the content is unique to the institutions. We’re not as much interested in ejournals or ebooks or publisher materials. If someone wanted to store their content with us, we would take it." Libraries already participating in OCLC’s CONTENTdm, a digital collection management system serving libraries, museums, and other cultural heritage institutions, will find the Digital Archive service as an optional feature integrated into various workflows, e.g., the CONTENTdm Acquisition Station, Connexion digital import capability, and the Web Harvesting service.
Online loading of master files, according to Surface, would be most appropriate where a CONTENTdm server is in place with workstations for scanning. Even with CONTENTdm installations, however, he thought that downloading a copy to a removable USB disk drive and shipping it to OCLC would be appropriate; in the case of non-CONTENTdm transmissions, that is the only route. According to Surface, this is the first year that the Connexion workstation software has been available as a client software outside the creation or augmenting of WorldCat bibliographic records. "We have now added the capability to create a record in WorldCat for a digital file at a workstation and then attach the digital file to the record in WorldCat from that workstation and upload the digital file online. When it gets to OCLC, the record is updated in WorldCat and the digital file sent to CONTENTdm with the master file put into the archive."
The Web Harvesting service has also been integrated into the Digital Archive. According to Surface, "This is another workflow for gathering digital materials. It harvests websites and webpages to put them into a repository. The service is mainly used by state libraries for state agency websites. Now it will work with the Digital Archive service. Harvested HTML pages will go into CONTENTdm and then copy into the Digital Archive service in the Arcfile format."
Getting content back from the Digital Archive, according to Surface, may involve a collection administrator setting up a personal request requirement or permission to enable online access. He did indicate that there was a limit of one gigabyte per request for files downloaded online. Each file has its own URL. He thought that only "superscholars" would want access to high-resolution files.
As of press time, not all the documentation for the new service was in place on the OCLC website (www.oclc.org/us/en/digitalarchive/support/default.htm).
The OCLC press release announcing the new service advertised it as able "to keep the costs of safely storing these important files within the budget of a library’s digital program." However, that particular claim may be debatable. Charges for the new service fall into 100-gigabyte chunks with each chunk priced at $750—one hundred and one gigabytes and the price jumps to $1,500. In a recent issue of MacWorld, a top-reviewed one terabyte (i.e., 1,000 gigabyte) external hard disk was quoted at $380. So for the price OCLC charges for one-tenth of a terabyte, a library could purchase two terabytes. Though Surface indicated that OCLC would consider a volume discount if a client needed more than one terabyte of storage ($7,500), he also confirmed that these prices are annual subscriptions. No discount for next year or the year after that.
Meanwhile, back at the web, Amazon (www.amazon.com) has launched a new program called Amazon Web Services (AWS; http://aws.amazon.com). One component of this developer’s platform is the S3 Simple Storage Service (Amazon S3). The price for S3 storage at Amazon Web Services is 15 cents a gigabyte a month or $1.80 a year, in comparison to OCLC’s $7.50 a gig. Again, in comparison, OCLC’s 100 gigabyte chunk for $750 each year would cost $180 at Amazon’s S3, but most probably less, since Amazon imposes no minimum. If you don’t use it, you don’t buy it. (For more information on Amazon S3, check out the overview and/or the FAQs page.)
Too good to be true? Well, when it comes to anything as important as preserving master files of expensive and labor-intensive digitization projects, one should always kick the tires. I interviewed four experts with divergent backgrounds: a librarian at an institution that measures its archives in petabytes (i.e., 1,000 terabytes), the CEO of a digitization consulting agency for libraries, a publisher techie, and a longtime expert observer of newbies and traditionals in the information industry. All of them considered the OCLC pricing steep, and all expressed an interest in Amazon’s new offering.
Defending OCLC’s Digital Archive pricing, some with whom I spoke questioned the ease of use at Amazon S3. They thought that it would require too much technical sophistication to get the data in and out for the average librarian (assuming there is such a thing as an average librarian). True and not true. As a developers’ platform in which storage is just one component, Amazon Web Services attracts a lot of techies who busy themselves in designing widgets, fixes, interfaces, programming tools, etc. In the middle of an interview, one techie sprang onto Google and found a handful of Amazon S3 interfaces. For example, Jungle Disk software (www.jungledisk.com) costs $20, including free lifetime upgrades; can be installed on any number of machines; and—for an extra $1 per month—will add browser-based access to S3 files plus other features. Steve Arnold of Arnold Information Technologies knew of at least two startup companies building their services on the Amazon Web Services platform. He opined that, in return for offering Amazon a share in their traffic or part ownership, the companies did not even pay the low prices Amazon sets.
However, in two areas, OCLC’s Digital Archive service does surpass Amazon S3. The Amazon service will only accept online uploading of content—no physical media, no mailing in disks for large data sets, which OCLC offers its CONTENTdm users and mandates for non-CONTENTdm users. And, as to large data sets, though it will accept any number of "digital objects," Amazon S3 will accept no digital object larger than five gigabytes. For example, a commercial movie DVD (not even Blu-ray) may run over nine gigabytes, according to my techie interviewee. However, he also pointed out that one could break a large file down into smaller chunks and use a "Playlist" feature to reassemble it. OCLC, on the other hand, will take files of any size, according to Surface.
In discussing OCLC’s price plan, Surface pointed out that it represented about half of what was previously charged, stemming from simplified applications and a switch to new storage technology. Clearly, librarians feel comfortable dealing with OCLC. As Surface expressed it, "For libraries, the big picture is what OCLC is doing for libraries. We are providing solutions for creating digital repositories at libraries with end-user discovery and access, as well as digital preservation and content management. Integrated workflow is the key value."
Nonetheless, the high price may draw challengers to OCLC’s Digital Archive service. Commercial challengers could include other library vendors, e.g., a digitization consulting service looking to expand its service offerings, a library outsourcer interested in reaching new libraries with something less than takeover services, or a content management system that sees itself as needing to add archiving to stay competitive with OCLC’s CONTENTdm. Any commercial challengers could use Amazon Web Services S3, along with other AWS offerings. They might even combine the online storage with licensed, dedicated terabyte disk drives as well. And not all challengers would necessarily be commercial. Librarians/techies might develop open source software tools to share with colleagues at no charge. Or they could work with library consortia to supply more supported services.
In any case, as Steve Arnold reminds us, "When it comes to archiving, you can never have too many copies." Or as the LOCKSS advocates proclaim, Lots Of Copies Keep Stuff Safe.