With an average of nearly a terabyte of data preserved every month, the Wellcome Library, a medical-related archive (part of the Wellcome Trust charitable foundation), pays particular attention to the issues of data growth and obsolescence. Located just a stone’s throw from the British Library, the Wellcome Library contains a huge collection of material, much of it in digital form. This material provides insight into health and social issues in the U.K. and beyond from the medieval period to the present. Ironically, it’s located on one of the most polluted roads in the region.The library’s collection includes records of the health effects of airship raids on London 100 years ago and the effect of World War II bombings on household cleanliness in the city. Recently, it began digitizing the health reports of medical officers from every borough in London from 1850 to 1974. These reports include details on tuberculosis testing from 1914 to 1916, the number of houses destroyed in bombing raids in World War II, the cost of building public toilets, and the number of rabid dogs by borough.
Preserving the History of DNA Research
The Wellcome Library also recently digitized documents by genetics pioneers James Watson, Francis Crick, Maurice Wilkins, and Rosalind Franklin, which are held at the University of Cambridge’s Churchill College. Their collective work on the structure of DNA won the Nobel Prize for medicine in 1962, although Franklin’s death in 1958 meant she could not be honored (Nobel Prizes are not awarded posthumously).
The collection comprises more than 1 million pages of original notes, letters, sketches, essays, and photographs. There is also a digitized version of Photograph 51, Franklin’s X-ray of a strand of DNA that contributed to Crick and Watson’s discovery of its double helix shape.
Franklin was an expert X-ray crystallographer. Incidentally, her discovery is the subject of a play, Photograph 51, which recently ran in London’s West End with Nicole Kidman in the lead role.
Help From Preservica
The DNA-related material contributes to a lot of data being loaded into the Wellcome Library’s digital archive. It runs largely on systems from specialist provider Preservica, with which the library has a commercial relationship. Production and development software provided by the latter is, according to the Wellcome Library, integrated into systems operated by the Wellcome Trust.
The Wellcome Library says its collection is held on a Preservica Enterprise Edition digital preservation platform. It includes 85,000 items, such as books, posters, paintings, and videos. On average, 11,000 users per month view an item. The third party’s software is used to manage and store the library’s digitized and born-digital collections. Dave Thompson, the Wellcome Library’s digital curator, says the digital content is stored locally. As of November 2015, it is 21TB in size and “contains approximately 14 million Jpeg2000 images and about 1,000 born digital collections.”
Thompson explains, “The choice of Jpeg2000 (part1) as a master format for digitization was partly made on the basis that this format is perceived to have a long and stable life. When that format becomes obsolete, as it inevitably will, Preservica will assist the library in migrating that content into another format. The same applies to the diverse range of formats that form the born digital collection.”
He adds that the platform housing the archive has three core functions. “It provides a secure managed environment within which we can store our digital assets. It also provides a set of decision support tools that allow the library to fully understand what is held and how to manage that content. And it also provides a platform out of which content can be disseminated.” Library users have no direct access to the platform or its content, “which is good from a data security perspective,” says Thompson.
‘Future-Proofing’ the Wellcome Library
The Wellcome Library also uses the Goobi open source software to track and manage digitized content, while metadata and page layout, formatting, and tagging software are used to provide access to digital content and to make it searchable. Nevertheless, backup or replication, as every old IT hand knows, is the key to preservation.
According to Thompson, the Wellcome Library does not back up digital content held in the Preservica system. “In reality the 1 terabyte of data is too large a body to back up on a nightly basis.” Instead, he says, the library works with the Wellcome Trust’s IT department, “to ensure that data is included in the replication strategy that is part of the Trust’s overall IT strategy. This means that, in real time, live data is replicated to two offsite storage nodes. Thus we have data security and the comfort of being able to restore content in the event of bad things happening.”
As for dealing with potential obsolescence, the Wellcome Library uses lifecycle management to help address digital preservation issues. For example, it allows for the identification of all individual file formats, says Thompson. “This supports decision making around what formats are current and viable, and decision-making around which formats may be obsolete. Definitions of obsolescence vary but preservation interventions are designed to ensure that data remains in a form that is accessible.” Support tools are available that can migrate obsolete formats to more current ones, a process that may be automated by using workflow software.
These and other software tools and techniques are being used to, as the Wellcome Library puts it, “future-proof” the digital archive. It’s certainly clear that the library is churning out large amounts of digitized data. “Figures for the production of digitized content can vary over time,” says Thompson. “Over the summer of 2015 we were ingesting over a terabyte per month [and] peaking in July at 1.43 terabytes.” The average volume of data ingested per month over the past 6 months is 0.88 TB. According to Thompson, this equates to 1,671,200 individual files. However, he says “the key measure of success for the library’s overall digitization strategy is not the volume of content ingested but the number of items that can be made available online to library users.”