Last week, nearly 300 people gathered in Alexandria, VA, for the 9th annual meeting of the National Digital Information Infrastructure and Preservation Program (NDIIP) and the National Digital Stewardship Alliance (NDSA).Organized by the Library of Congress (LC), #digpres2013 brought scores of archivists, librarians, and information specialists together to explore “solutions to the challenges of stewarding digital content over the long-term.” The presentations represented an astounding array of ingenuity, displaying tools that were useful to all concerned with preserving the digital record, from high literature to news of the Kardashians.
The main meeting began on July 23, with opening remarks by Bill LeFurgy (@blefurgy), digital preservation manager at the LC. He focused on two trending topics that were addressed during conference sessions: 1) personal digital preservation, and 2) democratizing digital stewardship. The following two presentations certainly delivered. First came Hilary Mason (@hmason), chief scientist at bitly, who began her talk concerning Big Data with a session title that barely fit on a screen: How engineers think about preservation (when they think about it at all).
Tapping Big Data
Being employed by bitly, Mason is intrigued by a world where it’s cheaper to store data than to throw it away. Her personal definition of Big Data is data that requires a specialized server to perform analysis; the machines are needed to make the data useful to the human mind. This is only possible today because of the reduced time it takes a machine to return an answer to a query: Where Gen 1 machines were used to count, Gen 2 count things cleverly.
To demonstrate, Mason introduced several apps, such as forecast.io that uses National Oceanic and Atmospheric Administration (NOAA) data and geolocator tools to forecast the weather exactly in the spot where you are standing. Other projects highlighted during Mason’s presentation demonstrated how data can be used for economic and social benefit. For this to occur, we need to make preservation not a by-product of research, but a primary goal at the outset: How do we best respect data? Are there best practices startup companies should consult?
Mason noted that most data archived for a project may not be needed. Are 365 days of data needed to interpret what has taken place and to forecast trends, or will 364 do? The value is in the aggregate. What we need are good strategies for designing what to keep so that we can build a cost-effective infrastructure to access what we store today. For example, the LC cannot afford to provide access to the archived Twitter feed. Preservation is useless if there is no access, so our efforts have to encompass both aspects of digital archiving. (In her spare time, Mason developed bookbookgoose to protest the algorithm that Amazon uses to present titles.)
The Value of Metadata
Sarah Werner of the Folger Shakespeare Library began her talk by noting that we need to make the past accessible, and sometimes that means disembodying the past to preserve it. While this is certainly not a new concept, she showed how the disposability of church indulgences—saved by individuals, only discarded by heirs—led to their re-use as endpapers and spine liners, for example.
A more recent example of how we can connect to the past using today’s technology is the creation of the Library of Aleph (@libraryofaleph) that presents the names of prints and photographs in the LC collection related to the Civil Rights struggle. These are not the images, but the captions disconnected from those images. It’s the metadata at the LC that makes it discoverable so any individual can create something powerful.
Dr. Micah Altman, director of research and head scientist of the Program on Information Science for the Massachusetts Institute of Technology (MIT) Libraries, announced the 2014 National Digital Stewardship Agenda, declaring that effective digital stewardship is vital. Issued annually, the agenda provides “funders and executive decision-makers insight into emerging technological trends, gaps in digital stewardship capacity, and key areas for funding, research and development to ensure that today's valuable digital content remains accessible and comprehensible in the future.” The agenda identifies high-impact opportunities to advance within three categories: state of the art, state of the practice, and state of the community. The 2014 agenda is now available.
Innovative Workflow Tools
The final formal session of Day 1 was a panel of speakers who addressed the topic of Creative Approaches to Content Preservation. Anne Wooten recounted how she and her partner, Bailey Smith, turned a master’s thesis project (University of California, Berkeley School of Information) into a business helping journalists and oral history collections by developing lightweight web-based speech-to-text software that makes robust searching of audio possible. The ability to integrate a fun tool to use into workflows allowed those in the room to see the immediate value to personal digital preservation (e.g., auto transcription).
Individuals and small organizations responsible for digitizing will find this tool layered atop existing systems quite effective. Additional information is available about the software and projects that use it on the software’s website.
Travis May of the Federal Reserve Bank of St. Louis announced the addition of tens of thousands of data series from the Organization for Economic Cooperation and Development (OECD) to the Federal Reserve Economic Database (FRED), bringing the total in this free online database to more than 140,000. FRED is an example of how an archival database must accommodate change. GDP data is revised, and the metadata must reflect this.
For Cal Lee, a professor at the University of North Carolina at Chapel Hill, the focus was on contrasting two main acquisition paths used by archives to build collections: One is the systematic transfer between the producer and the archive where there is some coordination between the two; the other is “dealing with whatever you get.”
In the latter, the archive has little play in how the materials are packaged and transferred; substantial guesswork is needed to describe materials post-transfer. “The BitCurator Project is an effort to build, test, and analyze systems and software for incorporating digital forensics methods into the workflows of a variety of collecting institutions,” according to Lee. The “project is a joint effort led by the School of Information and Library Science at the University of North Carolina, Chapel Hill (SILS) and the Maryland Institute for Technology in the Humanities (MITH) to develop a system for collecting professionals that incorporates the functionality of many digital forensics tools.”
Keep Your Eye on the Data
Jason Scott of ArchiveTeam.org is concerned about corporations running our online world, along with “a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.” This preservation activist group has saved at-risk web content by downloading files from sites that have disappeared due to mergers, for example, often with little warning, such as the Xanga community, Myspace, SnapJoy, FormSpring.me, and Posterous (purchased by Twitter in 2012). His advice: Keep an eye on where you’re putting stuff.
Day 1 of the conference ended with a series of lightning talks, described in detail in a handout from the LC, followed by poster and demo sessions of the following projects:
Making Resources Accessible
Lisa Green of Common Crawl, “a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone,” started Day 2 with a review of the LC’s mission statement: To make its resources available and useful to Congress and the American people and to sustain and preserve a universal collection of knowledge and creativity for future generations. While many archivists place the emphasis of their work on aspects related to preservation, she is sure that the more important part of LC’s mission is to make resources available and useful.
According to Green, the increased capacity of storage has changed how we preserve, but not why. Paralleling Mason’s presentation the day before, Green’s talk concentrated on machine-scale analysis of data, the usefulness of the data lying in its analysis. Her organization’s data resides in public data sets in the Amazon cloud (AWS). Anyone with access to the web can access this data using Amazon Compute capabilities to analyze the data for free. Amazon has lowered the barrier for anyone to run machine-scale analysis.
Designed for Interoperability
Emily Gore described the work of the Digital Public Library of America (DPLA). Most people think of DPLA as a portal for discovery of exhibits (of images) by topic, location, and time. But it’s more than that; it’s a platform to build upon with free data available for download. Standards allow for interoperability across collections. The Open Library License is a tool to grant public noncommercial online access to copyrighted material that DPLA hopes gains momentum, transferring the concept of checking out a book from the library to the digital age.
Gore foresees a fleet of “Scannebagos” zipping across the nation, fully equipped to pull into a community and scan materials to create local historical collections that are instantly made searchable in the DPLA. The official launch of the DPLA, postponed out of respect for victims of the Boston Marathon bombing, has been rescheduled for Oct. 24–25, 2013.
In the panel session “Green Bytes: Sustainable Approaches to Digital Stewardship,” moderator Joshua Sternfeld of the National Endowment for the Humanities set the stage. He introduced the topic by reminding those in the audience that “a comprehensive examination of digital environment sustainability requires an interdisciplinary perspective that merges material and access needs, and brings together groups that too frequently have been content to remain in isolation.” The panelists each brought insights into the topic:
- David Rosenthal of the LOCKSS Program at Stanford University framed “the problem of green digital preservation as a tension between expectation of constant online access and the need to reduce energy consumption.”
- Kris Carpenter highlighted the work of the Internet Archive “in applying creative workflows and practices.”
- Krishna Kant of George Mason University shared “several scientific and engineering breakthroughs that have the potential to minimize the carbon footprint of data centers” in his presentation on Sustainability Issues.
Among the most interesting comments during this session was the notion that we could use renewable energy to power IT, not energy drawn from the grid. Even individuals can make a difference by de-duplicating our files and compressing objects within and across administrative domains. Tradeoffs are required, such as fidelity to the original vs. cost (reduced representation), but content creators can send links instead of content (don’t create a local copy) and purging obsolete, unneeded data. Perhaps 25% of the data can answer 90% of the questions. As a community, we need to invest in data science or scholarly research.
The remainder of the day was devoted to concurrent sessions and the presentation of NDSA Innovation Awards to the following recipients:
Future Steward: Martin Gengenbach, Gates Archive
Martin Gengenbach is recognized for his work documenting digital forensics tools and workflows, especially his paper, “The Way We Do it Here: Mapping Digital Forensics Workflows in Collecting Institutions” and his work cataloging the DFXML schema.
Individual: Kim Schroeder, Wayne State University
Kim Schroeder is recognized for her work as a mentor to future digital stewards in her role as a lecturer in Digital Preservation at Wayne State University, where she helped establish the first NDSA Student Group, supported the student-led colloquium on digital preservation, and worked to facilitate collaboration between students in digital stewardship and local cultural heritage organizations.
Project: DataUp, California Digital Library
DataUp is recognized for creating an open source tool uniquely built to assist individuals aiming to preserve research datasets by guiding them through the digital stewardship workflow process from dataset creation and description to the deposit of their datasets into public repositories.
Organization: Archive Team
The Archive Team, a self-described “loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage,” is recognized for both for its aggressive, vital work in preserving websites and digital content slated for deletion and for its work advocating for the preservation of digital culture within the technology and computing sectors.