Francis Maude, former U.K. minister for the Cabinet Office, describes open data as “the new raw material of the 21st century,” adding, “In the past, governments tended to leave large tracts of public sector information unanalysed and under-used due to resource constraints and a cultural unwillingness to make it available.” Today, data is still often left locked up in data repositories, the offices of researchers, or the agencies that created it.
Open Definition defines open data as unique in that it “can be freely used, modified, and shared by anyone for any purpose,” including creating a commons in which anyone can participate. Such major advances as Data.gov “serve millions of people worldwide, from researchers and civic hackers, to businesses and citizens. These users have created apps, launched new products and services, and have improved transparency and openness, making the U.S. Government more accountable and responsive to the American people.” Similar efforts from the U.K., the United Nations, and others are ushering in a new age of transparency and access.
Harvard University’s Berkman Klein Center recently released the Net Data Directory, a prototype index related to data on the internet as a model for creating systems to make open data more discoverable and useful.
The OA Data Mandate
Most academic institutions, research organizations, major academic publishers, and others now require that data be submitted or available for their studies. Although organizational and governmental data is a bit easier to find, most data needs are based on the subject and not on the potential sponsors or institutions doing the research.
The issue, of course, is how to select the best repository to find the specific datasets that you need. They vary greatly in terms of their content, goals, methods, and access policies. Depending on the subject or research discipline, data can generally be deposited in more than one data center or repository. Each repository has its own requirements or specifications for the data that it offers, based on subject or research domain, metadata, file format and/or data structure, and the types and nature of data reuse and access policies.
The following are just a sampling of some of the larger sources of open data:
- The data repositories list from the Open Access Directory
- re3data.org, the global registry of research data repositories
- OpenDOAR, a directory of academic OA repositories
- Zenodo, a repository for academics who don’t have access to other data repositories
- The Dryad Digital Repository, a resource for making the data from scientific publications discoverable, reusable, and citable (Most of the data files in Dryad are from peer-reviewed articles, theses, or dissertations.)
There are so many OA repositories that Repository 66 has taken to mapping them across the globe as one way to represent their breadth (see Figure 1). So, how can anyone reasonably keep up with available data in any subject area? Even information professionals find this a challenging task, with very little sense of comprehensiveness in the results.
Working to Corral Data on the Internet
The internet gives us an abundance of riches—and the problem of locating the best data for our needs. Finding a book in a library or mapping a physical location is nothing compared to the far more refined and detailed open data requests generated for research datasets. For example, an open data search on “cancer and obesity” lacks needed details: What types of cancer? What types of patients (e.g., age, sex, ethnicity, or country of origin)? What parameters encompass obesity? Are there underlying problems (e.g., diabetes or genetic factors)?
People looking for data have very specific information needs. And that necessitates a well-developed database to locate particular data. Researchers have little time—and even less patience—to deal with gnarly search engines. For open data to be truly useful, it has to be easily findable at a high level of detail.
The Berkman Klein Center’s mission is to “explore and understand cyberspace; to study its development, dynamics, norms, and standards. … We are more than an academic institution. We design. We code. We construct. Our in-house team of developers helps us translate research into action, converting raw ideas into practical tools, platforms, and organizations.” Progress on being able to find appropriate data is a critical challenge for medical researchers, policymakers, legal experts, citizen groups, and others.
Taking the Net Data Directory for a Spin
According to the Berkman Klein Center, the Net Data Directory “is intended to make finding useful quantitative data about a broad range of internet-related topics—broadband, cybersecurity, freedom of expression, and more—easier for researchers, policymakers, journalists, and the public.”
This free database allows users to search, sort, and filter records, and “the vast majority” of datasets are open and publicly available. However, there is currently no way to mark, save, or download records. The records include the name of the data source, a short description of the available data, and a link (see Figure 2 for a sample entry). Data sources are tagged both by geographic coverage (including global, regional, and country-level tags) and by topic. The database is still small (157 records as of this writing), and the Berkman Klein Center is actively encouraging suggestions for content.
The search results do not come up in any apparent order—certainly not by date or closeness to the topic requested. Complex searches such as “Canada AND security” or “Russia AND censorship” retrieve few results. A date is attached to each record, but they appear to be the dates these entries were added to the database rather than a date relating to the dataset itself. Clicking on the title in the results list takes you to the website for the data, but the actual dataset is often buried layers below the target page.
Once you begin a search, you have few options to refine it except to further narrow the results by tags, geographic area, or subtopic (see Figure 3 for an example). The terminology also will take some tweaking. Phrase searching is apparently not well-understood by the search engine. A search for “broadband in the US” brought up many hits; however, among the top results are Azerbaijan data on internet backbones from 2010 to 2012 and a report from ESCAP (Economic and Social Commission for Asia and the Pacific) on its internet usage. I assume the database doesn’t take unstructured queries. It is clearly a beta version, especially because the project is still in its early stages.
A Challenge and an Opportunity for Internet Indexing
The content value of this specific database is perhaps less important than the opportunity it provides for information professionals to work with the Berkman Klein Center as it develops. This just might lead to ideas for better search systems, not only for internet-related data, but also for other subject areas. It could even help the development of internet indexing itself. Fernando Bermejo, the Net Data Directory’s founder, says his hope is that it “will help anyone interested in knowing about the current state of the Internet to find the data they need in order to make informed decisions, produce insightful research, or simply learn something new about the online world.”
A New Era for the Berkman Klein Center
The Berkman Klein Center was previously known as the Berkman Center for Internet & Society. On July 5, 2016, it got a new benefactor and a new name. Michael R. Klein, chairman of the Sunlight Foundation, donated $15 million to continue and expand the center’s work. “At a time when the opportunities and challenges of an increasingly networked world abound and digital transformations are profoundly shaping the future of society,” the press release notes, “this gift will not only provide vital core support, but will also allow the Center to start new explorations, launch innovative programs, and incubate novel collaborations both nationally and internationally.”
Open Knowledge’s Global Open Data Index estimates that only about 9% of its key governmental datasets are open. We have much work to do to make open data truly available and useful. With the new funding and the Net Data Directory project from the Berkman Klein Center, we can hope that this work will not only continue, but also increase. Collaborating with information professionals could be a key to this project’s success.
Open Datasets: Today’s Oxymoron in Research
Today we are lucky to be able to locate reasonable (if not comprehensive) lists of available data repositories. Gaining access to the public datasets contained within them is still an overly burdensome process. The effort that users are required to invest to evaluate each potential dataset is often beyond the patience of even the most experienced searchers.
Information professionals already know the complexities and labor involved in developing good search engines and metadata. The Net Data Directory just may be an opportunity to engage in much-needed practical research with a nonprofit partner that has a rich history and a strong commitment to innovation and quality. Information professionals, with our years of experience and focus on users, have much to offer in this process. We all need better internet search tools—especially when it comes to open data. This may be our chance.