Thomson Reuters Tackles Open Access Datasets With Data Citation Index
Nancy K. Herther
Posted On October 29, 2012
This month, Thomson Reuters began a soft launch of its new Data Citation Index, which is intended as “a comprehensive view of scholarly research bringing research data into the same arena as the published literature it supports. In combination with other resources available on the Web of Knowledge platform, researchers can access critical information from the leading scholarly journals, books, and conference proceedings and now data.” In an interview with NewsBreaks, Chris Burghardt, the company’s vice president for product and market strategy, noted that, “Data Citation Index offers a significant value particularly within the broader context of the Web of Knowledge.”
The company has announced webinars for early November 2012 to allow current customers as well as potential users to see the product first-hand as it has been integrated into the larger suite of citation products. Trial access is being made available “so that potential users can spend time reviewing the product and evaluating internally before making a purchasing decision.”
An Unmet Need
“We found that there was an unmet need for the discovery of research data,” says Burghardt, “and a lack of methods and standards for citing and awarding credit for research data.” Believing it a good fit for the Web of Knowledge platform, Thomson Reuters focused on three key aspects of today’s data problem:
- Data Discovery. “Currently there is no single, comprehensive resource for researchers to easily access data that others have produced, presented as a valid, critical component, supporting research and publication,” says Burghardt. By pulling together information on data from multiple global repositories and integrating it with the search features of Web of Knowledge, the product provides “as single destination to discover data.”
- Data Attribution. Any researcher wants their research attributed to them, however, no widely accepted standard for data sets exists. Burghardt notes: “When a researcher uses another’s data, they are unclear how to attribute that work to its creator.”
- Data Structure: The wide variance in how data has been structured creates another hurdle to overcome. “This can make locating and understanding data challenging as different ways of designating data records, sets, studies, etc., are not consistently applied.”
Burghardt sees Thomson Reuters' more than 50 years of experience with citation indexes and other products has given the company the “core competencies” in managing complicated metadata. “With Data Citation Index, Web of Knowledge now covers the top 5 forms of research—journals, books, conference proceedings, patents, and datasets.”
Providing Both a 'Predictive and Retrospective View'
Especially with the burgeoning Open Access movement, the availability of research datasets is growing quickly. But to date, there has been no associated growth in indexing and discovery tools to make these resources easily findable. “We also see potential impact for the Data Citation Index with organizations that may not rely as heavily on traditional scholarly literature as research institutions and universities,” says Burghardt. “For example, fast-breaking corporate R&D departments understand the importance of using journals and conference proceedings to help guide their development activities and see the value of using Web of Knowledge to help inform product roadmaps and collaborations. Now the addition of the Data Citation Index provides the ability to pinpoint the underlying data behind research well before the findings make their way into a publication. The combination of these resources provides both a predictive and retrospective view into the most important discoveries being made in their field, which is very powerful.”
Structured as a subscription product within the Web of Knowledge platform, the product is leased with access to the entire dataset. “Factors such as the size of the subscribing institution and their current Web of Knowledge holdings are considered in their subscription fee to ensure the cost of the Data Citation Index is scaled appropriately to their institution,” says Burghardt.
Partner Repositories—A Virtual Who’s Who
The list of participating organizations and data centers is highly impressive and crosses disciplinary boundaries from arts and humanities to social sciences to STEM (science, technology, engineering, medicine). “Before we began development on the Data Citation Index, we had collaborative discussions with key contacts at data repositories, librarians, and researchers,” says Burghardt. “We discussed the problems we were trying to solve, provided background information to set the stage, then walked through a prototype with this audience of mixed stakeholders to ensure not only our concept but our execution was on the right path.” New repositories will be added weekly as they are identified and qualified for inclusion.
Leave it to Thomson Reuters to have a well-developed structure for both repository identification and selection. As with its journal policies, it has established a process to continually review coverage and “repositories now covered are monitored to ensure that they remain available and are maintaining high standards and a clear relevance to the Data Citation product,” says Burghardt. “Many factors are taken into account when evaluating repositories for coverage, ranging from both qualitative and quantitative. The repository's basic publishing standards, its editorial content, the international diversity of its authorship and the citation data associated with it are all considered. No one factor is considered in isolation, but by combining and interrelating the data, our editors are able to determine the repository’s overall strengths and weaknesses.”
The initial line-up of repositories includes: The Inter-university Consortium for Political and Social Research (ICPSR), York University's Archaeology Data Service, University of Wisconsin’s BioMagResBank, Rutgers' Nucleic Acid Database, Department of Energy's Oak Ridge National Laboratory data, NOAA’s Paleoclimatology data, the U.K.’s Office for National Statistics, the National Center for Biotechnology Information’s Gene Expression Omnibus—and far too many others to list here. At launch, Burghardt estimates that the index includes “approximately 2 million records from 70 repositories with plans for 500,000 new records and approximately 40 new repositories added per year.”
This not only fills a nagging gap in access for researchers, but it prods research agencies to focus on discovery. ICPSR Director George Alter notes that “the Data Citation Index promises to solve one of the key problems as we move toward more open access to data: How do data producers get credit for their important contributions? With the Data Citation Index, researchers who produce data will have a tool showing the impact of their work in terms of the number of publications using their data as well as the citations to those publications.”
Data Citation Index captures “all available metadata for the data repositories we index. In many cases, this available metadata is very granular and the repository is broken into a variety of child data types (studies, sets). In other instances the content will only appear as a single repository record.” This is due to the lack of clear standards for metadata. In these cases, “because the content in the repository is so critical to Web of Knowledge users and because the repository is working with us to implement a more consistent data structure, the data would be made available within the Data Citation Index as the repository formatting work is underway.”
Given the lack of standards in this area, the index builds digital research records from existing descriptive metadata to create the index’s bibliographic records and cited references for digital research. However, the company is now committed to working with the scholarly community to promote and develop standard citation formats for digital research records.
Solving a 'Pervasive Pain Point'
“We did a pre-launch demo of the Data Citation Index at ALA in June and a launch demo at the Frankfurt Book Fair earlier this month,” Burghardt recounts. “We’ve had a very positive reaction from the market, which has been exciting. The Data Citation Index is really the first step towards solving a pervasive pain point in the scholarly community when it comes to discovering, using, and citing data.”
Today, Thomson Reuters estimates that more than 20 million researchers rely on the Web of Knowledge platform. This product couldn’t be better timed as the numbers of Open Access resources grows steadily and existing discovery is clearly inadequate. Using the web as an access point today is akin to navigating without a compass. However, the value of this product goes even deeper. Scientific data is essential to discovery—but also essential to progress.
As the movement toward Science 2.0 grows, efforts are being made—led by Stephen Friend and Sage BioNetworks—to tease out competitive, sensitive, proprietary corporate/institutional data underlying products or services from the broader category of data that is “fundamentally non-commercial” yet essential to scientific progress. By making these types of data more easily findable and usable we will hopefully be able to speed the process of solutions, cures and information to bring about a better world. “Data Citation Index is designed to support access to data sets that are maintained on data repositories, which are primarily Open Access,” says Burghardt. “We have not had discussions with Sage Bionetworks, but would welcome the opportunity to do so.”
“The more we understand about science and its complexities,” notes John Wilbanks of Creative Commons, “the more important it is for scientific data to be shared openly. It’s not useful to have ten different labs doing the same research and not sharing their results; likewise, we’re much more likely to be able to pinpoint diseases if we have genomic data from a large pool of individuals.” Perhaps we are starting to see this dream become a reality.