Elsevier continues its quest for seamless scientific research with the launch of the latest application for the SciVerse Application Marketplace, U.S. Government Dataset Search. This free application available to subscribers, developed by the Tetherless World Research Constellation (TWC) at Rensselaer Polytechnic Institute (RPI), is available for search across the SciVerse Hub, and provides researchers keyword matched results for more than 300,000 government datasets from data.gov as well as the semantic data and related demos at the Linking Open Government Data (LOGD) portal. The datasets are diverse and include scientific topics like climate change, clean air, heart disease, and cancer.
Users will find the datasets app from the SciVerse Application Marketplace, where more than 17 applications are now available. The application must be added to one’s profile before searching, but the profile serves as a default setting, allowing the applications to run automatically with new searches of The Hub. Individual users, libraries, and other organizations can establish the profiles.
The use of semantic technology in the Datasets application creates a powerful search experience. The semantic technology provides relational linking of content behind the scenes, even if the structure or format of the content is different. For example, the intersection of data for toxic factory emissions and cause and rates of mortality or birth defects for a geographic location could reveal possible cause/effect relationships to be studied by scientists. John S. Erickson, Ph.D., project lead for the Datasets application and Rensselaer research engineer described the process of converting raw CSV file data from data.gov to resource description framework (RDF), the format that powers the semantic web, stating, “the conversion process involves picking apart the tables, rows, and columns of data and creating instead, immense triple stores. The triples embody massive graphs of relationships, some inferred and some direct.” When the SciVerse Datasets application queries the data, it looks for particular patterns that match a user’s keyword search. Erickson believes the addition of semantic technologies enables a tighter relationship between researchers and data.
Vice president of product management for Elsevier’s Application Marketplace and Developer Network Rafael Sidi said, “Using Semantic Web technologies, Tetherless World Research Constellation at Rensselaer has built innovative solutions leveraging open government datasets from Data.gov. The Dataset Search application built by Rensselaer illustrates how collaboration with the research community can lead to innovative applications that enhance scientists’ productivity.”
SciVerse’s Datasets application searches the extensive collection of metadata for data.gov records, including detailed subjects, keywords, and descriptions of the dataset records. This match provides researchers seamless links to relevant content and raw data. In the current version of the application, only the keywords from The Hub are used to match terms, so post search faceting will not change the results in the dataset application.
The results from the Dataset application appear in the “My Applications” area of The Hub (see image 1). A scroll bar allows for quick scanning of relevant datasets, which are listed with a title and agency name. Clicking on a dataset takes the user to a tabular view of the datasets, still within The Hub, called “canvas” view (see image 2). This page offers additional information such as: the name of the dataset, agency, description, category, keywords, and link to the raw data. For example, a search for breast cancer led to datasets from the Departments of Health and Hunan Services, Agriculture, EPA, and the drugs and lactation database (LactMed), a peer-reviewed and fully referenced database of drugs to which breastfeeding mothers may be exposed.
As the application is new, there are a few quirks with the search results and interface. First, results are numbered within the application, but once on the canvas view, are not. Second, if a user clicks on dataset number “18” within the application, the canvas page still places them at the top of the list, causing unnecessary scrolling through unnumbered results. Finally, some of the Boolean logic is not fully implemented. Erickson says the application, still young, requires some ironing out. Eventually, he hopes the application can offer data visualization for a greater user experience.
Once a user clicks on the link for raw data, they leave The Hub and are redirected (through a new browser window) to data.gov for additional information (see image 3). Here users can download datasets depending on available formats such as PDF, XML, CSV, TXT, Maps, or RDF, or on some occasions, like my example, be taken to the website for the database to conduct more thorough searching. Clicking on the dataset name from the canvas, however, takes one to the TWC LOGD dataset details page, where the RDF and modeling exist (see image 4). This area (not recommended for the casual searcher) is designed to create awareness of and direct access to the data, where scientists can manipulate it for their own applications.
According to Erickson, “Using the LOGD datasets details page, we make it very easy to do integration with other data.” If this data were left in conventional structures, such as the raw CSV files, it would be very difficult and time consuming for a citizen developer to manipulate. But semantic technology makes it very easy for someone to take the data, use the tutorials provided by TWD, and create a mash-up. It’s the mission of TWC to enable this exploration. It’s the mission of SciVerse to make this a collaborative effort that seamlessly integrates scientific research. The Datasets application is a great mash-up of both.