At the January ALA Midwinter conference in Philadelphia, Elsevier announced that it was extending and codifying arrangements for academic researchers to mine text from Elsevier’s archives of journals and books. Now, through their libraries’ access, academic researchers will be able to use Elsevier’s API to batch-download documents in computer-readable XML format. Although the announcement was greeted with as much caution and concern as applause, this represents an important first step for commercial scholarly publishers to open their massive stores of text and data for researchers. “Our new policy enshrines text- and data-mining rights in our standard ScienceDirect subscription agreement for academic customers,” Elsevier notes.
Text and data mining (TDM) has become a hot-button issue in our era of Big Data. “A typical example in data mining is using consumer purchasing patterns to predict which products to place close together on shelves, or to offer coupons for, and so on,” notes University of California–Berkeley’s Marti Hearst. “The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts. Databases are designed for programs to process automatically; text is written for people to read.”
TDM: The Future of Research
“Every day, we create 2.5 quintillion bytes of data—so much that 90% of the data in the world today has been created in the last two years alone.” Charles Henry, president of the Council on Library and Information Resources, notes, “The massive scale of data creation and accumulation, together with the increasing dependence on data in research and scholarship, are profoundly changing the nature of knowledge discovery, organization, and reuse. As our intellectual heritage moves more deeply into online research and teaching environments, new modes of inquiry emerge; digital data afford investigations across disciplinary boundaries in the sciences, social sciences, and humanities, further muddling traditional boundaries of inquiry. How then are we responding to what may be the most complex and urgent contemporary challenge for research and scholarship?”
Today one clear answer to making this increasing flood of information useful—even valuable—for research and business applications is through text or data mining. Although this has been used with data extensively, companies and researchers are looking at ways to use this 90% of data being collected to improve efficiencies, study markets, and advance scholarship. InterSystems’ Michael Brands explains that “90% of what people do in a business day is unstructured and the results of most of these activities can only be captured in unstructured data. … It is generally acknowledged in modern economy that knowledge is the biggest asset of companies and most of this knowledge, since it's developed by people, is recorded in unstructured formats.”
Gleaning useful information from natural-language text, however, has been a daunting task because, unlike numerical data, text is amorphous and unstructured. It doesn’t easily fit into the algorithms developed for mining information and meaning from data. Text is far more complex, involving cultural nuances in the communication of information, opinions, or dramatic narrative. Although work on manipulating or mining meaning from data goes back for ages, text mining is a creation of the past 15 years. With the ready availability of digitized texts and the ability to store and manipulate large stores of information due to computer advances, text mining holds great promise across the disciplines and for the private sector.
In a 2009 report from the International Association of Scientific, Technical and Medical Publishers, it was estimated that each year, global STEM (science, technology, engineering, and mathematics) research produces more than 1.5 million new scholarly articles. Today, text mining is becoming more common in management, the biomedical sciences, and chemistry. Efforts are being made to establish footholds in social sciences and humanities research as well.
Carefully Crafted Terms From Elsevier
For academic customers, text- and data-mining rights for non-commercial purposes will be included in all new ScienceDirect subscription agreements and upon renewal for existing customers. Librarians interested in adding the TDM clause to their existing agreement prior to renewal are able to request a simple contract amendment via their Elsevier Account Manager. Once the institutional agreement is updated, researchers at subscribing institutions can use our developers’ portal to register. They will then receive a key to the Application Programming Interface (API) of ScienceDirect, which provides full-text content in XML and plain-text formats optimal for TDM.
When researchers have completed their text-mining project through the API, the output can be used for non-commercial purposes under a CC BY-NC license. The output can contain “snippets” of up to 200 characters of the original text, which enables both the researchers who are answering a specific question and those looking to build resources to define the context of the new information they’ve extracted from the literature. Elsevier also requests that text-mining researchers include a DOI link back to the original content to ensure that authors receive credit and that future researchers have a reliable reference to the authoritative source of the underlying articles.
At least provisionally, Elsevier is limiting researchers to 10,000 articles per week for free mining—with the caveat that the researchers or their institutions sign the binding legal agreement.
“While everyone recognizes the opportunity that text mining brings,” Elsevier’s announcement continues, “it is a specialized process. Many researchers are looking for services that make this process easier so they can concentrate on the part they do best—research. Here at Elsevier, we are continually working on ways to make it easier to text mine by both improving our technology support and optimizing the publication process to make content mineable.” Although most see this as a major milestone, many have expressed frustration with the control that Elsevier retains in this process.