Professional searchers have struggled with the challenge of providing clients with "complete" author bibliographies for decades. The arrival of online databases would have seemed to make the task simple, but dealing with the endless variations in author listings—combined with the merciless demands of computers for precise queries—left searchers flailing. In an era of end-user-only searching, something had to be done—finally, something is. Two of the "high-priced spread" institutional services—Thomson Scientific's Web of Science/Knowledge (http://scientific.thomson.com) and Elsevier's Scopus (http://www.scopus.com)—have introduced sophisticated algorithmic tools designed to improve author searching. Thomson Scientific's improvements began this month and will continue through the rest of the year. Elsevier's Scopus put its Author Identifier features in place in mid-May. Elsevier's improvements involve an arrangement with Parity Computing (http://www.paritycomputing.com), an independent computer service offering a range of advanced data mining and text-processing features to a number of clients.
Thomson Scientific's Web of Science
Thomson Scientific, home of the Institute for Scientific Information (ISI), is launching its authorship search tools in phases. Jim Pringle, vice president of product development, stated: "A simple, quick-fix solution would not meet our users' needs. Instead, we have worked closely with researchers over several years to develop a full suite of offerings to address those challenges." The first tool issued is Author Finder, a guided search aid, already on the system. Users enter names, receive a list of name variants, and match results with subject categories and/or institutions to narrow the search to specific authors.
In the third quarter of 2006, around October, according to Pringle, Thomson Scientific will initiate the first phase of its author disambiguation tools. Using powerful, proprietary disambiguation tools developed in-house, the Web of Science will incorporate layers of content elements, including citation relationships, to group papers by the same author into "authorships." This mammoth effort will require re-indexing over a century of bibliographic citations and will extend over the rest of 2006, according to Pringle. In time, users can expect to see direct links from "authorship groups" to biographical and bibliographical information from the ISI Web of Knowledge record and http://www.ISIHighlyCited.com.
In the course of upgrading its author access, Thomson Scientific has abandoned its longstanding tradition of retaining only author initials. From the start of 2006, it has begun adding full author names. However, Pringle informed me that the company does not intend to re-index earlier records; this change will only cover records from 2006 forward. The release of records with full author names should also take place in the third quarter.
Not all records in Web of Science will participate in the authorship groupings—only those that come from source records and that are indexed by Thomson Scientific/ISI in their metabases—Science Citation Index, Social Sciences Citation Index, and Arts and Humanities Citation Index. Footnote or citation records will fall outside the new disambiguation processes. However, Pringle assured me that users will still be able to conduct "Cited Author" searches to reach those citations.
As Pringle described it, the extensive algorithmic processing relies on the standardized metadata available to Thomson Scientific from its own processed records. He doubted the same approaches would work for others, "but they work extremely well for us. The quality relies on our consistent cited data." Pringle indicated that the company had not yet begun looking into expanding the value to enterprises, but federated searching and Web of Knowledge's array of files could be future targets for improvement. Besides its own proprietary linking service, Web of Science also links to CrossRef's DOIs and OpenURL. Once author records are disambiguated and clustered, perhaps linking services could expand the applications.
Announced in mid-June at the SLA conference but released a month earlier, the Scopus Author Identifier automatically distinguishes between authors with the same name and matches variations of author names. The disambiguation process at Scopus stems from software developed at Parity Computing and evolved through interaction with Scopus content and its developers. According to Jaco Zijlstra, Scopus director, "In this first release we have achieved an extraordinary level of precision, with over 99 percent certainty that records are matched to the correct author. We have already grouped over 95 percent of our records to authors which is quite an achievement over a base of 20 million author profiles. We are now focused on fine tuning the recall. Now the system is live, the more data we add, the better the recall will be."
The algorithmic approach relies on using additional data elements besides author names, e.g., affiliation, publication history, source title, subject area, co-authors, etc. Scopus excludes records from the process that lack sufficient data to determine a match. Once clearly identified, authors receive a unique identifier number. An "Author Details" page gives users an overview of data associated with a specific author. Users can scan the authors' papers and then view article listings for co-authors and co-authors' co-authors. The Scopus Citation Tracker allows an instant overview of records citing an author with the option to exclude self-citations by clicking to eliminate records tagged with the Author Identifier. The same "Author Details" page also allows authors to make corrections to their own listings. Jim Pringle said that Thomson Scientific's Web of Science has an author feedback feature as well.
According to Amanda Spiteri, director of marketing for Scopus, the company intends to apply the process to the whole database and has already completed processing half the database back to 1996. Spiteri could not tell me how many entries remain "un-grouped." She said that the more common the author name, the more likely it would have un-grouped records. It also depended on the type of authorship. "We use a large number of data elements, so it's hard to match outliers and common names," said Spiteri. However, she informed me that 99 out of 100 records are matched correctly to an author with 95 percent recall; only five records would not be matched due to insufficient data.
Spiteri said that reception for the improvements had been "great. We show it to librarians and they appreciate the strides we have made to solve problems on a big scale."
Wrestling with clustering author entries occurs even at services on the other end of the price spectrum, namely Google Scholar and Windows Live Academic Search. The former links open Web content for many different versions of scholarly work, measured more by project than by document, e.g., clustering technical reports, conference presentations, and other author activities associated with a research project. The latter handles more traditional formats but includes many of the features common to expensive services such as Web of Science and Scopus. Those interested in joining the wrestling match might start by contacting firms like Parity Computing that have already done much of the technical spade work.