CiteSeer (http://citeseerx.ist.psu.edu) could be called a vertical research portal, a niche search engine, or a specialized digital library. It uses a specialized crawler (robot) to find scholarly papers; it then extracts the text from PDF and PostScript files and creates a searchable full-text index. CiteSeer enriches access to these materials by extracting metadata such as author names and publication information. The pioneering Autonomous Citation Indexing tool follows citations and acknowledgments from one paper to another, science mapping and data mining as it progresses. Digital libraries of scholarly works need structure and context. The newly announced SeerSuite open source code base offers excellent tools for this process, from automated citation indexing to web crawling to Boolean queries.CiteSeer, in one version or another, has been running for more than 10 years, with hundreds of thousands of documents and millions of citations from papers in computer and information sciences posted to the web. It was first created as Research Index in 1998 by Steve Lawrence (now at Google) and professor C. Lee Giles of The Pennsylvania State University. Over the years, it has provided a rigorous testbed for both software development and content research. Its emphasis on automation of web crawling to find academic documents and automatically extracting citations has proven very successful.
CiteSeer has been heavily used, handling hundreds of thousands queries a day, and it has improved over time due to long-term development commitments and user demand. The citation linking created a network effect, enriching the information on the portal as the system processes more citations and makes more connections. "This is adding to the Semantic Web by semanticizing research documents," Giles says, as it extracts valuable searchable metadata.
Building on that experience, CiteSeerX is a completely new system, re-architected for scaling and modularity, to handle increasing demands from both researchers and digital library programmatic interfaces. The system uses artificial intelligence, machine learning, support vector machines, and other techniques to recognize and extract metadata for the articles found. It now uses the Lucene search engine and supports standards such as the Open Archives Initiative (OAI), including metadata browsing, and Z39.50. CiteSeerX has a simple but powerful internal structure for documents and citations. If it cannot access a document cited, it creates a virtual document as a place holder, which can then be filled when the document is available.
According to Giles, "We’ve learned from web search that getting access to information improves understanding and decision making. Our goal is to build tools for researchers to gain insight into what work has been done, avoid duplicate effort, and create new theories."
SeerSuite beta 0.1 (http://sourceforge.net/projects/citeseerx) is the Java open source code version of CiteSeerX, distributed under the Apache license. This includes the citation indexing and search features, as well as a scalable modular framework that can handle thousands of simultaneous queries, distribute indexes, and balance their demands across many servers. Documentation, currently sparse, will be added within the next 6 months. While at an early stage now, Giles says that SeerSuite will bring this form of digital library structure to many different researchers and fields, requiring IT support mainly to install and configure the system.
Using this source code, any field can start to build its digital library portal, customized for the articles, papers, and data sets of specialized disciplines, rather than trying to fit all types into one specific system. The new portal would use the web crawling processes to keep the information current, index the full text, extract metadata, and automatically create a citation index. It provides simple full-text search; fielded search; and full Boolean, phrase, and proximity search.
In search results, CiteSeer can show the context of search terms and citations for each paper, providing a preview or scanning overview of each paper, and it creates a continuously updated bibliography. All these features can be adjusted for specific requirements of a field or discipline. Giles says, "The goal is to have automated extraction tools in many disciplines, to the benefit of even small or medium digital libraries."
For example, the ChemXSeer portal (http://chemxseer.ist.psu.edu) recognizes chemical entities, which are the main information components central to much chemistry research. ChemXSeer can convert data displayed in tables to tabular data, with rows and column names, and numeric data from the table cells. Similarly, it can take apart other kinds of published data sets, in lists, for example, and treat them as individual strings or numbers. Giles has headed the development of these tools because chemists need all the information from earlier work to design new experiments. He points out that there is more than a century of chemical research where the only underlying data is expressed in the printed images, and that data is extremely valuable. "The past needs to be fixed," he says, even if everyone posts data sets with future research papers.
CiteSeer tests have also been deployed for academic research in ebusiness and archeology and have led to a more modular architecture of the system to support the requirements of these fields. The CiteSeer group is experimenting with sharing content as well as code, perhaps using federated or distributed principles.
The SeerSearch code has lightweight web services for metadata extraction, citation graphs, general indexing, metadata, repository, file type conversion, and duplicate detection. This means that any other web applications can use the system to perform these tasks and send back the results. It can integrate with Fedora (repository/digital asset management software) to store not just documents as data objects but also microformats such as citations.
Ongoing projects for both CiteSeerX and the shared SeerSuite modules include recognizing acknowledgments as a form of citation and recognizing and clustering name entities and institutional affiliations. More ambitiously, there is an initiative to recreate the mathematical expressions of equations displayed as images in PDF. Likewise, Giles is working on a "graph indexer," which can read statistics published as line graphs, histograms, and other 2D plots; identify the units on the X and Y axes; and use OCR to extract the data points and text blocks. The information will then be stored in XML form and integrated into the digital library system.
CiteSeerX also offers personalization and Web 2.0 features such as personal collections, tagging for articles, error correction, and document submission (user-created content). Users can monitor specific papers for metadata updates via email and create bibliographies by marking and downloading specific records.
Finally, Giles and his students plan a complex event logging system, which can independently store the actions of users on the entire site, not just the search engine. This will provide much more useful insights into user vocabulary, results evaluation, and general behavior.