Lee Giles, the David Reese Professor at Penn State University’s College of Information Sciences and Technology, received the National Federation of Advanced Information Services’ (NFAIS) 2018 Miles Conrad Award. On March 1, he delivered the annual Miles Conrad Memorial Lecture at NFAIS’ 60th anniversary conference. Giles may be best known for his contributions to the creation, development, and maintenance of CiteSeer (now CiteSeerX), the first search engine focused on open scholarly literature. He has published more than 400 journal articles, conference papers, and other works, and his research has been highlighted in The New York Times, WIRED, The Wall Street Journal, and The Washington Post. In announcing Giles as the Miles Conrad Award winner, NFAIS said, “CiteSeer radically changed the way scholars and scientists search the literature. It was the inspiration for numerous subsequent scholarly search engines such as Google Scholar.” I spoke with Giles during the NFAIS conference. What follows is an edited and abridged transcript of our conversation.
Photo at right of Lee Giles (left) accepting the Miles Conrad Award from Peter Simon, NFAIS president (right), is courtesy of NFAIS.
Dave Shumaker: Lee, congratulations on receiving the Miles Conrad Award.
Lee Giles: Thank you very much.
Shumaker: The Miles Conrad Award recognizes your contributions over a sustained period. It’s kind of a lifetime achievement award. So, I’m curious—what led you into information science? Did you start out with the intention to be an information scientist?
Giles: No, I started out wanting to become an optical physicist. But my doctoral advisor was an expert in medical imaging, and medical imaging has a lot of computational issues. So I started doing more of the computational work, and I published some of the first papers on optical neural networks. Then I got more interested in neural networks, and the focus on optics went away. I also got interested in optical computing, but then the interest in optics went away from that. It was a logical progression from neural networks and computer architecture and analog computing to computer science. From neural networks, I went readily into machine learning, and machine learning is a stepping stone into artificial intelligence (AI).
Shumaker: And the scholarly communication component?
Giles: That came about because we were looking for data, and data weren’t easy to find until the web came along. When we saw the resources on the web, we thought, “Wow, we can crawl the web for data.” So we started building web crawlers. And it so happened that there were a lot of scholarly papers we could crawl. Then we thought, “Wouldn’t it be nice if we could extract the citations and build a citation indexing system that would track how many times papers got cited, who was getting cited, what was getting cited?” We were just curious. And that’s what led to CiteSeer.
Shumaker: So one thing led into the next?
Giles: Yes, and the CiteSeer project took off and just grabbed us—Kurt Bollacker, Steve Lawrence, and myself, who were working together on this at NEC. Then when we went our separate ways, Steve, who was really the original code developer, asked me if I wanted to run it at Penn State, so I did. I’ve continued to develop and improve it ever since.
Shumaker: Let’s fast-forward to today. What’s the status of CiteSeerX now?
Giles: We had to stop crawling for a while, but we are about to start up again. Crawling, by the way, is one of the most difficult and resource-intensive processes you can have. It involves going out, finding material, bringing it back, and looking at it to see if it’s worthwhile, then going back out. That all requires a lot of bandwidth as well as storage capacity and processing power. There are issues with the instability of the web and of individual sites. We’ve devoted some of our papers to focused crawling techniques and how to crawl primarily for scholarly papers. For example, we don’t crawl publishers unless they want us to. We maintain lists of sites never to go to, and lists of sites to prioritize, and visit them more often than others.
Shumaker: What new developments are you working on?
Giles: One new project is an initiative called MathSeer. It’s funded by the Alfred P. Sloan Foundation. Richard Zanibbi of the Rochester Institute of Technology is the principal investigator, and I’m the co-principal investigator. The goal is to add math formula search capability to the web and to use CiteSeerX.
Another high priority is re-architecting the system infrastructure. We’d like to re-factor the system to make it even more scalable. The current system model requires a database, which we populate automatically. The database is used to render results pages in CiteSeerX. We believe we don’t need to do that anymore. We’ll still keep a database for research purposes, but we don’t have to use it to render pages. We think this change will make digital library search engines a lot easier to implement. It’s been disappointing that there are so few digital library search engines in academia. One of the reasons we believe there are so few is that they require a lot of resources. Many of the processes aren’t automated, so we think this change will automate some of those processes and reduce the resource requirements. This can be a big help to publishers as well, because they’re putting a lot of effort into maintaining the search engines that use their content.
We’re also exploring integrating other types of information into CiteSeerX. We want to be able to take information about papers and integrate it with related presentations, biographical information about the author, related patents, and similar kinds of information. Often, when users are looking for papers, they’re also interested in these other kinds of information at the same time, so it would be valuable for CiteSeerX to integrate them. And we also want to be able to do knowledge extraction from the papers.
Shumaker: It occurs to me that much of what CiteSeer does is dependent on having material available openly. Open access is clearly a hot issue here at the conference. One of the numbers that’s been mentioned is that only 15% of the scholarly output is open access. Does that number seem right to you? What’s your assessment of the status and prospects for open access?
Giles: That number does seem low to me. I’ve seen estimates that as many as half of all published papers are open access. In our own work, which was published in PLOS One, we’ve estimated that it’s about a quarter. We did have an independent confirmation of our number in conversations with Microsoft Academic, but of course these are all samples, so the number depends on what the sample is. It also depends on definitions. We’re using a broad definition, including various forms of publication, such as tech reports, not just journal articles.
As for the outlook, I feel open access is a snowball rolling downhill. It’s just going to keep getting bigger. The reason is that governments and other research funders are demanding it. Most scientists like it, because it makes their job easier. It makes discovery easier. I personally believe that the future for publishers is to offer services to facilitate information management and discovery.
Shumaker: Another one of this year’s hot topics is natural-language processing. You mentioned in your lecture that it still has a long way to go. It seems like the general public is being pushed in that direction, with the various retrieval systems and voice-activated digital assistants. What are your thoughts on the present and future of natural-language processing?
Giles: I’d say the marketing is getting ahead of the performance. You can certainly ask the digital assistants questions that they don’t understand—questions that you or I do understand. There are patterns that the developers have put into these tools that are useful: What time is it; what’s on my calendar? AI tools can do a lot of basic functions, but they don’t really understand anything, and they don’t have any deep knowledge. Think of it as a database. A database doesn’t understand. Take self-driving cars. If today’s self-driving car encounters a car flipped upside down on the side of the highway, it identifies that there’s an obstacle on the side of the highway. But does it recognize that as a car flipped upside down? Only if it’s been trained on upside-down cars.
There are computer scientists who promote an exaggerated view of what the systems do. They talk about systems “understanding.” It’s really elaborate pattern recognition, and it can be really, really good—if you have enough patterns to train on.
On the other hand, pretty soon, the capabilities of machine translation will surpass what any human can do. It seems they already have. Consider the scope and number of languages that AI tools are able to translate—and they will continue to get better as they have more patterns to train on. Of course, they still make mistakes, and there are serious examples and hilarious examples out there—but the systems will keep getting better.
Shumaker: I was speaking with an academic librarian recently about the trend toward academic librarians taking on data management. What are your thoughts about the roles of academic librarians? Can the CiteSeer tools help with data management, or institutional repositories, or other roles?
Giles: I think the data management initiatives are very good. And there are academic libraries that are doing wonderful work with repositories—take the work Cornell has done with the arXiv.org repository, for example. It would certainly be possible for more academic libraries to take on similar projects. As for institutions implementing CiteSeer tools, the key difference is that repositories such as arXiv are submission-based, rather than crawl-based. The authors deposit their papers, rather than the system going out and finding them. There hasn’t been much interest in turning CiteSeer into a submission-based system, and we’ve stayed focused on developing it as a crawl-based system. As for submission-based systems, I think authors are more likely to submit papers to large, aggregated repositories, including arXiv, or perhaps to other repositories dedicated to specific disciplines, rather than institutional repositories. That’s because the large repositories are where others—researchers, students, even journalists—go to discover what’s new.
So, I think the priority for libraries is building repositories for data. For example, I hand-code, and I get tired of backing up my code and my data. A library-run service that can help me would be really useful. I wish libraries had developed a type of GitHub.
Shumaker: Lee, before we end the conversation, is there anything else you’re working on that you’d like to mention?
Giles: There are three other areas I’m working on these days. One is “text in the wild,” another is recurrent neural networks, and the third is AI in education.
Shumaker: What’s your focus in AI for education?
Giles: We’ve gotten started on a really interesting project related to open educational resources (OERs). It’s called BBookX, and its goal is to help a professor or lecturer build a textbook out of OERs. It uses information retrieval tools and OERs such as Wikipedia. The instructor chooses a topic, the system goes out and brings back open resources on the topic, and then the instructor chooses among the resources, customizes the book, and publishes it.
We’ve been trying to develop a formal model of prerequisites to help the instructor manage the relationships among topics, so that prerequisites come first and subsequent chapters build on them. That’s also taken us into automatic question generation for multiple-choice testing. The goal is to build good “distractors.” Distractors are the wrong answers in a multiple-choice test. A good distractor is one that’s plausible. So, if the question is “The earth rotates around [blank],” good distractors might be “the moon,” “Saturn,” and “the Milky Way” and not “pizza” or “mama.” It turns out that coming up with good questions and good distractors is a hard part of developing good test questions, so we’re trying to help with that process. We’re looking into how we can automatically create and “tune” the distractors so that instructors can give us material and we come up with different levels of easier or harder distractors. The whole field of applying AI to education is one that we think is very promising and has many opportunities.
Shumaker: We’ll look forward to hearing more about your progress with BBookX as well as CiteSeerX developments in the future. Once again, congratulations on receiving the Miles Conrad Award and thanks for your time today.