People are not just talking about digital libraries any more—they are building them. That much was evident at the fourth Digital Libraries conference, held August 11-14 and sponsored by the Association for Computing Machinery in Berkeley, California. Unlike the first Digital Libraries conference in 1995, this one was a new mixture of the original dreamers with applications developers, policy makers, and librarians eager for practical technologies to solve real problems. Also unlike earlier conferences, we now have enough experience to pinpoint issues and challenges. We can tell others a few things about what to do when building a digital library—and even more things about what not to do.As with any emerging field, a certain amount of time at DL '99 was taken up with self-examination. Attendees wondered how to define the subject and where to draw the boundaries. Whom should it embrace, and what are its goals? In fact, we are seeing the same head scratching in practically any group related to information retrieval this year. Confused librarians, computer scientists, social scientists, psychologists, and graphic designers find that other professions are discovering the same things but are describing them in different languages. Sometimes they are insulted at perceived encroachments: After all, what is this new thing called metadata except cataloging and classification? Why is a thesaurus suddenly an "ontology"? The cross-fertilization is uncomfortable but fruitful. With so much to understand, the more perspectives the better.
Issues for Discussion
So what is a digital library? The term is used for everything from online catalog access to a physical collection to inventive combinations of online materials, collaborative work applications, and interfaces created to fit into an organization's workflow. In one sense, a digital library should be very familiar to readers of Information Today. It still requires acquiring, organizing, storing, and making information accessible. However, my own very strong bias is toward the inventive end of the spectrum.
It seems to me that a digital library is not just an electronic surrogate for a physical library. It must take into account the differences that working in a digital medium implies. For instance, in cyberspace we cannot flip through materials, nor do we have a strong sense of location. On the other hand, access to remote materials is as easy as if they were in the same room. All the text is searchable, and we can combine and use the contents in new ways. Collections can be enormous, and the materials are never checked out. The materials can be shared with dozens or hundreds or thousands of people. They invite joint creative work. They can be woven into the workday effortlessly. Whatever structure a digital library has, it should exploit these features rather than try to be a traditional library.
Nevertheless, fulfilling this dream of instant access and joint use is easier said than done. Above all, we are faced with the human issues: How do people find and use information? How do they interact with computers? Can we design search systems and interfaces that make sense? Do people who are not information- or computer-literate have adequate access to the information they need, or are they being left behind? What are the hurdles they face?
Ann Bishop of the University of Illinois compared two projects that supplied computer access to information—one for college students and the other for low-income families. In both, she found what she called "insurmountable molehills," such as not having appropriate bus schedules to deliver participants to classes on time, or not having the right power cord for a modem. Both the students and the neighborhood participants she studied didn't know how to ask or whom to ask. Having the right content, and having it apparent, are also of paramount importance. The content must be personally relevant in order to motivate users to look for it. If it is to be used, the digital library must fit easily into the rest of the user's life, both in terms of ease of access and relevance of content.
In his keynote address, David Levy of the Xerox Palo Alto Research Center noted that we have achieved funding, visibility, cross-discipline communities that work together, notable online collections that are growing, and international involvement in developing digital libraries. However, he listed these new problems:
• Metadata
• Privacy and security
• Intellectual property rights
• Preservation of digital materials
• Collection management, or, How do you make usable collections from digital resources?
In particular, Levy pointed out that, just like public libraries during the past century, digital libraries are in the process of defining themselves. Questions of funding, purpose, audience, and function have never been completely resolved for traditional libraries, and they remain an open question for digital libraries as well. Tensions also remain between what is technically possible and what is socially advisable.
Technology transfer is an additional issue Henry Gladney of the IBM Almaden Research Center brought up. He remarked that while there are groundbreaking new approaches being developed, the commercial systems are still using software that was developed 25 years ago. Conversely, universities fail to notice perfectly good systems developed commercially, and would rather develop their own. He urged everyone to look outside his or her immediate circle of knowledge.
In his excellent introductory seminar to digital libraries, Ed Fox of Virginia Polytechnic and State University added a number of difficult technical issues, as well as the problem of human-computer interaction. One surprising difficulty is that too many groups are addressing the problems separately. We have the library community and the Web community trying to establish metadata standards in common, with TEI, GILS, MARC, RDF, and several flavors of the Dublin Core all vying for acceptance. Protocols for communicating among distributed collections include Z39.50, Dienst, and Stanford's START, which has died. An IBM study Fox cited reports that the key problem is intellectual property—not a technical or a user problem at all. Yet, we have heard representatives of the Copyright Office and the Patent and Trademark Office proclaim that our current laws are adequate for cyberspace.
Interface design as well as other access questions seem to be of mounting interest to this group of attendees, as they should. Wei Ding and Dagobert Soergel of the University of Maryland-College Park and Gary Marchionini of the University of North Carolina-Chapel Hill investigated combinations of text and images to determine which improve users' comprehension. They found that a combination of text plus pictures worked decidedly better than either one singly. I particularly liked the fact that their presentation and paper presented guidelines for this purpose. They also gave design guidelines for selecting "keyframes" from video, with an emphasis on frames with people, vivid colors, novel scenes, emotion, or symbols.
Tammara Combs and Ben Bedersen (Human Computer Interaction Lab, University of Maryland) compared several screen layouts for thumbnail prints as finding aids for slide or picture collections. They found that a 2-D grid works better than slicker-looking, revolving tiers of pictures, even though the 2-D grid is not as exciting visually. See http://www.acm.org/sigir for this year's DL '99 Proceedings, which shows screen shots. The proceedings should be up shortly.
New Projects and Papers
There were plenty of real applications to marvel at. New Zealand's University of Waikato stole the show with its Digital Library of Popular Music, which won the award for best paper. Hum a tune into a microphone, even off-key, and the system returns its best guesses at a match. Anyone who has been hummed at on the reference desk by a desperate tone-deaf patron would covet this system. At present, the collection comprises jazz, folk, and 100,000 MIDI tunes found on the Web. The inventors—David Bainbridge, Craig Nevill-Manning, Ian Witten, Lloyd Smith, and Rodger McNab—envision this as an easy-to-access comprehensive collection eventually. They expect that within the next 3 years digital music libraries will become a popular end-user technology in use by CD ordering sites, traditional recorded music stores, and libraries as well. Their system is rich in metadata so that it can be searched even if you can't hum. The software is available for download at http://www.nzdl.org.
To help users find related documents in large collections of unstructured, unindexed text, Steve Jones and Gordon Paynter, also of the University of Waikato, unveiled Kniles and Phrasier. Both of these are automatic linking tools that help users browse through large collections that change rapidly and don't have a manually constructed set of cues such as a directory or subject-index terms. Kniles inserts hyperlinks into a document on the fly. To do this, it extracts and indexes "keyphrases" automatically using software called Kea. It uses the same similarity calculations for key phrases that most Web search engines use to determine if an entire document is similar to a query. Phrasier, which has one of the nicest interfaces I've seen, extends Kniles by creating a three-paned interactive interface. Column 1 shows the document, with keyphrases in bold type. Column 2 is the keyphrase pane. The third column is related documents. What I liked was that the related documents were linked to the keyphrases selected, so that by choosing different pieces of a document, you get a different list of related works. See http://www.cs.waikato.ac.nz/~stevej/Research/Phrasier for an explanation with illustrations.
NEC's Kurt Bollacker, Steve Lawrence, and C. Lee Giles, known for their articles on Web coverage, have developed a filtering application that is free to use. ResearchIndex, formerly called CiteSeer, finds new scientific articles that match a user's profile, even when keyword matches are not present. It also updates itself through user feedback, as well as by noting which documents the user looks at. ResearchIndex matches profiles using several similarity/relevance measures, including citations, links (as Clever and Google do), and context. For people who are trying to keep current in the sciences, this may be an excellent approach for finding new literature while eliminating irrelevant documents. I plan to try it once it expands beyond its current base of computer science articles.
The Networked Digital Library of Theses and Dissertations (http://www.ndltd.org), presented by Ed Fox of Virginia Polytechnic and State University, now has 2,000 graduate and undergraduate theses. It hopes to attract contributors worldwide, and has more than 59 member institutions from 13 countries. All students at Virginia Tech are required to submit their work, and they may choose to limit distribution to the campus or to make their work available worldwide. Virginia Tech has developed software, which is available to other institutions, to make the submission process easy. One notable advantage over paper submissions, aside from availability, is that these theses can contain many kinds of media, not just text.
Not surprisingly, computer science research has been one of the most electronically accessible subjects. The first project I knew of was NCSTRL (Networked Computer Science Technical Reference Library, http://www.ncstrl.org), a collaboration among several research universities to make their collections of computer science research reports searchable centrally. CoRR (Computing Research Repository, http://www.usc.edu/isd/elecresources/gateways/corr.html), a partnership of ACM, the Los Alamos e-Print archive, and NCSTRL, builds on this effort by using the distributed Dienst protocol developed for NCSTRL. However, any researcher may submit an article to the CoRR collection, and each subject section can be moderated. For this, they use software developed for the Los Alamos National Laboratory physics e-print archive.
Copyright, is, of course, an issue for both NDLTD and CoRR. Apparently, preprints of articles that are to appear in publications of the ACM or the IEEE are permitted to be distributed through CoRR. Elsevier has so far given permission only for Artificial Intelligence. All articles will be permanently archived, which is good news for the online community. Authors may submit new or corrected versions of their papers, but the older versions will be kept as well. CoRR now has approximately 2,000 papers, as well as all issues of the Journal of Artificial Intelligence Research.
SMETE (http://www.smete.org) is an ambitious plan for a distributed national digital library of high-quality science, math, engineering, and technical courseware and other resources. It will combine collaborative tools, a forum for peer review, and a central registry. The collection will be designed to support undergraduate education. It appears to be still very much in the planning stage, and is being developed under the sponsorship of the National Science Foundation.
Digital Libraries Coming of Age
There were 2 packed days of presentations at DL '99, as well as 2 days of additional workshops. I've only touched on a few highlights. What is apparent is that the use of information is becoming a prime topic, and that researchers are beginning to understand the information-finding process so that they can break up its complexity into smaller chunks and then address each separately. Questions of what is relevant when, and who uses information and how, have been added to plain technical wizardry. We have also moved beyond straight text to music, to image retrieval, and to combinations of resources. However, as Gary Marchionini remarked, we still don't know how to measure success.
This annual conference on Digital Libraries is one of two, and participants were uniformly eager to merge with the IEEE's Advances in Digital Libraries. Splitting a developing field in two weakens the attendance as well as the quality of the papers at either gathering. Participants find it difficult to attend both, and each group needs to have access to the new developments unveiled at the other. A merger would be a healthy move, preventing further fragmentation. ACM's DL 2000 will be held in San Antonio, Texas, June 2-7, in conjunction with Hypertext 2000. For more information, see http://www.dl00.org.