Once upon a time (in 1997), a guy who worked at the Excite web search company wanted to learn to program in Java. The guy was Doug Cutting, and the project he started there became the basic code for a search engine designed from the ground up to work at web scale. After Excite became a portal and Doug decided that entrepreneurship was not for him, he named the engine Lucene and posted the source code with a license allowing anyone to work with it, change it, and sell it, royalty-free. Fast-forward to 2009, and Lucene is an Apache project (http://lucene.apache.org). Version 2.9 is the most recent release. It now has "near-real-time" search (so as soon as the content is posted, it's indexed and searchable); faster wild card prefix searches (card*) and a reverse filter for wild card suffixes (*card); better filters for geographic location; and improved Arabic, Persian, and Chinese support.
The new version is more efficient and faster at both indexing and searching. But the most impressive part is simply that Lucene is improving and providing new features in reliable search, year after year. Being open source means that thousands of people from all over the world are working together to improve and extend the code, and as a group, they are sharing experiences and proposing new approaches. Sites that use Lucene include Netflix.com, NASA's Nebula cloud, and LinkedIn's search.
Naomi Dushay of Stanford University uses Blacklight (http://projectblacklight.org), an open source integrated library system (ILS) based on Lucene. She says, "Our librarians love that we can technically do anything when we have raw text/data to work with. ... Our users love that we can prioritize their collective needs. ... We've tweaked it a number of times to improve results based on user feedback. We can also make significant changes to the features and look and feel of our UI [user interface], as warranted." This kind of flexibility is not even available for many proprietary search engines for enterprise use or for libraries.
The Lucene family of open source search tools competes on many levels with proprietary search engines.
- Solr (http://lucene.apache.org/solr) is a sophisticated enterprise search engine, with distributed search on multiple servers for hundreds of millions of documents, very fast results in either the full text or specific fields (or both), excellent relevance ranking, and flexible XML results.
- Nutch (http://lucene.apache.org/nutch) is a web-scale crawler (aka robot) that can traverse links on the web or an intranet.
- Tika (with Apache POI; http://lucene.apache.org/tika) harnesses powerful indexing tools for HTML, PDF, office documents, and XML.
Open source has one huge advantage over commercial search software: access to the code. If there's a critical bug, there's no absolute dependence on a vendor to fix it. Any programmer can look at the code and may be able to solve the problem; it may take hiring an expert consultant or professional services company to fix, but even then, there are many more choices.
Another reason Lucene can measure well against proprietary search software is that it supports many languages, both human and computer. The new version improves Unicode support, and there is language analysis support for Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish from the Snowball stemmer and external contributions of analytics for Indic languages, Arabic, Thai, Polish, Chinese, Japanese, Korean, and nearly every other language.
In the same vein, there are many implementations of Lucene into various computer languages. The main Lucene implementation is in Java. There are ports to Python, C, Ruby, Perl, and C# .NET. And these are not just the code-they are designed to be index-compatible, so a program written in one language can create or edit an index, while the server could be running the optimized Java version.
Open source does not have to mean "hackers only." There are helpful guides, documentation, and user mailing lists on the Lucene site, and many other sites across the web. Each of the subprojects has a long list of companies and people offering paid support, from individual programmers with specialized expertise to hosting and configuration services to companies providing a full suite of enterprise-level support, such as Red Hat offers for Linux. The main Lucene code is available at http://wiki.apache.org/lucene-java/Support.
Bess Sadler of the University of Virginia (original sponsors of the Blacklight ILS) writes, "North Carolina State's Endeca implementation was seen as inspirational by many, but also frustrating because it seemed financially unattainable." Blacklight, which is based on the free and open source Lucene/Solr code, is a flexible alternative that provides much of the same functionality. It can index many forms of data, including MARC records, TEI, EAD finding aids, digitized image metadata, and content from data silos, which previously had required tedious individual queries, including a unique digital archive of antique coin images. She says that it has solved a significant problem: "[A]lthough the musical instruments used in a piece of music were catalogued in the MARC record, these fields were not indexed or searchable by the library's commercial OPAC" (Sadler, Elizabeth [Bess]. "Project Blacklight: A Next Generation Library Catalog at a First Generation University." Library Hi Tech, Vol. 27, No. 1, 2009, pp. 57-67; www.emeraldinsight.com, DOI: 10.1108/07378830910942919). Blacklight provides an interface and query parser for searches by instrument, as well as date ranges and relevance-ranked results.
Dushay adds, "And, as has been demonstrated on so many discussion lists and in so many discussions, the group wisdom of the open source community improves our local efforts, let alone the contributions made by other institutions and individuals." Lucene and its family show that this can work for 10 years and more.
Lucene Example Sites
University of North Texas Digital Library,
NASA Nebula Cloud Computing Platform,
Bigsearch.ca, Canadian web search engine, http://results.bigsearch.ca/search.jsp?query=mosaics+supplies
Digg.com (search for banned books), http://digg.com/search?s=banned+books&sort=newest
Integrated library systems built on open source code