A new study of coverage of the indexable Web by search engines states that, as of February 1999, only 42 percent of the Web is indexed by the combined search engines. The study, by Steve Lawrence and C. Lee Giles of NEC Research Institute, appeared in the July 8, 1999, issue of Nature (pp. 107-109). Lawrence and Giles made headlines a year ago with their study of overlap among search engines that showed that each Web search engine indexed a fairly discrete corner of the Web, with little overlap among them. In that study, Lawrence and Giles reported that the combined coverage by all Web search engines was about 60 percent of the Web. Their conclusion from comparing the two studies is that the Web search engines are not keeping pace with the growth of the Web.
Using randomly generated Web addresses, the authors estimate a total of 16 million Web servers in existence. Of these, they estimate that roughly 2.8 million are publicly accessible and present indexable information for Web search engines to collect. There are, they say, 800 million publicly indexable Web pages, accounting for 6 terabytes of text (not image) data. Much of the Web is not indexable, residing behind query boxes or in non-indexable databases, or specifying that Web crawlers and spiders may not index the server's contents (robots exclusion policy). (A study we did at Datasearch in 1997 indicated that approximately 50 percent of the Web was not indexable.)
The authors first used random Web URLs to estimate total Web servers in existence. They predict that there are approximately 16 million at present. The profile of the Web indexed by all search engines together was as follows:
- 83 percent commercial sites
- 6 percent scientific or educational sites
- 2.8 percent health
- 2.3 percent personal
- 1.9 percent societies
- 1.5 percent pornographic
- 1.4 percent community
- 1.2 percent government
- 0.8 percent religion
Only about a third of Web servers contain metadata on their home pages, and only 0.3 percent used the Dublin Core. The lack of standardized tags was quite evident: The authors found 123 distinct tags.
One disturbing, but not surprising, finding is that "popular sites"—sites that have many links to them—are much more likely to be indexed than sites that have few links to them. Since Web spiders follow links in order to discover new sites, it is harder for a site with no links to it to be found in a Web crawl. The study also found that search engines are lagging behind, taking months to index a new page. The average median age of "new" pages was 57 days. Despite its smaller size, they found that Infoseek has a higher probability of indexing random new sites. This bears out another recent study at the Wharton School that calls Infoseek an "overachiever."
Ranking the Search Engines
Lawrence and Giles used 1,050 real queries from NEC researchers in order to test Web engine coverage. Of the 42 percent of the Web covered by the Web search engines, here's a breakdown of coverage by each:
- Northern Light 38.3 percent
- Snap 37.1 percent
- AltaVista 37.1 percent
- HotBot 27.1 percent
- Microsoft 20.3 percent
- Infoseek 19.2 percent
- Google 18.6 percent
- Yahoo! 17.6 percent
- Excite 13.5 percent
- Lycos 5.9 percent
- Euroseek 5.2 percent
While these results are startling, they may not give a complete picture of Web contents or research. Giles and Lawrence used basic Boolean queries that required exact matches. In other words, they asked for a Boolean AND. They turned off truncation, and "transformed queries to the advanced syntax for AltaVista." As any researcher knows, insisting on exact matches greatly diminishes the set of retrieved documents. It increases precision, but decreases recall. Turning off truncation and concept searching further diminish the recall. Thus, we might expect that coverage of these topics might be considerably larger than this study would indicate. In addition, the study appeared to classify as "science/education" only sites that were university, college, or research laboratory sites. This eliminates large, valuable archives from publishers, scholarly societies, or commercial entities such as the Special Collection at Northern Light.
"One of the great promises of the web is that of equalizing the accessibility of information," conclude the authors. But, they state, the search engines "typically index a biased sample of the web." They point to the overemphasis on popular pages, or pages with many links, and suggest that valuable new research is not found by the researcher who needs it because of this propensity. Tools such as Direct Hit or Google use popularity or number-of-links measures to improve the precision and quality of their searches.
Giles and Lawrence call for better and more equal coverage for research and educational information. The question of what to include in order to serve the public is an old one. Librarians deal with it constantly, under the rubric of "selection." Today's search engines appear to be working toward providing less information of higher quality—providing some good answers—instead of complete coverage, no matter the quality. This is a direct response to screams from the public of information overload. Perhaps there is a place for broad coverage in narrow fields for those who want "all" the answers instead of just some good ones.