The number of search engines that concentrate on surfacing information from the deep web is growing. One company, which last week changed its name from Infovell to DeepDyve (www.deepdyve.com), has decided to expand into the consumer internet search market. It will also offer a fee-based version ($45 per month), called DeepDyve Pro, with advanced functionality such as dynamic foldering, visual clustering, and expanded filtering.
Setting itself apart from general web search engines such as Google, Yahoo!, and Live, DeepDyve searches deep web content in the life sciences, patents, and Wikipedia. It has indexed more than 500 million pages of content as of this month. Sources of its deep web information include MEDLINE, CRISP (a database of federally funded biomedical research projects), Clinical Trials, VAERS (FDA Vaccine Adverse Events Reporting System), World Health Organization Model List of Essential Medicines, and scientific journals published by Annual Reviews; BioOne; Mary Ann Liebert, Inc.; and SAGE Publications. Journals from Oxford University Press, MIT Press, and Hindawi were added in October. Its patents are sourced from both the U.S. and European patent offices. Some of the deep web information indexed by DeepDyve is freely available on the web, while others require a separate subscription to view the full text.
To expand beyond biosciences, DeepDyve added the open access site arXiv, which covers physics and computer science. It anticipates greater coverage of the physical sciences, particularly information technology, clean technology, and energy, by the end of 2008. Next on its agenda is business information.
Under its original name, Infovell, DeepDyve revealed its new search technology for finding biopharmaceutical information in the hidden or deep web at DEMOfall08. The video of the presentation by CEO William Park is on the DeepDyve website. Rather than the standard product introduction, he used the real-life example of how a friend of his, diagnosed with possible vasculitis, failed at using Google to find information; but he succeeded in finding the information with DeepDyve. The company officially launched its public search engine in beta on Sept. 22, 2008.
The company’s technology is based on work done mapping the human genome by the company’s founders, Qianjin Hu and Tom Tang. It uses a KeyPhrase algorithm that indexes all words and phrases, rather than relying on a few keywords. Using a federated-search approach, the large search box can accommodate queries of up to 25,000 characters, which can be in any language. Recommended by DeepDyve is cutting and pasting a paragraph of text into the search box. "Content is the query," says Park. This is vastly more than the 32 words allowed by Google and the very few words most searchers enter as a search query at any web search engine.
Park explained to me that the KeyPhrase technology is agnostic to language and to vertical search because it focuses on pattern matching rather than keyword matching. "It’s purely statistical, there are no semantics involved, no synonyms, no metadata. We get content and relevancy without taxonomy." Even with MEDLINE, he said, DeepDyve "tokenizes" the controlled vocabulary.
Once initial results appear, limited to the most relevant 250, clicking "More like this" on a retrieved article refines search results by using KeyPhrase to narrow results. Clicking on the article title takes you either to the webpage containing the full text of the article or to the webpage of the publisher where you can purchase the full text of the article. Results can be sorted by relevance, source, or date.
To the left of the search results are options for refining your search by source. DeepDyve Pro has additional refinement options—by content types and by an automated topic extraction list. The Pro version also allows you to view results not as a list but as a Venn diagram. Park stressed that the diagram view graphically shows the intersection of articles, thus surfacing multiple concepts, something not easily done with a static list.
With the name change to DeepDyve came a new, cleaner user interface. The revamped site has a large search box, which gives a visual cue to input large blocks of text, with three columns beneath it. The first column shows the subscriber’s search history and the middle column contains collections the subscriber may have saved. The third column lists "Interesting Searches"; this is DeepDyve’s foray into community building.
According to Park, DeepDyve is more than a search engine—it’s "the world’s research engine." He continued, "We’re not competing with Google, which excels at single concept searches. Real researchers, however, need to do multiple concept searches."
Park also believes that the popularity algorithms employed by web search engines are inadequate for real research. That’s where the deep web comes in. It has sources that are hidden behind pay walls and are found on obscure, academic, and scientific websites. He claims the quality of the source data, ease of use, and the ability to sort, filter, share, and save queries and results are the three major benefits of DeepDyve. These attributes, he says, "translate into efficiency and cost savings."
If the free version of DeepDyve is designed for "information-savvy consumers," as its press release claims, then I fit that description. I’m not a medical researcher, however. When it comes to MEDLINE and the other sources indexed by DeepDyve, I’m a novice. I’m likely to search for "sore back" rather than "herniated disc," hoping that responses to sore back will lead me to better, more scientific terminology. In this respect, DeepDyve performs very well.
Other searchers have not been so fortunate. Kevin Ryan, at Search Engine Watch (http://searchenginewatch.com/3631665), commented that his searches uniformly received "no results." I’ve never had this happen. His first failed search was for information on "how patients with familial hemochromatosis might be able to donate blood." I cut and pasted that sentence into DeepDyve and retrieved 521,956 results, of which DeepDyve shows the first 250. The first hit, from the Canadian Journal of Gastroenterology, was titled "Hemochromatosis Patients as Voluntary Blood Donors." I’m no expert, but that sounds awfully close to the desired answer to me.
I then cut and pasted most of the abstract for that article into DeepDyve’s search box. This resulted in an almost totally irrelevant set of results. When the searcher cannot weight the terms and lets the algorithm take over, strange things can happen. In this case, DeepDyve picked up the concepts of "survey" and "patients" as more important than "hemochromatosis" and "donating blood." Although DeepDyve encourages lengthy entries in the search box, in my experience, you can overdo it. The other thing I’ve learned to do is uncheck Wikipedia for medical topic searches. It, too, can add wildly irrelevant results.
Chris Sherman (www.tinyurl.com/6knvu5) is more positive, saying, "it’s a great tool for serious searchers wanting to do comprehensive research. …" What both Ryan and Sherman were evaluating, however, was the free version. The Pro version, although offering the same search algorithms, is a powerhouse when it comes to refining search strategies and viewing results in graphical form.
Should serious researchers consider paying $45 every month for DeepDyve Pro? That probably depends upon whether they already have access to subscription-based products such as ProQuest, Dialog, STN, and Questel. For those in academic institutions, research laboratories, and corporate settings, DeepDyve Pro may well fall into the "nice to have" rather than "need to have." As an adjunct way to search, the free version makes an excellent complement to the premium content sources professional researchers have been using for decades.