For more than 20 years, researchers have worked to conceptualize methods for making web searching more comprehensive—going beyond the surface sites that are easily accessed by today’s search engines to truly create a system of universal access to the world’s knowledge. The task is proving to be far more complicated than computer scientists had thought. “The existing approaches,” notes one recent analysis, “lack [the ability] to efficiently locate the deep web which is hidden behind the surface web.”
Today, it is estimated that more than 65% of all internet searches in the U.S. are done using Google. Both Bing and Yahoo continue to be major players as well.
Avoiding the Dark Side
We all want searching to be more comprehensive, targeting the exact information that we need with the least amount of effort and frustration. However, nestled near the abyss of the information ocean is the dark web, a space where hackers and criminals create fake sites and conduct their commerce. The dark web continues to frustrate efforts to control illegal activity, including credit scams, drug sales, and the exploitation of international relations. Clearly, this isn’t what we are looking for in information retrieval.
By analyzing available data, Smart Insights says that more than 6.5 billion web searches are made each day around the globe. Current hacking scandals are making it clear that the concept of safe searching is more than just about protecting children from predators. There are a variety of search options that have been designed with privacy in mind:
- DuckDuckGo, which bills itself as “the search engine that doesn’t track you”
- Gibiru, which offers “Uncensored Anonymous Search”
- Swisscows, a Switzerland-based option that calls itself “the efficient alternative for anyone who attaches great importance to data integrity and the protection of privacy”
- Lukol, which works as a proxy server from Google and removes traceable entities
- MetaGer, a German search engine that removes any traces of your electronic footprints and also allows for anonymous linking
- Oscobo, a British product that does not track you and provides a robust option of search types, including images, videos, and maps
And there are others as well, demonstrating that concern for privacy over profits is creating reliable solutions for searchers across the globe.
Google and other standard web search engines can be infuriating when you’re trying to do intensive background research, due to their lack of deep searching into the content of the databases and websites they retrieve. Given the amount of information on the web, this isn’t surprising, but we need better performance if we are truly able to rely on web searching as a legitimate option for research. Information professionals are used to the structured searching of verifiable information. What is missed is that deep web content—the “meat” of information that searchers need and expect.
Researchers Andrea Cali and Umberto Straccia noted in a 2017 article, “the Deep Web (a.k.a. the Hidden Web) is the set of data that are accessible on the Internet, usually through HTML forms, but are not indexable by search engines, as they are returned only in dynamically-generated pages.” This distinction has made reaching the content in these databases very difficult. The most successful sites, to date, have been targeting specific types of hidden data.
Working largely from public data, “whether researching arrest records, phone numbers, addresses, demographic data, census data, or a wide variety of other information,” Instant Checkmate is a fee-based service that retrieves data from public databases containing arrest reports, court records, government license information, social media profiles, and more. By doing so, it claims to help “thousands of Americans find what they’re looking for each and every day.” Searches seem to take forever, which, given the size of the databases it is searching, isn’t unreasonable. The data is encrypted to protect the searcher’s identity. Reports are far more detailed than anything we might otherwise be able to find in a more timely manner. Similar services include MyLife, Pipl, and Yippy.
Information professionals are perhaps most familiar with the Internet Archive’s Wayback Machine, the front-end search engine to more than 308 billion archived webpages and link addresses to even more. The Internet Archive itself takes up 30-plus petabytes of server space. For comparison, a single petabyte of data would fill 745 million floppy disksor1.5 million CD-ROMs.
And that’s just the size of the information that can be searched. Google Scholar and Google Books are two search engines that are working to dig deeper into the content of websites for scholarly information. Searchers can do their own searching by using the “site:” command; however, this is a tedious and hit-or-miss process, since these search engines are only able to scan the indexed pages linked to some domain homepages.
Deep Web Search Engines
A variety of search engines are working to provide improved access to key information otherwise hidden inside of websites or behind paywalls. Methods to get to this deep web are currently still under development—and are not regulated to protect users from unethical practices. Deep web search engines are able to uncover more information and links and improve the results of a search to include an estimated 500% more information than traditional search engines.
Examples of today’s search engines that are designed to reach these deep web sites include:
None of these are exceptional resources for information professionals that solve our problems of deep searching. These websites pop up and get taken down very frequently, and others pop up in their place. And none of these systems necessarily has staying power.
To thoroughly access deep web information, you’ll need to install and use a Tor browser, which also provides the basis for access to the dark web. The real issue facing researchers is how to control the search process in these huge, individually structured databases.
Creating a Stable Deep Web Search Tool Is Harder Than You Might Think
In August 2017, a deep web search engine was being touted as bringing better quality deep searching while promising to protect the privacy of users. DeepSearch from TSignal was to be the focus of this NewsBreak; however, it recently disappeared from the web—perhaps it was acquired by another company or taken down for more development and testing. This has happened before and probably will happen again. As researchers noted in a 2013 article, “While crawling the deep-web can be immensely useful for a variety of tasks including web indexing and data integration, crawling the deep-web content is known to be hard.”
Earlier this year, two Indian researchers reported on their goal of creating a dynamic, focused web crawler that would work in two stages: first, to collect relevant sites, and second, for in-site exploring. They noted that the deep web itself remains a major stumbling block because its databases “change continuously and so cannot be easily indexed by a search engine.”
The deep web’s complications are many—query design, requirements for some level of user registration, variant extraction protocols and procedures, etc. Let alone the linguistic complications as global searching confronts meanings and connections of terminology across disciplines and languages. Today’s open web search is so ubiquitous that we rarely think about the potential complications; however, the deep web is another animal, and some researchers question whether it would be possible to bridge this divide without doing much work to modify the “present architecture of [the] web.”
Information professionals can easily see the need for better search techniques to handle the complex, evolving nature of the web—and increasingly, so can other professionals. Psychiatrists studying addiction have initiated their own efforts to better access and study the deep web and dark web due to their role in the “marketing or sale or distribution” of drugs and developing an “easily renewable and anarchic online drug-market [which] is gradually transforming indeed the drug market itself, from a ‘street’ to a ‘virtual’ one.”
What can we do as we wait for a better solution to web search? Reliable scholarly databases can easily be supplemented with existing search sites and mega-search engines. Information professionals have always been aware of the complex nature of search, and today, computer scientists and web designers are confronting these realities as well. There is no ultimate solution—which, if nothing else, guarantees the future of our field.