"Robots" in web design are not like R2-D2 or BattleBots: They are programs that follow links and gather data from webpages, mostly to populate the indexes for web search engines such as Yahoo!, Google, and Microsoft Live. These robots act like obsessive web surfers who click on every single link that they see, in their quest to find every possible page. A whole industry of search engine optimization (SEO) has grown to help web publishers make sure their pages are properly indexed. The Robots Exclusion Protocol (REP) is a way for site publishers to interact with search engine robots. It was developed first by an Excite Search engineer, Martijn Koster, back in the early days of the web. This protocol, described in detail at www.robots.org, never got as far as a standard or even an RFC (Request for Comments), but it was a de-facto success: all the large search engines work with it, as well as most of the smaller ones, including enterprise and site search tools.
The 1990s version of the REP allowed websites to include a page, "robots.txt," in the host root directory to indicate which pages and paths of the site should be ignored by the search engine robots. The one and only directive was "Disallow," which tells the robot to avoid those URLs, even though the pages might have incoming links from the same site or external sites. This can be specified for specific robot crawlers (such as Googlebot, MSNBot, or Yahoo! Slurp) or applied to all robots, which assume that everything is allowed unless specifically disallowed in the robots.txt file. This system does allow sites to specify that a subdirectory such as "listings/byauthor" should not be crawled. But it is limited in scope and can’t handle more complex situations, such as ignoring all .pdf files.
Ten years later, the largest three search companies have agreed on a base set of new features for robots.txt, with more flexibility for site publishers to set "Allow" as well as disallowed URL paths, to use wildcards in these paths (such as *pdf), and to link to a sitemap XML file. All of the top search engines were supporting these features before, but with slightly different syntax and options: This agreement allows site publishers to use one set of directives instead of three.
There are real problems of rogue robot spiders overloading the site or inserting spam into guestbook forms, blogs comments, or search fields. Therefore, each service has pledged to allow "remote authentication" so a site publisher can check the IP address to see if it’s in the search engine’s address block. This means it’s easier to identify and block spiders that masquerade as one of the search indexing robots but are really rogues.
The other way web publishers can interact with search robots is to insert directives into webpages, but until this new version of the REP was introduced, there was no standard way to do this in non-HTML documents. The new X-Robots-Tag directive, added to the HTTP header, is now an official option for sending directives for text, PDF, and office document files.
The older version of the REP provided the tag META, with the name ROBOTS and the attributes NOINDEX and NOFOLLOW, separately or together. These control whether the robot should use the page text for the search index or follow the links on the page, respectively. In the new REP, the Meta ROBOTS tag attributes can now include NOSNIPPET, which tells search engines not to display the match words in content on the search results page, and NOARCHIVE, which tells the search engines not to even keep the content of the page. This is particularly useful for pages that constantly change, such as displays of headlines. NOODP tells the search engines not to show any information for the Open Directory Project for this page on search results. All of these changes give much more control over their search results to website publishers.
What is not yet included in the REP are some additional directives in robots.txt, supported by some but not all of the search engines. The most important one of these is "Crawl-Delay," which allows the site publisher to suggest how frequently the crawlers should send page requests to the server, with suggested options ranging from 0.5 seconds to 20 seconds (supported by Yahoo! and MSN). In addition, Google has implemented a "Noindex" directive for robots.txt, which means the robot should follow links in pages with this path but not index the content; this is not supported by the other robots.
Yahoo! invented a "robots-nocontent" attribute (misleadingly called a tag), which can be used as part of a CSS class name, to mark navigation and boilerplate content on a page so the search engine doesn’t index them. Many site search engines use a pseudo-tag, and to delimit unwanted text, and there was an attempt to create HTML class-level controls as Microformats (http://microformats.org/wiki/robots-exclusion), but neither of those is supported by the big web search engines.
Danny Sullivan, a leading commentator on web search, says, "I don’t think there were many changes other than the three major search engines coming together to say this is where we agree and disagree. But that’s an important first step to getting to the big jump we need—for them to agree in the places where they conflict. We’re long overdue for that type of standardization, and I’m looking forward to seeing it come from them."
The revised REP doesn’t address online publisher policies on appropriate use of their materials, so institutions and content creators must make a binary decision for each URL: Should it be indexed or not? One proposal to address this is Automated Content Access Protocol (ACAP; www.the-acap.org), which is focused more on rights of publishers, rather than readers or search engines, and incorporating digital rights management. (For background on ACAP, see the Feb. 14, 2008, NewsBreak http://newsbreaks.infotoday.com/nbReader.asp?ArticleId=40927.)
The search engines banding together and the publishers proposing their own solutions make it likely that there will be more conflict in the future between these two camps—with end users and libraries stuck in the middle.
New REP Announcement URLs