AOL Is Caught in Its Own Long Tail

In terms of press relations, Aug. 6, 2006, began an exceedingly challenging week for America Online and its parent company, Time Warner. On that Sunday, the world discovered that apparently well-meaning AOL employees took detailed log files covering 36,389,567 searches performed by AOL members and published them on the Internet. The searches took place between March 1 and May 31, 2006. AOL researchers published the log files, they said, to benefit the academic community.

Reuters reported that the data set was online for about 10 days before bloggers discovered it. Writing in his TechCrunch blog, Michael Arrington quickly published a post labeled "AOL Proudly Releases Massive Amounts of Private Data" (http://www.techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of-user-search-data). Arrington presciently noted the following:

The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

AOL took down the data set at approximately midnight EST on Aug. 6, but not before numerous mirror sites sprang up worldwide. (See, for instance, http://www.gregsadetsky.com/aol-data.) There is no evidence that AOL has tried to shut down these mirrors.

A firestorm of criticism erupted, first in the blogosphere and then in the mainstream media. AOL spokesperson Andrew Weinstein did not mince words. "This was a screw up, and we're angry and upset about it. It was an innocent-enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant."

The incident could not have come at a worse time for AOL, which is trying to recast itself as an advertising-supported content provider catering to broadband users. In the same week, as Weinstein and colleagues strove to respond to the incident, AOL announced a free antivirus service, free 5-gigabyte storage for all users, and a free personalized e-mail service.

The data set contains these elements: an anonymous user ID number in lieu of the AOL screen name, the query the user typed in (minus punctuation), the time stamp when the query took place, the rank of the query on the hit list if the user "clicked through," and the host name of the destination if the user clicked through. The data set represents searches by users of AOL's proprietary client software, not general visitors to aol.com.

Most press reports misstated the number of searches in the data set, saying it comprised 20 million queries. The AOL announcement said that there were 21,011,340 unique queries out of 36,389,567 records in the data set.

Enterprising techies went well beyond mirroring the data set. At http://www.aolsearchdatabase.com, you can search the search logs, filtering by user number, search keywords, or click-through domain. Here's how to hone in on an individual's identity: First search for a given topic (e.g., Key West); then filter by user number. You'll see the complete range of searches that person performed. Because many people search for themselves, you may quickly uncover individual identity.

The New York Times reporters Michael Barbaro and Tom Zeller used this technique to identify Thelma Arnold of Lilburn, Ga., who searches for health information for her friends. Much more unsavory and disturbing searches are in the database, including someone who wants to know "how to kill your wife."

The researchers—Greg Pass and Abdur Chowdhury of AOL and Cayley Torgeson of Raybeam—should have understood the implications of releasing this data set. This sort of data mining examines the "Long Tail" of the search logs. There's a lot to be learned by studying unusual or unique searches or the searches of individual users. This is well understood in the search analytics community.

Ironically, this is the type of search log information the Justice Department had sought from AOL, Google, and others. Google fought and won a court battle limiting what it disclosed to the government. Now the government, private marketers, former friends, ex-spouses, and others are mining the database.

Google CEO Eric Schmidt said the following during a keynote at the Search Engine Strategies conference: "We are reasonably satisfied … that this kind of thing could not happen at Google." He then said, "Never say never." Schmidt should temper his moral superiority. Google itself is benefiting from the ongoing breach of privacy: The aolsearchdatabase.com site is Google's advertising partner.

Some cynical bloggers accused AOL of maliciously exposing private information. One academic, Carnegie Mellon researcher Serge M. Egelman, disagreed: "My current research focuses on conveying privacy information to end users. As part of this I've been working on P3P (a W3C standard for creating machine-readable privacy policies). AOL has been incredibly helpful to our lab by assisting us and providing search data. … This research is immensely valuable, and we are very thankful to AOL for assisting us. We were only given a list of search terms and assumed that that was all they would be putting on their Web site. Obviously they made a huge oversight, but it was in good faith. After all, they were merely trying to assist other researchers. It would be extremely unfortunate if publicity from this incident force[d] them to discontinue joint work with the research community."

The researchers asked people using the collection for research to please cite their paper as follows: "A Picture of Search," given at The First International Conference on Scalable Information Systems, in Hong Kong this past June. (Note the timing: The data set covers the period just before the conference.) The full text of the paper is available at http://portal.acm.org.

It's an interesting academic paper.