Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology Unisphere/DBTA
PRIVACY/COOKIES POLICY
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



News & Events > NewsBreaks
Back Index Forward
Twitter RSS Feed
 



Amazonfail: How Metadata and Sex Broke the Amazon Book Search
by
Posted On April 20, 2009
Click here for full-size image
Click here for full-size image
Click here for full-size image
Click here for full-size image
Click here for full-size image
Click here for full-size image
Amazon failed in a big way on Easter weekend. As the largest bookstore in the world, if a book does not appear in its lists or its search results, the book practically disappears. The event now known as #AmazonFail involves a great cast of characters-books, metadata, sex, search results, traditionally disenfranchised groups, a possible hacker, the Kindle, the absence of institutional response, and the emergence of Twitter for sharing information very quickly on a massive scale.

The ways Amazon failed are many. It did not have a clear policy on "adult" content, although there is evidence that it deals with those materials in special ways. It placed too great a reliance on metadata and automation. The database and publications architecture needs checks and sign-offs. Its communications to its customers, authors, and the media were deeply insufficient.

And Amazon had the bad luck to make a significant mistake regarding people who are highly articulate and communicative, at a moment when there are real-time technology tools to support them. In particular, Twitter users' near-real-time broadcasts speed everything up: Links from Slashdot can overwhelm unprepared servers over the course of hours or days, but a note from a popular Twitterer can do the same, in just a few minutes.

#AmazonFail

It started when authors noticed their books losing "Sales Rank," which is a number on an Amazon book record that is generally related to recent sales. Sales rank is generated by a proprietary algorithm because it controls which book links are displayed on the homepage or best-seller lists and-it turns out-the main search engine. People have tried to manipulate it, and it's changed over the years, but it seems to have been stable before this event. But last week, some of the numbers didn't go up or down; the rank itself disappeared (www.fonerbooks.com/surfing.htm; www.salesrankexpress.com).

Around April 11, several authors were blogging about the problem. One author, Mark Probst, posted a response from Amazon saying that his book was deranked was because it was marked "adult"; but he knew that was not true, as it's a YA (young adult) story that simply includes gay characters. Some other authors and publishers of GLBT (gay, lesbian, bisexual, and transgender) work discovered that their sales ranks were gone too, and they blogged their discontent. (See the screen with an example of a book page without a sales rank.)

Then, Storm Grant described this on Twitter with the metatag #amazonfail; zanzando sent it to author and blogger Neil Gaiman, who investigated it and then "re-tweeted" to his 200,000 "followers." This created a tipping point, a critical mass, as many smart and articulate people found evidence of what was going on, identifying the delisted books. Delisted titles included Brokeback Mountain, Ellen: The Biography, and Heather Has Two Mommies-all of which deal with GLBT issues without being explicit or primarily sexual. At the same time, "Playboy Centerfolds" and raunchy autobiographies retained their sales rank (http://markprobst.livejournal.com/15293.html; http://twitter.com/StormGrant/status/1502600844; http://twitter.com/neilhimself/status/1503615450; www.salon.com/mwt/broadsheet/feature/2009/04/13/amazon_fail/index.html).

While Twitter is somewhat similar to blogging systems, it displays posts (known as "Tweets") in near-real-time. Users can follow hundreds of other Twitter accounts, viewing the aggregated listing in their default homepage, in online feed readers, or on desktop applications. There are no threads or folded comments, just a flat list, so Twitterers have invented the convention of a hash tag (#) to indicate a continuing topic, in this case #amazonfail. Searching for the tag displays Tweets posted minutes or seconds before, and the search page has a function that adds a note on the page when more matches for the search are indexed, creating a fast-paced environment. As on Facebook and other social networking systems, it has been misused-"twitter-mobs" created to tease or mock (http://bhc3.wordpress.com/2009/03/20/breathe-reflections-the-cisco-fatty-story; http://search.twitter.com/search?q=amazonfail).

On April 11 and 12, there were many theories blogged and Twittered about Amazon's actions and the company's attitude toward GLBT books, from the mundane to the apocalyptic, based on the very real attacks these communities have suffered in the past. The most common suggestion was that Amazon had decided to delist books on topics that the executives or a pressure group found distasteful. Or perhaps a reporting system for flagging inappropriate books was attacked in an organized fashion. A self-proclaimed "hacker" posted, claiming responsibility, but this was soon debunked. The speculation continued, and indeed, it still does (http://community.livejournal.com/meta_writer/tag/amazonfail; www.salon.com/mwt/broadsheet/feature/2009/04/13/amazon_fail_2/index.html; http://letters.salon.com/mwt/broadsheet/feature/2009/04/13/amazon_fail_2/view/index8.html).

Metadata Categories to Blame?

Later on April 12, Jane of the Dear Author blog discovered that books delisted were not just GLBT but those tagged with "erotica " or "sex" as well, such as Full Frontal Feminism and the sociology textbook The Sexual Politics of Disability. This explanation swept through the community, with thousands of links from #amazonfail: Good (and reproducible) information drove out the bad, in this case (http://dearauthor.com/wordpress/2009/04/12/amazon-possibly-using-category-metadata-to-filter-rankings).

From this and other evidence, I believe there is a flag on each category, defining whether it is adult or nonadult. When a category is flagged adult, the system automatically suppresses the sales rank and the main search results for all items in that category. This is supported by the observation that many Kindle editions have a separate listing (under the Kindle category) and so some books delisted in book format were still available in Kindle format, and vice versa. The category labels themselves seem to come from publisher catalogs, CIP (Cataloging in Publication) data, aggregators, and reviews such as Publishers' Weekly and Library Journal, with other tags, possibly from users, inconsistently applied.

The Amazon database is distributed and the index is updated incrementally, so when readers do searches, they may come up with different results from one session to the next. This means that the delistings seemed to roll across the catalog, including GLBT books, feminist, disabled-rights, and other sex-positive items, followed by Tweets reporting the changes.

Mysterious Adult Content Policy

As far as anyone can tell (authors and publishers included), there is no stated policy on Amazon's site regarding adult content. The message sent to Probst is the clearest explanation: "In consideration of our entire customer base, we exclude ‘adult' material from appearing in some searches and best seller lists. Since these lists are generated using sales ranks, adult materials must also be excluded from that feature" (http://markprobst.livejournal.com/15293.html).

He was very surprised to get that message because the book is actually a YA story that includes gay characters, but it is certainly not explicit.

I could find only three reports of "adult" delisting before this, although people have noticed the lack of explicit materials not findable from the main page-only from subsections such as "Sex" and "Social Sciences-Pornography."

The first just remains as a link to an article that has since been lost:

"‘Amazon Hides Sales Rank on Certain Books'

August 27, 2008 - 1:58 p.m.-Bibliofuture

Amazon.com is hiding the sales rank on certain risqué books if they become too popular ..."

(http://lisnews.org/amazon_hides_sales_rank_certain_books).

The second is Amazon's response to Craig Seymour, an associate professor of journalism at Northern Illinois University. In early February 2009, his memoir lost its sales rank, was delisted and removed from the Amazon search engine, presumably because of adult content. While the title may have been provocative-All I Could Bare: My Life in the Strip Clubs of Gay Washington, D.C.-the book itself merited a positive Publishers' Weekly review, and other books with "stripper" and related sexual terms in the title were not delisted. He has posted several unhelpful responses from Amazon's author service department, including one saying that it was "classified as an Adult product" and implying that was the reason he had no sales rank. He and his publisher continued to insist that it be fixed, and 2 days later, his rank and findability returned, with no explanation (http://craigspoplife.blogspot.com/2009/04/my-amazonfail-timeline.html; www.publishersweekly.com/article/CA6571980.html?industryid=47263).

The third is novelist Francine Saint Marie's writing on AfterEllen.com, "Amazon's ‘Glitch' Myth Debunked." This post describes Saint Marie's difficulties in getting sales rank for Kindle editions of her lesbian romances. She knew that the Kindle versions were selling because her publishing account was tallying royalty percentages. But the books were only reachable by browsing the Gay & Lesbian section or following an offsite link to her Kindle book pages. She was told over and over that nothing could be done about it:

Amazon restated, "We do not have the ability to manually add content into the Amazon sales ranking system." And then again, "Amazon has no means of manually adding sales ranks/categories due to the automated nature of the system." And again, "We remain unable to manually add sales ranking information to any product detail pages." And again, "This is a fully automated process and we have no way to change this manually."

However, she did tests with "new" authors, and they got ranked after only one sale. Finally, in March 2009, she deleted every title of hers from the Kindle store and then republished them on the store as untagged romance novels. Within a day they had sales ranks; some were even on the best-seller lists-the same books that had so many problems before. She had clearly been on some kind of Adult content blacklist, and Amazon was unwilling to do anything about it (www.afterellen.com/node/48877).

Amazon's Responses

On the evening of Sunday, April 12, mainstream press book reviewers also started following the story in their blogs. The Los Angeles Times managed to get one quote from Amazon's spokesperson: "Responding to our initial post, Amazon Director of Corporate Communications Patty Smith e-mailed Jacket Copy. ‘There was a glitch with our sales rank feature that is in the process of being fixed,' she wrote. ‘We're working to correct the problem as quickly as possible'" (http://latimesblogs.latimes.com/jacketcopy/2009/04/amazon-responds-to-adult-queries-blames-a-glitch.html).

The word "glitch" was not well-received by the concerned community, as it seemed to be a trivialization of a serious problem-it engendered another Twitter tag: #glitchmyass. Then, on the morning of Monday, April 13, Amazon sent a message to some publications and to those who had emailed the company:

This is an embarrassing and ham-fisted cataloging error for a company that prides itself on offering complete selection.

It has been misreported that the issue was limited to Gay & Lesbian themed titles-in fact, it impacted 57,310 books in a number of broad categories such as Health, Mind & Body, Reproductive & Sexual Medicine, and Erotica. This problem impacted books not just in the United States but globally. It affected not just sales rank but also had the effect of removing the books from Amazon's main product search.

Many books have now been fixed and we're in the process of fixing the remainder as quickly as possible, and we intend to implement new measures to make this kind of accident less likely to occur in the future.

(http://community.livejournal.com/meta_writer/13059.html)

However, there is still no statement of this kind on the Amazon front page, media relations area, blog, or any other area of the site, and the public relations department has not responded to our request for a comment. There's no question that Amazon's near-silence has distressed and infuriated many people. Without additional communication, there is no way to know what really happened and whether it might happen again.

Continued Distrust of Amazon

There are still rumors and worries because the marginalized groups affected aren't just paranoid -it seems that some people really are out to get them. Varying anonymous statements ascribed to "Amazon insiders" speculating about management preferences and plans fed these fears. Many people in the GLBT, disability sexuality, feminist, and other sex-positive communities feel attacked-that Amazon did something terribly wrong and may easily do so again(www.salon.com/mwt/broadsheet/feature/2009/04/13/amazon_fail_2/index.html).

In a guest editorial on TechCrunch, veteran Silicon Valley technologist Mary Hodder wrote,

"#AmazonFail is about the subconscious assumptions of people built into algorithms and classification that contain discriminatory ideas ... And we all know search result order can lead to big sales, or invisibility" (www.techcrunch.com/2009/04/14/guest-post-why-amazon-didnt-just-have-a-glitch).

Amazon, like Starbucks, was once a small struggling startup, with high regard for its customers. Having grown so large and unresponsive, it's also now seen as a behemoth, crushing independent local stores, able to control what is available for purchase.

Clay Shirky, author of Here Comes Everybody, worries about the easy assumptions of malice. "We're no longer willing to cut Amazon any slack, because we don't trust them, and we don't trust them because we feel like they did something bad, even though we now know, intellectually, that they didn't actually do the bad thing we've come to hate them for. They didn't intend to silence gay-themed work, and they didn't provide the means for groups of anti-gay bigots to do so either. Even if the employee currently blamed for the change in the database turned out to be a virulent homophobe, the problem is in not having checks and balances for making changes to the database, not widespread bias" (www.shirky.com/weblog/2009/04/the-failure-of-amazonfail).

Implications for All Information Systems

We can certainly learn a lot from this disaster: improving our processes for change, for both code and metadata; re-examining our metadata vocabularies; protecting our systems from mistakes and maliciousness; posting clear and detailed content policies; and responding to user concerns as soon as humanly possible.

Information Quality Trainwrecks bloggers provided three possible reasons for this foul-up:

  • There are deficiencies in how Amazon manages and controls its matter-data
  • An algorithm ran that tagged children's books or medical textbooks as "Adult" content; then the algorithm was producing duff quality data, regardless of the intent (in fact, if an algorithm to censor/restrict certain content does exist but was 'secret' then the outcome of this boo boo has been to raise awareness of it).
  • The "Information Asset" was not properly secured and protected, which is vital, "as a cloudy business is the main asset that they have" (www.iqtrainwrecks.com/tag/amazonfail).

The SmoothSpan security blog points to change control in source code as a model for metadata:

"In this day and age of Cloud Computing, SaaS, and web applications, data is becoming increasingly just as critical as code. Metadata, for example, is the stuff of which customizations to multi-tenant architectures are made of. In that sense, it is code of a sort" (http://smoothspan.wordpress.com/2009/04/14/amazonfail-shows-data-matters-too).

Amazon has had other problems in the past with automated systems, including miscataloging rabbit-shaped sex toys and sending their U.K. customers a flyer suggesting them as Easter gifts, even to customers who had not previously ordered from their Sex & Sexuality store (www.theregister.co.uk/2007/04/12/amazon_rabbit_mail).

And another comment from Shirky: "The problems they have with labeling and handling contested categories is a problem with all categorization systems since the world began. Metadata is worldview; sorting is a political act. ... No one gets cataloging ‘right' in any perfect sense, and no algorithm returns the ‘correct' results. We know that, because we see it every day, in every large-scale system we use. No set of labels or algorithms solves anything once and for all; any working system for showing data to the user is a bag of optimizations and tradeoffs that are a lot worse than some Platonic ideal, but a lot better than nothing" (www.shirky.com/weblog/2009/04/the-failure-of-amazonfail).

Keith Kisser (the Invisible Library blog), in "There's a Little Amazonfail in All of Us," says "Some categories in the Library of Congress system still use ‘Muhammadan' as a subject term and ‘homosexuality' is still under ‘Mental Disorders' in some areas. This isn't a coordinated effort against Muslims and the LGBT community by the Library of Congress, it's just a legacy of our outdated cultural terms, biased categories that reveal old fangled bigotry and all around bad judgment on the part of our forecatalogers, who didn't know any better. It's also something that can be fixed" (http://sanchezkisser.com/blog/2009/04/16/theres-a-little-amazonfail-in-all-of-us). See correction to this paragraph in the Comments below -- with thanks to LC for sending it. --Ed.

The question of monoculture and a single company's dominance comes up again and again. Some people have called for exposure of policies for adult material, general categorization, and search algorithms. While Amazon is highly unlikely to do that, public institutions can and should do so. We've seen the benefits of integration, from shared taxonomies to federated searching. But there may be cases where a diversity of approaches can provide alternatives in situations like this.

Kassia Krozser of Booksquare wrote, "For those whose business relies upon Amazon's ability to run its own business, Amazon needs to respond to questions of how this could happen and the steps being taken to prevent it from happening again. If it is true that a single employee was able to miscategorize some content and then flip a digital switch for an entire node (or branch) of products based on attributes of one, maybe two, maybe more products assigned to that node, then vendors need to understand the steps Amazon is taking to prevent their products from being 'delisted.'

"So now we all know this flag exists in the Amazon system, and we know it's been in existence for a long time. We don't know how it is used, who makes the decisions about whether to switch it on or off, or how outside pressures can be used to change the status of items in the database (one also suspects that another customer service comment about responding to customer complaints has an element of truth as well). While the book community in general will likely be more vigilant, it would be nice for Amazon to clarify its policy in this regard, and provide a notification/resolution process for those products that are ‘flipped'" (http://booksquare.com/amazonfail-post-mortem).

This is a telling example of the importance of diversity in information sources. Amazon could, at any time, remove or subtly hide information on any topic, from open source software to gay romance. It certainly will not be stocking any books about how to disable the digital rights management (DRM) on the Kindle! Any single source is likely to be flawed, limited by the assumptions of the people who created the system and the content.

Process and metadata management are no longer optional for any institution relying on information retrieval, whether for business or as an information resource. Words are not meaning: They have nuances and implications that only make sense in their context. While metadata can supplement the source words with standardized vocabulary and taxonomic classification, that metadata has to be correct. The more automation implemented the higher the likelihood that there will be errors. It's time to know who makes the metadata and what methods they use. The cost of dirty data makes the cost of cataloging, or at least using a human to do reality-checking of imported categories, seem suddenly much smaller. But we should also be aware of the limits of our understanding and design systems to handle change gracefully because it will come, in ways we can't possibly anticipate.

Editor's Note: The diagram above is reproduced with the kind permission of the National Coalition Against Censorship (ncac.org) and the creator Sarah Falcon.


Avi Rappoport is available for search engine consulting on both small and large projects.  She is also the editor of www.searchtools.com.

Email Avi Rappoport
Comments Add A Comment
Posted By Matt Raymond4/30/2009 11:56:23 AM

Regarding the comment above about the Library of Congress and subject headings, one of our reference librarians pointed out that the statement about homosexuality and "Muhammadans" is untrue.

This was the result of a search of the current Library of Congress Subject Headings and the entire hierarchy of all the mentions of homosexuality in the Library of Congress Classification system.

We also searched in LCSH and LCC and the LC catalog and could not find any mention of "Mohammadan" as a subject heading used on any record. The only mention in LCSH is a reference to it NOT being used -- in other words, an "x" reference, which means "DO NOT use this heading."

It is possible that the statement was based on a search of a library catalog that uses old headings and which has never changed them. However, we could find no instances where his claim was true within Library of Congress systems.

Matt Raymond
Communications Director
Library of Congress

***************

Thanks for the correction - noted above. --Ed.

Posted By Marcus H4/21/2009 8:19:39 PM

And no where in all of Amazon's response was any sort of reference to regret that anyone might have been concerned, inconvenienced, or offended. Indeed, to my eye, their "explanation" was as "ham-fisted" as the original problem. Instead, there was an air of corporate condescension and aloofness that smacks of arrogance. I regret that I cannot any longer simply accept the information provided through their marketing system, and now have to question any ability to make a purchase responsibly through Amazon.com.
Posted By Glenn I4/21/2009 8:05:21 PM

That was a helpful roundup of the story so far. Thanks.
Posted By Paprika Pink4/20/2009 3:29:53 PM

Great article looking without bias at the problems, risks, and necessity of treating products/art/commerce as data.

              Back to top