Searching the entire Twitter stream is surprisingly new. It was only a year ago that Twitter search was released on the website, though there had been some third-party partial search engines. But it would not search anything older than about 2 weeks. Now, Twitter has given the entire database of tweets, from 2006 to the present, to the Library of Congress as a primary source archive. And Twitter has (presumably) sold the same database to Google, which is making it searchable. So, is this a waste of money, an invasion of privacy, or a priceless primary source?
The complete feed of Twitter postings, called the "firehose" is now available to some third-party partners: that is how Google (http://www.google.com/webhp?esrch=RTReplay) and the Microsoft Bing Search engine have added current tweets to their searchable indexes and interfaces. Starting from nothing, there are now about 35 million messages a day. Many of these are automated, like the special of the day from woot.com and the Twisst service (www.twisst.nl), which sends an alert when the International Space Station passes a subscriber's latitude and longitude. Even some of those may be useful to see how many Twitter users are intrigued by the ISS, over the course of time.
The full database of tweets from 2006 to the present, five terabytes, will be stored both at the U.S. Library of Congress, and in Google's server farm. The absence of a long-term search engine may well have lulled users into thinking of their tweets as ephemeral and fleeting, but they were wrong-anything once public may be public again. Twitter's terms of service (http://twitter.com/tos) have always allowed a non-exclusive right to do anything it wants to with the data. In plain language it states: "This license is you authorizing us to make your Tweets available to the rest of the world and to let others do the same."
The Library of Congress has a long, thoughtful blog post considering how best to provide research access to the Twitter database. It is thinking about privacy issues, non-commerciality, and plans to limit access to the database until 6 months after the original tweet. Google, however, is indexing them as tiny webpages, and providing unlimited public interactive search (www.google.com/search?q=%2Binfotoday&tbs=mbl:1), currently finding tweets from about Feb. 10, 2010. So, whatever controls are on the Library version are mostly irrelevant.
For Twitter users "security by obscurity" never existed: anyone will be able to see everything a twitter account has posted for purposes of biography, targeted sales, or stalking. Twitter may well sell the database to marketers, advertising agencies, or foreign and domestic government agencies. The database may be subpoenaed for legal discovery or even criminal cases. The genie is out of the bottle.
Social media blogger and researcher Fred Stulzman writes:
Up until Twitter sent its archives over to the Library of Congress, Twitter users could realistically expect they could make things go away. They could delete Tweets. They could change their account name. They could remove their account. Without consulting their users, privacy advocates, rights organizations, or any other voices of reason, Twitter has summarily taken these very real privacy remedies away from their users.
There is no visible Twitter policy on removal. Perhaps the Library and the experiences of other online archivists can help Twitter find a happy medium in archival processes. One approach would be to make user actions apply to the archives. In 2011, if someone deletes a message, the next update from Twitter would include an instruction to delete that message from all archives. Search engines do this all the time with deleted websites and pages: it's not a huge technical burden.
Many people don't realize that libraries are more than a place for books, or think that handling a dataset like this would be outrageously expensive-they don't understand how libraries and research have changed. They also think the sheer mundane nature of the vast majority of messages makes the entire data set worthless. Even a few librarians miss the value of the data set as a whole. This lack of clarity is a "teachable moment," a chance for archivists and historians to set the record straight.
"The Twitter digital archive has extraordinary potential for research into our contemporary way of life," said Librarian of Congress James H. Billington. "This information provides detailed evidence about how technology based social networks form and evolve over time. The collection also documents a remarkable range of social trends. Anyone who wants to understand how an ever-broadening public is using social media to engage in an ongoing debate regarding social and cultural issues will have need of this material." The National Archives is storing specifically governmental digital content, which can benefit from the techniques developed at the Library of Congress with the Twitter archive and similar data sources, and add to it as well.
Primary sources like Twitter are exactly what should be available for researchers. It's amazing what good historians can do with tattered bits of seemingly-unimportant information, such as medieval laundry lists. Records of who attended what parties may explain political alliances, which lead to important decisions; wills show the evolution of legal theories; deadly dull sermons may include the first use of a certain word. Historians are grateful for any and all these sources, because they are contemporary and unmediated-there is no opportunity for intermediate bias or misunderstanding. That doesn't mean that there is not a bias in what has been saved, or that historians are completely un-biased, but new eyes always see new things.
All the Official Twitters are not the important part-the rest of us have tweeted the kind of data that is valuable for tracing everything. Simple examples include the spread of infectious diseases, the rise of the Tea Party movement, attitudes towards our invasion of Iran, opinions about smart phone usability, failed vs. successful American Idol contestants, and understanding changing use of language. In the U.S. and all over the world, as people use Twitter and other social networks, they talk about what is important to them, from the price of oranges to the course of a competitive Presidential campaign. All of this can help us understand what really happened-less about theory and expert commentary, more about the reality on the ground.
The Library of Congress is working with the Stanford University group on Computational Approaches to Digital Stewardship. Techniques of text mining are designed to work on huge amounts of data, looking for common trends and sudden changes. They can make connections that are invisible to anyone without the context, for example, tracing the progress of a scientific discovery by not only the expressions of the primary laboratory, but also their friends and colleagues. Semantic analysis, and in particular, entity extraction, can lead to the source of new terminology or the rise of an unheralded startup company. These are the same challenges faced by digital data curators in institutions, from the U.K. Web Archive to the customer-service departments of your local electric utility. Techniques of finding information can be shared between commercial and institutional repositories.
"It's kind of like saying, ‘Are newspapers useful for historians?'" says Elaine Tyler May, a history professor at the University of Minnesota and president of the Organization of American Historians, as quoted on slate.com. "We know that they are, but you have to know what you're looking for."
How To Delete a Tweet: http://help.twitter.com/forums/10711/entries/18906
Online Archive Ethics: www.archive.org/about/ethics_BK.php
The Oakland Archive Policy: http://school.berkeley.edu/research/conferences/aps/removal-policy.html
Update: April 29, 2010: For additional details from LC, see the blog post and FAQ at http://blogs.loc.gov/loc/2010/04/the-library-and-twitter-an-faq/.