Internet Archive Turns Up the Speed With BitTorrent
Nancy K. Herther
Posted On August 23, 2012
On Aug. 7, 2012, the Internet Archive gave peer-to-peer file sharing a major boost by making more than 1 million books, movies, and other media immediately available as “torrents” from BitTorrent instead of solely relying on HTTP (Hypertext Transfer Protocol) for downloading content. Using two of the Internet Archive’s servers in addition to connecting distribution of content to others requesting the same material guarantees a faster delivery regardless of the users’ mode of internet connection.
In the press release, Eric Klinker, BitTorrent CEO, noted that “BitTorrent is the now fastest way to download complete items from the Archive, because the BitTorrent client downloads simultaneously from two different Archive servers located in two different datacenters, and from other Archive users who have downloaded these torrents.”
Kahle’s Passion—A Wealth of Resources
Internet Archive founder Brewster Kahle’s self-described “aha moment” came while visiting the offices of search engine AltaVista years ago: “I was standing there, looking at this machine that was the size of five or six Coke machines, and there was an ‘aha moment’ that said, ‘You can do everything.’”
Kahle is a computer scientist responsible for a number of internet search systems, including the Thinking Machines nationwide computerized library; the Wide Area Information Server, an information retrieval system for the internet; and, with Bruce Gilliat, co-founded Alexa Internet, a search engine built into a browser, in 1996, which was later sold to Amazon. “Alexa Internet grew out of a vision of intelligent Web navigation constantly improving through its users,” according to the website. “Since then, our Alexa users have downloaded millions of Toolbars, and Alexa has created one of the largest Web crawls, and developed the infrastructure to process and serve massive amounts of data.”
Internet Archive reflects the deep commitment that Kahle has developed, working in true collaboration with libraries, volunteers, foundations, and with his own investments. The Archive was founded in 1996 when the company began archiving webpages. Today, the Archive includes a wide range of component archives, each focused on specific media or goals yet sharing the same overall mission.
With the release of the Wayback Machine in 2001, these archival resources were made freely available to a global audience. Today, this includes “over 150 billion web pages archived from 1996 to a few months ago” and has grown to involve the Smithsonian, Library of Congress, and New Library of Alexandria, Egypt in support of this huge repository.
The heart of the Archive is the Ebook and Texts Archive, which includes more than 3.5 million titles from Google Book Search, collaborations with libraries, and work at its 23 scanning centers across the globe. Collections can be browsed or searched and the main access point is the Open Library—“one web page for every book”—that provides basic book metadata gleaned from the Library of Congress, Amazon, or other sources, along with links to digital/ebook versions or other information. Funded both by Kahle’s foundation and the California State Library, the goal is to create “an open, editable library catalog, building towards a web page for every book ever published” and currently includes more than 20 million records and 1 million free ebook titles.
The Internet Archive's Software Archive is “designed to preserve and provide access to all kinds of rare or difficult to find, legally downloadable software titles and background information on those titles,” including nearly 40,000 shareware, freeware programs, “video news releases about software titles, speed runs of actual software game play, previews and promos for software games, high-score and skill replays of various game genres, and the art of filmmaking with real-time computer game engines.”
The Moving Images Archive includes nearly 700,000 free movies, films, and videos ranging from “classic full-length films, to daily alternative news broadcasts, to cartoons and concerts.” Many can be downloaded as well. In the past, there were frequent issues of access due to the heavy traffic, browser hang-ups due to user insufficient hard-disk or TMP disk space, issues between user computers, browsers, and all of the standards that exist for various players. With BitTorrent, I was able to download four titles quickly without any problem.
The Audio Archive includes more than 1.3 million items in its collection of “free digital recordings ranging from alternative news programming, to Grateful Dead concerts, to Old Time Radio shows, to book and poetry readings, to original music uploaded by our users. Many of these audios and MP3s are available for free download.” Kahle’s strong commitment to accessibility is especially felt here where he has led the industry in incorporating functionality that goes beyond the bare bones of ADA compliance.
Netlabels, a collection of “complete, freely downloadable/streamable, often Creative-Commons-licensed catalog of virtual record labels.” The nearly 30,000 titles available are “non-profit, community-built entities dedicated to providing high quality, non-commercial, freely distributable MP3/OGG-format music for online download in a multitude of genres.”
Archive-It is “a web archiving service to harvest and preserve digital collections.” With more than 190 partner institutions in 44 U.S. states and 16 countries, this subscription web archiving service “helps organizations to harvest, build, and preserve collections of digital content. Through our user friendly web application Archive-It partners can collect, catalog, and manage their collections of archived content with 24/7 access and full text search available for their use as well as their patrons. Content is hosted and stored at the Internet Archive data centers.” It has collected more than 4.4 billion URLS for nearly 2,000 public collections so far.
Many digitization projects use destructive methods. Kahle sees a problem in the physical deconstruction of these artifacts. “There is always going to be a role for books,” Kahle explained to the Guardian in 2011. “We want to see books live forever.” It is intended less as a physical library, but more as a protected archive—not unlike the Svalbard Global Seed Vault, which seeks to ensure the survival of the genetic diversity of “the world’s food crops” for future generations by saving seeds in underground Arctic caverns protected by the permafrost.
BitTorrent—Moving Mountains in Less Time
BitTorrent is based on the BitTorrent Protocol, invented by company co-founder Bram Cohen in 2001, providing an efficient, distributed way of delivering files. The company doesn’t host content but, instead, acts as a form of switching station to move content quickly but carefully using new nonlinear models of the process. (A torrent holds information about the location of different pieces of the target file.)
If you’ve ever been frustrated by slow downloads that can freeze your computer, BitTorrent downloads are actually faster as more computers join in requesting a download. Instead of downloading programs in a single one-to-one string from a single source to each requester, BitTorrent is able to send pieces of the program to each of the requesting computers (or peers), then distribute those pieces from each peer so that each computer receives the entire completed program.
Anyone can easily and quickly download the BitTorrent program (which then appears as an empty screen much like newly-opened reference managers) until you begin to populate it with download requests. BitTorrent is fast, efficient, free, and comes without irritating ads or pop-ups. With more than 150 million users, it has become a global standard for delivering large files over the internet, being used by companies such as Wikipedia, Twitter, and Facebook, among others.
When it announced the BitTorrent relationship on Aug. 7, the Archive began with 1.5 million torrents (nearly a petabyte of data) including live music concerts, the Prelinger movie collection, the LibriVox audio book collection, feature films, old time radio shows, more than 1.2 million books, and “all new uploads from patrons who are into community collections.”
As with any technology that makes moving data faster and more efficient, BitTorrent has been linked to Napsterlike charges of copyright infringement. The blog Torrent Freak reported that more than 200,000 people were sued between 2010 and August 2011 for transferring copyrighted materials using the BitTorrent network.
Some sites using Torrent technology, such as Pirate Bay and IsoHunt, have been charged with copyright infringement either in court or in lists of suspect sites issued by various organizations. In reaction, Google has factored “removal requests” of such sites into Google search results. Google’s website notes that, “sites with high numbers of removal notices may appear lower in our results. This ranking change should help users find legitimate, quality sources of content more easily.”
“Google has begun to ‘punish’ sites,” complains IsoHunt’s Gary Fong. “While Google already started down this path of censorship with autocorrect before, search ranking based on mere DMCA notices is a line that should not be crossed.” Additionally, Fong charges that, “what's missing on Google’s DMCA notices report—YouTube. The largest by far video content website in the world ought to have very high volume of DMCA notices, if not the most, and it’s inconspicuously [sic] missing from the list. To downrank and censor any website that’s not Google’s that receives a high number of DMCA notices? Sounds exactly like antitrust to me.”
The Archive was careful to note that, “unlike many BitTorrent sites, such as the Pirate Bay, which only host torrent files but not the actual digital content they point to, the Internet Archive is also hosting all of the original content for which it makes torrents available.” Electronic Frontier Foundation's John Gilmore commented in the Archive’s press release that, “I supported the original creation of BitTorrent because I believe in building technology to make it easy for communities to share what they have. The Archive is helping people to understand that BitTorrent isn’t just for ephemeral or dodgy items that disappear from view in a short time. BitTorrent is a great way to get and share large files that are permanently available from libraries like the Internet Archive.”
Keeping Tabs on the Progress
In keeping with its open culture, the Archive is posting data on downloads and torrents at its site. Data released so far shows a strong interest and success in this new approach to sharing and accessing these files. Users are finding much faster file transfers of even large multimedia files, such as movies. As a 501(c) (3) nonprofit, the Archive is using torrents to apply state-of-the-art technology to its mission of building an internet library to offer “permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format.”
In the past few years, the library digitization story has resembled the Tortoise and the Hare story. Google is the agile and well-financed Hare leaping ahead with its grand plans to digitize the world’s literature in collaboration with library partners, who would gain digital copies and be spared the expense. The Archive—the slow but sure Tortoise—is moving more slowly in a collaboration that requires that all participants share some of the costs of the process. With Google apparently now pulling back on its project—and many reports surfacing of how academic libraries have found themselves paying far more than they anticipated for their share of the cost of these services—the lowly Tortoise would seem to be moving into the lead.
“The Internet has put universal access to knowledge within our grasp. Now we need to put all of the world’s literature online. This is easier to do than it might seem, if we resist the impulse to centralize and build only a few monolithic libraries,” Kahle has written. “We need lots of publishers, booksellers, authors, and readers—and lots of libraries. If many actors work together, we can have a robust, distributed publishing and library system, possibly resembling the World Wide Web.”
Lee Rainie, director of the Pew Research Center’s Internet & American Life Project, believes that BitTorrent is a strong positive for the Archive and web searchers alike.
The technology, of course, is neutral and can be used for good or bad purposes. We’ve seen in our work, for instance, that people are very worried about cyberbullying via social networking sites and yet they like their own interactions over such sites and say that most people aren’t bullies. The same basic dynamic exists in the peer-to-peer space. Peer-to-peer networks facilitate easy sharing of massive amounts of material. Some people distribute copyright protected material via those sites for free. But others share massive files that are perfectly legal to share. That’s the goal of the Internet Archives here. Everything in our data suggests that users will take advantage of and be grateful for such sharing.
Perhaps slow but steady will eventually win this race.