DATA OVERLOAD Web Archives and the Challenges of Scale

Imagine future historians studying the public discourse on autism in the early 21st century. They sift through an archive as vast as anything we know today, but they must contend with born-digital sources—blogs written by people with autism, for example, or the websites of advocacy organizations and government agencies, not to mention video, audio, and social media content, all gathered from across the web. The extent of this archive means that the traditional methods of doing historical research will no longer be relevant, at least for this project. Historians will have to use new techniques and digital tools to interrogate the archive. The scale of pertinent sources, the technical skills required to analyze them, and the need to assess what was and wasn’t collected by the archivists who processed these materials in the first place will raise a host of challenges.

As the web becomes ever more integrated into our lives, numerous entities, such as the Library of Congress and the Internet Archive, have begun archiving it. But these new web archives contain so much data that historians have begun reconsidering research methods, skills, and epistemology. In fact, few historians now possess the requisite qualifications to perform professional research in web sources.

In March 2019, participants in a “datathon” held at George Washington University in Washington, DC, got a taste of what research with born-digital web archives could look like. The event was organized by the Andrew W. Mellon Foundation-funded Archives Unleashed Project, which, according to its website, “aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.” The project’s goal is to lower barriers to working with large-scale web archives by creating accessible tools and a web-based interface with which to use them. The datathon brought together librarians, archivists, computer scientists, and researchers from a variety of disciplines, including the humanities and social sciences, to explore web archives on a wide range of topics.

At datathons, people can broadly experiment with specific datasets, asking and answering questions about them. The Archives Unleashed Project has hosted datathons since 2016, to explore the possibilities that web archives present for research. A “big challenge for this project,” explains Ian Milligan, principal investigator of Archives Unleashed, is determining “where should the project end and the researcher take over.” In other words, how can the project ensure that it sufficiently prepares archival custodians and researchers to continue to be able do this work in the future? Through these datathons, Archives Unleashed strives to create communities of users for the tools it’s creating and build expertise in using web archives and sources.

Studying the recent past all but compels historians to use web archives.

At the datathon in Washington, the project team provided pre-selected collections of web sources, and participants chose which materials they wanted to work with. Topics included the Washington, DC-area punk music scene, web content from former Soviet Bloc countries, and the #MeToo movement. Participants identified the questions they wanted to ask of the sources, used analytical tools from the Archives Unleashed Toolkit to explore the data, and presented their findings.

One group explored the non-textual elements—images, audio, video—in the 48 gigabytes of the DC Punk Archive that they had been given to work with. With a tool in the Archives Unleashed Toolkit, they extracted over 10,000 digital objects from the collection, then determined file types in order to identify the types of materials they were working with. Expecting to find mostly audio and videos of concerts, the group also discovered tickets, posters, flyers, album covers, photographs of artists, and more—objects that would be vital to telling the history of the scene. Another team, working with the former Soviet bloc websites, analyzed those sites’ outgoing links to understand which other sites around the world were important within the content of their collection.

Another team explored sites from the #MeToo Digital Media Collection, which is being gathered by Harvard University’s Schlesinger Library as part of a project to comprehensively document the movement. Several participants approached the harvested material from an archival perspective, asking questions like: Is the collection capturing what’s necessary in order to be useful to researchers now and in the future? What are the assessment criteria that should be used to ensure that the collection has archival value? How do you document those decisions and ensure that the resulting archives are usable?

These are questions that archivists have always asked in making decisions about records and other archival materials. But web archiving includes its own problems of scale, preservation, privacy, and copyright. The Internet Archive began preserving sites from the World Wide Web in 1996. Since then it has archived nearly 350 billion web pages. The memory required to store all of this content is well in excess of 15 petabytes. (Your computer at home probably has about a thousand gigabytes of hard drive storage; one petabyte is a little over one million gigabytes.) Users of the Internet Archive’s Wayback Machine can explore a treasure trove of websites, including the entire GeoCities network; over 1,600 versions of—the website of the former vice president—dating back to 1998; and the earliest US federal government web pages. The majority of this content remains untapped by historians.

And the Internet Archive is not the only institution capturing knowledge existing on the web; traditional institutions are also involved in this effort. The DC Punk Archive is the work of special-collections librarians at the District of Columbia Public Library. National libraries and legal-deposit libraries also do this archival labor, with a growing number of countries passing non-print legal-deposit laws, which mandate the collection of sites within national domains, such as .fr or .no in Europe. The British Library has worked with the United Kingdom’s network of deposit libraries to routinely archive the entire .uk web domain, required by a 2013 law that complemented the long tradition of legal deposit of print materials in national libraries.

Working with the Internet Archive, the Library of Congress has also been creating archives of websites in the public interest since 2000. The library currently collects, per month, between 20,000 and 25,000 gigabytes of content on a wide range of topics, the sites of all legislative branches of the US federal government and a selection of those maintained by executive agencies, and some international websites, such as those covering general elections around the world and major political and social upheavals. In a phone interview, Abigail Grotke, web archiving team lead at the Library of Congress, explained how reference librarians and overseas operations officers with subject-matter expertise provide guidance on which web archive collections the library should create and maintain, and on “the urgent events” that should be documented and preserved as they happen.

Web archiving brings its own problems of scale, preservation, privacy, and copyright.

According to Grotke, the Library of Congress always obtains permission from a site’s owners before “crawling”—a term derived from the use of a piece of software called a web crawler that systematically browses and collects data from websites. While this adds complexity to the task and requires that the library be much more selective about what it collects, Grotke says that it also allows its collecting to be more “focused and deeper.” Since no collecting work can capture everything on the web, decisions always need to be made about where a crawl stops. Attention to details like these will ensure that historians can explore what’s preserved in these vast collections of data. Still, gleaning meaningful information from these sources will require historians to use new tools.

Many of the barriers to using these archives are simply a result of scale—the archives are just too big to provide good results from keyword searches or even to browse through. As a result, analytical tools are necessary. Web archiving crawls create files in the WARC format, an international standard that has been adopted by libraries and other web archiving organizations. WARC files preserve the content of a website in addition to other archival information, such as when the content was collected. The Archives Unleashed Toolkit (available for free on the project’s website, at used in the datathon includes scripts (little programs that do discrete tasks) to sort and manage the data and metadata in WARC files.

The toolkit allows users to, for example, strip out everything but the main content, eliminating secondary information such as website navigation and ads. Other scripts in the toolkit allow users to see what is included in the archive they are working with. Users can also filter by language, group sites in a collection by the date on which they were crawled, or find all names of individuals, organizations, or places in a group of sites. These techniques do require some basic knowledge of how websites work, but they don’t necessitate years of training.

While the web itself is of recent enough inception that only a small subset of historians who study contemporary history are currently using it as a source, more will need to be prepared to do so in the coming years. As one datathon participant put it, software programs such as the Archives Unleashed Toolkit provide means for “trying to understand your dataset before you dive into it.” As we get further and further away from the early days of the web, and with so much of our history recorded there, historians, now more than ever, need to know how to work with these materials.

Will there be a virtual afterlife? The race is on to save our digital selves from oblivion

How will you be remembered? How will Facebook? For better and for worse, some internet pioneers are now facing the less-than-edifying realities of their creations and are trying to fix them

Publishing house interns wading through “slush piles” may quip about that Wiccan memoir they just cringe-read or the Guide to DIY Bunkers they’ve turned down. But sometimes important books are also missed. The Diary of Anne Frank was saved from obscurity only because an editor spotted it in a stack of rejected manuscripts. Thus the promise of a new prize for book proposals, sponsored by Penguin Random House Canada and Westwood Creative Artists. It offers one graduate from the University of King’s College-Dalhousie University MFA in Creative Nonfiction the chance to win $2,500, a meeting with an editor and an offer of representation by an agent. Read excerpts from this year’s nominees, including this piece on the problem of memory in the internet age by Stacey McLeod.

Alfred Nobel may be best known as the founder of the Nobel Prize, but he also invented dynamite. When his brother died in 1888, newspapers accidentally wrote obituaries for Alfred, blasting him for making his fortune from death and destruction. One headline even read “The Merchant of Death is Dead.” It’s believed he was so shaken by reading these opinions and facing his not-so-glamorous legacy that he donated his fortune to the creation of the Nobel Prize. His philanthropy didn’t change the fact he’d invented dynamite, but it changed what he was remembered for.

How will you be remembered? How will Facebook? For better and for worse, some internet pioneers are now facing the less-than-edifying realities of their creations and are trying to fix them. The social media generation is only starting to see its own long-term impact. But is there time for this-generation app developers to consider legacy when racing to create the next big thing?

It’s not just a race against competition. Companies must keep up with software, adapt products to work on new devices, and navigate our constantly-changing needs and behaviours. At the same time, the public is becoming more aware of privacy and copyright, and demanding accountability.

Digital mistakes are recorded history, even when deleted or corrected. If personal data is stored, spread, leaked or misused, it impacts how someone is remembered. At the same time, companies can shutter or forever-delete media with important historical context. In 2017, a Twitter employee deactivated U.S. President Donald Trump’s account for 11 minutes. It was restored but it made people ask: Who is handling our digital history and how can it disappear so easily? We know social media companies won’t be around forever but we’re only starting to understand the lifespan of the memories we’ve entrusted in them.

For the past 20 years, the biggest keeper of our digital memory hasn’t been one organization but a much larger, more mysterious force: the internet. Just who takes ownership of the internet — and decides what is worth keeping or destroying — is a complicated web.


On an unassuming side street in San Francisco’s Richmond neighbourhood sits a former Christian Science church where a pole flies a flag with an image of Earth, as seen from space. Tall white columns tower over steps leading to black iron doors. The building is no longer a place of worship per se, but it is a place of salvation. Inside the building is the largest known backup of the internet.

The Internet Archive has been preserving web history since 1996 and makes it accessible by search through the Wayback Machine — a time capsule of sorts that lets you insert a URL and slide year-over-year to view updates and changes. It currently contains more than 330 billion web pages, 4.5 million audio recordings, four million videos, three million images and 200,000 software programs, as well as 30 million books and texts.

The Archive champions universal access to content other companies want to restrict and sell. It’s positioned as an information Robin Hood of sorts, salvaging media and old software, then making it available at no cost to the public and historical record. But there’s still a cost — the Archive spends millions each year on servers to store the more than 50 petabytes of data it’s accumulated so far (it saves more than 100 million web pages every day).

Founder Brewster Kahle calls himself a digital librarian but he’s also one of the internet’s earliest trailblazers. In the 1980s, he was a rising star at the Massachusetts Institute of Technology (MIT) before co-founding early search engine Alexa Internet (later acquired by Amazon), named after the ancient Library of Alexandria in Egypt. He then founded the Internet Archive, also inspired by the library that burnt to the ground and took with it more than 500 years of books, artifacts and global knowledge. But Brewster believes it wasn’t just fire that destroyed the ancient library — it was more about the concept of universal knowledge becoming a threat.

On a November afternoon in 2017, he’s standing in the church lobby, donning one of his signature button-up shirts and small round glasses. He leads a small group of people past book scanners, an old gramophone and an arcade machine, then up a flight of stairs to the cathedral. Rows of pews are topped with cushions fashioned from vintage swag, like Napster shirts. Amber light streams through the stained-glass windows and dome ceiling, above a stage with a cardboard cut-out of a Wayback time machine (a cheeky, hand-painted prop leftover from a conference).

Brewster admits there are no Teslas out front or stock options at a place like this. You won’t find stereotypical Silicon Valley poster children milling about, though more than 100 people work in the church basement or remotely, funded by donations and subscriptions to public archiving service Archive-It. Theirs is a labour of love for freedom of information, which can attract different crowds.

Free building tours reel in the curious public each week. On this day, a group of computer science students from a nearby college wander around the cathedral. One asks about an art installation hanging from the ceiling, a bob on a thread pointing at hard drives, titled “Earthquake Detector.” Brewster then points out a less obvious work of art, tall black vertical towers with blinking blue lights. These servers hold the primary copy of the Internet Archive. (There’s another copy in a warehouse outside San Francisco, and partial copies in global locations like Amsterdam, Canada and Egypt.)

A student asks about the server in Egypt. Do they worry about a server in a politically unstable country? What if it’s destroyed?

Brewster explains the servers work on different cycles and can’t all be taken out at once, but pauses when considering the issue of political instability.

“What would you think would be more of a threat right now? The United States or Egypt?” he asks them. “I’m not sure we’d all agree.”

Politics aren’t the only threat to web archives. Digital content is hard to catch, up one moment and down the next. Brewster estimates the average web page lives for 100 days before it’s changed or deleted. Social media statuses can live for seconds. The Internet Archive tries to capture everything (the good and bad), with bots regularly crawling websites and the public submitting links and capturing fleeting posts.

Brewster moves the group toward an exit but pauses at pews with colourful clay statues of people, a “riff” on the Terracotta Army sculptures in China. They represent people who have worked for the Archive for three years or more. He points out his own statue near Ted Nelson (who coined the term hypertext) and late internet activist Aaron Swartz.

He asks the students if they find the statues cool or creepy, then explains it’s their way of honouring the people who choose the unglamorous work of archiving over more lucrative gigs.

“The republic may not survive if we continue in the direction Silicon Valley is leading us down,” he tells them. “If we’re not building better tools that are more conscious of the effects, we’re going to end up with election cycles based on fake news, alt-facts and problems. People are now disagreeing about what’s even true or not, or if you can trust anybody. These sorts of things are real. They’re real. And a big part of the cause of these problems is within 100 miles of here.”

“The big corporations, including the logos you’re going and emblazoning yourself with—” He pauses and points to one of the students’ T-shirts with a Facebook logo on it. “They’re not on your side.”

The group chuckles but it’s a nervous laughter. Brewster is serious, his tone not unlike a dad trying to drive home a message to his teenage kids.

“I’m only going to be around for a short amount of time. It will be up to you guys,” he tells them. “They will do anything they can to make you feel indebted, or small, or inspired to go and work on some crappy project. They do that all day long. That’s their job, to try to get you to be a small cog in a big wheel and that wheel is running over other people. But maybe there’s a different way.

“Use services that are worth using,” he tells them. “And if they’re not worth using, build an alternative that is better.”

It’s an inspirational message and some of the students look like they’ve heard it. But Brewster’s plea needs to compete with what new-generation tech will offer when they step out the door: The prospect of fame, exclusive parties, millions-of-dollars buyouts and big salaries to pay for Palo Alto apartments.

Perhaps someday they can use those fortunes to right wrongs inside themselves or apologize, like Alfred Nobel and big social media companies. But maybe the damage will have already been done.

When you are 21 years old, which path do you choose?

Stacey McLeod is a Toronto-based journalist and editor writing about the intersection of death, tech and data, and is currently writing her first creative nonfiction manuscript For the Record: The Race to Live Forever in Virtual Afterlives.

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!