The Most Important Thing We Can Do for Data

How are we preserving the present?

Jul 16, 2025

Have you ever seen a Gutenberg bible? They’re beautiful. Leather-bound testaments to the power of preservation, only 49 copies are known to exist. They sit in vaults and rare-book rooms at places like Harvard and the British Museum, where they are painstakingly protected for future generations.

Contrast that with how many of America’s most important public datasets — on climate, incarceration, public health and so much more — exist in just one copy. And that copy might live on a single government server.

Until it doesn’t.

A dataset on pesticide exposure in children? Gone. No redirect. No warning. No backup.

Data — that endless stream of information we assume will always be there in one location — is extremely fragile, vulnerable to things as banal as power failures and as oppressive as an authoritarian government.

We all assume the internet is forever. But in reality, we’ve built systems that work more like ephemeral ink than stone. And unlike the Gutenberg bible, no one brags about owning a copy of a .gov spreadsheet.

And yet, if you look closely, you’ll find signs of quiet erasure everywhere. Entire datasets vanish with no notice. URLs break. Pages redirect — or don’t. Government spreadsheets blink out of existence.

Thankfully, a growing group of archivists, technologists and activists are sounding the alarm — and doing something about it. Their work forms a kind of ecosystem: some are documenting what’s already disappearing, others are building long-term infrastructure, and all of them are teaching us what resilience really looks like.

Some are acting like first responders, racing to capture what’s already disappearing. My friend Denise Ross and a small but mighty team at America’s Essential Data launched America’s Data Index as a way to track when public datasets quietly disappear or are altered without explanation. They’ve already flagged over 500 cases.

At the Library Innovation Lab at Harvard Law School, Jack Cushman, along with Jonathan Zittrain, is helping mirror all of data.gov, making it searchable, usable and resilient. Considering that key data has been disappearing from the site since January, this is an urgent task. They’re treating the public record like a living thing — one that needs constant watching.

And others are building for the long haul — not just catching what’s vanishing, but trying to remember everything, forever. Mark Graham and the team at the Internet Archive are doing heroic work: By building a nonprofit "record button” for the web, they’re preserving millions of pages that might otherwise vanish. They’ve currently captured over 835 billion web pages and 44 million books and texts, available free to all. It’s one of the largest-scale memory projects humanity has ever attempted — a Library of Alexandria for the digital age, if we’re lucky enough to keep it standing.

We faced the problem of resilience at TUMO Center for Creative Technologies, a free digital education program based in Armenia. (I’m on the board.) Fearful that the Azerbaijanis in power might destroy endangered cultural heritage sites in Artsakh, we had students go out and digitize them. They came back with 3-D scans, drone imagery and photographs of sites that may no longer exist. These digital “copies” may be the only thing left of some of these historic monuments, so we stored them in locations oceans apart. We then partnered with the Starling Lab for Data Integrity and Jonathan Dotan to store that sensitive, irreplaceable data in ways that guaranteed its integrity and survivability, as well as to distribute it across the web so it wouldn’t disappear..

When you’re building a service like Twitter, you’re always thinking about redundancy. A hard disk usually fails in 3 to 5 years, so instead of putting data on one drive, you build what’s called a RAID array to store the data across multiple drives. If one hard drive fails, it's fine – it's on one of the other ones. Just unplug the failed one and replace it. And the same thing goes with servers: A server may fail in 3 to 5 years, so never only have one server managing something. Either distribute the responsibility or have a redundant one ready to kick in.

What we need is a “RAID array for civilization.” Let’s create a system where no single hard drive — or server, agency or administration/country — can silently lose the only copy. But it's not just technical redundancy in this case. Real resilience means redundancy across every dimension:

→ Not just storage drives, but geographies: Keep copies in multiple parts of the world in case of natural disasters.
→ Not just geographies, but institutions: Keep copies across different organizations in case an organization ceases to exist.
→ Not just institutions, but funding models: Make sure those organizations have different and diverse funding models so that the funding streams don’t disappear.

We need this so that our collective memory — our cultural record, our public truths — can withstand the test of time. If we only have one copy, we may not really have it at all. Just like those invaluable Gutenberg bibles, we need to preserve, protect and distribute our data. Those books have survived for almost 700 years. What will we have to show for our time?

Edward

Jul 16

In addition to data preservation, we also need a program of data migration. There are many datasets that will not run on more modern systems. This can be a problem with older SQL databases as well as Internet Arhive content that will no longer display correctly (think of animation tools and even Adobe Flash). So preservation and migration is a key element.

Expand full comment

Discussion about this post

Ready for more?