|
|
Subscribe / Log in / New account

The need to reliably preserve our community history

By Jonathan Corbet
August 27, 2025
The Internet is a wonderful thing; it allows anybody to look up information of interest. Included in all of that is the history of the free-software development community; how we got to where we are says a lot about why things are the way they are and what might come next. So the takeover of Groklaw rings a loud alarm; we have been reminded that history stored on the Internet is an ephemeral thing and cannot be expected to remain available forever.

Some of our younger readers may not remember what an important resource Groklaw was during a crucial time in our history. Pamela Jones and those who helped her chronicled the ups and downs of the SCO lawsuit, keeping us all informed of what was going on, digging up information that was useful for the defense, and teaching us all a great deal about how the US legal process works. Groklaw did not stop there, though; it was an irreplaceable source of information on software patents, standards, licensing, and more. It was truly one of the key community resources for many years.

The site was important enough that LWN's history contains over 800 articles and comments with links to almost 700 pages on Groklaw.

Unfortunately, Groklaw has now been taken over by somebody whose priorities differ strongly from those of the site's founder. As of this writing, Groklaw has become a platform for the promotion of crypto products. The text of some original articles remains (though decorated with crypto ads) while other articles have been replaced entirely. An important part of the free-software community's history (and that of the commercial realm around it) has been compromised.

The usual policy at LWN is to not attempt to keep up with link rot. Our history runs back to 1998; if we tried to keep up with all of things we have linked to over those years, we would never have the time to do anything else. But finding ourselves to be hosting hundreds of once-important links that now lead to a crypto scammer's site presents a bit of a special case. We have thus, over the last week, replaced every Groklaw link we could with an equivalent link into the Internet Archive. In the end, there were only a half-dozen links that were not archived.

This episode is an important reminder of what can happen to any Internet resource that we depend on. Web sites will only persist for as long as somebody is willing to care (and pay) for them. Certain governments are currently doing their best to disappear inconvenient information from the net — information that others depend on heavily. An era where everything is digital is an era where important history can disappear at any time and, increasingly, it does.

In this case, the Internet Archive has come to the rescue, proving its value once again. But the Internet Archive is a single point of failure, and one that is under attack at that; it, too, could disappear someday. It is not, by itself, a convincing solution to this problem.

What is a convincing solution is unclear at best. With tools like Git, our community has gotten part of the way there; it is hard to imagine a scenario where we would lose the kernel source or its (Git-era) development history. At least, in any such scenario, building a new kernel would likely appear relatively low on the to-do lists of any survivors. Widespread copying of important data is clearly needed, but we cannot easily do that with the Internet as a whole.

Until we figure this out, we will run an increasing risk of ending up in a situation where the only way to learn something about our history is to ask somebody's generative-AI system and hope that it fabricates something that is remotely plausible. That is not quite the world that we were hoping to build. Our species is currently faced with a number of urgent problems; preserving our history is surely one of them.


to post comments

An odd coincidence

Posted Aug 27, 2025 14:14 UTC (Wed) by philipstorry (subscriber, #45926) [Link]

Quite coincidentally, I recently decided to start populating my vanity website with old comments and posts from communities around the web, to create an archive of sorts.

I suspect if I'd spent the time writing posts for my website instead of comments, I'd have a much larger site. I may even have been regarded as a blogger. For whatever that may be worth...

I'm sure I left comments on Groklaw, but given my low levels of legal expertise I doubt they're a great loss to our community. But it would have been nice to have made that decision for myself!

I also think it would be good to capture oral history of the industry. People's opinions and recollections matter, even when they conflict with the facts.

public-inbox underlying storage

Posted Aug 27, 2025 14:33 UTC (Wed) by mricon (subscriber, #59252) [Link] (3 responses)

I like the storage format offered by public-inbox. It's a collection of git repositories that are easy to parse, back up, replicate efficiently, and has lots of tools to check for bitrot or tampering.

I recommend we adopt it for long-term archival of all textual data.

public-inbox underlying storage

Posted Aug 28, 2025 23:56 UTC (Thu) by stefanha (subscriber, #55072) [Link] (2 responses)

One thing I didn't figure out when looking around a public-inbox reposiyory: is there an index? For example, if I want to find an email by Message-ID, do I need to scan through each commit to find it?

public-inbox underlying storage

Posted Aug 29, 2025 18:36 UTC (Fri) by mricon (subscriber, #59252) [Link] (1 responses)

The underlying repository is just a series of git commits. To make it searchable, you will need to run an index on it using public-inbox-index. The simplest index will map message-id's to commits and is a simple sqlite3 database that you can query using simple SQL queries.

public-inbox underlying storage

Posted Aug 30, 2025 10:02 UTC (Sat) by stefanha (subscriber, #55072) [Link]

Thank you!

Manually archiving outlinks when publishing an article

Posted Aug 27, 2025 15:33 UTC (Wed) by auerswal (subscriber, #119876) [Link] (3 responses)

As a suggestion: to increase the chances of finding an archived copy later, any outlink of any newly published article could be archived manually. The Wayback Machine provides a "Save Page Now" box on <https://web.archive.org/>. Then there is "archive.today" at <https://archive.today/> and <https://archive.ph/> for manually archiving web pages (no automatic crawler-based archiving there). This is only a little extra work per new article.

Manually archiving outlinks when publishing an article

Posted Aug 27, 2025 17:02 UTC (Wed) by jzb (editor, #7867) [Link] (2 responses)

"This is only a little extra work per new article."

That is an interesting idea, but it would need to be something we can automate: our articles tend to be quite link-heavy. I think most of mine have 20 or more, minimum. So, more than a little extra work. Also, aren't archive.today and archive.ph the same service?

That said, this may be automatable to save an archive page for each link we use in an article, but it wouldn't solve the larger problem, even for us. If I reference a page today, that would solve keeping a capture for that page, but it's not uncommon for things to have already disappeared when I am writing new articles...

Manually archiving outlinks when publishing an article

Posted Aug 28, 2025 15:39 UTC (Thu) by elw (subscriber, #86388) [Link]

archive.is/ph/etc all have APIs that you can use to automate the creation of these archive links as well as browser plugins for Chrome and Firefox

Manually archiving outlinks when publishing an article

Posted Sep 5, 2025 0:57 UTC (Fri) by cypherpunks2 (guest, #152408) [Link]

Be aware that a manual robots.txt exclusion of IA will cause all archived pages on that domain to be retroactively hidden! All it would take is for Groklaw to add "ia_archiver" to its robots.txt exclusion policy and everything would be removed. Although IA did relax their policy by no longer respecting a denial of "*" and no longer respecting a denial of "ia_archiver" on government and military sites, explicitly denying that user agent would remove web history.

ArchiveTeam, Software Heritage & archive.today

Posted Aug 27, 2025 15:59 UTC (Wed) by pabs (subscriber, #43278) [Link]

ArchiveTeam are a group aimed at saving web resources to web.archive.org and to archive.org items, mainly resources that are in immediate or potential future danger of being deleted, or are important in some way, but also proactively saves lots of websites that aren't in danger yet. If any LWN readers know of FOSS folks who have died, or old projects that are dying slowly, or technologies that are out of fashion or otherwise in danger, they can help preserve it. Their main IRC channel #archiveteam-bs on the hackint IRC network is the main place to mention such resources. They have ArchiveBot, for recursive crawls of individual websites. They also have a distributed archiving mechanism (DPoS) based on volunteers running a virtual machine (called a Warrior) and spreading archiving work across all the VMs. One of the DPoS projects continuously archives new content on many different sites (including LWN, and many FOSS blog aggregator/Planet sites). They also have other projects for archiving code repositories, and for archiving MediaWiki/DokuWiki wikis as re-importable data. Some members also work on archiving FOSS-adjacent things, like MoinMoin wikis, IRC logs, Trac, Bugzilla, mailing lists in general (and Mailman2 lists in particular), pastebins and dying hosting sites like TuxFamily. They also welcome developers to work on the software that these activities are based on, most of it is also under FOSS licenses. Unfortunately it looks like ArchiveBot was never set loose on Groklaw, or more of it could have been preserved.

https://wiki.archiveteam.org/
https://archive.fart.website/archivebot/viewer/?q=groklaw

Software Heritage is a group aimed at proactively archiving all source code, it has been featured on LWN before. It also needs developers to help support archiving more VCS/forge types, and folks to submit forge instances and individual repos.

https://www.softwareheritage.org/
https://lwn.net/Articles/693471/

I'd also like to mention archive.today, it is good for archiving single pages that are difficult to save otherwise, because of the use of JavaScript, are blocked by anti-scraper tech, or are behind paywalls. This is also separate to archive.org.

https://archive.today/

Back to the article topic, I think it would be great if there were a EU mirror of IA, and also a separate-to-IA EU web archiving project too.

Dangers to history

Posted Aug 27, 2025 16:13 UTC (Wed) by pabs (subscriber, #43278) [Link]

Significant dangers to digital history are deletionism, upgrades, migrations, shrinking communities, short memories and expiring browser histories. All of these mean resources can't be easily found on archive sites using their previous URLs. If folks notice signs of these, consider preventing those things at least until the resources can be archived, and getting the word out about the changes through LWN and other news sources.

Gmane....

Posted Aug 27, 2025 19:03 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I wish somebody resurrected GMane. It had so much old lore that is now inaccessible. I would gladly chip in with some funding.

Gmane....

Posted Aug 27, 2025 20:25 UTC (Wed) by mricon (subscriber, #59252) [Link]

The difficulty is not in just copying gmane's data, but dealing with constant requests for data deletion, including through courts. "Chipping in" would need to be on an ongoing basis and be a full-time job.

More bit rot: TypePad blog platform going away

Posted Aug 27, 2025 20:42 UTC (Wed) by sjfriedl (✭ supporter ✭, #10111) [Link] (1 responses)

Interesting timing: I was notified today that my blog host, TypePad, is shutting down at the end of September, meaning there's going to be a mad scramble to re-home stuff. Unfortunate timing for me that I start a three-week vacation in Europe next week, so it's going to suck for me.

I wonder how many TypePad blogs will go offline forever...

More bit rot: TypePad blog platform going away

Posted Aug 28, 2025 0:51 UTC (Thu) by pabs (subscriber, #43278) [Link]

TypePad is on ArchiveTeam's radar already, so I expect they will start a Distributed Preservation of Service (DPoS) project soonish. Check out my comment above, and this wiki page if you want to help that when it starts.

https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

Meeting the world where it is at

Posted Aug 28, 2025 9:52 UTC (Thu) by jkridner (subscriber, #179011) [Link] (1 responses)

For those of us sensitive to authoritative history and provenance, I'd suggest we find ways to make it simpler to do-the-right-thing and know-what-the-right-thing-is.

For example, the death of OpenID means that I have to sign into this site and leave a comment with no more authoritative stance that I am "that" Jason Kridner than anyone else who might lay claim to that name.

It still makes no sense to me that common tools, like e-mail, still don't use signatures.

Naturally, none of this replace the need for Archive.org, but does potentially yield more value to what it preserves and provides. While we cannot easily save the Internet as a whole from becoming a wasteland of advertising, we can at least build an infrastructure of reasonably verifiable sources.

And those people building those AI bots that are ripping the Internet to shreds, well, we can slowly educate them to at least cite sources and pull information in ways that will reduce their cost and increase their usefulness.

Oh, and I haven't gpg signed anything in years, so don't see myself as part of the problem.
-----BEGIN PGP SIGNATURE-----

iHUEARYKAB0WIQQjUJ3RTapq7acy8di8BGTymNaGdgUCaLAmoQAKCRC8BGTymNaG
dlWFAP9mo8m/seaZN1x0en5Hphu1QSuKgO4nZDpOtkNVu1Q1sgD+L5TIk4E8Llye
JSfSAo8EF2yJ8jwfk8CVjL/DRWtlVgU=
=+qAE
-----END PGP SIGNATURE-----

Meeting the world where it is at

Posted Aug 28, 2025 12:23 UTC (Thu) by jkridner (subscriber, #179011) [Link]

er, I meant to say that is not that I don't see myself as part of the problem. That is, I think we all can do more to help make doing the right thing easier and do more to show what the right thing is--or at least to try to figure it out.
-----BEGIN PGP SIGNATURE-----

iHUEARYKAB0WIQQjUJ3RTapq7acy8di8BGTymNaGdgUCaLBKNwAKCRC8BGTymNaG
diQMAQDN59zGYnR9VVsrEgLEXpBVVrUoRMFfv+3exBBHcgQBTAEAxFASxLJaWlhV
QInfpmHiuwZxUUyfOQIpJqA3CstDugc=
=DxVa
-----END PGP SIGNATURE-----

LLM "archivisation"

Posted Aug 29, 2025 12:54 UTC (Fri) by jezuch (subscriber, #52988) [Link] (3 responses)

> Until we figure this out, we will run an increasing risk of ending up in a situation where the only way to learn something about our history is to ask somebody's generative-AI system and hope that it fabricates something that is remotely plausible.

Well, if the language models were trained on the source documents, it may be possible to retrieve them verbatim, bypassing the digital regurgitation. Kinda an unexpected "benefit" of the Great Vacuuming of the Internet.

LLM "archivisation"

Posted Aug 31, 2025 19:55 UTC (Sun) by gf2p8affineqb (subscriber, #124723) [Link] (2 responses)

How would you know if they are actually verbatim though?

LLM "archivisation"

Posted Aug 31, 2025 21:10 UTC (Sun) by Wol (subscriber, #4433) [Link]

Or whether the source documents were human-generated, or themselves AI slop from the previous iteration ... ?

Cheers,
Wol

LLM "archivisation"

Posted Sep 2, 2025 18:55 UTC (Tue) by jkridner (subscriber, #179011) [Link]

Digital signatures would go a long way. If we quote an article, we can at least attest to "the article said X when I read it". A digital signature doesn't necessarily have to mean authorship. Not sure where to find any good resources on best practices for attestation signatures and usage on archive.org, but I think it'd go a long way to differentiating between random crap and not-so-random crap.

Anandtech and NovaBBS

Posted Sep 10, 2025 6:17 UTC (Wed) by anton (subscriber, #25547) [Link]

Another site that has been demolished recently is Anandtech. They ceased publishing new content a year or two ago, but their old content used to be available, until recently. Now, every link to one of their articles is redirected to a forum (which seems to be about the same topics).

I think that is bad marketing from the owners. Eventually the search engines will stop producing links to those former Anandtech articles, and any ad money and potential traffic for their forum (which they could advertise for when showing the article) is going to dry up.

Another site that's recently gone is the NovaBBS interface to Usenet. The guy who used to run it has passed away. In this case, the content is still available (Usenet is a federated system, as they now call it), and even the Rocksolid Light software that NovaBBS used is still available, but is there a publically available website that provides the interface. It seems to me that the (text) Usenet providers become fewer over time, and eventually Usenet will go away.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds