The need to reliably preserve our community history
Some of our younger readers may not remember what an important resource Groklaw was during a crucial time in our history. Pamela Jones and those who helped her chronicled the ups and downs of the SCO lawsuit, keeping us all informed of what was going on, digging up information that was useful for the defense, and teaching us all a great deal about how the US legal process works. Groklaw did not stop there, though; it was an irreplaceable source of information on software patents, standards, licensing, and more. It was truly one of the key community resources for many years.
The site was important enough that LWN's history contains over 800 articles and comments with links to almost 700 pages on Groklaw.
Unfortunately, Groklaw has now been taken over by somebody whose priorities differ strongly from those of the site's founder. As of this writing, Groklaw has become a platform for the promotion of crypto products. The text of some original articles remains (though decorated with crypto ads) while other articles have been replaced entirely. An important part of the free-software community's history (and that of the commercial realm around it) has been compromised.
The usual policy at LWN is to not attempt to keep up with link rot. Our history runs back to 1998; if we tried to keep up with all of things we have linked to over those years, we would never have the time to do anything else. But finding ourselves to be hosting hundreds of once-important links that now lead to a crypto scammer's site presents a bit of a special case. We have thus, over the last week, replaced every Groklaw link we could with an equivalent link into the Internet Archive. In the end, there were only a half-dozen links that were not archived.
This episode is an important reminder of what can happen to any Internet resource that we depend on. Web sites will only persist for as long as somebody is willing to care (and pay) for them. Certain governments are currently doing their best to disappear inconvenient information from the net — information that others depend on heavily. An era where everything is digital is an era where important history can disappear at any time and, increasingly, it does.
In this case, the Internet Archive has come to the rescue, proving its value once again. But the Internet Archive is a single point of failure, and one that is under attack at that; it, too, could disappear someday. It is not, by itself, a convincing solution to this problem.
What is a convincing solution is unclear at best. With tools like Git, our community has gotten part of the way there; it is hard to imagine a scenario where we would lose the kernel source or its (Git-era) development history. At least, in any such scenario, building a new kernel would likely appear relatively low on the to-do lists of any survivors. Widespread copying of important data is clearly needed, but we cannot easily do that with the Internet as a whole.
Until we figure this out, we will run an increasing risk of ending up in a
situation where the only way to learn something about our history is to ask
somebody's generative-AI system and hope that it fabricates something that
is remotely plausible. That is not quite the world that we were hoping to
build. Our species is currently faced with a number of urgent problems;
preserving our history is surely one of them.
Posted Aug 27, 2025 14:14 UTC (Wed)
by philipstorry (subscriber, #45926)
[Link]
I suspect if I'd spent the time writing posts for my website instead of comments, I'd have a much larger site. I may even have been regarded as a blogger. For whatever that may be worth...
I'm sure I left comments on Groklaw, but given my low levels of legal expertise I doubt they're a great loss to our community. But it would have been nice to have made that decision for myself!
I also think it would be good to capture oral history of the industry. People's opinions and recollections matter, even when they conflict with the facts.
Posted Aug 27, 2025 14:33 UTC (Wed)
by mricon (subscriber, #59252)
[Link] (3 responses)
I recommend we adopt it for long-term archival of all textual data.
Posted Aug 28, 2025 23:56 UTC (Thu)
by stefanha (subscriber, #55072)
[Link] (2 responses)
Posted Aug 29, 2025 18:36 UTC (Fri)
by mricon (subscriber, #59252)
[Link] (1 responses)
Posted Aug 30, 2025 10:02 UTC (Sat)
by stefanha (subscriber, #55072)
[Link]
Posted Aug 27, 2025 15:33 UTC (Wed)
by auerswal (subscriber, #119876)
[Link] (3 responses)
Posted Aug 27, 2025 17:02 UTC (Wed)
by jzb (editor, #7867)
[Link] (2 responses)
"This is only a little extra work per new article." That is an interesting idea, but it would need to be something we can automate: our articles tend to be quite link-heavy. I think most of mine have 20 or more, minimum. So, more than a little extra work. Also, aren't archive.today and archive.ph the same service? That said, this may be automatable to save an archive page for each link we use in an article, but it wouldn't solve the larger problem, even for us. If I reference a page today, that would solve keeping a capture for that page, but it's not uncommon for things to have already disappeared when I am writing new articles...
Posted Aug 28, 2025 15:39 UTC (Thu)
by elw (subscriber, #86388)
[Link]
Posted Sep 5, 2025 0:57 UTC (Fri)
by cypherpunks2 (guest, #152408)
[Link]
Posted Aug 27, 2025 15:59 UTC (Wed)
by pabs (subscriber, #43278)
[Link]
https://wiki.archiveteam.org/
Software Heritage is a group aimed at proactively archiving all source code, it has been featured on LWN before. It also needs developers to help support archiving more VCS/forge types, and folks to submit forge instances and individual repos.
https://www.softwareheritage.org/
I'd also like to mention archive.today, it is good for archiving single pages that are difficult to save otherwise, because of the use of JavaScript, are blocked by anti-scraper tech, or are behind paywalls. This is also separate to archive.org.
Back to the article topic, I think it would be great if there were a EU mirror of IA, and also a separate-to-IA EU web archiving project too.
Posted Aug 27, 2025 16:13 UTC (Wed)
by pabs (subscriber, #43278)
[Link]
Posted Aug 27, 2025 19:03 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Aug 27, 2025 20:25 UTC (Wed)
by mricon (subscriber, #59252)
[Link]
Posted Aug 27, 2025 20:42 UTC (Wed)
by sjfriedl (✭ supporter ✭, #10111)
[Link] (1 responses)
I wonder how many TypePad blogs will go offline forever...
Posted Aug 28, 2025 0:51 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
Posted Aug 28, 2025 9:52 UTC (Thu)
by jkridner (subscriber, #179011)
[Link] (1 responses)
For example, the death of OpenID means that I have to sign into this site and leave a comment with no more authoritative stance that I am "that" Jason Kridner than anyone else who might lay claim to that name.
It still makes no sense to me that common tools, like e-mail, still don't use signatures.
Naturally, none of this replace the need for Archive.org, but does potentially yield more value to what it preserves and provides. While we cannot easily save the Internet as a whole from becoming a wasteland of advertising, we can at least build an infrastructure of reasonably verifiable sources.
And those people building those AI bots that are ripping the Internet to shreds, well, we can slowly educate them to at least cite sources and pull information in ways that will reduce their cost and increase their usefulness.
Oh, and I haven't gpg signed anything in years, so don't see myself as part of the problem.
iHUEARYKAB0WIQQjUJ3RTapq7acy8di8BGTymNaGdgUCaLAmoQAKCRC8BGTymNaG
Posted Aug 28, 2025 12:23 UTC (Thu)
by jkridner (subscriber, #179011)
[Link]
iHUEARYKAB0WIQQjUJ3RTapq7acy8di8BGTymNaGdgUCaLBKNwAKCRC8BGTymNaG
Posted Aug 29, 2025 12:54 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (3 responses)
Well, if the language models were trained on the source documents, it may be possible to retrieve them verbatim, bypassing the digital regurgitation. Kinda an unexpected "benefit" of the Great Vacuuming of the Internet.
Posted Aug 31, 2025 19:55 UTC (Sun)
by gf2p8affineqb (subscriber, #124723)
[Link] (2 responses)
Posted Aug 31, 2025 21:10 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Sep 2, 2025 18:55 UTC (Tue)
by jkridner (subscriber, #179011)
[Link]
Posted Sep 10, 2025 6:17 UTC (Wed)
by anton (subscriber, #25547)
[Link]
I think that is bad marketing from the owners. Eventually the search engines will stop producing links to those former Anandtech articles, and any ad money and potential traffic for their forum (which they could advertise for when showing the article) is going to dry up.
Another site that's recently gone is the NovaBBS interface to Usenet. The guy who used to run it has passed away. In this case, the content is still available (Usenet is a federated system, as they now call it), and even the Rocksolid Light software that NovaBBS used is still available, but is there a publically available website that provides the interface. It seems to me that the (text) Usenet providers become fewer over time, and eventually Usenet will go away.
An odd coincidence
public-inbox underlying storage
public-inbox underlying storage
public-inbox underlying storage
public-inbox underlying storage
Manually archiving outlinks when publishing an article
Manually archiving outlinks when publishing an article
archive.is/ph/etc all have APIs that you can use to automate the creation of these archive links as well as browser plugins for Chrome and Firefox
Manually archiving outlinks when publishing an article
Be aware that a manual robots.txt exclusion of IA will cause all archived pages on that domain to be retroactively hidden!
All it would take is for Groklaw to add "ia_archiver" to its robots.txt exclusion policy and everything would be removed. Although IA did relax their policy by no longer respecting a denial of "*" and no longer respecting a denial of "ia_archiver" on government and military sites, explicitly denying that user agent would remove web history.
Manually archiving outlinks when publishing an article
ArchiveTeam, Software Heritage & archive.today
https://archive.fart.website/archivebot/viewer/?q=groklaw
https://lwn.net/Articles/693471/
Dangers to history
Gmane....
Gmane....
More bit rot: TypePad blog platform going away
More bit rot: TypePad blog platform going away
Meeting the world where it is at
-----BEGIN PGP SIGNATURE-----
dlWFAP9mo8m/seaZN1x0en5Hphu1QSuKgO4nZDpOtkNVu1Q1sgD+L5TIk4E8Llye
JSfSAo8EF2yJ8jwfk8CVjL/DRWtlVgU=
=+qAE
-----END PGP SIGNATURE-----
Meeting the world where it is at
-----BEGIN PGP SIGNATURE-----
diQMAQDN59zGYnR9VVsrEgLEXpBVVrUoRMFfv+3exBBHcgQBTAEAxFASxLJaWlhV
QInfpmHiuwZxUUyfOQIpJqA3CstDugc=
=DxVa
-----END PGP SIGNATURE-----
LLM "archivisation"
LLM "archivisation"
LLM "archivisation"
Wol
LLM "archivisation"
Another site that has been demolished recently is Anandtech. They ceased publishing new content a year or two ago, but their old content used to be available, until recently. Now, every link to one of their articles is redirected to a forum (which seems to be about the same topics).
Anandtech and NovaBBS
