Archiving web sites
I recently took a deep dive into web site archival for friends who were worried about losing control over the hosting of their work online in the face of poor system administration or hostile removal. This makes web site archival an essential instrument in the toolbox of any system administrator. As it turns out, some sites are much harder to archive than others. This article goes through the process of archiving traditional web sites and shows how it falls short when confronted with the latest fashions in the single-page applications that are bloating the modern web.
Converting simple sites
The days of handcrafted HTML web sites are long gone. Now web sites are dynamic and built on the fly using the latest JavaScript, PHP, or Python framework. As a result, the sites are more fragile: a database crash, spurious upgrade, or unpatched vulnerability might lose data. In my previous life as web developer, I had to come to terms with the idea that customers expect web sites to basically work forever. This expectation matches poorly with "move fast and break things" attitude of web development. Working with the Drupal content-management system (CMS) was particularly challenging in that regard as major upgrades deliberately break compatibility with third-party modules, which implies a costly upgrade process that clients could seldom afford. The solution was to archive those sites: take a living, dynamic web site and turn it into plain HTML files that any web server can serve forever. This process is useful for your own dynamic sites but also for third-party sites that are outside of your control and you might want to safeguard.
For simple or static sites, the venerable Wget program works well. The incantation to mirror a full web site, however, is byzantine:
$ nice wget --mirror --execute robots=off --no-verbose --convert-links \
--backup-converted --page-requisites --adjust-extension \
--base=./ --directory-prefix=./ --span-hosts \
--domains=www.example.com,example.com http://www.example.com/
The above downloads the content of the web page, but also crawls everything within the specified domains. Before you run this against your favorite site, consider the impact such a crawl might have on the site. The above command line deliberately ignores robots.txt rules, as is now common practice for archivists, and hammer the website as fast as it can. Most crawlers have options to pause between hits and limit bandwidth usage to avoid overwhelming the target site.
The above command will also fetch "page requisites" like style sheets (CSS), images, and scripts. The downloaded page contents are modified so that links point to the local copy as well. Any web server can host the resulting file set, which results in a static copy of the original web site.
That is, when things go well. Anyone who has ever worked with a computer
knows that things seldom go according to plan; all sorts of
things can make the procedure derail in interesting ways. For example,
it was trendy for a while to have calendar blocks in web sites. A CMS
would generate those on the fly and make crawlers go into an infinite
loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions
(e.g. Wget has a --reject-regex option) to ignore problematic
resources. Another option, if the administration interface for the
web site is accessible, is to disable calendars, login forms, comment
forms, and other dynamic areas. Once the site becomes static, those
will stop working anyway, so it makes sense to remove such clutter
from the original site as well.
JavaScript doom
Unfortunately, some web sites are built with much more than pure HTML. In single-page sites, for example, the web browser builds the content itself by executing a small JavaScript program. A simple user agent like Wget will struggle to reconstruct a meaningful static copy of those sites as it does not support JavaScript at all. In theory, web sites should be using progressive enhancement to have content and functionality available without JavaScript but those directives are rarely followed, as anyone using plugins like NoScript or uMatrix will confirm.
Traditional archival methods sometimes fail in the dumbest way. When
trying to build an offsite backup of a local newspaper
(pamplemousse.ca), I found that
WordPress adds query strings
(e.g. ?ver=1.12.4) at the end of JavaScript includes. This confuses
content-type detection in the web servers that serve the archive, which
rely on the file extension
to send the right Content-Type header. When such an archive is
loaded in a
web browser, it fails to load scripts, which breaks dynamic websites.
As the web moves toward using the browser as a virtual machine to run arbitrary code, archival methods relying on pure HTML parsing need to adapt. The solution for such problems is to record (and replay) the HTTP headers delivered by the server during the crawl and indeed professional archivists use just such an approach.
Creating and displaying WARC files
At the Internet Archive, Brewster
Kahle and Mike Burner designed
the ARC (for "ARChive") file format in 1996 to provide a way to
aggregate the millions of small files produced by their archival
efforts. The format was eventually standardized as the WARC ("Web
ARChive") specification that
was released as an ISO standard in 2009 and
revised in 2017. The standardization effort was led by the International Internet
Preservation Consortium (IIPC), which is an "international
organization of libraries and other organizations established to
coordinate efforts to preserve internet content for the future
",
according to Wikipedia; it includes members such as the US Library of
Congress and the Internet Archive. The latter uses the WARC format
internally in its Java-based Heritrix
crawler.
A WARC file aggregates multiple resources like HTTP headers, file
contents, and other metadata in a single compressed
archive. Conveniently, Wget actually supports the file format with
the --warc parameter. Unfortunately, web browsers cannot render WARC
files directly, so a viewer or some conversion is necessary to access
the archive. The simplest such viewer I have found is pywb, a
Python package that runs a simple webserver to offer a
Wayback-Machine-like interface to browse the contents of WARC
files. The following set of commands will render a WARC file on
http://localhost:8080/:
$ pip install pywb
$ wb-manager init example
$ wb-manager add example crawl.warc.gz
$ wayback
This tool was, incidentally, built by the folks behind the Webrecorder service, which can use a web browser to save dynamic page contents.
Unfortunately, pywb has trouble loading WARC files generated by Wget because it followed an inconsistency in the 1.0 specification, which was fixed in the 1.1 specification. Until Wget or pywb fix those problems, WARC files produced by Wget are not reliable enough for my uses, so I have looked at other alternatives. A crawler that got my attention is simply called crawl. Here is how it is invoked:
$ crawl https://example.com/
(It does say "very simple" in the README.) The program does support
some command-line options, but most of its defaults are sane: it will fetch
page requirements from other domains (unless the -exclude-related
flag is used), but does not recurse out of the domain. By default, it
fires up ten parallel connections to the remote site, a setting that
can be changed with the -c flag. But, best of all, the resulting WARC
files load perfectly in pywb.
Future work and alternatives
There are plenty more resources
for using WARC files. In
particular, there's a Wget drop-in replacement called Wpull that is
specifically designed for archiving web sites. It has experimental
support for PhantomJS and youtube-dl integration that
should allow downloading more complex JavaScript sites and streaming
multimedia, respectively. The software is the basis for an elaborate
archival tool called ArchiveBot,
which is used by the "loose collective of
rogue archivists, programmers, writers and loudmouths
" at
ArchiveTeam in its struggle to
"save the history before it's lost
forever
". It seems that PhantomJS integration does not work as well as
the team wants, so ArchiveTeam also uses a rag-tag bunch of other
tools to mirror more complex sites. For example, snscrape will
crawl a social media profile to generate a list of pages to send into
ArchiveBot. Another tool the team employs is crocoite, which uses
the Chrome browser in headless mode to archive JavaScript-heavy sites.
This article would also not be complete without a nod to the HTTrack project, the "website copier". Working similarly to Wget, HTTrack creates local copies of remote web sites but unfortunately does not support WARC output. Its interactive aspects might be of more interest to novice users unfamiliar with the command line.
In the same vein, during my research I found a full rewrite of Wget called Wget2 that has support for multi-threaded operation, which might make it faster than its predecessor. It is missing some features from Wget, however, most notably reject patterns, WARC output, and FTP support but adds RSS, DNS caching, and improved TLS support.
Finally, my personal dream for these kinds of tools would be to have them integrated with my existing bookmark system. I currently keep interesting links in Wallabag, a self-hosted "read it later" service designed as a free-software alternative to Pocket (now owned by Mozilla). But Wallabag, by design, creates only a "readable" version of the article instead of a full copy. In some cases, the "readable version" is actually unreadable and Wallabag sometimes fails to parse the article. Instead, other tools like bookmark-archiver or reminiscence save a screenshot of the page along with full HTML but, unfortunately, no WARC file that would allow an even more faithful replay.
The sad truth of my experiences with mirrors and archival is that data dies. Fortunately, amateur archivists have tools at their disposal to keep interesting content alive online. For those who do not want to go through that trouble, the Internet Archive seems to be here to stay and Archive Team is obviously working on a backup of the Internet Archive itself.
| Index entries for this article | |
|---|---|
| GuestArticles | Beaupré, Antoine |
Posted Sep 25, 2018 14:48 UTC (Tue)
by anarcat (subscriber, #66354)
[Link] (1 responses)
Posted Oct 4, 2018 17:58 UTC (Thu)
by anarcat (subscriber, #66354)
[Link]
The Pamplemousse crawl is now available on the Internet
Archive, it might end up in the wayback machine at some point if
the Archive curators think it is worth it.
Posted Sep 25, 2018 15:52 UTC (Tue)
by bnewbold (subscriber, #72587)
[Link] (9 responses)
Thanks for another well researched article! In particular, great to see Archive Team get more attention and love, they are a really impressive community (IMHO).
Two additional resources the Internet Archive has, which don't fit the personal and self-sufficient angle this article focuses on, are the "Save Page Now" feature/API on web.archive.org (which allows anybody to request that the archive crawlers immediately snapshot a single page plus embedded resources), and Brozzler, our new "headless browser" crawler (https://github.com/internetarchive/brozzler), which combines with warcprox (https://github.com/internetarchive/warcprox), a proxy that saves all HTTP(S) traffic to WARC format. The tide seems to be in the direction of using headless browsers over custom crawling tools (like Heritrix), though the cost is *significantly* more, so it hasn't been used in non-profit crawling at the same scales yet (as far as I know). As some context there, it's my understanding that Google and other search crawlers have been using headless browsers for years (there is a narrative that this is why Chrome/Chromium is "so fast" and had so much sandboxing/security focus in the early day compared to other browsers).
The Archive also has an "archive as a service" offering with hundreds of institutional users, with full control over crawl prioritization and WARC export, but the cost is high for individual users (feels too self-promotional to link, but you can find it easily).
Posted Sep 25, 2018 21:21 UTC (Tue)
by Kamilion (guest, #42576)
[Link]
Posted Sep 26, 2018 17:09 UTC (Wed)
by jond (subscriber, #37669)
[Link] (2 responses)
Posted Sep 27, 2018 18:12 UTC (Thu)
by bnewbold (subscriber, #72587)
[Link] (1 responses)
Many of the major components (Heretrix, Brozzler, various Wayback replay tools, trough, warcprox, etc) are free software, and we have been amenable to, eg, licensing front-end javascript code. Many of us strongly believe in FLOSS principles, but we have limited resources and social capital to spend, and have tried to focus those on the highest impact changes.
I encourage you to keep asking though! In the meanwhile a lot of smaller groups hit our "save page now" API with a cron script to backup smaller websites (reasonable solution for up to a couple thousand URLs), which costs nothing and just takes (volunteer) time.
Posted Sep 27, 2018 18:22 UTC (Thu)
by anarcat (subscriber, #66354)
[Link]
That said, "hitting save page now" is basically what I'm doing on my blog. I wrote a feed2exec plugin to ping the wayback machine when new content is posted on my site, and it has served me well.
But for larger operations, I hope my article shows that there are plenty of tools out there to build your own little internet archive. It might not have all the bell and whistles (multimedia support and library collections, for example), but you can get pretty far with ArchiveBot/crocoite/wpull and a viewer like pywb.
It really depends, after all, what you want to actually do: archive your own website? other websites? old software?
For the latter, by the way, a significant resource might also be the software heritage folks although they are primarily focused on source code...
Posted Oct 8, 2018 0:02 UTC (Mon)
by pabs (subscriber, #43278)
[Link] (3 responses)
$ HEAD https://web.archive.org/save/https://lwn.net/
Posted Oct 11, 2018 15:14 UTC (Thu)
by anarcat (subscriber, #66354)
[Link] (2 responses)
Posted Oct 12, 2018 3:31 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Oct 24, 2018 21:08 UTC (Wed)
by anarcat (subscriber, #66354)
[Link]
Posted Nov 23, 2018 4:03 UTC (Fri)
by nikisweeting (guest, #128789)
[Link]
I'm the maintainer of Bookmark Archiver, mentioned near the end, and we'd love to add Brozzler/WarcProx system that can crawl pages using Chrome headless to replay JS and other user actions recorded with puppeteer during the browsing session.
The end goal is to have exactly the actions that I took when visiting a site replayable at a later date. I may even save the VM containing the binaries for a browser capable of replaying the archive once a year, so that the sites can be visited far, far in the future on any x86 compatible machine.
Posted Sep 25, 2018 16:48 UTC (Tue)
by xxiao (guest, #9631)
[Link] (1 responses)
web scraping with python can work reliably with static sites, and has to use headless-browser etc to crawl javascript-sites which is very challenging too.
Posted Sep 25, 2018 16:59 UTC (Tue)
by anarcat (subscriber, #66354)
[Link]
Posted Sep 26, 2018 8:51 UTC (Wed)
by zoobab (guest, #9945)
[Link] (8 responses)
Not to complains about the Internet Archive, I run the simplest static website exposing some directories with lighttpd:
https://web.archive.org/web/20170615082215/http://filez.z...
Just click on "allwinner", it basically did not managed to maintain the links.
Posted Sep 26, 2018 10:10 UTC (Wed)
by tlamp (subscriber, #108540)
[Link] (6 responses)
That's just one part of the equation... You need still tooling to crawl and save the whole page (and its internal links) in a future accessible offline format, like WARC is.
I could imagine that Selenium could be used as driver/backend for one of the projects mentioned, though.
Posted Sep 26, 2018 11:33 UTC (Wed)
by zoobab (guest, #9945)
[Link] (5 responses)
I found a way to pilot it with a the REPL+telnet, still have to document it. And this plugin rewrites the links as well.
Posted Oct 3, 2018 22:44 UTC (Wed)
by debacle (subscriber, #7114)
[Link] (4 responses)
(*) This is: $ echo firefox-esr hold | sudo dpkg --set-selections
Posted Oct 4, 2018 1:44 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (3 responses)
Posted Oct 4, 2018 9:15 UTC (Thu)
by debacle (subscriber, #7114)
[Link] (2 responses)
Posted Oct 4, 2018 13:50 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (1 responses)
https://bonedaddy.net/pabs3/log/2018/09/08/webextocalypse/
Posted Dec 30, 2018 10:45 UTC (Sun)
by debacle (subscriber, #7114)
[Link]
Posted Sep 27, 2018 2:11 UTC (Thu)
by anarcat (subscriber, #66354)
[Link]
I counter that there's a better way to archive those files on archive.org: you could upload those to the Internet archive software collection. That way there would be meaningful, semantic data associated with those things instead of just an opaque directory listing.
That is more work, of course... Which is why I packaged that commandline tool (because their web UI is hellish). ;)
Posted Oct 1, 2018 17:09 UTC (Mon)
by ceplm (subscriber, #41334)
[Link] (3 responses)
Posted Oct 1, 2018 17:24 UTC (Mon)
by anarcat (subscriber, #66354)
[Link] (2 responses)
(There's something to be said about the sustainability of browser add-ons here, but I'll stay on topic.. ;)
Posted Oct 2, 2018 9:19 UTC (Tue)
by ceplm (subscriber, #41334)
[Link] (1 responses)
2. hideous UI prepared for Internet Archive, but not for humans. Search? really?
I actually don't hate the fact that KDE WAR and MAFF were just a file storing the web page in question, and I would like to store that one page somewhere in my regular files like any other document.
So, if pywb had just a command display with one parameter [directory] which would open as http://localhost:8080 with a list of all archives in the directory and its subdirectories, and just by simple clicking on that name of archive one would open saved page. Nothing more.
Posted Oct 2, 2018 15:03 UTC (Tue)
by anarcat (subscriber, #66354)
[Link]
Posted Oct 11, 2018 0:38 UTC (Thu)
by ikreymer (guest, #127798)
[Link] (1 responses)
Thanks for mentioning pywb and Webrecorder. I wanted to mention that we've released an updated version of our warcio library, which is used by pywb and it should now be able to handle WARCs created by wget. Thanks for bringing attention to this issue! If you've already installed pywb, you can update the warcio library by running I also wanted to mention a few other options for creating WARCs using warcio and pywb pywb has a built-in 'record' mode that allows you to record directly into a pywb collection
by, for example, browsing pywb also supports proxy mode recording and allows you set pywb as the http and https proxy for your browser. You can record directly into a pywb collection and content recorded in this way is more likely to work when replayed in pywb.
Posted Feb 29, 2020 12:34 UTC (Sat)
by dddddcccccc (guest, #137523)
[Link]
I have a project that trying to address this. It's a work in progress but produces completely faithful and functional archives.
https://github.com/dosyago/22120
Posted Jul 30, 2020 0:06 UTC (Thu)
by nikisweeting (guest, #128789)
[Link]
(Disclaimer: I'm the author of that package)
https://github.com/pirate/ArchiveBox
$ pip install archivebox
As usual, here's the list of issues and patches generated while researching this article:
Archiving web sites
I also want to personally thank the folks in the #archivebot channel for their assistance and letting me play with their toys.
ia commandline tool
As it turns out, I couldn't stop working on this topic and opened two more PRs upstream after submitting WARC files to the internet archive:
Archiving web sites
ia documentationiaArchiving web sites
Archiving web sites
Kinda sick of looking through my browser history trying to find a keyword, and getting ridiculously stupid results.
It's great to see mention of your "as a service" offering, which is something I was looking at for something I'm involved in. I'm convinced that your software is a great fit for our needs, but as a small, volunteer-driven, non-profit group, we almost certainly can't afford the SaaS.
Is there any chance you would consider open sourcing your service software?
Archiving web sites
Archiving web sites
I think you're being unfair to yourselves. :) Most, if not all of the archive.org magic sauce is basically public. The hard work is connecting all the pieces together and making them work reliably, on the long term. That's your achievement, and it's amazing. I do encourage people to free their software for others to use, but I know that, in practice, it's not always meaningful or useful, especially for old codebases with who knows what inside... ;)
Archiving web sites
Archiving web sites
404 Not Found
Connection: close
Date: Mon, 08 Oct 2018 00:00:44 GMT
Server: nginx/1.13.11
Content-Length: 170
Content-Type: text/html
Client-Date: Mon, 08 Oct 2018 00:00:44 GMT
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certs.godaddy.com/repository//CN=Go Daddy Secure Certificate Authority - G2
Client-SSL-Cert-Subject: /OU=Domain Control Validated/CN=*.archive.org
Client-SSL-Cipher: ECDHE-RSA-AES128-GCM-SHA256
Client-SSL-Socket-Class: IO::Socket::SSL
Archiving web sites
Archiving web sites
Archiving web sites
Archiving web sites
Archiving web sites
I'm sorry to hear that! :) The reason why this is not covered more deeply is because the "WARC" approach worked for the site I tested against. It's true it might fail against single-page applications (SPA) - for this other tools are necessary. This is what I covered in the Future work and alternatives section. In particular, I would use the Archiving web sites
crocoite project for JavaScript-heavy sites, and it's what ArchiveTeam use for their "chromebot". It uses chrome in headless mode to browse the page, but I haven't tested it. It does seem to work for them however...
Just use Selenium
Just use Selenium
Just use Selenium
scrapbook
scrapbook
scrapbook
scrapbook
https://github.com/tahama/scrapbookq/blob/master/src/mani...
/usr/share/mozilla/extensions/{ec8030f7-c20a-464f-9b0e-13a3a9e97384}/tahama@163.com -> git repo
scrapbook
I briefly tried out the scrapbookq a while ago and it does not seem to be as useful and good as the original XUL based scrapbook.
I'm not even sure, whether it is good enough to have it in Debian at all.
For now, I will stay with Firefox 52.9.0.
Maybe the best solution would be a web server, that can store pages on request.
Completely independent of Firefox, and even compatible with w3m...
It must run on ones own notebook computer for my use case.
Any ideas and code welcome!
Just use Selenium
Not to complains about the Internet Archive, I run the simplest static website exposing some directories with lighttpd.[...] Just click on "allwinner", it basically did not managed to maintain the links.
That's a bug, I guess. Note that if you append a slash, the links work fine. But when you crawl down, it's true that some files are missing.
Archiving web sites
Archiving web sites
Archiving web sites
Archiving web sites
Hi,
pywb should now work with wget warcs + other options
pip install -U warcio to get the latest.
-Ilya (Webrecorder Lead Dev)
from warcio.capture_http import capture_http
import requests
with capture_http('example.warc.gz'):
# request all urls to be loaded
requests.get('https://example.com/')
requests.get('https://google.com/)
http://localhost:8080/example/record/http://example.com/. You can enable this by running wayback --live --record
More info on this in the pywb docspywb should now work with wget warcs + other options
Archiving web sites
$ archivebox init
$ archivebox add 'https://example.com';
