Migrating the Internet Archive to Kubernetes

By Jake Edge
January 2, 2019

The Internet Archive (IA) has been around for over 20 years now; many will know it for its Wayback Machine, which is an archive of old versions of web pages, but IA is much more than just that. Tracey Jaquith said that she and her IA colleague David Van Duzer would relate a "love/hate, long adventure story—mostly love" about the migration of parts of IA to Kubernetes. It is an ongoing process, but they learned a lot along the way, so they wanted to share some of that with attendees of KubeCon + CloudNativeCon North America 2018.

Jaquith has been with IA for 18 years; she started when IA did, but left for four years and then came back. Van Duzer is a more recent addition, joining IA about a year and a half ago; he works on the web crawling process that feeds the Wayback Machine. Van Duzer said that IA has been around since the beginning of the web and, over that time, has created a daunting pile of code that he has now started to become comfortable with. At this point, IA is "dipping its toes" into the Kubernetes world; any big change like that is going to need to be sold to colleagues, pain points will need to be worked out, and so on. In order to do that, they needed to answer the question: "what's in it for us?"

Where does Kubernetes fit?

IA has been using Docker for a while; there are ways to package up the "PHP monolith" into a Docker image. Docker has many advantages that are well known, but for him the most interesting thing about it was that it "enforced a constrained model of how to deploy things". It forced him to learn a new way to deploy services that would ultimately make them more scalable.

Rather than "get into the weeds" of technical objections and other problems that people might have with Kubernetes, they stepped back and took a high-level look at what a library (or archive) is and does. Basically, a library gathers a bunch of stuff that it wants to preserve forever, but it also wants to get it in the hands of people right away, Van Duzer said. They wanted to figure out what part of that mission was the most ready for a transition to Kubernetes.

The Wayback Machine contains more than 300 billion web pages, but it runs as part of a larger platform that has scanned books, Creative Commons videos, emulators for old software and video games, audio, and so on. Curating all of those different kinds of content has created a number of different "one-off processes" that might be a good fit for migrating to Kubernetes. But if those processes are truly one-off projects to ingest a particular type of content, Kubernetes would just be adding another layer to maintain, so they looked further.

The storage system is what preserves the archived content; it has gathered more than 50PB over the 22 years. IA is on its fourth-generation storage system at this point and it is leery of handing off the responsibility for that data to "the cloud"; it is and will remain self-hosted. There is a bias toward simplicity at IA, and there is a lot of skepticism about moving all that data into some new system. The data is kept as simple files in directories with the metadata stored next to the files on "boring block devices"; replication can be done with rsync. IA has been burned in the past with things like RAID and distributed filesystems so it likes to keep things simple, Van Duzer said.

So that led to the third part of what IA does, circulating the content, as the "Goldilocks option"; the front-end application that serves four million visitors per day might fit well with the horizontal scaling provided by Kubernetes. But it turned out that even that was a little too daunting to bite off all at once.

Working on the front-end

Jaquith reiterated the diversity and volume of content that IA stores and provides to visitors. The idea was to explore how Docker and Kubernetes could make IA easier to run and maintain on their own infrastructure. IA is housed in a former church in San Francisco, she said, which has an eye-opening amount of power coming into it.

Docker was first used at IA in late 2014 for an audio-fingerprinting project. It was a bit frustrating to her that the use of Docker was imposed from above. In late 2015, she sat in on a Docker talk at MozFest, fully expecting to hate it, but: "I loved it and it just blew my mind". That led her to want to start using Docker more.

By 2016, the processing that handles conversion between various formats (e.g. PDF for scanned books, MP3 for audio, MP4 for video) had been changed to use Docker. IA ends up with content in all kinds of formats, so there is a need to convert them in various ways. Putting that processing into containers reduced security concerns from handling those formats a bit; overall, "it was a big win for us".

The GitLab announcement of Auto DevOps (which uses Kubernetes) in mid-2018 really accelerated the move toward Kubernetes. IA already used GitLab in its infrastructure, so this provided an easy path to adding Kubernetes to its existing GitLab workflows. She and Van Duzer gave an internal talk on Auto DevOps in July; there was a "lively discussion" at the end of that, but the outcome was positive.

Things started picking up quickly after that. In August, the IA test phase of the pipeline was migrated and by September the full pipeline was working in Auto DevOps. This allows developers, contractors, and volunteers to use the Review Apps feature, which creates a full, functioning web site for testing whenever a branch is pushed to the repository.

In October, two web applications that are associated with the book-scanning process were added to Kubernetes. Jaquith said that a volunteer was working on them and wanted to learn Kubernetes; over a weekend, he got them both working in Kubernetes. Van Duzer said that he had a bit of a different take; the volunteer wasn't so much excited about Kubernetes as he was "fed up with deploying yet another VM, deploying yet another application". Jaquith agreed that was another way to look at it.

The previous week (early December 2018) saw the transition of the dweb.archive.org code to containers and into Kubernetes. That is a decentralized version of IA that uses things like WebTorrent and IPFS to access archive content stored all over the internet.

Logjams and breakthroughs

There is a lot of inertia, resistance, and skepticism within organizations that can be hard to overcome, especially with bigger changes, Jaquith said. One of the things that helped break that logjam at IA was when she and Van Duzer teamed up to start pushing it at the end of 2017. She is from a development-heavy background, while Van Duzer has an operations-heavy history; when their colleagues saw them both pushing Kubernetes from their particular angles, it made a big difference.

It would be "easy to not emphasize enough" how the Auto DevOps feature helped smooth the path of Kubernetes at IA, Jaquith said. Since GitLab was already being used in-house, it was easy to argue that "all we're going to do is extend this homegrown pipeline to a full pipeline", which would bring benefits in auditing and other areas. Prior to Auto DevOps, IA was using the regular Docker registry, Van Duzer said, which made him uncomfortable because there is no real audit trail or authentication of the Docker images. "GitLab just took care of all of that"; he can now trust that an image he pulled down came from a specific developer so he knows what is inside the image.

Since IA is self-hosting, it is probably not a surprise that it is running its own Kubernetes cluster in-house, Jaquith said. IA uses kubeadm to create its cluster. The kre8 tool is what IA has developed to automate the process of creating the cluster and setting up the Auto DevOps continuous integration/continuous deployment (CI/CD) pipeline. It requires a VM with SSH access and root privileges; it targets Ubuntu, but should work for most distributions, she said. It will allow coworkers and others, such as the book-scanning web application volunteer, to try out the full Kubernetes experience without making any real commitment.

She demonstrated the kallr command that comes with kre8. It is a top-like monitor for the Kubernetes cluster that is particularly useful for the GUI-averse. "Old school" Unix people tend to like terminal-based tools, she said. kallr plucks interesting information using kubectl; it updates every second and highlights changes that are happening, which is useful to spot problems in the cluster.

It is important to include both development and operations in the discussions about what pieces will be migrated, she said. Coming up with common goals will help smooth things over and "avoid bad blood". For example, in the process of coming up with common goals at IA, the piece that she was most interested in working on turned out not to really help anyone, so it was shelved.

Prior to the addition of Review Apps, there was a hand-rolled way to test changes to the IA system. By using Nginx rewrite rules, traffic to www-NAME.archive.org would end up at the user's IA home directory and serve the version of the code living there. But there are some problems with that technique; for one thing it only allows for a single destination so working on multiple branches is tricky. In addition, bringing in outsiders, such as contractors or volunteers, meant that they needed an IA home directory. "It was feeling a little outgrown", Jaquith said.

The Review Apps feature has changed all that. Outsiders can work and test the full site without needing IA credentials or access. The head of web sites for IA was asked in a meeting what he thought about Kubernetes and he said that it was a "game changer", Jaquith said. No staff needs to approve changes that someone is trying out and if they are in a wildly different time zone, there is no longer a lengthy feedback cycle because IA staff does not need to do anything to allow them to test their changes.

She put up a slide showing the IA pipeline; it is fairly standard, though the test phase is before the build phase, which is unusual. That part would be addressed later in the talk, but the pipeline continued with a review phase, which is where the reviewable versions of the site are created. If they are rejected at that stage, it is all just cleaned up.

Crawling

All of those changes have made a huge difference to the front-end developers, Van Duzer said, but he still wanted to test the theory that this could make other kinds of application development easier.

The web crawling that IA does is pretty straightforward: you start with a list of web sites, fetch those, extract URLs from them, rinse, and repeat. Back in 2009, IA developed Heritrix, which is an open-source web crawler written in Java, but it is meant to be run on a single system. It is quite good at pipelining the URL fetching process, but doesn't coordinate with other instances. That led to something called "CrawlHQ", which has multiple Heritrix instances that share a single queue of URLs to ensure that some are not crawled multiple times.

He decided to try to figure out how to deploy that in Kubernetes. He effectively rewrote CrawlHQ but, instead 38,000 lines of code, it became 73 lines of Python. That was possible because it used lots of other people's code, he said; Apache Kafka for partitioning the work queue and FoundationDB for storing the hashes of the URLs seen, for example. He could have simply run the code in some processes in a few different VMs as he has done in the past. But once the Kubernetes infrastructure was up and running, he could simply deploy it to the cluster. He can easily scale the throughput of the crawler by using the GitLab REPLICAS variable; he can adjust that according to his needs "without even thinking about it".

Some tips

The IA Git tree is 4.8GB in size, Jaquith said, and the pipeline is creating multiple 6-9GB Docker images. This is why the test phase is done before the build phase—it took too long for developers to be able to see and test their changes if they waited for the build of all the Docker images and such. The test phase is built using a shallow clone of the Git repository with the changes layered on it. The Auto DevOps feature does a full Git clone for each stage in the pipeline, which does not work well for IA. It is something that may change down the road, she said.

In order to avoid the penalty of full-repository checkouts, base Docker images are made of the environment every 1-4 weeks and then used as a basis for all development and testing. New code is added to the shallow clone by fetching, then fast-forwarding Git to a specific commit ID or branch name. It was "surprisingly difficult" to work out how to do that fast forward and she suggested that others who might also want to do it take a picture of the slide with commands. The slides are available, as is a YouTube video of the presentation (skip to 25:15 for the slide and explanation).

Jaquith then went through some tips on using Kubernetes and GitLab. She suggested using local storage when running Kubernetes for simple tests like on a laptop; it is much easier to configure. GitLab comes with an empty PostgreSQL database installed as part of the Auto DevOps feature, but if you aren't using it, it can be disabled to eliminate an extra persistent volume at each step of the pipeline. In addition, the GitLab API provides access to nearly everything you can see in the web interface, which makes it easy to write scripts to check and automate various things.

The plan is to move the IA production web site to Kubernetes in the near future ("knock on wood"), she said. Right now, there is static list of VMs that is used by HAProxy to do load balancing. What they would like to have is the demand-based elastic scaling possibilities that come with Kubernetes.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Seattle for KubeCon NA.]

Index entries for this article
Conference	KubeCon NA/2018

Migrating the Internet Archive to Kubernetes

Posted Jan 2, 2019 22:52 UTC (Wed) by jccleaver (guest, #127418) [Link]

It seems like there might be a few wins there -- abstracting complexity into platforms is usually a good thing -- but the overall vibe that comes across here is an evangelization without any particularly good justification. Simplicity is more and more key the lower and lower into the infrastructure one gets, or the more fundamental the data is that you're dealing with (IMHO whomever suggested the INTERNET ARCHIVE storage should be handed off to some public cloud system really ought to be booted off the project).

The point of an archive is to persist when the originals fail. "Hack once, break everywhere" (or, more likely, "typo once, obliterate everywhere") possibilities traded to for buzzword compliant CI process should be treated with a heavy dose of skepticism.

Migrating the Internet Archive to Kubernetes

Posted Jan 3, 2019 14:02 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link] (7 responses)

> IA is housed in a former church in San Francisco

Tell me about your DR plan.

Migrating the Internet Archive to Kubernetes

Posted Jan 3, 2019 18:34 UTC (Thu) by jccleaver (guest, #127418) [Link]

> > IA is housed in a former church in San Francisco
> Tell me about your DR plan.

A Black Sabbath memoriam basement?

Migrating the Internet Archive to Kubernetes

Posted Jan 3, 2019 18:59 UTC (Thu) by jgknight (guest, #108323) [Link]

This blog post is a few years old, but it does go over how they replicate the data onto another drive "usually" in another datacenter and that both copies are active and available. I also found an older post from 2014 that claimed they have 4 datacenters.

https://blog.archive.org/2016/10/25/20000-hard-drives-on-...
https://archive.org/web/petabox.php

Migrating the Internet Archive to Kubernetes

Posted Jan 4, 2019 10:17 UTC (Fri) by ale2018 (guest, #128727) [Link] (1 responses)

An e-mail message I received on Tue, 29 Nov 2016, claiming to be from "Brewster Kahle, Internet Archive" (actually signed and sent by relevantblue.com) was saying:

[...] So this year, we have set a new goal: to create a copy of Internet Archive's digital collections in another country. We are building the Internet Archive of Canada because lots of copies keeps stuff safe. This will cost millions. You are one of the special people who keep the Internet Archive going in a big way. Will you help sustain this non-profit library by making a tax-deductible donation today?

On November 9 in America, we woke up to a new administration promising radical change. [...]

Followups on that campaign were blogged in a FAQs about the Internet Archive Canada in December that year, including archived videos on Donald Trump on freedom of the press. After these two years, I'm wondering if Canada is far enough... Let's see how they'll do with 5G Huawei stuff.

Migrating the Internet Archive to Kubernetes

Posted Jan 4, 2019 10:20 UTC (Fri) by ale2018 (guest, #128727) [Link]

(that was meant to be a reply to smitty_one_each)

Migrating the Internet Archive to Kubernetes

Posted Jan 4, 2019 14:57 UTC (Fri) by jhhaller (guest, #56103) [Link] (1 responses)

Given the cost of real estate in San Francisco, and the cost of electricity in California, this does not seem to be the most cost-effective place to house a data center. A commercial data center would be paying real estate taxes, and sales tax on electricity and equipment, so I understand how a charitable foundation owning its own data center would be considered cost effective. An old church does not seem likely to be well hardened for earthquakes, and that's bound to be factored into casualty insurance rates for the cost of all the the equipment. Insurance should be a big factor in a DR plan, to replace equipment and lease temporary replacements/space until a permanent replacement is available.

Migrating the Internet Archive to Kubernetes

Posted Jan 4, 2019 17:00 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

> An old church does not seem likely to be well hardened for earthquakes

Wouldn't it just being an old (enough) church prove that it is hardened against earthquakes? I'd expect most old earthquake-unsafe buildings have been shaken out of the market by now.

Migrating the Internet Archive to Kubernetes

Posted Jan 4, 2019 15:46 UTC (Fri) by karkhaz (subscriber, #99844) [Link]

Joke's on you. DR means "Disaster Recovery" to the naïve, but these the Internet Archive is instead going for Divine Resilience against earthquakes.