LWN: Comments on "Merkle trees and build systems"

apt2ostree vs rpm-ostree

fencekicker — Mon, 20 Jul 2020 13:36:25 +0000

apt2ostree sounds similar to rpm-ostree - creating the image out of packages, though it's DEBs, not RPMs. I assume the same limitations as with rpm-ostree apply? %pre/%post script sometimes not working as they should, you can't rely on them running on the real target etc.

We build a custom distro based on CentOS atomic host, and discovered that some things won't work in %pre/%post scripts - e.g. if you want to copy some files around, that won't work; I think 'systemctl enable <service>' works, and I'm not sure about adding users or groups. Obviously, if you want to support multiple platforms, the package scripts won't help, because they will not run on the real target.

The way I see it, rpm-ostree is a very interesting project, but it breaks quite a few expectations that came with creating RPM packages, so it feels rather kludgy in this respect. I don't think people are eager to rewrite their RPMs to handle the rpm-ostree workflow.

Merkle trees and build systems

pabs — Wed, 17 Jun 2020 11:36:42 +0000

This seems kind of similar to ostree, have you considered just using that?

Merkle trees and build systems

cyphar — Wed, 17 Jun 2020 05:28:21 +0000

Well, Docker has *finally* implemented OCI image support (which just boiled down to supporting the metadata blobs, since the layer blobs were designed to be identical between the Docker v2.2 image format and the OCIv1 image specification). So it's entirely possible they'll support the extensions we're working on, but I wouldn't hold out hope that it would be a quick transition. containerd and podman/cri-o will probably pick them up faster (though I think podman/cri-o will require more extensive changes to their storage internals since they're based on Docker's graphdriver code).

Merkle trees and build systems

Cyberax — Wed, 17 Jun 2020 01:54:21 +0000

All the ideas there are great!

But will Docker (or Moby or whatever they'll be called in a week) implement them?

Merkle trees and build systems

cyphar — Wed, 17 Jun 2020 01:02:58 +0000

We are currently going through a more formalised specification process to hopefully get a properly specified version of the scheme I outlined in my talk. While the final scheme might not be the same as the one I outlined (which should be unsurprising given I hacked it together pretty last-minute), the general design should be similar. Unfortunately it will certainly be some time before we can point to production users of such a system.

Merkle trees and build systems

Cyberax — Tue, 09 Jun 2020 03:27:34 +0000

Mostly because they _pretend_ to be declarative descriptions of the resulting image, while introducing subtle non-reproducible bugs.

Just take a typical Dockerfile from Github: https://github.com/wurstmeister/kafka-docker/blob/master/... - this is random example from using their code search function.

You can see that it does: "apk add --no-cache bash curl jq docker" - basically installs the most recent available version of packages, without any notion of "lockfiles".

Merkle trees and build systems

bergwolf — Tue, 09 Jun 2020 03:16:49 +0000

> I'm really disgusted by Dockerfiles

Could you elaborate a bit why you dislike Dockerfiles?

Merkle trees and build systems

pabs — Mon, 08 Jun 2020 12:30:37 +0000

Some details are in this bug:

https://github.com/restic/restic/issues/2446

Merkle trees and build systems

mathstuf — Mon, 08 Jun 2020 12:19:07 +0000

Are they always stored as a single object then I assume? I wonder if statistics on how large directory blobs are in a repository could be made. I doubt they tend to approach normal chunk sizes often which means that, statistically, you're unlikely to find a chunk boundary in a directory blob in the first place.

Merkle trees and build systems

pabs — Mon, 08 Jun 2020 04:07:04 +0000

Ah, I see why it seemed familiar, the speaker mentions (31:40) that he stole most of the design for OCIv2 from restic.

Merkle trees and build systems

pabs — Mon, 08 Jun 2020 03:44:22 +0000

Website and git for the project:

https://umo.ci/ https://github.com/openSUSE/umoci

Merkle trees and build systems

pabs — Mon, 08 Jun 2020 03:38:51 +0000

This talk reminds me of how modern backup systems like restic and borg store filesystems; similar to git but without the commit hash chain (just independent snapshots) and with an additional layer of splitting files into chunks using rolling hashes.

Sadly the restic storage design misses out splitting directories into chunks of filenames, which means that there is some inefficiency around directories with many files in them.

I wonder when git is going to adopt the file chunking stuff.

Merkle trees and build systems

mathstuf — Sun, 07 Jun 2020 12:10:52 +0000

I wonder if anything ever came of this: <https://www.youtube.com/watch?v=bbTxdzbjv7I>. Sorry, I don't have a repo link for it.

Merkle trees and build systems

Cyberax — Sat, 06 Jun 2020 06:08:19 +0000

I'm aware of Packer, but it's a bad solution. It doesn't produce "layered" images, so you're stuck with giant tar files.

For Docker caching to properly work, you basically need to do content-based addressing for its layers. I'm actually looking at OSTree and it seems eminently doable, I might actually take a stab at it.

Merkle trees and build systems

rgh — Sat, 06 Jun 2020 05:58:50 +0000

I don't know if there's anything based on OSTree but as a practical solution have a look at Packer from Hashicorp (https://packer.io). It's sole purpose is to build images and it does it really, really well.

Merkle trees and build systems

rgh — Sat, 06 Jun 2020 05:03:07 +0000

a giant pile of poo

To use the technical term!

Merkle trees and build systems

nim-nim — Fri, 05 Jun 2020 10:09:17 +0000

apt and rpm do not deal with just a set of resulting files, they deal with composition rules, build rules, and post-install file munging, because real-life software uses things like indexes that can not be computed before the things the index indexes are actually deployed

Merkle trees and build systems

nim-nim — Fri, 05 Jun 2020 10:03:10 +0000

Well, rpm and apt systems implement all this from archives (cpio or or) so it’s perfectly practical and has been done for a long time now. Maybe ostree makes it more practical, maybe not. It’s hard to compare a mature ecosystem that had to solve the 20% of make-it-fully-work problems that take 80% of implementation energy with an ecosystem that implemented the 80% easy part and had not have to deal with the 20% hard part yet (and does not have the warts and scars resulting with this part).

And I strongly suspect that a lot of the parts where you would find existing systems inefficient, are inefficient because rpm and apt systems have to deal with the real world, where code maintenance and ownership is distributed, and you do no have a single dev entity owning the whole codebase that can do whatever it wants at all stages of the build in its own custom (ostree) sets.

From this POW the article (IMHO) mistakes the convenience of a single unified BSD-style build tree with the convenience of ostree itself. Unified build trees *are* definitely more convenient, they just do not scale to the messiness of real life dev organization structures.

Anyway, I did write that the result looked convenient, so no criticism of the ostree implementation on my part, just reacting to people that implied ostree invented hot water.

Merkle trees and build systems

pabs — Mon, 01 Jun 2020 05:01:49 +0000

Debian also provides "product" images in the form of Debian Live.

Some folks are also thinking about applying things like ostree or squashfs overlays in order to provide upgradable Debian "appliances".

Some folks are also looking at converting Debian binary packages to Debian "apps" using AppImage/Flatpak.

Merkle trees and build systems

pabs — Mon, 01 Jun 2020 04:57:57 +0000

For apt, the setting APT_CONFIG environment variable can entirely separate an apt config from the system config. Take a look at how chdist (from devscripts) or apt-venv do it. Unfortunately there isn't yet a way to tell a dpkg binary to not look at the system config with an equivalent DPKG_CONFIG variable, so things like mmdebstrap's sub-Essential mode break if you have packages that install dpkg hooks.

Merkle trees and build systems

Cyberax — Sun, 31 May 2020 22:41:09 +0000

I'm very intrigued by OSTree. Is there any work to make it possible to build OCI container images using it?

I'm using Docker (just like pretty much everybody these days) and I'm really disgusted by Dockerfiles. It would be nice to replace them with something better. It's already fairly easy to do by simply tar-ing the target image and importing it, but this loses the "layer" structure of Dockerfiles and negates all the caching advantages. It looks like OSTree can be a perfect fit there.

Merkle trees and build systems

MrWim — Sun, 31 May 2020 20:44:15 +0000

> Hmm, I was curious whether the "unpack the .deb package" in the apt2ostree context really meant unpacking (dpkg --unpack) or extracting (dpkg-deb --extract or --raw-extract), so went to check the code at https://github.com/stb-tester/apt2ostree/blob/master/apt2.... What's in there is not really pretty TBH. :)

To get the apt2ostree build rules right to work it needs to know exactly what the inputs and outputs are. I found this difficult to work out with dpkg and apt so I ended up copying the multistrap approach and doing the unpacking and creating the dpkg database myself. In particular with apt I was worried about how the local system configuration of apt would affect the creation of these chroots.

Not running preinst has worked out ok for us so far. I guess because we're never doing `apt upgrade`, we always build a new tree from scratch.

> While many of the hardcoded assumptions might appear to work, perhaps better now that we have been improving the pseudo-essential package set when it comes to installation bootstrapping (see https://wiki.debian.org/Teams/Dpkg/Spec/InstallBootstrap)

This seems like really good news. Thanks for sharing.

> these cannot be generally applied to any package in distributions like Debian (or Ubuntu). Things like not running the preinst, or the bare passwd databases, or only handling a hardcoded list of control files, etc. In any case I guess apt2ostree could already further benefit from some of the things that we have been working on with the installation bootstrap improvements.

I guess it's worked for us because we're not trying to build a box of parts OS like debian, we're building a embedded product, so we know that the thing that we test is the same as what our customers will be using, whereas with Debian you don't know what combination of packages your users will be using.

On the subject of apt2ostree - I think the key innovation in it is apt lockfiles. I talk a bit about it in the README: https://github.com/stb-tester/apt2ostree#lockfiles . This allows us to do all the apt bits, and store the result in source control, then all the dpkg bits that happen during the build should be relatively deterministic. It also means that an apt upgrade is explicit and visible in the source repo for the rootfs image.

Merkle trees and build systems

MrWim — Sun, 31 May 2020 20:24:56 +0000

That said, there are a few tricks one can use, such as having multiple repositories, and then one could implement GC by pulling recently-used refs from one into a new repo (which is really just hardlinking, so pretty cheap), then delete the old repo and move it into place. We could probably add this as a primitive into OSTree itself - it'd make GC cost closer to O(data preserved) and not O(data). Could also amortize by having multiple repos that have different subsets of refs and prune them at different points, re-importing whatever canonical data as needed.

I've been thinking about this and I'm having difficulty seeing how it can be cheaper (at least theoretically).

GC with making new repo

Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) operations
New hard-link for each one - O(data preserved) syscalls
List all objects in old repo deleting each one - O(total objects) syscalls, O(total objects) operations

Theoretical performance of GC with current system

Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) memory, O(objects preserved) operations
List all objects in old repo deleting each one not preserved - O(total objects) syscalls, O(total objects) operations

It seems to me that the latter is the same as the former, but you're using the hardlinks in the filesystem to mark an object, while in the former you could be using a hashmap in memory. Unless theres some way of deleting a directory recursively that is cheaper than iterating over all the files I can't see how the second option could be cheaper?

I've not personally had a problem with ostree gc, but I've found in the past that one of the causes of ostree performance problems is that the GFile interface makes it difficult to reason about what os-level operations will occur when you make a method call. For example when I was implementing #1643 I found that it was much faster to work with the GVariants directly than the OstreeRepoFile* interface.

It's not quite done yet, but I've been working on something that might make working with the GVariants directly somewhat more convenient: https://github.com/wmanley/gvariant-rs/blob/master/examples/ostree-ls/src/main.rs

Merkle trees and build systems

MrWim — Sun, 31 May 2020 20:01:03 +0000

You're referring to maintainer scripts and the like? With apt2ostree we store both the data and the metadata in separate trees. When we come to combine them together we try to reconstruct a dpkg database that would result from all these packages being unpacked into a chroot.

I think of it being a bit like map-reduce, where you design the reduce step to be as cheap as possible, and the map can be expensive if you like because it deterministic, cacheable and parallelizable.

Merkle trees and build systems

MrWim — Sun, 31 May 2020 19:53:27 +0000

This seems like a strange comment to make on an article specifically describing the advantages of making such a transposition.

The only difference between storing source-code in git vs storing it as source tarballs is storage and performance characteristics. But: it's exactly those differences that make git so much more useful than source tarballs. You interact with it and think about it differently once certain operations are cheap.

At the risk of restating things already stated in the article:

Deployment is now incremental and cheap - you only need to transfer the blobs that have changed, not the whole tarball. The effect is you don't need a separate deployment mechanism during dev vs prod.
You can store many many different versions of your built images in a reasonable amount of space - so you just stop worrying about it what to keep and what the deletion policy should be
Your intermediate build steps now share storage with your final rootfses, so you don't need to worry about reducing the number of build steps or intermediate artefacts.
The cost of comparing two trees is now super cheap, so becomes a natural operation to do. Interested in how a particular change has affected your installed image? It's fine we run a diff on the whole tree for every PR. This is particularly useful for changes to the build system itself.
Maybe you're performing some transformation on your tree that only depends on one subdirectory? That's fine: extracting a subdir is cheap and because it's a merkle tree you know if you need to rerun your build steps, because you can find the SHA of the subtree cheaply.
Composing a new tree out of subtrees or partial trees is super cheap, so you do so wherever convenient.

You could implement all of the above with tarballs, but it would be so impractical that you wouldn't. With Merkle trees it's natural. Whether it's innovative or not is irrelevant.

Merkle trees and build systems

Ericson2314 — Sat, 30 May 2020 17:14:41 +0000

The differences you talked about: hashing the content of intermediate build steps and using Merkle hashing for deduplication/incrementally have long been things we've wanted to try with Nix, as you've noticed.

What's new is that we're now doing it! The company I working on IPFS and Nix, as described in https://discourse.nixos.org/t/obsidian-systems-is-excited... . The underlying changes should make it easy to support other hashing schemes like OSTree's.

I'm really glad to see you all are working on similar things---the shift in perspective from rules on files to rules on subtrees is huge and I hope to see it emerge and spread in as many ways is possible. Hopefully everyone can modularize and we're have more interopt / drop-in replacements and independent composition of networking and storage methods.

Happy to answer any questions about Nix / would love to compare notes on these sorts of things.

Merkle trees and build systems

nim-nim — Sat, 30 May 2020 13:26:04 +0000

A .deb/rpm is far more than an archive of built files that could be replaced with an ostree of built files

Merkle trees and build systems

nim-nim — Sat, 30 May 2020 13:20:11 +0000

It’s not especially innovative, BTW, it’s just using an uncompressed ostree instead of the compressed archive you find in traditional component systems (and then since people have been working on archiving ostrees to transfer them between systems you will ultimately get right back to the starting point).

The result certainly looks convenient but if it was more than an ostree transposition of how things were already done you would not have an apt2ostree in the middle of the article.

Merkle trees and build systems

guillemj — Sat, 30 May 2020 00:31:08 +0000

Hmm, I was curious whether the "unpack the .deb package" in the apt2ostree context really meant unpacking (dpkg --unpack) or extracting (dpkg-deb --extract or --raw-extract), so went to check the code at https://github.com/stb-tester/apt2ostree/blob/master/apt2.... What's in there is not really pretty TBH. :)

While many of the hardcoded assumptions might appear to work, perhaps better now that we have been improving the pseudo-essential package set when it comes to installation bootstrapping (see https://wiki.debian.org/Teams/Dpkg/Spec/InstallBootstrap), these cannot be generally applied to any package in distributions like Debian (or Ubuntu). Things like not running the preinst, or the bare passwd databases, or only handling a hardcoded list of control files, etc. In any case I guess apt2ostree could already further benefit from some of the things that we have been working on with the installation bootstrap improvements.

The general concept looks nice though. :)

Merkle trees and build systems

walters — Fri, 29 May 2020 19:08:16 +0000

GC with OSTree is indeed a problem with large numbers of refs (or really a large amount of data). The design was mostly focused on the base operating system case where this isn't a problem.

That said, there are a few tricks one can use, such as having multiple repositories, and then one could implement GC by pulling recently-used refs from one into a new repo (which is really just hardlinking, so pretty cheap), then delete the old repo and move it into place. We could probably add this as a primitive into OSTree itself - it'd make GC cost closer to O(data preserved) and not O(data). Could also amortize by having multiple repos that have different subsets of refs and prune them at different points, re-importing whatever canonical data as needed.

(There's a hugely interesting sub-topic here around whether OSTree is a *cache* of something like a .deb/rpm or whether it's canonical, i.e. your build system outputs it)

Merkle trees and build systems

estansvik — Fri, 29 May 2020 14:36:37 +0000

Ah, right.

Merkle trees and build systems

drothlis — Fri, 29 May 2020 10:34:48 +0000

I can't speak about Nix with any authority because I haven't used it, but here are some thoughts:

Nix's Merkle tree structure, as described by dezgeg's comment, is capturing the *dependencies* of a package, rather than the *contents* of a single package.

In Nix, as far as I can tell, there's no sharing/deduplication of individual files across different packages or different versions of the same package. Section 7.5 of the Nix PhD thesis (which was provided by civodul in another comment) talks about this problem and provides some workarounds for making deployments more network-efficient —such as calculating binary diffs— but this problem simply doesn't exist if you store the contents as Merkle trees.

It also means that you can't share/reuse any work done by intermediate build steps *within* a package -- with Nix the granularity is at the level of each package.

According to the Build Systems à la Carte paper, Nix resolves transitive dependencies, so it only stores the hashes of the terminal inputs, ignoring intermediate dependencies (search for "Deep Constructive Traces" in the paper). For "cloud" build systems this means that you don't have to wait until intermediate targets are built before deciding if you need to build the target (because it may already be in the build farm's cache). The downside is that you can't have early cutoff: Say you've added a comment to a source file, if the compiled ".o" is unchanged then early cutoff means you can stop there because the final build artifact is going to be the same.

Chapter 6 of the Nix thesis talks about making the build outputs "content addressable" (where the checksum describes the build target's contents instead of the build command + dependencies) but that's described as experimental. I don't know if it was ever implemented in Nix. Even if it was, it doesn't use a Merkle tree (as described in the thesis); it serialises the entire directory including all its files' contents (like tar) and then calculates a checksum of this.

P.S. I hope it doesn't sound like I'm dissing Nix. It has many advantages, not least of which: It's an open-source tool that actually exists and anyone can use. My article is about sharing a new technique/idea.

---

Another build system I wanted to mention is BuildStream -- but again, I don't know enough about it, and I ran out of word count & time. I'm unsure of the relationship between BuildStream, BuildGrid, and BuildBox; and they have gone through several architectural changes. These projects work together to provide "cloud" capabilities: farming work out to remote build servers, and providing a cloud-based cache of build artifacts. As far as we can tell they have something like this article's "ostree_mod" that works remotely. They used to use OSTree but now they have developed their own buildbox-casd component. Their reasons for migrating from OSTree, based on a conversation at the London Build Meetup in October 2019:

They want it to interoperate with HTTP-based CAS ("content addressable store") implementations like Bazel's one (a simple HTTP PUT/GET protocol that can be backed by S3, whatever Google's equivalent is, or any plain old HTTP server).
There's no ostree push. You have to use ostree pull via SSH reverse proxying, etc.
They've dropped the GC, instead they expire objects in an LRU fashion. OSTree has the guarantee that if you have a ref, you'll have the contents of that tree. This means you need GC to be able to expire objects without breaking these guarantees. Instead whenever they pull a tree they touch all the files in it, and then expire them in an LRU fashion. They were seeing GC pauses of 24hr with OSTree.

Merkle trees and build systems

drothlis — Fri, 29 May 2020 08:31:16 +0000

All build systems & packaging systems use dependency graphs, I would
imagine. The novel thing in this article is representing each individual
build artifact (be it a single executable, a package, or an entire tree
rootfs) as its own Merkle tree. Each of these trees is representing the
files & directories in a single build artifact, not dependencies between
artifacts. Dependencies are still handled by the underlying build system
(Ninja, in this case).

Merkle trees and build systems

jeeger — Fri, 29 May 2020 06:29:01 +0000

Yeah, the build system seems interesting, but not discussing Nix seems like a big omission.

Merkle trees and build systems

civodul — Thu, 28 May 2020 21:55:25 +0000

Yes, it does. This is called the "extensional model" in the original Nix work (see https://edolstra.github.io/pubs/phd-thesis.pdf). The model is that build processes are view as pure functions, which can thus be "memoized"; IOW, build results can be cached in the store.

Merkle trees and build systems

drothlis — Thu, 28 May 2020 21:29:33 +0000

Ninja will rebuild the target if the command line to build it changes (because the ostree_combine command-line now has an additional argument). This functionality is provided by Ninja itself: https://ninja-build.org/manual.html#ref_log

Or you could change the target name, as you suggest, though you'd need to GC the ostree repo in your build folder periodically.

Merkle trees and build systems

MrWim — Thu, 28 May 2020 21:20:57 +0000

> Neat solution. Minor correction: I think the target of the ldconfig rule in the mandb/ldconfig example figure should be /etc/ld.so.cache, not the /lib and /usr/lib subtrees (?)

It's both really. ldconfig creates the symlinks in /lib and /usr/lib, so it needs to be that too.

Merkle trees and build systems

MrWim — Thu, 28 May 2020 21:16:44 +0000

> is the name of the target somehow dependant on the input parameters of the ostree_combine() call (i.e. same as Nix)?

Yes, that's right. The names of (almost) all the targets are auto-generated based on the inputs and the command to execute, although even if they weren't ninja would take care of that for us because it treats the command to execute as one of the inputs for the purposes of dirty detection.

Merkle trees and build systems

atai — Thu, 28 May 2020 17:29:01 +0000

Does guix also use the similar construct as it is modeled after Nix?

Merkle trees and build systems

estansvik — Thu, 28 May 2020 17:18:59 +0000

Neat solution. Minor correction: I think the target of the ldconfig rule in the mandb/ldconfig example figure should be /etc/ld.so.cache, not the /lib and /usr/lib subtrees (?)