|
|
Log in / Subscribe / Register

Merkle trees and build systems

Merkle trees and build systems

Posted May 29, 2020 10:34 UTC (Fri) by drothlis (guest, #89727)
In reply to: Merkle trees and build systems by dezgeg
Parent article: Merkle trees and build systems

I can't speak about Nix with any authority because I haven't used it, but here are some thoughts:

Nix's Merkle tree structure, as described by dezgeg's comment, is capturing the *dependencies* of a package, rather than the *contents* of a single package.

In Nix, as far as I can tell, there's no sharing/deduplication of individual files across different packages or different versions of the same package. Section 7.5 of the Nix PhD thesis (which was provided by civodul in another comment) talks about this problem and provides some workarounds for making deployments more network-efficient —such as calculating binary diffs— but this problem simply doesn't exist if you store the contents as Merkle trees.

It also means that you can't share/reuse any work done by intermediate build steps *within* a package -- with Nix the granularity is at the level of each package.

According to the Build Systems à la Carte paper, Nix resolves transitive dependencies, so it only stores the hashes of the terminal inputs, ignoring intermediate dependencies (search for "Deep Constructive Traces" in the paper). For "cloud" build systems this means that you don't have to wait until intermediate targets are built before deciding if you need to build the target (because it may already be in the build farm's cache). The downside is that you can't have early cutoff: Say you've added a comment to a source file, if the compiled ".o" is unchanged then early cutoff means you can stop there because the final build artifact is going to be the same.

Chapter 6 of the Nix thesis talks about making the build outputs "content addressable" (where the checksum describes the build target's contents instead of the build command + dependencies) but that's described as experimental. I don't know if it was ever implemented in Nix. Even if it was, it doesn't use a Merkle tree (as described in the thesis); it serialises the entire directory including all its files' contents (like tar) and then calculates a checksum of this.

P.S. I hope it doesn't sound like I'm dissing Nix. It has many advantages, not least of which: It's an open-source tool that actually exists and anyone can use. My article is about sharing a new technique/idea.

---

Another build system I wanted to mention is BuildStream -- but again, I don't know enough about it, and I ran out of word count & time. I'm unsure of the relationship between BuildStream, BuildGrid, and BuildBox; and they have gone through several architectural changes. These projects work together to provide "cloud" capabilities: farming work out to remote build servers, and providing a cloud-based cache of build artifacts. As far as we can tell they have something like this article's "ostree_mod" that works remotely. They used to use OSTree but now they have developed their own buildbox-casd component. Their reasons for migrating from OSTree, based on a conversation at the London Build Meetup in October 2019:

  • They want it to interoperate with HTTP-based CAS ("content addressable store") implementations like Bazel's one (a simple HTTP PUT/GET protocol that can be backed by S3, whatever Google's equivalent is, or any plain old HTTP server).
  • There's no ostree push. You have to use ostree pull via SSH reverse proxying, etc.
  • They've dropped the GC, instead they expire objects in an LRU fashion. OSTree has the guarantee that if you have a ref, you'll have the contents of that tree. This means you need GC to be able to expire objects without breaking these guarantees. Instead whenever they pull a tree they touch all the files in it, and then expire them in an LRU fashion. They were seeing GC pauses of 24hr with OSTree.


to post comments

Merkle trees and build systems

Posted May 29, 2020 19:08 UTC (Fri) by walters (subscriber, #7396) [Link] (4 responses)

GC with OSTree is indeed a problem with large numbers of refs (or really a large amount of data). The design was mostly focused on the base operating system case where this isn't a problem.

That said, there are a few tricks one can use, such as having multiple repositories, and then one could implement GC by pulling recently-used refs from one into a new repo (which is really just hardlinking, so pretty cheap), then delete the old repo and move it into place. We could probably add this as a primitive into OSTree itself - it'd make GC cost closer to O(data preserved) and not O(data). Could also amortize by having multiple repos that have different subsets of refs and prune them at different points, re-importing whatever canonical data as needed.

(There's a hugely interesting sub-topic here around whether OSTree is a *cache* of something like a .deb/rpm or whether it's canonical, i.e. your build system outputs it)

Merkle trees and build systems

Posted May 30, 2020 13:26 UTC (Sat) by nim-nim (subscriber, #34454) [Link] (2 responses)

A .deb/rpm is far more than an archive of built files that could be replaced with an ostree of built files

Merkle trees and build systems

Posted May 31, 2020 20:01 UTC (Sun) by MrWim (subscriber, #47432) [Link] (1 responses)

You're referring to maintainer scripts and the like? With apt2ostree we store both the data and the metadata in separate trees. When we come to combine them together we try to reconstruct a dpkg database that would result from all these packages being unpacked into a chroot.

I think of it being a bit like map-reduce, where you design the reduce step to be as cheap as possible, and the map can be expensive if you like because it deterministic, cacheable and parallelizable.

Merkle trees and build systems

Posted Jun 5, 2020 10:09 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

apt and rpm do not deal with just a set of resulting files, they deal with composition rules, build rules, and post-install file munging, because real-life software uses things like indexes that can not be computed before the things the index indexes are actually deployed

Merkle trees and build systems

Posted May 31, 2020 20:24 UTC (Sun) by MrWim (subscriber, #47432) [Link]

That said, there are a few tricks one can use, such as having multiple repositories, and then one could implement GC by pulling recently-used refs from one into a new repo (which is really just hardlinking, so pretty cheap), then delete the old repo and move it into place. We could probably add this as a primitive into OSTree itself - it'd make GC cost closer to O(data preserved) and not O(data). Could also amortize by having multiple repos that have different subsets of refs and prune them at different points, re-importing whatever canonical data as needed.

I've been thinking about this and I'm having difficulty seeing how it can be cheaper (at least theoretically).

GC with making new repo

  1. Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) operations
  2. New hard-link for each one - O(data preserved) syscalls
  3. List all objects in old repo deleting each one - O(total objects) syscalls, O(total objects) operations

Theoretical performance of GC with current system

  1. Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) memory, O(objects preserved) operations
  2. List all objects in old repo deleting each one not preserved - O(total objects) syscalls, O(total objects) operations

It seems to me that the latter is the same as the former, but you're using the hardlinks in the filesystem to mark an object, while in the former you could be using a hashmap in memory. Unless theres some way of deleting a directory recursively that is cheaper than iterating over all the files I can't see how the second option could be cheaper?

I've not personally had a problem with ostree gc, but I've found in the past that one of the causes of ostree performance problems is that the GFile interface makes it difficult to reason about what os-level operations will occur when you make a method call. For example when I was implementing #1643 I found that it was much faster to work with the GVariants directly than the OstreeRepoFile* interface.

It's not quite done yet, but I've been working on something that might make working with the GVariants directly somewhat more convenient: https://github.com/wmanley/gvariant-rs/blob/master/examples/ostree-ls/src/main.rs

Merkle trees and build systems

Posted May 30, 2020 17:14 UTC (Sat) by Ericson2314 (guest, #139248) [Link]

The differences you talked about: hashing the content of intermediate build steps and using Merkle hashing for deduplication/incrementally have long been things we've wanted to try with Nix, as you've noticed.

What's new is that we're now doing it! The company I working on IPFS and Nix, as described in https://discourse.nixos.org/t/obsidian-systems-is-excited... . The underlying changes should make it easy to support other hashing schemes like OSTree's.

I'm really glad to see you all are working on similar things---the shift in perspective from rules on files to rules on subtrees is huge and I hope to see it emerge and spread in as many ways is possible. Hopefully everyone can modularize and we're have more interopt / drop-in replacements and independent composition of networking and storage methods.

Happy to answer any questions about Nix / would love to compare notes on these sorts of things.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds