|
|
Log in / Subscribe / Register

Merkle trees and build systems

Merkle trees and build systems

Posted May 29, 2020 19:08 UTC (Fri) by walters (subscriber, #7396)
In reply to: Merkle trees and build systems by drothlis
Parent article: Merkle trees and build systems

GC with OSTree is indeed a problem with large numbers of refs (or really a large amount of data). The design was mostly focused on the base operating system case where this isn't a problem.

That said, there are a few tricks one can use, such as having multiple repositories, and then one could implement GC by pulling recently-used refs from one into a new repo (which is really just hardlinking, so pretty cheap), then delete the old repo and move it into place. We could probably add this as a primitive into OSTree itself - it'd make GC cost closer to O(data preserved) and not O(data). Could also amortize by having multiple repos that have different subsets of refs and prune them at different points, re-importing whatever canonical data as needed.

(There's a hugely interesting sub-topic here around whether OSTree is a *cache* of something like a .deb/rpm or whether it's canonical, i.e. your build system outputs it)


to post comments

Merkle trees and build systems

Posted May 30, 2020 13:26 UTC (Sat) by nim-nim (subscriber, #34454) [Link] (2 responses)

A .deb/rpm is far more than an archive of built files that could be replaced with an ostree of built files

Merkle trees and build systems

Posted May 31, 2020 20:01 UTC (Sun) by MrWim (subscriber, #47432) [Link] (1 responses)

You're referring to maintainer scripts and the like? With apt2ostree we store both the data and the metadata in separate trees. When we come to combine them together we try to reconstruct a dpkg database that would result from all these packages being unpacked into a chroot.

I think of it being a bit like map-reduce, where you design the reduce step to be as cheap as possible, and the map can be expensive if you like because it deterministic, cacheable and parallelizable.

Merkle trees and build systems

Posted Jun 5, 2020 10:09 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

apt and rpm do not deal with just a set of resulting files, they deal with composition rules, build rules, and post-install file munging, because real-life software uses things like indexes that can not be computed before the things the index indexes are actually deployed

Merkle trees and build systems

Posted May 31, 2020 20:24 UTC (Sun) by MrWim (subscriber, #47432) [Link]

That said, there are a few tricks one can use, such as having multiple repositories, and then one could implement GC by pulling recently-used refs from one into a new repo (which is really just hardlinking, so pretty cheap), then delete the old repo and move it into place. We could probably add this as a primitive into OSTree itself - it'd make GC cost closer to O(data preserved) and not O(data). Could also amortize by having multiple repos that have different subsets of refs and prune them at different points, re-importing whatever canonical data as needed.

I've been thinking about this and I'm having difficulty seeing how it can be cheaper (at least theoretically).

GC with making new repo

  1. Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) operations
  2. New hard-link for each one - O(data preserved) syscalls
  3. List all objects in old repo deleting each one - O(total objects) syscalls, O(total objects) operations

Theoretical performance of GC with current system

  1. Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) memory, O(objects preserved) operations
  2. List all objects in old repo deleting each one not preserved - O(total objects) syscalls, O(total objects) operations

It seems to me that the latter is the same as the former, but you're using the hardlinks in the filesystem to mark an object, while in the former you could be using a hashmap in memory. Unless theres some way of deleting a directory recursively that is cheaper than iterating over all the files I can't see how the second option could be cheaper?

I've not personally had a problem with ostree gc, but I've found in the past that one of the causes of ostree performance problems is that the GFile interface makes it difficult to reason about what os-level operations will occur when you make a method call. For example when I was implementing #1643 I found that it was much faster to work with the GVariants directly than the OstreeRepoFile* interface.

It's not quite done yet, but I've been working on something that might make working with the GVariants directly somewhat more convenient: https://github.com/wmanley/gvariant-rs/blob/master/examples/ostree-ls/src/main.rs


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds