Merkle trees and build systems
Merkle trees and build systems
Posted May 31, 2020 20:24 UTC (Sun) by MrWim (subscriber, #47432)In reply to: Merkle trees and build systems by walters
Parent article: Merkle trees and build systems
That said, there are a few tricks one can use, such as having multiple repositories, and then one could implement GC by pulling recently-used refs from one into a new repo (which is really just hardlinking, so pretty cheap), then delete the old repo and move it into place. We could probably add this as a primitive into OSTree itself - it'd make GC cost closer to O(data preserved) and not O(data). Could also amortize by having multiple repos that have different subsets of refs and prune them at different points, re-importing whatever canonical data as needed.
I've been thinking about this and I'm having difficulty seeing how it can be cheaper (at least theoretically).
GC with making new repo
- Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) operations
- New hard-link for each one - O(data preserved) syscalls
- List all objects in old repo deleting each one - O(total objects) syscalls, O(total objects) operations
Theoretical performance of GC with current system
- Walk the trees finding objects to keep - O(metadata preserved) syscalls, O(objects preserved) memory, O(objects preserved) operations
- List all objects in old repo deleting each one not preserved - O(total objects) syscalls, O(total objects) operations
It seems to me that the latter is the same as the former, but you're using the hardlinks in the filesystem to mark an object, while in the former you could be using a hashmap in memory. Unless theres some way of deleting a directory recursively that is cheaper than iterating over all the files I can't see how the second option could be cheaper?
I've not personally had a problem with ostree gc, but I've found in the past that one of the causes of ostree performance problems is that the GFile interface makes it difficult to reason about what os-level operations will occur when you make a method call. For example when I was implementing #1643 I found that it was much faster to work with the GVariants directly than the OstreeRepoFile* interface.
It's not quite done yet, but I've been working on something that might make working with the GVariants directly somewhat more convenient: https://github.com/wmanley/gvariant-rs/blob/master/examples/ostree-ls/src/main.rs
