Updating the Git protocol for SHA-256

Posted Jun 20, 2020 14:07 UTC (Sat) by Ericson2314 (guest, #139248)
In reply to: Updating the Git protocol for SHA-256 by ms-tg
Parent article: Updating the Git protocol for SHA-256

Speaking of IPFS things, https://discuss.ipfs.io/t/git-on-ipfs-links-and-reference...

Git really should start merkelizing blob hashes / chunk blobs. Not only does it help with data exchange, but it also means faster hashing when a blob changes O(n) vs O(log n). This transition is the best time to fix things like this, pity it seems they are not under discussion.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 23:14 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (9 responses)

How big of a chunk do you think would work? Too little and the chunk hashes start to dwarf the content size. Too large and anything but trivial changes end up changing every hunk. Sure, there are massive files in repositories, but these probably fall into one of a few buckets:

- best left to git-lfs, git-annex, or some other off-loading tool
- machine generated data (of some kind) that changes rarely
- non-text artifacts that change rarely

I think experiments to test the actual benefits in organic Git repositories this would be interesting, but I'd rather see the hash transition happen correctly and smoothly and it sounds complicated enough as it is. And it should be laying down version numbers into formats as it needs that such another transition could leverage to ease its upgrade path too.

Updating the Git protocol for SHA-256

Posted Jun 21, 2020 13:46 UTC (Sun) by pabs (subscriber, #43278) [Link] (8 responses)

restic and friends use variable-sized chunks, that seems to be the way to go to me.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 2:07 UTC (Mon) by cyphar (subscriber, #110703) [Link]

They do use variable-sized chunks (more specifically, content-defined chunking), but those chunking algorithms still require you to specify how large you want your chunks to be on average (or in restic's case, the chunking algorithm also asks what the maximum and minimum chunk sizes are). So you still have to decide on the trade-off between chunks that are too large and chunks that are too small.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:14 UTC (Sat) by ras (subscriber, #33059) [Link] (6 responses)

Your comment led me look up restic, and I was thinking "finally, this is it", then I discovered https://github.com/restic/restic/issues/187. With ransomware a thing it's a major omission, and sadly there's been two years with no movement. Shame.

But you say it has friends?

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:46 UTC (Sat) by pabs (subscriber, #43278) [Link] (5 responses)

borg is the other modern chunking backup system:

https://borgbackup.github.io/borgbackup/

There is also bup, much more closely related to git:

https://github.com/bup/bup

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:19 UTC (Sat) by johill (subscriber, #25196) [Link] (4 responses)

Thinking about ransomware, I think you should be able to configure permissions on restic's use of AWS/S3 to prevent deletion? There's a use case that's more interesting to asymmetric encryption - not letting the machine that's doing the backup access its own old data, in case it's compromised, as mentioned on that ticket. Maybe you can even configure the S3 bucket to be read-only of sorts (e.g. by storing in deep archive or glacier, and not let the IAM account do any restores from there), but I don't know how much or what restic needs to read out of the repo to make a new backup.

Borg's encryption design seems to have one issue - as far as I can tell, the "content-based chunker" has a very small key (they claim 32 bits, but say it goes linearly though the algorithm, so not all of those bits eventually matter), which would seem to allow fingerprinting attacks ("you have this chain of chunk sizes, so you must have this file"). Borg also has been debating S3 storage for years without any movement.

Ultimately I landed with bup (that I had used previously), and have been working on adding to bup both (asymmetric) encryption support and AWS/S3 storage; in the latter case you can effectively make your repo append-only (to the machine that's making the backup), i.e. AWS permissions ensure that it cannot actually delete the data. It could delete some metadata tables etc. but that's mostly recoverable (though I haven't written the recovery tools yet), apart from the ref names (which are only stored in dynamoDB for consistency reasons, S3 has almost no consistency guarantees.)

It's probably not ready for mainline yet (and we're busy finishing the python 3 port in mainline), but I've actually used it recently to begin storing some of my backups (currently ~850GiB) in S3 Deep Archive.

Configuration references:
https://github.com/jmberg/bup/blob/master/Documentation/b...
https://github.com/jmberg/bup/blob/master/Documentation/b...

Some design documentation is in the code:
https://github.com/jmberg/bup/blob/master/lib/bup/repo/en...

If you use it, there are two other things in my tree that you'd probably want:

1) with a lot of data, the content-based splitting on 13 bits results in far too much metadata (and storage isn't that expensive anymore), so you'd want to increase that. Currently in master that's not configurable, but I changed that: https://github.com/jmberg/bup/blob/master/Documentation/b...

2) if you have lots of large directories (e.g. maildir) then minor changes to those currently consumes a significant amount of storage space since the entire folder is saved again (the list of files). I have "treesplit" in my code that allows splitting up those trees (again, content-based) to avoid that issue, which for my largest maildir of ~400k files brings down the amount of new data saved from close to 10 MB (after compression) to <<50kB when a new email is written there. Looks like I didn't document that yet, but I should add it here: https://github.com/jmberg/bup/blob/master/Documentation/b.... The commit describes it a bit now: https://github.com/jmberg/bup/commit/44006daca4786abe31e3...

And yes, I'm working with upstream on this.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:31 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

Is anyone working on adding treesplitting to git itself? Your docs mention that the tree duplication issue occurs with git too.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:42 UTC (Sat) by johill (subscriber, #25196) [Link]

I'm not aware of that. It would probably mean a new object type in git, or such, I haven't really thought about it.

However, it's not nearly as bad in git? You're not storing hundreds of thousands of files in a folder in git, presumably? :-) Not sure how much interest there would be in git on that.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:48 UTC (Sat) by johill (subscriber, #25196) [Link]

I should, however, mention that due to git's object format ("<type> <length>\x00<data>") other tools can architecturally have an advantage on throughput. Due to the header, bup has to run the content-based splitting first, and then start hashing the object only once it knows how long it is. If you don't have the limitation of the git storage format, you can do without such a header and do both hashes in parallel, stopping once you find the split point. I've been thinking about mitigating that with threading, but it's a bit difficult right now in bup's software architecture. (Incidentally, python is not the issue here, since the hash splitting is entirely in C in my tree, so can run outside the GIL.)

Updating the Git protocol for SHA-256

Posted Jul 8, 2020 19:22 UTC (Wed) by nix (subscriber, #2304) [Link]

> And yes, I'm working with upstream on this.

By this point, as a mere observer, I would say you *are* one of upstream. You're one of the two people doing most of the bup commits and have been for over a year now. :)