Updating the Git protocol for SHA-256 [LWN.net]

Updating the Git protocol for SHA-256

Posted Jun 19, 2020 17:09 UTC (Fri) by pj (subscriber, #4506) [Link] (19 responses)

I'd love for them to be forward-looking enough to adopt something like multihash (https://richardschneider.github.io/net-ipfs-core/articles...)

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 0:34 UTC (Sat) by ms-tg (subscriber, #89231) [Link] (18 responses)

I second multihash and would like to understand an explanation for not using it?

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 14:07 UTC (Sat) by Ericson2314 (guest, #139248) [Link] (10 responses)

Speaking of IPFS things, https://discuss.ipfs.io/t/git-on-ipfs-links-and-reference...

Git really should start merkelizing blob hashes / chunk blobs. Not only does it help with data exchange, but it also means faster hashing when a blob changes O(n) vs O(log n). This transition is the best time to fix things like this, pity it seems they are not under discussion.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 23:14 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (9 responses)

How big of a chunk do you think would work? Too little and the chunk hashes start to dwarf the content size. Too large and anything but trivial changes end up changing every hunk. Sure, there are massive files in repositories, but these probably fall into one of a few buckets:

- best left to git-lfs, git-annex, or some other off-loading tool
- machine generated data (of some kind) that changes rarely
- non-text artifacts that change rarely

I think experiments to test the actual benefits in organic Git repositories this would be interesting, but I'd rather see the hash transition happen correctly and smoothly and it sounds complicated enough as it is. And it should be laying down version numbers into formats as it needs that such another transition could leverage to ease its upgrade path too.

Updating the Git protocol for SHA-256

Posted Jun 21, 2020 13:46 UTC (Sun) by pabs (subscriber, #43278) [Link] (8 responses)

restic and friends use variable-sized chunks, that seems to be the way to go to me.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 2:07 UTC (Mon) by cyphar (subscriber, #110703) [Link]

They do use variable-sized chunks (more specifically, content-defined chunking), but those chunking algorithms still require you to specify how large you want your chunks to be on average (or in restic's case, the chunking algorithm also asks what the maximum and minimum chunk sizes are). So you still have to decide on the trade-off between chunks that are too large and chunks that are too small.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:14 UTC (Sat) by ras (subscriber, #33059) [Link] (6 responses)

Your comment led me look up restic, and I was thinking "finally, this is it", then I discovered https://github.com/restic/restic/issues/187. With ransomware a thing it's a major omission, and sadly there's been two years with no movement. Shame.

But you say it has friends?

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:46 UTC (Sat) by pabs (subscriber, #43278) [Link] (5 responses)

borg is the other modern chunking backup system:

https://borgbackup.github.io/borgbackup/

There is also bup, much more closely related to git:

https://github.com/bup/bup

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:19 UTC (Sat) by johill (subscriber, #25196) [Link] (4 responses)

Thinking about ransomware, I think you should be able to configure permissions on restic's use of AWS/S3 to prevent deletion? There's a use case that's more interesting to asymmetric encryption - not letting the machine that's doing the backup access its own old data, in case it's compromised, as mentioned on that ticket. Maybe you can even configure the S3 bucket to be read-only of sorts (e.g. by storing in deep archive or glacier, and not let the IAM account do any restores from there), but I don't know how much or what restic needs to read out of the repo to make a new backup.

Borg's encryption design seems to have one issue - as far as I can tell, the "content-based chunker" has a very small key (they claim 32 bits, but say it goes linearly though the algorithm, so not all of those bits eventually matter), which would seem to allow fingerprinting attacks ("you have this chain of chunk sizes, so you must have this file"). Borg also has been debating S3 storage for years without any movement.

Ultimately I landed with bup (that I had used previously), and have been working on adding to bup both (asymmetric) encryption support and AWS/S3 storage; in the latter case you can effectively make your repo append-only (to the machine that's making the backup), i.e. AWS permissions ensure that it cannot actually delete the data. It could delete some metadata tables etc. but that's mostly recoverable (though I haven't written the recovery tools yet), apart from the ref names (which are only stored in dynamoDB for consistency reasons, S3 has almost no consistency guarantees.)

It's probably not ready for mainline yet (and we're busy finishing the python 3 port in mainline), but I've actually used it recently to begin storing some of my backups (currently ~850GiB) in S3 Deep Archive.

Configuration references:
https://github.com/jmberg/bup/blob/master/Documentation/b...
https://github.com/jmberg/bup/blob/master/Documentation/b...

Some design documentation is in the code:
https://github.com/jmberg/bup/blob/master/lib/bup/repo/en...

If you use it, there are two other things in my tree that you'd probably want:

1) with a lot of data, the content-based splitting on 13 bits results in far too much metadata (and storage isn't that expensive anymore), so you'd want to increase that. Currently in master that's not configurable, but I changed that: https://github.com/jmberg/bup/blob/master/Documentation/b...

2) if you have lots of large directories (e.g. maildir) then minor changes to those currently consumes a significant amount of storage space since the entire folder is saved again (the list of files). I have "treesplit" in my code that allows splitting up those trees (again, content-based) to avoid that issue, which for my largest maildir of ~400k files brings down the amount of new data saved from close to 10 MB (after compression) to <<50kB when a new email is written there. Looks like I didn't document that yet, but I should add it here: https://github.com/jmberg/bup/blob/master/Documentation/b.... The commit describes it a bit now: https://github.com/jmberg/bup/commit/44006daca4786abe31e3...

And yes, I'm working with upstream on this.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:31 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

Is anyone working on adding treesplitting to git itself? Your docs mention that the tree duplication issue occurs with git too.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:42 UTC (Sat) by johill (subscriber, #25196) [Link]

I'm not aware of that. It would probably mean a new object type in git, or such, I haven't really thought about it.

However, it's not nearly as bad in git? You're not storing hundreds of thousands of files in a folder in git, presumably? :-) Not sure how much interest there would be in git on that.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:48 UTC (Sat) by johill (subscriber, #25196) [Link]

I should, however, mention that due to git's object format ("<type> <length>\x00<data>") other tools can architecturally have an advantage on throughput. Due to the header, bup has to run the content-based splitting first, and then start hashing the object only once it knows how long it is. If you don't have the limitation of the git storage format, you can do without such a header and do both hashes in parallel, stopping once you find the split point. I've been thinking about mitigating that with threading, but it's a bit difficult right now in bup's software architecture. (Incidentally, python is not the issue here, since the hash splitting is entirely in C in my tree, so can run outside the GIL.)

Updating the Git protocol for SHA-256

Posted Jul 8, 2020 19:22 UTC (Wed) by nix (subscriber, #2304) [Link]

> And yes, I'm working with upstream on this.

By this point, as a mere observer, I would say you *are* one of upstream. You're one of the two people doing most of the bup commits and have been for over a year now. :)

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 15:36 UTC (Sat) by hmh (subscriber, #3838) [Link] (6 responses)

I am not well versed on multihash, but a first look failed to find a canonical, immutable registry of already in use algorithms and their mapping to IDs (the numerical ones that are ABI since they end up embedded in the base# representations) along with the procedures to interact with such a registry. I mean something like IANA does.

At that point it becomes app specific, and other than the obvious protocol best practice that you should explicitly encode the protocol version (in this case what hash and hash parameters if not implied), there is little to be gained.

Prefixing (hidden by base# or explicitly) the hash type in git has already been covered by other replies and posts, and yes, imho it really should be done if at all possible.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 17:11 UTC (Sat) by cyphar (subscriber, #110703) [Link] (5 responses)

Multihash defines exactly two things, an extensible format and a table of hash functions. So it definitely does what you say it doesn't (in fairness, the link @ms-tg gave you isn't as useful as the project's page[1]).

Now there isn't an IANA-like procedure, everything is done via PRs on GitHub but that's just differences in administrative structure.

[1]: https://multiformats.io/multihash/

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 18:42 UTC (Sat) by hmh (subscriber, #3838) [Link] (1 responses)

A procedure to add new hashes is a procedure, PRs in github are fine.

This link you sent is much better, the other one lacks essential information...

I am quite sure git would severely restrict the allowed hashes, but at least the design of multihash seems sane and safely extensible, including when ones does the short-sighted error of enshrining short prefixes of the hash anywhere that is not a throw away command line call... A bad practice that is very common among git users.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 23:17 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

> including when ones does the short-sighted error of enshrining short prefixes of the hash anywhere that is not a throw away command line call... A bad practice that is very common among git users.

"Best practice" for short usage in more permanent places includes the date (or tag description) and summary of the commit in question (which both greatly ease conflict resolution when it occurs and gives some idea of what's going on without having to copy/paste the has yourself).

IANA

Posted Jun 22, 2020 15:37 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

IANA offers a _lot_ of different procedures. Varying from Private Use and Experimental (chunks of namespace carved off entirely for users to do with as they please without talking to IANA at all) through to Standards Action (you must publish an IETF Standards Track document e.g. a Best Common Practice or an RFC explicitly designated Internet Standard) and where the namespace is hierarchically infinite or near infinite (e.g. OIDs, DNS) IANA just delegates one layer of the namespace and more or less lets the hierarchy sort it out. Technically these OIDs don't even belong to IANA (it hijacked the ones used for the Internet many years ago) but it delegates them this way anyway and it's too late for the standards organisations that minted them to say "No".

RFC 8126 lists 10 such procedures for general use in new namespaces.

So what Multihash are doing here sounds like a typical new IANA namespace which has an Experimental/ Private Use region (self-assigned) and then Specification Required for the rest of the namespace. You must document what you're doing, maybe with a Standards Organisation, maybe you write a white paper, maybe even you just spin up a web site with a technical rant, but you need to document it and then you get reviewed and maybe get in.

Apparently Multihash is writing up some sort of formal document to maybe got to the IETF, but given they started in 2016 and it's not that hard they may not ever get it polished up and standardised anywhere, it's not a problem.

Updating the Git protocol for SHA-256

Posted Jun 24, 2020 4:03 UTC (Wed) by nevyn (guest, #33129) [Link] (1 responses)

Hmm, as someone who has done a bunch of work with hashes over the last couple of years I'd not heard of multihash before, and looking at https://multiformats.io/#projects-using-multiformats it seems the main user is still just ipfs. This wouldn't necessarily be bad if it was new and gaining usage, but it's more worrying given it's been around over half a decade and supposed to be established.

Another similar point is the table itself, the hashes added are done ad hoc when someone uses them and wants to use multihash ... again, fine if the project is very new and gaining traction but much less good if the project is established and you go see that none of https://github.com/dgryski/dgohash are there. I understand it's volunteer based contributions but if you want people to actually use your std. it's going to be much easier if they can use it without having to self register well known/used decade old types.

Then there's the format itself. I understand that hashes are variable length but showing abbreviated hashes is very well known at this point. A new git repo. shows 7 characters for the --abbrev hash, ansible with over 50k commits only shows 10 (and even then github only shows 7), and they want to add "1220" to the front of that? And they really want you to show it to the user all the time? Even if abbreviated hashes weren't a thing, most users are going to think it's a bit weird if literally all the hashes they see start with the same 4 hex characters (at a minimum -- using blake2b will eat 6, I think). I also doubt many developers would want to store the hases natively, because it doesn't take many instances before storing the exact same byte sequence with each piece of actual data becomes more than trivial waste.

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 17:02 UTC (Thu) by pj (subscriber, #4506) [Link]

...all valid criticisms, but I've yet to see an alternative with equivalent functionality and more widespread support. If you know of one, I'd love to hear about it! Though as you say, multihash is still fairly young so would likely welcome feedback that would help adoption/functionality/usability.

Updating the Git protocol for SHA-256

Posted Jun 19, 2020 17:27 UTC (Fri) by david.a.wheeler (subscriber, #72896) [Link] (2 responses)

The problem with $5$ is that it will be interpreted by shells and mangled. It's very common to have commands with hash values, e.g.,

git reset HASH_VALUE

But having a standard prefix is reasonable. I had proposed rotating the hash value by 16 characters in the first character, so that 0 becomes g, 1 becomes h, and so on. Then you can determine from the first character what encoding is used. You can extend that further by additional rotations or encoding more characters.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 10:51 UTC (Sat) by cpitrat (subscriber, #116459) [Link] (1 responses)

Or add the prefix for all but sha-1. A first char of [0-9a-f] would mean sha-1 and prefix must not be removed. A prefix g would be sha-256, and so on. That's bot very different from MultiHash though. The pain is the complexity of the sha-1 exception which is not that awful for full hashes (you can look at the length, as all the other will have a prefix there's no risk of collision). Shortened hashes added a layer of mess to the problem ...

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 14:22 UTC (Sat) by gavinbeatty (guest, #139659) [Link]

g is used as a prefix for SHA-1 when using git describe, but point taken, non [a-g0-9].

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 20:36 UTC (Sat) by josh (subscriber, #17465) [Link]

A single-character prefix would suffice to disambiguate:

[0-9a-f]+ would be SHA-1.
T[0-9a-f]+ would be SHA-256
Pick a new capital letter [G-Z] for each new hash.

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 4:41 UTC (Thu) by draco (subscriber, #1792) [Link]

I was surprised they didn't switch to using Base64 instead of hex. You'd need only 43 characters (vs today's 40) instead of 64. Prefix one more character for the hash method and you only add 4. [Yes, standard Base64 would always append an '=', but since the input is fixed-length, there's no need to include it.]

But instead it looks like they'll stick with hex and disambiguate via ^{sha1} and ^{sha256} suffixes.