LWN: Comments on "Updating the Git protocol for SHA-256"

Updating the Git protocol for SHA-256

nix — Wed, 08 Jul 2020 19:28:27 +0000

OK, you live and learn: git:// allows pack reception! I clearly never read that part of the git-daemon manpage :)

Updating the Git protocol for SHA-256

nix — Wed, 08 Jul 2020 19:22:46 +0000

> And yes, I'm working with upstream on this.

By this point, as a mere observer, I would say you *are* one of upstream. You're one of the two people doing most of the bup commits and have been for over a year now. :)

Updating the Git protocol for SHA-256

johill — Sat, 27 Jun 2020 07:48:20 +0000

I should, however, mention that due to git's object format ("<type> <length>\x00<data>") other tools can architecturally have an advantage on throughput. Due to the header, bup has to run the content-based splitting first, and then start hashing the object only once it knows how long it is. If you don't have the limitation of the git storage format, you can do without such a header and do both hashes in parallel, stopping once you find the split point. I've been thinking about mitigating that with threading, but it's a bit difficult right now in bup's software architecture. (Incidentally, python is not the issue here, since the hash splitting is entirely in C in my tree, so can run outside the GIL.)

Updating the Git protocol for SHA-256

johill — Sat, 27 Jun 2020 07:42:59 +0000

I'm not aware of that. It would probably mean a new object type in git, or such, I haven't really thought about it.

However, it's not nearly as bad in git? You're not storing hundreds of thousands of files in a folder in git, presumably? :-) Not sure how much interest there would be in git on that.

Updating the Git protocol for SHA-256

pabs — Sat, 27 Jun 2020 07:31:49 +0000

Is anyone working on adding treesplitting to git itself? Your docs mention that the tree duplication issue occurs with git too.

Updating the Git protocol for SHA-256

johill — Sat, 27 Jun 2020 07:19:37 +0000

Thinking about ransomware, I think you should be able to configure permissions on restic's use of AWS/S3 to prevent deletion? There's a use case that's more interesting to asymmetric encryption - not letting the machine that's doing the backup access its own old data, in case it's compromised, as mentioned on that ticket. Maybe you can even configure the S3 bucket to be read-only of sorts (e.g. by storing in deep archive or glacier, and not let the IAM account do any restores from there), but I don't know how much or what restic needs to read out of the repo to make a new backup.

Borg's encryption design seems to have one issue - as far as I can tell, the "content-based chunker" has a very small key (they claim 32 bits, but say it goes linearly though the algorithm, so not all of those bits eventually matter), which would seem to allow fingerprinting attacks ("you have this chain of chunk sizes, so you must have this file"). Borg also has been debating S3 storage for years without any movement.

Ultimately I landed with bup (that I had used previously), and have been working on adding to bup both (asymmetric) encryption support and AWS/S3 storage; in the latter case you can effectively make your repo append-only (to the machine that's making the backup), i.e. AWS permissions ensure that it cannot actually delete the data. It could delete some metadata tables etc. but that's mostly recoverable (though I haven't written the recovery tools yet), apart from the ref names (which are only stored in dynamoDB for consistency reasons, S3 has almost no consistency guarantees.)

It's probably not ready for mainline yet (and we're busy finishing the python 3 port in mainline), but I've actually used it recently to begin storing some of my backups (currently ~850GiB) in S3 Deep Archive.

Configuration references:
https://github.com/jmberg/bup/blob/master/Documentation/b...
https://github.com/jmberg/bup/blob/master/Documentation/b...

Some design documentation is in the code:
https://github.com/jmberg/bup/blob/master/lib/bup/repo/en...

If you use it, there are two other things in my tree that you'd probably want:

1) with a lot of data, the content-based splitting on 13 bits results in far too much metadata (and storage isn't that expensive anymore), so you'd want to increase that. Currently in master that's not configurable, but I changed that: https://github.com/jmberg/bup/blob/master/Documentation/b...

2) if you have lots of large directories (e.g. maildir) then minor changes to those currently consumes a significant amount of storage space since the entire folder is saved again (the list of files). I have "treesplit" in my code that allows splitting up those trees (again, content-based) to avoid that issue, which for my largest maildir of ~400k files brings down the amount of new data saved from close to 10 MB (after compression) to <<50kB when a new email is written there. Looks like I didn't document that yet, but I should add it here: https://github.com/jmberg/bup/blob/master/Documentation/b.... The commit describes it a bit now: https://github.com/jmberg/bup/commit/44006daca4786abe31e3...

And yes, I'm working with upstream on this.

Updating the Git protocol for SHA-256

pabs — Sat, 27 Jun 2020 05:46:09 +0000

borg is the other modern chunking backup system:

https://borgbackup.github.io/borgbackup/

There is also bup, much more closely related to git:

https://github.com/bup/bup

Updating the Git protocol for SHA-256

ras — Sat, 27 Jun 2020 05:14:41 +0000

Your comment led me look up restic, and I was thinking "finally, this is it", then I discovered https://github.com/restic/restic/issues/187. With ransomware a thing it's a major omission, and sadly there's been two years with no movement. Shame.

But you say it has friends?

Updating the Git protocol for SHA-256

plugwash — Fri, 26 Jun 2020 15:45:49 +0000

Most vulnerabilities in hashes seem to incrementally chip away at the strength. Not be immediate and complete breaks. So having more bits of headroom gives you more time from when the cryptographers start chipping away at the strength to when you have a practical breaks.

I would also expect hash functions with a larger internal state to be more secure even if their output size is the same. Even if the difficulty of finding a collision is similar the collision is less useful if you can't just tack on an arbitary suffix.

Updating the Git protocol for SHA-256

smitty_one_each — Fri, 26 Jun 2020 11:31:42 +0000

One might have been tempted to consider leaving git as-is, and calling the SHA-256 version "got", and then using a script to walk the tree with checkouts and commits to convert the repo from git to got.

Which is probably harder than I realize to accomplish.

Updating the Git protocol for SHA-256

pj — Thu, 25 Jun 2020 17:02:47 +0000

...all valid criticisms, but I've yet to see an alternative with equivalent functionality and more widespread support. If you know of one, I'd love to hear about it! Though as you say, multihash is still fairly young so would likely welcome feedback that would help adoption/functionality/usability.

Updating the Git protocol for SHA-256

kmweber — Thu, 25 Jun 2020 16:36:15 +0000

This was interesting to me because, deapite being obvious in retrospect, it never occurred to me that git's use of hashes guarantees integrity of the revision history. I always saw it solely as a means of implementing a content-addressable store.

Updating the Git protocol for SHA-256

grawity — Thu, 25 Jun 2020 09:09:00 +0000

Well, if the password is actually empty, at least OpenSSH will outright let you skip password-based authentication – no password prompts to be shown. I have seen actual Git and Hg servers which use this (if I remember correctly, the OpenSolaris Hg repository used to be served exactly this way).

Sure you could argue that you still need a known username, but that can be simply included in the git+ssh:// URL (like people already do with git@github.com).

(Still, even if you had to press Enter at a blank password prompt, that's how CVS pserver used to work and everyone accepted it as "anonymous access" all the same.)

No capabilities only for "dumb" HTTP protocol

jnareb — Thu, 25 Jun 2020 08:05:21 +0000

> This provides a clear path forward for the most commonly used Git protocol (git://). It does not, however, address less desirable methods such as communicating over HTTP (http://), since that method does not provide capabilities.

Actually Git protocol (with capabilities) is used with three transport methods: bare TCP (git://) - unauthenticated and nowadays rarely used, SSH, and "smart" HTTP(s) (http:// and https://). If you use GitHub, Bitbucket, GitLab or any other hosting site, you are using encapsulated git protocol whether you use SSH or https:// URLs.

It is only *"dumb" HTTP* that has problems, that relies on WebDAV-capable web server and `git update-server-info` to be run on each repository update (usually from hooks). "Smart" HTTP protocol relies on `git http-server` CGI script or equivalent.

So in my opinion it should be s/such as communicating over HTTP/such as communicating over "dumb" HTTP/

Updating the Git protocol for SHA-256

newren — Thu, 25 Jun 2020 06:52:44 +0000

> Arguably you still have time -- most of the work to date has been fixing the git code to where they can abstract away the hash algorithm at all (previously the code encoded object IDs as a raw char[40] all over). IIRC, at this stage, there's still no non-experimental support for SHA-256 per se.

I don't think that's quite a fair characterization; as far as I understand, there's been quite a bit of sha-256 specific work -- the choice of sha-256 was made two years ago (and not earlier) because that was the point at which brian needed a decision to be made to proceed further on the transition plan. When someone tried to propose a different hash six months ago, this is part of what brian had to say:

"Because we decided some time ago, I've sent in a bunch of patches to our
testsuite to make it work with SHA-256. Some of these patches are
general, in that they make the tests generate values which are used, or
they are specific to the length of the hash algorithm. Others use
specific hash values, and changing the hash algorithm will require
recomputing all of these values.

Absent a compelling security reason to abandon SHA-256, such as a
significant previously unknown cryptographic weakness, I don't plan to
reimplement all of this work. Updating our testsuite to work
successfully with SHA-256 has taken a huge amount of time, and this work
has been entirely done on my own free time because I want the Git
project to be successful. That doesn't even include the contributions
of others who have reviewed, discussed, and contributed to the current
work and transition plan."
(Source: https://lore.kernel.org/git/20191223011306.GF163225@camp....)

> Perhaps this goes without saying, but since this is the kind of thing that can get very bikesheddy, performance numbers and strong arguments specifically refuting their reasons will probably do better than opinions.

Yes, absolutely. And someone as capable as brian to volunteer to do all the work brian has been doing for the last few years, or some magic to convince brian to throw away part of his work and happily redo it for a new hash. I personally know almost nothing about all these hashes and have not been involved in the hash transition plan, but if blake3 is still impressive enough to you that you still want to try to change brian's and possibly others' minds, I can at least point you to the thread where the sha256 decision was initially made. It may help you craft your arguments relative to performance and other characteristics. See it over here: https://lore.kernel.org/git/20180609224913.GC38834@genre....

Updating the Git protocol for SHA-256

draco — Thu, 25 Jun 2020 04:41:32 +0000

I was surprised they didn't switch to using Base64 instead of hex. You'd need only 43 characters (vs today's 40) instead of 64. Prefix one more character for the hash method and you only add 4. [Yes, standard Base64 would always append an '=', but since the input is fixed-length, there's no need to include it.]

But instead it looks like they'll stick with hex and disambiguate via ^{sha1} and ^{sha256} suffixes.

Updating the Git protocol for SHA-256

draco — Thu, 25 Jun 2020 04:23:33 +0000

Arguably you still have time -- most of the work to date has been fixing the git code to where they can abstract away the hash algorithm at all (previously the code encoded object IDs as a raw char[40] all over). IIRC, at this stage, there's still no non-experimental support for SHA-256 per se.

Here's the criteria they used to choose SHA-256, from git.git/Documentation/technical/hash-function-transition.txt:

1. A 256-bit hash (long enough to match common security practice; not
excessively long to hurt performance and disk usage).

2. High quality implementations should be widely available (e.g., in
OpenSSL and Apple CommonCrypto).

3. The hash function's properties should match Git's needs (e.g. Git
requires collision and 2nd preimage resistance and does not require
length extension resistance).

4. As a tiebreaker, the hash should be fast to compute (fortunately
many contenders are faster than SHA-1).

Looking at the git history of the file, their candidates included: SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256.

From the commit message in which they down-selected to SHA-256:

"From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, K12, and so on are all believed to have similar security properties. All are good options from a security point of view.

SHA-256 has a number of advantages:

* It has been around for a while, is widely used, and is supported by
just about every single crypto library (OpenSSL, mbedTLS, CryptoNG,
SecureTransport, etc).

* When you compare against SHA1DC, most vectorized SHA-256
implementations are indeed faster, even without acceleration.

* If we're doing signatures with OpenPGP (or even, I suppose, CMS),
we're going to be using SHA-2, so it doesn't make sense to have our
security depend on two separate algorithms when either one of them
alone could break the security when we could just depend on one.

So SHA-256 it is."

Perhaps this goes without saying, but since this is the kind of thing that can get very bikesheddy, performance numbers and strong arguments specifically refuting their reasons will probably do better than opinions.

Updating the Git protocol for SHA-256

nevyn — Wed, 24 Jun 2020 04:03:03 +0000

Hmm, as someone who has done a bunch of work with hashes over the last couple of years I'd not heard of multihash before, and looking at https://multiformats.io/#projects-using-multiformats it seems the main user is still just ipfs. This wouldn't necessarily be bad if it was new and gaining usage, but it's more worrying given it's been around over half a decade and supposed to be established.

Another similar point is the table itself, the hashes added are done ad hoc when someone uses them and wants to use multihash ... again, fine if the project is very new and gaining traction but much less good if the project is established and you go see that none of https://github.com/dgryski/dgohash are there. I understand it's volunteer based contributions but if you want people to actually use your std. it's going to be much easier if they can use it without having to self register well known/used decade old types.

Then there's the format itself. I understand that hashes are variable length but showing abbreviated hashes is very well known at this point. A new git repo. shows 7 characters for the --abbrev hash, ansible with over 50k commits only shows 10 (and even then github only shows 7), and they want to add "1220" to the front of that? And they really want you to show it to the user all the time? Even if abbreviated hashes weren't a thing, most users are going to think it's a bit weird if literally all the hashes they see start with the same 4 hex characters (at a minimum -- using blake2b will eat 6, I think). I also doubt many developers would want to store the hases natively, because it doesn't take many instances before storing the exact same byte sequence with each piece of actual data becomes more than trivial waste.

Updating the Git protocol for SHA-256

Hattifnattar — Tue, 23 Jun 2020 14:53:09 +0000

No, it is not known to be more secure. Unfortunately with the current sate of the art this is impossible to know.

Sure, it has bigger key space, but 256 bit already makes a random collision astronomically unlikely. The real problem are vulnerabilities. And any vulnerability found in SHA-256 is pretty much guaranteed to be present in SHA-512, and vice versa.

Updating the Git protocol for SHA-256

dezgeg — Tue, 23 Jun 2020 12:19:29 +0000

There is the "none" authentication method that can be used. E.g. "ssh nethack@alt.org" seems to use that. I suppose then the only thing needed is configuring the SSH server to ignore the username.

Updating the Git protocol for SHA-256

niner — Tue, 23 Jun 2020 07:20:28 +0000

That's not really anonymous ssh, it's just ssh with a publicly known user name and password (in this case "anonymous" and "").

Updating the Git protocol for SHA-256

pabs — Tue, 23 Jun 2020 02:10:31 +0000

"there is no such thing as unpassworded anonymous guest ssh access" doesn't appear to be true:

https://askubuntu.com/questions/583141/passwordless-and-k...
https://singpolyma.net/2009/11/anonymous-sftp-on-ubuntu/

PS: branchable.com allows anonymous git:// pushes to wikis.

http://ikiwiki.info/tips/untrusted_git_push/
https://ikiwiki-hosting.branchable.com/todo/anonymous_git...

Updating the Git protocol for SHA-256

xnox — Mon, 22 Jun 2020 18:43:48 +0000

It feels dated to use SHA256 instead of BLAKE3.

BLAKE3 is faster on both 32bit and 64bit arches, over big and small inputs. And for big stuff, it supports streaming validation and incremental hash updates. Such that one can verify large pack files as one is receiving them.

I wonder if it is too late to consider BLAKE3.

Updating the Git protocol for SHA-256

nix — Mon, 22 Jun 2020 18:08:10 +0000

You can tunnel anything over ssh, but the git protocol is meant for *anonymous* fetching -- and there is no such thing as unpassworded anonymous guest ssh access :)

Dumb HTTP doesn't require a server -- it only needs an HTTP server that can serve files. It's much slower and transfers a lot more than the smart protocol, but if you need it you really need it. Like git bundles, it's useful getting stuff to/from networkologically constrained environments.

Updating the Git protocol for SHA-256

mirabilos — Mon, 22 Jun 2020 16:02:20 +0000

You need dumb http because there is no git server (initially), you just have reading (and possibly writing) access to a remote repository, over http, ssh, or something. Or even local files.

The git protocol is only used when there’s an actual server process involved, which isn’t always possible.

Updating the Git protocol for SHA-256

NYKevin — Mon, 22 Jun 2020 15:44:14 +0000

Ugh, I have so many questions here:

- Why do we need both dumb and smart HTTP(S)? Should the client even care what the server looks like internally?
- Why isn't local just a special case of rsync?
- The inclusion of both git and ssh in the list is questionable (you can tunnel anything over ssh, right?) but it's probably too late to fix now.

IIRC Mercurial has a grand total of three: HTTP(S), SSH, and local.

IANA

tialaramex — Mon, 22 Jun 2020 15:37:09 +0000

IANA offers a _lot_ of different procedures. Varying from Private Use and Experimental (chunks of namespace carved off entirely for users to do with as they please without talking to IANA at all) through to Standards Action (you must publish an IETF Standards Track document e.g. a Best Common Practice or an RFC explicitly designated Internet Standard) and where the namespace is hierarchically infinite or near infinite (e.g. OIDs, DNS) IANA just delegates one layer of the namespace and more or less lets the hierarchy sort it out. Technically these OIDs don't even belong to IANA (it hijacked the ones used for the Internet many years ago) but it delegates them this way anyway and it's too late for the standards organisations that minted them to say "No".

RFC 8126 lists 10 such procedures for general use in new namespaces.

So what Multihash are doing here sounds like a typical new IANA namespace which has an Experimental/ Private Use region (self-assigned) and then Specification Required for the rest of the namespace. You must document what you're doing, maybe with a Standards Organisation, maybe you write a white paper, maybe even you just spin up a web site with a technical rant, but you need to document it and then you get reviewed and maybe get in.

Apparently Multihash is writing up some sort of formal document to maybe got to the IETF, but given they started in 2016 and it's not that hard they may not ever get it polished up and standardised anywhere, it's not a problem.

Updating the Git protocol for SHA-256

cesarb — Mon, 22 Jun 2020 14:03:23 +0000

> So, HTTP(S) is not merely a transport in git but a completely different protocol?

There are actually two different http/https transports in git, the older "dumb" transport (put the files somewhere visible to the http daemon, make it export that directory through http, done), and the newer "smart" transport (which is more similar to a CGI script). So if I'm not miscounting, we have a total of six different transports in git: the "git" transport, the "dumb" http transport, the "smart" http transport, the ssh transport, the rsync transport, and the "local" transport (pointing directly to a local filesystem).

Updating the Git protocol for SHA-256

jezuch — Mon, 22 Jun 2020 12:16:48 +0000

So, HTTP(S) is not merely a transport in git but a completely different protocol?

Ouch.

Also, is it really "less desirable"? AFAICT all the hosting providers are only allowing cloning via HTTPS... At least that I know of.

Updating the Git protocol for SHA-256

cyphar — Mon, 22 Jun 2020 02:07:11 +0000

They do use variable-sized chunks (more specifically, content-defined chunking), but those chunking algorithms still require you to specify how large you want your chunks to be on average (or in restic's case, the chunking algorithm also asks what the maximum and minimum chunk sizes are). So you still have to decide on the trade-off between chunks that are too large and chunks that are too small.

Updating the Git protocol for SHA-256

NYKevin — Mon, 22 Jun 2020 01:57:16 +0000

> Maybe SHA-3, but that's still sort of new and less tested.

This is the essential problem. There will always be shiny new hash functions that may or may not actually be secure. There will always be new threats against old functions. It is impossible to know, right now, what hash function you will need to be using in ten years' time. If you are not designing your system to regularly switch hash functions, you are not designing for security.

That's why they are making this extensible. They have the humility to realize that we don't know what we're going to need tomorrow.

Updating the Git protocol for SHA-256

pabs — Sun, 21 Jun 2020 13:46:47 +0000

restic and friends use variable-sized chunks, that seems to be the way to go to me.

Updating the Git protocol for SHA-256

mb — Sun, 21 Jun 2020 11:34:39 +0000

Pushing all problems to the user does not solve them.
It makes life easier for the developers, but hard for all users.
One should always try to avoid the need for help from users when upgrading/extending things. There are so many examples where this just made the process take forever, because so many users don't want to put the effort in. (e.g. Python 2->3).

Updating the Git protocol for SHA-256

Otus — Sun, 21 Jun 2020 07:08:45 +0000

SHA-512 is only faster on 64-bit systems when hashing lonh enough messages. With short inputs is it slower. And of course it doubles stored hash lengths, increasing overhead.

IMO choosing something more secure makes little sense when SHA-256 remains unbroken. Maybe SHA-3, but that's still sort of new and less tested.

Updating the Git protocol for SHA-256

flussence — Sun, 21 Jun 2020 05:32:17 +0000

SHA-512 isn't really state of the art (it's part of SHA-2 alongside 256, and SHA-3 exists), but it is known to be both faster and more secure than SHA-256.

Updating the Git protocol for SHA-256

Kamilion — Sun, 21 Jun 2020 02:58:11 +0000

... I don't understand why they aren't just jumping straight to the state of the art SHA512 and calling it done for the next two decades in one fell swoop instead of "let's support lots of arbitrary extensions". How about just two.
Shouldn't we be taking lessons learned from wireguard into account? If we move to SHA256 we're simply kicking the can down the road a bit further, but not solving the problem. Software selection for adoption in high profile projects like this tends to drive hardware acceleration, and I'd much rather see HW vendors armtwisted into shooting for SHA512/ED25519/ChaCha20 accelerators than the current breed of ZLIB and AES-256 accelerators.

Updating the Git protocol for SHA-256

mathstuf — Sat, 20 Jun 2020 23:17:26 +0000

> including when ones does the short-sighted error of enshrining short prefixes of the hash anywhere that is not a throw away command line call... A bad practice that is very common among git users.

"Best practice" for short usage in more permanent places includes the date (or tag description) and summary of the commit in question (which both greatly ease conflict resolution when it occurs and gives some idea of what's going on without having to copy/paste the has yourself).

Updating the Git protocol for SHA-256

mathstuf — Sat, 20 Jun 2020 23:14:27 +0000

How big of a chunk do you think would work? Too little and the chunk hashes start to dwarf the content size. Too large and anything but trivial changes end up changing every hunk. Sure, there are massive files in repositories, but these probably fall into one of a few buckets:

- best left to git-lfs, git-annex, or some other off-loading tool
- machine generated data (of some kind) that changes rarely
- non-text artifacts that change rarely

I think experiments to test the actual benefits in organic Git repositories this would be interesting, but I'd rather see the hash transition happen correctly and smoothly and it sounds complicated enough as it is. And it should be laying down version numbers into formats as it needs that such another transition could leverage to ease its upgrade path too.

Updating the Git protocol for SHA-256

mathstuf — Sat, 20 Jun 2020 23:09:17 +0000

What is to be done with commit references in commit messages and the like. I'd really like those to stay relevant…

Updating the Git protocol for SHA-256

josh — Sat, 20 Jun 2020 20:36:15 +0000

A single-character prefix would suffice to disambiguate:

[0-9a-f]+ would be SHA-1.
T[0-9a-f]+ would be SHA-256
Pick a new capital letter [G-Z] for each new hash.