Updating the Git protocol for SHA-256

By John Coggeshall
June 19, 2020

The Git source-code management system has for years been moving toward abandoning the Secure Hash Algorithm 1 (SHA-1) in favor of the more secure SHA-256 algorithm. Recently, the project moved a step closer to that goal with contributors implementing new Git protocol capabilities to enable the transition.

Why move from SHA-1

Fundamentally, Git repositories are built on hash values — presently using the SHA-1 algorithm. A simplified explanation of the importance of hash values to Git follows; readers may also be interested in our previous coverage where the details were covered.

SHA-1 hash values are strings that uniquely represent the contents of an object (for example a source file), and no two files should ever produce the same string. In Git, every object has a hash value representation of its contents. The specific directory structure of these objects is stored in an object called a tree object. This tree object is an organized hierarchy of hashes, each one pointing to a specific version of a specific object within the repository. Tree objects themselves, as mentioned, are also hashed when stored into the repository. When a commit to the repository occurs, the basic steps that occur are:

Files are assigned new hash values (if they changed)
A tree object is created then hashed, containing all the hashes for all the files in their current state.
A commit object is created and hashed, referencing the tree object hash

In short, Git uses SHA-1 hashes everywhere to ensure the integrity of the repository's contents by effectively creating a chain of hash values of the objects representing that repository over time, similar to blockchain technology.

The problem with SHA-1, or any hashing algorithm, is that its usefulness erodes if the hashes it produces make collisions likely. A collision, in this case, means two pieces of data that produce the same hash value. If an attacker is able to replace the contents of an object in such a way that it still produces the same hashed value, the idea of trusting the hash value to uniquely define the contents of a Git object breaks down. Worse, if one were to find a way to intelligently produce those collisions, say to inject malicious code, the security implications would be devastating as it would allow a file in the chain to be replaced unnoticed. Since practical compromises of SHA-1 have already happened, it is important to move away from SHA-1. That transition is one step closer with recent developments.

State of the SHA-256 transition

The primary force behind the move from SHA-1 to SHA-256 is contributor brian m. carlson, who has been working over the years to make the transition happen. It has not been an easy task; the original Git implementation hard-coded SHA-1 as the only supported algorithm, and countless repositories need to be transitioned from SHA-1 to SHA-256. Moreover, in the time this transition is taking place, Git needs to maintain interoperability between the two hash algorithms within the context of a single repository, since users may still be using older Git clients.

The problems surrounding that transition are complicated. Different versions of Git clients and servers may or may not have SHA-256 support, and all repositories need to be able to work under both algorithms for some time to come. This means Git will need to keep track of objects in two different ways and seamlessly work correctly, regardless of the hashing algorithm. For example, hash values are often abbreviated by users when referencing commits: 412e40d041 instead of 412e40d041e861506bb3ac11a3a91e3, so even the fact that SHA-256 and SHA-1 hash values are different lengths is only marginally helpful.

In the latest round of patches, carlson proposes changes to the communication protocol logic for dealing with the transition. The patch doesn't sound like it was part of the original transition plans, but became necessary to move forward as carlson notes:

It was originally planned that we would not upgrade the protocol and would use SHA-1 for all protocol functionality until some point in the future. However, doing that requires a huge amount of additional work (probably incorporating several hundred more patches which are not yet written) and it's not possible to get the test suite to even come close to passing without a way to fetch and push repositories. I therefore decided that implementing an object-format extension was the best way forward.

The patch set enhances the pack protocol that is used by Git clients to include keeping track of the hashing algorithm. This is implemented via the new object-format capability. In the patch to the protocol documentation, carlson describes the object-format capability as a way for Git to indicate support for particular hashing algorithms:

This capability, which takes a hash algorithm as an argument, indicates that the server supports the given hash algorithms [...] When provided by the client, this indicates that it intends to use the given hash algorithm to communicate.

If the client supports hashes using SHA-256, this change to the protocol enables that to be specified directly. By omitting the capability, Git will assume the hash values are presented as SHA-1.

This provides a clear path forward for the most commonly used Git protocol (git://). It does not, however, address less desirable methods such as communicating over HTTP (http://), since that method does not provide capabilities. To address these situations, the implementation attempts to guess the type of hash algorithm being used by looking at the hash length. Carlson notes this works, but could be a problem if at some point in the future SHA-256 is replaced with a different algorithm that also produces 256-bit outputs. To this however, carlson says that he believes any hashing algorithm that someday might supersede SHA-256 will be longer than 256-bit:

The other two cases are the dumb HTTP protocol and bundles, both of which have no object-format extension (because they provide no capabilities) and are therefore distinguished solely by their hash length. We will have problems if in the future we need to use another 256-bit algorithm, but I plan to be improvident and hope that we'll move to longer algorithms in the future to cover ourselves for post-quantum security.

Carlson acknowledges that his solution to the technical challenges facing the project moving to SHA-256 isn't ideal. When cloning a repository, for example, the hashing algorithm being used by the parent repository isn't known up front. Carlson's work gets around this in a two step process:

Clone support is necessarily a little tricky because we are initializing a repository and then fetching refs, at which point we learn what hash algorithm the remote side supports. We work around this by calling the code that updates the hash algorithm and repository version a second time to rewrite that data once we know what version we're using. This is the most robust way I could approach this problem, but it is still a little ugly.

What comes next

With this milestone reached, the end is in sight for a fully working implementation of SHA-256 powered repositories. This will be a major milestone in the evolution of Git, and arguably place it on solid footing for the future. In fact, carlson laid out what he expects those last patches will likely consist of:

Additional future series include one last series of test fixes (28 patches) plus six final patches in the series that enables SHA-256 support.

In closing, it is worth noting that one of the reasons this transition has been so hard is that the original Git implementation was not designed to swap out hashing algorithms. Much of the work put in to the SHA-256 implementation has been walking back this initial design flaw. With these changes almost complete, it not only provides an alternative to SHA-1, but also makes Git fundamentally indifferent to the hashing algorithm used. This should make Git more adaptable in the future should the need to replace SHA-256 with something stronger arise.

Updating the Git protocol for SHA-256

Posted Jun 19, 2020 16:38 UTC (Fri) by sytoka (guest, #38525) [Link] (25 responses)

It could be possible to put $5$ at the beginning of the hash, with the same number than use by /etc/shadow ! So $6$ will be for sha512 and so on. The challenge with sha1 will stay the same but the future will be easy.

Updating the Git protocol for SHA-256

Posted Jun 19, 2020 17:09 UTC (Fri) by pj (subscriber, #4506) [Link] (19 responses)

I'd love for them to be forward-looking enough to adopt something like multihash (https://richardschneider.github.io/net-ipfs-core/articles...)

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 0:34 UTC (Sat) by ms-tg (subscriber, #89231) [Link] (18 responses)

I second multihash and would like to understand an explanation for not using it?

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 14:07 UTC (Sat) by Ericson2314 (guest, #139248) [Link] (10 responses)

Speaking of IPFS things, https://discuss.ipfs.io/t/git-on-ipfs-links-and-reference...

Git really should start merkelizing blob hashes / chunk blobs. Not only does it help with data exchange, but it also means faster hashing when a blob changes O(n) vs O(log n). This transition is the best time to fix things like this, pity it seems they are not under discussion.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 23:14 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (9 responses)

How big of a chunk do you think would work? Too little and the chunk hashes start to dwarf the content size. Too large and anything but trivial changes end up changing every hunk. Sure, there are massive files in repositories, but these probably fall into one of a few buckets:

- best left to git-lfs, git-annex, or some other off-loading tool
- machine generated data (of some kind) that changes rarely
- non-text artifacts that change rarely

I think experiments to test the actual benefits in organic Git repositories this would be interesting, but I'd rather see the hash transition happen correctly and smoothly and it sounds complicated enough as it is. And it should be laying down version numbers into formats as it needs that such another transition could leverage to ease its upgrade path too.

Updating the Git protocol for SHA-256

Posted Jun 21, 2020 13:46 UTC (Sun) by pabs (subscriber, #43278) [Link] (8 responses)

restic and friends use variable-sized chunks, that seems to be the way to go to me.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 2:07 UTC (Mon) by cyphar (subscriber, #110703) [Link]

They do use variable-sized chunks (more specifically, content-defined chunking), but those chunking algorithms still require you to specify how large you want your chunks to be on average (or in restic's case, the chunking algorithm also asks what the maximum and minimum chunk sizes are). So you still have to decide on the trade-off between chunks that are too large and chunks that are too small.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:14 UTC (Sat) by ras (subscriber, #33059) [Link] (6 responses)

Your comment led me look up restic, and I was thinking "finally, this is it", then I discovered https://github.com/restic/restic/issues/187. With ransomware a thing it's a major omission, and sadly there's been two years with no movement. Shame.

But you say it has friends?

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:46 UTC (Sat) by pabs (subscriber, #43278) [Link] (5 responses)

borg is the other modern chunking backup system:

https://borgbackup.github.io/borgbackup/

There is also bup, much more closely related to git:

https://github.com/bup/bup

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:19 UTC (Sat) by johill (subscriber, #25196) [Link] (4 responses)

Thinking about ransomware, I think you should be able to configure permissions on restic's use of AWS/S3 to prevent deletion? There's a use case that's more interesting to asymmetric encryption - not letting the machine that's doing the backup access its own old data, in case it's compromised, as mentioned on that ticket. Maybe you can even configure the S3 bucket to be read-only of sorts (e.g. by storing in deep archive or glacier, and not let the IAM account do any restores from there), but I don't know how much or what restic needs to read out of the repo to make a new backup.

Borg's encryption design seems to have one issue - as far as I can tell, the "content-based chunker" has a very small key (they claim 32 bits, but say it goes linearly though the algorithm, so not all of those bits eventually matter), which would seem to allow fingerprinting attacks ("you have this chain of chunk sizes, so you must have this file"). Borg also has been debating S3 storage for years without any movement.

Ultimately I landed with bup (that I had used previously), and have been working on adding to bup both (asymmetric) encryption support and AWS/S3 storage; in the latter case you can effectively make your repo append-only (to the machine that's making the backup), i.e. AWS permissions ensure that it cannot actually delete the data. It could delete some metadata tables etc. but that's mostly recoverable (though I haven't written the recovery tools yet), apart from the ref names (which are only stored in dynamoDB for consistency reasons, S3 has almost no consistency guarantees.)

It's probably not ready for mainline yet (and we're busy finishing the python 3 port in mainline), but I've actually used it recently to begin storing some of my backups (currently ~850GiB) in S3 Deep Archive.

Configuration references:
https://github.com/jmberg/bup/blob/master/Documentation/b...
https://github.com/jmberg/bup/blob/master/Documentation/b...

Some design documentation is in the code:
https://github.com/jmberg/bup/blob/master/lib/bup/repo/en...

If you use it, there are two other things in my tree that you'd probably want:

1) with a lot of data, the content-based splitting on 13 bits results in far too much metadata (and storage isn't that expensive anymore), so you'd want to increase that. Currently in master that's not configurable, but I changed that: https://github.com/jmberg/bup/blob/master/Documentation/b...

2) if you have lots of large directories (e.g. maildir) then minor changes to those currently consumes a significant amount of storage space since the entire folder is saved again (the list of files). I have "treesplit" in my code that allows splitting up those trees (again, content-based) to avoid that issue, which for my largest maildir of ~400k files brings down the amount of new data saved from close to 10 MB (after compression) to <<50kB when a new email is written there. Looks like I didn't document that yet, but I should add it here: https://github.com/jmberg/bup/blob/master/Documentation/b.... The commit describes it a bit now: https://github.com/jmberg/bup/commit/44006daca4786abe31e3...

And yes, I'm working with upstream on this.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:31 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

Is anyone working on adding treesplitting to git itself? Your docs mention that the tree duplication issue occurs with git too.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:42 UTC (Sat) by johill (subscriber, #25196) [Link]

I'm not aware of that. It would probably mean a new object type in git, or such, I haven't really thought about it.

However, it's not nearly as bad in git? You're not storing hundreds of thousands of files in a folder in git, presumably? :-) Not sure how much interest there would be in git on that.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:48 UTC (Sat) by johill (subscriber, #25196) [Link]

I should, however, mention that due to git's object format ("<type> <length>\x00<data>") other tools can architecturally have an advantage on throughput. Due to the header, bup has to run the content-based splitting first, and then start hashing the object only once it knows how long it is. If you don't have the limitation of the git storage format, you can do without such a header and do both hashes in parallel, stopping once you find the split point. I've been thinking about mitigating that with threading, but it's a bit difficult right now in bup's software architecture. (Incidentally, python is not the issue here, since the hash splitting is entirely in C in my tree, so can run outside the GIL.)

Updating the Git protocol for SHA-256

Posted Jul 8, 2020 19:22 UTC (Wed) by nix (subscriber, #2304) [Link]

> And yes, I'm working with upstream on this.

By this point, as a mere observer, I would say you *are* one of upstream. You're one of the two people doing most of the bup commits and have been for over a year now. :)

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 15:36 UTC (Sat) by hmh (subscriber, #3838) [Link] (6 responses)

I am not well versed on multihash, but a first look failed to find a canonical, immutable registry of already in use algorithms and their mapping to IDs (the numerical ones that are ABI since they end up embedded in the base# representations) along with the procedures to interact with such a registry. I mean something like IANA does.

At that point it becomes app specific, and other than the obvious protocol best practice that you should explicitly encode the protocol version (in this case what hash and hash parameters if not implied), there is little to be gained.

Prefixing (hidden by base# or explicitly) the hash type in git has already been covered by other replies and posts, and yes, imho it really should be done if at all possible.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 17:11 UTC (Sat) by cyphar (subscriber, #110703) [Link] (5 responses)

Multihash defines exactly two things, an extensible format and a table of hash functions. So it definitely does what you say it doesn't (in fairness, the link @ms-tg gave you isn't as useful as the project's page[1]).

Now there isn't an IANA-like procedure, everything is done via PRs on GitHub but that's just differences in administrative structure.

[1]: https://multiformats.io/multihash/

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 18:42 UTC (Sat) by hmh (subscriber, #3838) [Link] (1 responses)

A procedure to add new hashes is a procedure, PRs in github are fine.

This link you sent is much better, the other one lacks essential information...

I am quite sure git would severely restrict the allowed hashes, but at least the design of multihash seems sane and safely extensible, including when ones does the short-sighted error of enshrining short prefixes of the hash anywhere that is not a throw away command line call... A bad practice that is very common among git users.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 23:17 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

> including when ones does the short-sighted error of enshrining short prefixes of the hash anywhere that is not a throw away command line call... A bad practice that is very common among git users.

"Best practice" for short usage in more permanent places includes the date (or tag description) and summary of the commit in question (which both greatly ease conflict resolution when it occurs and gives some idea of what's going on without having to copy/paste the has yourself).

IANA

Posted Jun 22, 2020 15:37 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

IANA offers a _lot_ of different procedures. Varying from Private Use and Experimental (chunks of namespace carved off entirely for users to do with as they please without talking to IANA at all) through to Standards Action (you must publish an IETF Standards Track document e.g. a Best Common Practice or an RFC explicitly designated Internet Standard) and where the namespace is hierarchically infinite or near infinite (e.g. OIDs, DNS) IANA just delegates one layer of the namespace and more or less lets the hierarchy sort it out. Technically these OIDs don't even belong to IANA (it hijacked the ones used for the Internet many years ago) but it delegates them this way anyway and it's too late for the standards organisations that minted them to say "No".

RFC 8126 lists 10 such procedures for general use in new namespaces.

So what Multihash are doing here sounds like a typical new IANA namespace which has an Experimental/ Private Use region (self-assigned) and then Specification Required for the rest of the namespace. You must document what you're doing, maybe with a Standards Organisation, maybe you write a white paper, maybe even you just spin up a web site with a technical rant, but you need to document it and then you get reviewed and maybe get in.

Apparently Multihash is writing up some sort of formal document to maybe got to the IETF, but given they started in 2016 and it's not that hard they may not ever get it polished up and standardised anywhere, it's not a problem.

Updating the Git protocol for SHA-256

Posted Jun 24, 2020 4:03 UTC (Wed) by nevyn (guest, #33129) [Link] (1 responses)

Hmm, as someone who has done a bunch of work with hashes over the last couple of years I'd not heard of multihash before, and looking at https://multiformats.io/#projects-using-multiformats it seems the main user is still just ipfs. This wouldn't necessarily be bad if it was new and gaining usage, but it's more worrying given it's been around over half a decade and supposed to be established.

Another similar point is the table itself, the hashes added are done ad hoc when someone uses them and wants to use multihash ... again, fine if the project is very new and gaining traction but much less good if the project is established and you go see that none of https://github.com/dgryski/dgohash are there. I understand it's volunteer based contributions but if you want people to actually use your std. it's going to be much easier if they can use it without having to self register well known/used decade old types.

Then there's the format itself. I understand that hashes are variable length but showing abbreviated hashes is very well known at this point. A new git repo. shows 7 characters for the --abbrev hash, ansible with over 50k commits only shows 10 (and even then github only shows 7), and they want to add "1220" to the front of that? And they really want you to show it to the user all the time? Even if abbreviated hashes weren't a thing, most users are going to think it's a bit weird if literally all the hashes they see start with the same 4 hex characters (at a minimum -- using blake2b will eat 6, I think). I also doubt many developers would want to store the hases natively, because it doesn't take many instances before storing the exact same byte sequence with each piece of actual data becomes more than trivial waste.

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 17:02 UTC (Thu) by pj (subscriber, #4506) [Link]

...all valid criticisms, but I've yet to see an alternative with equivalent functionality and more widespread support. If you know of one, I'd love to hear about it! Though as you say, multihash is still fairly young so would likely welcome feedback that would help adoption/functionality/usability.

Updating the Git protocol for SHA-256

Posted Jun 19, 2020 17:27 UTC (Fri) by david.a.wheeler (subscriber, #72896) [Link] (2 responses)

The problem with $5$ is that it will be interpreted by shells and mangled. It's very common to have commands with hash values, e.g.,

git reset HASH_VALUE

But having a standard prefix is reasonable. I had proposed rotating the hash value by 16 characters in the first character, so that 0 becomes g, 1 becomes h, and so on. Then you can determine from the first character what encoding is used. You can extend that further by additional rotations or encoding more characters.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 10:51 UTC (Sat) by cpitrat (subscriber, #116459) [Link] (1 responses)

Or add the prefix for all but sha-1. A first char of [0-9a-f] would mean sha-1 and prefix must not be removed. A prefix g would be sha-256, and so on. That's bot very different from MultiHash though. The pain is the complexity of the sha-1 exception which is not that awful for full hashes (you can look at the length, as all the other will have a prefix there's no risk of collision). Shortened hashes added a layer of mess to the problem ...

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 14:22 UTC (Sat) by gavinbeatty (guest, #139659) [Link]

g is used as a prefix for SHA-1 when using git describe, but point taken, non [a-g0-9].

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 20:36 UTC (Sat) by josh (subscriber, #17465) [Link]

A single-character prefix would suffice to disambiguate:

[0-9a-f]+ would be SHA-1.
T[0-9a-f]+ would be SHA-256
Pick a new capital letter [G-Z] for each new hash.

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 4:41 UTC (Thu) by draco (subscriber, #1792) [Link]

I was surprised they didn't switch to using Base64 instead of hex. You'd need only 43 characters (vs today's 40) instead of 64. Prefix one more character for the hash method and you only add 4. [Yes, standard Base64 would always append an '=', but since the input is fixed-length, there's no need to include it.]

But instead it looks like they'll stick with hex and disambiguate via ^{sha1} and ^{sha256} suffixes.

Updating the Git protocol for SHA-256

Posted Jun 19, 2020 16:49 UTC (Fri) by michaelkjohnson (subscriber, #41438) [Link] (1 responses)

"485865fd0 instead of 412e40d041e861506bb3ac11a3a91e3"

But 485865fd0 is not a prefix of 412e40d041e861506bb3ac11a3a91e3; that example would be clearer if you used the prefix.

Updating the Git protocol for SHA-256

Posted Jun 19, 2020 17:11 UTC (Fri) by coogle (guest, #138507) [Link]

> But 485865fd0 is not a prefix of 412e40d041e861506bb3ac11a3a91e3; that example would be clearer if you used the prefix.

Thank you - not sure how that happened. Updated the article to use the proper shorthand value in the example.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 13:03 UTC (Sat) by cesarb (subscriber, #6266) [Link]

> Carlson notes this works, but could be a problem if at some point in the future SHA-256 is replaced with a different algorithm that also produces 256-bit outputs. To this however, carlson says that he believes any hashing algorithm that someday might supersede SHA-256 will be longer than 256-bit

The tendency seems to be towards newer hashing algorithms being 256-bit. From the SHA-2 family, we have SHA-512/256 which is basically SHA-512 truncated to 256 bits; from the BLAKE family, the latest member BLAKE3 has 256-bit output (though it has an extensible mode with unlimited output length).

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 16:20 UTC (Sat) by Tomasu (guest, #39889) [Link] (2 responses)

This all seems unnecessarily complex. Add a version and hash format field to the protocol and on disk format. Introduce a "git upgrade-hash-algo" command and when repos want to update they let users know they need to update their clients to a version that supports the new format.

Yes I realize that might cause issues for larger projects that have a bunch of external automated scripts/bots running that may not be maintained. It's not the projects responsibility to support unmaintained processes.

Updating the Git protocol for SHA-256

Posted Jun 20, 2020 23:09 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

What is to be done with commit references in commit messages and the like. I'd really like those to stay relevant…

Updating the Git protocol for SHA-256

Posted Jun 21, 2020 11:34 UTC (Sun) by mb (subscriber, #50428) [Link]

Pushing all problems to the user does not solve them.
It makes life easier for the developers, but hard for all users.
One should always try to avoid the need for help from users when upgrading/extending things. There are so many examples where this just made the process take forever, because so many users don't want to put the effort in. (e.g. Python 2->3).

Updating the Git protocol for SHA-256

Posted Jun 21, 2020 2:58 UTC (Sun) by Kamilion (guest, #42576) [Link] (5 responses)

... I don't understand why they aren't just jumping straight to the state of the art SHA512 and calling it done for the next two decades in one fell swoop instead of "let's support lots of arbitrary extensions". How about just two.
Shouldn't we be taking lessons learned from wireguard into account? If we move to SHA256 we're simply kicking the can down the road a bit further, but not solving the problem. Software selection for adoption in high profile projects like this tends to drive hardware acceleration, and I'd much rather see HW vendors armtwisted into shooting for SHA512/ED25519/ChaCha20 accelerators than the current breed of ZLIB and AES-256 accelerators.

Updating the Git protocol for SHA-256

Posted Jun 21, 2020 5:32 UTC (Sun) by flussence (guest, #85566) [Link] (4 responses)

SHA-512 isn't really state of the art (it's part of SHA-2 alongside 256, and SHA-3 exists), but it is known to be both faster and more secure than SHA-256.

Updating the Git protocol for SHA-256

Posted Jun 21, 2020 7:08 UTC (Sun) by Otus (subscriber, #67685) [Link] (1 responses)

SHA-512 is only faster on 64-bit systems when hashing lonh enough messages. With short inputs is it slower. And of course it doubles stored hash lengths, increasing overhead.

IMO choosing something more secure makes little sense when SHA-256 remains unbroken. Maybe SHA-3, but that's still sort of new and less tested.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 1:57 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

> Maybe SHA-3, but that's still sort of new and less tested.

This is the essential problem. There will always be shiny new hash functions that may or may not actually be secure. There will always be new threats against old functions. It is impossible to know, right now, what hash function you will need to be using in ten years' time. If you are not designing your system to regularly switch hash functions, you are not designing for security.

That's why they are making this extensible. They have the humility to realize that we don't know what we're going to need tomorrow.

Updating the Git protocol for SHA-256

Posted Jun 23, 2020 14:53 UTC (Tue) by Hattifnattar (subscriber, #93737) [Link] (1 responses)

No, it is not known to be more secure. Unfortunately with the current sate of the art this is impossible to know.

Sure, it has bigger key space, but 256 bit already makes a random collision astronomically unlikely. The real problem are vulnerabilities. And any vulnerability found in SHA-256 is pretty much guaranteed to be present in SHA-512, and vice versa.

Updating the Git protocol for SHA-256

Posted Jun 26, 2020 15:45 UTC (Fri) by plugwash (subscriber, #29694) [Link]

Most vulnerabilities in hashes seem to incrementally chip away at the strength. Not be immediate and complete breaks. So having more bits of headroom gives you more time from when the cryptographers start chipping away at the strength to when you have a practical breaks.

I would also expect hash functions with a larger internal state to be more secure even if their output size is the same. Even if the difficulty of finding a collision is similar the collision is less useful if you can't just tack on an arbitary suffix.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 12:16 UTC (Mon) by jezuch (subscriber, #52988) [Link] (9 responses)

So, HTTP(S) is not merely a transport in git but a completely different protocol?

Ouch.

Also, is it really "less desirable"? AFAICT all the hosting providers are only allowing cloning via HTTPS... At least that I know of.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 14:03 UTC (Mon) by cesarb (subscriber, #6266) [Link] (8 responses)

> So, HTTP(S) is not merely a transport in git but a completely different protocol?

There are actually two different http/https transports in git, the older "dumb" transport (put the files somewhere visible to the http daemon, make it export that directory through http, done), and the newer "smart" transport (which is more similar to a CGI script). So if I'm not miscounting, we have a total of six different transports in git: the "git" transport, the "dumb" http transport, the "smart" http transport, the ssh transport, the rsync transport, and the "local" transport (pointing directly to a local filesystem).

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 15:44 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (7 responses)

Ugh, I have so many questions here:

- Why do we need both dumb and smart HTTP(S)? Should the client even care what the server looks like internally?
- Why isn't local just a special case of rsync?
- The inclusion of both git and ssh in the list is questionable (you can tunnel anything over ssh, right?) but it's probably too late to fix now.

IIRC Mercurial has a grand total of three: HTTP(S), SSH, and local.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 16:02 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

You need dumb http because there is no git server (initially), you just have reading (and possibly writing) access to a remote repository, over http, ssh, or something. Or even local files.

The git protocol is only used when there’s an actual server process involved, which isn’t always possible.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 18:08 UTC (Mon) by nix (subscriber, #2304) [Link] (5 responses)

You can tunnel anything over ssh, but the git protocol is meant for *anonymous* fetching -- and there is no such thing as unpassworded anonymous guest ssh access :)

Dumb HTTP doesn't require a server -- it only needs an HTTP server that can serve files. It's much slower and transfers a lot more than the smart protocol, but if you need it you really need it. Like git bundles, it's useful getting stuff to/from networkologically constrained environments.

Updating the Git protocol for SHA-256

Posted Jun 23, 2020 2:10 UTC (Tue) by pabs (subscriber, #43278) [Link] (4 responses)

"there is no such thing as unpassworded anonymous guest ssh access" doesn't appear to be true:

https://askubuntu.com/questions/583141/passwordless-and-k...
https://singpolyma.net/2009/11/anonymous-sftp-on-ubuntu/

PS: branchable.com allows anonymous git:// pushes to wikis.

http://ikiwiki.info/tips/untrusted_git_push/
https://ikiwiki-hosting.branchable.com/todo/anonymous_git...

Updating the Git protocol for SHA-256

Posted Jun 23, 2020 7:20 UTC (Tue) by niner (subscriber, #26151) [Link] (2 responses)

That's not really anonymous ssh, it's just ssh with a publicly known user name and password (in this case "anonymous" and "").

Updating the Git protocol for SHA-256

Posted Jun 23, 2020 12:19 UTC (Tue) by dezgeg (subscriber, #92243) [Link]

There is the "none" authentication method that can be used. E.g. "ssh nethack@alt.org" seems to use that. I suppose then the only thing needed is configuring the SSH server to ignore the username.

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 9:09 UTC (Thu) by grawity (subscriber, #80596) [Link]

Well, if the password is actually empty, at least OpenSSH will outright let you skip password-based authentication – no password prompts to be shown. I have seen actual Git and Hg servers which use this (if I remember correctly, the OpenSolaris Hg repository used to be served exactly this way).

Sure you could argue that you still need a known username, but that can be simply included in the git+ssh:// URL (like people already do with git@github.com).

(Still, even if you had to press Enter at a blank password prompt, that's how CVS pserver used to work and everyone accepted it as "anonymous access" all the same.)

Updating the Git protocol for SHA-256

Posted Jul 8, 2020 19:28 UTC (Wed) by nix (subscriber, #2304) [Link]

OK, you live and learn: git:// allows pack reception! I clearly never read that part of the git-daemon manpage :)

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 18:43 UTC (Mon) by xnox (guest, #63320) [Link] (2 responses)

It feels dated to use SHA256 instead of BLAKE3.

BLAKE3 is faster on both 32bit and 64bit arches, over big and small inputs. And for big stuff, it supports streaming validation and incremental hash updates. Such that one can verify large pack files as one is receiving them.

I wonder if it is too late to consider BLAKE3.

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 4:23 UTC (Thu) by draco (subscriber, #1792) [Link] (1 responses)

Arguably you still have time -- most of the work to date has been fixing the git code to where they can abstract away the hash algorithm at all (previously the code encoded object IDs as a raw char[40] all over). IIRC, at this stage, there's still no non-experimental support for SHA-256 per se.

Here's the criteria they used to choose SHA-256, from git.git/Documentation/technical/hash-function-transition.txt:

1. A 256-bit hash (long enough to match common security practice; not
excessively long to hurt performance and disk usage).

2. High quality implementations should be widely available (e.g., in
OpenSSL and Apple CommonCrypto).

3. The hash function's properties should match Git's needs (e.g. Git
requires collision and 2nd preimage resistance and does not require
length extension resistance).

4. As a tiebreaker, the hash should be fast to compute (fortunately
many contenders are faster than SHA-1).

Looking at the git history of the file, their candidates included: SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256.

From the commit message in which they down-selected to SHA-256:

"From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, K12, and so on are all believed to have similar security properties. All are good options from a security point of view.

SHA-256 has a number of advantages:

* It has been around for a while, is widely used, and is supported by
just about every single crypto library (OpenSSL, mbedTLS, CryptoNG,
SecureTransport, etc).

* When you compare against SHA1DC, most vectorized SHA-256
implementations are indeed faster, even without acceleration.

* If we're doing signatures with OpenPGP (or even, I suppose, CMS),
we're going to be using SHA-2, so it doesn't make sense to have our
security depend on two separate algorithms when either one of them
alone could break the security when we could just depend on one.

So SHA-256 it is."

Perhaps this goes without saying, but since this is the kind of thing that can get very bikesheddy, performance numbers and strong arguments specifically refuting their reasons will probably do better than opinions.

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 6:52 UTC (Thu) by newren (subscriber, #5160) [Link]

> Arguably you still have time -- most of the work to date has been fixing the git code to where they can abstract away the hash algorithm at all (previously the code encoded object IDs as a raw char[40] all over). IIRC, at this stage, there's still no non-experimental support for SHA-256 per se.

I don't think that's quite a fair characterization; as far as I understand, there's been quite a bit of sha-256 specific work -- the choice of sha-256 was made two years ago (and not earlier) because that was the point at which brian needed a decision to be made to proceed further on the transition plan. When someone tried to propose a different hash six months ago, this is part of what brian had to say:

"Because we decided some time ago, I've sent in a bunch of patches to our
testsuite to make it work with SHA-256. Some of these patches are
general, in that they make the tests generate values which are used, or
they are specific to the length of the hash algorithm. Others use
specific hash values, and changing the hash algorithm will require
recomputing all of these values.

Absent a compelling security reason to abandon SHA-256, such as a
significant previously unknown cryptographic weakness, I don't plan to
reimplement all of this work. Updating our testsuite to work
successfully with SHA-256 has taken a huge amount of time, and this work
has been entirely done on my own free time because I want the Git
project to be successful. That doesn't even include the contributions
of others who have reviewed, discussed, and contributed to the current
work and transition plan."
(Source: https://lore.kernel.org/git/20191223011306.GF163225@camp....)

> Perhaps this goes without saying, but since this is the kind of thing that can get very bikesheddy, performance numbers and strong arguments specifically refuting their reasons will probably do better than opinions.

Yes, absolutely. And someone as capable as brian to volunteer to do all the work brian has been doing for the last few years, or some magic to convince brian to throw away part of his work and happily redo it for a new hash. I personally know almost nothing about all these hashes and have not been involved in the hash transition plan, but if blake3 is still impressive enough to you that you still want to try to change brian's and possibly others' minds, I can at least point you to the thread where the sha256 decision was initially made. It may help you craft your arguments relative to performance and other characteristics. See it over here: https://lore.kernel.org/git/20180609224913.GC38834@genre....

No capabilities only for "dumb" HTTP protocol

Posted Jun 25, 2020 8:05 UTC (Thu) by jnareb (subscriber, #46500) [Link]

> This provides a clear path forward for the most commonly used Git protocol (git://). It does not, however, address less desirable methods such as communicating over HTTP (http://), since that method does not provide capabilities.

Actually Git protocol (with capabilities) is used with three transport methods: bare TCP (git://) - unauthenticated and nowadays rarely used, SSH, and "smart" HTTP(s) (http:// and https://). If you use GitHub, Bitbucket, GitLab or any other hosting site, you are using encapsulated git protocol whether you use SSH or https:// URLs.

It is only *"dumb" HTTP* that has problems, that relies on WebDAV-capable web server and `git update-server-info` to be run on each repository update (usually from hooks). "Smart" HTTP protocol relies on `git http-server` CGI script or equivalent.

So in my opinion it should be s/such as communicating over HTTP/such as communicating over "dumb" HTTP/

Updating the Git protocol for SHA-256

Posted Jun 25, 2020 16:36 UTC (Thu) by kmweber (guest, #114635) [Link]

This was interesting to me because, deapite being obvious in retrospect, it never occurred to me that git's use of hashes guarantees integrity of the revision history. I always saw it solely as a means of implementing a content-addressable store.

Updating the Git protocol for SHA-256

Posted Jun 26, 2020 11:31 UTC (Fri) by smitty_one_each (subscriber, #28989) [Link]

One might have been tempted to consider leaving git as-is, and calling the SHA-256 version "got", and then using a script to walk the tree with checkouts and commits to convert the repo from git to got.

Which is probably harder than I realize to accomplish.