LWN: Comments on "Moving Git past SHA-1"

Moving Git past SHA-1

pabs — Thu, 09 Jan 2020 01:16:31 +0000

https://public-inbox.org/git/20200107173111.GB923852@alph...

Moving Git past SHA-1

keitsi — Wed, 08 Jan 2020 14:18:46 +0000

https://arstechnica.com/information-technology/2020/01/pg...

Moving Git past SHA-1

anarcat — Wed, 17 Jan 2018 16:38:22 +0000

What's the current state of affairs here? It looks like a hash object construct was merged, but I'm not sure. Git certainly still uses SHA-1 now - is this plan still on?

Moving Git past SHA-1

kleptog — Fri, 03 Mar 2017 21:41:39 +0000

Moving to another hash doesn't necessarily need to mean change in disk size. You could take a SHA-512 hash and truncate it to 160 bits and you'd be much safer than SHA-1.

If we thought 160 bits collision resistance was fine then it's doesn't seem smart to start using 512 or even 256 bits for Git identifiers. Just pick a safer hash and truncate it to the length you want.

Moving Git past SHA-1

emorrp1 — Thu, 02 Mar 2017 12:25:31 +0000

Re-writing history would invalidate existing signed commits and require force-pushing, which can be disabled for some repos/heads. Some people enable signing for every commit to prove authorship and regularly (automatically) validate existing signatures in the history. Above is a suggestion to have parallel double-hashed history, which would be better, though I think you'd definitely want the first new-style-only commit (enabling the second hash) to be signed from a trusted maintainer: https://lwn.net/Articles/715844/.

Ultimately I wouldn't worry about it as there isn't yet a simple solution and the developers have now raised the priority of this work, so they'll think through the various options to come up with a compromise that is both implementable and reasonable.

Moving Git past SHA-1

jwilk — Thu, 02 Mar 2017 11:09:38 +0000

Please don't roll your own crypto.

Detecting attempted collisions with sha1dc

jnareb — Thu, 02 Mar 2017 10:06:07 +0000

With the version currently in 'pu' (proposed updates) branch, Git can be configured with the USE_SHA1DC build time configuration variable to use SHA-1 implementation from shattered.io that detects attempted collisions[1]. Note that for the time being it is quite a bit slower than other implementations; optimization is ongoing.

This way with `transfer.fsckObjects` any attempt can be easily detected.

[1]: https://github.com/cr-marcstevens/sha1collisiondetection

Moving Git past SHA-1

Cyberax — Thu, 02 Mar 2017 02:41:35 +0000

You actually don't even need to concatenate them, just replace the first bits of sha-3.

Moving Git past SHA-1

epa — Wed, 01 Mar 2017 12:25:19 +0000

Sorry, I meant that the full length of the new hashes would be 40 + 64 = 104 characters.

Moving Git past SHA-1

epa — Wed, 01 Mar 2017 12:23:25 +0000

You could define the new hash code as being the SHA1 and the SHA3-256 hashes concatenated together. Yes, I am aware of the result that this won't necessarily result in a stronger hash than using just SHA3-256 by itself. However it won't be weaker than SHA3-256.

The nice property is then that all existing hash prefixes, and 40 character hashes, continue to be valid. If you want to have the greater collision resistance of SHA3-256 then you can use the new jumbo-sized 106 character hashes.

Moving Git past SHA-1

zenaan — Wed, 01 Mar 2017 10:39:43 +0000

>I guess it'll use the db mapping new to old hash values that everyone will need to carry around forever. 250MiB+ for the kernel. Sigh.

Is 250MiB just the mapping size (i.e. no object data)? Sounds surprising, but perhaps not for the number of years (rather, objects) involved?

Flag day doesn't sound so bad - a repo can simply store some meta-data for it's UTC flag day or "flag moment", with a certain quiescence period (say 15 minutes) during which pulls and pushes are not accepted... thereby ensuring world-sync for the flag moment, yes?

Moving Git past SHA-1

mathstuf — Wed, 01 Mar 2017 02:18:37 +0000

There are many references to the old hashes floating out there. You need to rewrite submodule pointers which means the submodule needs rewritten first. You still need the old hashes for links on the Web or risk breaking boatloads of links. Just rewriting as a policy for everyone would be bad. You can do it for your projects if you wish, but I don't want that.

Moving Git past SHA-1

kjp — Wed, 01 Mar 2017 01:39:27 +0000

I don't understand why you wouldn't just rewrite history using the new hash. It seems far simpler to code and understand.

Moving Git past SHA-1

tialaramex — Wed, 01 Mar 2017 01:36:25 +0000

"at least as far as is publicly known at the moment"

Construction-wise this problem was literally inevitable in the M-D family. The entire internal state is emitted, so if you can find A, B such that hash(A) = hash(B) then you can't help but have hash(A | X) = hash(B | X), albeit this applies to an internal representation so for practical real world inputs there's some fiddling about with padding to consider.

SHA-3 is a sponge design, so there's a separate operation to "squeeze out" the internal state into a hash result once you've "soaked up" all the bytes to be hashed. You could in principle still smash the entire internal state, and then make more collisions with suffixes but it seems far more likely that an attack would be content to collide one particular output hash that was squeezed out, despite the internal state not being fully collided, in this case suffixes would not collide.

Moving Git past SHA-1

nix — Wed, 01 Mar 2017 00:08:15 +0000

Oh you can easily tell which hash they're from -- just check the lengths. What you can't tell (without rehashing) is that specific parent and child blobs denote the same object. (However, as noted elsethread, this is not a problem: the deltifier will spot them anyway, and squash them away to next-to-nothing.)

Moving Git past SHA-1

magfr — Tue, 28 Feb 2017 22:49:11 +0000

But would this solve the problem at hand?

What happens under this scheme when I add a second blob with the same sha1 hash as one previously in the repository but a different content?

Moving Git past SHA-1

jengelh — Tue, 28 Feb 2017 21:30:39 +0000

>The problem is that when you create a new-hash commit that's a child of an old-hash commit, you have to somehow know that the blobs with new-hash X are equivalent to the parent's blobs with old-hash Y *even though the hashes are different*

Not sure how hashes are stored internally, but pretending they are dealt with in their hex form, why not add a character not in the [0-9a-f] alphabet?

if (strlen(hash) == 20 && strchr(hash, ':') == NULL) { method = sha1; } else { (method, hash) = split("f00f....f00f", ":"); } kind of thing.

Something like that worked for shadow(5) too when $2a$ et al support was added.

Moving Git past SHA-1

Otus — Tue, 28 Feb 2017 19:29:54 +0000

> The issue is that weaknesses have been found in SHA-1 that reduce the complexity dramatically - from "not before heat death of the universe" to "can be done for under $1,000,000 in computer time".

SHA-1 was never at the "heat death of the universe" level. 160-bits mean a maximum collision resistance against brute force of 2^80, which is more like "current world computing power for a year". Give or take an order of magnitude.

> So far, no similar design flaw has been found in SHA-2 or SHA-3. However, no such design flaw was known when SHA-1 was standardised, either.

It took ten years to have theoretical attacks on full SHA-1. SHA-2 has stood for over fifteen, still has a clear margin between rounds broken and total rounds, not to mention would need much more than the 2^16x speedup that was found for SHA-1 to be attacked in practice.

Moving Git past SHA-1

drh — Tue, 28 Feb 2017 18:41:52 +0000

On the SQLite project there are just over 70K artifacts ("files" if you will) with an average size of 550 bytes after storing as deltas and compressing using zlib. Uncompressed and undeltaed, the average file size is 69K. Of the 70K fullsize uncompressed and undeltaed artifacts, only 294 (0.4%) are smaller than a 64-character hexadecimal SHA256 hash. But if you compare the deltaed and compressed file sizes against the hash size, the figures are more depressing. About 20% of the files are in fact smaller than the 64-character SHA256 hash, 18% for a 56-character SHA3-224 hash and 6.5% are smaller than the 40-character SHA1.

SHA1 (also MD5 and SHA2) has the weakness that once you find one collision, it becomes easy to find lots of other collisions that have the same prefix. SHA3 does not have this weakness (at least as far as is publicly known at the moment).

Moving Git past SHA-1

mlankhorst — Tue, 28 Feb 2017 18:41:46 +0000

This would effectively be the same as moving to a new hash, so might as well move to a newer, more secure hash.

Moving Git past SHA-1

farnz — Tue, 28 Feb 2017 18:28:46 +0000

There are two reasons to not simply extend the SHA-1 construction:

The "heat death of the universe" depends on an attacker having to perform all the operations we expect them to. A break (such as the current break for SHA-1) allows an attacker to get the result they need without doing all the operations we expect them to. Thus, we get a shorter hash by using a new, unbroken construction like SHA-2 or SHA-3, rather than by extending SHA-1 out far enough that even with the attacker's ability to skip operations, they can't feasibly compute colliding hashes under any circumstances.
Once one break is found, it's likely that someone will build on this work to reduce the difficulty of the break. Thus, if you extend the broken construction, someone is quite likely to find a way to simply not bother with the extended bit.

That second point is a doozy - you expect 2⁸⁰ operations (as it's 160 bits long - a 256 bit hash would need 2¹²⁸ operations) to break SHA-1, but the attack reduces that to around 2⁶³ operations. You may be able to extend it to 2⁸⁰ operations by extending the hash to (say) 200 bits, but an attacker building on the recent work could then reduce it back down to 2⁶³ operations, and you've got to pay the cost of a new hash again.

Moving Git past SHA-1

Tara_Li — Tue, 28 Feb 2017 17:21:14 +0000

But can that "before the heat death of the Universe" be recovered by simply making the SHA-1 hash longer? Or is the crack so bad that the length of the hash effectively doesn't matter any more?

Moving Git past SHA-1

farnz — Tue, 28 Feb 2017 17:00:31 +0000

It's not really about relative strengths of the various hashes - all of them are theoretically unbreakable in reasonable time, even SHA-1, absent a weakness unknown at the time the hash was standardised. The issue is that weaknesses have been found in SHA-1 that reduce the complexity dramatically - from "not before heat death of the universe" to "can be done for under $1,000,000 in computer time".

So far, no similar design flaw has been found in SHA-2 or SHA-3. However, no such design flaw was known when SHA-1 was standardised, either.

Moving Git past SHA-1

Tara_Li — Tue, 28 Feb 2017 16:38:09 +0000

I'm not clear on the relative strength of the various SHAs - Apparently, SHA1 is bad, SHA224, SHA256, and SHA512 are really versions of SHA2, and there will be various levels of SHA3, as well.

There is, of course, the trade-off of the hashes ending up longer than a significant fraction of your files, but ...

How much more effective is SHA3 than SHA1? Is it expected to take 10e+3, 10e+30, or 10e+300 as long for the same length hash to create a collision? Or do the hashes keep getting longer, and if so - can't we just make SHA1 hashes longer?

Moving Git past SHA-1

nix — Tue, 28 Feb 2017 15:11:38 +0000

There *are* object IDs stored in packs, of course -- commit, tag, and tree objects all contain embedded SHA-1 IDs (and, in the new world, new-hash IDs).

But the mapping from SHA-1 IDs to object offsets in each pack is computed at receive time and stored in the index. (This means that every object lookup must check each index in turn, which is why occasionally repacking them into a single big pack without actually recompressing any content at all is still a significant win.)

Moving Git past SHA-1

garyvdm — Tue, 28 Feb 2017 14:59:50 +0000

I think (I'm not 100%) that the object ids are not stored in the pack file, only in the pack index. The pack file can be enumerated, and the object ids can be calculated. Hence it would be possible to have one pack file, and multiple index files for each object id type. Such a repo might look like this:

.git/
    objects/
        pack
            pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.idx
            pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.idx-sha3
            pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pack

Moving Git past SHA-1

Karellen — Tue, 28 Feb 2017 13:06:55 +0000

No idea :-)

I have basic familiarity with *some* git internals, but "how packs are formatted" has never been something I've needed to delve into. It's entirely possible that could scuttle the whole idea.

Moving Git past SHA-1

nix — Tue, 28 Feb 2017 12:06:25 +0000

I guess it'll use the db mapping new to old hash values that everyone will need to carry around forever. 250MiB+ for the kernel. Sigh.

Ah well, disk space really is cheap: I've just never internalized it properly. 250MiB still seems like a lot rather than about 1p. (And that price is taken from *enterprise* storage costs. Disks that normal humans not obsessed with RAIDing everything use will be even cheaper.)

Moving Git past SHA-1

nix — Tue, 28 Feb 2017 12:03:01 +0000

Hm. Given that the transitions will always happen at a parent/child commit boundary, you're probably right: the delta compression heuristics are partially commit- and partially file-based, so even the smallest window will find and scrunch them into nothing.

OK, all my objections to this scheme are gone then: except for a slightly increased diff cost across the boundary (to, uh, what non-git users pay all the time) it is costless. :)

Moving Git past SHA-1

andrewsh — Tue, 28 Feb 2017 11:19:44 +0000

How would that work with packs?

Moving Git past SHA-1

Karellen — Tue, 28 Feb 2017 08:39:06 +0000

Sorry, I wasn't explaining myself clearly.

Rather than have a new-hash commit that's a child of an old-hash commit, maintain two complete sets of commits/history, one for each hash. You can build one history from the other.

e.g. Say you have a repository containing a single file with two revisions. At the moment, you have a repo like (excuse shortened and comically simplified hash values):

.git/
    objects/
        00/000001 -- first revision of file
        00/000002 -- second revision of file
        00/100001 -- tree containing first revision of file
        00/100002 -- tree containing second revision of file
        00/200001 -- commit containing first tree
        00/200002 -- commit containing second tree
    refs/
        heads/
            master -- points to 00200002

Now, replace that with:

.git/
    objects/
        sha1/00/000001 -- first revision of file
        sha1/00/000002 -- second revision of file
        sha1/00/100001 -- sha1-based tree containing first revision of file
        sha1/00/100002 -- sha1-based tree containing second revision of file
        sha1/00/200001 -- sha1-based commit containing first tree
        sha1/00/200002 -- sha1-based commit containing second tree
        sha3/ff/000001 -- first revision of file
        sha3/ff/000002 -- second revision of file
        sha3/ff/100001 -- sha3-based tree containing first revision of file
        sha3/ff/100002 -- sha3-based tree containing second revision of file
        sha3/ff/200001 -- sha3-based commit containing first tree
        sha3/ff/200002 -- sha3-based commit containing second tree
    refs/
        heads/
            master -- points to sha1:00200002

In the above case, sha1/00/000001 and sha3/ff/000001 are hard links to the same file. The blob is identical in each case, it's just linked to multiple times with different names. The same goes for the second revision of the file. Only the tree and commit objects need to be duplicated for each hash, but they're going to be orders of magnitude smaller than the sizes of the blobs. You can build one tree from the other, and even build an entirely new commit history based on a different hash at any time in the future. When you create a new commit, a new commit object will be created for each hash you've configured your repo to use.

Anyone can refer to a commit by either id (sha1:00200002 /or/ sha3:ff200002), and, provided you have the relevant commit history built in your repo, you can see which commit they're talking about. References to old commits are still valid, because you still have (and can continue to maintain) the old hash history.

A project could say "from date X we will talk to each other about commits using sha3 refs by default", but that won't need to change how your repo actually works under the hood.

Does that explain what I'm trying to get at a bit better?

Moving Git past SHA-1

johill — Tue, 28 Feb 2017 06:54:08 +0000

That only really works - if at all - when your objects are all loose, but they will be in packs so it won't.

Moving Git past SHA-1

smurf — Tue, 28 Feb 2017 03:27:34 +0000

That shouldn't be a problem if you do the hard-linking at the point of transition between old and new hash. You need to mandate that the transitioning commit may not have any other changes. Presto, instant hash correspondence table.

Moving Git past SHA-1

BenHutchings — Tue, 28 Feb 2017 01:07:01 +0000

git doesn't really store each object independently; it eventually packs them up together and applies delta-compression to reduce storage for multiple versions of a file. I would expect that the new-hashed blobs can be delta-compressed down to no more than a few bytes of metadata.

Moving Git past SHA-1

nix — Mon, 27 Feb 2017 23:04:45 +0000

The problem is that when you create a new-hash commit that's a child of an old-hash commit, you have to somehow know that the blobs with new-hash X are equivalent to the parent's blobs with old-hash Y *even though the hashes are different*: i.e. rather than just doing a string compare you have to actually *rehash* under the other hash and then compare that.

Nobody's written that code, so with current code it will assume all the blobs are different and write them out all over again, even though they're all identical. Not even repacking all objects would fix that, because that doesn't rehash anything either...

Linus seems to be happy with this, but having seen the amount of space it would waste forevermore in e.g. the Chromium repo, I'm not so happy :(

Moving Git past SHA-1

Karellen — Mon, 27 Feb 2017 22:47:25 +0000

> One result of this approach would be some inevitable duplication of objects around the transition, as the same files are stored under both the old and new IDs. The alternative is to perform some sort of mapping or otherwise allow objects to be known under both the old and new IDs, but that would add some significant complexity and would also increase the amount of data stored in the repository.

Would it increase the amount of data stored in the repository that much? All the blobs themselves would be exactly the same. A blob with an sha1 hash of "0000....ffff" but an sha3 hash of "ffff....0000" would only need to be stored once, but with two hardlinks - one for each hash type. Even a pack file, containing a bunch of blobs, could contain all the blob/diff data only once, but with each blob/diff referred to with multiple names.

The only duplicated information would be the tree, commit and tag objects. OK, for a large project with a lot of history, they'd take up a not-entirely-negligible amount of space, but it would be a pretty small fraction of the blob space, wouldn't it?

And you could build a complete set of sha-3-identified trees and commit history from the sha-1 equivalent, and manage them side-by-side indefinitely. Heck, you could (re)build an sha-1 commit history from an sha-3 one.

You could then identify commits with an extended identifier, something like "sha1:0000....ffff" or "sha3:ffff....0000". Assume "sha1" as the default if left unspecified, or make the default hash type a per-repository config option. That would allow different projects to decide which hash they want to use by default, based on the paranoia level of each project's maintainer.

Moving Git past SHA-1

joey — Mon, 27 Feb 2017 21:32:39 +0000

Since git-annex builds on top of git, it inherits its foundational SHA1 weaknesses. Or does it? Interestingly, when I dug into the details, I found a way to make git-annex repositories secure from SHA1 collision attacks, as long as signed commits are used (and verified).

When git commits are signed (and verified), SHA1 collisions in commits are not a problem. (The current SHA1 collision attack cannot usefully collide git commits either, although future attacks may.) And there seems to be no way to generate usefully colliding git tree objects (unless they contain really ugly binary filenames). That leaves blob objects, and when using git-annex, those are git-annex key names, which can be secured from being a vector for SHA1 collision attacks.

This needed some work on git-annex, which is now done, so look for a release in the next day or two that hardens it against SHA1 collision attacks. For details about how to use it, and more about why it avoids git's SHA1 weaknesses, see https://git-annex.branchable.com/tips/using_signed_git_co...