Moving Git past SHA-1
Git uses SHA-1 extensively. When a "blob" (a revision of a file, essentially) is placed in a repository, the blob's contents (with some Git metadata) are hashed, and the result is used to identify the blob thereafter. The other types of objects in a Git repository — "trees", identifying a directory hierarchy full of blobs, and "commits", describing revisions — are also identified by their SHA-1 hashes. The hash of each commit object is calculated from, among other things, the hash of the previous commit in the chain. The result is that the same commit ID in two repositories is, in the absence of hash collisions, guaranteed to refer not only to the same set of files, but to an identical history leading to that state.
The use of SHA-1 in this way makes it difficult to tamper with the files in a repository; a change to a file will change the resulting hash, so the change will be noticed. Git thus functions as a sort of source-code blockchain that encodes the full repository history in its current state. If an attacker can generate two files with the same SHA-1 hash, though, it may be possible to substitute one for the other in a repository without being detected. That could be a way to get hostile code into any project stored in Git — an outcome that is generally viewed as a bad thing.
There are two separate issues that need to be considered when looking at the implications of SHA-1's demise: how urgent is the need for Git to switch to something else, and how can that switch be carried out?
The sky isn't quite falling yet
The discussion on the use of SHA-1 in Git is far from new. Linus Torvalds
posted the first version of Git on
April 7, 2005, saying: "It's not an SCM, it's a distribution and
archival mechanism. I bet you could make a reasonable SCM on top of it,
though.
" Three weeks later, a
conversation on the wisdom of the SHA-1 choice was already underway.
At that time, Torvalds responded
that SHA-1 is not the real security mechanism used in Git and, as a
result, even a full compromise of the hash function would not necessarily
be a problem.
His argument, essentially, was that the distributed nature of Git
repositories makes an attack difficult even given the ability to generate
collisions cheaply. Just replacing an object in a repository is not
enough; the attacker would have to find a way to distribute that object to
other repositories around the world, which is not an easy task. The colliding
object would have to function as C source (if the kernel were the target of
attack here), and would have to stand up to a casual inspection — it would
have to look like proper kernel source. That increases the
difficulty of generating the collision significantly. There are, he noted,
easier ways to get bad code into the kernel: "So if you actually
wanted to corrupt the kernel tree, you'd do it by just fooling me into
accepting a crap patch. Hey, it happens all the time.
"
After the Google announcement, Torvalds posted a lengthy message about SHA-1 and Git. He pointed out that generating collisions in source code is harder than with PDF files (as Google used) because the latter can contain a great deal of invisible data that does not change the formatted result. The kernel's chain of trust is where the project's security really lies, he said. He also noted that the fingerprints of the technique used to generate this SHA-1 collision are easy to detect, so any eventual attack based on this method can be easily defended against. That said, he also noted (without getting into details) that there is a plan to move Git away from SHA-1 in the near future.
Torvalds's view is fairly sanguine, in other words; others are a bit more worried, for a number of reasons. Not everybody uses Git just for C source code, for example; less "transparent" file types might be more easily subject to attack. The scariest possibility might be firmware blobs, which are just binary code; modifications to such a blob will not be easy to notice via any sort of inspection.
The distribution argument has a significant flaw as well: Git repositories are often not as widely distributed as one might think. There are central hosting sites, such as GitHub and kernel.org, that contain large numbers of repositories; these sites routinely keep objects in a single, central store to reduce storage and backup needs. Kernel.org probably has hundreds of kernel repositories, but commit c470abd4fde40ea6a0846a2beab642a578c0b8cd (tagging the 4.10 release) is the same object in all of them. A single bad object in a central site like this could thus contaminate many repositories.
Joey Hess described how such an attack might be carried out. An attacker gets a subsystem maintainer to accept the "good" version of an object; meanwhile, the "bad" version is placed in a repository on the hosting site. When the maintainer pushes their repository to that site, the bad object may displace the good one, since it already exists in the repository with the SHA-1 ID of the good object. That bad object would then be propagated in any subsequent pushes or pulls.
It is also worth noting that there is a certain amount of invisible data even in "transparent" files like C source. The Git headers themselves have some dark corners where the bits needed to force a collision can be hidden. The good news there is that such an attack is relatively easy to detect. In many cases, the existing "git fsck" functionality will find it, and central sites tend to run fsck regularly already. It turns out that the Git transfer.fsckobjects configuration variable can be used to force a check whenever objects move between repositories. Even Torvalds was surprised to learn that this option exists; there is now talk of enabling it by default.
Moving on
The Git developers may feel that the weakening of SHA-1 is not an emergency, but there also appears to be a strong consensus that, after all these years, the project needs to move on to a more secure hash algorithm. While it may be true that, as Torvalds said, there is a plan for this transition, it must be said that the plan is in a rather early stage, and that there are some problems that must be solved first.
The first of those is fairly prosaic. A quick look through the Git source turns up a great many variable declarations like:
unsigned char sha1[20];
In other words, the format of the hash used to identify every object in a Git repository is declared as a basic type with a hard-coded constant size. One might think that the developers involved could have avoided this situation, but it is an issue that must be dealt with now. Brian Carlson has been working on switching to an opaque struct object_id type for some time (he first mentioned this work in April 2014), but it is slow going. As of February 25, he still had over 1100 sites in need of conversion.
That work, in any case, is just code refactoring. A trickier task is figuring out how to introduce a new hash algorithm without breaking existing repositories, without requiring the rewriting of the history in those repositories, and maximizing interoperability. The plan, as worked out primarily by Torvalds and Jeff King, is to introduce a new blob type that is identified by a new hash type (possibly parameterizing the hash type so that the next transition is easier). These blobs could only exist in a repository managed by a version of Git that is new enough to understand their format.
Once the "use the new hash type" bit has been flipped on a given repository, all new objects must use that type. New objects would not be allowed to contain pointers to old-hash objects, with one exception: a new-hash commit could have an old-hash parent. The intent behind this rule is to make the transition to the new object IDs happen as quickly as possible; once the bit is flipped, all new work uses those IDs.
One result of this approach would be some inevitable duplication of objects around the transition, as the same files are stored under both the old and new IDs. The alternative is to perform some sort of mapping or otherwise allow objects to be known under both the old and new IDs, but that would add some significant complexity and would also increase the amount of data stored in the repository. In fact, that kind of mapping could grow in a hurry as the number of hash algorithms used by Git grows. So it seems more likely that the one-time duplication cost is the path that will be chosen.
Once a repository moves to the new format, any other repositories that push to or pull from that repository will also have to change. An attempt to pull from a new-format repository into one that hasn't made the transition will simply fail. So there will be a flag day of sorts for most projects. In the kernel's case, there will presumably come a day when the kernel.org repository starts using new IDs, and the rest of the community will have to follow suit. Such a change should probably happen a fair while after Git itself is capable of using the new IDs so that the updated software is widely distributed by the time it is needed.
A tiny little detail that hasn't yet been worked out is which hash algorithm will be chosen to succeed SHA-1. Most developers appear to think that SHA-3 is the logical next step, but that discussion has not yet begun in earnest.
So, while the sky may not be falling, it is showing increasing signs of
structural instability. As has been seen, moving Git to a new hash type is
not a trivial task; it will not be accomplished overnight, or even this
year if one looks realistically at what needs to be done. The time has
certainly come for the project to finally start making real progress on
this perennial wishlist item. The good news is that the developers
involved would appear to have
heard this message and are bringing a new focus to the task.
Index entries for this article | |
---|---|
Security | SHA-1 |
Posted Feb 27, 2017 21:32 UTC (Mon)
by joey (guest, #328)
[Link]
When git commits are signed (and verified), SHA1 collisions in commits are not a problem. (The current SHA1 collision attack cannot usefully collide git commits either, although future attacks may.) And there seems to be no way to generate usefully colliding git tree objects (unless they contain really ugly binary filenames). That leaves blob objects, and when using git-annex, those are git-annex key names, which can be secured from being a vector for SHA1 collision attacks.
This needed some work on git-annex, which is now done, so look for a release in the next day or two that hardens it against SHA1 collision attacks. For details about how to use it, and more about why it avoids git's SHA1 weaknesses, see https://git-annex.branchable.com/tips/using_signed_git_co...
Posted Feb 27, 2017 22:47 UTC (Mon)
by Karellen (subscriber, #67644)
[Link] (15 responses)
Would it increase the amount of data stored in the repository that much? All the blobs themselves would be exactly the same. A blob with an sha1 hash of "0000....ffff" but an sha3 hash of "ffff....0000" would only need to be stored once, but with two hardlinks - one for each hash type. Even a pack file, containing a bunch of blobs, could contain all the blob/diff data only once, but with each blob/diff referred to with multiple names.
The only duplicated information would be the tree, commit and tag objects. OK, for a large project with a lot of history, they'd take up a not-entirely-negligible amount of space, but it would be a pretty small fraction of the blob space, wouldn't it?
And you could build a complete set of sha-3-identified trees and commit history from the sha-1 equivalent, and manage them side-by-side indefinitely. Heck, you could (re)build an sha-1 commit history from an sha-3 one.
You could then identify commits with an extended identifier, something like "sha1:0000....ffff" or "sha3:ffff....0000". Assume "sha1" as the default if left unspecified, or make the default hash type a per-repository config option. That would allow different projects to decide which hash they want to use by default, based on the paranoia level of each project's maintainer.
Posted Feb 27, 2017 23:04 UTC (Mon)
by nix (subscriber, #2304)
[Link] (14 responses)
Nobody's written that code, so with current code it will assume all the blobs are different and write them out all over again, even though they're all identical. Not even repacking all objects would fix that, because that doesn't rehash anything either...
Linus seems to be happy with this, but having seen the amount of space it would waste forevermore in e.g. the Chromium repo, I'm not so happy :(
Posted Feb 28, 2017 1:07 UTC (Tue)
by BenHutchings (subscriber, #37955)
[Link] (1 responses)
Posted Feb 28, 2017 12:03 UTC (Tue)
by nix (subscriber, #2304)
[Link]
OK, all my objections to this scheme are gone then: except for a slightly increased diff cost across the boundary (to, uh, what non-git users pay all the time) it is costless. :)
Posted Feb 28, 2017 3:27 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Feb 28, 2017 6:54 UTC (Tue)
by johill (subscriber, #25196)
[Link]
Posted Feb 28, 2017 8:39 UTC (Tue)
by Karellen (subscriber, #67644)
[Link] (7 responses)
Sorry, I wasn't explaining myself clearly.
Rather than have a new-hash commit that's a child of an old-hash commit, maintain two complete sets of commits/history, one for each hash. You can build one history from the other.
e.g. Say you have a repository containing a single file with two revisions. At the moment, you have a repo like (excuse shortened and comically simplified hash values):
Now, replace that with:
In the above case, sha1/00/000001 and sha3/ff/000001 are hard links to the same file. The blob is identical in each case, it's just linked to multiple times with different names. The same goes for the second revision of the file. Only the tree and commit objects need to be duplicated for each hash, but they're going to be orders of magnitude smaller than the sizes of the blobs. You can build one tree from the other, and even build an entirely new commit history based on a different hash at any time in the future. When you create a new commit, a new commit object will be created for each hash you've configured your repo to use.
Anyone can refer to a commit by either id (sha1:00200002 /or/ sha3:ff200002), and, provided you have the relevant commit history built in your repo, you can see which commit they're talking about. References to old commits are still valid, because you still have (and can continue to maintain) the old hash history.
A project could say "from date X we will talk to each other about commits using sha3 refs by default", but that won't need to change how your repo actually works under the hood.
Does that explain what I'm trying to get at a bit better?
Posted Feb 28, 2017 11:19 UTC (Tue)
by andrewsh (subscriber, #71043)
[Link] (5 responses)
Posted Feb 28, 2017 12:06 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Ah well, disk space really is cheap: I've just never internalized it properly. 250MiB still seems like a lot rather than about 1p. (And that price is taken from *enterprise* storage costs. Disks that normal humans not obsessed with RAIDing everything use will be even cheaper.)
Posted Mar 1, 2017 10:39 UTC (Wed)
by zenaan (guest, #3778)
[Link]
Is 250MiB just the mapping size (i.e. no object data)? Sounds surprising, but perhaps not for the number of years (rather, objects) involved?
Flag day doesn't sound so bad - a repo can simply store some meta-data for it's UTC flag day or "flag moment", with a certain quiescence period (say 15 minutes) during which pulls and pushes are not accepted... thereby ensuring world-sync for the flag moment, yes?
Posted Feb 28, 2017 13:06 UTC (Tue)
by Karellen (subscriber, #67644)
[Link]
I have basic familiarity with *some* git internals, but "how packs are formatted" has never been something I've needed to delve into. It's entirely possible that could scuttle the whole idea.
Posted Feb 28, 2017 14:59 UTC (Tue)
by garyvdm (subscriber, #82325)
[Link] (1 responses)
Posted Feb 28, 2017 15:11 UTC (Tue)
by nix (subscriber, #2304)
[Link]
But the mapping from SHA-1 IDs to object offsets in each pack is computed at receive time and stored in the index. (This means that every object lookup must check each index in turn, which is why occasionally repacking them into a single big pack without actually recompressing any content at all is still a significant win.)
Posted Feb 28, 2017 22:49 UTC (Tue)
by magfr (subscriber, #16052)
[Link]
What happens under this scheme when I add a second blob with the same sha1 hash as one previously in the repository but a different content?
Posted Feb 28, 2017 21:30 UTC (Tue)
by jengelh (guest, #33263)
[Link] (1 responses)
Not sure how hashes are stored internally, but pretending they are dealt with in their hex form, why not add a character not in the [0-9a-f] alphabet?
if (strlen(hash) == 20 && strchr(hash, ':') == NULL) { method = sha1; } else { (method, hash) = split("f00f....f00f", ":"); } kind of thing.
Something like that worked for shadow(5) too when $2a$ et al support was added.
Posted Mar 1, 2017 0:08 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Feb 28, 2017 16:38 UTC (Tue)
by Tara_Li (guest, #26706)
[Link] (8 responses)
There is, of course, the trade-off of the hashes ending up longer than a significant fraction of your files, but ...
How much more effective is SHA3 than SHA1? Is it expected to take 10e+3, 10e+30, or 10e+300 as long for the same length hash to create a collision? Or do the hashes keep getting longer, and if so - can't we just make SHA1 hashes longer?
Posted Feb 28, 2017 17:00 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (4 responses)
It's not really about relative strengths of the various hashes - all of them are theoretically unbreakable in reasonable time, even SHA-1, absent a weakness unknown at the time the hash was standardised. The issue is that weaknesses have been found in SHA-1 that reduce the complexity dramatically - from "not before heat death of the universe" to "can be done for under $1,000,000 in computer time".
So far, no similar design flaw has been found in SHA-2 or SHA-3. However, no such design flaw was known when SHA-1 was standardised, either.
Posted Feb 28, 2017 17:21 UTC (Tue)
by Tara_Li (guest, #26706)
[Link] (2 responses)
Posted Feb 28, 2017 18:28 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
There are two reasons to not simply extend the SHA-1 construction:
That second point is a doozy - you expect 280 operations (as it's 160 bits long - a 256 bit hash would need 2128 operations) to break SHA-1, but the attack reduces that to around 263 operations. You may be able to extend it to 280 operations by extending the hash to (say) 200 bits, but an attacker building on the recent work could then reduce it back down to 263 operations, and you've got to pay the cost of a new hash again.
Posted Feb 28, 2017 18:41 UTC (Tue)
by mlankhorst (subscriber, #52260)
[Link]
Posted Feb 28, 2017 19:29 UTC (Tue)
by Otus (subscriber, #67685)
[Link]
SHA-1 was never at the "heat death of the universe" level. 160-bits mean a maximum collision resistance against brute force of 2^80, which is more like "current world computing power for a year". Give or take an order of magnitude.
> So far, no similar design flaw has been found in SHA-2 or SHA-3. However, no such design flaw was known when SHA-1 was standardised, either.
It took ten years to have theoretical attacks on full SHA-1. SHA-2 has stood for over fifteen, still has a clear margin between rounds broken and total rounds, not to mention would need much more than the 2^16x speedup that was found for SHA-1 to be attacked in practice.
Posted Feb 28, 2017 18:41 UTC (Tue)
by drh (guest, #65025)
[Link] (2 responses)
SHA1 (also MD5 and SHA2) has the weakness that once you find one collision, it becomes easy to find lots of other collisions that have the same prefix. SHA3 does not have this weakness (at least as far as is publicly known at the moment).
Posted Mar 1, 2017 1:36 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
Construction-wise this problem was literally inevitable in the M-D family. The entire internal state is emitted, so if you can find A, B such that hash(A) = hash(B) then you can't help but have hash(A | X) = hash(B | X), albeit this applies to an internal representation so for practical real world inputs there's some fiddling about with padding to consider.
SHA-3 is a sponge design, so there's a separate operation to "squeeze out" the internal state into a hash result once you've "soaked up" all the bytes to be hashed. You could in principle still smash the entire internal state, and then make more collisions with suffixes but it seems far more likely that an attack would be content to collide one particular output hash that was squeezed out, despite the internal state not being fully collided, in this case suffixes would not collide.
Posted Mar 3, 2017 21:41 UTC (Fri)
by kleptog (subscriber, #1183)
[Link]
If we thought 160 bits collision resistance was fine then it's doesn't seem smart to start using 512 or even 256 bits for Git identifiers. Just pick a safer hash and truncate it to the length you want.
Posted Mar 1, 2017 1:39 UTC (Wed)
by kjp (guest, #39639)
[Link] (6 responses)
Posted Mar 1, 2017 2:18 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link] (4 responses)
Posted Mar 1, 2017 12:23 UTC (Wed)
by epa (subscriber, #39769)
[Link] (3 responses)
The nice property is then that all existing hash prefixes, and 40 character hashes, continue to be valid. If you want to have the greater collision resistance of SHA3-256 then you can use the new jumbo-sized 106 character hashes.
Posted Mar 1, 2017 12:25 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted Mar 2, 2017 2:41 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Mar 2, 2017 11:09 UTC (Thu)
by jwilk (subscriber, #63328)
[Link]
Posted Mar 2, 2017 12:25 UTC (Thu)
by emorrp1 (guest, #99512)
[Link]
Ultimately I wouldn't worry about it as there isn't yet a simple solution and the developers have now raised the priority of this work, so they'll think through the various options to come up with a compromise that is both implementable and reasonable.
Posted Mar 2, 2017 10:06 UTC (Thu)
by jnareb (subscriber, #46500)
[Link]
This way with `transfer.fsckObjects` any attempt can be easily detected.
[1]: https://github.com/cr-marcstevens/sha1collisiondetection
Posted Jan 17, 2018 16:38 UTC (Wed)
by anarcat (subscriber, #66354)
[Link]
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
.git/
objects/
00/000001 -- first revision of file
00/000002 -- second revision of file
00/100001 -- tree containing first revision of file
00/100002 -- tree containing second revision of file
00/200001 -- commit containing first tree
00/200002 -- commit containing second tree
refs/
heads/
master -- points to 00200002
.git/
objects/
sha1/00/000001 -- first revision of file
sha1/00/000002 -- second revision of file
sha1/00/100001 -- sha1-based tree containing first revision of file
sha1/00/100002 -- sha1-based tree containing second revision of file
sha1/00/200001 -- sha1-based commit containing first tree
sha1/00/200002 -- sha1-based commit containing second tree
sha3/ff/000001 -- first revision of file
sha3/ff/000002 -- second revision of file
sha3/ff/100001 -- sha3-based tree containing first revision of file
sha3/ff/100002 -- sha3-based tree containing second revision of file
sha3/ff/200001 -- sha3-based commit containing first tree
sha3/ff/200002 -- sha3-based commit containing second tree
refs/
heads/
master -- points to sha1:00200002
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
I think (I'm not 100%) that the object ids are not stored in the pack file, only in the pack index. The pack file can be enumerated, and the object ids can be calculated. Hence it would be possible to have one pack file, and multiple index files for each object id type.
Such a repo might look like this:
Moving Git past SHA-1
.git/
objects/
pack
pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.idx
pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.idx-sha3
pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pack
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Detecting attempted collisions with sha1dc
Moving Git past SHA-1