|
|
Subscribe / Log in / New account

Linus on Git and SHA-1

Linus Torvalds has posted a lengthy explanation of why the recently created SHA-1 collision is not an emergency for Git users. "In the pdf examples, the pdf format acted as the 'black box', and what you see is the printout which has only a very indirect relationship to the pdf encoding. But if you use git for source control like in the kernel, the stuff you really care about is source code, which is very much a transparent medium. If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice." That said, he notes that there is work in progress to move away from SHA-1.

[It seems that subversion users have an additional set of concerns; see this bug report conversation for the scary story.]


to post comments

Proof

Posted Feb 25, 2017 23:56 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (36 responses)

This is why proofs of concept like Google's PDF stunt are necessary. This didn't come out of nowhere. We have known for about a decade that SHA-1 would be vulnerable to a collision attack, it was just a matter of when it would be developed and how much it might cost. But merely knowing what will happen is ignored in favour of happy dreams, maybe it won't happen, let's work on other things which are "more important". Even now, with the proof of concept, Linus is saying it's not really a big deal. OK, the building is actually on fire, I guess we'll start looking for our coat and maybe saunter towards the door ? There's no hurry.

Git developers get to be a bit smug about Subversion. Ha! Their design was a bit more vulnerable and they didn't move fast enough, pinned under a collapsed beam in the burning building. Git won't be that dumb, why as soon as we've picked up our wallet we'll be... oh, that is a _funny_ cat video, what were we doing? Right yes, house on fire, we should find our shoes! Hmm, did we turn off the TV?

In places (like SSL certificates or web browser crypto) where it was even more obvious (not that Subversion or git have any excuse) that SHA-1 was a problem, we still had an up-hill battle to get people to accept that they'd have to put up with the discomfort of actually migrating to another hash before the whole building went up in flames. In December 2016 we were still seeing commercial payment gateways arguing that their incompetence plus the holiday season ought to be enough reason to justify yet more months of trusting SHA-1, even though some of the people in those meetings will have known their colleagues almost had a working SHA-1 collision already and thus nation states were undoubtedly far ahead of them.

I really want to believe that Git will get this done quickly, will find a way to make it relatively seamless, and a year from now we'll be wondering what the fuss was all about. But I worry that a year from now Linus will still be hand-waving about how it's not /that/ urgent to migrate, and _five_ years from now it'll finally be underway in earnest when we find out some obscure but important git repo was busted this way for the past 18 months...

By the way, lots of Subversion people seem to be assuming that the hack they're using to detect this one specific prefix is the best way to defend against this. That's _really_ really not so. Because the type of attack was known so far in advance of an actual attack being implemented we have had tools for some years now that can detect blobs which are likely to mess with the internals of the hash. There is a _minuscule_ risk of false positives, so this should be deployed wherever you can't stop using SHA-1 but want some defence against collisions, here's the latest example:

https://github.com/cr-marcstevens/sha1collisiondetection

Proof

Posted Feb 26, 2017 0:25 UTC (Sun) by Wol (subscriber, #4433) [Link] (31 responses)

> I really want to believe that Git will get this done quickly, will find a way to make it relatively seamless, and a year from now we'll be wondering what the fuss was all about. But I worry that a year from now Linus will still be hand-waving about how it's not /that/ urgent to migrate, and _five_ years from now it'll finally be underway in earnest when we find out some obscure but important git repo was busted this way for the past 18 months...

From what I can understand though, the difference between Git and SVN is in the way they receive the data.

SVN receives the data, and writes it into repository. Pushing corrupted data corrupts the receiving repository.

Git receives the data, hashes it, and appends it to the repository. If there's a collision, the append fails with "can't overwrite existing data". So pushing corrupted data causes the corrupted data to be dropped. Okay, you've now got two repositories out-of-sync with each other (and if you can get someone to update-pull from the corrupt repository your attack will succeed), but Linus says this out-of-sync-repositories situation is actually very easy to spot.

Cheers,
Wol

Proof

Posted Feb 26, 2017 1:51 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (21 responses)

Subversion got unlucky, with a hilarious bug that was tripped by the collision, but it's not how bad guys would attack a CMS. Linus & co. have actually discussed how to simulate the collided import and check it won't break Git, by basically monkey-patching the hash checks in the code to produce a fake collision and then check it errors out or does something sensible. You seem very confident it won't break, but they haven't actually tested yet, so you're basically saying "I am confident Git has no bugs" which is... not so smart.

Anyway, real collision _attacks_ aren't about blowing up the collision handling code (or lack of it) in your CMS. They're about tricking you into thinking one thing is another thing, usually by showing one person A and another person B telling both they're the same thing X and knowing that they'll only check hash(X) and so if hash(A) = hash(B) they'll believe they're seeing their same thing when they aren't.

And the only long term solution to that is to use a good hash so that hash(X) = hash(A) is adequate to conclude that A = X. Git doesn't do that today, and it doesn't even yet have a proposal for how it'll do it tomorrow, let alone how it'll migrate everybody. Remember, nobody who pays the least attention was blind-sided by this, a more careful designer would have rejected SHA-1 when Git was started over a decade ago.

(2004, Schneier recommends SHA-1 not be used in new applications: Rumours of fire)
"Don't worry, there's no fire. An alarm would sound"
(2012 Researchers publish estimated prices to collide SHA-1: A fire alarm is ringing)
"That's just the fire alarm, it's probably nothing. I can't smell any smoke, can you?"
(2015 "The Shappening" free-start collisions published, prices are revised down significantly: There is a strong scent of smoke)
"That smell might not be smoke though, could be something else. Wait until we see smoke"
(2017 "Shattered" full SHA-1 PDF collision using fixed prefix: Clouds of smoke are filling the room)
"Well, don't you worry, if we see an actual fire I know I have an extinguisher around here somewhere"

Proof

Posted Feb 26, 2017 16:06 UTC (Sun) by Wol (subscriber, #4433) [Link]

> And the only long term solution to that is to use a good hash so that hash(X) = hash(A) is adequate to conclude that A = X. Git doesn't do that today, and it doesn't even yet have a proposal for how it'll do it tomorrow, let alone how it'll migrate everybody. Remember, nobody who pays the least attention was blind-sided by this, a more careful designer would have rejected SHA-1 when Git was started over a decade ago.

Except that I think Linus looked at it as an engineer (as he has a tendency to do). "Collisions are rare, but inevitable. Code to handle them will be buggy. If we hit one, PANIC".

That does make exploits difficult, in that any exploit needs to make sure that no-one ever compares the exploited and the exploit-free trees - rather difficult to ensure in practice. (Okay, it also needs the maintainers to be aware of the possibility, and to look for it, in order for the system to be reasonably fail-safe, and I know that's asking a lot...)

Cheers,
Wol

Proof

Posted Feb 26, 2017 22:36 UTC (Sun) by moltonel (guest, #45207) [Link] (3 responses)

> And the only long term solution to that is to use a good hash so that hash(X) = hash(A) is adequate to conclude that A = X

Meh, that still relies on being lucky, and ignores the advances in both computing power and knowledge of each algorythm's weakness. If you want to be safe you can use a hash to speed things up, but if the hash of new data matches old data, you need to check the actual bytes for a collision. That's what many hashmap containers do: the hash gets you to the proper location in an array, and at that location is a linked list of all values with that hash.

Failing that (if you don't have the bytes available to compare, so not in the git/svn case), it's probably safest to use multiple hashes. Gentoo for example uses SHA256+SHA512+WHILPOOL+filesize to check that the downloaded file is correct.

Proof

Posted Feb 26, 2017 23:17 UTC (Sun) by excors (subscriber, #95769) [Link] (2 responses)

> Meh, that still relies on being lucky

DRAM error rates are apparently somewhere vaguely around 2^-35 errors per bit per hour. If you have two files that differ by a single bit, and you leave them in memory for a millisecond before memcmping them, that's about a 1 in 2^56 chance of one of the differing bits flipping and giving a false match. With a non-broken 256-bit hash function, there's only a 1 in 2^256 chance of two different files giving a false match. So by checking the actual bytes you're massively *increasing* the risk of an incorrect result caused by bad luck.

Proof

Posted Feb 27, 2017 15:56 UTC (Mon) by hmh (subscriber, #3838) [Link] (1 responses)

Not with ECC. If you don't have it, well... Hopefully you're using something that detects corruption on data at rest in memory.

Also, as far as the crypto goes, it has been proved that a combination of several hashes is approximately as strong as the strongest one, at least against brute force.

It would make sense to use very different hashes with the same strength as a hedge against a weakness being found in one of them. But that would mean, e.g. using both sha2-256 and sha3-256 (sha3 is a very different construct than Sha 2 exactly for that reason, otherwise BLAKE might have been selected for sha3). And the total strength would be 256 bits, not more.

Proof

Posted Feb 27, 2017 17:45 UTC (Mon) by excors (subscriber, #95769) [Link]

I used the numbers for ECC-detected correctable errors (which should be harmless) from https://research.google.com/pubs/pub35162.html , but that seems to indicate detectable uncorrectable errors are only ~40 times rarer. Presumably some undetectable multi-bit errors are possible in ECC too, though very much rarer again. But still astronomically more likely than an accidental hash collision.

The point is that you can't practically divide algorithms into "definitely correct" vs "probably correct but relies on being lucky" - everything running on hardware has some non-zero chance of giving the wrong result. Instead you need to look at how close to zero those chances are, and divide them into "absurdly unlikely to ever fail even if every computer in the world runs this algorithm for a million years" vs "might actually fail in practice", and a non-broken hash function fits into the first category. (memcmp on a system with ECC RAM probably goes in the first category too, but memcmp without ECC probably doesn't.)

(This is separate from the issue that non-broken hash functions have a tendency to become broken hash functions after a few decades. Any system that's designed to use a secure hash ought to be designed to migrate cleanly to a new hash function in the future.)

My God are you alarmist

Posted Feb 27, 2017 9:42 UTC (Mon) by linuxrocks123 (subscriber, #34648) [Link] (15 responses)

Protip: If you think the building you're in is on fire, but you see people calmly milling around, perhaps you should consider another possibility -- say, "Maybe I'm not perceiving this right" -- before immediately concluding "EVERYBODY EXCEPT ME IS INSANE!"

Guess what? RSync uses Adler-32 and MD-5, and it works fine, and the world is still here. No fire.

My personal hope? SCMs will use this opportunity to change their code not to rely on cryptographically secure hashes. They have no fundamental need for them, and assuming

(hash(A) = hash(B)) -> (A = B)

is obviously logically unsound, sometimes unsafe, and always the worst of kludges. Subversion deserves the mess it's in now, and I say that even as I now have to make sure never to commit those two files to my giant Subversion repository that controls my home directory thanks to their f-up. I hope they will fix their design rather than kick the can down the road by replacing SHA-1 with something new. You don't need a cryptographically secure hash, guys: don't use one, and don't assume the unsound.

My God are you alarmist

Posted Feb 27, 2017 15:02 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (14 responses)

By default rsync doesn't bother hashing the contents of files at all, it trusts that if they have the same name and timestamps, they're "probably" the same file, no need to look any further. This is great for the backup of my rotating collection of TV recordings yet-to-be-watched, each show is huge, maybe 2GB each, so just reading them all off spinning disks to find out a checksum would take _ages_, and I don't want rsync wasting my resources on that every time it runs. But if I was worried about bad guys maybe replacing a document with a different document, then such a slap-dash approach won't cut it, and rsync is completely the wrong tool.

SCMs have chosen crypto hashes because they want content addressability and a crypto hash lets you achieve that efficiently. Git in particular doesn't bother doing very much else except content-addressability, this is part of how it's able to be distributed so easily, unlike other schemes content addressing means two parties are sure to the agree what to name the same thing.

Insisting that hash(A) = hash(B) ⇏ A = B is all very well logically, but as a practical matter all it does for you is that you're now obliged to send a LOT of bits everywhere to do anything, because the only useful content addressable reference for A other than hash(A) is A itself. And what do you get for this enormous increase in bandwidth, CPU power and disk space? Er, well, in theory there's a minuscule chance it'll catch an error case some day, for someone, somewhere. How minuscule ? Let's find out:

For example compared to a 256 bit crypto hash - if you use this new SCM to create a billion-billion-billion (10^27) new documents per second (good luck with that), and compare them all against all the other documents ever created this way (remember you need to store them all somewhere), by the year 9999 CE you'd probably run into one collision which you'd detect but the hash would not. Although maybe not, chance can be a funny thing.

All worth it right? Unfortunately all the extra bandwidth, CPU power and disk space incurs its own risk of errors, which are far larger. I mean, tiny, but because the risk from using a good crypto hash was so minute these tiny risks dwarf it. You've wasted all those resources AND made things worse.

So that's why nobody is doing that.

There Is (Almost) Never One "Only Other Way"

Posted Feb 27, 2017 22:28 UTC (Mon) by linuxrocks123 (subscriber, #34648) [Link] (13 responses)

There is a _LOT_ wrong with your argument here. Content addressability is not even necessary for an SCM. It may be desirable, but it isn't necessary. Moreover, you can use hashes for content addressability without assuming hash(A) = hash(B) --> A = B. The various methods for doing this are usually discussed in a second or third semester computer science course.

There Is (Almost) Never One "Only Other Way"

Posted Feb 27, 2017 23:29 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (12 responses)

For an argument with a "lot" wrong with it you seem to have conceded almost all of it?

Of _course_ content addressability isn't obligatory for an SCM, SCCS doesn't have content addressability. But it turns out people actually want their SCM to be much better than SCCS, and lots of the features which you may not care about rely on this choice. Git is so reliant upon this choice that Linus says its better to think of the version control features as just a sort of thin layer of icing, the main thing he built is a content addressable filesystem.

Likewise, doubtless any number of amazing schemes could be dreamed up for content addressability, and I have no doubt that somewhere a Computer Science course has decided to spend lots of time walking students through one or more of the fancier examples, just as one of my CS undergrad courses spent a bunch of time trying to brutalize the C pre-processor into a way of managing HTML files so each student could make their own web page (this probably dates me) so everything is being taught somewhere. But that didn't make the C pre-processor suitable for this task, and fancier schemes don't fix the problem you're complaining of.

The Pigeonhole Principle, the same thing that forbids infinite compression algorithms and other such schemes, requires that you accept some non-zero risk of collisions if you're going to throw some of the bits away. Doesn't matter how you do it, the risk is the same - which is why we pick crypto hashes, because they have the best properties we could hope for under the circumstances.

There Is (Almost) Never One "Only Other Way"

Posted Feb 28, 2017 10:03 UTC (Tue) by linuxrocks123 (subscriber, #34648) [Link] (11 responses)

> For an argument with a "lot" wrong with it you seem to have conceded almost all of it?

I'm not sure how you misread what I wrote to come to that conclusion.

> Git is so reliant upon this choice that Linus says its better to think of the version control features as just a sort of thin layer of icing, the main thing he built is a content addressable filesystem.

So Git will continue to rely on that unsound kludge, then. That's a shame, but I don't personally use it for anything serious, so I guess I don't care.

> trying to brutalize the C pre-processor into a way of managing HTML files so each student could make their own web page

That sounds beyond horrible.

> The Pigeonhole Principle, the same thing that forbids infinite compression algorithms and other such schemes, requires that you accept some non-zero risk of collisions if you're going to throw some of the bits away. Doesn't matter how you do it, the risk is the same

So don't do that. Uniquely identify things some other way, like with a hash plus its location on a linked list of things that hash to the same thing. Or don't use hashes at all and use that tree thing where each letter in the string moves you down a level of the tree. I'm not going to write out a full design for some hypothetical SCM that doesn't rely on hashes not having collisions.

> we pick crypto hashes, because they have the best properties we could hope for under the circumstances.

Yeah ... if you're going to try to use hash(A) = hash(B) --> A = B, crypto hashes are about the only way to make that sort of work. If you're _NOT_ going to make that assumption, and you shouldn't, you can use much cheaper hashes and save some CPU time.

The only way people get away with abusing crypto hashes for this at all is because SCMs and the other places this hack is used don't reside in memory, and the hashes don't need to be taken frequently.

But, hey, if Git people thinks it's easier to break the universe every five years to switch to a new hashing algorithm, be my guest. It's not my universe. I just hope the SVN people have more sense and fix the problem correctly. The design of Subversion doesn't rely on the unsound assumption, just this one optional deduplication feature (representation sharing). And fixing that would be as easy as just comparing the actual files before doing the deduplication. Easy.

There Is (Almost) Never One "Only Other Way"

Posted Feb 28, 2017 13:48 UTC (Tue) by madscientist (subscriber, #16861) [Link]

Subversion is not a DVCS. Git is. If you are willing to postulate a central service that can arbitrate collisions, like a Subversion server, then sure, they're not hard to handle... but that is a fundamentally different system than Git. If you have no central service then you cannot absolutely guarantee complete uniqueness... and at that point we're just talking about levels of unlikelihood. If you're not happy with the current essentially non-existent chance of a collision, then why would you be happy with half of that chance? It's still not zero.

Which is not to say that I think Git shouldn't ensure that a collision won't corrupt the repository... but anything beyond that is useless work.

> if Git people thinks it's easier to break the universe every five years to switch to a new hashing algorithm

I know you're being hyperbolic but just for posterity: Git is currently 12 years old and I wouldn't be surprised if it's another 2 years or more before a new hashing algorithm is in common use. If we get to a point where cryptographic hashes are being broken within 5 years we'll have much bigger problems than Git collisions. Also, the introduction of the new hash won't "break the universe"--plans for introducing it all include backward-compatibility. It's clearly a non-starter to require everyone to rewrite their entire history, and it's not necessary.

There Is (Almost) Never One "Only Other Way"

Posted Feb 28, 2017 14:24 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (9 responses)

"So don't do that. Uniquely identify things some other way, like with a hash plus its location on a linked list of things that hash to the same thing."

There are three things you could choose here, let's look at each in turn, because I think you are gradually learning a little bit

1. Everybody makes their own linked lists. Whenever we look at any object we need to examine it in order to compare it with things in our list for the same hash and figure out if it was already on the list. We can't share our "unique" identifiers for things, because we each have our own linked list with different things in it and in a different order, so the "unique" identifiers are only locally unique. When we discuss things with other people the only way to refer to them is by sending the entire thing to be discussed, even if we're pretty sure they have it, we can't identify it to them any other way because they have different linked lists.

You can build a pretty good little local revision control system this way. It's a bit resource hungry, slow, disk bandwidth intensive, but it works well. Attempts to build a distributed revision control system are painful, although centralised revision control is practical for people with plenty of network bandwidth so they can upload and download all the things that the remote system needs to find / add in its linked lists.

2. A Central Authority makes the linked lists. The unique identifiers are global, but we can't find out the identifier for anything unless we either send it to the Central Authority for them to identify, or we have a (potentially partial) copy of the Central Authority's linked list to compare against. We can discuss things with other people because we share these globally unique references, but we're restricted to only talking about things the Central Authority has seen. Somebody has to pay for the Central Authority, which also functions as a global library of all human knowledge, albeit one that doesn't have a useful index.

This time using the system at all (even locally) is clumsy and slow unless you have a high bandwidth connection to the Central Authority or (for read only use) a complete and up-to-date copy of its linked lists. Good news is that distributed usage is no harder than local use, bad news is that's not saying much.

3. We use an _imaginary_ totally ordered list instead of a real linked list, so that we can work out where anything would be in this totally ordered list and write that location down without keeping all the items from the list. For a crypto hash we can't even attempt this, but with a weak or trivial hash this isn't too hard to arrange. The hash plus the location number uniquely identify anything. This seems like an amazing breakthrough, until we remember the Pigeonhole Principle. Where do all the bits go? Ah, they're in the location number. Instead of N=9703384921593020482 I now have hash(N) = 446 plus location(N) = 27829639719764153, I write those down, and er... wait, why did we invent this complicated scheme that has no benefit whatsoever?

This system is pointless and nobody would build it.

There Is (Almost) Never One "Only Other Way"

Posted Feb 28, 2017 15:56 UTC (Tue) by linuxrocks123 (subscriber, #34648) [Link] (8 responses)

I'm happy you were able to think of three ways to do something. I'm less happy you think your inability to think of any other ways to do it is a nonexistence proof. Again, I'm not going to write out a full design for some hypothetical SCM that doesn't rely on hashes not having collisions. I have better things to do.

There Is (Almost) Never One "Only Other Way"

Posted Feb 28, 2017 18:06 UTC (Tue) by nix (subscriber, #2304) [Link] (6 responses)

This smacks of "my arguments were demolished so I'm going to flounce out while pretending not to do so".

Fundamentally, any DVCS requires the ability to determine whether two objects are the same, and an indication of their histories. Either you do it by transmitting the whole object and history (-> very expensive) or you do it by assigning an identifier (-> impractical on any DVCS without central coordination) or you do it by some scheme that collapses large objects into much smaller identifiers algorithmically (i.e. a hash).

I see no alternatives to these three, and only the last yields anything remotely usable as a DVCS. Your assertion that this is not true is not convincing, not to me, at least.

There Is (Almost) Never One "Only Other Way"

Posted Mar 1, 2017 3:41 UTC (Wed) by linuxrocks123 (subscriber, #34648) [Link] (5 responses)

Sigh.

Fine, here's a non-exhaustive list of possible ways to uniquely ID a commit excluding the three from earlier:
- Email address plus ID number generated by the client with that email address that is always 1 + previous ID number generated by the coder's machine.
- Hostname of arbitrary server running unique ID generation network service, plus the ID number said server generated for your changeset.
- Public key of committing programmer plus ID number (always 1 + previous local ID generated) generated by the coder's machine.

There's three. I'm not interested in debating stupid details like "what if the coder has two machines" (then use a different public key on each machine or something. Whatever.). Minor problems have minor solutions. You get the idea here. HAND.

There Is (Almost) Never One "Only Other Way"

Posted Mar 1, 2017 13:30 UTC (Wed) by itvirta (guest, #49997) [Link]

Aren't the first and third ones basically the "everybody makes their own lists and id's" idea, and the second one is the "a central authority makes the lists and id's" idea
(or perhaps a combination of the two)?

There Is (Almost) Never One "Only Other Way"

Posted Mar 1, 2017 14:07 UTC (Wed) by gmatht (guest, #58961) [Link] (1 responses)

If you aren't worried about stupid details like the possibility that someone might accidentally or deliberately use the same public key on two different machines, I am not sure there is much point in worrying about collisions in cryptographic keys. Even just restoring a machine from backup could lead to collisions in enumerated IDs. Things like email addresses are typically much more likely to clash than cryptographic hashes. There could well have been another foo@baz.com; maybe baz didn't pay their DNS bill, or maybe the original foo stopped signing in and their account was deleted; these things that happen all the time, whereas sudden collisions in cryptographic hashes believed to be secure at the time are unheard of.

In any case (3) has the same theoretical problem as hashes. How do you *know* your public key is unique? My understanding is that keys are more vulnerable to collisions because key generation also relies on being able to generate truly random numbers: http://jblevins.org/log/ssh-vulnkey (obviously they have the advantage that you wouldn't have to check the central server on each commit).

There Is (Almost) Never One "Only Other Way"

Posted Mar 1, 2017 14:56 UTC (Wed) by excors (subscriber, #95769) [Link]

Generating keys also involves generating large prime numbers, by generating random numbers and testing them for primality until you find one that passes. Testing for primality is a hard problem, but testing for probable-primality is relatively easy - the Miller-Rabin test can tell you either "definitely not prime" or "probably prime (but might not be)", and you can repeat it a hundred times and if it still thinks it's probably prime then there's an incredibly high chance that it really is.

But there's still a non-zero chance that it gives the wrong answer. If you insist on absolute mathematical certainty, to the extent that you won't trust any hash function as a unique identifier, then you shouldn't trust public key cryptography either.

There Is (Almost) Never One "Only Other Way"

Posted Mar 1, 2017 15:32 UTC (Wed) by ianmcc (subscriber, #88379) [Link] (1 responses)

If I understand correctly, this basically amounts to tagging commits with some kind of 'unique' identifier, that is chosen by the committer. That might prevent some problems, but accidents or malicious commits could still cause problems?

There Is (Almost) Never One "Only Other Way"

Posted Mar 1, 2017 16:51 UTC (Wed) by nix (subscriber, #2304) [Link]

Quite. You're relying on fallible humans to come up with something unique. Humans are *terrible* at this. Hash functions are far more reliable. Replacing something that has fabulously rare collisions with something that has collisions all the time whenever humans make a mistake (restore from backup, whatever) is not any sort of improvement. The resulting system would have incessant collisions and would be deeply unreliable.

There Is (Almost) Never One "Only Other Way"

Posted Mar 1, 2017 1:19 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

It's possible to prove that there's no algorithm that assigns numbers to graph nodes in a DAG that would produce the same numbers for any sub graph of this graph, provided that nodes can only be compared with each other.

In other words, it's impossible to create a DVCS without referring to content of the nodes.

Proof

Posted Feb 27, 2017 9:03 UTC (Mon) by eternaleye (guest, #67051) [Link] (8 responses)

> Git receives the data, hashes it, and appends it to the repository.

This isn't really any safer in practice, and the reason is GitHub.

GitHub, under the hood, stores all forks of a repository as _branches_ of a _single_ repository, for efficiency reasons.

So, guess what happens if I fork, commit a malicious colliding blob, and send you a patch (email is safe, right?) that adds a benign colliding blob?

What happens is you read the patch, figure it's safe, commit, push, and now you've just made everyone who uses your project vulnerable.

Linux' firmware repository (in particular, its GitHub mirror) is vulnerable to this _today_. Even nastier if I send you the benign blob as a pull request from a kernel.org repo - you may never look at the GitHub repo, but people will still get fucked by it.

Proof

Posted Feb 27, 2017 14:37 UTC (Mon) by bronson (subscriber, #4806) [Link] (4 responses)

GitHub uses alternates, not branches. Your scenario wouldn't work.

A scenario that might work: what happens if you create a repo, convince me to fork it, commit a malicious blob to your repo, then send me a pull request with a colliding blob? (and somehow explain why I should merge a commit that isn't already in your original repo)

*maybe* that would work. Assuming you can create plausible malicious blobs, of course. SHA1 preimage attacks are coming but they're not here yet.

Proof

Posted Feb 27, 2017 15:22 UTC (Mon) by joey (guest, #328) [Link] (2 responses)

A preimage attack is not needed to create a colliding blob. THe current attack is sufficient, it just needs that ~1yr of computation to expliitly target a git blob.

Proof

Posted Feb 27, 2017 18:08 UTC (Mon) by bronson (subscriber, #4806) [Link] (1 responses)

Agreed, replacing a blob is pretty easy, as long as you can hide a short burst of line noise somewhere. But, replacing it with a malicious and nonobvious payload that will make all your expense worth it? That still seems fairly out of reach for now.

Just making sure I understand correctly. You've thought about this a lot more than I have.

Proof

Posted Feb 27, 2017 18:14 UTC (Mon) by bronson (subscriber, #4806) [Link]

Ah, that's what I get for not reading to the end before commenting: https://lwn.net/Articles/715777/

That sounds plausible but still pretty easy to spot. It might be easier and less risky to just subvert the binary blob at its source.

Ultimately though, agreed: it's much easier than it should be, and getting easier each day.

Proof

Posted Feb 28, 2017 2:25 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

I'm pretty sure GitHub uses namespace support where all repos in a fork network share a single .git directory, but store their refs under unique prefixes. I think they even contributed that code because alternatives was just not fast enough.

Proof

Posted Mar 1, 2017 15:16 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

> > Git receives the data, hashes it, and appends it to the repository.

> This isn't really any safer in practice, and the reason is GitHub.

Even if that's true (and I don't know - I've just seen the other posts disputing it) then that's not git's problem, it's github's.

The critical point, is that if there is a problem, as far as I can tell it's the push that fails. So if a developer is pushing a change, it blows up there and then.

Of course, if developer A pushes to repository A, while developer B pushes to repository B, then developer C tries to pull from both A and B, we're going to have a problem ...

(Although doesn't git use the time-stamp as part of the hash, so fixing the collision is simple in theory, just edit and save the unchanged file? Of course, in practice, it could mean a rebase of a big project - rather painful ...)

Cheers,
Wol

Proof

Posted Mar 1, 2017 16:53 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

No, timestamps of files are not part of the saved state in git. The only data git saves for blobs is the name, the executable bit, and an is-a-symlink bit, in the not-really-"file mode".

(Obviously the timestamp of a *commit* is part of the saved state: it's in the commit object, thus part of its hash.)

Proof

Posted Mar 2, 2017 14:58 UTC (Thu) by Wol (subscriber, #4433) [Link]

That's obviously what I was thinking of - recommit the file to change the hash, and bob's your mother's sister's husband.

Still doesn't get round the problem of a full-scale rebase where C has a collision pulling repositories from A and B :-)

Cheers,
Wol

Proof

Posted Feb 26, 2017 2:38 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link] (3 responses)

> But I worry that a year from now Linus will still be hand-waving about how it's not /that/ urgent to migrate

Why would that matter? He hasn't been the maintainer for a really long time and isn't even an active contributor at this point.

Proof

Posted Feb 26, 2017 10:22 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (2 responses)

In _theory_ it doesn't matter at all. In _theory_ the Git maintainers would have shipped a version that could handle a newer hash in, say, 2012, ignored anything Linus Torvalds had to say about Git because he's not an "active contributor at this point" and we'd be reading about how people need to upgrade to a version of Git that's less than five years old to keep using Github because they're switching off their legacy SHA-1 support.

But we live here in the real world, where the first thing the press and general public did was ask Linus Torvalds, and where the actual maintainers had a long back-and-forth with Linus, despite him not being an "active contributor at this point" about what their plans could or should be in the future. The response to Linus' initial "I haven't read the paper but I have an opinion" rambling wasn't "This isn't even your software any more, go away old man" but deference to the master's opinions. Which is fine. But, it means we can't say "Why would that matter?" with a straight face, it clearly _does_ matter still.

Proof

Posted Feb 27, 2017 14:25 UTC (Mon) by bronson (subscriber, #4806) [Link] (1 responses)

It seems like you're conflating marketing and engineering here. Sure, Linus created the project, he's famous and colorful, so he's still the guy everybody goes to for quotes. So?

Don't worry, the transition is happening, and it's advancing methodically. Until shown otherwise, this really a big a problem as you seem to think.

Proof

Posted Feb 27, 2017 18:01 UTC (Mon) by bronson (subscriber, #4806) [Link]

ugh, typo: "Until shown otherwise, this *isn't* that big of a problem."

Yet.

Collisions inevitable.

Posted Feb 26, 2017 7:21 UTC (Sun) by pwfxq (subscriber, #84695) [Link] (3 responses)

Unless some new compression/information theory has come along, aren't collisions inevitable with *any* hash function? Software that uses hash functions to detect duplicates could use a hash function as a quick first test, but must surely resort to a byte-by-byte comparison of the two files to fully verify whether two files are genuinely identical? Isn't any software that's written assuming that two files will never have the same hash fundamentally flawed?

Collisions inevitable.

Posted Feb 26, 2017 8:00 UTC (Sun) by ewx (subscriber, #103004) [Link]

Yes and no. A hash function with 2^64-bit input and 256-bit output necessarily has a very large number of collisions. However (1) the probability of running across one by accident is negligible and (2) for a secure hash function, the cost of constructing one deliberately is unrealistically large. Hence the disregard for the possibility of a collision until a hash function starts to make the transition from 'secure' to 'broken'.

(Restricting discussion to cryptographic hash functions rather than e.g. CRC32.)

Duplicates aren't the only use case, anyway. You don't have anything to compare with when e.g. verifying a signature.

Collisions inevitable.

Posted Feb 26, 2017 10:54 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

The Pigeonhole principle means invariably there exist collisions. However, for a _working_ crypto hash the collisions are essentially impossible to find on purpose and you'll never run into one by accident. So we can prove they must exist, but we don't have any examples, in sort-of the same way we can prove Almost All real numbers are Normal, but er, mathematicians have only found a handful of obscure examples of actual Normal numbers and I don't think we know how to systematically write any of them down...

So, it's fine for almost any program to treat hash(A) = hash(X) as meaning A = X so long as it uses a good crypto hash. So long as it's a good crypto hash you're more likely to run into trouble with a CPU microcode bug, or damage from cosmic rays in your CPU caches than have any problems with collisions. If your code doesn't feel the need to try to detect cosmic ray damage, don't worry about collisions from a good hash. The problem is that when we tell people "This isn't a good crypto hash any more. Stop using it" they are reluctant to actually do all the hard work implied, and the can gets kicked down the road. But on the other hand here I am in 2017 still seeing people use 32-bit signed variables to count seconds since the epoch, so it's not as though the problem is specific to cryptography.

Collisions inevitable.

Posted Feb 27, 2017 0:42 UTC (Mon) by khim (subscriber, #9252) [Link]

As was correctly noted above if (memcmp(a, b, s) == 0) comparison could fail you, too (see soft error). And if your hash is large enough and good enough you could safely ignore these collisions because they are so rare.

Transparent versus opaque formats

Posted Feb 26, 2017 9:11 UTC (Sun) by epa (subscriber, #39769) [Link] (9 responses)

Linus makes a useful distinction between transparent formats (essentially, anything that is read and edited as plain text) and opaque ones. An opaque file format such as PDF can contain lots of hidden code which the end user doesn't see: many different PDF files display identical output on screen or on paper. HTML is also opaque, if you never hit the 'view source' button and you allow the possibility of comments or do-nothing Javascript in the middle of the file.

But what about if for some reason you stored a gzip compressed file in your git repository? Then you have an opaque format because there are many gzip encodings of the same document. I think it's a finite number, while the number of possible PDFs producing a given page is probably infinite, but it's probably still a big enough number of possibilities for an attacker to work with.

This kind of attack could be prevented by defining a 'canonical' gzip encoding. For example, say that the canonical form is the encoding produced by a certain fixed version of GNU gzip at its -9 compression level. Then there is a one-to-one mapping between compressed and uncompressed documents. The version control system, or document signing system, will only allow canonically-encoded gzip files.

PDF is more complex and it would be more difficult (perhaps Turing-complete, I'm not sure) to work out the canonical PDF code producing any given document, but you could still take some steps towards that goal by narrowing down some 'pointless choices' offered by the PDF spec. You would need to have a converter program which converts any PDF into an "almost canonical" version which produces the same output (and make some decision about whether to allow PDF forms, Javascript etc). It wouldn't be a one-to-one mapping and the format would still be opaque, but the attack would be a lot more difficult.

For source code, even though it's transparent, one can't rule out that attacks become possible if the hash function is weak enough. For example you might replace some tabs with spaces, generating lots of variants of a source file which look identical on screen and compile to the same object code, in order to generate a collision. This probably isn't practical with SHA-1, but it's an interesting case to consider. For source code too you can define an almost-canonical form by insisting on consistent indentation and whitespace conventions.

It gets more interesting with bitmap images, where two JPEG encodings of the same photograph may be visually indistinguishable to a human even if the pixels on screen are a bit different. Heck, you could even change the photograph a bit and nobody would notice. This suggests that bitmap images (except perhaps for pure black and white scans of text documents) are not a good way to exchange signed documents.

Transparent versus opaque formats

Posted Feb 26, 2017 11:35 UTC (Sun) by excors (subscriber, #95769) [Link] (5 responses)

> Linus makes a useful distinction between transparent formats (essentially, anything that is read and edited as plain text) and opaque ones. An opaque file format such as PDF can contain lots of hidden code which the end user doesn't see

A plain text file can contain lots of zero-width Unicode spaces, soft hyphens, RTL marks, etc, which the end user won't see because they're invisible.

If you constrain it to ASCII text files in a viewer that makes all control characters visible and that highlights trailing whitespace, then maybe that could count as transparent. But that sounds too constrained to be a useful distinction in practice.

Transparent versus opaque formats

Posted Feb 27, 2017 12:07 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

Yes, I did touch on the idea of requiring normalized whitespace and indentation for source code.

For non-code text documents, such as a legal agreement, tightening up the allowed whitespace might be a good idea. You'll never eliminate all possible avenues for fiddling the content, but if you can get the number of variations of a document down from trillions to under a million or so, you've greatly reduced the scope for finding a collision.

Transparent versus opaque formats

Posted Feb 27, 2017 12:18 UTC (Mon) by johill (subscriber, #25196) [Link]

Depending on the input format, you could round-trip through something like pandoc, I guess.

Transparent versus opaque formats

Posted Mar 4, 2017 1:57 UTC (Sat) by zslade (subscriber, #72097) [Link] (2 responses)

There is no such thing as plain text... What encoding? UTF-8, other Unicode, 7-bit clean or maybe it's in ISO 1251 or some Latin encoding... There is no lowest common denominator other than 7-bit clean ASCII. Good luck running into only that particular encoding in the wild.

Transparent versus opaque formats

Posted Mar 6, 2017 14:28 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

The Linux kernel sources (with a few exceptions for people's names in comments) are indeed 7-bit ASCII, as is the majority of source code.

Transparent versus opaque formats

Posted Mar 6, 2017 15:29 UTC (Mon) by excors (subscriber, #95769) [Link]

That's a fine policy until you get a patch submitted by Mr John FÜ�¦¶~;�ª²VEÊgÖ�ÇøK�Lyà+=öøm±iÅkEÁSþß·`8érr/ç­r�IàF who insists it would be culturally insensitive to transliterate his name.

(Also it's not true that names are the only exceptions - in some older versions of the kernel I see staging drivers with Big5-encoded Chinese comments.)

Transparent versus opaque formats

Posted Feb 26, 2017 12:03 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (2 responses)

"Heck, you could even change the photograph a bit and nobody would notice"

The attack demonstrated needs just 128 bytes (plus a particular starting prefix, but that prefix is unextraordinary albeit the researchers can't change it without doing all the other work again) to collide the hash. This is normal, the hash operates on 64 byte blocks, they've used one block to trap it in a corner with a partial collision, and the second block finishes it off.

Unlike a lot of the other aspects of the attack, I think modern programmers have an appropriate conception of how little 128 bytes is. It's a burst of noise on the top line of the photograph, a momentary distortion in music, easily ignored and going unrecognised in most places.

I am not even convinced it would be impossible to sneak such a thing into source code. Put a comment marker in front, tell people it's Mojibake from a Greek copyright statement that was mistakenly handled as KOI-8R by a Perl script or something like that. People would kick off if you tried to add 64k of gibberish to a source file, but two or three lines with an explanation might fly.

Transparent versus opaque formats

Posted Feb 26, 2017 15:39 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

Why would I accept mojibake these days? Convert it to UTF-8 if it's so important. Not that it says that the block can't be provided in valid UTF-8 (you just then have the language gibberish problem which would usually be asked for in English), just that mojibake is not a valid excuse these days.

Transparent versus opaque formats

Posted Feb 26, 2017 17:31 UTC (Sun) by joey (guest, #328) [Link]

<https://joeyh.name/blog/entry/SHA1_collision_via_ASCII_art/>

A slightly tongue in cheek post, but it may well be possible to generate a collision that can be used that way.

Linus on Git and SHA-1

Posted Feb 26, 2017 9:40 UTC (Sun) by balbir_singh (guest, #34142) [Link] (9 responses)

That is OK for source, but we also have binary blobs like firmware.

Linus on Git and SHA-1

Posted Feb 26, 2017 11:58 UTC (Sun) by pizza (subscriber, #46) [Link] (8 responses)

Honestly, if you're going to try and trojan a firmware blob, you're far better off just bribing or blackmailing someone within the organization that created it to begin with, label it a "security update" and release it via official channels rather than trying to sneak something into an existing repo.

Mangling an opaque firmware blob into doing something usefully malicious while still functioning sufficiently well enough to not be immediately obvious that it's been altered (ie the hardware still functions) is several orders of magnitude more difficult than doing that for a C source file -- which is (still) essentially impossible with current-to-medium-term computing resources.

Linus on Git and SHA-1

Posted Feb 27, 2017 7:33 UTC (Mon) by Garak (guest, #99377) [Link] (1 responses)

If it's just twiddling 128 bytes of esoteric debugging error strings or losing a rarely used function for a feature that <0.1% of user are going to notice? (presumably NSA TAO would have no trouble picking specific features it knows the specific target isn't using). Or twiddling 128 bytes of some bitmap logo? Oh the number of times I've seen a boot up flash of corrupted pixels... Or if a stippling/dithering pattern was added to some logo with 1kbit of stegged info, would I care that much about a white logo that turned slightly gray? I just write it off in my mind as one of a million bugs that I presume exist in the overall system. What am I going to do- trace what really caused that bit of snow crash? Or just acknowledge that the product manufacturers (and the NSA through complicit inaction at least) have limited my options to realistically and effectively care that much about security.

Linus on Git and SHA-1

Posted Feb 27, 2017 15:36 UTC (Mon) by joey (guest, #328) [Link]

It should be much easier than that:

1. Identify the first instruction run by the firmware at boot. Replace with a jump to 129 bytes after the current end of the firmware. Use the resulting file as the input the the collision generation attack. (If you wan to target this being stored in a git repository, include a git blob header in the data used to generate the collision.)
2. Now you have two 128 byte chunks which when appended to the file in #1, result in two colliding, but different files.
3. At the end of each of the two different files, append the same payload. Since the SHA1 hash function is in the same state at the end of each file in #2, appending identical data to each file yields new files that also collide.

The payload examines the memory in the 128 byte colliding area. If it sees the "good" version, it runs the instruction that was originally replaced with the jump, and jumps back to the second original instruction. If it sees the "bad" version, it runs the remainder of the payload, the exploit.

Obviously there is some trickiness to do with relative addresses and returning registers to the original state when jumping back in the "good" version etc. But this should be very doable for any assembly programmer; it should even be possible to completely automate it.

Linus on Git and SHA-1

Posted Feb 28, 2017 16:06 UTC (Tue) by droundy (subscriber, #4559) [Link] (5 responses)

The real reason this sha1 attack won't be used in firmware is that it requires you to first get one malicious blob into the repository which has a collision with a second blob. If the blob is opaque firmware, why not just put the second one in first?

This attack relies on a format like PDF which a human can verify, but which can hide garbage bytes.

Linus on Git and SHA-1

Posted Feb 28, 2017 20:12 UTC (Tue) by balbir_singh (guest, #34142) [Link] (4 responses)

One could imagine a git-clone of a repository that has malicious firmware, the idea is to hide the real firmware with a bad one without anyone noticing anything different. A comparison of repos will tell you across all the clones nothing has changed. The reason you can't put the second one is that no-one will take a pull request for malicious firmware, but if some one hosted a mirror with a replaced/modified blob, no-one would notice

Linus on Git and SHA-1

Posted Feb 28, 2017 20:43 UTC (Tue) by excors (subscriber, #95769) [Link] (3 responses)

This attack won't work with "real" firmware (i.e. one produced totally innocently by the vendor) - you can't collide with an arbitrary file, you need to carefully construct both of the files to collide with each other. You could make one of them malicious and one non-malicious, but by what process would someone be willing to take a pull request for the non-malicious firmware yet reject the malicious firmware? (Nobody is going to reverse-engineer and comprehensively security-audit the firmware blob before accepting it.)

Linus on Git and SHA-1

Posted Mar 1, 2017 1:16 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (2 responses)

Aha, I actually know the answer to this.

Suppose the firmware we should desire to install in the victim's system is spectacularly bad news. Like, maybe it would cause actual catastrophic mechanical failure shortly after the firmware is applied. An important project they were working on may be er, sabotaged, as a result.

Of course if we give _this_ firmware to the system maintainer, who has the hardware in question but doesn't have the knowledge or interest to reverse engineer firmware, it is very likely to get tested, a conscientious maintainer would certainly do this, at least before a release, with our catastrophic "bug" there would be cursing, the release won't happen, and there will probably be focus on the firmware we supplied. This is most unfortunate as perhaps we are an entity for which plausible deniability is key to our operational success.

But if we provide a "good" firmware, which is quite similar but works fine, and keep to ourselves a "bad" firmware which collides but causes the catastrophic failure, the maintainer can test the "good" firmware, give it the all clear and then we simply ensure that the victim obtains their copy of the firmware from somewhere with our "bad" version. SHA-1 guarantees that it's the genuine article they're getting, but alas that is not so.

Linus on Git and SHA-1

Posted Mar 1, 2017 1:40 UTC (Wed) by droundy (subscriber, #4559) [Link] (1 responses)

That is a good scenario, and would be a use case. However, I would expect that most often an attacker who is this sophisticated could simply write the firmware such that it doesn't trigger on the maintainer's computer, either due to checking IP addresses, checking the time, or some other detail.

It doesn't sound like there is much in the way of plausible deniability with this kind of attack. The 128 byte sequence is apparently pretty distinctive, so it would be pretty obvious once the attack is made that the two firmware blobs were created with this trick. Also, do maintainers really accept firmware blobs from anyone other than the companies that create the hardware? I'd hope (probably overly optimistically) that said company would have motivation to keep bad firmware off of their devices. Of course, if the attacker compromises that company that is a different issue, but using a stronger hash wouldn't protect us from that level of adversary.

I do agree that git should switch away from sha1, but am not convinced that this attack is particularly dangerous.

Linus on Git and SHA-1

Posted Mar 1, 2017 11:02 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

You are doubtless correct that this is only one of many possible vectors for such sabotage. I will point out that if there are fifteen ways to do something bad and we dismiss all fifteen as "well, they could do it 14 other ways" we did just let the bad guys choose which they prefer and that seems unwise.

However, my reason for replying is you talk about "the 128 byte sequence" as being "pretty distinctive". We actually have an example from the real world which is instructive. Flame (related to Stuxnet) has a collided MD5 certificate from Microsoft. A long chain of grave errors resulted in this being possible, but the collision step is interesting because the academic publication of MD5 collisions uses a _different collision_. The same general mathematical approach was used, but evidently done independently by people with lots of resources specifically in order to produce this one certificate, then keep it secret. Naturally the NSA is a good candidate for an organisation capable of devoting tremendous computer resources to espionage, and it has the right smart mathematicians too.

Anyone looking for the byte sequence from the academic breaks would have seen nothing in Flame and its packaging.

But people who understood the mathematics involved, and searched intelligently based on that would have (as happened after the discovery that it had a forged certificate) realised it was using a collision.

We have the same knowledge for SHA-1. Nobody should be scanning for the 128 bytes from the PoC because although that's easy it's also unlikely to find serious bad guys messing with your data. Instead use Marc Stevens' documented disturbance vector code to find _any_ inputs likely to perturb SHA-1 in this way. The false positive rate is imperceptibly higher, but it ensures the only false negative would require an entirely novel attack so when you get pwned at least somebody gets their PhD thesis out of the resulting analysis.

Linus on Git and SHA-1

Posted Feb 26, 2017 16:19 UTC (Sun) by welinder (guest, #4699) [Link] (6 responses)

One bit of information to take away from this is that neither git nor svn bothered to test what the consequences of a collision would be. That really ought to have been tested years ago.

(And the way you test it in the absence of an actual collision is to patch the hash function to return a the hash of file B when it encounters file A.)

Linus on Git and SHA-1

Posted Feb 26, 2017 16:52 UTC (Sun) by ttelford (guest, #44176) [Link] (3 responses)

First off: How are you supposed to test against a SHA-1 collision when you don’t have one to test with? There’s no way anybody could have tested the condition until a few weeks ago.

There’s also no real reason to test what the consequences of a collision in git would be, as the result is already known: Since the SHA1 is a pointer to the blob (data) that git is storing, a collision means you have two valid objects which the pointer would point to. No matter what happens, one of the pointers involved in a collision would point to the wrong place.

I disagree with Linus’s reasoning for one simple reason: git stores more than source code, and when there is a collision, it’ll be tricky to figure out how to recover both versions of a blob with the same sha1, especially with the (ahem) utter cluelessness and total lack of interest a lot of developers have in version control.

A lot of us who support corporate development teams, with a lot of ’new’ developers that would rather pass files around on USB than use version control (like they did in College)… well, we’re in a different boat.

We’re guaranteed that a hash collision will be difficult to detect and painful to fix. (Fortunately, the odds of it happening are worse than getting struck by lightning while getting married to Rihanna after winning the lottery).

That said it’s ridiculously unlikely that a collision will ever happen by accident.

Linus on Git and SHA-1

Posted Feb 26, 2017 17:12 UTC (Sun) by welinder (guest, #4699) [Link] (1 responses)

>First off: How are you supposed to test against a SHA-1 collision when you don’t
>have one to test with?

It's very easy as I wrote: "the way you test it in the absence of an actual collision is to patch the hash function to return a the hash of file B when it encounters file A."

I.e., you are testing with a modified hash function SHA1' that is close enough to SHA1 that everything works with your existing repository, yet one for which you can easily find a collision because file A and B will have the same hash.

It's like testing upcoming leap seconds. You don't do it by time travel, but by lying selectively to the system. In the case of leap seconds, you lie about the current time. In the case of hash functions you lie about what the hash of a specific file is.

Linus on Git and SHA-1

Posted Feb 27, 2017 14:50 UTC (Mon) by bronson (subscriber, #4806) [Link]

If that's all you want then that's easy, and people have been doing it for years.

Lot of links here: http://stackoverflow.com/questions/9392365/how-would-git-...

There are many experiments and discussions on the mailing lists too.

Linus on Git and SHA-1

Posted Feb 27, 2017 14:06 UTC (Mon) by Sesse (subscriber, #53779) [Link]

> First off: How are you supposed to test against a SHA-1 collision when you don’t have one to test with?

SHA-1 collisions, at least the current ones, tend to induce very specific internal states in the hash function. There are methods to detect such states with very high probability (counter-cryptographic analysis).

Of course, eventually we'll see counter-counter-cryptography in the form of other methods of generating collisions… so it's a band-aid.

Linus on Git and SHA-1

Posted Feb 26, 2017 17:45 UTC (Sun) by joey (guest, #328) [Link] (1 responses)

There have been git patches that allow testing this for years. There's been a fairly good analysis of the ways different sorts of collisions could be used to attack git repositories.

What the git developers have failed at is proactively developing a plan to deal with the eventual SHA1 collision and migrate to another hash. That work could have been started in 2011 or earlier, but the consensus at eg the 2011 gittogether when I raised the issue was to wait and see. I'm seeing basically the same design discussion from then recapitulated on the git mailing list now. Perhaps it will get somewhere this time.

Linus on Git and SHA-1

Posted Feb 26, 2017 19:40 UTC (Sun) by jnareb (subscriber, #46500) [Link]

Actually the work on making Git code independent of the cryptographic hash is ongoing, but by no means finished (there are many places with hidden assumption that Git uses SHA-1). And there is a problem of backward compatibility to be solved.

There were a few patches send to git mailing list to a) test what happens on collision (I don't think it was available earlier), b) allow to switch to slower but attempted-collision detecting SHA-1 library.

Linus on Git and SHA-1

Posted Feb 27, 2017 19:27 UTC (Mon) by jgg (subscriber, #55211) [Link]

Has there been any talk about improving signed commits? It seems to me that is a very reasonable place to start, and doesn't really fall under Linus's arguments. Many people are using signed commits now thinking they are similar security to detached signatures on .tar.gz files.

Ideally a signed commit would create a parallel merkle tree using the same hash function as the PGP signature. This seems like it would not be too bad to patch into git, the 2ndary Merkle tree could be tables of blobs that map blob => STRONG_HASH and the top of those tables is linked into the PGP blob.

Verification would then be able to ensure that every git blob ID XX is hashes to STRONG_HASH, proving that at least the checkout at the signed commit unmangled.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds