Proof
Proof
Posted Feb 26, 2017 0:25 UTC (Sun) by Wol (subscriber, #4433)In reply to: Proof by tialaramex
Parent article: Linus on Git and SHA-1
From what I can understand though, the difference between Git and SVN is in the way they receive the data.
SVN receives the data, and writes it into repository. Pushing corrupted data corrupts the receiving repository.
Git receives the data, hashes it, and appends it to the repository. If there's a collision, the append fails with "can't overwrite existing data". So pushing corrupted data causes the corrupted data to be dropped. Okay, you've now got two repositories out-of-sync with each other (and if you can get someone to update-pull from the corrupt repository your attack will succeed), but Linus says this out-of-sync-repositories situation is actually very easy to spot.
Cheers,
Wol
Posted Feb 26, 2017 1:51 UTC (Sun)
by tialaramex (subscriber, #21167)
[Link] (21 responses)
Anyway, real collision _attacks_ aren't about blowing up the collision handling code (or lack of it) in your CMS. They're about tricking you into thinking one thing is another thing, usually by showing one person A and another person B telling both they're the same thing X and knowing that they'll only check hash(X) and so if hash(A) = hash(B) they'll believe they're seeing their same thing when they aren't.
And the only long term solution to that is to use a good hash so that hash(X) = hash(A) is adequate to conclude that A = X. Git doesn't do that today, and it doesn't even yet have a proposal for how it'll do it tomorrow, let alone how it'll migrate everybody. Remember, nobody who pays the least attention was blind-sided by this, a more careful designer would have rejected SHA-1 when Git was started over a decade ago.
(2004, Schneier recommends SHA-1 not be used in new applications: Rumours of fire)
Posted Feb 26, 2017 16:06 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
Except that I think Linus looked at it as an engineer (as he has a tendency to do). "Collisions are rare, but inevitable. Code to handle them will be buggy. If we hit one, PANIC".
That does make exploits difficult, in that any exploit needs to make sure that no-one ever compares the exploited and the exploit-free trees - rather difficult to ensure in practice. (Okay, it also needs the maintainers to be aware of the possibility, and to look for it, in order for the system to be reasonably fail-safe, and I know that's asking a lot...)
Cheers,
Posted Feb 26, 2017 22:36 UTC (Sun)
by moltonel (guest, #45207)
[Link] (3 responses)
Meh, that still relies on being lucky, and ignores the advances in both computing power and knowledge of each algorythm's weakness. If you want to be safe you can use a hash to speed things up, but if the hash of new data matches old data, you need to check the actual bytes for a collision. That's what many hashmap containers do: the hash gets you to the proper location in an array, and at that location is a linked list of all values with that hash.
Failing that (if you don't have the bytes available to compare, so not in the git/svn case), it's probably safest to use multiple hashes. Gentoo for example uses SHA256+SHA512+WHILPOOL+filesize to check that the downloaded file is correct.
Posted Feb 26, 2017 23:17 UTC (Sun)
by excors (subscriber, #95769)
[Link] (2 responses)
DRAM error rates are apparently somewhere vaguely around 2^-35 errors per bit per hour. If you have two files that differ by a single bit, and you leave them in memory for a millisecond before memcmping them, that's about a 1 in 2^56 chance of one of the differing bits flipping and giving a false match. With a non-broken 256-bit hash function, there's only a 1 in 2^256 chance of two different files giving a false match. So by checking the actual bytes you're massively *increasing* the risk of an incorrect result caused by bad luck.
Posted Feb 27, 2017 15:56 UTC (Mon)
by hmh (subscriber, #3838)
[Link] (1 responses)
Also, as far as the crypto goes, it has been proved that a combination of several hashes is approximately as strong as the strongest one, at least against brute force.
It would make sense to use very different hashes with the same strength as a hedge against a weakness being found in one of them. But that would mean, e.g. using both sha2-256 and sha3-256 (sha3 is a very different construct than Sha 2 exactly for that reason, otherwise BLAKE might have been selected for sha3). And the total strength would be 256 bits, not more.
Posted Feb 27, 2017 17:45 UTC (Mon)
by excors (subscriber, #95769)
[Link]
The point is that you can't practically divide algorithms into "definitely correct" vs "probably correct but relies on being lucky" - everything running on hardware has some non-zero chance of giving the wrong result. Instead you need to look at how close to zero those chances are, and divide them into "absurdly unlikely to ever fail even if every computer in the world runs this algorithm for a million years" vs "might actually fail in practice", and a non-broken hash function fits into the first category. (memcmp on a system with ECC RAM probably goes in the first category too, but memcmp without ECC probably doesn't.)
(This is separate from the issue that non-broken hash functions have a tendency to become broken hash functions after a few decades. Any system that's designed to use a secure hash ought to be designed to migrate cleanly to a new hash function in the future.)
Posted Feb 27, 2017 9:42 UTC (Mon)
by linuxrocks123 (subscriber, #34648)
[Link] (15 responses)
Guess what? RSync uses Adler-32 and MD-5, and it works fine, and the world is still here. No fire.
My personal hope? SCMs will use this opportunity to change their code not to rely on cryptographically secure hashes. They have no fundamental need for them, and assuming
(hash(A) = hash(B)) -> (A = B)
is obviously logically unsound, sometimes unsafe, and always the worst of kludges. Subversion deserves the mess it's in now, and I say that even as I now have to make sure never to commit those two files to my giant Subversion repository that controls my home directory thanks to their f-up. I hope they will fix their design rather than kick the can down the road by replacing SHA-1 with something new. You don't need a cryptographically secure hash, guys: don't use one, and don't assume the unsound.
Posted Feb 27, 2017 15:02 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link] (14 responses)
SCMs have chosen crypto hashes because they want content addressability and a crypto hash lets you achieve that efficiently. Git in particular doesn't bother doing very much else except content-addressability, this is part of how it's able to be distributed so easily, unlike other schemes content addressing means two parties are sure to the agree what to name the same thing.
Insisting that hash(A) = hash(B) ⇏ A = B is all very well logically, but as a practical matter all it does for you is that you're now obliged to send a LOT of bits everywhere to do anything, because the only useful content addressable reference for A other than hash(A) is A itself. And what do you get for this enormous increase in bandwidth, CPU power and disk space? Er, well, in theory there's a minuscule chance it'll catch an error case some day, for someone, somewhere. How minuscule ? Let's find out:
For example compared to a 256 bit crypto hash - if you use this new SCM to create a billion-billion-billion (10^27) new documents per second (good luck with that), and compare them all against all the other documents ever created this way (remember you need to store them all somewhere), by the year 9999 CE you'd probably run into one collision which you'd detect but the hash would not. Although maybe not, chance can be a funny thing.
All worth it right? Unfortunately all the extra bandwidth, CPU power and disk space incurs its own risk of errors, which are far larger. I mean, tiny, but because the risk from using a good crypto hash was so minute these tiny risks dwarf it. You've wasted all those resources AND made things worse.
So that's why nobody is doing that.
Posted Feb 27, 2017 22:28 UTC (Mon)
by linuxrocks123 (subscriber, #34648)
[Link] (13 responses)
Posted Feb 27, 2017 23:29 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link] (12 responses)
Of _course_ content addressability isn't obligatory for an SCM, SCCS doesn't have content addressability. But it turns out people actually want their SCM to be much better than SCCS, and lots of the features which you may not care about rely on this choice. Git is so reliant upon this choice that Linus says its better to think of the version control features as just a sort of thin layer of icing, the main thing he built is a content addressable filesystem.
Likewise, doubtless any number of amazing schemes could be dreamed up for content addressability, and I have no doubt that somewhere a Computer Science course has decided to spend lots of time walking students through one or more of the fancier examples, just as one of my CS undergrad courses spent a bunch of time trying to brutalize the C pre-processor into a way of managing HTML files so each student could make their own web page (this probably dates me) so everything is being taught somewhere. But that didn't make the C pre-processor suitable for this task, and fancier schemes don't fix the problem you're complaining of.
The Pigeonhole Principle, the same thing that forbids infinite compression algorithms and other such schemes, requires that you accept some non-zero risk of collisions if you're going to throw some of the bits away. Doesn't matter how you do it, the risk is the same - which is why we pick crypto hashes, because they have the best properties we could hope for under the circumstances.
Posted Feb 28, 2017 10:03 UTC (Tue)
by linuxrocks123 (subscriber, #34648)
[Link] (11 responses)
I'm not sure how you misread what I wrote to come to that conclusion.
> Git is so reliant upon this choice that Linus says its better to think of the version control features as just a sort of thin layer of icing, the main thing he built is a content addressable filesystem.
So Git will continue to rely on that unsound kludge, then. That's a shame, but I don't personally use it for anything serious, so I guess I don't care.
> trying to brutalize the C pre-processor into a way of managing HTML files so each student could make their own web page
That sounds beyond horrible.
> The Pigeonhole Principle, the same thing that forbids infinite compression algorithms and other such schemes, requires that you accept some non-zero risk of collisions if you're going to throw some of the bits away. Doesn't matter how you do it, the risk is the same
So don't do that. Uniquely identify things some other way, like with a hash plus its location on a linked list of things that hash to the same thing. Or don't use hashes at all and use that tree thing where each letter in the string moves you down a level of the tree. I'm not going to write out a full design for some hypothetical SCM that doesn't rely on hashes not having collisions.
> we pick crypto hashes, because they have the best properties we could hope for under the circumstances.
Yeah ... if you're going to try to use hash(A) = hash(B) --> A = B, crypto hashes are about the only way to make that sort of work. If you're _NOT_ going to make that assumption, and you shouldn't, you can use much cheaper hashes and save some CPU time.
The only way people get away with abusing crypto hashes for this at all is because SCMs and the other places this hack is used don't reside in memory, and the hashes don't need to be taken frequently.
But, hey, if Git people thinks it's easier to break the universe every five years to switch to a new hashing algorithm, be my guest. It's not my universe. I just hope the SVN people have more sense and fix the problem correctly. The design of Subversion doesn't rely on the unsound assumption, just this one optional deduplication feature (representation sharing). And fixing that would be as easy as just comparing the actual files before doing the deduplication. Easy.
Posted Feb 28, 2017 13:48 UTC (Tue)
by madscientist (subscriber, #16861)
[Link]
Which is not to say that I think Git shouldn't ensure that a collision won't corrupt the repository... but anything beyond that is useless work.
> if Git people thinks it's easier to break the universe every five years to switch to a new hashing algorithm
I know you're being hyperbolic but just for posterity: Git is currently 12 years old and I wouldn't be surprised if it's another 2 years or more before a new hashing algorithm is in common use. If we get to a point where cryptographic hashes are being broken within 5 years we'll have much bigger problems than Git collisions. Also, the introduction of the new hash won't "break the universe"--plans for introducing it all include backward-compatibility. It's clearly a non-starter to require everyone to rewrite their entire history, and it's not necessary.
Posted Feb 28, 2017 14:24 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (9 responses)
There are three things you could choose here, let's look at each in turn, because I think you are gradually learning a little bit
1. Everybody makes their own linked lists. Whenever we look at any object we need to examine it in order to compare it with things in our list for the same hash and figure out if it was already on the list. We can't share our "unique" identifiers for things, because we each have our own linked list with different things in it and in a different order, so the "unique" identifiers are only locally unique. When we discuss things with other people the only way to refer to them is by sending the entire thing to be discussed, even if we're pretty sure they have it, we can't identify it to them any other way because they have different linked lists.
You can build a pretty good little local revision control system this way. It's a bit resource hungry, slow, disk bandwidth intensive, but it works well. Attempts to build a distributed revision control system are painful, although centralised revision control is practical for people with plenty of network bandwidth so they can upload and download all the things that the remote system needs to find / add in its linked lists.
2. A Central Authority makes the linked lists. The unique identifiers are global, but we can't find out the identifier for anything unless we either send it to the Central Authority for them to identify, or we have a (potentially partial) copy of the Central Authority's linked list to compare against. We can discuss things with other people because we share these globally unique references, but we're restricted to only talking about things the Central Authority has seen. Somebody has to pay for the Central Authority, which also functions as a global library of all human knowledge, albeit one that doesn't have a useful index.
This time using the system at all (even locally) is clumsy and slow unless you have a high bandwidth connection to the Central Authority or (for read only use) a complete and up-to-date copy of its linked lists. Good news is that distributed usage is no harder than local use, bad news is that's not saying much.
3. We use an _imaginary_ totally ordered list instead of a real linked list, so that we can work out where anything would be in this totally ordered list and write that location down without keeping all the items from the list. For a crypto hash we can't even attempt this, but with a weak or trivial hash this isn't too hard to arrange. The hash plus the location number uniquely identify anything. This seems like an amazing breakthrough, until we remember the Pigeonhole Principle. Where do all the bits go? Ah, they're in the location number. Instead of N=9703384921593020482 I now have hash(N) = 446 plus location(N) = 27829639719764153, I write those down, and er... wait, why did we invent this complicated scheme that has no benefit whatsoever?
This system is pointless and nobody would build it.
Posted Feb 28, 2017 15:56 UTC (Tue)
by linuxrocks123 (subscriber, #34648)
[Link] (8 responses)
Posted Feb 28, 2017 18:06 UTC (Tue)
by nix (subscriber, #2304)
[Link] (6 responses)
Fundamentally, any DVCS requires the ability to determine whether two objects are the same, and an indication of their histories. Either you do it by transmitting the whole object and history (-> very expensive) or you do it by assigning an identifier (-> impractical on any DVCS without central coordination) or you do it by some scheme that collapses large objects into much smaller identifiers algorithmically (i.e. a hash).
I see no alternatives to these three, and only the last yields anything remotely usable as a DVCS. Your assertion that this is not true is not convincing, not to me, at least.
Posted Mar 1, 2017 3:41 UTC (Wed)
by linuxrocks123 (subscriber, #34648)
[Link] (5 responses)
Fine, here's a non-exhaustive list of possible ways to uniquely ID a commit excluding the three from earlier:
There's three. I'm not interested in debating stupid details like "what if the coder has two machines" (then use a different public key on each machine or something. Whatever.). Minor problems have minor solutions. You get the idea here. HAND.
Posted Mar 1, 2017 13:30 UTC (Wed)
by itvirta (guest, #49997)
[Link]
Posted Mar 1, 2017 14:07 UTC (Wed)
by gmatht (guest, #58961)
[Link] (1 responses)
In any case (3) has the same theoretical problem as hashes. How do you *know* your public key is unique? My understanding is that keys are more vulnerable to collisions because key generation also relies on being able to generate truly random numbers: http://jblevins.org/log/ssh-vulnkey (obviously they have the advantage that you wouldn't have to check the central server on each commit).
Posted Mar 1, 2017 14:56 UTC (Wed)
by excors (subscriber, #95769)
[Link]
But there's still a non-zero chance that it gives the wrong answer. If you insist on absolute mathematical certainty, to the extent that you won't trust any hash function as a unique identifier, then you shouldn't trust public key cryptography either.
Posted Mar 1, 2017 15:32 UTC (Wed)
by ianmcc (subscriber, #88379)
[Link] (1 responses)
Posted Mar 1, 2017 16:51 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Mar 1, 2017 1:19 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
In other words, it's impossible to create a DVCS without referring to content of the nodes.
Posted Feb 27, 2017 9:03 UTC (Mon)
by eternaleye (guest, #67051)
[Link] (8 responses)
This isn't really any safer in practice, and the reason is GitHub.
GitHub, under the hood, stores all forks of a repository as _branches_ of a _single_ repository, for efficiency reasons.
So, guess what happens if I fork, commit a malicious colliding blob, and send you a patch (email is safe, right?) that adds a benign colliding blob?
What happens is you read the patch, figure it's safe, commit, push, and now you've just made everyone who uses your project vulnerable.
Linux' firmware repository (in particular, its GitHub mirror) is vulnerable to this _today_. Even nastier if I send you the benign blob as a pull request from a kernel.org repo - you may never look at the GitHub repo, but people will still get fucked by it.
Posted Feb 27, 2017 14:37 UTC (Mon)
by bronson (subscriber, #4806)
[Link] (4 responses)
A scenario that might work: what happens if you create a repo, convince me to fork it, commit a malicious blob to your repo, then send me a pull request with a colliding blob? (and somehow explain why I should merge a commit that isn't already in your original repo)
*maybe* that would work. Assuming you can create plausible malicious blobs, of course. SHA1 preimage attacks are coming but they're not here yet.
Posted Feb 27, 2017 15:22 UTC (Mon)
by joey (guest, #328)
[Link] (2 responses)
Posted Feb 27, 2017 18:08 UTC (Mon)
by bronson (subscriber, #4806)
[Link] (1 responses)
Just making sure I understand correctly. You've thought about this a lot more than I have.
Posted Feb 27, 2017 18:14 UTC (Mon)
by bronson (subscriber, #4806)
[Link]
That sounds plausible but still pretty easy to spot. It might be easier and less risky to just subvert the binary blob at its source.
Ultimately though, agreed: it's much easier than it should be, and getting easier each day.
Posted Feb 28, 2017 2:25 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Posted Mar 1, 2017 15:16 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
> This isn't really any safer in practice, and the reason is GitHub.
Even if that's true (and I don't know - I've just seen the other posts disputing it) then that's not git's problem, it's github's.
The critical point, is that if there is a problem, as far as I can tell it's the push that fails. So if a developer is pushing a change, it blows up there and then.
Of course, if developer A pushes to repository A, while developer B pushes to repository B, then developer C tries to pull from both A and B, we're going to have a problem ...
(Although doesn't git use the time-stamp as part of the hash, so fixing the collision is simple in theory, just edit and save the unchanged file? Of course, in practice, it could mean a rebase of a big project - rather painful ...)
Cheers,
Posted Mar 1, 2017 16:53 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
(Obviously the timestamp of a *commit* is part of the saved state: it's in the commit object, thus part of its hash.)
Posted Mar 2, 2017 14:58 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Still doesn't get round the problem of a full-scale rebase where C has a collision pulling repositories from A and B :-)
Cheers,
Proof
"Don't worry, there's no fire. An alarm would sound"
(2012 Researchers publish estimated prices to collide SHA-1: A fire alarm is ringing)
"That's just the fire alarm, it's probably nothing. I can't smell any smoke, can you?"
(2015 "The Shappening" free-start collisions published, prices are revised down significantly: There is a strong scent of smoke)
"That smell might not be smoke though, could be something else. Wait until we see smoke"
(2017 "Shattered" full SHA-1 PDF collision using fixed prefix: Clouds of smoke are filling the room)
"Well, don't you worry, if we see an actual fire I know I have an extinguisher around here somewhere"
Proof
Wol
Proof
Proof
Proof
Proof
My God are you alarmist
My God are you alarmist
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
- Email address plus ID number generated by the client with that email address that is always 1 + previous ID number generated by the coder's machine.
- Hostname of arbitrary server running unique ID generation network service, plus the ID number said server generated for your changeset.
- Public key of committing programmer plus ID number (always 1 + previous local ID generated) generated by the coder's machine.
There Is (Almost) Never One "Only Other Way"
(or perhaps a combination of the two)?
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
There Is (Almost) Never One "Only Other Way"
Proof
Proof
Proof
Proof
Proof
Proof
Proof
Wol
Proof
Proof
Wol