Call to atention about using hash functions as content indexers (SCM saga)
From: | piotr-AT-larroy.com (Pedro Larroy) | |
To: | linux-kernel-AT-vger.kernel.org | |
Subject: | Call to atention about using hash functions as content indexers (SCM saga) | |
Date: | Tue, 12 Apr 2005 00:40:21 +0200 |
Hi I had a quick look at the source of GIT tonight, I'd like to warn you about the use of hash functions as content indexers. As probably you are aware, hash functions such as SHA-1 are surjective not bijective (1-to-1 map), so they have collisions. Here one can argue about the low probability of such a collision, I won't get into subjetive valorations of what probability of collision is tolerable for me and what is not. I my humble opinion, choosing deliberately, as a design decision, a method such as this one, that in some cases could corrupt data in a silent and very hard to detect way, is not very good. One can also argue that the probability of data getting corrupted in the disk, or whatever could be higher than that of the collision, again this is not valid comparison, since the fact that indexing by hash functions without additional checking could make data corruption legal between the reasonable working parameters of the program is very dangerous. And by the way, I've read http://www.usenix.org/events/hotos03/tech/full_papers/hen... and I don't agree with it. Asides from the "avalanche effect" the properties of a good hash function is that distributes well the entropy between the output bits, so probably, although the inputs are not random, the pdf change in the outputs would be small, otherwise it would be easier to find collision by differential or statistical methods. Last, hash functions should be only used to check data integrity, and that's the reason of the "avalanche effect", so even single bit errors are propagated and it's effect is very noticeable at the output. Regards. -- Pedro Larroy Tovar | Linux & Network consultant | pedro%larroy.com Make debian mirrors with debian-multimirror: http://pedro.larroy.com/deb_mm/ * Las patentes de programación son nocivas para la innovación * http://proinnova.hispalinux.es/
Posted Apr 12, 2005 18:12 UTC (Tue)
by pkolloch (subscriber, #21709)
[Link] (5 responses)
Well, humans seem to be bad at understanding low probabilities. While Pedro Larroy does not seem uninformed or ignorant, he just cuts off the discussion about the probabilities. A SHA-1 has 160 bits. Taking into account the birthday effect, the expected number of hashes of random documents needed to produce a collision is 2^80. That is a large number. The mentioned avalanche effect ensures that non-random, even similar documents behave as a random document collection. As far as I am informed, there are not even known documents which produce collisions using the much weaker MD5 (120 bits). That said there has been recent research which suggests that SHA-1 is (cryptographically) insecure. That does not effect the statistics for the accidential clash: two well-meaning developers check in documents which produce the same hash values. Yet it might be possible to launch some attack based on this.
Posted Apr 14, 2005 15:50 UTC (Thu)
by vgough (guest, #2781)
[Link] (4 responses)
The arguments for and against these sorts of approaches are similar to the jokes about mathematicians vs. engineers. This is not a problem with intuition or misunderstanding of probabilities, it about how people view the job of algorithms, and how they should be designed.
I agree with Pedro that, ideally, algorithms should not be designed with known possibilities of failure (even very small possibilities) - which puts me somewhat toward the mathematician side of the spectrum. This is particularly unsettling because you can't predict what inputs would lead to failure. Because the probability of failure is low, people may feel safe in skipping checks to ensure this assumption hasn't been violated.
Adding some checks to make sure nothing bad has happened would probably solve most people's concerns. Sure, someday Linus might have to say "sorry, please add some whitespace to your patch, because it causes a hash collision with one of the old kernel patches", but at least it would be fixable if the problem can be detected.
So that is why Pedro shows no interest in getting into a talk about acceptable probabilities. The entire point is that algorithms should not be designed to fail unless there is a very good reason!
Posted Apr 14, 2005 17:33 UTC (Thu)
by shane (subscriber, #3335)
[Link] (3 responses)
http://www.geocities.com/SiliconValley/Pines/8659/crc.htm
Mandelbrot proved that it is impossible to get a clean signal by improving
the signal strength while at IBM, and engineers started introducing
redundancy into transmissions. In some ways, this led to the necessity for
packet-based transmission, and eventually to packet-switched networks,
like the Internet:
http://www.fractalwisdom.com/FractalWisdom/fractal.html
In fact, TCP and UDP packets have checksums. Oh, and memory chips use
parity to detect errors. Heck, CPU's have internal error detection these
days.
Basically all of the technology that exists assumes and deals with failure
and risk. When you start looking at failure rates with tens of zeros,
engineers shrug their shoulders and move on with life.
Mathematicians pretend like their subject is somehow "pure", but consider
that it took thousands of years before Euclid's 5th postulate was fully
explored - arguably the most read book on mathematics ever!
Basically, you may not like the possibility of collisions from an
aesthetic point of view, but it is not a real world problem. And like my
daddy always said, there's no accounting for taste.
Posted Apr 14, 2005 18:21 UTC (Thu)
by vgough (guest, #2781)
[Link]
Posted Apr 14, 2005 21:48 UTC (Thu)
by swiftone (guest, #17420)
[Link] (1 responses)
As the article mentions, _checking_ identity (e.g. checking for bad data) is different than _assigning_ identity. All of th examples you cite are checking validity. None are assigning identity. If you look at identity algorithms in the industy (UUID, etc) you'll find they are a different breed.
Posted Aug 29, 2005 15:40 UTC (Mon)
by lunakid (guest, #32146)
[Link]
Posted Apr 19, 2005 1:55 UTC (Tue)
by bfitz (guest, #4844)
[Link]
Posted Apr 21, 2005 6:22 UTC (Thu)
by xoddam (subscriber, #2322)
[Link]
Where human intuition fails...
I think you've misunderstood Pedro's comments. If you read his comment carefully, you will see why he doesn't care about the discussion of probabilities.designed to fail
Hard disks (well IDE) uses CRC to check for bad data. I have no evidence
to support me, but I'm guessing that it uses 16-bit or 32-bit CRC:
Living in a world of errors
It is a very interesting topic. I'd say the biggest problem between ideas are perfect, implementations are not
mathematicians and engineers is when one group thinks they can get along
without the other! Because in general, mathematicians don't build
bridges, and engineers don't lose much sleep over the limits of our
understanding of the universe :-)
I agree, the problem is mostly one of aesthetics. But we should hope
there will always be people around to ask if our current implementations
are ugly for a reason, or just for the sake of ugliness.
Hard disks (well IDE) uses CRC to check for bad data.
Living in a world of errors
About checking and assigning (creating) identity being different:
Living in a world of errors
Quite correct. The difference between the two is context. Identity checking always assumes/implies the more specific context of e.g. change characteristics. Identity assignment, OTOH, can basically only assume some upper bound of numbers in the supported pool of identities, and little more (some life-cycle information for aging may also be available etc.).
Straw man argument. Yes, sure, hash collision. All the air molecules in this room might zip over to the other side and I could die of suffocation. I'm not worrying about that either, because even though it theoretically could happen, the probability of it happening is pretty damn low. I don't need to go around wearing a breathing mask for fear of Brownian motion taking my oxygen away, and I don't need to worry about SHA-1 hash collisions in entries in the repository.Call to atention about using hash functions as content indexers (SCM saga)
Just as a hash is a good way to check the integrity of the content, Distributed asynchronous integrity checking!
looking at the content is a pretty good way to ensure that your
'assumption' that the hash doesn't collide is good.
Using the hash for day-to-day work is fine, as long as spot
checks are done on anything which is supposed to be 'authoritative',
such as Linus' official tree. It's the responsibility of the
paranoid to do the checks :-)
Of course any good tool will ring warning bells as soon as anything
looks out of place. Hash collisions are astronomically unlikely;
it is even more astronomically unlikely that two documents with the
same hash value will both have meaning in the same context.