Call to atention about using hash functions as content indexers (SCM saga)

[Posted April 12, 2005 by corbet]

From:		piotr-AT-larroy.com (Pedro Larroy)
To:		linux-kernel-AT-vger.kernel.org
Subject:		Call to atention about using hash functions as content indexers (SCM saga)
Date:		Tue, 12 Apr 2005 00:40:21 +0200

Hi

I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.

As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not. 

I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.

And by the way, I've read
http://www.usenix.org/events/hotos03/tech/full_papers/hen...

and I don't agree with it. Asides from the "avalanche effect" the
properties of a good hash function is that distributes well the entropy
between the output bits, so probably, although the inputs are not
random, the pdf change in the outputs would be small, otherwise it would
be easier to find collision by differential or statistical methods.

Last, hash functions should be only used to check data integrity, and
that's the reason of the "avalanche effect", so even single bit errors
are propagated and it's effect is very noticeable at the output.

Regards.

-- 
Pedro Larroy Tovar | Linux & Network consultant |  pedro%larroy.com 
Make debian mirrors with debian-multimirror: http://pedro.larroy.com/deb_mm/
	* Las patentes de programación son nocivas para la innovación * 
				http://proinnova.hispalinux.es/

Where human intuition fails...

Posted Apr 12, 2005 18:12 UTC (Tue) by pkolloch (subscriber, #21709) [Link] (5 responses)

Well, humans seem to be bad at understanding low probabilities. While Pedro Larroy does not seem uninformed or ignorant, he just cuts off the discussion about the probabilities.

A SHA-1 has 160 bits. Taking into account the birthday effect, the expected number of hashes of random documents needed to produce a collision is 2^80. That is a large number. The mentioned avalanche effect ensures that non-random, even similar documents behave as a random document collection. As far as I am informed, there are not even known documents which produce collisions using the much weaker MD5 (120 bits).

That said there has been recent research which suggests that SHA-1 is (cryptographically) insecure. That does not effect the statistics for the accidential clash: two well-meaning developers check in documents which produce the same hash values. Yet it might be possible to launch some attack based on this.

designed to fail

Posted Apr 14, 2005 15:50 UTC (Thu) by vgough (guest, #2781) [Link] (4 responses)

I think you've misunderstood Pedro's comments. If you read his comment carefully, you will see why he doesn't care about the discussion of probabilities.

The arguments for and against these sorts of approaches are similar to the jokes about mathematicians vs. engineers. This is not a problem with intuition or misunderstanding of probabilities, it about how people view the job of algorithms, and how they should be designed.

I agree with Pedro that, ideally, algorithms should not be designed with known possibilities of failure (even very small possibilities) - which puts me somewhat toward the mathematician side of the spectrum. This is particularly unsettling because you can't predict what inputs would lead to failure. Because the probability of failure is low, people may feel safe in skipping checks to ensure this assumption hasn't been violated.

Adding some checks to make sure nothing bad has happened would probably solve most people's concerns. Sure, someday Linus might have to say "sorry, please add some whitespace to your patch, because it causes a hash collision with one of the old kernel patches", but at least it would be fixable if the problem can be detected.

So that is why Pedro shows no interest in getting into a talk about acceptable probabilities. The entire point is that algorithms should not be designed to fail unless there is a very good reason!

Living in a world of errors

Posted Apr 14, 2005 17:33 UTC (Thu) by shane (subscriber, #3335) [Link] (3 responses)

Hard disks (well IDE) uses CRC to check for bad data. I have no evidence to support me, but I'm guessing that it uses 16-bit or 32-bit CRC:

http://www.geocities.com/SiliconValley/Pines/8659/crc.htm

Mandelbrot proved that it is impossible to get a clean signal by improving the signal strength while at IBM, and engineers started introducing redundancy into transmissions. In some ways, this led to the necessity for packet-based transmission, and eventually to packet-switched networks, like the Internet:

http://www.fractalwisdom.com/FractalWisdom/fractal.html

In fact, TCP and UDP packets have checksums. Oh, and memory chips use parity to detect errors. Heck, CPU's have internal error detection these days.

Basically all of the technology that exists assumes and deals with failure and risk. When you start looking at failure rates with tens of zeros, engineers shrug their shoulders and move on with life.

Mathematicians pretend like their subject is somehow "pure", but consider that it took thousands of years before Euclid's 5th postulate was fully explored - arguably the most read book on mathematics ever!

Basically, you may not like the possibility of collisions from an aesthetic point of view, but it is not a real world problem. And like my daddy always said, there's no accounting for taste.

ideas are perfect, implementations are not

Posted Apr 14, 2005 18:21 UTC (Thu) by vgough (guest, #2781) [Link]

It is a very interesting topic. I'd say the biggest problem between
mathematicians and engineers is when one group thinks they can get along
without the other! Because in general, mathematicians don't build
bridges, and engineers don't lose much sleep over the limits of our
understanding of the universe :-)

I agree, the problem is mostly one of aesthetics. But we should hope
there will always be people around to ask if our current implementations
are ugly for a reason, or just for the sake of ugliness.

Living in a world of errors

Posted Apr 14, 2005 21:48 UTC (Thu) by swiftone (guest, #17420) [Link] (1 responses)

Hard disks (well IDE) uses CRC to check for bad data.

As the article mentions, _checking_ identity (e.g. checking for bad data) is different than _assigning_ identity. All of th examples you cite are checking validity. None are assigning identity. If you look at identity algorithms in the industy (UUID, etc) you'll find they are a different breed.

Living in a world of errors

Posted Aug 29, 2005 15:40 UTC (Mon) by lunakid (guest, #32146) [Link]

About checking and assigning (creating) identity being different:
Quite correct. The difference between the two is context. Identity checking always assumes/implies the more specific context of e.g. change characteristics. Identity assignment, OTOH, can basically only assume some upper bound of numbers in the supported pool of identities, and little more (some life-cycle information for aging may also be available etc.).

Call to atention about using hash functions as content indexers (SCM saga)

Posted Apr 19, 2005 1:55 UTC (Tue) by bfitz (guest, #4844) [Link]

Straw man argument. Yes, sure, hash collision. All the air molecules in this room might zip over to the other side and I could die of suffocation. I'm not worrying about that either, because even though it theoretically could happen, the probability of it happening is pretty damn low. I don't need to go around wearing a breathing mask for fear of Brownian motion taking my oxygen away, and I don't need to worry about SHA-1 hash collisions in entries in the repository.

Distributed asynchronous integrity checking!

Posted Apr 21, 2005 6:22 UTC (Thu) by xoddam (subscriber, #2322) [Link]

Just as a hash is a good way to check the integrity of the content,
looking at the content is a pretty good way to ensure that your
'assumption' that the hash doesn't collide is good.

Using the hash for day-to-day work is fine, as long as spot
checks are done on anything which is supposed to be 'authoritative',
such as Linus' official tree. It's the responsibility of the
paranoid to do the checks :-)

Of course any good tool will ring warning bells as soon as anything
looks out of place. Hash collisions are astronomically unlikely;
it is even more astronomically unlikely that two documents with the
same hash value will both have meaning in the same context.