User: Password:
|
|
Subscribe / Log in / New account

Why not just use the SHA1 only?

Why not just use the SHA1 only?

Posted Nov 15, 2008 0:06 UTC (Sat) by nevets (subscriber, #11875)
Parent article: /dev/ksm: dynamic memory sharing

KSM will generate an SHA1 hash of the page's contents. That hash will then be used to look up other pages with the same hash value. If a subsequent memcmp() call shows that the contents of the pages are truly identical,

I've been informed by the git Gods that the chances of a SHA1 collision is the same as the Earth being sucked up into the black hole created by the CERN accelerator. If this is indeed true, then there's no need for that extra memcmp.

I brought this up to the git God himself about git's use of SHA1. He did agree with me that if there were to be a collision of SHA1's in git, that the database would be corrupted. But he blew it off as a snowball's chance in hell. He did show some concern that there might be a way to crack the algorithm.

But who am I to question a git God?


(Log in to post comments)

Why not just use the SHA1 only?

Posted Nov 15, 2008 8:25 UTC (Sat) by dlang (subscriber, #313) [Link]

actually the statement is that you can't deliberatly come up with a conflicting sha1.

there are databases what hold the sha1 of various files, and there are a lot of known conflicts in them

Why not just use the SHA1 only?

Posted Nov 15, 2008 14:46 UTC (Sat) by jbh (subscriber, #494) [Link]

Are you sure? According to wikipedia, none have been found (although it is known that it can be found with complexity 2^63, less than the expected 2^80).

Why not just use the SHA1 only?

Posted Nov 15, 2008 14:56 UTC (Sat) by jbh (subscriber, #494) [Link]

Just to be clear: If you restrict yourself to "collision-prone" SHA1s, there's a 1/2^63 chance of conflict. With normal (random) SHA1s, the chance is 1/2^80. Deliberately creating a conflict with a given SHA1 (second preimage attack) is still 1/2^160, and the chance of that second preimage being non-gibberish substantially lower.

Why not just use the SHA1 only?

Posted Nov 15, 2008 19:45 UTC (Sat) by scarabaeus (guest, #7142) [Link]

IMHO nevets's suggestion is the wrong way round. For better performance, the initial checksum should be a very fast 64-bit checksum - possibly even simpler than CRC. (The style of weak+fast checksum used by rsync springs to mind...)

Even on systems with huge amounts of pages, the likelihood of hash collisions will be far too low to affect performance - also because memcmp() will abort early if it detects differences between the compared pages.

This way, you save memory for the checksums (which also improves hash lookup performance), and the checking will be faster.

Why not just use the SHA1 only?

Posted Nov 18, 2008 14:41 UTC (Tue) by nevets (subscriber, #11875) [Link]

Actually, my comment was a little facetious. My point was not a way to fix this algorithm, but a comment against what git is doing. The git repo really relies on absolutely no conflicts. If one happens then the database is corrupted. I keep hearing that the chances of this happening is astronomically low, but the fact that the chance can happen, bothers me.

Why not just use the SHA1 only?

Posted Nov 18, 2008 16:07 UTC (Tue) by nix (subscriber, #2304) [Link]

Jean-Luc Herren did the maths recently on the git list, in
<48E4ABC0.80100@gmx.ch>:

In case it's interesting to someone, I once calculated (and wrote
down) the math for the following scenario:

- 10 billion humans are programming
- They *each* produce 5000 git objects every day
- They all push to the same huge repository
- They keep this up for 50 years

With those highly exagerated assumptions, the probability of
getting a hash collision in that huge git object database is
6e-13. Provided I got the math right.

So, mathematically speaking you have to say "yes, it *is*
possible". But math aside it's perfectly correct to say "no, it
won't happen, ever". (Speaking about the *accidental* case.)

Why not just use the SHA1 only?

Posted Nov 18, 2008 18:37 UTC (Tue) by dlang (subscriber, #313) [Link]

git will never overwrite an object that it thinks that it has.

so git could get corrupted, but it would not be corrupted by overwriting old data and loosing it, it would be corrupted by not saving new data (much easier to detect as that is the data that people would be trying to use)

there is an option in git (I don't remember if it's compile time or not) to do additional checking when saving data to check that the data is the same even if it has the same hash, and give an error if it's not the same.

Why not just use the SHA1 only?

Posted Nov 16, 2008 13:39 UTC (Sun) by nix (subscriber, #2304) [Link]

It only generates a hash of the first, what, 128 bytes of the page, so any
pages with the same leading 128 bytes will 'collide' (in the sense that
the first 128 bytes are identical, but the rest are not).

Why not just use the SHA1 only?

Posted Nov 18, 2008 4:30 UTC (Tue) by nevets (subscriber, #11875) [Link]

From your friends at Wikipedia, there is an article on SHA-1

SHA-1 (as well as SHA-0) produces a 160-bit digest from a message with a maximum length of (2^64 − 1) bits

This looks like it can be much bigger than 128 bytes.

Why not just use the SHA1 only?

Posted Nov 18, 2008 16:01 UTC (Tue) by nix (subscriber, #2304) [Link]

Yeah, but for speed reasons they're only hashing the first N bytes (I
think it's 128), rather than the whole page. It's a sensible tradeoff, I
think.

Why not just use the SHA1 only?

Posted May 23, 2010 9:47 UTC (Sun) by rafal.maj (guest, #66508) [Link]

This doesn't make sense, they are for sure hashing entire page.

Why not just use the SHA1 only?

Posted May 23, 2010 10:57 UTC (Sun) by johill (subscriber, #25196) [Link]

Well, but you clearly didn't read the code:
+#define PAGECMP_OFFSET 128
+#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
+/* hash the page */
+static void page_hash(struct page *page, unsigned char *digest)
+{
+	struct scatterlist sg;
+	struct hash_desc desc;
+
+	sg_init_table(&sg, 1);
+	sg_set_page(&sg, page, PAGEHASH_SIZE, 0);
+	desc.tfm = tfm;
+	desc.flags = 0;
+	crypto_hash_digest(&desc, &sg, PAGEHASH_SIZE, digest);
+}
and it does "for sure" make sense since it's just a way to speed up matching, and just using 128 bytes means much less loading from memory. Tuning the 128 might make sense, potentially.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds