User: Password:
Subscribe / Log in / New account

Why not just use the SHA1 only?

Why not just use the SHA1 only?

Posted Nov 15, 2008 19:45 UTC (Sat) by scarabaeus (guest, #7142)
In reply to: Why not just use the SHA1 only? by nevets
Parent article: /dev/ksm: dynamic memory sharing

IMHO nevets's suggestion is the wrong way round. For better performance, the initial checksum should be a very fast 64-bit checksum - possibly even simpler than CRC. (The style of weak+fast checksum used by rsync springs to mind...)

Even on systems with huge amounts of pages, the likelihood of hash collisions will be far too low to affect performance - also because memcmp() will abort early if it detects differences between the compared pages.

This way, you save memory for the checksums (which also improves hash lookup performance), and the checking will be faster.

(Log in to post comments)

Why not just use the SHA1 only?

Posted Nov 18, 2008 14:41 UTC (Tue) by nevets (subscriber, #11875) [Link]

Actually, my comment was a little facetious. My point was not a way to fix this algorithm, but a comment against what git is doing. The git repo really relies on absolutely no conflicts. If one happens then the database is corrupted. I keep hearing that the chances of this happening is astronomically low, but the fact that the chance can happen, bothers me.

Why not just use the SHA1 only?

Posted Nov 18, 2008 16:07 UTC (Tue) by nix (subscriber, #2304) [Link]

Jean-Luc Herren did the maths recently on the git list, in

In case it's interesting to someone, I once calculated (and wrote
down) the math for the following scenario:

- 10 billion humans are programming
- They *each* produce 5000 git objects every day
- They all push to the same huge repository
- They keep this up for 50 years

With those highly exagerated assumptions, the probability of
getting a hash collision in that huge git object database is
6e-13. Provided I got the math right.

So, mathematically speaking you have to say "yes, it *is*
possible". But math aside it's perfectly correct to say "no, it
won't happen, ever". (Speaking about the *accidental* case.)

Why not just use the SHA1 only?

Posted Nov 18, 2008 18:37 UTC (Tue) by dlang (subscriber, #313) [Link]

git will never overwrite an object that it thinks that it has.

so git could get corrupted, but it would not be corrupted by overwriting old data and loosing it, it would be corrupted by not saving new data (much easier to detect as that is the data that people would be trying to use)

there is an option in git (I don't remember if it's compile time or not) to do additional checking when saving data to check that the data is the same even if it has the same hash, and give an error if it's not the same.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds