Moving Git past SHA-1
Moving Git past SHA-1
Posted Feb 27, 2017 22:47 UTC (Mon) by Karellen (subscriber, #67644)Parent article: Moving Git past SHA-1
Would it increase the amount of data stored in the repository that much? All the blobs themselves would be exactly the same. A blob with an sha1 hash of "0000....ffff" but an sha3 hash of "ffff....0000" would only need to be stored once, but with two hardlinks - one for each hash type. Even a pack file, containing a bunch of blobs, could contain all the blob/diff data only once, but with each blob/diff referred to with multiple names.
The only duplicated information would be the tree, commit and tag objects. OK, for a large project with a lot of history, they'd take up a not-entirely-negligible amount of space, but it would be a pretty small fraction of the blob space, wouldn't it?
And you could build a complete set of sha-3-identified trees and commit history from the sha-1 equivalent, and manage them side-by-side indefinitely. Heck, you could (re)build an sha-1 commit history from an sha-3 one.
You could then identify commits with an extended identifier, something like "sha1:0000....ffff" or "sha3:ffff....0000". Assume "sha1" as the default if left unspecified, or make the default hash type a per-repository config option. That would allow different projects to decide which hash they want to use by default, based on the paranoia level of each project's maintainer.
Posted Feb 27, 2017 23:04 UTC (Mon)
by nix (subscriber, #2304)
[Link] (14 responses)
Nobody's written that code, so with current code it will assume all the blobs are different and write them out all over again, even though they're all identical. Not even repacking all objects would fix that, because that doesn't rehash anything either...
Linus seems to be happy with this, but having seen the amount of space it would waste forevermore in e.g. the Chromium repo, I'm not so happy :(
Posted Feb 28, 2017 1:07 UTC (Tue)
by BenHutchings (subscriber, #37955)
[Link] (1 responses)
Posted Feb 28, 2017 12:03 UTC (Tue)
by nix (subscriber, #2304)
[Link]
OK, all my objections to this scheme are gone then: except for a slightly increased diff cost across the boundary (to, uh, what non-git users pay all the time) it is costless. :)
Posted Feb 28, 2017 3:27 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Feb 28, 2017 6:54 UTC (Tue)
by johill (subscriber, #25196)
[Link]
Posted Feb 28, 2017 8:39 UTC (Tue)
by Karellen (subscriber, #67644)
[Link] (7 responses)
Sorry, I wasn't explaining myself clearly.
Rather than have a new-hash commit that's a child of an old-hash commit, maintain two complete sets of commits/history, one for each hash. You can build one history from the other.
e.g. Say you have a repository containing a single file with two revisions. At the moment, you have a repo like (excuse shortened and comically simplified hash values):
Now, replace that with:
In the above case, sha1/00/000001 and sha3/ff/000001 are hard links to the same file. The blob is identical in each case, it's just linked to multiple times with different names. The same goes for the second revision of the file. Only the tree and commit objects need to be duplicated for each hash, but they're going to be orders of magnitude smaller than the sizes of the blobs. You can build one tree from the other, and even build an entirely new commit history based on a different hash at any time in the future. When you create a new commit, a new commit object will be created for each hash you've configured your repo to use.
Anyone can refer to a commit by either id (sha1:00200002 /or/ sha3:ff200002), and, provided you have the relevant commit history built in your repo, you can see which commit they're talking about. References to old commits are still valid, because you still have (and can continue to maintain) the old hash history.
A project could say "from date X we will talk to each other about commits using sha3 refs by default", but that won't need to change how your repo actually works under the hood.
Does that explain what I'm trying to get at a bit better?
Posted Feb 28, 2017 11:19 UTC (Tue)
by andrewsh (subscriber, #71043)
[Link] (5 responses)
Posted Feb 28, 2017 12:06 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Ah well, disk space really is cheap: I've just never internalized it properly. 250MiB still seems like a lot rather than about 1p. (And that price is taken from *enterprise* storage costs. Disks that normal humans not obsessed with RAIDing everything use will be even cheaper.)
Posted Mar 1, 2017 10:39 UTC (Wed)
by zenaan (guest, #3778)
[Link]
Is 250MiB just the mapping size (i.e. no object data)? Sounds surprising, but perhaps not for the number of years (rather, objects) involved?
Flag day doesn't sound so bad - a repo can simply store some meta-data for it's UTC flag day or "flag moment", with a certain quiescence period (say 15 minutes) during which pulls and pushes are not accepted... thereby ensuring world-sync for the flag moment, yes?
Posted Feb 28, 2017 13:06 UTC (Tue)
by Karellen (subscriber, #67644)
[Link]
I have basic familiarity with *some* git internals, but "how packs are formatted" has never been something I've needed to delve into. It's entirely possible that could scuttle the whole idea.
Posted Feb 28, 2017 14:59 UTC (Tue)
by garyvdm (subscriber, #82325)
[Link] (1 responses)
Posted Feb 28, 2017 15:11 UTC (Tue)
by nix (subscriber, #2304)
[Link]
But the mapping from SHA-1 IDs to object offsets in each pack is computed at receive time and stored in the index. (This means that every object lookup must check each index in turn, which is why occasionally repacking them into a single big pack without actually recompressing any content at all is still a significant win.)
Posted Feb 28, 2017 22:49 UTC (Tue)
by magfr (subscriber, #16052)
[Link]
What happens under this scheme when I add a second blob with the same sha1 hash as one previously in the repository but a different content?
Posted Feb 28, 2017 21:30 UTC (Tue)
by jengelh (guest, #33263)
[Link] (1 responses)
Not sure how hashes are stored internally, but pretending they are dealt with in their hex form, why not add a character not in the [0-9a-f] alphabet?
if (strlen(hash) == 20 && strchr(hash, ':') == NULL) { method = sha1; } else { (method, hash) = split("f00f....f00f", ":"); } kind of thing.
Something like that worked for shadow(5) too when $2a$ et al support was added.
Posted Mar 1, 2017 0:08 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
.git/
objects/
00/000001 -- first revision of file
00/000002 -- second revision of file
00/100001 -- tree containing first revision of file
00/100002 -- tree containing second revision of file
00/200001 -- commit containing first tree
00/200002 -- commit containing second tree
refs/
heads/
master -- points to 00200002
.git/
objects/
sha1/00/000001 -- first revision of file
sha1/00/000002 -- second revision of file
sha1/00/100001 -- sha1-based tree containing first revision of file
sha1/00/100002 -- sha1-based tree containing second revision of file
sha1/00/200001 -- sha1-based commit containing first tree
sha1/00/200002 -- sha1-based commit containing second tree
sha3/ff/000001 -- first revision of file
sha3/ff/000002 -- second revision of file
sha3/ff/100001 -- sha3-based tree containing first revision of file
sha3/ff/100002 -- sha3-based tree containing second revision of file
sha3/ff/200001 -- sha3-based commit containing first tree
sha3/ff/200002 -- sha3-based commit containing second tree
refs/
heads/
master -- points to sha1:00200002
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
I think (I'm not 100%) that the object ids are not stored in the pack file, only in the pack index. The pack file can be enumerated, and the object ids can be calculated. Hence it would be possible to have one pack file, and multiple index files for each object id type.
Such a repo might look like this:
Moving Git past SHA-1
.git/
objects/
pack
pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.idx
pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.idx-sha3
pack-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pack
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
Moving Git past SHA-1
