Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Posted Jun 19, 2020 16:38 UTC (Fri) by sytoka (guest, #38525)Parent article: Updating the Git protocol for SHA-256
Posted Jun 19, 2020 17:09 UTC (Fri)
by pj (subscriber, #4506)
[Link] (19 responses)
Posted Jun 20, 2020 0:34 UTC (Sat)
by ms-tg (subscriber, #89231)
[Link] (18 responses)
Posted Jun 20, 2020 14:07 UTC (Sat)
by Ericson2314 (guest, #139248)
[Link] (10 responses)
Git really should start merkelizing blob hashes / chunk blobs. Not only does it help with data exchange, but it also means faster hashing when a blob changes O(n) vs O(log n). This transition is the best time to fix things like this, pity it seems they are not under discussion.
Posted Jun 20, 2020 23:14 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (9 responses)
- best left to git-lfs, git-annex, or some other off-loading tool
I think experiments to test the actual benefits in organic Git repositories this would be interesting, but I'd rather see the hash transition happen correctly and smoothly and it sounds complicated enough as it is. And it should be laying down version numbers into formats as it needs that such another transition could leverage to ease its upgrade path too.
Posted Jun 21, 2020 13:46 UTC (Sun)
by pabs (subscriber, #43278)
[Link] (8 responses)
Posted Jun 22, 2020 2:07 UTC (Mon)
by cyphar (subscriber, #110703)
[Link]
Posted Jun 27, 2020 5:14 UTC (Sat)
by ras (subscriber, #33059)
[Link] (6 responses)
But you say it has friends?
Posted Jun 27, 2020 5:46 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (5 responses)
https://borgbackup.github.io/borgbackup/
There is also bup, much more closely related to git:
Posted Jun 27, 2020 7:19 UTC (Sat)
by johill (subscriber, #25196)
[Link] (4 responses)
Borg's encryption design seems to have one issue - as far as I can tell, the "content-based chunker" has a very small key (they claim 32 bits, but say it goes linearly though the algorithm, so not all of those bits eventually matter), which would seem to allow fingerprinting attacks ("you have this chain of chunk sizes, so you must have this file"). Borg also has been debating S3 storage for years without any movement.
Ultimately I landed with bup (that I had used previously), and have been working on adding to bup both (asymmetric) encryption support and AWS/S3 storage; in the latter case you can effectively make your repo append-only (to the machine that's making the backup), i.e. AWS permissions ensure that it cannot actually delete the data. It could delete some metadata tables etc. but that's mostly recoverable (though I haven't written the recovery tools yet), apart from the ref names (which are only stored in dynamoDB for consistency reasons, S3 has almost no consistency guarantees.)
It's probably not ready for mainline yet (and we're busy finishing the python 3 port in mainline), but I've actually used it recently to begin storing some of my backups (currently ~850GiB) in S3 Deep Archive.
Configuration references:
Some design documentation is in the code:
If you use it, there are two other things in my tree that you'd probably want:
1) with a lot of data, the content-based splitting on 13 bits results in far too much metadata (and storage isn't that expensive anymore), so you'd want to increase that. Currently in master that's not configurable, but I changed that: https://github.com/jmberg/bup/blob/master/Documentation/b...
2) if you have lots of large directories (e.g. maildir) then minor changes to those currently consumes a significant amount of storage space since the entire folder is saved again (the list of files). I have "treesplit" in my code that allows splitting up those trees (again, content-based) to avoid that issue, which for my largest maildir of ~400k files brings down the amount of new data saved from close to 10 MB (after compression) to <<50kB when a new email is written there. Looks like I didn't document that yet, but I should add it here: https://github.com/jmberg/bup/blob/master/Documentation/b.... The commit describes it a bit now: https://github.com/jmberg/bup/commit/44006daca4786abe31e3...
And yes, I'm working with upstream on this.
Posted Jun 27, 2020 7:31 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Jun 27, 2020 7:42 UTC (Sat)
by johill (subscriber, #25196)
[Link]
However, it's not nearly as bad in git? You're not storing hundreds of thousands of files in a folder in git, presumably? :-) Not sure how much interest there would be in git on that.
Posted Jun 27, 2020 7:48 UTC (Sat)
by johill (subscriber, #25196)
[Link]
Posted Jul 8, 2020 19:22 UTC (Wed)
by nix (subscriber, #2304)
[Link]
By this point, as a mere observer, I would say you *are* one of upstream. You're one of the two people doing most of the bup commits and have been for over a year now. :)
Posted Jun 20, 2020 15:36 UTC (Sat)
by hmh (subscriber, #3838)
[Link] (6 responses)
At that point it becomes app specific, and other than the obvious protocol best practice that you should explicitly encode the protocol version (in this case what hash and hash parameters if not implied), there is little to be gained.
Prefixing (hidden by base# or explicitly) the hash type in git has already been covered by other replies and posts, and yes, imho it really should be done if at all possible.
Posted Jun 20, 2020 17:11 UTC (Sat)
by cyphar (subscriber, #110703)
[Link] (5 responses)
Now there isn't an IANA-like procedure, everything is done via PRs on GitHub but that's just differences in administrative structure.
Posted Jun 20, 2020 18:42 UTC (Sat)
by hmh (subscriber, #3838)
[Link] (1 responses)
This link you sent is much better, the other one lacks essential information...
I am quite sure git would severely restrict the allowed hashes, but at least the design of multihash seems sane and safely extensible, including when ones does the short-sighted error of enshrining short prefixes of the hash anywhere that is not a throw away command line call... A bad practice that is very common among git users.
Posted Jun 20, 2020 23:17 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
"Best practice" for short usage in more permanent places includes the date (or tag description) and summary of the commit in question (which both greatly ease conflict resolution when it occurs and gives some idea of what's going on without having to copy/paste the has yourself).
Posted Jun 22, 2020 15:37 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link]
RFC 8126 lists 10 such procedures for general use in new namespaces.
So what Multihash are doing here sounds like a typical new IANA namespace which has an Experimental/ Private Use region (self-assigned) and then Specification Required for the rest of the namespace. You must document what you're doing, maybe with a Standards Organisation, maybe you write a white paper, maybe even you just spin up a web site with a technical rant, but you need to document it and then you get reviewed and maybe get in.
Apparently Multihash is writing up some sort of formal document to maybe got to the IETF, but given they started in 2016 and it's not that hard they may not ever get it polished up and standardised anywhere, it's not a problem.
Posted Jun 24, 2020 4:03 UTC (Wed)
by nevyn (guest, #33129)
[Link] (1 responses)
Another similar point is the table itself, the hashes added are done ad hoc when someone uses them and wants to use multihash ... again, fine if the project is very new and gaining traction but much less good if the project is established and you go see that none of https://github.com/dgryski/dgohash are there. I understand it's volunteer based contributions but if you want people to actually use your std. it's going to be much easier if they can use it without having to self register well known/used decade old types.
Then there's the format itself. I understand that hashes are variable length but showing abbreviated hashes is very well known at this point. A new git repo. shows 7 characters for the --abbrev hash, ansible with over 50k commits only shows 10 (and even then github only shows 7), and they want to add "1220" to the front of that? And they really want you to show it to the user all the time? Even if abbreviated hashes weren't a thing, most users are going to think it's a bit weird if literally all the hashes they see start with the same 4 hex characters (at a minimum -- using blake2b will eat 6, I think). I also doubt many developers would want to store the hases natively, because it doesn't take many instances before storing the exact same byte sequence with each piece of actual data becomes more than trivial waste.
Posted Jun 25, 2020 17:02 UTC (Thu)
by pj (subscriber, #4506)
[Link]
Posted Jun 19, 2020 17:27 UTC (Fri)
by david.a.wheeler (subscriber, #72896)
[Link] (2 responses)
git reset HASH_VALUE
But having a standard prefix is reasonable. I had proposed rotating the hash value by 16 characters in the first character, so that 0 becomes g, 1 becomes h, and so on. Then you can determine from the first character what encoding is used. You can extend that further by additional rotations or encoding more characters.
Posted Jun 20, 2020 10:51 UTC (Sat)
by cpitrat (subscriber, #116459)
[Link] (1 responses)
Posted Jun 20, 2020 14:22 UTC (Sat)
by gavinbeatty (guest, #139659)
[Link]
Posted Jun 20, 2020 20:36 UTC (Sat)
by josh (subscriber, #17465)
[Link]
[0-9a-f]+ would be SHA-1.
Posted Jun 25, 2020 4:41 UTC (Thu)
by draco (subscriber, #1792)
[Link]
But instead it looks like they'll stick with hex and disambiguate via ^{sha1} and ^{sha256} suffixes.
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
- machine generated data (of some kind) that changes rarely
- non-text artifacts that change rarely
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
https://github.com/jmberg/bup/blob/master/Documentation/b...
https://github.com/jmberg/bup/blob/master/Documentation/b...
https://github.com/jmberg/bup/blob/master/lib/bup/repo/en...
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
IANA
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
Updating the Git protocol for SHA-256
T[0-9a-f]+ would be SHA-256
Pick a new capital letter [G-Z] for each new hash.
Updating the Git protocol for SHA-256