Updating the Git protocol for SHA-256

Posted Jun 21, 2020 13:46 UTC (Sun) by pabs (subscriber, #43278)
In reply to: Updating the Git protocol for SHA-256 by mathstuf
Parent article: Updating the Git protocol for SHA-256

restic and friends use variable-sized chunks, that seems to be the way to go to me.

Updating the Git protocol for SHA-256

Posted Jun 22, 2020 2:07 UTC (Mon) by cyphar (subscriber, #110703) [Link]

They do use variable-sized chunks (more specifically, content-defined chunking), but those chunking algorithms still require you to specify how large you want your chunks to be on average (or in restic's case, the chunking algorithm also asks what the maximum and minimum chunk sizes are). So you still have to decide on the trade-off between chunks that are too large and chunks that are too small.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:14 UTC (Sat) by ras (subscriber, #33059) [Link] (6 responses)

Your comment led me look up restic, and I was thinking "finally, this is it", then I discovered https://github.com/restic/restic/issues/187. With ransomware a thing it's a major omission, and sadly there's been two years with no movement. Shame.

But you say it has friends?

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 5:46 UTC (Sat) by pabs (subscriber, #43278) [Link] (5 responses)

borg is the other modern chunking backup system:

https://borgbackup.github.io/borgbackup/

There is also bup, much more closely related to git:

https://github.com/bup/bup

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:19 UTC (Sat) by johill (subscriber, #25196) [Link] (4 responses)

Thinking about ransomware, I think you should be able to configure permissions on restic's use of AWS/S3 to prevent deletion? There's a use case that's more interesting to asymmetric encryption - not letting the machine that's doing the backup access its own old data, in case it's compromised, as mentioned on that ticket. Maybe you can even configure the S3 bucket to be read-only of sorts (e.g. by storing in deep archive or glacier, and not let the IAM account do any restores from there), but I don't know how much or what restic needs to read out of the repo to make a new backup.

Borg's encryption design seems to have one issue - as far as I can tell, the "content-based chunker" has a very small key (they claim 32 bits, but say it goes linearly though the algorithm, so not all of those bits eventually matter), which would seem to allow fingerprinting attacks ("you have this chain of chunk sizes, so you must have this file"). Borg also has been debating S3 storage for years without any movement.

Ultimately I landed with bup (that I had used previously), and have been working on adding to bup both (asymmetric) encryption support and AWS/S3 storage; in the latter case you can effectively make your repo append-only (to the machine that's making the backup), i.e. AWS permissions ensure that it cannot actually delete the data. It could delete some metadata tables etc. but that's mostly recoverable (though I haven't written the recovery tools yet), apart from the ref names (which are only stored in dynamoDB for consistency reasons, S3 has almost no consistency guarantees.)

It's probably not ready for mainline yet (and we're busy finishing the python 3 port in mainline), but I've actually used it recently to begin storing some of my backups (currently ~850GiB) in S3 Deep Archive.

Configuration references:
https://github.com/jmberg/bup/blob/master/Documentation/b...
https://github.com/jmberg/bup/blob/master/Documentation/b...

Some design documentation is in the code:
https://github.com/jmberg/bup/blob/master/lib/bup/repo/en...

If you use it, there are two other things in my tree that you'd probably want:

1) with a lot of data, the content-based splitting on 13 bits results in far too much metadata (and storage isn't that expensive anymore), so you'd want to increase that. Currently in master that's not configurable, but I changed that: https://github.com/jmberg/bup/blob/master/Documentation/b...

2) if you have lots of large directories (e.g. maildir) then minor changes to those currently consumes a significant amount of storage space since the entire folder is saved again (the list of files). I have "treesplit" in my code that allows splitting up those trees (again, content-based) to avoid that issue, which for my largest maildir of ~400k files brings down the amount of new data saved from close to 10 MB (after compression) to <<50kB when a new email is written there. Looks like I didn't document that yet, but I should add it here: https://github.com/jmberg/bup/blob/master/Documentation/b.... The commit describes it a bit now: https://github.com/jmberg/bup/commit/44006daca4786abe31e3...

And yes, I'm working with upstream on this.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:31 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

Is anyone working on adding treesplitting to git itself? Your docs mention that the tree duplication issue occurs with git too.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:42 UTC (Sat) by johill (subscriber, #25196) [Link]

I'm not aware of that. It would probably mean a new object type in git, or such, I haven't really thought about it.

However, it's not nearly as bad in git? You're not storing hundreds of thousands of files in a folder in git, presumably? :-) Not sure how much interest there would be in git on that.

Updating the Git protocol for SHA-256

Posted Jun 27, 2020 7:48 UTC (Sat) by johill (subscriber, #25196) [Link]

I should, however, mention that due to git's object format ("<type> <length>\x00<data>") other tools can architecturally have an advantage on throughput. Due to the header, bup has to run the content-based splitting first, and then start hashing the object only once it knows how long it is. If you don't have the limitation of the git storage format, you can do without such a header and do both hashes in parallel, stopping once you find the split point. I've been thinking about mitigating that with threading, but it's a bit difficult right now in bup's software architecture. (Incidentally, python is not the issue here, since the hash splitting is entirely in C in my tree, so can run outside the GIL.)

Updating the Git protocol for SHA-256

Posted Jul 8, 2020 19:22 UTC (Wed) by nix (subscriber, #2304) [Link]

> And yes, I'm working with upstream on this.

By this point, as a mere observer, I would say you *are* one of upstream. You're one of the two people doing most of the bup commits and have been for over a year now. :)