LWN: Comments on "Large files with Git: LFS and git-annex"

Large files with Git: LFS and git-annex

mnr — Sat, 23 Jul 2022 13:51:52 +0000

https://github.com/jedbrown/git-fat git-fat is not maintained anymore

Large files with Git: LFS and git-annex

rweaver6 — Sat, 27 Jul 2019 19:25:27 +0000

I came upon this discussion very late, while investigating GVFS/VFSforGit .

VFSforGit was not designed to solve a large-file problem. See https://docs.microsoft.com/en-us/azure/devops/learn/git/g... .

It was designed to help adapt Git to Microsoft's internal Windows development repository, which was 3.5M files in a zillion directories and branches, 300GB total. Gigantic repository, file size not really the issue. Obviously it will contain some large files as well, but it's not what was limiting their ability to move Windows development to Git.

Whether the file system virtualization provided by VFSforGit *could* be made to help Git also with large files is an interesting question.

splitting the large CVE list in the security tracker

nix — Sat, 15 Dec 2018 00:59:39 +0000

The LZMA compression system already does some of this, with a customizable filter system, though at the moment the only non-conventional-compression filters are filters for a lot of ISAs that can absolutize relative jumps to increase the redundancy of executables. :)

chunked files

pixelpapst — Fri, 14 Dec 2018 01:13:12 +0000

I think the "new pack format" idea is spot on, and something I have been contemplating for a few months now, inspired by casync.

The chunking approach and on-disk data structure seem solid; git would probably use a standard casync chunk store, but a git-sprecific index file.

(Just for giggles, I've been meaning to even evaluate how much space would be shared when backing up a casync-ified .git directory (including its chunk store) and the checked-out objects to a different, common casync chuck store.)

I cannot wait to see to what new heights git-annex would grow in a world where every ordinary git user already had basic large-file interoperability with it.

(Anarcat, thank you for educating people about git-annex and all your documentation work.)

Large files with Git: LFS and git-annex

AndreiG — Thu, 13 Dec 2018 18:31:39 +0000

caca labs ?
libcaca ...?
libpipi ...?
wtf did you find these people ?😂

Append-only large files

anarcat — Thu, 13 Dec 2018 16:51:27 +0000

I'm not exactly sure as I haven't reviewed the source code behind git-pack-objects, only the manual page, which says:

In a packed archive, an object is either stored as a compressed whole or as a difference from some other object. The latter is often called a delta. [...]

--window=<n>, --depth=<n>
These two options affect how the objects contained in the pack are stored using delta compression. The objects are first internally sorted by type, size and optionally names and compared against the other objects within --window to see if using delta compression saves space. --depth limits the maximum delta depth; making it too deep affects the performance on the unpacker side, because delta data needs to be applied that many times to get to the necessary object. The default value for --window is 10 and --depth is 50. The maximum depth is 4095.

So yes, it can also "optionally" "sort by name", but it's unclear to me how that works or how effective that is. Besides, the window size is quite small as well, although it can be bumped up to make pack take all available memory with that parameter. :)

Append-only large files

epa — Thu, 13 Dec 2018 16:41:25 +0000

Huh, so the delta is entirely blind to whatever filename the content was added under? That's a clean design, but it seems like adding some amount of hinting (so that similar filenames are grouped together for finding deltas) would greatly improve performance, and not just in this case.

Large files with Git: LFS and git-annex

MatyasSelmeci — Thu, 13 Dec 2018 16:24:31 +0000

This sounds cool. What does the .gitsvn file look like -- a simple path -> revision mapping? Is there a script that checks out specific files (e.g. via svn export/svn cat or something)? Does that happen automatically via some sort of git hooks?

Large files with Git: LFS and git-annex

Lennie — Thu, 13 Dec 2018 11:45:40 +0000

I also noticed new users who are used to CVS/SVN, etc. need to first unlearn some stuff before 'getting git'.

Large files with Git: LFS and git-annex

derobert — Wed, 12 Dec 2018 19:11:32 +0000

Pretty sure git-annex fsck does that, at least my runs of it sometimes report a lower than desired number of copies. It also checks the data is correct (matches checksum), detecting any bitrot, though --fast should disable that part.

Note that it only checks one repository (which doesn't have to be the local one, useful especially for special remotes). So you need to have it run for all the repositories you trust to keep copies to detect bitrot, accidental deletion, etc. And it stores the data locally, so you may need git-annex sync to make the results known across the git-annex network.

Large files with Git: LFS and git-annex

gebi — Wed, 12 Dec 2018 19:04:56 +0000

yes, exactly, but from my reading of the docs it was the only method to check if the replication count of each object was still what was defined, thus it needed to be run regularaly without errors (eg. wanted to run it once per week, just like zfs scrub).

Large files with Git: LFS and git-annex

derobert — Wed, 12 Dec 2018 17:30:25 +0000

That sounds like you were running git-annex repair, which starts by unpacking the repository. But you really only ever run that if there is an error, which should be extremely rare since git is pretty stable now. You want git fsck (to check the git repository) and git-annex fsck (to confirm files match their checksums). Neither should appreciably grow the repository (git-annex fsck may store some metadata about last check time).

Append-only large files

anarcat — Wed, 12 Dec 2018 13:32:54 +0000

the other problem is that the delta algorithm in git works very badly for growing files, because it deduplicates within a certain "window" of "N" blobs (default 10), *sorted by size*. The degenerate case of this is *multiple* growing files of similar size which get grouped together and are absolutely unrelated. alternatively, you might be lucky and have your growing file aligned correctly, but only some of the recent entries will get sorted together, earlier entries will get lost in the mists of time.

of course, widening that window would help the security tracker, but it would require a costly repack, and new clones everywhere... and considering how long that tail of commits is, it would probably imply other performance costs...

Large files with Git: LFS and git-annex

pj — Wed, 12 Dec 2018 13:29:38 +0000

I wonder if it would be possible to shove large files into a 'remote repository' container and then deal with them kind of as if they're submodules. A unified interface might simplify things.

Also, wrt chunking, there are several other merkle-tree-based projects that might have useful ideas: Perkeep (previously Camlistore) and IPFS among others.

splitting the large CVE list in the security tracker

mjthayer — Wed, 12 Dec 2018 09:02:28 +0000

> Perhaps it would be possible to use some kind of wrapper so that the file could be maintained as a large file, but git would store it as many pieces. If the file has structure, the idea would be to split it before checkin and reassemble it on checkout.

Taking this further, what about losslessly decompiling certain well-known binary formats? Not sure if it would work for e.g. PDF. Structured documents could be saved as folders containing files. Would the smudge/clean filters Antoine mentioned work for that?

On the other hand, I wonder how many binary files could really be versioned sensibly which do not have some accessible source format which could be checked into git instead. I would imagine that e.g. most JPEGs would be successive versions which did not have much in common with each other from a compression point of view. It would just be the question - does one need all versions in the repository or not? And if one does, well not much to be done.

Large files with Git: LFS and git-annex

gebi — Wed, 12 Dec 2018 08:56:21 +0000

last time i tried git-annex with encrypted remote storages, everytime i checked for consistency the local git repo grew by 700MB and it took _ages_. It went usable small again after packing but it seemed no ideal back in the days.

git-annex special remote to store into another git repository possible ?

domo — Wed, 12 Dec 2018 08:25:47 +0000

Thanks anarcat for good article (again!) -- I've forgotten git-annex altogether since the early days I looked into it.

Now I have to look again -- I've done 3 programs to store large files in separate git repositories
(latest just got working prototype using clean/smudge filters)...

... just that it looks git-annex using bup special remote would be the solution I've been
achieving in my projects... and taking that into use instead of completing my last one
would possibly be most time and resource effective alternative!

So, I'll put NIH and sunk cost fallacy aside ant try that next :D

Append-only large files

pabs — Wed, 12 Dec 2018 08:18:49 +0000

The Debian CVE list mostly grows from the top as that is where newer issues are placed, although sometimes older issues get updated too.

Append-only large files

epa — Wed, 12 Dec 2018 07:45:28 +0000

I was surprised to hear how much git struggles with Debian’s security issues file. It takes forever to resolve deltas. But this file must surely be append-only for most changes. A naive version control system whose only kind of delta was ‘append these bytes’ (storing a whole new copy of the file otherwise) would handle it without problems, though not packed quite as tightly.

So maybe git needs a hint that a particular file should be treated as append-only, where it takes a simpler approach to computing deltas to save time, at the expense of some disk space.

Large files with Git: LFS and git-annex

nybble41 — Wed, 12 Dec 2018 05:08:49 +0000

> However it is possible to reduce disk space usage by using "thin mode" which uses hard links between the internal git-annex disk storage and the work tree. The downside is, of course, that changes are immediately performed on files, which means previous file versions are automatically discarded. This can lead to data loss if users are not careful.

Perhaps this would be a good application for reflinks? Given a suitable filesystem, of course. All the space-saving of hard links (until you start making changes) without the downside of corrupting the original file.

Large files with Git: LFS and git-annex

unixbhaskar — Wed, 12 Dec 2018 03:13:21 +0000

Well, My feelings are in line with this statement ".... feels like learning Git: you always feel you are not quite there and you can always learn more. It's a double-edged sword and can feel empowering for some users and terrifyingly hard for others."

In spite, using and knowing it over the years, still fumble, still, it intimidates me(lack of bent of mind) ...but it is a wonderful software to make life much easier.

Large files with Git: LFS and git-annex

kenshoen — Wed, 12 Dec 2018 00:25:48 +0000

It's a shame that jc/split-blob didn't take off...

chunked files

anarcat — Wed, 12 Dec 2018 00:22:24 +0000

This didn't make it to the final text, but that's something that could be an interesting lead in fixing the problem in git itself: chunking. Many backup software (like restic, borg and bup) use a "rolling checksum" system (think rsync, but for storage) to extract the "chunks" that should be stored, instead of limiting the data to be stored on file boundaries. This makes it possible to deduplicate across multiple versions of the same files more efficiently and transparently.

Incidentally, git-annex supports bup as a backend. And so when I asked joeyh about implementing chunking support in the git-annex backend (it already supports chunked transfers), that's what he answered of course. :)

That would be the ultimate git killer feature, in my opinion, as it would permanently solve the large file problem. But having worked on the actual implementation of such rolling checksum backup software, I can tell you it is *much* harder to wrap your head around that data structure than git's more elegant design.

Maybe it could be a new pack format?

GFS (not not that one) AKA VFS for git

anarcat — Wed, 12 Dec 2018 00:16:31 +0000

you know what, that's true, I totally forgot about GVFS (which we should apparently call "VFS for git" now). That's probably because, first, it just doesn't seem to run on Linux, from what I can tell. To be more precise, it's still at the "prototype" stage, so certainly not something that seems "entreprise-scale" to me.

It could be a promising lead to fix the Debian security team repository size issues, mind you, but then we'd have to figure out how to host the server side of things and I don't know how *that* works either.

Frankly, it looks like a Microsoft thing that's not ready for us mortals, unfortunately. At least the LFS folks had the decency of providing us with usable releases and a test server people could build on top of... But maybe it will become a usable alternative

splitting the large CVE list in the security tracker

JoeBuck — Wed, 12 Dec 2018 00:10:34 +0000

Perhaps it would be possible to use some kind of wrapper so that the file could be maintained as a large file, but git would store it as many pieces. If the file has structure, the idea would be to split it before checkin and reassemble it on checkout. Perhaps the technique could be generalized to handle cases where files grow roughly by appending (I say "roughly" because multiple development branches would do appends and then merges would be required), so that older sections of the file remain unchanged.

Large files with Git: LFS and git-annex

ejr — Tue, 11 Dec 2018 23:44:11 +0000

The problem **FOR ME** with git-annex is platform support. I deal with platforms that have a C compiler, a kinda-sorta-C++ compiler, and that's it. I use git annex but coupled with plenty of out-of-tree copying that is a pain. I've yet to try git-lfs. It doesn't feel like it fits into my uses that naturally are multi-upstream.

LLVM may eventually make this moot until the next great back-end. Not because of licensing but rather timing. Stupid patent issues, being honest, and horrible things like those.

[BTW, is that coffee shop in Bristol still around? Haven't been "downtown" since I moved. At that point in our trip we don't want to stop.]

Large files with Git: LFS and git-annex

ralt — Tue, 11 Dec 2018 23:23:45 +0000

There are only two hard problems...

Large files with Git: LFS and git-annex

mathstuf — Tue, 11 Dec 2018 23:14:48 +0000

What's a GNOME library got to do with this? ;)

Large files with Git: LFS and git-annex

ralt — Tue, 11 Dec 2018 23:03:36 +0000

Hmm... no mention of GVFS? :-)

splitting the large CVE list in the security tracker

anarcat — Tue, 11 Dec 2018 22:40:07 +0000

CVE data in the security tracker is not that large (it's 18M / 300k lines). But there's a lot of history on the same file (52k commits) and that' the issue with git. I think we would have the same issue if the file was small but with the same history.

I have actually done the work to split that file, including history, first with a shallow clone of 1000 commits and then with the full history. Even when keeping the full history of all those 52k commits, the "split by year" repository take up a lot less space than the original repository (145MB vs 1.6GB, an order of magnitude smaller).

Performance is also significantly improved by an order of magnitude: cloning the repository (locally) takes 2 minutes instead of 21 minutes. And of course, running "git annotate" or "git log" on the individual files is much faster than on the larger file, although that's a bit of an unfair comparison.

So splitting the file gets rid of most of the performance issues the repository suffers from, at least according to the results I have been able to produce. The problem is it involves some changes in the workflow, from what I understand, particularly at times like this when we are likely to get CVEs from two different years (2018 and 2019, possibly three with 2017) which means working over multiple files. But it seems to me this is something that's easier to deal with than fixing fundamental design issues with git's internal storage. :)

problems symlinks and p2p: might be worth looking into git-annex again

warrax — Tue, 11 Dec 2018 22:37:08 +0000

I think I might try it again. Thanks for the "update", so to speak.

Large files with Git: LFS and git-annex

corsac — Tue, 11 Dec 2018 21:52:56 +0000

CVE data in the security tracker is not that large (it's 18M / 300k lines). But there's a lot of history on the same file
(52k commits) and that' the issue with git. I think we would have the same issue if the file was small but with the same history.

problems symlinks and p2p: might be worth looking into git-annex again

anarcat — Tue, 11 Dec 2018 21:15:55 +0000

I've been thoroughly impressed by the new v6/v7 "unlocked files" mode. I only brushed over it in the article, but it's a radical change in the way git-annex manages files. It makes things *much* easier with regards to interoperability with other software: they can just modify files and then the operator commits the files normally with git. While there are still a few rough edges in the implementation, the idea is there and makes the entire thing actually workable on USB keys and so on. So you may want to reconsider from that aspect.

I find the p2p implementation to be a little too complex to my taste, but it's there: it uses magic-wormhole and Tor to connect peers across NAT. And from there you can create whatever topology you want. I would rather seen a wormhole-only implementation, honestly, but maybe would have been less of a match for g-a...

Anyways, long story short: if you ever looked at git-annex in the past and found it weird, well, it might be soon time to take a look again. It's still weird in some places (it's haskell after all :p) and it's a complex piece of software, but I generally find that I can do everything I need with it. I am hoping to write a followup article about more in-depth git-annex use cases, specifically about archival and file synchronisation soon (but probably after the new year)... I just had to get this specific article out first so that I don't get a "but what about LFS" blanket response to that other article.

Large files with Git: LFS and git-annex

warrax — Tue, 11 Dec 2018 20:49:44 +0000

Sorry for the absolute mess I made of the spelling in that.

*and use it

*how to would => how to work

I can only apologize.

Large files with Git: LFS and git-annex

warrax — Tue, 11 Dec 2018 20:45:21 +0000

I really *wanted* to like git-annex and use, but the lack of tutorial material (at the time, possibly different now) about how to would around NATs and things of that ilk really hampered me.

That and... some software just doesn't want to work sensibly with symlinks, unforunately :(.

In the end I just chose unison for a star-topology sync (which it looks like git-annex effectively requires if you're being a NAT). Works equally well with large and small files, but obviously not really *versioned* per se.

Large files with Git: LFS and git-annex

anarcat — Tue, 11 Dec 2018 20:36:53 +0000

I don't think anyone could have imagined that file would grow that big in 2004, so don't be too hard on yourself. (And yes, the irony didn't escape me, I just thought it would be unfair to pin that peculiar one on you... )

Large files with Git: LFS and git-annex

joey — Tue, 11 Dec 2018 20:32:30 +0000

Thanks for this unbiased and accurate comparison.

(BTW, the full irony is that I'm responsible for the Debian security tracker containing that single large file in the first place.)

Large files with Git: LFS and git-annex

anarcat — Tue, 11 Dec 2018 20:28:04 +0000

as usual, the bug reports and feature requests I opened while writing this article:

git-annex: LFS API support
git-annex: "why are all those files modified" found while testing v7 mode
git-annex: clarify that v7 applies to all clones
git-annex: v7 fails to fetch files on FAT filesystem

For the sake of transparency, I should also mention that I am a long time git-annex user and even contributor, as my name sits in the thanks page under the heading "code and other bits" section, which means I probably contributed some code to the project. I can't remember now what code exactly I contributed, but I certainly contributed to the documentation. That, in turn, may bias my point of view in favor of git-annex even though I tried to be as neutral as possible in my review of both projects, both of which I use on a regular basis, as I hinted in the article.

Large files with Git: LFS and git-annex

Cyberax — Tue, 11 Dec 2018 20:27:53 +0000

One of my friends uses a franken-repository by putting large files in an SVN repository and storing their versions in a special .gitsvn file. Works surprisingly well.