LWN: Comments on "Large files with Git: LFS and git-annex" https://lwn.net/Articles/774125/ This is a special feed containing comments posted to the individual LWN article titled "Large files with Git: LFS and git-annex". en-us Wed, 22 Oct 2025 14:02:43 +0000 Wed, 22 Oct 2025 14:02:43 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Large files with Git: LFS and git-annex https://lwn.net/Articles/902257/ https://lwn.net/Articles/902257/ mnr <div class="FormattedComment"> <a rel="nofollow" href="https://github.com/jedbrown/git-fat">https://github.com/jedbrown/git-fat</a> git-fat is not maintained anymore<br> </div> Sat, 23 Jul 2022 13:51:52 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/794767/ https://lwn.net/Articles/794767/ rweaver6 <div class="FormattedComment"> I came upon this discussion very late, while investigating GVFS/VFSforGit . <br> <p> VFSforGit was not designed to solve a large-file problem. See <a href="https://docs.microsoft.com/en-us/azure/devops/learn/git/gvfs-design-history">https://docs.microsoft.com/en-us/azure/devops/learn/git/g...</a> .<br> <p> It was designed to help adapt Git to Microsoft's internal Windows development repository, which was 3.5M files in a zillion directories and branches, 300GB total. Gigantic repository, file size not really the issue. Obviously it will contain some large files as well, but it's not what was limiting their ability to move Windows development to Git.<br> <p> Whether the file system virtualization provided by VFSforGit *could* be made to help Git also with large files is an interesting question.<br> <p> </div> Sat, 27 Jul 2019 19:25:27 +0000 splitting the large CVE list in the security tracker https://lwn.net/Articles/774974/ https://lwn.net/Articles/774974/ nix <div class="FormattedComment"> The LZMA compression system already does some of this, with a customizable filter system, though at the moment the only non-conventional-compression filters are filters for a lot of ISAs that can absolutize relative jumps to increase the redundancy of executables. :)<br> </div> Sat, 15 Dec 2018 00:59:39 +0000 chunked files https://lwn.net/Articles/774885/ https://lwn.net/Articles/774885/ pixelpapst <div class="FormattedComment"> I think the "new pack format" idea is spot on, and something I have been contemplating for a few months now, inspired by casync.<br> <p> The chunking approach and on-disk data structure seem solid; git would probably use a standard casync chunk store, but a git-sprecific index file.<br> <p> (Just for giggles, I've been meaning to even evaluate how much space would be shared when backing up a casync-ified .git directory (including its chunk store) and the checked-out objects to a different, common casync chuck store.)<br> <p> I cannot wait to see to what new heights git-annex would grow in a world where every ordinary git user already had basic large-file interoperability with it.<br> <p> (Anarcat, thank you for educating people about git-annex and all your documentation work.)<br> </div> Fri, 14 Dec 2018 01:13:12 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774870/ https://lwn.net/Articles/774870/ AndreiG <div class="FormattedComment"> caca labs ?<br> libcaca ...?<br> libpipi ...?<br> wtf did you find these people ?😂<br> </div> Thu, 13 Dec 2018 18:31:39 +0000 Append-only large files https://lwn.net/Articles/774860/ https://lwn.net/Articles/774860/ anarcat I'm not exactly sure as I haven't reviewed the source code behind git-pack-objects, only the manual page, which says: <blockquote> In a packed archive, an object is either stored as a compressed whole or as a difference from some other object. The latter is often called a delta. [...] <br> <br> --window=&lt;n&gt;, --depth=&lt;n&gt;<br> These two options affect how the objects contained in the pack are stored using delta compression. The objects are first internally sorted by type, size and optionally names and compared against the other objects within --window to see if using delta compression saves space. --depth limits the maximum delta depth; making it too deep affects the performance on the unpacker side, because delta data needs to be applied that many times to get to the necessary object. The default value for --window is 10 and --depth is 50. The maximum depth is 4095. </blockquote> So yes, it can also "optionally" "sort by name", but it's unclear to me how that works or how effective that is. Besides, the window size is quite small as well, although it can be bumped up to make pack take all available memory with that parameter. :) Thu, 13 Dec 2018 16:51:27 +0000 Append-only large files https://lwn.net/Articles/774858/ https://lwn.net/Articles/774858/ epa <div class="FormattedComment"> Huh, so the delta is entirely blind to whatever filename the content was added under? That's a clean design, but it seems like adding some amount of hinting (so that similar filenames are grouped together for finding deltas) would greatly improve performance, and not just in this case.<br> </div> Thu, 13 Dec 2018 16:41:25 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774853/ https://lwn.net/Articles/774853/ MatyasSelmeci <div class="FormattedComment"> This sounds cool. What does the .gitsvn file look like -- a simple path -&gt; revision mapping? Is there a script that checks out specific files (e.g. via svn export/svn cat or something)? Does that happen automatically via some sort of git hooks?<br> </div> Thu, 13 Dec 2018 16:24:31 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774800/ https://lwn.net/Articles/774800/ Lennie <div class="FormattedComment"> I also noticed new users who are used to CVS/SVN, etc. need to first unlearn some stuff before 'getting git'.<br> </div> Thu, 13 Dec 2018 11:45:40 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774756/ https://lwn.net/Articles/774756/ derobert <p>Pretty sure <tt>git-annex fsck</tt> does that, at least my runs of it sometimes report a lower than desired number of copies. It also checks the data is correct (matches checksum), detecting any bitrot, though <tt>--fast</tt> should disable that part.</p> <p>Note that it only checks one repository (which doesn't have to be the local one, useful especially for special remotes). So you need to have it run for all the repositories you trust to keep copies to detect bitrot, accidental deletion, etc. And it stores the data locally, so you may need <tt>git-annex sync</tt> to make the results known across the git-annex network.</p> Wed, 12 Dec 2018 19:11:32 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774755/ https://lwn.net/Articles/774755/ gebi <div class="FormattedComment"> yes, exactly, but from my reading of the docs it was the only method to check if the replication count of each object was still what was defined, thus it needed to be run regularaly without errors (eg. wanted to run it once per week, just like zfs scrub).<br> </div> Wed, 12 Dec 2018 19:04:56 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774740/ https://lwn.net/Articles/774740/ derobert <p>That sounds like you were running <tt>git-annex repair</tt>, which starts by unpacking the repository. But you really only ever run that if there is an error, which should be extremely rare since git is pretty stable now. You want <tt>git fsck</tt> (to check the git repository) and <tt>git-annex fsck</tt> (to confirm files match their checksums). Neither should appreciably grow the repository (git-annex fsck may store some metadata about last check time). Wed, 12 Dec 2018 17:30:25 +0000 Append-only large files https://lwn.net/Articles/774701/ https://lwn.net/Articles/774701/ anarcat <div class="FormattedComment"> the other problem is that the delta algorithm in git works very badly for growing files, because it deduplicates within a certain "window" of "N" blobs (default 10), *sorted by size*. The degenerate case of this is *multiple* growing files of similar size which get grouped together and are absolutely unrelated. alternatively, you might be lucky and have your growing file aligned correctly, but only some of the recent entries will get sorted together, earlier entries will get lost in the mists of time.<br> <p> of course, widening that window would help the security tracker, but it would require a costly repack, and new clones everywhere... and considering how long that tail of commits is, it would probably imply other performance costs...<br> </div> Wed, 12 Dec 2018 13:32:54 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774700/ https://lwn.net/Articles/774700/ pj <div class="FormattedComment"> I wonder if it would be possible to shove large files into a 'remote repository' container and then deal with them kind of as if they're submodules. A unified interface might simplify things.<br> <p> Also, wrt chunking, there are several other merkle-tree-based projects that might have useful ideas: Perkeep (previously Camlistore) and IPFS among others.<br> </div> Wed, 12 Dec 2018 13:29:38 +0000 splitting the large CVE list in the security tracker https://lwn.net/Articles/774687/ https://lwn.net/Articles/774687/ mjthayer <div class="FormattedComment"> <font class="QuotedText">&gt; Perhaps it would be possible to use some kind of wrapper so that the file could be maintained as a large file, but git would store it as many pieces. If the file has structure, the idea would be to split it before checkin and reassemble it on checkout.</font><br> <p> Taking this further, what about losslessly decompiling certain well-known binary formats? Not sure if it would work for e.g. PDF. Structured documents could be saved as folders containing files. Would the smudge/clean filters Antoine mentioned work for that?<br> <p> On the other hand, I wonder how many binary files could really be versioned sensibly which do not have some accessible source format which could be checked into git instead. I would imagine that e.g. most JPEGs would be successive versions which did not have much in common with each other from a compression point of view. It would just be the question - does one need all versions in the repository or not? And if one does, well not much to be done.<br> </div> Wed, 12 Dec 2018 09:02:28 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774688/ https://lwn.net/Articles/774688/ gebi <div class="FormattedComment"> last time i tried git-annex with encrypted remote storages, everytime i checked for consistency the local git repo grew by 700MB and it took _ages_. It went usable small again after packing but it seemed no ideal back in the days.<br> </div> Wed, 12 Dec 2018 08:56:21 +0000 git-annex special remote to store into another git repository possible ? https://lwn.net/Articles/774672/ https://lwn.net/Articles/774672/ domo <div class="FormattedComment"> Thanks anarcat for good article (again!) -- I've forgotten git-annex altogether since the early days I looked into it.<br> <p> Now I have to look again -- I've done 3 programs to store large files in separate git repositories<br> (latest just got working prototype using clean/smudge filters)...<br> <p> ... just that it looks git-annex using bup special remote would be the solution I've been<br> achieving in my projects... and taking that into use instead of completing my last one<br> would possibly be most time and resource effective alternative!<br> <p> So, I'll put NIH and sunk cost fallacy aside ant try that next :D<br> <p> </div> Wed, 12 Dec 2018 08:25:47 +0000 Append-only large files https://lwn.net/Articles/774675/ https://lwn.net/Articles/774675/ pabs <div class="FormattedComment"> The Debian CVE list mostly grows from the top as that is where newer issues are placed, although sometimes older issues get updated too.<br> </div> Wed, 12 Dec 2018 08:18:49 +0000 Append-only large files https://lwn.net/Articles/774673/ https://lwn.net/Articles/774673/ epa <div class="FormattedComment"> I was surprised to hear how much git struggles with Debian’s security issues file. It takes forever to resolve deltas. But this file must surely be append-only for most changes. A naive version control system whose only kind of delta was ‘append these bytes’ (storing a whole new copy of the file otherwise) would handle it without problems, though not packed quite as tightly. <br> <p> So maybe git needs a hint that a particular file should be treated as append-only, where it takes a simpler approach to computing deltas to save time, at the expense of some disk space. <br> </div> Wed, 12 Dec 2018 07:45:28 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774666/ https://lwn.net/Articles/774666/ nybble41 <div class="FormattedComment"> <font class="QuotedText">&gt; However it is possible to reduce disk space usage by using "thin mode" which uses hard links between the internal git-annex disk storage and the work tree. The downside is, of course, that changes are immediately performed on files, which means previous file versions are automatically discarded. This can lead to data loss if users are not careful.</font><br> <p> Perhaps this would be a good application for reflinks? Given a suitable filesystem, of course. All the space-saving of hard links (until you start making changes) without the downside of corrupting the original file.<br> </div> Wed, 12 Dec 2018 05:08:49 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774661/ https://lwn.net/Articles/774661/ unixbhaskar <div class="FormattedComment"> Well, My feelings are in line with this statement ".... feels like learning Git: you always feel you are not quite there and you can always learn more. It's a double-edged sword and can feel empowering for some users and terrifyingly hard for others." <br> <p> In spite, using and knowing it over the years, still fumble, still, it intimidates me(lack of bent of mind) ...but it is a wonderful software to make life much easier. <br> </div> Wed, 12 Dec 2018 03:13:21 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774656/ https://lwn.net/Articles/774656/ kenshoen <div class="FormattedComment"> It's a shame that jc/split-blob didn't take off...<br> </div> Wed, 12 Dec 2018 00:25:48 +0000 chunked files https://lwn.net/Articles/774655/ https://lwn.net/Articles/774655/ anarcat This didn't make it to the final text, but that's something that could be an interesting lead in fixing the problem in git itself: chunking. Many backup software (like restic, borg and bup) use a "rolling checksum" system (think rsync, but for storage) to extract the "chunks" that should be stored, instead of limiting the data to be stored on file boundaries. This makes it possible to deduplicate across multiple versions of the same files more efficiently and transparently. <p> Incidentally, git-annex supports bup as a backend. And so when I asked joeyh about implementing chunking support in the git-annex backend (it already supports chunked transfers), that's what he <a href="https://git-annex.branchable.com/design/assistant/deltas/#comment-96aa8732a66bbc9513ab1127d7d33f68">answered</a> of course. :) <p> That would be the ultimate git killer feature, in my opinion, as it would permanently solve the large file problem. But having worked on the actual implementation of such rolling checksum backup software, I can tell you it is *much* harder to wrap your head around that data structure than git's more elegant design. <p> Maybe it could be a new pack format? Wed, 12 Dec 2018 00:22:24 +0000 GFS (not not that one) AKA VFS for git https://lwn.net/Articles/774653/ https://lwn.net/Articles/774653/ anarcat <div class="FormattedComment"> you know what, that's true, I totally forgot about GVFS (which we should apparently call "VFS for git" now). That's probably because, first, it just doesn't seem to run on Linux, from what I can tell. To be more precise, it's still at the "prototype" stage, so certainly not something that seems "entreprise-scale" to me.<br> <p> It could be a promising lead to fix the Debian security team repository size issues, mind you, but then we'd have to figure out how to host the server side of things and I don't know how *that* works either.<br> <p> Frankly, it looks like a Microsoft thing that's not ready for us mortals, unfortunately. At least the LFS folks had the decency of providing us with usable releases and a test server people could build on top of... But maybe it will become a usable alternative<br> </div> Wed, 12 Dec 2018 00:16:31 +0000 splitting the large CVE list in the security tracker https://lwn.net/Articles/774652/ https://lwn.net/Articles/774652/ JoeBuck <div class="FormattedComment"> Perhaps it would be possible to use some kind of wrapper so that the file could be maintained as a large file, but git would store it as many pieces. If the file has structure, the idea would be to split it before checkin and reassemble it on checkout. Perhaps the technique could be generalized to handle cases where files grow roughly by appending (I say "roughly" because multiple development branches would do appends and then merges would be required), so that older sections of the file remain unchanged.<br> <p> </div> Wed, 12 Dec 2018 00:10:34 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774649/ https://lwn.net/Articles/774649/ ejr <div class="FormattedComment"> The problem **FOR ME** with git-annex is platform support. I deal with platforms that have a C compiler, a kinda-sorta-C++ compiler, and that's it. I use git annex but coupled with plenty of out-of-tree copying that is a pain. I've yet to try git-lfs. It doesn't feel like it fits into my uses that naturally are multi-upstream.<br> <p> LLVM may eventually make this moot until the next great back-end. Not because of licensing but rather timing. Stupid patent issues, being honest, and horrible things like those.<br> <p> [BTW, is that coffee shop in Bristol still around? Haven't been "downtown" since I moved. At that point in our trip we don't want to stop.]<br> </div> Tue, 11 Dec 2018 23:44:11 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774647/ https://lwn.net/Articles/774647/ ralt <div class="FormattedComment"> There are only two hard problems...<br> </div> Tue, 11 Dec 2018 23:23:45 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774644/ https://lwn.net/Articles/774644/ mathstuf <div class="FormattedComment"> What's a GNOME library got to do with this? ;)<br> </div> Tue, 11 Dec 2018 23:14:48 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774642/ https://lwn.net/Articles/774642/ ralt <div class="FormattedComment"> Hmm... no mention of GVFS? :-)<br> </div> Tue, 11 Dec 2018 23:03:36 +0000 splitting the large CVE list in the security tracker https://lwn.net/Articles/774635/ https://lwn.net/Articles/774635/ anarcat <blockquote> CVE data in the security tracker is not that large (it's 18M / 300k lines). But there's a lot of history on the same file (52k commits) and that' the issue with git. I think we would have the same issue if the file was small but with the same history. </blockquote> <p> I have actually done the work to split that file, including history, first with a <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908678#52">shallow clone of 1000 commits</a> and then <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908678#72">with the full history</a>. Even when keeping the full history of all those 52k commits, the "split by year" repository take up a lot less space than the original repository (145MB vs 1.6GB, an order of magnitude smaller). <p> Performance is also significantly improved by an order of magnitude: cloning the repository (locally) takes 2 minutes instead of 21 minutes. And of course, running "git annotate" or "git log" on the individual files is much faster than on the larger file, although that's a bit of an unfair comparison. <p> So splitting the file gets rid of most of the performance issues the repository suffers from, at least according to the results I have been able to produce. The problem is it involves some <a href="https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908678#87">changes in the workflow</a>, from what I understand, particularly at times like this when we are likely to get CVEs from two different years (2018 and 2019, possibly three with 2017) which means working over multiple files. But it seems to me this is something that's easier to deal with than fixing fundamental design issues with git's internal storage. :) Tue, 11 Dec 2018 22:40:07 +0000 problems symlinks and p2p: might be worth looking into git-annex again https://lwn.net/Articles/774636/ https://lwn.net/Articles/774636/ warrax <div class="FormattedComment"> I think I might try it again. Thanks for the "update", so to speak.<br> </div> Tue, 11 Dec 2018 22:37:08 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774629/ https://lwn.net/Articles/774629/ corsac <div class="FormattedComment"> CVE data in the security tracker is not that large (it's 18M / 300k lines). But there's a lot of history on the same file<br> (52k commits) and that' the issue with git. I think we would have the same issue if the file was small but with the same history.<br> <p> <p> </div> Tue, 11 Dec 2018 21:52:56 +0000 problems symlinks and p2p: might be worth looking into git-annex again https://lwn.net/Articles/774627/ https://lwn.net/Articles/774627/ anarcat I've been thoroughly impressed by the new v6/v7 "unlocked files" mode. I only brushed over it in the article, but it's a radical change in the way git-annex manages files. It makes things *much* easier with regards to interoperability with other software: they can just modify files and then the operator commits the files normally with git. While there are still a few rough edges in the implementation, the idea is there and makes the entire thing actually workable on USB keys and so on. So you may want to reconsider from that aspect. <p> I find the <a href="https://git-annex.branchable.com/tips/peer_to_peer_network_with_tor/">p2p implementation</a> to be a little too complex to my taste, but it's there: it uses magic-wormhole and Tor to connect peers across NAT. And from there you can create whatever topology you want. I would rather seen a wormhole-only implementation, honestly, but maybe would have been less of a match for g-a... <p> Anyways, long story short: if you ever looked at git-annex in the past and found it weird, well, it might be soon time to take a look again. It's still weird in some places (it's haskell after all :p) and it's a complex piece of software, but I generally find that I can do everything I need with it. I am hoping to write a followup article about more in-depth git-annex use cases, specifically about archival and file synchronisation soon (but probably after the new year)... I just had to get this specific article out first so that I don't get a "but what about LFS" blanket response to that other article. Tue, 11 Dec 2018 21:15:55 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774621/ https://lwn.net/Articles/774621/ warrax <div class="FormattedComment"> Sorry for the absolute mess I made of the spelling in that.<br> <p> *and use it<br> <p> *how to would =&gt; how to work<br> <p> I can only apologize.<br> </div> Tue, 11 Dec 2018 20:49:44 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774619/ https://lwn.net/Articles/774619/ warrax <div class="FormattedComment"> I really *wanted* to like git-annex and use, but the lack of tutorial material (at the time, possibly different now) about how to would around NATs and things of that ilk really hampered me.<br> <p> That and... some software just doesn't want to work sensibly with symlinks, unforunately :(.<br> <p> In the end I just chose unison for a star-topology sync (which it looks like git-annex effectively requires if you're being a NAT). Works equally well with large and small files, but obviously not really *versioned* per se.<br> <p> </div> Tue, 11 Dec 2018 20:45:21 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774618/ https://lwn.net/Articles/774618/ anarcat <div class="FormattedComment"> I don't think anyone could have imagined that file would grow that big in 2004, so don't be too hard on yourself. (And yes, the irony didn't escape me, I just thought it would be unfair to pin that peculiar one on you... )<br> </div> Tue, 11 Dec 2018 20:36:53 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774617/ https://lwn.net/Articles/774617/ joey <div class="FormattedComment"> Thanks for this unbiased and accurate comparison.<br> <p> (BTW, the full irony is that I'm responsible for the Debian security tracker containing that single large file in the first place.)<br> </div> Tue, 11 Dec 2018 20:32:30 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774613/ https://lwn.net/Articles/774613/ anarcat as usual, the bug reports and feature requests I opened while writing this article: <ul> <li><a href="https://git-annex.branchable.com/todo/LFS_API_support/">git-annex: LFS API support</a> <li><a href="http://git-annex.branchable.com/bugs/why_are_all_those_files_modified/">git-annex: "why are all those files modified"</a> found while testing v7 mode <li><a href="http://git-annex.branchable.com/todo/clarify_that_v7_applies_to_all_clones/">git-annex: clarify that v7 applies to all clones</a> <li><a href="http://git-annex.branchable.com/bugs/v7_fails_to_fetch_files_on_FAT_filesystem/">git-annex: v7 fails to fetch files on FAT filesystem</a> </ul> For the sake of transparency, I should also mention that I am a long time git-annex user and even contributor, as my name sits in the <a href="http://git-annex.branchable.com/thanks/">thanks</a> page under the heading "code and other bits" section, which means I probably contributed some code to the project. I can't remember now what code exactly I contributed, but I certainly contributed to the documentation. That, in turn, may bias my point of view in favor of git-annex even though I tried to be as neutral as possible in my review of both projects, both of which I use on a regular basis, as I hinted in the article. Tue, 11 Dec 2018 20:28:04 +0000 Large files with Git: LFS and git-annex https://lwn.net/Articles/774616/ https://lwn.net/Articles/774616/ Cyberax <div class="FormattedComment"> One of my friends uses a franken-repository by putting large files in an SVN repository and storing their versions in a special .gitsvn file. Works surprisingly well.<br> </div> Tue, 11 Dec 2018 20:27:53 +0000