|
|
Subscribe / Log in / New account

Append-only large files

Append-only large files

Posted Dec 12, 2018 13:32 UTC (Wed) by anarcat (subscriber, #66354)
In reply to: Append-only large files by epa
Parent article: Large files with Git: LFS and git-annex

the other problem is that the delta algorithm in git works very badly for growing files, because it deduplicates within a certain "window" of "N" blobs (default 10), *sorted by size*. The degenerate case of this is *multiple* growing files of similar size which get grouped together and are absolutely unrelated. alternatively, you might be lucky and have your growing file aligned correctly, but only some of the recent entries will get sorted together, earlier entries will get lost in the mists of time.

of course, widening that window would help the security tracker, but it would require a costly repack, and new clones everywhere... and considering how long that tail of commits is, it would probably imply other performance costs...


to post comments

Append-only large files

Posted Dec 13, 2018 16:41 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Huh, so the delta is entirely blind to whatever filename the content was added under? That's a clean design, but it seems like adding some amount of hinting (so that similar filenames are grouped together for finding deltas) would greatly improve performance, and not just in this case.

Append-only large files

Posted Dec 13, 2018 16:51 UTC (Thu) by anarcat (subscriber, #66354) [Link]

I'm not exactly sure as I haven't reviewed the source code behind git-pack-objects, only the manual page, which says:
In a packed archive, an object is either stored as a compressed whole or as a difference from some other object. The latter is often called a delta. [...]

--window=<n>, --depth=<n>
These two options affect how the objects contained in the pack are stored using delta compression. The objects are first internally sorted by type, size and optionally names and compared against the other objects within --window to see if using delta compression saves space. --depth limits the maximum delta depth; making it too deep affects the performance on the unpacker side, because delta data needs to be applied that many times to get to the necessary object. The default value for --window is 10 and --depth is 50. The maximum depth is 4095.
So yes, it can also "optionally" "sort by name", but it's unclear to me how that works or how effective that is. Besides, the window size is quite small as well, although it can be bumped up to make pack take all available memory with that parameter. :)


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds