LWN.net Logo

Versioning really-big files

Versioning really-big files

Posted Apr 4, 2010 20:33 UTC (Sun) by smurf (subscriber, #17840)
In reply to: Subversion considered obsolete by RCL
Parent article: A proposed Subversion vision and roadmap

Hmm. I can think of a simple way to fix the SHA1 problem (hash all (blockno-contents-of-block) tuples separately and XOR the result, or whatever; needs editor support to be effective).

The larger problem, however, is that you want a way to carry multiple versions of slowly-changing multi-GB files in your repo -- without paying the storage price of (a compressed version of) the whole blob, each time you check in a single-byte change. Same for network traffic when sending that change to a remote repository.

This is essentially a solved problem (rsync does it all the time) and just needs integration into the VCS-of-the-day. This problem is quite orthogonal to the question of whether said VCS-of-the-day is distributed or central, or whether it is named git or hg or bzr or whatever.

Yes, I know that the SVN people seem to have gotten this one mostly-right ("mostly" because their copy of the original file is not compressed). Hopefully, somebody will do the work for git or hg or whatever. It's not exactly rocket science.


(Log in to post comments)

Versioning really-big (binary) files

Posted Apr 6, 2010 18:47 UTC (Tue) by vonbrand (subscriber, #4458) [Link]

git uses delta compression by default (and has done so for a long time now), so the "huge binary files that change a bit" shouldn't be a problem. Please check with the latest version.

Versioning really-big (binary) files

Posted Apr 6, 2010 23:12 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

the real problem is that in many cases when people say 'huge binary file that changes a bit' they really mean 'huge binary file where the meaning changes a little, but the actual file contents change a lot', usually due to a compression algorithm being used

even for images and audio, if you were to check them in uncompressed the git delta functionality would work well and diff the files against each other, but if you compress the file (jpeg, mp3, or even png) before checking it in, a small change to the uncompressed data results in a huge change to the compressed data. If it's a lossless compression (i.e. png) then it would be possible to have git uncompress it before checking for differences, but if it's a lossy compression you can't do this.

Versioning really-big (binary) files

Posted Apr 7, 2010 7:53 UTC (Wed) by paulj (subscriber, #341) [Link]

The real problem is people thinking such files are suitable for checking
into an SCM. Just archive them somewhere.

Versioning really-big (binary) files

Posted Apr 12, 2010 1:14 UTC (Mon) by vonbrand (subscriber, #4458) [Link]

Not really. If the contents needs version control, it should be handled by a VCS. The size or format of the files could be a technical hurdle, sure; but it shouldn't be an excuse for not solving the problem.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds