Posted May 14, 2011 6:07 UTC (Sat) by drag (subscriber, #31333)
Parent article: DVCS-autosync
I don't understand why people find it attractive to use source code management systems to build syncing file systems...
The vast majority of data, in terms of volume, is going to be binary data only. And a significant number of popular formats use compression for various reasons.
Are not things like Git really inefficient at handling binary data?
It seems to me that it would be far more efficient to handle things using a block-like storage mechanism..
Posted May 14, 2011 9:44 UTC (Sat) by rmayr (subscriber, #16880)
[Link]
My personal use case is a tree of most of my work so far, including source code and documents. For these, I explicitly want the history (thinking of papers, theses, project proposals, reports, etc.) and have relied upon tracking the changes in quite a few instances. However, I try to keep large binary blobs outside this tree (that might change with decent git-annex integration, although I don't know (yet) how best to do that).
DVCS-autosync
Posted May 14, 2011 12:50 UTC (Sat) by loevborg (guest, #51779)
[Link]
I agree completely. DVCS are useful for plain text, which for almost anyone except programmers is only a tiny portion of your data. Using git doesn't gain you anything, because we can't use diff for ODF etc. What we need is pretty much a drop-in replacement for dropbox.
With all due respect, this seems the expression of a sort of narrow-mindedness typical of programmers. People who work with source code all day assume that code is the model of computer work. However, code is actually very particular. What we really need is a place to store our letters, PDF forms, spreadsheets and so on, which are quite different from code in many respects.
The situation is similar in the field of text editors. Here every week or so someone starts a new editor which targets ... programmers. But there still is hardly a single text editor which is any good for writing simple English language prose! Here there is little use for syntax coloring, line numbers, nonproportional fonts. Instead you need configurable line spacing, easy-to-read proportional fonts, paragraph-based (not line-based) navigation, word-wrapping, etc. Still instead of solving this very real problem, people come up with new solutions to already solved problems.
DVCS-autosync
Posted May 14, 2011 13:08 UTC (Sat) by nix (subscriber, #2304)
[Link]
Actually, you can plug in domain-specific diff/merge tools into git (OpenOffice documents were a named use case for this). Obviously the merge tools for binary stuff need their own way to do conflict resolution as well.
Git is exactly as efficient at handling binary files as handling text files: you have to go back to CVS to find something that isn't good at binary files.
DVCS-autosync
Posted May 19, 2011 5:07 UTC (Thu) by smurf (subscriber, #17840)
[Link]
There are a couple of document formats which are singularly not suited for VCSes, though.
OOo documents, for instance, are compressed XML files. There's no sane way to store multiple versions of these in a git archive. Store them uncompressed (dunno how to teach OOo that) and you get a signficant decrease in storage requirements, long-term.
Still needs a domain specific conflict rresolver, of course. You could probably script your way into LibreOffice to do it, though it's nontrivial.
DVCS-autosync
Posted May 19, 2011 5:20 UTC (Thu) by dlang (✭ supporter ✭, #313)
[Link]
actually, since git does the compression itself, the answer is to have git uncompress the documents and version the uncompressed data. then when you check it out git assembles the version you want, then compresses it as you check it out.
you can even insert XML aware diff engines if you want.
DVCS-autosync
Posted May 26, 2011 19:43 UTC (Thu) by nix (subscriber, #2304)
[Link]
Yes indeed. 'man gitattributes' and search for 'filter' for something that may help here.
DVCS-autosync
Posted May 14, 2011 19:40 UTC (Sat) by elanthis (guest, #6227)
[Link]
You mean... people who _write code_ think like people who _write code_?
How surprising.
DVCS-autosync
Posted May 14, 2011 20:45 UTC (Sat) by dlang (✭ supporter ✭, #313)
[Link]
it's not that git is especially inefficient at handling binary files, it's that the compression and other efficiencies that git can get with text files don't work on binary files.
but git is not any worse in dealing with binary files than any other solution where you want to be able to retrieve any version that ever existed.
that said, git does have some limitations in terms of max sizes of things
DVCS-autosync
Posted May 14, 2011 21:45 UTC (Sat) by drag (subscriber, #31333)
[Link]
So when I want to share my 800MB movie with a friend or have it available on another machine it's just not going to happen.
From dropbox's website: https://www.dropbox.com/help/5
> Files uploaded to Dropbox via the desktop application have no file size limit.
> Files uploaded through the website (by pressing the upload button) have a 300 MB cap. In other words, each file you upload through the website must be 300 MB or less.
> All files uploaded to your Dropbox must be smaller than the size of your Dropbox account's storage quota. For example, if you have a free 2 GB account, you can upload one 2 GB file or many files that all add up to 2 GB. If you are over your storage quota, Dropbox will stop syncing until you are below your limit.
Dropbox's revision control system is optional and only will save revisions for 30 days. Many people want sync software to sync a significant amount of data.... I am guessing that for most people's purposes revision control is much less important then just automatic syncing.
The ability to carelessly drop a file into your drop box and have it automatically available on any machine you happen to want to use is the 'killer feature' for Dropbox. The idea of trying to do something like manage a 4GB mp3 collection using something like Git commit sounds like a nightmare to me.
DVCS-autosync
Posted May 14, 2011 20:47 UTC (Sat) by dlang (✭ supporter ✭, #313)
[Link]
people are attracted to DVCS systems because they already address a lot of the problems that can come up when you have more than two locations that you are trying to sync.
why reinvent the wheel when you can reuse the work that someone else has done?
DVCS-autosync
Posted May 15, 2011 21:38 UTC (Sun) by dan_a (subscriber, #5325)
[Link]
Git in particular has one killer feature here - its built in checksumming which ensures that any copies of files are uncorrupted.
DVCS-autosync
Posted May 15, 2011 23:47 UTC (Sun) by drag (subscriber, #31333)
[Link]
Yeah I love git for a variety of reasons. I just worry that there is a impedance mismatch between it and the goal of 'Dropbox replacement'.
Git is very carefully optimized to provide a high performance revision control system for text files.
However typical Dropbox usage only deals with a tiny amount of text data and revision control is borderline irrelevant for most people's uses.
I just remember using Git for a variety of purposes and realizing 'Hey putting that ISO image for the cdrom I made into my repository was a very very stupid thing to do'... yet people are going to want to store ISOs, mp3s, zip files, and huge raw-formatted camera images and other things of that nature in anything calling itself a 'A open source Dropbox replacement'
DVCS-autosync
Posted May 16, 2011 10:53 UTC (Mon) by dgm (subscriber, #49227)
[Link]
"Git is very carefully optimized to provide a high performance revision control system for text files."
You were almost right up until "revision control system for text files". Linus described Git more in the line of an information tracker, or content addressable filesystem, used to implement a DVCS that accidentally bears the same name.
DVCS-autosync
Posted May 16, 2011 11:00 UTC (Mon) by drag (subscriber, #31333)
[Link]
Well that's confusing then.
So Git is the backend for Git DVCS...
DVCS-autosync
Posted May 16, 2011 20:58 UTC (Mon) by njs (guest, #40338)
[Link]
So what's the difference? It's an "information tracker or content-addressable filesystem" that is heavily optimized on the assumptions that the information/content its holding is highly compressible (by deltas and by gzip), which makes it okay to have two copies of everything (one in the repo and one in the filesystem), and also (last I checked) that it can hold any individual file in RAM. Those things are totally reasonable assumptions for source code, and terrible assumptions for a directory full of Blu-Ray rips. (And those are just the two issues that occur off the top of my head.)
Linus' point is that git is designed with good decoupled interfaces between its internal components, not that it's always going to be good at solving problems that it wasn't designed for.
DVCS-autosync
Posted May 16, 2011 21:12 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
you have a valid point in terms of being able to mmap things (which is not quite the same as being able to fit them in memory). this is a limitation on 32 bit systems
however, how can you do version control if you don't keep a copy of the file somewhere else? if someone changes it, how can you get back what was there before without another copy?
DVCS-autosync
Posted May 16, 2011 21:34 UTC (Mon) by njs (guest, #40338)
[Link]
Oh, does it use mmap exclusively for all file access now? Including repacking and everything? Neat.
You need two local copies if you want to do local version control, and also to let people edit files normally on disk (as opposed to, say, interposing a FUSE filesystem to observe edits as they happen). But the systems we're talking about are not trying to do local version control. They're trying to do remote backup and syncing!
*For this use-case*, you might almost be better off with CVS than with git. Its handling of binary files is dumb, but at least it wouldn't double your local storage requirements. Even better, of course, would be a system that stored the second copy on the remote server only, and then used something clever like rsync to upload the deltas.
Or maybe one could do something clever with libgit2 and librsync to let you directly and efficiently commit a local set of files to a remote bare repository...
DVCS-autosync
Posted May 16, 2011 21:53 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
I don't know that it always uses mmap, but it's the use of mmap that imposes the file size limit on individual files.
pack files are limited in that they use a 32 bit offset into them, but that's a matter of optimisation for files that can be diffed and compressed.
yes, for this use case you may be better of with CVS, but only until you have to reconcile differences between different locations. DVCS tools give you the framework (and many of the mechanisms) for doing this as part of their heritage
DVCS-autosync
Posted May 16, 2011 22:11 UTC (Mon) by njs (guest, #40338)
[Link]
> yes, for this use case you may be better of with CVS, but only until you have to reconcile differences between different locations. DVCS tools give you the framework (and many of the mechanisms) for doing this as part of their heritage
Yes, of course. (Though in practice I'm not sure git's current merge mechanisms are well-optimized for the collection-of-large-binary-files case either.)
But that doesn't change the point, which is that git is not a perfect match for this problem, and a better tool that was similar to git in some ways but not in others could potentially do substantially better.
DVCS-autosync
Posted May 16, 2011 22:16 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
I agree that git does not do anything that useful for large binary files, and if that is all you are wanting to sync, then you are better looking at modifying git rather than using it directly.
but this depends in large part on what these binary files are. git supports configurable diff/merge engines, so if there is any sane way to merge your 'binary' files, git will allow you to use it.
please don't get me wrong, I'm not saying that git is perfect, just that it does a better job than anything else for the general case and brings a lot to the party. This makes basing a tool on git (or one of the other DVCS systems if you dislike git for some reason) a very reasonable thing to do rather than just developing your app from scratch.
DVCS-autosync
Posted May 17, 2011 14:45 UTC (Tue) by nye (guest, #51576)
[Link]