Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
The vast majority of data, in terms of volume, is going to be binary data only. And a significant number of popular formats use compression for various reasons.
Are not things like Git really inefficient at handling binary data?
It seems to me that it would be far more efficient to handle things using a block-like storage mechanism..
Posted May 14, 2011 9:44 UTC (Sat) by rmayr (subscriber, #16880)
Posted May 14, 2011 12:50 UTC (Sat) by loevborg (guest, #51779)
With all due respect, this seems the expression of a sort of narrow-mindedness typical of programmers. People who work with source code all day assume that code is the model of computer work. However, code is actually very particular. What we really need is a place to store our letters, PDF forms, spreadsheets and so on, which are quite different from code in many respects.
The situation is similar in the field of text editors. Here every week or so someone starts a new editor which targets ... programmers. But there still is hardly a single text editor which is any good for writing simple English language prose! Here there is little use for syntax coloring, line numbers, nonproportional fonts. Instead you need configurable line spacing, easy-to-read proportional fonts, paragraph-based (not line-based) navigation, word-wrapping, etc. Still instead of solving this very real problem, people come up with new solutions to already solved problems.
Posted May 14, 2011 13:08 UTC (Sat) by nix (subscriber, #2304)
Git is exactly as efficient at handling binary files as handling text files: you have to go back to CVS to find something that isn't good at binary files.
Posted May 19, 2011 5:07 UTC (Thu) by smurf (subscriber, #17840)
OOo documents, for instance, are compressed XML files. There's no sane way to store multiple versions of these in a git archive. Store them uncompressed (dunno how to teach OOo that) and you get a signficant decrease in storage requirements, long-term.
Still needs a domain specific conflict rresolver, of course. You could probably script your way into LibreOffice to do it, though it's nontrivial.
Posted May 19, 2011 5:20 UTC (Thu) by dlang (✭ supporter ✭, #313)
you can even insert XML aware diff engines if you want.
Posted May 26, 2011 19:43 UTC (Thu) by nix (subscriber, #2304)
Posted May 14, 2011 19:40 UTC (Sat) by elanthis (guest, #6227)
Posted May 14, 2011 20:45 UTC (Sat) by dlang (✭ supporter ✭, #313)
but git is not any worse in dealing with binary files than any other solution where you want to be able to retrieve any version that ever existed.
that said, git does have some limitations in terms of max sizes of things
Posted May 14, 2011 21:45 UTC (Sat) by drag (subscriber, #31333)
From dropbox's website:
> Files uploaded to Dropbox via the desktop application have no file size limit.
> Files uploaded through the website (by pressing the upload button) have a 300 MB cap. In other words, each file you upload through the website must be 300 MB or less.
> All files uploaded to your Dropbox must be smaller than the size of your Dropbox account's storage quota. For example, if you have a free 2 GB account, you can upload one 2 GB file or many files that all add up to 2 GB. If you are over your storage quota, Dropbox will stop syncing until you are below your limit.
Dropbox's revision control system is optional and only will save revisions for 30 days. Many people want sync software to sync a significant amount of data.... I am guessing that for most people's purposes revision control is much less important then just automatic syncing.
The ability to carelessly drop a file into your drop box and have it automatically available on any machine you happen to want to use is the 'killer feature' for Dropbox. The idea of trying to do something like manage a 4GB mp3 collection using something like Git commit sounds like a nightmare to me.
Posted May 14, 2011 20:47 UTC (Sat) by dlang (✭ supporter ✭, #313)
why reinvent the wheel when you can reuse the work that someone else has done?
Posted May 15, 2011 21:38 UTC (Sun) by dan_a (subscriber, #5325)
Posted May 15, 2011 23:47 UTC (Sun) by drag (subscriber, #31333)
Git is very carefully optimized to provide a high performance revision control system for text files.
However typical Dropbox usage only deals with a tiny amount of text data and revision control is borderline irrelevant for most people's uses.
I just remember using Git for a variety of purposes and realizing 'Hey putting that ISO image for the cdrom I made into my repository was a very very stupid thing to do'... yet people are going to want to store ISOs, mp3s, zip files, and huge raw-formatted camera images and other things of that nature in anything calling itself a 'A open source Dropbox replacement'
Posted May 16, 2011 10:53 UTC (Mon) by dgm (subscriber, #49227)
You were almost right up until "revision control system for text files". Linus described Git more in the line of an information tracker, or content addressable filesystem, used to implement a DVCS that accidentally bears the same name.
Posted May 16, 2011 11:00 UTC (Mon) by drag (subscriber, #31333)
So Git is the backend for Git DVCS...
Posted May 16, 2011 20:58 UTC (Mon) by njs (guest, #40338)
Linus' point is that git is designed with good decoupled interfaces between its internal components, not that it's always going to be good at solving problems that it wasn't designed for.
Posted May 16, 2011 21:12 UTC (Mon) by dlang (✭ supporter ✭, #313)
however, how can you do version control if you don't keep a copy of the file somewhere else? if someone changes it, how can you get back what was there before without another copy?
Posted May 16, 2011 21:34 UTC (Mon) by njs (guest, #40338)
You need two local copies if you want to do local version control, and also to let people edit files normally on disk (as opposed to, say, interposing a FUSE filesystem to observe edits as they happen). But the systems we're talking about are not trying to do local version control. They're trying to do remote backup and syncing!
*For this use-case*, you might almost be better off with CVS than with git. Its handling of binary files is dumb, but at least it wouldn't double your local storage requirements. Even better, of course, would be a system that stored the second copy on the remote server only, and then used something clever like rsync to upload the deltas.
Or maybe one could do something clever with libgit2 and librsync to let you directly and efficiently commit a local set of files to a remote bare repository...
Posted May 16, 2011 21:53 UTC (Mon) by dlang (✭ supporter ✭, #313)
pack files are limited in that they use a 32 bit offset into them, but that's a matter of optimisation for files that can be diffed and compressed.
yes, for this use case you may be better of with CVS, but only until you have to reconcile differences between different locations. DVCS tools give you the framework (and many of the mechanisms) for doing this as part of their heritage
Posted May 16, 2011 22:11 UTC (Mon) by njs (guest, #40338)
Yes, of course. (Though in practice I'm not sure git's current merge mechanisms are well-optimized for the collection-of-large-binary-files case either.)
But that doesn't change the point, which is that git is not a perfect match for this problem, and a better tool that was similar to git in some ways but not in others could potentially do substantially better.
Posted May 16, 2011 22:16 UTC (Mon) by dlang (✭ supporter ✭, #313)
but this depends in large part on what these binary files are. git supports configurable diff/merge engines, so if there is any sane way to merge your 'binary' files, git will allow you to use it.
please don't get me wrong, I'm not saying that git is perfect, just that it does a better job than anything else for the general case and brings a lot to the party. This makes basing a tool on git (or one of the other DVCS systems if you dislike git for some reason) a very reasonable thing to do rather than just developing your app from scratch.
Posted May 17, 2011 14:45 UTC (Tue) by nye (guest, #51576)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds