Posted Jun 7, 2012 8:05 UTC (Thu) by oever (subscriber, #987)
[Link]
Yes it does. LWN has written wrote about bup. The bup README document is a great read. It explains that while bup is similar to git in that it uses Merkle trees, but also how it is different because it splits up big files by, like rsync, using a rolling checksum. By doing so, deduplication works better than when using fixed blocks.
Imagine a large random file to which you prepend one byte. To a block-based deduplication algorithm, the entire file has changed and there is no deduplication happening. With a rolling checksum method, the first block is different but all subsequent blocks are the same. This method of deduplication is mainly useful for backing up filesystems and databases efficiently, but also helps backing up compressed archives such as zip files (but not or less for compressed tar files).
Using a rolling checksum for doing backups, like bup does, is genius. As far as I can tell, neither ocnam, tarsnap or ddar use a rolling checksum for deduplication.
Backing up in trees with Obnam 1.0
Posted Jun 7, 2012 13:31 UTC (Thu) by rbrito (subscriber, #66188)
[Link]
It seems to me that obnam does use something like rolling checksums (or, at least, something close), as is stated in the manpage:
"When you run a backup, obnam uploads data into the backup repository.
The data is divided into chunks, and if a chunk already exists in the
backup repository, it is not uploaded again."
Regarding obnam and bup, I have tried both in this past week and some quick observations about them were:
* obnam can delete previous backups that you don't want anymore, while bup can't---and this is even mentioned in the documentation. This is useful for those that (like me) backup some directories that contain large files (e.g., videos downloaded from youtube or ISOs of distributions etc.) that I didn't mean to be there in the first place.
* obnam doesn't have a way to easily browse the contents of the backup repository, but bup does have (at least) three ways: a FUSE implementation (bup fuse), a web implementation (bup web) and an FTP-like implementation (bup ftp).
* bup decides to store its backup repository under ~/.bup, if not informed otherwise. If you skim quickly its manpage, you can probably miss the fact that you should specify the -d option to get it to backup somewhere else. The -f option of "bup index" *only* works for the index file, not for the whole backup.
I decided, for the first reason, to stick with obnam, as I am badly in need of a backup strategy and I hope that a FUSE implementation will soon appear (so that one can, e.g., drag and drop the needed files from, say, nautilus or via samba).
The only thing that I found bad about obnam (besides the lack of navigation cited above) is that it is slow. On a 2nd generation Core i5 notebook, backing up to an external USB HD attained speeds of up to 10MB/s, which I think that could be better. Only one core seemed to be used.
By the way, regarding bup, is it safe to run the command "git gc" in the backup repository?
Backing up in trees with Obnam 1.0
Posted Jun 7, 2012 22:29 UTC (Thu) by oever (subscriber, #987)
[Link]
First off: I have not tested bup myself on a significant amount of data; so far I'm content with reading parts of the code and the documentation and thinking about scenarios for using it.
Ocman seems to do de-duplication on fixed blocks, not variable blocks as one would get with a rolling checksum. You can configure the block size, but i think the boundary positions are simply multiples of the block size.
When using a rolling checksum, one moves a window over the data and when the checksum value falls in a particular range, the block ends. This means that the blocks have different sizes. The size depends on the content. By choosing the range for the checksum values that trigger a split, one can influence the average blocks in the backup.
Backing up in trees with Obnam 1.0
Posted Jun 8, 2012 10:27 UTC (Fri) by rbrito (subscriber, #66188)
[Link]
Please, excuse my ignorance here, but you have consistently used the name ocman.
Posted Jun 8, 2012 12:36 UTC (Fri) by oever (subscriber, #987)
[Link]
It was an error, I meant obnam, not ocman.
Backing up in trees with Obnam 1.0
Posted Jun 8, 2012 9:00 UTC (Fri) by juliank (subscriber, #45896)
[Link]
> Only one core seemed to be used.
It's written in Python, so I would not assume it to use more than one core due to the GIL anyway.
Backing up in trees with Obnam 1.0
Posted Jun 8, 2012 10:19 UTC (Fri) by rbrito (subscriber, #66188)
[Link]
I was under the impression that even programs written in Python can use multiple cores/cpus/whatever when calling C-extensions (appropriately marked with Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS), but I am a real beginner with respect to python and I would appreciate any correction.
Backing up in trees with Obnam 1.0
Posted Jun 8, 2012 10:32 UTC (Fri) by juliank (subscriber, #45896)
[Link]
Yes, but then, this code does not really have much C parts from what I remember.
Backing up in trees with Obnam 1.0
Posted Jun 14, 2012 15:23 UTC (Thu) by JanC_ (guest, #34940)
[Link]
Well, all file I/O and a bunch of other things are in C, but all (or most of) that code probably isn't very CPU-intensive...
It should be possible to move the CPU-intensive parts (all the hashing & encryption parts) to C or Cython code. Alternatively, PyPy is working on removal of the GIL, but that might take years to finish.
But I'm not sure in how far Obnam currently uses non-sequential code anyway?
Backing up in trees with Obnam 1.0
Posted Jun 14, 2012 12:59 UTC (Thu) by njs (guest, #40338)
[Link]
> but also helps backing up compressed archives such as zip files (but not or less for compressed tar files).
If your de-duplicator uses an rsync-compatible rolling checksum, and your tar files are compressed with gzip --rsyncable, then de-duplication should work. (I thought --rsyncable had become the default at some point, but now can't find evidence of this. And sadly bzip2 doesn't seem to have sprouted an --rsyncable option -- maybe the file format requires fixed-size blocks or something.)