Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
PostgreSQL 9.3 beta: Federated databases and more
LWN.net Weekly Edition for May 9, 2013
(Nearly) full tickless operation in 3.10
Backing up in trees with Obnam 1.0
Posted Jun 7, 2012 8:05 UTC (Thu) by oever (subscriber, #987)
Imagine a large random file to which you prepend one byte. To a block-based deduplication algorithm, the entire file has changed and there is no deduplication happening. With a rolling checksum method, the first block is different but all subsequent blocks are the same. This method of deduplication is mainly useful for backing up filesystems and databases efficiently, but also helps backing up compressed archives such as zip files (but not or less for compressed tar files).
Using a rolling checksum for doing backups, like bup does, is genius. As far as I can tell, neither ocnam, tarsnap or ddar use a rolling checksum for deduplication.
Posted Jun 7, 2012 13:31 UTC (Thu) by rbrito (subscriber, #66188)
"When you run a backup, obnam uploads data into the backup repository.
The data is divided into chunks, and if a chunk already exists in the
backup repository, it is not uploaded again."
Regarding obnam and bup, I have tried both in this past week and some quick observations about them were:
* obnam can delete previous backups that you don't want anymore, while bup can't---and this is even mentioned in the documentation. This is useful for those that (like me) backup some directories that contain large files (e.g., videos downloaded from youtube or ISOs of distributions etc.) that I didn't mean to be there in the first place.
* obnam doesn't have a way to easily browse the contents of the backup repository, but bup does have (at least) three ways: a FUSE implementation (bup fuse), a web implementation (bup web) and an FTP-like implementation (bup ftp).
* bup decides to store its backup repository under ~/.bup, if not informed otherwise. If you skim quickly its manpage, you can probably miss the fact that you should specify the -d option to get it to backup somewhere else. The -f option of "bup index" *only* works for the index file, not for the whole backup.
I decided, for the first reason, to stick with obnam, as I am badly in need of a backup strategy and I hope that a FUSE implementation will soon appear (so that one can, e.g., drag and drop the needed files from, say, nautilus or via samba).
The only thing that I found bad about obnam (besides the lack of navigation cited above) is that it is slow. On a 2nd generation Core i5 notebook, backing up to an external USB HD attained speeds of up to 10MB/s, which I think that could be better. Only one core seemed to be used.
By the way, regarding bup, is it safe to run the command "git gc" in the backup repository?
Posted Jun 7, 2012 22:29 UTC (Thu) by oever (subscriber, #987)
Ocman seems to do de-duplication on fixed blocks, not variable blocks as one would get with a rolling checksum. You can configure the block size, but i think the boundary positions are simply multiples of the block size.
When using a rolling checksum, one moves a window over the data and when the checksum value falls in a particular range, the block ends. This means that the blocks have different sizes. The size depends on the content. By choosing the range for the checksum values that trigger a split, one can influence the average blocks in the backup.
Posted Jun 8, 2012 10:27 UTC (Fri) by rbrito (subscriber, #66188)
Is ocman a typo for obnam?
I don't find any hits related to backups doing some searches with ocman as a keyword (e.g. https://duckduckgo.com/?q=ocman+backup).
Posted Jun 8, 2012 12:36 UTC (Fri) by oever (subscriber, #987)
Posted Jun 8, 2012 9:00 UTC (Fri) by juliank (subscriber, #45896)
It's written in Python, so I would not assume it to use more than one core due to the GIL anyway.
Posted Jun 8, 2012 10:19 UTC (Fri) by rbrito (subscriber, #66188)
Posted Jun 8, 2012 10:32 UTC (Fri) by juliank (subscriber, #45896)
Posted Jun 14, 2012 15:23 UTC (Thu) by JanC_ (guest, #34940)
It should be possible to move the CPU-intensive parts (all the hashing & encryption parts) to C or Cython code. Alternatively, PyPy is working on removal of the GIL, but that might take years to finish.
But I'm not sure in how far Obnam currently uses non-sequential code anyway?
Posted Jun 14, 2012 12:59 UTC (Thu) by njs (guest, #40338)
If your de-duplicator uses an rsync-compatible rolling checksum, and your tar files are compressed with gzip --rsyncable, then de-duplication should work. (I thought --rsyncable had become the default at some point, but now can't find evidence of this. And sadly bzip2 doesn't seem to have sprouted an --rsyncable option -- maybe the file format requires fixed-size blocks or something.)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds