Backing up in trees with Obnam 1.0
Lars Wirzenius's new backup tool Obnam was just declared 1.0. There is no shortage of backup options these days, and in some way Wirzenius's decision to scratch his own itch with the project is par for the course. But the program does offer a different feature set than many of its competitors.
For starters, Obnam makes only "snapshot" backups — that is, every backup looks like a complete snapshot of the system: there are not separate "full" and "incremental" backup options. That obviates the need to separately configure full and incremental backups on different schedules, and it similarly simplifies the restoration process. Any snapshot can be restored, without "walking" a chain of deltas from a full backup starting position. In his 1.0 release announcement, Wirzenius argues that full-plus-incremental backups make sense for tape drives, where sequential access favors adding deltas with incremental changes after an initial full backup, but that hard-disk backups make the incremental delta approach pointless.
But the sneaky part is that under the hood, Obnam's snapshots are all incremental, at least in the sense that each snapshot only records changes since the last. The difference is that they are stored in copy-on-write (COW) b-trees like those Btrfs uses for filesystems. Any snapshot can be reconstructed from the b-tree, and individual snapshots can be removed by deleting their node and re-attaching the sub-trees. To make the COW b-tree approach space-efficient, it uses pervasive automatic data de-duplication. The same chunk of data on disk is re-used — both across multiple files and over multiple snapshot generations. In addition to saving space by not duplicating files that have not changed between snapshots, moving or renaming large files does not result in duplicate copies of the bits. By default, Obnam uses one-megabyte chunks, although this setting is adjustable in Obnam's configuration file.
Obnam sports other features of practical value, such as built-in GnuPG encryption, which Wirzenius cited as a weakness in most rsync-based backup tools. It also works with local disks or over the network, including NFS, SMB, and SFTP. Wirzenius admits that the latter protocol is slow, but that SCP (which should be faster) lacks support for tracking information like file removals, which Obnam depends on. In network backup setups, Obnam supports both push (client-initiated) and pull (server-initiated) backup sessions.
Storing and retrieving
Installation requires several of Wirzenius's other code projects, including his B-tree library larch and terminal status-update library ttystatus, plus paramiko a third-party SSH2 library. Most are packaged for Debian (Wirzenius packages his own projects for Debian), but not all of them are available in downstream derivatives like Ubuntu. He provides an Apt repository for the necessary packages; instructions and a link to the repository's signing key are provided on his Obnam tutorial page.
The tutorial goes into further detail about Obnam's data de-duplication with practical examples. You can create a new backup with
obnam backup ~/projectfooand subsequently back up a parent directory with
obnam backup ~Rather than re-save the files from projectfoo, the new backup will point to the copy already on disk. Each backup created with Obnam is specific to a directory; you can exclude specific subdirectories with the --exclude= flag, but you cannot backup several directories in a single command.
The tutorial also explains that Obnam automatically saves checkpoints every 100MB while creating a new backup. This is valuable because the initial snapshot is always akin to a full backup in other tools, and can be large enough to introduce failures. Checkpoints are not guaranteed to preserve the entire data set as are regular snapshots; they only allow an interrupted backup to resume without starting over from scratch.
Obnam's basic usage is straightforward; the same
obnam backup ~
command that is used to start a
new backup in the above example is used verbatim to perform the
subsequent snapshots. You store snapshots on a remote repository by
appending --repository=URL, specify a filesystem storage
location with --output=PATH, and specify a GnuPG encryption
key with --encrypt-with=KEYID.
You can restore a directory from a snapshot with
obnam restore --to=/mnt/recovery-volume ~(which will restore the most recent snapshot of your home directory to /mnt/recovery-volume). You can optionally restore just a file or a subdirectory from the snapshot with
obnam restore ~/importantfiles --to=/mnt/recovery-volume ~You can also specify a specific intermediate snapshot by appending a --generation=N flag to the restore command; you can get a list of the available snapshots by running
obnam generations
. The obnam verify
command checks
snapshot data against the files on disk, and obnam fsck
checks the internal consistency of the b-tree.
Forgetfulness
The only real confusing part of working with Obnam is the snapshot retention process. You can tell the program to immediately delete older snapshots by running
obnam forget --keep=7d(which will keep the most recent seven days' worth of snapshots), or some variation. The wrinkle is that the 7d attribute will keep only one backup per day for those seven days, even if you run Obnam hourly. To keep seven days' worth of hourly snapshots, you would need to specify --keep=168h.
You can set a snapshot retention policy in your configuration file that uses these rules in combination. You can retain hourly, daily, weekly, monthly, and yearly snapshots by providing a comma-separated list. For example, 12h,7d,3m will keep the last 12 hourly snapshots, the last seven daily snapshots, and the last three monthly snapshots. When the numbers start to converge (such as the last 48 hourly snapshots and last two daily snapshots) is when the potential for miscounting sets in; Wirzenius recommends that you try your retention policy on the command line with the --pretend option to simulate results before deploying them in the real world.
In an email, Wirzenius elaborated a bit on those tricky multi-factor retention policies. Each retention rule (e.g., hour, day, or month) is examined separately by Obnam, he said, and a snapshot is kept if it matches any of the rules. So a 48h,2d policy would match 48 hourly snapshots, then match two additional daily snapshots, for 50 total.
As of the 1.0 release, there are a few areas that need improvement, such as managing multiple clients storing snapshots on one repository; Wirzenius says that further thought is required before implementing a real "server mode." For example, two or more machines can run Obnam and push their backups to the same remote repository, and they will be tagged with the hostname of origin. However, Obnam can also be run from the repository machine and "pull" backups from the two remote sources, but in that case each one needs to specify a client name with the --client-name= flag in order for Obnam to keep their metadata separate.
In practice, my interest in backup utilities stems largely from how rarely I make good backups on a regular basis (i.e., paranoia). I may be atypical in that way, but the primary reasons I have abandoned most of the backup utilities I have test driven in the past are the overhead in keeping track of full and incremental backup schedules and the lack of good tools for rotating old backups out without manual intervention. Obnam scores on both of those metrics. If you have a complicated setup with multiple machines, you may find quirks (such as the client name issue or the speed of SFTP) working against you, but Wirzenius is still at work on the code — and he seems quite happy to take bug reports and questions.
Posted Jun 7, 2012 3:30 UTC (Thu)
by grahame (guest, #5823)
[Link] (2 responses)
I note it has an "obnam verify" command; otherwise I'd be scared off by the potential for bugs given the quite high complexity of the system.
So, anyone here given it a go yet?
Posted Jun 7, 2012 16:53 UTC (Thu)
by joey (guest, #328)
[Link] (1 responses)
Quite a lot of care has gone into Obnam's use of gpg too. It doesn't just encrypt data to a single gpg key, which would prevent changing keys later without reencrypting all the data. Instead, it encrypts data using a secret key that is itself encrypted by your gpg key(s). So new keys can be given access. The scheme is explain here: http://liw.fi/obnam/encryption/
I liked that so much I implemented the same scheme in git-annex for its gpg encryption.
Posted Jun 15, 2012 9:02 UTC (Fri)
by Darkstar (guest, #28767)
[Link]
Posted Jun 7, 2012 4:03 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (10 responses)
Posted Jun 7, 2012 8:05 UTC (Thu)
by oever (guest, #987)
[Link] (9 responses)
Imagine a large random file to which you prepend one byte. To a block-based deduplication algorithm, the entire file has changed and there is no deduplication happening. With a rolling checksum method, the first block is different but all subsequent blocks are the same. This method of deduplication is mainly useful for backing up filesystems and databases efficiently, but also helps backing up compressed archives such as zip files (but not or less for compressed tar files).
Using a rolling checksum for doing backups, like bup does, is genius. As far as I can tell, neither ocnam, tarsnap or ddar use a rolling checksum for deduplication.
Posted Jun 7, 2012 13:31 UTC (Thu)
by rbrito (guest, #66188)
[Link] (7 responses)
"When you run a backup, obnam uploads data into the backup repository.
Regarding obnam and bup, I have tried both in this past week and some quick observations about them were:
* obnam can delete previous backups that you don't want anymore, while bup can't---and this is even mentioned in the documentation. This is useful for those that (like me) backup some directories that contain large files (e.g., videos downloaded from youtube or ISOs of distributions etc.) that I didn't mean to be there in the first place.
* obnam doesn't have a way to easily browse the contents of the backup repository, but bup does have (at least) three ways: a FUSE implementation (bup fuse), a web implementation (bup web) and an FTP-like implementation (bup ftp).
* bup decides to store its backup repository under ~/.bup, if not informed otherwise. If you skim quickly its manpage, you can probably miss the fact that you should specify the -d option to get it to backup somewhere else. The -f option of "bup index" *only* works for the index file, not for the whole backup.
I decided, for the first reason, to stick with obnam, as I am badly in need of a backup strategy and I hope that a FUSE implementation will soon appear (so that one can, e.g., drag and drop the needed files from, say, nautilus or via samba).
The only thing that I found bad about obnam (besides the lack of navigation cited above) is that it is slow. On a 2nd generation Core i5 notebook, backing up to an external USB HD attained speeds of up to 10MB/s, which I think that could be better. Only one core seemed to be used.
By the way, regarding bup, is it safe to run the command "git gc" in the backup repository?
Posted Jun 7, 2012 22:29 UTC (Thu)
by oever (guest, #987)
[Link] (2 responses)
Ocman seems to do de-duplication on fixed blocks, not variable blocks as one would get with a rolling checksum. You can configure the block size, but i think the boundary positions are simply multiples of the block size.
When using a rolling checksum, one moves a window over the data and when the checksum value falls in a particular range, the block ends. This means that the blocks have different sizes. The size depends on the content. By choosing the range for the checksum values that trigger a split, one can influence the average blocks in the backup.
Posted Jun 8, 2012 10:27 UTC (Fri)
by rbrito (guest, #66188)
[Link] (1 responses)
Is ocman a typo for obnam?
I don't find any hits related to backups doing some searches with ocman as a keyword (e.g. https://duckduckgo.com/?q=ocman+backup).
Posted Jun 8, 2012 12:36 UTC (Fri)
by oever (guest, #987)
[Link]
Posted Jun 8, 2012 9:00 UTC (Fri)
by juliank (guest, #45896)
[Link] (3 responses)
It's written in Python, so I would not assume it to use more than one core due to the GIL anyway.
Posted Jun 8, 2012 10:19 UTC (Fri)
by rbrito (guest, #66188)
[Link] (2 responses)
Posted Jun 8, 2012 10:32 UTC (Fri)
by juliank (guest, #45896)
[Link] (1 responses)
Posted Jun 14, 2012 15:23 UTC (Thu)
by JanC_ (guest, #34940)
[Link]
It should be possible to move the CPU-intensive parts (all the hashing & encryption parts) to C or Cython code. Alternatively, PyPy is working on removal of the GIL, but that might take years to finish.
But I'm not sure in how far Obnam currently uses non-sequential code anyway?
Posted Jun 14, 2012 12:59 UTC (Thu)
by njs (subscriber, #40338)
[Link]
If your de-duplicator uses an rsync-compatible rolling checksum, and your tar files are compressed with gzip --rsyncable, then de-duplication should work. (I thought --rsyncable had become the default at some point, but now can't find evidence of this. And sadly bzip2 doesn't seem to have sprouted an --rsyncable option -- maybe the file format requires fixed-size blocks or something.)
Posted Jun 7, 2012 4:44 UTC (Thu)
by keeperofdakeys (guest, #82635)
[Link]
Posted Jun 7, 2012 7:21 UTC (Thu)
by hickinbottoms (subscriber, #14798)
[Link]
I've been using rdiff-backup for a few years but have been stung by its rough edges when running out of space on the backup volume on a couple of occasions (I've had to bin the backup history and start again). This looks to tick more boxes for large backups, especially with the compression, de-duplication and periodic snapshot features. I'm now running it alongside rdiff-backup as a test and to gain some trust in it before considering a complete switchover. I've not come across tarsnap or ddar before, though, so I'm also going to take a look at those. One minor correction to the article - it says "specify a filesystem storage location with --output=PATH" - the --output switch is for redirecting standard output, I think the correct switch is --repository.
Posted Jun 7, 2012 7:36 UTC (Thu)
by rvfh (guest, #31018)
[Link]
Would be nice to implement that in obnam too I think, as it's way faster and much easier for users to find a file.
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Yes it does. LWN has written wrote about bup. The bup README document is a great read. It explains that while bup is similar to git in that it uses Merkle trees, but also how it is different because it splits up big files by, like rsync, using a rolling checksum. By doing so, deduplication works better than when using fixed blocks.
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
The data is divided into chunks, and if a chunk already exists in the
backup repository, it is not uploaded again."
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Backing up in trees with Obnam 1.0
Obnam could use FUSE for restoration
hbackupfs -C <client> [-D <date>] <mountpoint>