Git-based backup with bup

March 31, 2010

This article was contributed by Joe 'Zonker' Brockmeier.

While Git is aimed at distributed version control for developers, it has also inspired more than a few people to apply Git to backing up all sorts of data. It's been the basis for several backup projects like the outdated eigenclass for general backups, and more specialized hacks to keep track of the etc directory (etckeeper), and a user's home directory (git-home-history). Another noteworthy backup application has popped up recently called bup.

Short for "backup," bup is a fledgling Git-based, or at least Git-inspired, backup solution written in Python and C. The first 0.01 release of bup was announced by Avery Pennarun on January 4, and development has been moving at a pretty good clip since. It is newly licensed under the LGPLv2, and is gathering an active community of developers.

Getting bup and its dependencies

Bup is available via a GitHub repository, and isn't currently packaged for any of the major distributions. The build instructions on the GitHub project page address building on Debian/Ubuntu, though users on Ubuntu 9.10 should substitute python2.6-dev for the development libraries, and make sure to install the python-fuse package to mount bup backups via FUSE.

Users will also want to install the par2 package, which is used by bup's fsck tool to create and read the Par2 format. Par2 allows bup to verify files and to recover damaged files, so if par2 isn't installed, bup's recovery features are not available. When using the bup fsck command, bup creates Par2 files to allow recovery of damaged blocks in the bup index and pack files. Using Par2, bup can recover up to 5% of damaged files. Users who want to test this can use the bup damage command to randomly destroy blocks and then attempt to recover the file using bup fsck.

Pandoc is required to generate bup's documentation, so users who would appreciate man pages and HTML documentation should install the pandoc package as well. Note that bup has no "make install" target at the moment, so the bup documentation and commands need to be moved into the appropriate locations manually.

Making backups with bup

It's important to understand what bup does and doesn't do. Bup is a back-end tool meant to handle large files (like VM images) and incremental backups quickly with as little space as possible. The focus of development is on speed, taking up less space with backups, error recovery, and not so much on being a front-end for performing backups.

This means that bup is not well-suited yet as a standalone solution for creating and managing backups. It's also without a GUI, so bup is best-suited for users who are comfortable writing their own backup scripts and with at least a passing familiarity with Git usage.

Bup is actually a suite of scripts/commands that manage creating backups, indexing files, listing files in a backup, etc. The data is stored in a Git-formatted repository, but bup writes its own packfiles and indexes — it doesn't use the git command directly, it only uses a few of Git's helper programs. The documentation that comes with bup is actually pretty good for a relatively new project, with a man page for each of the commands. It's a bit short on examples and a user guide would be nice, but given the project has only been around since the beginning of the year, it's hard to find fault with the amount of documentation already available.

To create a new backup, a user can either feed a file to bup's split command or use bup index to create an index of files and then use the bup save command to create a new backup. When using split, bup takes input and breaks it into chunks about 8K in size, saving the resulting files in a bup repository.

That's useful, but doesn't actually automate much. Bup index will create or update a cache of files and directories in the filesystem, along with their hashes, which can be used by bup save to track files that have been updated since the last backup. Then bup save can work from the index to create a repository or update it with the files that have changed. Bup supports local and remote backups, bi-directionally. That is to say, bup allows local backups, backing up your local computer to a remote server, or pulling backups to the local machine from a remote server.

Bup is relatively speedy and does a pretty good job of compressing files using Git's packfile format. Bup particularly shines on incremental backups, because it uses a "rolling checksum" to compare the file chunks and only save the parts that have changed. Files are split and then checked into Git separately, and bup creates a index file that lists the filenames of those chunks (from a SH1 hash of the file) in the order that they're created. The files that match don't need to be re-saved. For more detail on the way bup works, see Pennarun's more detailed post about version control of large files that preceded bup's creation.

Restoring from bup backups

It's easier to create backups using bup, at the moment, than actually restoring from backups made with bup. That's not to say it's too challenging to get files, just that the process for restoring files is not as smooth as creating them in the first place. Bup has a save command that can be used to create a backup set, but lacks a restore command. So for the time being, it's best to use bup's split command and use its join utility to retrieve files.

The other problem with trying to use bup save is that it doesn't preserve file data like ownership, links, creation/modification times, etc. The upshot is that files backed up with bup won't have some of the requisite metadata that most users want when restoring from backups.

While bup's incremental backups take up less space than full backups, they still take up space. At the moment, bup has no way to delete older backups or manage the backups in any real way. This means that after a while bup stops being particularly effective at saving space after all.

Users can browse the backups in a number of ways. Bup provides a fuse command for mounting the backups as a directory, and an ftp command for browsing the backups as one would a remote directory via FTP. However, the views do not entirely match up with the actual files. Larger files that have been split are viewed as a top-level directory that has the name of the original file and then sub-directories under that that contain the actual data. Unfortunately, even though it uses Git, bup doesn't actually create a standard Git repository from the backups, so it's not possible to use one of the many GUI tools for Git to browse the backups.

At the moment, bup is relatively primitive but looks to be maturing and gaining interest fairly quickly. The project already has a handful of contributors in addition to Pennarun, and the mailing list seems fairly active for such a new program.

The project doesn't have a roadmap, per se, but discussions on the mailing list indicate that a bup restore command should be a reality soon, as well as handling file metadata so restored files retain their dates, ownership, and so on. While bup isn't yet a full-featured backup system, if the project maintains its current momentum, it should be quite useful by the end of 2010.

Index entries for this article
GuestArticles	Brockmeier, Joe

Git-based backup with bup

Posted Apr 1, 2010 6:59 UTC (Thu) by sitaram (guest, #5959) [Link] (2 responses)

rolling checksums are like rsync, so I'm wondering how much space saving this has over rdiff-backup? rdiff-backup does have a "delete older than" option, which I use regularly, and it has become my backup scheme of choice.

Bup's par2-based checking is nice, and I'm a git freak but I think a howto type article that shows all of bup's strengths (with actual commands) would be nice.

Git-based backup with bup

Posted Apr 1, 2010 12:29 UTC (Thu) by Darkmere (subscriber, #53695) [Link] (1 responses)

Rdiff-backup has a _severe_ flaw in that if the latest part of the backup is corrupted, you lose the whole history.

Noticed that the painful way myself.

Git-based backup with bup

Posted Apr 1, 2010 15:57 UTC (Thu) by sitaram (guest, #5959) [Link]

Without par2, bup has similar problems. And if you were tempted by the dedup promise and backed up lots of *machines*, they're potentially all gone.

I saw on their mailing list archive that that is why they started par2 support. Nice. But my question was specific to space only.

Git-based backup with bup

Posted Apr 1, 2010 7:06 UTC (Thu) by sitaram (guest, #5959) [Link] (2 responses)

and you forgot to mention the magic buzz-word in this space today: de-duplication. If I interpret this (from the README) right:

Data is "automagically" shared between incremental backups without having to know which backup is based on which other one - even if the backups are made from two different computers that don't even know about each other.

then that beats rdiff-backup hands down... I think a "show off features with commands" doc is definitely needed!

Git-based backup with bup

Posted Apr 1, 2010 9:28 UTC (Thu) by jond (subscriber, #37669) [Link] (1 responses)

Yes, I currently use rdiff-backup as "best of breed" and I am watching bup
with interest.

I have an in-progress FUSE-powered rdiff-backup increments browser (there's
another called 'archfs', too, which is further along). I might branch off
bup support at some point.

Git-based backup with bup

Posted Apr 8, 2010 4:26 UTC (Thu) by mbiker (guest, #65090) [Link]

There's also brackup. This one possesses many of the same useful properties, and also does encryption.
http://code.google.com/p/brackup/

5% of damaged files?

Posted Apr 1, 2010 9:57 UTC (Thu) by johill (subscriber, #25196) [Link] (4 responses)

"Using Par2, bup can recover up to 5% of damaged files."?

Why would I care about recovering 5%? I think you mean something else, but what exactly?

5% of damaged files?

Posted Apr 1, 2010 11:31 UTC (Thu) by jond (subscriber, #37669) [Link] (3 responses)

I interpret it to mean, the par stuff means you can suffer up to 5%
corruption by volume of data: so, if you had 100GB, up to 5G of that could
be corrupted and you could recover it.

5% of damaged files?

Posted Apr 1, 2010 12:20 UTC (Thu) by nye (subscriber, #51576) [Link] (1 responses)

In practice this probably means that you can recover from the majority of disk problems - unless you get particularly unlucky you're likely to be able to recover from some damaged sectors, and whole disk failures are generally unlikely (I've *never* seen one that wasn't caused by massive physical trauma).

5% of damaged files?

Posted Apr 1, 2010 18:33 UTC (Thu) by Thalience (subscriber, #4217) [Link]

Well, there is also the popular "disk controller board just stops working" failure mode.

5% of damaged files?

Posted Apr 7, 2010 9:11 UTC (Wed) by buchanmilne (guest, #42315) [Link]

Let's rather re-prhase that as:

"You should be able suffer up to 5% corruption of the backup files, and
still be able to restore all your data".

Git-based backup with bup

Posted Apr 1, 2010 18:56 UTC (Thu) by zooko (guest, #2589) [Link]

They should use zfec instead because it is about a bazillion times faster that par2.

Where are my notes...

Oh:

http://allmydata.org/trac/zfec/browser/zfec/README.txt?re...

Oh, it is only about a hundred times faster (depending).

I'll see if I can subscribe to their mailing list and pitch my tool.

Disclaimer: I think of bup as being a competitor to my own Tahoe-LAFS project, which is useful for backups.

Unison, like dropbox

Posted Apr 10, 2010 12:49 UTC (Sat) by Velmont (guest, #46433) [Link]

OK, this is a bit off-topic, but when I thought backup and looked at all these tools, what I really wanted was a way to have all my files being synced and that I could change stuff on pc1 change stuff on pc 2 and pc 3 and have it all merged unto all the computers.

So after many years of testing out different backup-solutions, I tried Unison in a star-topology (all machines sync to a central machine), and it works fantastic. You get version-backup of the files as well, something that was rather nice now that my Lucid Lynx computer crashed because of the Intel GPU-bug and destroyed my full HomeBank file.

Of course, you could also run something like rdiff-backup on top of your server-tree of files in order to backup other places. But in these «I've got many computers, many of which is mobile and need to work on all of them» days, Unison is a real treat.