Btrfs: Subvolumes and snapshots

By Jonathan Corbet
January 6, 2014

The previous installment in LWN's ongoing series on the Btrfs filesystem covered multiple device handling: various ways of setting up a single filesystem on a set of physical devices. Another interesting aspect of Btrfs can be thought of as working in the opposite manner: subvolumes allow the creation of multiple filesystems on a single device (or array of devices). Subvolumes create a number of interesting possibilities not supported by other Linux filesystems. This article will discuss how to use the subvolume feature and the associated snapshot mechanism.

Subvolume basics

A typical Unix-style filesystem contains a single directory tree with a single root. By default, a Btrfs filesystem is organized in the same way. Subvolumes change that picture by creating alternative roots that function as independent filesystems in their own right. This can be illustrated with a simple example:

    # mkfs.btrfs /dev/sdb5
    # mount /dev/sdb5 /mnt/1
    # cd /mnt/1
    # touch a

Thus far, we have a mundane btrfs filesystem with a single empty file (called "a") on it. To create a subvolume and create a file within it, one can type:

    # btrfs subvolume create subv
    # touch subv/b
    # tree
    .
    ├── a
    └── subv
	└── b

    1 directory, 2 files

The subvolume has been created with the name subv; thus far, the operation looks nearly indistinguishable from having simply created a directory by that name. But there are some differences that pop up if one looks for them. For example:

    # ln a subv/
    ln: failed to create hard link ‘subv/a’ => ‘a’: Invalid cross-device link

So, even though subv looks like an ordinary subdirectory, the filesystem treats it as if it were on a separate physical device; moving into subv is like crossing an ordinary Unix mount point, even though it's still housed within the original btrfs filesystem. The subvolume can also be mounted independently:

    # btrfs subvolume list /mnt/1
    ID 257 gen 8 top level 5 path subv
    # mount -o subvolid=257 /dev/sdb5 /mnt/2
    # tree /mnt/2
    /mnt/2
    └── b

    0 directories, 1 file

The end result is that each subvolume can be treated as its own filesystem. It is entirely possible to create a whole series of subvolumes and mount each separately, ending up with a set of independent filesystems all sharing the underlying storage device. Once the subvolumes have been created, there is no need to mount the "root" device at all if only the subvolumes are of interest.

Btrfs will normally mount the root volume unless explicitly told to do otherwise with the subvolid= mount option. But that is simply a default; if one wanted the new subvolume to be mounted by default instead, one could run:

    btrfs subvolume set-default 257 /mnt/1

Thereafter, mounting /dev/sdb5 with no subvolid= option will mount the subvolume subv. The root volume has a subvolume ID of zero, so mounting with subvolid=0 will mount the root.

Subvolumes can be made to go away with:

    btrfs subvolume delete path

For ordinary subvolumes (as opposed to snapshots, described below), the subvolume indicated by path must be empty before it can be deleted.

Snapshots

A snapshot in Btrfs is a special type of subvolume — one which contains a copy of the current state of some other subvolume. If we return to our simple filesystem created above:

    # btrfs subvolume snapshot /mnt/1 /mnt/1/snapshot
    # tree /mnt/1
    /mnt/1
    ├── a
    ├── snapshot
    │   ├── a
    │   └── subv
    └── subv
        └── b

    3 directories, 3 files

The snapshot subcommand creates a snapshot of the given subvolume (the /mnt/1 root volume in this case), placing that snapshot under the requested name (/mnt/1/snapshot) in that subvolume. As a result, we now have a new subvolume called snapshot which appears to contain a full copy of everything that was in the filesystem previously. But, of course, Btrfs is a copy-on-write filesystem, so there is no need to actually copy all of that data; the snapshot simply has a reference to the current root of the filesystem. If anything is changed — in either the main volume or the snapshot — a copy of the relevant data will be made, so the other copy will remain unchanged.

Note also that the contents of the existing subvolume (subv) do not appear in the snapshot. If a snapshot of a subvolume is desired, that must be created separately.

Snapshots clearly have a useful backup function. If, for example, one has a Linux system using Btrfs, one can create a snapshot prior to installing a set of distribution updates. If the updates go well, the snapshot can simply be deleted. (Deletion is done with "btrfs subvolume delete" as above, but snapshots are not expected to be empty before being deleted). Should the update go badly, instead, the snapshot can be made the default subvolume and, after a reboot, everything is as it was before.

Snapshots can also be used to implement a simple "time machine" functionality. While working on this article series, your editor set aside a Btrfs partition to contain a copy of /home. On occasion, a simple script runs:

    rsync -aix --delete /home /home-backup
    btrfs subvolume snapshot /home-backup /home-backup/ss/`date +%y-%m-%d_%H-%M`

The rsync command makes /home-backup look identical to /home; a snapshot is then made of that state of affairs. Over time, the result is the creation of a directory full of timestamped snapshots; returning to the state of /home at any given time is a simple matter of going into the proper snapshot. Of course, if /home is also on a Btrfs filesystem, one could make regular snapshots without the rsync step, but the redundancy that comes with a backup drive would be lost.

One can quickly get used to having this kind of resource available. This also seems like an area that is just waiting for the development of some higher-level tools. Some projects are already underway; see Snapper or btrfs-time-machine, for example. There is also an "autosnap" feature that has been posted in the past, though it does not seem to have seen any development recently. For now, most snapshot users are most likely achieving the desired functionality through their own sets of ad hoc scripts.

Subvolume quotas

It typically will not take long before one starts to wonder how much disk space is used by each subvolume. A naive use of a tool like du may or may not produce a useful answer; it is slow and unable to take into account the sharing of data between subvolumes (snapshots in particular). Beyond that, in many situations, it would be nice to be able to divide a volume into subvolumes but not to allow any given subvolume to soak up all of the available storage space. These needs can be met through the Btrfs subvolume quota group mechanism.

Before getting into quotas, though, a couple of caveats are worth mentioning. One is that "quotas" in this sense are not normal, per-user disk quotas; those can be managed on Btrfs just like with any other filesystem. Btrfs subvolume quotas, instead, track and regulate usage by subvolumes, with no regard for the ownership of the files that actually take up the space. The other thing worth bearing in mind is that the quota mechanism is relatively new. The management tools are on the rudimentary side, there seem to be some performance issues associated with quotas, and there's still a sharp edge or two in there waiting for unlucky users.

By default, Btrfs filesystems do not have quotas enabled. To turn this feature on, run:

    # btrfs quota enable path

A bit more work is required to retrofit quotas into an older Btrfs filesystem; see this wiki page for details. Once quotas are established, one can look at actual usage with:

    # btrfs qgroup show /home-backup
    qgroupid rfer        excl       
    -------- ----        ----       
    0/5      21184458752 49152      
    0/277    21146079232 2872635392 
    0/281    20667858944 598929408  
    0/282    20731035648 499802112  
    0/284    20733419520 416395264  
    0/286    20765806592 661327872  
    0/288    20492754944 807755776  
    0/290    20672286720 427991040  
    0/292    20718280704 466567168  
    0/294    21184458752 49152

This command was run in the time-machine partition described above, where all of the subvolumes are snapshots. The qgroupid is the ID number (actually a pair of numbers — see below) associated with the quota group governing each subvolume, rfer is the total amount of data referred to in the subvolume, and excl is the amount of data that is not shared with any other subvolume. In short, "rfer" approximates what "du" would indicate for the amount of space used in a subvolume, while "excl" tells how much space would be freed by deleting the subvolume.

...or, something approximately like that. In this case, the subvolume marked 0/5 is the root volume, which cannot be deleted. "0/294" is the most recently created snapshot; it differs little from the current state of the filesystem, so there is not much data that is unique to the snapshot itself. If one were to delete a number of files from the main filesystem, the amount of "excl" data in that last snapshot would increase (since those files still exist in the snapshot) while the amount of free space in the filesystem as a whole would not increase.

Limits can be applied to subvolumes with a command like:

    # btrfs qgroup limit 30M /mnt/1/subv

One can then test the limit with:

    # dd if=/dev/zero of=/mnt/1/subv/junk bs=10k
    dd: error writing ‘junk’: Disk quota exceeded
    2271+0 records in
    2270+0 records out
    23244800 bytes (23 MB) copied, 0.0334957 s, 694 MB/s

One immediate conclusion that can be drawn is that the limits are somewhat approximate at best; in this case, a limit of 30MB was requested, but the enforcement kicked in rather sooner than that. This happens even though the system appears to have a clear understanding of both the limit and current usage:

    # btrfs qgroup show -r /mnt/1
    qgroupid rfer     excl     max_rfer 
    -------- ----     ----     -------- 
    0/5      16384    16384    0        
    0/257    23261184 23261184 31457280

The 0/257 line corresponds to the subvolume of interest; the current usage is shown as being rather less than the limit, but writes were limited anyway.

There is another interesting complication with subvolume quotas, as demonstrated by:

    # rm /mnt/1/subv/junk
    rm: cannot remove ‘/mnt/1/subv/junk’: Disk quota exceeded

In a copy-on-write world, even deleting data requires allocating space, for a while at least. A user in this situation would appear to be stuck; little can be done until somebody raises the limit for at least as long as it takes to remove some files. This particular problem has been known to the Btrfs developers since 2012, but there does not yet appear to be a fix in the works.

The quota group is somewhat more flexible than has been shown so far; it can, for example, organize quotas in hierarchies that apply limits at multiple levels. Imagine one had a Btrfs filesystem to be used for home directories, among other things. Each user's home could be set up as a separate subvolume with something like this:

    # cd /mnt/1
    # btrfs subvolume create home 
    # btrfs subvolume create home/user1
    # btrfs subvolume create home/user2
    # btrfs subvolume create home/user3

By default, each subvolume is in its own quota group, so each user's usage can be limited easily enough. But if there are other hierarchies in the same Btrfs filesystem, it might be nice to limit the usage of home as a whole. One would start by creating a new quota group:

    # btrfs qgroup create 1/1 home

Quota group IDs are, as we have seen, a pair of numbers; the first of those numbers corresponds to the group's level in the hierarchy. At the leaf level, that number is zero; IDs at that level have the subvolume ID as the second number of the pair. All higher levels are created by the administrator, with the second number being arbitrary.

The assembly of the hierarchy is done by assigning the bottom-level groups to the new higher-level groups. In this case, the subvolumes created for the user-level directories have IDs 258, 259, and 260 (as seen with btrfs subvolume list), so the assignment is done with:

    # btrfs qgroup assign 0/258 1/1 .
    # btrfs qgroup assign 0/259 1/1 .
    # btrfs qgroup assign 0/260 1/1 .

Limits can then be applied with:

    # btrfs qgroup limit 5M 0/258 .
    # btrfs qgroup limit 5M 0/259 .
    # btrfs qgroup limit 5M 0/260 .
    # btrfs qgroup limit 10M 1/1 .

With this setup, any individual user can use up to 5MB of space within their own subvolume. But users as a whole will be limited to 10MB of space within the home subvolume, so if user1 and user2 use their full quotas, user3 will be entirely out of luck. After creating exactly such a situation, querying the quota status on the filesystem shows:

    # btrfs qgroup show -r .
    qgroupid rfer     excl     max_rfer 
    -------- ----     ----     -------- 
    0/5      16384    16384    0        
    0/257    16384    16384    0        
    0/258    5189632  5189632  5242880  
    0/259    5189632  5189632  5242880  
    0/260    16384    16384    5242880  
    1/1      10346496 10346496 10485760

We see that the first two user subvolumes have exhausted their quotas; that is also true of the upper-level quota group (1/1) that we created for home as a whole. As far as your editor can tell, there is no way to query the shape of the hierarchy; one simply needs to know how that hierarchy was built to work with it effectively.

As can be seen, subvolume quota support still shows signs of being relatively new code; there is still a fair amount of work to be done before it is truly ready for production use. Subvolume and snapshot support in general, though, has been around for years and is in relatively good shape. All told, subvolumes offer a highly useful feature set; in the future, we may well wonder how we ran our systems without them.

At this point, our survey of the major features of the Btrfs filesystem is complete. The next (and final) installment in this series will cover a number of loose ends, the send/receive feature, and more.

Index entries for this article
Kernel	Btrfs/LWN's guide to
Kernel	Filesystems/Btrfs

Deleting subvolumes

Posted Jan 6, 2014 19:48 UTC (Mon) by fishface60 (subscriber, #88700) [Link] (7 responses)

> For ordinary subvolumes (as opposed to snapshots, described below),
> the subvolume indicated by path must be empty before it can be deleted.

This is not the case, you are perfectly able to delete all kinds of
subvolumes, not just snapshots

~$ btrfs sub cre foo
Create subvolume './foo'
~$ cd foo
~/foo$ echo asdf >bar
~/foo$ cd ..
~$ btrfs sub del foo
Delete subvolume '/home/<name>/foo'
~$ ls foo
ls: cannot access foo: No such file or directory

To be able to delete subvolumes without being root you need to add
`user_subvol_rm_allowed` to your fstab.

You can also mount a subvolume by path, rather than just by ID too, with
`mount -o subvol=$path $device $mountpoint`.

Another handy feature is read-only snapshots, so you can keep your backups from being tampered with.

~$ btrfs sub cre foo
Create subvolume './foo'
~$ cd foo
~/foo$ echo hello >file
~/foo$ cd ..
~$ btrfs sub sna -r foo bar
Create a readonly snapshot of 'foo' in './bar'
~$ cd bar
~/bar$ echo goodbye >file
bash: file: Read-only file system

You also need more priviliges to delete read-only snapshots:

~$ btrfs sub del bar
Delete subvolume '/home/<user>/bar'
ERROR: cannot delete '/home/<user>/bar' - Read-only file system
~$ sudo btrfs sub del bar
[sudo] password for <user>:
Delete subvolume '/home/<user>/bar'
~$ ls bar
ls: cannot access bar: No such file or directory

Deleting subvolumes

Posted Jan 6, 2014 20:18 UTC (Mon) by dlang (guest, #313) [Link] (6 responses)

what's the advantage of using subvolumes rather than just independent filesystems?

Deleting subvolumes

Posted Jan 6, 2014 20:20 UTC (Mon) by dlang (guest, #313) [Link]

oops, I saw the comment in "unread comments" and didn't notice it was to a new article. I'll go read the article now. sorry for the noise.

Deleting subvolumes

Posted Jan 6, 2014 20:26 UTC (Mon) by zlynx (guest, #2285) [Link] (4 responses)

To do it with ordinary file systems you need free raw blocks. It almost requires the use of LVM, in fact.

Also, snapshots aren't as easy to create our manage.

Deleting subvolumes

Posted Jan 7, 2014 8:38 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

Plus, if you do it with LVW and copy-on-write, you never know when you're going to run out of disk space. A random I/O error during some metadata operation will leave your file system in an inconsistent state. Btrfs might have problems deleting files when your FS is full, but it will not corrupt its structure.

Plus, with btrfs subvolumes you *can* delete files from snapshots and actually free some space (like the large download which you no longer need). No such luck with LVM.

Plus, creating a snapshot with LVM requires a consistent FS state. How do you assert that (and keep it until the snapshot command returns), on a live system, if you can't unmount?

Deleting subvolumes

Posted Jan 7, 2014 10:50 UTC (Tue) by Yenya (subscriber, #52846) [Link]

It is filesystem-dependent, but it can be done. See

http://linux.die.net/man/8/xfs_freeze

Deleting subvolumes

Posted Jan 7, 2014 17:02 UTC (Tue) by zlynx (guest, #2285) [Link]

I think that all of the supported file systems in RedHat Server do a live freeze during LVM snapshot, so that isn't really a problem. It might be a problem for your application performance, if you can't stand an unresponsive filesystem for a few seconds.

I know that ext3, ext4 and xfs all work fine after a LVM snap.

Deleting subvolumes

Posted Jan 16, 2014 16:41 UTC (Thu) by arielb1@mail.tau.ac.il (guest, #94131) [Link]

Isn't the filesystem-on-disk always in a consistent state for crash safety anyways? You just need to take a point in the block scheduler - the child FS will be in the same state as after "all calls to schedule_new_block(or whatever) hang, scheduler writes all blocks to disk, system crashes" - which is valid (by crash safety).

Btrfs: Subvolumes and snapshots

Posted Jan 6, 2014 20:25 UTC (Mon) by dlang (guest, #313) [Link] (9 responses)

Am I correct in thinking that when you remove one subvolume it can cause another subvolume to grow drastically if they were sharing data.

for example, if you have a series of snapshots and delete a snapshot in the middle, (almost) all that data will be needed by the older snapshots, which will then show a lot more usage.

What happens if this puts the older snapshot over it's quota? will the removal of the snapshot fail?

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 22:35 UTC (Wed) by and (guest, #2883) [Link] (7 responses)

> Am I correct in thinking that when you remove one subvolume it can cause another subvolume to grow drastically if they were sharing data.

If I understood this correctly, deleting a subvolume will always free the exclusive data of this subvolume and in no case grow other subvolumes or change their quota. What _can_ happen is that some data is no longer shared and thus deleting one subvolume can increase the exclusive data of another subvolume.

(This just lead me to wondering how to find the minimal set of subvolumes must be deleted to free a predefined amount of space. Smells like an NP complete problem.)

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 22:40 UTC (Wed) by dlang (guest, #313) [Link] (6 responses)

hmm, so if the quota includes shared data, then the quota for a snapshot must be as large as the filesystem that it's backing up as all the data will initially be shared.

At this point, putting a quota on a snapshot seems worthless.

what am I missing?

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 23:09 UTC (Wed) by and (guest, #2883) [Link] (5 responses)

> hmm, so if the quota includes shared data, then the quota for a snapshot must be as large as the filesystem that it's backing up as all the data will initially be shared.

Well, the quota needs to be at least as large as the space used at the time when the snapshot is created. I can't see how shared data could be accounted for other than that: in the model which I think you have in mind, shared data would be either attributed to a single subvolume or proportionally to all which contain it. both possibilities are undesirable for the reason you mentioned and also because the amount of exclusive space used by one subvolume can increase by writing to or deleting a file in another subvolume. Causing the first subvolume to go over its quota in this case would constitute _really_ unexpected behavior, IMO.

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 23:25 UTC (Wed) by dlang (guest, #313) [Link] (4 responses)

people who use snapshots are actually quite used to having the usage of one snapshot go up when they delete a newer snapshot.

requiring that the quota allow for the full space means that quota is going to be useless for preventing the disk from filling up.

instead of being able to say something like "I want snapshots, but I don't want them to consume more than 10% of my space" with it either refusing to create new snapshots of deleting older snapshots when it fills up, you are instead going to have to set the quota for a snapshot very large and the sum of the quotas is going to be much larger than your total space.

Btrfs: Subvolumes and snapshots

Posted Jan 9, 2014 19:13 UTC (Thu) by iabervon (subscriber, #722) [Link] (3 responses)

The policy "I want snapshots, but I don't want them to take up more than 10% of my total space" is a complete disaster if applied as a limit. Your OS upgrades would fail for lack of space, even if you have plenty of free space, if you replace files accounting for 10% of your total space. (That is, you don't want to be prevented from unsharing things in your root subvolume, but that's the only operation that increases the amount of space used by snapshots-- creating a new snapshot takes pretty much no new space.)

What you actually want is a garbage collection policy, rather than a limit policy; in order to generate free space, the filesystem is allowed to delete some unmounted subvolumes in a particular order. So this goal is pretty far in implementation from what "quota" means in other filesystems.

Btrfs: Subvolumes and snapshots

Posted Jan 9, 2014 20:45 UTC (Thu) by dlang (guest, #313) [Link] (2 responses)

> The policy "I want snapshots, but I don't want them to take up more than 10% of my total space" is a complete disaster if applied as a limit. Your OS upgrades would fail for lack of space, even if you have plenty of free space,

for almost all the systems I run, the OS is far less than 10% of the total disk space. The majority of the space is used by data files, or software outside of the distro.

Btrfs: Subvolumes and snapshots

Posted Jan 9, 2014 20:55 UTC (Thu) by iabervon (subscriber, #722) [Link] (1 responses)

Even so, that's still: "You can't delete that brain image, because then it would count toward snapshots you weren't prevented from making."

Btrfs: Subvolumes and snapshots

Posted Jan 9, 2014 21:14 UTC (Thu) by dlang (guest, #313) [Link]

> Even so, that's still: "You can't delete that brain image, because then it would count toward snapshots you weren't prevented from making."

In theory you may be correct, but having used snapshots for over a decade, in practice, on non-toy sized filesystems, it just isn't a problem.

just like the OS isn't larger than 10% of the filesystem, neither is any individual item on it.

On existing sytems, the quota is not per snapshot, but for all snapshots combined. They also have a policy for what to do when you try and create a snapshot and there isn't space, which is usually to delete the oldest snapshot until there is space.

Btrfs: Subvolumes and snapshots

Posted Jan 27, 2014 0:07 UTC (Mon) by rodgerd (guest, #58896) [Link]

Consider a real-world example with LVM+XFS from a month ago: I needed to do a P2V which required an extra half terabyte of space in one of my partitions. I did the lvextend -L +500G / xfs_grow dance, did the migration.

However, the growth was only temporary: once I'd finished the P2V I was actually only needing about another 100G above the original filesytem. I was a bit stuck here, because you can't shrink XFS. Even if it had been ext*, though, I'd have had to do an offline resize, then an LVM resize (taking care to not over-shrink and destroy the filesystem!). Either way it would (and did) require an outage.

If btrfs I could have grown the quota, done the work, thrown the excess files away, and then shrunk the quota again.

Btrfs: Subvolumes and snapshots

Posted Jan 6, 2014 23:24 UTC (Mon) by ppisa (subscriber, #67307) [Link] (2 responses)

User is able to resolve quota exceed problem (tested/required to resolove our students problems - homes as subvolumes on Btrfs and server over NFS to 40 stations). The solution is truncate some of huge files. I.e.

echo 1 >/user/xy/too_fat_file

truncate -s 0 /user/xy/too_fat_file

which worked for regular user even over NFS.

The problem we found was that user is not able to log into graphical/X session (Xfce) because required pipes cannot be created in his/her home.

Text console login over PAM and with mount of subvolume over NFS works without problems and user can trim/truncate offending file.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 15:08 UTC (Tue) by zzxtty (guest, #45175) [Link] (1 responses)

We've had to deal with this for some time with ZFS, I recommend the following command to our users:

cp /dev/null ~/largefile

It's a nice simple syntax they understand.

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 18:46 UTC (Wed) by mebrown (subscriber, #7960) [Link]

Even shorter:

> largefile

snapshot performance

Posted Jan 7, 2014 3:23 UTC (Tue) by geuder (subscriber, #62854) [Link] (1 responses)

I have used a daily and a weekly snapshot with LVM2 and ext4 for a long time. Stupid user errors tend to be more common than disk crashes... Cron just creates it/them every morning at 7 using a 5 line shell script. Obviously this will lead to 2 extra big write head moves for every disk write, write throughput is only 10% (IIRC) compared to not using any snapshot. But for normal software development work it does not disturb at all. Machines are fast enough and when they are not the bottleneck is typically elsewhwere than in pure write throughput. When working with filesystem images in the gigabyte class I do it on another filesystem without the safety net.

Note to self: find a system with reasonably new kernel and a free corner on the disk and compare whether the impact of btrfs snapshots to write performance is less (I guess it should be possible to avoid or at least reduce the extra write head moves)

snapshot performance

Posted Jan 7, 2014 5:51 UTC (Tue) by dlang (guest, #313) [Link]

BTRFS probably won't show significant drops in throughput when using snapshots, not because it avoids head movement, but because even normal writes are out of place and so reads end up seeking all over the place (writes don't suffer because they just get written wherever it's convenient, one spot is as good as the next)

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 5:43 UTC (Tue) by iabervon (subscriber, #722) [Link] (7 responses)

It would be interesting if you could combine multiple devices and subvolumes such that you have a single filesystem spread across your regular hard drive and a backup disk, using RAID0, and the root subvolume were only allocated from the regular hard drive, while snapshots were only allocated from the backup disk. The actual data transfer to the backup disk would then happen as a consequence of both devices needing to contain the same blocks, while the filesystem as a whole understands that the backup disk doesn't need to contain multiple copies of the same block just because multiple snapshots were made. And, of course, at least one subvolume would always be completely recoverable in the event of either disk failing.

I think that kind of policy would be a more interesting use of multiple devices than RAID policies which just have a numeric block allocation policy, and would be more powerful than what you could do with either device-level RAID or rsync or both.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 5:49 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

that's not how BTRFS snapshots work.

Every write on a BTR filesystem creates a new block, eventually the old blocks are garbage collected. all that a snapshot does is to prevent the blocks that are current as of the time of the snapshot from being garbage collected. They don't get copied anywhere, they just don't get deleted.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 7:34 UTC (Tue) by iabervon (subscriber, #722) [Link]

Right, the copy wouldn't be due to the snapshot, it would be due to the resulting policy as to where the block has to be stored not being fulfilled, just like if you converted a RAID0 filesystem to a RAID1 filesystem, but on a per-block level, based on what subvolumes include the block. That is, the snapshot doesn't create new blocks, but it does cause the blocks to need to be mirrored, when new files on the root subvolume would not be required to be mirrored (but would have to be on the first device). When the file is deleted from the main subvolume, it could be garbage collected off the first device, but would continue to be required on the backup device until it was no longer in a snapshot.

Of course, I don't think multiple device support interacts like this with subvolumes, either, but it seems like all of the hard work is done: subvolumes and multiple devices work, and online conversion between RAID configurations. Subvolume membership can affect blocks (e.g., via quotas), including when the blocks are shared between subvolumes. It seems to me like all that's missing is the ability to choose raid policy on a per-block basis, some raid policies that wouldn't make sense otherwise (e.g., "raid0, but only use a particular device"), and the ability to trigger balancing based on the creation of a subvolume with a different raid policy.

Btrfs: Subvolumes and snapshots

Posted Feb 6, 2015 2:45 UTC (Fri) by JimAvera (guest, #100933) [Link] (4 responses)

To somehow get the benefits of RAID0 and RAID1 simultaneously would be awesome.

A variation of iabevon's idea would be an explicit "copying unbalance" operation, which would copy necessary blocks of a specified subvolume to reside only on a specified device(s). To be useful for the purpose I'll explain momentarily, blocks which were not already on the destination device(s) would be *duplicated* not moved, leaving the originals where they were; blocks already on the destination(s) would stay as-is, possibly shared.

This could be used to get "delayed redundancy", where everything normally runs RAID0 (striped) for speed, including most snapshots. But periodically a snapshot would be converted to be stored only on a single drive (or subset of striped drives). These snaps would provide insurance against disk crashes, but not in real-time; after a crash, you could recover only to the time of the latest snap isolated to other drive(s).

For example, you could take snapshots every hour or more frequently, keeping 24 hours worth; and once a day "copy & unbalance" the latest snapshot, rotating the excluded drive.

You would get the full speed benefit of RAID0 striping for real-time operations during the day, and still get periodic backups against disk failures. But TANSTAAFL, the time-cost of that redundancy would be more than with RAID1 (where redundant blocks are written concurrently in the first place); however that cost could be paid at controlled times, e.g., in the middle of the night.

A big fly in this ointment is that a succession of such "backup" snapshots would end up with multiple copies of the same data on each drive, because blocks which were "de-balanced" became disconnected copies (necessary so that every block would in fact have copies on multiple drives). Out-of-band deduplication could be run over the latest N "backup" snapshots (N=number of drives) to eliminate unnecessary copies, but adding a lot of disk i/o to the nightly "backup" operations.

A more-difficult solution to duplicate blocks among the backup snapshots, would be to make the "copy & unbalance" operation examine other specified subvolumes to find existing copies of blocks on the destination drive(s), perhaps comparing exactly-parallel files for the same ObjectID in the trees, and then compare corresponding extents which already resided entirely on the destination drive(s).

Btrfs: Subvolumes and snapshots

Posted Feb 6, 2015 5:42 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

It sounds like what you're describing could be handled fairly well by per-subvolume RAID levels, which are already on the agenda according to the btrfs FAQ: "However, we have plans to allow per-subvolume and per-file RAID levels." Just set your backup snapshot as RAID-1 and everything else as RAID-0, then perform a "soft" rebalance to redistribute the data which just changed levels. (But what happens when an extent is part of multiple subvolumes with different allocation schemes? Highest level wins? Last assigned level? The worst possibility would be breaking the COW link....)

Btrfs: Subvolumes and snapshots

Posted Feb 6, 2015 6:41 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

I think you are slightly misunderstanding the way snapshots work.

When you do a snapshot, you aren't copying all the data into the snapshot. What you are doing is copying a current set of metadata and flagging all the disk blocks as Copy on Write, so that as you continue to use the filesystem, the blocks that make up the snapshot never get changed. If the OS wants to write to that file the filesystem allocates a new block, copies the existing data over to it and then does the modification that the OS asked for.

So if you have a filesystem in a RAID0 stripe set of drives, when you make a snapshot, the snapshot will continue to require both drives.

You would then have to make a complete copy of the files on the filesystem to have it all reside on one drive.

Btrfs: Subvolumes and snapshots

Posted Feb 6, 2015 9:40 UTC (Fri) by JimAvera (guest, #100933) [Link] (1 responses)

Well, yes. Creating a snapshot doesn't replicate file data, but that's besides the point. When blocks are striped across all drives, then failure of a single drive causes a total loss.

The idea is to somehow convert selected "backup" snapshots to no longer store anything on a particular drive. It would be like removing one disk from a RAID-0 set, but only for a single specified subvolume. The difference is that blocks on the "removed" drive would be replicated, so other snapshots referencing the same data would still point to the original copy, achieving the desired replication effect.
-----
But such a scheme would only be interesting with two drives; eventually it would create as many copies as drives, which for more than two is excessive.

Btrfs: Subvolumes and snapshots

Posted Feb 6, 2015 9:47 UTC (Fri) by dlang (guest, #313) [Link]

actually, since the blocks include references to other blocks, you would end up with three (or more) copies of the blocks

1. the original set that's split across the two drives

2. the set that has all pointers changed to live on drive1

3. the set that has all pointers changed to live on drive2

and since the pointer changes need to take place each time the data is copied from one drive to another (since the existing blocks will already be in use), each single-drive snapshot would be a full copy of everything.

Also, while snapshots are pretty reliable (since what they do is stop making changes to the blocks currently in use), this code would be doing some major re-writing of blocks it would also be far more fragile.

you would be better of just using a conventional backup program to do a backup of the snapshot, far less risk involved.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 9:43 UTC (Tue) by rvfh (guest, #31018) [Link] (6 responses)

Am I to understand that a BTR [sub]volume can be mounted any number of times? What if I mount a subv on /mnt/2 and access it from /mnt/1/subv?

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 10:21 UTC (Tue) by jengelh (guest, #33263) [Link]

The view should be remaining consistent.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 11:35 UTC (Tue) by Yenya (subscriber, #52846) [Link] (4 responses)

Any Linux FS (e.g. ext4, xfs, ...) can be mounted a number of times. You can do

mount /dev/sdc1 /mnt1
mount /dev/sdc1 /mnt2

and then access (and write) files under /mnt1 _and_ /mnt2 simultaneously. I suppose it is handled on the vfs level to have only one struct superblock (and the referenced FS structures) in this case.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 13:15 UTC (Tue) by cortana (subscriber, #24596) [Link] (3 responses)

I've always wondered how you can guarantee that a filesystem is unmounted, in the presence of this multiple mounting feature (and, these days, mount namespaces). If I want to eject /dev/sdc, unmounting just /mnt2 in your example is not enough to guarantee that the filesystem is safe for removal. You'd also have to check the list of mounted filesystems and unmount /mnt1. I don't even know what you should do if some process with its own mount namespace has the filesystem mounted...

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 16:03 UTC (Tue) by cruff (subscriber, #7201) [Link]

I don't even know what you should do if some process with its own mount namespace has the filesystem mounted.

You are out of luck unless you can access the other namespace. I have several systems that has an LXC container with bind mounts of some file systems inside the container. Because of the version of the kernel used in RHEL 6, I have not determined a way to access the container's mount namespace from the container host system after the container has been created. Thus it can be impossible to umount all instances of each file system. Of course, containers under RHEL 6 have other usability issues, we sometimes have to just reboot the container host to restore everything to normal. I'm thinking of going back to using chroot jails.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 16:04 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

You can also unmount based on the device path. There's also the '--all-targets' option: "umount --all-targets /dev/sdc".

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 17:25 UTC (Tue) by cortana (subscriber, #24596) [Link]

umount --all-targets only appears to work in the current mount namespace, going by <https://www.kernel.org/pub/linux/utils/util-linux/v2.23/v...>.

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 10:16 UTC (Tue) by sitaram (guest, #5959) [Link] (5 responses)

> rsync -aix --delete /home /home-backup
> btrfs subvolume snapshot /home-backup /home-backup/ss/`date +%y-%m-%d_%H-%M`

> The rsync command makes /home-backup look identical to /home; a snapshot is then made of that state of affairs. Over time, the result is the creation of a directory full of timestamped snapshots; returning to the state of /home at any given time is a simple matter of going into the proper snapshot. Of course, if /home is also on a Btrfs filesystem, one could make regular snapshots without the rsync step, but the redundancy that comes with a backup drive would be lost.

Wouldn't it be better to go the other way: have /home on btrfs, make a snapshot, and the rsync that snapshot to the second disk? By using rsync, you're giving up a fair bit of "snapshot" functionality -- some files might be inconsistent (or at least more so than with a btrfs snapshot)

Btrfs: Subvolumes and snapshots

Posted Jan 7, 2014 16:51 UTC (Tue) by oohlaf (guest, #42838) [Link] (4 responses)

Exactly.
Snapshot home and use btrfs send to transfer the snapshot to a second disk.
Both disks should have btrfs as filesystem.

See https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 0:39 UTC (Wed) by corbet (editor, #1) [Link] (3 responses)

Syncing from a snapshot is a good idea, yes. In the case of the very simplistic example given in the article, the system in question has an ext4 home partition that has been there for years, so that wasn't an option.

Current plan is to look at send/receive in the next installment.

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 19:31 UTC (Wed) by MattJD (subscriber, #91390) [Link] (2 responses)

Brtfs wiki describes a method to convert an ext4 filesystem to btrfs without data loss (at: https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 ). It is supposed to preserve the original filesystem, so you can try it out and revert if it doesn't work well enough.

Btrfs: Subvolumes and snapshots

Posted Jan 10, 2014 11:30 UTC (Fri) by jezuch (subscriber, #52988) [Link] (1 responses)

> Brtfs wiki describes a method to convert an ext4 filesystem to btrfs without data loss

That's a neat trick enabled by the COW nature of btrfs and I wonder if there is a possibility to add support for more filesystems beside ext4. I'm especially interested in XFS as that's what I'm running on my systems :)

Btrfs: Subvolumes and snapshots

Posted Jan 10, 2014 15:10 UTC (Fri) by MattJD (subscriber, #91390) [Link]

>> Brtfs wiki describes a method to convert an ext4 filesystem to btrfs without data loss
>That's a neat trick enabled by the COW nature of btrfs and I wonder if there is a possibility to add support for more filesystems beside ext4. I'm especially interested in XFS as that's what I'm running on my systems :)

The tool uses a userspace library to read the filesystem and setup Btrfs. Adding support for XFS (or any other fs) should mostly involve getting a user space library to read XFS and calling the appropriate functions from the conversion tool to setup Btrfs.

Btrfs: Subvolumes and snapshots

Posted Jan 8, 2014 16:52 UTC (Wed) by janecek (subscriber, #74774) [Link] (1 responses)

For rsync + btrfs snapshot backups, the --inplace rsync option might be useful. By default, when a file is changed, rsync creates a new file and moves it over the original. With --inplace, updated data is written directly to the file, so the unchanged parts can be shared among snapshots.

Btrfs: Subvolumes and snapshots

Posted Jan 28, 2014 16:00 UTC (Tue) by nye (subscriber, #51576) [Link]

You would also want --no-whole-file, because the default for local syncs is to rewrite the whole file rather than using the delta transfer algorithm.

Anecdote time:

I have a system which stores backups that come in the form of some large-ish database dumps created by some MS software. Each time the backup is made, the entire file is rewritten, thus defeating the use of snapshots for saving multiple versions with minimal disk space.
My solution to this is to save them to a staging area which does not get snapshotted, then 'rsync --inplace --no-W' to the final destination, which contains yesterday's version of the dump. This neatly turns a full rewrite into a much smaller set of block changes, without having to use some full-blown deduplication tool.