There are two fundamental reasons to want to spread a single filesystem across multiple physical devices: increased capacity and greater reliability. In some configurations, RAID can also offer improved throughput for certain types of workloads, though throughput tends to be a secondary consideration. RAID arrays can be arranged into a number of configurations ("levels") that offer varying trade-offs between these parameters. Btrfs does not support all of the available RAID levels, but it does have support for the levels that most people actually want to use.
RAID 0 ("striping") can be thought of as a way of concatenating multiple physical disks together into a single, larger virtual drive. A strict striping implementation distributes data across the drives in a well-defined set of "stripes"; as a result, all of the drives must be the same size, and the total capacity is simply the product of the number of drives and the capacity of any individual drive in the array. Btrfs can be a bit more flexible than this, though, supporting a concatenation mode (called "single") which can work with unequally sized drives. In theory, any number of drives can be combined into a RAID 0 or "single" array.
RAID 1 ("mirroring") trades off capacity for reliability; in a RAID 1 array, two drives (of the same size) store identical copies of all data. The failure of a single drive can kill an entire RAID 0 array, but a RAID 1 array will lose no data in that situation. RAID 1 arrays will be slower for write-heavy use, since all data must be written twice, but they can be faster for read-heavy workloads, since any given read can be satisfied by either drive in the array.
RAID 10 is a simple combination of RAID 0 and RAID 1; at least two pairs of drives are organized into independent RAID 1 mirrored arrays, then data is striped across those pairs.
RAID 2, RAID 3, and RAID 4 are not heavily used, and they are not supported by Btrfs. RAID 5 can be thought of as a collection of striped drives with a parity drive added on (in reality, the parity data is usually distributed across all drives). A RAID 5 array with N drives has the storage capacity of a striped array with N-1 drives, but it can also survive the failure of any single drive in the array. RAID 6 uses a second parity drive, increasing the amount of space lost to parity blocks but adding the ability to lose two drives simultaneously without losing any data. A RAID 5 array must have at least three drives to make sense, while RAID 6 needs four drives. Both RAID 5 and RAID 6 are supported by Btrfs.
One other noteworthy point is that Btrfs goes out of its way to treat metadata differently than file data. A loss of metadata can threaten the entire filesystem, while the loss of file data affects only that one file — a lower-cost, if still highly undesirable, failure. Metadata is usually stored in duplicate form in Btrfs filesystems, even when a single drive is in use. But the administrator can explicitly configure how data and metadata are stored on any given array, and the two can be configured differently: data might be simply striped in a RAID 0 configuration, for example, while metadata is stored in RAID 5 mode in the same filesystem. And, for added fun, these parameters can be changed on the fly.
Earlier in this series, we used mkfs.btrfs to create a simple Btrfs filesystem. A more complete version of this command for the creation of multiple-device arrays looks like this:
mkfs.btrfs -d mode -m mode dev1 dev2 ...
This command will group the given devices together into a single array and build a filesystem on that array. The -d option describes how data will be stored on that array; it can be single, raid0, raid1, raid10, raid5, or raid6. The placement of metadata, instead, is controlled with -m; in addition to the modes available for -d, it supports dup (metadata is stored twice somewhere in the filesystem). The storage modes for data and metadata are not required to be the same.
So, for example, a simple striped array with two drives could be created with:
mkfs.btrfs -d raid0 /dev/sdb1 /dev/sdc1
Here, we have specified striping for the data; the default for metadata will be dup. This filesystem is mounted with the mount command as usual. Either /dev/sdb1 or /dev/sdc1 can be specified as the drive containing the filesystem; Btrfs will find all other drives in the array automatically.
The df command will only list the first drive in the array. So, for example, a two-drive RAID 0 filesystem with a bit of data on it looks like this:
# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 274G 30G 241G 11% /mnt
More information can be had with the btrfs command:
root@dt:~# btrfs filesystem show /mnt
Label: none uuid: 4714fca3-bfcb-4130-ad2f-f560f2e12f8e
Total devices 2 FS bytes used 27.75GiB
devid 1 size 136.72GiB used 17.03GiB path /dev/sdb1
devid 2 size 136.72GiB used 17.01GiB path /dev/sdc1
(Subcommands to btrfs can be abbreviated, so one could type "fi" instead of "filesystem", but full commands will be used here). This output shows the data split evenly across the two physical devices; the total space consumed (17GiB on each device) somewhat exceeds the size of the stored data. That shows a commonly encountered characteristic of Btrfs: the amount of free space shown by a command like df is almost certainly not the amount of data that can actually be stored on the drive. Here we are seeing the added cost of duplicated metadata, among other things; as we will see below, the discrepancy between the available space shown by df and reality is even greater for some of the other storage modes.
Naturally, no matter how large a particular filesystem is when the administrator sets it up, it will prove too small in the long run. That is simply one of the universal truths of system administration. Happily, Btrfs makes it easy to respond to a situation like that; adding another drive (call it "/dev/sdd1") to the array described above is a simple matter of:
# btrfs device add /dev/sdd1 /mnt
Note that this addition can be done while the filesystem is live — no downtime required. Querying the state of the updated filesystem reveals:
# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 411G 30G 361G 8% /mnt
# btrfs filesystem show /mnt
Label: none uuid: 4714fca3-bfcb-4130-ad2f-f560f2e12f8e
Total devices 3 FS bytes used 27.75GiB
devid 1 size 136.72GiB used 17.03GiB path /dev/sdb1
devid 2 size 136.72GiB used 17.01GiB path /dev/sdc1
devid 3 size 136.72GiB used 0.00 path /dev/sdd1
The filesystem has been expanded with the addition of the new space, but there is no space consumed on the new drive. It is, thus, not a truly striped filesystem at this point, though the difference can be hard to tell. New data copied into the filesystem will be striped across all three drives, so the amount of used space will remain unbalanced unless explicit action is taken. To balance out the filesystem, run:
# btrfs balance start -d -m /mnt
Done, had to relocate 23 out of 23 chunks
The flags say to balance both data and metadata across the array. A balance operation involves moving a lot of data between drives, so it can take some time to complete; it will also slow access to the filesystem. There are subcommands to pause, resume, and cancel the operation if need be. Once it is complete, the picture of the filesystem looks a little different:
# btrfs filesystem show /mnt
Label: none uuid: 4714fca3-bfcb-4130-ad2f-f560f2e12f8e
Total devices 3 FS bytes used 27.78GiB
devid 1 size 136.72GiB used 10.03GiB path /dev/sdb1
devid 2 size 136.72GiB used 10.03GiB path /dev/sdc1
devid 3 size 136.72GiB used 11.00GiB path /dev/sdd1
The data has now been balanced (approximately) equally across the three drives in the array.
Devices can also be removed from an array with a command like:
# btrfs device delete /dev/sdb1 /mnt
Before the device can actually removed, it is, of course, necessary to relocate any data stored on that device. So this command, too, can take a long time to run; unlike the balance command, device delete offers no way to pause and restart the operation. Needless to say, the command will not succeed if there is not sufficient space on the remaining drives to hold the data from the outgoing drive. It will also fail if removing the device would cause the array to fall below the minimum number of drives for the RAID level of the filesystem; a RAID 0 filesystem cannot be left with a single drive, for example.
Note that any drive can be removed from an array; there is no "primary" drive that must remain. So, for example, a series of add and delete operations could be used to move a Btrfs filesystem to an entirely new set of physical drives with no downtime.
The management of the other RAID levels is similar to RAID 0. To create a mirrored array, for example, one could run:
mkfs.btrfs -d raid1 -m raid1 /dev/sdb1 /dev/sdc1
With this setup, both data and metadata will be mirrored across both drives. Exactly two drives are required for RAID 1 arrays; these arrays, once again, can look a little confusing to tools like df:
# du -sh /mnt
28G /mnt
# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 280G 56G 215G 21% /mnt
Here, df shows 56GB of space taken, while du swears that only half that much data is actually stored there. The listed size of the filesystem is also wrong, in that it shows the total space, not taking into account that every block will be stored twice; a user who attempts to store that much data in the array will be sorely disappointed. Once again, more detailed and correct information can be had with:
# btrfs filesystem show /mnt
Label: none uuid: e7e9d7bd-5151-45ab-96c9-e748e2c3ee3b
Total devices 2 FS bytes used 27.76GiB
devid 1 size 136.72GiB used 30.03GiB path /dev/sdb1
devid 2 size 142.31GiB used 30.01GiB path /dev/sdc1
Here we see the full data (plus some overhead) stored on each drive.
A RAID 10 array can be created with the raid10 profile; this type of array requires an even number of drives, with four drives at a minimum. Drives can be added to — or removed from — an active RAID 10 array, but, again, only in pairs. RAID 5 arrays can be created from any number of drives with a minimum of three; RAID 6 needs a minimum of four drives. These arrays, too, can handle the addition and removal of drives while they are mounted.
Imagine for a moment that a three-device RAID 0 array has been created and populated with a bit of data:
# mkfs.btrfs -d raid0 -m raid0 /dev/sdb1 /dev/sdc1 /dev/sdd1
# mount /dev/sdb1 /mnt
# cp -r /random-data /mnt
At this point, the state of the array looks somewhat like this:
# btrfs filesystem show /mnt
Label: none uuid: 6ca4e92a-566b-486c-a3ce-943700684bea
Total devices 3 FS bytes used 6.57GiB
devid 1 size 136.72GiB used 4.02GiB path /dev/sdb1
devid 2 size 136.72GiB used 4.00GiB path /dev/sdc1
devid 3 size 136.72GiB used 4.00GiB path /dev/sdd1
After suffering a routine disk disaster, the system administrator then comes to the conclusion that there is value in redundancy and that, thus, it would be much nicer if the above array used RAID 5 instead. It would be entirely possible to change the setup of this array by backing it up, creating a new filesystem in RAID 5 mode, and restoring the old contents into the new array. But the same task can be accomplished without downtime by converting the array on the fly:
# btrfs balance start -dconvert=raid5 -mconvert=raid5 /mnt
(The balance filters page on the Btrfs wiki and this patch changelog have better information on the balance command than the btrfs man page). Once again, this operation can take a long time; it involves moving a lot of data between drives and generating checksums for everything. At the end, though, the administrator will have a nicely balanced RAID 5 array without ever having had to take the filesystem offline:
# btrfs filesystem show /mnt
Label: none uuid: 6ca4e92a-566b-486c-a3ce-943700684bea
Total devices 3 FS bytes used 9.32GiB
devid 1 size 136.72GiB used 7.06GiB path /dev/sdb1
devid 2 size 136.72GiB used 7.06GiB path /dev/sdc1
devid 3 size 136.72GiB used 7.06GiB path /dev/sdd1
Total space consumption has increased, due to the addition of the parity blocks, but otherwise users should not notice the conversion to the RAID 5 organization.
A redundant configuration does not prevent disk disasters, of course, but it does enable those disasters to be handled with a minimum of pain. Let us imagine that /dev/sdc1 in the above array starts to show signs of failure. If the administrator has a spare drive (we'll call it /dev/sde1) available, it can be swapped into the array with a command like:
btrfs replace start /dev/sdc1 /dev/sde1 /mnt
If needed, the -r flag will prevent the system from trying to read from the outgoing drive if possible. Replacement operations can be canceled, but they cannot be paused. Once the operation is complete, /dev/sdc1 will no longer be a part of the array and can be disposed of.
Should a drive fail outright, it may be necessary to mount the filesystem in the degraded mode (with the "-o degraded" flag. The dead drive can then be removed with:
btrfs device delete missing /mnt
The word "missing" is recognized as meaning a drive that is expected to be part of the array, but which is not actually present. The replacement drive can then be added with btrfs device add, probably followed by a balance operation.
The multiple-device features have been part of the Btrfs design from the early days, and, for the most part, this code has been in the mainline and relatively stable for some time. The biggest exception is the RAID 5 and RAID 6 support, which was merged for 3.9. Your editor has not seen huge numbers of problem reports for this functionality, but the fact remains that it is relatively new and there may well be a surprise or two there that users have not yet encountered.
Built-in support for RAID arrays is one of the key Btrfs features, but the list of advanced capabilities does not stop there. Another fundamental aspect of Btrfs is its support for subvolumes and snapshots; those will be discussed in the next installment in this series.
Btrfs: Working with multiple devices
Posted Dec 30, 2013 21:25 UTC (Mon) by gioele (subscriber, #61675) [Link]
Btrfs: Working with multiple devices
Posted Dec 30, 2013 21:28 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]
Btrfs: Working with multiple devices
Posted Dec 30, 2013 23:24 UTC (Mon) by fandingo (subscriber, #67019) [Link]
For RAID, there's a massive benefit, and Jon did a great job of explaining much of the benefit. A RAID layer that knows about the file system is far more powerful and intelligent.
On the other hand, it's easy to look at other Device Mapper targets and wonder why they won't become a part of Btrfs. For example, dm-crypt is a particularly interesting example. Btrfs has no builtin support for encryption and doesn't have any plans to do so. Instead, the authors advise users to setup each drive individually with LUKS. That's a poor solution. Now you have N hard drives each configured with different keys and salts (although the same passphrase) that may need to be unlocked separately before mounting the multi-device Btrfs file system.
Btrfs is supposed to be a fancy multi-device, super intelligent file system, but the way that it treats encryption is lousy.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 0:53 UTC (Tue) by jengelh (guest, #33263) [Link]
Well, the cryptanalists will probably find some scare stories. If you do not encrypt at the block layer, it is conceivable that your tree structure(s) become somewhat visible (as is the case with encfs).
Btrfs: Working with multiple devices
Posted Dec 31, 2013 2:17 UTC (Tue) by fandingo (subscriber, #67019) [Link]
Btrfs writes directly to the devices. It's not stacked on top of a regular file system like Encfs (where the regular file system's metadata is what leaks and is the concern that you noted). It would be easy to have Btrfs have an encryptions "plugin" that where data goes through on the way to be written/read from the various devices and have a simple LUKS style header on one or more of the disks.
The problem with Encfs entirely lies with the lower file system (Ext4, XFS, etc.) because that metadata (file names, inode/extents, permissions/attributes, etc.) is not encrypted. Btrfs is completely capable of securing it.
It always seemed like a silly omission. It would not have been that complicated largely because the dm-crypt code could have been used, and the kernel has secure key storage and encryption/hash algorithms already implemented. Unfortunately, the clearest way to implement it would be a LUKS-like header on the file system, and I don't think such a thing is possible since the on-disk format was declared stable recently.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 7:02 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
Btrfs: Working with multiple devices
Posted Dec 31, 2013 17:41 UTC (Tue) by deepfire (guest, #26138) [Link]
With LUKS you cannot do any transformation on the content of the block device -- any attempt will result in detectable corruption.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 17:44 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
I don't think that manipulation of the tree structure is such a big issue.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 19:36 UTC (Tue) by smurf (subscriber, #17840) [Link]
Let's say I have my /home encrypted with encfs. Let's assume that Joe Cracker, or a Nameless Subversive Adversary (nice acronym, I might keep it), can email me an attachment which in most typical mail programs ends up somewhere on disk. It's easy to find that file.
It's also easy to find my holiday pictures, or my tax returns. Just check the file's modification date and the file size (+/- a few bytes).
You can now plant incriminating evidence. Send appropriate emails which look like spam, confiscate my hard disk, move files around.
I basically have no chance in hell to convince a jury that yes, this is an encrypted system to which only I have the key (unless another Nameless Subversive Adversary has implanted a keylogger in my keyboard's controller), BUT no, my notes originally stated that this money got sent to a banana shipping company, NOT to one that typically ships cocaine; and no, I did NOT take this expertly-manufactured image of me banging an underage prostitute.
Just as one not-entirely-farfetched example.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 19:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
Knowing plaintext of any e-mail (or any file, for that matter) gets you exactly nothing, because ecryptfs also authenticates file content with MACs so the content is not malleable.
Btrfs: Working with multiple devices
Posted Jan 2, 2014 16:35 UTC (Thu) by drag (subscriber, #31333) [Link]
If your machine is running and the filesystem is mounted when it is captured then it's largely worthless at protecting your data. If your machine is compromised, the FS is mounted or at any point after that the file system gets mounted then it's not going to help you.
Btrfs: Working with multiple devices
Posted Jan 2, 2014 16:49 UTC (Thu) by drag (subscriber, #31333) [Link]
command line option:
rd.luks.key=<keypath>:<keydev>:<luksdev>
http://www.freedesktop.org/software/systemd/man/systemd-c...
Note that I have not tried this out myself.
Btrfs: booting with degrated root-partitions?
Posted Dec 30, 2013 21:58 UTC (Mon) by jaa (subscriber, #14170) [Link]
Secondly, what will happen nowdays with btrfs, if you have btrfs(raid1/raid5) as root partition and one drive is missing? Last time I tried the result was busted boot and drop to emergency shell (I think it was with Fedora F18 or F19). Or maybe this has more to do with with boot scripts of distribution in question?
In any case, it would be interesting to hear if it possible to support degrated boot with btrfs-root partition.
Btrfs: booting with degrated root-partitions?
Posted Dec 30, 2013 22:07 UTC (Mon) by corbet (editor, #1) [Link]
In my experience, btrfs will not mount a filesystem in the degraded mode unless explicitly told to do so. So yes, I can see how that could break things at boot time. It may take a while for various distributions' initramfs and boot-time scripts to catch up.
Btrfs: booting with degrated root-partitions?
Posted Dec 31, 2013 1:07 UTC (Tue) by neilbrown (subscriber, #359) [Link]
In the "good old days" when boot was sequential and predictable you would just wait until all physical devices had appeared, then assemble arrays out of whatever happens to be available, starting arrays as degraded if not enough devices are present because it is obvious that the missing devices are absent.
In the "brave new world" where everything is dynamic and hot-plugged there *is*no*moment* when you can say "everything has been discovered" so there is no moment when you can clearly say "No more devices will appear, I may as well start whatever arrays I can, even if they are degraded".
My current approach is to start a timer when it is first possible to assemble an array degraded. If the missing device(s) appear(s) before the timer goes off, the array gets started and the timer is stopped. If the timer fires, the array is started degraded.
Choosing a timeout that is appropriate for a system with 2 drives, and also appropriate for a system with 1024 drives is not straight forward....
Maybe I want to restart the timer any time any device at all appears .... But I don't think that is straight forward either :-(
Btrfs: booting with degrated root-partitions?
Posted Dec 31, 2013 10:39 UTC (Tue) by pbonzini (subscriber, #60935) [Link]
That's pretty much a watchdog, isn't it?
Btrfs: booting with degrated root-partitions?
Posted Dec 31, 2013 19:39 UTC (Tue) by smurf (subscriber, #17840) [Link]
No? Any new device triggers a bunch of udev events. Even if that last RAID disk is attached behind three USB hubs (don't laugh, it can happen), enough events fire regularly that a timeout of five seconds or so should not be a problem.
Btrfs: booting with degrated root-partitions?
Posted Jan 5, 2014 13:32 UTC (Sun) by lbt (subscriber, #29672) [Link]
For non-bitmapped arrays it may be sensible to have a reserved bitmap area which is enabled when starting degraded and no-bitmap means no-bitmap-when-not-degraded.
Btrfs: booting with degrated root-partitions?
Posted Jan 5, 2014 19:10 UTC (Sun) by neilbrown (subscriber, #359) [Link]
Not sure I'd call it "the correct" solution though.
Having this approach as a fall-back could be used to justify having a fairly show timeout for my current solution.
Btrfs: booting with degrated root-partitions?
Posted Jan 5, 2014 20:08 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]
in addition, with the btrfs method os RAID, you will have multiple points where the array is 'complete enough to start' for different types of data, and if you have some data that's not replicated, you will need to wait for every disk.
Btrfs: booting with degrated root-partitions?
Posted Jan 6, 2014 18:12 UTC (Mon) by lbt (subscriber, #29672) [Link]
Yes - you have to rebuild those parts which would have been written; but that's a little like a deferred write and is an aspect of what write-intent bitmaps in md are for.
Once the bitmap is full that could become a complete re-sync; but given I've used my system so much that I've filled up the bitmaps I'd say I'd have been rather grumpy if the alternative was that I was still sitting watching a boot prompt and waiting for a device to arrive (by courier rather than usb one assumes!)
I don't follow - if an array is "complete enough to start [degraded]" for say data but not metadata then it's not yet complete enough to start. If a given non-resilient configuration can't use this early boot method then that's OK.
In general this is about handling on-going device arrivals in a RAID and avoiding blocking until "all devices are ready"; I'm suggesting that an optimistic approach combined with dynamic write-intent bitmaps may be a solution.
Also note that degraded assembly in RAID terms is not the same as forcing a potentially corrupt assembly where event counts mismatch.
Btrfs: booting with degrated root-partitions?
Posted Jan 23, 2014 18:51 UTC (Thu) by psusi (guest, #95157) [Link]
Btrfs: booting with degrated root-partitions?
Posted Jan 7, 2014 13:30 UTC (Tue) by jschrod (subscriber, #1646) [Link]
Btrfs: booting with degrated root-partitions?
Posted Jan 7, 2014 22:46 UTC (Tue) by neilbrown (subscriber, #359) [Link]
http://git.neil.brown.name/git?p=mdadm.git;a=tree;f=systemd;...
Btrfs: booting with degrated root-partitions?
Posted Jan 9, 2014 0:56 UTC (Thu) by jschrod (subscriber, #1646) [Link]
Cool -- thank you. If we ever meet on some conference, a bottle of wine's on you.
Cheers,
Joachim (FWIW, core-TeXie)
Btrfs: booting with degrated root-partitions?
Posted Jan 9, 2014 23:34 UTC (Thu) by Lennie (subscriber, #49641) [Link]
Even if I pass the right degraded argument to the kernel commandline and completed a rebalance before I did a shutdown and removed the drive, it did not work.
I think sometimes grub was even to read the ramdisk and kernel from the disk, but the kernel failed to mount the filesystem proper.
Just try it.
It makes me kind of sad.
RAID 5/6 not ready for production
Posted Dec 30, 2013 22:03 UTC (Mon) by jhhaller (subscriber, #56103) [Link]
I would also recommend that adding a drive or drives is best done when the filesystem is nearly empty. Scrubbing a 13TB (raw space) RAID-5 filesystem took almost a week, and balancing the new drive into RAID-5 is still in progress.
RAID 5/6 not ready for production
Posted Jan 9, 2014 12:12 UTC (Thu) by Darkstar (guest, #28767) [Link]
This would probably mean that data and metadata RAID levels must match but that would be a small price to pay for getting rid of RAID5 rebalancings
RAID 5/6 not ready for production
Posted Jan 9, 2014 12:46 UTC (Thu) by obrakmann (subscriber, #38108) [Link]
Also, the effort required to avoid I/O bottlenecks for a similar filesystem in Linux would be huge (and would probably trigger quite a patent scare). The memory requirements would probably also be quite high.
That said, I love working with their systems so much, and man would it be nice to have a mainline Linux filesystem with a similar feature set! Btrfs just falls far too short for that.
Btrfs: Working with multiple devices
Posted Dec 30, 2013 22:10 UTC (Mon) by smurf (subscriber, #17840) [Link]
One question: suppose I use RAID6 for metadata and RAID5 for data. Two drives fail. Random pieces of my data is now unreadable … so what exactly is the use case for having more redundancy in my metadata? The only reason that I can think of would be a slow off-site backup copy (btrfs' incremental backup feature is rather nice for creating that, admittedly) which I might use to restore the unreadable parts of my data – but is there a btrfs tool which emits the missing filename+offset+length triplets which I could use to fetch andre-write the holes? Speaking of backup, does the output (and thus the applicability) of btrfs' incremental backup depend on the disk geometry, or can I apply any backup as long as the source and destination snapshots' contents match?
Btrfs: Working with multiple devices
Posted Dec 30, 2013 22:27 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]
if you loose too many data drives, you have lost some random chunk of your data.
if you loose too many metadata drives, you have lost _all_ your data.
that can make it worthwhile to protect your metadata more.
Btrfs: Working with multiple devices
Posted Jan 4, 2014 0:07 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
if you lose too many data drives, you have lost some random chunk of your data.if you lose too many metadata drives, you have lost _all_ your data.
Actually, if you lose too many data drives, you have lost myriad random chunks spread throughout your data. The question is is there a use case where that is substantially better than losing everything.
I think such use cases are too rare for us to talk about using higher redundancy in the metadata than in the data. I think the reason for using higher redundancy in the metadata than in the data is spot defects (you lose one whole drive plus one sector of another, not two whole drives).
Btrfs: Working with multiple devices
Posted Jan 4, 2014 14:54 UTC (Sat) by khim (subscriber, #9252) [Link]
I sincerely hope they'll add ability to use different RAID modes for different files soon. Because I can easily see a case for that: 20-30GB game downloaded via Steam does not need redundancy at all (I can easily donwload it again), while 1000-10'000 times smaller save files for that same games deserve as much redundancy as I can get. Different RAID modes for data and metadata? Meh. Not interested.
Btrfs: Working with multiple devices
Posted Jan 4, 2014 17:52 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]
Btrfs: Working with multiple devices
Posted Jan 4, 2014 22:30 UTC (Sat) by khim (subscriber, #9252) [Link]
Of course not - and metadata most definitely belong into the bucket where inportant files reside. But if said bucket can only contain metadata and nothing else then… I have no usecase for that.
I guess I can try to convert my important data into a metadata (you can convert file into a directory if you use filenames as small pieces of content), but all these tricks quickly becoming excercise in frustration.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 14:05 UTC (Tue) by masoncl (subscriber, #47138) [Link]
Also raid1 for metadata and raid5/6 for data will tend to perform much better than raid5 for metadata (fewer sub stripe writes).
-chris
Btrfs: Working with multiple devices
Posted Jan 1, 2014 0:31 UTC (Wed) by ttonino (subscriber, #4073) [Link]
A single sector metadata error might make a part of a tree unavailable, leading to a huge loss of data.
A single sector data error may hit a small file, an unimportant file, a video file where it will hardly be noticed when replaced with 00's...
Btrfs: Working with multiple devices
Posted Jan 1, 2014 8:57 UTC (Wed) by smurf (subscriber, #17840) [Link]
It's a question of what the file system does when there's a bad block. Recover from RAID, sure. Beyond that? Log the error and forget about it? Try to rewrite the offender? Mark the offending block as bad and write the recovered data to someplace else? Drop the whole RAID disk?
Is there an integrity test which reads redundancy blocks and verifies that they match the actual data?
MD has such a test, but it reads the whole disk regardless of whether there are actual data or just empty space, and when it does find an error it's next-to-impossible to figure out which file the problem belongs to.
BACKUPS??!!!
Posted Jan 1, 2014 12:26 UTC (Wed) by liw (subscriber, #6379) [Link]
If it's data you care about, you make backups. Any data you haven't backed up, you WILL lose.
If you really care about the data, you back it up multiple times. I used to have 7 backup copies of all my precious live data, not counting multiple copies on the same storage system (e.g., those made by RAID). Some of these were backups of backups, because that was necessary to thwart certain attack scenarios.
RAID is excellent for reducing the need to restore: if RAID can fix a disk failure problem for you, you don't have to restore. But RAID is not a backup, and relying on it means you will lose data.
It is the new year. Apart from 1920x1080 and 445 PPI, a really good resolution for this year would be to make sure you a) review all your data to know what's precious b) review and test all your procedures for backing up, restoring, and verifying backups c) help your loved ones convert from "I know I should" to "I do and I know they work".
(This was my first backup rant of 2014. I will make more. Be prepared.)
BACKUPS??!!!
Posted Jan 1, 2014 16:34 UTC (Wed) by smurf (subscriber, #17840) [Link]
btrfs is way cool for storing backups. Store the backup on a btrfs volume, create a snapshot before "rsync -a --inplace --delete"ing to the master copy, periodically throw away some (but not all) old snapshots, you're done.
(Except for these pesky database files which need to be dumped instead of block-copied … so, instead of many FS snapshots, you store a few DB snapshots plus the transaction logs.)
These days, of course, you can do the snapshot on the master and use btrfs' diff-and-apply to extract the changes, which is a whole lot faster than using rsync.
BACKUPS??!!!
Posted Jan 16, 2014 17:57 UTC (Thu) by oldtomas (guest, #72579) [Link]
Sometimes backups in the classical sense are just impractical: the image archive with hundreds of terabytes of images; you know something went sour in there, but a restore will take you a week (your customers are elsewhere by then)...
File systems supporting those classes of problem will become more and more important as long as practical capacity grows faster than practical bandwidth (as is the case at the moment). Either something like btrfs or something totally distributed (à la Ceph -- I don't know which direction Hammer has taken of late).
Btrfs: Working with multiple devices
Posted Jan 1, 2014 13:24 UTC (Wed) by masoncl (subscriber, #47138) [Link]
If you're raided, it'll fix the bad block (raid5/6 don't support this yet), and there are inspection commands to follow backrefs to find the file that owns the block.
-chris
Btrfs: Working with multiple devices
Posted Jan 1, 2014 18:17 UTC (Wed) by kreijack (guest, #43513) [Link]
root@octopus:~# btrfs device stat /mnt/storage/
[/dev/sdc].write_io_errs 0
[/dev/sdc].read_io_errs 0
[/dev/sdc].flush_io_errs 0
[/dev/sdc].corruption_errs 0
[/dev/sdc].generation_errs 0
[/dev/sdd].write_io_errs 0
[/dev/sdd].read_io_errs 0
[/dev/sdd].flush_io_errs 0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
Btrfs: Working with multiple devices
Posted Dec 30, 2013 22:24 UTC (Mon) by ssam (guest, #46587) [Link]
If you start with a single drive system (defaults to data=single, metadata=dup), then add a drive and do
btrfs fi balance start -dconvert=raid1 /mnt
you will end up with a filesystem where your data is nicely protected by raid1, but you metadata is still dup, so not guaranteed to spread across drives.
Its up to the user to make sure that the metadata is at least as well protected as the data.
(details of my experience http://thread.gmane.org/gmane.comp.file-systems.btrfs/206...)
Btrfs: Working with multiple devices
Posted Jan 6, 2014 11:33 UTC (Mon) by cesarb (subscriber, #6266) [Link]
Btrfs: Working with multiple devices
Posted Dec 31, 2013 2:13 UTC (Tue) by ceswiedler (guest, #24638) [Link]
This is going to get fixed at some point, right?
Btrfs: Working with multiple devices
Posted Dec 31, 2013 10:06 UTC (Tue) by cwillu (subscriber, #67268) [Link]
Add into this the plans to have different allocation policies per subvolume/folder/file/whatever, and mix thoroughly with the syscall du uses not actually returning the values du prints to the screen (du needs to do some simple arithmetic), and you can see how this gets complicated. Calculating a simple free space value ends up requiring you to know the future.
Btrfs: Working with multiple devices
Posted Jan 2, 2014 2:43 UTC (Thu) by zuki (subscriber, #41808) [Link]
It should be enough for btrfs to "predict the future" as a simple extension of present time. If right now I have a certain RAID level, it'd be quite useful for df to return how much free raw disk space I have divided by current average duplication level. Every other answer seems less useful.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 16:16 UTC (Tue) by fandingo (subscriber, #67019) [Link]
The problem is that people conflate two questions when they use df:
1) How much space is not currently used?
2) How much space is available for writing?
Traditionally, the answers to both questions were almost identical. Btrfs changes that dramatically. Df answers #1, but the user is usually asking #2.
#2 cannot really be answered on Btrfs due to its features and design. Currently, you can set dup and RAID mode for data and metadata independently. Df could cope with this *as the features are currently implemented* because it's only possible to set those values for the entire file system. However, eventually it will be possible to set these at the subvolume (or even per-file!) level, which causes a problem. Additionally, there are features like compression or dedup that complicate things in the other direction.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 16:30 UTC (Tue) by ceswiedler (guest, #24638) [Link]
I get that in some cases it becomes impossible to answer the question in a simple fashion. And I don't expect that me, the armchair filesystem designer, is going to come up with better solutions than the btrfs people.
But in the case listed, to answer the question of "how much space on a raid1 set" with "drivecapacity*2" seems really poor.
Wouldn't it be better to use some conservative estimates for amount of space writeable rather than just listing the total free space? Just because df isn't going to be able to give a precisely correct number, I don't see why it needs to give a wildly incorrect one (in terms of how much space is writeable).
Dedup, compression, subvolumes, file storage options, small files versus big ones--all of those I expect to be variables I have to cope with as a user. I get that. But I expect the basic df to tell me "if I wrote a stream of bytes to a generic location on this drive, ignoring duplication or compression, how many bytes could I write?"
I don't mind that the number wouldn't be precise, or that in some cases it might not be correct at all. To me that's like running out of inodes when you still have space--okay, I can deal with it. But most of the time, space will run out first (by design) and so that's what I care about at a high level.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 17:03 UTC (Tue) by fandingo (subscriber, #67019) [Link]
There's your problem. That's not what df is meant to provide.
Btrfs is an extraordinarily generic file system. Even though not all the features are complete, the idea is that you can enable a feature *anywhere* in the file system. Want to make a single directory RAID6 but the rest of the file system data single? Sure, that fits with the architecture, although there is no present implementation.
> Wouldn't it be better to use some conservative estimates for amount of space writeable rather than just listing the total free space? Just because df isn't going to be able to give a precisely correct number, I don't see why it needs to give a wildly incorrect one (in terms of how much space is writeable).
What is conservative? Why is it df's job to calculate what an individual thinks is reasonable? There are two problems with this approach:
1) The result will always be incorrect. At least the current implementation gives a correct answer, even if the user believes the df is answering the wrong question.
2) There's simply no way to provide a consistent answer without knowing the write() operation. There's no "generic location," and that question directly undermines the flexibility of Btrfs.
To me, whoever or whatever is reading the output of df or `btrfs fi df` should know the file system configuration and the implications of the reported free space to their desired operation. I'd much rather have accurate information, even if it means that I have to recognize, "hey, this file system is in dup, raid6, or what ever."
There's simply too many different things that can affect on-disk use in Btrfs for df to take into account. The answer is very likely to be plain wrong and no less clear.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 18:52 UTC (Tue) by raven667 (subscriber, #5198) [Link]
Having df provide an estimated free disk by default for btrfs, with an option to provide a more accurate but less useful raw number seem very reasonable to me.
Btrfs: Working with multiple devices
Posted Dec 31, 2013 19:51 UTC (Tue) by smurf (subscriber, #17840) [Link]
All _I_ care about is that this number from df is somewhat smaller than the actual number of bytes I am still guaranteed to be able to write. In the terminal case, I do NOT want to see ENOSPC when df tells me there's still a GB free on that disk.
Anything else is nice-to-know; if df tells me that there's 1GB free but I can copy 10GB onto the disk before my write actually errors out, so much the better.
> a more accurate but less useful raw number
What would you use that number for if it's less useful?
Btrfs: Working with multiple devices
Posted Jan 1, 2014 18:10 UTC (Wed) by kreijack (guest, #43513) [Link]
> All _I_ care about is that this number from df is somewhat smaller than
> the actual number of bytes I am still guaranteed to be able to write. In
> the terminal case, I do NOT want to see ENOSPC when df tells me there's
> still a GB free on that disk.
I wrote [1] a patch which estimated the available space. The assumption of this estimation was the ratio bytes-allocated-on-the-disk/byte-consumed-by-files is constant in a filesystem.
With all its limit, this estimation reached two goals:
- it would be compatible with all the code which is unaware of BTRFS and its behaviour
- it gave a good estimation when the space were exhausted.
The main complaints of this patch were the wording: there was no a consensus about how describe
- the space occupied on the disk
- the free space available for the file
- the sum of the space consumed by the files
Time to time a new person came and tried to suggest a new set of wording....
May be that it is time I have to re-pushing this set of patch..
Btrfs: Working with multiple devices
Posted Jan 6, 2014 13:35 UTC (Mon) by cesarb (subscriber, #6266) [Link]
Why not something simpler? The free space is the amount of bytes which could be written, on a single new file on the root of the subvolume, using the default data and metadata mode for the subvolume, and with worst-case expansion in case of transparent compression or deduplication.
This should match the intuitive expectation of the user: if df says I have a hundred megabytes of free space, I should be able to create at least a single file containing a hundred megabytes, give or take a few hundred kilobytes for a metadata fudge factor.
If after creating the hundred megabyte file it says I still have thirty megabytes free, that is great (extra disk space for free!). But if it says I have a hundred megabytes of free space, and writing a new file fails after fifty megabytes written, that is bad and should not happen unless there was a hardware failure.
(I say single file because creating a hundred megabyte-sized files is obviosuly more expensive than a single hundred-megabyte file, and there is no way of knowing how many files the user will create.)
Btrfs: Working with multiple devices
Posted Jan 6, 2014 13:45 UTC (Mon) by cesarb (subscriber, #6266) [Link]
The total size is the amount of bytes which could be written, using the definition from my above comment, if the filesystem were *empty*. The used size is (total - free).
This would have the following desirable properties:
- A freshly formatted filesystem would have "used" 0 and "free" = "total".
- A completely full filesystem would have "free" 0 and "used" = "total".
- Adding new files is monotonic; "used" increases (or stays the same), "free" decreases (or stays the same).
- The "total" value does not change, no matter how many files you add or remove.
- If "free" says you have x bytes, you can create at least a file of up to x bytes, even if its contents came from /dev/urandom.
Btrfs: Working with multiple devices
Posted Jan 1, 2014 12:06 UTC (Wed) by nix (subscriber, #2304) [Link]
df provides crucial facilities btrfs-specific commands cannot: it reports on all filesystems at once (for a summary overview), and it reports on filesystems that may not be btrfs. Retaining those features is, IMHO, crucial, unless we're going to a world in which everyone has one huge btrfs filesystem and nothing else (all eggs in one data structural basket, not really my idea of a good time).
Btrfs: Working with multiple devices
Posted Dec 31, 2013 8:18 UTC (Tue) by jezuch (subscriber, #52988) [Link]
I've read that there was an effort to unify the various RAID implementations in md, dm and btrfs. How's that going?
Btrfs: Working with multiple devices
Posted Dec 31, 2013 14:08 UTC (Tue) by masoncl (subscriber, #47138) [Link]
As I add N-way mirroring, the raid10 code will also gain more features.
-chris
Btrfs: Working with multiple devices
Posted Jan 1, 2014 0:44 UTC (Wed) by ttonino (subscriber, #4073) [Link]
Files much shorter than 1 MB are better off being a single disk, even for an otherwise idle disk (the slowest of 2 disks is slower than the fastest one of them, and as slow as the slowest).
Striping is effective only when a single file is dominating the disk busy times - and that file is larger than a few megs. Otherwise, one file per disk is better.
Btrfs: Working with multiple devices
Posted Jan 1, 2014 21:05 UTC (Wed) by rvfh (subscriber, #31018) [Link]
Btrfs: Working with multiple devices
Posted Jan 2, 2014 21:44 UTC (Thu) by drag (subscriber, #31333) [Link]
Yes. And these block sizes can be critically important to avoid massive penalties when writing small files to deeply stacked storage arrays.
Btrfs: Working with multiple devices
Posted Jan 1, 2014 5:30 UTC (Wed) by dirtyepic (subscriber, #30178) [Link]
When there's more than one drive the metadata defaults to "raid1", but I think they're pretty much the same thing.
Btrfs: multiple device support: remove it!
Posted Jan 1, 2014 16:54 UTC (Wed) by dougg (subscriber, #1894) [Link]
I'm looking at SCSI xcopy version 2 lite, what Microsoft calls Offloaded Data Transfer (ODX). It facilitates "point in time" copies via ROD (representation of data) tokens. This is almost a perfect mechanism for backups on a live system (i.e. consistent, almost no down time). With the help of a file system log it seems to solve the storage side of what VMWare calls vMotion: the ability to move a VM from one server to another while it is operating. Nicolas Bellinger's Linux target subsystem is part of the way toward implementing ODX; it already implements the building block behind VMWare's VAAI in this area (i.e. SCSI xcopy version 1).
http://blogs.technet.com/b/matthewms/archive/2013/09/13/v...
The redundant array of independent disks (RAID) is a remnant of the age of rotating disks. File systems should forget about it when looking to the future.
Btrfs: multiple device support: remove it!
Posted Jan 1, 2014 21:07 UTC (Wed) by rvfh (subscriber, #31018) [Link]
But then you lose the ability to treat data and metadata differently...
Btrfs: multiple device support: remove it!
Posted Jan 2, 2014 1:19 UTC (Thu) by ncm (subscriber, #165) [Link]
There's no requirement for btrfs to take over all your RAIDity, is there? I.e. you could still run it on top of MD (or what-have-you), so long as btrfs sees it as a block device? So the one benefit of letting btrfs do the work is finer-grained control over what is stored how.
Btrfs: multiple device support: remove it!^w^w
Posted Jan 4, 2014 14:54 UTC (Sat) by kreijack (guest, #43513) [Link]
There are a lot of optimizations that could be implemented with an integration between the filesystem and the storage layer
- the "RAID write hole"[1] disappears: the filesystem checksum allow to determine which data is correct
- more, you can also detect error due to random bit flipping on the platter (the disk size is becoming comparable with the error rate [2]) and correct it.
- you don't need to "prepare" the raid before using it: with btrfs you need only few second to create a filesystem w/raid
- you can have different raid profiles in the same filesystem: today BTRFS allow different raid profile only for data and metadata. It was discussed to allow different raid profiles per subvolume basis. [Even tough I am not so sure that it is a good idea, from an administrator point of view]. For example:
- you can put metadata on RAID1
- "valuable data" on raid 5/6
- "cache data" (i.e. downloaded films) on raid0
The only point which I have to agree that the integration of the raid in a file-system requires longer developing time.
[1] http://www.raid-recovery-guide.com/raid5-write-hole.aspx
[2] WD reports in its data-sheet an error rate of 1/10^14; an 1 TB disk are about 13^10 bit.... so if you read all the disk ten times, WD says that it is possible that you got a bit wrong
Btrfs: multiple device support: remove it!
Posted Jan 2, 2014 1:04 UTC (Thu) by smurf (subscriber, #17840) [Link]
For one, backups at the block level are a very stupid idea, especially if you have a log-structured file system, or one that can do snapshots.
Also, you will need to have the whole file system in a consistent state in order to make a block-level snapshot. Surprise: this is an expensive operation. Not all work loads tolerate that.
Also, speaking of taking care of size: Resizing a RAID (by adding a disk) is an expensive operation which touches the whole disk, even if most space is not allocated yet. On the other hand, as I understand it (correct me if I'm wrong) btrfs will only spread new files onto a new disk, which allows one to re-balance the file system at one's leisure. In my case, if I added a new disk, rebalancing the whole RAID would slow down my site for a week; btrfs would require around 80 hours, and I can limit that to dead time in the early morning.
Also, a link to something that reads like a Windows commercial is hardly an endorsement for THIS crowd. :-P
In any case, the text you linked to talks about treating a file like a file system, except that this file gets loopback-mounted on the SAN instead of on the host machine and you appear to be able to shrink it. The overlap between that blatant Windows commercial ^w^w^w blog post, and this topic, is essentially zero.
Btrfs: multiple device support: remove it!
Posted Jan 2, 2014 23:34 UTC (Thu) by dougg (subscriber, #1894) [Link]
Yes, "you will need to have the whole file system in a consistent state in order to make a block-level snapshot. Surprise: this is an expensive operation." And a point_in_time ROD based read can be made a very quick operation; continue almost immediately, backup ready to write whenever, take as many copies as you like; migrate that consistent FS state to another machine, etc.
How many months has btrfs been delayed while its authors tried to re-invent and re-implement RAID?
So you don't like marketing BS? Well welcome to planet earth. Just maybe there might be some useful information lurking in there. There doesn't seem to be much pedagogic material available. If you prefer something dry and obtuse, I can provide references to SPC-4 and SBC-3 drafts found at www.t10.org [No marketing BS there.]
Btrfs: multiple device support: remove it!
Posted Jan 3, 2014 0:38 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
For example, it's already possible to use different RAID levels for metadata and data. It's also going to be possible to set individual RAID levels for individual files.
ZFS can do both, btw.
Btrfs: multiple device support: remove it!
Posted Jan 3, 2014 7:40 UTC (Fri) by smurf (subscriber, #17840) [Link]
You cannot do snapshots and all the other cool ideas marketing talks about without file system and OS support. Which block-level technology is below that (file system's RAID, kernel block-level RAID, hardware RAID, no RAID, block storage file of the VM's host; directly connected, NAS, iSCSI, Acronym-of-the-Day) is completely irrelevant.
Btrfs: Working with multiple devices
Posted Jan 2, 2014 7:26 UTC (Thu) by mrjoel (subscriber, #60922) [Link]
In short, in order to force fail a drive from a RAID1 in order to replace it, the filesystem must be unmounted, forcibly mounted as degraded, then the devices swapped, readded, synced, then un/remounted to remove degraded flag. I find it a requirement (with hotswap drives) to keep the filesystem mounted, force remove a disk, dynamically go into degraded state, pull the drive, add a replacement, add it and have the sync occur, and then dynamically leave degraded state. That isn't (as of four months ago) possible with btrfs!
Btrfs: Working with multiple devices
Posted Jan 6, 2014 14:29 UTC (Mon) by kreijack (guest, #43513) [Link]
Anyway BTRFS is a young filesystem with a lot of capabilities, so some problems exist.
For example, the last time I tried, it was impossible to remove a failed device (even tough it is possible to replace it !)
GB
Btrfs: Working with multiple devices
Posted Jan 6, 2014 14:43 UTC (Mon) by mrjoel (subscriber, #60922) [Link]
Precisely for that reason. The particular board I was doing this testing on only had 3 SATA connectors, and I used one for a boot/system SSD and the other two for a RAID1 data volume. In order to have a new device to replace the old one with, I needed to force failure/removal of an existing one first, which I was not able to do.
The actual thread is below, with several inconsistencies littered throughout.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/279...
Btrfs: Working with multiple devices
Posted Jan 9, 2014 12:59 UTC (Thu) by kreijack (guest, #43513) [Link]
in your case you should
- shutdown the system
- replace (physically) the disk
- reboot the system
- mount in degraded mode ( mount -o degrade /dev/xxxx /mnt/xxxx) the btrfs filesystem
- add the new device (btrfs device add /dev/<new device> /mnt/xxxx)
- remove the "missing" device (btrfs device del missing /mnt/xxxx), where missing is the word "missing" (not a device)
Which BTRFS doesn't support is removing a device when it is not readable without doing an umount.
Btrfs: Working with multiple devices
Posted Jan 9, 2014 17:33 UTC (Thu) by mrjoel (subscriber, #60922) [Link]
You also forgot to mention that a remount of the filesystem is also required at the end of the resync (when you have to remember after a few hours to follow-up on the system) to exit degraded mode. I found that even after adding a new device I wasn't able to exit degraded mode without remounting (even though the userspace tools reported it wasn't degraded!).
Btrfs: Working with multiple devices
Posted Jan 9, 2014 18:59 UTC (Thu) by kreijack (guest, #43513) [Link]
The two options can be done without un-mounting the filesystem, with the exception that it is not possible to *remove* a faulty disk without unmount.
In your case (i.e. a system with only 3 sata ports), the replace is not a viable solution (you don't have the new disk). So you need to shutdown *anyway* the system. This is the reason why I used the word "shutdown"
You wrote:
"...mark a disk as failed, hot remove and replace it..."
I read this as replace. This is fully working.
May be that inverting your sentences I can be more clear:
- it is not true that I need to unmount the filesystem to remove a device (with the exception of above)
- it is not true that I need to unmount the filesystem to replace a device (even if the device is faulty)
However I agree with you that removing a faulty device should be implemented properly.
Regarding to exit from the "degraded mode", I don't see what is the problem. The degraded mode means that a filesystem can be mounted without some devices. Nothing more. I don't see the needing to remount the filesystem to remove this flag.
BTW, I am not suggesting that NOW btrfs can be a replacement of MD. I am quite sure that there is a lot of corner cases not handled properly when a device is fault and need to be replaced. However the main pieces are there. It need only a lot of testing. (which is not secondary).
Btrfs: Working with multiple devices
Posted Jan 9, 2014 19:57 UTC (Thu) by mrjoel (subscriber, #60922) [Link]
That is why much money is spent on enterprise servers with redundancy and why the concept of hot-swappable drive trays is so powerful. The variables of filesystem, RAID level, drive size, drive capacity, etc. don't matter (or are very secondary) in that scenario.
You have a fair distinction on remove vs. replace, but only if considered for the long-term or permanent desired end state. However, I need to remove/degrade/fail a drive prior to bringing the replacement drive online, but only long enough to replace it with another one.
I happen to be doing this testing with a limited system with 3 SATA ports, but that seems irrelevant since typically a system owner will want to use all drives that are available, and don't reserve an extra SATA/SAS channel for the times when a drive fails. The notion that I have to shutdown the system because of the number of sata ports or to make the new disk available is ludicrous to me - I have hot-swappable drive bays intentionally to not require that.
I suppose I could forcibly physically remove the failing drive out from under btrfs, but that's not a clean option either. I'd rather give the system fair notice to stop using the device, but btrfs doesn't allow me to do that.
>> "...mark a disk as failed, hot remove and replace it..."
> I read this as replace. This is fully working.
No, it is not fully working, it fails at the first step! The other two steps do work in other scenarios, but claiming full or 2/3 functionality when it breaks at the first step is quite optimistic. To quote myself on the btrfs mailing list:
"I was surprised to find that I'm not allowed to remove one of the two drives in a RAID1. Kernel message is 'btrfs: unable to go below two devices on raid1' Not allowing it by default makes some sense, however a --force flag or something would be beneficial."
In the end, I'd like btrfs to improve, however it has been slow going and with this latest round of failed testing for what is a simple enterprise requirement, I've lost most expectations of btrfs filling this need. I'm not against btrfs, I use it often on single drive systems, however don't have any confidence in it for anything but the gentlest of handling on multi-drive systems.
This isn't just supposition or a paper study, I've tried it, and the following are quotes from Chris Murphy on the referenced thread:
"I'd expect to have to add a device and then remove missing"
"However, after adding a device and deleting missing, ... the btrfs volume is still mounted degraded. Further, if I [remount the filesytem with -o remount] it's still degraded. I had to unmount and mount again to clear degraded. That seems to be a problem, to have to unmount the volume in order to remove the degraded flag, which is needed to begin the rebalance. And what if btrfs is the root file system? It needs to be rebooted to clear the degraded option."
Btrfs: Working with multiple devices
Posted Jan 9, 2014 22:34 UTC (Thu) by kreijack (guest, #43513) [Link]
> "I was surprised to find that I'm not allowed to remove one of the two
> drives in a RAID1. Kernel message is 'btrfs: unable to go below two
> devices on raid1' Not allowing it by default makes some sense, however a
> --force flag or something would be beneficial."
I agree with you that allow to force a removing a disk would be useful; if a disk reports problem, it could slowdown the system due to the read-error-reset-retry cycle. So in this case it would be useful to stop using a disk.
> typically a system owner will want to use all drives that are available, and
> don't reserve an extra SATA/SAS channel for the times when a drive fails.
On that I cannot agree, think about the spare disk... However this doesn't change the point:
btrfs support an "add-remove" replace; but doesn't support a "remove-add" replace
> That seems to be a problem, to have to unmount the volume in order to
> remove the degraded flag, which is needed to begin the rebalance. And
> what if btrfs is the root file system? It needs to be rebooted to
> clear the degraded option."
Are you sure about that ? I tried to rebalance a degraded filesystem (but after having added a new disk). It seems to work:
ghigo@venice:/tmp$ sudo /sbin/mkfs.btrfs -K -f -d raid1 -m raid1 /dev/loop[01]
WARNING! - Btrfs v3.12 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using
Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
nodesize 16384 leafsize 16384 sectorsize 4096 size 19.53GiB
Btrfs v3.12
ghigo@venice:/tmp$ sudo mount /dev/loop0 t/
ghigo@venice:/tmp$ sudo umount t/
ghigo@venice:/tmp$ sudo losetup -d /dev/loop0
ghigo@venice:/tmp$ sudo mount /dev/loop1 t
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
ghigo@venice:/tmp$ sudo mount -o degraded /dev/loop1 t
ghigo@venice:/tmp$ sudo btrfs fi balance t/
ERROR: error during balancing 't/' - No space left on device
There may be more info in syslog - try dmesg | tail
ghigo@venice:/tmp$ sudo btrfs dev add -f /dev/loop2 t
Performing full device TRIM (9.77GiB) ...
ghigo@venice:/tmp$ sudo btrfs fi balance t/
Done, had to relocate 2 out of 2 chunks
ghigo@venice:/tmp$ cat /proc/self/mountinfo | grep loop1
79 20 0:39 / /tmp/t rw,relatime shared:63 - btrfs /dev/loop1 rw,degraded,space_cache
Btrfs: Working with multiple devices
Posted Jan 9, 2014 22:39 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]
A lot of people don't run with a spare disk. They run with all the disks active and when one fails they replace it.
Yes, this extends the time when they are running in degraded mode, but since the disk will probably be significantly cheaper later, and there's no testing to see if the spare has a problem (and remember, it is running all the time), it's not a completely unreasonable thing to do.
people routinely run their home systems in much riskier modes.
Btrfs: Working with multiple devices
Posted Mar 19, 2014 22:33 UTC (Wed) by dany (guest, #18902) [Link]
Company has servers with many disks, which have raid2z (sort of raid6 analogy) redundancy in zpools/zvols. There is no hot-spare and there is no extra space for new disk. One disk fails (in example, pool name is "pool", failed disk is c7t6d0). ONLINE disk replacement procedure with LIVE ZFS is going to be:
1. zpool detach pool c7t6d0 #mark failed disk as detached for zfs
2. zpool offline pool c7t6d0 #take disk out of zfs control
3. cfgadm -l | grep c7t6d0 #figure out controller and target of disk
sata5/6::dsk/c7t6d0 disk connected configured ok
4. cfgadm -c unconfigure sata5/6 #power off disk
5. physicaly replace disk
6. cfgadm -c configure sata5/6 #power on disk
7. fdisk/format c7t6d0 #create partition table on disk
8. zpool online pool c7t6d0 #bring disk under zfs control
9. zpool replace pool c7t6d0 #resilver disk
Similar capability was possible with older volume manager (SDS) in Solaris. For Linux and Solaris there is proprietary VxVM, which also can do that.
So yes, there is a need for exactly this use case in LINUX also (no reboot, no remount, no extra space for new disk). I would definitely expect this functionality from next-gen FS. Actually it could be a risk to run FS without this capability for some companies. Yes hot-spares are great, but in real world, they are not always available.
Btrfs: Working with multiple devices
Posted Mar 19, 2014 22:53 UTC (Wed) by anselm (subscriber, #2796) [Link]
I can't say anything about Btrfs, but the scenario you've outlined should be quite tractable on Linux with LVM and, e.g., Ext4.
Btrfs: Working with multiple devices
Posted Jan 9, 2014 23:22 UTC (Thu) by mrjoel (subscriber, #60922) [Link]
>> typically a system owner will want to use all drives that are
>> available, and don't reserve an extra SATA/SAS channel for
>> the times when a drive fails.
> On that I cannot agree, think about the spare disk... However
> this doesn't change the point:
> btrfs support an "add-remove" replace; but doesn't support
> a "remove-add" replace
I see these as two use cases of the same fundamental capability - removing a device when that action inherently results in running in a degraded state. That is indeed what I couldn't reconcile myself with when last evaluating btrfs and ended up being our showstopper.
On the spare device issue, sure - having at least one hot spare is very common. However, I expect that one typically has a hotspare configured in the hardware RAID controller in which case btrfs just sees a single device so the multi-device support doesn't come into play. That loses the ability of btrfs to do direct device integrity checking, but seems to be the only option since the btrfs wiki lists "Hot spare support" as not claimed and nothing done [1]. Eliminating the cases where HW RAID is used, if an additional drive is spinning in a chassis, then the ideal is have all drives added as RAID-6 and/or a reserved hot spare (in fact, a hot spare in btrfs may end up being just be bump the N-level redundancy one level, which would offer the additional failure tolerance as well as offer the additional active spindle for I/O. However, since both hot spare and RAID6 are not implemented in btrfs, I assume most would be inclined to add it to the mix, especially since btrfs can use odd number of drives in RAID1 since it's per block not overall device.
On the degraded flag changes, looks like it may have been updated since I was trying in August, I see some mailing list patches to allow changing feature bits while online, so that sounds like good news. However, even in your example, the mount still shows the degraded flag which is misleading, although understandable why required. At the time I was doing my testing, 'btrfs show' didn't reflect the actual runtime status of the filesystem from the kernel, and a quick perusal of the btrfs-progs git tree doesn't show anything updated related to that. So, at least in my mind, the question remains - if showing a degraded mount option in the mount arguments, how can one determine whether or not the mounted filesystem status is still degraded, or has been restored to nominal redundancy state?
[1] https://btrfs.wiki.kernel.org/index.php/Project_ideas#Hot...
Btrfs: Working with multiple devices
Posted Jan 13, 2014 23:53 UTC (Mon) by rodgerd (guest, #58896) [Link]
A controlled removal resulted in an 18 hour wait for less than 1 TB of data to migrate from an old drive to a new drive, and then required a reboot to get btrfs to let go of the nominally removed drive.
I was pretty disappointed. Have to try again next year...
Two meanings of RAID-1
Posted Jan 7, 2014 17:50 UTC (Tue) by hpa (guest, #48575) [Link]
If you have N devices, MD will store N copies of your data. Any one drive in isolation contains the entire data.
If you have N devices, btrfs will store *2* copies of your data, on any 2 devices out of the N available. You need N-1 drives to remain good to be guaranteed to recover all data.
Two meanings of RAID-1
Posted Jan 9, 2014 19:32 UTC (Thu) by Pc5Y9sbv (guest, #41328) [Link]
Two meanings of RAID-1
Posted Jan 9, 2014 20:51 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]
Two meanings of RAID-1
Posted Jan 9, 2014 22:38 UTC (Thu) by Pc5Y9sbv (guest, #41328) [Link]
Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds