A btrfs update at LinuxCon Europe

By Jonathan Corbet
November 2, 2011

In October, the btrfs user community expressed concerns about the still missing-in-action filesystem checker and repair tool. At that time, btrfs creator Chris Mason said that he hoped to demonstrate a working checker during his LinuxCon Europe session. Your editor was there as part of a standing-room-only crowd ready to see the show; we did indeed get a demonstration, but it may not have been quite what some attendees expected.

Chris started by talking about btrfs and its goals in general; those have been well covered here and need not be repeated now. He reiterated Oracle's plan to use btrfs as the core filesystem for its RHEL-derivative Linux distribution; needless to say, supporting that role requires a rock-solid implementation. So a lot of work has been going into extensive testing of the filesystem and fixing bugs.

The 3.2 kernel release will see the results of that work; it will contain lots of fixes. There will also be significant improvements to the logging code. It turns out that a lot of data was being logged more than once, greatly increasing the amount of I/O required; that has now been fixed. I/O traffic for the log, it seems, has been cut to about 25% of its previous level.

For 3.3, the main improvement seems to be the use of larger blocks for nodes in the filesystem B-tree. Larger blocks can hold more data, of course, and, in particular, more metadata. That means that metadata that was previously scattered in the filesystem can be kept together with the relevant inode. That, in turn, leads to significant performance improvements for many filesystem operations.

Another near-term feature, due to arrive "right after fsck", is the merging of Dave Woodhouse's RAID5 and RAID6 implementations. That work was initially posted in 2009; Chris apologized for taking so long to get it merged. How this feature will actually be used still needs some thought; RAID5 or 6 is quite good for data, but it can be problematic for metadata, which tends to not fill anything close to a full RAID stripe and, thus, can lead to low I/O performance. Happily, btrfs has been designed from the beginning to keep data and metadata separate; that means that things can be set up where data is protected with full RAID while metadata is managed using simple mirroring.

Talk of protecting metadata leads naturally to the problem of recovering a filesystem when its metadata has been corrupted. That is what a filesystem checker program is for; btrfs, thus far, has been increasingly famous for it lack of a proper checker (and, more importantly, a proper filesystem repair tool). As of the LinuxCon talk, btrfs still does not have a real repair tool, but some progress has been made in that direction and a couple of other mechanisms have been provided.

The copy-on-write nature of btrfs implies that there will be numerous old copies of the filesystem metadata on the storage device at any given time. Any change, after all, will create a new copy, leaving the previous version in place until the block is reused. Chris observed that filesystem corruptions rarely affect that older metadata, so it makes sense to use it as a primary resource in the recovery of a corrupted disk. But, first, one needs to be able to find that older metadata.

To that end, btrfs maintains an array containing the block locations of many older versions of the filesystem root. The root block, he said, is more important than the superblock when it comes to recovering data. The root is replaced often as metadata changes percolate up to the top of the directory hierarchy, so the "old root blocks" array contains pointers to what is, in effect, a set of snapshots of the very recent state of the filesystem. Clearly, this will be a valuable resource should something go badly wrong.

One way of using that array is simply to mount the filesystem using an older version of the root. Chris demonstrated this feature by poking holes in a test filesystem, then mounting an older root to get back to where things had been before. For simple, quickly-detected problems, older root blocks should be a path toward a quick solution.

It is not too hard to imagine situations where this approach will not work, though. If a metadata block in a rarely-changed subtree is, say, zeroed by a hardware malfunction, it could go undetected for some time. By the time the user realizes that something is wrong, there may be no older hierarchy containing the information needed to put things back together. So other solutions will be necessary.

Obviously, one of those solutions will be the full filesystem checker and repair tool. That tool is still not ready, though. Getting a repair tool right is a hard problem; without a lot of care, a well-intentioned attempt to repair a filesystem can easily make it worse. Data that may have been recoverable before the repair attempt may no longer be so afterward. Even if a proper btrfsck were available today, it would probably be some years before it reflected enough experience to inspire confidence in users who are concerned about their data.

So it seems that something else is required. That "something else" turns out to be a data recovery tool written by Josef Bacik. This tool has a simple (to explain) job: dig through a corrupted filesystem in read-only mode and extract as much of the data as possible. Since it makes no changes, it cannot make things worse; it seems like a worthwhile tool to have around even if a full repair tool existed.

That tool, along with all the requisite filesystem support, is expected to be available in the 3.2 kernel time frame. Meanwhile, there is a new btrfs-progs repository that will include the recovery tool in the near future. All told, it may not be quite the btrfsck that some users were hoping for, but it should be enough to make those users feel a bit more confident about entrusting their data to a new filesystem. Judging from the size of the crowd at Chris's talk, there are a lot of people interested in doing exactly that.

[Your editor would like to thank the Linux Foundation for funding his travel to LinuxCon Europe.]

Index entries for this article
Kernel	Btrfs
Kernel	Filesystems/Btrfs
Conference	LinuxCon Europe/2011

Oracle

Posted Nov 3, 2011 1:03 UTC (Thu) by kragilkragil2 (guest, #76172) [Link] (2 responses)

So they promised a checker and that turned out to be a version of testdisk that supports BtrFS. Awesome! Not.
Not releasing the code for checker a long time ago was a mistake and waiting only makes it worse. So why isn't there code? Sure people will frag their FS, so what. Tell people it eats babies in flashing red letters for a minute before they use it. BtrFS is not production ready. If some of those users provides a good bug report we will get a working BtrFS a lot sooner.

Oracle

Posted Nov 3, 2011 4:51 UTC (Thu) by drag (guest, #31333) [Link] (1 responses)

It'll be out in time for the Oracle release, I can almost guarantee it.

That is good and bad.

Good for us because now we will get to see what happens when people start to use it in large scale production environments.

Bad for Oracle customers, because they will be the ones beta testing it.

Oracle

Posted Nov 10, 2011 2:21 UTC (Thu) by clump (subscriber, #27801) [Link]

Enterprise users have been able to test drive Btrfs since RHEL 6.0 was released. It's tech preview, but it's available.

A btrfs update at LinuxCon Europe

Posted Nov 3, 2011 4:17 UTC (Thu) by ncm (guest, #165) [Link] (2 responses)

A checking-only tool (or tools) that can be run in the background on a mounted volume would be more useful than an all-singing all-dancing automatic repair tool. One that could suggest running specific repair tools in the event of trouble would be more useful yet. After an ecosystem of checkers and repairers have got very mature, they would naturally be stitched together to be run automatically. Complaining doesn't speed up that work. Until it is ready, repair tools are best run on an image copy of the file system.

A btrfs update at LinuxCon Europe

Posted Nov 3, 2011 6:35 UTC (Thu) by njs (subscriber, #40338) [Link] (1 responses)

In the last thread, people claimed that checking-only tools have existed for some time (both online and offline, IIRC).

A btrfs update at LinuxCon Europe

Posted Nov 3, 2011 16:20 UTC (Thu) by iabervon (subscriber, #722) [Link]

I think there needs to be one of them that's called "fsck.btrfs" so that it can be run by scripts between a detected bad event (e.g., kernel panic) and writing to the filesystem again.

A btrfs update at LinuxCon Europe

Posted Nov 10, 2011 12:59 UTC (Thu) by callegar (guest, #16148) [Link] (3 responses)

Btrfs is not the only filesystem without a checker, unfortunately. UDF is in the same condition. Which is equally bad since it leaves linux without an unencumbered , vendor neutral, cross platform, filesystem (and most likely this is the reason why every linux user still sticks with FAT). And which is also sort of funny, since many people do backups on that. I wonder if this btrfs case may result in more attention from distributions at the need to invest in tools so that /all/ filesystems that are supported with R/W can be checked and in case something goes wrong some data recovery can be practiced.

UDF

Posted Nov 11, 2011 11:25 UTC (Fri) by eru (subscriber, #2753) [Link] (1 responses)

Can UDF really be used as a normal R/W FS on a) Linux, b) Windows? I have only ever seen in on DVD:s, and I suspect OS'es might cheat and not implement UDF features not needed for that task.

UDF

Posted Nov 11, 2011 23:19 UTC (Fri) by cladisch (✭ supporter ✭, #50193) [Link]

> Can UDF really be used as a normal R/W FS on a) Linux, b) Windows?

Yes; it's essentially a 'normal' file system like, e.g., ext2.

> I have only ever seen in on DVD:s, and I suspect OS'es might cheat and not implement UDF features not needed for that task.

The Linux UDF driver defaulted to a 2048 byte sector size which would be wrong for other disk types; this was fixed two years ago. The userspace tool (mkudffs) still has the same bug; you need to remember to specify the sector size explicitly when formatting a HD or a USB stick.
Windows doesn't have this problem.

At that time, there were problems with interchanging data between OSes (IIRC new files created in Linux didn't always show up in Windows); I don't know if this is still the case.

JFFS2 also has no fsck

Posted Nov 13, 2011 22:35 UTC (Sun) by skierpage (guest, #70911) [Link]

("Journalling Flash File System version 2 or JFFS2 is a log-structured file system for use with flash memory devices.")

It worked fine on my One Laptop Per Child laptop for years until it didn't, and there's no utility to repair it; neither Wikipedia nor its FAQ mention this absence. Fortunately (?) userspace has no idea of the carnage going on below it, so I could tar off my files despite all the "jffs2_get_inode_nodes: Eep. No valid nodes for ino #340448" syslog messages.