| Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
In October, the btrfs user community expressed concerns about the still missing-in-action filesystem checker and repair tool. At that time, btrfs creator Chris Mason said that he hoped to demonstrate a working checker during his LinuxCon Europe session. Your editor was there as part of a standing-room-only crowd ready to see the show; we did indeed get a demonstration, but it may not have been quite what some attendees expected.
Chris started by talking about btrfs and its goals in general; those have been well covered here and need not be repeated now. He reiterated Oracle's plan to use btrfs as the core filesystem for its RHEL-derivative Linux distribution; needless to say, supporting that role requires a rock-solid implementation. So a lot of work has been going into extensive testing of the filesystem and fixing bugs.
The 3.2 kernel release will see the results of that work; it will contain
lots of fixes. There will also be significant improvements to the logging
code. It turns out that a lot of data was being logged more than once,
greatly increasing the amount of I/O required; that has now been fixed.
I/O traffic for the log, it seems, has been cut to about 25% of its
previous level.
For 3.3, the main improvement seems to be the use of larger blocks for nodes in the filesystem B-tree. Larger blocks can hold more data, of course, and, in particular, more metadata. That means that metadata that was previously scattered in the filesystem can be kept together with the relevant inode. That, in turn, leads to significant performance improvements for many filesystem operations.
Another near-term feature, due to arrive "right after fsck," is the merging of Dave Woodhouse's RAID5 and RAID6 implementations. That work was initially posted in 2009; Chris apologized for taking so long to get it merged. How this feature will actually be used still needs some thought; RAID5 or 6 is quite good for data, but it can be problematic for metadata, which tends to not fill anything close to a full RAID stripe and, thus, can lead to low I/O performance. Happily, btrfs has been designed from the beginning to keep data and metadata separate; that means that things can be set up where data is protected with full RAID while metadata is managed using simple mirroring.
Talk of protecting metadata leads naturally to the problem of recovering a filesystem when its metadata has been corrupted. That is what a filesystem checker program is for; btrfs, thus far, has been increasingly famous for it lack of a proper checker (and, more importantly, a proper filesystem repair tool). As of the LinuxCon talk, btrfs still does not have a real repair tool, but some progress has been made in that direction and a couple of other mechanisms have been provided.
The copy-on-write nature of btrfs implies that there will be numerous old copies of the filesystem metadata on the storage device at any given time. Any change, after all, will create a new copy, leaving the previous version in place until the block is reused. Chris observed that filesystem corruptions rarely affect that older metadata, so it makes sense to use it as a primary resource in the recovery of a corrupted disk. But, first, one needs to be able to find that older metadata.
To that end, btrfs maintains an array containing the block locations of many older versions of the filesystem root. The root block, he said, is more important than the superblock when it comes to recovering data. The root is replaced often as metadata changes percolate up to the top of the directory hierarchy, so the "old root blocks" array contains pointers to what is, in effect, a set of snapshots of the very recent state of the filesystem. Clearly, this will be a valuable resource should something go badly wrong.
One way of using that array is simply to mount the filesystem using an older version of the root. Chris demonstrated this feature by poking holes in a test filesystem, then mounting an older root to get back to where things had been before. For simple, quickly-detected problems, older root blocks should be a path toward a quick solution.
It is not too hard to imagine situations where this approach will not work, though. If a metadata block in a rarely-changed subtree is, say, zeroed by a hardware malfunction, it could go undetected for some time. By the time the user realizes that something is wrong, there may be no older hierarchy containing the information needed to put things back together. So other solutions will be necessary.
Obviously, one of those solutions will be the full filesystem checker and repair tool. That tool is still not ready, though. Getting a repair tool right is a hard problem; without a lot of care, a well-intentioned attempt to repair a filesystem can easily make it worse. Data that may have been recoverable before the repair attempt may no longer be so afterward. Even if a proper btrfsck were available today, it would probably be some years before it reflected enough experience to inspire confidence in users who are concerned about their data.
So it seems that something else is required. That "something else" turns out to be a data recovery tool written by Josef Bacik. This tool has a simple (to explain) job: dig through a corrupted filesystem in read-only mode and extract as much of the data as possible. Since it makes no changes, it cannot make things worse; it seems like a worthwhile tool to have around even if a full repair tool existed.
That tool, along with all the requisite filesystem support, is expected to be available in the 3.2 kernel time frame. Meanwhile, there is a new btrfs-progs repository that will include the recovery tool in the near future. All told, it may not be quite the btrfsck that some users were hoping for, but it should be enough to make those users feel a bit more confident about entrusting their data to a new filesystem. Judging from the size of the crowd at Chris's talk, there are a lot of people interested in doing exactly that.
[Your editor would like to thank the Linux Foundation for funding his travel to LinuxCon Europe.]
| Index entries for this article | |
|---|---|
| Kernel | Btrfs |
| Kernel | Filesystems/Btrfs |
| Conference | LinuxCon Europe/2011 |
Oracle
Posted Nov 3, 2011 1:03 UTC (Thu) by kragilkragil2 (guest, #76172) [Link]
Oracle
Posted Nov 3, 2011 4:51 UTC (Thu) by drag (guest, #31333) [Link]
That is good and bad.
Good for us because now we will get to see what happens when people start to use it in large scale production environments.
Bad for Oracle customers, because they will be the ones beta testing it.
Oracle
Posted Nov 10, 2011 2:21 UTC (Thu) by clump (subscriber, #27801) [Link]
A btrfs update at LinuxCon Europe
Posted Nov 3, 2011 4:17 UTC (Thu) by ncm (subscriber, #165) [Link]
A btrfs update at LinuxCon Europe
Posted Nov 3, 2011 6:35 UTC (Thu) by njs (guest, #40338) [Link]
A btrfs update at LinuxCon Europe
Posted Nov 3, 2011 16:20 UTC (Thu) by iabervon (subscriber, #722) [Link]
A btrfs update at LinuxCon Europe
Posted Nov 10, 2011 12:59 UTC (Thu) by callegar (guest, #16148) [Link]
Btrfs is not the only filesystem without a checker, unfortunately. UDF is in the same condition. Which is equally bad since it leaves linux without an unencumbered , vendor neutral, cross platform, filesystem (and most likely this is the reason why every linux user still sticks with FAT). And which is also sort of funny, since many people do backups on that. I wonder if this btrfs case may result in more attention from distributions at the need to invest in tools so that /all/ filesystems that are supported with R/W can be checked and in case something goes wrong some data recovery can be practiced.
UDF
Posted Nov 11, 2011 11:25 UTC (Fri) by eru (subscriber, #2753) [Link]
Can UDF really be used as a normal R/W FS on a) Linux, b) Windows? I have only ever seen in on DVD:s, and I suspect OS'es might cheat and not implement UDF features not needed for that task.
UDF
Posted Nov 11, 2011 23:19 UTC (Fri) by cladisch (✭ supporter ✭, #50193) [Link]
Yes; it's essentially a 'normal' file system like, e.g., ext2.
> I have only ever seen in on DVD:s, and I suspect OS'es might cheat and not implement UDF features not needed for that task.
The Linux UDF driver defaulted to a 2048 byte sector size which would be wrong for other disk types; this was fixed two years ago. The userspace tool (mkudffs) still has the same bug; you need to remember to specify the sector size explicitly when formatting a HD or a USB stick.
Windows doesn't have this problem.
At that time, there were problems with interchanging data between OSes (IIRC new files created in Linux didn't always show up in Windows); I don't know if this is still the case.
JFFS2 also has no fsck
Posted Nov 13, 2011 22:35 UTC (Sun) by skierpage (guest, #70911) [Link]
It worked fine on my One Laptop Per Child laptop for years until it didn't, and there's no utility to repair it; neither Wikipedia nor its FAQ mention this absence. Fortunately (?) userspace has no idea of the carnage going on below it, so I could tar off my files despite all the "jffs2_get_inode_nodes: Eep. No valid nodes for ino #340448" syslog messages.
Copyright © 2011, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds