Runtime filesystem consistency checking

By Jake Edge
April 3, 2012

This year's edition of the Linux Storage, Filesystem, and Memory Management Summit took place in San Francisco April 1-2, just prior to the Linux Foundation Collaboration Summit. Ashvin Goel of the University of Toronto was invited to the summit to discuss the work that he and others at the university had done on consistency checking as filesystems are updated, rather than doing offline checking using tools like fsck. One of the students who had worked on the project, Daniel Fryer, was also present to offer his perspective from the audience. Goel said that the work is not ready for production use, and Fryer echoed that, noting that the code is not 100% solid by any means. They are researchers, Goel said, so the community should give them some leeway, but that any input to make their work more relevant to Linux would be appreciated.

Filesystems have bugs, Goel said, producing a list of bugs that caused filesystem corruption over the last few years. Existing solutions can't deal with these problems because they start with the assumption that the filesystem is correct. Journals, RAID, and checksums on data are nice features but they depend on offline filesystem checking to fix up any filesystem damage that may occur. Those solutions protect against problems below the filesystem layer and not against bugs in the filesystem implementation itself.

But, he said, offline checking is slow and getting slower as disks get larger. In addition, the data is not available while the fsck is being done. Because of that, checking is usually only done after things have obviously gone wrong, which makes the repair that much more difficult. The example given was a file and directory inode that both point to the same data block; how can the checker know which is correct at that point?

James Bottomley asked if there were particular tools that were used to cause various kinds of filesystem corruption, and if those tools were available for kernel hackers and others to use. Goel said that they have tools for both ext3 and btrfs, while audience members chimed in with other tools to cause filesystem corruptions. Those included fsfuzz, mentioned by Ted Ts'o, which will do random corruptions of a filesystem. It is often used to test whether malformed filesystems on USB sticks can be used to crash or subvert the kernel. There were others, like fswreck for the OCFS2 filesystem, as well as similar tools for XFS noted by Christoph Hellwig and another that Chris Mason said he had written for btrfs. Bottomley's suggestion that the block I/O scheduler could be used to pick blocks to corrupt was met with a response from another in the audience joking that the block layer didn't really need any help corrupting data—widespread laughter ensued.

Returning to the topic at hand, Goel stated that doing consistency checking at runtime is faced with the problem that consistency properties are global in nature and are therefore expensive to check. To find two pointers to the same data block, one must scan the entire filesystem, for example. In an effort to get around this difficulty, the researchers hypothesized that global consistency properties could be transformed into local consistency invariants. If only local invariants need to be checked, runtime consistency checking becomes a more tractable problem.

They started with the assumption that the initial filesystem is consistent, and that something below the filesystem layer, like checksums, ensures that correct data reaches the disk. At runtime, then, it is only necessary to check that the local invariants are maintained by whatever data is being changed in any metadata writes. This checking happens before those changes become "durable", so they reason by induction that the filesystem resulting from those is also consistent. By keeping any inconsistent state changes from reaching the disk, the "Recon" system makes filesystem repair unnecessary.

As an example, ext3 maintains a bitmap of the allocated blocks, so to ensure consistency when a block is allocated, Recon needs to test that the proper bit in the bitmap flips from zero to one and that the pointer used is the correct one (i.e. it corresponds to the bit flipped). That is the "consistency invariant" for determining that the block has been allocated correctly. A bit in the bitmap can't be set without a corresponding block pointer being set and vice versa. Additional checks are done to make sure that the block had not already been allocated, for example. That requires that Recon maintain its own block bitmap.

These invariants (they came up with 33 of them for ext3) are checked at the transaction commit point. The design of Recon is based on a fundamental mistrust of the filesystem code and data structures, so it sits between the filesystem and the block layer. When the filesystem does a metadata write, Recon records that operation. Similarly, it caches the data from metadata reads, so that the invariants can be validated without excessive disk reads. When the commit of a metadata update is done, the read cache is updated if the invariants are upheld in the update.

When filesystem metadata is updated, Recon needs to determine what logical change is being performed. It does that by examining the metadata block to determine what type of block it is, and then does a "logical diff" of the changes. The result is a "logical change record" that records five separate fields for each change: block type, ID, the field that changed, the old value, and the new value. As an example, Goel listed the change records that might result from appending a block to inode 12:

Type ID Field Old New

inode 12 blockptr[1] 0 501

inode 12 i_size 4096 8192

inode 12 i_blocks 8 16

bitmap 501 -- 0 1

bgd 0 free_blocks 1500 1499

Type	ID	Field	Old	New
inode	12	blockptr[1]	0	501
inode	12	i_size	4096	8192
inode	12	i_blocks	8	16
bitmap	501	--	0	1
bgd	0	free_blocks	1500	1499

Using those records, the invariants can be checked to ensure that the block pointer referenced in the inode is the same as the one that has its bit set in the bitmap, for example.

Currently, when any invariant is violated, the filesystem is stopped. Eventually there may be ways to try to fix the problems before writing to disk, but for now, the safe option is to stop any further writes.

Recon was evaluated by measuring how many consistency errors were detected by it vs. those caught by fsck. Recon caught quite a few errors that were not detected by fsck, while it only missed two that fsck caught. In both cases, the filesystem checker was looking at fields that are not currently used by ext3. Many of the inconsistencies that Recon found and fsck didn't were changes to unallocated data, which are not important from a consistency standpoint, but still should not be changed in a correctly operating filesystem.

There are some things that neither fsck nor Recon can detect, like changes to filenames in directories or time field changes in inodes. In both cases, there isn't any redundant information to do a consistency check against.

The performance impact of Recon is fairly modest, at least in terms of I/O operations. With a cache size of 128MB, Recon could handle a web server workload with only a reduction of approximately 2% I/O operations/second based on a graph that was shown. The cache size was tweaked to find a balance based on the working set size of the workload so that the cache would not be flushed prematurely, which would otherwise cause expensive reads of the metadata information. The tests were run on a filesystem on a 1TB partition with 15-20GB of random files according to Fryer, and used small files to try to stress the metadata cache.

No data was presented on the CPU impact of Recon, other than to say that there was "significant" CPU overhead. Their focus was on the I/O cost, so more investigation of the CPU cost is warranted. Based on comments from the audience, though, some would be more than willing to spend some CPU in the name of filesystem consistency so that the far more expensive offline checking could be avoided in most cases.

The most important thing to take away from the talk, Goel said, is that as long as the integrity of written block data is assured, all of the ext3 properties that can checked by fsck can instead be done at runtime. As Ric Wheeler and others in the audience pointed out, that doesn't eliminate the need for an offline checker, but it may help reduce how often it's needed. Goel agreed with that, and noted that in 4% of their tests with corrupted filesystems, fsck would complete successfully, but that a second run would find more things to fix. Ts'o was very interested to hear that and asked that they file bugs for those cases.

There is ongoing work on additional consistency invariants as well as things like reducing the memory overhead and increasing the number of filesystems that are covered. Dave Chinner noted that invariants for some filesystems may be hard to come up with, especially for filesystems like XFS that don't necessarily do metadata updates through the page cache.

The reaction to Recon was favorable overall. It is an interesting project and surprised some that it was possible to do runtime consistency checking at all. As always, there is more to do, and the team has limited resources, but most attendees seemed favorably impressed with the work.

[Many thanks are due to Mel Gorman for sharing his notes from this session.]

Index entries for this article
Kernel	Filesystems
Conference	Storage, Filesystem, and Memory-Management Summit/2012

Runtime filesystem consistency checking

Posted Apr 3, 2012 16:41 UTC (Tue) by Tara_Li (guest, #26706) [Link] (14 responses)

This is *DEFINITELY* going to become more and more critical, as spinning media becomes more and more unsuited to random accesses.

Runtime filesystem consistency checking

Posted Apr 3, 2012 18:58 UTC (Tue) by martinfick (subscriber, #4455) [Link] (13 responses)

I don't understand your point? Have random accesses slowed down? Are they anticiapted to slow down?

Runtime filesystem consistency checking

Posted Apr 3, 2012 20:00 UTC (Tue) by drag (guest, #31333) [Link]

The slowdown is relative to other metrics related to computer performance, I imagine.

Also while reliability and capacity has both increased, capacity has far outstripped reliability. So that while today's drives are generally more reliable then older ones (as in bad/corrupt blocks lost per GB) the chances of you losing part of your data is much higher simply because there is so much more of it.

This sort of stuff why online fsck and scrubs (reading in data and comparing it to checksums to detect and correct corruption) is so important on modern file systems. Previously the only people that needed to care were ones that could justify the expense of purchasing big SAN devices and whatnot.

Runtime filesystem consistency checking

Posted Apr 3, 2012 20:01 UTC (Tue) by cmccabe (guest, #60281) [Link] (11 responses)

Hard disk sizes have continued increasing exponentially, while rotations per minute (RPMs) have more or less stopped increasing. So seeks are becoming more expensive, and fsck in general is starting to take much longer on hard disks.

SSDs don't have these limitations, however.

Runtime filesystem consistency checking

Posted Apr 4, 2012 8:00 UTC (Wed) by dgm (subscriber, #49227) [Link] (10 responses)

That doesn't make a lot of sense.

Disk capacity may have increased, but disk platters are exactly the same size as before: 3.5 inches. So, moving the read head around should cost mostly the same as before. The only factor I can think of is that the head has to be more precisely positioned, and that may (or may not) be more costly because of physical limitations (rebounds).

On the other hand there are two factors that should make seek time decrease: improved machinery and more density. More density means that more data goes faster under the read head, so more often seeks can be satisfied without moving the read head, just waiting for the data to pass below.

Runtime filesystem consistency checking

Posted Apr 4, 2012 9:22 UTC (Wed) by epa (subscriber, #39769) [Link] (8 responses)

I thought the other poster's point was about rotational speed. If the disk rotates at 100 revolutions per second then you may have to wait ten milliseconds in the worst case, even if the head is already positioned correctly. That ten milliseconds is not getting any shorter because disks are not spinning faster. However, the other components in the system are getting faster, so the ten millisecond overhead becomes more and more significant. Similarly, the disk head takes almost as long to move into position today as it did twenty years ago, even though processors and RAM are many times faster.

Or maybe the point is that larger filesystems necessarily require more random accesses and hence more disk seeks when you fsck them. Larger RAM would mitigate this but I don't know whether increased RAM for caching has kept pace with filesystem sizes enough. An fsck expert would be able to give some numbers.

Runtime filesystem consistency checking

Posted Apr 4, 2012 10:27 UTC (Wed) by khim (subscriber, #9252) [Link] (7 responses)

Actually the original poster was wrong: seeks are no more expensive. They have the same cost, but you need more of them. Even if you'll grown filesystem data structures to reduce fragmentation undeniable fact is that number is tracks is growing and time to read a single track is constant.

This means that time needed to read the whole disk from the beginning to the end is growing.

Runtime filesystem consistency checking

Posted Apr 4, 2012 12:17 UTC (Wed) by epa (subscriber, #39769) [Link] (6 responses)

Makes sense. I imagine that the number of tracks grows with the square root of disk capacity.

Runtime filesystem consistency checking

Posted Apr 4, 2012 12:59 UTC (Wed) by khim (subscriber, #9252) [Link] (5 responses)

More or less. This means that when you go from Linux 0.1 (with typical size of HDD 200-300MB) to Linux 3.0 (with typical size of HDD 2-4TB) filesystem slows by a factor of 100, not by a factor of 10'000. But 100x slowdown is still a lot.

Runtime filesystem consistency checking

Posted Apr 4, 2012 16:01 UTC (Wed) by wazoox (subscriber, #69624) [Link] (4 responses)

I'm currently testing 4TB drives right now. RAID rebuild now reaches 48 hours, up from 10 hours for 1 TB.
The individual 4 TB drive needs more than 9 hours to simply fill it up sequentially.
We'll need to index blocks on our spinning rust on SSD cache before long :)

Runtime filesystem consistency checking

Posted Apr 4, 2012 19:41 UTC (Wed) by khim (subscriber, #9252) [Link] (3 responses)

Contemporary 4TB HDDs are especially slow because they use 5 plates (where your 1TB disks probably used 2 or 3). This means that not only you see the slowdown from growing number of tracks, you see additional slowdown from growing number of plates!

Thankfully in this direction 5 is the limit: I doubt we'll see return of 30 plates monsters like the infamous Winchester… all 3.5" HDDs to date had 5 plates or less.

Runtime filesystem consistency checking

Posted Apr 5, 2012 9:18 UTC (Thu) by misiu_mp (guest, #41936) [Link] (2 responses)

I dont think its clear why would more plates be slower. More plates means more heads, with possibility for concurrency - that should increase sequential transfer speed.
If data is written cylinder-wise, the latency should be similar to one-plate disk.
If it is written plate-wise, the latency should vary up and down in relation to block numbers. Its possible the average latency would still be comparable.
The only clear negative about multi-platter systems is the increased inertia of the head assembly. It's not so clear whether it has a practical implication.
Apart from this unclear performance implication, there is of course the decreased reliability and increased cost of multi-platter solutions. That is the main reason we don't see that many of them.

Runtime filesystem consistency checking

Posted Apr 5, 2012 10:00 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

More plates means more heads, with possibility for concurrency - that should increase sequential transfer speed.

Good idea. Sadly it's about ten years too late. Today's tracks are too small: when the head is on a track on one plate all other heads are not on this same track. In fact they are not on track at all. They just randomly drift between 2-3 tracks adjacent to each other. That's why you can only use one head actively (how can we use even one if it's all is so unstable? well, it's easy: there are active scheme which dynamically moves head to keep it on track).

If data is written cylinder-wise, the latency should be similar to one-plate disk.

Latency of seeks - yes, number of tracks - no. If you use the same plates then filesystem on a single plate HDD will be roughly five times faster then filesystem on five plates HDD.

That is the main reason we don't see that many of them.

The main reason we don't see many of them is cost. They are more expensive to produce and since they are less reliable they incur more warranty overhead. They are also slower, but this secondary problem.

Runtime filesystem consistency checking

Posted Apr 13, 2012 8:47 UTC (Fri) by ekj (guest, #1524) [Link]

So, we "only" need to make the arms move independently then. :-)

Runtime filesystem consistency checking

Posted Apr 5, 2012 19:01 UTC (Thu) by cmccabe (guest, #60281) [Link]

When I said "seeks are becoming more expensive" meant in relation to other things going on in the system, not in an absolute sense.

From a programmer's perspective, the growth in hard disk capacity has not been matched by a corresponding increase in either throughput or worst-case latency.

Because hard disk throughput has not kept pace, in a high performance setup, your only hope for reasonable throughput is to use RAID with striping. But RAID increases the minimum size that you can read-- before, that minimum was a sector-- with RAID, it's a stripe. This makes hard disks even less of a random-access medium, since you never want to be reading just a few bytes-- you want to read a whole RAID stripe at a time in order to be efficient.

Most programmers don't know about these details because the database does all this for you.

Runtime filesystem consistency checking

Posted Apr 3, 2012 17:13 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (5 responses)

> Many of the inconsistencies that Recon found and fsck didn't were changes to unallocated data, which are not important from a consistency standpoint, but still should not be changed in a correctly operating filesystem.

That may be true of ext3, but in general there are a few reasons why one might want to write to space which has not been allocated *on disk*. Atomic updates come to mind: reserve space in memory, write the new data, and finally--after the data has been committed--reserve it on disk and update the metadata. Online defragmentation or resizing could be implemented this way as well. Unallocated data is "don't care"; it shouldn't be a problem to change it even if the reason for the change is not yet apparent.

This also seems to introduce a second point of failure; the system can cease to function due to a filesystem bug, as before, or due to a bug in Recon which unnecessarily blocks access to the filesystem. That risk would have to be weighed against the cost of filesystem corruption in the absence of Recon, of course.

Runtime filesystem consistency checking

Posted Apr 3, 2012 19:58 UTC (Tue) by NAR (subscriber, #1313) [Link] (2 responses)

This also seems to introduce a second point of failure

This was my first thought too. They are using code to check code, which is kind of like automated tests. In my experience tests are wrong about as many times as the code itself (but this could be due to our fragile test environment), so it's one more thing to get right. On the other hand if Recon is changed a lot fewer times than the filesystem code itself, then Recon can reach sufficient maturity to be actually useful.

Runtime filesystem consistency checking

Posted Apr 3, 2012 23:28 UTC (Tue) by neilbrown (subscriber, #359) [Link]

It reminds me a lot of lockdep.

lockdep is brilliant for developers as it warns you early of your bugs, just as this 'recon' would warn ext3 developers of their bugs.
But lockdep used to report lots of false positives - this has got a lot better over the years though.

I'm not sure I'd enable lockdep or recon in production though. There is a real cost, and it is not at all likely to help more than it hurts.

Runtime filesystem consistency checking

Posted Apr 4, 2012 13:10 UTC (Wed) by ashvin (guest, #83894) [Link]

We expect that Recon will initially be used mainly for development. As it matures, it could be deployed in production use. The need to change it over time should be comparable to the need to change fsck.

Runtime filesystem consistency checking

Posted Apr 4, 2012 13:07 UTC (Wed) by ashvin (guest, #83894) [Link] (1 responses)

We do handle "unallocated" writes due to shadow paging (e.g., copy-on-write on btrfs) where metadata is written to unallocated regions and then linked to the file-system on commit. We find this linkage at commit point and will not report an error. We haven't worked with online resizing but I suspect the handling should be similar on commit. Writing to unallocated data to which there is no linkage at commit seems suspect: how would the file system know that data is useful in anyway after a crash?

Runtime filesystem consistency checking

Posted Apr 4, 2012 15:30 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> Writing to unallocated data to which there is no linkage at commit seems suspect: how would the file system know that data is useful in anyway after a crash?

The data would not be useful after a crash; up to the point where the allocation is recorded on disk, there are no references to it, and it can simply revert to unallocated space, canceling the incomplete "transaction".

Runtime filesystem consistency checking

Posted Apr 3, 2012 17:33 UTC (Tue) by jzbiciak (guest, #5246) [Link] (6 responses)

It seems like this could also catch hardware problems, such as bit flips in main memory (for folks that lack ECC), and bugs in non-filesystem code, if it results in memory corruption.

Runtime filesystem consistency checking

Posted Apr 4, 2012 12:07 UTC (Wed) by nix (subscriber, #2304) [Link] (5 responses)

There's a bigger problem: it is possible to tell a disk to write something somewhere, and it reports success, but has actually either written something different, lost it entirely, or overwritten something different. Local consistency checks on read would detect the first two cases, but no local check can spot the third case, as far as I can see. (This case is certainly not common, though: I have seen only one disk problem that was possibly attributable to this class of failure. I've been trying to get info on the frequency of this sort of missed-seek or mis-seek error, but all I can find is rumours. The secrecy of disk manufacturers regarding info critical to the integrity of one's data is quite distressing at times.)

Runtime filesystem consistency checking

Posted Apr 4, 2012 13:16 UTC (Wed) by ashvin (guest, #83894) [Link]

Recon can catch bit flips that occur in memory and that corrupt file system metadata buffers before the Recon code is run. Catching bit flips after the Recon checks require checksumming and replication of file system metadata buffers. Recon will also not detect lost and misdirected writes if they occur after the Recon checks. However, they can be caught using version numbers, but the file system needs to support that, e.g., generation numbers in btrfs.

Runtime filesystem consistency checking

Posted Apr 12, 2012 12:47 UTC (Thu) by nye (subscriber, #51576) [Link] (3 responses)

>There's a bigger problem: it is possible to tell a disk to write something somewhere, and it reports success, but has actually either written something different, lost it entirely, or overwritten something different

My ZFS experience on a ~5TB pool consisting of six commodity HDs under fairly light load (ie. it's a home file server) is that every couple of months scrub detects checksum errors in a block or a small handful of blocks, without any corresponding read/write errors being given by the device.

Not sure if that's the situation you're talking about.

(Also, the same experience has taught me that Western Digital should be avoided like ebola. I actually wonder if their green series might be drives that have failed QC and been re-badged for the unsuspecting consumer.)

Runtime filesystem consistency checking

Posted Apr 13, 2012 10:33 UTC (Fri) by etienne (guest, #25256) [Link] (2 responses)

> every couple of months scrub detects checksum errors in a block or a small handful of blocks, without any corresponding read/write errors being given by the device.

I am not sure to interpret exactly what is happening on my own PC, but I suspect something like:
- one block of sectors develop a bit fault in the magnetic data
- the ECC correct it each times the PC reads the sector resulting in a *very long* delay of few seconds
- the Linux driver do not noticed there was long ECC correction and do not decide to rewrite the sector identically to get the magnetic data corrected
- long term a second error will appear on the magnetic data and the ECC will no more be sufficient.

I do not know why the sector is not rewritten by the Linux driver, I know that I did solve same problem on another PC by touching a file in a directory, forcing the sector containing the directory entry to be rewritten.
I never noticed the problem when the "old" ATA/IDE driver was used, but I am not sure I interpret correctly what happens on my PC during the last few days...

Runtime filesystem consistency checking

Posted Apr 13, 2012 12:34 UTC (Fri) by james (subscriber, #1325) [Link]

I understand that these days, it's the disk firmware's job to do this. It knows that there's a read error (which Linux doesn't, although it can sometimes guess ¹); it knows that sector is dodgy (which Linux doesn't, because it doesn't get told when the hardware remaps sectors), it can try re-writing the sector, and if that fails, it can remap in a spare sector (which Linux can't do, because it doesn't have access to the disk's spare sectors) if (and only if) it knows what that sector's contents should be, either because it's managed to read something, or the computer is overwriting that sector.

And it can do all of that without having to worry about which operating system is running, or it's a database using raw access, or if it's a light layer using BIOS calls but no filesystem. It can preserve this information across reformats.

In your case, by causing the sector containing the directory entry to be rewritten, the disk probably decided that this was a great time to remap in a spare sector, so it actually went to a different part of the disk. (Unless the filesystem you were using put the new directory entry somewhere else anyway.)

And ECC correction doesn't take seconds; re-reading the same sector repeatedly in the hope that you can get a last good read does.

¹ If you've got command queueing turned on, several requests outstanding, and there's a delay, which sector caused the problem?

Runtime filesystem consistency checking

Posted Apr 13, 2012 14:51 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Adding to what James said above me, I think you might want to look into installing the SMART tools to query your drive. The "smartctl" tool will read out and display the various statistics the drive's been collecting about corrected errors and remapped sectors.

It will also let you fire up background health checks (these can take quite a long time to complete -- as long as a day, as I recall) that may help turn up other problems.

Runtime filesystem consistency checking

Posted Apr 3, 2012 17:53 UTC (Tue) by cesarb (subscriber, #6266) [Link] (2 responses)

> Goel stated that doing consistency checking at runtime is faced with the problem that consistency properties are global in nature and are therefore expensive to check. To find two pointers to the same data block, one must scan the entire filesystem, for example.

Now that this problem has been recognized, could future filesystems be designed so that all relevant consistency properties are local instead of global?

In the quoted example, for instance, the filesystem could record an "owner" identifier for each data block, instead of a single "used/free" bit. Then the check "two owners point to the same data block" becomes instead "the data block points back to the correct owner".

In an ext3-style implementation of this concept, the owner identifier could be the block which points to this data block. So if you are looking at a data block pointed to by an indirect block you have to check 3 invariants, all local: "the data block owner is the indirect block", "the indirect block points to the data block", and "there are no duplicate data blocks within this indirect block". The same applies if you have a block containing inodes instead of an indirect block.

Runtime filesystem consistency checking

Posted Apr 3, 2012 18:05 UTC (Tue) by dtlin (subscriber, #36537) [Link]

According to Btrfs Design, btrfs has backrefs from each btree node and file extent to its parents — plural, because there may be multiple.

Runtime filesystem consistency checking

Posted Apr 4, 2012 8:05 UTC (Wed) by dgm (subscriber, #49227) [Link]

In general, the path to reliability is to add redundancy and increase locality. Redundancy decreases the probability of complete failure, and locality limits the impact of the failure.

Runtime filesystem consistency checking

Posted Apr 3, 2012 18:51 UTC (Tue) by dcg (subscriber, #9198) [Link] (4 responses)

I am surprised that the "integrity check" btrfs feature included in 3.3 wasn't mentioned. It seems to be very similar in concept: (see comments in http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-...)

Runtime filesystem consistency checking

Posted Apr 4, 2012 6:58 UTC (Wed) by hisdad (subscriber, #5375) [Link] (2 responses)

btrfs more reliable?
Not for me. I have an Ml110G6 with P212 raid controller.
That should be either all ok or all fail, right? Nope. One disk had a part fail and that somehow got through the controller and corrupted the file systems. btrfsck happily tells me there are errors. When I look for the "-f" it just smiles at me. The ext4 partitions were recoverable.

Oh yes we need better metadata, but also background scrubs.
--Dad

Runtime filesystem consistency checking

Posted Apr 7, 2012 15:24 UTC (Sat) by dcg (subscriber, #9198) [Link]

I just pointed out the existence of a integrity checker, I never said btrfs was more reliable.

Runtime filesystem consistency checking

Posted Apr 13, 2012 6:04 UTC (Fri) by Duncan (guest, #6647) [Link]

Btrfs is marked experimental for a reason, and it's public knowledge that until recently (like late kernel 3.3 cycle) the available fsck was read-only, no fix option. AFAIK there's a fix option version available now, but only in dangerdonteveruse branch of Chris's btrfs-tools tree ATM, not yet even in the main master branch, as it's being tested to ensure that at worst it doesn't make any problems WORSE instead of better.

IOW, as both the wiki and kernel options clearly state or from an even perfunctory scan of recent list posting, btrfs is clearly experimental and only suitable for test data that is either entirely throwaway or is available elsewhere ATM. While a responsible admin will always have backups for data even on production quality filesystems, there, backups are backups, the primary copy can be the copy on the working filesystem. But things are rather different for btrfs or any filesystem at its development state, where the primary copy should be thought of as existing on some non-btrfs volume, backed up as should be any data of value, and anything placed on btrfs is purely testing data -- if it happens to be there the next time you access it, particularly after the next umount/mount cycle, great, otherwise, you have a bug and as a tester of an experimental filesystem, should be following the development list close enough to know whether it has been reported or not yet, and a responsibility to try to trace and report it, including running various debug patches, etc, as may be requested for tracing it down. Otherwise, if you're not actively testing it, what are you doing running such an experimental and in active development filesystem in the first place?

So /of/ /course/ btrfs is not likely to be more reliable at this point. It's still experimental, and while it is /beginning/ to stabilize, development activity and debugging is still very high, features are still being added, it only recently got a writing fsck at all and that's still in dangerdonteveruse for a reason. As a volunteer tester of a filesystem in that state, if it's as stable as a mature filesystem for you, you probably simply aren't pushing it hard enough in your tests! =:^)

But of course every filesystem had a point at which it was in a similar state, and unlike many earlier filesystems, btrfs has been designed from the ground up with data/metadata checksumming and other reliability features as primary design features, so by the time it reaches maturity and that experimental label comes off for mainline (actually, arguably before that as kernel folks are generally quite conservative about removing such labels), it should be well beyond ext* in terms of reliability, at least when run with the defaults (many of the features including checksumming can be turned off, it can be run without the default metadata duplication, etc).

So while btrfs as a still experimental filesystem fit only for testing /of/ /course/ doesn't yet have the reliability of ext3/4, it will get there, and then surpass them... until at a similar maturity to what they are now, it should indeed be better than they are now, or even than they will be then, for general purpose usage reliability, anyway.

Duncan (who has been following the btrfs list for a few weeks now, and all too often sees posts from folks wondering how to recover data... from a now unmountable but very publicly testing-quality-only filesystem that never should have had anything but throwaway quality testing data on it in the first place.)

Runtime filesystem consistency checking

Posted Apr 4, 2012 13:21 UTC (Wed) by ashvin (guest, #83894) [Link]

We are aware of the btrfs integrity checker and plan to incorporate some ideas from there into Recon. There are two main differences between the btrfs integrity checker and Recon right now. The btrfs integrity checker depends on the file system code while Recon works outside, at the block layer. A bug in the file system may affect the integrity checker, while the hope is that bugs in the file system and in Recon are going to be unrelated. Second, the integrity checker mainly checks that updated blocks are destined to the correct locations, while Recon checks the contents of the updated blocks.

Runtime filesystem consistency checking

Posted Apr 4, 2012 20:47 UTC (Wed) by travelsn (guest, #48694) [Link]

Does anybody know if they have tested Recon on flash based filesystem. For embedded devices JFFS2/UBIFS fs are of interest. If anybody follows mtd mailing list there are many instances where file system does become corrupted it would be nice to know at run time rather than finding out that filesystem is corrupted when the device boots!

Runtime filesystem consistency checking

Posted Apr 5, 2012 2:36 UTC (Thu) by asj (subscriber, #74238) [Link]

Thanks for the article. if it included any links to the original work on this that would have been nicer.

Runtime filesystem consistency checking

Posted May 30, 2012 11:50 UTC (Wed) by aigarius (subscriber, #7329) [Link]

Would it not be easier to have a low priority kernel thread that runs in the background and periodically checks the integrity of the whole filesystem? It could then both fix it and also do the defrag actions in a continuous way.