Runtime filesystem consistency checking
Filesystems have bugs, Goel said, producing a list of bugs that caused filesystem corruption over the last few years. Existing solutions can't deal with these problems because they start with the assumption that the filesystem is correct. Journals, RAID, and checksums on data are nice features but they depend on offline filesystem checking to fix up any filesystem damage that may occur. Those solutions protect against problems below the filesystem layer and not against bugs in the filesystem implementation itself.
But, he said, offline checking is slow and getting slower as disks get larger. In addition, the data is not available while the fsck is being done. Because of that, checking is usually only done after things have obviously gone wrong, which makes the repair that much more difficult. The example given was a file and directory inode that both point to the same data block; how can the checker know which is correct at that point?
James Bottomley asked if there were particular tools that were used to cause various kinds of filesystem corruption, and if those tools were available for kernel hackers and others to use. Goel said that they have tools for both ext3 and btrfs, while audience members chimed in with other tools to cause filesystem corruptions. Those included fsfuzz, mentioned by Ted Ts'o, which will do random corruptions of a filesystem. It is often used to test whether malformed filesystems on USB sticks can be used to crash or subvert the kernel. There were others, like fswreck for the OCFS2 filesystem, as well as similar tools for XFS noted by Christoph Hellwig and another that Chris Mason said he had written for btrfs. Bottomley's suggestion that the block I/O scheduler could be used to pick blocks to corrupt was met with a response from another in the audience joking that the block layer didn't really need any help corrupting data—widespread laughter ensued.
Returning to the topic at hand, Goel stated that doing consistency checking at runtime is faced with the problem that consistency properties are global in nature and are therefore expensive to check. To find two pointers to the same data block, one must scan the entire filesystem, for example. In an effort to get around this difficulty, the researchers hypothesized that global consistency properties could be transformed into local consistency invariants. If only local invariants need to be checked, runtime consistency checking becomes a more tractable problem.
They started with the assumption that the initial filesystem is consistent, and that something below the filesystem layer, like checksums, ensures that correct data reaches the disk. At runtime, then, it is only necessary to check that the local invariants are maintained by whatever data is being changed in any metadata writes. This checking happens before those changes become "durable", so they reason by induction that the filesystem resulting from those is also consistent. By keeping any inconsistent state changes from reaching the disk, the "Recon" system makes filesystem repair unnecessary.
As an example, ext3 maintains a bitmap of the allocated blocks, so to ensure consistency when a block is allocated, Recon needs to test that the proper bit in the bitmap flips from zero to one and that the pointer used is the correct one (i.e. it corresponds to the bit flipped). That is the "consistency invariant" for determining that the block has been allocated correctly. A bit in the bitmap can't be set without a corresponding block pointer being set and vice versa. Additional checks are done to make sure that the block had not already been allocated, for example. That requires that Recon maintain its own block bitmap.
These invariants (they came up with 33 of them for ext3) are checked at the transaction commit point. The design of Recon is based on a fundamental mistrust of the filesystem code and data structures, so it sits between the filesystem and the block layer. When the filesystem does a metadata write, Recon records that operation. Similarly, it caches the data from metadata reads, so that the invariants can be validated without excessive disk reads. When the commit of a metadata update is done, the read cache is updated if the invariants are upheld in the update.
When filesystem metadata is updated, Recon needs to determine what logical change is being performed. It does that by examining the metadata block to determine what type of block it is, and then does a "logical diff" of the changes. The result is a "logical change record" that records five separate fields for each change: block type, ID, the field that changed, the old value, and the new value. As an example, Goel listed the change records that might result from appending a block to inode 12:
Using those records, the invariants can be checked to ensure that the block pointer referenced in the inode is the same as the one that has its bit set in the bitmap, for example.
Type ID Field Old New inode 12 blockptr[1] 0 501 inode 12 i_size 4096 8192 inode 12 i_blocks 8 16 bitmap 501 -- 0 1 bgd 0 free_blocks 1500 1499
Currently, when any invariant is violated, the filesystem is stopped. Eventually there may be ways to try to fix the problems before writing to disk, but for now, the safe option is to stop any further writes.
Recon was evaluated by measuring how many consistency errors were detected by it vs. those caught by fsck. Recon caught quite a few errors that were not detected by fsck, while it only missed two that fsck caught. In both cases, the filesystem checker was looking at fields that are not currently used by ext3. Many of the inconsistencies that Recon found and fsck didn't were changes to unallocated data, which are not important from a consistency standpoint, but still should not be changed in a correctly operating filesystem.
There are some things that neither fsck nor Recon can detect, like changes to filenames in directories or time field changes in inodes. In both cases, there isn't any redundant information to do a consistency check against.
The performance impact of Recon is fairly modest, at least in terms of I/O operations. With a cache size of 128MB, Recon could handle a web server workload with only a reduction of approximately 2% I/O operations/second based on a graph that was shown. The cache size was tweaked to find a balance based on the working set size of the workload so that the cache would not be flushed prematurely, which would otherwise cause expensive reads of the metadata information. The tests were run on a filesystem on a 1TB partition with 15-20GB of random files according to Fryer, and used small files to try to stress the metadata cache.
No data was presented on the CPU impact of Recon, other than to say that there was "significant" CPU overhead. Their focus was on the I/O cost, so more investigation of the CPU cost is warranted. Based on comments from the audience, though, some would be more than willing to spend some CPU in the name of filesystem consistency so that the far more expensive offline checking could be avoided in most cases.
The most important thing to take away from the talk, Goel said, is that as long as the integrity of written block data is assured, all of the ext3 properties that can checked by fsck can instead be done at runtime. As Ric Wheeler and others in the audience pointed out, that doesn't eliminate the need for an offline checker, but it may help reduce how often it's needed. Goel agreed with that, and noted that in 4% of their tests with corrupted filesystems, fsck would complete successfully, but that a second run would find more things to fix. Ts'o was very interested to hear that and asked that they file bugs for those cases.
There is ongoing work on additional consistency invariants as well as things like reducing the memory overhead and increasing the number of filesystems that are covered. Dave Chinner noted that invariants for some filesystems may be hard to come up with, especially for filesystems like XFS that don't necessarily do metadata updates through the page cache.
The reaction to Recon was favorable overall. It is an interesting project and surprised some that it was possible to do runtime consistency checking at all. As always, there is more to do, and the team has limited resources, but most attendees seemed favorably impressed with the work.
[Many thanks are due to Mel Gorman for sharing his notes from this session.]
Index entries for this article | |
---|---|
Kernel | Filesystems |
Conference | Storage, Filesystem, and Memory-Management Summit/2012 |
Posted Apr 3, 2012 16:41 UTC (Tue)
by Tara_Li (guest, #26706)
[Link] (14 responses)
Posted Apr 3, 2012 18:58 UTC (Tue)
by martinfick (subscriber, #4455)
[Link] (13 responses)
Posted Apr 3, 2012 20:00 UTC (Tue)
by drag (guest, #31333)
[Link]
Also while reliability and capacity has both increased, capacity has far outstripped reliability. So that while today's drives are generally more reliable then older ones (as in bad/corrupt blocks lost per GB) the chances of you losing part of your data is much higher simply because there is so much more of it.
This sort of stuff why online fsck and scrubs (reading in data and comparing it to checksums to detect and correct corruption) is so important on modern file systems. Previously the only people that needed to care were ones that could justify the expense of purchasing big SAN devices and whatnot.
Posted Apr 3, 2012 20:01 UTC (Tue)
by cmccabe (guest, #60281)
[Link] (11 responses)
SSDs don't have these limitations, however.
Posted Apr 4, 2012 8:00 UTC (Wed)
by dgm (subscriber, #49227)
[Link] (10 responses)
Disk capacity may have increased, but disk platters are exactly the same size as before: 3.5 inches. So, moving the read head around should cost mostly the same as before. The only factor I can think of is that the head has to be more precisely positioned, and that may (or may not) be more costly because of physical limitations (rebounds).
On the other hand there are two factors that should make seek time decrease: improved machinery and more density. More density means that more data goes faster under the read head, so more often seeks can be satisfied without moving the read head, just waiting for the data to pass below.
Posted Apr 4, 2012 9:22 UTC (Wed)
by epa (subscriber, #39769)
[Link] (8 responses)
Or maybe the point is that larger filesystems necessarily require more random accesses and hence more disk seeks when you fsck them. Larger RAM would mitigate this but I don't know whether increased RAM for caching has kept pace with filesystem sizes enough. An fsck expert would be able to give some numbers.
Posted Apr 4, 2012 10:27 UTC (Wed)
by khim (subscriber, #9252)
[Link] (7 responses)
Actually the original poster was wrong: seeks are no more expensive. They have the same cost, but you need more of them. Even if you'll grown filesystem data structures to reduce fragmentation undeniable fact is that number is tracks is growing and time to read a single track is constant. This means that time needed to read the whole disk from the beginning to the end is growing.
Posted Apr 4, 2012 12:17 UTC (Wed)
by epa (subscriber, #39769)
[Link] (6 responses)
Posted Apr 4, 2012 12:59 UTC (Wed)
by khim (subscriber, #9252)
[Link] (5 responses)
More or less. This means that when you go from Linux 0.1 (with typical size of HDD 200-300MB) to Linux 3.0 (with typical size of HDD 2-4TB) filesystem slows by a factor of 100, not by a factor of 10'000. But 100x slowdown is still a lot.
Posted Apr 4, 2012 16:01 UTC (Wed)
by wazoox (subscriber, #69624)
[Link] (4 responses)
Posted Apr 4, 2012 19:41 UTC (Wed)
by khim (subscriber, #9252)
[Link] (3 responses)
Contemporary 4TB HDDs are especially slow because they use 5 plates (where your 1TB disks probably used 2 or 3). This means that not only you see the slowdown from growing number of tracks, you see additional slowdown from growing number of plates! Thankfully in this direction 5 is the limit: I doubt we'll see return of 30 plates monsters like the infamous Winchester… all 3.5" HDDs to date had 5 plates or less.
Posted Apr 5, 2012 9:18 UTC (Thu)
by misiu_mp (guest, #41936)
[Link] (2 responses)
Posted Apr 5, 2012 10:00 UTC (Thu)
by khim (subscriber, #9252)
[Link] (1 responses)
Good idea. Sadly it's about ten years too late. Today's tracks are too small: when the head is on a track on one plate all other heads are not on this same track. In fact they are not on track at all. They just randomly drift between 2-3 tracks adjacent to each other. That's why you can only use one head actively (how can we use even one if it's all is so unstable? well, it's easy: there are active scheme which dynamically moves head to keep it on track). Latency of seeks - yes, number of tracks - no. If you use the same plates then filesystem on a single plate HDD will be roughly five times faster then filesystem on five plates HDD. The main reason we don't see many of them is cost. They are more expensive to produce and since they are less reliable they incur more warranty overhead. They are also slower, but this secondary problem.
Posted Apr 13, 2012 8:47 UTC (Fri)
by ekj (guest, #1524)
[Link]
Posted Apr 5, 2012 19:01 UTC (Thu)
by cmccabe (guest, #60281)
[Link]
From a programmer's perspective, the growth in hard disk capacity has not been matched by a corresponding increase in either throughput or worst-case latency.
Because hard disk throughput has not kept pace, in a high performance setup, your only hope for reasonable throughput is to use RAID with striping. But RAID increases the minimum size that you can read-- before, that minimum was a sector-- with RAID, it's a stripe. This makes hard disks even less of a random-access medium, since you never want to be reading just a few bytes-- you want to read a whole RAID stripe at a time in order to be efficient.
Most programmers don't know about these details because the database does all this for you.
Posted Apr 3, 2012 17:13 UTC (Tue)
by nybble41 (subscriber, #55106)
[Link] (5 responses)
That may be true of ext3, but in general there are a few reasons why one might want to write to space which has not been allocated *on disk*. Atomic updates come to mind: reserve space in memory, write the new data, and finally--after the data has been committed--reserve it on disk and update the metadata. Online defragmentation or resizing could be implemented this way as well. Unallocated data is "don't care"; it shouldn't be a problem to change it even if the reason for the change is not yet apparent.
This also seems to introduce a second point of failure; the system can cease to function due to a filesystem bug, as before, or due to a bug in Recon which unnecessarily blocks access to the filesystem. That risk would have to be weighed against the cost of filesystem corruption in the absence of Recon, of course.
Posted Apr 3, 2012 19:58 UTC (Tue)
by NAR (subscriber, #1313)
[Link] (2 responses)
This was my first thought too. They are using code to check code, which is kind of like automated tests. In my experience tests are wrong about as many times as the code itself (but this could be due to our fragile test environment), so it's one more thing to get right. On the other hand if Recon is changed a lot fewer times than the filesystem code itself, then Recon can reach sufficient maturity to be actually useful.
Posted Apr 3, 2012 23:28 UTC (Tue)
by neilbrown (subscriber, #359)
[Link]
lockdep is brilliant for developers as it warns you early of your bugs, just as this 'recon' would warn ext3 developers of their bugs.
I'm not sure I'd enable lockdep or recon in production though. There is a real cost, and it is not at all likely to help more than it hurts.
Posted Apr 4, 2012 13:10 UTC (Wed)
by ashvin (guest, #83894)
[Link]
Posted Apr 4, 2012 13:07 UTC (Wed)
by ashvin (guest, #83894)
[Link] (1 responses)
Posted Apr 4, 2012 15:30 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link]
The data would not be useful after a crash; up to the point where the allocation is recorded on disk, there are no references to it, and it can simply revert to unallocated space, canceling the incomplete "transaction".
Posted Apr 3, 2012 17:33 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (6 responses)
Posted Apr 4, 2012 12:07 UTC (Wed)
by nix (subscriber, #2304)
[Link] (5 responses)
Posted Apr 4, 2012 13:16 UTC (Wed)
by ashvin (guest, #83894)
[Link]
Posted Apr 12, 2012 12:47 UTC (Thu)
by nye (subscriber, #51576)
[Link] (3 responses)
My ZFS experience on a ~5TB pool consisting of six commodity HDs under fairly light load (ie. it's a home file server) is that every couple of months scrub detects checksum errors in a block or a small handful of blocks, without any corresponding read/write errors being given by the device.
Not sure if that's the situation you're talking about.
(Also, the same experience has taught me that Western Digital should be avoided like ebola. I actually wonder if their green series might be drives that have failed QC and been re-badged for the unsuspecting consumer.)
Posted Apr 13, 2012 10:33 UTC (Fri)
by etienne (guest, #25256)
[Link] (2 responses)
I am not sure to interpret exactly what is happening on my own PC, but I suspect something like:
I do not know why the sector is not rewritten by the Linux driver, I know that I did solve same problem on another PC by touching a file in a directory, forcing the sector containing the directory entry to be rewritten.
Posted Apr 13, 2012 12:34 UTC (Fri)
by james (subscriber, #1325)
[Link]
And it can do all of that without having to worry about which operating system is running, or it's a database using raw access, or if it's a light layer using BIOS calls but no filesystem. It can preserve this information across reformats.
In your case, by causing the sector containing the directory entry to be rewritten, the disk probably decided that this was a great time to remap in a spare sector, so it actually went to a different part of the disk. (Unless the filesystem you were using put the new directory entry somewhere else anyway.)
And ECC correction doesn't take seconds; re-reading the same sector repeatedly in the hope that you can get a last good read does.
¹ If you've got command queueing turned on, several requests outstanding, and there's a delay, which sector caused the problem?
Posted Apr 13, 2012 14:51 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
It will also let you fire up background health checks (these can take quite a long time to complete -- as long as a day, as I recall) that may help turn up other problems.
Posted Apr 3, 2012 17:53 UTC (Tue)
by cesarb (subscriber, #6266)
[Link] (2 responses)
Now that this problem has been recognized, could future filesystems be designed so that all relevant consistency properties are local instead of global?
In the quoted example, for instance, the filesystem could record an "owner" identifier for each data block, instead of a single "used/free" bit. Then the check "two owners point to the same data block" becomes instead "the data block points back to the correct owner".
In an ext3-style implementation of this concept, the owner identifier could be the block which points to this data block. So if you are looking at a data block pointed to by an indirect block you have to check 3 invariants, all local: "the data block owner is the indirect block", "the indirect block points to the data block", and "there are no duplicate data blocks within this indirect block". The same applies if you have a block containing inodes instead of an indirect block.
Posted Apr 3, 2012 18:05 UTC (Tue)
by dtlin (subscriber, #36537)
[Link]
Posted Apr 4, 2012 8:05 UTC (Wed)
by dgm (subscriber, #49227)
[Link]
Posted Apr 3, 2012 18:51 UTC (Tue)
by dcg (subscriber, #9198)
[Link] (4 responses)
Posted Apr 4, 2012 6:58 UTC (Wed)
by hisdad (subscriber, #5375)
[Link] (2 responses)
Oh yes we need better metadata, but also background scrubs.
Posted Apr 7, 2012 15:24 UTC (Sat)
by dcg (subscriber, #9198)
[Link]
Posted Apr 13, 2012 6:04 UTC (Fri)
by Duncan (guest, #6647)
[Link]
IOW, as both the wiki and kernel options clearly state or from an even perfunctory scan of recent list posting, btrfs is clearly experimental and only suitable for test data that is either entirely throwaway or is available elsewhere ATM. While a responsible admin will always have backups for data even on production quality filesystems, there, backups are backups, the primary copy can be the copy on the working filesystem. But things are rather different for btrfs or any filesystem at its development state, where the primary copy should be thought of as existing on some non-btrfs volume, backed up as should be any data of value, and anything placed on btrfs is purely testing data -- if it happens to be there the next time you access it, particularly after the next umount/mount cycle, great, otherwise, you have a bug and as a tester of an experimental filesystem, should be following the development list close enough to know whether it has been reported or not yet, and a responsibility to try to trace and report it, including running various debug patches, etc, as may be requested for tracing it down. Otherwise, if you're not actively testing it, what are you doing running such an experimental and in active development filesystem in the first place?
So /of/ /course/ btrfs is not likely to be more reliable at this point. It's still experimental, and while it is /beginning/ to stabilize, development activity and debugging is still very high, features are still being added, it only recently got a writing fsck at all and that's still in dangerdonteveruse for a reason. As a volunteer tester of a filesystem in that state, if it's as stable as a mature filesystem for you, you probably simply aren't pushing it hard enough in your tests! =:^)
But of course every filesystem had a point at which it was in a similar state, and unlike many earlier filesystems, btrfs has been designed from the ground up with data/metadata checksumming and other reliability features as primary design features, so by the time it reaches maturity and that experimental label comes off for mainline (actually, arguably before that as kernel folks are generally quite conservative about removing such labels), it should be well beyond ext* in terms of reliability, at least when run with the defaults (many of the features including checksumming can be turned off, it can be run without the default metadata duplication, etc).
So while btrfs as a still experimental filesystem fit only for testing /of/ /course/ doesn't yet have the reliability of ext3/4, it will get there, and then surpass them... until at a similar maturity to what they are now, it should indeed be better than they are now, or even than they will be then, for general purpose usage reliability, anyway.
Duncan (who has been following the btrfs list for a few weeks now, and all too often sees posts from folks wondering how to recover data... from a now unmountable but very publicly testing-quality-only filesystem that never should have had anything but throwaway quality testing data on it in the first place.)
Posted Apr 4, 2012 13:21 UTC (Wed)
by ashvin (guest, #83894)
[Link]
Posted Apr 4, 2012 20:47 UTC (Wed)
by travelsn (guest, #48694)
[Link]
Posted Apr 5, 2012 2:36 UTC (Thu)
by asj (subscriber, #74238)
[Link]
Posted May 30, 2012 11:50 UTC (Wed)
by aigarius (subscriber, #7329)
[Link]
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
The individual 4 TB drive needs more than 9 hours to simply fill it up sequentially.
We'll need to index blocks on our spinning rust on SSD cache before long :)
Runtime filesystem consistency checking
Runtime filesystem consistency checking
If data is written cylinder-wise, the latency should be similar to one-plate disk.
If it is written plate-wise, the latency should vary up and down in relation to block numbers. Its possible the average latency would still be comparable.
The only clear negative about multi-platter systems is the increased inertia of the head assembly. It's not so clear whether it has a practical implication.
Apart from this unclear performance implication, there is of course the decreased reliability and increased cost of multi-platter solutions. That is the main reason we don't see that many of them.
Runtime filesystem consistency checking
More plates means more heads, with possibility for concurrency - that should increase sequential transfer speed.
If data is written cylinder-wise, the latency should be similar to one-plate disk.
That is the main reason we don't see that many of them.
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
This also seems to introduce a second point of failure
Runtime filesystem consistency checking
Runtime filesystem consistency checking
But lockdep used to report lots of false positives - this has got a lot better over the years though.
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
- one block of sectors develop a bit fault in the magnetic data
- the ECC correct it each times the PC reads the sector resulting in a *very long* delay of few seconds
- the Linux driver do not noticed there was long ECC correction and do not decide to rewrite the sector identically to get the magnetic data corrected
- long term a second error will appear on the magnetic data and the ECC will no more be sufficient.
I never noticed the problem when the "old" ATA/IDE driver was used, but I am not sure I interpret correctly what happens on my PC during the last few days...
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
According to Btrfs Design, btrfs has backrefs from each btree node and file extent to its parents — plural, because there may be multiple.
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Not for me. I have an Ml110G6 with P212 raid controller.
That should be either all ok or all fail, right? Nope. One disk had a part fail and that somehow got through the controller and corrupted the file systems. btrfsck happily tells me there are errors. When I look for the "-f" it just smiles at me. The ext4 partitions were recoverable.
--Dad
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Does anybody know if they have tested Recon on flash based filesystem. For embedded devices JFFS2/UBIFS fs are of interest. If anybody follows mtd mailing list there are many instances where file system does become corrupted it would be nice to know at run time rather than finding out that filesystem is corrupted when the device boots!
Runtime filesystem consistency checking
Runtime filesystem consistency checking
Runtime filesystem consistency checking