LWN.net Logo

Responding to ext4 journal corruption

By Jonathan Corbet
May 27, 2008
Last week's article on barriers described one way in which things could go wrong with journaling filesystems. Therein, it was noted that the journal checksum feature added to the ext4 filesystem would mitigate some of those problems by preventing the replay of the journal if it had not been completely written before a crash. As a discussion this week shows, though, the situation is not quite that simple.

Ted Ts'o was doing some ext4 testing when he noticed a problem with how the journal checksum is handled. The journal will normally contain several transactions which have not yet been fully played into the filesystem. Each one of those transactions includes a commit record which contains, among other things, a checksum for the transaction. If the checksum matches the actual transaction data in the journal, the system knows that the transaction was written completely and without errors; it should thus be safe to replay the transaction into the filesystem.

The problem that Ted noticed was this: if a transaction in the middle of the series failed to match its checksum, the playback of the journal would stop - but only after writing the corrupted transaction into the filesystem. This is a sort of worst-of-all-worlds scenario: the kernel will dump data which is known to be corrupt into the filesystem, then silently throw away the (presumably good) transactions after the bad one. The ext4 developers quickly arrived at a consensus that this behavior is a bug which should be fixed.

But what should really done is not as clear as one might think. Ted's suggestion was this:

So I think the right thing to do is to replay the *entire* journal, including the commits with the failed checksums (except in the case where journal_async_commit is enabled and the last commit has a bad checksum, in which case we skip the last transaction). By replaying the entire journal, we don't lose any of the revoke blocks, which is critical in making sure we don't overwrite any data blocks, and replaying subsequent metadata blocks will probably leave us in a much better position for e2fsck to be able to recover the filesystem.

A bit of background might help in understanding the problem that Ted is trying to solve here. In the default data=ordered mode, ext3 and ext4 do not write all data to the journal before it goes to the filesystem itself. Instead, only filesystem metadata goes to the journal; data blocks are written directly to the filesystem. The "ordered" part means that all of the data blocks will be written before the filesystem code will start writing the metadata; in this way, the metadata will always describe a complete and correct filesystem.

Now imagine a journal which contains a set of transactions similar to these (in this order):

  1. A file is created, with its associated metadata.

  2. That file is then deleted, and its metadata blocks are released.

  3. Some other file is extended, with the newly-freed metadata blocks being reused as data blocks.

Imagine further that the system crashes with those transactions in the journal, but transaction 2 is corrupt. Simply skipping the bad transaction and replaying transaction 3 would lead to the filesystem being most confused about the status of the reused blocks. But just stopping at the corrupt transaction also has a problem: the data blocks created in transaction 3 may have already been written, but, as of transaction 1, the filesystem thinks those are metadata blocks. That, too, leads to a corrupt filesystem. By replaying the entire journal, Ted hopes to catch situations like that and leave the filesystem in an overall better shape.

It is, perhaps, not surprising that there was some disagreement with this approach. Andreas Dilger argued:

The whole point of this patch was to avoid the case where random garbage had been written into the journal and then splattering it all over the filesystem. Considering that the journal has the highest density of important metadata in the filesystem, it is virtually impossible to get more serious corruption than in the journal.

The next proposal was to make a change to the on-disk journal format ("one more time") turning the per-transaction checksum into a per-block checksum. Then it would be possible to get a handle on just how bad any corruption is, and even corrupt transactions could be mostly replayed. As of this writing, that looks like the approach which will be taken.

Arguably, the real conclusion to take from this discussion was best expressed by Arjan van de Ven in an entirely different context: "having a journal is soooo 1999". The Btrfs filesystem, which has a good chance of replacing ext3 and ext4 a few years from now, does not have a journal; instead, it uses its fast snapshot mechanism to keep transactions consistent. Btrfs may, thus, avoid some of the problems that come with journaling - though, perhaps, at the cost of introducing a set of interesting new problems.


(Log in to post comments)

Responding to ext4 journal corruption

Posted May 29, 2008 13:23 UTC (Thu) by nix (subscriber, #2304) [Link]

Writing garbage into the journal is quite easy, too. All it takes is for 
the disk to forget a single seek after a legitimate journal write, and 
it'll write something into the journal which was supposed to go elsewhere. 
(I've seen this on disks running live systems on ext3 for huge banks. The 
banks were not very happy, because the sysadmins simply unplugged the disk 
array after the disk errors: so the filesystem was unclean, the journal 
was replayed... and oops, that's sprayed quite a lot of garbage into the 
fs, because a multimegabyte logfile write had landed in the journal, and 
all of that was misinterpreted as metadata. That specific case, in which 
the blocks look nothing like journal blocks at all, was plugged in 
e2fsprogs 1.40, but the bank was using a version of RHEL that was still on 
1.35...)

Responding to ext4 journal corruption

Posted May 30, 2008 6:03 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

  1. A file is created, with its associated metadata.
  2. That file is then deleted, and its metadata blocks are released.
  3. Some other file is extended, with the newly-freed metadata blocks being reused as data blocks.

It seems that if you defer releasing metadata blocks in the in-memory notion of "available space" until the transaction releasing them is well and truly committed (rather than "sent to the journal"), you prevent '3' from ever happening.

In fact, the general issue seems to be related to storage repurposing. For example, consider blocks freed from file A get allocated to file B. If data for B gets written to those blocks but the transactions reassigning those blocks get corrupted across a crash, then file A would hold contents intended for file B.

Thus, it seems prudent in data=ordered mode to prevent the allocator from reallocating recently freed blocks until the metadata indicating that those blocks are actually free is actually committed. I have no idea how difficult to implement that might be, but it is something that only needs to be tracked in the in-memory notion of "available space."

Will this degrade the quality of allocations? It might for nearly full filesystems or filesystems with a lot of churn, but for filesystems that are far from full, I doubt it would have any measurable impact whatsoever. There will be some pool of blocks from files recently getting deleted or truncated that won't be available for reallocation immediately.

Anyone see any holes in this?

Responding to ext4 journal corruption

Posted May 30, 2008 13:43 UTC (Fri) by masoncl (subscriber, #47138) [Link]

Reiserfsv3 and jbd both use write ahead logging schemes, and so they solve very similar
problems.  Reiserfs keeps tracks in ram of which blocks are pinned and not available for
allocations, while jbd uses these revoke records.

Keeping track in ram has performance implications, but it is certainly possible.

Responding to ext4 journal corruption

Posted Jun 7, 2008 17:20 UTC (Sat) by Duncan (guest, #6647) [Link]

That's probably one of the big reasons I've found reiserfs (3) so stable 
here, at least after ordered-by-default hit the tree.  I ran a system for 
some time with an annoying memory bus error issue (generic memory rated a 
speed notch higher than it should have been, a BIOS update eventually let 
me limit memory speed by a notch, after which it was absolutely stable) 
that would crash the system with MCE errors relatively frequently.  100% 
reiserfs, no UPS, no problems after ordered-by-default, tho I certainly 
had some previous to that.

I'm running the same system but with a memory and CPU upgrade now, and 
with reiserfs on mdp/kernel RAID-6, system directly on one RAID-6 
partition (with a second for a backup system image), everything else on 
LVM2 on another one.  Despite the lack of barriers on the stack as 
explained in last week's barrier article, and despite continuing to run 
without a UPS and having occasional power outages that often trigger a 
RAID-6 rebuild, I've been VERY happy with system integrity.

Duncan

Responding to ext4 journal corruption

Posted Jun 1, 2008 22:09 UTC (Sun) by jlokier (guest, #52227) [Link]

Another way, which doesn't pin blocks and prevent their reallocation, is to keep track of
dependencies in the journal: transaction 3 _depends_ on transaction 2, because it uses blocks
which are repurposed in transaction 2.  So there should be a note in transaction 3 saying "I
depend on T2".

During replay, if transaction 2 fails due to bad checksum, transaction 3 will be rejected due
to the dependency.  Transaction 4 may be fine, etc.

(The same dependencies can be converted to finer-grained barriers too - e.g. to optimise ext4
on software RAID.)

Some RAM is needed to keep track of the dependencies, until commits are known to have hit the
platters.  If it's a problem, this can be bounded with some hashed approximation akin to a
Bloom filter.

Responding to ext4 journal corruption

Posted Jun 9, 2008 3:26 UTC (Mon) by efexis (guest, #26355) [Link]

This is what first came to my mind, but if data has been written, but metadata saying what
this data is gets discarded, the new data could be misinterpreted as what the previous
metadata said it was (such as believing it to be more metadata pointing to blocks on the disk,
but it's actually an image). I guess the solution here would be to zero any pointers to
metadata first (or settings a 'corrupt' or 'deleted' flag on the metadata sector itself) and
making sure that's reached the disk before writing the data. Of course this can slow things
down as you have to write to the metadata block an extra time per update.

I think the snapshotting way is the only way forward; if you never get rid of something until
certain the new one works (ie, has completely reached the disc) then it doesn't matter what
you do or when... you'll always have at least one working version. Large writes would start
failing when your disc is nearing full, but with todays drive sizes, we're more concerned with
losing 500G of data than filling it.


Responding to ext4 journal corruption

Posted Jun 5, 2008 18:35 UTC (Thu) by anton (guest, #25547) [Link]

I believe in the superiority of copy-on-write file systems over journaling file systems, but problems such as the one discussed can happen in copy-on-write file systems like Btrfs, too, unless they are carefully implemented; i.e., they must not reuse freed blocks until one or two checkpoints have made it to the disk (two, if you want to survive if the last checkpoint becomes unreadable).

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds