LWN: Comments on "Responding to ext4 journal corruption"

Responding to ext4 journal corruption

efexis — Mon, 09 Jun 2008 03:26:14 +0000

This is what first came to my mind, but if data has been written, but metadata saying what
this data is gets discarded, the new data could be misinterpreted as what the previous
metadata said it was (such as believing it to be more metadata pointing to blocks on the disk,
but it's actually an image). I guess the solution here would be to zero any pointers to
metadata first (or settings a 'corrupt' or 'deleted' flag on the metadata sector itself) and
making sure that's reached the disk before writing the data. Of course this can slow things
down as you have to write to the metadata block an extra time per update.

I think the snapshotting way is the only way forward; if you never get rid of something until
certain the new one works (ie, has completely reached the disc) then it doesn't matter what
you do or when... you'll always have at least one working version. Large writes would start
failing when your disc is nearing full, but with todays drive sizes, we're more concerned with
losing 500G of data than filling it.

Responding to ext4 journal corruption

Duncan — Sat, 07 Jun 2008 17:20:37 +0000

That's probably one of the big reasons I've found reiserfs (3) so stable 
here, at least after ordered-by-default hit the tree.  I ran a system for 
some time with an annoying memory bus error issue (generic memory rated a 
speed notch higher than it should have been, a BIOS update eventually let 
me limit memory speed by a notch, after which it was absolutely stable) 
that would crash the system with MCE errors relatively frequently.  100% 
reiserfs, no UPS, no problems after ordered-by-default, tho I certainly 
had some previous to that.

I'm running the same system but with a memory and CPU upgrade now, and 
with reiserfs on mdp/kernel RAID-6, system directly on one RAID-6 
partition (with a second for a backup system image), everything else on 
LVM2 on another one.  Despite the lack of barriers on the stack as 
explained in last week's barrier article, and despite continuing to run 
without a UPS and having occasional power outages that often trigger a 
RAID-6 rebuild, I've been VERY happy with system integrity.

Duncan

Responding to ext4 journal corruption

anton — Thu, 05 Jun 2008 18:35:56 +0000

I believe in the superiority of copy-on-write file systems over journaling file systems, but problems such as the one discussed can happen in copy-on-write file systems like Btrfs, too, unless they are carefully implemented; i.e., they must not reuse freed blocks until one or two checkpoints have made it to the disk (two, if you want to survive if the last checkpoint becomes unreadable).

Responding to ext4 journal corruption

jlokier — Sun, 01 Jun 2008 22:09:57 +0000

Another way, which doesn't pin blocks and prevent their reallocation, is to keep track of
dependencies in the journal: transaction 3 _depends_ on transaction 2, because it uses blocks
which are repurposed in transaction 2.  So there should be a note in transaction 3 saying "I
depend on T2".

During replay, if transaction 2 fails due to bad checksum, transaction 3 will be rejected due
to the dependency.  Transaction 4 may be fine, etc.

(The same dependencies can be converted to finer-grained barriers too - e.g. to optimise ext4
on software RAID.)

Some RAM is needed to keep track of the dependencies, until commits are known to have hit the
platters.  If it's a problem, this can be bounded with some hashed approximation akin to a
Bloom filter.

Responding to ext4 journal corruption

masoncl — Fri, 30 May 2008 13:43:34 +0000

Reiserfsv3 and jbd both use write ahead logging schemes, and so they solve very similar
problems.  Reiserfs keeps tracks in ram of which blocks are pinned and not available for
allocations, while jbd uses these revoke records.

Keeping track in ram has performance implications, but it is certainly possible.

Responding to ext4 journal corruption

jzbiciak — Fri, 30 May 2008 06:03:26 +0000

A file is created, with its associated metadata.

That file is then deleted, and its metadata blocks are released.

Some other file is extended, with the newly-freed metadata blocks being reused as data blocks.

It seems that if you defer releasing metadata blocks in the in-memory notion of "available space" until the transaction releasing them is well and truly committed (rather than "sent to the journal"), you prevent '3' from ever happening.

In fact, the general issue seems to be related to storage repurposing. For example, consider blocks freed from file A get allocated to file B. If data for B gets written to those blocks but the transactions reassigning those blocks get corrupted across a crash, then file A would hold contents intended for file B.

Thus, it seems prudent in data=ordered mode to prevent the allocator from reallocating recently freed blocks until the metadata indicating that those blocks are actually free is actually committed. I have no idea how difficult to implement that might be, but it is something that only needs to be tracked in the in-memory notion of "available space."

Will this degrade the quality of allocations? It might for nearly full filesystems or filesystems with a lot of churn, but for filesystems that are far from full, I doubt it would have any measurable impact whatsoever. There will be some pool of blocks from files recently getting deleted or truncated that won't be available for reallocation immediately.

Anyone see any holes in this?

Responding to ext4 journal corruption

nix — Thu, 29 May 2008 13:23:34 +0000

Writing garbage into the journal is quite easy, too. All it takes is for 
the disk to forget a single seek after a legitimate journal write, and 
it'll write something into the journal which was supposed to go elsewhere. 
(I've seen this on disks running live systems on ext3 for huge banks. The 
banks were not very happy, because the sysadmins simply unplugged the disk 
array after the disk errors: so the filesystem was unclean, the journal 
was replayed... and oops, that's sprayed quite a lot of garbage into the 
fs, because a multimegabyte logfile write had landed in the journal, and 
all of that was misinterpreted as metadata. That specific case, in which 
the blocks look nothing like journal blocks at all, was plugged in 
e2fsprogs 1.40, but the bank was using a version of RHEL that was still on 
1.35...)