LWN.net Logo

Responding to ext4 journal corruption

Responding to ext4 journal corruption

Posted May 30, 2008 6:03 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246)
Parent article: Responding to ext4 journal corruption

  1. A file is created, with its associated metadata.
  2. That file is then deleted, and its metadata blocks are released.
  3. Some other file is extended, with the newly-freed metadata blocks being reused as data blocks.

It seems that if you defer releasing metadata blocks in the in-memory notion of "available space" until the transaction releasing them is well and truly committed (rather than "sent to the journal"), you prevent '3' from ever happening.

In fact, the general issue seems to be related to storage repurposing. For example, consider blocks freed from file A get allocated to file B. If data for B gets written to those blocks but the transactions reassigning those blocks get corrupted across a crash, then file A would hold contents intended for file B.

Thus, it seems prudent in data=ordered mode to prevent the allocator from reallocating recently freed blocks until the metadata indicating that those blocks are actually free is actually committed. I have no idea how difficult to implement that might be, but it is something that only needs to be tracked in the in-memory notion of "available space."

Will this degrade the quality of allocations? It might for nearly full filesystems or filesystems with a lot of churn, but for filesystems that are far from full, I doubt it would have any measurable impact whatsoever. There will be some pool of blocks from files recently getting deleted or truncated that won't be available for reallocation immediately.

Anyone see any holes in this?


(Log in to post comments)

Responding to ext4 journal corruption

Posted May 30, 2008 13:43 UTC (Fri) by masoncl (subscriber, #47138) [Link]

Reiserfsv3 and jbd both use write ahead logging schemes, and so they solve very similar
problems.  Reiserfs keeps tracks in ram of which blocks are pinned and not available for
allocations, while jbd uses these revoke records.

Keeping track in ram has performance implications, but it is certainly possible.

Responding to ext4 journal corruption

Posted Jun 7, 2008 17:20 UTC (Sat) by Duncan (guest, #6647) [Link]

That's probably one of the big reasons I've found reiserfs (3) so stable 
here, at least after ordered-by-default hit the tree.  I ran a system for 
some time with an annoying memory bus error issue (generic memory rated a 
speed notch higher than it should have been, a BIOS update eventually let 
me limit memory speed by a notch, after which it was absolutely stable) 
that would crash the system with MCE errors relatively frequently.  100% 
reiserfs, no UPS, no problems after ordered-by-default, tho I certainly 
had some previous to that.

I'm running the same system but with a memory and CPU upgrade now, and 
with reiserfs on mdp/kernel RAID-6, system directly on one RAID-6 
partition (with a second for a backup system image), everything else on 
LVM2 on another one.  Despite the lack of barriers on the stack as 
explained in last week's barrier article, and despite continuing to run 
without a UPS and having occasional power outages that often trigger a 
RAID-6 rebuild, I've been VERY happy with system integrity.

Duncan

Responding to ext4 journal corruption

Posted Jun 1, 2008 22:09 UTC (Sun) by jlokier (guest, #52227) [Link]

Another way, which doesn't pin blocks and prevent their reallocation, is to keep track of
dependencies in the journal: transaction 3 _depends_ on transaction 2, because it uses blocks
which are repurposed in transaction 2.  So there should be a note in transaction 3 saying "I
depend on T2".

During replay, if transaction 2 fails due to bad checksum, transaction 3 will be rejected due
to the dependency.  Transaction 4 may be fine, etc.

(The same dependencies can be converted to finer-grained barriers too - e.g. to optimise ext4
on software RAID.)

Some RAM is needed to keep track of the dependencies, until commits are known to have hit the
platters.  If it's a problem, this can be bounded with some hashed approximation akin to a
Bloom filter.

Responding to ext4 journal corruption

Posted Jun 9, 2008 3:26 UTC (Mon) by efexis (guest, #26355) [Link]

This is what first came to my mind, but if data has been written, but metadata saying what
this data is gets discarded, the new data could be misinterpreted as what the previous
metadata said it was (such as believing it to be more metadata pointing to blocks on the disk,
but it's actually an image). I guess the solution here would be to zero any pointers to
metadata first (or settings a 'corrupt' or 'deleted' flag on the metadata sector itself) and
making sure that's reached the disk before writing the data. Of course this can slow things
down as you have to write to the metadata block an extra time per update.

I think the snapshotting way is the only way forward; if you never get rid of something until
certain the new one works (ie, has completely reached the disc) then it doesn't matter what
you do or when... you'll always have at least one working version. Large writes would start
failing when your disc is nearing full, but with todays drive sizes, we're more concerned with
losing 500G of data than filling it.


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds