LWN.net Logo

Journal-guided RAID resync

By Jonathan Corbet
November 24, 2009
The RAID4, 5, and 6 storage technologies are designed to protect against the failure of a single drive. Blocks of data are spread out across the array and, for each stripe, there is a parity block stored on one of the drives. Should one drive fail, the lost data can be recovered through the use of the remaining drives and the parity information. This mechanism copes less well with system crashes and power failures, though, forcing software RAID administrators to choose between speed and reliability. A new mechanism called journal-guided resyncronization may make life easier, but only if it actually gets into the kernel.

The problem is that data and parity blocks must be updated in an atomic manner; if the two go out of sync, then the RAID array is no longer in a position to recover lost data. Indeed, it could return corrupted data. Expensive hardware RAID solutions use battery backup to ensure that updates are not interrupted partway through, but software RAID solutions often do not have that option. So if the system crashes - or the power fails - in the middle of an update to a RAID volume, that volume could end up being corrupted. Computer users, being a short-sighted kind of people in general, tend to regard this as a Bad Thing.

There are a couple of possible ways of mitigating this risk. One is to perform a full rescan of the RAID volume after a crash, fixing up any partially-updated stripes. The problem here is that (1) the correct fix for an inconsistent stripe may not always be clear, and (2) this process can take a long time. Long enough to cause users to think nostalgically about the days of fast, reliable floppy-disk storage.

An alternative approach is to introduce a type of journaling to the RAID layer. The RAID implementation can set aside some storage where it writes stripes (perhaps not the data, but, perhaps, just the numbers of the affected stripes) prior to changing the real array. This approach works, and it can recover a crashed RAID array without a full rescan, but there is a cost here too: that journaling can slow down the operation of the array significantly. Writes to the journal must be synchronous or it cannot be counted on to do its job, so write operations become far slower than they were before. Given that, it's not surprising that a lot of RAID administrators turn off RAID-level journaling and spend a lot of time hoping that nothing goes wrong.

A few years ago, Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau published a paper describing a better way, which they called "journal-guided resynchronization." Contemporary filesystems tend to do journaling of their own; why not use the filesystem journal to track changes to the RAID array as well? Running one journal can only be cheaper than running two - especially when one considers that the RAID journal must track, among other things, changes to the filesystem journal. The only problem is that the RAID and filesystem layers communicate through the relatively narrow block-layer API; using filesystem journaling to track RAID-level information has the potential to mix the layers considerably.

Jody McIntyre's journal-guided resync implementation adds a new "declared" mode to the ext3 filesystem. As the journal is being written, a new "declare block" is added describing exactly which blocks are to be written to the storage device. Those blocks are then written with a new BIO flag stating that the filesystem has taken responsibility for resynchronizing the stripe should something go wrong; that lets the storage layer forget about that particular problem. Should the system crash, the filesystem will find those declare blocks in the journal; it can then issue a (new) BIO_SYNCRAID operation asking the storage subsystem to resynchronize the specific stripes containing the listed blocks.

The result should be the best of both worlds. The cost of adding one more block to the filesystem journal is far less than doing that journaling at the RAID layer; Jody claims a 3-5% performance hit, as compared to 30% with the MD write-intent bitmap mechanism. But resynchronization after a crash should be quite fast, since it need only look at the parts of the array which were under active modification at the time. The only problem is that it requires the addition of specific support at the filesystem layer, so each filesystem must be modified separately. How this technique could be used in a filesystem which works without journaling (Btrfs comes to mind) would also have to be worked out.

There's one other little problem as well. This work was done at Sun as a way of improving performance with the Lustre filesystem. But Jody notes:

Unfortunately, we have determined that these patches are NOT useful to Lustre. Therefore I will not be doing any more work on them. I am sending them now in case they are useful as a starting point for someone else's work.

So this patch series has been abandoned for now. It seems like this functionality should be useful to software RAID users, so, hopefully, somebody will pick them up and carry them forward. In the absence of a new developer, software RAID administrators will continue to face an unhappy choice well into the future.


(Log in to post comments)

Journal-guided RAID resync

Posted Nov 25, 2009 21:56 UTC (Wed) by neilbrown (subscriber, #359) [Link]

(1) the correct fix for an inconsistent stripe may not always be clear
For an inconsistency caused by a crash interrupting a write, there is no correctness problem. If there are two credible options (such as a RAID1 with two devices where a block differs) then either option is equally correct. One version might be "newer" than the other, but that doesn't mean it is more "correct". "correctness" can be an issue when inconsistency is caused by dodgy hardware - e.g. a stuck-bit in a device buffer. But that is a completely different problem.

Journal-guided RAID resync

Posted Nov 26, 2009 12:17 UTC (Thu) by nye (guest, #51576) [Link]

Surely this is not the case in the (fairly common) case of out-of-order writes? This sounds like exactly the sort of situation that caused the ext4 kerfuffle earlier this year.

On a different note, can anyone explain to me why running a check or (god forbid) rebuild on a RAID array takes like a dozen times longer than reading/writing the contents of all the disks?

Journal-guided RAID resync

Posted Nov 26, 2009 21:00 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Surely this is not the case in the (fairly common) case of out-of-order writes? This sounds like exactly the sort of situation that caused the ext4 kerfuffle earlier this year.
I don't see how out-of-order writes complicate the question ... maybe I'm missing something. A key question that must be understood is "what data is valid", and in the case of multiple parallel writes pending when a crash happens, there are lots of combinations that are all equally valid.
On a different note, can anyone explain to me why running a check or (god forbid) rebuild on a RAID array takes like a dozen times longer than reading/writing the contents of all the disks?
This is not my experience. If you post specifics to linux-raid@vger.kernel.org you will probably get a useful reply.

The only explanation for what you describe that immediately occurs to me is that fact that the check/repair code deliberately slows down when any other IO is active so as not to inconvenience that IO, but you probably know that already.

Journal-guided RAID resync

Posted Dec 2, 2009 14:03 UTC (Wed) by nye (guest, #51576) [Link]

>I don't see how out-of-order writes complicate the question ... maybe I'm missing something. A key question that must be understood is "what data is valid", and in the case of multiple parallel writes pending when a crash happens, there are lots of combinations that are all equally valid.

I think I may have misunderstood the nature of the problem in question, but IIRC I was thinking about when one version is 'correct', in the sense that it is an old but consistent version of the data, but the other version has been partially updated, and ended up in an invalid state. This might happen, for example, if a metadata change is written before the corresponding data change.
I suppose from the point of the RAID though, that data is valid, in that it's an accurate representation of what the filesystem asked to be on the disk at the given moment, so I see your point.

>This is not my experience. If you post specifics to linux-raid@vger.kernel.org you will probably get a useful reply.

Well, the specific incident that prompted the question involved a hardware RAID controller (my experiences with Linux software RAID have, thus far, been unproblematic). Knowing next to nothing about how RAID is implemented I wondered if the procedure is expected to involve, say, an O(n^2) number of read or writes, or something else that would lead to this slowness being generally expected. Obviously not :).

Journal-guided RAID resync

Posted Nov 26, 2009 16:54 UTC (Thu) by nix (subscriber, #2304) [Link]

Hang on. If you crash after a RAID-5 stripe has been written to one of the
disks but not the other, you can tell the stripe is inconsistent, but not
what the valid contents are (at least, not programmatically.)

You can tell with RAID-6, but ISTR you mentioning a few months ago that md
can't actually do this yet.

Journal-guided RAID resync

Posted Nov 26, 2009 20:54 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Hang on. If you crash after a RAID-5 stripe has been written to one of the disks but not the other, you can tell the stripe is inconsistent, but not what the valid contents are (at least, not programmatically.)
Between the moment when a write is requested, and the moment when the success of that write is reported - and possibly further until a barrier request has been acknowledged - both the 'old' data and the 'new' data are valid. The correct thing to do in this case is to treat all of the data blocks as "valid" (because they are) and update the parity block to ensure it is consistent.

What do you think "Valid" means in this context?

Journal-guided RAID resync

Posted Nov 26, 2009 21:34 UTC (Thu) by nix (subscriber, #2304) [Link]

My worry is that between the time when the new data is written, and the
time when the parity block is updated (or vice versa if the parity write
gets to the disk surface first), if you have a crash, bingo, you have
instant corruption of that stripe. There isn't any way to make those two
writes happen in sync, after all.

(I *know* you know this, so am quite mystified that you're apparently
claiming that it isn't a problem. I don't see how a workaround is even
*possible*: it's why battery-backing of RAID-5 arrays is done at all...)

Journal-guided RAID resync

Posted Nov 27, 2009 1:14 UTC (Fri) by neilbrown (subscriber, #359) [Link]

You only have "instant corruption" if the array becomes degraded before the parity gets corrected. For this reason mdadm will not assemble an array which is both dirty and degraded. I thought I had mentioned this in my original comment, but apparently not. Maybe this is the case implied in the original article, though I don't think the text really matches reality: Either there is a correct fix that is trivial, or no fix is possible.

(The only two ways to avoid corruption when a crash happens on a degraded array are 1/ to journal updates at the raid level, or 2/ use a copy-on-write filesystem that knows about the stripe size and only ever writes into a stripe that does not contain any important information.).

Journal-guided RAID resync

Posted Nov 27, 2009 6:56 UTC (Fri) by nix (subscriber, #2304) [Link]

Aaah, I see (actually whenever you mention this I get it for a few minutes
and then it blurs out of memory again). Yes, that makes sense: if you
don't lose a disk, you have two intact stripes and one stripe containing
not-yet-written garbage: whether that's the parity stripe or not is
immaterial.

So the thing to be worried about here is that RAID-5 only really protects
you from a single disk failure if your array is not being written to (or
is battery-backed).

And I suspect this is the case the article was discussing.

Journal-guided RAID resync

Posted Nov 26, 2009 8:12 UTC (Thu) by stefanor (subscriber, #32895) [Link]

> The RAID4, 5, and 6 storage technologies are designed to protect against the failure of a single drive.

Misleading, RAID-6 protects against double drive failures.

Delayed reaction

Posted Nov 29, 2009 7:03 UTC (Sun) by qu1j0t3 (guest, #25786) [Link]

ZFS did an elegant end run around this issue around five years ago.

At least this discussion gives some publicity to the nasty shortcomings of RAID-*.

Delayed reaction

Posted Dec 1, 2009 3:08 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

I thought that ZFS included software raid, how does it deal with the problem of figuring out what is supposed to be in place when some drives in a stripe are updated and others are not (after a power failure)?

note that checksums don't help here as all drives have valid checksums.

Delayed reaction

Posted Dec 1, 2009 22:48 UTC (Tue) by paulj (subscriber, #341) [Link]

My basic, potentially a bit wrong, understanding of ZFS is that:

a) It uses COW, and blocks are arranged in a Merkle tree. An updated block isn't properly part of the ZFS until its referencing meta-data block has been updated with its hash. This means that ultimately there is only block you absolutely must update atomically - the root "uber-block" (and even there, you can arrange to have redundant uber-blocks and you can arrange that old uber-block(s) remain valid until you're confident the uber-block is properly written out).

Even if hardware lies massively and re-arranges the writes and manages to lose/corrupt the child data block write, while still writing the parent, you'll at least be able to detect the inconsistency, generally.

b) Because ZFS integrates RAID and FS, it can do variable-sized stripes. So if you want to write data, there's no requirement to read-in other data in order to calculate parity. You only have to write the data.

Combine the 2 and you have avoided the RAID-5 "write hole" AND gained performance (presuming your data hash function is quite cheap relative to I/O - which tends to be true on any half-recent hardware). See Jeff Bonwick's RAID-Z blog entry.

Delayed reaction

Posted Dec 4, 2009 1:26 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

normal raid 5 implementations do not require that you read all the blocks to update one, they only require thatyou know the old and new contents of the block that you are modifying and the parity block. so that isn't a ZFS advantage

I wasn't aware that ZFS could mix raid types/configurations within a single filesystem. that seems very strange to me, are you sure that it can do this? I thought that when you created the ZFS filesystem you told it the raid type and configuration to use.

the basic problem is that wuthout doing a sync, the OS has no way of knowing when the data written to different drives actually hits the media and is safe (and this can happen in different orders for different writes, even to the same spot), thus the need to do 'raid journaling' or similar.

I don't believe that ZFS does COW, doing that would fragment the data very rapidly, and that is death to performance for drives that need to seek.

I don't believe that your two points are valid in themselves, and even if they were I don't see how they combine to solve this problem.

Delayed reaction

Posted Dec 4, 2009 19:10 UTC (Fri) by paulj (subscriber, #341) [Link]

I've only had a quick glance, so I could be completely wrong and just looking for anything consistent with my pre-existing notions, but Linux RAID-5 seems to schedule old blocks to be read-in when a block is modified, in order to update the parity. E.g. look at handle_stripe, handle_strip_dirtying and places where the R5_Wantread bit is set. ICBW though.

There are quite a few reasonably authorative sources that say ZFS is COW and transactional. So there's no reason to think it isn't, though I havn't read the code.

I didn't say ZFS mixed RAID types in a single OS. I said it used variable sized stripes. ZFS always writes full-stripes, so it never has to read-back data to finish the write. Combined with the Merkle tree arrangement, it means ordinary writes do not leave the RAID or FS inconsistent, even for short windows of time.

Block layer RAID-5 and separate FS have never been able to achieve that, hence the talk now of introducing more interfaces to make them more closely coupled (and doesnt btrfs have its own device management layer, ala ZFS?).

There's lots of stuff on the net about this.. ;)

Delayed reaction

Posted Dec 4, 2009 22:56 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

so you mean that sometimes it uses 16k stripes across the disks, and in the same filesystem will use 64K stripes across the disks? (picking a couple sizes out of thin air here)

how can it decide when to use what, and what data needs to be stored on what stripe size?

given all the people who are claiming that ZFS is the best performing filesystem ever, I don't believe that it is COW and transactional across multiple drives. on rotating media where seeks are expensive these features and performance just don't coexist.

the problem is that when ZFS wants to update a raid 5 strip it has two writes to do, to two different drives. unless it does a full sync and flush of the drive write buffer for each write (which will kill performance), it has no way of knowing which block on disk gets updated first.

if the system crashes after one block is updated and before the second block is updated, how can it know which one is correct?

Delayed reaction

Posted Dec 4, 2009 23:20 UTC (Fri) by foom (subscriber, #14868) [Link]

> I don't believe that it is COW and transactional across multiple drives.

You could, perhaps, go read about it instead of saying what it can't possibly be.

Here's one place to start:

http://en.wikipedia.org/wiki/ZFS#Copy-on-write_transactio...

Delayed reaction

Posted Dec 5, 2009 1:29 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

note that this says it is tranactional when syncronous writes are desired. that is not the normal operation

second, it days that it uses an intent bitmap to implement this protection. this is one of the methods described in the article to address this problem

so ZFS didn't sidestep this issue, it implemented the same method that MD offers (which is usually disabled for performance reasons), write intent bitmaps.

I stand corrected on the COW issue, but by doing COW you have two major problems

1. every write to a file ends up updating many more places on the filesystem (the file block gets moved and re-written, then the metadata that points to that file gets moved and re-written, then the metadata that points to that metadata..... until the root directory gets re-written) this is one of the failues of btree filesystems that made them unusable on rotating media.

wikipedia states that ZFS works around this by buffering the writes to consolodate them, but that is the same strategy that ext3 uses, and the need to then write all that buffered data in the face of a sync call is why fsync is so horrible on ext3

Delayed reaction

Posted Dec 5, 2009 12:22 UTC (Sat) by paulj (subscriber, #341) [Link]

Note that the ZFS Intent Log doesn't affect the consistency of the filesystem. The fs is consistent without the intent log, thanks to the COW, Merkle tree arrangement. So its not solving any consistency problems, i.e.:

  • ZFS is transactional, with or without the ZIL
  • ZFS is always consistent on disk, with or without the ZIL

The ZIL is there to solve the problem of supporting performant synchronous writes. ZFS would still support consistent, synchronous writes without the ZIL, just they'd be very slow.

I think it'd be useful if you looked at this ZFS presentation, which is where I'm getting most of my info from. The OSOL ZFS community also has a source overview.

Delayed reaction

Posted Dec 5, 2009 12:31 UTC (Sat) by paulj (subscriber, #341) [Link]

A good blog entry on the ZIL.

Delayed reaction

Posted Dec 5, 2009 10:20 UTC (Sat) by paulj (subscriber, #341) [Link]

My understanding is that COW is fundamental to how ZFS works (refer back to my previous comment on Merkle trees and how data and meta-data reference each other). I havn't read the code to verify this, but it's what all the papers and presentations on ZFS say.

I think that also already answers your question as to how ZFS can know which one is correct. I'm not sure there's any value in, again, explaining how ZFS consistency (in theory) comes down to the update of a single "uber-block".

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds