LWN.net Logo

Journal-guided RAID resync

Journal-guided RAID resync

Posted Nov 25, 2009 21:56 UTC (Wed) by neilbrown (subscriber, #359)
Parent article: Journal-guided RAID resync

(1) the correct fix for an inconsistent stripe may not always be clear
For an inconsistency caused by a crash interrupting a write, there is no correctness problem. If there are two credible options (such as a RAID1 with two devices where a block differs) then either option is equally correct. One version might be "newer" than the other, but that doesn't mean it is more "correct". "correctness" can be an issue when inconsistency is caused by dodgy hardware - e.g. a stuck-bit in a device buffer. But that is a completely different problem.


(Log in to post comments)

Journal-guided RAID resync

Posted Nov 26, 2009 12:17 UTC (Thu) by nye (guest, #51576) [Link]

Surely this is not the case in the (fairly common) case of out-of-order writes? This sounds like exactly the sort of situation that caused the ext4 kerfuffle earlier this year.

On a different note, can anyone explain to me why running a check or (god forbid) rebuild on a RAID array takes like a dozen times longer than reading/writing the contents of all the disks?

Journal-guided RAID resync

Posted Nov 26, 2009 21:00 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Surely this is not the case in the (fairly common) case of out-of-order writes? This sounds like exactly the sort of situation that caused the ext4 kerfuffle earlier this year.
I don't see how out-of-order writes complicate the question ... maybe I'm missing something. A key question that must be understood is "what data is valid", and in the case of multiple parallel writes pending when a crash happens, there are lots of combinations that are all equally valid.
On a different note, can anyone explain to me why running a check or (god forbid) rebuild on a RAID array takes like a dozen times longer than reading/writing the contents of all the disks?
This is not my experience. If you post specifics to linux-raid@vger.kernel.org you will probably get a useful reply.

The only explanation for what you describe that immediately occurs to me is that fact that the check/repair code deliberately slows down when any other IO is active so as not to inconvenience that IO, but you probably know that already.

Journal-guided RAID resync

Posted Dec 2, 2009 14:03 UTC (Wed) by nye (guest, #51576) [Link]

>I don't see how out-of-order writes complicate the question ... maybe I'm missing something. A key question that must be understood is "what data is valid", and in the case of multiple parallel writes pending when a crash happens, there are lots of combinations that are all equally valid.

I think I may have misunderstood the nature of the problem in question, but IIRC I was thinking about when one version is 'correct', in the sense that it is an old but consistent version of the data, but the other version has been partially updated, and ended up in an invalid state. This might happen, for example, if a metadata change is written before the corresponding data change.
I suppose from the point of the RAID though, that data is valid, in that it's an accurate representation of what the filesystem asked to be on the disk at the given moment, so I see your point.

>This is not my experience. If you post specifics to linux-raid@vger.kernel.org you will probably get a useful reply.

Well, the specific incident that prompted the question involved a hardware RAID controller (my experiences with Linux software RAID have, thus far, been unproblematic). Knowing next to nothing about how RAID is implemented I wondered if the procedure is expected to involve, say, an O(n^2) number of read or writes, or something else that would lead to this slowness being generally expected. Obviously not :).

Journal-guided RAID resync

Posted Nov 26, 2009 16:54 UTC (Thu) by nix (subscriber, #2304) [Link]

Hang on. If you crash after a RAID-5 stripe has been written to one of the
disks but not the other, you can tell the stripe is inconsistent, but not
what the valid contents are (at least, not programmatically.)

You can tell with RAID-6, but ISTR you mentioning a few months ago that md
can't actually do this yet.

Journal-guided RAID resync

Posted Nov 26, 2009 20:54 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Hang on. If you crash after a RAID-5 stripe has been written to one of the disks but not the other, you can tell the stripe is inconsistent, but not what the valid contents are (at least, not programmatically.)
Between the moment when a write is requested, and the moment when the success of that write is reported - and possibly further until a barrier request has been acknowledged - both the 'old' data and the 'new' data are valid. The correct thing to do in this case is to treat all of the data blocks as "valid" (because they are) and update the parity block to ensure it is consistent.

What do you think "Valid" means in this context?

Journal-guided RAID resync

Posted Nov 26, 2009 21:34 UTC (Thu) by nix (subscriber, #2304) [Link]

My worry is that between the time when the new data is written, and the
time when the parity block is updated (or vice versa if the parity write
gets to the disk surface first), if you have a crash, bingo, you have
instant corruption of that stripe. There isn't any way to make those two
writes happen in sync, after all.

(I *know* you know this, so am quite mystified that you're apparently
claiming that it isn't a problem. I don't see how a workaround is even
*possible*: it's why battery-backing of RAID-5 arrays is done at all...)

Journal-guided RAID resync

Posted Nov 27, 2009 1:14 UTC (Fri) by neilbrown (subscriber, #359) [Link]

You only have "instant corruption" if the array becomes degraded before the parity gets corrected. For this reason mdadm will not assemble an array which is both dirty and degraded. I thought I had mentioned this in my original comment, but apparently not. Maybe this is the case implied in the original article, though I don't think the text really matches reality: Either there is a correct fix that is trivial, or no fix is possible.

(The only two ways to avoid corruption when a crash happens on a degraded array are 1/ to journal updates at the raid level, or 2/ use a copy-on-write filesystem that knows about the stripe size and only ever writes into a stripe that does not contain any important information.).

Journal-guided RAID resync

Posted Nov 27, 2009 6:56 UTC (Fri) by nix (subscriber, #2304) [Link]

Aaah, I see (actually whenever you mention this I get it for a few minutes
and then it blurs out of memory again). Yes, that makes sense: if you
don't lose a disk, you have two intact stripes and one stripe containing
not-yet-written garbage: whether that's the parity stripe or not is
immaterial.

So the thing to be worried about here is that RAID-5 only really protects
you from a single disk failure if your array is not being written to (or
is battery-backed).

And I suspect this is the case the article was discussing.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds