|
|
Log in / Subscribe / Register

An md/raid6 data corruption bug

An md/raid6 data corruption bug

Posted Aug 19, 2014 23:02 UTC (Tue) by Richard_J_Neill (subscriber, #23093)
Parent article: An md/raid6 data corruption bug

I wonder whether there is any workaround yet for the problem of a RAID5 array where the degraded array suffers a single bit read-error on rebuild, thereby kicking one of the non-redundant drives out of the array, and compromising the entire thing.

I suffered from exactly this - and indeed, with modern huge hard drives, there's about a 10% chance of a (detected, unrecoverable) read error occurring for a single bit when doing a copy of the entire drive. The chances are, that this bit error, is probably unimportant (statistically it's most likely to create a pixel-error somewhere in a video file) - but it's catastrophic when the entire RAID array refuses to be rebuilt.

As far as I can see, Linux RAID5 has no support for "I am mildly annoyed about a 1-bit error, but I do still want to keep the other 3.999999999999 TB of my data".

RAID5 rests on the (now completely wrong) assumption that a complete start-to-end read of a healthy drive should never ever experience uncorrected errors. This is the drive manufacturers' fault because the error-correction algorithm hasn't been strengthened in line with increasing disk sizes - but it means RAID5 is now rather dangerous.


to post comments

An md/raid6 data corruption bug

Posted Aug 20, 2014 0:16 UTC (Wed) by dlang (guest, #313) [Link] (6 responses)

in practice, what you can do is manually start the array with the most recently 'failed' drive forced to be part of it and try again.

If the bit is really unrecoverable, you can still copy your data off of the array (except for the file that bit is part of)

not fun, but I've done it in the past.

An md/raid6 data corruption bug

Posted Aug 20, 2014 5:15 UTC (Wed) by rodgerd (guest, #58896) [Link] (5 responses)

That can be risky. I've done just that in the past and the result was an apparently fine array with what turned out to be significant data corruption in many of the files on it, which I didn't notice for several months.

Fortunately I had backups, but searching across several months of backups for the last good copy of photos was not my idea of a good time.

An md/raid6 data corruption bug

Posted Aug 20, 2014 5:57 UTC (Wed) by dlang (guest, #313) [Link] (3 responses)

whenever you have a drive problem, if you are able to get access to anything, your first priority should be to get it off and throw away the bad drive.

by the time a drive starts showing problems through it's error correction and automatic bad block replacement, it's likely to go downhill fast, and even if you are lucky and it doesn't, drives are cheap compared to the contents.

An md/raid6 data corruption bug

Posted Aug 20, 2014 20:53 UTC (Wed) by rodgerd (guest, #58896) [Link] (2 responses)

I didn't run it like that for several months - it's that re-adding the drive until I could migrate corrupted the files, which I didn't access for several months after. If I had a shorter backup cycle, I'd have lost them.

An md/raid6 data corruption bug

Posted Aug 24, 2014 19:13 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> If I had a shorter backup cycle, I'd have lost them.

If this was for work, that would be negligent ... :-) But I suspect it might be for home. I try and dump anything important to DVD, so that's hopefully safe and dandy ... (yes I know).

But when I set up a backup regime at work, iirc it said the tapes were good for about 60 uses (DDS/2/3 - that dates it ...) So my regime had - iirc - fifteen tapes. Each week, a new set got swapped in, but one of the set was swapped out. And roughly every four weeks, the swapped out tape had its write tab flipped, never to be written again. I can't exactly remember the jiggery-pokery, but I described our backup regime to management as "we have a daily backup for the last week, a weekly backup for the last month, and a monthly backup forever".

And, done properly, the swapping out meant all our tapes were replaced, dumped into long-term storage, comfortably inside the manufacturers recommended usage limit.

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 24, 2014 22:00 UTC (Sun) by rodgerd (guest, #58896) [Link]

>> If I had a shorter backup cycle, I'd have lost them.

> If this was for work, that would be negligent ... :-)

The chance of losing anything at work are remote. When you need to be able to refer to your data from ranges of "7 years" to "a human lifetime plus 7 years" depending on the itme, thinking hard about backup and archiving becomes the norm.

> But I suspect it might be for home. I try and dump anything important to DVD, so that's hopefully safe and dandy ... (yes I know).

Yep. Personally for home I prefer to dump to multiple external drives and carry them off-site periodically. DVDs or tapes are nominally are more stable medium, but the pain of swapping means that backups are just less likely to happen. Easy backups are backups that happen...

> And, done properly, the swapping out meant all our tapes were replaced, dumped into long-term storage, comfortably inside the manufacturers recommended usage limit.

Oh, the place I worked that was saving money on 9-track and DDS tapes by keeping them in rotation for literally years. And doing incremental backups across a month at a time. Those decisions never caused any problems, no sir.

An md/raid6 data corruption bug

Posted Aug 30, 2014 15:37 UTC (Sat) by Wol (subscriber, #4433) [Link]

> That can be risky. I've done just that in the past and the result was an apparently fine array with what turned out to be significant data corruption in many of the files on it, which I didn't notice for several months.

Maybe that's one for the raid designers.

Have a mode that force-rebuilds the array. But where you do get a rebuild failure that sets a "corrupt" flag somewhere. So when the layer above WRITES to the "corrupt" sector it clears the flag - we have known-good new data.

But if the next access to that sector is a *read*, the array returns a read error. After all, the data is corrupt because of a real read error in the underlying disks ...

So, if you do have a failure like that, once your array is rebuilt you can run fsck or the like, and find all the damaged files. Okay, it'd take quite a time :-) but it seems to make sense - after all if you're reading direct from disk these errors propagate up through the layers, so just add some mechanism to raid to make it do the same.

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 20, 2014 1:12 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (30 responses)

"with modern huge hard drives, there's about a 10% chance of a (detected, unrecoverable) read error occurring for a single bit when doing a copy of the entire drive"

Do you have a citation for this extraordinary claim?

Many systems routinely scrub the array once per week these days. To do a scrub obviously all the drives (thus, at least two) are read in full. If you were right that just reading the data results in 10% chance of an error, you'd expect everybody to see trouble reported during the scrub at least every month and a half on average. But that doesn't happen.

An md/raid6 data corruption bug

Posted Aug 20, 2014 14:03 UTC (Wed) by dany (guest, #18902) [Link] (2 responses)

I have no citation and also dont think that probability is that high, but does it really matter if there are uncorrectable sectors with 10% chance or 5% or 0.1%? More disks + bigger disks means, probability raises. Even if most people wont experience problems, some will.

I work for service provider of many companies and support dozens of disk arrays. I can tell, that in practice this situation happens (I experienced it more than 5 times). It is basically double fault. One disk is faulted and other disk in same raid5 group is experiencing uncorrectable sectors during process of recalculating data to spare disk.

Not mentioning, these disk arrays have scrubbing enabled and are running with available spare disks.

Its not common situation, but with disks getting bigger, probability of this double fault only raises.

So my suggestion is, if you value your data and have only one disk array, use raid6 on disks with capacity starting 600GB and bigger. Other options for preventing double disk fault are tripple mirror and raidz2 (similar to raid6 in zfs). Or if you have second disk array, you can mirror your LUNs to second array, this way you are ok with raid5 on each array.

An md/raid6 data corruption bug

Posted Aug 20, 2014 20:04 UTC (Wed) by carenas (guest, #46541) [Link] (1 responses)

this is why if you really care about the data you should do RAID6 instead

An md/raid6 data corruption bug

Posted Aug 20, 2014 21:28 UTC (Wed) by dlang (guest, #313) [Link]

Yep, you really want raid 6

a few years ago I was working with a 45 disk RAID array, and calculated that if there was a 1% chance of a disk failing in a year, and it took a week to populate a replacement disk in the background (10% bandwidth, other handwaving), with RAID 5 there was a 2.5% chance of a second disk failing before the first replacement disk was ready.

with RAID 6 this dropped to 0.025% or something like that

the increase in protection really is significant, no matter what the cause of the failure.

An md/raid6 data corruption bug

Posted Aug 20, 2014 14:11 UTC (Wed) by Richard_J_Neill (subscriber, #23093) [Link] (11 responses)

You asked for evidence. For a start:

http://www.zdnet.com/blog/storage/why-raid-5-stops-workin...
http://superuser.com/questions/700177/why-ure-fails-raid-...

Also, this actually happened to me (and even though this is a single-sample, the inference is that this problem isn't that rare).

The maths tells you that for a 10E-14 bit-error-rate on a typical drive, with a group of 5x 1 TB drives, there is about a 40% chance of a single unfixable error!

Now, in reality, it's slightly less bad than this, as pointed out here:
http://www.high-rely.com/hr_66/blog/why-raid-5-stops-work...
but nevertheless, using RAID5 for availability is decidedly high risk.

Other problems with the all-eggs-in-one-basket approach are that drives in the same batch are likely to fail at around the same time, and that a failing PSU can apply mains voltage across the +12V line.

Since then, I've been entirely converted to RAID1 (just mirorring), DRBD (two mirrored servers), and full nightly backups to a different system.

(An aside: running Postgresql works remarkably well when the underlying device is a DRBD pair of {RAID-1 mirrored SSDs} in 2 servers with a dedicated gigabit link.)

An md/raid6 data corruption bug

Posted Aug 20, 2014 14:49 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (2 responses)

OK, seems like the problem is that you've seen a black swan, and so now the obviously bogus calculations as to the rate of black swan sightings don't feel as bogus to you as they should.

One commenter in your final link figures out what's going on here. BER turns all the data the drive makers have into a single comparable figure, like an APR for your mortgage, or the THD+N for a DAC. But that figure shouldn't be read to mean that such errors will occur evenly, with exactly one bit lost for every so-many bits read, it's just the result of a sum they performed.

The drives you're talking about will lose entire physical disk sectors, 32768 bits, or nothing, due to their design. So the events you've modelled as happening about 10% of the time will be somewhat more damaging than your hypothetical "single pixel" but actually happen less than once in every million times on average.

Of course one-in-a-million events will happen, and they did to you. But telling everybody that the sky is falling won't work - they can see it isn't and so you'll just be dismissed as a scaremonger.

An md/raid6 data corruption bug

Posted Aug 21, 2014 9:02 UTC (Thu) by mbunkus (subscriber, #87248) [Link]

Well, it hasn't only happened to him. To me it happened twice with RAID5; once in a pretty four-drive array with 4x2TB drives, once in a slightly larger one (7x1.5TB drives).

The math may not be as bad as the ZDnet article makes it out to be, however it is really happening to people. After having gone through the agonizing process of re-building a whole array and restoring everything from backups resulting in long hours, lost data, outages in important systems and therefore angry customers I now use RAID6 only. The additional cost in drives is nothing compared to the cost and headache of a lost array.

An md/raid6 data corruption bug

Posted Aug 22, 2014 21:44 UTC (Fri) by Wol (subscriber, #4433) [Link]

> OK, seems like the problem is that you've seen a black swan, and so now the obviously bogus calculations as to the rate of black swan sightings don't feel as bogus to you as they should.

I used to get Vogon's customer newsletter regularly. We're talking about 20 years ago here! And even back then, there were regular stories in it about where a customer's raid array failed. Sometimes because they weren't checking the disks, and the second failure obviously trashed the array. But there were a fair few stories where it was the stress of rebuilding the array after the first failure that caused the second failure.

That was twenty years ago! And this company was making a nice business out of recovering crashed arrays! (Usually banking systems, where recovering from backup wasn't really a viable option.)

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 20, 2014 18:36 UTC (Wed) by welinder (guest, #4699) [Link]

I wouldn't trust the math there.

The formula used is based on the assumption that the probability of getting a read error for the sectors on a disk are IID -- independent and identically distributed. I am having serious trouble believing both parts of that.

First, sectors are not identical. For example, the G forces that the spinning platters are subjected to are wildly different near the center and near the edge. There is also some variation in the number of iron atoms per sector even though more sectors per ring are allocation near the edge.

Second, whatever caused a read error in one sectors -- wobly power, cosmic rays, heat, mechanical slack in head positioning, whatever -- generally can affect more than one sector.

PSU Failure Mode

Posted Aug 21, 2014 13:48 UTC (Thu) by dskoll (subscriber, #1630) [Link] (6 responses)

a failing PSU can apply mains voltage across the +12V line.

Really? You've seen that happen? I find that to be an extraordinary failure mode.

PSU Failure Mode

Posted Aug 21, 2014 14:19 UTC (Thu) by Wol (subscriber, #4433) [Link] (5 responses)

A PSU will contain a transformer. A transformer will have high and low voltage wires in close proximity. A short between them will send an HV surge down the LV line.

Yes, a fuse should catch it. Still doesn't necessarily prevent the damage.

Cheers,
Wol

PSU Failure Mode

Posted Aug 21, 2014 18:39 UTC (Thu) by dskoll (subscriber, #1630) [Link] (4 responses)

Yep, I know how PSUs work. I've never seen a PSU fail where the primary and secondary windings of the transformer short out. Seems like a very rare way to fail. I would think more likely failure modes are the transformer simply open-circuiting or the regulating/switching electronics failing.

PSU Failure Mode

Posted Aug 22, 2014 9:23 UTC (Fri) by BlueLightning (subscriber, #38978) [Link] (1 responses)

I'm pretty sure this happened to a friend of mine some years ago - at least in his case everything connected to the PSU except the CPU got fried; and that could only really have happened if a high voltage was shorted to the 5V or 12V rail.

As you might be aware modern PSUs don't have huge transformers, instead they have a switch mode power supply with a high side and a low side of the circuit. It's not so much true now perhaps but a few years ago people weren't paying all that much attention to the quality of the power supplies they put in their systems - you could tell if you opened them that they were built to a very low price. If you pay bottom dollar for this kind of component then it's not too surprising that some of the separation and safety that ought to be there to prevent this kind of failure has been missed out.

PSU Failure Mode

Posted Aug 22, 2014 21:40 UTC (Fri) by Wol (subscriber, #4433) [Link]

> If you pay bottom dollar for this kind of component

:-)

I've bought cheap cases including PSU on far too many occasions :-) and found myself replacing the PSU for whatever reason. The most recent being a PSU was an alleged 500W job. Only for me to discover it was 500W *in*, whereas I wanted 500W *out*. I think the PSU was man enough for the job, but only just, and I didn't want to find out the hard way if it would fail ...

(the replacement is a lot quieter and cooler :-)

Cheers,
Wol

PSU Failure Mode

Posted Aug 22, 2014 18:34 UTC (Fri) by jwarnica (subscriber, #27492) [Link] (1 responses)

I think fuse or not, 120v being pushed across a 12v diode will act as a pretty nice fuse itself.

PSU Failure Mode

Posted Aug 22, 2014 20:27 UTC (Fri) by etienne (guest, #25256) [Link]

Bad example, the diode will not fail. I am old enough to have seen these germanium diodes not able to handle 40 volts, but anything done this century for diodes will handle 320 volts without much problem - until the nearby capacitor fail and smells. Such hardware bug is detected by smelling and sometimes the local color theme been a bit darker.

An md/raid6 data corruption bug

Posted Aug 22, 2014 20:33 UTC (Fri) by ttonino (guest, #4073) [Link] (14 responses)

> "with modern huge hard drives, there's about a 10% chance of a (detected, > unrecoverable) read error occurring for a single bit when doing a copy of > the entire drive"

> Do you have a citation for this extraordinary claim?

The hard drive manufacturers tell you, in the spec for uncorrectable read error rate.

Just a random data sheet: http://www.seagate.com/staticfiles/docs/pdf/datasheet/dis... which says:

Nonrecoverable Read Errors per Bits Read, Max 1 per 10E14

1TB is 10E12 bytes. A 4 TB drive will thus hold 3.2 * 10E13 bits.

Result: maximum 31% chance of an uncorrectable read error over reading the whole drive.

Note that 'enterprise' drives may be better in this regard by a factor of 10. Still, a long way off from 'totally reliable'.

An md/raid6 data corruption bug

Posted Aug 23, 2014 5:49 UTC (Sat) by neilbrown (subscriber, #359) [Link] (13 responses)

> Nonrecoverable Read Errors per Bits Read, Max 1 per 10E14
>
> 1TB is 10E12 bytes. A 4 TB drive will thus hold 3.2 * 10E13 bits.
>
> Result: maximum 31% chance of an uncorrectable read error over reading the whole drive.

You are making two very common errors in probability theory here. Firstly you are trying to add the probabilities of two events to get the probability that either of them will happen. This is not valid. If it were then the probability of getting a head on one or both of 2 coin tosses would be 100%, which it isn't.

Probabilities don't add - they multiply(*). If the probability of 1 head is 50%, then the probability of 2 heads is 50% * 50% or 25%. To get the probability of at least one head, you need to find the probability of zero heads (25%) and subtract that from 1. So 75%.

To apply that to the disk drive example, the probability that one bit won't fail is (1-10^-14). The probability that 4*10^12 bits won't fail is the first number to the power of the second.

Using 'bc' with commands
scale=100
e(l(1-10^-14)*(4*10^12))

Gives me a 96% chance that no bits will fail, so a 4% chance of failure of any bit at all.

Now we come to the second common error: you can only multiply probabilities of independent events. Two coin tosses are (except for quantum effects) independent, so you can multiply probabilities. Two bit failures on a disk drive are not. At the very least you get 4096 (1 sector) all failing at once. And typically if you get any errors, then you start to see lots more. To be able to combine probabilities in the presence of inter-dependence you need to be able to quantify that interdependence and I don't have the data for that.

At a guess, that 4% chance of any failure is probably 3.99% chance that nothing will fail and 0.01% chance that a few megabytes will fail.

Certainly drives do fail. Multiple drives the same array can fail. Regular "scrubbing" to find failing blocks is important. RAID6 is more reliable than RAID5. More expensive drives are sometimes worth it. Different drives from different batches or different manufactures spreads your risk. Backups are important.

But the reality is that most RAID arrays will not suffer any errors most of the time, and many people will use RAID arrays without ever seeing a double failure. You should still prepare for it though.

(*)Full disclosure: probabilities *can* add, but only for mutually exclusive events. For independent events they multiply.

An md/raid6 data corruption bug

Posted Aug 23, 2014 11:12 UTC (Sat) by Wol (subscriber, #4433) [Link] (9 responses)

>> Nonrecoverable Read Errors per Bits Read, Max 1 per 10E14
>
>> 1TB is 10E12 bytes. A 4 TB drive will thus hold 3.2 * 10E13 bits.
>
>> Result: maximum 31% chance of an uncorrectable read error over reading the whole drive.

> You are making two very common errors in probability theory here. Firstly
> you are trying to add the probabilities of two events to get the
> probability that either of them will happen. This is not valid. If it were
> then the probability of getting a head on one or both of 2 coin tosses
> would be 100%, which it isn't.

I think we have a massive comprehension error - I am now rewriting this for the THIRD time. And I am in total agreement with the GP's assessment of an excessively high chance of failure. Let's do the maths.

1TB is 10e12 bytes. We have a raid-5 array of 4TB disks. That gives us 10e12 x 12 x 8 bits, ie 10e14 bits - which the manufacturers tell us is the level we should EXPECT an unrecoverable read error! I'm not sure to what extent a raid rebuild scans the entire array, but if the statistics are telling me to *expect* an error, and my system can't handle it, then something's badly wrong.

Or let's do the maths the way you say we should. Given that three scans of the disk will read the manufacturer's 10e14 bits, that gives us a 70% (2 in 3) likelihood of any one scan being successful. Calculating the probability of three successful scans across 3 x 4GB disks gives us 0.7^3 or 17%!!! Adding further disks to the array merely REDUCES the chance of success even further!!!

So, based on the manufacturers figures, that makes it that if I have a RAID-5 array composed of 4GB drives, then AT BEST I have a 1 in 6 chance of a successful rebuild after a drive failure!!!

Okay, I've been a bit sloppy with my maths, but those figures are pretty close. More to the point, that 10e14 probably contains a fair bit of safety margin, but that's countered by the fact that drive failures are not independent. Typically an array will contain drives of the same age and make, probably the same batch, and statistically they are more likely to fail together than at random.

So, I'm left to agree with article - given modern huge drives, RAID-5 gives you a totally false sense of security because the risk of a double failure - based on manufacturers figures!!! - is somewhat high!

(RAID-6 should - provided it doesn't agressively knock faulty drives off-line - be much more reliable because it can still rebuild a new disk even given multiple errors across multiple disks, provided it doesn't get three errors in the same stripe.)

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 15:34 UTC (Sat) by JGR (subscriber, #93631) [Link] (8 responses)

> 1TB is 10e12 bytes

Surely this should be 1e12 bytes?

An md/raid6 data corruption bug

Posted Aug 23, 2014 18:10 UTC (Sat) by Wol (subscriber, #4433) [Link] (7 responses)

No. 10e0 is a 1 with no noughts after it. 10e12 is 1 with 12 noughts. Very confusing :-)

As is the fact that half of the numbers are quoted in bits, and half in bytes :-)

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 18:51 UTC (Sat) by JGR (subscriber, #93631) [Link] (5 responses)

That doesn't really tally with any scientific/engineering notation that I've used.
10e0 is short for 10 × 10^0, which is 10, not 1, unless you mean 1.0e0?

An md/raid6 data corruption bug

Posted Aug 23, 2014 20:04 UTC (Sat) by Wol (subscriber, #4433) [Link] (4 responses)

I think you've got yourself totally muddled.

10E3 is engineering for 10^3.

That means three tens multiplied together.

10 x 10 x 10 = 100

The problem is you get scientific notation, where 5000 is 0.5E4 (the mantissa is always between nought and one), and engineering notation where 5000 is 5E3 (the exponent is always a multiple of three).

ANYTHING to the power NOUGHT is ALWAYS ONE.

Oops - I see your mistake. We are writing "10E14", and you're reading "10 x 10E14". You're sticking a 10 in front that we haven't put there, so your maths is correct, but so is ours! :-)

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 20:07 UTC (Sat) by Wol (subscriber, #4433) [Link]

Dare I embarass you further? :-)

1e12 = 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 x 1 = 1

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 20:19 UTC (Sat) by eehakkin (guest, #92008) [Link]

10E3 is engineering for 10^3.
No, it is not. 10E3 is for 10*10^3.
$ python -c 'print 10e3;print 10*10**3'
10000.0
10000

An md/raid6 data corruption bug

Posted Aug 23, 2014 20:34 UTC (Sat) by JGR (subscriber, #93631) [Link] (1 responses)

Ah I see, well if that is the notation that you are using, it is non-standard.
http://en.wikipedia.org/wiki/Scientific_notation#Engineer...
http://en.wikipedia.org/wiki/Scientific_notation#E_notation

5000 is 5E3, in both scientific and engineering notation, the mantissa should be between >=1 and <10 or >=1 and <1000 respectively.

> 10E3 is engineering for 10^3.
> That means three tens multiplied together.
> 10 x 10 x 10 = 100
Three tens multiplied together is 1000.
In the standard sense, 10E3 is 10000.
If you consider 1E12 to be 1, how can you consider 5E3 equal to 5000? Your scheme seems oddly inconsistent.

An md/raid6 data corruption bug

Posted Aug 23, 2014 22:30 UTC (Sat) by Wol (subscriber, #4433) [Link]

:-)

Now I'm getting myself thoroughly muddled. You're right. It looks like I somehow got it into my head that E and ^ were the same. Of course they're not - E is 10^.

So my calculations were an order of magnitude out?

But that still leaves pretty bad odds of recovery of a raid-5 array. My original 1 in 6 chance of success becomes a 5 in 60 odds of failure?

That's what bothers me about this story. Because the odds are low, we have various people saying "not a problem", despite a bunch of other people saying they've been hit - some several times! The majority don't see a problem, but a *significant* minority are getting hit. And despite my faulty maths, I've done statistics, and it looks real to me.

Whether you need to read a 4TB disk end-to-end 3 times, or 30 times, in order to exceed the manufacturer's MTBerrors, that figure is easily achieved these days.

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 20:31 UTC (Sat) by nybble41 (subscriber, #55106) [Link]

Do you happen to have a reference for your rather bizarre variation of E notation, which I have never seen used anywhere else? Normally aEb is, as JGR says, a * 10^b, not a^b. So 10e12 is 10 * 10^12, or 1e13, and 10e0 is just 10. Engineering notation is the same except that the exponent is always a multiple of three, to match the SI prefixes.

http://en.wikipedia.org/wiki/Scientific_notation#E_notation
http://en.wikipedia.org/wiki/Engineering_notation

This also matches the floating-point syntax in every major programming language:

$ python -c 'print 1e3'
1000.0

An md/raid6 data corruption bug

Posted Aug 23, 2014 11:18 UTC (Sat) by Wol (subscriber, #4433) [Link] (2 responses)

> To apply that to the disk drive example, the probability that one bit
> won't fail is (1-10^-14). The probability that 4*10^12 bits won't fail is > the first number to the power of the second.

> Using 'bc' with commands
> scale=100
> e(l(1-10^-14)*(4*10^12))

> Gives me a 96% chance that no bits will fail, so a 4% chance of failure of
> any bit at all.

Sorry Neil, I think I've seen your mistake, and it's simple English comprehension. The manufacturers' figures do NOT quote "chance of failure per bit read". They quote a completely different statistic, namely "bits read per failure", and they are not interchangeable.

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 23:09 UTC (Sat) by neilbrown (subscriber, #359) [Link] (1 responses)

The other, even more common I suspect, mistake people make with probability is in translating natural-language statements into mathematical statements. You are right that I didn't help at all there.

What do the manufacturers say?
I found "WD Re Series Spec Sheet" which says:

Non-recoverable read errors per bits read: < 10 in 10^16

Which I think is "fewer than 10 read errors per 1.25 petabytes read"

"10 in 10^16" is at least 10 times better than the "1 in 10^14" that was suggested earlier. Maybe drives are getting better? Maybe more expensive drives are worth it? Maybe the number don't mean much.

But what *do* they mean? Note that it says "< 10". So "10" is an upper bound, not an expected value. I wonder how firm that bound is ... no way to tell really.

And what exactly is "1.25 petabytes read". Are all reads the same? Is a sequential read through the whole drive comparable to random reads? Does it make a difference if the data read was recently written? Does that increase or decrease the chance of error?

I would *guess* that it means "1.25 petabytes read in a normal workload" - where "normal" includes a mix of writes and reading lots of data that was written recently and some that has not be touched for a long time.

But my point is that this number is almost meaningless. You can possibly use it compare two drives. I don't think there is any way to map this number into a probability of device failure during an array resync or recovery.

RAID5 with bad-block-list support enabled should survive a recovery. You might lose a block of data, but you probably won't. RAID6 is definitely safer. Significantly safer. Whether that safety if worth the cost of an extra device, or maybe a new enclosure is not something you can calculate by plugging numbers from a data sheet into a calculator.

You'd be much better off getting anecdotal evidence from people who use RAID arrays a lot. Unfortunately people who don't suffer any problems rarely provide anecdotes.

An md/raid6 data corruption bug

Posted Aug 24, 2014 11:54 UTC (Sun) by Wol (subscriber, #4433) [Link]

:-)

But as even this thread shows, people are suffering this failure.

So it may be rare, but it's still common enough to cause significant grief.

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 20, 2014 1:44 UTC (Wed) by neilbrown (subscriber, #359) [Link] (11 responses)

> I wonder whether there is any workaround yet for the problem of a RAID5 array where the degraded array suffers a single bit read-error on rebuild

Yes there is. It is called "bad block lists" or "bbl".

Rather than marking the whole device as faulty, md will just mark the bad block as faulty.

Requires mdadm 3.3 and Linux 3.1 or later (and "later" generally means "hopefully fewer bugs", but the more people I can get testing this, the sooner such bugs will be discovered).

To add a bbl to a pre-existing array you need to assemble with "--update=bbl". You cannot hot-add a BBL at present.

An md/raid6 data corruption bug

Posted Aug 21, 2014 17:18 UTC (Thu) by Tomasu (guest, #39889) [Link] (10 responses)

Hmm, I'll look into that next time I build an array (happens somewhat infrequent "unfortunately" or fortunately depending on how you see it).

I mainly run RAID5 because I'm dumb, and dislike the added overhead of RAID6, but I also run a complete mirror of my main RAID5 onto another RAID5, so I feel I'm somewhat safe from the unlikely event of a double fault in the main array, and pretty safe from a double fault in both arrays happening at the same time. Is this dumb? Probably! It certainly would be cheaper to go RAID6 in my main array, but that system (small nas in a lian-li PC-Q25 case) has 7 drive bays, and the goal was to reach 10TB+ in that limit with 2TB drives. Cutting out a drive for a second parity disk would have meant I'd not hit 10TB :( So in the end, I started picking up 3TB drives when they went on sale over the past couple years, and eventually built a backup array to host an entire copy of the main NAS, as well as a separate location to store my more important backups that are housed on a RAID1 of two (old) disks on my home server (and in two remote locations as well).

An md/raid6 data corruption bug

Posted Aug 23, 2014 11:40 UTC (Sat) by Wol (subscriber, #4433) [Link] (9 responses)

That sounds sensible. Basically, the greater the proportion of disk given over to error detection and recovery, the better. And you've got maybe 60%? But notice I used two words there - "detection" and "recovery".

A mirror dedicates 50% of your capacity to recovery. If you get a read failure (ie the disk doesn't respond) you can recover. If you get a read error (ie the wrong data is returned), the raid will detect it but your app won't know which version is correct.

Change that to raid 5, and now you have dedicated 33% of your capacity to detection and recovery. Any single read error will be detected, and any single read failure will be recoved with. And that's why you should run raid 5 over raid 1, and not the other way round, because a read failure in the mirror will be handled by the raid 5, but it doesn't work the other way round. But it sounds like you're effectively running raid 1 over raid 5 :-(

Going to raid 6 now strengthens both the detection and recovery of raid 5. But given this article, it sounds like adding further parity disks to raid 6 might not be a bad idea :-)

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 12:55 UTC (Sat) by Tomasu (guest, #39889) [Link] (2 responses)

I agree its safer to go with RAID6 in the long run, especially if you then mirror that ;)

My setup has two separate RAID5 arrays, in different machines. The NAS contains the main array, and my "home server" contains the backup array that has fewer, but larger disks also in a RAID5 which gets a rsync of the NAS array every night. It also stores other backups, so its not a direct clone, it just has a complete copy of the contents of the NAS.

It was all build ad-hoc as I had money, and there were sales on 2TB (nas) and 3TB (backup) disks.

I may test out raid6 again eventually when I need to upgrade the nas. but it'll be a while. I built my nas to handle 5 additional years of media and other downloads, WITHOUT deleting anything. But I sometimes delete stuff I really don't need to keep, which extends the lifetime by quite a bit. So it may be a while.

I do however have a VM storage box to finish building, and I may go raid6 for the vm storage. Not sure yet. The actual host of the array(s) will not likely be directly connected to the actual VM host, so it won't need to have more than like 90MBps throughput, and a decent sized raid6 outstrips that several times over (though a raid5 of the same number of spindles is about twice as fast, if not more).

An md/raid6 data corruption bug

Posted Aug 23, 2014 14:51 UTC (Sat) by Wol (subscriber, #4433) [Link] (1 responses)

Quick maths ... 7 spindles, 2TB disks, raid 5. I make that 12TB storage.

6 spindles, 3TB disks, raid 6. I make that 12TB storage :-)

If disks are cheap you could upgrade the disks in the nas, then convert to raid 6 :-)

Dunno if you want to do that, though, I've just bought two 3TB disks, so I make the cost of your upgrade about £300 (Seagate Barracuda disks).

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 15:06 UTC (Sat) by Tomasu (guest, #39889) [Link]

Indeed. But the actual storage is about 11TB. Doing a full convert would be a pain in the butt. :D It helps that I have the backup array though. So I could. I don't currently have enough 3TB disks to do that however. the backup array has 5 3TB disks. so it wouldn't actually be too expensive to do, just one more (or two, cause why not?) disk(s). If I do, I'll be waiting for a big sale on the 3TB disks. It's just not a huge priority for me at the moment. I have more than enough storage and enough protection at the moment to last quite a long time.

An md/raid6 data corruption bug

Posted Aug 24, 2014 3:08 UTC (Sun) by dlang (guest, #313) [Link] (5 responses)

> Change that to raid 5, and now you have dedicated 33% ...

it depends how many disks are in your RAID array, the more disks, the fewer you are dedicating.

you re correct for a 3-disk array, but for a 10 disk array

RAID1 (mirroring) eats 50%
RAID5 eats 10%
RAID6 eats 20%

An md/raid6 data corruption bug

Posted Aug 24, 2014 19:42 UTC (Sun) by Wol (subscriber, #4433) [Link] (4 responses)

:-)

I'm an underemployed ex-programmer struggling to make a living while caring for my wife :-)

I've just upgraded my PC to two mirrored 3TB drives. Next step add another drive and make it raid-5. If prices have fallen enough I'll then make it raid-6. So yes, I was talking about the smallest number of drives for each raid setup.

Adding further drives to raid-5 or -6 reduces the space allocated to safety, presumably increasing the risk, but I guess that's moderately negligible. Might even reduce it, by reducing the load on each disk. Like many things, I find my understanding increases with discussion - I might *know* the facts, but I don't always *understand* them - this discussion I know has deepened my understanding of raid.

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 24, 2014 20:55 UTC (Sun) by dlang (guest, #313) [Link] (3 responses)

growing to larger arrays does mean the chance of any single drive failing in a given time period is higher, but when you need multiple drive failures, the chances are still _really_ low, and is scales linearly to the number of drives in the array.

showing my math (in case I have it wrong :-)

If drives have a 12% chance of dieing each year (rough figure from the big studies several years back), that's 1% a month, or ~.2% per week per drive

if you have two drives, the chance of one of them failing each week is .4% (probabilities added), but the chance of them both failing in the same week is 0.0004% (0.2%*0.2%)

if you have three drives in RAID5, the chance of one of them failing each week is 0.6% (3*0.2%), whiel the chance of two of them failing the same week is 0.0024% ((3*0.2%)*(2*0.2%)), higher chances of loss, but twice the storage

if you have four drives in RAID6, the chance of one of them failing each week is 0.8% (4*0.2%), the chance of two of them failing in the same week is 0.0048% ((4*0.2%)*(3*0.2%)), while the chance of three failing in the same week is 0.0000002% ((4*0.2%)*(3*0.2%)*(2*0.2%))

if you have 10 drives in RAID6, the chance of one of them failing each week is 2% (10*0.2%), the chance of two of them failing in the same week is 0.036% ((10*0.2%)*(9*0.2%)), while the chance of three failing in the same week is 0.0006% ((10*0.2%)*(9*0.2%)*(8*0.2%))

now, if it takes you less than a week to recover (or the drives are more reliable), these numbers get better fast.

say that instead of a 12% annual failure and one week rebuild time, you have a 3.5% annual failure and 1 day rebuild time. this translates to a ~0.01% chance of failure per drive within the rebuild time.

with this

RAID1 single disk 0.02%, two disk 0.000001% (1e-6)
3 disk RAID5 single disk 0.03%, two disk 0.000006% (6e-6)
4 disk RAID6 single disk 0.04%, two disk 0.000012% (1.2e-5), three disk 0.0000000024 (2.4e-9)
10 disk RAID6 single disk 0.1%, two disk 0.00009% (9e-5), three disk 0.00000007% (7.2e-8)

meanwhile the cost of the redundancy is fixed, so it becomes much cheaper (as a percentage) as the array grows.

Then there is the performance question. If you are doing largely sequential I/O (backups, large media files), the performance hit is fairly small (and theoretically can be reduced to basically nothing, but that would require that the OS know how to do raid stripe aligned writes, and the raid subsystem notice them, neither of which is available in the kernel today), but if you are doing a lot of small, random I/O (databases), the performance hit can be very large due to the read-modify-write cycle needed to keep the parity up to date.

I have hopes that as flash drives and shingled rotating drives become more popular that the kernel will learn that it can save a lot of time by writing an entire eraseblock/RAID stripe/shingle stripe at once and start to prefer doing so.

An md/raid6 data corruption bug

Posted Aug 25, 2014 20:39 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

> If drives have a 12% chance of dieing each year (rough figure from the big studies several years back), that's 1% a month, or ~.2% per week per drive

Is that 12% for each drive or 12% of drives are expected to die (on second thought…is there a difference?)? If the latter, did you get 1% per month from 12% / 12 or (1 - (1 - .12) ^ (1 / 12)) == 1.06%? The latter seems more accurate, but that's mainly my gut feeling here. As an example, with a 50% failure rate becomes 5.6% per month instead of 4.17% because you expect ~94.4% to survive each month until you're left with 50% still around. Then again, drives are independent (…ish), so maybe just straight division is better there. Anyways, leaving it here for a second thought on it.

An md/raid6 data corruption bug

Posted Aug 25, 2014 20:45 UTC (Mon) by dlang (guest, #313) [Link]

and this is why I show the math :-)

It was intended to be 12% of drives die in any given year, so 1% of drives die in any given month, or .2% of drives die in any given week (assuming that the failures are really independent, not a latent manufacturing defect that will affect all drives of a class)

An md/raid6 data corruption bug

Posted Feb 13, 2015 16:46 UTC (Fri) by ragaar (guest, #101043) [Link]

Your decimal places seem to be slightly off in the RAID6 scenario [3 drive failure]

fail_week = 0.002;
# n = Number of drives

fail_each_week(n) = n*fail_week;
# NOTE: You can multiply failure by 100 to see % failure

# Pseudo-code
Foreach(4, 3, 2)
Y *= fail_each_week(x);
End

Y = ((4*0.002)*(3*0.002)*(2*0.002)) = 1.92e-07 ≈ 2e-07
Y * 100 ≈ 2e-05%

We're amongst friends so rounding to 2 is fine, but [as of Feb 2015] there seems to be couple extra decimal places in your post.
RAID6 scenario [3 drive failure] shows 2e-07% as opposed to 2e-05%.

Side comment:
Thanks for consolidating this information. This was the only post that I've found combining HDD failure rates on a yearly/monthly/weekly interval, laid out the basic statistical math, and provided description as to the intent applied during each step.

It is well thought out posts like this that help make the internet better. You get a +1 (gold star) vote from me :)


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds