|
|
Log in / Subscribe / Register

An md/raid6 data corruption bug

An md/raid6 data corruption bug

Posted Aug 20, 2014 1:44 UTC (Wed) by neilbrown (subscriber, #359)
In reply to: An md/raid6 data corruption bug by Richard_J_Neill
Parent article: An md/raid6 data corruption bug

> I wonder whether there is any workaround yet for the problem of a RAID5 array where the degraded array suffers a single bit read-error on rebuild

Yes there is. It is called "bad block lists" or "bbl".

Rather than marking the whole device as faulty, md will just mark the bad block as faulty.

Requires mdadm 3.3 and Linux 3.1 or later (and "later" generally means "hopefully fewer bugs", but the more people I can get testing this, the sooner such bugs will be discovered).

To add a bbl to a pre-existing array you need to assemble with "--update=bbl". You cannot hot-add a BBL at present.


to post comments

An md/raid6 data corruption bug

Posted Aug 21, 2014 17:18 UTC (Thu) by Tomasu (guest, #39889) [Link] (10 responses)

Hmm, I'll look into that next time I build an array (happens somewhat infrequent "unfortunately" or fortunately depending on how you see it).

I mainly run RAID5 because I'm dumb, and dislike the added overhead of RAID6, but I also run a complete mirror of my main RAID5 onto another RAID5, so I feel I'm somewhat safe from the unlikely event of a double fault in the main array, and pretty safe from a double fault in both arrays happening at the same time. Is this dumb? Probably! It certainly would be cheaper to go RAID6 in my main array, but that system (small nas in a lian-li PC-Q25 case) has 7 drive bays, and the goal was to reach 10TB+ in that limit with 2TB drives. Cutting out a drive for a second parity disk would have meant I'd not hit 10TB :( So in the end, I started picking up 3TB drives when they went on sale over the past couple years, and eventually built a backup array to host an entire copy of the main NAS, as well as a separate location to store my more important backups that are housed on a RAID1 of two (old) disks on my home server (and in two remote locations as well).

An md/raid6 data corruption bug

Posted Aug 23, 2014 11:40 UTC (Sat) by Wol (subscriber, #4433) [Link] (9 responses)

That sounds sensible. Basically, the greater the proportion of disk given over to error detection and recovery, the better. And you've got maybe 60%? But notice I used two words there - "detection" and "recovery".

A mirror dedicates 50% of your capacity to recovery. If you get a read failure (ie the disk doesn't respond) you can recover. If you get a read error (ie the wrong data is returned), the raid will detect it but your app won't know which version is correct.

Change that to raid 5, and now you have dedicated 33% of your capacity to detection and recovery. Any single read error will be detected, and any single read failure will be recoved with. And that's why you should run raid 5 over raid 1, and not the other way round, because a read failure in the mirror will be handled by the raid 5, but it doesn't work the other way round. But it sounds like you're effectively running raid 1 over raid 5 :-(

Going to raid 6 now strengthens both the detection and recovery of raid 5. But given this article, it sounds like adding further parity disks to raid 6 might not be a bad idea :-)

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 12:55 UTC (Sat) by Tomasu (guest, #39889) [Link] (2 responses)

I agree its safer to go with RAID6 in the long run, especially if you then mirror that ;)

My setup has two separate RAID5 arrays, in different machines. The NAS contains the main array, and my "home server" contains the backup array that has fewer, but larger disks also in a RAID5 which gets a rsync of the NAS array every night. It also stores other backups, so its not a direct clone, it just has a complete copy of the contents of the NAS.

It was all build ad-hoc as I had money, and there were sales on 2TB (nas) and 3TB (backup) disks.

I may test out raid6 again eventually when I need to upgrade the nas. but it'll be a while. I built my nas to handle 5 additional years of media and other downloads, WITHOUT deleting anything. But I sometimes delete stuff I really don't need to keep, which extends the lifetime by quite a bit. So it may be a while.

I do however have a VM storage box to finish building, and I may go raid6 for the vm storage. Not sure yet. The actual host of the array(s) will not likely be directly connected to the actual VM host, so it won't need to have more than like 90MBps throughput, and a decent sized raid6 outstrips that several times over (though a raid5 of the same number of spindles is about twice as fast, if not more).

An md/raid6 data corruption bug

Posted Aug 23, 2014 14:51 UTC (Sat) by Wol (subscriber, #4433) [Link] (1 responses)

Quick maths ... 7 spindles, 2TB disks, raid 5. I make that 12TB storage.

6 spindles, 3TB disks, raid 6. I make that 12TB storage :-)

If disks are cheap you could upgrade the disks in the nas, then convert to raid 6 :-)

Dunno if you want to do that, though, I've just bought two 3TB disks, so I make the cost of your upgrade about £300 (Seagate Barracuda disks).

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 23, 2014 15:06 UTC (Sat) by Tomasu (guest, #39889) [Link]

Indeed. But the actual storage is about 11TB. Doing a full convert would be a pain in the butt. :D It helps that I have the backup array though. So I could. I don't currently have enough 3TB disks to do that however. the backup array has 5 3TB disks. so it wouldn't actually be too expensive to do, just one more (or two, cause why not?) disk(s). If I do, I'll be waiting for a big sale on the 3TB disks. It's just not a huge priority for me at the moment. I have more than enough storage and enough protection at the moment to last quite a long time.

An md/raid6 data corruption bug

Posted Aug 24, 2014 3:08 UTC (Sun) by dlang (guest, #313) [Link] (5 responses)

> Change that to raid 5, and now you have dedicated 33% ...

it depends how many disks are in your RAID array, the more disks, the fewer you are dedicating.

you re correct for a 3-disk array, but for a 10 disk array

RAID1 (mirroring) eats 50%
RAID5 eats 10%
RAID6 eats 20%

An md/raid6 data corruption bug

Posted Aug 24, 2014 19:42 UTC (Sun) by Wol (subscriber, #4433) [Link] (4 responses)

:-)

I'm an underemployed ex-programmer struggling to make a living while caring for my wife :-)

I've just upgraded my PC to two mirrored 3TB drives. Next step add another drive and make it raid-5. If prices have fallen enough I'll then make it raid-6. So yes, I was talking about the smallest number of drives for each raid setup.

Adding further drives to raid-5 or -6 reduces the space allocated to safety, presumably increasing the risk, but I guess that's moderately negligible. Might even reduce it, by reducing the load on each disk. Like many things, I find my understanding increases with discussion - I might *know* the facts, but I don't always *understand* them - this discussion I know has deepened my understanding of raid.

Cheers,
Wol

An md/raid6 data corruption bug

Posted Aug 24, 2014 20:55 UTC (Sun) by dlang (guest, #313) [Link] (3 responses)

growing to larger arrays does mean the chance of any single drive failing in a given time period is higher, but when you need multiple drive failures, the chances are still _really_ low, and is scales linearly to the number of drives in the array.

showing my math (in case I have it wrong :-)

If drives have a 12% chance of dieing each year (rough figure from the big studies several years back), that's 1% a month, or ~.2% per week per drive

if you have two drives, the chance of one of them failing each week is .4% (probabilities added), but the chance of them both failing in the same week is 0.0004% (0.2%*0.2%)

if you have three drives in RAID5, the chance of one of them failing each week is 0.6% (3*0.2%), whiel the chance of two of them failing the same week is 0.0024% ((3*0.2%)*(2*0.2%)), higher chances of loss, but twice the storage

if you have four drives in RAID6, the chance of one of them failing each week is 0.8% (4*0.2%), the chance of two of them failing in the same week is 0.0048% ((4*0.2%)*(3*0.2%)), while the chance of three failing in the same week is 0.0000002% ((4*0.2%)*(3*0.2%)*(2*0.2%))

if you have 10 drives in RAID6, the chance of one of them failing each week is 2% (10*0.2%), the chance of two of them failing in the same week is 0.036% ((10*0.2%)*(9*0.2%)), while the chance of three failing in the same week is 0.0006% ((10*0.2%)*(9*0.2%)*(8*0.2%))

now, if it takes you less than a week to recover (or the drives are more reliable), these numbers get better fast.

say that instead of a 12% annual failure and one week rebuild time, you have a 3.5% annual failure and 1 day rebuild time. this translates to a ~0.01% chance of failure per drive within the rebuild time.

with this

RAID1 single disk 0.02%, two disk 0.000001% (1e-6)
3 disk RAID5 single disk 0.03%, two disk 0.000006% (6e-6)
4 disk RAID6 single disk 0.04%, two disk 0.000012% (1.2e-5), three disk 0.0000000024 (2.4e-9)
10 disk RAID6 single disk 0.1%, two disk 0.00009% (9e-5), three disk 0.00000007% (7.2e-8)

meanwhile the cost of the redundancy is fixed, so it becomes much cheaper (as a percentage) as the array grows.

Then there is the performance question. If you are doing largely sequential I/O (backups, large media files), the performance hit is fairly small (and theoretically can be reduced to basically nothing, but that would require that the OS know how to do raid stripe aligned writes, and the raid subsystem notice them, neither of which is available in the kernel today), but if you are doing a lot of small, random I/O (databases), the performance hit can be very large due to the read-modify-write cycle needed to keep the parity up to date.

I have hopes that as flash drives and shingled rotating drives become more popular that the kernel will learn that it can save a lot of time by writing an entire eraseblock/RAID stripe/shingle stripe at once and start to prefer doing so.

An md/raid6 data corruption bug

Posted Aug 25, 2014 20:39 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

> If drives have a 12% chance of dieing each year (rough figure from the big studies several years back), that's 1% a month, or ~.2% per week per drive

Is that 12% for each drive or 12% of drives are expected to die (on second thought…is there a difference?)? If the latter, did you get 1% per month from 12% / 12 or (1 - (1 - .12) ^ (1 / 12)) == 1.06%? The latter seems more accurate, but that's mainly my gut feeling here. As an example, with a 50% failure rate becomes 5.6% per month instead of 4.17% because you expect ~94.4% to survive each month until you're left with 50% still around. Then again, drives are independent (…ish), so maybe just straight division is better there. Anyways, leaving it here for a second thought on it.

An md/raid6 data corruption bug

Posted Aug 25, 2014 20:45 UTC (Mon) by dlang (guest, #313) [Link]

and this is why I show the math :-)

It was intended to be 12% of drives die in any given year, so 1% of drives die in any given month, or .2% of drives die in any given week (assuming that the failures are really independent, not a latent manufacturing defect that will affect all drives of a class)

An md/raid6 data corruption bug

Posted Feb 13, 2015 16:46 UTC (Fri) by ragaar (guest, #101043) [Link]

Your decimal places seem to be slightly off in the RAID6 scenario [3 drive failure]

fail_week = 0.002;
# n = Number of drives

fail_each_week(n) = n*fail_week;
# NOTE: You can multiply failure by 100 to see % failure

# Pseudo-code
Foreach(4, 3, 2)
Y *= fail_each_week(x);
End

Y = ((4*0.002)*(3*0.002)*(2*0.002)) = 1.92e-07 ≈ 2e-07
Y * 100 ≈ 2e-05%

We're amongst friends so rounding to 2 is fine, but [as of Feb 2015] there seems to be couple extra decimal places in your post.
RAID6 scenario [3 drive failure] shows 2e-07% as opposed to 2e-05%.

Side comment:
Thanks for consolidating this information. This was the only post that I've found combining HDD failure rates on a yearly/monthly/weekly interval, laid out the basic statistical math, and provided description as to the intent applied during each step.

It is well thought out posts like this that help make the internet better. You get a +1 (gold star) vote from me :)


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds