LWN: Comments on "A journal for MD/RAID5"

A journal for MD/RAID5

snnn — Fri, 24 Dec 2021 18:16:10 +0000

While it looks beautiful, we must know it has limitations.
First. the write through mode is for increasing data safety, not performance. The problem it tries to fix, write-hole, isn't common. Thus you don't this need this feature. RAID isn't a backup, it doesn't need to provide 100% data safety. It is meant to reduce downtime in most common scenarios. So the extra gain from adding a RAID journal is small.

While the write back mode can increase performance, it reduces reliability. Because the journal device can't be a MD raid array.

And I think the code isn't stable yet. We saw kernel hangs when the raid array was doing sync and it had a write journal and also had heavy read/write load during the sync. Such problems were also got reported to the Linux RAID mail list by other users.

Is there any redundancy while the data resides in the journal?

snnn — Fri, 24 Dec 2021 18:15:57 +0000

According to a discussion in linux-raid, 6 years later, the answer is still: no. You can't use another MD raid array for this. You may use a hardware raid but it doesn't make much sense.

A journal for MD/RAID5

nix — Wed, 01 Mar 2017 15:52:53 +0000

Aside: With the writeback journal work in v4.10+, nobarrier won't provide much speedup, since all barriers need to do is ensure a flush to the journal, not the RAID array itself -- and generally one requires nobarrier due to seek-induced sloth. So, AIUI, you can ignore nobarrier entirely if you are using a writeback journal.

A journal for MD/RAID5

nix — Wed, 01 Mar 2017 15:00:18 +0000

Aside, years later: I'm actually getting an SSD-based box now (and using md5 + journal + bcache, with journal + bcache on the same SSD). The SSD I'm putting in it is a 480GiB Intel one guaranteed for the to-me-astonishing figure of one complete device write per day for five years. That comes out as 900TiB.

I... don't think it's worth worrying about this much. Not even if you're, say, compiling Chromium over and over again, on the RAID: at ~90GiB of writes a time, that *still* comes to less than one complete device write per day because compiling Chromium is not a fast thing.

(However, I'm still splitting off a non-bcached, non-journalled, md0 array for transient stuff I don't care about and won't ever read, or won't read more than once, simply because it's *inefficient* to burn stuff into SSD that I'll never reference.)

A journal for MD/RAID5

nix — Wed, 16 Dec 2015 18:14:05 +0000

Well, yeah, ionic migration will slowly wear all silicon chips out -- but the CPU will be being aged no matter what you do, which seems more significant a problem (and in fact with modern processes that aging has now started to dip into the point where it is noticeable to real users, rather than being a century or so away: don't expect to be still using that machine of yours a decade or so from now...)

A journal for MD/RAID5

hummassa — Tue, 15 Dec 2015 20:05:26 +0000

A rust-based control would have been nice...

A journal for MD/RAID5

paulj — Tue, 15 Dec 2015 09:56:12 +0000

So, I think RAM actually can wear out. My understanding is the use of electronics does cause physical changes to the electronics. With DRAM I think this can manifest itself as increasing error rates with use. This ageing process may though be effectively negligible for nearly all purposes though. ;)

A journal for MD/RAID5

smckay — Mon, 14 Dec 2015 23:02:23 +0000

write-once-access-never archival storage...

Sounds like an excellent application for the Signetics 25000 Series 9C46XN. An underrated chip that never got near enough use.

A journal for MD/RAID5

nix — Mon, 14 Dec 2015 19:53:29 +0000

Would that workload age a normal rotating rust drive? Does I/O age a spinning rust drive noticeably? (I mean, it must do so a *bit* because the head is moving more than it otherwise would -- but how much? Presumably reads are as bad as writes...)

The upcoming spinning rust drives that have their heads contacting the storage medium -- now *those* would get aged by this, and indeed by any load at all. But as far as I can tell those suck for any purpose other than write-once-access-never archival storage...

A journal for MD/RAID5

bronson — Mon, 14 Dec 2015 19:34:47 +0000

That kind of workload is going to age any kind of drive -- rust-based or flash based. But it sounds like this data is fairly disposable so you can treat your drives like the throwaway commodities they are. Live sports studios replace their streaming storage on a schedule, long before they start getting questionable.

An exotic DRAM-based drive might be more reliable than just swapping out your devices every n events. Or it might not, I've never used one.

A journal for MD/RAID5

nix — Sun, 13 Dec 2015 21:24:45 +0000

One of my use cases -- not one I do all the time, but one I want to be able to do without worrying about wearing things out, and I can easily see people who are actually involved in video processing rather than being total dilettantes like me spending far more time on it -- involves writing on the order of three hundred gigabytes every *three hours*, for weeks on end (and even then, even slow non-RAIDed consumer rotating rust drives can keep up with that load and only sit at about 20% utilization, averaged over the time).

In light of the 200TiB figure, it's safe to e.g. not care about doing builds and the like on SSDs, even very big builds of monsters like LibreOffice with debugging enabled (so it writes out 20GiB per build, that's nothing, a ten-thousandth of the worst observed failure level and a hundred thousandth of some of them). But things like huffyuv-compressed video being repeatedly rewritten as things mux and unmux it... that's more substantial. One of my processing flows writes a huffyuv-not-very-compressed data mountain out *eight times* as the mux/unmux/chew/mux/remuxes fly past, and only then gets to deal with tools that can handle something that's been compressed to a useful level. Ideally that'd all sit on a ramdisk, but who the hell has that much RAM? Not me, that's for sure. So I have to let the machine read and write on the order of a terabyte, each time... thankfully, this being Linux, the system is snappy and responsive while all this is going on, so I can more or less ignore the thing as a background job -- but if it ages my drives before their time I wouldn't be able to ignore it!

Block device checksumming

itvirta — Sun, 13 Dec 2015 16:09:04 +0000

Well, of course even the basic dm-verity would be able to detect changes.
But everything I can find tells me that it's a read-only target, which isn't
really what one wants for general use.

A journal for MD/RAID5

joib — Sat, 12 Dec 2015 09:05:40 +0000

AFAIU the block barrier rework which landed 2.6.33-ish ought to make barrier vs nobarrier mostly moot if you have a non-volatile write cache.

Disclaimer: This if from reading various comments from people more knowledgeable than me on the matter around the time this was merged, and on a very-high-level understanding of how the code works, rather than on actual benchmarks.

A journal for MD/RAID5

zlynx — Fri, 11 Dec 2015 22:59:09 +0000

SSDs are much less fragile than people seem to think.

The Tech Report ran six drives until they died: http://techreport.com/review/27909/the-ssd-endurance-expe...

First failures were at around 200 TB of writes. That is a lot. The next one was at 700 TB. Two of the drives survived more than 2 PB of writes.

I don't believe you should worry about a couple hundred gigabytes unless you do it every day for a couple of years.

A journal for MD/RAID5

nix — Fri, 11 Dec 2015 22:24:03 +0000

Quite. Further, RAM never wears out, no matter how hard it's written and for how long. As someone who fairly routinely writes hundreds of gigs to RAID at a time, I'd like to be able to close the write hole without rapidly destroying an SSD in the process! bcache detects and avoids caching I/O from processes doing a lot of sequential I/O, but that's not going to work in this situation: you have to journal the lot.

lvm

Pc5Y9sbv — Fri, 11 Dec 2015 02:50:36 +0000

I've always used MD RAID to construct the PVs underneath my LVM2 volume groups. All the flexibility of LVM to migrate logical volumes and all the flexibility of mdadm to manage disk failures, reshaping, etc.

Lately, I am combining these with LV cache. I use MD RAID to create redundant SSD and bulk HDD arrays as different PVs and can choose to place some LVs only on SSD, some only on HDD, and some as an SSD-cached HDD volume.

Linux Software RAID > Hardware RAID

hmh — Thu, 10 Dec 2015 21:41:58 +0000

Well, people are trying to add proper FEC to dm-verify.

I realise this is not what you asked for, since it actually repairs the data, but hey, that could be even more useful depending on what you want to do ;-)

Thread starts at:
https://lkml.org/lkml/2015/11/4/772

A journal for MD/RAID5

andresfreund — Thu, 10 Dec 2015 15:44:31 +0000

Entirely depends on the type of load. Few SSDs e.g. are fast enoough to saturate SATA for small random writes.

Linux Software RAID > Hardware RAID

itvirta — Thu, 10 Dec 2015 15:29:32 +0000

As an aside, and this may be assuming the worst of hard drives, has there been any thought on a block-checksumming RAID?
(Or any block device, but being able to read from a mirror or parity if the data is corrupted would be nice.)

Can we have one for Christmas?

A journal for MD/RAID5

itvirta — Thu, 10 Dec 2015 15:20:34 +0000

I thought SSD:s can already be as fast as the SATA interface, so I wonder what the advantage of SATA-attached RAM is,
especially given that an SSD can easily be 4x the size of that 64 GB RAM thingy. Putting all that RAM on the motherboard
might be different, though.

Linux Software RAID > Hardware RAID

pizza — Wed, 09 Dec 2015 17:04:46 +0000

Oh, I've seen that bufferbloat problem myself. It's much better in recent years (and more modern HW generations), but intelligently setting up the file system with awareness of the underlying block/stripes also made a hell of a difference.

But I don't use 3Ware cards for RAID5 write performance, I use them for reliability/robustness for bulk storage that is nearly always read loads. (If write performance mattered that much, I'd use sufficient disks for RAID10; RAID5/6 is awful)

Linux Software RAID > Hardware RAID

Yenya — Tue, 08 Dec 2015 21:34:55 +0000

3ware cards suffer(ed) badly from the bufferbloat problem - the perceived filesystem latency rapidly increased with increasing write load. This is something that MD RAID does not manifest.

Linux Software RAID > Hardware RAID

pizza — Tue, 08 Dec 2015 13:27:47 +0000

This is one area where our experiences differ.

In fifteen years of using 3Ware RAID cards, for example, I've never had a single controller-induced failure, or data loss that wasn't the result of blatant operator error (or multiple drive failure..) My experience with the DAC960/1100 series was similar (though I did once have a controller fail; no data loss once it was swapped). I've even performed your described failure scenario multiple times. Even in the day of PATA/PSCSI, hot-swapping (and hot spares) just worked with those things.

(3Ware cards, the DAC family, and a couple of the Dell PERC adapters were the only ones I had good experiences with; the rest were varying degrees of WTF-to-outright-horror. Granted, my experience now about five years out of date..)

Meanwhile, The supermicro-based server next to me actually *locked up* two days ago when I attempted to swap a failing eSATA-attached drive used for backups.

But my specific comment about robustness is that you can easily end up with an unbootable system if the wrong drive fails on an MDRAID array that contains /boot. And if you don't put /boot on an array, you end up in the same position. (to work around this, I traditionally put /boot on a PATA CF or USB stick, which I regularly imaged and backed up so I could immediately swap in a replacement)

FWIW I retired the last of those MDRAID systems about six months ago.

Linux Software RAID > Hardware RAID

Yenya — Tue, 08 Dec 2015 07:36:13 +0000

I am not sure what do you mean by "reliability and rubustness at the system level": sure, battery backed RAM is and advantage (unless the MD journal reaches the end-user kernels). But I had lots of stories where HW RAID failed for bizzare reasons such as replacing a failed drive with a vendor-provided one, which has not been erased beforehand, and which destroyed the rest of the configuration of the whole array, because the controllers in the array thought that the configuration stored on the replaced drive was for some reason newer than the configuration stored on the rest of the drives in the array. So no, my exprerience tells that the reliability and robustness is on the Linux MD RAID side.

Linux Software RAID > Hardware RAID

pizza — Mon, 07 Dec 2015 22:53:44 +0000

As a counterpoint, HW RAID controllers offer an advantage in reliability and robustness at the system level. Or at least the good ones do.

Linux Software RAID > Hardware RAID

Yenya — Mon, 07 Dec 2015 21:56:50 +0000

I, for one, also think that the article summary is too modest. Linux MD RAID is far superior to the hardware solutions not only for the reasons written above, but _also_ for its performance. According to my experience, even the most expensive solutions of the time (including the DAC960 SCSI-to-SCSI bridges and, to some extent, 3ware HW RAID cards), the performance of the HW RAID sucks. The typical HW RAID controller happily accept your write requests, and stall the later read request, manifesting a similar behaviour to the networking bufferbloat. On the other hand, the kernel is aware of the requests that are being waited upon, and can prioritise the requests accordingly. Also, the kernel can use the whole RAM as a cache, while the HW RAID controller cache is much more expensive per byte, and usually unobtainable in sizes similar to the RAM of the modern computers. For me, "HW RAID" is a bad joke. JBOD with Linux MD RAID is much better.

A journal for MD/RAID5

nix — Mon, 07 Dec 2015 19:24:57 +0000

An update: it turns out that there *are* still devices being sold which look like disks and are actually RAM. They vary from insanely pricey things which give you 8GiB of SAS-attached storage for the low, low price of $3000 (!!!) to this: <http://www.hyperossystems.co.uk/07042003/hardware.htm> which looks just about perfect: the price is sane enough, at least.

Actually you can stick enough RAM in it that it's questionable if you need an SSD at all, even for bcache: just partition this thing and use some of it for the RAID write hole avoidance and some of it for bcache. It can even dump its contents onto CF and restore back from it if the battery runs out.

I think my next machine will have one of these.

Linux Software RAID > Hardware RAID

hmh — Sun, 06 Dec 2015 14:00:22 +0000

you'd actually want to trigger a hard-scrub that rewrites all stripes

Is that really something that people would want?

I guess I imagine that the drive itself would notice if there was any weakness in the current recording (e.g. correctable errors) and would re-write the block proactively. So all that should be necessary is to read every block. But maybe I give too much credit to the drive firmware.

I used to think the HDD firmware would handle that sanely, as well. Well, let's just say you cannot assume consumer HDDs 1TB and above will do that properly (or will be sucessful at it while the sector is still weak but ECC-correctable).

Forcing a scrub has saved my data several times, already. Once it start happening, I get a new group of unreadable sectors detected by SMART or by an array read attempt every couple weeks (each requiring a md repair cycle to ensure none are left behind), until I either get pissed off enough to find a way to force-hard-scrub that entire component device (typically by using mdadm --replace with the help of a hot-spare device).

A journal for MD/RAID5

hmh — Sun, 06 Dec 2015 13:06:12 +0000

This would be really simple to fix, as adding missing variants for "disable barrier" is not going to be an ABI break.

However, it is also of very limited value (in fact, it could make things worse) because it will not be supported on older kernels, unless this kind of change is accepted in the -stable trees and also backported by the distros.

A journal for MD/RAID5

plugwash — Fri, 04 Dec 2015 15:17:05 +0000

In networking packets can vary in size fairly freely (yes there is a minimum and a maximum but any size between those is typically allowed) whereas storage blocks are expected to be a power of two size. If the layer below you works in power of 2 sized blocks and the layer above you expects power of two sized blocks then you can't easilly store metadata in the blocks themselves without either wasting half the space or creating a block boundry mismatch which is likely to kill performance. If you store it seperately then you risk it getting out of sync (though checksums can fix that) and also again waste a fair bit of performance reading and writing it.

The only real fix for this is to change the model, rather than providing redundancy as a shim layer between the storage system and the filesystem provide it as part of the filesystem.

Linux Software RAID > Hardware RAID

neilbrown — Fri, 04 Dec 2015 09:13:14 +0000

> you'd actually want to trigger a hard-scrub that rewrites all stripes

Is that really something that people would want?

I guess I imagine that the drive itself would notice if there was any weakness in the current recording (e.g. correctable errors) and would re-write the block proactively. So all that should be necessary is to read every block. But maybe I give too much credit to the drive firmware.

FWIW this would very quite straight forward to implement if anyone thought it would actually be used and wanted a journey-man project to work on.

Linux Software RAID > Hardware RAID

hmh — Thu, 03 Dec 2015 16:48:37 +0000

Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not.

I guess this gives you half of a poor-man's scrubbing, but any real hardware RAID will provide automatic scrubbing in any case.

Well, md's "repair" sync_action will give you poor-man's scrubbing (which only rewrites when the underlying storage reports an erasure/read error, or when the parity data sets are not consistent with the data -- ideal for SSDs, but not really what you want for modern "slightly forgetful" HDDs, where you'd actually want to trigger a hard-scrub that rewrites all stripes).

lvm

shane — Thu, 03 Dec 2015 14:52:40 +0000

That's a pity. I use lvm all the time, and have found that the ability to grow volumes and ship them between physical drives to be quite handy.

Linux Software RAID > Hardware RAID

ldo — Tue, 01 Dec 2015 19:51:55 +0000

Yes, I can do something with the outout of badblocks; if a disk has bad sectors on it, I replace it. I’ve found more than one bad disk this way. Also, badblocks scans work whether the disk is RAIDed or not.

Here is a pair of scripts I wrote to ease the job of running badblocks scans.

A journal for MD/RAID5

nix — Tue, 01 Dec 2015 12:06:34 +0000

> All current controllers use RAM, supercapacitors and flash. The supercapacitor provides just enough power to allow writing the cache to flash.

Or you could do that. Again this requires specialist hardware support and is almost surely unavailable to md :(

Linux Software RAID > Hardware RAID

nix — Tue, 01 Dec 2015 12:05:13 +0000

Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not.

I guess this gives you half of a poor-man's scrubbing, but any real hardware RAID will provide automatic scrubbing in any case. You can't do anything with the output of badblocks: even if it finds some, because md doesn't know there's a bad block there it's not going to do any of the things it would routinely do when a bad block is found (starting by rewriting it from the others to force the drive to spare it out, IIRC). All you can do is start a real scrub -- in which case why didn't you just run one in the first place? Set the min and max speeds right and it'll be a lot less disruptive to you disk-load-wise too.

Your first and last points are, of course, compelling (and they're why I'm probably going mdadm on the next machine -- well, that and the incomparable Neil Brown advantage, you will never find a hardware RAID vendor as clued-up or helpful), but this one in particular seems like saying 'md is better than hardware RAID because you can badly implement, by hand, half of something hardware RAID does as a matter of course'.

The right way to scrub with mdadm is echo check > /sys/block/md*/md/sync_action (or 'repair' if you want automatic rewriting of bad blocks). If you're using badblocks by hand I'd say you're doing something wrong.

Linux Software RAID > Hardware RAID

ldo — Tue, 01 Dec 2015 07:48:08 +0000

> I just monitor the S.M.A.R.T. stats on the physical disks, and as soon as
> I see reallocated sectors, they get replaced.

In that case, you are probably replacing disks a lot more often than you need to, adding to your costs without any significant improvement in data reliability.

Linux Software RAID > Hardware RAID

sbakker — Tue, 01 Dec 2015 07:08:59 +0000

I heartily agree. We've been burned by HW RAID failures in the past and I can tell you it's no fun. Monitoring them is also a pain, as some vendors won't allow you to see the S.M.A.R.T. stats of the underlying devices, while others will, but they all do it in a different way (and with truly horrible CLI utilities). Oh, and you have to manage and update the firmware for these things separately.

The possibility that MD gives you to take the disks out of one machine and just plug them into another greatly helps with recovery from server failures.

With MD RAID, I just monitor the S.M.A.R.T. stats on the physical disks, and as soon as I see reallocated sectors, they get replaced.

I also tend to favour RAID10 over RAID5 or RAID6 (faster rebuilds, better write performance), but then, my storage needs are not that large, so I can afford it.

Linux Software RAID > Hardware RAID

ldo — Mon, 30 Nov 2015 23:07:35 +0000

I think Linux software RAID is wonderful. I have had several clients running it for many years, and I am impressed with how well it copes with disk failures. Why it’s better than hardware RAID:

Hardware-independent disk formats. You can swap disk controllers, even move the disks to a different machine, and they will still work. You can use disks of different brands, even different sizes. This is handy when a drive fails—you can use any replacement disk that is big enough.
Array components remain directly accessible. Thus, it is easy to run periodic badblocks scans on all your disks, regardless of whether they’re part of an array or not.
Common administration tools. Once you have figured out the basics of mdadm, you can use it on any Linux distro, any hardware, any RAID configuration. And, as I mentioned, other common non-RAID-specific tools like badblocks also work fine.

Performance? I’ve never noticed an issue. Modern systems have CPU to burn, and RAID processing essentially disappears in the idle rounding error.

A journal for MD/RAID5

wazoox — Mon, 30 Nov 2015 17:05:28 +0000

> Hardware RAID controllers in my experience generally use battery-backed DRAM, not NVRAM

All current controllers use RAM, supercapacitors and flash. The supercapacitor provides just enough power to allow writing the cache to flash.

> If you only have one SSD, can you split it between this and bcache somehow?

I suppose that by finely tuning bcache write-back mode to only send full stripes writes directly to the disk, you could render this feature mostly redundant. Mostly.