LWN.net Logo

Advertisement

GStreamer, Embedded Linux, Android, VoD, Smooth Streaming, DRM, RTSP, HEVC, PulseAudio, OpenGL. Register now to attend.

Advertise here

MD / DM

The Linux software RAID code (often called "MD" for "multi-device") is a longstanding feature of the kernel. RAID users appreciate its robustness, configurability, and the fact that it performs well; better performance than that achieved with hardware RAID controllers is not unheard of. In recent years, little has been heard about the MD code, however. Its feature set has changed slowly, and developments with the device mapper code have taken a higher profile. That, perhaps, is as it should be; a storage subsystem which attracts attention is rarely a good thing.

That said, MD hacker Neil Brown has been busy. His latest patch set implements RAID5 reshaping: the ability to add devices to a RAID5 array without going through a backup and restore cycle - or even shutting the array down. This is a nontrivial task; adding a drive to a RAID5 array requires redistributing data and parity blocks across the entire array. With this version of the patch, Linux MD can not only perform this task, but it can do it while still handling normal I/O to the array. The new patch also checkpoints the process, so that it can be restarted if interrupted in the middle; this corrects a minor defect in the previous version, wherein interrupting the reshaping task would cause all data in the array to be lost.

Neil notes that things could still go wrong:

There is still a small window ( < 1 second) at the start of the reshape during which a crash will cause unrecoverable corruption. My plan is to resolve this in mdadm rather than md. The critical data will be copied into the new drive(s) prior to commencing the reshape. If there is a crash the kernel will refuse the reassemble the array. mdadm will be able to re-assemble it by first restoring the critical data and then letting the remainder of the reshape run it's course.

Neil has various other enhancements in mind, including the ability to upgrade a RAID5 array to RAID6 (which increases fault tolerance by adding another set of parity blocks). Quite a bit, clearly, is happening in the MD world.

All this activity drew queries from a couple of observers who had, it seems, assumed that the addition of the device mapper to the kernel meant that the MD code would eventually whither away. The device mapper can handle some of the lower RAID levels (mirroring and striping) now, and there is work in progress to add RAID5 support. Since the device mapper is a general framework for mixing and matching drives, it makes sense to some that the RAID functionality should move there too.

Unsurprisingly, Neil disagrees. His suggestion is that "anything with redundancy," including RAID5 and RAID6, is best handled in the MD code. The device mapper, instead, is good for fancier arrangements like multipath, encryption, volume management, snapshots, etc. Certainly, those who are placing trust in RAID for redundancy should be comforted by the rather longer track record built up by the MD code. MD is also said to be faster than the device mapper at this time.

As others have pointed out, however, there is a cost to carrying multiple RAID implementations in the kernel. Each must be maintained, and each will have its own unique bugs to contribute to the whole. So, as the device mapper develops higher-level RAID capabilities, it would be nice if some of the core code could be shared between MD and DM. Making that happen, however, will require developer effort - and it's not clear that any hackers are interested in doing that work at this time.


(Log in to post comments)

Neil Brown

Posted Jan 26, 2006 11:05 UTC (Thu) by samj (guest, #7135) [Link]

Neil is an absolute Unix wizard - I worked with him about 10 years ago and was regularly impressed by his work on internal projects (which unfortunately for the most part appear to have remained internal). It's great to see his work on MD enjoying a wider audience and I for one would be a lot more likely to trust code he's written; for example notice that he's identified and is proactively fixing a potential issue *before* it eats users' data rather than afterwards as is usually the case. Let's not forget about his work on NFS too... last I checked (which was years ago - I don't use NFS much these days) he was doing some work on scalability and performance issues as well as an authentication layer. Interesting stuff...

MD / DM

Posted Jan 26, 2006 11:13 UTC (Thu) by gypsumfantastic (guest, #31134) [Link]

"this corrects a minor defect in the previous version, wherein interrupting the reshaping task would cause all data in the array to be lost."

This must be some new meaning of the word 'minor' I've never encountered before.

MD "minor" defects.

Posted Jan 26, 2006 11:39 UTC (Thu) by Duncan (guest, #6647) [Link]

LOL. That was my reaction as well. If the original comment didn't have
quotation marks around "minor", it should have.

Duncan

MD "minor" defects.

Posted Jan 26, 2006 13:41 UTC (Thu) by ewan (subscriber, #5533) [Link]

I read it as humourous, but it is possible for the original
defect to had big consequences while still actually being,
in itself, a small defect (like an off by one error, say).

MD "minor" defects.

Posted Jan 26, 2006 20:39 UTC (Thu) by guinan (subscriber, #4644) [Link]

Classic Corbet understatement. Cracks me up every time...

Actually, I can see how minor it is...

Posted Jan 26, 2006 13:43 UTC (Thu) by hummassa (subscriber, #307) [Link]

reshaping is a risky operation. I expect people to fully backup before
doing it (like they would if they would backup/reshape/restore like in the
old days). Then reshape, then if something went wrong, restore. The fact
that 99.9% of the time they wouldn't need to restore is a bonus IMHO.

Actually, I can see how minor it is...

Posted Jan 26, 2006 23:40 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I would exepct people to have adequate backups, or equivalent disaster recovery, whether they reshape an array or not. RAID can improve reliability, but never make the data indestructable.

Further, once the code is finished, fully reviewed and fully tested (which is still a little way off), I don't see that you data would be any less safe during a reshape then it is during a resync.

"minor"

Posted Jan 28, 2006 1:48 UTC (Sat) by pimlott (guest, #1535) [Link]

After recent mine accidents, we in the US use the word "miner" to mean "disregarding safety, leaving prone to catastrophe". Jon was making a pun on that.

"minor"

Posted Jan 28, 2006 14:07 UTC (Sat) by bronson (subscriber, #4806) [Link]

Possibly the most insensitive thing I've seen posted to LWN.

"minor"

Posted Jan 28, 2006 17:10 UTC (Sat) by pimlott (guest, #1535) [Link]

Woah, sorry, I didn't think anyone would take it that way.

"minor"

Posted Jan 28, 2006 16:26 UTC (Sat) by corbet (editor, #1) [Link]

Suffice to say that's not what I was doing at all.

Recent MD enhancements

Posted Jan 26, 2006 14:18 UTC (Thu) by brugolsky (subscriber, #28) [Link]

Jon, there are lots of other important MD patches that have gone in recently that probably deserve an article of their own. In particular, the capability to rewrite a stripe when a read error occurs means that MD can often recover from an error, rather than kick the drive out of the array. One can also proactively do a background scan, like many hardware RAID controllers. Additionally, bitmapped-based intent logging allows for faster resyncs, when required. Given the nature of today's huge drives, these changes greatly increase the utility of MD, as resyncing a 500GB drive on a busy server can take days if it is only resyncing at say, 5MB/s, and runs the risk of exposing a latent error on another drive.

These and other recent changes have brought MD robustness and usability much closer to that offered by expensive hardware RAID implementations, while maintaining all of the flexibility, transparency, and performance that has long been the hallmark of Linux MD.

Recent MD enhancements

Posted Jan 26, 2006 15:39 UTC (Thu) by hmh (subscriber, #3838) [Link]

Agreed. A full article on Linux MD which describes _all_ current and soon-upcoming capabilities would be quite welcome IMHO.

Recent MD enhancements

Posted Jan 27, 2006 15:36 UTC (Fri) by knobunc (subscriber, #4678) [Link]

I would love to see that too. I am a grateful (and heavy) user of the Linux MD stuff and would be very interested in learning about the new features, and tips and pitfalls associated with them.

-ben

another bonus

Posted Feb 2, 2006 21:27 UTC (Thu) by niner (subscriber, #26151) [Link]

Is to have some comparison benchmarks in such an article with some hardware RAID controllers. While I love the advantages of a Linux software RAID, I'd really like to know how much they cost performance wise compared to hardware solutions.

snapshots?

Posted Jan 26, 2006 17:31 UTC (Thu) by dann (guest, #11621) [Link]

It would be nice if snapshot functionality like in WAFL would be available.
It's extremly useful to be able to cd ~/.snapshot and access the file system
state from a few hours/days ago, and do this without having to play with
different partitions...

snapshots?

Posted Jan 26, 2006 19:14 UTC (Thu) by xav (guest, #18536) [Link]

That has nothing to do with DM. That's a filesystem issue.

snapshots?

Posted Jan 26, 2006 20:14 UTC (Thu) by nix (subscriber, #2304) [Link]

If you don't mind storing your filesystem in PostgreSQL and accessing it via FUSE, I'm on it. File-by-file and directory-by-directory rollback-and-forward, with branching (roll back and write). A `commit' is done on every open for writing; versioning is done both by filename and by inode number, so editors that write files by unlink()-and-rename() are covered.

(The overhead is necessarily considerable, although access to data at branch tips, which should be most of the accesses is still O(1).)

I'll admit it's mostly for the fun of it... I should have a trac up with design docs and a public svn repo in a week or so (hardware replacement at this end first so I've got the disk space to play with things like this!) and be working on the actual code.

snapshots?

Posted Jan 27, 2006 0:16 UTC (Fri) by drag (subscriber, #31333) [Link]

There is those "log-structure" filesystems. There are a couple currently in the works for Linux.. one from a telecom company from Japan and another one that made it into that 'google summer of code'.

http://logfs.sourceforge.net/
http://www.nilfs.org/

They write like a log were you start at the beginning of the disk and just walk down the drive never overwriting old data or zeroing anything out.

You get undelete features, the ability to mount a snapshot of the file system at any time in it's history while the real volume is still online, access a file at any time during it's history. That sort of thing. Also has other advantages like very fast write speeds and robustness against loosing data.. even in a file system corruption. (if stuff gets added to the end of a file, just rollback the changes till you get to good data)

Of course it's got problems.. intense file system fragmentation and difficulty with figuring out the best way to reclaim and reuse disk space. It wouldn't be good for general purpose stuff.

snapshots?

Posted Jan 28, 2006 14:17 UTC (Sat) by bronson (subscriber, #4806) [Link]

This actually sounds darned useful for /home. Right now I have tons of files scattered all over the place that I'm afraid to delete because, who knows, I might need one or two again in the future. With this filesystem I could just go back in time and fetch a file in the rare event that I actually need it again. I'd keep my home dir a lot cleaner.

When the disk is full, it could be put in a maintenance mode where everything is copied as low as possible. This blows away your history, of course, but it clears up fragmentation and recovers unused space.

So... does it work in practice?

snapshots?

Posted Jan 30, 2006 2:55 UTC (Mon) by nix (subscriber, #2304) [Link]

Well, for what it's worth time-based expiry was designed into Recant from the start. Yes, the algorithm is rather fiddly and expensive; definitely a job to be done by a background thread in times of disk idleness only.

Log-structured filesystems are one of those things that seem terribly neat at the start --- Recant was originally going to be a log-structured FS --- but I spent some time trying to figure out a way to expire them without doing a massive pass over the entire disk and vast memory consumption and never thought of a way. Hence I'm trying something implemented completely differently.

You also can't go back in time on any scale smaller than the entire filesystem with a log-structured FS, which makes it all of marginal use. Recant lets you go backwards on a file-by-file and tree-by-tree basis (with obvious oddities if you have some files in that tree hardlinked to places outside that tree).

However log-structured filesystems are very *efficient* at both reading and writing, fragmentation excepted, and require essentially no maintenance --- until they fill up. But when they fill up, you're in real trouble.

(Now if my hardware would just stop failing I might be able to get some more work done on it. One dead motherboard, one dead network card and one dead disk this weekend alone. *sigh*)

--- oh, and doing complete backups of any filesystem with historical state is a bit of a sod, too. I have some ideas on that point, and oddly after this weekend's disk failures the backup stuff has suddenly started mattering a lot more to me...

snapshots?

Posted Feb 2, 2006 16:06 UTC (Thu) by anton (guest, #25547) [Link]

Log-structured filesystems are one of those things that seem terribly neat at the start --- Recant was originally going to be a log-structured FS --- but I spent some time trying to figure out a way to expire them without doing a massive pass over the entire disk and vast memory consumption and never thought of a way.

Well, I also failed to see a good way for combining the segments and garbage collection ideas of the original LFS proposals with snapshots and clones; moreover, the speed disadvantages of using a free-blocks approach seem to have been mostly eliminated by clustering and delayed writing.

But other ideas and properties of log-structured file systems still seem to be worthwhile to me, in particular the possibility of having decent data consistency guarantees and snapshots. So my thoughts have turned to implementing LFSs with mostly conventional free-blocks management, and I have written up these ideas.

snapshots?

Posted Apr 12, 2006 19:23 UTC (Wed) by treed (subscriber, #11432) [Link]

Sounds like ZODB. ZODB is not a general purpose Unix filesystem but it has many of the qualities you mention.

http://www.zope.org/Products/StandaloneZODB

snapshots?

Posted Jan 26, 2006 22:26 UTC (Thu) by dann (guest, #11621) [Link]

Only if you consider that filesystems and dm/md have to be forever separated.
zfs has shown that it can be done otherwise. Food for thought...

MD / DM mirroring

Posted Feb 3, 2006 8:09 UTC (Fri) by feyd (guest, #26860) [Link]

How can I create a DM mirrored array? When I create one in evmsgui it is in fact MD RAID, not DM.

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds