LWN.net Logo

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

The Lone Wolves weblog has a report on benchmarking Linux filesystems using software RAID. "As it turns out, my (then) limited understanding of RAID and some trouble with my 3ware RAID cards meant that I had to scale back my benchmark quite a bit. I only have two disks so I was going to test RAID 1. Chunk size is not a factor when using RAID 1 so that axis was dropped from my benchmark. Then I found out that LVM (and the size of the extends it uses) are also not a factor, so I dropped another axis. And to top it off I discovered some nasty problems with 3ware 9550 RAID cards under Linux that quickly made me give up on hardware RAID."
(Log in to post comments)

fallbacks for testing and development

Posted Apr 21, 2008 15:24 UTC (Mon) by Jel (subscriber, #22988) [Link]

I seem to recall trying LVM on loopback files, and being shocked to find 
out that it's not supported.

I never got why LVM doesn't allow this.  Isn't that the whole unix 
philosophy -- to treat everything as a file?  To allow for unusual 
scripting and unexpected setups?

Wouldn't it be perfect for both benchmarkers and developers to be able to 
flexibly test these things on top of a single hard drive (or even a ram 
drive) which has known performance characteristics?

I think every linux feature that uses hardware should have software or 
filesystem fallbacks.  That way IM client developers, media player 
developers could write for all sorts of webcam and TV card and genlocker 
and surround sound setups without actually having that hardware.

I only need to barely mention the security researchers setting up honey 
pots, the admins doing deployments, etc. for people to see that there's a 
real need for this sort of flexibility.  Just think of how many times 
you've mounted a filesystem as loopback, simulating a physical drive.  
This should be possible across the whole platform, with every device.  I 
know that's a tall order, but we should recognise the principle and move 
towards it, at least.

No LVM on loopback

Posted Apr 22, 2008 1:40 UTC (Tue) by pr1268 (subscriber, #24648) [Link]

I never got why LVM doesn't allow this. Isn't that the whole unix philosophy -- to treat everything as a file? To allow for unusual scripting and unexpected setups?

I'm curious about that, too. I thought that the whole VFS kernel subsystem was to separate all the layers of the file system to enable all sorts of different HW and FS combinations1. But, I do take comfort in knowing that, if it doesn't exist currently, then there's probably a good reason why it doesn't exist.

1 I'm still amazed at how easy it was to set up a Linux kernel software RAID-0 on two IDE disks running off of separate ribbon cables attached to a Promise ATA controller card that came free with my purchase of a WDC 200GB hard drive four years ago. As unusual as I consider this setup to be, Linux gives me absolutely no grief about it (and this virtual drive hauls ass!). ;-)

No LVM on loopback

Posted Apr 23, 2008 12:55 UTC (Wed) by daniel (subscriber, #3181) [Link]

I have set out to rewrite Linux LVM (lvm3 project, a merge of device mapper capabilities with
the generic block layer) and I will certainly ensure that the result works properly on
loopback files.  Anybody have more wishes?

Daniel

No LVM on loopback

Posted Apr 23, 2008 19:55 UTC (Wed) by nix (subscriber, #2304) [Link]

Useful snapshots: i.e. >32 of them, snapshots where snapshotting the root 
filesystem works... ideally a frontend command like chroot but which 
snapshots everything below some tree, kicks up a shell in it, and zaps all 
the snapshots when the shell exits.

(Why? So I can back my whole system up via snapshots, reliably. I might 
well have >32 filesystems: I obviously want to back up / without running a 
deadlock risk: and the chroot-in-snapshot command is just obviously 
useful.)

So far LVM snapshots have enough limitations that I haven't found a single 
use for them in practice... it would be nice to have snapshots that work. 
(Of course we might all go to btrfs and get them for free!)

No LVM on loopback

Posted Apr 23, 2008 21:01 UTC (Wed) by zlynx (subscriber, #2285) [Link]

I use LVM on a Fedora Core system at work.  I made a little script to snap each of the
filesystems, including the root fs, and then mount each snap in a snapshot directory.  That
script runs just before the nightly backup job which reads off the nightly snapshot.

The script also keeps a weekly snapshot.  The two snapshots are very convenient for grabbing
quick copies of old files.

I also took advantage of the ability to assign new LVs (which a snapshot is) to particular PVs
so as to distribute the write load across several disks.  Because, when a snapshot source is
written to, copies of the old blocks have to be written into the snapshots before the old
block can be overwritten on the source.  That sped things up a lot.

In a couple years of doing this I've never had that system lock up during a backup job, so the
risk of deadlock must be very low.

No LVM on loopback

Posted Apr 24, 2008 20:32 UTC (Thu) by nix (subscriber, #2304) [Link]

It is, but still... the problem is apparently that the lvm tools don't 
lock themselves in memory, and there's a window where I/O to the snapshot 
origin is disabled. If the OS needs to page in the LVM tools at that 
point, deadlock.

(And that's only the known deadlock.)

No LVM on loopback

Posted Apr 24, 2008 5:07 UTC (Thu) by muwlgr (guest, #35359) [Link]

I would wish :
 - raid1/raid5 facilities for separate LVs, unlike current situation
      when you have first to assemble the whole RAID PV 
      in underlying layer by mdadm.
 - online filesystem resizing, like jfs/jfs2 in AIX LVM.
      There was fsadm script in LVM1, but in LVM2 it has gone.

In short, the final wish really looks like ZFS/Btrfs and of course Zumastor :>

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 21, 2008 18:20 UTC (Mon) by gvy (guest, #11981) [Link]

Rather lame article as he'd do just fine with writeback cache even without BBU -- at least not
much worse than with software raid.

There's no point in buying 3ware (moreso areca) cards to use them as stupid channels...
especially when these are <8ch (there might be some sense with 16ch areca since the cpu's
throughput is simply twice not enough).

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 21, 2008 19:24 UTC (Mon) by SanderMarechal (guest, #51684) [Link]

> Rather lame article as he'd do just fine with writeback cache even
> without BBU -- at least not much worse than with software raid.

True. Performance was roughly equal using mdadm or using 3ware hardware RAID with write cache,
but you run the risk of loosing some data on a power loss. I thought I wrote this in the
article but apparently I didn't, I have edited it now.

My server isn't hooked to an UPS. That, combined with the advantage of the "partition size
trick" I mentioned made me go for software RAID, regardless of the small performance
differences there would be.

> There's no point in buying 3ware (moreso areca) cards to use them as stupid channels...

My server has one regular PCI slot and it runs at 33 Mhz. That makes for a total max
throughput of 133 MiB/sec. Easily filled by my two SATA 3.0 disks. I can hook up four drives
to my card. My PCI-X slot is 64-bit/100 Mhz and caps out at 800 MiB/sec. More than enough for
four SATA disks.

Yeah, it sucks spending good money on an expensive RAID card and only use it at a channel. But
it was the only thing I could find that could run in PCI-X on 100 Mhz. I prefer it to not
having the 3ware card at all so I still consider it money well spent :-)

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 21, 2008 21:08 UTC (Mon) by tialaramex (subscriber, #21167) [Link]

I had a not disimilar experience with 4 channel 3ware hardware years ago and wasn't able to
get any performance out of it at all in a RAID configuration. Perhaps I just didn't read
enough manuals (or make enough paid support calls) to find out what I needed to do to enable
this "writeback cache" (I don't remember what settings we did try, like I said, years ago, at
the time I had detailed notes). I also found the 3ware software to be abysmal and it seemed to
want to wind its tentacles deep into my Linux system which I wasn't very comfortable with.

Like the article's writer, I opted for software RAID, using the 3ware card as a fast ATA to
SCSI controller, and found it to perform very adequately for my needs. Since then I've skipped
hardware RAID controllers altogether in my own machines and wouldn't bother even suggesting
them until they become necessary due to e.g. sheer quantity of drives connected. But we do
have Dell branded SAS RAID on some servers because someone overrode my intended specification,
and it seems to behave itself, at least it's kept out of my way so far. Of course for all I
know it's just that the monitoring doesn't work, and in fact one of the drives died this
weekend and the other will cop out tonight...

I have suffered a series of nasty failures on a PC that I eventually tracked down to a failing
IDE controller. All the contents were under Linux Software RAID, and it soldiered on through a
series of scary-looking IDE errors, I still have both drives which seem to work fine with the
dodgy controller out of the picture (it eventually refused to recognise hard disks on boot, at
which point I realised what was going on and replaced the whole board) and the array was
rebuilt each time with no significant data loss, just an ext3 journal replay needed and a few
hours of reduced disk throughput. That array is now retired because I switched to larger SATA
disks.

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 21, 2008 20:33 UTC (Mon) by maney (subscriber, #12630) [Link]

Apparently the only problem with running RAID on the 3ware was that they warned him of
potential data loss on power loss without the battery.  This is not much, if at all, mitigated
by running the card as a dumb channel, or with RAID enabled and disk caches off, since there's
still plenty of deferring of writes going on (hint: that's why Linux runs better when there's
RAM available for caching disk blocks).  Running any sort of RAID, hardware or soft, in hopes
of getting better reliability without putting the whole setup on an UPS with signalling to
allow an intentional shutdown rather than an unplanned shutoff is, IMO, pretty pointless,
power glitches being so much more common than hard drive failures.

Just sayin'

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 21, 2008 22:17 UTC (Mon) by SanderMarechal (guest, #51684) [Link]

That's why I want a UPS in my next house, when I have the space to put it all in (and a 19"
rack - but that's partly for geek points :-)

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 22, 2008 6:55 UTC (Tue) by topher (guest, #2223) [Link]

I've gotta disagree with you slightly here, with regards to RAID without UPS being pointless.

While it's true that power glitches are significantly more common than hardware failures,
they're also significantly easier to recover from.  Minor data loss due to a power outage is a
lot better than complete data loss due to the failure of a non-RAID disk.

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 22, 2008 14:59 UTC (Tue) by Thalience (subscriber, #4217) [Link]

The kernel's use of deferred writes (held in system ram) is greatly mitigated by the use of a
journaling filesystem. Such filesystems have a large amount of code dedicated to ordering the
writes so that recovery from a power failure is easy.

A volatile write cache on a raid card or individual disk breaks the filesystem's assumptions
about when its write commands are actually complete. 

The situation with individual drives is getting better, as more of them are supporting various
types of cache flush commands, or tagged writes. The cache on a typical hardware raid card is
too large to be flushing on every journal commit, but I don't see why tagged write commands
wouldn't be very useful. However, I am not aware of a card that implements them (or any other
way for the kernel to manage its cache).

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 25, 2008 10:40 UTC (Fri) by jschrod (subscriber, #1646) [Link]

While I agree in principle with your comment that one should use UPS together with RAID, there
is...

> power glitches being so much more common than hard drive failures.

Ug. Where are you living? Here in Germany, near Frankfurt, the next hard drive failure is more
probable than the next power glitch -- I had several drive failures the last two years, but no
power glitch at all.

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted May 3, 2008 19:18 UTC (Sat) by anton (guest, #25547) [Link]

The write caching by the file system is no problem for the file system and data consistency if the file system writes the data out in the right order (and file systems do that (some better and some worse), through such techniques as journaling and/or ordered writes).

The write caching by the drives or a hardware RAID controller can be a problem, if it results in the writes being reordered; and at least drives perform such reordering. There are a several remedies: 1) Turn off write caching in the drive (costs a lot of performance if you don't have tagged commands) 2) Use flush commands to the drive at appropriate moments. I don't know if the file systems do these things and do them properly. Last I looked, they did not turn off write caching, but left that to the sysadmin.

How does that mix with RAID1? In case of a power failure, the drives may not contain the same data (one may have written one or more blocks that the other has not written); what the md driver does in this case is to run the RAID from on drive, and sync the other disk from it. I guess that opens a window of vulnerability while the syncing is going on, but it's probably still the safest thing to do. Having a reliable UPS with automatic shutdown would eliminate this window, but the window is typically small compared to the time when the RAID1 protects against a disk failure, so the reliable UPS may not be worth the cost. If the file system was better integrated with the RAID (ZFS?), this situation could be handled in a better way.

Finally, in our experience UPSes are not more reliable than our power grid. We had an UPS die (and take the machine it was supposed to serve down) while the grid was ok and all other machines (including those without UPS) continued working; we eventuelly decided to go without UPSes. The alternative would have been to have redundant power supplies with redundant UPSes, which was not worth the effort for us.

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 21, 2008 23:19 UTC (Mon) by kev009 (guest, #43906) [Link]

He's worried about data loss from power failure but then chose to use XFS...

FWIW I get decent performance out of a 3ware 9000S with write cache enabled and no BBU.  The
system is on a UPS, backups are present, and it is just a light duty home server.

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 22, 2008 12:18 UTC (Tue) by dgc (subscriber, #6611) [Link]

> He's worried about data loss from power failure but then chose
> to use XFS...

FWIW, I think your information is out of date.

We put fixes into 2.6.17 that worked around the problem
that caused "data loss on power fail" pretty effectively,
and we fixed it properly in 2.6.22. I haven't seen a bug report
for "NULL files" on power loss or crash since 2.6.17....

If it's still happening to you on a recent kernel then please
report it. Otherwise, you need to find some other reason not
to use XFS. ;)

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 22, 2008 16:15 UTC (Tue) by bronson (subscriber, #4806) [Link]

Wow, this is big news!  I'm surprised I never heard this.

I suggest you put it at the very top of your readme or something: XFS NO LONGER CAUSES DATA
LOSS ON POWER FAIL.  REPEAT.  XFS NO LONGER CAUSES DATA LOSS ON POWER FAIL.  Because pretty
much everyone I know is still laboring under this misconception!

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted Apr 23, 2008 16:46 UTC (Wed) by alankila (subscriber, #47141) [Link]

I'm thoroughly ignorant on this issue. I thought that basically all journalling filesystems
guarantee only that there is consistency of filesystem after service interruptions.

Was the nature of XFS's empty files problem somehow more severe than you'd expect to get in
other filesystems? I mean, data only in system RAM is lost no matter what. But once it's
partially or fully in journal, it should have been preserved. Did XFS not recover file data
for situations where it could have done so?

Benchmarking Linux filesystems on software RAID 1 (Lone Wolves)

Posted May 3, 2008 18:42 UTC (Sat) by anton (guest, #25547) [Link]

The way I have heard it, XFS used to zero files that were partially written. I have been bitten by such behaviour, although with a different file system. In my case I lost an hour of work.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.