LWN: Comments on "Ext3 and RAID: silent data killers?"

Ext3 and RAID: silent data killers?

jengelh — Fri, 11 Sep 2009 17:59:29 +0000

You can only remount ro when there are no files open in write mode. And there usually are on /.

Ext3 and write caching by drives are the data killers...

nix — Fri, 11 Sep 2009 16:33:55 +0000

Note that you won't get a whole blockfull of garbage: ext3 will generally
notice that 'hey, this doesn't look like a journal' once the record that
spanned the block boundary is complete. But that's a bit late...

(this is all supposition from postmortems of shagged systems. Thankfully
we no longer use hardware prone to this!)

Ext3 and RAID: silent data killers?

Pc5Y9sbv — Fri, 11 Sep 2009 01:18:46 +0000

I agree you cannot blindly use RAID5 without considering the sizing, but what do you consider an acceptable recovery time?

My cheap MD RAID5 with three 500 GB SATA drives allows me to have 1TB and approximately 100 MB/s per drive throughput, which implies a full scan to re-add a replacement drive might take 2 hours or so (reading all 500 GB from 2 drives and writing 500 GB to the third at 75% of full speed). I have never been in a position where this I/O time was worrisome as far as a double fault hazard. Having a commodity box running degraded for several days until replacement parts are delivered is a more common consumer-level concern, which has not changed with drive sizes.

Journaling no protection against power drop

anton — Thu, 10 Sep 2009 21:34:42 +0000

It depends on where you live. Here power outages are quite infrequent, but mostly take so long that the UPS will run out of power. So then the UPS only gives the opportunity for a clean shutdown (and that opportunity was never realized by our sysadmin when we had UPSs), and that is unnecessary if you have all of the following: decent drives that complete the last sector on power failure; a good file system; and a setup that gives the file system what it needs to stay consistent (e.g., barriers or hdparm -W0). And of course we have backups around if the worst comes to worst. And while we don't have the ultimate trust in ext3 and the hard drives we use, we have not yet needed the backups for that.

Ext3 and write caching by drives are the data killers...

Cato — Thu, 10 Sep 2009 21:05:26 +0000

Actually the comment about a single incorrect block in a journal 'spraying garbage' over the disk is here: http://lwn.net/Articles/284313/

Journaling no protection against power drop

Cato — Thu, 10 Sep 2009 20:58:31 +0000

Great to see your testing tool, I will try this out on a few spare hard drives to see what happens.

UPSs are useful at least to regulate the voltage and cover against momentary power cuts, which are very frequent where I live, and far more frequent than UPS failures in my experience.

Journaling no protection against power drop

Cato — Thu, 10 Sep 2009 20:52:23 +0000

And of course RAID has its own issues with RAID 6 really being required with today's large disks to protect against the separate-failure-during-rebuild case.

Without RAID, the operating system will have no idea the sector is corrupt - this is why I like ZFS's block checksumming, as you can get a list of files with corrupt blocks in order to restore from backup.

Ext3 and write caching by drives are the data killers...

Cato — Thu, 10 Sep 2009 20:50:22 +0000

One interesting scenario, mentioned I think elsehwere in the comments to this article: a single 'misplaced write' (i.e. disk doesn't do the seek to new position, writing to old position) means that a data block goes into the ext3 journal.

In the absence of ext3 journal checksumming, and if there is a crash requiring replay of this journal block, horrible things will happen - presumably garbage is written to various places on disk from the 'journal' entry. One symptom may be log entries saying 'write beyond end of partition', which I've seen a few times with ext3 corruption and I think is a clear indicator of corrupt filesystem metadata.

This is one reason why JBD2 added journal checksumming for use with ext4 - I hope this also gets used by ext3. In my view, it would be a lot better to make that change to ext3 than to make data=writeback the default, which will speed up some workloads and most likely corrupt some additional data (though I guess not metadata).

Journaling no protection against power drop

hensema — Thu, 10 Sep 2009 09:00:12 +0000

> They say they happily stop writing halfway in the middle of a sector, and respond to power drop only by parking the head.

Which is no problem. The CRC for the sector will be incorrect, which will be reported to the host adapter. The host adapter will then reconstruct the data and write back the correct sector.

Of course you do need RAID for this.

Ext3 and write caching by drives are the data killers...

BackSeat — Wed, 09 Sep 2009 20:35:23 +0000

It's acknowledged that ext3's lack of journal checksumming can cause corruption

It may only be semantics, but it's unlikely that the lack of journal checksumming causes corruption, although it may make it difficult to detect.

As for LVM, I've never seen the point. Just another layer of ones and zeros between the data and the processor. I never use it, and I'm very surprised some distros seem to use it by default.

Journaling no protection against power drop

anton — Tue, 08 Sep 2009 20:54:52 +0000

[Engineers at drive manufacturers] say they happily stop writing halfway in the middle of a sector, and respond to power drop only by parking the head.

The results from my experiments on cutting power on disk drives are consistent with the theory that the drives I tested complete the sector they write when the power goes away. However, I have seen drives that corrupt sectors on unusual power conditions; the manufacturers of these drives (IBM, Maxtor) and their successors (Hitachi) went to my don't-buy list and are still there.

Some drives only report blocks written to the platter after they really have been, but that's bad for benchmarks, so most drives fake it, particularly when they detect benchmark-like behavior.

Write-back caching (reporting completion before the data hits the platter) is normally enabled in PATA and also SATA drives (running benchmarks or not), because without tagged commands (mostly absent in PATA, and not universally supported for SATA) performance is very bad otherwise. You can disable that with hdparm -W0. Or you can ask for barriers (e.g., as an ext3 mount option), which should give the same consistency guarantees at lower cost if the file system is implemented properly; however, my trust in the proper implementation in Linux is severely undermined by the statements that some prominent kernel developers have made in recent months on file systems.

Everyone serious about reliability uses battery backup

Do you mean a UPS? So how does that help when the UPS fails? Yes, we have had that (while power was alive), and we concluded that our power grid is just as reliable as a UPS. One could protect against a failing UPS with dual (redundant) power supplies and dual UPSs, but that would probably double the cost of our servers. A better option would be to have an OS that sets up the hardware for good reliability (i.e., disable write caching if necessary) and works hard in the OS to ensure data and metadata consistency. Unfortunately, it seems that that OS is not Linux.

Ext3 and RAID: silent data killers?

giraffedata — Tue, 08 Sep 2009 06:25:07 +0000

I don't know what 'high end storage servers' you are talking about, the even the multi-million dollar arrays from EMC and IBM do not have the characteristics that you are claiming.

Now that you mention it, I do remember that earlier IBM Sharks had nonvolatile storage based on a battery. Current ones don't, though. The battery's only job is to allow the machine to dump critical memory contents to disk drives after a loss of external power. I think that's the trend, but I haven't kept up on what EMC, Hitachi, etc. are doing. IBM's other high end storage server, the former XIV Nextra, is the same.

Ext3 and RAID: silent data killers?

dlang — Tue, 08 Sep 2009 04:56:42 +0000

I don't know what 'high end storage servers' you are talking about, the even the multi-million dollar arrays from EMC and IBM do not have the characteristics that you are claiming.

Ext3 and RAID: silent data killers?

giraffedata — Mon, 07 Sep 2009 23:18:40 +0000

Ah, I see, the point is that even if you turn off the power *and pull the disk* halfway through a write, the disk state is still consistent? Yeah, battery-backed cache alone obviously can't ensure that.

No one said anything about pulling a disk. I did mention pulling a power cord. I meant the power cord that supplies the RAID enclosure (storage server).

A RAID enclosure with a battery inside that powers only the memory can keep the data consistent in the face of a power cord pull, but fails the persistence test, because the battery eventually dies. I think when people think persistent, they think indefinite. High end storage servers do in fact let you pull the power cord and not plug it in again for years and still be able to read back all the data that was completely written to the server before the pull. Some do it by powering disk drives (not necessarily the ones that normally hold the data) for a few seconds on battery.

Also, I think some people expect of persistence that you can take the machine, once powered down, apart and put it back together and the data will still be there. Battery backed memory probably fails that test.

Ext3 and RAID: silent data killers?

nix — Mon, 07 Sep 2009 22:47:06 +0000

Ah, I see, the point is that even if you turn off the power *and pull the
disk* halfway through a write, the disk state is still consistent? Yeah,
battery-backed cache alone obviously can't ensure that.

Ext3 and RAID: silent data killers?

dlang — Sat, 05 Sep 2009 00:31:46 +0000

actually, there are a LOT of enclosures that don't provide battery backup for the drives at all, just for the cache.

it's possible that they have heavy duty power supplies that keep power up for a fraction of a second after the power fail signal goes out to the drives, but they defiantly do not keep the drives spinning long enough to flush their caches

Journaling no protection against power drop

giraffedata — Sat, 05 Sep 2009 00:10:59 +0000

If you provide a few seconds' battery backup for the drive but not the host, then the blocks in the buffer that the drive said were on the disk get a chance to actually get there.

But then you also get the garbage that the host writes in its death throes (e.g. update of a random sector) while the drive is still up.

To really solve the problem, you need much more sophisticated shutdown sequencing.

Ext3 and RAID: silent data killers?

giraffedata — Sat, 05 Sep 2009 00:01:27 +0000

The battery in RAID adapter card only barely addresses the issue, I don't care how long it lives.

But the comment also addressed "RAID enclosures," which I take to mean storage servers that use RAID technology. Those, if they are at all serious, have batteries that power the disk drives as well, and only for a few seconds -- long enough to finish the write. It's not about backup power, it's about a system in which data is always consistent and persistent, even if someone pulled a power cord at some point.

Ext3 and RAID: silent data killers?

giraffedata — Fri, 04 Sep 2009 23:39:29 +0000

(and how will applications respond to their disks becoming read-only suddenly)

I think this would be only marginally better than shutting the whole system down. In many ways it would be worse, since users have ways to deal with a missing system but not a system acting strangely.

partially degraded raid 6 _IS_ vunerable to partial writes on power failure

giraffedata — Fri, 04 Sep 2009 23:32:37 +0000

But you don't have to build a RAID6 array that way. Ones I've looked at use a journaling scheme to provide atomic parity update. No matter where you get interrupted in the middle of updating a stripe, you can always get back the pre-update parity-consistent stripe (minus whatever 1 or 2 components that might have died at the same time).

I suspect Linux 'md' doesn't have the resources to do this feasibly, but a SCSI RAID6 unit probably would. I don't expect there's much market for the additional component loss protection of RAID6 without getting the interrupted write protection too.

Ext3 and RAID: silent data killers?

nix — Fri, 04 Sep 2009 10:38:28 +0000

'Terrible performance' is in the eye of the beholder. So is reliability. Software RAID is constrained by bus bandwidth, so RAID 10 writes may well be slower than RAID 5 if you're bus-limited: and even RAID 5 writes are no slower than writes to a single drive. TBH, 89Mb/s writes and 250MB/s reads (which my Areca card can manage with a four-drive RAID 5 array) don't seem too 'terrible' to me.

Furthermore, reliability is fine *if* you can be sure that once RAID parity computations have happened the stripe will always hit the disk, even if there is a power failure. With battery-backed RAID, this is going to be true (modulo RAID controller card failure or a failure of the drive you're writing to). Obviously if the array is sufficiently degraded reliability isn't going to be good anymore, but doesn't everyone know that?

Ext3 and RAID: silent data killers?

nix — Fri, 04 Sep 2009 10:32:25 +0000

Yes indeed. My Areca RAID card claims a month's battery life. Thankfully I've never had cause to check this, but I guess residents of Auckland in the 1998 power-generation fiasco would have liked it. :)

Ext3 and RAID: silent data killers?

njs — Fri, 04 Sep 2009 08:22:18 +0000

mdadm has a hook to let you run an arbitrary script when a RAID device changes state (degrades, etc.); I don't have a remount -o ro script handy myself, though.

Ext3 and write caching by drives are the data killers...

landley — Thu, 03 Sep 2009 23:18:48 +0000

The paragraph starting "But what if more than one drive fails?" is misleading. You don't need another drive to fail to experience this problem, all you need is an unclean shutdown of an array that's both dirty and degraded. (Two words: "atime updates".) The second failure can be entirely a _software_ problem (which can invalidate stripes on other drives without changing them, because the parity information needed to use them is gone). Software problem != hardware problem, it's not the same _kind_ of failure.

People expect RAID to protect against permanent hardware failures, and people expect journaling to protect against data loss from transient failures which may be entirely due to software (ala kernel panic, hang, watchdog, heartbeat...). In the first kind of failure you need to replace a component, in the second kind of failure the hardware is still good as new afterwards. (Heck, you can experience unclean shutdowns if your load balancer kills xen shares impolitely. There's _lots_ of ways to do this. I've had shutdown scripts hang failing to umount a network share, leaving The Button as the only option.)

Another problematic paragraph starts with "RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability."

That's misleading because redundancy isn't what provides this reliability, at least in other contexts. When you lose the redundancy, you open yourself to an _unrelated_ issue of update granularity/atomicity. A single disk doesn't have this "incomplete writes can cause collateral damage to unrelated data" problem. (It might start to if physical block size grows bigger than filesystem sector size, but even 4k shouldn't do that on a properly aligned modern ext3 filesystem.) Nor does RAID 0 have an update granularity issue, and that has no redundancy in the first place.

I.E. a degraded RAID 5/6 that has to reconstruct data using parity information from multiple stripes that can't be updated atomically is _more_ vulnerable to data loss from interrupted writes than RAID 0 is, and the data loss is of a "collateral damage" form that journaling silently fails to detect. This issue is subtle, and fiddly, and although people keep complaining that it's not worth documenting because everybody should already know it, the people trying to explain it keep either getting it _wrong_ or glossing over important points.

Another point that was sort of glossed over is that journaling isn't exactly a culprit here, it's an accessory at best. This is a block device problem which would still cause data loss on a non-journaled filesystem, and it's a kind of data loss that a fsck won't necessarily detect. (Properly allocated file data elsewhere on the disk, which the filesystem may not have attempted to write to in years, may be collateral damage. And since you have no warning you could rsync the damaged stuff over your backups if you don't notice.) If it's seldom-used data it may be long gone before you notice.

The problem is that journaling gives you a false sense of security, since it doesn't protect against these issues (which exist at the block device level, not the filesystem level), and can hide even the subset of problems fsck would detect, by reporting a successful journal replay when the metadata still contains collateral damage in areas the journal hadn't logged any recent changes to.

I look forward to btrfs checksumming all extents. That should at least make this stuff easier to detect, so you can know when you _haven't_ experienced this problem.

Rob

Ext3 and RAID: silent data killers?

Cato — Thu, 03 Sep 2009 21:42:01 +0000

When disks are remounted read-only, there is no message to the end user (at least on Ubuntu) - as a result, one PC where this happened had a read-only root filesystem for weeks until I noticed this from the logs.

Sending notifications of serious system events like this would be very helpful, with a bundle of standard event filters that can easily generate an email or other alert.

Ext3 and RAID: silent data killers?

yohahn — Thu, 03 Sep 2009 15:37:55 +0000

Right, but what standard method to trigger this when a raid goes degraded. (and how will applications respond to their disks becoming read-only suddenly)

LVM barriers

Cato — Thu, 03 Sep 2009 10:23:43 +0000

There are some limitations on this - LVM barriers will only work with a linear target, apparently: http://lwn.net/Articles/326597/

Ext3 and write caching by drives are the data killers...

Cato — Thu, 03 Sep 2009 07:58:14 +0000

True, but it would be good if there was something simple like "apt-get install data-integrity" in major distros, which would then help the user configure the system for high integrity by default and this was well publicised. This could include things like: disabling write cache, periodic fsck's, ext3 data=journal, etc.

It would still be better if distros made this the default but I don't see much prospect of this.

One other example of disregard for data integrity that I've noticed is that Ubuntu (and probably Debian) won't fsck a filesystem (including root!) if the system is on batteries. This is very dubious - the fsck might exhaust the battery, but the user might well prefer a while without use of their laptop due to no battery to a long time without use of their valuable data when the system gets corrupted later on...

Fortunately on my desktop with a UPS, on_ac_power returns 255 which counts as 'not on battery' for the /etc/init.d/check*.sh scripts.

Ext3 and RAID: silent data killers?

Cato — Thu, 03 Sep 2009 07:51:01 +0000

On full backups: one of the nice things about rsnapshot and similar rsync-based tools is that every backup is both a full backup and an incremental backup. Full in that previous backups can be deleted without any effect on this backup (thanks to hard links), and incremental in that the data transfer required is proportional to the specific data blocks that have changed (thanks to rsync).

Ext3 and write caching by drives are the data killers...

job — Thu, 03 Sep 2009 07:18:59 +0000

According to previous comments at LWN barriers should be working through LVM with 2.6.30.

Ext3 and RAID: silent data killers?

dlang — Thu, 03 Sep 2009 05:26:17 +0000

actually, if you have a read-mostly workload raid 5/6 can end up being as fast as raid 10. I couldn't believe this myself when I first ran into it, but I have a large (multiple TB) database used for archiving log data and discovered that read/search performance was the same with raid 6 as with raid 10.

in digging further I discovered that they key to performance was to have enough queries in flight to keep all disk heads fully occupied (one outstanding query per drive spindle), and you can do this with both raid 6 and raid 10.

Ext3 and RAID: silent data killers?

k8to — Thu, 03 Sep 2009 05:06:54 +0000

Yes agreed, RAID is for availability and performance. RAID 5 doesn't offer performance, and the availability story isn't great either. So don't use it.

Ext3 and RAID: silent data killers?

k8to — Thu, 03 Sep 2009 05:05:46 +0000

Double fault can kill raid 10 also, but you're much less likely to have the fault propogate as discussed in the article, and the downtime for bringing in a standby is much smaller, so standby drives are more effective.

Meanwhile, you also get vastly better performance, and higher reliability of implementation.

It's really a no brainer unless you're poor.

Ext3 and write caching by drives are the data killers...

Cato — Wed, 02 Sep 2009 07:18:09 +0000

You are right on both points - when ext3 tries to do barriers on top of LVM it complains at that point, not at time of mounting.

Are these issues really unique to Ext3?

Cato — Wed, 02 Sep 2009 07:01:49 +0000

See http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf for a good paper on how hard disks fail, including partial failures, and the failure/recovery behaviour of ext3, reiserfs, JFS and NTFS. Also talks about ixt3, an ext3 variant by the authors that's intended to have a stronger and more consistent failure/recovery model. This includes journal and data block checksums.

There was a 'journal checksum patch' for ext3 by the authors, I think, but I can't track it down.

Not sure about NTFS, but HFS+ does seem to have journal checksums - see http://developer.apple.com/mac/library/technotes/tn/tn115...

Ext3 and write caching by drives are the data killers...

Cato — Wed, 02 Sep 2009 06:36:16 +0000

I grepped the 2.6.24 sources, fs/ext3/*.c and fs/jbd/*.c, for any mention of checksum, and couldn't find it. However the email lists do have some reference to a journal checksum patch for ext3 that didn't make it into 2.6.25.

One other thought: perhaps LVM is bad for data integrity with ext3 because, as well as stopping barriers from working, LVM generates more fragmentation in the ext3 journal - that's one of the conditions mentioned by Ted Tso as potentially causing write reordering and hence FS corruption here: http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/m...

Ext3 and RAID: silent data killers?

drag — Wed, 02 Sep 2009 00:39:25 +0000

> BACKUPS are poor, version control is the only sane backup.

If your using version control for backups then that is your backup. Your
sentence does not really make a whole lot of sense and is nonsensical.
There is no difference.

My favorite form of backup is to use Git to sync data on geographically
disparate machines. But this is only suitable for text data. If I have to
backup photographs then source code management systems are utter shit.

> Backups are horrible to recover from.

They are only horrible to recover with if the backup was done poorly. If
you (or anybody else) does a shitty job of setting them up then it's your
(or their's) fault they are difficult.

Backing up is a concept.

Anyways its much more horrible to recover data that has ceased to
exist.

> Backups provide no sane automatable mechanism for pruning older data
> (backups) that doesn't suffer from the same corruption/accidental deletion
> problem that originals have, but worse, amplified since they don't even
> have a good mechanism for sanity checking (usage)! Backups tend to backup
> corrupted data without complaining.

Your doing it wrong.

The best form of backup is to full backups to multiple places. Ideally they
should be in a different region. You don't go back and prune data or clean
them up. Thats WRONG. Incremental backups are only useful to reduce the
amount of dataloss between full backups. A full copy of _EVERYTHING_ is a
requirement. And you save it for as long as that data is valuable. Usually
5 years.

It depends on what your doing but a ideal setup would be like this:
* On-site backups every weekend. Full backups. Stored for a few months.
* Incremental backups twice a day, and resets at the weekend with the full
backup.
* Every month 2 full backups are stored for 2-3 years.
* Off-site backups 1 a month, stored for 5 years.
etc. etc.

That would probably be a good idea for most small/medium businesses.

If your relying on a server or a single datacenter to store your data
reliably then your a fool. I don't give a shit on how high quality your
server hardware is or file system or anything. A single fire, vandalism,
hardware failure, disaster, sabotage, or any number of things can utterly
destroy _everything_.

Ext3 and RAID: silent data killers?

dlang — Wed, 02 Sep 2009 00:13:04 +0000

actually, in many cases the batteries for the raid cards can last for several weeks.

Ext3 and RAID: silent data killers?

Richard_J_Neill — Wed, 02 Sep 2009 00:09:12 +0000

> I'm a little surprised at lack of any discussion of RAID battery backups.
> All RAID enclosures and RAID host-adapters worth their salt have a BBU
>(battery backup unit) option for exactly this purpose.

Yes...but it's only good for a few hours. So if your power outage lasts more than that, then the BBWC (battery backed write cache) is still toast.

On a related note, I've just bought a pair of HP servers (DL380s) and an IBM X3550. It's very annoying that there is no way to buy either of these without hardware raid, nor can the raid card be turned off in the BIOS. For proper reliability, I only really trust software (md) raid in Raid 1 mode (with write caching off). [Aside: this kills the performance for database workloads (fdatasync), though the Intel X25-E SSDs outperform 10k SAS drives by a factor of about 12.]

Ext3 and write caching by drives are the data killers...

dododge — Tue, 01 Sep 2009 23:49:41 +0000

Well for starters (unless things have changed recently) LVM doesn't support write barriers, so if you put LVM in the loop it probably doesn't matter if the drive reports write completion correctly or not. If you use XFS on top of LVM you get a big warning about this at mount time.

I don't use EXT3 much, but from a quick googling it sounds like you have to explicitly turn on barrier support in fstab and it still won't warn you about the LVM issue until it actually tries to use one.