Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 8:15 UTC (Tue) by Cato (guest, #7643)
Parent article: Ext3 and RAID: silent data killers?

In my experience of massive data loss (thousands of files) using ext3 on a hard disk with a default data=ordered setup with LVM and no RAID, using Linux and ext3 is quite dangerous in its default configuration. This seems to be due to drives that reorder blocks in the write cache, and lack of journal checksumming in ext3 to cope with this (and possibly also LVM issues). See http://lwn.net/Articles/342978/ for the details.

My standard setup now is to:

1. Avoid LVM completely

2. Disable write caching on all hard drives using hdparm -W0 /dev/sdX.

3. Enable data=journal on ext3 (tune2fs -o journal_data /dev/sdX is the best way to ensure partitions are mounted with this option, including the root partition and when installed in another system, post-reinstall, etc).

The performance hit from these changes is trivial compared to the two days I spent rebuilding a PC where the root filesystem lost thousands of files and the backup filesystem was completely lost.

I suspect that the reason LVM is seen as reliable despite being the default for Fedora and RHEL/CentOS is that enterprise Linux deployment use hardware RAID cards with battery-backed cache, and perhaps higher quality drives that don't lie about write completion.

Linux is far worse at losing data with a default ext3 setup than I once thought it was, unfortunately. If correctly configured it's fine, but the average new Linux user has no way to know how to configure this. I can't recall losing data like this on Windows in the absence of a hardware problem.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 8:51 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (10 responses)

When I follow your reference, I come face to face with myself suggesting perfectly ordinary explanations and you have no answer.

It is always nice to imagine that if you find the right combination of DIP switches, set the right values in the configuration file, choose the correct compiler flags, your problems will magically vanish.

But sometimes the real problem is just that your hardware is broken, and all the voodoo and contortions do nothing. You've changed _three_ random things about your setup based on, AFAICT, no evidence at all, and you think it's a magic formula. Until something bad happens again and you return in another LWN thread to tell us how awful ext3 is again...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 9:03 UTC (Tue) by Cato (guest, #7643) [Link] (9 responses)

I don't have time to do a scientific experiment across a number of PCs using different setups so I had to go with the evidence I had, and could find through searches.

I did base these changes mostly on the well known lack of journal checksumming in ext3 (going to data=journal and avoiding write caching) - see http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal. Dropping LVM is harder to justify, it's really just a hunch based on a number of reports of LVM being involved in data corruption, and on my own data point that the LVM volumes on one disk were completely inaccessible (i.e. corrupted LVM metadata) - hence it was not just ext3 involved here, though it might have been write caching as well.

I'm interested to hear responses that show these steps are unnecessary, of course.

I really doubt the hardware is broken: there are no disk I/O errors in any of the logs, there were 2 disks corrupted (1 SATA, 1 PATA), and there are no symptoms of memory errors (random application/system crashes).

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 17:48 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

of the three changes you are making

disabling write caches

this is mandatory unless you have battery backed cache to recover from failed writes. period, end of statement. if you don't do this you _will_ loose data when you loose power.

avoiding LVM

I have also run into 'interesting' things with LVM, and so I also avoid it. I see it as a solution in search of a problem for just about all users (just about all users would be just as happy, and have things work faster with less code involved, if they just used a single partition covering their entire drive.)

I suspect that some of the problem here is that ordering of things gets lost in the LVM layer, but that's just a guess.

data=journal,

this is not needed if the application is making proper use of fsync. if the application is not making proper use of fsync it's still not enough to make the data safe.

by the way, ext3 does do checksums on journal entries. the details of this were posted as part of the thread on linux-kernel.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:05 UTC (Tue) by Cato (guest, #7643) [Link] (3 responses)

Possibly data=journal is overkill, I was going by the Wikipedia page on ext3, link above. However a conservative setup is attractive at present as performance is far less important than reliability, for this PC anyway.

Do you know roughly when ext3 checksums were added, or by whom, as this contradicts the Wikipedia page? Must be since 2007, based on http://archives.free.net.ph/message/20070519.014256.ac3a2.... I thought journal checksumming was only added to ext4 (see first para of http://lwn.net/Articles/284037/) not ext3.

This sort of corruption issue is one reason to have multiple partitions; parallel fscks are another. In fact, it would be good if Linux distros automatically scheduled a monthly fsck for every filesystem, even if journal-based.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:15 UTC (Tue) by dlang (guest, #313) [Link] (2 responses)

Ted Tso detailed the protection of the journal in this thread (I've deleted the particular message or I'd quote it for you)

I'm not sure I believe that parallel fscks on partitions on the same drive do you much good. the limiting factor for speed is the throughput of the drive. do you really gain much from having it bounce around interleaving the different fsck processes?

as for protecting against this sort of corruption, I don't think it really matters.

for flash, the device doesn't know about your partitions, so it will happily map blocks from different partitions to the same eraseblock, which will then get trashed on a power failure, so partitions don't do you any good.

for a raid array it may limit corruption, but that depends on how your partition boundaries end up matching the stripe boundaries.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:44 UTC (Tue) by Cato (guest, #7643) [Link]

I still can't find that email, but this outlines that journal checksumming was added to JBD2 to support ext4: http://ext4.wiki.kernel.org/index.php/Frequently_Asked_Qu...

This Usenix paper mentions that JBD2 will ultimately be usable by other filesystems, so perhaps that's how ext3 does (or will) support this: http://www.usenix.org/publications/login/2007-06/openpdfs... - however, I don't think ext3 has journal checksums in (say) 2.6.24 kernels.

Ext3 and write caching by drives are the data killers...

Posted Sep 2, 2009 6:36 UTC (Wed) by Cato (guest, #7643) [Link]

I grepped the 2.6.24 sources, fs/ext3/*.c and fs/jbd/*.c, for any mention of checksum, and couldn't find it. However the email lists do have some reference to a journal checksum patch for ext3 that didn't make it into 2.6.25.

One other thought: perhaps LVM is bad for data integrity with ext3 because, as well as stopping barriers from working, LVM generates more fragmentation in the ext3 journal - that's one of the conditions mentioned by Ted Tso as potentially causing write reordering and hence FS corruption here: http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/m...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 22:37 UTC (Tue) by cortana (subscriber, #24596) [Link] (2 responses)

> disabling write caches
>
> this is mandatory unless you have battery backed cache to recover from
> failed writes. period, end of statement. if you don't do this you _will_
> loose data when you loose power.

If this is true (and I don't doubt that it is), why on earth is it not the default? Shipping software with such an unsafe default setting is stupid. Most users have no ideas about these settings... surely they shouldn't be handed a delicious pizza smeared with nitroglycerin topping, and then be blamed when they bite into it and it explodes...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 22:41 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

simple, enabling the write cache gives you a 10x (or better) performance boost for all the times when your system doesn't loose power.

the market has shown that people are willing to take this risk by driving all vendors that didn't make the change out of the marketplace

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 7:58 UTC (Thu) by Cato (guest, #7643) [Link]

True, but it would be good if there was something simple like "apt-get install data-integrity" in major distros, which would then help the user configure the system for high integrity by default and this was well publicised. This could include things like: disabling write cache, periodic fsck's, ext3 data=journal, etc.

It would still be better if distros made this the default but I don't see much prospect of this.

One other example of disregard for data integrity that I've noticed is that Ubuntu (and probably Debian) won't fsck a filesystem (including root!) if the system is on batteries. This is very dubious - the fsck might exhaust the battery, but the user might well prefer a while without use of their laptop due to no battery to a long time without use of their valuable data when the system gets corrupted later on...

Fortunately on my desktop with a UPS, on_ac_power returns 255 which counts as 'not on battery' for the /etc/init.d/check*.sh scripts.

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 23:18 UTC (Thu) by landley (guest, #6789) [Link]

The paragraph starting "But what if more than one drive fails?" is misleading. You don't need another drive to fail to experience this problem, all you need is an unclean shutdown of an array that's both dirty and degraded. (Two words: "atime updates".) The second failure can be entirely a _software_ problem (which can invalidate stripes on other drives without changing them, because the parity information needed to use them is gone). Software problem != hardware problem, it's not the same _kind_ of failure.

People expect RAID to protect against permanent hardware failures, and people expect journaling to protect against data loss from transient failures which may be entirely due to software (ala kernel panic, hang, watchdog, heartbeat...). In the first kind of failure you need to replace a component, in the second kind of failure the hardware is still good as new afterwards. (Heck, you can experience unclean shutdowns if your load balancer kills xen shares impolitely. There's _lots_ of ways to do this. I've had shutdown scripts hang failing to umount a network share, leaving The Button as the only option.)

Another problematic paragraph starts with "RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability."

That's misleading because redundancy isn't what provides this reliability, at least in other contexts. When you lose the redundancy, you open yourself to an _unrelated_ issue of update granularity/atomicity. A single disk doesn't have this "incomplete writes can cause collateral damage to unrelated data" problem. (It might start to if physical block size grows bigger than filesystem sector size, but even 4k shouldn't do that on a properly aligned modern ext3 filesystem.) Nor does RAID 0 have an update granularity issue, and that has no redundancy in the first place.

I.E. a degraded RAID 5/6 that has to reconstruct data using parity information from multiple stripes that can't be updated atomically is _more_ vulnerable to data loss from interrupted writes than RAID 0 is, and the data loss is of a "collateral damage" form that journaling silently fails to detect. This issue is subtle, and fiddly, and although people keep complaining that it's not worth documenting because everybody should already know it, the people trying to explain it keep either getting it _wrong_ or glossing over important points.

Another point that was sort of glossed over is that journaling isn't exactly a culprit here, it's an accessory at best. This is a block device problem which would still cause data loss on a non-journaled filesystem, and it's a kind of data loss that a fsck won't necessarily detect. (Properly allocated file data elsewhere on the disk, which the filesystem may not have attempted to write to in years, may be collateral damage. And since you have no warning you could rsync the damaged stuff over your backups if you don't notice.) If it's seldom-used data it may be long gone before you notice.

The problem is that journaling gives you a false sense of security, since it doesn't protect against these issues (which exist at the block device level, not the filesystem level), and can hide even the subset of problems fsck would detect, by reporting a successful journal replay when the metadata still contains collateral damage in areas the journal hadn't logged any recent changes to.

I look forward to btrfs checksumming all extents. That should at least make this stuff easier to detect, so you can know when you _haven't_ experienced this problem.

Rob

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 16:43 UTC (Tue) by ncm (guest, #165) [Link] (6 responses)

What's wrong with providing battery backup for your drives? If they have power for a little while after the last write request arrives, then write caching, re-ordering writes, and lying about what's already on the disk don't matter. You still need to do backups, of course, but you'll need to use them less often because your file system will be that much less likely to have been corrupted.

I don't buy that your experience suggests there's anything wrong with ext3, if you didn't protect against power drops. The same could happen with any file system. The more efficient the file system is, the more likely is corruption in that case -- although some inefficient file systems seem especially corruptible.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 17:31 UTC (Tue) by Cato (guest, #7643) [Link] (5 responses)

This PC is on battery backup (UPS) already - that didn't stop the corruption though. This is a home PC, and in any case it really shouldn't be necessary to use a UPS just to avoid filesystem/LVM corruption.

Since the rebuild, I have realised that the user of the PC has been turning it off via the power switch accidentally, which perhaps caused the write cache of the disk(s) to get corrupted and is a fairly severe test. Despite several sudden poweroffs due to this, with the new setup there has been no corruption yet. It seems unlikely that the writes would be pending in the disk's write cache for so long that they couldn't be written out while power was still, but the fact is that both ext3 and LVM data structures got corrupted.

It's acknowledged that ext3's lack of journal checksumming can cause corruption when combined with disk write caching (whereas XFS does have such checksums I think). The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted. Betting the data integrity of a remotely administered system on this time window is not something I want to do.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:06 UTC (Tue) by ncm (guest, #165) [Link]

The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted

That's easy: No. When power starts to drop, everything is gone at that moment. If the disk is writing at that moment, the unfinished sector gets corrupted, and maybe others. UPS for the computer and disk together helps only a little against corruption unless power drops are almost always shorter than the battery time, or it automatically shuts down the computer before getting used up. You may be better off if the computer loses power immediately, and only the disk gets the UPS treatment.

it really shouldn't be necessary to use a UPS just to avoid filesystem/LVM corruption.

Perhaps, but it is. (What is this "should"?) The file system doesn't matter near so much as you would like. They can be indefinitely bad, but can be no more than fairly good. The good news is that the UPS only needs to support the disk, and it only needs to keep power up for a few seconds; then many file systems are excellent, although the bad ones remain bad.

Ext3 and write caching by drives are the data killers...

Posted Sep 9, 2009 20:35 UTC (Wed) by BackSeat (guest, #1886) [Link] (3 responses)

It's acknowledged that ext3's lack of journal checksumming can cause corruption

It may only be semantics, but it's unlikely that the lack of journal checksumming causes corruption, although it may make it difficult to detect.

As for LVM, I've never seen the point. Just another layer of ones and zeros between the data and the processor. I never use it, and I'm very surprised some distros seem to use it by default.

Ext3 and write caching by drives are the data killers...

Posted Sep 10, 2009 20:50 UTC (Thu) by Cato (guest, #7643) [Link] (2 responses)

One interesting scenario, mentioned I think elsehwere in the comments to this article: a single 'misplaced write' (i.e. disk doesn't do the seek to new position, writing to old position) means that a data block goes into the ext3 journal.

In the absence of ext3 journal checksumming, and if there is a crash requiring replay of this journal block, horrible things will happen - presumably garbage is written to various places on disk from the 'journal' entry. One symptom may be log entries saying 'write beyond end of partition', which I've seen a few times with ext3 corruption and I think is a clear indicator of corrupt filesystem metadata.

This is one reason why JBD2 added journal checksumming for use with ext4 - I hope this also gets used by ext3. In my view, it would be a lot better to make that change to ext3 than to make data=writeback the default, which will speed up some workloads and most likely corrupt some additional data (though I guess not metadata).

Ext3 and write caching by drives are the data killers...

Posted Sep 10, 2009 21:05 UTC (Thu) by Cato (guest, #7643) [Link] (1 responses)

Actually the comment about a single incorrect block in a journal 'spraying garbage' over the disk is here: http://lwn.net/Articles/284313/

Ext3 and write caching by drives are the data killers...

Posted Sep 11, 2009 16:33 UTC (Fri) by nix (subscriber, #2304) [Link]

Note that you won't get a whole blockfull of garbage: ext3 will generally
notice that 'hey, this doesn't look like a journal' once the record that
spanned the block boundary is complete. But that's a bit late...

(this is all supposition from postmortems of shagged systems. Thankfully
we no longer use hardware prone to this!)

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 23:49 UTC (Tue) by dododge (guest, #2870) [Link] (3 responses)

Well for starters (unless things have changed recently) LVM doesn't support write barriers, so if you put LVM in the loop it probably doesn't matter if the drive reports write completion correctly or not. If you use XFS on top of LVM you get a big warning about this at mount time.

I don't use EXT3 much, but from a quick googling it sounds like you have to explicitly turn on barrier support in fstab and it still won't warn you about the LVM issue until it actually tries to use one.

Ext3 and write caching by drives are the data killers...

Posted Sep 2, 2009 7:18 UTC (Wed) by Cato (guest, #7643) [Link]

You are right on both points - when ext3 tries to do barriers on top of LVM it complains at that point, not at time of mounting.

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 7:18 UTC (Thu) by job (guest, #670) [Link] (1 responses)

According to previous comments at LWN barriers should be working through LVM with 2.6.30.

LVM barriers

Posted Sep 3, 2009 10:23 UTC (Thu) by Cato (guest, #7643) [Link]

There are some limitations on this - LVM barriers will only work with a linear target, apparently: http://lwn.net/Articles/326597/