User: Password:
|
|
Subscribe / Log in / New account

Ext3 and write caching by drives are the data killers...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 9:03 UTC (Tue) by Cato (subscriber, #7643)
In reply to: Ext3 and write caching by drives are the data killers... by tialaramex
Parent article: Ext3 and RAID: silent data killers?

I don't have time to do a scientific experiment across a number of PCs using different setups so I had to go with the evidence I had, and could find through searches.

I did base these changes mostly on the well known lack of journal checksumming in ext3 (going to data=journal and avoiding write caching) - see http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal. Dropping LVM is harder to justify, it's really just a hunch based on a number of reports of LVM being involved in data corruption, and on my own data point that the LVM volumes on one disk were completely inaccessible (i.e. corrupted LVM metadata) - hence it was not just ext3 involved here, though it might have been write caching as well.

I'm interested to hear responses that show these steps are unnecessary, of course.

I really doubt the hardware is broken: there are no disk I/O errors in any of the logs, there were 2 disks corrupted (1 SATA, 1 PATA), and there are no symptoms of memory errors (random application/system crashes).


(Log in to post comments)

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 17:48 UTC (Tue) by dlang (subscriber, #313) [Link]

of the three changes you are making

disabling write caches

this is mandatory unless you have battery backed cache to recover from failed writes. period, end of statement. if you don't do this you _will_ loose data when you loose power.

avoiding LVM

I have also run into 'interesting' things with LVM, and so I also avoid it. I see it as a solution in search of a problem for just about all users (just about all users would be just as happy, and have things work faster with less code involved, if they just used a single partition covering their entire drive.)

I suspect that some of the problem here is that ordering of things gets lost in the LVM layer, but that's just a guess.

data=journal,

this is not needed if the application is making proper use of fsync. if the application is not making proper use of fsync it's still not enough to make the data safe.

by the way, ext3 does do checksums on journal entries. the details of this were posted as part of the thread on linux-kernel.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:05 UTC (Tue) by Cato (subscriber, #7643) [Link]

Possibly data=journal is overkill, I was going by the Wikipedia page on ext3, link above. However a conservative setup is attractive at present as performance is far less important than reliability, for this PC anyway.

Do you know roughly when ext3 checksums were added, or by whom, as this contradicts the Wikipedia page? Must be since 2007, based on http://archives.free.net.ph/message/20070519.014256.ac3a2.... I thought journal checksumming was only added to ext4 (see first para of http://lwn.net/Articles/284037/) not ext3.

This sort of corruption issue is one reason to have multiple partitions; parallel fscks are another. In fact, it would be good if Linux distros automatically scheduled a monthly fsck for every filesystem, even if journal-based.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:15 UTC (Tue) by dlang (subscriber, #313) [Link]

Ted Tso detailed the protection of the journal in this thread (I've deleted the particular message or I'd quote it for you)

I'm not sure I believe that parallel fscks on partitions on the same drive do you much good. the limiting factor for speed is the throughput of the drive. do you really gain much from having it bounce around interleaving the different fsck processes?

as for protecting against this sort of corruption, I don't think it really matters.

for flash, the device doesn't know about your partitions, so it will happily map blocks from different partitions to the same eraseblock, which will then get trashed on a power failure, so partitions don't do you any good.

for a raid array it may limit corruption, but that depends on how your partition boundaries end up matching the stripe boundaries.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:44 UTC (Tue) by Cato (subscriber, #7643) [Link]

I still can't find that email, but this outlines that journal checksumming was added to JBD2 to support ext4: http://ext4.wiki.kernel.org/index.php/Frequently_Asked_Qu...

This Usenix paper mentions that JBD2 will ultimately be usable by other filesystems, so perhaps that's how ext3 does (or will) support this: http://www.usenix.org/publications/login/2007-06/openpdfs... - however, I don't think ext3 has journal checksums in (say) 2.6.24 kernels.

Ext3 and write caching by drives are the data killers...

Posted Sep 2, 2009 6:36 UTC (Wed) by Cato (subscriber, #7643) [Link]

I grepped the 2.6.24 sources, fs/ext3/*.c and fs/jbd/*.c, for any mention of checksum, and couldn't find it. However the email lists do have some reference to a journal checksum patch for ext3 that didn't make it into 2.6.25.

One other thought: perhaps LVM is bad for data integrity with ext3 because, as well as stopping barriers from working, LVM generates more fragmentation in the ext3 journal - that's one of the conditions mentioned by Ted Tso as potentially causing write reordering and hence FS corruption here: http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/m...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 22:37 UTC (Tue) by cortana (subscriber, #24596) [Link]

> disabling write caches
>
> this is mandatory unless you have battery backed cache to recover from
> failed writes. period, end of statement. if you don't do this you _will_
> loose data when you loose power.

If this is true (and I don't doubt that it is), why on earth is it not the default? Shipping software with such an unsafe default setting is stupid. Most users have no ideas about these settings... surely they shouldn't be handed a delicious pizza smeared with nitroglycerin topping, and then be blamed when they bite into it and it explodes...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 22:41 UTC (Tue) by dlang (subscriber, #313) [Link]

simple, enabling the write cache gives you a 10x (or better) performance boost for all the times when your system doesn't loose power.

the market has shown that people are willing to take this risk by driving all vendors that didn't make the change out of the marketplace

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 7:58 UTC (Thu) by Cato (subscriber, #7643) [Link]

True, but it would be good if there was something simple like "apt-get install data-integrity" in major distros, which would then help the user configure the system for high integrity by default and this was well publicised. This could include things like: disabling write cache, periodic fsck's, ext3 data=journal, etc.

It would still be better if distros made this the default but I don't see much prospect of this.

One other example of disregard for data integrity that I've noticed is that Ubuntu (and probably Debian) won't fsck a filesystem (including root!) if the system is on batteries. This is very dubious - the fsck might exhaust the battery, but the user might well prefer a while without use of their laptop due to no battery to a long time without use of their valuable data when the system gets corrupted later on...

Fortunately on my desktop with a UPS, on_ac_power returns 255 which counts as 'not on battery' for the /etc/init.d/check*.sh scripts.

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 23:18 UTC (Thu) by landley (subscriber, #6789) [Link]

The paragraph starting "But what if more than one drive fails?" is misleading. You don't need another drive to fail to experience this problem, all you need is an unclean shutdown of an array that's both dirty and degraded. (Two words: "atime updates".) The second failure can be entirely a _software_ problem (which can invalidate stripes on other drives without changing them, because the parity information needed to use them is gone). Software problem != hardware problem, it's not the same _kind_ of failure.

People expect RAID to protect against permanent hardware failures, and people expect journaling to protect against data loss from transient failures which may be entirely due to software (ala kernel panic, hang, watchdog, heartbeat...). In the first kind of failure you need to replace a component, in the second kind of failure the hardware is still good as new afterwards. (Heck, you can experience unclean shutdowns if your load balancer kills xen shares impolitely. There's _lots_ of ways to do this. I've had shutdown scripts hang failing to umount a network share, leaving The Button as the only option.)

Another problematic paragraph starts with "RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability."

That's misleading because redundancy isn't what provides this reliability, at least in other contexts. When you lose the redundancy, you open yourself to an _unrelated_ issue of update granularity/atomicity. A single disk doesn't have this "incomplete writes can cause collateral damage to unrelated data" problem. (It might start to if physical block size grows bigger than filesystem sector size, but even 4k shouldn't do that on a properly aligned modern ext3 filesystem.) Nor does RAID 0 have an update granularity issue, and that has no redundancy in the first place.

I.E. a degraded RAID 5/6 that has to reconstruct data using parity information from multiple stripes that can't be updated atomically is _more_ vulnerable to data loss from interrupted writes than RAID 0 is, and the data loss is of a "collateral damage" form that journaling silently fails to detect. This issue is subtle, and fiddly, and although people keep complaining that it's not worth documenting because everybody should already know it, the people trying to explain it keep either getting it _wrong_ or glossing over important points.

Another point that was sort of glossed over is that journaling isn't exactly a culprit here, it's an accessory at best. This is a block device problem which would still cause data loss on a non-journaled filesystem, and it's a kind of data loss that a fsck won't necessarily detect. (Properly allocated file data elsewhere on the disk, which the filesystem may not have attempted to write to in years, may be collateral damage. And since you have no warning you could rsync the damaged stuff over your backups if you don't notice.) If it's seldom-used data it may be long gone before you notice.

The problem is that journaling gives you a false sense of security, since it doesn't protect against these issues (which exist at the block device level, not the filesystem level), and can hide even the subset of problems fsck would detect, by reporting a successful journal replay when the metadata still contains collateral damage in areas the journal hadn't logged any recent changes to.

I look forward to btrfs checksumming all extents. That should at least make this stuff easier to detect, so you can know when you _haven't_ experienced this problem.

Rob


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds