Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Posted Sep 1, 2009 8:15 UTC (Tue) by Cato (guest, #7643)Parent article: Ext3 and RAID: silent data killers?
My standard setup now is to:
1. Avoid LVM completely
2. Disable write caching on all hard drives using hdparm -W0 /dev/sdX.
3. Enable data=journal on ext3 (tune2fs -o journal_data /dev/sdX is the best way to ensure partitions are mounted with this option, including the root partition and when installed in another system, post-reinstall, etc).
The performance hit from these changes is trivial compared to the two days I spent rebuilding a PC where the root filesystem lost thousands of files and the backup filesystem was completely lost.
I suspect that the reason LVM is seen as reliable despite being the default for Fedora and RHEL/CentOS is that enterprise Linux deployment use hardware RAID cards with battery-backed cache, and perhaps higher quality drives that don't lie about write completion.
Linux is far worse at losing data with a default ext3 setup than I once thought it was, unfortunately. If correctly configured it's fine, but the average new Linux user has no way to know how to configure this. I can't recall losing data like this on Windows in the absence of a hardware problem.
Posted Sep 1, 2009 8:51 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (10 responses)
It is always nice to imagine that if you find the right combination of DIP switches, set the right values in the configuration file, choose the correct compiler flags, your problems will magically vanish.
But sometimes the real problem is just that your hardware is broken, and all the voodoo and contortions do nothing. You've changed _three_ random things about your setup based on, AFAICT, no evidence at all, and you think it's a magic formula. Until something bad happens again and you return in another LWN thread to tell us how awful ext3 is again...
Posted Sep 1, 2009 9:03 UTC (Tue)
by Cato (guest, #7643)
[Link] (9 responses)
I did base these changes mostly on the well known lack of journal checksumming in ext3 (going to data=journal and avoiding write caching) - see http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal. Dropping LVM is harder to justify, it's really just a hunch based on a number of reports of LVM being involved in data corruption, and on my own data point that the LVM volumes on one disk were completely inaccessible (i.e. corrupted LVM metadata) - hence it was not just ext3 involved here, though it might have been write caching as well.
I'm interested to hear responses that show these steps are unnecessary, of course.
I really doubt the hardware is broken: there are no disk I/O errors in any of the logs, there were 2 disks corrupted (1 SATA, 1 PATA), and there are no symptoms of memory errors (random application/system crashes).
Posted Sep 1, 2009 17:48 UTC (Tue)
by dlang (guest, #313)
[Link] (7 responses)
disabling write caches
this is mandatory unless you have battery backed cache to recover from failed writes. period, end of statement. if you don't do this you _will_ loose data when you loose power.
avoiding LVM
I have also run into 'interesting' things with LVM, and so I also avoid it. I see it as a solution in search of a problem for just about all users (just about all users would be just as happy, and have things work faster with less code involved, if they just used a single partition covering their entire drive.)
I suspect that some of the problem here is that ordering of things gets lost in the LVM layer, but that's just a guess.
data=journal,
this is not needed if the application is making proper use of fsync. if the application is not making proper use of fsync it's still not enough to make the data safe.
by the way, ext3 does do checksums on journal entries. the details of this were posted as part of the thread on linux-kernel.
Posted Sep 1, 2009 18:05 UTC (Tue)
by Cato (guest, #7643)
[Link] (3 responses)
Do you know roughly when ext3 checksums were added, or by whom, as this contradicts the Wikipedia page? Must be since 2007, based on http://archives.free.net.ph/message/20070519.014256.ac3a2.... I thought journal checksumming was only added to ext4 (see first para of http://lwn.net/Articles/284037/) not ext3.
This sort of corruption issue is one reason to have multiple partitions; parallel fscks are another. In fact, it would be good if Linux distros automatically scheduled a monthly fsck for every filesystem, even if journal-based.
Posted Sep 1, 2009 18:15 UTC (Tue)
by dlang (guest, #313)
[Link] (2 responses)
I'm not sure I believe that parallel fscks on partitions on the same drive do you much good. the limiting factor for speed is the throughput of the drive. do you really gain much from having it bounce around interleaving the different fsck processes?
as for protecting against this sort of corruption, I don't think it really matters.
for flash, the device doesn't know about your partitions, so it will happily map blocks from different partitions to the same eraseblock, which will then get trashed on a power failure, so partitions don't do you any good.
for a raid array it may limit corruption, but that depends on how your partition boundaries end up matching the stripe boundaries.
Posted Sep 1, 2009 18:44 UTC (Tue)
by Cato (guest, #7643)
[Link]
This Usenix paper mentions that JBD2 will ultimately be usable by other filesystems, so perhaps that's how ext3 does (or will) support this: http://www.usenix.org/publications/login/2007-06/openpdfs... - however, I don't think ext3 has journal checksums in (say) 2.6.24 kernels.
Posted Sep 2, 2009 6:36 UTC (Wed)
by Cato (guest, #7643)
[Link]
One other thought: perhaps LVM is bad for data integrity with ext3 because, as well as stopping barriers from working, LVM generates more fragmentation in the ext3 journal - that's one of the conditions mentioned by Ted Tso as potentially causing write reordering and hence FS corruption here: http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/m...
Posted Sep 1, 2009 22:37 UTC (Tue)
by cortana (subscriber, #24596)
[Link] (2 responses)
If this is true (and I don't doubt that it is), why on earth is it not the default? Shipping software with such an unsafe default setting is stupid. Most users have no ideas about these settings... surely they shouldn't be handed a delicious pizza smeared with nitroglycerin topping, and then be blamed when they bite into it and it explodes...
Posted Sep 1, 2009 22:41 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
the market has shown that people are willing to take this risk by driving all vendors that didn't make the change out of the marketplace
Posted Sep 3, 2009 7:58 UTC (Thu)
by Cato (guest, #7643)
[Link]
It would still be better if distros made this the default but I don't see much prospect of this.
One other example of disregard for data integrity that I've noticed is that Ubuntu (and probably Debian) won't fsck a filesystem (including root!) if the system is on batteries. This is very dubious - the fsck might exhaust the battery, but the user might well prefer a while without use of their laptop due to no battery to a long time without use of their valuable data when the system gets corrupted later on...
Fortunately on my desktop with a UPS, on_ac_power returns 255 which counts as 'not on battery' for the /etc/init.d/check*.sh scripts.
Posted Sep 3, 2009 23:18 UTC (Thu)
by landley (guest, #6789)
[Link]
People expect RAID to protect against permanent hardware failures, and people expect journaling to protect against data loss from transient failures which may be entirely due to software (ala kernel panic, hang, watchdog, heartbeat...). In the first kind of failure you need to replace a component, in the second kind of failure the hardware is still good as new afterwards. (Heck, you can experience unclean shutdowns if your load balancer kills xen shares impolitely. There's _lots_ of ways to do this. I've had shutdown scripts hang failing to umount a network share, leaving The Button as the only option.)
Another problematic paragraph starts with "RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability."
That's misleading because redundancy isn't what provides this reliability, at least in other contexts. When you lose the redundancy, you open yourself to an _unrelated_ issue of update granularity/atomicity. A single disk doesn't have this "incomplete writes can cause collateral damage to unrelated data" problem. (It might start to if physical block size grows bigger than filesystem sector size, but even 4k shouldn't do that on a properly aligned modern ext3 filesystem.) Nor does RAID 0 have an update granularity issue, and that has no redundancy in the first place.
I.E. a degraded RAID 5/6 that has to reconstruct data using parity information from multiple stripes that can't be updated atomically is _more_ vulnerable to data loss from interrupted writes than RAID 0 is, and the data loss is of a "collateral damage" form that journaling silently fails to detect. This issue is subtle, and fiddly, and although people keep complaining that it's not worth documenting because everybody should already know it, the people trying to explain it keep either getting it _wrong_ or glossing over important points.
Another point that was sort of glossed over is that journaling isn't exactly a culprit here, it's an accessory at best. This is a block device problem which would still cause data loss on a non-journaled filesystem, and it's a kind of data loss that a fsck won't necessarily detect. (Properly allocated file data elsewhere on the disk, which the filesystem may not have attempted to write to in years, may be collateral damage. And since you have no warning you could rsync the damaged stuff over your backups if you don't notice.) If it's seldom-used data it may be long gone before you notice.
The problem is that journaling gives you a false sense of security, since it doesn't protect against these issues (which exist at the block device level, not the filesystem level), and can hide even the subset of problems fsck would detect, by reporting a successful journal replay when the metadata still contains collateral damage in areas the journal hadn't logged any recent changes to.
I look forward to btrfs checksumming all extents. That should at least make this stuff easier to detect, so you can know when you _haven't_ experienced this problem.
Rob
Posted Sep 1, 2009 16:43 UTC (Tue)
by ncm (guest, #165)
[Link] (6 responses)
I don't buy that your experience suggests there's anything wrong with ext3, if you didn't protect against power drops. The same could happen with any file system. The more efficient the file system is, the more likely is corruption in that case -- although some inefficient file systems seem especially corruptible.
Posted Sep 1, 2009 17:31 UTC (Tue)
by Cato (guest, #7643)
[Link] (5 responses)
Since the rebuild, I have realised that the user of the PC has been turning it off via the power switch accidentally, which perhaps caused the write cache of the disk(s) to get corrupted and is a fairly severe test. Despite several sudden poweroffs due to this, with the new setup there has been no corruption yet. It seems unlikely that the writes would be pending in the disk's write cache for so long that they couldn't be written out while power was still, but the fact is that both ext3 and LVM data structures got corrupted.
It's acknowledged that ext3's lack of journal checksumming can cause corruption when combined with disk write caching (whereas XFS does have such checksums I think). The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted. Betting the data integrity of a remotely administered system on this time window is not something I want to do.
Posted Sep 1, 2009 18:06 UTC (Tue)
by ncm (guest, #165)
[Link]
That's easy: No. When power starts to drop, everything is gone at that moment. If the disk is writing at that moment, the unfinished sector gets corrupted, and maybe others. UPS for the computer and disk together helps only a little against corruption unless power drops are almost always shorter than the battery time, or it automatically shuts down the computer before getting used up. You may be better off if the computer loses power immediately, and only the disk gets the UPS treatment.
it really shouldn't be necessary to use a UPS just to avoid filesystem/LVM corruption.
Perhaps, but it is. (What is this "should"?) The file system doesn't matter near so much as you would like. They can be indefinitely bad, but can be no more than fairly good. The good news is that the UPS only needs to support the disk, and it only needs to keep power up for a few seconds; then many file systems are excellent, although the bad ones remain bad.
Posted Sep 9, 2009 20:35 UTC (Wed)
by BackSeat (guest, #1886)
[Link] (3 responses)
It may only be semantics, but it's unlikely that the lack of journal checksumming causes corruption, although it may make it difficult to detect. As for LVM, I've never seen the point. Just another layer of ones and zeros between the data and the processor. I never use it, and I'm very surprised some distros seem to use it by default.
Posted Sep 10, 2009 20:50 UTC (Thu)
by Cato (guest, #7643)
[Link] (2 responses)
In the absence of ext3 journal checksumming, and if there is a crash requiring replay of this journal block, horrible things will happen - presumably garbage is written to various places on disk from the 'journal' entry. One symptom may be log entries saying 'write beyond end of partition', which I've seen a few times with ext3 corruption and I think is a clear indicator of corrupt filesystem metadata.
This is one reason why JBD2 added journal checksumming for use with ext4 - I hope this also gets used by ext3. In my view, it would be a lot better to make that change to ext3 than to make data=writeback the default, which will speed up some workloads and most likely corrupt some additional data (though I guess not metadata).
Posted Sep 10, 2009 21:05 UTC (Thu)
by Cato (guest, #7643)
[Link] (1 responses)
Posted Sep 11, 2009 16:33 UTC (Fri)
by nix (subscriber, #2304)
[Link]
(this is all supposition from postmortems of shagged systems. Thankfully
Posted Sep 1, 2009 23:49 UTC (Tue)
by dododge (guest, #2870)
[Link] (3 responses)
I don't use EXT3 much, but from a quick googling it sounds like you have to explicitly turn on barrier support in fstab and it still won't warn you about the LVM issue until it actually tries to use one.
Posted Sep 2, 2009 7:18 UTC (Wed)
by Cato (guest, #7643)
[Link]
Posted Sep 3, 2009 7:18 UTC (Thu)
by job (guest, #670)
[Link] (1 responses)
Posted Sep 3, 2009 10:23 UTC (Thu)
by Cato (guest, #7643)
[Link]
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
>
> this is mandatory unless you have battery backed cache to recover from
> failed writes. period, end of statement. if you don't do this you _will_
> loose data when you loose power.
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
What's wrong with providing battery backup for your drives? If they have power for a little while after the last write request arrives, then write caching, re-ordering writes, and lying about what's already on the disk don't matter. You still need to do backups, of course, but you'll need to use them less often because your file system will be that much less likely to have been corrupted.
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted
Ext3 and write caching by drives are the data killers...
It's acknowledged that ext3's lack of journal checksumming can cause corruptionExt3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
notice that 'hey, this doesn't look like a journal' once the record that
spanned the block boundary is complete. But that's a bit late...
we no longer use hardware prone to this!)
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
Ext3 and write caching by drives are the data killers...
LVM barriers