Ext3 and RAID: silent data killers?

By Jonathan Corbet
August 31, 2009

Technologies such as filesystem journaling (as used with ext3) or RAID are generally adopted with the purpose of improving overall reliability. Some system administrators may thus be a little disconcerted by a recent linux-kernel thread suggesting that, in some situations, those technologies can actually increase the risk of data loss. This article attempts to straighten out the arguments and reach a conclusion about how worried system administrators should be.

The conversation actually began last March, when Pavel Machek posted a proposed documentation patch describing the assumptions that he saw as underlying the design of Linux filesystems. Things went quiet for a while, before springing back to life at the end of August. It would appear that Pavel had run into some data-loss problems when using a flash drive with a flaky connection to the computer; subsequent tests done by deliberately removing active drives confirmed that it is easy to lose data that way. He hadn't expected that:

Before I pulled that flash card, I assumed that doing so is safe, because flashcard is presented as block device and ext3 should cope with sudden disk disconnects. And I was wrong wrong wrong. (Noone told me at the university. I guess I should want my money back).

In an attempt to prevent a surge in refund requests at universities worldwide, Pavel tried to get some warnings put into the kernel documentation. He has run into a surprising amount of opposition, which he (and some others) have taken as an attempt to sweep shortcomings in Linux filesystems under the rug. The real story, naturally, is a bit more complex.

Journaling technology like that used in ext3 works by writing some data to the filesystem twice. Whenever the filesystem must make a metadata change, it will first gather together all of the block-level changes required and write them to a special area of the disk (the journal). Once it is known that the full description of the changes has made it to the media, a "commit record" is written, indicating that the filesystem code is committed to the change. Once the commit record is also safely on the media, the filesystem can start writing the metadata changes to the filesystem itself. Should the operation be interrupted (by a power failure, say, or a system crash or abrupt removal of the media), the filesystem can recover the plan for the changes from the journal and start the process over again. The end result is to make metadata changes transactional; they either happen completely or not at all. And that should prevent corruption of the filesystem structure.

One thing worth noting here is that actual data is not normally written to the journal, so a certain amount of recently-written data can be lost in an abrupt failure. It is possible to configure ext3 (and ext4) to write data to the journal as well, but, since the performance cost is significant, this option is not heavily used. So one should keep in mind that most filesystem journaling is there to protect metadata, not the data itself. Journaling does provide some data protection anyway - if the metadata is lost, the associated data can no longer be found - but that's not its primary reason for existing.

It is not the lack of journaling for data which has created grief for Pavel and others, though. The nature of flash-based storage makes another "interesting" failure mode possible. Filesystems work with fixed-size blocks, normally 4096 bytes on Linux. Storage devices also use fixed-size blocks; on traditional rotating media, those blocks are traditionally 512 bytes in length, though larger block sizes are on the horizon. The key point is that, on a normal rotating disk, the filesystem can write a block without disturbing any unrelated blocks on the drive.

Flash storage also uses fixed-size blocks, but they tend to be large - typically tens to hundreds of kilobytes. Flash blocks can only be rewritten as a unit, so writing a 4096-byte "block" at the operating system level will require a larger read-modify-write cycle within the flash drive. It is certainly possible for a careful programmer to write flash-drive firmware which does this operation in a safe, transactional manner. It is also possible that the flash drive manufacturer was rather more interested in getting a cheap device to market quickly than careful programming. In the commodity PC hardware market, that possibility becomes something much closer to a certainty.

What this all means is that, on a low-quality flash drive, an interrupted write operation could result in the corruption of blocks unrelated to that operation. If the interrupted write was for metadata, a journaling filesystem will redo the operation on the next mount, ensuring that the metadata ends up in its intended destination. But the filesystem cannot know about any unrelated blocks which might have been trashed at the same time. So journaling will not protect against this kind of failure - even if it causes the sort of metadata corruption that journaling is intended to prevent.

This is the "bug" in ext3 that Pavel wished to document. He further asserted that journaling filesystems can actually make things worse in this situation. Since a full fsck is not normally required on journaling filesystems, even after an improper dismount, any "collateral" metadata damage will go undetected. At best, the user may remain unaware for some time that random data has been lost. At worst, corrupt metadata could cause the code to corrupt other parts of the filesystem over the course of subsequent operation. The skipped fsck may have enabled the system to come back up quickly, but it has done so at the risk of letting corruption persist and, possibly, spread.

One could easily argue that the real problem here is the use of hidden translation layers to make a flash device look like a normal drive. David Woodhouse did exactly that:

This just goes to show why having this "translation layer" done in firmware on the device itself is a _bad_ idea. We're much better off when we have full access to the underlying flash and the OS can actually see what's going on. That way, we can actually debug, fix and recover from such problems.

The manufacturers of flash drives have, thus far, proved impervious to this line of reasoning, though.

There is a similar failure mode with RAID devices which was also discussed. Drives can be grouped into a RAID5 or RAID6 array, with the result that the array as a whole can survive the total failure of any drive within it. As long as only one drive fails at a time, users of RAID arrays can rest assured that the smoke coming out of their array is not taking their data with it.

But what if more than one drive fails? RAID works by combining blocks into larger stripes and associating checksums with those stripes. Updating a block requires rewriting the stripe containing it and the associated checksum block. So, if writing a block can cause the array to lose the entire stripe, we could see data loss much like that which can happen with a flash drive. As a normal rule, this kind of loss will not occur with a RAID array. But it can happen if (1) one drive has already failed, causing the array to run in "degraded" mode, and (2) a second failure occurs (Pavel pulls the power cord, say) while the write is happening.

Pavel concluded from this scenario that RAID devices may actually be more dangerous than storing data on a single disk; he started a whole separate subthread (under the subject "raid is dangerous but that's secret") to that effect. This claim caused a fair amount of concern on the list; many felt that it would push users to forgo technologies like RAID in favor of single, non-redundant drive configurations. Users who do that will avoid the possibility of data loss resulting from a specific, unlikely double failure, but at the cost of rendering themselves entirely vulnerable to a much more likely single failure. The end result would be a lot more data lost.

The real lessons from this discussion are fairly straightforward:

Treat flash drives with care, do not expect them to be more reliable than they are, and do not remove them from the system until all writes are complete.
RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability. If the consequences of a second failure would be too severe, one should avoid writing to arrays running in degraded mode.
As Ric Wheeler pointed out, the easiest way to lose data on a Linux system is to run the disks with their write cache enabled. This is especially true on RAID5/6 systems, where write barriers are still not properly supported. There has been some talk of disabling drive write caches and enabling barriers by default, but no patches have been posted yet.
There is no substitute for good backups. Your editor would add that any backups which have not been checked recently have a strong chance of not being good backups.

How this information will be reflected in the kernel documentation remains to be seen. Some of it seems like the sort of system administration information which is not normally considered appropriate for inclusion in the documentation of the kernel itself. But there is value in knowing what assumptions one's filesystems are built on and what the possible failure modes are. A better understanding of how we can lose data can only help us to keep that from actually happening.

Index entries for this article
Kernel	Data integrity
Kernel	Filesystems/ext3
Kernel	RAID

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 21:32 UTC (Mon) by chrish (guest, #351) [Link] (3 responses)

One small nit. RAID6 protects from a double failure (that the whole point of it). So people who are worried about the impact of a double failure (which *is* bad) on their RAID5 array should run RAID6 instead.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 6:05 UTC (Tue) by k8to (guest, #15413) [Link] (2 responses)

From what we know about device failure patterns, everyone should be worried about this.

partially degraded raid 6 _IS_ vunerable to partial writes on power failure

Posted Sep 1, 2009 17:33 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

Neil brown posted a message explaining how raid 6 is still vunerable to unclean shutdown problems

Date: Wed, 26 Aug 2009 09:28:50 +1000
From: Neil Brown <neilb@suse.de>
Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible

On Monday August 24, greg.freemyer@gmail.com wrote:
> > +Don't damage the old data on a failed write (ATOMIC-WRITES)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > + Because RAM tends to fail faster than rest of system during
> > + powerfail, special hw killing DMA transfers may be necessary;
> > + otherwise, disks may write garbage during powerfail.
> > + This may be quite common on generic PC machines.
> > +
> > + Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > + because it needs to write both changed data, and parity, to
> > + different disks. (But it will only really show up in degraded mode).
> > + UPS for RAID array should help.
>
> Can someone clarify if this is true in raid-6 with just a single disk
> failure? I don't see why it would be.

It does affect raid6 with a single drive missing.

After an unclean shutdown you cannot trust any Parity block as it
is possible that some of the blocks in the stripe have been updated,
but others have not. So you must assume that all parity blocks are
wrong and update them. If you have a missing disk you cannot do that.

To take a more concrete example, imagine a 5 device RAID6 with
3 data blocks D0 D1 D2 as well a P and Q on some stripe.
Suppose that we crashed while updating D0, which would have involved
writing out D0, P and Q.
On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3
of D0, P and Q have been updated and the others not.
We can try to recompute D2 from D0 D1 and P, from
D0 P and Q or from D1, P and Q.

We could conceivably try each of those and if they all produce the
same result we might be confident of it.
If two produced the same result and the other was different we could
use a voting process to choose the 'best'. And in this particular
case I think that would work. If 0 or 3 had been updates, all would
be the same. If only 1 was updated, then the combinations that
exclude it will match. If 2 were updated, then the combinations that
exclude the non-updated block will match.

But if both D0 and D1 were being updated I think there would be too
many combinations and it would be very possibly that all three
computed values for D2 would be different.

So yes: a singly degraded RAID6 cannot promise no data corruption
after an unclean shutdown. That is why "mdadm" will not assemble such
an array unless you use "--force" to acknowledge that there has been a
problem.

NeilBrown

partially degraded raid 6 _IS_ vunerable to partial writes on power failure

Posted Sep 4, 2009 23:32 UTC (Fri) by giraffedata (guest, #1954) [Link]

But you don't have to build a RAID6 array that way. Ones I've looked at use a journaling scheme to provide atomic parity update. No matter where you get interrupted in the middle of updating a stripe, you can always get back the pre-update parity-consistent stripe (minus whatever 1 or 2 components that might have died at the same time).

I suspect Linux 'md' doesn't have the resources to do this feasibly, but a SCSI RAID6 unit probably would. I don't expect there's much market for the additional component loss protection of RAID6 without getting the interrupted write protection too.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 21:33 UTC (Mon) by yohahn (guest, #4107) [Link] (6 responses)

Given the second lesson provided above, is there a standard method for remounting a volume, read-only, if the underlying raid becomes degraded?

I can imagine much script writing, but if it is a general need, shouldn't there be a general method?

(an even more fun question would be, "Will applications fail in a reasonable fashion if they suddenly have their backing store mounted read-only".)

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 11:54 UTC (Tue) by Klavs (guest, #10563) [Link] (5 responses)

mount -o remount,ro /

Will remount the device mounted on / - read-only.

Ext3 and RAID: silent data killers?

Posted Sep 3, 2009 15:37 UTC (Thu) by yohahn (guest, #4107) [Link] (4 responses)

Right, but what standard method to trigger this when a raid goes degraded. (and how will applications respond to their disks becoming read-only suddenly)

Ext3 and RAID: silent data killers?

Posted Sep 3, 2009 21:42 UTC (Thu) by Cato (guest, #7643) [Link] (1 responses)

When disks are remounted read-only, there is no message to the end user (at least on Ubuntu) - as a result, one PC where this happened had a read-only root filesystem for weeks until I noticed this from the logs.

Sending notifications of serious system events like this would be very helpful, with a bundle of standard event filters that can easily generate an email or other alert.

Ext3 and RAID: silent data killers?

Posted Sep 11, 2009 17:59 UTC (Fri) by jengelh (guest, #33263) [Link]

You can only remount ro when there are no files open in write mode. And there usually are on /.

Ext3 and RAID: silent data killers?

Posted Sep 4, 2009 8:22 UTC (Fri) by njs (subscriber, #40338) [Link]

mdadm has a hook to let you run an arbitrary script when a RAID device changes state (degrades, etc.); I don't have a remount -o ro script handy myself, though.

Ext3 and RAID: silent data killers?

Posted Sep 4, 2009 23:39 UTC (Fri) by giraffedata (guest, #1954) [Link]

(and how will applications respond to their disks becoming read-only suddenly)

I think this would be only marginally better than shutting the whole system down. In many ways it would be worse, since users have ways to deal with a missing system but not a system acting strangely.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 21:49 UTC (Mon) by me@jasonclinton.com (subscriber, #52701) [Link] (25 responses)

I'm a little surprised at lack of any discussion of RAID battery backups.
All RAID enclosures and RAID host-adapters worth their salt have a BBU
(battery backup unit) option for exactly this purpose. Why the small block
write buffer on an SSD cannot be backed up by a suitably small zinc-air
battery is a mystery to me.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 22:31 UTC (Mon) by proski (subscriber, #104) [Link] (12 responses)

I imagine that at least the more expensive SSDs have something like that, or maybe they can finish the write if the external power is disconnected. But when it comes to flash cards, like those used in digital cameras, the cost difference would be prohibitive.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 22:34 UTC (Mon) by me@jasonclinton.com (subscriber, #52701) [Link] (11 responses)

SD, MemoryStick, and CF cards do not have firmware.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 22:45 UTC (Mon) by pizza (subscriber, #46) [Link] (10 responses)

of course they have firmware; how else would (for example) a CF card translate the ATA commands into individual read/write ops on the appropriate flash chips and deal with write levelling?

Granted, that "firmware" may be in the form fo mask ROM, but I know of at least one case where a CF card had a firmware update released for it.

SD and MS are a lot simpler, but even they require something to translate the SD/MS wire protocols into flash read/write ops.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 22:52 UTC (Mon) by me@jasonclinton.com (subscriber, #52701) [Link] (9 responses)

Sorry, you're right about CF. I haven't seen one of those in ages.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 23:27 UTC (Mon) by drag (guest, #31333) [Link] (8 responses)

AND SD and Memorystick and any other remotely consumer-related device.

They all are 'smart devices'.

If it was not for the firmware MTD-to-Block translation then you could not use them in Windows and they could not be formatted Fat32.

When I have dealt with Flash in the past, the raw flash type, the flash just appears as a memory region. Like I have this old i386 board I am dealing with that has it's flash just starting at 0x80000 and it goes on for about eight megs or so.

That's it. That's all the hardware does for you. You have to know then how to communicate with it and it's underlining structure and know the proper way to write to it and everything. All that has to be done in software.

I suppose most of that is rather old fashioned.. the flash was soldiered directly into the traces on the board.

I can imagine it would be quite difficult and would require new hardware protocols to allow a OS to manage flash directly properly over something like SATA or USB.

But fundamentally MTD are quite a bit different from Block devices. It's a different class of I/O completely. Just like how a character device like a mouse or a keyboard can't be written to with Fat32. You can fake MTD by running a Block-to-MTD layer on SD flash or a file or anything else and some poeple think that helps with wear leveling, but I think that is foolish and may actually end up being self-defeating as you have no idea how the algorithms in the firmware work.

Ext3 and RAID: silent data killers?

Posted Aug 31, 2009 23:59 UTC (Mon) by BenHutchings (subscriber, #37955) [Link]

AND SD and Memorystick and any other remotely consumer-related device. They all are 'smart devices'.

Not all. SmartMedia, xD and Memory Stick variants provide a raw flash interface - that's a major reason why they have had to be revised repeatedly to allow for higher-capacity chips. They rely on an external controller to do write-buffering, and do not support any wear-leveling layer.

When I have dealt with Flash in the past, the raw flash type, the flash just appears as a memory region

It is possible for a flash controller to map NOR flash into memory since it is random-access for reading. However, large flash chips are all NAND flash which only supports block reads.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 0:00 UTC (Tue) by me@jasonclinton.com (subscriber, #52701) [Link] (6 responses)

Isn't the ATA/MMC<->MTD translation done in the consumer "reader" that you stick these devices in? CF is electrically compatible with ATA. That's not even remotely the case with the electrical interfaces on either SD or MS.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 0:52 UTC (Tue) by drag (guest, #31333) [Link] (3 responses)

Maybe. I don't think so. At least not for SD.

Remember that SD stands for 'Secure Digital' and is DRM'd. So there has to be some smarts in it to do that.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 6:22 UTC (Tue) by Los__D (guest, #15263) [Link] (2 responses)

Almost no SD support the DRM features, according to Wikipedia.

(Still doesn't change the point, though. SDs are probably designed with internal firmware)

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 9:36 UTC (Tue) by alonz (subscriber, #815) [Link] (1 responses)

That's not what Wikipedia says—they say few devices support CPRM. Which is more-or-less true—almost no devices in the western market use CPRM, while in Japan every single device does (it is required as part of i-Mode, which is mandated by DoCoMo).

As for firmware, the SD card interface (available for free at www.sdcard.org defines accesses in terms of 512-byte “logical” sectors, practically mandating the card to implement a flash translation layer.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 12:49 UTC (Tue) by Los__D (guest, #15263) [Link]

Doh, of course.

I read "devices" as the SD cards themselves.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 16:57 UTC (Tue) by Baylink (guest, #755) [Link]

Generally, I think that's true, yes; the only small-flash technology that actually *looks like an ATA drive at the connector* is CF; the others require a smart reader to do the interfacing -- which may itself *not* look like ATA at the back; there are clearly other better ways to do this stuff.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 17:37 UTC (Tue) by iabervon (subscriber, #722) [Link]

SD is MMC with a few extra features nobody uses. The readers do USB-storage<->SD, but SD is still 512-byte chunks (it's a card-reported value, and the host can actually try changing it, but 512 is the only value that is ever supported).

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 3:16 UTC (Tue) by zlynx (guest, #2285) [Link] (2 responses)

All the RAID discussion on the list was about the Linux MD/DM software RAID. It isn't as reliable as other options.

From what I gather, MD does not use write-intent logging by default, and when it is enabled it is very inefficient. Probably because it doesn't spread the write intent logs around the disks. Also, MD does not detect a unclean shutdown, so it does not start a RAID scrub and go into read+verify mode. And all that is a problem even when the array isn't degraded.

And of course it doesn't have a battery backup. :)

All that said, Linux MD hasn't given me any problems, and I prefer it over most cheap hardware RAID.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 5:14 UTC (Tue) by neilbrown (subscriber, #359) [Link] (1 responses)

I don't think "is very inefficient" is correct. There is a real performance impact, but the size of that impact is very dependent on workload and configuration. It is easy to add or remove the write intent logging while the array is active, so there is minimal cost in experimenting to see what impact it really has in an given usage.

And MD most certainly does detect an unclean shutdown and will validate all parity block on restart.

But you are right that it doesn't have battery backup. If fast NVRAM were available on commodity server hardware, I suspect we would get support for it in md/raid5 in fairly short order.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 15:30 UTC (Tue) by zlynx (guest, #2285) [Link]

As far as I could tell, MD will not verify parity during regular reads while the array is unclean.

It may start a background verify, although it seemed to me that was dependent on what the distro's startup scripts did...

Ext3 and RAID: silent data killers?

Posted Sep 2, 2009 0:09 UTC (Wed) by Richard_J_Neill (subscriber, #23093) [Link] (8 responses)

> I'm a little surprised at lack of any discussion of RAID battery backups.
> All RAID enclosures and RAID host-adapters worth their salt have a BBU
>(battery backup unit) option for exactly this purpose.

Yes...but it's only good for a few hours. So if your power outage lasts more than that, then the BBWC (battery backed write cache) is still toast.

On a related note, I've just bought a pair of HP servers (DL380s) and an IBM X3550. It's very annoying that there is no way to buy either of these without hardware raid, nor can the raid card be turned off in the BIOS. For proper reliability, I only really trust software (md) raid in Raid 1 mode (with write caching off). [Aside: this kills the performance for database workloads (fdatasync), though the Intel X25-E SSDs outperform 10k SAS drives by a factor of about 12.]

Ext3 and RAID: silent data killers?

Posted Sep 2, 2009 0:13 UTC (Wed) by dlang (guest, #313) [Link] (7 responses)

actually, in many cases the batteries for the raid cards can last for several weeks.

Ext3 and RAID: silent data killers?

Posted Sep 4, 2009 10:32 UTC (Fri) by nix (subscriber, #2304) [Link] (6 responses)

Yes indeed. My Areca RAID card claims a month's battery life. Thankfully I've never had cause to check this, but I guess residents of Auckland in the 1998 power-generation fiasco would have liked it. :)

Ext3 and RAID: silent data killers?

Posted Sep 5, 2009 0:01 UTC (Sat) by giraffedata (guest, #1954) [Link] (5 responses)

The battery in RAID adapter card only barely addresses the issue, I don't care how long it lives.

But the comment also addressed "RAID enclosures," which I take to mean storage servers that use RAID technology. Those, if they are at all serious, have batteries that power the disk drives as well, and only for a few seconds -- long enough to finish the write. It's not about backup power, it's about a system in which data is always consistent and persistent, even if someone pulled a power cord at some point.

Ext3 and RAID: silent data killers?

Posted Sep 5, 2009 0:31 UTC (Sat) by dlang (guest, #313) [Link]

actually, there are a LOT of enclosures that don't provide battery backup for the drives at all, just for the cache.

it's possible that they have heavy duty power supplies that keep power up for a fraction of a second after the power fail signal goes out to the drives, but they defiantly do not keep the drives spinning long enough to flush their caches

Ext3 and RAID: silent data killers?

Posted Sep 7, 2009 22:47 UTC (Mon) by nix (subscriber, #2304) [Link] (3 responses)

Ah, I see, the point is that even if you turn off the power *and pull the
disk* halfway through a write, the disk state is still consistent? Yeah,
battery-backed cache alone obviously can't ensure that.

Ext3 and RAID: silent data killers?

Posted Sep 7, 2009 23:18 UTC (Mon) by giraffedata (guest, #1954) [Link] (2 responses)

Ah, I see, the point is that even if you turn off the power *and pull the disk* halfway through a write, the disk state is still consistent? Yeah, battery-backed cache alone obviously can't ensure that.

No one said anything about pulling a disk. I did mention pulling a power cord. I meant the power cord that supplies the RAID enclosure (storage server).

A RAID enclosure with a battery inside that powers only the memory can keep the data consistent in the face of a power cord pull, but fails the persistence test, because the battery eventually dies. I think when people think persistent, they think indefinite. High end storage servers do in fact let you pull the power cord and not plug it in again for years and still be able to read back all the data that was completely written to the server before the pull. Some do it by powering disk drives (not necessarily the ones that normally hold the data) for a few seconds on battery.

Also, I think some people expect of persistence that you can take the machine, once powered down, apart and put it back together and the data will still be there. Battery backed memory probably fails that test.

Ext3 and RAID: silent data killers?

Posted Sep 8, 2009 4:56 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

I don't know what 'high end storage servers' you are talking about, the even the multi-million dollar arrays from EMC and IBM do not have the characteristics that you are claiming.

Ext3 and RAID: silent data killers?

Posted Sep 8, 2009 6:25 UTC (Tue) by giraffedata (guest, #1954) [Link]

I don't know what 'high end storage servers' you are talking about, the even the multi-million dollar arrays from EMC and IBM do not have the characteristics that you are claiming.

Now that you mention it, I do remember that earlier IBM Sharks had nonvolatile storage based on a battery. Current ones don't, though. The battery's only job is to allow the machine to dump critical memory contents to disk drives after a loss of external power. I think that's the trend, but I haven't kept up on what EMC, Hitachi, etc. are doing. IBM's other high end storage server, the former XIV Nextra, is the same.

Are these issues really unique to Ext3?

Posted Aug 31, 2009 23:22 UTC (Mon) by pr1268 (guest, #24648) [Link] (1 responses)

I can't imagine that these issues (especially the flash-based disk problems Pavel experienced) are unique to Ext3. Microsoft's NTFS and Mac OS X's HFS+ are both journaling file systems that perform transactional commits, don't they?

Are these issues really unique to Ext3?

Posted Sep 2, 2009 7:01 UTC (Wed) by Cato (guest, #7643) [Link]

See http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf for a good paper on how hard disks fail, including partial failures, and the failure/recovery behaviour of ext3, reiserfs, JFS and NTFS. Also talks about ixt3, an ext3 variant by the authors that's intended to have a stronger and more consistent failure/recovery model. This includes journal and data block checksums.

There was a 'journal checksum patch' for ext3 by the authors, I think, but I can't track it down.

Not sure about NTFS, but HFS+ does seem to have journal checksums - see http://developer.apple.com/mac/library/technotes/tn/tn115...

Journaling no protection against power drop

Posted Sep 1, 2009 0:41 UTC (Tue) by ncm (guest, #165) [Link] (13 responses)

Every time I see discussion of a journaling file system, somebody implies they expect it to protect against corruption on power interruptions. I recall somebody who reported repeatedly turning his system on and off just to revel in watching it recover (without fsck) each time. It's not clear where people got the idea, but it's dead clear that people promoting journaling file systems have, again and again, utterly failed to make it clear that a journaling file system does not protect against corruption on power interruptions. (Crashes, usually. Power drops, often, but no promises.)

Pavel's complaint is a consequence of this failure.

I note in passing that there is no need, in general, for a file system that journals data as well as metadata to write the data twice. That's just a feature of common, naive designs. Journaling data along with metadata, howsoever sophisticated, doesn't protect against power drops either. Nothing does. Use battery backup if you care. A few seconds is enough if you can shut down the host fast enough.

Journaling no protection against power drop

Posted Sep 1, 2009 6:40 UTC (Tue) by IkeTo (subscriber, #2122) [Link] (11 responses)

My understanding is that with a "good" system (e.g., hard disk write in disk blocks, RAM power not going off when hard disk has enough power to write the data in it into the disk, etc), a journaling file system does offer protection against corruption of filesystem (and, depend on filesystem actually being used, corruption of the data already written). I believe you think I'm wrong, and I'd like to understand your reasoning.

Journaling no protection against power drop

Posted Sep 1, 2009 7:58 UTC (Tue) by ncm (guest, #165) [Link] (10 responses)

It seems to be extremely common to believe in disks that use the spinning-down platter to drive the motor as a generator to power a last few sectors' writes. When I speak to engineers at drive manufacturers, they say it's a myth. (It might have been done decades ago.) They say they happily stop writing halfway in the middle of a sector, and respond to power drop only by parking the head.

Some drives only report blocks written to the platter after they really have been, but that's bad for benchmarks, so most drives fake it, particularly when they detect benchmark-like behavior. Everyone serious about reliability uses battery backup, so whoever's left isn't serious, and (they reason) deserve what they get, because they're not paying. Building in better reliability manifestly doesn't improve sales or margins.

If you pay twice as much for a drive, you might get better behavior. Or you might only pay more.

If you provide a few seconds' battery backup for the drive but not the host, then the blocks in the buffer that the drive said were on the disk get a chance to actually get there.

Journaling no protection against power drop

Posted Sep 1, 2009 17:09 UTC (Tue) by Baylink (guest, #755) [Link]

> most drives fake it, particularly when they detect benchmark-like behavior.

> If you pay twice as much for a drive, you might get better behavior. Or you might only pay more.

I generally find the difference per GB to be 6:1 going from even enterprise SATA drives to Enterprise SCSI (U-160 or faster, 10K or faster). My experience is that I get what I pay for, YMMV.

Journaling no protection against power drop

Posted Sep 1, 2009 17:20 UTC (Tue) by markusle (guest, #55459) [Link] (1 responses)

> Some drives only report blocks written to the platter after they really
> have been, but that's bad for benchmarks, so most drives fake it,
> particularly when they detect benchmark-like behavior.

I'd be very interested in some additional references or a list of drives
that do or don't do this.

Journaling no protection against power drop

Posted Sep 1, 2009 17:44 UTC (Tue) by ncm (guest, #165) [Link]

Start by looking at very, very expensive, slow drives. Then forget about them. Instead, rely on redundancy and battery backup. There are lots of companies that aggregate cheap disks, batteries, cache, and power in a nice box, and each charges what they can get for it. Some work well, others less so. Disk arrays work like insurance: spread the risk, and cover for failures. Where they inadvertently concentrate risk, you get it all.

The storage industry is as mature as any part of the computer business. It is arranged such as to allow you to spend as much money as you like, and can happily absorb as much as you throw at it. If you know what you're doing, you can get full value for your money. If you don't know what you're doing, you can spend just as much and get little more value than the raw disks in the box. There is no substitute for competence.

Journaling no protection against power drop

Posted Sep 1, 2009 23:28 UTC (Tue) by dododge (guest, #2870) [Link]

> They say they happily stop writing halfway in the middle of a sector, and respond to power drop only by parking the head.

The old DeskStar drive manual (circa 2002) explicitly stated that power loss in the middle of a write could lead to partially-written sectors, which would trigger a hard error if you tried to read them later on. According to an LKML discussion back then, the sectors would stay in this condition indefinitely and would not be remapped; so the drive would continue to throw hard errors until you manually ran a repair tool to find and fix them.

Journaling no protection against power drop

Posted Sep 5, 2009 0:10 UTC (Sat) by giraffedata (guest, #1954) [Link]

If you provide a few seconds' battery backup for the drive but not the host, then the blocks in the buffer that the drive said were on the disk get a chance to actually get there.

But then you also get the garbage that the host writes in its death throes (e.g. update of a random sector) while the drive is still up.

To really solve the problem, you need much more sophisticated shutdown sequencing.

Journaling no protection against power drop

Posted Sep 8, 2009 20:54 UTC (Tue) by anton (subscriber, #25547) [Link] (2 responses)

[Engineers at drive manufacturers] say they happily stop writing halfway in the middle of a sector, and respond to power drop only by parking the head.

The results from my experiments on cutting power on disk drives are consistent with the theory that the drives I tested complete the sector they write when the power goes away. However, I have seen drives that corrupt sectors on unusual power conditions; the manufacturers of these drives (IBM, Maxtor) and their successors (Hitachi) went to my don't-buy list and are still there.

Some drives only report blocks written to the platter after they really have been, but that's bad for benchmarks, so most drives fake it, particularly when they detect benchmark-like behavior.

Write-back caching (reporting completion before the data hits the platter) is normally enabled in PATA and also SATA drives (running benchmarks or not), because without tagged commands (mostly absent in PATA, and not universally supported for SATA) performance is very bad otherwise. You can disable that with hdparm -W0. Or you can ask for barriers (e.g., as an ext3 mount option), which should give the same consistency guarantees at lower cost if the file system is implemented properly; however, my trust in the proper implementation in Linux is severely undermined by the statements that some prominent kernel developers have made in recent months on file systems.

Everyone serious about reliability uses battery backup

Do you mean a UPS? So how does that help when the UPS fails? Yes, we have had that (while power was alive), and we concluded that our power grid is just as reliable as a UPS. One could protect against a failing UPS with dual (redundant) power supplies and dual UPSs, but that would probably double the cost of our servers. A better option would be to have an OS that sets up the hardware for good reliability (i.e., disable write caching if necessary) and works hard in the OS to ensure data and metadata consistency. Unfortunately, it seems that that OS is not Linux.

Journaling no protection against power drop

Posted Sep 10, 2009 20:58 UTC (Thu) by Cato (guest, #7643) [Link] (1 responses)

Great to see your testing tool, I will try this out on a few spare hard drives to see what happens.

UPSs are useful at least to regulate the voltage and cover against momentary power cuts, which are very frequent where I live, and far more frequent than UPS failures in my experience.

Journaling no protection against power drop

Posted Sep 10, 2009 21:34 UTC (Thu) by anton (subscriber, #25547) [Link]

It depends on where you live. Here power outages are quite infrequent, but mostly take so long that the UPS will run out of power. So then the UPS only gives the opportunity for a clean shutdown (and that opportunity was never realized by our sysadmin when we had UPSs), and that is unnecessary if you have all of the following: decent drives that complete the last sector on power failure; a good file system; and a setup that gives the file system what it needs to stay consistent (e.g., barriers or hdparm -W0). And of course we have backups around if the worst comes to worst. And while we don't have the ultimate trust in ext3 and the hard drives we use, we have not yet needed the backups for that.

Journaling no protection against power drop

Posted Sep 10, 2009 9:00 UTC (Thu) by hensema (guest, #980) [Link] (1 responses)

> They say they happily stop writing halfway in the middle of a sector, and respond to power drop only by parking the head.

Which is no problem. The CRC for the sector will be incorrect, which will be reported to the host adapter. The host adapter will then reconstruct the data and write back the correct sector.

Of course you do need RAID for this.

Journaling no protection against power drop

Posted Sep 10, 2009 20:52 UTC (Thu) by Cato (guest, #7643) [Link]

And of course RAID has its own issues with RAID 6 really being required with today's large disks to protect against the separate-failure-during-rebuild case.

Without RAID, the operating system will have no idea the sector is corrupt - this is why I like ZFS's block checksumming, as you can get a list of files with corrupt blocks in order to restore from backup.

Journaling no protection against power drop

Posted Sep 1, 2009 16:10 UTC (Tue) by davecb (subscriber, #1574) [Link]

If you use flash as a commit cache and write both data and metatdata, you can both protect against failure and gain performance. The one thing you can't protect against is losing the data being written to cache at the time the failure or power outage occurs.
This, by the way, is an oversimplified description of ZFS (;-))

--dave

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 1:35 UTC (Tue) by flewellyn (subscriber, #5047) [Link] (2 responses)

I can't believe a kernel developer thought that a degraded RAID5 having data corruption problems if it suffered power loss (or another fault) while still degraded, or rebuilding, is at all noteworthy.

It's right in the definition of RAID5: you cannot expect to maintain data integrity after a double-fault. The filesystem used on top of it is irrelevant.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 3:22 UTC (Tue) by zlynx (guest, #2285) [Link]

I think it's noteworthy.

Most people, myself included, expect a degraded RAID array to fail **only if another drive fails**. I do NOT expect losing 256KB of data because a single 4KB write failed.

And in fact, the array didn't lose it. It just can't tell which 4KB went bad, which it could if MD did good write-intent logging.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 7:51 UTC (Tue) by job (guest, #670) [Link]

It's the fact that it's silent that is the problem. Writes fail under certain circumstances, and that's acceptable, but when a failed write perhaps affects other data silently and now your system reports healthy but you don't know if it really is... That's just .. ouch.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 4:09 UTC (Tue) by sureshb (guest, #25018) [Link] (1 responses)

If SSDs are smart devices, in future can they use some part of memory for journaling ? All writes go through this and in back ground the data is moved to final destination. This will have performance hit, but it shouldn't be that bad.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 7:45 UTC (Tue) by job (guest, #670) [Link]

There are experimental object storage devices where the devices themselves house the file system logic. I'm not sure I would want one of those, I'd prefer the opposite: stupid flash memory connected to the bus. Things are generally easier to fix in software.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 5:16 UTC (Tue) by k8to (guest, #15413) [Link] (18 responses)

RAID 5 is completely undepdendable. Anyone who uses RAID5 and expects reliability is ignorant or incompetent. It also has terrible performance.

RAID 6 is slightly less bad. If you want to avoid problems with crashes, outages, you should have multiple hot standbys. if you want performance you should use RAID 10.

Either way you should use a backup as your data loss reduction strategy.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 7:46 UTC (Tue) by job (guest, #670) [Link] (4 responses)

I would expect the same problem to affect RAID10, as a double fault can kill them too if you're very unlucky.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 8:05 UTC (Tue) by drag (guest, #31333) [Link] (1 responses)

Well as single fault can destroy any data if you want to look at it that way... But generally with one drive gone either Raid 6 or Raid 10 should still be adequate.

With RAID 5 the amount of time it takes to recover is so long nowadays that the chances of having a double fault is pretty good. It was one thing to have 20GB with 30MB/s performance, but it's quite another to have 1000GB with 50MB/s performance...

Ext3 and RAID: silent data killers?

Posted Sep 11, 2009 1:18 UTC (Fri) by Pc5Y9sbv (guest, #41328) [Link]

I agree you cannot blindly use RAID5 without considering the sizing, but what do you consider an acceptable recovery time?

My cheap MD RAID5 with three 500 GB SATA drives allows me to have 1TB and approximately 100 MB/s per drive throughput, which implies a full scan to re-add a replacement drive might take 2 hours or so (reading all 500 GB from 2 drives and writing 500 GB to the third at 75% of full speed). I have never been in a position where this I/O time was worrisome as far as a double fault hazard. Having a commodity box running degraded for several days until replacement parts are delivered is a more common consumer-level concern, which has not changed with drive sizes.

Ext3 and RAID: silent data killers?

Posted Sep 3, 2009 5:05 UTC (Thu) by k8to (guest, #15413) [Link] (1 responses)

Double fault can kill raid 10 also, but you're much less likely to have the fault propogate as discussed in the article, and the downtime for bringing in a standby is much smaller, so standby drives are more effective.

Meanwhile, you also get vastly better performance, and higher reliability of implementation.

It's really a no brainer unless you're poor.

Ext3 and RAID: silent data killers?

Posted Sep 3, 2009 5:26 UTC (Thu) by dlang (guest, #313) [Link]

actually, if you have a read-mostly workload raid 5/6 can end up being as fast as raid 10. I couldn't believe this myself when I first ran into it, but I have a large (multiple TB) database used for archiving log data and discovered that read/search performance was the same with raid 6 as with raid 10.

in digging further I discovered that they key to performance was to have enough queries in flight to keep all disk heads fully occupied (one outstanding query per drive spindle), and you can do this with both raid 6 and raid 10.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 8:11 UTC (Tue) by drag (guest, #31333) [Link] (11 responses)

The way I look at it is like this:

RAID = availability/performance
BACKUPS = data protection.

Anything other way of looking at is pretty much doomed to be flawed.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 15:43 UTC (Tue) by Cato (guest, #7643) [Link] (1 responses)

This is a good way to look at it. Starting with a near-CDP tool such as rsnapshot is a good approach to snapshots, backing up data as frequently as every hour with low overhead, through rsync with multi-version support through hard links between snapshots. Then if the overhead of a scan every hour is too much, or you need very fast recovery from a disk fault, add RAID as well.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 16:05 UTC (Tue) by jonabbey (guest, #2736) [Link]

Thanks for the reference to rsnapshot!

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 16:47 UTC (Tue) by martinfick (subscriber, #4455) [Link] (7 responses)

BACKUPS are poor, version control is the only sane backup. Backups are horrible to recover from. Backups provide no sane automatable mechanism for pruning older data (backups) that doesn't suffer from the same corruption/accidental deletion problem that originals have, but worse, amplified since they don't even have a good mechanism for sanity checking (usage)! Backups tend to backup corrupted data without complaining.

Backups are good for certain limited chores such as backing up your version control system! :) But ONLY if you have a mechanism to verify the sanity of your previous backup and the original before making the next backup. Else, you are back to backing up corrupted data.

A good version control system protects you from corruption and accidental deletion since you can always go to an older version. And the backup system with checksums (often built into VCS) should protect the version control system.

If you don't have space for VCing your data you don't likely really have space for backing it up either, so do not accept this as an excuse to not vcs your data instead of backing it up.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 17:44 UTC (Tue) by Cato (guest, #7643) [Link]

Since I've researched this a lot recently, here are some rsync/librsync based tools that work somewhat like version control systems but are intended for system backups. They qualify as 'near-CDP' since rsync is efficient at scanning for changes.

rsnapshot is pretty good as a 'sort of' version control system for any type of file including binaries. It doesn't do any compression, just rsync plus hard links, but works very well within its design limits. It can backup filesystems including the hard links (use rsync -avH in the config file), and is focused on 'pull' backups i.e. backup server ssh's into the server to be backed up. It's used by some web hosting providers who back up tens of millions of files every few hours, with scans taking a surprisingly short time due to the efficiency of rsync. Generally rsnapshot is best if you have a lot of disk space available, and not much time to run the backups in.

rdiff-backup may be closer to what you are thinking of - unlike rsnapshot it only stores the deltas between versions of a file, and stores permissions etc as metadata (so you don't have to have root on the box being backed up to rsync arbitrary files). It's a bit slower than rsnapshot but a lot of people like it. It does include checksums which is a very attractive feature.

duplicity is somewhat like rsnapshot, but can also do encryption, so it's more suitable for backup to a system you don't control.

There are a lot of these tools around, based on Mike Rubel's original ideas, but these ones seem the most actively discussed.

For a non-rsync backup, dar is excellent but not widely mentioned - it includes per-block encryption and compression, and per-file checksums, and is generally much faster for recovery than tar, where you must read through the whole archive to recover.

rdiff-backup, like VCS tools, will have difficulty with files of 500 MB or more - it's been reported that such files don't get backed up, or are not delta'ed. Very large files that change frequently (databases, VM images, etc) are a problem for all these tools.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 17:55 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

unless your version control stores your data somewhere other than on your computer, it's a poor substitute for a backup.

there are lots of things that can happen to your computer (including your house burning down) that will destroy everything on it.

no matter how much protection you put into your storage system, you still need backups.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 18:05 UTC (Tue) by martinfick (subscriber, #4455) [Link]

Local backups suffer from the same problem as local version control.

Thus, locality is unrelated to whether your are using backups or version control. Yes, it is better to put it on another computer, or, at least another physical device. But, this is in no way an argument for using backups instead of version control.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 18:05 UTC (Tue) by joey (guest, #328) [Link] (1 responses)

> If you don't have space for VCing your data you don't likely really have
> space for backing it up either, so do not accept this as an excuse to not
> vcs your data instead of backing it up.

I'd agree, but you may not have memory to VCS your data. Git, in particular, scales memory usage badly with large data files.

Ext3 and RAID: silent data killers?

Posted Sep 1, 2009 18:16 UTC (Tue) by martinfick (subscriber, #4455) [Link]

If you have disk space, you have memory: it's called swap. Use it appropriately. With ~$60 TB disks, there is no excuse for either not having enough memory or enough space to VC your data.

Ext3 and RAID: silent data killers?

Posted Sep 2, 2009 0:39 UTC (Wed) by drag (guest, #31333) [Link] (1 responses)

> BACKUPS are poor, version control is the only sane backup.

If your using version control for backups then that is your backup. Your
sentence does not really make a whole lot of sense and is nonsensical.
There is no difference.

My favorite form of backup is to use Git to sync data on geographically
disparate machines. But this is only suitable for text data. If I have to
backup photographs then source code management systems are utter shit.

> Backups are horrible to recover from.

They are only horrible to recover with if the backup was done poorly. If
you (or anybody else) does a shitty job of setting them up then it's your
(or their's) fault they are difficult.

Backing up is a concept.

Anyways its much more horrible to recover data that has ceased to
exist.

> Backups provide no sane automatable mechanism for pruning older data
> (backups) that doesn't suffer from the same corruption/accidental deletion
> problem that originals have, but worse, amplified since they don't even
> have a good mechanism for sanity checking (usage)! Backups tend to backup
> corrupted data without complaining.

Your doing it wrong.

The best form of backup is to full backups to multiple places. Ideally they
should be in a different region. You don't go back and prune data or clean
them up. Thats WRONG. Incremental backups are only useful to reduce the
amount of dataloss between full backups. A full copy of _EVERYTHING_ is a
requirement. And you save it for as long as that data is valuable. Usually
5 years.

It depends on what your doing but a ideal setup would be like this:
* On-site backups every weekend. Full backups. Stored for a few months.
* Incremental backups twice a day, and resets at the weekend with the full
backup.
* Every month 2 full backups are stored for 2-3 years.
* Off-site backups 1 a month, stored for 5 years.
etc. etc.

That would probably be a good idea for most small/medium businesses.

If your relying on a server or a single datacenter to store your data
reliably then your a fool. I don't give a shit on how high quality your
server hardware is or file system or anything. A single fire, vandalism,
hardware failure, disaster, sabotage, or any number of things can utterly
destroy _everything_.

Ext3 and RAID: silent data killers?

Posted Sep 3, 2009 7:51 UTC (Thu) by Cato (guest, #7643) [Link]

On full backups: one of the nice things about rsnapshot and similar rsync-based tools is that every backup is both a full backup and an incremental backup. Full in that previous backups can be deleted without any effect on this backup (thanks to hard links), and incremental in that the data transfer required is proportional to the specific data blocks that have changed (thanks to rsync).

Ext3 and RAID: silent data killers?

Posted Sep 3, 2009 5:06 UTC (Thu) by k8to (guest, #15413) [Link]

Yes agreed, RAID is for availability and performance. RAID 5 doesn't offer performance, and the availability story isn't great either. So don't use it.

Ext3 and RAID: silent data killers?

Posted Sep 4, 2009 10:38 UTC (Fri) by nix (subscriber, #2304) [Link]

'Terrible performance' is in the eye of the beholder. So is reliability. Software RAID is constrained by bus bandwidth, so RAID 10 writes may well be slower than RAID 5 if you're bus-limited: and even RAID 5 writes are no slower than writes to a single drive. TBH, 89Mb/s writes and 250MB/s reads (which my Areca card can manage with a four-drive RAID 5 array) don't seem too 'terrible' to me.

Furthermore, reliability is fine *if* you can be sure that once RAID parity computations have happened the stripe will always hit the disk, even if there is a power failure. With battery-backed RAID, this is going to be true (modulo RAID controller card failure or a failure of the drive you're writing to). Obviously if the array is sufficiently degraded reliability isn't going to be good anymore, but doesn't everyone know that?

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 8:15 UTC (Tue) by Cato (guest, #7643) [Link] (22 responses)

In my experience of massive data loss (thousands of files) using ext3 on a hard disk with a default data=ordered setup with LVM and no RAID, using Linux and ext3 is quite dangerous in its default configuration. This seems to be due to drives that reorder blocks in the write cache, and lack of journal checksumming in ext3 to cope with this (and possibly also LVM issues). See http://lwn.net/Articles/342978/ for the details.

My standard setup now is to:

1. Avoid LVM completely

2. Disable write caching on all hard drives using hdparm -W0 /dev/sdX.

3. Enable data=journal on ext3 (tune2fs -o journal_data /dev/sdX is the best way to ensure partitions are mounted with this option, including the root partition and when installed in another system, post-reinstall, etc).

The performance hit from these changes is trivial compared to the two days I spent rebuilding a PC where the root filesystem lost thousands of files and the backup filesystem was completely lost.

I suspect that the reason LVM is seen as reliable despite being the default for Fedora and RHEL/CentOS is that enterprise Linux deployment use hardware RAID cards with battery-backed cache, and perhaps higher quality drives that don't lie about write completion.

Linux is far worse at losing data with a default ext3 setup than I once thought it was, unfortunately. If correctly configured it's fine, but the average new Linux user has no way to know how to configure this. I can't recall losing data like this on Windows in the absence of a hardware problem.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 8:51 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (10 responses)

When I follow your reference, I come face to face with myself suggesting perfectly ordinary explanations and you have no answer.

It is always nice to imagine that if you find the right combination of DIP switches, set the right values in the configuration file, choose the correct compiler flags, your problems will magically vanish.

But sometimes the real problem is just that your hardware is broken, and all the voodoo and contortions do nothing. You've changed _three_ random things about your setup based on, AFAICT, no evidence at all, and you think it's a magic formula. Until something bad happens again and you return in another LWN thread to tell us how awful ext3 is again...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 9:03 UTC (Tue) by Cato (guest, #7643) [Link] (9 responses)

I don't have time to do a scientific experiment across a number of PCs using different setups so I had to go with the evidence I had, and could find through searches.

I did base these changes mostly on the well known lack of journal checksumming in ext3 (going to data=journal and avoiding write caching) - see http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal. Dropping LVM is harder to justify, it's really just a hunch based on a number of reports of LVM being involved in data corruption, and on my own data point that the LVM volumes on one disk were completely inaccessible (i.e. corrupted LVM metadata) - hence it was not just ext3 involved here, though it might have been write caching as well.

I'm interested to hear responses that show these steps are unnecessary, of course.

I really doubt the hardware is broken: there are no disk I/O errors in any of the logs, there were 2 disks corrupted (1 SATA, 1 PATA), and there are no symptoms of memory errors (random application/system crashes).

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 17:48 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

of the three changes you are making

disabling write caches

this is mandatory unless you have battery backed cache to recover from failed writes. period, end of statement. if you don't do this you _will_ loose data when you loose power.

avoiding LVM

I have also run into 'interesting' things with LVM, and so I also avoid it. I see it as a solution in search of a problem for just about all users (just about all users would be just as happy, and have things work faster with less code involved, if they just used a single partition covering their entire drive.)

I suspect that some of the problem here is that ordering of things gets lost in the LVM layer, but that's just a guess.

data=journal,

this is not needed if the application is making proper use of fsync. if the application is not making proper use of fsync it's still not enough to make the data safe.

by the way, ext3 does do checksums on journal entries. the details of this were posted as part of the thread on linux-kernel.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:05 UTC (Tue) by Cato (guest, #7643) [Link] (3 responses)

Possibly data=journal is overkill, I was going by the Wikipedia page on ext3, link above. However a conservative setup is attractive at present as performance is far less important than reliability, for this PC anyway.

Do you know roughly when ext3 checksums were added, or by whom, as this contradicts the Wikipedia page? Must be since 2007, based on http://archives.free.net.ph/message/20070519.014256.ac3a2.... I thought journal checksumming was only added to ext4 (see first para of http://lwn.net/Articles/284037/) not ext3.

This sort of corruption issue is one reason to have multiple partitions; parallel fscks are another. In fact, it would be good if Linux distros automatically scheduled a monthly fsck for every filesystem, even if journal-based.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:15 UTC (Tue) by dlang (guest, #313) [Link] (2 responses)

Ted Tso detailed the protection of the journal in this thread (I've deleted the particular message or I'd quote it for you)

I'm not sure I believe that parallel fscks on partitions on the same drive do you much good. the limiting factor for speed is the throughput of the drive. do you really gain much from having it bounce around interleaving the different fsck processes?

as for protecting against this sort of corruption, I don't think it really matters.

for flash, the device doesn't know about your partitions, so it will happily map blocks from different partitions to the same eraseblock, which will then get trashed on a power failure, so partitions don't do you any good.

for a raid array it may limit corruption, but that depends on how your partition boundaries end up matching the stripe boundaries.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:44 UTC (Tue) by Cato (guest, #7643) [Link]

I still can't find that email, but this outlines that journal checksumming was added to JBD2 to support ext4: http://ext4.wiki.kernel.org/index.php/Frequently_Asked_Qu...

This Usenix paper mentions that JBD2 will ultimately be usable by other filesystems, so perhaps that's how ext3 does (or will) support this: http://www.usenix.org/publications/login/2007-06/openpdfs... - however, I don't think ext3 has journal checksums in (say) 2.6.24 kernels.

Ext3 and write caching by drives are the data killers...

Posted Sep 2, 2009 6:36 UTC (Wed) by Cato (guest, #7643) [Link]

I grepped the 2.6.24 sources, fs/ext3/*.c and fs/jbd/*.c, for any mention of checksum, and couldn't find it. However the email lists do have some reference to a journal checksum patch for ext3 that didn't make it into 2.6.25.

One other thought: perhaps LVM is bad for data integrity with ext3 because, as well as stopping barriers from working, LVM generates more fragmentation in the ext3 journal - that's one of the conditions mentioned by Ted Tso as potentially causing write reordering and hence FS corruption here: http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/m...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 22:37 UTC (Tue) by cortana (subscriber, #24596) [Link] (2 responses)

> disabling write caches
>
> this is mandatory unless you have battery backed cache to recover from
> failed writes. period, end of statement. if you don't do this you _will_
> loose data when you loose power.

If this is true (and I don't doubt that it is), why on earth is it not the default? Shipping software with such an unsafe default setting is stupid. Most users have no ideas about these settings... surely they shouldn't be handed a delicious pizza smeared with nitroglycerin topping, and then be blamed when they bite into it and it explodes...

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 22:41 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

simple, enabling the write cache gives you a 10x (or better) performance boost for all the times when your system doesn't loose power.

the market has shown that people are willing to take this risk by driving all vendors that didn't make the change out of the marketplace

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 7:58 UTC (Thu) by Cato (guest, #7643) [Link]

True, but it would be good if there was something simple like "apt-get install data-integrity" in major distros, which would then help the user configure the system for high integrity by default and this was well publicised. This could include things like: disabling write cache, periodic fsck's, ext3 data=journal, etc.

It would still be better if distros made this the default but I don't see much prospect of this.

One other example of disregard for data integrity that I've noticed is that Ubuntu (and probably Debian) won't fsck a filesystem (including root!) if the system is on batteries. This is very dubious - the fsck might exhaust the battery, but the user might well prefer a while without use of their laptop due to no battery to a long time without use of their valuable data when the system gets corrupted later on...

Fortunately on my desktop with a UPS, on_ac_power returns 255 which counts as 'not on battery' for the /etc/init.d/check*.sh scripts.

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 23:18 UTC (Thu) by landley (guest, #6789) [Link]

The paragraph starting "But what if more than one drive fails?" is misleading. You don't need another drive to fail to experience this problem, all you need is an unclean shutdown of an array that's both dirty and degraded. (Two words: "atime updates".) The second failure can be entirely a _software_ problem (which can invalidate stripes on other drives without changing them, because the parity information needed to use them is gone). Software problem != hardware problem, it's not the same _kind_ of failure.

People expect RAID to protect against permanent hardware failures, and people expect journaling to protect against data loss from transient failures which may be entirely due to software (ala kernel panic, hang, watchdog, heartbeat...). In the first kind of failure you need to replace a component, in the second kind of failure the hardware is still good as new afterwards. (Heck, you can experience unclean shutdowns if your load balancer kills xen shares impolitely. There's _lots_ of ways to do this. I've had shutdown scripts hang failing to umount a network share, leaving The Button as the only option.)

Another problematic paragraph starts with "RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability."

That's misleading because redundancy isn't what provides this reliability, at least in other contexts. When you lose the redundancy, you open yourself to an _unrelated_ issue of update granularity/atomicity. A single disk doesn't have this "incomplete writes can cause collateral damage to unrelated data" problem. (It might start to if physical block size grows bigger than filesystem sector size, but even 4k shouldn't do that on a properly aligned modern ext3 filesystem.) Nor does RAID 0 have an update granularity issue, and that has no redundancy in the first place.

I.E. a degraded RAID 5/6 that has to reconstruct data using parity information from multiple stripes that can't be updated atomically is _more_ vulnerable to data loss from interrupted writes than RAID 0 is, and the data loss is of a "collateral damage" form that journaling silently fails to detect. This issue is subtle, and fiddly, and although people keep complaining that it's not worth documenting because everybody should already know it, the people trying to explain it keep either getting it _wrong_ or glossing over important points.

Another point that was sort of glossed over is that journaling isn't exactly a culprit here, it's an accessory at best. This is a block device problem which would still cause data loss on a non-journaled filesystem, and it's a kind of data loss that a fsck won't necessarily detect. (Properly allocated file data elsewhere on the disk, which the filesystem may not have attempted to write to in years, may be collateral damage. And since you have no warning you could rsync the damaged stuff over your backups if you don't notice.) If it's seldom-used data it may be long gone before you notice.

The problem is that journaling gives you a false sense of security, since it doesn't protect against these issues (which exist at the block device level, not the filesystem level), and can hide even the subset of problems fsck would detect, by reporting a successful journal replay when the metadata still contains collateral damage in areas the journal hadn't logged any recent changes to.

I look forward to btrfs checksumming all extents. That should at least make this stuff easier to detect, so you can know when you _haven't_ experienced this problem.

Rob

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 16:43 UTC (Tue) by ncm (guest, #165) [Link] (6 responses)

What's wrong with providing battery backup for your drives? If they have power for a little while after the last write request arrives, then write caching, re-ordering writes, and lying about what's already on the disk don't matter. You still need to do backups, of course, but you'll need to use them less often because your file system will be that much less likely to have been corrupted.

I don't buy that your experience suggests there's anything wrong with ext3, if you didn't protect against power drops. The same could happen with any file system. The more efficient the file system is, the more likely is corruption in that case -- although some inefficient file systems seem especially corruptible.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 17:31 UTC (Tue) by Cato (guest, #7643) [Link] (5 responses)

This PC is on battery backup (UPS) already - that didn't stop the corruption though. This is a home PC, and in any case it really shouldn't be necessary to use a UPS just to avoid filesystem/LVM corruption.

Since the rebuild, I have realised that the user of the PC has been turning it off via the power switch accidentally, which perhaps caused the write cache of the disk(s) to get corrupted and is a fairly severe test. Despite several sudden poweroffs due to this, with the new setup there has been no corruption yet. It seems unlikely that the writes would be pending in the disk's write cache for so long that they couldn't be written out while power was still, but the fact is that both ext3 and LVM data structures got corrupted.

It's acknowledged that ext3's lack of journal checksumming can cause corruption when combined with disk write caching (whereas XFS does have such checksums I think). The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted. Betting the data integrity of a remotely administered system on this time window is not something I want to do.

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 18:06 UTC (Tue) by ncm (guest, #165) [Link]

The only question is whether the time between power starting to drop and the power going completely is enough to flush pending writes (possibly reordered), while also not having any RAM contents get corrupted

That's easy: No. When power starts to drop, everything is gone at that moment. If the disk is writing at that moment, the unfinished sector gets corrupted, and maybe others. UPS for the computer and disk together helps only a little against corruption unless power drops are almost always shorter than the battery time, or it automatically shuts down the computer before getting used up. You may be better off if the computer loses power immediately, and only the disk gets the UPS treatment.

it really shouldn't be necessary to use a UPS just to avoid filesystem/LVM corruption.

Perhaps, but it is. (What is this "should"?) The file system doesn't matter near so much as you would like. They can be indefinitely bad, but can be no more than fairly good. The good news is that the UPS only needs to support the disk, and it only needs to keep power up for a few seconds; then many file systems are excellent, although the bad ones remain bad.

Ext3 and write caching by drives are the data killers...

Posted Sep 9, 2009 20:35 UTC (Wed) by BackSeat (guest, #1886) [Link] (3 responses)

It's acknowledged that ext3's lack of journal checksumming can cause corruption

It may only be semantics, but it's unlikely that the lack of journal checksumming causes corruption, although it may make it difficult to detect.

As for LVM, I've never seen the point. Just another layer of ones and zeros between the data and the processor. I never use it, and I'm very surprised some distros seem to use it by default.

Ext3 and write caching by drives are the data killers...

Posted Sep 10, 2009 20:50 UTC (Thu) by Cato (guest, #7643) [Link] (2 responses)

One interesting scenario, mentioned I think elsehwere in the comments to this article: a single 'misplaced write' (i.e. disk doesn't do the seek to new position, writing to old position) means that a data block goes into the ext3 journal.

In the absence of ext3 journal checksumming, and if there is a crash requiring replay of this journal block, horrible things will happen - presumably garbage is written to various places on disk from the 'journal' entry. One symptom may be log entries saying 'write beyond end of partition', which I've seen a few times with ext3 corruption and I think is a clear indicator of corrupt filesystem metadata.

This is one reason why JBD2 added journal checksumming for use with ext4 - I hope this also gets used by ext3. In my view, it would be a lot better to make that change to ext3 than to make data=writeback the default, which will speed up some workloads and most likely corrupt some additional data (though I guess not metadata).

Ext3 and write caching by drives are the data killers...

Posted Sep 10, 2009 21:05 UTC (Thu) by Cato (guest, #7643) [Link] (1 responses)

Actually the comment about a single incorrect block in a journal 'spraying garbage' over the disk is here: http://lwn.net/Articles/284313/

Ext3 and write caching by drives are the data killers...

Posted Sep 11, 2009 16:33 UTC (Fri) by nix (subscriber, #2304) [Link]

Note that you won't get a whole blockfull of garbage: ext3 will generally
notice that 'hey, this doesn't look like a journal' once the record that
spanned the block boundary is complete. But that's a bit late...

(this is all supposition from postmortems of shagged systems. Thankfully
we no longer use hardware prone to this!)

Ext3 and write caching by drives are the data killers...

Posted Sep 1, 2009 23:49 UTC (Tue) by dododge (guest, #2870) [Link] (3 responses)

Well for starters (unless things have changed recently) LVM doesn't support write barriers, so if you put LVM in the loop it probably doesn't matter if the drive reports write completion correctly or not. If you use XFS on top of LVM you get a big warning about this at mount time.

I don't use EXT3 much, but from a quick googling it sounds like you have to explicitly turn on barrier support in fstab and it still won't warn you about the LVM issue until it actually tries to use one.

Ext3 and write caching by drives are the data killers...

Posted Sep 2, 2009 7:18 UTC (Wed) by Cato (guest, #7643) [Link]

You are right on both points - when ext3 tries to do barriers on top of LVM it complains at that point, not at time of mounting.

Ext3 and write caching by drives are the data killers...

Posted Sep 3, 2009 7:18 UTC (Thu) by job (guest, #670) [Link] (1 responses)

According to previous comments at LWN barriers should be working through LVM with 2.6.30.

LVM barriers

Posted Sep 3, 2009 10:23 UTC (Thu) by Cato (guest, #7643) [Link]

There are some limitations on this - LVM barriers will only work with a linear target, apparently: http://lwn.net/Articles/326597/