having caught exactly zero blocks of bad data passed as good
Evidently Daniel hasn't worked much with disks that are powered off unexpectedly. There's a widespread myth (originating where?!) that disks detect a power drop and use the last few milliseconds to do something safe, such as finish up the sector they're writing. It's not true. A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds. Some hard read failures, even if duly reported, count as real damage, and are not unlikely.
Your typical journaled file system doesn't protect against power-off scribbling damage, as fondly as so many people wish and believe with all their little hearts.
Even without unexpected power drops, it's foolish to depend on more reliable reads than the manufacturer promises, because they trade off marginal correctness (which is hard to measure) against density (which is on the box in big bold letters). What does the money say to do? PostgreSQL uses 64-bit block checksums because they care about integrity. It's possibly reasonable to say that theirs is the right level for such checking, but not to say there's no need for it.
Correctness
Posted Dec 5, 2008 18:22 UTC (Fri) by man_ls (guest, #15091) [Link]
Everything you say can be prevented by a more robust filesystem with data journaling, even without checksums. Ext3 with data=ordered is an example.Even with checksumming data integrity is not guaranteed: yes, the filesystem will detect that a sector is corrupt, but it still needs to locate a good previous version and be able to roll back to that version. Isn't it easier to just do data journaling?
Correctness
Posted Dec 5, 2008 22:18 UTC (Fri) by ncm (subscriber, #165) [Link]
Everything you say can be prevented by a more robust filesystem ...FALSE. I'm talking about hardware-level sector failures. A filesystem without checksumming can be made robust against reported bad blocks, but a bad block that the drive delivers as good can completely bollix ext3 or any fs without its own checksums. Drive manufacturers specify and (just) meet a rate of such bad blocks, low enough for non-critical applications, and low enough not to kill performance of critical applications that perform their own checking and recovery methods.
Denial is not a sound engineering practice.
Correctness
Posted Dec 6, 2008 0:06 UTC (Sat) by man_ls (guest, #15091) [Link]
Interesting point: it seems I misread your post so let me re-elaborate. Data journaling prevents against half-written sectors, since they will not count as written. This leaves a power-off which causes physical damage to a disk, and yet the disk will not realize the sector is bad. Keep in mind that we have data journaling, so this particular sector will not be used until it is completely overwritten. The kind of damage must be permanent yet remain hidden when writing, which is why I deemed it impossible. It seems you have good cause to believe it can happen, so it would be most enlightening to hear any data points you may have.As to your concerns about high data density and error rates, they are exactly what Mr Phillips happily dismisses: in practice they do not seem to cause any trouble.
Over-engineering is not a sound engineering practice either.
Correctness
Posted Dec 7, 2008 22:28 UTC (Sun) by ncm (subscriber, #165) [Link]
File checksums needed?
Posted Dec 6, 2008 18:57 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds.
Actually, I think the probability of reading such a sector without error indication is negligible. There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.
I've seen a handful of studies that showed these failure modes, and I'm pretty sure none of them showed simple sector CRC failure.
If sector CRC failure were the problem, adding a file checksum is probably no better than just using stronger sector CRC.
File checksums needed?
Posted Dec 16, 2008 1:57 UTC (Tue) by daniel (subscriber, #3181) [Link]
There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.
Please, stop...
Posted Dec 20, 2008 3:31 UTC (Sat) by sandeen (guest, #42852) [Link]
XFS, like any journaling filesystem, expects that when the storage says data is safe on disk, it is safe on disk and the filesystem can proceed with whatever comes next. That's it; no special capacitors, no power-fail interrupts, no death-rays from mars. There is no special-ness required (unless you consider barriers to prevent re-ordering to be special, and xfs is not unique in that respect either).
Please, stop...
Posted Dec 20, 2008 3:55 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
You must have seriously misread the post to which you responded. It doesn't mention special features of hardware. It does mention special flaws in hardware and how XFS works in spite of them.I too remember reports that in testing, systems running early versions of XFS didn't work because XFS assumed, like pretty much everyone else, that the hardware would not write garbage to the disk and subsequently read it back with no error indication. The testing showed that real world hardware does in fact do that and, supposedly, XFS developers improved XFS so it could maintain data integrity in spite of it.
Correctness
Posted Dec 11, 2008 16:50 UTC (Thu) by anton (subscriber, #25547) [Link]
A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds.Given that disk drives do their own checksumming, you get pretty good odds. And if you think they are not good, why would you think that FS checksums are any better?
Concerning getting such damage on power-off, most drives don't do that; we would hear a lot about drive-level read errors after turning off computers if that was a general characteristic. However, I have seen such things a few times, and it typically leads to me avoiding the brand of the drive for a long time (i.e., no Hitachi drives for me, even though they were still IBM when it happened, and no Maxtor, either; hmm, could it be that selling such drives leads to having to sell the division/company soon after?); they usually did not happen happen on an ordinary power-off, but in some unusual situations that might result in funny power characteristics (that's still no excuse to corrupt the disk).
Correctness
Posted Dec 15, 2008 21:06 UTC (Mon) by grundler (subscriber, #23450) [Link]
It was true for SCSI disks in the 90's. The feature was called "Sector Atomicity". As expected, there is a patent for one implementation:
http://www.freepatentsonline.com/5359728.html
AFAIK, every major server vendor required it. I have no idea if this was ever implemented for IDE/ATA/SATA drives. But UPS's became the norm for avoiding power failure issues.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds