Sponsored link Serve your customers, not your servers, with VERIO Linux VPS. Full-access test-drive here. |
Three commentsThree commentsPosted Jul 14, 2006 9:47 UTC (Fri) by ncm (subscriber, #165)Parent article: Crash-only software: More than meets the eye
First, this (otherwise excellent) article repeats a common misconception: in fact, journaling file systems, like journaling databases, are no generally safe against power drops. The underlying reason is that, for evidently unavoidable marketing reasons, most drives *lie* about whether blocks sent to the drive really are physically on the disk; it may take a few seconds for blocks actually to get written after the drive has sworn up-and-down that it's already been done. Drives that never lie lose performance benchmarks.
Furthermore, there is an urban myth that many/most drives will use the motor as a generator to provide power to finish writing the current block and park the head. It is generally false -- all drives you're likely to encounter happily write random stuff if voltage drops while they're writing, even if they do park the head afterward.
The implication is that a high-reliability system really must either have some form of battery-backed up disk drive power (e.g. UPS), or must somehow ensure its drives really don't lie (good luck verifying that!), or their journal blocks must have 64-bit-or-better checksums independent of whatever the drive uses, and be prepared for late journal blocks to be unreadable or just wrong. (Available drives have an appallingly high specified bit-error rate, so using 64-bit checksums everywhere reliability matters is a good idea anyhow.) Finally, when testing failure recovery, don't confuse hardware or software reset with power failure.
Second, the principles behind crash-only design are embodied, thus far uniquely, in the C++ exception mechanism. A well-designed C++ program will have only a very few places that catch and process exceptions, and almost all the code that is executed during an exception is also run frequently during normal operation, in destructors. This differs fundamentally from languages with superficially similar exception features that depend on the "try-finally" construct. There, exception-handling code is scattered pervasively throughout the system, and much of it cannot be executed in any practical test process.
Third, crash-only design is tied very closely to logging and log-replaying. Often log-replaying code can be recycled for user-level undo/redo, and thus exercised in normal operation. Note that if you're not worried about power failure (because of your UPS), there's no need for the program to flush (i.e. fsync) the log file frequently. The kernel will do that if the program crashes, and mmapping a big data-structure image (e.g. to support undoing a deletion) to the end of the log file is very cheap. "Auto-save" is a very poor substitute for good logging.
(Log in to post comments)
Three comments Posted Jul 14, 2006 11:58 UTC (Fri) by nix (subscriber, #2304) [Link] Isn't the drive-write-late-and-garbage problem exactly what write barriers are meant to solve, and the major reason why the journalled filesystems and md layer make use of write barriers? (Do any drives actually lie about write barriers, too, and say they're passed when the stuff they bar is not yet on the medium?)
Write barriers Posted Jul 14, 2006 19:15 UTC (Fri) by giraffedata (subscriber, #1954) [Link] No, the problem that write barriers solve is where the device says "OK, I've got the data" and Linux considers the data to be permanent based on that. Before write barriers, that's what happens.That's not ridiculous, by the way. "Permanent" is a matter of degree, and being written on the platter is just one degree in the middle of the scale. Once the device has the data, it is safe from a Linux kernel crash, and that's a lot. Write barriers are, BTW, a Linux kernel block layer phenomenon; the device doesn't know the concept. Linux has various ways to know that the device has put the data on the platter and uses them to implement write barriers. But if the device lies, the write barriers won't work. Since the device lies to circumvent a system that explicitly asked for the data to go on the platter, I rather doubt that it would refrain from lying when Linux write barriers are involved. BTW, I can't confirm or deny that devices lie like this, and to the extent claimed. If someone can back up this claim, I'd love to see it.
Three comments Posted Jul 14, 2006 19:30 UTC (Fri) by ncm (subscriber, #165) [Link] I don't know to what degree modern drives really obey write barriers. If history is any guide, they obey write barriers when the data rate is low, but toss them when the buffers fill up, or any time they seem to recognize a benchmark being run. In any case, they won't protect against sectors being half-written.
Three comments Posted Jul 18, 2006 6:21 UTC (Tue) by nix (subscriber, #2304) [Link] That's just horrible enough that it might be true, but the idealist in me hopes that it isn't, because it would render the entire concept of write barriers pointless :(
Three comments Posted Jul 18, 2006 9:04 UTC (Tue) by ncm (subscriber, #165) [Link] Not at all... it just means that backup power is necessary for a reliable system. Disk drive designers have learned not to pretend they can make a whole system reliable all by themselves, and that (furthermore) the market won't pay for them to try. It doesn't take much backup power; if you can get the CPU's or ATA interface's power to drop out of tolerance a few seconds before the drive's, that may be all you need.
Three comments Posted Jul 19, 2006 7:17 UTC (Wed) by drs (guest, #16570) [Link] My understanding of this problem (recalling a Ted Tso talk on the issue)is that as system voltage sags in an unexpected power failure, different system components fail to operate correctly at different voltage levels.
More specifically, the voltage at which main memory maintains coherency is
writing garbage when the voltage drops Posted Jul 14, 2006 19:19 UTC (Fri) by giraffedata (subscriber, #1954) [Link] there is an urban myth that many/most drives will use the motor as a generator to provide power to finish writing the current block and park the head. It is generally false -- all drives you're likely to encounter happily write random stuff if voltage drops while they're writing, even if they do park the head afterward. I can easily believe that the motor generating power is fantasy, but I always assumed there was a capacitor in there that could supply enough energy to finish writing the current sector. Why wouldn't there be?
writing garbage when the voltage drops Posted Jul 15, 2006 5:21 UTC (Sat) by roelofs (subscriber, #2599) [Link] I can easily believe that the motor generating power is fantasy, but I always assumed there was a capacitor in there that could supply enough energy to finish writing the current sector. Why wouldn't there be?Size, maybe? I'm just shooting the breeze here, but caps associated with power supplies tend to be immensely bigger than typical hard-drive components, and I'd guess that one capable of acting as a power-supply standin for even a few milliseconds would still be quite a bit bigger than the little surface-mount discretes used on drives today. But maybe I'm suffering from cranio-rectal impaction again... I hate it when that happens. Greg
writing garbage when the voltage drops Posted Jul 17, 2006 14:29 UTC (Mon) by giraffedata (subscriber, #1954) [Link] OK, I did some calculations. I think the drive needs less than 10 microseconds to finish writing a sector. In that time, it needs up to 1 ampere, and can work with at least 4v out of the 5v power supply. So a 10uF capacitor, which is the size of a pea, should suffice. The stored energy in the disk probably is relevant too, in that it keeps the disk spinning fast enough for an acceptable write 10 uS after the motor loses power.
writing garbage when the voltage drops Posted Jul 25, 2006 3:59 UTC (Tue) by barrygould (guest, #4774) [Link] I'd expect you want clusters, not sectors, ensured to be written safely.
writing garbage when the voltage drops Posted Jul 15, 2006 23:19 UTC (Sat) by ncm (subscriber, #165) [Link] Suffice to say that disk-drive manufacturing is a very cost-sensitive business. They'd be happy to make drives fail better if it didn't actually cost anything, but nobody is willing to pay if it does cost.
writing garbage when the voltage drops Posted Jul 18, 2006 15:40 UTC (Tue) by giraffedata (subscriber, #1954) [Link] Now that I think about it, the atomic write in the case of power failure isn't all that useful, because if the sector doesn't get completely written, it can't be read back. The CRC in the trailer won't have been written. That means you can achieve the same thing by writing two copies of the critical sector: On readback, if you can't read the first copy, you just use the second copy, which is the complete old version. You'd probably want that redundancy anyway, because it's probably a really important sector and write failures happen even without power failures. For the benefit of those who are wondering why people think atomic sector writes at power failure are important: Some systems deal with the possibility of system failure in the middle of a complex disk update as follows: Keep the original data intact and write a whole second, updated copy. (Use copy-on-write if you have to for practicality). A single sector points to current copy. When you have a complete updated copy, update the pointer sector to point to the updated copy. Then delete the original copy. Any kind of failure before you update the pointer sector just means the complex update never happened. But if the update of the pointer sector itself gets interrupted, then you've got neither the original nor the updated copy.
Three comments Posted Jul 20, 2006 16:20 UTC (Thu) by renox (guest, #23785) [Link] > journaling file systems, like journaling databases, are no generally safe against power drops
I think that it depends on the type of journaling: journaling metadata only doesn't protect your files which can be corrupted, but among other ext3fs has a journaling data+metadata, which has a quite high performance impact, but it should 'protect' as in the data is here or not here but it isn't half here (seen with ReiserFSv3: the passwd file appended with binary data, urgh).
Of course even journaling 'data+metadata' can work correctly only if the disk obey some order of writing the data.
Three comments Posted Jul 20, 2006 23:43 UTC (Thu) by ncm (subscriber, #165) [Link] ...even journaling 'data+metadata' can work correctly only if the disk obey some order of writing the data.Precisely the point. However, data+metadata journaling may be kind of pointless if you don't have any way of telling how much of the data you had meant to have written was, in fact, written. For example, if you have an outage while compiling a kernel, no amount of journaling can make it safe to skip "make clean" before running "make" again. That makes the cheapest, fastest journaling regime also best, for such an environment.
|
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.