Posted Sep 10, 2009 17:34 UTC (Thu) by ncm (subscriber, #165)
[Link]
There are two concerns. First, if power drops during a physical write operation, that sector is scragged. If it was writing metadata, you have serious problems with whatever files that metadata describes, if anything points to that sector.
Second, more subtle but probably more important, drives lie about what is physically on disk. To look good on benchmarks, they tell the controller that sectors have been physically copied to the platter while they are still only in buffer RAM in the drive -- up to several megabytes' worth. A few seconds after the last controller operation, these writes have drained to the disk. Before that, there's no guessing which have been written and which haven't, and blocks the system meant to write first may be written last. As a consequence, after powerup the file system sees blocks that are supposed to have important metadata in them with, instead, whatever was left there.
Only with a UPS
Posted Sep 10, 2009 18:05 UTC (Thu) by flewellyn (subscriber, #5047)
[Link]
Okay, but I guess what I'm asking is, what do you classify as a crash, if power drops are not included?
Only with a UPS
Posted Sep 10, 2009 19:19 UTC (Thu) by ncm (subscriber, #165)
[Link]
Back about 1998, a Windows user told me that, for him, Windows "hardly ever crashes". Further questioning revealed that he defined "crash" as "I have to re-install". Lockups, a multiple-daily event, didn't count. Generally, though, by "crash" we mean the system stops responding to events, and must be re-started; usually this is a software failure, although all manner of hardware faults can cause it. When these happen, the disk has plenty of time to drain its buffers. Usually the software fault has not caused any disk writes with crazy parameters.
OS developers don't count power drops among crashes because those aren't their fault. That's commendable, because when they say "crash" they mean something they accept responsibility for.
Only with a UPS
Posted Sep 10, 2009 22:20 UTC (Thu) by flewellyn (subscriber, #5047)
[Link]
Ah, that makes sense.
Handling power drops, to me, seems to be a matter of impossibility, at least as long as disks lie about when writes actually complete.
Only with a UPS
Posted Sep 11, 2009 8:29 UTC (Fri) by jschrod (subscriber, #1646)
[Link]
I'd say a crash happens any time that my system stands still because of some kernel oops, or any time I have to press the reset button because something hangs beyond redemption. (The latter being more often the case in my environment.)
Only with a UPS
Posted Sep 10, 2009 18:18 UTC (Thu) by aliguori (subscriber, #30636)
[Link]
<i>Second, more subtle but probably more important, drives lie about what is physically on disk.</i>
Not really. You can make a disk tell you when data is actually on the platter vs in the write cache. Furthermore, most "enterprise" drives have battery-backed write caches that guarantee enough power for the write caches to be flushed.
Only with a UPS
Posted Sep 10, 2009 19:06 UTC (Thu) by ncm (subscriber, #165)
[Link]
You can make a disk tell you when data is actually on the platter
No, you can ask a disk to tell you. It might even be honest about it if you never get the buffer too full. The commercial incentives to lie for the sake of benchmarks are extremely strong. Drives that don't lie cost a lot more, and are slower. Honesty is an extra-cost option. If you don't pay for honesty (few do) you won't get it.
Honesty usually costs a lot more than a UPS.
Only with a UPS
Posted Sep 10, 2009 19:12 UTC (Thu) by dlang (✭ supporter ✭, #313)
[Link]
I have yet to see a drive that includes battery backup in it, and I work at a place that spends millions of dollars a year on enterprise grade storage
I think this is a myth like the drives that use platter energy to power themselves to write their buffer.
if you can point to a drive that includes a battery backup on the drive please post a link to it.
Only with a UPS
Posted Sep 10, 2009 19:22 UTC (Thu) by ncm (subscriber, #165)
[Link]
I suspect aliguori is referring to disk-array boxes, not drives.
Only with a UPS
Posted Sep 11, 2009 16:24 UTC (Fri) by nix (subscriber, #2304)
[Link]
Even then... it's just occurred to me that while I know my Areca RAID
controller has its own battery-backed cache, I have no idea whether it's
asked the drives it controls to turn off *their* internal write cache...
(Obviously we don't want that cache, as it's not battery-backed in any
way.)
Only with a UPS
Posted Sep 11, 2009 14:41 UTC (Fri) by anton (guest, #25547)
[Link]
First, if power drops during a physical write operation,
that sector is scragged. If it was writing metadata, you have serious
problems with whatever files that metadata describes, if anything
points to that sector.
In my
experiments on cutting power on disk drives while writing, the
drives did not corrupt sectors. I have seen IBM and Maxtor drives
corrupt sectors under more unusual power fluctuation circumstances;
maybe that's a reason why you can no longer buy drives from IBM or
Maxtor; Hitachi (IBM successor) and Seagate-Maxtor (not Seagate proper) are certainly
on my dont-buy list.
And a modern file system can protect against the corruption of a
single sector:
E.g., in a journaling file system, that sector is either in the log
or in the permanent storage. If it's in the log, just stop the replay
when you encounter the sector. If it's in permanent storage, then you
will notice that the replay write fails, and the file system can remap
the sector/block to a working one (or the drive might remap it
transparently on the replay write, or might just perform the write on
the original sector; in these cases the file system has nothing to do).
Of course, if the file system performs only meta-data journaling, then
it will likely not notice corrupt data (because it is not accessed
during replay), but apparently neither the file system maintainer nor
the user (or whoever decided to use a meta-data journaling file system) cares about data
anyway, so that's ok.
In a copy-on-write file system, the sector either contains the root
of the file system, or it contains something written after the last
root. In the latter case these blocks are unreachable anyway after
recovery (unless there is also an intent log, in which case the
discussion above applies). If the root is affected, then on recovery
the youngest alternative root is read, giving us the latest consistent
state of the file system.
Second, more subtle but probably more important, drives
lie about what is physically on disk.
In the experiments mentioned above, when the drive had write caching
enabled (default on PATA and SATA drives), the drives not just
reported completion right away, but worse, also reordered the writes (so using barriers or turning off write caching is essential for every kind of consistency).
With write caching disabled, the results of my experiments (both in
performance and in what was on disk after powering off) are consistent
with the theory that the drive reports the completion of writes only
after the sector hits the platter and (with the program I used)
consequently only wrote the sectors in order.
BTW, it's not just the drive manufacturers that default to fast
rather than safe; the Linux kernel developers do a similar thing (with
a much smaller performance incentive) when they disable barriers by
default, and turned ext3 from data=journal to data=ordered (and letting
data=journal rot), and recently to data=writeback (although that may
be just to make ext3 as bad as ext4 so people will not switch back).
Hmm, are Solaris or BSD developers less cavalier about their user's data?
On the subject: UPSs and computer PSUs can fail, too. Better
recommend a dual power supplies with dual UPSs; double failures should
be relatively rare.