Not logged in
Log in now
Create an account
Subscribe to LWN
Pencil, Pencil, and Pencil
Dividing the Linux desktop
LWN.net Weekly Edition for June 13, 2013
A report from pgCon 2013
Little things that matter in language design
Only with a UPS
Posted Sep 10, 2009 2:02 UTC (Thu) by ncm (subscriber, #165)
Anywhere that a power drop is overwhelmingly more likely than a system crash, which includes Most of the Known World, the whole discussion is more or less moot. That does not make the discussion moot overall, though, because people who care about their data can move themselves into the Rest of Known World by providing a few seconds' UPS backing for the drive. We just need to make clear which part of the world we're talking about.
Posted Sep 10, 2009 6:14 UTC (Thu) by flewellyn (subscriber, #5047)
Posted Sep 10, 2009 17:34 UTC (Thu) by ncm (subscriber, #165)
Second, more subtle but probably more important, drives lie about what is physically on disk. To look good on benchmarks, they tell the controller that sectors have been physically copied to the platter while they are still only in buffer RAM in the drive -- up to several megabytes' worth. A few seconds after the last controller operation, these writes have drained to the disk. Before that, there's no guessing which have been written and which haven't, and blocks the system meant to write first may be written last. As a consequence, after powerup the file system sees blocks that are supposed to have important metadata in them with, instead, whatever was left there.
Posted Sep 10, 2009 18:05 UTC (Thu) by flewellyn (subscriber, #5047)
Posted Sep 10, 2009 19:19 UTC (Thu) by ncm (subscriber, #165)
OS developers don't count power drops among crashes because those aren't their fault. That's commendable, because when they say "crash" they mean something they accept responsibility for.
Posted Sep 10, 2009 22:20 UTC (Thu) by flewellyn (subscriber, #5047)
Handling power drops, to me, seems to be a matter of impossibility, at least as long as disks lie about when writes actually complete.
Posted Sep 11, 2009 8:29 UTC (Fri) by jschrod (subscriber, #1646)
Posted Sep 10, 2009 18:18 UTC (Thu) by aliguori (subscriber, #30636)
Not really. You can make a disk tell you when data is actually on the platter vs in the write cache. Furthermore, most "enterprise" drives have battery-backed write caches that guarantee enough power for the write caches to be flushed.
Posted Sep 10, 2009 19:06 UTC (Thu) by ncm (subscriber, #165)
No, you can ask a disk to tell you. It might even be honest about it if you never get the buffer too full. The commercial incentives to lie for the sake of benchmarks are extremely strong. Drives that don't lie cost a lot more, and are slower. Honesty is an extra-cost option. If you don't pay for honesty (few do) you won't get it.
Honesty usually costs a lot more than a UPS.
Posted Sep 10, 2009 19:12 UTC (Thu) by dlang (✭ supporter ✭, #313)
I think this is a myth like the drives that use platter energy to power themselves to write their buffer.
if you can point to a drive that includes a battery backup on the drive please post a link to it.
Posted Sep 10, 2009 19:22 UTC (Thu) by ncm (subscriber, #165)
Posted Sep 11, 2009 16:24 UTC (Fri) by nix (subscriber, #2304)
Posted Sep 11, 2009 14:41 UTC (Fri) by anton (guest, #25547)
First, if power drops during a physical write operation,
that sector is scragged. If it was writing metadata, you have serious
problems with whatever files that metadata describes, if anything
points to that sector.
And a modern file system can protect against the corruption of a
E.g., in a journaling file system, that sector is either in the log
or in the permanent storage. If it's in the log, just stop the replay
when you encounter the sector. If it's in permanent storage, then you
will notice that the replay write fails, and the file system can remap
the sector/block to a working one (or the drive might remap it
transparently on the replay write, or might just perform the write on
the original sector; in these cases the file system has nothing to do).
Of course, if the file system performs only meta-data journaling, then
it will likely not notice corrupt data (because it is not accessed
during replay), but apparently neither the file system maintainer nor
the user (or whoever decided to use a meta-data journaling file system) cares about data
anyway, so that's ok.
In a copy-on-write file system, the sector either contains the root
of the file system, or it contains something written after the last
root. In the latter case these blocks are unreachable anyway after
recovery (unless there is also an intent log, in which case the
discussion above applies). If the root is affected, then on recovery
the youngest alternative root is read, giving us the latest consistent
state of the file system.
Second, more subtle but probably more important, drives
lie about what is physically on disk.
With write caching disabled, the results of my experiments (both in
performance and in what was on disk after powering off) are consistent
with the theory that the drive reports the completion of writes only
after the sector hits the platter and (with the program I used)
consequently only wrote the sectors in order.
BTW, it's not just the drive manufacturers that default to fast
rather than safe; the Linux kernel developers do a similar thing (with
a much smaller performance incentive) when they disable barriers by
default, and turned ext3 from data=journal to data=ordered (and letting
data=journal rot), and recently to data=writeback (although that may
be just to make ext3 as bad as ext4 so people will not switch back).
Hmm, are Solaris or BSD developers less cavalier about their user's data?
On the subject: UPSs and computer PSUs can fail, too. Better
recommend a dual power supplies with dual UPSs; double failures should
be relatively rare.
Posted Sep 10, 2009 12:02 UTC (Thu) by xilun (subscriber, #50638)
I know that my statistic sample is to small (and worse, this was 6 years ago and I don't know if HD today are of the same quality), but anyway my first guess is that if the software is careful enough and the hardware of decent quality, the risk of massive data corruption due to a power failure is not too high (at least in absence of bad system design, like using RAID 5/6 in a power unsafe context)
Posted Sep 10, 2009 14:13 UTC (Thu) by Cato (subscriber, #7643)
This PC was frequently reset accidentally by the user pressing the power button, which caused at least one data loss event within one year. Since disabling write caching (and a couple of other changes) I've not had any data loss on this PC, but it's probably too early to be sure these changes have fixed the problem.
FWIW, I believe that at least on this setup, disabling write caching helps avoid ext3 and LVM corruption.
Posted Sep 11, 2009 5:28 UTC (Fri) by magnus (subscriber, #34778)
In the past I've had to reboot due to X server hangs (probably problems in the display driver), oopses due to unstable hardware (memory mainly) and sometimes soft hangs like losing connection to an NIS or NFS server or getting PAM misconfigured and not having a prompt to work from.
Posted Sep 19, 2009 20:30 UTC (Sat) by efexis (guest, #26355)
Posted Sep 20, 2009 7:51 UTC (Sun) by Cato (subscriber, #7643)
Posted Sep 21, 2009 7:20 UTC (Mon) by efexis (guest, #26355)
But of all those, the U is the most important, as if it succeeds it will protect your filesystem. You may end up with some left over temp files as tasks that didn't receive the terminate request signal didn't clean up after themselves, but this is usually not too great a cost.
Posted Sep 22, 2009 6:13 UTC (Tue) by Cato (subscriber, #7643)
Posted Sep 22, 2009 9:28 UTC (Tue) by efexis (guest, #26355)
Posted Sep 12, 2009 0:35 UTC (Sat) by spitzak (guest, #4593)
While the power is still running, and the disk is spinning and working perfectly, EXT4 has *already* stored information on it that says the file that the atomic rename() went to is empty. The disk is in the wrong state! It is irrelevant whether a power failure may further damage the data!
Posted Sep 12, 2009 6:22 UTC (Sat) by ncm (subscriber, #165)
Power loss -> no guarantee?
Posted Sep 17, 2009 11:11 UTC (Thu) by forthy (guest, #1525)
This is wrong. Consider a log-structured, checksummed file system like
NILFS. It gathers all writes, writes them out in one go, and checksums
every chunk it writes. What happens when power is lost during that write?
The checksum is wrong. The last update before isn't touched, so the file
system will revert to this last update. All is hunky dory, all ponies
still there, no data lost except the last update - which is the guarantee
of such a file system: You can only depend that those data is on disk
where the transaction was completely written to disk. And note: writing
one sector to a hard disk takes a few microseconds nowadays, so the drive
can detect a power outage and stop writing before it randomly scrambles a
sector - it might not complete everything, but leaving a garbled sector is
possible to avoid.
On the other argument: In the part of the world where I live (Munich),
power outages are far less frequent than crashes. Our file server had some
CPU problems two years ago and crashed about once a week. Thanks to the
stability of ReiserFS, no data loss occurred during the half year until we
found the root cause and replaced the CPUs. Even when not including
hardware defects, I definitely have more crashes than power outages.
Frequent power outages happen in poor countries with third-world
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds