User: Password:
Subscribe / Log in / New account

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 16:10 UTC (Wed) by jonth (subscriber, #4008)
Parent article: POSIX v. reality: A position on O_PONIES

An article in the best Reithian tradition: It informed, educated and entertained. Thanks, Valerie.

(Log in to post comments)

Only with a UPS

Posted Sep 10, 2009 2:02 UTC (Thu) by ncm (subscriber, #165) [Link]

We can only complain of what was omitted, not what was said. What was omitted, as is unfortunately always omitted from presentations by file system experts howsoever brilliant, is mention of the crucial distinction between crashes and power drops. Disks being the way they are, none of the above really applies to power drops. If power to the drive drops, your file system can offer you no guarantee, O_PONIES support notwithstanding.

Anywhere that a power drop is overwhelmingly more likely than a system crash, which includes Most of the Known World, the whole discussion is more or less moot. That does not make the discussion moot overall, though, because people who care about their data can move themselves into the Rest of Known World by providing a few seconds' UPS backing for the drive. We just need to make clear which part of the world we're talking about.

Only with a UPS

Posted Sep 10, 2009 6:14 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

Could you elaborate more on this distinction?

Only with a UPS

Posted Sep 10, 2009 17:34 UTC (Thu) by ncm (subscriber, #165) [Link]

There are two concerns. First, if power drops during a physical write operation, that sector is scragged. If it was writing metadata, you have serious problems with whatever files that metadata describes, if anything points to that sector.

Second, more subtle but probably more important, drives lie about what is physically on disk. To look good on benchmarks, they tell the controller that sectors have been physically copied to the platter while they are still only in buffer RAM in the drive -- up to several megabytes' worth. A few seconds after the last controller operation, these writes have drained to the disk. Before that, there's no guessing which have been written and which haven't, and blocks the system meant to write first may be written last. As a consequence, after powerup the file system sees blocks that are supposed to have important metadata in them with, instead, whatever was left there.

Only with a UPS

Posted Sep 10, 2009 18:05 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

Okay, but I guess what I'm asking is, what do you classify as a crash, if power drops are not included?

Only with a UPS

Posted Sep 10, 2009 19:19 UTC (Thu) by ncm (subscriber, #165) [Link]

Back about 1998, a Windows user told me that, for him, Windows "hardly ever crashes". Further questioning revealed that he defined "crash" as "I have to re-install". Lockups, a multiple-daily event, didn't count. Generally, though, by "crash" we mean the system stops responding to events, and must be re-started; usually this is a software failure, although all manner of hardware faults can cause it. When these happen, the disk has plenty of time to drain its buffers. Usually the software fault has not caused any disk writes with crazy parameters.

OS developers don't count power drops among crashes because those aren't their fault. That's commendable, because when they say "crash" they mean something they accept responsibility for.

Only with a UPS

Posted Sep 10, 2009 22:20 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

Ah, that makes sense.

Handling power drops, to me, seems to be a matter of impossibility, at least as long as disks lie about when writes actually complete.

Only with a UPS

Posted Sep 11, 2009 8:29 UTC (Fri) by jschrod (subscriber, #1646) [Link]

I'd say a crash happens any time that my system stands still because of some kernel oops, or any time I have to press the reset button because something hangs beyond redemption. (The latter being more often the case in my environment.)

Only with a UPS

Posted Sep 10, 2009 18:18 UTC (Thu) by aliguori (subscriber, #30636) [Link]

<i>Second, more subtle but probably more important, drives lie about what is physically on disk.</i>

Not really. You can make a disk tell you when data is actually on the platter vs in the write cache. Furthermore, most "enterprise" drives have battery-backed write caches that guarantee enough power for the write caches to be flushed.

Only with a UPS

Posted Sep 10, 2009 19:06 UTC (Thu) by ncm (subscriber, #165) [Link]

You can make a disk tell you when data is actually on the platter

No, you can ask a disk to tell you. It might even be honest about it if you never get the buffer too full. The commercial incentives to lie for the sake of benchmarks are extremely strong. Drives that don't lie cost a lot more, and are slower. Honesty is an extra-cost option. If you don't pay for honesty (few do) you won't get it.

Honesty usually costs a lot more than a UPS.

Only with a UPS

Posted Sep 10, 2009 19:12 UTC (Thu) by dlang (subscriber, #313) [Link]

I have yet to see a drive that includes battery backup in it, and I work at a place that spends millions of dollars a year on enterprise grade storage

I think this is a myth like the drives that use platter energy to power themselves to write their buffer.

if you can point to a drive that includes a battery backup on the drive please post a link to it.

Only with a UPS

Posted Sep 10, 2009 19:22 UTC (Thu) by ncm (subscriber, #165) [Link]

I suspect aliguori is referring to disk-array boxes, not drives.

Only with a UPS

Posted Sep 11, 2009 16:24 UTC (Fri) by nix (subscriber, #2304) [Link]

Even then... it's just occurred to me that while I know my Areca RAID
controller has its own battery-backed cache, I have no idea whether it's
asked the drives it controls to turn off *their* internal write cache...
(Obviously we don't want that cache, as it's not battery-backed in any

Only with a UPS

Posted Sep 11, 2009 14:41 UTC (Fri) by anton (subscriber, #25547) [Link]

First, if power drops during a physical write operation, that sector is scragged. If it was writing metadata, you have serious problems with whatever files that metadata describes, if anything points to that sector.
In my experiments on cutting power on disk drives while writing, the drives did not corrupt sectors. I have seen IBM and Maxtor drives corrupt sectors under more unusual power fluctuation circumstances; maybe that's a reason why you can no longer buy drives from IBM or Maxtor; Hitachi (IBM successor) and Seagate-Maxtor (not Seagate proper) are certainly on my dont-buy list.

And a modern file system can protect against the corruption of a single sector:

E.g., in a journaling file system, that sector is either in the log or in the permanent storage. If it's in the log, just stop the replay when you encounter the sector. If it's in permanent storage, then you will notice that the replay write fails, and the file system can remap the sector/block to a working one (or the drive might remap it transparently on the replay write, or might just perform the write on the original sector; in these cases the file system has nothing to do). Of course, if the file system performs only meta-data journaling, then it will likely not notice corrupt data (because it is not accessed during replay), but apparently neither the file system maintainer nor the user (or whoever decided to use a meta-data journaling file system) cares about data anyway, so that's ok.

In a copy-on-write file system, the sector either contains the root of the file system, or it contains something written after the last root. In the latter case these blocks are unreachable anyway after recovery (unless there is also an intent log, in which case the discussion above applies). If the root is affected, then on recovery the youngest alternative root is read, giving us the latest consistent state of the file system.

Second, more subtle but probably more important, drives lie about what is physically on disk.
In the experiments mentioned above, when the drive had write caching enabled (default on PATA and SATA drives), the drives not just reported completion right away, but worse, also reordered the writes (so using barriers or turning off write caching is essential for every kind of consistency).

With write caching disabled, the results of my experiments (both in performance and in what was on disk after powering off) are consistent with the theory that the drive reports the completion of writes only after the sector hits the platter and (with the program I used) consequently only wrote the sectors in order.

BTW, it's not just the drive manufacturers that default to fast rather than safe; the Linux kernel developers do a similar thing (with a much smaller performance incentive) when they disable barriers by default, and turned ext3 from data=journal to data=ordered (and letting data=journal rot), and recently to data=writeback (although that may be just to make ext3 as bad as ext4 so people will not switch back). Hmm, are Solaris or BSD developers less cavalier about their user's data?

On the subject: UPSs and computer PSUs can fail, too. Better recommend a dual power supplies with dual UPSs; double failures should be relatively rare.

Only with a UPS

Posted Sep 10, 2009 12:02 UTC (Thu) by xilun (guest, #50638) [Link]

On the other hand, I've already written file system _highly_ intrusive software (in the form of specialized data reordering for HFS+ to be able to non destructively resize this fs) and tested it like 20 times on a non trivial fs content by unplugging the power cord of the computer in the middle of a resize operation, without experiencing a single data corruption (and the fs was also always at least recoverable quickly, but this wasn't even needed for read only operations to properly work).

I know that my statistic sample is to small (and worse, this was 6 years ago and I don't know if HD today are of the same quality), but anyway my first guess is that if the software is careful enough and the hardware of decent quality, the risk of massive data corruption due to a power failure is not too high (at least in absence of bad system design, like using RAID 5/6 in a power unsafe context)

Only with a UPS

Posted Sep 10, 2009 14:13 UTC (Thu) by Cato (subscriber, #7643) [Link]

Since we are trading anecdotes, here's mine: - loss of thousands of files and LVM metadata corruption on a PC using ext3 on top of LVM.

This PC was frequently reset accidentally by the user pressing the power button, which caused at least one data loss event within one year. Since disabling write caching (and a couple of other changes) I've not had any data loss on this PC, but it's probably too early to be sure these changes have fixed the problem.

FWIW, I believe that at least on this setup, disabling write caching helps avoid ext3 and LVM corruption.

Only with a UPS

Posted Sep 11, 2009 5:28 UTC (Fri) by magnus (subscriber, #34778) [Link]

In my experience, system hangs are much more common than power outages for desktop systems.

In the past I've had to reboot due to X server hangs (probably problems in the display driver), oopses due to unstable hardware (memory mainly) and sometimes soft hangs like losing connection to an NIS or NFS server or getting PAM misconfigured and not having a prompt to work from.

Only with a UPS

Posted Sep 19, 2009 20:30 UTC (Sat) by efexis (guest, #26355) [Link]

Alt+Printscreen+U. Always press it before reboot, if the kernel's not oopsed, can save you data :-)

Only with a UPS

Posted Sep 20, 2009 7:51 UTC (Sun) by Cato (subscriber, #7643) [Link]

There are some other handy Magic SysRq (i.e. Alt-PrintScreen) keystrokes as well:

Only with a UPS

Posted Sep 21, 2009 7:20 UTC (Mon) by efexis (guest, #26355) [Link]

Although I wouldn't recommend using S(ync) as the third option for rebooting the system, after terminating processes etc. If the system's becoming unstable, syncing the drives is the very first thing I'd want to do. I prefer the order S-E-I-U-B. AFAIA, an S before U is redundant as buffers are written out as part of the remount-ro process, so a seperate sync() isn't needed (if anyone knows otherwise please correct me).

But of all those, the U is the most important, as if it succeeds it will protect your filesystem. You may end up with some left over temp files as tasks that didn't receive the terminate request signal didn't clean up after themselves, but this is usually not too great a cost.


Only with a UPS

Posted Sep 22, 2009 6:13 UTC (Tue) by Cato (subscriber, #7643) [Link]

Yes, personally I prefer the mnemonic Raising Skinny Elephants Is Utterly Boring.

Only with a UPS

Posted Sep 22, 2009 9:28 UTC (Tue) by efexis (guest, #26355) [Link]

Does the R do much? If you're rebooting/etc anyway... if the kernel's able to trap the Alt+SysRq+R, then it can trap the S/E/I/U/B keys too? Or is there another reason for it?

Only with a UPS

Posted Sep 12, 2009 0:35 UTC (Sat) by spitzak (guest, #4593) [Link]

You are wrong.

While the power is still running, and the disk is spinning and working perfectly, EXT4 has *already* stored information on it that says the file that the atomic rename() went to is empty. The disk is in the wrong state! It is irrelevant whether a power failure may further damage the data!

Only with a UPS

Posted Sep 12, 2009 6:22 UTC (Sat) by ncm (subscriber, #165) [Link]

You rather miss the point. Given reliable storage -- i.e., doesn't lie about what's reached disk, or has enough battery backup to make sure it gets there, eventually -- it's possible to write a reliable file system. Without, it doesn't matter how well done the file system is, a power drop can corrupt it. If you want safety against power drops, you need both.

Power loss -> no guarantee?

Posted Sep 17, 2009 11:11 UTC (Thu) by forthy (guest, #1525) [Link]

This is wrong. Consider a log-structured, checksummed file system like NILFS. It gathers all writes, writes them out in one go, and checksums every chunk it writes. What happens when power is lost during that write? The checksum is wrong. The last update before isn't touched, so the file system will revert to this last update. All is hunky dory, all ponies still there, no data lost except the last update - which is the guarantee of such a file system: You can only depend that those data is on disk where the transaction was completely written to disk. And note: writing one sector to a hard disk takes a few microseconds nowadays, so the drive can detect a power outage and stop writing before it randomly scrambles a sector - it might not complete everything, but leaving a garbled sector is possible to avoid.

On the other argument: In the part of the world where I live (Munich), power outages are far less frequent than crashes. Our file server had some CPU problems two years ago and crashed about once a week. Thanks to the stability of ReiserFS, no data loss occurred during the half year until we found the root cause and replaced the CPUs. Even when not including hardware defects, I definitely have more crashes than power outages. Frequent power outages happen in poor countries with third-world infrastructure.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds