User: Password:
Subscribe / Log in / New account

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 22:40 UTC (Wed) by njs (guest, #40338)
In reply to: POSIX v. reality: A position on O_PONIES by alexl
Parent article: POSIX v. reality: A position on O_PONIES

I would be shocked if you could find a filesystem developer who isn't aware of exactly what POSIX requires of rename (i.e. atomicity with respect to other processes viewing the same fs concurrently), and why it works the way it does.

Our standards keep rising, though, and these days people actually care about what happens to data over crashes -- and POSIX's primitives *suck* for this, unless you are writing a giant database with dedicated storage.

In this discussion, whenever kernel folks talk about developers coming to depend on rename's atomicity, I'm pretty sure they're talking about its atomicity with respect to crashes. (For instance, I believe Subversion's backend format uses atomic-rename for reliability over crashes, because fsync is just untenable.)

(Log in to post comments)

crash vs. power drop

Posted Sep 10, 2009 2:15 UTC (Thu) by ncm (subscriber, #165) [Link]

I suspect that many readers, perhaps most, are equating "crash" with "power drop".

How often do any of us see crashes on production systems, nowadays? Things being the way they are, power drops are overwhelmingly more likely on your typical desktop or rack installation, just because a UPS is an extra expense and crippling data loss isn't especially likely even without. "All of the above" is meaningful only if power doesn't drop unexpectedly, and, sadly, that needs to be repeated every time.

crash vs. power drop

Posted Sep 10, 2009 6:12 UTC (Thu) by njs (guest, #40338) [Link]

Good point, though with two caveats: 1) Even if, as you've claimed, power drops can cause corruption to sectors that are in the middle of being written (it seems plausible), it isn't clear to me how frequently this happens. Your average drive probably spends most of its time seeking, for instance. 2) In principle, there's no reason a fs couldn't maintain guarantees even over events like that, though I don't know whether any current ones do.

crash vs. power drop

Posted Sep 10, 2009 17:50 UTC (Thu) by ncm (subscriber, #165) [Link]

Scragged sectors are a risk, but not the main risk, of power drops. What's worse is blocks still in buffer RAM on the drive that the system thought were already physically on disk, because the drive told it so.

When the drives lie to the file system, the file system can't guarantee anything. Most drives (i.e. the ones you have) lie. Drives that don't lie cost way more, and are slower; it's much cheaper to add some battery backup, and even some redundancy, than to buy the expensive drives. There's no point in discussing fine points of data ordering if you haven't got one or the other.

crash vs. power drop

Posted Sep 11, 2009 6:27 UTC (Fri) by zmi (guest, #4829) [Link]

> Second, more subtle but probably more important, drives lie about what is
physically on disk.

That's why in a RAID you really *must* turn off the drive's write cache.
I've tried to explain that in the XFS FAQ:
and also in the questions below that one.

Short: I've got a new 2TB WD drive with 64MB cache, we intend to use them
in a RAID. Take 16 of these drives, it adds to 1024MB (1GB) of write cache.
So in the worst situation,
1) you've got an UPS, but your power supply fails and the PC/server is out
of power
2) drives have their caches full, so up to 1GB of data is lost, where the
filesystem believed they are on disk. There's a *very* high chance that
lots of metadata is included in the cached writes.
3) each of the 16 drives could write "half sectors", effectively destroying
the previous and the actual content.

In all this discussion, it would have been worth noting that if you really
*care* about your data, you *must* turn off the drive write cache. Yes,
power failures are not so often in countries with good power supply. Still,
I use an UPS and in the last half year, had
a) my daughter playing around turning the power of the server off
b) a dead power supply in my workstation
and so, even with an UPS, "drive write cache off" is a must. Simply put a
hdparm -W0 /dev/sda
in your boot scripts.

Note that still this only helps in 1) and 2), but for problem 3) there's
nothing anybody but the disk manufacturers can do. I must say that I have
no evidence of ever having had that problem somewhere. It might be that
happened when there are "strange filesystem problems" after a crash, but
you can't tell for sure.

As for the rename: Really, there should only ever be the chance of having
either the old file or the new one, and the filesystem should care about
this even for crash situations.

Note: In Linux you can tune writeback behaviour in /etc/sysctl.conf:
# start writeback at around 16MB, max. 250MB
vm.dirty_background_bytes = 16123456
vm.dirty_bytes = 250123456
# older kernels had this:
#vm.dirty_background_ratio = 5
#vm.dirty_ratio = 10
# write blocks to disk after 1 second (default: 3000ms)
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100

Note that dirty_bytes/dirty_ratio is to block new writes once the cache has
that many bytes to write. On systems with 8GB RAM or more, you could end up
having gigabytes of disk cache.

Sorry for putting all in one post, but I hope it helps people who care
about their data to have some tunings to start with.

mfg zmi

crash vs. power drop

Posted Sep 11, 2009 11:40 UTC (Fri) by Cato (subscriber, #7643) [Link]

Thanks for explaining all that.

On the topic of writing 'half sectors' due to a power drop: the author of has done quite a lot of tests on various hard drives, and generally found that they usually don't do this, though some instances have done. He has a useful program that can test any drive for this behaviour, though it's mostly intended to test for out of order writes due to caching - I believe only some drives lie about this.

crash vs. power drop

Posted Sep 14, 2009 14:13 UTC (Mon) by nye (guest, #51576) [Link]

This is why barriers *should* be a good thing.

If all you care about is benchmark results, there's an obvious incentive for the drive to claim falsely that some data has been physically written to disk.

On the other hand, there's far less incentive to lie about barriers. If all you're saying is that '*if and when* this data has been written, then that other data will have also been written', you can still happily claim that it's all done when it isn't, without breaking the barrier commitment.

When you have that commitment, it's possible to build a transaction system upon it which works even under the assumption that the drive will lie. You're not going to achieve the full benchmark speed, but it's going to be far better than turning off the cache.

Of course, whether drive manufacturers see it that way is another matter. Is there any data on whether drives actually honour write barriers? It would be interesting to see if there are indeed drives that aren't expensive enough to report accurately on when data has been written, but still honour the barrier request.

crash vs. power drop

Posted Sep 14, 2009 15:06 UTC (Mon) by ncm (subscriber, #165) [Link]

This is where Scott Adams's "Which is more likely" principle is useful. We just frame the question, thus:

Drive manufacturer A can sell almost equally as many drives made this way as that way, but "that way" costs more development time and might make it come out a little slower in benchmarks. Some purchase decisions depend on claiming it's made "that way". The manufacturer can make it "that way" or just say it is, but not. Which is more likely?

crash vs. power drop

Posted Sep 14, 2009 16:12 UTC (Mon) by nye (guest, #51576) [Link]

A fair point, but in all things there's a strong economic incentive to claim that a product does something it doesn't, if that will make more people buy it, and yet most things don't claim do do something which is simply factually not true.

Usually when it's a feature that either works or doesn't - so it's not a subjective measurement - it's likely that a product does at least technically do what it says it does.

Presumably the argument is that the manufacturers aren't specifically claiming a particular feature, but the disk is behaving in a particular way that just happens to be not what the user expected - so they're not technically lying. This does seem to weaken the idea that they're doing it to improve the chances of people buying it though, if they're not stating it as a feature.

Just out of interest, I've just spent a while trying to see if I can find out what the difference is between the Samsung HE502IJ and HD502IJ - two drives which are identical on paper, but one is sold as 'RAID-class'. Neither are even remotely expensive enough not to lie about their actions, so what's the difference? Well, some forum post claims that one has a 'rotational vibration sensor', whatever that means.

In conclusion, people who try to sell you things are all liars and cheats, and I intend to grow a beard and live out the rest of my days as a hermit, never having to worry about these things again. Perhaps I shall raise yaks.

crash vs. power drop

Posted Sep 15, 2009 12:52 UTC (Tue) by Cato (subscriber, #7643) [Link]

I believe that TLER (time limited error recovery) is one characteristic of enterprise drives - simply means fewer retries so that the RAID controller or OS gets a drive failure more quickly, and knows to use redundancy to complete the I/O.

crash vs. power drop

Posted Sep 15, 2009 12:49 UTC (Tue) by Cato (subscriber, #7643) [Link]

Maybe the best way round this is to make sure that performance benchmarks always include a 'reliability benchmark' that detects drives which are lying about writes to hard disk.

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 8:24 UTC (Thu) by alexl (guest, #19068) [Link]

I also think its unlikely that kernel people are not aware of the atomicity of rename. However, almost every post, including this one, avoid mentioning why rename is actually used, and rather use handwavy notions about application authors using rename because somehow we got hooked on it via ext3 behaviour, which is far from the truth.

I'm all for a reasonable API addition to implement O_PONIES, and would implement support for it in the stuff i work on (glib, gio, etc) the second it was availible. However, all existing applications already use rename() to do atomic renames() for reasons unrelated to system crashes, so why not just make all these applications work without additional changes? At little cost in performance.

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 8:52 UTC (Thu) by dlang (subscriber, #313) [Link]

define 'a little cost in performance' that you (and everyone else)would be willing to loose.

doing a fsync on ext3 (what the ext maintainers believe is nessasary to get to the disk to be safe) can take several seconds. if you want a rename to provide that sort of guarantee you need to be willing to pay that sort of cost for every rename.

ext3 never provided the guarantees that people think it did. it just happened to work if you didn't crash too soon after doing a rename.

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 9:04 UTC (Thu) by alexl (guest, #19068) [Link]

Here we go again...

NO NO NO NO. We do not need/want the file to be fsynced.

Why do people keep repeating this fallacy? We all know that fsync is expensive, and don't want to use it, or something with similar semantics.

What we want is something that gives us the natural behavior of rename() replace (atomically get either the old or the new file) and extend it to a system crash. This does not imply a fsync, but rather that the data for the new file is on disk before the metadata is on disk. This is much cheaper than an fsync because it does not require the data to be written immediately, but rather that we have to delay the write of the metadata until the data has been written. Thus "little cost in performance", at least in relation to fsync.

And then you write "ext3 never provided the guarantees that people think it did" when my whole point has been about how everyone gives this reason for why people use rename when its not actually the reason! I am well aware that rename() does not give me system crash safety, I use it for other reasons. However, I *would* like it if this common operation that has been in use for decades before ext3 was written also was recognized by ext3 and made even more useful (even though this is in no way guaranteed by POSIX).

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 16:37 UTC (Thu) by nye (guest, #51576) [Link]

I have noticed a tendency in this discussion (I don't mean the responses to this article, but the overall discussion) that the 'POSIX-fundamentalist' faction is unwilling or unable to accept that saying 'I want A to happen iff B happens' is *not* the same as saying 'I want a guarantee that A and B happen'.

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 20:38 UTC (Thu) by HelloWorld (guest, #56129) [Link]

If you want both the write and the rename to happen, you'd have to fsync() the file *and* the directory. Which means that the open(), write(), fsync(the_file), close(), rename() sequence provides exactly the semantics you describe.

POSIX v. reality: A position on O_PONIES

Posted Sep 21, 2009 1:52 UTC (Mon) by efexis (guest, #26355) [Link]

That's the point though... that's /not/ what people in the discussion want, or are asking for.

POSIX v. reality: A position on O_PONIES

Posted Sep 21, 2009 13:45 UTC (Mon) by nye (guest, #51576) [Link]


POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 16:01 UTC (Thu) by forthy (guest, #1525) [Link]

I really wonder why all this "data=ordered" stuff is said to cost performance. If implemented right, it must improve performance. All you want to do is the following: Push data into the write buffer. Push metadata into the metadata write buffer. Push freed blocks into the freed blocks buffer (but don't actually free them). If your buffers are full, there's no free block around any more, or a timer expires, do the following:

  1. Write out data.
  2. Write out metadata (first to journal, then to the actual file system).
  3. Actually free the blocks from the freed block list

You only have to write data once - new files go to newly allocated blocks which don't appear in the metadata when you write them (they are still marked as free in the on-disk data). For files with in-place writes, we usually don't care (there are many race conditions for writing in-place, so the general usage pattern is not to do that if you care about your data). For crash-resilient systems, you want to write your metadata twice (once into a journal, once into the file system), order it (ordered metadata updates), or use a COW/log structured file system, where you write a new file system root (snapshot) on every update round. While you are writing data from your buffers, open up new buffers for the OS to be used as buffers for the next round (double-buffering strategy). This double buffering should be a common part of the FS layer, because it will be used in all major file systems.

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 16:41 UTC (Thu) by dlang (subscriber, #313) [Link]

the problem with your approach is that various pieces (including the hard drive itself) will re-order anything in it's buffer to shorten the total time it takes to get everything in the buffer to disk.

that is why barriers are needed to tell the device not to reorder across the buffer.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds