POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 9:04 UTC (Thu) by alexl (subscriber, #19068)
In reply to: POSIX v. reality: A position on O_PONIES by dlang
Parent article: POSIX v. reality: A position on O_PONIES

Here we go again...

NO NO NO NO. We do not need/want the file to be fsynced.

Why do people keep repeating this fallacy? We all know that fsync is expensive, and don't want to use it, or something with similar semantics.

What we want is something that gives us the natural behavior of rename() replace (atomically get either the old or the new file) and extend it to a system crash. This does not imply a fsync, but rather that the data for the new file is on disk before the metadata is on disk. This is much cheaper than an fsync because it does not require the data to be written immediately, but rather that we have to delay the write of the metadata until the data has been written. Thus "little cost in performance", at least in relation to fsync.

And then you write "ext3 never provided the guarantees that people think it did" when my whole point has been about how everyone gives this reason for why people use rename when its not actually the reason! I am well aware that rename() does not give me system crash safety, I use it for other reasons. However, I *would* like it if this common operation that has been in use for decades before ext3 was written also was recognized by ext3 and made even more useful (even though this is in no way guaranteed by POSIX).

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 16:37 UTC (Thu) by nye (subscriber, #51576) [Link] (3 responses)

I have noticed a tendency in this discussion (I don't mean the responses to this article, but the overall discussion) that the 'POSIX-fundamentalist' faction is unwilling or unable to accept that saying 'I want A to happen iff B happens' is *not* the same as saying 'I want a guarantee that A and B happen'.

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 20:38 UTC (Thu) by HelloWorld (guest, #56129) [Link] (2 responses)

If you want both the write and the rename to happen, you'd have to fsync() the file *and* the directory. Which means that the open(), write(), fsync(the_file), close(), rename() sequence provides exactly the semantics you describe.

POSIX v. reality: A position on O_PONIES

Posted Sep 21, 2009 1:52 UTC (Mon) by efexis (guest, #26355) [Link]

That's the point though... that's /not/ what people in the discussion want, or are asking for.

POSIX v. reality: A position on O_PONIES

Posted Sep 21, 2009 13:45 UTC (Mon) by nye (subscriber, #51576) [Link]

*speechless*

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 16:01 UTC (Thu) by forthy (guest, #1525) [Link] (1 responses)

I really wonder why all this "data=ordered" stuff is said to cost performance. If implemented right, it must improve performance. All you want to do is the following: Push data into the write buffer. Push metadata into the metadata write buffer. Push freed blocks into the freed blocks buffer (but don't actually free them). If your buffers are full, there's no free block around any more, or a timer expires, do the following:

Write out data.
Write out metadata (first to journal, then to the actual file system).
Actually free the blocks from the freed block list

You only have to write data once - new files go to newly allocated blocks which don't appear in the metadata when you write them (they are still marked as free in the on-disk data). For files with in-place writes, we usually don't care (there are many race conditions for writing in-place, so the general usage pattern is not to do that if you care about your data). For crash-resilient systems, you want to write your metadata twice (once into a journal, once into the file system), order it (ordered metadata updates), or use a COW/log structured file system, where you write a new file system root (snapshot) on every update round. While you are writing data from your buffers, open up new buffers for the OS to be used as buffers for the next round (double-buffering strategy). This double buffering should be a common part of the FS layer, because it will be used in all major file systems.

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 16:41 UTC (Thu) by dlang (guest, #313) [Link]

the problem with your approach is that various pieces (including the hard drive itself) will re-order anything in it's buffer to shorten the total time it takes to get everything in the buffer to disk.

that is why barriers are needed to tell the device not to reorder across the buffer.