define 'a little cost in performance' that you (and everyone else)would be willing to loose.
doing a fsync on ext3 (what the ext maintainers believe is nessasary to get to the disk to be safe) can take several seconds. if you want a rename to provide that sort of guarantee you need to be willing to pay that sort of cost for every rename.
ext3 never provided the guarantees that people think it did. it just happened to work if you didn't crash too soon after doing a rename.
Posted Sep 10, 2009 9:04 UTC (Thu) by alexl (subscriber, #19068)
[Link]
Here we go again...
NO NO NO NO. We do not need/want the file to be fsynced.
Why do people keep repeating this fallacy? We all know that fsync is expensive, and don't want to use it, or something with similar semantics.
What we want is something that gives us the natural behavior of rename() replace (atomically get either the old or the new file) and extend it to a system crash. This does not imply a fsync, but rather that the data for the new file is on disk before the metadata is on disk. This is much cheaper than an fsync because it does not require the data to be written immediately, but rather that we have to delay the write of the metadata until the data has been written. Thus "little cost in performance", at least in relation to fsync.
And then you write "ext3 never provided the guarantees that people think it did" when my whole point has been about how everyone gives this reason for why people use rename when its not actually the reason! I am well aware that rename() does not give me system crash safety, I use it for other reasons. However, I *would* like it if this common operation that has been in use for decades before ext3 was written also was recognized by ext3 and made even more useful (even though this is in no way guaranteed by POSIX).
POSIX v. reality: A position on O_PONIES
Posted Sep 10, 2009 16:37 UTC (Thu) by nye (guest, #51576)
[Link]
I have noticed a tendency in this discussion (I don't mean the responses to this article, but the overall discussion) that the 'POSIX-fundamentalist' faction is unwilling or unable to accept that saying 'I want A to happen iff B happens' is *not* the same as saying 'I want a guarantee that A and B happen'.
POSIX v. reality: A position on O_PONIES
Posted Sep 17, 2009 20:38 UTC (Thu) by HelloWorld (guest, #56129)
[Link]
If you want both the write and the rename to happen, you'd have to fsync() the file *and* the directory. Which means that the open(), write(), fsync(the_file), close(), rename() sequence provides exactly the semantics you describe.
POSIX v. reality: A position on O_PONIES
Posted Sep 21, 2009 1:52 UTC (Mon) by efexis (guest, #26355)
[Link]
That's the point though... that's /not/ what people in the discussion want, or are asking for.
POSIX v. reality: A position on O_PONIES
Posted Sep 21, 2009 13:45 UTC (Mon) by nye (guest, #51576)
[Link]
*speechless*
POSIX v. reality: A position on O_PONIES
Posted Sep 17, 2009 16:01 UTC (Thu) by forthy (guest, #1525)
[Link]
I really wonder why all this "data=ordered" stuff is said to cost
performance. If implemented right, it must improve performance. All you
want to do is the following: Push data into the write buffer. Push
metadata into the metadata write buffer. Push freed blocks into the freed
blocks buffer (but don't actually free them). If your buffers are full,
there's no free block around any more, or a timer expires, do the
following:
Write out data.
Write out metadata (first to journal, then to the actual file
system).
Actually free the blocks from the freed block list
You only have to write data once - new files go to newly allocated
blocks which don't appear in the metadata when you write them (they are
still marked as free in the on-disk data). For files with in-place
writes, we usually don't care (there are many race conditions for writing
in-place, so the general usage pattern is not to do that if you care about
your data). For crash-resilient systems, you want to write your metadata
twice (once into a journal, once into the file system), order it (ordered
metadata updates), or use a COW/log structured file system, where you
write a new file system root (snapshot) on every update round. While you
are writing data from your buffers, open up new buffers for the OS to be
used as buffers for the next round (double-buffering strategy). This
double buffering should be a common part of the FS layer, because it will
be used in all major file systems.
POSIX v. reality: A position on O_PONIES
Posted Sep 17, 2009 16:41 UTC (Thu) by dlang (✭ supporter ✭, #313)
[Link]
the problem with your approach is that various pieces (including the hard drive itself) will re-order anything in it's buffer to shorten the total time it takes to get everything in the buffer to disk.
that is why barriers are needed to tell the device not to reorder across the buffer.