Let me repost here something I posted to Theodore Tso's blog.
With that out of the way, lets talk about rename. It should create a conceptual write barrier for the data blocks of the file involved. Its not inventing a filesystem semantic out of thin air any more than writing a zero-length file is: POSIX doesnt say much at all about what happens after a crash, and so this whole discussion is uncharted territory. Itd be perfectly fine for a POSIX system to overwrite all your files with pictures of donuts on an unclear shutdown. This is not an issue of standards conformance: it is a quality of implementation issue. The standard allowing you to do something terrible isnt an excuse for that behavior. Its like saying yes, its perfectly fine that I live off of Crisco and tequila. The law allows it!
Now, first of all: theres a lot of historical precedent for rename writing data blocks before metadata: not only does ext3 do it, but many older filesystems too. Certainly, many programs are written under the assumption that my rename semantics hold: and these programs work fine (in fact, better) on a running system.
Second, your rename behavior will lead to bugs now and forever: open-write-close-rename will work just fine on a running system, and theres a good chance itll appear to work even if the developer takes the unusual step of testing during a system crash. Because this sequence will seem to work just fine most of the time, plenty of programs will have hidden data-loss bugs. Thats not a world I want to live in.
On a running system, of course, a rename is atomic with respect to both the filename and its contents otherwise itd be useless. Under your semantics, however, youve effectively made rename without fsync a useless and dangerous, yet very conceptually tempting operation. Scolding application programmers to insert fsync calls will lead to confusion and frustration: fsync, as make the data hit the disk now doesnt have anything conceptually to do with atomic replacement except as an arcane filesystem implementation detail. Anything that appears to work in the typical case, but that does something dangerous in special corner cases, is broken by design.
When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a files contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.
But for the sake of argument, lets bite our tongue and insert this fsync. The system cant tell the difference between an fsync intended to ensure, say, message receipt, and an fsync that ensures after-crash consistency across a rename. Because were blocking and waiting for disk IO, application latency greatly increases (by up to three seconds, apparently). Users begin to complain. Now, the application developer has two choices: either implement the threaded solution you mention, or remove the fsync.
The threaded solution gives the correct behavior, but is horribly complicated, or requires libraries on which the developer might not want to depend, especially for a small operation. Look what youve done now: not only have you made the correct code non-obvious, but youve made the correct code under-performing as well. Its absolutely ludicrous to expect every program that wants to correct replace a files contents to spawn off a worker thread.
Thus, most application developers will just remove the fsync. (Or do the moral equivalent, as KDE has done, and provide a knob to turn the fsync back on.) Now weve created a deliberate rare data-loss bug because the correct code is far too complicated.
Now, this situation in itself would be bad. But add laptop_mode, and now weve made an API the very contemplation of which drives men to unspeakable acts. Weve added fsync everywhere, and we find its causing problems: the disk spins up all the time, as it must in order to maintain fsyncs semantics. So the solution is to neuter the very fsync youve implored application developers to add? Because you, the one making fsync a non-op, know that most of these fsyncs are there maintain data consistency, you can have laptop_mode trade durability for battery life.
But some fsyncs are there to ensure application-level durability. Imagine an SMTP server. So, you create an fsync-really-means-fsync inheritable process flag. If an application developer has an *important* fsync call, hell just set this process flag and call fsync. Now, since that flag is contrary to 20 years of established use and will be a footnote on a newish version of the fsync manual, most application developers wont actually know about it. Oh, theyll call fsync, and their programs will appear to work just fine, even after dutiful hard-reboot testing.
Except when someone using laptop_mode has an unexpected power failure. Now that user has lost, and it wasnt his fault. (How the hell wasnt the message on the disk? fsync returned success. Must be a bad disk. [hours pass] Oh, whats this laptop_mode? Changes fsync? @!#$%!@#%) And before you say caveat modulator: you shouldnt need to be an expert on the data retention needs of each of your programs to extract good battery life.
Now when an application developer needs to actually use the *real* fsync, he turns on this process flag. Except hes also dutifully using fsync to ensure rename consistency, so he has to create plumbing to manage the state of the magic fsync flag across different parts of his program so only the fsyncs that need to be real fsyncs are real. Lets imagine this program also runs arbitrary other programs: it then needs to unset the magic inheritable fsync flag before fork, otherwise programs that dont really it will be running with he real fsync. Thats a non-trivial amount of work.
Also, application developers everywhere need to add autoconf tests for the magic process flag. Older programs will actually be broken, through no fault of their own. Its either that, or rewrite the initialization scripts for programs that need the real fsync. (And in that case, the program may very well run far more real fsyncs that needed.)
Now youve made *two* traditional, long-standing system calls, rename and fsync, act dangerously in certain hard-to-test boundary cases, with elaborate and arcane workarounds that are so counter-intuitive (fsync almost always means fsync?) that developers will almost certainly get it wrong, at least the first time. Correct behavior might as well be in a disused lavatory behind a beware of the leopard sign.
Whats the alternative? fsync_and_i_mean_it? You could create an
fbarrier system call that applications would use to ensure data
consistency while preserving fsyncs current role. fbarrier might come
in handy in other contexts too. But of course that system call
wouldnt be portable but wait weve already established that
when an application calls rename, it *means* to insert a write
barrier. fbarrier might be useful, but we can also infer it from a
rename call, and with perfect accuracy: when does an application *not*
want this behavior on rename?
So, just make rename include an implicit call to a conceptual
fbarrier. Existing applications work. Today. With no changes, or even
a recompile. Applications that call fsync before a rename at least do
no harm. rename remains an intuitive, powerful, and simple way for an
application developer to express what he wants to do (instead of being
a tasty-looking landmine). fsync doesnt have to be treated specially
in certain bizarre modes. And you dont really lose any efficiency,
because under your scheme, every correct application would have to
call fsync anyway and I bet fbarrier would be far less expensive
than an outright fsync. (Or, if fsync really is cheap enough on a
given filesystem, make fbarrier *be* fsync.)
How often do you get to improve performance *and* safety at the same