|
|
Log in / Subscribe / Register

Wishful thinking

Wishful thinking

Posted Mar 16, 2009 14:45 UTC (Mon) by jamesh (guest, #1159)
In reply to: Wishful thinking by bojan
Parent article: Garrett: ext4, application expectations and power management

Could you elaborate? In what situations do you imagine an application renaming a temporary file over an existing file name and not wanting the data flushed to disk before the rename is flushed?

For truly temporary files (as opposed to those used to stage atomic changes to long lived files), the application usually leaves them with their initial name, right?


to post comments

Wishful thinking

Posted Mar 16, 2009 14:50 UTC (Mon) by k8to (guest, #15413) [Link] (39 responses)

Because rename does not always get used to replace files....

Wishful thinking

Posted Mar 16, 2009 15:02 UTC (Mon) by jamesh (guest, #1159) [Link] (2 responses)

The file system knows when it is being used to replace a file though, right?

So which cases where an application writes to a file then uses rename() to replace another file would they not want the ordering of the write and rename to be preserved?

Wishful thinking

Posted Mar 16, 2009 15:31 UTC (Mon) by endecotp (guest, #36428) [Link] (1 responses)

It's the converse question you need to ask: "which cases where an application writes to a file then uses rename() NOT to replace another file would WANT the ordering to be preserved?"

The first time you save a config file would be one example. Say my config file is a hundred bytes or so of XML. Each time I save it I first save it to a .new file and then rename it. The very first time I do this I am not overwriting an existing file, but every subsequent time I am.

Now in this case you can argue that the "old or new" guarantee is not really needed; if I open the file and find it doesn't contain valid XML (e.g. it's a block zeros, or it's empty) then I can ignore the file and use defaults, and get the desired behaviour. But this needs extra work; I have to check the validity of the input in a way that is not needed for any other case and as we all know, rarely-executed code is a good place to find bugs. [My server crashed on the leap second, demonstrating the textbook example of this.]

Here's another example: I download a file into a .partial file. When the download is complete I do some sanity checks on the file (checksum maybe) and then rename it. My naive expectation is that when the application next starts it will either find a .partial file from an abandoned download, which it can delete, or it will find a data file that it can trust to be valid because it was sanity-checked before renaming. I really don't want to have to re-do the checking for all previous downloads when the app starts because users already complain about start-up time.

So unfortunately, simply adding extra safety in the case where the target of the rename already exists is not sufficient for these real cases.

Wishful thinking

Posted Mar 16, 2009 15:51 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Then do it for all renames. The behavior is still sane and consistent, and handles your partial file case just fine.

Wishful thinking

Posted Mar 16, 2009 15:28 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (35 responses)

No, it doesn't, but that doesn't matter. Give me a concrete example use of rename in which the application develop didn't intend to insert a write barrier for the data blocks. Just one example.

I bet you can't -- because that's insane behavior! Without the write barrier, rename alone is useless.

Files that won't matter after a crash!

Posted Mar 16, 2009 16:03 UTC (Mon) by Pc5Y9sbv (guest, #41328) [Link]

The write-barrier ONLY serves a purpose for preserving IO order after a crash. For the processes using the file before the crash, the kernel is already providing the abstraction of ordered IO, and the on-disk ordering does not matter at all.

So, for many temporary files which will be purged/ignored on crash recovery, the write ordering doesn't matter at all. The only reason the files ever need to go to disk is for backing store in case the block cache is flushed.

So while I agree it would be nice if write-barriers could be on by default, this might need to be a configurable policy if it cannot be proven to have negligible performance penalty for all cases. And if it can ever be disabled as the default, then applications need a way to request it explicitly to override a default "fast but unsafe" treatment.

Wishful thinking

Posted Mar 16, 2009 16:46 UTC (Mon) by endecotp (guest, #36428) [Link]

I can't think of an example, and I would also be interested to hear of any that others can come up with.

However, what you can find are cases where the user doesn't really care much about robustness at all if it impacts performance. (Compiles, for example; I can imagine a compiler "safely" overwriting the output object file so that it doesn't leave a broken environment after ctrl-C. But the user might be happy to "make clean" after a system crash.)

So it may be legitimate for a filesystem to behave in this way iff there is a performance benefit.

Wishful thinking

Posted Mar 16, 2009 21:38 UTC (Mon) by bojan (subscriber, #14302) [Link] (32 responses)

Any file that you don't care about to have data on disk and can be discarded is such a file.

Essentially, you are giving kernel more room to manage writes the way it sees fit, which is always good for performance.

Wishful thinking

Posted Mar 16, 2009 21:50 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (31 responses)

No, not really. I'm talking about rename being a write barrier, and not making rename flush data to disk at the instant it's called. The performance impact is minimal.

In a common case, the file being renamed has no dirty blocks, so no more work will be caused by writing the data blocks before the rename.

When the file being renamed has dirty blocks, these blocks will have to be written anyway. Forcing them out before, instead of after, the metadata will have a negligible impact on performance, especially since the elevator can combine these writes with other ones at the time of its choosing, not the application's.

And actually, forcing application developers to call fsync is worse for performance than making rename a write barrier. If rename isn't a write barrier, rename without fsync is dangerous, and therefore applications will add fsync. These fsync calls are worse for system performance than the rename being a write barrier: fsync forces out-of-order, non-coalesced disk flushes and vast increases in application latency. It also diminishes the effectiveness of laptop_mode --- forcing the disk to spin up.

Now, sometimes you need temporary files. But when do you go around renaming temporary files you don't intend to keep after dumping lots of data in them? If they have lots of data, the write ordering forced by the barrier won't matter much. If they have a little data, they're probably not going to get written out at all before they're unlinked. Either way, rename-as-write-barrier doesn't affect performance of temporary files.

Besides: if temporary files are a bottleneck, metadata journaling will hurt far more than the write barrier anyway. When you really, really want to optimize the performance of operations on temporary files, either use tmpfs or a freshly-created ext2 filesystem.

Making rename a write barrier is a performance win. Avoid fsync.

Wishful thinking

Posted Mar 16, 2009 22:00 UTC (Mon) by bojan (subscriber, #14302) [Link] (26 responses)

If the directory you renamed the file in needs to be synced to disk (because another file in it changed and someone asked to fsync it, for instance; or because it has to be evicted from cache etc.) and rename is ordered, you need to commit the data in the file first, because that is your guarantee now. If that file is big and would not normally get committed (because it's temporary and would get removed a bit later), you just caused a performance hit.

For such a file, the application may not care that this particular temporary file is empty or corrupt if the machine crashes just then. It may just remove it and continue.

So, current POSIX rename semantics are there for a good reason - to allow kernel to order writes as it sees fit. Sure, it would be good to have a call such as rename2(), for which the order is guaranteed.

Wishful thinking

Posted Mar 16, 2009 22:08 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (25 responses)

That's an obscure corner case, and I've already addressed it in another reply. Shuffling around huge files with large numbers of outstanding blocks and then immediately deleting them is a contrived case that doesn't occur in real life.

Freedom isn't always a good thing. Making rename work consistently is far more important. The non-ordered rename case just doesn't make a whole lot of sense. It's not a useful degree of freedom, unlike synchronous-versus-cached, atomic-versus-not, and so on. The performance gain is minimum, and the danger large. Making rename ordered with respect to the data blocks in the file is a huge win.

Wishful thinking

Posted Mar 16, 2009 22:13 UTC (Mon) by bojan (subscriber, #14302) [Link] (24 responses)

> Making rename work consistently is far more important.

rename() already works consistently with documentation.

What you addressed in the other thread is the you would like rename() to work they way you _think_ it should work. We know that already.

Wishful thinking

Posted Mar 16, 2009 22:17 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (23 responses)

Let me repost here something I posted to Theodore Tso's blog.

With that out of the way, let’s talk about rename. It should create a conceptual write barrier for the data blocks of the file involved. It’s not inventing a filesystem semantic out of thin air any more than writing a zero-length file is: POSIX doesn’t say much at all about what happens after a crash, and so this whole discussion is uncharted territory. It’d be perfectly fine for a POSIX system to overwrite all your files with pictures of donuts on an unclear shutdown. This is not an issue of standards conformance: it is a quality of implementation issue. The standard allowing you to do something terrible isn’t an excuse for that behavior. It’s like saying “yes, it’s perfectly fine that I live off of Crisco and tequila. The law allows it!”

Now, first of all: there’s a lot of historical precedent for rename writing data blocks before metadata: not only does ext3 do it, but many older filesystems too. Certainly, many programs are written under the assumption that my rename semantics hold: and these programs work fine (in fact, better) on a running system.

Second, your rename behavior will lead to bugs now and forever: open-write-close-rename will work just fine on a running system, and there’s a good chance it’ll appear to work even if the developer takes the unusual step of testing during a system crash. Because this sequence will seem to work just fine most of the time, plenty of programs will have hidden data-loss bugs. That’s not a world I want to live in.

Third, there’s the issue of API parsimony. Your semantics change rename from a hard to misuse API to one that’s very prone to misuse. See http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html (”How Do I Make This Hard to Misuse?”) and http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html (”What If I Don’t Actually Like My Users?”). On that scale, you’ve moved rename from a very good 7 (”the obvious use is (probably) the correct one”) to an appalling -5 (”do it right and it will sometimes break at runtime”).

On a running system, of course, a rename is atomic with respect to both the filename and its contents — otherwise it’d be useless. Under your semantics, however, you’ve effectively made rename without fsync a useless and dangerous, yet very conceptually tempting operation. Scolding application programmers to insert fsync calls will lead to confusion and frustration: fsync, as “make the data hit the disk now” doesn’t have anything conceptually to do with atomic replacement except as an arcane filesystem implementation detail. Anything that appears to work in the typical case, but that does something dangerous in special corner cases, is broken by design.

When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a file’s contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.

But for the sake of argument, let’s bite our tongue and insert this fsync. The system can’t tell the difference between an fsync intended to ensure, say, message receipt, and an fsync that ensures after-crash consistency across a rename. Because we’re blocking and waiting for disk IO, application latency greatly increases (by up to three seconds, apparently). Users begin to complain. Now, the application developer has two choices: either implement the threaded solution you mention, or remove the fsync.

The threaded solution gives the correct behavior, but is horribly complicated, or requires libraries on which the developer might not want to depend, especially for a small operation. Look what you’ve done now: not only have you made the correct code non-obvious, but you’ve made the correct code under-performing as well. It’s absolutely ludicrous to expect every program that wants to correct replace a file’s contents to spawn off a worker thread.

Thus, most application developers will just remove the fsync. (Or do the moral equivalent, as KDE has done, and provide a knob to turn the fsync back on.) Now we’ve created a deliberate rare data-loss bug because the correct code is far too complicated.

Now, this situation in itself would be bad. But add laptop_mode, and now we’ve made an API the very contemplation of which drives men to unspeakable acts. We’ve added fsync everywhere, and we find it’s causing problems: the disk spins up all the time, as it must in order to maintain fsync’s semantics. So the solution is to neuter the very fsync you’ve implored application developers to add? Because you, the one making fsync a non-op, know that most of these fsyncs are there maintain data consistency, you can have laptop_mode trade durability for battery life.

But some fsyncs are there to ensure application-level durability. Imagine an SMTP server. So, you create an “fsync-really-means-fsync” inheritable process flag. If an application developer has an *important* fsync call, he’ll just set this process flag and call fsync. Now, since that flag is contrary to 20 years of established use and will be a footnote on a newish version of the fsync manual, most application developers won’t actually know about it. Oh, they’ll call fsync, and their programs will appear to work just fine, even after dutiful hard-reboot testing.

Except when someone using laptop_mode has an unexpected power failure. Now that user has lost, and it wasn’t his fault. (”How the hell wasn’t the message on the disk? fsync returned success. Must be a bad disk. [hours pass] Oh, what’s this laptop_mode? Changes fsync? @!#$%!@#%”) And before you say “caveat modulator:” you shouldn’t need to be an expert on the data retention needs of each of your programs to extract good battery life.

Now when an application developer needs to actually use the *real* fsync, he turns on this process flag. Except he’s also dutifully using fsync to ensure rename consistency, so he has to create plumbing to manage the state of the magic fsync flag across different parts of his program so only the fsyncs that need to be real fsyncs are real. Let’s imagine this program also runs arbitrary other programs: it then needs to unset the magic inheritable fsync flag before fork, otherwise programs that don’t really it will be running with he real fsync. That’s a non-trivial amount of work.

Also, application developers everywhere need to add autoconf tests for the magic process flag. Older programs will actually be broken, through no fault of their own. It’s either that, or rewrite the initialization scripts for programs that need the real fsync. (And in that case, the program may very well run far more real fsyncs that needed.)

Now you’ve made *two* traditional, long-standing system calls, rename and fsync, act dangerously in certain hard-to-test boundary cases, with elaborate and arcane workarounds that are so counter-intuitive (”fsync almost always means fsync?”) that developers will almost certainly get it wrong, at least the first time. Correct behavior might as well be in a disused lavatory behind a “beware of the leopard” sign.

What’s the alternative? fsync_and_i_mean_it? You could create an
fbarrier system call that applications would use to ensure data
consistency while preserving fsync’s current role. fbarrier might come
in handy in other contexts too. But of course that system call
wouldn’t be portable — but wait… we’ve already established that
when an application calls rename, it *means* to insert a write
barrier. fbarrier might be useful, but we can also infer it from a
rename call, and with perfect accuracy: when does an application *not*
want this behavior on rename?

So, just make rename include an implicit call to a conceptual
fbarrier. Existing applications work. Today. With no changes, or even
a recompile. Applications that call fsync before a rename at least do
no harm. rename remains an intuitive, powerful, and simple way for an
application developer to express what he wants to do (instead of being
a tasty-looking landmine). fsync doesn’t have to be treated specially
in certain bizarre modes. And you don’t really lose any efficiency,
because under your scheme, every correct application would have to
call fsync anyway — and I bet fbarrier would be far less expensive
than an outright fsync. (Or, if fsync really is cheap enough on a
given filesystem, make fbarrier *be* fsync.)

How often do you get to improve performance *and* safety at the same
time?

Wishful thinking

Posted Mar 16, 2009 22:20 UTC (Mon) by bojan (subscriber, #14302) [Link] (19 responses)

> It should

I didn't want to read past that, sorry. I can also imagine that things should be some way or the other. They are, however, not.

Wishful thinking

Posted Mar 16, 2009 22:27 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (18 responses)

Now you're just trolling. The whole core of your argument is that "since POSIX allows this, and some filesystems do it that way, we should keep doing it that way". It doesn't logically follow. Maybe if you were to read my posts, you'd get a sense for what logic looks like.

Wishful thinking

Posted Mar 16, 2009 22:36 UTC (Mon) by bojan (subscriber, #14302) [Link] (16 responses)

> It doesn't logically follow.

Actually it does. If you _don't_ allow for what POSIX specifies in your applications (which is where the problem is), then there will be consequences (i.e. the applications will lose files).

This can be properly fixed in two ways:

1. By calling fsync() from the application when required.
2. By introducing something new that does what you keep talking about.

Overloading specified behaviour with unspecified things is dangerous, because it encourages application writers to do the wrong thing. We've seen that before with XFS and wrong people got blamed that time too.

Sure, Ted is a practical person, so he doesn't want to break things unnecessarily. I admire him for keeping his cool.

Wishful thinking

Posted Mar 16, 2009 23:32 UTC (Mon) by jamesh (guest, #1159) [Link] (14 responses)

This whole issue is about what should happen in a case that POSIX specifies, so I don't know why you keep on bringing this up.

In cases where POSIX does not specify behaviour, it is left up to the implementation. If the choice is between trying to provide the runtime atomic rename guarantee over a crash or slightly higher performance, I'd pick the first option. After all, that's why I am running a crash-resistant file system in the first place.

Are you seriously saying you can't understand the benefits of delaying IO but preserving the order of certain operations over a "do it now" fsync() call?

Wishful thinking

Posted Mar 16, 2009 23:42 UTC (Mon) by bojan (subscriber, #14302) [Link] (13 responses)

The problem is with portability. If you write your applications to _not_ do what POSIX requires, they will be broken when they go to a different system which happens to have an FS that doesn't order renames on disk.

> Are you seriously saying you can't understand the benefits of delaying IO but preserving the order of certain operations over a "do it now" fsync() call?

1. Yes, I can understand it.
2. No, this is not what rename() specifies.

So, when an application writer thinks that it will be like that everywhere, he/she is wrong and the application may lose data. That is bad.

Hence, I'm suggesting that for the cases where ordered rename is warranted, we should have a new API.

PS. As I explained elsewhere, unordered rename has its use as well, so one cannot just assume that everyone should drop that and do ordered. It is also not practical to demand that, because too many systems would have to be audited and changed to achieve it. And before you say "but don't we have to fix more apps already" - well, the applications are buggy right now according to specification - not the other way around.

Wishful thinking

Posted Mar 17, 2009 0:18 UTC (Tue) by nix (subscriber, #2304) [Link] (10 responses)

You're acting as if POSIX is set in stone and can never change to account
for new de-facto standards, when in reality that is the *only* way it ever
changes (and often Linux is the source of such changes).

Ten years ago, would you have been arguing that programs that relied on
symlinks were broken because POSIX did not require them?

Wishful thinking

Posted Mar 17, 2009 0:31 UTC (Tue) by bojan (subscriber, #14302) [Link] (8 responses)

> Ten years ago, would you have been arguing that programs that relied on symlinks were broken because POSIX did not require them?

If the programs correctly tested to see if the support is there and then refused to work if symlinks were not there, there would be nothing wrong with them. So, by all means, if you write an application that tests that the underlying FS has ordered renames and refuses to work otherwise with sloppy open()/write()/close()/rename() sequence, that's perfectly OK. You just need to write even _more_ code to do this then if you just used fsync(). Up to you.

Wishful thinking

Posted Mar 17, 2009 1:24 UTC (Tue) by nix (subscriber, #2304) [Link] (7 responses)

The vast majority of programs, even when symlinks were optional, assumed
their presence, because the enormous majority of the installed base had
them.

This is actually worse. If you get the open()/write()/fsync()/close()/
rename() sequence wrong, by leaving out the fsync(), the visible effect
during development is *nil*, even on filesystems like pre-patch ext4,
because this is a change which only has an effect when something goes
really wrong and the OS crashes or you lose power at the wrong instant,
and if that happens, any data loss will be written off to the power
failure, like as not.

Expecting any but the most skilled developers to remember that fsync()
when omitting it has *no visible negative consequence* in normal operation
is a complete and total pipe-dream. You can wish all you will, but only a
few percent will ever conform.

It is much better to arrange to do the right thing in the filesystem,
which *does* have especially skilled people hacking at it, than in the
vast mass of wildly-varying-in-quality code out there in the real world.

Wishful thinking

Posted Mar 17, 2009 2:17 UTC (Tue) by bojan (subscriber, #14302) [Link] (6 responses)

> The vast majority of programs, even when symlinks were optional, assumed their presence, because the enormous majority of the installed base had them.

WOW! Programs have bugs. Imagine that ;-)

> Expecting any but the most skilled developers to remember that fsync() when omitting it has *no visible negative consequence* in normal operation is a complete and total pipe-dream.

The no negative visible consequence applies to one file system in one mode _only_ (and according to some, not even on it all the time). The rest - it depends.

If you ever tried to debug a race condition, you'd know that it can be really hard to do, because the system doesn't get into such conditions all the time. Did someone guarantee to you that programming was going to be easy? I must have missed that lesson ;-)

Oh, and for all the forgetful unskilled developers: man 2 close. I sure needed it :-(

> You can wish all you will, but only a few percent will ever conform.

And their applications will still suck and they will still rely on hacks in file systems to work. And of course, people doing this will be the ones loudest complaining that "file system is broken" when they encounter problems on another platform. Not even my six year old is this childish. But, hey - that's life.

> It is much better to arrange to do the right thing in the filesystem, which *does* have especially skilled people hacking at it, than in the vast mass of wildly-varying-in-quality code out there in the real world.

All you need to do is this:

1. Convince all FS writers to only use new semantics.
2. Convince POSIX folks to change the spec.

Good luck doing that.

PS. The vast majority of people do not program using APIs we are talking about here. They are using libraries that wrap all this up, other programming languages that have calls that wrap all this up etc. These will be written by people familiar with lower level POSIX APIs we are talking about here. For a good example, see: http://mail.gnome.org/archives/gtk-devel-list/2009-March/...

Wishful thinking

Posted Mar 17, 2009 2:23 UTC (Tue) by bojan (subscriber, #14302) [Link]

> people doing this

Of course, I mean your supposed vast majority that won't do the fsync here.

Wishful thinking

Posted Mar 17, 2009 2:26 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (3 responses)

The POSIX spec doesn't need to change one bit. Both behaviors entirely conform to POSIX.

And as for getting filesystems to change -- that's going to be the case. Any widely-used filesysem will encounter the same problem that started this mess, and will either implement the same fix or suffer the fate of XFS.

Wishful thinking

Posted Mar 17, 2009 2:35 UTC (Tue) by bojan (subscriber, #14302) [Link] (2 responses)

I see FS implementers shaking in their boots :-)

BTW, people already started fixing the code. Or didn't you read that GTK thread?

PS. Even Ted's workarounds in ext4 do not do full ordered rename in all cases. These are only for the cases of the most widely known application breakage. But, if you keep insisting, he may do the lockup-on-fsync for you, ext3 style, just so that you can get that nice UI feeling in properly written apps ;-)

Wishful thinking

Posted Mar 17, 2009 2:37 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (1 responses)

Care to link to this thread?

Wishful thinking

Posted Mar 17, 2009 2:44 UTC (Tue) by bojan (subscriber, #14302) [Link]

Already have. You have to go a few posts up.

Wishful thinking

Posted Mar 17, 2009 20:37 UTC (Tue) by nix (subscriber, #2304) [Link]

>> Expecting any but the most skilled developers to remember that fsync()
>> when omitting it has *no visible negative consequence* in normal
>> operation is a complete and total pipe-dream.
>
> The no negative visible consequence applies to one file system in one
> mode _only_ (and according to some, not even on it all the time). The
> rest - it depends.

I repeat: omitting fsync() has no negative visible consequence *in normal
operation* on *any* POSIX-compliant system. Turning off the power or
locking up the box is *not* 'normal operation'.

I know of no developers of anything other than full-blown databases who do
anything like that to test their programs. Thus, for nearly all programs,
omitting fsync() is harmless during the development and testing phase.
Thus, it will regularly be omitted, *no matter what* you might wish.

... and, um, changing POSIX really isn't that hard. Make a good case that
some behaviour is common enough and POSIX will bend. The Austin Group is
populated with normal human beings^W^Wraging pedants like you or I, not
gods. (There are some demigods there, though.)

It is quite possible to convince them that a change is needed, and POSIX
regularly changes semantics in new release.

Wishful thinking

Posted Mar 17, 2009 0:33 UTC (Tue) by bojan (subscriber, #14302) [Link]

Oh, and if you want to change POSIX, please do so. I have no objection. As if my opinion mattered here ;-)

Wishful thinking

Posted Mar 17, 2009 8:35 UTC (Tue) by jamesh (guest, #1159) [Link]

> If you write your applications to _not_ do what POSIX requires, they will
> be broken when they go to a different system which happens to have an FS
> that doesn't order renames on disk.

We are talking about a case that POSIX leaves undefined here. An OS can wipe the disk on system crash and still be POSIX compliant.

We are in the realm of implementation defined behaviour, so talking of "applications doing what POSIX requires" doesn't really make sense. Claiming that the applications are buggy in a case where the specification offers no guidance doesn't help anyone.

Ext4's crash resistance is a desirable feature that exceeds the minimum requirements needed for POSIX conformance. Preserving atomic renames over a crash also exceeds those minimum requirements.

I'd be willing to pay the performance penalty from providing this behaviour in the same way I am willing to pay the performance penalty from metadata journaling.

A filesystem's job is not to punish users for application developers' oversights.

Posted Mar 18, 2009 0:52 UTC (Wed) by xoddam (subscriber, #2322) [Link]

This is *so* not about application developers or POSIX!

The *only* behaviour under discussion is recoverability across system failures. That's what POSIX doesn't (can't) guarantee, and it's what a journaling filesystem is supposed to provide *in addition* to the POSIX guarantees.

System administrators and users choose to run journaling filesystems so they don't waste time cleaning up after a crash. A journaling filesystem that makes it more, not less, likely for users to lose data isn't doing its job.

POSIX guarantees atomicity of rename -- while the system is running. Application developers code to that guarantee, without particular reference to what happens when the power is cut or some video driver scribbles on the kernel heap. If the system crashes, there is no POSIXLY_CORRECT guarantee that anything will be recoverable at all. Whether you use fsync or not.

A journaling filesystem is supposed to provide more reasonable behaviour FOR USERS. Its job is not to punish users for the corner cases that application developers didn't consider.

Wishful thinking

Posted Mar 18, 2009 0:14 UTC (Wed) by dvdeug (subscriber, #10998) [Link]

What POSIX specifies is that a compliant system, upon a system crash, can hunt down all hard copies that have been made and burn them, after overwriting the data on the disk seven times with zeros, ones, and random data. I'm not sure how an application is supposed to allow for that.

Wishful thinking

Posted Mar 17, 2009 20:44 UTC (Tue) by man_ls (guest, #15091) [Link]

Now you're just trolling.
Now?

Wishful thinking

Posted Mar 16, 2009 22:37 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

quote:
When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a file’s contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.

one very good example of when you would want to do a rename, but don't need to do a write barrier is when you are rolling logs.

this is usually done where one program has the file open and is writing data to it, another program renames the file, and then tells the first program to close the file and re-open the original name.

there is no need for a write barrier anywhere in this case.

laptop mode explicitly breaks normal expected (failsafe) behavior in the interest of of saving power. if the distro turns it on by default and never tells the user they are doing so (along with allowing the user to specify the 'how much data am I willing to loose' parameter) the distro is at fault, but that can be fixed.

adding the ability to selectivly mask this, so that you can have some programs (say your word processor) go ahead and wake up the drive to save it's data, but keep other non-critical things from doing so (even if those things _think_ that they are critical) would be a very good thing.

Wishful thinking

Posted Mar 16, 2009 22:42 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

there is no need for a write barrier anywhere in this case.
You don't need a write barrier for a completely unmodified file either. But a write barrier hurts neither case.

adding the ability to selectivly mask this, so that you can have some programs (say your word processor) go ahead and wake up the drive to save it's data, but keep other non-critical things from doing so (even if those things _think_ that they are critical) would be a very good thing.
That's a nightmare API that makes it very difficult to determine whether you're actually writing to the disk or not. If an application's data aren't critical, it shouldn't be calling fsync in the first place. And if the property of whether the data are critical can change, the application itself should provide a knob to control that. A process-level flub is both coarse and crude as a means of controlling that.

Wishful thinking

Posted Mar 16, 2009 23:56 UTC (Mon) by jlokier (guest, #52227) [Link]

By the way, Mac OS X reall has an "fsync-and-I-really-mean-it" flag!

Look up F_FULLSYNC, and why Linux fsync isn't a proper fsync anyway on most of its filesystems.

Wishful thinking

Posted Mar 16, 2009 22:26 UTC (Mon) by dlang (guest, #313) [Link] (3 responses)

the problem is that making the write barrier be part of the rename requires changes to all filesystems on all operating systems, and applications are only safe if they are running on a new enough version of an OS.

doing the fsync before rename works on all filesystems on all operating systems, but requires changing the applications.

if the applications push this into the OS/filesystem they will need to document that they are unsafe on any but (...) which is a list that will change over time without any control (and probably without the knowledge) of the application developers. but if they put in the fsync they work with everything that's on the market today.

they don't have to do the fsync of the directory if they don't care which version of the file exists after the crash, just doing it for the data is enough.

Wishful thinking

Posted Mar 16, 2009 22:35 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (2 responses)

the problem is that making the write barrier be part of the rename requires changes to all filesystems on all operating systems, and applications are only safe if they are running on a new enough version of an OS.
True enough, but applications work without fsync already, so upgrades will be required in either case. But there are far fewer filesystems than applications, so from a purely logistic point of view, it makes more sense to change the filesystem.

if the applications push this into the OS/filesystem they will need to document that they are unsafe on any but (...) which is a list that will change over time without any control (and probably without the knowledge) of the application developers. but if they put in the fsync they work with everything that's on the market today.
There are plenty of applications that don't include fsync. Making this change turns existing incorrect applications into correct ones, and doesn't harm any application that already uses fsync. Besides -- how many of these applications are portable to other systems anyway? A certain degree of unportability is required to drive change. Sometimes you don't need to worry about supporting every POSIX-compliant system and can take advantage of better functionality.

they don't have to do the fsync of the directory if they don't care which version of the file exists after the crash, just doing it for the data is enough.
...which causes performance problems. In fact, Theodore Tso recommends running fsync in a separate thread, which drastically increase the complexity of simple applications. Applications are left to choose one of incorrect, slow, or complicated. They shouldn't be required to make that choice.

Not to mention the problems rampant fsync would cause for laptop_mode.

Wishful thinking

Posted Mar 16, 2009 22:39 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

no, applications _sometimes_ or _mostly_ work without fsync on some OS/filesystem/mount option combinations

that's far from what you are asserting "applications work without fsync already"

Wishful thinking

Posted Mar 16, 2009 22:44 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

You can add those qualifiers to almost any statement. The fact is that these applications are written to assume rename inserts a write barrier, and the vast majority of the time, these applications get what they want. With pre-patch ext4 and XFS, this assumption breaks. The breakage was noticed very quickly after ext4 entered wide use. If these applications' assumptions were so unreliable, we would have been an uproar well before ext4 was released.

Wishful thinking

Posted Mar 16, 2009 21:46 UTC (Mon) by bojan (subscriber, #14302) [Link] (8 responses)

> Could you elaborate?

Yes. Say you just renamed a very big temporary file (think GBs) and it just so happens that it would be a good time for the kernel to sync the directory your file is in to disk, because another file in that directory changed and somebody asked to fsync the directory (or some other condition that kernel finds appropriate - doesn't matter). If you guarantee the order with rename(), you then first need to commit a few GBs of data in order to do this (which would otherwise never happen, because the file is temporary and would go away a bit later). If you don't guarantee the order, you then just commit the directory and you are done.

In other words, kernel currently has the freedom to do what it finds most appropriate and is allowed to by POSIX. Ordered renames restrict that freedom.

Wishful thinking

Posted Mar 16, 2009 21:57 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (7 responses)

Yes, the kernel has the freedom to do what's appropriate, but uses that freedom to do mind-numbingly stupid things.

You're grasping at straws. First of all, you're most likely never going to see gigabytes of dirty blocks for a single file. They'll have been flushed well before your rename! Second, even if we do end up in your scenario, that file's blocks will be flushed in very short order anyway, so you're going to incur the penalty for that IO whether you do it before or after the rename.

As for the case of immediately unlinking the file --- that's a rather unlikely scenario. Can you give me one non-contrived real-life example where this would actually happen?

Remember:

  • You have a large temporary file with many unflushed data blocks
  • This temporary file has just been renamed
  • This temporary file is about to be unlinked
  • Between the rename of the large temporary file and its unlinking, fsync must be called on the directory
When would this plausibly happen?

Wishful thinking

Posted Mar 16, 2009 22:07 UTC (Mon) by bojan (subscriber, #14302) [Link] (6 responses)

> First of all, you're most likely never going to see gigabytes of dirty blocks for a single file.

You are right. And nobody will ever need more than 640 kB of RAM ;-)

> They'll have been flushed well before your rename!

Not if someone calls fsync on that directory. Or kernel decides (for whatever reason) that this directory must go out to disk.

But look, you obviously don't want to accept that:

1. This happens.
2. POSIX says what it says.
3. Kernel is allowed to do what POSIX says.

That's OK. The documentation is crystal clear on this. It is all in the manual pages for close() and rename(). Unfortunately, people choose to ignore it.

Sure, it would be nice to have a call that guarantees all this, but thinking that rename() is that call is simply false.

Wishful thinking

Posted Mar 16, 2009 22:12 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (5 responses)

And you clearly don't want to accept that we can improve over POSIX with little danger and a large upside. POSIX doesn't make any guarantees about what happens on an unclean shutdown. The kernel could overwrite all your files with pictures of carrots and it'd be POSIX-complaint. Would you be okay with that outcome? After all,
  1. This happens.
  2. POSIX says what it says.
  3. Kernel is allowed to do what POSIX says.

Please stop justifying poor behavior by resorting to POSIX.

Wishful thinking

Posted Mar 16, 2009 22:17 UTC (Mon) by bojan (subscriber, #14302) [Link] (4 responses)

> little danger

Yeah, we've seen that with applications that were losing files on a perfectly good FS.

> Please stop justifying poor behavior by resorting to POSIX.

The behaviour is not poor. It is there for a reason, which you don't want to admit.

Lucky Ted is a nice man, so he put workarounds in place for folks that want to continue using sloppy code on ext4.

Wishful thinking

Posted Mar 16, 2009 22:24 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (3 responses)

Yeah, we've seen that with applications that were losing files on a perfectly good FS.
What are you talking about? A reordering rename is strictly safer than your broken one.

The behaviour is not poor. It is there for a reason, which you don't want to admit.
No, the behavior is dangerous and unintuitive, and there is no sound reason for it to work that way other than to make metadata write-out a little simpler. There is no performance or correctness upside to rename working the way you insist. You don't want to admit that POSIX may allow something that is nevertheless nonsensical.

Wishful thinking

Posted Mar 16, 2009 22:42 UTC (Mon) by bojan (subscriber, #14302) [Link] (2 responses)

> There is no performance or correctness upside to rename working the way you insist.

I'll just answer about correctness. If you take a broken application to a perfectly good system that doesn't order renames, because it doesn't have to, you will lose data. So, there is an upside to programming correctly and according to spec.

I think I answered the performance bit elsewhere, but you don't want to accept it. Which is fine by me.

> You don't want to admit that POSIX may allow something that is nevertheless nonsensical.

POSIX is not nonsensical, it is completely asynchronous and unordered, which is what you don't seem to like. Sure, we could have another mechanism for ordered renames - I don't deny that. It's just that current rename() isn't it, which you don't seem to be able to understand.

Wishful thinking

Posted Mar 16, 2009 22:53 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (1 responses)

If you take a broken application to a perfectly good system that doesn't order renames, because it doesn't have to, you will lose data.
Adhering to POSIX is not the same as being "perfectly good".

POSIX is not nonsensical, it is completely asynchronous and unordered
No, it's perfectly synchronous, ordered, and atomic with respect to a running system. It's only in the case of an unclean shutdown that we disagree, and POSIX really doesn't say much at all about that scenario. My behavior, your behavior, and overwriting with carrots are all perfectly POSIX-compliant with respect to a system that's been shut down uncleanly. To the greatest extent possible, the state of a system after an unclean shutdown should reflect the state the system was in shortly before that shutdown, and an ordered rename goes a long way toward achieving that.

Wishful thinking

Posted Mar 16, 2009 23:29 UTC (Mon) by bojan (subscriber, #14302) [Link]

> Adhering to POSIX is not the same as being "perfectly good".

I love it how you think that your or my opinion actually matters here. What matters is what's been written in the documentation for years now. That is the _only_ objective thing programmers on both ends of this can rely on.

> No, it's perfectly synchronous, ordered, and atomic with respect to a running system.

Yeah, confuse the issue some more, when you don't have anything new to add.

We are discussing here the "ordering of writes to disk on rename". In respect to this, POSIX is asynchronous and unordered. Heck, you cannot even tell if two consecutive writes will be written in that order to disk.

> My behavior, your behavior, and overwriting with carrots are all perfectly POSIX-compliant with respect to a system that's been shut down uncleanly.

Your behaviour is truly what you would like it to be. What you are describing as my behaviour is what the documentation actually says, so it's not mine at all.

As or carrots, that is wrong, because you never fsynced carrots to disk, so they should not be there. But sure - implement it ;-)

> To the greatest extent possible, the state of a system after an unclean shutdown should reflect the state the system was in shortly before that shutdown, and an ordered rename goes a long way toward achieving that.

And also encourages application writers to keep writing broken code, file system writers to put hacks into the system to work around that broken code etc.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds