|
|
Log in / Subscribe / Register

Wishful thinking

Wishful thinking

Posted Mar 16, 2009 22:08 UTC (Mon) by quotemstr (subscriber, #45331)
In reply to: Wishful thinking by bojan
Parent article: Garrett: ext4, application expectations and power management

That's an obscure corner case, and I've already addressed it in another reply. Shuffling around huge files with large numbers of outstanding blocks and then immediately deleting them is a contrived case that doesn't occur in real life.

Freedom isn't always a good thing. Making rename work consistently is far more important. The non-ordered rename case just doesn't make a whole lot of sense. It's not a useful degree of freedom, unlike synchronous-versus-cached, atomic-versus-not, and so on. The performance gain is minimum, and the danger large. Making rename ordered with respect to the data blocks in the file is a huge win.


to post comments

Wishful thinking

Posted Mar 16, 2009 22:13 UTC (Mon) by bojan (subscriber, #14302) [Link] (24 responses)

> Making rename work consistently is far more important.

rename() already works consistently with documentation.

What you addressed in the other thread is the you would like rename() to work they way you _think_ it should work. We know that already.

Wishful thinking

Posted Mar 16, 2009 22:17 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (23 responses)

Let me repost here something I posted to Theodore Tso's blog.

With that out of the way, let’s talk about rename. It should create a conceptual write barrier for the data blocks of the file involved. It’s not inventing a filesystem semantic out of thin air any more than writing a zero-length file is: POSIX doesn’t say much at all about what happens after a crash, and so this whole discussion is uncharted territory. It’d be perfectly fine for a POSIX system to overwrite all your files with pictures of donuts on an unclear shutdown. This is not an issue of standards conformance: it is a quality of implementation issue. The standard allowing you to do something terrible isn’t an excuse for that behavior. It’s like saying “yes, it’s perfectly fine that I live off of Crisco and tequila. The law allows it!”

Now, first of all: there’s a lot of historical precedent for rename writing data blocks before metadata: not only does ext3 do it, but many older filesystems too. Certainly, many programs are written under the assumption that my rename semantics hold: and these programs work fine (in fact, better) on a running system.

Second, your rename behavior will lead to bugs now and forever: open-write-close-rename will work just fine on a running system, and there’s a good chance it’ll appear to work even if the developer takes the unusual step of testing during a system crash. Because this sequence will seem to work just fine most of the time, plenty of programs will have hidden data-loss bugs. That’s not a world I want to live in.

Third, there’s the issue of API parsimony. Your semantics change rename from a hard to misuse API to one that’s very prone to misuse. See http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html (”How Do I Make This Hard to Misuse?”) and http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html (”What If I Don’t Actually Like My Users?”). On that scale, you’ve moved rename from a very good 7 (”the obvious use is (probably) the correct one”) to an appalling -5 (”do it right and it will sometimes break at runtime”).

On a running system, of course, a rename is atomic with respect to both the filename and its contents — otherwise it’d be useless. Under your semantics, however, you’ve effectively made rename without fsync a useless and dangerous, yet very conceptually tempting operation. Scolding application programmers to insert fsync calls will lead to confusion and frustration: fsync, as “make the data hit the disk now” doesn’t have anything conceptually to do with atomic replacement except as an arcane filesystem implementation detail. Anything that appears to work in the typical case, but that does something dangerous in special corner cases, is broken by design.

When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a file’s contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.

But for the sake of argument, let’s bite our tongue and insert this fsync. The system can’t tell the difference between an fsync intended to ensure, say, message receipt, and an fsync that ensures after-crash consistency across a rename. Because we’re blocking and waiting for disk IO, application latency greatly increases (by up to three seconds, apparently). Users begin to complain. Now, the application developer has two choices: either implement the threaded solution you mention, or remove the fsync.

The threaded solution gives the correct behavior, but is horribly complicated, or requires libraries on which the developer might not want to depend, especially for a small operation. Look what you’ve done now: not only have you made the correct code non-obvious, but you’ve made the correct code under-performing as well. It’s absolutely ludicrous to expect every program that wants to correct replace a file’s contents to spawn off a worker thread.

Thus, most application developers will just remove the fsync. (Or do the moral equivalent, as KDE has done, and provide a knob to turn the fsync back on.) Now we’ve created a deliberate rare data-loss bug because the correct code is far too complicated.

Now, this situation in itself would be bad. But add laptop_mode, and now we’ve made an API the very contemplation of which drives men to unspeakable acts. We’ve added fsync everywhere, and we find it’s causing problems: the disk spins up all the time, as it must in order to maintain fsync’s semantics. So the solution is to neuter the very fsync you’ve implored application developers to add? Because you, the one making fsync a non-op, know that most of these fsyncs are there maintain data consistency, you can have laptop_mode trade durability for battery life.

But some fsyncs are there to ensure application-level durability. Imagine an SMTP server. So, you create an “fsync-really-means-fsync” inheritable process flag. If an application developer has an *important* fsync call, he’ll just set this process flag and call fsync. Now, since that flag is contrary to 20 years of established use and will be a footnote on a newish version of the fsync manual, most application developers won’t actually know about it. Oh, they’ll call fsync, and their programs will appear to work just fine, even after dutiful hard-reboot testing.

Except when someone using laptop_mode has an unexpected power failure. Now that user has lost, and it wasn’t his fault. (”How the hell wasn’t the message on the disk? fsync returned success. Must be a bad disk. [hours pass] Oh, what’s this laptop_mode? Changes fsync? @!#$%!@#%”) And before you say “caveat modulator:” you shouldn’t need to be an expert on the data retention needs of each of your programs to extract good battery life.

Now when an application developer needs to actually use the *real* fsync, he turns on this process flag. Except he’s also dutifully using fsync to ensure rename consistency, so he has to create plumbing to manage the state of the magic fsync flag across different parts of his program so only the fsyncs that need to be real fsyncs are real. Let’s imagine this program also runs arbitrary other programs: it then needs to unset the magic inheritable fsync flag before fork, otherwise programs that don’t really it will be running with he real fsync. That’s a non-trivial amount of work.

Also, application developers everywhere need to add autoconf tests for the magic process flag. Older programs will actually be broken, through no fault of their own. It’s either that, or rewrite the initialization scripts for programs that need the real fsync. (And in that case, the program may very well run far more real fsyncs that needed.)

Now you’ve made *two* traditional, long-standing system calls, rename and fsync, act dangerously in certain hard-to-test boundary cases, with elaborate and arcane workarounds that are so counter-intuitive (”fsync almost always means fsync?”) that developers will almost certainly get it wrong, at least the first time. Correct behavior might as well be in a disused lavatory behind a “beware of the leopard” sign.

What’s the alternative? fsync_and_i_mean_it? You could create an
fbarrier system call that applications would use to ensure data
consistency while preserving fsync’s current role. fbarrier might come
in handy in other contexts too. But of course that system call
wouldn’t be portable — but wait… we’ve already established that
when an application calls rename, it *means* to insert a write
barrier. fbarrier might be useful, but we can also infer it from a
rename call, and with perfect accuracy: when does an application *not*
want this behavior on rename?

So, just make rename include an implicit call to a conceptual
fbarrier. Existing applications work. Today. With no changes, or even
a recompile. Applications that call fsync before a rename at least do
no harm. rename remains an intuitive, powerful, and simple way for an
application developer to express what he wants to do (instead of being
a tasty-looking landmine). fsync doesn’t have to be treated specially
in certain bizarre modes. And you don’t really lose any efficiency,
because under your scheme, every correct application would have to
call fsync anyway — and I bet fbarrier would be far less expensive
than an outright fsync. (Or, if fsync really is cheap enough on a
given filesystem, make fbarrier *be* fsync.)

How often do you get to improve performance *and* safety at the same
time?

Wishful thinking

Posted Mar 16, 2009 22:20 UTC (Mon) by bojan (subscriber, #14302) [Link] (19 responses)

> It should

I didn't want to read past that, sorry. I can also imagine that things should be some way or the other. They are, however, not.

Wishful thinking

Posted Mar 16, 2009 22:27 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (18 responses)

Now you're just trolling. The whole core of your argument is that "since POSIX allows this, and some filesystems do it that way, we should keep doing it that way". It doesn't logically follow. Maybe if you were to read my posts, you'd get a sense for what logic looks like.

Wishful thinking

Posted Mar 16, 2009 22:36 UTC (Mon) by bojan (subscriber, #14302) [Link] (16 responses)

> It doesn't logically follow.

Actually it does. If you _don't_ allow for what POSIX specifies in your applications (which is where the problem is), then there will be consequences (i.e. the applications will lose files).

This can be properly fixed in two ways:

1. By calling fsync() from the application when required.
2. By introducing something new that does what you keep talking about.

Overloading specified behaviour with unspecified things is dangerous, because it encourages application writers to do the wrong thing. We've seen that before with XFS and wrong people got blamed that time too.

Sure, Ted is a practical person, so he doesn't want to break things unnecessarily. I admire him for keeping his cool.

Wishful thinking

Posted Mar 16, 2009 23:32 UTC (Mon) by jamesh (guest, #1159) [Link] (14 responses)

This whole issue is about what should happen in a case that POSIX specifies, so I don't know why you keep on bringing this up.

In cases where POSIX does not specify behaviour, it is left up to the implementation. If the choice is between trying to provide the runtime atomic rename guarantee over a crash or slightly higher performance, I'd pick the first option. After all, that's why I am running a crash-resistant file system in the first place.

Are you seriously saying you can't understand the benefits of delaying IO but preserving the order of certain operations over a "do it now" fsync() call?

Wishful thinking

Posted Mar 16, 2009 23:42 UTC (Mon) by bojan (subscriber, #14302) [Link] (13 responses)

The problem is with portability. If you write your applications to _not_ do what POSIX requires, they will be broken when they go to a different system which happens to have an FS that doesn't order renames on disk.

> Are you seriously saying you can't understand the benefits of delaying IO but preserving the order of certain operations over a "do it now" fsync() call?

1. Yes, I can understand it.
2. No, this is not what rename() specifies.

So, when an application writer thinks that it will be like that everywhere, he/she is wrong and the application may lose data. That is bad.

Hence, I'm suggesting that for the cases where ordered rename is warranted, we should have a new API.

PS. As I explained elsewhere, unordered rename has its use as well, so one cannot just assume that everyone should drop that and do ordered. It is also not practical to demand that, because too many systems would have to be audited and changed to achieve it. And before you say "but don't we have to fix more apps already" - well, the applications are buggy right now according to specification - not the other way around.

Wishful thinking

Posted Mar 17, 2009 0:18 UTC (Tue) by nix (subscriber, #2304) [Link] (10 responses)

You're acting as if POSIX is set in stone and can never change to account
for new de-facto standards, when in reality that is the *only* way it ever
changes (and often Linux is the source of such changes).

Ten years ago, would you have been arguing that programs that relied on
symlinks were broken because POSIX did not require them?

Wishful thinking

Posted Mar 17, 2009 0:31 UTC (Tue) by bojan (subscriber, #14302) [Link] (8 responses)

> Ten years ago, would you have been arguing that programs that relied on symlinks were broken because POSIX did not require them?

If the programs correctly tested to see if the support is there and then refused to work if symlinks were not there, there would be nothing wrong with them. So, by all means, if you write an application that tests that the underlying FS has ordered renames and refuses to work otherwise with sloppy open()/write()/close()/rename() sequence, that's perfectly OK. You just need to write even _more_ code to do this then if you just used fsync(). Up to you.

Wishful thinking

Posted Mar 17, 2009 1:24 UTC (Tue) by nix (subscriber, #2304) [Link] (7 responses)

The vast majority of programs, even when symlinks were optional, assumed
their presence, because the enormous majority of the installed base had
them.

This is actually worse. If you get the open()/write()/fsync()/close()/
rename() sequence wrong, by leaving out the fsync(), the visible effect
during development is *nil*, even on filesystems like pre-patch ext4,
because this is a change which only has an effect when something goes
really wrong and the OS crashes or you lose power at the wrong instant,
and if that happens, any data loss will be written off to the power
failure, like as not.

Expecting any but the most skilled developers to remember that fsync()
when omitting it has *no visible negative consequence* in normal operation
is a complete and total pipe-dream. You can wish all you will, but only a
few percent will ever conform.

It is much better to arrange to do the right thing in the filesystem,
which *does* have especially skilled people hacking at it, than in the
vast mass of wildly-varying-in-quality code out there in the real world.

Wishful thinking

Posted Mar 17, 2009 2:17 UTC (Tue) by bojan (subscriber, #14302) [Link] (6 responses)

> The vast majority of programs, even when symlinks were optional, assumed their presence, because the enormous majority of the installed base had them.

WOW! Programs have bugs. Imagine that ;-)

> Expecting any but the most skilled developers to remember that fsync() when omitting it has *no visible negative consequence* in normal operation is a complete and total pipe-dream.

The no negative visible consequence applies to one file system in one mode _only_ (and according to some, not even on it all the time). The rest - it depends.

If you ever tried to debug a race condition, you'd know that it can be really hard to do, because the system doesn't get into such conditions all the time. Did someone guarantee to you that programming was going to be easy? I must have missed that lesson ;-)

Oh, and for all the forgetful unskilled developers: man 2 close. I sure needed it :-(

> You can wish all you will, but only a few percent will ever conform.

And their applications will still suck and they will still rely on hacks in file systems to work. And of course, people doing this will be the ones loudest complaining that "file system is broken" when they encounter problems on another platform. Not even my six year old is this childish. But, hey - that's life.

> It is much better to arrange to do the right thing in the filesystem, which *does* have especially skilled people hacking at it, than in the vast mass of wildly-varying-in-quality code out there in the real world.

All you need to do is this:

1. Convince all FS writers to only use new semantics.
2. Convince POSIX folks to change the spec.

Good luck doing that.

PS. The vast majority of people do not program using APIs we are talking about here. They are using libraries that wrap all this up, other programming languages that have calls that wrap all this up etc. These will be written by people familiar with lower level POSIX APIs we are talking about here. For a good example, see: http://mail.gnome.org/archives/gtk-devel-list/2009-March/...

Wishful thinking

Posted Mar 17, 2009 2:23 UTC (Tue) by bojan (subscriber, #14302) [Link]

> people doing this

Of course, I mean your supposed vast majority that won't do the fsync here.

Wishful thinking

Posted Mar 17, 2009 2:26 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (3 responses)

The POSIX spec doesn't need to change one bit. Both behaviors entirely conform to POSIX.

And as for getting filesystems to change -- that's going to be the case. Any widely-used filesysem will encounter the same problem that started this mess, and will either implement the same fix or suffer the fate of XFS.

Wishful thinking

Posted Mar 17, 2009 2:35 UTC (Tue) by bojan (subscriber, #14302) [Link] (2 responses)

I see FS implementers shaking in their boots :-)

BTW, people already started fixing the code. Or didn't you read that GTK thread?

PS. Even Ted's workarounds in ext4 do not do full ordered rename in all cases. These are only for the cases of the most widely known application breakage. But, if you keep insisting, he may do the lockup-on-fsync for you, ext3 style, just so that you can get that nice UI feeling in properly written apps ;-)

Wishful thinking

Posted Mar 17, 2009 2:37 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (1 responses)

Care to link to this thread?

Wishful thinking

Posted Mar 17, 2009 2:44 UTC (Tue) by bojan (subscriber, #14302) [Link]

Already have. You have to go a few posts up.

Wishful thinking

Posted Mar 17, 2009 20:37 UTC (Tue) by nix (subscriber, #2304) [Link]

>> Expecting any but the most skilled developers to remember that fsync()
>> when omitting it has *no visible negative consequence* in normal
>> operation is a complete and total pipe-dream.
>
> The no negative visible consequence applies to one file system in one
> mode _only_ (and according to some, not even on it all the time). The
> rest - it depends.

I repeat: omitting fsync() has no negative visible consequence *in normal
operation* on *any* POSIX-compliant system. Turning off the power or
locking up the box is *not* 'normal operation'.

I know of no developers of anything other than full-blown databases who do
anything like that to test their programs. Thus, for nearly all programs,
omitting fsync() is harmless during the development and testing phase.
Thus, it will regularly be omitted, *no matter what* you might wish.

... and, um, changing POSIX really isn't that hard. Make a good case that
some behaviour is common enough and POSIX will bend. The Austin Group is
populated with normal human beings^W^Wraging pedants like you or I, not
gods. (There are some demigods there, though.)

It is quite possible to convince them that a change is needed, and POSIX
regularly changes semantics in new release.

Wishful thinking

Posted Mar 17, 2009 0:33 UTC (Tue) by bojan (subscriber, #14302) [Link]

Oh, and if you want to change POSIX, please do so. I have no objection. As if my opinion mattered here ;-)

Wishful thinking

Posted Mar 17, 2009 8:35 UTC (Tue) by jamesh (guest, #1159) [Link]

> If you write your applications to _not_ do what POSIX requires, they will
> be broken when they go to a different system which happens to have an FS
> that doesn't order renames on disk.

We are talking about a case that POSIX leaves undefined here. An OS can wipe the disk on system crash and still be POSIX compliant.

We are in the realm of implementation defined behaviour, so talking of "applications doing what POSIX requires" doesn't really make sense. Claiming that the applications are buggy in a case where the specification offers no guidance doesn't help anyone.

Ext4's crash resistance is a desirable feature that exceeds the minimum requirements needed for POSIX conformance. Preserving atomic renames over a crash also exceeds those minimum requirements.

I'd be willing to pay the performance penalty from providing this behaviour in the same way I am willing to pay the performance penalty from metadata journaling.

A filesystem's job is not to punish users for application developers' oversights.

Posted Mar 18, 2009 0:52 UTC (Wed) by xoddam (subscriber, #2322) [Link]

This is *so* not about application developers or POSIX!

The *only* behaviour under discussion is recoverability across system failures. That's what POSIX doesn't (can't) guarantee, and it's what a journaling filesystem is supposed to provide *in addition* to the POSIX guarantees.

System administrators and users choose to run journaling filesystems so they don't waste time cleaning up after a crash. A journaling filesystem that makes it more, not less, likely for users to lose data isn't doing its job.

POSIX guarantees atomicity of rename -- while the system is running. Application developers code to that guarantee, without particular reference to what happens when the power is cut or some video driver scribbles on the kernel heap. If the system crashes, there is no POSIXLY_CORRECT guarantee that anything will be recoverable at all. Whether you use fsync or not.

A journaling filesystem is supposed to provide more reasonable behaviour FOR USERS. Its job is not to punish users for the corner cases that application developers didn't consider.

Wishful thinking

Posted Mar 18, 2009 0:14 UTC (Wed) by dvdeug (subscriber, #10998) [Link]

What POSIX specifies is that a compliant system, upon a system crash, can hunt down all hard copies that have been made and burn them, after overwriting the data on the disk seven times with zeros, ones, and random data. I'm not sure how an application is supposed to allow for that.

Wishful thinking

Posted Mar 17, 2009 20:44 UTC (Tue) by man_ls (guest, #15091) [Link]

Now you're just trolling.
Now?

Wishful thinking

Posted Mar 16, 2009 22:37 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

quote:
When an application performs a rename, the *INTENT* is to insert a write barrier for the data blocks of the file involved. When is this interpretation ever wrong? When is it ever useful to be able to tell the system to replace a file’s contents and its name, except when the system crashes, in which case you just want its name and some garbage? Nobody ever wants that.

one very good example of when you would want to do a rename, but don't need to do a write barrier is when you are rolling logs.

this is usually done where one program has the file open and is writing data to it, another program renames the file, and then tells the first program to close the file and re-open the original name.

there is no need for a write barrier anywhere in this case.

laptop mode explicitly breaks normal expected (failsafe) behavior in the interest of of saving power. if the distro turns it on by default and never tells the user they are doing so (along with allowing the user to specify the 'how much data am I willing to loose' parameter) the distro is at fault, but that can be fixed.

adding the ability to selectivly mask this, so that you can have some programs (say your word processor) go ahead and wake up the drive to save it's data, but keep other non-critical things from doing so (even if those things _think_ that they are critical) would be a very good thing.

Wishful thinking

Posted Mar 16, 2009 22:42 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

there is no need for a write barrier anywhere in this case.
You don't need a write barrier for a completely unmodified file either. But a write barrier hurts neither case.

adding the ability to selectivly mask this, so that you can have some programs (say your word processor) go ahead and wake up the drive to save it's data, but keep other non-critical things from doing so (even if those things _think_ that they are critical) would be a very good thing.
That's a nightmare API that makes it very difficult to determine whether you're actually writing to the disk or not. If an application's data aren't critical, it shouldn't be calling fsync in the first place. And if the property of whether the data are critical can change, the application itself should provide a knob to control that. A process-level flub is both coarse and crude as a means of controlling that.

Wishful thinking

Posted Mar 16, 2009 23:56 UTC (Mon) by jlokier (guest, #52227) [Link]

By the way, Mac OS X reall has an "fsync-and-I-really-mean-it" flag!

Look up F_FULLSYNC, and why Linux fsync isn't a proper fsync anyway on most of its filesystems.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds