LWN.net Logo

O_PONIES

O_PONIES

Posted Jul 2, 2009 20:58 UTC (Thu) by mikov (subscriber, #33179)
In reply to: O_PONIES by spitzak
Parent article: In brief

His comments claiming that fsync() is the correct way to do things shows however that he still does not have the foggiest idea what programmers want and need, which is very disturbing for somebody who should be an expert on this stuff.

What do you mean?

The "correct" sequence of renames, temporary files and fsync-s is relatively complex and error-prone, however it does work and I don't see such a big advantage in adding a flag which too requires changing the existing sources.


(Log in to post comments)

O_PONIES

Posted Jul 2, 2009 21:18 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

It "works" in the sense that it's technically correct. It's just not useful on ext3, where doing so will often cause a huge pile of disk i/o and block you for an irritatingly long period of time.

O_PONIES

Posted Jul 2, 2009 23:36 UTC (Thu) by mikov (subscriber, #33179) [Link]

Yes, but this is a problem in ext3 not in the syscall. It seems like a horrible idea to address problems in one file-system implementation by adding new Linux specific behavior to existing syscalls.

Adding transaction semantics to the file system in this one case is the road to madness. Fortunately I think that it has very little chance of being accepted in the kernel.

(Also, you can see my other response to spitzak below).

O_PONIES

Posted Jul 2, 2009 23:40 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

Look at it from the application programmer's perspective. There's no way to tell the difference between ext3 and ext4 - they return the same magic number in the statfs call. So an application can't decide whether to use fsync() or not at runtime. Given that applications have to run well on ext3, fsync() is simply not an option. Unfortunate, but there you go.

O_PONIES

Posted Jul 2, 2009 23:57 UTC (Thu) by mikov (subscriber, #33179) [Link]

Alas, I think there is probably no easy and beautiful solution to this one, especially from the application programmer's perspective. I wouldn't advice on relying on Linux specific behavior, which at that will presumably be available only in newest kernels, for such a basic task.

Instead I would suggest one of:

- Put all the error prone logic (the temporary file, fsync(), permission and attribute copying, etc) in a portable library routine relying only on POSIX and simply live with the fsync() overhead on ext3.

or

- Again put it in a portable library routine, but without the fsync(). The rename problem has already been addressed in ext4, so it is safe, and in ext3 there is a global sync frequently enough that it isn't a problem in practice (explaining why nobody ever complained about it before ext4).

O_PONIES

Posted Jul 3, 2009 0:19 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

Right. My personal opinion is that Linux filesystems should be expected to behave atomically here - POSIX is a set of baselevel functionality, not an excuse for doing something that breaks applications. If people want portability to platforms that don't make these guarantees then they'll have to make some kind of sacrifice. I don't see adding extra flags as the best way of handling this.

O_PONIES

Posted Jul 3, 2009 16:55 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

I'm firmly in the fsync-is-horrible-renames-should-be-write-barriers camp, but if you want to expose a way to for an application to tell whether it should fsync or not, the best way is to use the existing pathconf mechanism. Using that, an application can just ask the kernel, "Do I need to fsync this particular file?". It's much more elegant than trying to have applications puzzle it out from filesystem-type information.

O_PONIES

Posted Jul 3, 2009 19:51 UTC (Fri) by sbergman27 (subscriber, #10767) [Link]

I believe that in kernel 2.6.30 and later, it is useful for ext3, and does not result in the delays you are thinking about. Ext3 was rather radically changed during the 2.6.30 development cycle. It is no longer the same bullet-proof fs which we had come to know and love.

O_PONIES

Posted Jul 3, 2009 20:27 UTC (Fri) by nix (subscriber, #2304) [Link]

I'd not call 'flipping the data=writeback mount option on by default'
a 'radical change'. add 'data=ordered' to your mount lines, or set the
EXT3_DEFAULTS_TO_ORDERED config option, and bingo, exactly right back
where we were before, huge and unpredictably horrible latencies and all.

Radical change? I don't think so.

O_PONIES

Posted Jul 5, 2009 12:39 UTC (Sun) by sbergman27 (subscriber, #10767) [Link]

No. Much more happened than that. And software developers, who are the audience for O_PONIES, cannot just add "data=ordered" to their all their user's mount lines. From a developer's perspective, radical changes have been made to what they can expect out of ext3.

And while data=ordered does add back some latency, they had already resolved the bulk of what issue there was *before* they changed the default mount behavior. That was the last change they made, along with adding in some of Ted's damage mitigation patches from Ext4 to keep the situation from becoming quite the total disaster that was ext4's (lack of) reliability when it was first declared "ready".

Read the related LKML family of threads.

O_PONIES

Posted Jul 5, 2009 22:26 UTC (Sun) by nix (subscriber, #2304) [Link]

No. Much more happened than that.
Cite? That was the only change that was at all likely to affect reliability that I can recall. (And yes, I read those threads.)

You can also turn on data=ordered in the superblock, following which it is 'sticky' and requires no action whatsoever. (Software developers, of course, have no control over which of these options sysadminss choose: nor have they ever. Nothing has changed here.)

Personally I've been cutting the power nightly on a KDE3 installation running atop ext4 and have had, uh, no problems whatsoever. No vanishing config files, no mysterious corruption, nothing. It makes me wonder how serious this whole thing really could have been...

O_PONIES

Posted Jul 5, 2009 22:50 UTC (Sun) by sbergman27 (subscriber, #10767) [Link]

Yes. That's the only change that I recall that would affect reliability. But you were also saying that once you turn on data=ordered that you get back all the latency issues. Much of what was done to ext3 and the default io scheduler in 2.6.30, aside from changing the default journaling mode, was aimed at cutting latencies on fsync. (That work was important enought that Linus threatened to change the default i/o scheduler of Jens couldn't fix the problems it was causing ext3. Jen's quickly complied.) The fact that there was an "obvious" reason (data=ordered) for ext3 latencies resulted in the *real* reasons going undetected for years because "everybody knew" it was just the penalty of the way daya=ordered worked.

After the real problems were resolved, setting the default to data=writeback seemed mostly a final flourish, since Ted had patches to help mitigate the worst of the unreliability it would cause. I suspect that the default would not have been changed if the latency thing had not been such a long-standing issue. (One tends to swat hard-to-swat flies harder when one finally does get them.) And, of course, Ted really wanted to see the data=ordered default go away, since it was making ext4 look bad by comparison.

It's the combination of the change of default journaling mode *and* the relative painlessness of fsync on 2.6.30+ ext3 that make ext3 a very different animal from pre-2.6.30 ext3 for developers. They really have to be talked about separately.

And yes, the default journaling mode can also be specified in the superblock. But that is still *totally* irelevant to the developer. The developer can mount his own filesystems with any options he wants. But it doesn't make a bit of difference to what happens when his *users* try to run his software on *their* machines.

O_PONIES

Posted Jul 5, 2009 23:55 UTC (Sun) by nix (subscriber, #2304) [Link]

Actually my understanding is that nobody realised how bad the latencies
could get. Linus insisted on the data=writeback default after running
tests with all the other patches in place showing enormously variable and
sometimes terrible fsync() latencies from tiny file writes when there was
a large dd going on in the background. Switching to writeback eliminated
those latencies.

So, not a 'flourish', unless you consider taking 10s to save a 500 byte
file a mere irrelevance.

O_PONIES

Posted Jul 6, 2009 0:45 UTC (Mon) by sbergman27 (subscriber, #10767) [Link]

No. Not 10s.

Linus was specifically *trying* to see how long he could make it stall. And by the time the the other problems were fixed, it was hard for him to force a 2 second latency:

http://lwn.net/Articles/328381/

Once the data=writeback default was put into place, it became hard for him to force a 600ms latency.

The original latencies were *far* longer than 2s. (30 seconds had been reported by some. Though such reports did not seem common.) The change of journaling mode ended up bringing the worst case down by a mere 1400ms.

And considering that the common case (as opposed to the relatively rarely seen worst case) was livable enough that it took 8 years for anyone to care enough to attack the problem, I'd call the final change of default journaling mode to cut the worst case by a final 1400ms a flourish.

And Linus never "insisted" on the data=writeback default. Ted suggested it. And in view of the ext4 patches that Ted was suggesting be moved over to ext3 to mitigate the worst (but not all) of the resulting reliability issues, Linus said OK, let's do that.

O_PONIES

Posted Jul 2, 2009 22:04 UTC (Thu) by spitzak (guest, #4593) [Link]

fsync() does much *more* than is required by the program.

All the program wants is for the old file to be atomically replaced with the new file. It is ok if after a crash the old file remains and none of the new file is on disk. fsync forces far more i/o than this requires. What is unacceptable is that after a crash a state other than oldfile or newfile can be the result.

Another way of looking at this is we want the effects of fsync, but deferred until just at the moment the actual rename is done on the disk (this is ok as the disk is being written to anyway).

Yet another way is to follow the POSIX spec which says that we should never see any state other than the old or new file, and that fsync is not required for this to happen. Of course POSIX does not say what happens if the machine crashes, but I think any acceptable crash recovery should match the POSIX spec as much as possible, otherwise it is not really a crash recovery.

O_PONIES

Posted Jul 2, 2009 22:08 UTC (Thu) by spitzak (guest, #4593) [Link]

I also want to point out that I suggest that an existing arrangement of flags (the three used by creat()) should do this, so no extra flag is needed.

The "official" solution has a lot of problems: you need to invent a non-colliding temporary name, that temporary file is visible while the file is being written, and may remain permanently on disk if a crash happens. And the fsync has serious amounts of overhead if you are doing a very large number of these files.

O_PONIES

Posted Jul 2, 2009 23:32 UTC (Thu) by mikov (subscriber, #33179) [Link]

Although you make good points, I have two disagree with both of them:

1. With ext3 fsync() behaves like sync() which is way too expensive. This is a real problem, but it is a problem in ext3 not in creat(). Changing POSIX or adding new flags to address a specific performance problem in ext3 is simply wrong.

2. Nobody ever expected that they can simply rewrite a file and get atomic behavior in the face of crash. The file system is not a transaction DB nor should it be.

The incorrect code which is causing the infamous "zero length after reboot files" consists of creating a temporary file and renaming it, but without the fsync().

The only difference in the correct version of the code is the addition of fflush(), fsync() and more error checking.

So, I think that the proposed new flag, or alternatively your suggestion to treat a combination of flags in a special way, is a bad idea. They lend hidden Linux-specific semantics to the FS in just one case. This is simply awful.

In my opinion the correct way to approach this problem would be one of:
a) Guarantee that meta-data changes occur after data changes. Thus the rename should only be committed to disk after the contents of the file have been committed to disk. This is not required by POSIX, but is what most other filesystems do anyway.

b) Introduce a new version of fsync() acting like a barrier. It does not stall the application, but it guarantees that any subsequent operations on this file have to occur after the previous ones have been committed to disk. So, fsync_improved() will cause no delay for the application or system, but will impose the necessary ordering.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds