In brief

By Jonathan Corbet
July 1, 2009

O_NODE. Miklos Szeredi has proposed a new flag (O_NODE) which could be passed to open() calls. This flag, in essence, says that the calling program wants to open the indicated filesystem node, but doesn't want to actually do anything with it. With such opens, the underlying open() file operation will not be called, reads and writes will not be allowed, etc.

One might well wonder what the use for such an operation is. The main motivation would appear to be to allow an application to create a file descriptor which can be passed to other system calls - fstat(), say, or openat(). File descriptors used in this way do not really need access to the underlying file, so it makes sense to provide a way to create file descriptors without that access.

O_PONIES. Rik van Riel has proposed another new open flag (actually called O_REWRITE) which is intended to help applications easily avoid the "zero-length files after a crash" problem. A program could open an existing file with O_REWRITE and get some special semantics. The new file would exist, invisibly, alongside the existing file for as long as it remains open; during that time, any other opens of that file would get the old version. Once the file is closed, the kernel will rename it to the given name in an atomic manner, ensuring that either the old version or the (full) new version will exist should a crash happen in the middle.

This option would make it easy for application developers to rewrite existing files without worrying about robustness. Some might respond that it would be better to just teach those developers to use fsync(), but, as Rik notes, "relying on application developers to get it right seems to not have worked out well". Rik's proposal currently lacks an accompanying patch, so it's not destined for the mainline anytime soon.

VFAT patents. As discussed elsewhere, Andrew Tridgell has posted a new lawyer-approved patch aimed at working around Microsoft's VFAT patents. The discussion on the lists has taken a bit of a different course this time around; there is still some annoyance at making changes like this to deal with the problems of the U.S. patent system, but those voices have been relatively quiet. Not completely quiet, though; Alan Cox said:

Putting the stuff in kernel upsets everyone who isn't under US rule, creates situations where things cannot be discussed and doesn't make one iota of difference to the vendors because they will remove the code from the source tree options and all anyway - because as has already been said it reduces risk.

Beyond that, what developers worry about is interoperability with other VFAT implementations. Alan Cox is asking that, if this patch goes in, the modified filesystem no longer be called "VFAT," since, as he sees it, it's now something else. Ted Ts'o has responded that "VFAT" is a bit of a slippery concept to begin with. It's not clear how this issue will be resolved.

Voyager's voyage. James Bottomley is a proud owner of an archaic Voyager system; Voyager is an x86-based architecture with a number of quaint features - though, contrary to rumor, steam power is not among them. It is not clear whether any Voyager-based systems are still running outside of James's basement. Nonetheless, James has been maintaining the Voyager architecture for years.

More recently, Voyager got kicked out when the code was broken in the process of an x86 subarchitecture-support rewrite. When James tried to get it put back in, x86 Ingo Molnar objected, saying that the costs of the patch were not justified by the benefits of serving such a small user base in the mainline kernel. In the end, Ingo rejected the patch outright, leading to what appeared to be an unsolvable stalemate between the two developers.

Things changed about the time that Linus jumped into the conversation:

Ingo, "absurd irrelevance" is not a reason. If it was, we'd lose about half our filesystems etc. Neither is "thousands of lines of code", or "it hasn't always worked". Again, if it was, then we'd have to get rid of just about all drivers out there.

Ingo eventually backed down on a number of his complaints about the Voyager patches. What remains, though, is a long list of technical problems with the Voyager tree and how it has been managed. James has accepted those complaints as valid, and will work toward resolving them. Before too long, Voyager owners (both of them) should once again have full support for their beloved architecture in the mainline kernel.

In brief

Posted Jul 2, 2009 8:57 UTC (Thu) by oseemann (guest, #6687) [Link] (3 responses)

After reading http://en.wikipedia.org/wiki/NCR_Voyager I wonder if that is the origin of the name vger.kernel.org?

In brief

Posted Jul 2, 2009 10:15 UTC (Thu) by jamesh (guest, #1159) [Link] (2 responses)

I'd always assumed that the hostname was a reference to the artificial intelligence from the first Star Trek movie.

In brief

Posted Jul 2, 2009 20:14 UTC (Thu) by SimonKagstrom (guest, #49801) [Link] (1 responses)

I believe it's from Arthur C. Clarkes book "2010". The voyager probe is picked up by aliens, but sadly in a bad state. Some of the characters on the name plate has been wiped away, thus "vger".

Or maybe I remember something completely different :-)

In brief

Posted Jul 3, 2009 19:45 UTC (Fri) by sbergman27 (guest, #10767) [Link]

No. There was nothing about that in 2010. You're thinking of Star Trek:The Motion Picture.

O_PONIES

Posted Jul 2, 2009 20:51 UTC (Thu) by spitzak (guest, #4593) [Link] (18 responses)

Rik's idea is certainly what is wanted by everybody, in fact I think creat() (and thus the flag set O_CREAT|O_WRONLY|O_TRUNC) should work this way and that change is unlikely to break any programs and will fix a lot of programs.

His comments claiming that fsync() is the correct way to do things shows however that he still does not have the foggiest idea what programmers want and need, which is very disturbing for somebody who should be an expert on this stuff.

O_PONIES

Posted Jul 2, 2009 20:58 UTC (Thu) by mikov (guest, #33179) [Link] (16 responses)

His comments claiming that fsync() is the correct way to do things shows however that he still does not have the foggiest idea what programmers want and need, which is very disturbing for somebody who should be an expert on this stuff.

What do you mean?

The "correct" sequence of renames, temporary files and fsync-s is relatively complex and error-prone, however it does work and I don't see such a big advantage in adding a flag which too requires changing the existing sources.

O_PONIES

Posted Jul 2, 2009 21:18 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (12 responses)

It "works" in the sense that it's technically correct. It's just not useful on ext3, where doing so will often cause a huge pile of disk i/o and block you for an irritatingly long period of time.

O_PONIES

Posted Jul 2, 2009 23:36 UTC (Thu) by mikov (guest, #33179) [Link] (4 responses)

Yes, but this is a problem in ext3 not in the syscall. It seems like a horrible idea to address problems in one file-system implementation by adding new Linux specific behavior to existing syscalls.

Adding transaction semantics to the file system in this one case is the road to madness. Fortunately I think that it has very little chance of being accepted in the kernel.

(Also, you can see my other response to spitzak below).

O_PONIES

Posted Jul 2, 2009 23:40 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (3 responses)

Look at it from the application programmer's perspective. There's no way to tell the difference between ext3 and ext4 - they return the same magic number in the statfs call. So an application can't decide whether to use fsync() or not at runtime. Given that applications have to run well on ext3, fsync() is simply not an option. Unfortunate, but there you go.

O_PONIES

Posted Jul 2, 2009 23:57 UTC (Thu) by mikov (guest, #33179) [Link] (1 responses)

Alas, I think there is probably no easy and beautiful solution to this one, especially from the application programmer's perspective. I wouldn't advice on relying on Linux specific behavior, which at that will presumably be available only in newest kernels, for such a basic task.

Instead I would suggest one of:

- Put all the error prone logic (the temporary file, fsync(), permission and attribute copying, etc) in a portable library routine relying only on POSIX and simply live with the fsync() overhead on ext3.

- Again put it in a portable library routine, but without the fsync(). The rename problem has already been addressed in ext4, so it is safe, and in ext3 there is a global sync frequently enough that it isn't a problem in practice (explaining why nobody ever complained about it before ext4).

O_PONIES

Posted Jul 3, 2009 0:19 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

Right. My personal opinion is that Linux filesystems should be expected to behave atomically here - POSIX is a set of baselevel functionality, not an excuse for doing something that breaks applications. If people want portability to platforms that don't make these guarantees then they'll have to make some kind of sacrifice. I don't see adding extra flags as the best way of handling this.

O_PONIES

Posted Jul 3, 2009 16:55 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

I'm firmly in the fsync-is-horrible-renames-should-be-write-barriers camp, but if you want to expose a way to for an application to tell whether it should fsync or not, the best way is to use the existing pathconf mechanism. Using that, an application can just ask the kernel, "Do I need to fsync this particular file?". It's much more elegant than trying to have applications puzzle it out from filesystem-type information.

O_PONIES

Posted Jul 3, 2009 19:51 UTC (Fri) by sbergman27 (guest, #10767) [Link] (6 responses)

I believe that in kernel 2.6.30 and later, it is useful for ext3, and does not result in the delays you are thinking about. Ext3 was rather radically changed during the 2.6.30 development cycle. It is no longer the same bullet-proof fs which we had come to know and love.

O_PONIES

Posted Jul 3, 2009 20:27 UTC (Fri) by nix (subscriber, #2304) [Link] (5 responses)

I'd not call 'flipping the data=writeback mount option on by default'
a 'radical change'. add 'data=ordered' to your mount lines, or set the
EXT3_DEFAULTS_TO_ORDERED config option, and bingo, exactly right back
where we were before, huge and unpredictably horrible latencies and all.

Radical change? I don't think so.

O_PONIES

Posted Jul 5, 2009 12:39 UTC (Sun) by sbergman27 (guest, #10767) [Link] (4 responses)

No. Much more happened than that. And software developers, who are the audience for O_PONIES, cannot just add "data=ordered" to their all their user's mount lines. From a developer's perspective, radical changes have been made to what they can expect out of ext3.

And while data=ordered does add back some latency, they had already resolved the bulk of what issue there was *before* they changed the default mount behavior. That was the last change they made, along with adding in some of Ted's damage mitigation patches from Ext4 to keep the situation from becoming quite the total disaster that was ext4's (lack of) reliability when it was first declared "ready".

Read the related LKML family of threads.

O_PONIES

Posted Jul 5, 2009 22:26 UTC (Sun) by nix (subscriber, #2304) [Link] (3 responses)

No. Much more happened than that.

Cite? That was the only change that was at all likely to affect reliability that I can recall. (And yes, I read those threads.)

You can also turn on data=ordered in the superblock, following which it is 'sticky' and requires no action whatsoever. (Software developers, of course, have no control over which of these options sysadminss choose: nor have they ever. Nothing has changed here.)

Personally I've been cutting the power nightly on a KDE3 installation running atop ext4 and have had, uh, no problems whatsoever. No vanishing config files, no mysterious corruption, nothing. It makes me wonder how serious this whole thing really could have been...

O_PONIES

Posted Jul 5, 2009 22:50 UTC (Sun) by sbergman27 (guest, #10767) [Link] (2 responses)

Yes. That's the only change that I recall that would affect reliability. But you were also saying that once you turn on data=ordered that you get back all the latency issues. Much of what was done to ext3 and the default io scheduler in 2.6.30, aside from changing the default journaling mode, was aimed at cutting latencies on fsync. (That work was important enought that Linus threatened to change the default i/o scheduler of Jens couldn't fix the problems it was causing ext3. Jen's quickly complied.) The fact that there was an "obvious" reason (data=ordered) for ext3 latencies resulted in the *real* reasons going undetected for years because "everybody knew" it was just the penalty of the way daya=ordered worked.

After the real problems were resolved, setting the default to data=writeback seemed mostly a final flourish, since Ted had patches to help mitigate the worst of the unreliability it would cause. I suspect that the default would not have been changed if the latency thing had not been such a long-standing issue. (One tends to swat hard-to-swat flies harder when one finally does get them.) And, of course, Ted really wanted to see the data=ordered default go away, since it was making ext4 look bad by comparison.

It's the combination of the change of default journaling mode *and* the relative painlessness of fsync on 2.6.30+ ext3 that make ext3 a very different animal from pre-2.6.30 ext3 for developers. They really have to be talked about separately.

And yes, the default journaling mode can also be specified in the superblock. But that is still *totally* irelevant to the developer. The developer can mount his own filesystems with any options he wants. But it doesn't make a bit of difference to what happens when his *users* try to run his software on *their* machines.

O_PONIES

Posted Jul 5, 2009 23:55 UTC (Sun) by nix (subscriber, #2304) [Link] (1 responses)

Actually my understanding is that nobody realised how bad the latencies
could get. Linus insisted on the data=writeback default after running
tests with all the other patches in place showing enormously variable and
sometimes terrible fsync() latencies from tiny file writes when there was
a large dd going on in the background. Switching to writeback eliminated
those latencies.

So, not a 'flourish', unless you consider taking 10s to save a 500 byte
file a mere irrelevance.

O_PONIES

Posted Jul 6, 2009 0:45 UTC (Mon) by sbergman27 (guest, #10767) [Link]

No. Not 10s.

Linus was specifically *trying* to see how long he could make it stall. And by the time the the other problems were fixed, it was hard for him to force a 2 second latency:

http://lwn.net/Articles/328381/

Once the data=writeback default was put into place, it became hard for him to force a 600ms latency.

The original latencies were *far* longer than 2s. (30 seconds had been reported by some. Though such reports did not seem common.) The change of journaling mode ended up bringing the worst case down by a mere 1400ms.

And considering that the common case (as opposed to the relatively rarely seen worst case) was livable enough that it took 8 years for anyone to care enough to attack the problem, I'd call the final change of default journaling mode to cut the worst case by a final 1400ms a flourish.

And Linus never "insisted" on the data=writeback default. Ted suggested it. And in view of the ext4 patches that Ted was suggesting be moved over to ext3 to mitigate the worst (but not all) of the resulting reliability issues, Linus said OK, let's do that.

O_PONIES

Posted Jul 2, 2009 22:04 UTC (Thu) by spitzak (guest, #4593) [Link]

fsync() does much *more* than is required by the program.

All the program wants is for the old file to be atomically replaced with the new file. It is ok if after a crash the old file remains and none of the new file is on disk. fsync forces far more i/o than this requires. What is unacceptable is that after a crash a state other than oldfile or newfile can be the result.

Another way of looking at this is we want the effects of fsync, but deferred until just at the moment the actual rename is done on the disk (this is ok as the disk is being written to anyway).

Yet another way is to follow the POSIX spec which says that we should never see any state other than the old or new file, and that fsync is not required for this to happen. Of course POSIX does not say what happens if the machine crashes, but I think any acceptable crash recovery should match the POSIX spec as much as possible, otherwise it is not really a crash recovery.

O_PONIES

Posted Jul 2, 2009 22:08 UTC (Thu) by spitzak (guest, #4593) [Link] (1 responses)

I also want to point out that I suggest that an existing arrangement of flags (the three used by creat()) should do this, so no extra flag is needed.

The "official" solution has a lot of problems: you need to invent a non-colliding temporary name, that temporary file is visible while the file is being written, and may remain permanently on disk if a crash happens. And the fsync has serious amounts of overhead if you are doing a very large number of these files.

O_PONIES

Posted Jul 2, 2009 23:32 UTC (Thu) by mikov (guest, #33179) [Link]

Although you make good points, I have two disagree with both of them:

1. With ext3 fsync() behaves like sync() which is way too expensive. This is a real problem, but it is a problem in ext3 not in creat(). Changing POSIX or adding new flags to address a specific performance problem in ext3 is simply wrong.

2. Nobody ever expected that they can simply rewrite a file and get atomic behavior in the face of crash. The file system is not a transaction DB nor should it be.

The incorrect code which is causing the infamous "zero length after reboot files" consists of creating a temporary file and renaming it, but without the fsync().

The only difference in the correct version of the code is the addition of fflush(), fsync() and more error checking.

So, I think that the proposed new flag, or alternatively your suggestion to treat a combination of flags in a special way, is a bad idea. They lend hidden Linux-specific semantics to the FS in just one case. This is simply awful.

In my opinion the correct way to approach this problem would be one of:
a) Guarantee that meta-data changes occur after data changes. Thus the rename should only be committed to disk after the contents of the file have been committed to disk. This is not required by POSIX, but is what most other filesystems do anyway.

b) Introduce a new version of fsync() acting like a barrier. It does not stall the application, but it guarantees that any subsequent operations on this file have to occur after the previous ones have been committed to disk. So, fsync_improved() will cause no delay for the application or system, but will impose the necessary ordering.

O_PONIES

Posted Jul 3, 2009 3:40 UTC (Fri) by njs (subscriber, #40338) [Link]

I thought all the problems here were with rewriting an existing file. What does O_CREAT have to do with anything?

(There are lots of programs that open a file with O_WRONLY|O_TRUNC and expect their writes to spool out over time. Log files are one obvious example.)

How about O_TSO_ME_HARDER

Posted Jul 6, 2009 0:06 UTC (Mon) by jordanb (guest, #45668) [Link] (2 responses)

In which, depending upon the state of lkml politics when the kernel in use was released, the filesystem on the device, and changes possibly made by the distribution, ether you *must* use fsync early and often to avoid losing data, *or* using fsync will be prohibitively expensive.

Or possibly both.

At any rate no matter what choice you make when using the flag, condescending kernel developers will make obnoxious pandering statements about how you're too stupid to make the obvious correct choice.

How about O_TSO_ME_HARDER

Posted Jul 6, 2009 3:49 UTC (Mon) by bronson (subscriber, #4806) [Link]

Brilliant. Of course, the Linux kernel already provides this behavior. Adding the flag would just make it explicit.

How about O_TSO_ME_HARDER

Posted Jul 6, 2009 18:55 UTC (Mon) by nix (subscriber, #2304) [Link]

Of course that is a bad name for the flag because people will think it has
something to do with offloading TCP to network cards! ;}