In brief
One might well wonder what the use for such an operation is. The main motivation would appear to be to allow an application to create a file descriptor which can be passed to other system calls - fstat(), say, or openat(). File descriptors used in this way do not really need access to the underlying file, so it makes sense to provide a way to create file descriptors without that access.
O_PONIES. Rik van Riel has proposed another new open flag (actually called O_REWRITE) which is intended to help applications easily avoid the "zero-length files after a crash" problem. A program could open an existing file with O_REWRITE and get some special semantics. The new file would exist, invisibly, alongside the existing file for as long as it remains open; during that time, any other opens of that file would get the old version. Once the file is closed, the kernel will rename it to the given name in an atomic manner, ensuring that either the old version or the (full) new version will exist should a crash happen in the middle.
This option would make it easy for application developers to rewrite
existing files without worrying about robustness. Some might respond that
it would be better to just teach those developers to use fsync(),
but, as Rik notes, "relying on application developers to get it right
seems to not have worked out well
". Rik's proposal currently lacks
an accompanying patch, so it's not destined for the mainline anytime soon.
VFAT patents. As discussed elsewhere, Andrew Tridgell has posted a new lawyer-approved patch aimed at working around Microsoft's VFAT patents. The discussion on the lists has taken a bit of a different course this time around; there is still some annoyance at making changes like this to deal with the problems of the U.S. patent system, but those voices have been relatively quiet. Not completely quiet, though; Alan Cox said:
Beyond that, what developers worry about is interoperability with other VFAT implementations. Alan Cox is asking that, if this patch goes in, the modified filesystem no longer be called "VFAT," since, as he sees it, it's now something else. Ted Ts'o has responded that "VFAT" is a bit of a slippery concept to begin with. It's not clear how this issue will be resolved.
Voyager's voyage. James Bottomley is a proud owner of an archaic Voyager system; Voyager is an x86-based architecture with a number of quaint features - though, contrary to rumor, steam power is not among them. It is not clear whether any Voyager-based systems are still running outside of James's basement. Nonetheless, James has been maintaining the Voyager architecture for years.
More recently, Voyager got kicked out when the code was broken in the process of an x86 subarchitecture-support rewrite. When James tried to get it put back in, x86 Ingo Molnar objected, saying that the costs of the patch were not justified by the benefits of serving such a small user base in the mainline kernel. In the end, Ingo rejected the patch outright, leading to what appeared to be an unsolvable stalemate between the two developers.
Things changed about the time that Linus jumped into the conversation:
Ingo eventually backed down on a number of
his complaints about the Voyager patches. What remains, though, is a long
list of technical problems with the Voyager tree and how it has been
managed. James has accepted those complaints as valid, and will work
toward resolving them. Before too long, Voyager owners (both of them)
should once again have full support for their beloved architecture in the
mainline kernel.
Posted Jul 2, 2009 8:57 UTC (Thu)
by oseemann (guest, #6687)
[Link] (3 responses)
Posted Jul 2, 2009 10:15 UTC (Thu)
by jamesh (guest, #1159)
[Link] (2 responses)
Posted Jul 2, 2009 20:14 UTC (Thu)
by SimonKagstrom (guest, #49801)
[Link] (1 responses)
Or maybe I remember something completely different :-)
Posted Jul 3, 2009 19:45 UTC (Fri)
by sbergman27 (guest, #10767)
[Link]
Posted Jul 2, 2009 20:51 UTC (Thu)
by spitzak (guest, #4593)
[Link] (18 responses)
His comments claiming that fsync() is the correct way to do things shows however that he still does not have the foggiest idea what programmers want and need, which is very disturbing for somebody who should be an expert on this stuff.
Posted Jul 2, 2009 20:58 UTC (Thu)
by mikov (guest, #33179)
[Link] (16 responses)
What do you mean?
The "correct" sequence of renames, temporary files and fsync-s is relatively complex and error-prone, however it does work and I don't see such a big advantage in adding a flag which too requires changing the existing sources.
Posted Jul 2, 2009 21:18 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link] (12 responses)
Posted Jul 2, 2009 23:36 UTC (Thu)
by mikov (guest, #33179)
[Link] (4 responses)
Adding transaction semantics to the file system in this one case is the road to madness. Fortunately I think that it has very little chance of being accepted in the kernel.
(Also, you can see my other response to spitzak below).
Posted Jul 2, 2009 23:40 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link] (3 responses)
Posted Jul 2, 2009 23:57 UTC (Thu)
by mikov (guest, #33179)
[Link] (1 responses)
Instead I would suggest one of:
- Put all the error prone logic (the temporary file, fsync(), permission and attribute copying, etc) in a portable library routine relying only on POSIX and simply live with the fsync() overhead on ext3.
or
- Again put it in a portable library routine, but without the fsync(). The rename problem has already been addressed in ext4, so it is safe, and in ext3 there is a global sync frequently enough that it isn't a problem in practice (explaining why nobody ever complained about it before ext4).
Posted Jul 3, 2009 0:19 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link]
Posted Jul 3, 2009 16:55 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link]
Posted Jul 3, 2009 19:51 UTC (Fri)
by sbergman27 (guest, #10767)
[Link] (6 responses)
Posted Jul 3, 2009 20:27 UTC (Fri)
by nix (subscriber, #2304)
[Link] (5 responses)
Radical change? I don't think so.
Posted Jul 5, 2009 12:39 UTC (Sun)
by sbergman27 (guest, #10767)
[Link] (4 responses)
And while data=ordered does add back some latency, they had already resolved the bulk of what issue there was *before* they changed the default mount behavior. That was the last change they made, along with adding in some of Ted's damage mitigation patches from Ext4 to keep the situation from becoming quite the total disaster that was ext4's (lack of) reliability when it was first declared "ready".
Read the related LKML family of threads.
Posted Jul 5, 2009 22:26 UTC (Sun)
by nix (subscriber, #2304)
[Link] (3 responses)
You can also turn on data=ordered in the superblock, following which it
is 'sticky' and requires no action whatsoever. (Software developers, of
course, have no control over which of these options sysadminss choose: nor
have they ever. Nothing has changed here.)
Personally I've been cutting the power nightly on a KDE3 installation
running atop ext4 and have had, uh, no problems whatsoever. No vanishing
config files, no mysterious corruption, nothing. It makes me wonder how
serious this whole thing really could have been...
Posted Jul 5, 2009 22:50 UTC (Sun)
by sbergman27 (guest, #10767)
[Link] (2 responses)
After the real problems were resolved, setting the default to data=writeback seemed mostly a final flourish, since Ted had patches to help mitigate the worst of the unreliability it would cause. I suspect that the default would not have been changed if the latency thing had not been such a long-standing issue. (One tends to swat hard-to-swat flies harder when one finally does get them.) And, of course, Ted really wanted to see the data=ordered default go away, since it was making ext4 look bad by comparison.
It's the combination of the change of default journaling mode *and* the relative painlessness of fsync on 2.6.30+ ext3 that make ext3 a very different animal from pre-2.6.30 ext3 for developers. They really have to be talked about separately.
And yes, the default journaling mode can also be specified in the superblock. But that is still *totally* irelevant to the developer. The developer can mount his own filesystems with any options he wants. But it doesn't make a bit of difference to what happens when his *users* try to run his software on *their* machines.
Posted Jul 5, 2009 23:55 UTC (Sun)
by nix (subscriber, #2304)
[Link] (1 responses)
So, not a 'flourish', unless you consider taking 10s to save a 500 byte
Posted Jul 6, 2009 0:45 UTC (Mon)
by sbergman27 (guest, #10767)
[Link]
Linus was specifically *trying* to see how long he could make it stall. And by the time the the other problems were fixed, it was hard for him to force a 2 second latency:
http://lwn.net/Articles/328381/
Once the data=writeback default was put into place, it became hard for him to force a 600ms latency.
The original latencies were *far* longer than 2s. (30 seconds had been reported by some. Though such reports did not seem common.) The change of journaling mode ended up bringing the worst case down by a mere 1400ms.
And considering that the common case (as opposed to the relatively rarely seen worst case) was livable enough that it took 8 years for anyone to care enough to attack the problem, I'd call the final change of default journaling mode to cut the worst case by a final 1400ms a flourish.
And Linus never "insisted" on the data=writeback default. Ted suggested it. And in view of the ext4 patches that Ted was suggesting be moved over to ext3 to mitigate the worst (but not all) of the resulting reliability issues, Linus said OK, let's do that.
Posted Jul 2, 2009 22:04 UTC (Thu)
by spitzak (guest, #4593)
[Link]
All the program wants is for the old file to be atomically replaced with the new file. It is ok if after a crash the old file remains and none of the new file is on disk. fsync forces far more i/o than this requires. What is unacceptable is that after a crash a state other than oldfile or newfile can be the result.
Another way of looking at this is we want the effects of fsync, but deferred until just at the moment the actual rename is done on the disk (this is ok as the disk is being written to anyway).
Yet another way is to follow the POSIX spec which says that we should never see any state other than the old or new file, and that fsync is not required for this to happen. Of course POSIX does not say what happens if the machine crashes, but I think any acceptable crash recovery should match the POSIX spec as much as possible, otherwise it is not really a crash recovery.
Posted Jul 2, 2009 22:08 UTC (Thu)
by spitzak (guest, #4593)
[Link] (1 responses)
The "official" solution has a lot of problems: you need to invent a non-colliding temporary name, that temporary file is visible while the file is being written, and may remain permanently on disk if a crash happens. And the fsync has serious amounts of overhead if you are doing a very large number of these files.
Posted Jul 2, 2009 23:32 UTC (Thu)
by mikov (guest, #33179)
[Link]
1. With ext3 fsync() behaves like sync() which is way too expensive. This is a real problem, but it is a problem in ext3 not in creat(). Changing POSIX or adding new flags to address a specific performance problem in ext3 is simply wrong.
2. Nobody ever expected that they can simply rewrite a file and get atomic behavior in the face of crash. The file system is not a transaction DB nor should it be.
The incorrect code which is causing the infamous "zero length after reboot files" consists of creating a temporary file and renaming it, but without the fsync().
The only difference in the correct version of the code is the addition of fflush(), fsync() and more error checking.
So, I think that the proposed new flag, or alternatively your suggestion to treat a combination of flags in a special way, is a bad idea. They lend hidden Linux-specific semantics to the FS in just one case. This is simply awful.
In my opinion the correct way to approach this problem would be one of:
b) Introduce a new version of fsync() acting like a barrier. It does not stall the application, but it guarantees that any subsequent operations on this file have to occur after the previous ones have been committed to disk. So, fsync_improved() will cause no delay for the application or system, but will impose the necessary ordering.
Posted Jul 3, 2009 3:40 UTC (Fri)
by njs (subscriber, #40338)
[Link]
(There are lots of programs that open a file with O_WRONLY|O_TRUNC and expect their writes to spool out over time. Log files are one obvious example.)
Posted Jul 6, 2009 0:06 UTC (Mon)
by jordanb (guest, #45668)
[Link] (2 responses)
Or possibly both.
At any rate no matter what choice you make when using the flag, condescending kernel developers will make obnoxious pandering statements about how you're too stupid to make the obvious correct choice.
Posted Jul 6, 2009 3:49 UTC (Mon)
by bronson (subscriber, #4806)
[Link]
Posted Jul 6, 2009 18:55 UTC (Mon)
by nix (subscriber, #2304)
[Link]
In brief
In brief
In brief
In brief
O_PONIES
O_PONIES
His comments claiming that fsync() is the correct way to do things shows however that he still does not have the foggiest idea what programmers want and need, which is very disturbing for somebody who should be an expert on this stuff.
O_PONIES
O_PONIES
O_PONIES
O_PONIES
O_PONIES
I'm firmly in the O_PONIES
fsync
-is-horrible-renames-should-be-write-barriers camp, but if you want to expose a way to for an application to tell whether it should fsync
or not, the best way is to use the existing pathconf
mechanism. Using that, an application can just ask the kernel, "Do I need to fsync this particular file?". It's much more elegant than trying to have applications puzzle it out from filesystem-type information.
O_PONIES
O_PONIES
a 'radical change'. add 'data=ordered' to your mount lines, or set the
EXT3_DEFAULTS_TO_ORDERED config option, and bingo, exactly right back
where we were before, huge and unpredictably horrible latencies and all.
O_PONIES
O_PONIES
No. Much more happened than that.
Cite? That was the only change that was at all likely to affect
reliability that I can recall. (And yes, I read those threads.)
O_PONIES
O_PONIES
could get. Linus insisted on the data=writeback default after running
tests with all the other patches in place showing enormously variable and
sometimes terrible fsync() latencies from tiny file writes when there was
a large dd going on in the background. Switching to writeback eliminated
those latencies.
file a mere irrelevance.
O_PONIES
O_PONIES
O_PONIES
O_PONIES
a) Guarantee that meta-data changes occur after data changes. Thus the rename should only be committed to disk after the contents of the file have been committed to disk. This is not required by POSIX, but is what most other filesystems do anyway.
O_PONIES
How about O_TSO_ME_HARDER
How about O_TSO_ME_HARDER
How about O_TSO_ME_HARDER
something to do with offloading TCP to network cards! ;}