Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for June 20, 2013
Pencil, Pencil, and Pencil
Dividing the Linux desktop
LWN.net Weekly Edition for June 13, 2013
A report from pgCon 2013
Maybe you missed this bit, but people are truncating the files _explicitly_ in the code and _not_ committing subsequent changes. That's what's zeroing out the files, not the file system.
Where the the correctness go?
Posted Mar 14, 2009 4:14 UTC (Sat) by foom (subscriber, #14868)
That's not the only scenario. There's the one involving rename... You open a *new* file, write to
it, close it, and rename it over an existing file. Then the filesystem commits the metadata change
(that is: the unlinking of the file with data in it, and the creation of the new empty file with the
same name), but *not* the data written to the new file.
No explicit truncation.
Now, there is also the scenario involving truncation. I expect everybody agrees that if you
truncate a file and then later overwrite it, there's a chance that the empty version of the file
might end up on disk. The thing that's undesirable about what ext4 was doing in *that* scenario
is that it essentially eagerly committed to disk the truncation, but lazily committed the new data.
Thus making it *much* more likely that you end up with the truncated file than you'd expect
given that the application called write() with the new data a few milliseconds after truncating the
Posted Mar 14, 2009 6:23 UTC (Sat) by bojan (subscriber, #14302)
Posted Mar 14, 2009 8:22 UTC (Sat) by alexl (subscriber, #19068)
fsync is just the only way to work around this in the posix API, but its is much more heavy and gives much more guarantees than we want.
Posted Mar 14, 2009 8:30 UTC (Sat) by bojan (subscriber, #14302)
If you rename such file to an existing file that contained data, you may legitimately end up with an empty file on crash.
If you want the data to be in the new file that will be renamed into the old file, you have to call fsync on the new file. Then you atomically rename.
This is what emacs and other safe programs already do. From https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug...
What emacs (and very sophisticated, careful application writers) will do is this:
3.a) open and read file ~/.kde/foo/bar/baz
3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
3.d) fsync(fd) --- and check the error return from the fsync
3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
Posted Mar 14, 2009 12:03 UTC (Sat) by alexl (subscriber, #19068)
Now, what ext4 does is clearly correct according to what is "allowed" by POSIX (actually, this is kinda vague as POSIX allows fsync() to be empty, and doesn't actually specify anything about system crashes.)
However, even if its "posixly correct", it is imho broken. In the sense that I wouldn't let any of my data near such a filesystem, and I would recommend everyone who asks me to not use it.
Take for example this command:
sed -i s/user1/user2/ foo.conf
This does in-place update using write-to-temp and rename over, without fsync. The result of running this command, is that if your machine locks up after up to a minute you loose both versions of foo.conf.
Now, is foo.conf important? How the heck is sed to know? Is sed broken? Should it fsync? Thats more or less arguing that every app should fsync on close, which on ext4 is the same as the filesystem doing it, but on ext3 is unnecessary and a massive system slowdown.
Or should we try to avoid the performance implications of fsync (due to its guarantees being far more than what we need to solve our requirements)? We could do this by punting this to the users of sed, by having a -important-data argument, and then pushing this further out to any script that uses sed, etc, etc.
Or we could just rely on filesystems to guarantee this common behaviour to work. Even if its not specified by POSIX. (And choose not to use filesystems that doesn't give us that guarantee, like so many people have switched from XFS after data losses).
Ideally of course there would be another syscall, flag or whatever that says "don't write metadata before data is written". That way we could get both efficient and correct apps, but that doesn't exist today.
Posted Mar 14, 2009 21:20 UTC (Sat) by bojan (subscriber, #14302)
Look, this may as well be true, but the fact is that all of us that are creating applications have one thing to rely on - documentation. And the documentation says what it says.
Posted Mar 16, 2009 12:00 UTC (Mon) by nye (guest, #51576)
Posted Mar 14, 2009 13:32 UTC (Sat) by nix (subscriber, #2304)
Hardly ideal, but probably unavoidable.
Why doesn't someone add real DB-style transactions to at least one
filesystem, again? They'd be really useful...
Posted Mar 14, 2009 21:25 UTC (Sat) by bojan (subscriber, #14302)
Yep, very true.
But, no zero length file, which was the original problem. Essentially, you will get at least _something_.
> Why doesn't someone add real DB-style transactions to at least one filesystem, again? They'd be really useful...
Who knows, maybe we'll get proper API for that behaviour out of this discussion.
Posted Mar 14, 2009 21:39 UTC (Sat) by foom (subscriber, #14868)
There's no guarantee of that. A filesystem could simple erase itself upon unexpected
poweroff/crash. *Anything* better than that is an implementation detail.
Posted Mar 15, 2009 1:58 UTC (Sun) by bojan (subscriber, #14302)
Posted Mar 15, 2009 2:42 UTC (Sun) by njs (guest, #40338)
Yeah, 3.f is supposed to say "link", not "rename". (Programming against POSIX correctly makes those Raymond Smullyan books seem like light reading. If only everything else weren't worse...)
> Why doesn't someone add real DB-style transactions to at least one
filesystem, again? They'd be really useful...
The problem is that a filesystem has a bazillion mostly independent "transactions" going on all the time, and no way to tell which ones are actually dependent on each other. (Besides, app developers would just screw up their rollback-and-retry logic anyway...)
Completely off the wall solution: move to a plan9/capability-ish system where apps all live in little micro-worlds and can only see the stuff that is important to them (this is a good idea anyway), and then use these to infer safe but efficient transaction domain boundaries. (Anyone looking for a PhD project...?)
Transactions, ordering, rollback...
Posted Mar 15, 2009 8:09 UTC (Sun) by Pc5Y9sbv (guest, #41328)
However, as we were discussing further up the page, a write-barrier is really all that is needed for the intuitive crash-proof behavior desired by everything doing the "create a temp file; relink to real name". An awful lot of the discussion seems to conflate request ordering with synchronous disk operations, when all we really desire is ordering constraints to flow through the entire filesystem and block layer to the physical medium.
All people want is for the POSIX ordering semantics of "file content writes" preceding "file name linkage" to be preserved across crashes. It is OK if the crash drops cached data and forgets the link, or the data and link, but not the data while preserving the link.
Posted Mar 15, 2009 12:26 UTC (Sun) by nix (subscriber, #2304)
Posted Mar 16, 2009 3:13 UTC (Mon) by k8to (subscriber, #15413)
link the name to name~
rename the name.new to name
Yes, explicit transaction support in the filesystem would be great, though hammering out the api will probably be hairy.
Posted Mar 14, 2009 9:19 UTC (Sat) by bojan (subscriber, #14302)
This semantics (where data in the new file is magically committed) may or may not be a result of particular file system implementation. From the rename() man page:
> If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.
Nowhere does it specify what _data_ will be in either file, just that the file will be there. ext4 dutifully obeys that.
In short, what you are referring to as "traditional unix way" doesn't really exist. Proof: emacs code.
PS. Sure, it would be nice to have such "one shot" API. But, the current API isn't it.
Posted Mar 14, 2009 14:30 UTC (Sat) by endecotp (guest, #36428)
No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.
Of course this is not true of crashes where POSIX doesn't say anything at all about what should happen. Behaviour after a crash is purely a QoI issue.
Posted Mar 14, 2009 21:18 UTC (Sat) by bojan (subscriber, #14302)
We are talking about data _on_ _disk_ here, not what the process may see (which may be buffers just written, as presented by the kernel). What is on disk is _durable_, which is what we are discussing here. For durable, you need fsync.
So, rename does not specify which data on disk will be when.
Posted Mar 15, 2009 12:44 UTC (Sun) by endecotp (guest, #36428)
So what this all boils down to is how close each filesystem implementation comes to "non-crash" behaviour after a crash, which is a quality-of-implementation choice for the filesystems.
As far as I can see, for portable code the best bet is to stick with the write-close-rename pattern. This is sufficient for atomic changes in the non-crash case. Adding fsync in there makes it safe in the crash case for some filesystems, but not all, and there are others where it was safe without it, and others where it has a performance penalty: it's far from a clear winner at the moment.
Posted Mar 15, 2009 21:24 UTC (Sun) by bojan (subscriber, #14302)
Hence, you need to have various #ifs and ifs() to figure out what works on your platform. See Mac OS X. fsync is just an example here. The point is that you must use _something_ to commit. Without that, POSIX does not guarantee anything beyond currently running processes seeing the same picture.
Posted Mar 16, 2009 4:49 UTC (Mon) by dlang (✭ supporter ✭, #313)
Posted Mar 16, 2009 13:28 UTC (Mon) by jamesh (guest, #1159)
That is likely to restrict reorderings that won't break correctness guarantees though.
Posted Mar 16, 2009 3:19 UTC (Mon) by k8to (subscriber, #15413)
fsync explicitly says that when it returns success, the data has been handed to the storage system successfully.
It doesn't guarantee that that storage system has committed it in a durable way for all scenarios. That's another issue.
fsync does guarantee that the data has been handed to the storage medium, but makes no guarantees about the implementation of that storage medium.
Posted Mar 16, 2009 1:07 UTC (Mon) by vonbrand (subscriber, #4458)
Sorry, "opening a file for writing it from scratch" is truncating, quite explicitly.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds