> Maybe you missed this bit, but people are truncating the files _explicitly_ in the code and
_not_
> committing subsequent changes. That's what's zeroing out the files, not the file system.
That's not the only scenario. There's the one involving rename... You open a *new* file, write to
it, close it, and rename it over an existing file. Then the filesystem commits the metadata change
(that is: the unlinking of the file with data in it, and the creation of the new empty file with the
same name), but *not* the data written to the new file.
No explicit truncation.
Now, there is also the scenario involving truncation. I expect everybody agrees that if you
truncate a file and then later overwrite it, there's a chance that the empty version of the file
might end up on disk. The thing that's undesirable about what ext4 was doing in *that* scenario
is that it essentially eagerly committed to disk the truncation, but lazily committed the new data.
Thus making it *much* more likely that you end up with the truncated file than you'd expect
given that the application called write() with the new data a few milliseconds after truncating the
file.
Posted Mar 14, 2009 8:22 UTC (Sat) by alexl (subscriber, #19068)
[Link]
That comment is totally misguided. We do not want our data guaranteed to be on disk. Nobody said they wanted that. What we want is for the traditional unix way to save a file to (write to tempfile, rename over target) to *either* result in the old file or the new file. Not a zero byte file, losing both the old and the new data.
fsync is just the only way to work around this in the posix API, but its is much more heavy and gives much more guarantees than we want.
Where the the correctness go?
Posted Mar 14, 2009 8:30 UTC (Sat) by bojan (subscriber, #14302)
[Link]
Please. When you open a new file (which is empty) and then write to it, but do not commit (by calling fsync), that file may not contain anything on disk for a while. Even after close, this can still be the case.
If you rename such file to an existing file that contained data, you may legitimately end up with an empty file on crash.
If you want the data to be in the new file that will be renamed into the old file, you have to call fsync on the new file. Then you atomically rename.
What emacs (and very sophisticated, careful application writers) will do is this:
3.a) open and read file ~/.kde/foo/bar/baz
3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
3.d) fsync(fd) --- and check the error return from the fsync
3.e) close(fd)
3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional
3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
Where the the correctness go?
Posted Mar 14, 2009 12:03 UTC (Sat) by alexl (subscriber, #19068)
[Link]
The only way with the current POSIX apis to get this guarantee is to fsync() the fd before renaming. But this imposes an unnecessary overhead on both the app (generally) and the whole system (with ext3 data=orderer).
Now, what ext4 does is clearly correct according to what is "allowed" by POSIX (actually, this is kinda vague as POSIX allows fsync() to be empty, and doesn't actually specify anything about system crashes.)
However, even if its "posixly correct", it is imho broken. In the sense that I wouldn't let any of my data near such a filesystem, and I would recommend everyone who asks me to not use it.
Take for example this command:
sed -i s/user1/user2/ foo.conf
This does in-place update using write-to-temp and rename over, without fsync. The result of running this command, is that if your machine locks up after up to a minute you loose both versions of foo.conf.
Now, is foo.conf important? How the heck is sed to know? Is sed broken? Should it fsync? Thats more or less arguing that every app should fsync on close, which on ext4 is the same as the filesystem doing it, but on ext3 is unnecessary and a massive system slowdown.
Or should we try to avoid the performance implications of fsync (due to its guarantees being far more than what we need to solve our requirements)? We could do this by punting this to the users of sed, by having a -important-data argument, and then pushing this further out to any script that uses sed, etc, etc.
Or we could just rely on filesystems to guarantee this common behaviour to work. Even if its not specified by POSIX. (And choose not to use filesystems that doesn't give us that guarantee, like so many people have switched from XFS after data losses).
Ideally of course there would be another syscall, flag or whatever that says "don't write metadata before data is written". That way we could get both efficient and correct apps, but that doesn't exist today.
Where the the correctness go?
Posted Mar 14, 2009 21:20 UTC (Sat) by bojan (subscriber, #14302)
[Link]
> However, even if its "posixly correct", it is imho broken.
Look, this may as well be true, but the fact is that all of us that are creating applications have one thing to rely on - documentation. And the documentation says what it says.
Where the the correctness go?
Posted Mar 16, 2009 12:00 UTC (Mon) by nye (guest, #51576)
[Link]
POSIX also allows a system crash to cause your computer to explode and hurl shrapnel into your face, because crash-behaviour is *undefined*. Are you seriously arguing that *any* POSIX-compliant behaviour is automatically the right thing? Clearly not, because you are arguing against one POSIX-compliant method in favour of another. There are an infinite number of ways to be POSIX-compliant, some of which are more useful than others.
Where the the correctness go?
Posted Mar 14, 2009 13:32 UTC (Sat) by nix (subscriber, #2304)
[Link]
There's a window there where it can leave you with baz~ and baz.new, but
no baz, on crash.
Hardly ideal, but probably unavoidable.
Why doesn't someone add real DB-style transactions to at least one
filesystem, again? They'd be really useful...
Where the the correctness go?
Posted Mar 14, 2009 21:25 UTC (Sat) by bojan (subscriber, #14302)
[Link]
> There's a window there where it can leave you with baz~ and baz.new, but no baz, on crash.
Yep, very true.
But, no zero length file, which was the original problem. Essentially, you will get at least _something_.
> Why doesn't someone add real DB-style transactions to at least one filesystem, again? They'd be really useful...
Who knows, maybe we'll get proper API for that behaviour out of this discussion.
Where the the correctness go?
Posted Mar 14, 2009 21:39 UTC (Sat) by foom (subscriber, #14868)
[Link]
> Essentially, you will get at least _something_.
There's no guarantee of that. A filesystem could simple erase itself upon unexpected
poweroff/crash. *Anything* better than that is an implementation detail.
Where the the correctness go?
Posted Mar 15, 2009 1:58 UTC (Sun) by bojan (subscriber, #14302)
[Link]
I knew someone's going have a silly comment here. I was expecting, however, that it's going to be more technical, along the lines "see, you cannot rely on it after all". Just for the record, the first rename emacs does is optional (in order to get the backup file) and would not be done for configuration files, hence full atomicity and durability.
Where the the correctness go?
Posted Mar 15, 2009 2:42 UTC (Sun) by njs (guest, #40338)
[Link]
> There's a window there where it can leave you with baz~ and baz.new, but
no baz, on crash.
Yeah, 3.f is supposed to say "link", not "rename". (Programming against POSIX correctly makes those Raymond Smullyan books seem like light reading. If only everything else weren't worse...)
> Why doesn't someone add real DB-style transactions to at least one
filesystem, again? They'd be really useful...
The problem is that a filesystem has a bazillion mostly independent "transactions" going on all the time, and no way to tell which ones are actually dependent on each other. (Besides, app developers would just screw up their rollback-and-retry logic anyway...)
Completely off the wall solution: move to a plan9/capability-ish system where apps all live in little micro-worlds and can only see the stuff that is important to them (this is a good idea anyway), and then use these to infer safe but efficient transaction domain boundaries. (Anyone looking for a PhD project...?)
Transactions, ordering, rollback...
Posted Mar 15, 2009 8:09 UTC (Sun) by Pc5Y9sbv (guest, #41328)
[Link]
The entire point of transactions is to say "these operations are related to one another" by opening a transaction and performing multiple read/write actions which populate a data dependency map. Then the commit says that either the dependency map is satisfied and all writes are made, or no writes are made. Thus it would not be difficult to obtain the map from the application, but it is a huge expansion of scope for the filesystem abstraction.
However, as we were discussing further up the page, a write-barrier is really all that is needed for the intuitive crash-proof behavior desired by everything doing the "create a temp file; relink to real name". An awful lot of the discussion seems to conflate request ordering with synchronous disk operations, when all we really desire is ordering constraints to flow through the entire filesystem and block layer to the physical medium.
All people want is for the POSIX ordering semantics of "file content writes" preceding "file name linkage" to be preserved across crashes. It is OK if the crash drops cached data and forgets the link, or the data and link, but not the data while preserving the link.
Where the the correctness go?
Posted Mar 15, 2009 12:26 UTC (Sun) by nix (subscriber, #2304)
[Link]
Even the off-the-wall solution won't work, because the reason for
transactions getting entangled with each other is dependencies *within the
fs metadata*. i.e. what you'd actually need to do is put off *all*
operations on fs metadata areas that may be shared with other transactions
until such time as the entire transaction is ready to commit. And that's a
huge change.
Where the the correctness go?
Posted Mar 16, 2009 3:13 UTC (Mon) by k8to (subscriber, #15413)
[Link]
There's an easy way to avoid that problem.
link the name to name~
rename the name.new to name
Yes, explicit transaction support in the filesystem would be great, though hammering out the api will probably be hairy.
Where the the correctness go?
Posted Mar 14, 2009 9:19 UTC (Sat) by bojan (subscriber, #14302)
[Link]
> What we want is for the traditional unix way to save a file to (write to tempfile, rename over target) to *either* result in the old file or the new file.
This semantics (where data in the new file is magically committed) may or may not be a result of particular file system implementation. From the rename() man page:
> If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.
Nowhere does it specify what _data_ will be in either file, just that the file will be there. ext4 dutifully obeys that.
In short, what you are referring to as "traditional unix way" doesn't really exist. Proof: emacs code.
PS. Sure, it would be nice to have such "one shot" API. But, the current API isn't it.
Where the the correctness go?
Posted Mar 14, 2009 14:30 UTC (Sat) by endecotp (guest, #36428)
[Link]
> Nowhere does it specify what _data_ will be in either file,
> just that the file will be there.
No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.
Of course this is not true of crashes where POSIX doesn't say anything at all about what should happen. Behaviour after a crash is purely a QoI issue.
Where the the correctness go?
Posted Mar 14, 2009 21:18 UTC (Sat) by bojan (subscriber, #14302)
[Link]
> No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.
We are talking about data _on_ _disk_ here, not what the process may see (which may be buffers just written, as presented by the kernel). What is on disk is _durable_, which is what we are discussing here. For durable, you need fsync.
So, rename does not specify which data on disk will be when.
Where the the correctness go?
Posted Mar 15, 2009 12:44 UTC (Sun) by endecotp (guest, #36428)
[Link]
But __NOTHING__ specifies what data you'll find left on the disk after a crash (and after a crash is the only time when the difference between "on disk" and "in memory buffers" makes any difference). fsync() does NOT guarantee durability - it can be a no-op.
So what this all boils down to is how close each filesystem implementation comes to "non-crash" behaviour after a crash, which is a quality-of-implementation choice for the filesystems.
As far as I can see, for portable code the best bet is to stick with the write-close-rename pattern. This is sufficient for atomic changes in the non-crash case. Adding fsync in there makes it safe in the crash case for some filesystems, but not all, and there are others where it was safe without it, and others where it has a performance penalty: it's far from a clear winner at the moment.
Where the the correctness go?
Posted Mar 15, 2009 21:24 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> fsync() does NOT guarantee durability - it can be a no-op.
Hence, you need to have various #ifs and ifs() to figure out what works on your platform. See Mac OS X. fsync is just an example here. The point is that you must use _something_ to commit. Without that, POSIX does not guarantee anything beyond currently running processes seeing the same picture.
Where the the correctness go?
Posted Mar 16, 2009 4:49 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
ven doing s fsync doesn't mean that you won't have this corruption. the two writes could go to the disk drive's buffer and it could write the metadata out before it writes the data blocks. if it looses power in between these two steps you have the same problem
Where the the correctness go?
Posted Mar 16, 2009 13:28 UTC (Mon) by jamesh (guest, #1159)
[Link]
Of course, if the drive supports barriers in its command queueing implementation it should be possible to prevent it reordering those writes.
That is likely to restrict reorderings that won't break correctness guarantees though.
Where the the correctness go?
Posted Mar 16, 2009 3:19 UTC (Mon) by k8to (subscriber, #15413)
[Link]
A no-op fsync is not compliant. You've taken it quite a bit too far.
fsync explicitly says that when it returns success, the data has been handed to the storage system successfully.
It doesn't guarantee that that storage system has committed it in a durable way for all scenarios. That's another issue.
fsync does guarantee that the data has been handed to the storage medium, but makes no guarantees about the implementation of that storage medium.
Where the the correctness go?
Posted Mar 16, 2009 1:07 UTC (Mon) by vonbrand (subscriber, #4458)
[Link]
Sorry, "opening a file for writing it from scratch" is truncating, quite explicitly.