> What we want is for the traditional unix way to save a file to (write to tempfile, rename over target) to *either* result in the old file or the new file.
This semantics (where data in the new file is magically committed) may or may not be a result of particular file system implementation. From the rename() man page:
> If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.
Nowhere does it specify what _data_ will be in either file, just that the file will be there. ext4 dutifully obeys that.
In short, what you are referring to as "traditional unix way" doesn't really exist. Proof: emacs code.
PS. Sure, it would be nice to have such "one shot" API. But, the current API isn't it.
Posted Mar 14, 2009 14:30 UTC (Sat) by endecotp (guest, #36428)
[Link]
> Nowhere does it specify what _data_ will be in either file,
> just that the file will be there.
No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.
Of course this is not true of crashes where POSIX doesn't say anything at all about what should happen. Behaviour after a crash is purely a QoI issue.
Where the the correctness go?
Posted Mar 14, 2009 21:18 UTC (Sat) by bojan (subscriber, #14302)
[Link]
> No; POSIX requires that the effects of one process' actions, as observed by another process, occur in order. So if you do the write() before the rename(), it is guaranteed that the file will be there with the expected data in it.
We are talking about data _on_ _disk_ here, not what the process may see (which may be buffers just written, as presented by the kernel). What is on disk is _durable_, which is what we are discussing here. For durable, you need fsync.
So, rename does not specify which data on disk will be when.
Where the the correctness go?
Posted Mar 15, 2009 12:44 UTC (Sun) by endecotp (guest, #36428)
[Link]
But __NOTHING__ specifies what data you'll find left on the disk after a crash (and after a crash is the only time when the difference between "on disk" and "in memory buffers" makes any difference). fsync() does NOT guarantee durability - it can be a no-op.
So what this all boils down to is how close each filesystem implementation comes to "non-crash" behaviour after a crash, which is a quality-of-implementation choice for the filesystems.
As far as I can see, for portable code the best bet is to stick with the write-close-rename pattern. This is sufficient for atomic changes in the non-crash case. Adding fsync in there makes it safe in the crash case for some filesystems, but not all, and there are others where it was safe without it, and others where it has a performance penalty: it's far from a clear winner at the moment.
Where the the correctness go?
Posted Mar 15, 2009 21:24 UTC (Sun) by bojan (subscriber, #14302)
[Link]
> fsync() does NOT guarantee durability - it can be a no-op.
Hence, you need to have various #ifs and ifs() to figure out what works on your platform. See Mac OS X. fsync is just an example here. The point is that you must use _something_ to commit. Without that, POSIX does not guarantee anything beyond currently running processes seeing the same picture.
Where the the correctness go?
Posted Mar 16, 2009 4:49 UTC (Mon) by dlang (✭ supporter ✭, #313)
[Link]
ven doing s fsync doesn't mean that you won't have this corruption. the two writes could go to the disk drive's buffer and it could write the metadata out before it writes the data blocks. if it looses power in between these two steps you have the same problem
Where the the correctness go?
Posted Mar 16, 2009 13:28 UTC (Mon) by jamesh (guest, #1159)
[Link]
Of course, if the drive supports barriers in its command queueing implementation it should be possible to prevent it reordering those writes.
That is likely to restrict reorderings that won't break correctness guarantees though.
Where the the correctness go?
Posted Mar 16, 2009 3:19 UTC (Mon) by k8to (subscriber, #15413)
[Link]
A no-op fsync is not compliant. You've taken it quite a bit too far.
fsync explicitly says that when it returns success, the data has been handed to the storage system successfully.
It doesn't guarantee that that storage system has committed it in a durable way for all scenarios. That's another issue.
fsync does guarantee that the data has been handed to the storage medium, but makes no guarantees about the implementation of that storage medium.