Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
That's true for ext3 but not for ext4, which is the point of the discussion.
Posted Mar 16, 2009 15:24 UTC (Mon) by nybble41 (subscriber, #55106)
Will ext4 increase a file's size before committing its contents to disk? If so, that would be a separate issue independent of the write()/rename() ordering issue discussed thus far. If not, then the guarantee is the same as with ext3.
Posted Mar 16, 2009 17:19 UTC (Mon) by nye (guest, #51576)
From your statement that the semantics for data=ordered were the same between ext3 and ext4, I assumed that you were using the word 'guarantee' to describe the known outcome of a particular operation, but I now realise that you meant the word more literally, as in 'this is the behaviour of the filesystem, and *this* is the subset of that behaviour which we guarantee will always hold true'. Since the actual behaviour was documented and predictable, and differs from ext4, I disagree that the semantics of the setting are the same, but concede that the actual guarantees made are.
Just so we can be clear what's happening, I'm going to go off on a tangent and try to work this through (assuming the no-fsync rename case)as I understand it. This isn't really directed at the parent poster:
The case in question is that you create a file (a.new). It gets an inode describing a length of zero, and a directory entry linking that name to that inode. Is this true? Does it get that inode and directory entry? I'm not sure what steps are skipped in the case of delayed allocation, but both of these thing appear to happen given that a zero-length file is indeed created upon a crash. Call this point A.
You write data to that file, and here the delayed allocation comes into play. The cached copy of the inode is updated. Is that true? I don't know if the in-memory representation corresponds to the on-disk representation. Clearly there can't be any real block pointers as the allocation hasn't happened yet.
Then 'a.new' is renamed to 'a', effectively unlinking the existing 'a' and 'a.new' and creating a new 'a' pointing to an inode previously known as 'a.new'. Call this point B.
At some point the inodes and directory entries are written, but because allocation is delayed, you are saving the version of the inode with size zero and no block pointers. Call this point C.
At this point you pray that there isn't a power cut.
At some point in the future the allocation is committed, and the inode for the new 'a' is updated to reflect this. Call this point D.
The question is, why is data created at points A and B committed to disk at (or by) point C given that it is already known, with certainty, that that data is useless until after point D? It appears that this will no longer happen in this particular case come 2.6.30, but is this not a specific case of some more general behaviour? Why commit the inode and directory entry for a file whose allocation hasn't happened yet?
Posted Mar 16, 2009 18:06 UTC (Mon) by nybble41 (subscriber, #55106)
The thing is, this *isn't* known at the filesystem level. The application knows that the rename() is useless until the write() has been committed, but there is no API to communicate this information to the filesystem. Perhaps there should be, but the lack of appropriately fine-grained userspace APIs is not the fault of the filesystem authors. All existing filesystems, ext3 included, assume that the rename() and write() operations are independent; the cases where the ordering happens to be correct from the application's point-of-view are purely accidental.
The more general issue is that application writers are depending on filesystems to provide full data journaling, which is a major performance killer and was never actually guaranteed. Metadata journaling, as used by ext3 and ext4 be default, is only a replacement for the fsck process; as with all asynchronous, non-journaled filesystems, the state of the recovered filesystem after fsck or journal playback will be internally consistent, but may not match any state which actually existed in RAM before the crash.
Posted Mar 17, 2009 11:10 UTC (Tue) by nye (guest, #51576)
Guarantees and the belt-and-braces of journaling
Posted Mar 18, 2009 0:16 UTC (Wed) by xoddam (subscriber, #2322)
Specifically, whenever a rename replaces an old file with a recently-written one, the application developer's intention is to achieve atomic replacement of the file's contents. Invariably. No exceptions. Even if you "kill -9 1", this will not cause corruption truncation of the target file.
POSIX *does* guarantee this atomicity, with precisely *one* exception -- if the system crashes, behaviour is undefined.
A journaling filesystem exists for one reason only to provide reasonable behaviour in the event of a system crash, ie. to extend the guarantees POSIX provides and reduce the need to recover data.
In a couple of instances, users have observed that particular up-and-coming journaling filesystems make it more (not less!) likely for them to need to recover files than the status quo. It is only sane that they should report this as a bug. It has *nothing* to do with application developers, who are using the recommended pattern and generally don't have much influence over what happens when their users' computers crash.
It is wonderful news then, that the developers of both filesystems (to my knowledge) that did exhibit such behaviour have listened to the requests of their users and let the journal extend the POSIX guarantee of atomic replacement on rename across system failures.
Hurrah and thankyou.
Discussion of fsync is a complete red herring. On older POSIX-conforming filesystems there is NO GUARANTEE AT ALL that the filesystem will be accessible after a system crash, fsync or no. On some implementations, fsync can indeed make this particular kind of data loss less likely (and application developers in-the-know have used it for this purpose). There is still no POSIXLY_CORRECT guarantee that data will not be lost, so for a filesystem developer to say that his users don't really deserve to benefit from the safety that journaling can afford until application developers have jumped through an extra latency-imposing hoop is a bit rude. Not to say, putting the cart before the horse.
By the way, Ted T'so is 90% correct to say application developers shouldn't fear fsync, and 100% wrong to say that fsync is the correct way to achieve atomic replacement with rename. Rename alone is supposed to achieve this; if a filesystem is technically capable of preserving this guarantee even across system failures then it should do so.
Posted Mar 20, 2009 13:28 UTC (Fri) by regala (subscriber, #15745)
Posted Mar 20, 2009 15:41 UTC (Fri) by foom (subscriber, #14868)
If journaling is not for that, then I want something which is! And yes, I do want to play World of Goo!
Why should I give a damn about the filesystem structure except as a prerequisite to being able to
get to my files. I want my files, and that means file *content*. So I want a system which does a
reasonably reliably job of ensuring that content doesn't disappear. Ext3 is such a system. Maybe it
was unintentional at the time it was designed, but now that it's recognized as a good idea, let's
*keep* making systems that work as well as it!
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds