|
|
Log in / Subscribe / Register

ordered(tm) brand

ordered(tm) brand

Posted Mar 15, 2009 20:10 UTC (Sun) by nybble41 (subscriber, #55106)
In reply to: ordered(tm) brand by szh
Parent article: Garrett: ext4, application expectations and power management

As best I can tell, the semantics *are* the same as ext3. The only difference is in the interval in which an unclean shutdown can result in application-level inconsistency. For ext3 that window was about five seconds; for ext4 it's significantly longer. In both cases data journaling (data=journal) should preserve the same order on disk as exists in RAM, though at a significant cost in performance.

The default setting, data=ordered, only guarantees that any given file's contents are committed to disk before that same file's metadata -- essentially just the file's size -- thus ensuring that the on-disk version of the file never contains uninitialized data. It doesn't make any guarantees, in ext3 or ext4, regarding the ordering of writes to separate files or directories. Similarly, the semantics of rename() are such that atomicity is guaranteed only with respect to the directory entry, not data or metadata associated with the files themselves.


to post comments

Semantic is NOT the same

Posted Mar 15, 2009 20:56 UTC (Sun) by khim (subscriber, #9252) [Link]

From user POV semantic is wildly different: with Ext3 I'm guaranteed my P2P state will not be corrupt - no five seconds window. With Ext4 I'm almost guranteed it'll be destroyed in crash. Big difference.

Granted - it looked like "more-or-less" the same mode from filesystem developer POV, but that's no excuse... May be it's accident, may be not - but semantic of ext3 and ext4 ordered case was quite different...

ordered(tm) brand

Posted Mar 16, 2009 12:35 UTC (Mon) by nye (guest, #51576) [Link] (7 responses)

>The default setting, data=ordered, only guarantees that any given file's contents are committed to disk before that same file's metadata

That's true for ext3 but not for ext4, which is the point of the discussion.

ordered(tm) brand

Posted Mar 16, 2009 15:24 UTC (Mon) by nybble41 (subscriber, #55106) [Link] (6 responses)

Could you give an example of a case where it isn't true for ext4? The point of the discussion seems to me to be that some application developers assumed that data=ordered implied full on-disk ordering of filesystem operations, when in fact it only guarantees limited ordering within individual inodes (ensuring that any given file's contents on disk are never uninitialized).

Will ext4 increase a file's size before committing its contents to disk? If so, that would be a separate issue independent of the write()/rename() ordering issue discussed thus far. If not, then the guarantee is the same as with ext3.

ordered(tm) brand

Posted Mar 16, 2009 17:19 UTC (Mon) by nye (guest, #51576) [Link] (5 responses)

I think we have a semantic misunderstanding.

From your statement that the semantics for data=ordered were the same between ext3 and ext4, I assumed that you were using the word 'guarantee' to describe the known outcome of a particular operation, but I now realise that you meant the word more literally, as in 'this is the behaviour of the filesystem, and *this* is the subset of that behaviour which we guarantee will always hold true'. Since the actual behaviour was documented and predictable, and differs from ext4, I disagree that the semantics of the setting are the same, but concede that the actual guarantees made are.

Just so we can be clear what's happening, I'm going to go off on a tangent and try to work this through (assuming the no-fsync rename case)as I understand it. This isn't really directed at the parent poster:

The case in question is that you create a file (a.new). It gets an inode describing a length of zero, and a directory entry linking that name to that inode. Is this true? Does it get that inode and directory entry? I'm not sure what steps are skipped in the case of delayed allocation, but both of these thing appear to happen given that a zero-length file is indeed created upon a crash. Call this point A.

You write data to that file, and here the delayed allocation comes into play. The cached copy of the inode is updated. Is that true? I don't know if the in-memory representation corresponds to the on-disk representation. Clearly there can't be any real block pointers as the allocation hasn't happened yet.

Then 'a.new' is renamed to 'a', effectively unlinking the existing 'a' and 'a.new' and creating a new 'a' pointing to an inode previously known as 'a.new'. Call this point B.

At some point the inodes and directory entries are written, but because allocation is delayed, you are saving the version of the inode with size zero and no block pointers. Call this point C.

At this point you pray that there isn't a power cut.

At some point in the future the allocation is committed, and the inode for the new 'a' is updated to reflect this. Call this point D.

The question is, why is data created at points A and B committed to disk at (or by) point C given that it is already known, with certainty, that that data is useless until after point D? It appears that this will no longer happen in this particular case come 2.6.30, but is this not a specific case of some more general behaviour? Why commit the inode and directory entry for a file whose allocation hasn't happened yet?

ordered(tm) brand

Posted Mar 16, 2009 18:06 UTC (Mon) by nybble41 (subscriber, #55106) [Link] (4 responses)

"The question is, why is data created at points A and B committed to disk at (or by) point C given that it is already known, with certainty, that that data is useless until after point D?"

The thing is, this *isn't* known at the filesystem level. The application knows that the rename() is useless until the write() has been committed, but there is no API to communicate this information to the filesystem. Perhaps there should be, but the lack of appropriately fine-grained userspace APIs is not the fault of the filesystem authors. All existing filesystems, ext3 included, assume that the rename() and write() operations are independent; the cases where the ordering happens to be correct from the application's point-of-view are purely accidental.

The more general issue is that application writers are depending on filesystems to provide full data journaling, which is a major performance killer and was never actually guaranteed. Metadata journaling, as used by ext3 and ext4 be default, is only a replacement for the fsck process; as with all asynchronous, non-journaled filesystems, the state of the recovered filesystem after fsck or journal playback will be internally consistent, but may not match any state which actually existed in RAM before the crash.

ordered(tm) brand

Posted Mar 17, 2009 11:10 UTC (Tue) by nye (guest, #51576) [Link]

I really think it is known by the filesystem, because it knows that the file has unallocated data at the time that it makes that write. However, Ted has answered the question here: http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsyn...

Guarantees and the belt-and-braces of journaling

Posted Mar 18, 2009 0:16 UTC (Wed) by xoddam (subscriber, #2322) [Link] (2 responses)

The use-case of writing a new file and renaming it to replace an existing one is documented and recommended by expert practitioners[citation needed] as the best, nay the only, way to achieve atomic operations on a POSIX system.

Specifically, whenever a rename replaces an old file with a recently-written one, the application developer's intention is to achieve atomic replacement of the file's contents. Invariably. No exceptions. Even if you "kill -9 1", this will not cause corruption truncation of the target file.

POSIX *does* guarantee this atomicity, with precisely *one* exception -- if the system crashes, behaviour is undefined.

A journaling filesystem exists for one reason only to provide reasonable behaviour in the event of a system crash, ie. to extend the guarantees POSIX provides and reduce the need to recover data.

In a couple of instances, users have observed that particular up-and-coming journaling filesystems make it more (not less!) likely for them to need to recover files than the status quo. It is only sane that they should report this as a bug. It has *nothing* to do with application developers, who are using the recommended pattern and generally don't have much influence over what happens when their users' computers crash.

It is wonderful news then, that the developers of both filesystems (to my knowledge) that did exhibit such behaviour have listened to the requests of their users and let the journal extend the POSIX guarantee of atomic replacement on rename across system failures.

Hurrah and thankyou.

Discussion of fsync is a complete red herring. On older POSIX-conforming filesystems there is NO GUARANTEE AT ALL that the filesystem will be accessible after a system crash, fsync or no. On some implementations, fsync can indeed make this particular kind of data loss less likely (and application developers in-the-know have used it for this purpose). There is still no POSIXLY_CORRECT guarantee that data will not be lost, so for a filesystem developer to say that his users don't really deserve to benefit from the safety that journaling can afford until application developers have jumped through an extra latency-imposing hoop is a bit rude. Not to say, putting the cart before the horse.

By the way, Ted T'so is 90% correct to say application developers shouldn't fear fsync, and 100% wrong to say that fsync is the correct way to achieve atomic replacement with rename. Rename alone is supposed to achieve this; if a filesystem is technically capable of preserving this guarantee even across system failures then it should do so.

Guarantees and the belt-and-braces of journaling

Posted Mar 20, 2009 13:28 UTC (Fri) by regala (guest, #15745) [Link] (1 responses)

journaling is not here to preserve data, but to preserve integrity. You would be pleased if the data were on disk, but the filesystem's got broken and unrepairable...
People need to know what journaling was introduced, and clearly it is not here to preserve your little settings you got smashed because you wanted to play World of Goo. Get serious.

Guarantees and the belt-and-braces of journaling

Posted Mar 20, 2009 15:41 UTC (Fri) by foom (subscriber, #14868) [Link]

> People need to know what journaling was introduced, and clearly it is not here to preserve
> your little settings you got smashed because you wanted to play World of Goo. Get serious.

If journaling is not for that, then I want something which is! And yes, I do want to play World of Goo!

Why should I give a damn about the filesystem structure except as a prerequisite to being able to
get to my files. I want my files, and that means file *content*. So I want a system which does a
reasonably reliably job of ensuring that content doesn't disappear. Ext3 is such a system. Maybe it
was unintentional at the time it was designed, but now that it's recognized as a good idea, let's
*keep* making systems that work as well as it!


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds