ordered(tm) brand
ordered(tm) brand
Posted Mar 16, 2009 17:19 UTC (Mon) by nye (guest, #51576)In reply to: ordered(tm) brand by nybble41
Parent article: Garrett: ext4, application expectations and power management
From your statement that the semantics for data=ordered were the same between ext3 and ext4, I assumed that you were using the word 'guarantee' to describe the known outcome of a particular operation, but I now realise that you meant the word more literally, as in 'this is the behaviour of the filesystem, and *this* is the subset of that behaviour which we guarantee will always hold true'. Since the actual behaviour was documented and predictable, and differs from ext4, I disagree that the semantics of the setting are the same, but concede that the actual guarantees made are.
Just so we can be clear what's happening, I'm going to go off on a tangent and try to work this through (assuming the no-fsync rename case)as I understand it. This isn't really directed at the parent poster:
The case in question is that you create a file (a.new). It gets an inode describing a length of zero, and a directory entry linking that name to that inode. Is this true? Does it get that inode and directory entry? I'm not sure what steps are skipped in the case of delayed allocation, but both of these thing appear to happen given that a zero-length file is indeed created upon a crash. Call this point A.
You write data to that file, and here the delayed allocation comes into play. The cached copy of the inode is updated. Is that true? I don't know if the in-memory representation corresponds to the on-disk representation. Clearly there can't be any real block pointers as the allocation hasn't happened yet.
Then 'a.new' is renamed to 'a', effectively unlinking the existing 'a' and 'a.new' and creating a new 'a' pointing to an inode previously known as 'a.new'. Call this point B.
At some point the inodes and directory entries are written, but because allocation is delayed, you are saving the version of the inode with size zero and no block pointers. Call this point C.
At this point you pray that there isn't a power cut.
At some point in the future the allocation is committed, and the inode for the new 'a' is updated to reflect this. Call this point D.
The question is, why is data created at points A and B committed to disk at (or by) point C given that it is already known, with certainty, that that data is useless until after point D? It appears that this will no longer happen in this particular case come 2.6.30, but is this not a specific case of some more general behaviour? Why commit the inode and directory entry for a file whose allocation hasn't happened yet?
