LWN.net Logo

Ts'o: Delayed allocation and the zero-length file problem

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 15:59 UTC (Sat) by Pc5Y9sbv (guest, #41328)
In reply to: Ts'o: Delayed allocation and the zero-length file problem by dcoutts
Parent article: Ts'o: Delayed allocation and the zero-length file problem

I thought the same thing as you at first, but I thought about it more and am no longer sure. To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process. This may lead to far too much serialization of IO operations for the typical desktop use case.

So, is there an appropriate set of heuristics to infer write barriers sometimes but not others? The specific case in this discussion would be something like "insert a write barrier after file content operations requested before metadata operations affecting linkage of that file's inode"? Is this sufficient and defensible?

Ideally, we should have POSIX write-barriers that can be applied to a set of open file and directory handles, and use them to get the proper level of ordering across crashes. The fsync solution is far too blunt an instrument to provide the transactionality that everyone is looking for when they relink newly created files into place.

But then what about all those shell scripts out there which do "foo > file.tmp && mv file.tmp file"? We would need a new write-barrier operation applicable from the shell script (somehow selecting partial ordering of requests issued from separate processes), or a heuristic write-barrier as above...


(Log in to post comments)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 17:32 UTC (Sat) by dcoutts (guest, #5387) [Link]

Yes it does raise a more general issue. We're not asking for a write barrier between every operation but it's not entirely obvious which ones we can safely omit (or "mostly safely" omit). I don't have a complete answer either.

Certainly since rename is supposed to be atomic and because it is used in this common idiom then it should have a write barrier wrt other operations on the same file. I don't think we should demand barriers between write operations within the same file or between different files. As you say it would be useful to be able to add explicit barriers sometimes, just as we can for CPU operations on memory.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 7:57 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]

I think the relink atomicity is a red herring here. That refers to the fact that the file is present under either the old or new name, i.e. it is an atomic directory metadata change, ignoring crash behaviors. Our main concern is that we want to extend the POSIX IO ordering semantics of non-atomic sequences across crash boundaries. We could actually recover from non-atomic relink (e.g. file is linked under old and new names) more easily than reordered content and name commits.

I think I agree now that it would be sensible to infer a write barrier between file content requests and inode linking requests for the same inode. This would cover a large percentage of "making data available" scenarios.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 14:52 UTC (Fri) by anton (guest, #25547) [Link]

To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process.
You have the right idea about the proper behaviour, but there is more freedom in implementing it: You can commit a batch of operations at the same time (e.g., after the 60s flush interval), and you need only a few barriers for each batch: essentially one barrier between writing everything but the commit block and writing the commit block, and another barrier between writing the commit block and writing the free-blocks information for the blocks that were freed by the commit (and if you delay the freeing long enough, you can combine the latter barrier with the former barrier of the next cycle).

This can be done relatively easily on a copy-on-write file system. For an update-in-place file system, you probably need more barriers or write more stuff in the journal (including data that's written to pre-existing blocks).

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds