LWN.net Logo

Ts'o: Delayed allocation and the zero-length file problem

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:11 UTC (Fri) by dcoutts (guest, #5387)
Parent article: Ts'o: Delayed allocation and the zero-length file problem

Surely the solution is just to put an implicit write barrier between the file content data being written and the file meta-data being written. Then the write followed by rename thing would actually be atomic (as we had always assumed it was).

There's a very low performance penalty for using a write barrier. All modern disks support it without having to issue a full flush.

App authors are not demanding that the file date make it to the disk immediately. They're demanding that the file update is atomic. It should preserve the old content or the new but never give us a zero length file.

Can this be that difficult? We do not need a total ordering on all file system requests. We just need it for certain meta-data and content data writes.


(Log in to post comments)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 16:25 UTC (Fri) by forthy (guest, #1525) [Link]

We don't even need write barriers. The updates of the file system can update lots of data and metadata in one go. But it should keep the consistency POSIX promises: All file system operations are performed in order. This is actually a POSIX promise; it just doesn't hold for crashes (because crashes are not specified). I.e. if delayed updates are used, they should be delayed all together and then done in an atomic way - either complete them or roll them back to the previous state. This is actually not difficult.

Btrfs does this; Ted Ts'o doesn't seem to get it. Many file system designer don't get it, they are anal about their metadata, and don't care at all about user data. Unix file systems have lost data in that way since the invention of synchronous metadata updates (prior to that, they also lost metadata ;-).

IMHO there is absolutely nothing wrong with the create-write-close-rename way to replace a file. As application writer, we have to rely somehow on our OS. If we don't, we better write it ourselves. When the file system designer don't get it, and are anal about some vague spec, fsck them.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 17:04 UTC (Fri) by jwb (guest, #15467) [Link]

What we actually need is a user-space API that actually makes sense and wasn't invented in the Sputnik era.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:25 UTC (Fri) by alexl (subscriber, #19068) [Link]

In terms of posix API additions, what would be nice is for atomic safe rewrite of files would be:

fd1=open(dirname(file))
fd2=openat(fd1, NULL, O_CREAT) // Creates an unlinked file
write(fd2)
flinkat(fd1, fd2, basename(file)) // Should guarantee fd2 is written to disk before linking.
close(fd2)
close(fd1)

This seems race free:
doesn't break if the directory is moved during write
doesn't let other apps see or modify the temp file while writing
doesn't leave a broken tempfile around on app crash
doesn't end up with an empty file on system crash

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:41 UTC (Sat) by dcoutts (guest, #5387) [Link]

Yes, that would be great. It's a natural extension of the POSIX notion that files are separate from their directory entries.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 23:25 UTC (Sun) by halfline (guest, #31920) [Link]

Another idea would be to introduce a new open flag, O_REWRITE, or some such that gives the same straightforwardness as O_TRUNC to the application developer but under the hood works on a detached file and atomically renames on close. Since all the I/O operations are grouped (via the fd), the kernel should be able to ensure proper ordering relatively easily (i think?) and apps wouldn't have to introduce a costly "sync it now" operation.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 21, 2009 0:34 UTC (Sat) by spitzak (guest, #4593) [Link]

This flag certainly is needed, but I would go further and say that Linux should change the behavior of O_CREAT|O_WRONLY|O_TRUNC to do exactly what you specify. This is because probably every program using these flags (or using creat()) are written to implicitly expect this behavior anyway.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:35 UTC (Sat) by dcoutts (guest, #5387) [Link]

Sorry, I wasn't clear. I meant that the write barrier should be implicit in the create-write-close-rename dance. I didn't mean application authors should explicitly have to add a write barrier. Of course in the kernel the write barrier would be explicit.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 15:59 UTC (Sat) by Pc5Y9sbv (guest, #41328) [Link]

I thought the same thing as you at first, but I thought about it more and am no longer sure. To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process. This may lead to far too much serialization of IO operations for the typical desktop use case.

So, is there an appropriate set of heuristics to infer write barriers sometimes but not others? The specific case in this discussion would be something like "insert a write barrier after file content operations requested before metadata operations affecting linkage of that file's inode"? Is this sufficient and defensible?

Ideally, we should have POSIX write-barriers that can be applied to a set of open file and directory handles, and use them to get the proper level of ordering across crashes. The fsync solution is far too blunt an instrument to provide the transactionality that everyone is looking for when they relink newly created files into place.

But then what about all those shell scripts out there which do "foo > file.tmp && mv file.tmp file"? We would need a new write-barrier operation applicable from the shell script (somehow selecting partial ordering of requests issued from separate processes), or a heuristic write-barrier as above...

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 17:32 UTC (Sat) by dcoutts (guest, #5387) [Link]

Yes it does raise a more general issue. We're not asking for a write barrier between every operation but it's not entirely obvious which ones we can safely omit (or "mostly safely" omit). I don't have a complete answer either.

Certainly since rename is supposed to be atomic and because it is used in this common idiom then it should have a write barrier wrt other operations on the same file. I don't think we should demand barriers between write operations within the same file or between different files. As you say it would be useful to be able to add explicit barriers sometimes, just as we can for CPU operations on memory.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 7:57 UTC (Sun) by Pc5Y9sbv (guest, #41328) [Link]

I think the relink atomicity is a red herring here. That refers to the fact that the file is present under either the old or new name, i.e. it is an atomic directory metadata change, ignoring crash behaviors. Our main concern is that we want to extend the POSIX IO ordering semantics of non-atomic sequences across crash boundaries. We could actually recover from non-atomic relink (e.g. file is linked under old and new names) more easily than reordered content and name commits.

I think I agree now that it would be sensible to infer a write barrier between file content requests and inode linking requests for the same inode. This would cover a large percentage of "making data available" scenarios.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 14:52 UTC (Fri) by anton (guest, #25547) [Link]

To provide a completely unsurprising behavior, i.e. provide expected inter-process POSIX semantics between pre-crash and post-crash processes, you would need to infer a write barrier between every IO operation by a process.
You have the right idea about the proper behaviour, but there is more freedom in implementing it: You can commit a batch of operations at the same time (e.g., after the 60s flush interval), and you need only a few barriers for each batch: essentially one barrier between writing everything but the commit block and writing the commit block, and another barrier between writing the commit block and writing the free-blocks information for the blocks that were freed by the commit (and if you delay the freeing long enough, you can combine the latter barrier with the former barrier of the next cycle).

This can be done relatively easily on a copy-on-write file system. For an update-in-place file system, you probably need more barriers or write more stuff in the journal (including data that's written to pre-existing blocks).

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 15, 2009 9:40 UTC (Sun) by k8to (subscriber, #15413) [Link]

NOTES
Note that fclose() only flushes the user space buffers provided by the C library. To ensure that the data
is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).

It seems fclose doesn't imply write.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 14:40 UTC (Fri) by anton (guest, #25547) [Link]

Write barriers or something equivalent (properly-used tagged commands, write cache flushes, or disabling write-back caches) are needed for any file system that wants to provide any consistency guarantee. Otherwise the disk drive can delay writing one block indefinitely while writing out others that are much later logically.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds