LWN.net Logo

O_*SYNC

By Jonathan Corbet
September 1, 2009
When developers think about forcing data written to files to be flushed to the underlying storage device, they tend to think about the fsync() system call. But it is also possible to request synchronous behavior for all operations on a file descriptor, either at open() time or using fcntl(). Support in Linux for synchronous I/O flags is likely to improve in 2.6.32, but this work has raised a couple of interesting issues with regard to the current implementation and forward compatibility.

There are three standard-defined flags which can be used to specify synchronous I/O behavior:

  • O_SYNC: requires that any write operations block until all data and all metadata have been written to persistent storage.

  • O_DSYNC: like O_SYNC, except that there is no requirement to wait for any metadata changes which are not necessary to read the just-written data. In practice, O_DSYNC means that the application does not need to wait until ancillary information (the file modification time, for example) has been written to disk. Using O_DSYNC instead of O_SYNC can often eliminate the need to flush the file inode on a write.

  • O_RSYNC: this flag, which only affects read operations, must be used in combination with either O_SYNC or O_DSYNC. It will cause a read() call to block until the data (and maybe metadata) being read has been flushed to disk (if necessary). This flag thus gives the kernel the option of delaying the flushing of data to disk; any number of writes can happen, but data need not be flushed until the application reads it back.

O_DSYNC and O_RSYNC are not new; they were added to the relevant standards well over ten years ago. But Linux has never really supported them (they are optional features), so glibc simply defines them both to be the same as O_SYNC.

Christoph Hellwig is working on a proper implementation of these flags, with an eye toward merging the changes in 2.6.32. It should be a relatively straightforward change at this point; the kernel has some nice infrastructure for handling data and metadata flushing now. What is potentially harder is making the change in a way which best meets the expectations of existing applications.

There are two unrelated issues which make this transition harder than one might expect it should be:

  • Linux has never actually implemented O_SYNC; what applications have been getting, instead, is O_DSYNC.

  • The open() implementation in the kernel simply ignores flags that it knows nothing about. This behavior can be changed only at risk of breaking unknown numbers of applications; it's an aspect of the kernel ABI.

Given the first problem listed above, one might be tempted to make a new flag for O_DSYNC and use it to obtain the current behavior, while O_SYNC would get the full metadata synchronization semantics. If this were to be done, though, applications which are built against a new C library but run on an older kernel would be presenting an unknown flag to open(), which would duly ignore it. That application would not get synchronous I/O behavior at all, which is almost certainly not a good thing. So something trickier will have to be done.

There is also the question of which semantics older applications should get. Jamie Lokier argued that applications requesting O_SYNC behavior wanted full metadata synchronization, even if the kernel has been cheating them out of the full experience. So, when running under a future kernel with a proper O_SYNC implementation, an old, binary application should get O_SYNC behavior. Ulrich Drepper, instead, thinks that behavior should not change for older applications:

But these programs apparently can live with the broken semantics. I don't worry too much about this. If people really need the fixed O_SYNC semantics then let them recompile their code.

It looks like Ulrich's view will win out, for the simple reason that the performance cost of the additional metadata synchronization seems worse than giving applications the semantics they have been running with anyway, even if those semantics are not quite what was promised.

Christoph outlined the likely course of action. Internally, O_SYNC will become O_DSYNC, and the open() flag which is currently O_SYNC will come to mean O_DSYNC. The open() system call will then take a new flag (name unknown; O_FULLSYNC and O_ISYNC have been suggested) which will be hidden from applications. At the glibc level, applications will see this:

    #define O_SYNC	(O_FULLSYNC|O_DSYNC)

On older kernels, the O_DSYNC flag (with the same value as O_SYNC now) will yield the same behavior as always, while O_FULLSYNC will be ignored. On newer kernels, the new flag will yield the full O_SYNC semantics. As long as applications do not reach under the hood and try to manipulate the O_FULLSYNC flag directly, all will be well.


(Log in to post comments)

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds