By Jonathan Corbet
September 1, 2009
When developers think about forcing data written to files to be flushed to
the underlying storage device, they tend to think about the
fsync()
system call. But it is also possible to request synchronous behavior for
all operations on
a file descriptor, either at
open() time or using
fcntl(). Support in Linux for synchronous I/O flags is likely to
improve in 2.6.32, but this work has raised a couple of interesting issues
with regard to the current implementation and forward compatibility.
There are three standard-defined flags which can be used to specify
synchronous I/O behavior:
- O_SYNC: requires that any write operations block until all
data and all metadata have been written to persistent storage.
- O_DSYNC: like O_SYNC, except that there is no
requirement to wait for any metadata changes which are not necessary
to read the just-written data. In practice, O_DSYNC means
that the application does not need to wait until ancillary information
(the file modification time, for example) has been written to disk.
Using O_DSYNC instead of O_SYNC can often eliminate
the need to flush the file inode on a write.
- O_RSYNC: this flag, which only affects read operations, must
be used in combination with either O_SYNC or
O_DSYNC. It will cause a read() call to block until
the data (and maybe metadata) being read has been flushed to disk (if
necessary). This flag thus gives the kernel the option of delaying
the flushing of data to disk; any number of writes can happen, but
data need not be flushed until the application reads it back.
O_DSYNC and O_RSYNC are not new; they were added to the
relevant standards well over ten years ago. But Linux has never really
supported them (they are optional features), so glibc simply defines them
both to be the same as O_SYNC.
Christoph Hellwig is working on a proper
implementation of these flags, with an eye toward merging the changes
in 2.6.32.
It should be a relatively straightforward change at this point; the kernel
has some nice infrastructure for handling data and metadata flushing now.
What is potentially harder is making the change in a way which best meets
the expectations of existing applications.
There are two unrelated issues which make this transition harder than one
might expect it should be:
- Linux has never actually implemented O_SYNC; what
applications have been getting, instead, is O_DSYNC.
- The open() implementation in the kernel simply ignores flags
that it knows nothing about. This behavior can be changed only at
risk of breaking unknown numbers of applications; it's an aspect of
the kernel ABI.
Given the first problem listed above, one might be tempted to make a new flag
for O_DSYNC and use it to obtain the current behavior, while
O_SYNC would get the full metadata synchronization semantics. If
this were to be done, though, applications which are built against a new C
library but run on an older kernel would be presenting an unknown flag to
open(), which would duly ignore it. That application would not get
synchronous I/O behavior at all, which is almost certainly not a good
thing. So something trickier will have to be done.
There is also the question of which semantics older applications should
get. Jamie Lokier argued that applications
requesting O_SYNC behavior wanted full metadata synchronization,
even if the kernel has been
cheating them out of the full experience. So, when running under a future
kernel with a proper O_SYNC implementation, an old, binary
application should get O_SYNC behavior. Ulrich Drepper, instead,
thinks that behavior should not change for
older applications:
But these programs apparently can live with the broken semantics.
I don't worry too much about this. If people really need the fixed
O_SYNC semantics then let them recompile their code.
It looks like Ulrich's view will win out, for the simple reason that the
performance cost of the additional metadata synchronization seems worse than
giving applications the semantics they have been running with anyway, even
if those semantics are not quite what was promised.
Christoph outlined the likely course of
action. Internally, O_SYNC will become O_DSYNC, and the
open() flag which is currently O_SYNC will come to mean O_DSYNC. The
open() system call will then take a new flag (name unknown;
O_FULLSYNC and O_ISYNC have been suggested) which will be
hidden from applications. At the glibc level, applications will see this:
#define O_SYNC (O_FULLSYNC|O_DSYNC)
On older kernels, the O_DSYNC flag (with the same value as
O_SYNC now) will yield the same behavior as always, while
O_FULLSYNC will be ignored. On newer
kernels, the new flag will yield the full O_SYNC semantics. As
long as applications do not reach under the hood and try to manipulate the
O_FULLSYNC flag directly, all will be well.
(
Log in to post comments)