The ongoing quest for atomic buffered writes
Pankaj Raghav started that
discussion on February 13, noting that both ext4 and XFS now have
support for atomic writes when direct I/O is in use, but that supporting
atomic buffered I/O "remains a contentious topic
". There are a
couple of outstanding proposals to add this feature: this
2024 series from John Garry and a more recent
patch set from Ojaswin Mujoo. These proposals have stalled, partly out
of concern about the amount of complexity added to the I/O paths and
questions about whether there is really a need for atomic buffered writes.
A frequently mentioned potential user for this feature is the PostgreSQL
database which, unlike many other database managers, uses buffered I/O.
The PostgreSQL code often has to go out of its way to ensure that partial
I/O operations do not corrupt the database, sometimes at a cost to
performance. PostgreSQL is an important user, but not all developers are
convinced that atomic buffered writes are the solution to its problems;
Christoph Hellwig, for example, commented: "I think a
better session would be how we can help postgres to move off buffered I/O
instead of adding more special cases for them.
"
PostgreSQL developer Andres Freund responded
that the project is indeed working on adding direct-I/O support, but its performance has not yet reached
the level of the buffered-I/O method. But, he
said, direct I/O will only ever be useful for
some larger installations. Smaller systems, or those where the database is
running as part of a larger application with its own memory needs, will
still do better in a buffered-I/O setup where
the kernel can manage the allocation of memory. Even when direct I/O becomes competitive as an option for PostgreSQL,
he said, "well over 50% of users
" will not be able to benefit from
it. Most of the developers in the conversation seem to accept that there
is a legitimate use case for atomic buffered I/O, though Hellwig remains a holdout.
An agreement that a solution would be nice to have does not, itself, create a solution, though. Atomic direct I/O was a complex problem to solve, requiring the kernel to keep I/O requests together all the way through to the eventual storage device. Buffered I/O adds complexity, since those operations have to go through the page cache, and the actual write operation is normally carried out at a different time, when the kernel gets around to it. Tracking atomicity requirements through the kernel in this way and preventing multiple operations from interfering with each other are not simple tasks.
Early in the discussion Mujoo suggested that one possible solution might be to use writethrough semantics for atomic buffered writes. In other words, when user space initiates a buffered write requesting atomic behavior (which would be done using pwritev2() with the RWF_ATOMIC flag), the kernel would immediately initiate the process of writing that data to disk. That would allow creating a short-term pin to keep the pages in memory (it is hard to do an atomic write if one of the pages full of data is pushed out to swap in the middle of the operation) and would let the kernel prevent any other changes to those pages while the operation is underway. There would be no need to find a way to track atomic writes for dirty data that is sitting in the page cache.
Jan Kara agreed that writethrough behavior could be interesting. It would allow much of the existing direct-I/O infrastructure to be reused, he said, making the solution much simpler. The real question, he said, was whether writethrough behavior would be useful for PostgreSQL. Freund answered that writethrough would indeed be useful, even in the absence of atomic behavior. He suggested implementing it by requiring that atomic buffered writes include a new RWF_WRITETHROUGH flag along with RWF_ATOMIC; that way, if the kernel ever implemented atomic buffered writes without writethrough, there would not be a behavior change seen by user space.
Raghav asked about the difference between the proposed RWF_WRITETHROUGH flag, and the existing RWF_DSYNC, saying that the former might (like most buffered writes) be asynchronous, while the latter is synchronous. Dave Chinner disagreed with that interpretation, though, saying that writethrough behavior is inherently synchronous so that errors can be immediately reported. The way to get asynchronous behavior, he said, is to use the asynchronous-I/O interface or io_uring. But RWF_WRITETHROUGH itself, he said, should behave identically to direct-I/O writes, allowing the existing I/O paths to be used to implement it. RWF_DSYNC, he said, would still be different in that it forces the storage device to commit the data to persistent media, while RWF_WRITETHROUGH would not take that extra step (meaning that data could remain in the device's write cache).
In an attempt to summarize the discussion, Raghav posted this set of proposed conclusions; the first step would be to implement the proposed writethrough behavior with immediate initiation of the requested operation. Writethrough alone, though, does not guarantee atomic behavior, so there will be more to be done. The next step will be to ensure that the data being written is not modified while the operation is underway. Fortunately, the kernel has long had a mechanism, stable pages, that can be brought into play here. By preventing modifications to a buffer that is being written, the kernel can prevent the data from being corrupted.
Later steps will include taking care to copy the full data range into the
page cache before beginning the operation, and to make sure that the buffer
is written in a single, atomic operation. There will inevitably be other
details to deal with, such as specifying and enforcing alignment
requirements for buffers used with atomic writes. But it would appear that
the path toward atomic buffered writes is starting to become more clear.
It shouldn't take more than another half-dozen or so LSFMM+BPF sessions
before the problem is fully solved.
| Index entries for this article | |
|---|---|
| Kernel | Atomic I/O operations |
