|
|
Log in / Subscribe / Register

The ongoing quest for atomic buffered writes

By Jonathan Corbet
March 2, 2026
There are many applications that need to be able to write multi-block chunks of data to disk with the assurance that the operation will either complete successfully or fail altogether — that the write will not be partially completed (or "torn"), in other words. For years, kernel developers have worked on providing atomic writes as a way of satisfying that need; see, for example, sessions from the Linux Storage, Filesystem, Memory Management, and BPF (LSFMM+BPF) Summit from 2023, 2024, and 2025 (twice). While atomic direct I/O is now supported by some filesystems, atomic buffered I/O still is not. Filling that gap seems certain to be a 2026 LSFMM+BPF topic but, thanks to an early discussion, the shape of a solution might already be coming into focus.

Pankaj Raghav started that discussion on February 13, noting that both ext4 and XFS now have support for atomic writes when direct I/O is in use, but that supporting atomic buffered I/O "remains a contentious topic". There are a couple of outstanding proposals to add this feature: this 2024 series from John Garry and a more recent patch set from Ojaswin Mujoo. These proposals have stalled, partly out of concern about the amount of complexity added to the I/O paths and questions about whether there is really a need for atomic buffered writes.

A frequently mentioned potential user for this feature is the PostgreSQL database which, unlike many other database managers, uses buffered I/O. The PostgreSQL code often has to go out of its way to ensure that partial I/O operations do not corrupt the database, sometimes at a cost to performance. PostgreSQL is an important user, but not all developers are convinced that atomic buffered writes are the solution to its problems; Christoph Hellwig, for example, commented: "I think a better session would be how we can help postgres to move off buffered I/O instead of adding more special cases for them."

PostgreSQL developer Andres Freund responded that the project is indeed working on adding direct-I/O support, but its performance has not yet reached the level of the buffered-I/O method. But, he said, direct I/O will only ever be useful for some larger installations. Smaller systems, or those where the database is running as part of a larger application with its own memory needs, will still do better in a buffered-I/O setup where the kernel can manage the allocation of memory. Even when direct I/O becomes competitive as an option for PostgreSQL, he said, "well over 50% of users" will not be able to benefit from it. Most of the developers in the conversation seem to accept that there is a legitimate use case for atomic buffered I/O, though Hellwig remains a holdout.

An agreement that a solution would be nice to have does not, itself, create a solution, though. Atomic direct I/O was a complex problem to solve, requiring the kernel to keep I/O requests together all the way through to the eventual storage device. Buffered I/O adds complexity, since those operations have to go through the page cache, and the actual write operation is normally carried out at a different time, when the kernel gets around to it. Tracking atomicity requirements through the kernel in this way and preventing multiple operations from interfering with each other are not simple tasks.

Early in the discussion Mujoo suggested that one possible solution might be to use writethrough semantics for atomic buffered writes. In other words, when user space initiates a buffered write requesting atomic behavior (which would be done using pwritev2() with the RWF_ATOMIC flag), the kernel would immediately initiate the process of writing that data to disk. That would allow creating a short-term pin to keep the pages in memory (it is hard to do an atomic write if one of the pages full of data is pushed out to swap in the middle of the operation) and would let the kernel prevent any other changes to those pages while the operation is underway. There would be no need to find a way to track atomic writes for dirty data that is sitting in the page cache.

Jan Kara agreed that writethrough behavior could be interesting. It would allow much of the existing direct-I/O infrastructure to be reused, he said, making the solution much simpler. The real question, he said, was whether writethrough behavior would be useful for PostgreSQL. Freund answered that writethrough would indeed be useful, even in the absence of atomic behavior. He suggested implementing it by requiring that atomic buffered writes include a new RWF_WRITETHROUGH flag along with RWF_ATOMIC; that way, if the kernel ever implemented atomic buffered writes without writethrough, there would not be a behavior change seen by user space.

Raghav asked about the difference between the proposed RWF_WRITETHROUGH flag, and the existing RWF_DSYNC, saying that the former might (like most buffered writes) be asynchronous, while the latter is synchronous. Dave Chinner disagreed with that interpretation, though, saying that writethrough behavior is inherently synchronous so that errors can be immediately reported. The way to get asynchronous behavior, he said, is to use the asynchronous-I/O interface or io_uring. But RWF_WRITETHROUGH itself, he said, should behave identically to direct-I/O writes, allowing the existing I/O paths to be used to implement it. RWF_DSYNC, he said, would still be different in that it forces the storage device to commit the data to persistent media, while RWF_WRITETHROUGH would not take that extra step (meaning that data could remain in the device's write cache).

In an attempt to summarize the discussion, Raghav posted this set of proposed conclusions; the first step would be to implement the proposed writethrough behavior with immediate initiation of the requested operation. Writethrough alone, though, does not guarantee atomic behavior, so there will be more to be done. The next step will be to ensure that the data being written is not modified while the operation is underway. Fortunately, the kernel has long had a mechanism, stable pages, that can be brought into play here. By preventing modifications to a buffer that is being written, the kernel can prevent the data from being corrupted.

Later steps will include taking care to copy the full data range into the page cache before beginning the operation, and to make sure that the buffer is written in a single, atomic operation. There will inevitably be other details to deal with, such as specifying and enforcing alignment requirements for buffers used with atomic writes. But it would appear that the path toward atomic buffered writes is starting to become more clear. It shouldn't take more than another half-dozen or so LSFMM+BPF sessions before the problem is fully solved.

Index entries for this article
KernelAtomic I/O operations


to post comments

heh

Posted Mar 3, 2026 0:41 UTC (Tue) by djwong (subscriber, #23506) [Link]

“...another half-dozen or so LSFMM+BPF sessions before the problem is fully solved.”

<snicker>

Atomic Writes and PLP

Posted Mar 3, 2026 12:01 UTC (Tue) by RazeLighter777 (subscriber, #130021) [Link]

How do atomic write implementations differ with drives having Power Loss Prevention and those without? It seems like to me that doing atomic operations on drives with PLP and without PLP should be different.

I'm curious about this. I've picked up a couple SAS SSDs with PLP and they've been much more dependable for my database workloads than my consumer grade drives.

Other use cases

Posted Mar 3, 2026 12:39 UTC (Tue) by danielkza (subscriber, #66161) [Link] (12 responses)

Writethrough behaviour sounds quite useful in general, file managers coudld leverage it when copying data to removable devices, eliminating the surprising occurence of a very fast file copy followed by a device eject taking minutes.

Other use cases

Posted Mar 3, 2026 13:03 UTC (Tue) by epa (subscriber, #39769) [Link] (9 responses)

If you are using io_uring or another asynchronous API, where you request a write and get notified of its completion later, I would say that writethrough should arguably be the default. But you might want to get two separate notifications: the first for "the data has hit the cache and is now visible to read queries", the second for "the data has been written to disk".

Other use cases

Posted Mar 3, 2026 16:10 UTC (Tue) by farnz (subscriber, #17727) [Link] (8 responses)

There's three notifications the kernel could give you:
  1. The data is visible to other readers, but not necessarily the storage device.
  2. The data is in the storage device's volatile cache, but not necessarily persistent.
  3. The device reports that the data has been persisted to the long-term storage medium, and absent a hardware issue, will be safe after a system restart.

The problem is that the second notification is not normally a lot more valuable than the first (they both mean "your data is not yet safe, but is visible to other readers", just with different values of "not yet safe"), while the third is costly (either you flush caches to convert "in the storage device's volatile cache" into "persisted", or you use a more expensive "forced unit access" write command, and prevent the device from doing its normal reordering optimizations on writes).

It's thus best for performance if you have to choose which notification of the three you get up-front, and accept the cost of asking for the "data is persisted" notification if you've chosen that one. It could, for example, be worth accepting the increased cost because you're writing to a USB 2.0 attached flash drive, and the resulting extra cost is minimal as compared to the cost of the write itself, and well worth it to avoid a long delay when the write is finished as the device becomes safe to remove, or it might be not worth the extra cost because you're writing to NVME storage, and the user won't remove it.

Other use cases

Posted Mar 4, 2026 10:34 UTC (Wed) by taladar (subscriber, #68407) [Link] (3 responses)

Technically there could also be more layers, e.g. if you have a RAID controller with its own cache or some sort of system where NVMe disks cache data from rotating ones.

Other use cases

Posted Mar 4, 2026 12:54 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

Even with more layers, the interesting notifications remain similar - they're telling you how big a failure is needed to lose your data. The interesting states are "any failure", "only a failure in the storage subsystem, but one that's quite likely to happen" and "only unlikely failures that we try to engineer out of the system completely".

Other use cases

Posted Mar 5, 2026 13:11 UTC (Thu) by taladar (subscriber, #68407) [Link] (1 responses)

True, my point was more that you probably don't want to bake assumptions about current storage architectures into the long-lived FS API.

Other reasons for more layers with messy properties might include network filesystems which might have local cached storage in case of network failure.

Other use cases

Posted Mar 5, 2026 13:56 UTC (Thu) by farnz (subscriber, #17727) [Link]

Yep - you want to base the notifications on what's interesting to the application, not on the current layering; one for "other readers will see the write" (obvious), one for "this system can crash without losing data, but if the storage stack crashes, you may lose data" (there are commit protocols where this is the point at which you can tentatively acknowledge the write), and one for "data should not be lost even if everything crashes" (since this is the final state for all writes) are still probably the interesting three, no matter how complicated you make the underlying layers.

I can also see some room for splitting the first into two: "other readers on this system" and "other readers with access to this storage stack" - but again, the point has to be "what is interesting to the application about this state", as opposed to "what states do I know about?", with "what states do I know about?" only being useful for preventing you offering notifications you can't provide at a sensible cost.

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 8, 2026 21:24 UTC (Sun) by DemiMarie (subscriber, #164188) [Link] (3 responses)

If NVMe drives were required to have non-volatile caches, how much would it add to the BOM cost?

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 9, 2026 14:55 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

What difference does it make? The notifications (in kernel caches, sent to device but not yet guaranteed on non-volatile storage, guaranteed as safe as the device can make it) are the same whether the cache is volatile, or non-volatile.

And the BOM cost for being able to save the volatile cache to non-volatile store is not small - maybe $10 for the capacitance to power the DRAM, flash and MCU until the cache is saved, plus however much extra flash you need so that you have a "safe space" to write the volatile cache to. You can't just make the device only have non-volatile cache, because the performance characteristics of non-volatile memory aren't what you need when you're doing things like "read 1 MiB from the main flash, replace 4 KiB with this new change, write 1 MiB to a new location, mark the original 1 MiB as safe to erase".

Note, too, that depending on the NVMe design, a write in non-volatile cache may not be considered "safe", because the chances of the non-volatile cache being corrupted are too high. At least one device I've looked at with non-volatile cache has a clear statement that the non-volatile cache is only safe during a commanded shutdown of the device, and not a power loss event, because the microcontroller may scribble over non-volatile cache during power loss.

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 9, 2026 16:20 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

> You can't just make the device only have non-volatile cache, because the performance characteristics of non-volatile memory aren't what you need when you're doing things like "read 1 MiB from the main flash, replace 4 KiB with this new change, write 1 MiB to a new location, mark the original 1 MiB as safe to erase".

Did NVRAM use to be 4KiB blocks? Can you still get it?

Okay, I don't know the cost implications, but couldn't you write your 4K updates to a 4K NVRAM cache, and then use normal DRAM as your cache for actually doing your "read 1MiB in, update 4K, write 1 MiB out"? And then you just need to be able to flush your incoming data to the 4K NVRAM on a power fail, which is an easier problem than flushing everything to the permanent backing store?

(And have a "here is a 1MiB write" command, so if the OS is streaming data to disk it can still send data in chunks that match the NVMe block size for efficiency).

Cheers,
Wol

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 9, 2026 16:32 UTC (Mon) by farnz (subscriber, #17727) [Link]

You can get various forms of flash with different block sizes; the smaller the blocks, the more expensive it becomes per byte, and the faster you want it, the more expensive it gets.

But the problem is that you don't have just 4K of updates at a time to handle - a high performance NVMe drive is handling gigabytes of unwritten data in its volatile cache, and having gigabytes of very fast flash is expensive, as compared to gigabytes of RAM, gigabytes of reasonably fast flash, and a supercapacitor to let you flush the RAM to the fast flash before you lose power completely. However, even that's going to be significant - doing it with pSLC flash will add maybe $30 to your $100/TiB NVMe SSD, more if you want a bigger cache (useful for performance).

Other use cases

Posted Mar 4, 2026 19:06 UTC (Wed) by TheJH (subscriber, #101155) [Link] (1 responses)

though such a file manager could also, after finishing a copy operation, fsync() all the copied files, or if that's too much, syncfs() the entire filesystem, before indicating to the user that the operation has completed...

Other use cases

Posted Mar 4, 2026 20:11 UTC (Wed) by farnz (subscriber, #17727) [Link]

Ideally, though, the file manager wants to report progress to the user - which means that you want notifications for fsync progress, not just write progress. That pushes you in the direction of sync_file_range, too - you'd end up submitting a series of sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER) operations to get you progress, then an fsync to sync the file metadata, then either a syncfs or a set of fsyncs on directories to ensure that the file is safely on disk.

Related thought

Posted Mar 3, 2026 20:33 UTC (Tue) by IanKelling (subscriber, #89418) [Link]

Lately, I've done some learning about different filesystems and found LWN articles to be immensely valuable. In this area, it is really filling a hole, please keep it up.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds