The intersection of unstable pages and direct I/O
[LWN subscriber-only content]
Writing a page of data to a block storage device takes time. If a process tells the kernel to perform a write, some time will thus pass before the operation is complete. Should that process modify the under-I/O data while it is being written, the result is a sort of data race with the usual unpredictable results. Either the old or the new data could end up being written; in the worst case, a combination of the old and new data could be written, corrupting the file.
In simple cases, this behavior is less of a problem than it seems; if the application is modifying the pages under I/O, it is likely to write them to persistent storage another time, and the end result will be as it should be. But if, for example, the filesystem is writing checksums alongside the data, changing the data partway through can cause the resulting checksum to not match. Other types of processing during I/O, such as transparent compression or a RAID implementation, can run into the same sort of trouble. In these cases, a badly timed change to a page under I/O can lead to I/O errors or data corruption.
The solution to this problem usually takes the form of stable pages — a guarantee made by the kernel that data will not be changed while it is under I/O. Stability can be achieved by, for example, simply delaying any attempts to modify the data until the I/O operation is complete. For buffered I/O, where data is copied through the page cache, implementing this approach is relatively straightforward, and was done years ago. Delaying memory access can reduce performance, though, so it is usually only done when there is a clear need.
Direct I/O, however, is a special case. When a process requests direct I/O, it wants to transfer data directly between its buffers and the underlying storage device without going through the page cache. Since stable pages are tied to the page cache, direct I/O operations cannot use them. Direct I/O is a relatively advanced, low-level operation, where applications take charge of their own buffer management, so problems from concurrent data changes have generally not been seen. Increasingly, though, there is interest in using direct I/O in conjunction with filesystem-integrity features, or with hardware features like the data integrity field (also known as "protection information" or "PI"), which provides end-to-end checksum protection.
Nobody covers the Linux kernel like LWN; be in the know with a one-month trial subscription, no credit card needed.
In May 2025, Qu Wenruo posted a patch describing problems that come about when user space performs direct I/O to a Btrfs filesystem with checksums enabled, then modifies its buffers while the I/O is underway. As described above, that can create checksum errors. Evidently, QEMU can be configured to use direct I/O and is prone to just this kind of failure. The solution that was adopted (and merged for 6.15), is to simply fall back to buffered I/O on filesystems where checksums are in use. That solves the problem, but at a heavy cost; direct I/O is now simply incompatible with checksums on Btrfs filesystems, and applications that depend on direct I/O for performance will be slowed down accordingly. There is a similar patch under consideration to fall back to buffered I/O for reads as well, since concurrent modifications to the buffer being read into can cause checksum failures.
The Btrfs change was applied without much fanfare; Christoph Hellwig's patch adding a
similar fallback to XFS attracted rather more attention. XFS does not
support data checksums like Btrfs does, but block devices can perform their
own checksumming; they need stable pages to do that work successfully. As
already noted, stable pages don't work with direct I/O so, in cases where the underlying block device
requires stable pages, Hellwig's patch causes XFS to fall back to buffered
I/O even when direct I/O has been requested. He expressed unhappiness
about the resulting performance penalty, and suggested a few ways for
"applications that know what they are doing
" to disable this
fallback.
Bart Van Assche had an
alternative suggestion: "only fall back to buffered I/O for buggy software that modifies direct I/O buffers before I/O
has completed
". Dave Chinner also questioned the
patch, also suggesting that buffered I/O
should only be forced for applications that are "known to corrupt stable
page based block devices
", of which he suspected there are relatively
few. Geoff Beck suggested
setting the page permissions to prevent changes to memory while it is under
direct I/O.
Hellwig's responses to those suggestions highlight the core of the
disagreement around this topic. Is changing a buffer that is under I/O an acceptable thing for an application to do?
Hellwig repeatedly insisted that it is indeed
acceptable: "Given that we never claimed that you can't modify the
buffer I would not call it buggy, even if the behavior is unfortunate
".
Concurrent modification works in many cases, he said; his objective is to
make it work in the remaining cases, even at the cost of severely reduced
I/O performance.
Chinner, instead, was adamant that the kernel should not try to support applications that perform this sort of modification:
Remember: O_DIRECT means the application takes full responsibility for ensuring IO concurrency semantics are correctly implemented. Modifying IO buffers whilst the IO buffer is being read from or written to by the hardware has always been an IO concurrency bug in the application.The behaviour being talked about here is, and always has been, an application IO concurrency bug, regardless of PI, stable writes, etc. Such an application bug existing is *not a valid reason for the kernel or filesystem to disable Direct IO*.
Hellwig, though, reiterated that the kernel
has never documented any requirement that buffers be untouched while under
direct I/O: "We can't just make that up 20+
years later
". He also said that, on RAID devices, modifying the data
being written could corrupt an entire stripe, with possible effects on data
unrelated to the offending application. At that point, he said, it becomes
a kernel bug that could let malicious users corrupt others' data. Jan Kara
said
that he agreed that concurrent modification of I/O buffers is an application bug, but he added that
the kernel needs to do something about the problem anyway.
Darrick Wong, addressing specifically the PI case where the hardware
verifies checksums provided by the software, observed that,
in most cases, applications will be unable to corrupt their own data by
modifying buffers under direct I/O without
the corruption being detected by the hardware. In those
cases, there is no need to fall back to buffered I/O. The RAID case is worse, though, and must be
prevented. Wong suggested adding a new block-device attribute (with the
name "mutation_blast_radius
") describing how serious a
concurrent modification would be and only falling back in the most serious
cases. As Hellwig pointed
out, though, that solution only addresses the PI case, and would cause
QEMU to fail.
While no firm conclusions were reached in this conversation, it seems clear
that, regardless of how one views the behavior of applications that modify
data that is under I/O, the kernel probably
needs to do something to prevent, at a minimum, the worst problems. The real
question will be whether it will be possible to provide a way for
well-behaved applications to avoid the extra overhead of buffered I/O without creating opportunities for malicious
actors. It may turn out that, in the end, some types of storage device
will simply not be usable in the direct-I/O mode.
| Index entries for this article | |
|---|---|
| Kernel | Block layer/Direct I/O |
| Kernel | Data integrity |
| Kernel | Stable pages |
Posted Nov 12, 2025 16:35 UTC (Wed)
by abatters (✭ supporter ✭, #6932)
[Link] (4 responses)
Example:
Posted Nov 12, 2025 17:42 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
This is a fixable problem if it's a common false positive - once page granularity detection says "two direct I/O reads to same page", you'd add in a check for overlap there.
Posted Nov 12, 2025 20:02 UTC (Wed)
by kreijack (guest, #43513)
[Link] (2 responses)
My be that I am misunderstanding something, but direct-I/O have some requirements. One of these is that the read/write size shall be multiple of the page size, and the offset must be multiple of the page size. So under direct-io is not possible to update a portion of a page, because you have to write (at least) the full-page.
Posted Nov 12, 2025 20:20 UTC (Wed)
by abatters (✭ supporter ✭, #6932)
[Link]
Posted Nov 12, 2025 20:23 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link]
We recently ran into an application (rocksdb) that didn't work on bcachefs in largebs mode because of this - it was only checking dma alignment, not offset.
Posted Nov 12, 2025 17:41 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link] (7 responses)
Posted Nov 12, 2025 18:59 UTC (Wed)
by nickodell (subscriber, #125165)
[Link] (4 responses)
Hellwig's initial patch says
>This series tries to address this by falling back to uncached buffered I/O. Given that this requires an extra copy it is usually going to be a slow down, especially for very high bandwith use cases, so I'm not exactly happy about.
I assume the bounce buffer also requires an additional copy, so what makes it faster than the approach here?
Posted Nov 12, 2025 19:26 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link] (1 responses)
O_DIRECT tends to be used where the application knows the pagecache is not going to be useful - ignoring what the application communicated and using buffered IO is a massive behavioral change and just not a good idea.
If you just bounce IOs, the only extra overhead you're paying for is allocating/freeing bounce buffers (generally quite fast, thanks to percpu freelists), and the memcpy - which, as I mentioned, is noise when we're already touching the data to checksum.
And you only have to pay for that on writes. Reads can be normal zero copy O_DIRECT reads: if you get a checksum error, and the buffer is mapped to userspace (i.e. might have been scribbled over), you retry it with a bounce buffer before treating it like a "real" checksum error.
(This is all what bcachefs has done for years).
Posted Nov 12, 2025 21:05 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Nov 12, 2025 19:28 UTC (Wed)
by Paf (subscriber, #91811)
[Link] (1 responses)
But this becomes proportionally less true with large folios though, probably to the point of not really true for larger sizes, since the overhead is spread over much more data.
Posted Nov 12, 2025 19:36 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Posted Nov 12, 2025 19:35 UTC (Wed)
by quwenruo_suse (subscriber, #124148)
[Link] (1 responses)
The huge performance drop in my previous observations is caused by unoptimized checksum implementation (kvm64 has no hardware accelerated CRC32C).
Posted Nov 12, 2025 19:57 UTC (Wed)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Corner case
4k buffer (one page)
write #1: 2k @ offset 0 written to sector 100
write #2: 2k @ offset 2k written to sector 200
My understanding is that your corner case as shown would not trigger conflict detection, because the buffer is not changed by the write. The problem would come with two reads to different parts of the same page (at sector alignment, on a system where sectors are smaller than pages), where page granularity would detect two places changing the same page.
Corner case
Corner case
Corner case
Corner case
There's a better solution than falling back to buffered io
There's a better solution than falling back to buffered io
There's a better solution than falling back to buffered io
There's a better solution than falling back to buffered io
Wol
There's a better solution than falling back to buffered io
There's a better solution than falling back to buffered io
There's a better solution than falling back to buffered io
There's a better solution than falling back to buffered io
