|
|
Subscribe / Log in / New account

The intersection of unstable pages and direct I/O

[LWN subscriber-only content]

By Jonathan Corbet
November 12, 2025
Longtime LWN readers will have encountered the concept of "stable pages" before; it was first covered here nearly 15 years ago. For the most part, the problem that stable pages were meant to solve — preventing errors when user space modifies a buffer that is under I/O — has been dealt with. But recent discussions show that there is one area where problems remain: direct I/O. There is some disagreement, though, over whether those problems are the result of user-space bugs and how much of a performance price should be paid to address them.

Writing a page of data to a block storage device takes time. If a process tells the kernel to perform a write, some time will thus pass before the operation is complete. Should that process modify the under-I/O data while it is being written, the result is a sort of data race with the usual unpredictable results. Either the old or the new data could end up being written; in the worst case, a combination of the old and new data could be written, corrupting the file.

In simple cases, this behavior is less of a problem than it seems; if the application is modifying the pages under I/O, it is likely to write them to persistent storage another time, and the end result will be as it should be. But if, for example, the filesystem is writing checksums alongside the data, changing the data partway through can cause the resulting checksum to not match. Other types of processing during I/O, such as transparent compression or a RAID implementation, can run into the same sort of trouble. In these cases, a badly timed change to a page under I/O can lead to I/O errors or data corruption.

The solution to this problem usually takes the form of stable pages — a guarantee made by the kernel that data will not be changed while it is under I/O. Stability can be achieved by, for example, simply delaying any attempts to modify the data until the I/O operation is complete. For buffered I/O, where data is copied through the page cache, implementing this approach is relatively straightforward, and was done years ago. Delaying memory access can reduce performance, though, so it is usually only done when there is a clear need.

Direct I/O, however, is a special case. When a process requests direct I/O, it wants to transfer data directly between its buffers and the underlying storage device without going through the page cache. Since stable pages are tied to the page cache, direct I/O operations cannot use them. Direct I/O is a relatively advanced, low-level operation, where applications take charge of their own buffer management, so problems from concurrent data changes have generally not been seen. Increasingly, though, there is interest in using direct I/O in conjunction with filesystem-integrity features, or with hardware features like the data integrity field (also known as "protection information" or "PI"), which provides end-to-end checksum protection.

Nobody covers the Linux kernel like LWN; be in the know with a one-month trial subscription, no credit card needed.

In May 2025, Qu Wenruo posted a patch describing problems that come about when user space performs direct I/O to a Btrfs filesystem with checksums enabled, then modifies its buffers while the I/O is underway. As described above, that can create checksum errors. Evidently, QEMU can be configured to use direct I/O and is prone to just this kind of failure. The solution that was adopted (and merged for 6.15), is to simply fall back to buffered I/O on filesystems where checksums are in use. That solves the problem, but at a heavy cost; direct I/O is now simply incompatible with checksums on Btrfs filesystems, and applications that depend on direct I/O for performance will be slowed down accordingly. There is a similar patch under consideration to fall back to buffered I/O for reads as well, since concurrent modifications to the buffer being read into can cause checksum failures.

The Btrfs change was applied without much fanfare; Christoph Hellwig's patch adding a similar fallback to XFS attracted rather more attention. XFS does not support data checksums like Btrfs does, but block devices can perform their own checksumming; they need stable pages to do that work successfully. As already noted, stable pages don't work with direct I/O so, in cases where the underlying block device requires stable pages, Hellwig's patch causes XFS to fall back to buffered I/O even when direct I/O has been requested. He expressed unhappiness about the resulting performance penalty, and suggested a few ways for "applications that know what they are doing" to disable this fallback.

Bart Van Assche had an alternative suggestion: "only fall back to buffered I/O for buggy software that modifies direct I/O buffers before I/O has completed". Dave Chinner also questioned the patch, also suggesting that buffered I/O should only be forced for applications that are "known to corrupt stable page based block devices", of which he suspected there are relatively few. Geoff Beck suggested setting the page permissions to prevent changes to memory while it is under direct I/O.

Hellwig's responses to those suggestions highlight the core of the disagreement around this topic. Is changing a buffer that is under I/O an acceptable thing for an application to do? Hellwig repeatedly insisted that it is indeed acceptable: "Given that we never claimed that you can't modify the buffer I would not call it buggy, even if the behavior is unfortunate". Concurrent modification works in many cases, he said; his objective is to make it work in the remaining cases, even at the cost of severely reduced I/O performance.

Chinner, instead, was adamant that the kernel should not try to support applications that perform this sort of modification:

Remember: O_DIRECT means the application takes full responsibility for ensuring IO concurrency semantics are correctly implemented. Modifying IO buffers whilst the IO buffer is being read from or written to by the hardware has always been an IO concurrency bug in the application.

The behaviour being talked about here is, and always has been, an application IO concurrency bug, regardless of PI, stable writes, etc. Such an application bug existing is *not a valid reason for the kernel or filesystem to disable Direct IO*.

Hellwig, though, reiterated that the kernel has never documented any requirement that buffers be untouched while under direct I/O: "We can't just make that up 20+ years later". He also said that, on RAID devices, modifying the data being written could corrupt an entire stripe, with possible effects on data unrelated to the offending application. At that point, he said, it becomes a kernel bug that could let malicious users corrupt others' data. Jan Kara said that he agreed that concurrent modification of I/O buffers is an application bug, but he added that the kernel needs to do something about the problem anyway.

Darrick Wong, addressing specifically the PI case where the hardware verifies checksums provided by the software, observed that, in most cases, applications will be unable to corrupt their own data by modifying buffers under direct I/O without the corruption being detected by the hardware. In those cases, there is no need to fall back to buffered I/O. The RAID case is worse, though, and must be prevented. Wong suggested adding a new block-device attribute (with the name "mutation_blast_radius") describing how serious a concurrent modification would be and only falling back in the most serious cases. As Hellwig pointed out, though, that solution only addresses the PI case, and would cause QEMU to fail.

While no firm conclusions were reached in this conversation, it seems clear that, regardless of how one views the behavior of applications that modify data that is under I/O, the kernel probably needs to do something to prevent, at a minimum, the worst problems. The real question will be whether it will be possible to provide a way for well-behaved applications to avoid the extra overhead of buffered I/O without creating opportunities for malicious actors. It may turn out that, in the end, some types of storage device will simply not be usable in the direct-I/O mode.

Index entries for this article
KernelBlock layer/Direct I/O
KernelData integrity
KernelStable pages



to post comments

Corner case

Posted Nov 12, 2025 16:35 UTC (Wed) by abatters (✭ supporter ✭, #6932) [Link] (4 responses)

I use a lot of direct I/O code that writes to a raw partition (e.g. /dev/sdb1). One interesting corner case is if you have a single page involved in two non-overlapping direct I/Os simultaneously at different offsets. The I/Os won't conflict with each other logically, but if they add these restrictions, the kernel may incorrectly detect a conflict based on page granularity.

Example:
4k buffer (one page)
write #1: 2k @ offset 0 written to sector 100
write #2: 2k @ offset 2k written to sector 200

Corner case

Posted Nov 12, 2025 17:42 UTC (Wed) by farnz (subscriber, #17727) [Link]

My understanding is that your corner case as shown would not trigger conflict detection, because the buffer is not changed by the write. The problem would come with two reads to different parts of the same page (at sector alignment, on a system where sectors are smaller than pages), where page granularity would detect two places changing the same page.

This is a fixable problem if it's a common false positive - once page granularity detection says "two direct I/O reads to same page", you'd add in a check for overlap there.

Corner case

Posted Nov 12, 2025 20:02 UTC (Wed) by kreijack (guest, #43513) [Link] (2 responses)

> One interesting corner case is if you have a single page involved in two non-overlapping direct I/Os simultaneously at different offsets. The I/Os won't conflict with each other logically, but if they add these restrictions, the kernel may incorrectly detect a conflict based on page granularity.

My be that I am misunderstanding something, but direct-I/O have some requirements. One of these is that the read/write size shall be multiple of the page size, and the offset must be multiple of the page size. So under direct-io is not possible to update a portion of a page, because you have to write (at least) the full-page.

Corner case

Posted Nov 12, 2025 20:20 UTC (Wed) by abatters (✭ supporter ✭, #6932) [Link]

The alignment requirements have been relaxed over the years. See the O_DIRECT section under:

https://www.man7.org/linux/man-pages/man2/open.2.html#NOTES

Corner case

Posted Nov 12, 2025 20:23 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link]

It's important to note that there are different alignment requirements for the buffer - "dma alignment" - and the file offset, and they both have to be checked for (via statx).

We recently ran into an application (rocksdb) that didn't work on bcachefs in largebs mode because of this - it was only checking dma alignment, not offset.

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 17:41 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link] (7 responses)

you just _bounce_ - drastically cheaper, especially considering that if you're checksumming you're touching the buffer anyways

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 18:59 UTC (Wed) by nickodell (subscriber, #125165) [Link] (4 responses)

Could you clarify why this is better than going through the page cache?

Hellwig's initial patch says

>This series tries to address this by falling back to uncached buffered I/O. Given that this requires an extra copy it is usually going to be a slow down, especially for very high bandwith use cases, so I'm not exactly happy about.

I assume the bounce buffer also requires an additional copy, so what makes it faster than the approach here?

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 19:26 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link] (1 responses)

Buffered IO is only fast when IO is mostly staying in cache. When it's not it's _significantly_ slower; you incur all the overhead of walking and managing the page cache radix tree, backround eviction, background writeback - doing all that stuff asynchronously is significantly more constantly when you're not amortizing it by letting things stay in cache.

O_DIRECT tends to be used where the application knows the pagecache is not going to be useful - ignoring what the application communicated and using buffered IO is a massive behavioral change and just not a good idea.

If you just bounce IOs, the only extra overhead you're paying for is allocating/freeing bounce buffers (generally quite fast, thanks to percpu freelists), and the memcpy - which, as I mentioned, is noise when we're already touching the data to checksum.

And you only have to pay for that on writes. Reads can be normal zero copy O_DIRECT reads: if you get a checksum error, and the buffer is mapped to userspace (i.e. might have been scribbled over), you retry it with a bounce buffer before treating it like a "real" checksum error.

(This is all what bcachefs has done for years).

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 21:05 UTC (Wed) by Wol (subscriber, #4433) [Link]

Wasn't that benchmarked a few years back? Somebody disabled caching on file copies, and especially with larger files the uncached copy was maybe ten times faster?

Cheers,
Wol

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 19:28 UTC (Wed) by Paf (subscriber, #91811) [Link] (1 responses)

At least for small pages, much of the cost of using the page cache is actually in setup, insertion, and removal from the cache, *not* in memory allocation or data copying. Depending on the use case and hardware, simple bounce buffers can be *much* faster. And they can be done in parallel across many threads without conflicting on the tree locking for the page cache. (Even a read is a tree insert unless the data is already present.)

But this becomes proportionally less true with large folios though, probably to the point of not really true for larger sizes, since the overhead is spread over much more data.

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 19:36 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link]

IOPs keep going up, though. Forcing everything through the buffered IO paths is just crippling.

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 19:35 UTC (Wed) by quwenruo_suse (subscriber, #124148) [Link] (1 responses)

I doubt. As I tried both, bouncing and falling back to buffered IO on btrfs, no observable difference.

The huge performance drop in my previous observations is caused by unoptimized checksum implementation (kvm64 has no hardware accelerated CRC32C).

There's a better solution than falling back to buffered io

Posted Nov 12, 2025 19:57 UTC (Wed) by koverstreet (✭ supporter ✭, #4296) [Link]

and you thought that test would be representative...?


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds