|
|
Log in / Subscribe / Register

bounce buffer direct I/O when stable pages are required

From:  Christoph Hellwig <hch-AT-lst.de>
To:  Jens Axboe <axboe-AT-kernel.dk>, Christian Brauner <brauner-AT-kernel.org>
Subject:  bounce buffer direct I/O when stable pages are required
Date:  Wed, 14 Jan 2026 08:40:58 +0100
Message-ID:  <20260114074145.3396036-1-hch@lst.de>
Cc:  "Darrick J. Wong" <djwong-AT-kernel.org>, Carlos Maiolino <cem-AT-kernel.org>, Qu Wenruo <wqu-AT-suse.com>, Al Viro <viro-AT-zeniv.linux.org.uk>, linux-block-AT-vger.kernel.org, linux-xfs-AT-vger.kernel.org, linux-fsdevel-AT-vger.kernel.org
Archive-link:  Article

Hi all,

this series tries to address the problem that under I/O pages can be
modified during direct I/O, even when the device or file system require
stable pages during I/O to calculate checksums, parity or data
operations.  It does so by adding block layer helpers to bounce buffer
an iov_iter into a bio, then wires that up in iomap and ultimately
XFS.

The reason that the file system even needs to know about it, is because
reads need a user context to copy the data back, and the infrastructure
to defer ioends to a workqueue currently sits in XFS.  I'm going to look
into moving that into ioend and enabling it for other file systems.
Additionally btrfs already has it's own infrastructure for this, and
actually an urgent need to bounce buffer, so this should be useful there
and could be wire up easily.  In fact the idea comes from patches by
Qu that did this in btrfs.

This patch fixes all but one xfstests failures on T10 PI capable devices
(generic/095 seems to have issues with a mix of mmap and splice still,
I'm looking into that separate), and make qemu VMs running Windows,
or Linux with swap enabled fine on an XFS file on a device using PI.

Performance numbers on my (not exactly state of the art) NVMe PI test
setup:

  Sequential reads using io_uring, QD=16.
  Bandwidth and CPU usage (usr/sys):

  | size |        zero copy         |          bounce          |
  +------+--------------------------+--------------------------+
  |   4k | 1316MiB/s (12.65/55.40%) | 1081MiB/s (11.76/49.78%) |
  |  64K | 3370MiB/s ( 5.46/18.20%) | 3365MiB/s ( 4.47/15.68%) |
  |   1M | 3401MiB/s ( 0.76/23.05%) | 3400MiB/s ( 0.80/09.06%) |
  +------+--------------------------+--------------------------+

  Sequential writes using io_uring, QD=16.
  Bandwidth and CPU usage (usr/sys):

  | size |        zero copy         |          bounce          |
  +------+--------------------------+--------------------------+
  |   4k |  882MiB/s (11.83/33.88%) |  750MiB/s (10.53/34.08%) |
  |  64K | 2009MiB/s ( 7.33/15.80%) | 2007MiB/s ( 7.47/24.71%) |
  |   1M | 1992MiB/s ( 7.26/ 9.13%) | 1992MiB/s ( 9.21/19.11%) |
  +------+--------------------------+--------------------------+

Note that the 64k read numbers look really odd to me for the baseline
zero copy case, but are reproducible over many repeated runs.

The bounce read numbers should further improve when moving the PI
validation to the file system and removing the double context switch,
which I have patches for that will sent as soon as we are done with
this series.

Diffstat:
 block/bio.c           |  323 ++++++++++++++++++++++++++++++--------------------
 block/blk.h           |   11 -
 fs/iomap/direct-io.c  |  189 +++++++++++++++--------------
 fs/iomap/ioend.c      |    8 +
 fs/xfs/xfs_aops.c     |    8 -
 fs/xfs/xfs_file.c     |   41 +++++-
 include/linux/bio.h   |   26 ++++
 include/linux/iomap.h |    9 +
 include/linux/uio.h   |    3 
 lib/iov_iter.c        |   98 +++++++++++++++
 10 files changed, 490 insertions(+), 226 deletions(-)



Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds