| From: |
| Pavel Begunkov <asml.silence-AT-gmail.com> |
| To: |
| io-uring-AT-vger.kernel.org |
| Subject: |
| [PATCH v4 0/6] BPF controlled io_uring |
| Date: |
| Tue, 27 Jan 2026 10:14:04 +0000 |
| Message-ID: |
| <cover.1769470552.git.asml.silence@gmail.com> |
| Cc: |
| asml.silence-AT-gmail.com, bpf-AT-vger.kernel.org |
| Archive-link: |
| Article |
Note: I'll be targeting 7.1 as it's rc7 and it can use some
time to settle down.
This series introduces a way to override the standard io_uring_enter
syscall execution with an extendible event loop, which can be controlled
by BPF via new io_uring struct_ops or from within the kernel.
There are multiple use cases I want to cover with this:
- Syscall avoidance. Instead of returning to the userspace for
CQE processing, a part of the logic can be moved into BPF to
avoid excessive number of syscalls.
- Access to in-kernel io_uring resources. For example, there are
registered buffers that can't be directly accessed by the userspace,
however we can give BPF the ability to peek at them. It can be used
to take a look at in-buffer app level headers to decide what to do
with data next and issuing IO using it.
- Smarter request ordering and linking. Request links are pretty
limited and inflexible as they can't pass information from one
request to another. With BPF we can peek at CQEs and memory and
compile a subsequent request.
- Feature semi-deprecation. It can be used to simplify handling
of deprecated features by moving it into the callback out core
io_uring. For example, it should be trivial to simulate
IOSQE_IO_DRAIN. Another target could be request linking logic.
- It can serve as a base for custom algorithms and fine tuning.
Often, it'd be impractical to introduce a generic feature because
it's either niche or requires a lot of configuration. For example,
there is support min-wait, however BPF can help to further fine tune
it by doing it in multiple steps with different number of CQEs /
timeouts. Another feature people were asking about is allowing
to over queue SQEs but make the kernel to maintain a given QD.
- Smarter polling. Napi polling is performed only once per syscall
and then it switches to waiting. We can do smarter and intermix
polling with waiting using the hook.
It might need more specialised kfuncs in the future, but the core
functionality is implemented with just two simple functions. One
returns region memory, which gives BPF access to CQ/SQ/etc. And
the second is for submitting requests. It's also given a structure
as an argument, which is used to pass waiting parameters.
It showed good numbers in a test that sequentially executes N nop
requests, where BPF was more than twice as fast than a 2-nop
request link implementation.
Pavel Begunkov (6):
io_uring: introduce callback driven main loop
io_uring/bpf-ops: add basic bpf struct_ops boilerplate
io_uring/bpf-ops: add loop_step struct_ops callback
io_uring/bpf-ops: add kfunc helpers
io_uring/bpf-ops: add bpf struct ops registration
selftests/io_uring: add a bpf io_uring selftest
include/linux/io_uring_types.h | 10 +
io_uring/Kconfig | 5 +
io_uring/Makefile | 3 +-
io_uring/bpf-ops.c | 265 +++++++++++++++++++
io_uring/bpf-ops.h | 28 ++
io_uring/io_uring.c | 8 +
io_uring/loop.c | 88 ++++++
io_uring/loop.h | 27 ++
tools/testing/selftests/Makefile | 3 +-
tools/testing/selftests/io_uring/Makefile | 143 ++++++++++
tools/testing/selftests/io_uring/basic.bpf.c | 116 ++++++++
tools/testing/selftests/io_uring/common.h | 6 +
tools/testing/selftests/io_uring/runner.c | 107 ++++++++
tools/testing/selftests/io_uring/types.bpf.h | 131 +++++++++
14 files changed, 938 insertions(+), 2 deletions(-)
create mode 100644 io_uring/bpf-ops.c
create mode 100644 io_uring/bpf-ops.h
create mode 100644 io_uring/loop.c
create mode 100644 io_uring/loop.h
create mode 100644 tools/testing/selftests/io_uring/Makefile
create mode 100644 tools/testing/selftests/io_uring/basic.bpf.c
create mode 100644 tools/testing/selftests/io_uring/common.h
create mode 100644 tools/testing/selftests/io_uring/runner.c
create mode 100644 tools/testing/selftests/io_uring/types.bpf.h
--
2.52.0