BPF comes to io_uring at last
The kernel's asynchronous io_uring interface maintains two shared ring buffers: a submission queue for sending requests to the kernel, and a completion queue containing the results of those requests. Even with shared memory removing much of the overhead of communicating with user space, there is still some overhead whenever the kernel must switch to user space to give it the opportunity to process completion requests and queue up any subsequent work items. A patch set from Pavel Begunkov minimizes this overhead by letting programmers extend the io_uring event loop with a BPF program that can enqueue additional work in response to completion events. The patch set has been in development for a long time, but has finally been accepted.
To use io_uring, the programmer sets up appropriate shared buffers with io_uring_setup() and mmap() before putting a number of io_uring_sqe (submission queue event) structures in the submission queue. The kernel can be notified of the presence of new events to process in two ways: by setting up a dedicated kernel thread to poll the queue, or by having user space call io_uring_enter() periodically.
When user space calls io_uring_enter(), the kernel first dispatches all of the items in the submission queue. After that, it can wait for a certain number of events to complete, wait for a timeout, or return to user space immediately, depending on which flags the system call was invoked with. Over time, the interface has been extended with ways to chain a series of operations together, such that one operation can depend on the outcome of another without requiring user space to act as an intermediary. For example, io_uring can be used to read a file and then send the contents over a socket asynchronously, without copying the data back to user space or performing a context switch. With the door opened to encoding more complex sequences of operations, people naturally wanted to handle cases just slightly more complex than a simple linear chain of operations.
This is where BPF comes in. Begunkov's patch set lets users associate a BPF struct ops program with io_uring queues; when user space calls io_uring_enter() on one of those queues, the BPF program will run instead of io_uring's normal event loop. The program can use the new bpf_io_uring_submit_sqes() kfunc to instruct the kernel to process entries from the submission queue, and the bpf_io_uring_get_region() kfunc to obtain access to the submission or completion queue in order to manipulate their contents.
The program then returns IOU_LOOP_CONTINUE to indicate that io_uring should call the program again after a configurable delay (or a set number of completions), or IOU_LOOP_STOP to return to user space. The BPF program may ask the kernel to loop as many times as it wishes, so it is theoretically possible to write a program that sets up io_uring, registers a BPF program, calls io_uring_enter() and then never returns to user space at all. This effectively bypasses BPF's limit on the number of operations that can be performed in a single program execution by calling the BPF program in a loop; if an application is already structured around an asynchronous event loop, it may be tempting to put more and more functionality into the BPF component. The BPF program does have to be tolerant of both spurious wakeups and potential cancellation by the kernel; if the task is killed, for example, the kernel will stop calling the BPF program and clean up the io_uring queues as normal.
That temptation to put more of the program into BPF is one reason that kernel developers were skeptical of Begunkov's approach when it resurfaced in November 2025. BPF makes it possible to implement complex operations in kernel space — but it can hardly be said to be as easy as writing normal user-space software. Programs will probably need to communicate between their in-kernel and user-space components anyway, but Begunkov's approach would have them doing so via ad-hoc interfaces rather than the existing io_uring interface.
Begunkov sees avoiding extraneous system calls as one of several uses for his patch set. He also suggested that BPF could become a transitional path for deprecating existing io_uring APIs. There are a number of organic extensions to the io_uring API, such as IOSQE_IO_DRAIN, that could be emulated in BPF, taking that logic out of the core kernel. He also thought that, as with the extensible scheduler class, introducing BPF would allow for experimentation with smarter polling algorithms before they're introduced to the kernel.
Jens Axboe, io_uring's creator and maintainer, planned to merge Begunkov's patch set during the 7.1 merge window. Caleb Mateos thought that the changes would not be as useful without kfuncs for interacting with io_uring registered buffers — additional buffers shared between the kernel and user space that can be referenced by io_uring operations. Registered buffers can be more efficient because they only need to be faulted and pinned into memory once, and can then be reused by subsequent operations.
Mateos referenced a patch set from Ming Lei, first seen in November (with an updated version in January). Lei's patch set is an alternate approach to integrating io_uring and BPF, which includes kfuncs for interacting with registered buffers along with an alternate set of attachment points for hooking into the io_uring subsystem. Lei's patches would not let users completely customize the behavior of io_uring_enter(); instead, users would be able to register BPF programs that could be invoked with a new IORING_OP_BPF io_uring operation. The approach is less flexible than Begunkov's (which could be used to emulate something similar, since it allows the BPF program to inspect and modify requests before submitting them to the kernel for processing), but is probably easier to use for targeted changes. Lei's approach is arguably more natural for allowing the deprecation of existing io_uring commands, since it can be used to replace specific operations with a BPF implementation.
Neither patch set has seen as much discussion as might be warranted, for a major change to io_uring. Mateos thought that the additional kfuncs were largely orthogonal to Begunkov's work — the kfuncs would be useful for BPF programs running in the context of io_uring, regardless of how those programs are triggered. Axboe agreed, deciding to apply Begunkov's patch set on March 17. Ming's work will have to be rebased, but Axboe seemed generally inclined to accept it as well. Either way, configurable BPF is coming to io_uring.
| Index entries for this article | |
|---|---|
| Kernel | BPF/io_uring |
