Zero-copy I/O for ublk, three different ways
1: Use BPF
There are few problems in the kernel, it seems, that cannot be addressed by throwing some BPF into the mix, and zero-copy ublk I/O would appear to be no exception. This patch set from Xiaoguang Wang adds a new program type (BPF_PROG_TYPE_UBLK) that can be loaded by ublk drivers and subsequently registered with one or more specific ublk devices. Once that happens, I/O requests generated by the kernel will be passed to that program rather than being sent to the user-space driver for execution. There is a new BPF helper function (not a kfunc, for unclear reasons) called bpf_ublk_queue_sqe() that allows BPF programs to add requests to the ring; this helper can be used to queue the I/O operations that fulfill the original block request.
There are a few advantages to handling these requests entirely in the kernel, starting with the ability to eliminate round trips with the user-space daemon. The biggest win, though, is likely to come from the fact that the BPF program has access to the buffers provided by the kernel and can use them directly for whatever I/O is needed to satisfy each request, eliminating a copy of that data. Block drivers can move quite a bit of data, so the advantage of avoiding copies should be clear. That said, this patch (like all the others discussed here) lacks benchmark results showing the performance improvement it enables.
2: Fused operations
Ming Lei, the author of the original ublk patches, has a rather different approach. Like ublk itself, this work is minimally documented and difficult to read, so this description is the result of a reverse-engineering effort and may well be wrong in some respects.
Operations in an io_uring ring are usually entirely separate from each other. There is a way to link them so that one operation must complete before the next can be dispatched, but otherwise each operation is distinct. Lei's patch set provides a rather tighter link between operations by adding the concept of "fused" operations — two operations that are tied together and which can share resources between them.
When a user-space ublk driver is running, it will receive commands from the kernel, via the ring, with instructions like "read N blocks from device D at offset O". With Lei's series applied, the driver will have the option to turn that operation into a fused command that is placed back into the ring for execution in the kernel. A fused command is two io_uring commands that are tied together; they must be submitted as a single unit. The "master" command (Lei's terminology) is of type IORING_OP_FUSED_CMD; it contains enough information for the ublk subsystem to connect the command to a request sent to the user-space driver. The "slave" command, instead, performs the actual I/O needed to satisfy that request.
As with the BPF solution, the key here is that the slave command has access to the buffer associated with the master; in this case, the slave command can access the kernel-space buffers associated with the original block I/O request. Once again, that allows the I/O to be performed without copying the data to or from the user-space driver. Once the slave command completes, the user-space driver can signal completion of the original block I/O request to the kernel in the usual way.
The fused-command functionality is a special-purpose beast; it will not work in any sort of general case. The subsystem receiving the fused command must have special support for it and, specifically, it must be able to locate the kernel-space buffer for the slave command and make the connection with a call to the new function io_fused_cmd_provide_kbuf() before the slave can execute. It is a fair amount of change to the io_uring subsystem, and it is not entirely clear that any other subsystem would be able to make use of it.
3: Use splice()
In the discussion after version
2 of Lei's patch set was posted, Pavel Begunkov observed
that "it all looks a bit complicated and intrusive
". He thought
that it might be possible to, instead, reuse the mechanisms for the splice()
system call. The io_uring "registered buffer" feature would be used to
facilitate zero-copy operation. Shortly thereafter, he posted a
preliminary, proof-of-concept implementation; it showed how this
approach could work but was not complete.
Lei had a number of questions about this approach, mostly focused on how the buffer management works. It is not clear how well the splice() approach would work if I/O needs to be performed on a given buffer more than once — for example, when writing to a mirrored block device. The questions kept coming, and Begunkov has not (as of this writing) posted a complete version of the patch. It seems likely that the splice() approach will not go much further, though surprises can always happen.
Wang, meanwhile, has said
that the fused-command approach seems like "the right direction
to support ublk zero copy
".
As was noted in the original ublk article, one of the key practical
problems that has impeded the microkernel approach to operating-system
design is the cost of communication between the components. Ublk has
managed to reduce that cost considerably, but there is more to be gained if
the cost of copying data between the kernel and user space can be
eliminated. So chances are good that developers will continue to work on
this problem until some sort of workable solution has been found.
Index entries for this article | |
---|---|
Kernel | Block layer/Block drivers |
Posted Mar 17, 2023 5:37 UTC (Fri)
by liam (guest, #84133)
[Link] (2 responses)
Posted Mar 26, 2023 15:15 UTC (Sun)
by ming.lei (guest, #74703)
[Link]
Another way to see the difference is to run ublk-null in zero copy mode[1], which simply
Posted Mar 29, 2023 8:28 UTC (Wed)
by old-memories (guest, #160155)
[Link]
We did some tests on our EBPF patches. We manually let IOPS reach the bottleneck of our storage device so that we could fairly compare CPU usage between EBPF and baseline(current ublk).
Posted Mar 26, 2023 15:09 UTC (Sun)
by ming.lei (guest, #74703)
[Link] (1 responses)
Thanks for making this as one lwn document!
> The fused-command functionality is a special-purpose beast;
The patchset addresses zero copy between device io buffer and io uring OPs in one generic way.
Any device can implement ->uring_cmd() for supporting this feature, so I do not agree it is one
The thing is that we don't many such requirement, and antother user is fuse FS.
Also it might help net recv zero copy a bit, see the following link:
https://lore.kernel.org/linux-block/ZBnTuX+5D8QeLPuQ@ovpn...
BTW, given we only have kernel pages, and there can't be user space VM mapping
- add one generic io_uring BPF OP, which can run one specified registered BPF program by
- link this BPF OP as slave request of fused command, then the bpf prog can do whatever on
- then userspace can decide how to handle the result from bpf mapping(s), such as,
> splice based approach
I don't think splice is one good solution, see details in the following doc:
https://github.com/ming1/linux/blob/my_v6.3-io_uring_fuse...
The above doc also provides basic techinical requirements for ublk zero copy feature.
Posted Mar 30, 2023 15:05 UTC (Thu)
by ming.lei (guest, #74703)
[Link]
fused command becomes more generic in V6 [1], and it models
There are more use cases mentioned in patch 3/17 [2].
Now sharing/providing buffer can be thought as one plugin of fused command,
[1] https://lore.kernel.org/linux-block/20230330113630.138886...
[2] https://lore.kernel.org/linux-block/20230330113630.138886...
Zero-copy I/O for ublk, three different ways
Zero-copy I/O for ublk, three different ways
bypasses data copy between io request pages and ublk server user buffer. IOPS boost can
be ~5X in this way with 64k/512k block size.
Zero-copy I/O for ublk, three different ways
We noticed that the CPU usage dropped to 1%. In our product, CPU usage is much more important than IOPS because our RPC backend is so complicated that ublk cannot boost IOPS too much.
Zero-copy I/O for ublk, three different ways
special-purpose beast.
for these pages, so exporting these pages to userspace doesn't make any sense. However,
BPF might be one perfect supplement here, such as:
passing bpf_prog_id
the kernel pages, and return results into user via any bpf mapping(s)
submit another fused command to handle IO with part of the kernel buffer.
Zero-copy I/O for ublk, three different ways
relationship between primary request and secondary requests,
and sharing resource between them, one core idea is to align resource's
lifetime with primary command, so provide one safe way to sharing
resource among kernel subsystems.
and more plugins could be added in future.