|
|
Subscribe / Log in / New account

Zero-copy I/O for ublk, three different ways

By Jonathan Corbet
March 16, 2023
The ublk subsystem enables the creation of user-space block drivers that communicate with the kernel using io_uring. Drivers implemented this way show some promise with regard to performance, but there is a bottleneck in the way: copying data between the kernel and the user-space driver's address space. It is thus not surprising that there is interest in implementing zero-copy I/O for ublk. The mailing lists have recently seen three different proposals for how this could be done.

1: Use BPF

There are few problems in the kernel, it seems, that cannot be addressed by throwing some BPF into the mix, and zero-copy ublk I/O would appear to be no exception. This patch set from Xiaoguang Wang adds a new program type (BPF_PROG_TYPE_UBLK) that can be loaded by ublk drivers and subsequently registered with one or more specific ublk devices. Once that happens, I/O requests generated by the kernel will be passed to that program rather than being sent to the user-space driver for execution. There is a new BPF helper function (not a kfunc, for unclear reasons) called bpf_ublk_queue_sqe() that allows BPF programs to add requests to the ring; this helper can be used to queue the I/O operations that fulfill the original block request.

There are a few advantages to handling these requests entirely in the kernel, starting with the ability to eliminate round trips with the user-space daemon. The biggest win, though, is likely to come from the fact that the BPF program has access to the buffers provided by the kernel and can use them directly for whatever I/O is needed to satisfy each request, eliminating a copy of that data. Block drivers can move quite a bit of data, so the advantage of avoiding copies should be clear. That said, this patch (like all the others discussed here) lacks benchmark results showing the performance improvement it enables.

2: Fused operations

Ming Lei, the author of the original ublk patches, has a rather different approach. Like ublk itself, this work is minimally documented and difficult to read, so this description is the result of a reverse-engineering effort and may well be wrong in some respects.

Operations in an io_uring ring are usually entirely separate from each other. There is a way to link them so that one operation must complete before the next can be dispatched, but otherwise each operation is distinct. Lei's patch set provides a rather tighter link between operations by adding the concept of "fused" operations — two operations that are tied together and which can share resources between them.

When a user-space ublk driver is running, it will receive commands from the kernel, via the ring, with instructions like "read N blocks from device D at offset O". With Lei's series applied, the driver will have the option to turn that operation into a fused command that is placed back into the ring for execution in the kernel. A fused command is two io_uring commands that are tied together; they must be submitted as a single unit. The "master" command (Lei's terminology) is of type IORING_OP_FUSED_CMD; it contains enough information for the ublk subsystem to connect the command to a request sent to the user-space driver. The "slave" command, instead, performs the actual I/O needed to satisfy that request.

As with the BPF solution, the key here is that the slave command has access to the buffer associated with the master; in this case, the slave command can access the kernel-space buffers associated with the original block I/O request. Once again, that allows the I/O to be performed without copying the data to or from the user-space driver. Once the slave command completes, the user-space driver can signal completion of the original block I/O request to the kernel in the usual way.

The fused-command functionality is a special-purpose beast; it will not work in any sort of general case. The subsystem receiving the fused command must have special support for it and, specifically, it must be able to locate the kernel-space buffer for the slave command and make the connection with a call to the new function io_fused_cmd_provide_kbuf() before the slave can execute. It is a fair amount of change to the io_uring subsystem, and it is not entirely clear that any other subsystem would be able to make use of it.

3: Use splice()

In the discussion after version 2 of Lei's patch set was posted, Pavel Begunkov observed that "it all looks a bit complicated and intrusive". He thought that it might be possible to, instead, reuse the mechanisms for the splice() system call. The io_uring "registered buffer" feature would be used to facilitate zero-copy operation. Shortly thereafter, he posted a preliminary, proof-of-concept implementation; it showed how this approach could work but was not complete.

Lei had a number of questions about this approach, mostly focused on how the buffer management works. It is not clear how well the splice() approach would work if I/O needs to be performed on a given buffer more than once — for example, when writing to a mirrored block device. The questions kept coming, and Begunkov has not (as of this writing) posted a complete version of the patch. It seems likely that the splice() approach will not go much further, though surprises can always happen.

Wang, meanwhile, has said that the fused-command approach seems like "the right direction to support ublk zero copy".

As was noted in the original ublk article, one of the key practical problems that has impeded the microkernel approach to operating-system design is the cost of communication between the components. Ublk has managed to reduce that cost considerably, but there is more to be gained if the cost of copying data between the kernel and user space can be eliminated. So chances are good that developers will continue to work on this problem until some sort of workable solution has been found.

Index entries for this article
KernelBlock layer/Block drivers


to post comments

Zero-copy I/O for ublk, three different ways

Posted Mar 17, 2023 5:37 UTC (Fri) by liam (guest, #84133) [Link] (2 responses)

Do the fio numbers given in the bpf proposal not count as benchmarks? It's a simple test, but it does add performance numbers to the conversation (cpu% went from 12.5% -> 1%).

Zero-copy I/O for ublk, three different ways

Posted Mar 26, 2023 15:15 UTC (Sun) by ming.lei (guest, #74703) [Link]

> Do the fio numbers given in the bpf proposal not count as benchmarks? It's a simple test, but it does add performance numbers > to the conversation (cpu% went from 12.5% -> 1%).

Another way to see the difference is to run ublk-null in zero copy mode[1], which simply
bypasses data copy between io request pages and ublk server user buffer. IOPS boost can
be ~5X in this way with 64k/512k block size.

[1] https://github.com/ming1/ubdsrv/commits/fused-cmd-zc-v2

Zero-copy I/O for ublk, three different ways

Posted Mar 29, 2023 8:28 UTC (Wed) by old-memories (guest, #160155) [Link]

Hello,

We did some tests on our EBPF patches. We manually let IOPS reach the bottleneck of our storage device so that we could fairly compare CPU usage between EBPF and baseline(current ublk).
We noticed that the CPU usage dropped to 1%. In our product, CPU usage is much more important than IOPS because our RPC backend is so complicated that ublk cannot boost IOPS too much.

Zero-copy I/O for ublk, three different ways

Posted Mar 26, 2023 15:09 UTC (Sun) by ming.lei (guest, #74703) [Link] (1 responses)

Thanks for making this as one lwn document!

> The fused-command functionality is a special-purpose beast;

The patchset addresses zero copy between device io buffer and io uring OPs in one generic way.

Any device can implement ->uring_cmd() for supporting this feature, so I do not agree it is one
special-purpose beast.

The thing is that we don't many such requirement, and antother user is fuse FS.

Also it might help net recv zero copy a bit, see the following link:

https://lore.kernel.org/linux-block/ZBnTuX+5D8QeLPuQ@ovpn...

BTW, given we only have kernel pages, and there can't be user space VM mapping
for these pages, so exporting these pages to userspace doesn't make any sense. However,
BPF might be one perfect supplement here, such as:

- add one generic io_uring BPF OP, which can run one specified registered BPF program by
passing bpf_prog_id

- link this BPF OP as slave request of fused command, then the bpf prog can do whatever on
the kernel pages, and return results into user via any bpf mapping(s)

- then userspace can decide how to handle the result from bpf mapping(s), such as,
submit another fused command to handle IO with part of the kernel buffer.

> splice based approach

I don't think splice is one good solution, see details in the following doc:

https://github.com/ming1/linux/blob/my_v6.3-io_uring_fuse...

The above doc also provides basic techinical requirements for ublk zero copy feature.

Zero-copy I/O for ublk, three different ways

Posted Mar 30, 2023 15:05 UTC (Thu) by ming.lei (guest, #74703) [Link]

> The fused-command functionality is a special-purpose beast;

fused command becomes more generic in V6 [1], and it models
relationship between primary request and secondary requests,
and sharing resource between them, one core idea is to align resource's
lifetime with primary command, so provide one safe way to sharing
resource among kernel subsystems.

There are more use cases mentioned in patch 3/17 [2].

Now sharing/providing buffer can be thought as one plugin of fused command,
and more plugins could be added in future.

[1] https://lore.kernel.org/linux-block/20230330113630.138886...

[2] https://lore.kernel.org/linux-block/20230330113630.138886...


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds