An io_uring-based user-space block driver
Your editor has spent a fair amount of time beating his head against the source for the ublk driver, as well as the ubdsrv server that comprises the user-space component. The picture that has emerged from this exploration of that uncommented and vowel-deficient realm is doubtless incorrect in some details, though the overall shape should be close enough to reality.
How ublk works
The ublk driver starts by creating a special device called /dev/ublk-control. The user-space server (or servers, there can be more than one) starts by opening that device and setting up an io_uring ring to communicate with it. Operations at this level are essentially ioctl() commands, but /dev/ublk-control has no ioctl() handler; all operations are, instead, sent as commands through io_uring. Since the purpose is to implement a device behind io_uring, the reasoning seems to be, there is no reason to not use it from the beginning.
A server will typically start with a UBLK_CMD_ADD_DEV command; as one might expect, it adds a new ublk device to the system. The server can describe various aspects of this device, including the number of hardware queues it claims to implement, its block size, the maximum transfer size, and the number of blocks the device can hold. Once this command succeeds, the device exists as far as the ublk driver is concerned and is visible as /dev/ublkcN, where N is the device ID returned when the device is created. The device has not yet been added to the block layer, though.
The server should open the new /dev/ublkcN device for the following steps, the first of which is to map a region from the device into the server's address space with an mmap() call. This region is an array of ublksrv_io_desc structures describing I/O requests:
struct ublksrv_io_desc {
/* op: bit 0-7, flags: bit 8-31 */
__u32 op_flags;
__u32 nr_sectors;
__u64 start_sector;
__u64 addr;
};
Notification of new I/O requests will be received via io_uring. To get to that point, the server must enqueue a set of UBLK_IO_FETCH_REQ requests on the newly created device; normally there will be one for each "hardware queue" declared for the device, which may also correspond to each thread running within the server. Among other things, this request must provide a memory buffer that can hold the maximum request size declared when the device was created.
Once this setup is complete, a separate UBLK_CMD_START_DEV operation will cause the ublk driver to actually create a block device visible to the rest of the system. When the block subsystem sends a request to this device, one of the queued UBLK_IO_FETCH_REQ operations will complete. The completion data returned to the user-space server will include the index of the ublkserv_io_desc structure describing the request, which the server should now execute. For a write request, the data to be written will be in the buffer that was provided by the server; for a read, the data should be placed in that same buffer.
When the operation is complete, the server must inform the kernel of that fact; this is done by placing a UBLK_IO_COMMIT_AND_FETCH_REQ operation into the ring. It will give the result of the operation back to the block subsystem, but will also enqueue the buffer to receive the next request, thus avoiding the need to do that separately.
There are the expected UBLK_CMD_STOP_DEV and UBLK_CMD_DEL_DEV operations to make existing devices go away, and a couple of other operations to query information about existing devices. There are also a number of details that have not been covered here, mostly aimed at increased performance. Among other things, the ublk protocol is set up to enable zero-copy I/O, but that is not implemented in the current code.
The server code implements two targets: null and loop. The null target is,
as one might expect, an overly complicated, block-oriented version of
/dev/null; it is useless but makes it possible to see how things
work with a minimum of unrelated details. The loop target uses an existing
file as the backing store for a virtual block device. According to author
Ming Lei, with this loop implementation, "the performance is
is even better than kernel loop with same setting
".
Implications
One might wonder why this work has been done (and evidently supported by Red Hat); if the world has been clamoring for an io_uring-based, user-space, faster loop block device, it has done so quietly. One advantage cited in the patch cover letter is that development of block-driver code is more easily done in user space; another is high-performance qcow2 support. The patch cover letter also cites interest expressed by other developers in having a fast user-space block-device mechanism available.
An interesting question, though, is whether this mechanism might ultimately facilitate the movement of a number of device drivers out of the kernel — perhaps not just block drivers. Putting device drivers into user-space code is a fundamental concept in a number of secure-system designs, including microkernel systems. But one of the problems with those designs has always been the communication overhead between the two components once they are no longer running within the same address space. Io_uring might just be a convincing answer to that problem.
Should that scenario play out, kernels of the future could look
significantly different from what we have today; they could be smaller,
with much of the complicated logic running in separate, user-space
components. Whether this is part of Lei's vision for ublk is unknown, and
things may never get anywhere near that point. But ublk is clearly an
interesting experiment that could lead to big changes down the line.
Something will need to be done about that complete absence of
documentation, though, on the way toward world domination.
| Index entries for this article | |
|---|---|
| Kernel | Block layer/Block drivers |
| Kernel | io_uring |
| Kernel | Releases/6.0 |
Posted Aug 8, 2022 15:45 UTC (Mon)
by sbates (subscriber, #106518)
[Link] (5 responses)
Cheers
Stephen
Posted Aug 8, 2022 16:00 UTC (Mon)
by NHO (subscriber, #104320)
[Link] (1 responses)
You just have the loop and the null, physical devices have drivers in kernelspace
Posted Aug 8, 2022 20:26 UTC (Mon)
by sbates (subscriber, #106518)
[Link]
Posted Aug 12, 2022 11:01 UTC (Fri)
by stefanha (subscriber, #55072)
[Link]
Posted Aug 13, 2022 15:29 UTC (Sat)
by ming.lei (guest, #74703)
[Link]
And so far it can't move physical device driver out of kernel, but turns out io_uring
Posted Aug 31, 2022 9:59 UTC (Wed)
by Darkstar (guest, #28767)
[Link]
Say you have some fancy EEPROM hooked up to the GPIO pins to your RaspberryPi. You could write a Python or C program that drives the GPIO pins correctly and uses ublk to make the EEPROM's data accessible to the kernel as block device.
You could use a KryoFlux or SCP device to build an io_uring-based replacement for the floppy driver, which would probably be a fun exercise :)
Posted Aug 8, 2022 17:58 UTC (Mon)
by shemminger (subscriber, #5739)
[Link]
Posted Aug 8, 2022 22:51 UTC (Mon)
by xecycle (subscriber, #140261)
[Link] (6 responses)
Posted Aug 8, 2022 23:28 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (2 responses)
Posted Aug 9, 2022 1:53 UTC (Tue)
by xecycle (subscriber, #140261)
[Link]
Posted Aug 9, 2022 2:25 UTC (Tue)
by felixfix (subscriber, #242)
[Link]
Posted Aug 9, 2022 3:16 UTC (Tue)
by hsiangkao (guest, #123981)
[Link]
Posted Aug 9, 2022 4:27 UTC (Tue)
by old-memories (guest, #160155)
[Link] (1 responses)
Posted Aug 9, 2022 4:39 UTC (Tue)
by hsiangkao (guest, #123981)
[Link]
Posted Aug 9, 2022 7:36 UTC (Tue)
by flussence (guest, #85566)
[Link] (2 responses)
(Ideally any hardware that lives beyond the boundary of an external port would never have a driver stack running as root, but that's a way off)
Posted Aug 9, 2022 15:04 UTC (Tue)
by k3ninho (subscriber, #50375)
[Link] (1 responses)
K3n.
Posted Aug 9, 2022 21:36 UTC (Tue)
by ejr (subscriber, #51652)
[Link]
Posted Aug 9, 2022 9:32 UTC (Tue)
by imphil (subscriber, #62487)
[Link] (1 responses)
Posted Aug 13, 2022 15:23 UTC (Sat)
by ming.lei (guest, #74703)
[Link]
Another reason is that the idea & implementation is pretty simple & straightforward.
Posted Aug 9, 2022 10:01 UTC (Tue)
by ddevault (subscriber, #99589)
[Link] (8 responses)
One thing I suspect will come out of this work is support for TLS with nbd. Right now, as I understand it, the kernel half of the nbd subsystem cannot handle encrypted connections.
Posted Aug 10, 2022 1:34 UTC (Wed)
by willmo (subscriber, #82093)
[Link]
Posted Aug 10, 2022 3:36 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (1 responses)
This is what microkernel people have been shouting for years. The model seL4 settled on (IIUC) was synchronous, short, typed messages copied and routed through the kernel for standard IPC, complemented by a generic, untyped page sharing facility permitting two or more processes to setup direct message passing a la io_uring. There's also an asynchronous signaling mechanism to assist with the latter, but ultimately you're given a simple, synchronous IPC mechanism, along with the basic tools to build your own high-performance IPC mechanism without the microkernel getting in your way. As far as I understand, this model is more practical and much less opinionated than earlier common approaches among microkernels, and also more closely parallels where Linux is headed.
What seL4 and other microkernels lack is the mindshare and incentive to actually build things atop these layers, regardless of their merit. seL4 especially demands huge, upfront time investments because of its build system. (Genode.org might alleviate much of this pain, though.) It's unsurprising these architectures will be replicated haphazardly on Linux, especially given the demands of multicore scaling, which heavily favor message passing architectures. But that's the age-old tech story, a consequence of path dependency, et al. Maybe io_uring will create an ecosystem of software more readily portable to microkernels, and maybe the increasing complexity of Linux will motivate such migrations. But probably not.
Posted Aug 10, 2022 8:04 UTC (Wed)
by ddevault (subscriber, #99589)
[Link]
https://git.sr.ht/~sircmpwn/helios
I definitely agree with the shortcomings of seL4, both the ones you mentioned and the ones you left out. Most serious new kernel projects these days are micro-kernels, so it's clear that if there's a future after Linux it will be in micro-kernels. The main issue is getting to that future. There's a reason we're all still using Unix systems after so long.
Posted Aug 10, 2022 10:08 UTC (Wed)
by guus (subscriber, #41608)
[Link] (2 responses)
I see parallels with OpenGL and Vulkan here, and I know even experienced programmers find thinking about this hard. So it's no wonder io_uring is currently only used for some specific cases where performance is critical.
Posted Aug 11, 2022 0:03 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Aug 20, 2022 20:54 UTC (Sat)
by roblucid (guest, #48964)
[Link]
Posted Oct 3, 2022 13:36 UTC (Mon)
by scientes (guest, #83068)
[Link] (1 responses)
Posted Oct 3, 2022 19:54 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
But yeah, X.509 should be replaced. JWK is not too bad as the base for the replacement and it's slowly evolving in this direction anyway.
Posted Aug 11, 2022 10:58 UTC (Thu)
by rwmj (subscriber, #5474)
[Link] (1 responses)
Of course we also use the NBD protocol for its intended purpose too since connecting applications with NBD either over Unix or TCP sockets is very useful for shipping huge disk images around.
Posted Aug 13, 2022 16:26 UTC (Sat)
by ming.lei (guest, #74703)
[Link]
I am working on ublk-qcow2[1], so far read only is working, my simple fio test(randread, 4k, libaio, dio, ...) shows ublk-qcow2 is ~2X IOPS of qemu-nbd.
> Of course we also use the NBD protocol for its intended purpose too since connecting
NBD driver can be re-implemented in userspace via ublksrv/io_uring, then better
Posted Aug 12, 2022 2:07 UTC (Fri)
by motiejus (subscriber, #92837)
[Link] (4 responses)
John, if you are reading this message: how do you do it? Do you subscribe and read everything in lkml?
Thank you for a great article.
Posted Aug 12, 2022 3:44 UTC (Fri)
by corbet (editor, #1)
[Link] (3 responses)
The key tools are gnus and nntp; it wouldn't be possible otherwise. When projects move off of email to centralized services they become much harder to follow; fortunately for me, the kernel seems in no danger of doing that.
Posted Aug 12, 2022 9:53 UTC (Fri)
by rcampos (subscriber, #59737)
[Link] (2 responses)
Posted Aug 12, 2022 13:07 UTC (Fri)
by madscientist (subscriber, #16861)
[Link] (1 responses)
Posted Aug 12, 2022 15:20 UTC (Fri)
by rcampos (subscriber, #59737)
[Link]
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
nbd, iscsi, qcow2, .....
passthrough command is one very efficient communication channel between user
and kernel space. In future, it may be extended for other userspace drivers or
components.
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
I have tested performance of TCMU and UBLK. And TCMU results in longer I/O lantency since it uses SCSI protocol while UBLK needn't it. Besides, TCMU does not support multiqueue(only one command ring with a coarse-grained lock) so it behaves worse with multiple FIO jobs. UBLK does support multiqueue and there is one io_uring instance per queue so it benefits from blk-mq.
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
Missing documentation
Missing documentation
with more details, will do it in 6.0 release if no one is working on it.
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
If the stdin/stdout are unrelated why would synchronous behaviour be needed?
It'd not be practical to change buffered i/o, but there was never any guarantee unflushed output reached a device so I am not really sure what the point really is if read(2)/write(2) were implemented differently in user space using io_uring.
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
An io_uring-based user-space block driver
> kernel. (Disclaimer: I'm one of the authors of nbdkit).
> applications with NBD either over Unix or TCP sockets is very useful for shipping huge
> disk images around.
performance may be reached. Even many cases can be implemented via ublk directly.
An io_uring-based user-space block driver
I follow a few dozen kernel mailing lists, not just linux-kernel. I certainly don't read everything, though. After a while you get pretty good at figuring out what's actually worth looking at.
How it's done
How it's done
How it's done
How it's done
