|
|
Subscribe / Log in / New account

An io_uring-based user-space block driver

By Jonathan Corbet
August 8, 2022
The addition of the ublk driver during the 6.0 merge window would have been easy to miss; it was buried deeply within an io_uring pull request and is entirely devoid of any sort of documentation that might indicate why it merits a closer look. Ublk is intended to facilitate the implementation of high-performance block drivers in user space; to that end, it uses io_uring for its communication with the kernel. This driver is considered experimental for now; if it is successful, it might just be a harbinger of more significant changes to come to the kernel in the future.

Your editor has spent a fair amount of time beating his head against the source for the ublk driver, as well as the ubdsrv server that comprises the user-space component. The picture that has emerged from this exploration of that uncommented and vowel-deficient realm is doubtless incorrect in some details, though the overall shape should be close enough to reality.

How ublk works

The ublk driver starts by creating a special device called /dev/ublk-control. The user-space server (or servers, there can be more than one) starts by opening that device and setting up an io_uring ring to communicate with it. Operations at this level are essentially ioctl() commands, but /dev/ublk-control has no ioctl() handler; all operations are, instead, sent as commands through io_uring. Since the purpose is to implement a device behind io_uring, the reasoning seems to be, there is no reason to not use it from the beginning.

A server will typically start with a UBLK_CMD_ADD_DEV command; as one might expect, it adds a new ublk device to the system. The server can describe various aspects of this device, including the number of hardware queues it claims to implement, its block size, the maximum transfer size, and the number of blocks the device can hold. Once this command succeeds, the device exists as far as the ublk driver is concerned and is visible as /dev/ublkcN, where N is the device ID returned when the device is created. The device has not yet been added to the block layer, though.

The server should open the new /dev/ublkcN device for the following steps, the first of which is to map a region from the device into the server's address space with an mmap() call. This region is an array of ublksrv_io_desc structures describing I/O requests:

    struct ublksrv_io_desc {
	/* op: bit 0-7, flags: bit 8-31 */
	__u32		op_flags;
	__u32		nr_sectors;
	__u64		start_sector;
	__u64		addr;
    };

Notification of new I/O requests will be received via io_uring. To get to that point, the server must enqueue a set of UBLK_IO_FETCH_REQ requests on the newly created device; normally there will be one for each "hardware queue" declared for the device, which may also correspond to each thread running within the server. Among other things, this request must provide a memory buffer that can hold the maximum request size declared when the device was created.

Once this setup is complete, a separate UBLK_CMD_START_DEV operation will cause the ublk driver to actually create a block device visible to the rest of the system. When the block subsystem sends a request to this device, one of the queued UBLK_IO_FETCH_REQ operations will complete. The completion data returned to the user-space server will include the index of the ublkserv_io_desc structure describing the request, which the server should now execute. For a write request, the data to be written will be in the buffer that was provided by the server; for a read, the data should be placed in that same buffer.

When the operation is complete, the server must inform the kernel of that fact; this is done by placing a UBLK_IO_COMMIT_AND_FETCH_REQ operation into the ring. It will give the result of the operation back to the block subsystem, but will also enqueue the buffer to receive the next request, thus avoiding the need to do that separately.

There are the expected UBLK_CMD_STOP_DEV and UBLK_CMD_DEL_DEV operations to make existing devices go away, and a couple of other operations to query information about existing devices. There are also a number of details that have not been covered here, mostly aimed at increased performance. Among other things, the ublk protocol is set up to enable zero-copy I/O, but that is not implemented in the current code.

The server code implements two targets: null and loop. The null target is, as one might expect, an overly complicated, block-oriented version of /dev/null; it is useless but makes it possible to see how things work with a minimum of unrelated details. The loop target uses an existing file as the backing store for a virtual block device. According to author Ming Lei, with this loop implementation, "the performance is is even better than kernel loop with same setting".

Implications

One might wonder why this work has been done (and evidently supported by Red Hat); if the world has been clamoring for an io_uring-based, user-space, faster loop block device, it has done so quietly. One advantage cited in the patch cover letter is that development of block-driver code is more easily done in user space; another is high-performance qcow2 support. The patch cover letter also cites interest expressed by other developers in having a fast user-space block-device mechanism available.

An interesting question, though, is whether this mechanism might ultimately facilitate the movement of a number of device drivers out of the kernel — perhaps not just block drivers. Putting device drivers into user-space code is a fundamental concept in a number of secure-system designs, including microkernel systems. But one of the problems with those designs has always been the communication overhead between the two components once they are no longer running within the same address space. Io_uring might just be a convincing answer to that problem.

Should that scenario play out, kernels of the future could look significantly different from what we have today; they could be smaller, with much of the complicated logic running in separate, user-space components. Whether this is part of Lei's vision for ublk is unknown, and things may never get anywhere near that point. But ublk is clearly an interesting experiment that could lead to big changes down the line. Something will need to be done about that complete absence of documentation, though, on the way toward world domination.

Index entries for this article
KernelBlock layer/Block drivers
Kernelio_uring
KernelReleases/6.0


to post comments

An io_uring-based user-space block driver

Posted Aug 8, 2022 15:45 UTC (Mon) by sbates (subscriber, #106518) [Link] (5 responses)

I am both intruiged and confused by this addition to the kernel ;-). One part that is missing for me is how we assigned a physical storage device (like a NVMe namespace or a SCSI LUN) to this ublk driver? Thanks for writing this article to add some info to the (rather) disapointing documentation but more clarification on the link between ublk and physical storage devices would be welcome.

Cheers

Stephen

An io_uring-based user-space block driver

Posted Aug 8, 2022 16:00 UTC (Mon) by NHO (subscriber, #104320) [Link] (1 responses)

Currently, you don't.

You just have the loop and the null, physical devices have drivers in kernelspace

An io_uring-based user-space block driver

Posted Aug 8, 2022 20:26 UTC (Mon) by sbates (subscriber, #106518) [Link]

Thanks for the clarification!

An io_uring-based user-space block driver

Posted Aug 12, 2022 11:01 UTC (Fri) by stefanha (subscriber, #55072) [Link]

Userspace can use the Linux VFIO API to implement an NVMe PCI or SCSI HBA PCI driver in userspace.

An io_uring-based user-space block driver

Posted Aug 13, 2022 15:29 UTC (Sat) by ming.lei (guest, #74703) [Link]

ublk can be used to implement some 'virtual' block device in userspace, such as loop,
nbd, iscsi, qcow2, .....

And so far it can't move physical device driver out of kernel, but turns out io_uring
passthrough command is one very efficient communication channel between user
and kernel space. In future, it may be extended for other userspace drivers or
components.

An io_uring-based user-space block driver

Posted Aug 31, 2022 9:59 UTC (Wed) by Darkstar (guest, #28767) [Link]

You could use that framework to write your own user-space driver to access a physical device.

Say you have some fancy EEPROM hooked up to the GPIO pins to your RaspberryPi. You could write a Python or C program that drives the GPIO pins correctly and uses ublk to make the EEPROM's data accessible to the kernel as block device.

You could use a KryoFlux or SCP device to build an io_uring-based replacement for the floppy driver, which would probably be a fun exercise :)

An io_uring-based user-space block driver

Posted Aug 8, 2022 17:58 UTC (Mon) by shemminger (subscriber, #5739) [Link]

Could this be used by SPDK project?
https://spdk.io/

or is it a parallel effort?

An io_uring-based user-space block driver

Posted Aug 8, 2022 22:51 UTC (Mon) by xecycle (subscriber, #140261) [Link] (6 responses)

Do we have a comparison of this vs. VDUSE/TCMU?

An io_uring-based user-space block driver

Posted Aug 8, 2022 23:28 UTC (Mon) by Paf (subscriber, #91811) [Link] (2 responses)

What is VDUSE/TCMU?

An io_uring-based user-space block driver

Posted Aug 9, 2022 1:53 UTC (Tue) by xecycle (subscriber, #140261) [Link]

Well these are not a "combined VDUSE/TCMU", but VDUSE and TCMU. VDUSE is VDPA devices in user-space, and TCMU is target-core module in user-space; both can be used as an interface to a user-space daemon to provide block devices or SCSI targets. But IIRC VDUSE has implemented only virtio-blk so far, although IMO it can be extended to support virtio-scsi protocol.

An io_uring-based user-space block driver

Posted Aug 9, 2022 2:25 UTC (Tue) by felixfix (subscriber, #242) [Link]

Is that a rhetorical question? Seems they have a lot in common then.

An io_uring-based user-space block driver

Posted Aug 9, 2022 3:16 UTC (Tue) by hsiangkao (guest, #123981) [Link]

honestly, apart from other concerns, I have the same question of this. vDPA is also easy for offloading.

An io_uring-based user-space block driver

Posted Aug 9, 2022 4:27 UTC (Tue) by old-memories (guest, #160155) [Link] (1 responses)

TCMU and UBLK are both userspace block drivers. TCMU provides block devices such as /dev/sdX and the backend could be file/rdb/qcow/optical_file.
I have tested performance of TCMU and UBLK. And TCMU results in longer I/O lantency since it uses SCSI protocol while UBLK needn't it. Besides, TCMU does not support multiqueue(only one command ring with a coarse-grained lock) so it behaves worse with multiple FIO jobs. UBLK does support multiqueue and there is one io_uring instance per queue so it benefits from blk-mq.

An io_uring-based user-space block driver

Posted Aug 9, 2022 4:39 UTC (Tue) by hsiangkao (guest, #123981) [Link]

IMO, an alternative approach of iouring ublk is over VIRTIO (IOWs) if we consider device offloading as well since VIRTIO has an mature ecosystem with even longer time. I don't think TCMU is worth comparing as of today, but you'd better to compare it with vDPA/vDUSE from performance and ecosystem [such as device offloading, virtual machine support, etc.] perspective.

An io_uring-based user-space block driver

Posted Aug 9, 2022 7:36 UTC (Tue) by flussence (guest, #85566) [Link] (2 responses)

I can see value in this for punting most of the exotic/elderly/removable disk drivers in the kernel to userspace.

(Ideally any hardware that lives beyond the boundary of an external port would never have a driver stack running as root, but that's a way off)

An io_uring-based user-space block driver

Posted Aug 9, 2022 15:04 UTC (Tue) by k3ninho (subscriber, #50375) [Link] (1 responses)

But I'm just about to use CXL to expose all my lanes to external connectors..!

K3n.

An io_uring-based user-space block driver

Posted Aug 9, 2022 21:36 UTC (Tue) by ejr (subscriber, #51652) [Link]

You owe me a keyboard. ;)

Missing documentation

Posted Aug 9, 2022 9:32 UTC (Tue) by imphil (subscriber, #62487) [Link] (1 responses)

Wouldn't it be nice if maintainers would at least require basic documentation before pulling a feature? The kernel requires open source userspace drivers for GPUs to ensure the driver code can be tested, but something even more basic such as "just write a couple of words how the feature works" seems to be not necessary.

Missing documentation

Posted Aug 13, 2022 15:23 UTC (Sat) by ming.lei (guest, #74703) [Link]

ublksrv README provides one simple doc on ublk, but yes, we can make one
with more details, will do it in 6.0 release if no one is working on it.

Another reason is that the idea & implementation is pretty simple & straightforward.

https://github.com/ming1/ubdsrv/blob/master/README

An io_uring-based user-space block driver

Posted Aug 9, 2022 10:01 UTC (Tue) by ddevault (subscriber, #99589) [Link] (8 responses)

Last year I mused to Greg K-H that a microkernel using an io_uring-style interface would probably be able to overcome the performance differences between micro- and monolithic kernels. Nice to see that thought validated :)

One thing I suspect will come out of this work is support for TLS with nbd. Right now, as I understand it, the kernel half of the nbd subsystem cannot handle encrypted connections.

An io_uring-based user-space block driver

Posted Aug 10, 2022 1:34 UTC (Wed) by willmo (subscriber, #82093) [Link]

Apple’s DriverKit framework uses submission/completion queue pairs to communicate with userspace network drivers, although apparently not block drivers. Of course they have a lot more microkernel heritage already.

An io_uring-based user-space block driver

Posted Aug 10, 2022 3:36 UTC (Wed) by wahern (subscriber, #37304) [Link] (1 responses)

> Last year I mused to Greg K-H that a microkernel using an io_uring-style interface would probably be able to overcome the performance differences between micro- and monolithic kernels.

This is what microkernel people have been shouting for years. The model seL4 settled on (IIUC) was synchronous, short, typed messages copied and routed through the kernel for standard IPC, complemented by a generic, untyped page sharing facility permitting two or more processes to setup direct message passing a la io_uring. There's also an asynchronous signaling mechanism to assist with the latter, but ultimately you're given a simple, synchronous IPC mechanism, along with the basic tools to build your own high-performance IPC mechanism without the microkernel getting in your way. As far as I understand, this model is more practical and much less opinionated than earlier common approaches among microkernels, and also more closely parallels where Linux is headed.

What seL4 and other microkernels lack is the mindshare and incentive to actually build things atop these layers, regardless of their merit. seL4 especially demands huge, upfront time investments because of its build system. (Genode.org might alleviate much of this pain, though.) It's unsurprising these architectures will be replicated haphazardly on Linux, especially given the demands of multicore scaling, which heavily favor message passing architectures. But that's the age-old tech story, a consequence of path dependency, et al. Maybe io_uring will create an ecosystem of software more readily portable to microkernels, and maybe the increasing complexity of Linux will motivate such migrations. But probably not.

An io_uring-based user-space block driver

Posted Aug 10, 2022 8:04 UTC (Wed) by ddevault (subscriber, #99589) [Link]

I happen to be working on a microkernel of my own:

https://git.sr.ht/~sircmpwn/helios

I definitely agree with the shortcomings of seL4, both the ones you mentioned and the ones you left out. Most serious new kernel projects these days are micro-kernels, so it's clear that if there's a future after Linux it will be in micro-kernels. The main issue is getting to that future. There's a reason we're all still using Unix systems after so long.

An io_uring-based user-space block driver

Posted Aug 10, 2022 10:08 UTC (Wed) by guus (subscriber, #41608) [Link] (2 responses)

Perhaps ideal would be to have all communication between processes and the kernel go via io_urings, including system calls, ioctls, signals, and so on. However, that requires a big mindset change from userspace programmers if they want to make effective use of it. Consider calling printf() twice and then a scanf(): how should these be ordered with respect to each other? The printf()s probably should be done in the order they were enqueued, but the scanf() could start before the printf()s finish. Do we want to make everything completely synchronous by default, or completely asynchronous, have implicit dependencies between operations or require the programmer to provice explicit dependencies?

I see parallels with OpenGL and Vulkan here, and I know even experienced programmers find thinking about this hard. So it's no wonder io_uring is currently only used for some specific cases where performance is critical.

An io_uring-based user-space block driver

Posted Aug 11, 2022 0:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Most common use-cases can be wrapped into synchronous submit-then-wait helper functions. If you don't care about performance, you just can use them instead of low-level messaging stuff.

An io_uring-based user-space block driver

Posted Aug 20, 2022 20:54 UTC (Sat) by roblucid (guest, #48964) [Link]

That issue was always present in stdio, correct programs would flush printf output when isatty(stdout) before input for prompts, so users could see them.
If the stdin/stdout are unrelated why would synchronous behaviour be needed?
It'd not be practical to change buffered i/o, but there was never any guarantee unflushed output reached a device so I am not really sure what the point really is if read(2)/write(2) were implemented differently in user space using io_uring.

An io_uring-based user-space block driver

Posted Oct 3, 2022 13:36 UTC (Mon) by scientes (guest, #83068) [Link] (1 responses)

TLS is a horrible protocol. Have you ever looked at ASN.1? And without RFC 7250 x509 is mandatory. How about using WIreGuard?

An io_uring-based user-space block driver

Posted Oct 3, 2022 19:54 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

ASN.1 is not too bad if you need to limit yourself to X.509 parsing. It's pretty compact and you need to write it only once.

But yeah, X.509 should be replaced. JWK is not too bad as the base for the replacement and it's slowly evolving in this direction anyway.

An io_uring-based user-space block driver

Posted Aug 11, 2022 10:58 UTC (Thu) by rwmj (subscriber, #5474) [Link] (1 responses)

Intrigued by how this compares to the existing approaches which all use the NBD driver in the kernel. (Disclaimer: I'm one of the authors of nbdkit).

Of course we also use the NBD protocol for its intended purpose too since connecting applications with NBD either over Unix or TCP sockets is very useful for shipping huge disk images around.

An io_uring-based user-space block driver

Posted Aug 13, 2022 16:26 UTC (Sat) by ming.lei (guest, #74703) [Link]

> Intrigued by how this compares to the existing approaches which all use the NBD driver in the
> kernel. (Disclaimer: I'm one of the authors of nbdkit).

I am working on ublk-qcow2[1], so far read only is working, my simple fio test(randread, 4k, libaio, dio, ...) shows ublk-qcow2 is ~2X IOPS of qemu-nbd.

> Of course we also use the NBD protocol for its intended purpose too since connecting
> applications with NBD either over Unix or TCP sockets is very useful for shipping huge
> disk images around.

NBD driver can be re-implemented in userspace via ublksrv/io_uring, then better
performance may be reached. Even many cases can be implemented via ublk directly.

[1] https://github.com/ming1/ubdsrv/tree/qcow2-devel

An io_uring-based user-space block driver

Posted Aug 12, 2022 2:07 UTC (Fri) by motiejus (subscriber, #92837) [Link] (4 responses)

I keep gettng impressed on what John finds in the firehose of Linux commits. Sometimes even unmerged ones and without any documentation.

John, if you are reading this message: how do you do it? Do you subscribe and read everything in lkml?

Thank you for a great article.

How it's done

Posted Aug 12, 2022 3:44 UTC (Fri) by corbet (editor, #1) [Link] (3 responses)

I follow a few dozen kernel mailing lists, not just linux-kernel. I certainly don't read everything, though. After a while you get pretty good at figuring out what's actually worth looking at.

The key tools are gnus and nntp; it wouldn't be possible otherwise. When projects move off of email to centralized services they become much harder to follow; fortunately for me, the kernel seems in no danger of doing that.

How it's done

Posted Aug 12, 2022 9:53 UTC (Fri) by rcampos (subscriber, #59737) [Link] (2 responses)

What do you mean with gnus?

How it's done

Posted Aug 12, 2022 13:07 UTC (Fri) by madscientist (subscriber, #16861) [Link] (1 responses)

Gnus is an Emacs package that allows you to read email and NNTP (Usenet) from within Emacs. It has a ton of very powerful features for managing especially large amounts of incoming mail/news.

How it's done

Posted Aug 12, 2022 15:20 UTC (Fri) by rcampos (subscriber, #59737) [Link]

Ohh, didn't know. Thanks!


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds