| This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible. |
One year ago, the io_uring subsystem did not exist in the mainline kernel; it showed up in the 5.1 release in May 2019. At its core, io_uring is a mechanism for performing asynchronous I/O, but it has been steadily growing beyond that use case and adding new capabilities. Herein we catch up with the current state of io_uring, where it is headed, and an interesting question or two that will come up along the way.
Classic Unix I/O is inherently synchronous. As far as an application is concerned, an operation is complete once a system call like read() or write() returns, even if some processing may continue behind its back. There is no way to launch an operation asynchronously and wait for its completion at some future time — a feature that many other operating systems had for many years before Unix was created.
In the Linux world, this gap was eventually filled with the asynchronous I/O (AIO) subsystem, but that solution has never proved to be entirely satisfactory. AIO requires specific support at the lower levels, so it never worked well outside of a couple of core use cases (direct file I/O and networking). Over the years there have been recurring conversations about better ways to solve the asynchronous-I/O problem. Various proposals with names like fibrils, threadlets, syslets, acall, and work-queue-based AIO have been discussed, but none have made it into the mainline.
The latest attempt in that series is io_uring, which did manage to get merged. Unlike its predecessors, io_uring is built around a ring buffer in memory shared between user space and the kernel; that allows the submission of operations (and collecting the results) without the need to call into the kernel in many cases. The interface is somewhat complex, but for many applications that perform massive amounts of I/O, that complexity is paid back in increased performance. See this document [PDF] for a detailed description of the io_uring API. Use of this API can be somewhat simplified with the liburing library.
Every entry placed into the io_uring submission ring carries an opcode telling the kernel what is to be done. When io_uring was added to the 5.1 kernel, the available opcodes were:
- IORING_OP_NOP
- This operation does nothing at all; the benefits of doing nothing asynchronously are minimal, but sometimes a placeholder is useful.
- IORING_OP_READV
- IORING_OP_WRITEV
- Submit a readv() or write() operation — the core purpose for io_uring in most settings.
- IORING_OP_READ_FIXED
- IORING_OP_WRITE_FIXED
- These opcodes also submit I/O operations, but they use "registered" buffers that are already mapped into the kernel, reducing the amount of total overhead.
- IORING_OP_FSYNC
- Issue an fsync() call — asynchronous synchronization, in other words.
- IORING_OP_POLL_ADD
- IORING_OP_POLL_REMOVE
- IORING_OP_POLL_ADD will perform a poll() operation on a set of file descriptors. It's a one-shot operation that must be resubmitted after it completes; it can be explicitly canceled with IORING_OP_POLL_REMOVE. Polling this way can be used to asynchronously keep an eye on a set of file descriptors. The io_uring subsystem also supports a concept of dependencies between operations; a poll could be used to hold off on issuing another operation until the underlying file descriptor is ready for it.
That functionality was enough to drive some significant interest in io_uring; its creator, Jens Axboe, could have stopped there and taken a break for a while. That, however, is not what happened. Since the 5.1 release, the following operations have been added:
- IORING_OP_SYNC_FILE_RANGE (5.2)
- Perform a sync_file_range() call — essentially an enhancement of the existing fsync() support, though without all of the guarantees of fsync().
- IORING_OP_SENDMSG (5.3)
- IORING_OP_RECVMSG (5.3)
- These operations support the asynchronous sending and receiving of packets over the network with sendmsg() and recvmsg().
- IORING_OP_TIMEOUT (5.4)
- IORING_OP_TIMEOUT_REMOVE (5.5)
- This operation completes after a given period of time, as measured either in seconds or number of completed io_uring operations. It is a way of forcing a waiting application to wake up even if it would otherwise continue sleeping for more completions.
- IORING_OP_ACCEPT (5.5)
- IORING_OP_CONNECT (5.5)
- Accept a connection on a socket, or initiate a connection to a remote peer.
- IORING_OP_ASYNC_CANCEL (5.5)
- Attempt to cancel an operation that is currently in flight. Whether this attempt will succeed depends on the type of operation and how far along it is.
- IORING_OP_LINK_TIMEOUT (5.5)
- Create a timeout linked to a specific operation in the ring. Should that operation still be outstanding when the timeout happens, the kernel will attempt to cancel the operation. If, instead, the operation completes first, the timeout will be canceled.
That is where the io_uring interface will stand as of the final 5.5 kernel release.
The development of io_uring is far from complete. To see that, one need merely look into linux-next to see what is queued for 5.6:
- IORING_OP_FALLOCATE
- Manipulate the blocks allocated for a file using fallocate()
- IORING_OP_OPENAT
- IORING_OP_OPENAT2
- IORING_OP_CLOSE
- Open and close files
- IORING_OP_FILES_UPDATE
- Frequently used files can be registered with io_uring for faster access; this command is a way of (asynchronously) adding files to the list (or removing them from the list).
- IORING_OP_STATX
- Query information about a file using statx().
- IORING_OP_READ
- IORING_OP_WRITE
- These are like IORING_OP_READV and IORING_OP_WRITEV, but they use the simpler interface that can only handle a single buffer.
- IORING_OP_FADVISE
- IORING_OP_MADVISE
- Perform the posix_fadvise() and madvise() system calls asynchronously.
- IORING_OP_SEND
- IORING_OP_RECV
- Send and receive network data.
- IORING_OP_EPOLL_CTL
- Perform operations on epoll file-descriptor sets with epoll_ctl()
What will happen after 5.6 remains to be seen. There was an attempt to add ioctl() support, but that was shot down due to reliability and security concerns. Axboe has, however, outlined a way in which support for specific ioctl() operations could be added on a case-by-case basis. One can imagine that, for example, the media subsystem, which supports a number of performance-sensitive ioctl() operations, would benefit from this mechanism.
There is also an early patch set adding support for splice().
All told, it would appear that io_uring is quickly growing the sort of capabilities that were envisioned many years ago when the developers were talking about thread-based asynchronous mechanisms. The desire to avoid blocking in event loops is strong; it seems likely that this API will continue to grow until a wide range of tasks can be performed with almost no risk of blocking at all. Along the way, though, there may be a couple of interesting issues to deal with.
One of those is that the field for io_uring commands is only eight bits wide, meaning that up to 256 opcodes can be defined. As of 5.6, 30 opcodes will exist, so there is still plenty of room for growth. There are more than 256 system calls implemented in Linux, though. If io_uring were to grow to the point where it supported most of them, that space would run out.
A different issue was raised by Stefan Metzmacher. Dependencies between commands are supported by io_uring now, so it is possible to hold the initiation of an operation until some previous operation has completed. What is rather more difficult is moving information between operations. In Metzmacher's case, he would like to call openat() asynchronously, then submit I/O operations on the resulting file descriptor without waiting for the open to complete.
It turns out that there is a plan for this: inevitably it calls for ... wait for it ... using BPF to make the connection from one operation to the next. The ability to run bits of code in the kernel at appropriate places in a chain of asynchronous operations would clearly open up a number of interesting new possibilities. "There's a lot of potential there", Axboe said. Indeed, one can imagine a point where an entire program is placed into a ring by a small C "driver", then mostly allowed to run on its own.
There is one potential hitch here, though, in that io_uring is an unprivileged interface; any necessary privilege checks are performed on the actual operations performed. But the plans to make BPF safe for unprivileged users have been sidelined, with explicit statements that unprivileged use will not be supported in the future. That could make BPF hard to use with io_uring. There may be plans for how to resolve this issue lurking deep within Facebook, but they have not yet found their way onto the public lists. It appears that the BPF topic in general will be discussed at the 2020 Linux Storage, Filesystem, and Memory-Management Summit.
In summary, though, io_uring appears to be on a roll with only a relatively
small set of growing pains. It will be interesting to see how much more
functionality finds its way into this subsystem in the coming releases.
Recent history suggests that the growth of io_uring will not be slowing
down anytime soon.
| Index entries for this article | |
|---|---|
| Kernel | Asynchronous I/O |
| Kernel | io_uring |
The rapid growth of io_uring
Posted Jan 24, 2020 18:26 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
The rapid growth of io_uring
Posted Jan 24, 2020 20:16 UTC (Fri) by Sesse (subscriber, #53779) [Link]
The rapid growth of io_uring
Posted Jan 24, 2020 21:51 UTC (Fri) by axboe (subscriber, #904) [Link]
The rapid growth of io_uring
Posted Jan 24, 2020 21:54 UTC (Fri) by farnz (subscriber, #17727) [Link]
If splice() is wired in, then you have all the components you need to implement sendfile() as a user of io_uring :-)
The rapid growth of io_uring
Posted Jan 28, 2020 10:12 UTC (Tue) by isilence (subscriber, #117961) [Link]
The rapid growth of io_uring
Posted Jan 25, 2020 0:23 UTC (Sat) by josh (subscriber, #17465) [Link]
The X Window System solved this decades ago: applications specify the ID they want when creating an object in X, for precisely this reason.
Let the application specify the file descriptor it wants to open, so that it knows what fd number it'll get, and can submit subsequent operations on that file descriptor in the same queue.
The rapid growth of io_uring
Posted Jan 30, 2020 12:05 UTC (Thu) by Karellen (subscriber, #67644) [Link]
The rapid growth of io_uring
Posted Jan 30, 2020 14:39 UTC (Thu) by Baughn (subscriber, #124425) [Link]
The rapid growth of io_uring
Posted Jan 30, 2020 23:51 UTC (Thu) by Karellen (subscriber, #67644) [Link]
Even per-process, how do you know which fds are free?
Are you suggesting that you should enumerate all the fds currently in use by your program, and then make sure to pick one that you know isn't being used, and hope that another thread doesn't race you to getting it anyway?
Sorry, I think I'm missing an important part of the puzzle somewhere here.
The rapid growth of io_uring
Posted Jan 31, 2020 10:39 UTC (Fri) by klempner (subscriber, #69940) [Link]
That allocator wouldn't allocate the same fd twice without a deallocate and could be made threadsafe so you wouldn't have race issues.
The rapid growth of io_uring
Posted Feb 6, 2020 17:26 UTC (Thu) by Wol (subscriber, #4433) [Link]
Run FORTRAN???
That's the way that always worked :-)
Cheers,
Wol
The rapid growth of io_uring
Posted Feb 7, 2020 8:08 UTC (Fri) by renox (guest, #23785) [Link]
The rapid growth of io_uring
Posted Feb 7, 2020 14:41 UTC (Fri) by canoon (subscriber, #109743) [Link]
The rapid growth of io_uring
Posted Jan 25, 2020 1:14 UTC (Sat) by wahern (subscriber, #37304) [Link]
Classic Unix I/O is inherently synchronous. As far as an application is concerned, an operation is complete once a system call like read() or write() returns, even if some processing may continue behind its back. There is no way to launch an operation asynchronously and wait for its completion at some future time — a feature that many other operating systems had for many years before Unix was created.
Unix historically didn't have asynchronous disk I/O, and this effected the architecture of subsystems and drivers. Linux still doesn't have asynchronous buffered disk I/O. All you're doing is offloading the operation to a thread pool in the kernel, which performs the operation synchronously, rather than a thread pool in user land.
Once upon a time Linux eschewed complex APIs in favor of making the basic primitives--forking, threading, syscalls, etc--as fast and efficient as possible. And that's perhaps why all the earlier proposals never never made it in--the original strategy was still paying dividends and reduced the pressures that enterprise systems never could cope with. A kernel-land I/O thread pool couldn't schedule I/O any better than a user land thread pool, and if it could the solution was to make it so the user land thread pool could just as well, not to actually relent and expose a specialized I/O thread pool kernel construct.
So now that Linux is moving in the direction of increasingly complex, leaky APIs to cover the performance gap, treading the old waters of enterprise Unix, what does that imply, if anything, about the evolution of Linux and the state of hardware. Are we really better off shifting so much complexity into the kernel? From a security perspective that seems highly doubtful. Are we really better off from a performance perspective? If Linux is abdicating its dogged pursuit of performance improvements, don't we know where this road leads? Or maybe Linux finally hit the wall of what the Unix process and I/O model could fundamentally provide. io_uring is what you'd build as an IPC primitive for a very high performance microkernel. Why keep all this complexity in one, insecure address space when we can finally disperse it across different processes, using a new process model that boxes and aggregates resources in a fundamentally different way.
The rapid growth of io_uring
Posted Jan 27, 2020 13:26 UTC (Mon) by ermo (subscriber, #86690) [Link]
Are you suggesting that it might be worth it to experiment with making Linux a hybrid (macro/micro) kernel and use io_uring as the transport between separate kernel "daemon" processes, moving forward with the separation by splitting out one subsystem at a time?
If not, could you perhaps outline what you had in mind? Inquiring minds would like to know.
The rapid growth of io_uring
Posted Jan 29, 2020 20:13 UTC (Wed) by notriddle (subscriber, #130608) [Link]
What it says about the state of hardware is easy.
Meltdown mitigations made context switching more expensive, and SSDs made the actual I/O cheaper. At some point, you're spending almost as much time context switching as you are doing actual work, and there's nothing Linux can do about it because it's all tied up in the architecture of the MMU and the processor's cache. Thus, it now makes sense to design the interface around reducing the number of context switches at all cost, instead of just assuming that the cost of doing the actual I/O will dominate.
The rapid growth of io_uring
Posted Jan 29, 2020 20:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]
And it's not like the VFS layer is impossible to implement asynchronously, it's just that before now it was not needed at all.
The rapid growth of io_uring
Posted Jan 26, 2020 11:10 UTC (Sun) by geertj (subscriber, #4116) [Link]
The rapid growth of io_uring
Posted Jan 26, 2020 18:18 UTC (Sun) by axboe (subscriber, #904) [Link]
The rapid growth of io_uring
Posted Jan 26, 2020 21:49 UTC (Sun) by grober (guest, #136840) [Link]
The rapid growth of io_uring
Posted Jan 27, 2020 0:23 UTC (Mon) by cesarb (subscriber, #6266) [Link]
The rapid growth of io_uring
Posted Jan 27, 2020 8:45 UTC (Mon) by dezgeg (subscriber, #92243) [Link]
The rapid growth of io_uring
Posted Jan 28, 2020 10:50 UTC (Tue) by jezuch (subscriber, #52988) [Link]
The rapid growth of io_uring
Posted Jan 26, 2020 23:55 UTC (Sun) by willy (subscriber, #9762) [Link]
The rapid growth of io_uring
Posted Jan 27, 2020 0:19 UTC (Mon) by josh (subscriber, #17465) [Link]
The rapid growth of io_uring
Posted Jan 27, 2020 0:22 UTC (Mon) by willy (subscriber, #9762) [Link]
The rapid growth of io_uring
Posted Jan 27, 2020 0:36 UTC (Mon) by cesarb (subscriber, #6266) [Link]
Is there any plan to allow the kernel to choose the buffer, so that you could give the kernel a small set of buffers (or a single large buffer) and the kernel would pick one (or carve a piece of that large buffer) when the data arrives? Or is that actually not an issue in real-world use cases?
The rapid growth of io_uring
Posted Jan 27, 2020 1:44 UTC (Mon) by farnz (subscriber, #17727) [Link]
While you can't have the kernel choose which buffer is in use for each pending read, you can make use of IORING_OP_POLL_ADD to poll all 1,000 network connections via the uring, and then use IORING_OP_READV to read only from the connections that are ready, using a smaller number of buffers than there are connections.
That's basically the same work level as epoll + read, but with both the epoll and the read being done asynchronously.
The rapid growth of io_uring
Posted Jan 27, 2020 10:03 UTC (Mon) by AndreiG (subscriber, #90359) [Link]
The rapid growth of io_uring
Posted Jan 27, 2020 18:25 UTC (Mon) by axboe (subscriber, #904) [Link]
The rapid growth of io_uring
Posted Feb 7, 2020 14:45 UTC (Fri) by CúChulainn (guest, #137154) [Link]
The rapid growth of io_uring
Posted Feb 7, 2020 21:04 UTC (Fri) by flussence (subscriber, #85566) [Link]
The rapid growth of io_uring
Posted Feb 7, 2020 23:18 UTC (Fri) by CúChulainn (guest, #137154) [Link]
The rapid growth of io_uring
Posted Feb 13, 2020 0:22 UTC (Thu) by flussence (subscriber, #85566) [Link]
The rapid growth of io_uring
Posted Jan 27, 2020 11:34 UTC (Mon) by seanyoung (subscriber, #28711) [Link]
Thanks Jens Axboe!
PS I can't wait to see this integrated with rust async_std.
The rapid growth of io_uring
Posted Jan 31, 2020 19:10 UTC (Fri) by sunbains (subscriber, #104233) [Link]
The rapid growth of io_uring
Posted Feb 5, 2020 17:52 UTC (Wed) by axboe (subscriber, #904) [Link]
The rapid growth of io_uring
Posted May 15, 2020 1:57 UTC (Fri) by twocode (guest, #132839) [Link]
The rapid growth of io_uring
Posted Feb 9, 2020 22:47 UTC (Sun) by dcoutts (guest, #5387) [Link]
The io_uring API is designed to scale to a moderate number of simultaneous I/O operations. The size of the submit and collect rings has to be enough to cover all the simultaneous operations. The max ring size is 4k entries. This is fine for the use case of disk I/O, connect, accept etc.
It's not fine for the "10k problem" of having 10s of 1000s of idle network connections. That's what epoll is designed for. We don't really want to have 10s of 1000s of pending async IO recv operations, we just want to wait for data to arrive on any connection, and then we can execute the IO op to collect the data.
So what's the idea for handling large numbers of network connections using io_uring, or some combo of io_uring and epoll? We have IORING_OP_POLL_ADD but of course this costs one io_uring entry so we can't go over 4k of them. There's IORING_OP_EPOLL_CTL for adjusting the fds in an epoll set. But there's no io_uring operation for epoll_wait. So do we have to use both io_uring and epoll_wait? Now that needs two threads, so no nice single-threaded event loop.
Perhaps I'm missing something. If not, isn't the obvious thing to add support for IORING_OP_EPOLL_WAIT? Then we can use IORING_OP_EPOLL_CTL to adjust the network fds we're monitoring and then issue a single IORING_OP_EPOLL_WAIT to wait for any network fd to have activity.
Alternatively, io_uring could subsume the epoll API entirely. The single-shot style of IORING_OP_POLL_ADD is actually very nice. But it has to scale to the 10k+ case, so cannot consume a completion queue entry for each fd polled like IORING_OP_POLL_ADD does.
The rapid growth of io_uring
Posted Feb 9, 2020 23:05 UTC (Sun) by andresfreund (subscriber, #69562) [Link]
Early on that could lead to the completion queue overflowing, but now those completions are saved (but no new submissions are allowed, basically).
The rapid growth of io_uring
Posted Feb 10, 2020 1:33 UTC (Mon) by dcoutts (guest, #5387) [Link]
So is this pattern of use with sockets expected to perform well, e.g. compared to epoll? If so, that's fantastic, we really can use just one system for both disk I/O and network I/O. The one-shot behaviour of IORING_OP_POLL_ADD is entirely adequate (covering epoll's EPOLLONESHOT which is its most useful mode).
The rapid growth of io_uring
Posted Feb 10, 2020 3:29 UTC (Mon) by andresfreund (subscriber, #69562) [Link]
Yes, that's correct to my knowledge. That wasn't the case in the first kernels with support for io_uring however (you need to test for IORING_FEAT_NODROP to know). But for poll like things it's also not that hard to handle the overflow case - just treat all of the sockets as ready. But that doesn't work at all for asynchronous network IO (rather than just readiness).
And obviously, you'd need to submit the 10k IORING_OP_POLL_ADDs in smaller batches, because the submission queue wouldn't be that long. But that's fine.
FWIW, I am fairly sure you can submit an IORING_OP_POLL_ADD on an epoll fd. Signal readiness once epoll_pwait would return (without returning the events of course). So if you need something for which oneshot style behaviour isn't best suited, or which you might want to wait on only some of the time, you can also compose io_uring and epoll. Caveat: I have not tested this, but I'm fairly sure I've seen code doing so, and a quick look in the code confirms that this should work: eventpoll_fops has a .poll implementation, which is what io_uring (via io_poll_add) relies on to register to get notified for readiness.
The rapid growth of io_uring
Posted May 25, 2020 23:25 UTC (Mon) by rand0m$tring (subscriber, #125230) [Link]
https://github.com/torvalds/linux/blob/444565650a5fe9c63d...
so that's the point at which it would reach capacity and fall over.
The rapid growth of io_uring
Posted May 25, 2020 22:31 UTC (Mon) by rand0m$tring (subscriber, #125230) [Link]
A buffer per operation is also a nonstarter for the same reason. In 5.7 IOSQE_BUFFER_SELECT attempts to solve the buffer problem with rotations. But I think an input / output ring buffer solution that allows allocation at the tail when needed, and just returns ranges into the input or output buffer makes more sense?
Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds