The rapid growth of io_uring

By Jonathan Corbet
January 24, 2020

One year ago, the io_uring subsystem did not exist in the mainline kernel; it showed up in the 5.1 release in May 2019. At its core, io_uring is a mechanism for performing asynchronous I/O, but it has been steadily growing beyond that use case and adding new capabilities. Herein we catch up with the current state of io_uring, where it is headed, and an interesting question or two that will come up along the way.

Classic Unix I/O is inherently synchronous. As far as an application is concerned, an operation is complete once a system call like read() or write() returns, even if some processing may continue behind its back. There is no way to launch an operation asynchronously and wait for its completion at some future time — a feature that many other operating systems had for many years before Unix was created.

In the Linux world, this gap was eventually filled with the asynchronous I/O (AIO) subsystem, but that solution has never proved to be entirely satisfactory. AIO requires specific support at the lower levels, so it never worked well outside of a couple of core use cases (direct file I/O and networking). Over the years there have been recurring conversations about better ways to solve the asynchronous-I/O problem. Various proposals with names like fibrils, threadlets, syslets, acall, and work-queue-based AIO have been discussed, but none have made it into the mainline.

The latest attempt in that series is io_uring, which did manage to get merged. Unlike its predecessors, io_uring is built around a ring buffer in memory shared between user space and the kernel; that allows the submission of operations (and collecting the results) without the need to call into the kernel in many cases. The interface is somewhat complex, but for many applications that perform massive amounts of I/O, that complexity is paid back in increased performance. See this document [PDF] for a detailed description of the io_uring API. Use of this API can be somewhat simplified with the liburing library.

What io_uring can do

Every entry placed into the io_uring submission ring carries an opcode telling the kernel what is to be done. When io_uring was added to the 5.1 kernel, the available opcodes were:

IORING_OP_NOP

This operation does nothing at all; the benefits of doing nothing asynchronously are minimal, but sometimes a placeholder is useful.

IORING_OP_READV

IORING_OP_WRITEV

Submit a readv() or write() operation — the core purpose for io_uring in most settings.

IORING_OP_READ_FIXED

IORING_OP_WRITE_FIXED

These opcodes also submit I/O operations, but they use "registered" buffers that are already mapped into the kernel, reducing the amount of total overhead.

IORING_OP_FSYNC

Issue an fsync() call — asynchronous synchronization, in other words.

IORING_OP_POLL_ADD

IORING_OP_POLL_REMOVE

IORING_OP_POLL_ADD will perform a poll() operation on a set of file descriptors. It's a one-shot operation that must be resubmitted after it completes; it can be explicitly canceled with IORING_OP_POLL_REMOVE. Polling this way can be used to asynchronously keep an eye on a set of file descriptors. The io_uring subsystem also supports a concept of dependencies between operations; a poll could be used to hold off on issuing another operation until the underlying file descriptor is ready for it.

That functionality was enough to drive some significant interest in io_uring; its creator, Jens Axboe, could have stopped there and taken a break for a while. That, however, is not what happened. Since the 5.1 release, the following operations have been added:

IORING_OP_SYNC_FILE_RANGE (5.2)

Perform a sync_file_range() call — essentially an enhancement of the existing fsync() support, though without all of the guarantees of fsync().

IORING_OP_SENDMSG (5.3)

IORING_OP_RECVMSG (5.3)

These operations support the asynchronous sending and receiving of packets over the network with sendmsg() and recvmsg().

IORING_OP_TIMEOUT (5.4)

IORING_OP_TIMEOUT_REMOVE (5.5)

This operation completes after a given period of time, as measured either in seconds or number of completed io_uring operations. It is a way of forcing a waiting application to wake up even if it would otherwise continue sleeping for more completions.

IORING_OP_ACCEPT (5.5)

IORING_OP_CONNECT (5.5)

Accept a connection on a socket, or initiate a connection to a remote peer.

IORING_OP_ASYNC_CANCEL (5.5)

Attempt to cancel an operation that is currently in flight. Whether this attempt will succeed depends on the type of operation and how far along it is.

IORING_OP_LINK_TIMEOUT (5.5)

Create a timeout linked to a specific operation in the ring. Should that operation still be outstanding when the timeout happens, the kernel will attempt to cancel the operation. If, instead, the operation completes first, the timeout will be canceled.

That is where the io_uring interface will stand as of the final 5.5 kernel release.

Coming soon

The development of io_uring is far from complete. To see that, one need merely look into linux-next to see what is queued for 5.6:

IORING_OP_FALLOCATE

Manipulate the blocks allocated for a file using fallocate()

IORING_OP_OPENAT

IORING_OP_OPENAT2

IORING_OP_CLOSE

Open and close files

IORING_OP_FILES_UPDATE

Frequently used files can be registered with io_uring for faster access; this command is a way of (asynchronously) adding files to the list (or removing them from the list).

IORING_OP_STATX

Query information about a file using statx().

IORING_OP_READ

IORING_OP_WRITE

These are like IORING_OP_READV and IORING_OP_WRITEV, but they use the simpler interface that can only handle a single buffer.

IORING_OP_FADVISE

IORING_OP_MADVISE

Perform the posix_fadvise() and madvise() system calls asynchronously.

IORING_OP_SEND

IORING_OP_RECV

Send and receive network data.

IORING_OP_EPOLL_CTL

Perform operations on epoll file-descriptor sets with epoll_ctl()

What will happen after 5.6 remains to be seen. There was an attempt to add ioctl() support, but that was shot down due to reliability and security concerns. Axboe has, however, outlined a way in which support for specific ioctl() operations could be added on a case-by-case basis. One can imagine that, for example, the media subsystem, which supports a number of performance-sensitive ioctl() operations, would benefit from this mechanism.

There is also an early patch set adding support for splice().

An asynchronous world

All told, it would appear that io_uring is quickly growing the sort of capabilities that were envisioned many years ago when the developers were talking about thread-based asynchronous mechanisms. The desire to avoid blocking in event loops is strong; it seems likely that this API will continue to grow until a wide range of tasks can be performed with almost no risk of blocking at all. Along the way, though, there may be a couple of interesting issues to deal with.

One of those is that the field for io_uring commands is only eight bits wide, meaning that up to 256 opcodes can be defined. As of 5.6, 30 opcodes will exist, so there is still plenty of room for growth. There are more than 256 system calls implemented in Linux, though. If io_uring were to grow to the point where it supported most of them, that space would run out.

A different issue was raised by Stefan Metzmacher. Dependencies between commands are supported by io_uring now, so it is possible to hold the initiation of an operation until some previous operation has completed. What is rather more difficult is moving information between operations. In Metzmacher's case, he would like to call openat() asynchronously, then submit I/O operations on the resulting file descriptor without waiting for the open to complete.

It turns out that there is a plan for this: inevitably it calls for ... wait for it ... using BPF to make the connection from one operation to the next. The ability to run bits of code in the kernel at appropriate places in a chain of asynchronous operations would clearly open up a number of interesting new possibilities. "There's a lot of potential there", Axboe said. Indeed, one can imagine a point where an entire program is placed into a ring by a small C "driver", then mostly allowed to run on its own.

There is one potential hitch here, though, in that io_uring is an unprivileged interface; any necessary privilege checks are performed on the actual operations performed. But the plans to make BPF safe for unprivileged users have been sidelined, with explicit statements that unprivileged use will not be supported in the future. That could make BPF hard to use with io_uring. There may be plans for how to resolve this issue lurking deep within Facebook, but they have not yet found their way onto the public lists. It appears that the BPF topic in general will be discussed at the 2020 Linux Storage, Filesystem, and Memory-Management Summit.

In summary, though, io_uring appears to be on a roll with only a relatively small set of growing pains. It will be interesting to see how much more functionality finds its way into this subsystem in the coming releases. Recent history suggests that the growth of io_uring will not be slowing down anytime soon.

Index entries for this article
Kernel	Asynchronous I/O
Kernel	io_uring

The rapid growth of io_uring

Posted Jan 24, 2020 18:26 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

BPF == Biggest Possible Fuckup.

The rapid growth of io_uring

Posted Jan 24, 2020 20:16 UTC (Fri) by Sesse (subscriber, #53779) [Link] (12 responses)

When can we get asynchronous sendfile()? :-)

The rapid growth of io_uring

Posted Jan 24, 2020 21:51 UTC (Fri) by axboe (subscriber, #904) [Link] (9 responses)

As soon as the splice stuff is integrated, you'll have just that. When I initially wrote splice, at the same time I turned sendfile() into a simple wrapper around it. So if you have splice, you have sendfile as well.

The rapid growth of io_uring

Posted Sep 2, 2021 12:36 UTC (Thu) by awkravchuk (guest, #154070) [Link] (8 responses)

How exactly did you do that for the general case? splice(2) requires that one of the fds is a pipe, so it looks like sending e.g. disk file to TCP socket wouldn't be possible.

The rapid growth of io_uring

Posted Sep 2, 2021 14:02 UTC (Thu) by farnz (subscriber, #17727) [Link] (7 responses)

There's special cases in the kernel (but not, it appears, in the manpage) to allow file to anything splicing by allocating a secret internal pipe. Look at fs/splice.c, function splice_direct_to_actor for the code.

The rapid growth of io_uring

Posted Sep 2, 2021 16:49 UTC (Thu) by awkravchuk (guest, #154070) [Link] (6 responses)

Great stuff, thanks!
Is there a way to use this from userspace? I'm not an actual kernel hacker, just trying io_uring out :)

The rapid growth of io_uring

Posted Sep 2, 2021 17:14 UTC (Thu) by farnz (subscriber, #17727) [Link] (5 responses)

Looks like no way to use this from userspace.

However, in the io_uring case, you should be able to build a splice-based sendfile yourself using a pipe you create via pipe(2) to act as the buffer. Or do similar via fixed buffers in the ring instead of splice.

The rapid growth of io_uring

Posted Sep 2, 2021 17:40 UTC (Thu) by awkravchuk (guest, #154070) [Link]

That's what I thought. Thank you for clarification!

The rapid growth of io_uring

Posted Nov 3, 2023 4:03 UTC (Fri) by leo60228 (guest, #167812) [Link] (3 responses)

This doesn't work if you want to have multiple calls with different offsets in flight at once, though, does it? If there's a partial read/write, the next splice would use the wrong data, and I can't think of a way to avoid that without only having one splice in flight at a time (which kind of defeats the point, at least for my application).

The rapid growth of io_uring

Posted Nov 3, 2023 12:07 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

If you're building your own equivalent of this trick in io_uring, you can have multiple splices in flight at once; you'd be using two linked SQEs, one of which splices input into the pipe, and the other of which splices the pipe into the output. Offset tracking is in your hands at this point.

The kernel's trick is simply to create the pipe for you if you don't provide one, and you're doing sync I/O from a file to something not-a-file.

The rapid growth of io_uring

Posted Nov 3, 2023 22:19 UTC (Fri) by leo60228 (guest, #167812) [Link] (1 responses)

I tried implementing that. What I mean is that, for example, if you were trying to copy 0x8000 byte chunks at a time, and the first SQE splicing from the input file to the pipe only copied 0x4000 bytes, the linked SQE would still try to read 0x8000 bytes from the pipe and potentially get the wrong data. Additionally, I'm not sure there's a guarantee that these linked SQEs would be atomic (i.e. if SQE A was linked to SQE B, and SQE C was linked to SQE D, the order A, C, B, D could be allowed and result in data being written to the wrong offsets in the output).

Thinking about it more, I suppose this could be solved by creating a large number of pipes, and making sure that no two SQEs using the same pipe are in flight at the same time. I'm concerned that having many pipes could result in its own performance issues, but it'd probably be fine...?

The rapid growth of io_uring

Posted Nov 5, 2023 10:12 UTC (Sun) by farnz (subscriber, #17727) [Link]

You don't have one pipe for all splices; you have one pipe per splice. If the first SQE copies 0x4000 bytes, then the linked SQE can only copy 0x4000 bytes out of the pipe, because there's only 0x4000 bytes in there to copy out. This is exactly what the kernel trick is - create a temporary pipe for the splice to use, so that you're always splicing in and out of pipes.

The rapid growth of io_uring

Posted Jan 24, 2020 21:54 UTC (Fri) by farnz (subscriber, #17727) [Link]

If splice() is wired in, then you have all the components you need to implement sendfile() as a user of io_uring :-)

The rapid growth of io_uring

Posted Jan 28, 2020 10:12 UTC (Tue) by isilence (subscriber, #117961) [Link]

I expect it to be done for-5.7
Nothing really difficult, I just need to address a couple of issues.

The rapid growth of io_uring

Posted Jan 25, 2020 0:23 UTC (Sat) by josh (subscriber, #17465) [Link] (7 responses)

> In Metzmacher's case, he would like to call openat() asynchronously, then submit I/O operations on the resulting file descriptor without waiting for the open to complete.

The X Window System solved this decades ago: applications specify the ID they want when creating an object in X, for precisely this reason.

Let the application specify the file descriptor it wants to open, so that it knows what fd number it'll get, and can submit subsequent operations on that file descriptor in the same queue.

The rapid growth of io_uring

Posted Jan 30, 2020 12:05 UTC (Thu) by Karellen (subscriber, #67644) [Link] (6 responses)

What if that fd is already in use?

The rapid growth of io_uring

Posted Jan 30, 2020 14:39 UTC (Thu) by Baughn (subscriber, #124425) [Link] (3 responses)

The call fails, and you get to fix your code. The IDs should be per-process.

The rapid growth of io_uring

Posted Jan 30, 2020 23:51 UTC (Thu) by Karellen (subscriber, #67644) [Link] (2 responses)

Fix your code? But...

Even per-process, how do you know which fds are free?

Are you suggesting that you should enumerate all the fds currently in use by your program, and then make sure to pick one that you know isn't being used, and hope that another thread doesn't race you to getting it anyway?

Sorry, I think I'm missing an important part of the puzzle somewhere here.

The rapid growth of io_uring

Posted Jan 31, 2020 10:39 UTC (Fri) by klempner (subscriber, #69940) [Link]

You would presumably use an allocator for this, perhaps one that is part of libc.

That allocator wouldn't allocate the same fd twice without a deallocate and could be made threadsafe so you wouldn't have race issues.

The rapid growth of io_uring

Posted Feb 6, 2020 17:26 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Even per-process, how do you know which fds are free?

Run FORTRAN???

That's the way that always worked :-)

Cheers,
Wol

The rapid growth of io_uring

Posted Feb 7, 2020 8:08 UTC (Fri) by renox (guest, #23785) [Link] (1 responses)

Mmm, let's add a "allocate_fd" system call, so now first you allocate/reserve a fd (very fast, you could even preallocate them), then you provide it to the different system calls .. which now have to add additional checks of course.
No way, this will happen but that's an interesting WHAT-IF design change.

The rapid growth of io_uring

Posted Feb 7, 2020 14:41 UTC (Fri) by canoon (guest, #109743) [Link]

All you really need is a range that is reserved for user space allocation. It of course doesn't solve issues around reusing data between requests though.

The rapid growth of io_uring

Posted Jan 25, 2020 1:14 UTC (Sat) by wahern (subscriber, #37304) [Link] (3 responses)

Classic Unix I/O is inherently synchronous. As far as an application is concerned, an operation is complete once a system call like read() or write() returns, even if some processing may continue behind its back. There is no way to launch an operation asynchronously and wait for its completion at some future time — a feature that many other operating systems had for many years before Unix was created.

Unix historically didn't have asynchronous disk I/O, and this effected the architecture of subsystems and drivers. Linux still doesn't have asynchronous buffered disk I/O. All you're doing is offloading the operation to a thread pool in the kernel, which performs the operation synchronously, rather than a thread pool in user land.

Once upon a time Linux eschewed complex APIs in favor of making the basic primitives--forking, threading, syscalls, etc--as fast and efficient as possible. And that's perhaps why all the earlier proposals never never made it in--the original strategy was still paying dividends and reduced the pressures that enterprise systems never could cope with. A kernel-land I/O thread pool couldn't schedule I/O any better than a user land thread pool, and if it could the solution was to make it so the user land thread pool could just as well, not to actually relent and expose a specialized I/O thread pool kernel construct.

So now that Linux is moving in the direction of increasingly complex, leaky APIs to cover the performance gap, treading the old waters of enterprise Unix, what does that imply, if anything, about the evolution of Linux and the state of hardware. Are we really better off shifting so much complexity into the kernel? From a security perspective that seems highly doubtful. Are we really better off from a performance perspective? If Linux is abdicating its dogged pursuit of performance improvements, don't we know where this road leads? Or maybe Linux finally hit the wall of what the Unix process and I/O model could fundamentally provide. io_uring is what you'd build as an IPC primitive for a very high performance microkernel. Why keep all this complexity in one, insecure address space when we can finally disperse it across different processes, using a new process model that boxes and aggregates resources in a fundamentally different way.

The rapid growth of io_uring

Posted Jan 27, 2020 13:26 UTC (Mon) by ermo (subscriber, #86690) [Link]

> Why keep all this complexity in one, insecure address space when we can finally disperse it across different processes, using a new process model that boxes and aggregates resources in a fundamentally different way.

Are you suggesting that it might be worth it to experiment with making Linux a hybrid (macro/micro) kernel and use io_uring as the transport between separate kernel "daemon" processes, moving forward with the separation by splitting out one subsystem at a time?

If not, could you perhaps outline what you had in mind? Inquiring minds would like to know.

The rapid growth of io_uring

Posted Jan 29, 2020 20:13 UTC (Wed) by notriddle (subscriber, #130608) [Link]

> So now that Linux is moving in the direction of increasingly complex, leaky APIs to cover the performance gap, treading the old waters of enterprise Unix, what does that imply, if anything, about the evolution of Linux and the state of hardware.

What it says about the state of hardware is easy.

Meltdown mitigations made context switching more expensive, and SSDs made the actual I/O cheaper. At some point, you're spending almost as much time context switching as you are doing actual work, and there's nothing Linux can do about it because it's all tied up in the architecture of the MMU and the processor's cache. Thus, it now makes sense to design the interface around reducing the number of context switches at all cost, instead of just assuming that the cost of doing the actual I/O will dominate.

The rapid growth of io_uring

Posted Jan 29, 2020 20:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Unix historically didn't have asynchronous disk I/O, and this effected the architecture of subsystems and drivers. Linux still doesn't have asynchronous buffered disk I/O. All you're doing is offloading the operation to a thread pool in the kernel, which performs the operation synchronously, rather than a thread pool in user land.
That's not quite true, the BIO layer is already asynchronous (with its own scheduler and everything). And that's where the most of the waiting goes.

And it's not like the VFS layer is impossible to implement asynchronously, it's just that before now it was not needed at all.

The rapid growth of io_uring

Posted Jan 26, 2020 11:10 UTC (Sun) by geertj (guest, #4116) [Link] (1 responses)

Very interesting. Does io_uring work with buffered io? Does it work with NFS?

The rapid growth of io_uring

Posted Jan 26, 2020 18:18 UTC (Sun) by axboe (subscriber, #904) [Link]

Yes and yes

The rapid growth of io_uring

Posted Jan 26, 2020 21:49 UTC (Sun) by grober (guest, #136840) [Link] (3 responses)

How io_uring implements asynchronous operations if everything is synchronous internally? Does it use thread pool?

The rapid growth of io_uring

Posted Jan 27, 2020 0:23 UTC (Mon) by cesarb (subscriber, #6266) [Link] (1 responses)

Actually, everything is asynchronous internally. The kernel submits a request to the disk hardware, and some time later, the disk hardware interrupts the kernel to tell it's done. In the meantime, the CPU can be doing something else, be it running another thread or (with asynchronous I/O like io_uring) doing something else on the same thread.

The rapid growth of io_uring

Posted Jan 27, 2020 8:45 UTC (Mon) by dezgeg (subscriber, #92243) [Link]

Filesystem code is certainly synchronous for things like reading metadata (inodes) from disk.

The rapid growth of io_uring

Posted Jan 28, 2020 10:50 UTC (Tue) by jezuch (subscriber, #52988) [Link]

I was wondering about this too. Is it just a way to avoid the overhead of system calls, or is it actually, truly asynchronous?

The rapid growth of io_uring

Posted Jan 26, 2020 23:55 UTC (Sun) by willy (subscriber, #9762) [Link] (2 responses)

I've heard complaints there's no AIO readdir(). There may well be a use-case for an IORING_OP_READDIR (although it should come with a user so we're sure we're doing something useful)

The rapid growth of io_uring

Posted Jan 27, 2020 0:19 UTC (Mon) by josh (subscriber, #17465) [Link] (1 responses)

I would hope that such an operation looks more like getdents64 than readdir.

The rapid growth of io_uring

Posted Jan 27, 2020 0:22 UTC (Mon) by willy (subscriber, #9762) [Link]

You're thinking about readdir(2) and getdents64(2). I was thinking about readdir(3).

The rapid growth of io_uring

Posted Jan 27, 2020 0:36 UTC (Mon) by cesarb (subscriber, #6266) [Link] (1 responses)

One drawback of io_uring seems to be that, as far as I know, you need a separate buffer for each pending request. So if you were using io_uring to read data from 1000 network connections at the same time, you'd need 1000 separate buffers, while with synchronous epoll+read, you'd need a single buffer (1000 times less memory).

Is there any plan to allow the kernel to choose the buffer, so that you could give the kernel a small set of buffers (or a single large buffer) and the kernel would pick one (or carve a piece of that large buffer) when the data arrives? Or is that actually not an issue in real-world use cases?

The rapid growth of io_uring

Posted Jan 27, 2020 1:44 UTC (Mon) by farnz (subscriber, #17727) [Link]

While you can't have the kernel choose which buffer is in use for each pending read, you can make use of IORING_OP_POLL_ADD to poll all 1,000 network connections via the uring, and then use IORING_OP_READV to read only from the connections that are ready, using a smaller number of buffers than there are connections.

That's basically the same work level as epoll + read, but with both the epoll and the read being done asynchronously.

The rapid growth of io_uring

Posted Jan 27, 2020 10:03 UTC (Mon) by AndreiG (guest, #90359) [Link] (5 responses)

Is it really that hard to change the name ... ?
Naming is very important in software engineering and we have a long way to go with naming and consistency in kernel code.
That and the title of this article sounds like one of those statistics from Pornhub that nobody asked for.

The rapid growth of io_uring

Posted Jan 27, 2020 18:25 UTC (Mon) by axboe (subscriber, #904) [Link] (4 responses)

Everybody knows that porn is a frequent driver of technology.

The rapid growth of io_uring

Posted Feb 7, 2020 14:45 UTC (Fri) by CúChulainn (guest, #137154) [Link] (3 responses)

Porn is destroying the west at a pace outstripping any technological innovation.

The rapid growth of io_uring

Posted Feb 7, 2020 21:04 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

No, that's capitalism doing that. Porn is just a conveniently taboo bogeyman used by puritanical wonks descended from slave owners to shift the spotlight off themselves.

The rapid growth of io_uring

Posted Feb 7, 2020 23:18 UTC (Fri) by CúChulainn (guest, #137154) [Link] (1 responses)

The problem with porn is that it is not taboo. Banks and financialisation are not capitalism. A lot of people want to move their families great distances to live amongst the posterity of slave owners, they must really love puritanical wonks.

The rapid growth of io_uring

Posted Feb 13, 2020 0:22 UTC (Thu) by flussence (guest, #85566) [Link]

Want to try that again, but make sense this time?

The rapid growth of io_uring

Posted Jan 27, 2020 11:34 UTC (Mon) by seanyoung (subscriber, #28711) [Link] (3 responses)

I must say that io_uring looks fantastic. A lot of people have spent a lot of time trying to solve this problem.

Thanks Jens Axboe!

PS I can't wait to see this integrated with rust async_std.

The rapid growth of io_uring

Posted Jan 31, 2020 19:10 UTC (Fri) by sunbains (subscriber, #104233) [Link] (2 responses)

I second that, thanks Jens! I found it quite straight forward to add support for io_uring to MySQL/InnoDB. I used liburing rather than the lower level interface.

The rapid growth of io_uring

Posted Feb 5, 2020 17:52 UTC (Wed) by axboe (subscriber, #904) [Link]

Nice! I might have missed this, any references to the MySQL/InnoDB io_uring code?

The rapid growth of io_uring

Posted May 15, 2020 1:57 UTC (Fri) by twocode (guest, #132839) [Link]

+1 @Suny, how much performance gain have you observed from your observation?

The rapid growth of io_uring

Posted Feb 9, 2020 22:47 UTC (Sun) by dcoutts (guest, #5387) [Link] (7 responses)

What I do not yet understand is how this is expected to interact with epoll for handling large numbers of network connections.

The io_uring API is designed to scale to a moderate number of simultaneous I/O operations. The size of the submit and collect rings has to be enough to cover all the simultaneous operations. The max ring size is 4k entries. This is fine for the use case of disk I/O, connect, accept etc.

It's not fine for the "10k problem" of having 10s of 1000s of idle network connections. That's what epoll is designed for. We don't really want to have 10s of 1000s of pending async IO recv operations, we just want to wait for data to arrive on any connection, and then we can execute the IO op to collect the data.

So what's the idea for handling large numbers of network connections using io_uring, or some combo of io_uring and epoll? We have IORING_OP_POLL_ADD but of course this costs one io_uring entry so we can't go over 4k of them. There's IORING_OP_EPOLL_CTL for adjusting the fds in an epoll set. But there's no io_uring operation for epoll_wait. So do we have to use both io_uring and epoll_wait? Now that needs two threads, so no nice single-threaded event loop.

Perhaps I'm missing something. If not, isn't the obvious thing to add support for IORING_OP_EPOLL_WAIT? Then we can use IORING_OP_EPOLL_CTL to adjust the network fds we're monitoring and then issue a single IORING_OP_EPOLL_WAIT to wait for any network fd to have activity.

Alternatively, io_uring could subsume the epoll API entirely. The single-shot style of IORING_OP_POLL_ADD is actually very nice. But it has to scale to the 10k+ case, so cannot consume a completion queue entry for each fd polled like IORING_OP_POLL_ADD does.

The rapid growth of io_uring

Posted Feb 9, 2020 23:05 UTC (Sun) by andresfreund (subscriber, #69562) [Link] (4 responses)

The submission/completion queues do not contain entries for requests currently being processed. I.e. there can be more requests being processed, than fit in either of the max queue sizes.

Early on that could lead to the completion queue overflowing, but now those completions are saved (but no new submissions are allowed, basically).

The rapid growth of io_uring

Posted Feb 10, 2020 1:33 UTC (Mon) by dcoutts (guest, #5387) [Link] (3 responses)

So if I understand what you mean then in principle we could have a completion ring of e.g. 1k entries, and submit 10k IORING_OP_POLL_ADD operations on sockets, and this can't lead to loosing any completions. At most, if the completion ring is empty we have to collect completions before submitting new operations (which is completely reasonable). If so, that's excellent.

So is this pattern of use with sockets expected to perform well, e.g. compared to epoll? If so, that's fantastic, we really can use just one system for both disk I/O and network I/O. The one-shot behaviour of IORING_OP_POLL_ADD is entirely adequate (covering epoll's EPOLLONESHOT which is its most useful mode).

The rapid growth of io_uring

Posted Feb 10, 2020 3:29 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (1 responses)

> So if I understand what you mean then in principle we could have a completion ring of e.g. 1k entries, and submit 10k IORING_OP_POLL_ADD operations on sockets, and this can't lead to loosing any completions.

Yes, that's correct to my knowledge. That wasn't the case in the first kernels with support for io_uring however (you need to test for IORING_FEAT_NODROP to know). But for poll like things it's also not that hard to handle the overflow case - just treat all of the sockets as ready. But that doesn't work at all for asynchronous network IO (rather than just readiness).

And obviously, you'd need to submit the 10k IORING_OP_POLL_ADDs in smaller batches, because the submission queue wouldn't be that long. But that's fine.

FWIW, I am fairly sure you can submit an IORING_OP_POLL_ADD on an epoll fd. Signal readiness once epoll_pwait would return (without returning the events of course). So if you need something for which oneshot style behaviour isn't best suited, or which you might want to wait on only some of the time, you can also compose io_uring and epoll. Caveat: I have not tested this, but I'm fairly sure I've seen code doing so, and a quick look in the code confirms that this should work: eventpoll_fops has a .poll implementation, which is what io_uring (via io_poll_add) relies on to register to get notified for readiness.

The rapid growth of io_uring

Posted May 25, 2020 23:25 UTC (Mon) by rand0m$tring (guest, #125230) [Link]

the completion queue size is allocated to approx 2x the submission queue.

https://github.com/torvalds/linux/blob/444565650a5fe9c63d...

so that's the point at which it would reach capacity and fall over.

The rapid growth of io_uring

Posted Mar 5, 2022 23:06 UTC (Sat) by smahapatra1 (guest, #157268) [Link]

Sorry for the naive question: why is EPOLL_ONESHOT the most useful mode? Does it not require another EPOLL_CTL and hence less efficient?
Thank you.

The rapid growth of io_uring

Posted May 25, 2020 22:31 UTC (Mon) by rand0m$tring (guest, #125230) [Link]

right so this is obviously a huge problem, it makes anything more than a toy server impossible. I don't understand the reasoning for this limitation?

A buffer per operation is also a nonstarter for the same reason. In 5.7 IOSQE_BUFFER_SELECT attempts to solve the buffer problem with rotations. But I think an input / output ring buffer solution that allows allocation at the tail when needed, and just returns ranges into the input or output buffer makes more sense?

The rapid growth of io_uring

Posted Dec 9, 2020 14:17 UTC (Wed) by hearts (guest, #143561) [Link]

In my understanding, you can still use epoll together with io_uring. For example, use IORING_OP_POLL_ADD to register epoll id to io_uring, this occupies only 1 slot.

Once epoll is ready, add all the ready descriptors to io_uring (readv / writev etc.) in batch. These descriptors will be dealt with immediately because they are in ready state.

Even you have say million connections (and most of them are idle), it is still possible to deal with them without issues.

You at least save some syscalls due to the batch operations, and for the application side, it is still more asynchronous than calling readv/writev yourself.

The rapid growth of io_uring

Posted Jun 24, 2021 13:47 UTC (Thu) by Emjayen (guest, #152930) [Link] (1 responses)

So an almost carbon-copy of the NT I/O model, developed 25 years ago.

The rapid growth of io_uring

Posted Jun 26, 2021 10:18 UTC (Sat) by flussence (guest, #85566) [Link]

And unlike NT, it's got the 25 years of filesystem development to go with it!