The rapid growth of io_uring

Posted Feb 9, 2020 22:47 UTC (Sun) by dcoutts (guest, #5387)
Parent article: The rapid growth of io_uring

What I do not yet understand is how this is expected to interact with epoll for handling large numbers of network connections.

The io_uring API is designed to scale to a moderate number of simultaneous I/O operations. The size of the submit and collect rings has to be enough to cover all the simultaneous operations. The max ring size is 4k entries. This is fine for the use case of disk I/O, connect, accept etc.

It's not fine for the "10k problem" of having 10s of 1000s of idle network connections. That's what epoll is designed for. We don't really want to have 10s of 1000s of pending async IO recv operations, we just want to wait for data to arrive on any connection, and then we can execute the IO op to collect the data.

So what's the idea for handling large numbers of network connections using io_uring, or some combo of io_uring and epoll? We have IORING_OP_POLL_ADD but of course this costs one io_uring entry so we can't go over 4k of them. There's IORING_OP_EPOLL_CTL for adjusting the fds in an epoll set. But there's no io_uring operation for epoll_wait. So do we have to use both io_uring and epoll_wait? Now that needs two threads, so no nice single-threaded event loop.

Perhaps I'm missing something. If not, isn't the obvious thing to add support for IORING_OP_EPOLL_WAIT? Then we can use IORING_OP_EPOLL_CTL to adjust the network fds we're monitoring and then issue a single IORING_OP_EPOLL_WAIT to wait for any network fd to have activity.

Alternatively, io_uring could subsume the epoll API entirely. The single-shot style of IORING_OP_POLL_ADD is actually very nice. But it has to scale to the 10k+ case, so cannot consume a completion queue entry for each fd polled like IORING_OP_POLL_ADD does.

The rapid growth of io_uring

Posted Feb 9, 2020 23:05 UTC (Sun) by andresfreund (subscriber, #69562) [Link] (4 responses)

The submission/completion queues do not contain entries for requests currently being processed. I.e. there can be more requests being processed, than fit in either of the max queue sizes.

Early on that could lead to the completion queue overflowing, but now those completions are saved (but no new submissions are allowed, basically).

The rapid growth of io_uring

Posted Feb 10, 2020 1:33 UTC (Mon) by dcoutts (guest, #5387) [Link] (3 responses)

So if I understand what you mean then in principle we could have a completion ring of e.g. 1k entries, and submit 10k IORING_OP_POLL_ADD operations on sockets, and this can't lead to loosing any completions. At most, if the completion ring is empty we have to collect completions before submitting new operations (which is completely reasonable). If so, that's excellent.

So is this pattern of use with sockets expected to perform well, e.g. compared to epoll? If so, that's fantastic, we really can use just one system for both disk I/O and network I/O. The one-shot behaviour of IORING_OP_POLL_ADD is entirely adequate (covering epoll's EPOLLONESHOT which is its most useful mode).

The rapid growth of io_uring

Posted Feb 10, 2020 3:29 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (1 responses)

> So if I understand what you mean then in principle we could have a completion ring of e.g. 1k entries, and submit 10k IORING_OP_POLL_ADD operations on sockets, and this can't lead to loosing any completions.

Yes, that's correct to my knowledge. That wasn't the case in the first kernels with support for io_uring however (you need to test for IORING_FEAT_NODROP to know). But for poll like things it's also not that hard to handle the overflow case - just treat all of the sockets as ready. But that doesn't work at all for asynchronous network IO (rather than just readiness).

And obviously, you'd need to submit the 10k IORING_OP_POLL_ADDs in smaller batches, because the submission queue wouldn't be that long. But that's fine.

FWIW, I am fairly sure you can submit an IORING_OP_POLL_ADD on an epoll fd. Signal readiness once epoll_pwait would return (without returning the events of course). So if you need something for which oneshot style behaviour isn't best suited, or which you might want to wait on only some of the time, you can also compose io_uring and epoll. Caveat: I have not tested this, but I'm fairly sure I've seen code doing so, and a quick look in the code confirms that this should work: eventpoll_fops has a .poll implementation, which is what io_uring (via io_poll_add) relies on to register to get notified for readiness.

The rapid growth of io_uring

Posted May 25, 2020 23:25 UTC (Mon) by rand0m$tring (guest, #125230) [Link]

the completion queue size is allocated to approx 2x the submission queue.

https://github.com/torvalds/linux/blob/444565650a5fe9c63d...

so that's the point at which it would reach capacity and fall over.

The rapid growth of io_uring

Posted Mar 5, 2022 23:06 UTC (Sat) by smahapatra1 (guest, #157268) [Link]

Sorry for the naive question: why is EPOLL_ONESHOT the most useful mode? Does it not require another EPOLL_CTL and hence less efficient?
Thank you.

The rapid growth of io_uring

Posted May 25, 2020 22:31 UTC (Mon) by rand0m$tring (guest, #125230) [Link]

right so this is obviously a huge problem, it makes anything more than a toy server impossible. I don't understand the reasoning for this limitation?

A buffer per operation is also a nonstarter for the same reason. In 5.7 IOSQE_BUFFER_SELECT attempts to solve the buffer problem with rotations. But I think an input / output ring buffer solution that allows allocation at the tail when needed, and just returns ranges into the input or output buffer makes more sense?

The rapid growth of io_uring

Posted Dec 9, 2020 14:17 UTC (Wed) by hearts (guest, #143561) [Link]

In my understanding, you can still use epoll together with io_uring. For example, use IORING_OP_POLL_ADD to register epoll id to io_uring, this occupies only 1 slot.

Once epoll is ready, add all the ready descriptors to io_uring (readv / writev etc.) in batch. These descriptors will be dealt with immediately because they are in ready state.

Even you have say million connections (and most of them are idle), it is still possible to deal with them without issues.

You at least save some syscalls due to the batch operations, and for the application side, it is still more asynchronous than calling readv/writev yourself.