LWN: Comments on "Descriptorless files for io_uring"

Descriptorless files for io_uring

Cyberax — Wed, 21 Feb 2024 00:35:35 +0000

You can't add new flags to open() :(

Descriptorless files for io_uring

adobriyan — Tue, 20 Feb 2024 19:35:43 +0000

> And some outdated rules such as "allocate the lowest one" cost a lot for no benefit in highly loaded network programs. Picking any from a recycled pool would be way better.

open(O_WHAT_IS_POSIX) ?

Descriptorless files for io_uring

Deleted user 129183 — Mon, 01 Nov 2021 19:34:04 +0000

> Haha, and here I thought I would have an original though on the internet! ;)

Ah yeah, the good old phenomenon of retroactive plagiarism – that’s when you think you finally have an original, sometimes revolutionary thought, then you learn that some obscure German philosopher wrote a book about it already 200 years ago. Or in this case, a post on a mailing list six months ago.

Descriptorless files for io_uring

cesarb — Mon, 09 Aug 2021 15:53:29 +0000

> Now I imagined an OS with *no* system calls.

I've been thinking about that for a few months, as a though experiment: would it be possible to create an operating system based solely on multiple asynchronous ring buffers, not only for IPC but also for all communication with the kernel, to the extreme of having no system call at all?

I came to the conclusion that there are two important exceptions. The first one is yielding the time slice, either when there's nothing to do, or as an optimization to wake up another task to read something the current task wrote to a ring buffer. Since the point of this exercise is to go to the extreme, that would be the single system call, but it would have no inputs and no outputs (and all registers would be kept unchanged); the kernel would read what to do from one of the ring buffers ("wake up task X", "wake up task Y", "I don't need the timeslice anymore until another task wakes me up"), and unless told otherwise ("I don't need the timeslice anymore") would return immediately after reading these messages.

The second one is reading the clock, which must be synchronous and fast. For that, instead of synchronous system calls, or asynchronous messages through a ring buffer, it should directly read hardware registers visible to user space (CNTPCT/CNTFRQ or similar) and apply adjustments read from a shared read-only page (this is AFAIK similar to what the Linux VDSO already does).

Descriptorless files for io_uring

tobin_baker — Fri, 06 Aug 2021 17:48:17 +0000

That's a great point: all epoll needed to do was pass a unique cookie for each registration back to the caller. It would be interesting to review the LKML discussion on the original epoll patches and see if this issue was ever brought up.

Sadly, this design flaw prevents me from passing epoll descriptors over Unix sockets (for an unusual but legit use case). Obviously, it also prevents using an inherited epoll descriptor across fork() as well. It should go without saying that no fd-based interface that doesn't allow freely passing its fds between processes should ever be accepted (because that violates the fundamental semantics of fds), but somehow Linus has failed to enforce that in this and other cases (mostly involving Davide Libenzi, btw).

Descriptorless files for io_uring

foom — Fri, 06 Aug 2021 14:59:00 +0000

Yea. That epoll uses the pair {fd-number, file-description-pointer} for modification/deregistration -- where the latter is looked up via the former, and BOTH must match the original registration ... is really quite unfortunate. If you open a file on FD 5, register it with epoll, then dup2(5, 6); close(5), you cannot unregister it anymore. Neither passing 5 OR 6 to epoll_ctl will work. (Of course, you can dup fd 6 back onto 5. THEN you can pass 5 to epoll_ctl to unregister it...) It really is just a broken API.

But the fix would've been super-simple, it didn't need to invent any new "offset" mechanisms, or descriptorless files, or prohibit inheritance of the epoll configuration across fork(). There's a perfectly good piece of data that should've bee used as a unique key: the user-provided "epoll_data_t data;" field! That is an arbitrary 64-bit value provided by the caller during registration, which, most importantly, is the (only) identifier passed BACK to the caller on every epoll_wait event report. If epoll had used that as a unique key, then de-registering an unwanted registration would be as simple as passing the same "data" you received from epoll_wait back in to epoll_ctl(EPOLL_CTL_DEL). Oh well.

At the cost of maintaining a second rb-tree indexed by "data" to the epoll structure, this could possibly even be fixed. Add two new operations, EPOLL_CTL_MOD2 and EPOLL_CTL_DEL2, which ignore the "fd" parameter, and instead look up the epoll entry to delete/modify using "data". Ideally, there would be a requirement that "data" be unique, but that would be an incompatible change, so we can't do that. However, these APIs can just use the first entry with a given "data" key, if there are multiple. Not ideal, but should suffice.

Descriptorless files for io_uring

nybble41 — Fri, 06 Aug 2021 01:14:40 +0000

I think your syscall 1, 2, and 3 could usefully be combined into one system call with two arguments besides the queue ID: "min_free" and "timeout". If timeout is zero then you get syscall 1 (no waiting). Otherwise the thread should be woken when any item in the queue with a "notify completion" flag enabled has completed, or (if timeout is greater than zero) when the timeout expires, or (if min_free is greater than zero) when there are at least min_free available slots at the tail of the submission queue. (The "has completed" state should be persistent, to avoid race conditions: If any "notify completion" item completed since the syscall last returned then it should return immediately rather than waiting for the next event.)

The case which was missed in splitting up the syscalls is the combination of 2 & 3: the thread may be able to do useful work if *either* a prior request completes *or* space becomes available in the queue, but with separate syscalls there is no way to wait for whichever occurs first.

If a process is created with an initial submission queue and there is support for queuing "create queue" and "destroy queue" commands then you don't really need syscalls 4 & 5, which brings us back down to a single syscall.

Descriptorless files for io_uring

farnz — Thu, 05 Aug 2021 10:31:24 +0000

io_uring is headed that way - you can run it in a mode where it puts commands in one queue, Linux picks them up and executes them, and puts the results in the other queue. Doesn't yet handle all Linux system calls (ioctl being the next one on the list), but it's a direction that Linux is taking.

Done sensibly, it could be a very efficient way to work; you need a few syscalls to make it work well, but SYSENTER/SYSCALL give you register space to handle that. All syscalls tell the kernel to check the submission queue for new work, but the kernel is always allowed to check the queue at its discretion. The queue itself has dependency marks in it, so that the kernel can execute work out of order if there's no flagged dependencies (e.g. so that you can submit GPU work, network reads and disk writes all in parallel). However, if you do mark dependencies, a work item cannot start until all its dependencies are complete.

Syscall 1 simply tells the kernel that there is work in the submission queue that the application would like the kernel to do. This call returns "quickly" - it just ensures that the kernel knows to check your submission queue - it's your "yield your timeslice" option.
Syscall 2 asks the kernel to block the application until a timeout is reached, or a given submitted work item has completed. Because the queue can complete out of order at kernel discretion, this says nothing about ordering other than that already supplied in the queue. It's allowed to return early, and it's on the application to check completion properly - e.g. loop until the blocking item has completed. As a quality of implementation issue, though, the kernel should aim to avoid returning until the timeout is reached, or the selected item has completed.
Syscall 3 asks the kernel to block the application until a timeout is reached, or until there is space in the submission queue for N more items. Again, this syscall can return early, and it's up to the application to loop until there's enough space in the submission queue.
Syscall 4 sets up your submission and completion queues (where in your VA space they live, how large they are etc). An application should be allowed multiple queues if it wants to avoid locking (e.g. have a thread-local queue for thread-local background work), but there is no way to have the kernel enforce dependencies between work items in different queues (two submission queues run in parallel at all times).
Syscall 5 tears down a queue pair when it's no longer needed.

With those 5 syscalls, you have a full system for a queued syscall mechanism that gives the kernel lots of freedom to run work in parallel (although it's not obliged to), and gives applications lots of freedom to be high performance.

Syscall 1 is your idea, and is necessary.

You need syscalls 2 and 3 to block the application if it cannot do any further useful work without a response from the outside world; the timeout allows the application to still do background maintenance (cache TTL cleanouts, for example) even though the outside world is not affecting it yet. Without these syscalls, the application busy-loops on the kernel completing work, wasting CPU time - and given that you want the kernel to block you, you might as well be able to tell the kernel what it will take for you to be able to do something so that it can avoid giving you CPU cycles if you're not going to use them.

And syscalls 4 and 5 could be simply something done at process creation time, but it's powerful to be able to run (say) queue per worker thread and not have to do application-level locking when there truly isn't a dependency between threads.

Descriptorless files for io_uring

njs — Thu, 05 Aug 2021 01:43:57 +0000

> That is *very* interesting. I've occasionally looked in that space because the winsock restrictions are problematic for some things in postgres..

If you want to dig into this trick, I think this issue is the most complete compendium of knowledge currently available: https://github.com/python-trio/trio/issues/52

Descriptorless files for io_uring

immibis — Wed, 04 Aug 2021 20:40:30 +0000

Now I imagined an OS with *no* system calls. Rather, you just put a message in your message queue, and the kernel checks it whenever you get pre-empted, and you can also call SYSENTER to yield your time slice.

Descriptorless files for io_uring

immibis — Wed, 04 Aug 2021 20:06:59 +0000

They are not traditional file descriptors. The significant difference is they are not allocated as small integers, but rather arbitrary 32/64-bit integers.

Descriptorless files for io_uring

tobin_baker — Wed, 04 Aug 2021 18:07:51 +0000

Bottom line: the epoll interface was designed by someone who didn't understand the difference between a file descriptor and an open file description. Sure, he was basically just copying Linus's offhand back-of-a-napkin design sketch from LKML, but he had plenty of time to think it through (as I think Linus would have done), and see where the naive design fell over. If nothing else, he could have just copied a design that actually works from FreeBSD (see what a difference an actual design review process makes, as opposed to "we only look at working code"?).

Descriptorless files for io_uring

kleptog — Wed, 04 Aug 2021 11:39:24 +0000

It would also fix the crazy issue where epoll can return notifications for a closed FD.

> Q6 Will closing a file descriptor cause it to be removed from all epoll sets automatically?

> A6 Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.

So if you have a descriptor in epoll that gets passed to a child then you will continue getting notifications from that descriptor and you can't do anything about it. If you close it yourself and open another file that gets the same FD, you get maximum confusion.

Descriptorless files for io_uring

tobin_baker — Tue, 03 Aug 2021 20:35:47 +0000

Except they're called HANDLE ;-)

Descriptorless files for io_uring

tobin_baker — Tue, 03 Aug 2021 20:35:13 +0000

There should absolutely be such a facility. It would be a huge perf win in many cases.

Descriptorless files for io_uring

tobin_baker — Tue, 03 Aug 2021 20:33:07 +0000

It would be, if a library could ensure that its reservation didn't conflict with another library's reservation. There should be a syscall like mmap(PROT_NONE) to just reserve a range of fds, and fail if it conflicts with an existing reservation. Then the system would promise to never open(), dup(), etc. onto any reserved fd.

Descriptorless files for io_uring

tobin_baker — Tue, 03 Aug 2021 20:25:44 +0000

The paper "The Scalable Commutativity Rule" (https://people.csail.mit.edu/nickolai/papers/clements-sc-...) specifically calls out the lowest-available-number-reused guarantee as an example of how POSIX makes some system calls inherently unscalable by not allowing them to commute (for no really good reason in this case).

Descriptorless files for io_uring

tobin_baker — Tue, 03 Aug 2021 20:20:26 +0000

This is how epoll should have been designed from the beginning. How anyone thought it was reasonable to allow passing an epoll fd between processes, but then require a registered fd to be deregistered using its *original integer value* (thus making it impossible to deregister an fd in a different process than the one that created it) is utterly beyond me, and says a lot about the lack of rigorous (or even elementary) design review in Linux kernel development. If epoll had allowed referencing registered fds by an offset like io_uring does, I think that would have been a fine solution to this (now unfixable) problem.

Descriptorless files for io_uring

andresfreund — Sat, 31 Jul 2021 00:56:38 +0000

> > Nor do windows completion ports have support for sensible handling of buffers when handling many connections

> I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?

Well, it's not really IOCP network IO, but something that can be bolted sideways to iocp :). But yes, fair enough.

> > readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...

> FWIW, IOCP actually does support readiness-style notification -- the windows kernel only understands IOCP, not `select`, so Winsock `select` is actually using IOCP syscalls underneath. The API for this was supposed to be completely internal and undocumented, but someone (@piscisaureus) reverse-engineered it, and it's now used in a number of modern event-loop libraries like libuv, tokio, trio, etc. (And also apparently "Starcraft: Remastered", according to the wine issue tracker.) It's awkward and janky and has some outright bugs you have to work around, but it's so much better than dealing with `select` that it's worth it.

That is *very* interesting. I've occasionally looked in that space because the winsock restrictions are problematic for some things in postgres..

Descriptorless files for io_uring

cesarb — Fri, 30 Jul 2021 11:58:04 +0000

It's not a forgotten technique, it's a common technique in hardware. For instance, it's how NVMe works, it's very common in network interfaces, and all modern GPUs use command buffers. In the software side, we have things like Wayland, and network protocols like X11. What's unusual is using it as a way to talk to the kernel, I don't recall any operating system which uses ring buffers instead of synchronous calls to the kernel, even microkernels with message queues AFAIK use a synchronous "add this message to the queue" system call.

Descriptorless files for io_uring

da4089 — Fri, 30 Jul 2021 10:40:07 +0000

In fact, Microsoft has just announced that they're copying the io_uring design for Windows: see https://windows-internals.com/i-o-rings-when-one-i-o-operation-is-not-enough/

Descriptorless files for io_uring

Cyberax — Thu, 29 Jul 2021 07:34:48 +0000

> I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?

Well, RIO is kinda a completely separate API from regular IOCP and overlapped IO. It is indeed similar to io_uring, albeit more limited.

IOCP is pretty well designed, actually. It's a way better design than the classic POSIX asynchronous APIs, but it's around 20 years old by now and its age is showing.

Descriptorless files for io_uring

njs — Thu, 29 Jul 2021 06:11:58 +0000

> Nor do windows completion ports have support for sensible handling of buffers when handling many connections

I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?

> readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...

FWIW, IOCP actually does support readiness-style notification -- the windows kernel only understands IOCP, not `select`, so Winsock `select` is actually using IOCP syscalls underneath. The API for this was supposed to be completely internal and undocumented, but someone (@piscisaureus) reverse-engineered it, and it's now used in a number of modern event-loop libraries like libuv, tokio, trio, etc. (And also apparently "Starcraft: Remastered", according to the wine issue tracker.) It's awkward and janky and has some outright bugs you have to work around, but it's so much better than dealing with `select` that it's worth it.

Descriptorless files for io_uring - lsof interaction?

gulsef073 — Sun, 25 Jul 2021 06:59:03 +0000

It's increasingly possible to observe BPF programs these days. We can surely profile them: https://linuxplumbersconf.org/event/4/contributions/294/

Descriptorless files for io_uring

wtarreau — Thu, 22 Jul 2021 15:32:47 +0000

I'm interested in seeing how this evolves. FDs in modern programs are a real pain. Seeing close() take 30% of the CPU sometimes is a nightmare. And not being able to prevent some FDs from being reused too early between threads requires a huge amount of trickery, where quite frankly, sometimes I'd prefer to call accept() with *my* expected FD and let it use it instead of having the bad surprise that it uses the one that was just enumerated by epoll_wait() the nanosecond before close() kicked.

And some outdated rules such as "allocate the lowest one" cost a lot for no benefit in highly loaded network programs. Picking any from a recycled pool would be way better.

Quick question for those in the know: Use for computational offload engines?

ejr — Thu, 22 Jul 2021 15:15:24 +0000

Is anyone currently looking at io_uring to manage data transfers from a host to computational offload engines (OpenCL-ish queues via OpenACC-, OpenMP-, or OneAPI-level-zero interfaces)?

Quick searching gives me a sense that it's an "oh cool idea!" response. I'd be interesting in finding active projects rather than starting from scratch.

Descriptorless files for io_uring

immibis — Thu, 22 Jul 2021 13:34:18 +0000

Yes, it's called Win32.

Descriptorless files for io_uring

ejr — Thu, 22 Jul 2021 12:18:24 +0000

Approximately right for Fortran (no longer all caps). Plus the unit can be an embedded data statement.

I've been watching this and similar things wondering how much it differs from old methods.

Descriptorless files for io_uring

taladar — Thu, 22 Jul 2021 07:44:12 +0000

Wouldn't that be trivial to solve by just having parameters in the library for base fd and number of fds after that to use?

Descriptorless files for io_uring

liam — Wed, 21 Jul 2021 23:19:49 +0000

We'll not tolerate any who harbor such ambitions here! We're not slashdot :)
BTW, lwn produced an article on this topic a few months ago: https://lwn.net/Articles/847951/

Descriptorless files for io_uring

jezuch — Wed, 21 Jul 2021 18:34:04 +0000

Haha, and here I thought I would have an original though on the internet! ;)

Descriptorless files for io_uring

andresfreund — Wed, 21 Jul 2021 16:44:46 +0000

Nor do windows completion ports have support for sensible handling of buffers when handling many connections, readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...

However, the big issue with windows IOPC imo are the awful docs.

Descriptorless files for io_uring - lsof interaction?

colo — Wed, 21 Jul 2021 07:44:16 +0000

This is my #1 concern for innovations like this. Also, about the emerging trend of implementing application features via eBPF code loaded into the kernel - will I lose the ability to trace all these applications' activities via bpftrace (and friends) as a consequence?

It certainly is a future I am a tad afraid of.

Descriptorless files for io_uring

liam — Wed, 21 Jul 2021 03:44:05 +0000

I think you mean this?

Descriptorless files for io_uring

Cyberax — Tue, 20 Jul 2021 19:30:06 +0000

Nope, it's not even close to the NT design.

NT implements completion ports for async IO, not ring-based pull communication.

Descriptorless files for io_uring

zdzichu — Tue, 20 Jul 2021 19:26:55 +0000

Given that it implements Windows NT designs, it is hardly an innovations.

Descriptorless files for io_uring

jezuch — Tue, 20 Jul 2021 18:37:30 +0000

Were does it go from here? Well, the instructions to uring look a lot like a programming language, as a way to construct entire IO pipelines. So obviously the next step is a JIT for that language!

Descriptorless files for io_uring

Wol — Tue, 20 Jul 2021 16:56:29 +0000

And is io_uring just a copy of a technique that's been around since the dawn of modern computing, just forgotten as a result of *nix trampling everything else underfoot?

It wouldn't surprise me if the basic idea of it comes from the 60s - much of what we learnt then has been forgotten ...

Cheers,
Wol

Descriptorless files for io_uring

Wol — Tue, 20 Jul 2021 16:54:07 +0000

So does this then look a bit like FORTRAN?

Okay, you'd need a new open command, but first you'd call a routine to set up your io_uring, and then you have your private file descriptor space - iirc 1 was the tty, 2 might have been the tape drive, up to about 4, and then your program simply allocated file descriptors from 5 upwards. Can't remember the syntax but it was something like

OPEN( unit, filename, attributes)

I guess in C you'd end up doing something like "fd = open(unit, filename, attributes)" where "unit" would be your index into the ring, and "fd" would be whatever the linux file descriptor was, that your uring had block pre-allocated.

Cheers,
Wol

Descriptorless files for io_uring - lsof interaction?

roc — Tue, 20 Jul 2021 10:54:10 +0000

Yeah ... more specifically will there be a /proc interface to expose data about these open descriptors?