Descriptorless files for io_uring

By Jonathan Corbet
July 19, 2021

The lowly file descriptor is one of the fundamental objects in Linux systems. A file descriptor, which is a simple integer value, can refer to an open file — or to a network connection, a running process, a loaded BPF program, or a namespace. Over the years, the use of file descriptors to refer to transient objects has grown to the point that it can be difficult to justify an API that uses anything else. Interestingly, though, the io_uring subsystem looks as if it is moving toward its own number space separate from file descriptors.

Io_uring was created to solve the asynchronous I/O problem; this is a functionality that Linux has never supported as well as users would have liked. User space can queue operations in a memory segment that is shared directly with the kernel, allowing those operations to be initiated, in many cases, without the need for an expensive system call. Similarly, another shared-memory segment contains the results of those operations once they complete. Initially, io_uring focused on simple operations (reading and writing, for example), but it has quickly gained support for many other system calls. It is evolving into the general asynchronous-operation API that Linux systems have always lacked.

Fixed files

A read or write operation must specify both the file descriptor to be operated on and a buffer to hold the data. There is a fair amount of setup work that must be done in the kernel before that operation can proceed, though. That includes taking a reference to the open file (to prevent it from going away while the operation is underway) and locking down the memory for the buffer. That overhead can, in many cases, add up to a significant part of the total cost of the operation; since programs tend to perform multiple operations with the same file descriptors and the same buffers, this overhead can be paid many times for the same resources, and it can add up.

From the beginning, io_uring has included a way to reduce that overhead in the form of the io_uring_register() system call:

    int io_uring_register(unsigned int fd, unsigned int opcode,
                          void *arg, unsigned int nr_args);

If opcode is IORING_REGISTER_BUFFERS, the io_uring subsystem will perform the setup work for the nr_args buffers pointed to by arg and keep the result; those buffers can then be used multiple times without paying that setup cost each time. If, instead, opcode is IORING_REGISTER_FILES, then arg is interpreted as an array of nr_args file descriptors. Each file in that array will be referenced and held open so that, once again, it can be used efficiently in multiple operations. These file descriptors are called "fixed" in io_uring jargon.

There are a couple of interesting aspects to fixed files. One is that the application can call close() on the file descriptor associated with a fixed file, but the reference within io_uring will remain and will still be usable. The other is that a fixed file is not referenced in subsequent io_uring operations by its file-descriptor number. Instead, operations use the offset where that file descriptor appeared in the args array during the io_uring_register() call. So if file descriptor 42 was placed in args[13], it will subsequently be known as fixed file 13 within io_uring.

So the io_uring subsystem has, in essence, set up a parallel descriptor space that can refer to open files, but which is independent of the regular file descriptors. In current kernels, though, it is still necessary to obtain a regular file descriptor for a file and register it for the file to appear in the io_uring fixed-file space. If, however, an application will never do anything with a file outside of io_uring, the creation of the regular file descriptor serves no real purpose.

It is, indeed, possible to create, use, and close a file descriptor entirely within io_uring. As noted above, this subsystem is not limited to simple I/O; it is also possible to open files and accept network connections with io_uring operations. At the moment, though, user space must intervene between the creation of the file descriptor and its use to install it as a fixed file. The cost of this work is not huge but it, too, can add up in an application that processes a lot of file descriptors.

No more file descriptors

To address this problem, Pavel Begunkov has posted this patch series adding a direct-to-fixed open operation. The io_uring operations that can create file descriptors — the equivalents of the openat2() and accept() system calls — gain the ability to, instead, store their result directly into the fixed-file table at a user-supplied offset. When this option is selected, there is no regular file descriptor created at all; the io_uring alternative descriptor is the only way to refer to the file.

The most likely use case for this feature is network servers; a busy server can create (with accept()) and use huge numbers of file descriptors in a short period of time. While io_uring operations, being asynchronous, can generally be executed in any order, it is possible to chain operations so that one does not begin before the previous one has successfully completed. Using this capability, a network server could queue a series of operations to accept the next incoming connection (storing it in the fixed-file table), write out the standard greeting, and initiate a read for the first data from the remote peer. User space would only need to become involved once that data has arrived and is ready to be processed.

This is clearly an interesting capability, and it shows how io_uring is quickly evolving into an alternative programming interface for Linux systems. The separation from the traditional file-descriptor space is just one more step in that direction. With the future addition of BPF support (which is still under development), the separation will become even more pronounced; the user-space component of some applications may become small indeed. Use of the io_uring API will probably not be worthwhile for the majority of applications, but for some it can make a large difference. It will be interesting to see where it goes from here.

Index entries for this article
Kernel	io_uring

Descriptorless files for io_uring

Posted Jul 19, 2021 17:58 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

Can we get these lightweight file descriptors for userspace as well?

The last time I was profiling an app that needed a lot of openat calls, the locking and searching for a free file descriptor in the kernel took something like 20% of the total time. We basically need a way to say: "I don' care about descriptors being dense and all of that POSIX nonsense, just give me a fucking number!"

Descriptorless files for io_uring

Posted Jul 19, 2021 20:30 UTC (Mon) by josh (subscriber, #17465) [Link] (10 responses)

That's very much the use case this is for. Entries in the fixed file table are assigned by userspace choosing which one to use, not by the kernel assigning a free one.

This is similar to how X Window System clients assign resource IDs: when you create an object like a window or pixmap, you assign the ID yourself from a range of valid IDs you were given, and you can then batch creation and usage of an object in the same batch.

If you then want to use the object outside of io_uring, you can always use an io_uring operation to link the fixed-file into a normal file descriptor, and then use it with normal syscalls. And if you want, you can assign a specific file descriptor yourself rather than relying on the kernel's "allocate the lowest unused number".

Descriptorless files for io_uring

Posted Jul 20, 2021 0:46 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> And if you want, you can assign a specific file descriptor yourself rather than relying on the kernel's "allocate the lowest unused number".

How?!? This would help me a lot. I can't change all the read/write calls to use io_uring, but I can do that with opens.

Descriptorless files for io_uring

Posted Jul 20, 2021 6:03 UTC (Tue) by josh (subscriber, #17465) [Link] (8 responses)

We can add an operation to link a fixed file into the process file descriptor table, given a fixed file index and an fd. Effectively, dup3 with one fixed file as the source and a non-fixed file as the destination.

Descriptorless files for io_uring

Posted Jul 20, 2021 6:44 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

So basically I'll need to pre-open a dummy file, dup() it a ton of times to reserve descriptors and then have some kind of an allocator for them? How would I do close()? Just use the magic io_uring dup() back to the original dummy file descriptor?

I guess it can work, but it seems kinda hacky.

Descriptorless files for io_uring

Posted Jul 20, 2021 7:08 UTC (Tue) by josh (subscriber, #17465) [Link] (6 responses)

Are you a library or a program? If you're a program, you own the descriptor space; just pick a high base and allocate descriptors above that as you see fit.

If you're a library or otherwise have to cooperatively manage the file descriptor space, you have to take more care, but you could still let your caller tell you a range you can use. Or we could add operations to reserve a range of descriptors for future use.

But ultimately, the best way to have a private space of file descriptors you can allocate as you see fit is to use io_uring with your own private ring and allocate from its fixed file table.

Descriptorless files for io_uring

Posted Jul 20, 2021 7:28 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Are you a library or a program?
A program mostly.

> If you're a program, you own the descriptor space; just pick a high base and allocate descriptors above that as you see fit.
But how can I do that? Is it something that's still being planned?

I think I might be missing something, but as far as I understand, all syscalls that create file descriptors can't accept number ranges.

Descriptorless files for io_uring

Posted Jul 20, 2021 8:43 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

dup2/3 can both create new file descriptors at specific numeric values (just pass whatever number you want as newfd). Of course, you still have to call them one at a time, and you still have to track which numbers are available on your own, but the functionality does exist. The concern, as josh noted, is that if a library did this, and then a different library tried to pull the same trick, they could easily end up walking all over each other. But a program doesn't really need to worry about that.

Descriptorless files for io_uring

Posted Jul 22, 2021 7:44 UTC (Thu) by taladar (subscriber, #68407) [Link] (1 responses)

Wouldn't that be trivial to solve by just having parameters in the library for base fd and number of fds after that to use?

Descriptorless files for io_uring

Posted Aug 3, 2021 20:33 UTC (Tue) by tobin_baker (guest, #139557) [Link]

It would be, if a library could ensure that its reservation didn't conflict with another library's reservation. There should be a syscall like mmap(PROT_NONE) to just reserve a range of fds, and fail if it conflicts with an existing reservation. Then the system would promise to never open(), dup(), etc. onto any reserved fd.

Descriptorless files for io_uring

Posted Jul 20, 2021 16:54 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

So does this then look a bit like FORTRAN?

Okay, you'd need a new open command, but first you'd call a routine to set up your io_uring, and then you have your private file descriptor space - iirc 1 was the tty, 2 might have been the tape drive, up to about 4, and then your program simply allocated file descriptors from 5 upwards. Can't remember the syntax but it was something like

OPEN( unit, filename, attributes)

I guess in C you'd end up doing something like "fd = open(unit, filename, attributes)" where "unit" would be your index into the ring, and "fd" would be whatever the linux file descriptor was, that your uring had block pre-allocated.

Cheers,
Wol

Descriptorless files for io_uring

Posted Jul 22, 2021 12:18 UTC (Thu) by ejr (subscriber, #51652) [Link]

Approximately right for Fortran (no longer all caps). Plus the unit can be an embedded data statement.

I've been watching this and similar things wondering how much it differs from old methods.

Descriptorless files for io_uring

Posted Jul 19, 2021 20:43 UTC (Mon) by dancol (guest, #142293) [Link] (1 responses)

Another option is a prctl that would allow a process to opt into relaxed FD allocation rules. The vast majority of software would Just Work in that environment.

Descriptorless files for io_uring

Posted Aug 3, 2021 20:35 UTC (Tue) by tobin_baker (guest, #139557) [Link]

There should absolutely be such a facility. It would be a huge perf win in many cases.

Descriptorless files for io_uring

Posted Jul 20, 2021 0:48 UTC (Tue) by JohnVonNeumann (guest, #131609) [Link] (3 responses)

Is there a world where traditional file descriptors are no longer used?

Descriptorless files for io_uring

Posted Jul 22, 2021 13:34 UTC (Thu) by immibis (subscriber, #105511) [Link] (2 responses)

Yes, it's called Win32.

Descriptorless files for io_uring

Posted Aug 3, 2021 20:35 UTC (Tue) by tobin_baker (guest, #139557) [Link] (1 responses)

Except they're called HANDLE ;-)

Descriptorless files for io_uring

Posted Aug 4, 2021 20:06 UTC (Wed) by immibis (subscriber, #105511) [Link]

They are not traditional file descriptors. The significant difference is they are not allocated as small integers, but rather arbitrary 32/64-bit integers.

Descriptorless files for io_uring

Posted Jul 20, 2021 8:51 UTC (Tue) by ddevault (subscriber, #99589) [Link] (14 responses)

The degree to which io_uring is changing the perspective on I/O in Linux is really quite profound. It seems to me like one of the most significant innovations in mainstream OS design in the last decade. I suspect that Linux is incubating a design which will flourish as the primary I/O paradigm of a new kernel in the future.

Descriptorless files for io_uring

Posted Jul 20, 2021 16:56 UTC (Tue) by Wol (subscriber, #4433) [Link] (5 responses)

And is io_uring just a copy of a technique that's been around since the dawn of modern computing, just forgotten as a result of *nix trampling everything else underfoot?

It wouldn't surprise me if the basic idea of it comes from the 60s - much of what we learnt then has been forgotten ...

Cheers,
Wol

Descriptorless files for io_uring

Posted Jul 30, 2021 11:58 UTC (Fri) by cesarb (subscriber, #6266) [Link] (4 responses)

It's not a forgotten technique, it's a common technique in hardware. For instance, it's how NVMe works, it's very common in network interfaces, and all modern GPUs use command buffers. In the software side, we have things like Wayland, and network protocols like X11. What's unusual is using it as a way to talk to the kernel, I don't recall any operating system which uses ring buffers instead of synchronous calls to the kernel, even microkernels with message queues AFAIK use a synchronous "add this message to the queue" system call.

Descriptorless files for io_uring

Posted Aug 4, 2021 20:40 UTC (Wed) by immibis (subscriber, #105511) [Link] (3 responses)

Now I imagined an OS with *no* system calls. Rather, you just put a message in your message queue, and the kernel checks it whenever you get pre-empted, and you can also call SYSENTER to yield your time slice.

Descriptorless files for io_uring

Posted Aug 5, 2021 10:31 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

io_uring is headed that way - you can run it in a mode where it puts commands in one queue, Linux picks them up and executes them, and puts the results in the other queue. Doesn't yet handle all Linux system calls (ioctl being the next one on the list), but it's a direction that Linux is taking.

Done sensibly, it could be a very efficient way to work; you need a few syscalls to make it work well, but SYSENTER/SYSCALL give you register space to handle that. All syscalls tell the kernel to check the submission queue for new work, but the kernel is always allowed to check the queue at its discretion. The queue itself has dependency marks in it, so that the kernel can execute work out of order if there's no flagged dependencies (e.g. so that you can submit GPU work, network reads and disk writes all in parallel). However, if you do mark dependencies, a work item cannot start until all its dependencies are complete.

Syscall 1 simply tells the kernel that there is work in the submission queue that the application would like the kernel to do. This call returns "quickly" - it just ensures that the kernel knows to check your submission queue - it's your "yield your timeslice" option.
Syscall 2 asks the kernel to block the application until a timeout is reached, or a given submitted work item has completed. Because the queue can complete out of order at kernel discretion, this says nothing about ordering other than that already supplied in the queue. It's allowed to return early, and it's on the application to check completion properly - e.g. loop until the blocking item has completed. As a quality of implementation issue, though, the kernel should aim to avoid returning until the timeout is reached, or the selected item has completed.
Syscall 3 asks the kernel to block the application until a timeout is reached, or until there is space in the submission queue for N more items. Again, this syscall can return early, and it's up to the application to loop until there's enough space in the submission queue.
Syscall 4 sets up your submission and completion queues (where in your VA space they live, how large they are etc). An application should be allowed multiple queues if it wants to avoid locking (e.g. have a thread-local queue for thread-local background work), but there is no way to have the kernel enforce dependencies between work items in different queues (two submission queues run in parallel at all times).
Syscall 5 tears down a queue pair when it's no longer needed.

With those 5 syscalls, you have a full system for a queued syscall mechanism that gives the kernel lots of freedom to run work in parallel (although it's not obliged to), and gives applications lots of freedom to be high performance.

Syscall 1 is your idea, and is necessary.

You need syscalls 2 and 3 to block the application if it cannot do any further useful work without a response from the outside world; the timeout allows the application to still do background maintenance (cache TTL cleanouts, for example) even though the outside world is not affecting it yet. Without these syscalls, the application busy-loops on the kernel completing work, wasting CPU time - and given that you want the kernel to block you, you might as well be able to tell the kernel what it will take for you to be able to do something so that it can avoid giving you CPU cycles if you're not going to use them.

And syscalls 4 and 5 could be simply something done at process creation time, but it's powerful to be able to run (say) queue per worker thread and not have to do application-level locking when there truly isn't a dependency between threads.

Descriptorless files for io_uring

Posted Aug 6, 2021 1:14 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

I think your syscall 1, 2, and 3 could usefully be combined into one system call with two arguments besides the queue ID: "min_free" and "timeout". If timeout is zero then you get syscall 1 (no waiting). Otherwise the thread should be woken when any item in the queue with a "notify completion" flag enabled has completed, or (if timeout is greater than zero) when the timeout expires, or (if min_free is greater than zero) when there are at least min_free available slots at the tail of the submission queue. (The "has completed" state should be persistent, to avoid race conditions: If any "notify completion" item completed since the syscall last returned then it should return immediately rather than waiting for the next event.)

The case which was missed in splitting up the syscalls is the combination of 2 & 3: the thread may be able to do useful work if *either* a prior request completes *or* space becomes available in the queue, but with separate syscalls there is no way to wait for whichever occurs first.

If a process is created with an initial submission queue and there is support for queuing "create queue" and "destroy queue" commands then you don't really need syscalls 4 & 5, which brings us back down to a single syscall.

Descriptorless files for io_uring

Posted Aug 9, 2021 15:53 UTC (Mon) by cesarb (subscriber, #6266) [Link]

> Now I imagined an OS with *no* system calls.

I've been thinking about that for a few months, as a though experiment: would it be possible to create an operating system based solely on multiple asynchronous ring buffers, not only for IPC but also for all communication with the kernel, to the extreme of having no system call at all?

I came to the conclusion that there are two important exceptions. The first one is yielding the time slice, either when there's nothing to do, or as an optimization to wake up another task to read something the current task wrote to a ring buffer. Since the point of this exercise is to go to the extreme, that would be the single system call, but it would have no inputs and no outputs (and all registers would be kept unchanged); the kernel would read what to do from one of the ring buffers ("wake up task X", "wake up task Y", "I don't need the timeslice anymore until another task wakes me up"), and unless told otherwise ("I don't need the timeslice anymore") would return immediately after reading these messages.

The second one is reading the clock, which must be synchronous and fast. For that, instead of synchronous system calls, or asynchronous messages through a ring buffer, it should directly read hardware registers visible to user space (CNTPCT/CNTFRQ or similar) and apply adjustments read from a shared read-only page (this is AFAIK similar to what the Linux VDSO already does).

Descriptorless files for io_uring

Posted Jul 20, 2021 19:26 UTC (Tue) by zdzichu (subscriber, #17118) [Link] (7 responses)

Given that it implements Windows NT designs, it is hardly an innovations.

Descriptorless files for io_uring

Posted Jul 20, 2021 19:30 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

Nope, it's not even close to the NT design.

NT implements completion ports for async IO, not ring-based pull communication.

Descriptorless files for io_uring

Posted Jul 21, 2021 16:44 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (4 responses)

Nor do windows completion ports have support for sensible handling of buffers when handling many connections, readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...

However, the big issue with windows IOPC imo are the awful docs.

Descriptorless files for io_uring

Posted Jul 29, 2021 6:11 UTC (Thu) by njs (subscriber, #40338) [Link] (3 responses)

> Nor do windows completion ports have support for sensible handling of buffers when handling many connections

I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?

> readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...

FWIW, IOCP actually does support readiness-style notification -- the windows kernel only understands IOCP, not `select`, so Winsock `select` is actually using IOCP syscalls underneath. The API for this was supposed to be completely internal and undocumented, but someone (@piscisaureus) reverse-engineered it, and it's now used in a number of modern event-loop libraries like libuv, tokio, trio, etc. (And also apparently "Starcraft: Remastered", according to the wine issue tracker.) It's awkward and janky and has some outright bugs you have to work around, but it's so much better than dealing with `select` that it's worth it.

Descriptorless files for io_uring

Posted Jul 29, 2021 7:34 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?

Well, RIO is kinda a completely separate API from regular IOCP and overlapped IO. It is indeed similar to io_uring, albeit more limited.

IOCP is pretty well designed, actually. It's a way better design than the classic POSIX asynchronous APIs, but it's around 20 years old by now and its age is showing.

Descriptorless files for io_uring

Posted Jul 31, 2021 0:56 UTC (Sat) by andresfreund (subscriber, #69562) [Link] (1 responses)

> > Nor do windows completion ports have support for sensible handling of buffers when handling many connections

> I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?

Well, it's not really IOCP network IO, but something that can be bolted sideways to iocp :). But yes, fair enough.

> > readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...

> FWIW, IOCP actually does support readiness-style notification -- the windows kernel only understands IOCP, not `select`, so Winsock `select` is actually using IOCP syscalls underneath. The API for this was supposed to be completely internal and undocumented, but someone (@piscisaureus) reverse-engineered it, and it's now used in a number of modern event-loop libraries like libuv, tokio, trio, etc. (And also apparently "Starcraft: Remastered", according to the wine issue tracker.) It's awkward and janky and has some outright bugs you have to work around, but it's so much better than dealing with `select` that it's worth it.

That is *very* interesting. I've occasionally looked in that space because the winsock restrictions are problematic for some things in postgres..

Descriptorless files for io_uring

Posted Aug 5, 2021 1:43 UTC (Thu) by njs (subscriber, #40338) [Link]

> That is *very* interesting. I've occasionally looked in that space because the winsock restrictions are problematic for some things in postgres..

If you want to dig into this trick, I think this issue is the most complete compendium of knowledge currently available: https://github.com/python-trio/trio/issues/52

Descriptorless files for io_uring

Posted Jul 30, 2021 10:40 UTC (Fri) by da4089 (subscriber, #1195) [Link]

In fact, Microsoft has just announced that they're copying the io_uring design for Windows: see https://windows-internals.com/i-o-rings-when-one-i-o-operation-is-not-enough/

Descriptorless files for io_uring - lsof interaction?

Posted Jul 20, 2021 10:02 UTC (Tue) by darmengod (subscriber, #130659) [Link] (3 responses)

Would "lsof" be able to list a process' references to files which only remain as "fixed files" within io_uring?

Will this have any major implications with regards to telling which files a program is referencing at any given time using standard tools?

Descriptorless files for io_uring - lsof interaction?

Posted Jul 20, 2021 10:54 UTC (Tue) by roc (subscriber, #30627) [Link]

Yeah ... more specifically will there be a /proc interface to expose data about these open descriptors?

Descriptorless files for io_uring - lsof interaction?

Posted Jul 21, 2021 7:44 UTC (Wed) by colo (guest, #45564) [Link] (1 responses)

This is my #1 concern for innovations like this. Also, about the emerging trend of implementing application features via eBPF code loaded into the kernel - will I lose the ability to trace all these applications' activities via bpftrace (and friends) as a consequence?

It certainly is a future I am a tad afraid of.

Descriptorless files for io_uring - lsof interaction?

Posted Jul 25, 2021 6:59 UTC (Sun) by gulsef073 (guest, #123117) [Link]

It's increasingly possible to observe BPF programs these days. We can surely profile them: https://linuxplumbersconf.org/event/4/contributions/294/

Descriptorless files for io_uring

Posted Jul 20, 2021 18:37 UTC (Tue) by jezuch (subscriber, #52988) [Link] (4 responses)

Were does it go from here? Well, the instructions to uring look a lot like a programming language, as a way to construct entire IO pipelines. So obviously the next step is a JIT for that language!

Descriptorless files for io_uring

Posted Jul 21, 2021 3:44 UTC (Wed) by liam (guest, #84133) [Link] (3 responses)

I think you mean this?

Descriptorless files for io_uring

Posted Jul 21, 2021 18:34 UTC (Wed) by jezuch (subscriber, #52988) [Link] (2 responses)

Haha, and here I thought I would have an original though on the internet! ;)

Descriptorless files for io_uring

Posted Jul 21, 2021 23:19 UTC (Wed) by liam (guest, #84133) [Link]

We'll not tolerate any who harbor such ambitions here! We're not slashdot :)
BTW, lwn produced an article on this topic a few months ago: https://lwn.net/Articles/847951/

Descriptorless files for io_uring

Posted Nov 1, 2021 19:34 UTC (Mon) by Deleted user 129183 (guest, #129183) [Link]

> Haha, and here I thought I would have an original though on the internet! ;)

Ah yeah, the good old phenomenon of retroactive plagiarism – that’s when you think you finally have an original, sometimes revolutionary thought, then you learn that some obscure German philosopher wrote a book about it already 200 years ago. Or in this case, a post on a mailing list six months ago.

Quick question for those in the know: Use for computational offload engines?

Posted Jul 22, 2021 15:15 UTC (Thu) by ejr (subscriber, #51652) [Link]

Is anyone currently looking at io_uring to manage data transfers from a host to computational offload engines (OpenCL-ish queues via OpenACC-, OpenMP-, or OneAPI-level-zero interfaces)?

Quick searching gives me a sense that it's an "oh cool idea!" response. I'd be interesting in finding active projects rather than starting from scratch.

Descriptorless files for io_uring

Posted Jul 22, 2021 15:32 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (3 responses)

I'm interested in seeing how this evolves. FDs in modern programs are a real pain. Seeing close() take 30% of the CPU sometimes is a nightmare. And not being able to prevent some FDs from being reused too early between threads requires a huge amount of trickery, where quite frankly, sometimes I'd prefer to call accept() with *my* expected FD and let it use it instead of having the bad surprise that it uses the one that was just enumerated by epoll_wait() the nanosecond before close() kicked.

And some outdated rules such as "allocate the lowest one" cost a lot for no benefit in highly loaded network programs. Picking any from a recycled pool would be way better.

Descriptorless files for io_uring

Posted Aug 3, 2021 20:25 UTC (Tue) by tobin_baker (guest, #139557) [Link]

The paper "The Scalable Commutativity Rule" (https://people.csail.mit.edu/nickolai/papers/clements-sc-...) specifically calls out the lowest-available-number-reused guarantee as an example of how POSIX makes some system calls inherently unscalable by not allowing them to commute (for no really good reason in this case).

Descriptorless files for io_uring

Posted Feb 20, 2024 19:35 UTC (Tue) by adobriyan (subscriber, #30858) [Link] (1 responses)

> And some outdated rules such as "allocate the lowest one" cost a lot for no benefit in highly loaded network programs. Picking any from a recycled pool would be way better.

open(O_WHAT_IS_POSIX) ?

Descriptorless files for io_uring

Posted Feb 21, 2024 0:35 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

You can't add new flags to open() :(

Descriptorless files for io_uring

Posted Aug 3, 2021 20:20 UTC (Tue) by tobin_baker (guest, #139557) [Link] (4 responses)

This is how epoll should have been designed from the beginning. How anyone thought it was reasonable to allow passing an epoll fd between processes, but then require a registered fd to be deregistered using its *original integer value* (thus making it impossible to deregister an fd in a different process than the one that created it) is utterly beyond me, and says a lot about the lack of rigorous (or even elementary) design review in Linux kernel development. If epoll had allowed referencing registered fds by an offset like io_uring does, I think that would have been a fine solution to this (now unfixable) problem.

Descriptorless files for io_uring

Posted Aug 4, 2021 11:39 UTC (Wed) by kleptog (subscriber, #1183) [Link] (1 responses)

It would also fix the crazy issue where epoll can return notifications for a closed FD.

> Q6 Will closing a file descriptor cause it to be removed from all epoll sets automatically?

> A6 Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.

So if you have a descriptor in epoll that gets passed to a child then you will continue getting notifications from that descriptor and you can't do anything about it. If you close it yourself and open another file that gets the same FD, you get maximum confusion.

Descriptorless files for io_uring

Posted Aug 4, 2021 18:07 UTC (Wed) by tobin_baker (guest, #139557) [Link]

Bottom line: the epoll interface was designed by someone who didn't understand the difference between a file descriptor and an open file description. Sure, he was basically just copying Linus's offhand back-of-a-napkin design sketch from LKML, but he had plenty of time to think it through (as I think Linus would have done), and see where the naive design fell over. If nothing else, he could have just copied a design that actually works from FreeBSD (see what a difference an actual design review process makes, as opposed to "we only look at working code"?).

Descriptorless files for io_uring

Posted Aug 6, 2021 14:59 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

Yea. That epoll uses the pair {fd-number, file-description-pointer} for modification/deregistration -- where the latter is looked up via the former, and BOTH must match the original registration ... is really quite unfortunate. If you open a file on FD 5, register it with epoll, then dup2(5, 6); close(5), you cannot unregister it anymore. Neither passing 5 OR 6 to epoll_ctl will work. (Of course, you can dup fd 6 back onto 5. THEN you can pass 5 to epoll_ctl to unregister it...) It really is just a broken API.

But the fix would've been super-simple, it didn't need to invent any new "offset" mechanisms, or descriptorless files, or prohibit inheritance of the epoll configuration across fork(). There's a perfectly good piece of data that should've bee used as a unique key: the user-provided "epoll_data_t data;" field! That is an arbitrary 64-bit value provided by the caller during registration, which, most importantly, is the (only) identifier passed BACK to the caller on every epoll_wait event report. If epoll had used that as a unique key, then de-registering an unwanted registration would be as simple as passing the same "data" you received from epoll_wait back in to epoll_ctl(EPOLL_CTL_DEL). Oh well.

At the cost of maintaining a second rb-tree indexed by "data" to the epoll structure, this could possibly even be fixed. Add two new operations, EPOLL_CTL_MOD2 and EPOLL_CTL_DEL2, which ignore the "fd" parameter, and instead look up the epoll entry to delete/modify using "data". Ideally, there would be a requirement that "data" be unique, but that would be an incompatible change, so we can't do that. However, these APIs can just use the first entry with a given "data" key, if there are multiple. Not ideal, but should suffice.

Descriptorless files for io_uring

Posted Aug 6, 2021 17:48 UTC (Fri) by tobin_baker (guest, #139557) [Link]

That's a great point: all epoll needed to do was pass a unique cookie for each registration back to the caller. It would be interesting to review the LKML discussion on the original epoll patches and see if this issue was ever brought up.

Sadly, this design flaw prevents me from passing epoll descriptors over Unix sockets (for an unusual but legit use case). Obviously, it also prevents using an inherited epoll descriptor across fork() as well. It should go without saying that no fd-based interface that doesn't allow freely passing its fds between processes should ever be accepted (because that violates the fundamental semantics of fds), but somehow Linus has failed to enforce that in this and other cases (mostly involving Davide Libenzi, btw).