Descriptorless files for io_uring
Io_uring was created to solve the asynchronous I/O problem; this is a functionality that Linux has never supported as well as users would have liked. User space can queue operations in a memory segment that is shared directly with the kernel, allowing those operations to be initiated, in many cases, without the need for an expensive system call. Similarly, another shared-memory segment contains the results of those operations once they complete. Initially, io_uring focused on simple operations (reading and writing, for example), but it has quickly gained support for many other system calls. It is evolving into the general asynchronous-operation API that Linux systems have always lacked.
Fixed files
A read or write operation must specify both the file descriptor to be operated on and a buffer to hold the data. There is a fair amount of setup work that must be done in the kernel before that operation can proceed, though. That includes taking a reference to the open file (to prevent it from going away while the operation is underway) and locking down the memory for the buffer. That overhead can, in many cases, add up to a significant part of the total cost of the operation; since programs tend to perform multiple operations with the same file descriptors and the same buffers, this overhead can be paid many times for the same resources, and it can add up.
From the beginning, io_uring has included a way to reduce that overhead in the form of the io_uring_register() system call:
int io_uring_register(unsigned int fd, unsigned int opcode,
void *arg, unsigned int nr_args);
If opcode is IORING_REGISTER_BUFFERS, the io_uring subsystem will perform the setup work for the nr_args buffers pointed to by arg and keep the result; those buffers can then be used multiple times without paying that setup cost each time. If, instead, opcode is IORING_REGISTER_FILES, then arg is interpreted as an array of nr_args file descriptors. Each file in that array will be referenced and held open so that, once again, it can be used efficiently in multiple operations. These file descriptors are called "fixed" in io_uring jargon.
There are a couple of interesting aspects to fixed files. One is that the application can call close() on the file descriptor associated with a fixed file, but the reference within io_uring will remain and will still be usable. The other is that a fixed file is not referenced in subsequent io_uring operations by its file-descriptor number. Instead, operations use the offset where that file descriptor appeared in the args array during the io_uring_register() call. So if file descriptor 42 was placed in args[13], it will subsequently be known as fixed file 13 within io_uring.
So the io_uring subsystem has, in essence, set up a parallel descriptor space that can refer to open files, but which is independent of the regular file descriptors. In current kernels, though, it is still necessary to obtain a regular file descriptor for a file and register it for the file to appear in the io_uring fixed-file space. If, however, an application will never do anything with a file outside of io_uring, the creation of the regular file descriptor serves no real purpose.
It is, indeed, possible to create, use, and close a file descriptor entirely within io_uring. As noted above, this subsystem is not limited to simple I/O; it is also possible to open files and accept network connections with io_uring operations. At the moment, though, user space must intervene between the creation of the file descriptor and its use to install it as a fixed file. The cost of this work is not huge but it, too, can add up in an application that processes a lot of file descriptors.
No more file descriptors
To address this problem, Pavel Begunkov has posted this patch series adding a direct-to-fixed open operation. The io_uring operations that can create file descriptors — the equivalents of the openat2() and accept() system calls — gain the ability to, instead, store their result directly into the fixed-file table at a user-supplied offset. When this option is selected, there is no regular file descriptor created at all; the io_uring alternative descriptor is the only way to refer to the file.
The most likely use case for this feature is network servers; a busy server can create (with accept()) and use huge numbers of file descriptors in a short period of time. While io_uring operations, being asynchronous, can generally be executed in any order, it is possible to chain operations so that one does not begin before the previous one has successfully completed. Using this capability, a network server could queue a series of operations to accept the next incoming connection (storing it in the fixed-file table), write out the standard greeting, and initiate a read for the first data from the remote peer. User space would only need to become involved once that data has arrived and is ready to be processed.
This is clearly an interesting capability, and it shows how io_uring is
quickly evolving into an alternative programming interface for Linux
systems. The separation from the traditional file-descriptor space is just
one more step in that direction. With the future addition of BPF support (which is still
under development), the separation will become even more pronounced;
the user-space component of some applications may become small indeed.
Use of the io_uring API will probably not be worthwhile for the majority
of applications, but for some it can make a large difference. It will be
interesting to see where it goes from here.
| Index entries for this article | |
|---|---|
| Kernel | io_uring |
Posted Jul 19, 2021 17:58 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (13 responses)
The last time I was profiling an app that needed a lot of openat calls, the locking and searching for a free file descriptor in the kernel took something like 20% of the total time. We basically need a way to say: "I don' care about descriptors being dense and all of that POSIX nonsense, just give me a fucking number!"
Posted Jul 19, 2021 20:30 UTC (Mon)
by josh (subscriber, #17465)
[Link] (10 responses)
This is similar to how X Window System clients assign resource IDs: when you create an object like a window or pixmap, you assign the ID yourself from a range of valid IDs you were given, and you can then batch creation and usage of an object in the same batch.
If you then want to use the object outside of io_uring, you can always use an io_uring operation to link the fixed-file into a normal file descriptor, and then use it with normal syscalls. And if you want, you can assign a specific file descriptor yourself rather than relying on the kernel's "allocate the lowest unused number".
Posted Jul 20, 2021 0:46 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
How?!? This would help me a lot. I can't change all the read/write calls to use io_uring, but I can do that with opens.
Posted Jul 20, 2021 6:03 UTC (Tue)
by josh (subscriber, #17465)
[Link] (8 responses)
Posted Jul 20, 2021 6:44 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
I guess it can work, but it seems kinda hacky.
Posted Jul 20, 2021 7:08 UTC (Tue)
by josh (subscriber, #17465)
[Link] (6 responses)
If you're a library or otherwise have to cooperatively manage the file descriptor space, you have to take more care, but you could still let your caller tell you a range you can use. Or we could add operations to reserve a range of descriptors for future use.
But ultimately, the best way to have a private space of file descriptors you can allocate as you see fit is to use io_uring with your own private ring and allocate from its fixed file table.
Posted Jul 20, 2021 7:28 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
> If you're a program, you own the descriptor space; just pick a high base and allocate descriptors above that as you see fit.
I think I might be missing something, but as far as I understand, all syscalls that create file descriptors can't accept number ranges.
Posted Jul 20, 2021 8:43 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Posted Jul 22, 2021 7:44 UTC (Thu)
by taladar (subscriber, #68407)
[Link] (1 responses)
Posted Aug 3, 2021 20:33 UTC (Tue)
by tobin_baker (guest, #139557)
[Link]
Posted Jul 20, 2021 16:54 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
Okay, you'd need a new open command, but first you'd call a routine to set up your io_uring, and then you have your private file descriptor space - iirc 1 was the tty, 2 might have been the tape drive, up to about 4, and then your program simply allocated file descriptors from 5 upwards. Can't remember the syntax but it was something like
OPEN( unit, filename, attributes)
I guess in C you'd end up doing something like "fd = open(unit, filename, attributes)" where "unit" would be your index into the ring, and "fd" would be whatever the linux file descriptor was, that your uring had block pre-allocated.
Cheers,
Posted Jul 22, 2021 12:18 UTC (Thu)
by ejr (subscriber, #51652)
[Link]
I've been watching this and similar things wondering how much it differs from old methods.
Posted Jul 19, 2021 20:43 UTC (Mon)
by dancol (guest, #142293)
[Link] (1 responses)
Posted Aug 3, 2021 20:35 UTC (Tue)
by tobin_baker (guest, #139557)
[Link]
Posted Jul 20, 2021 0:48 UTC (Tue)
by JohnVonNeumann (guest, #131609)
[Link] (3 responses)
Posted Jul 22, 2021 13:34 UTC (Thu)
by immibis (subscriber, #105511)
[Link] (2 responses)
Posted Aug 3, 2021 20:35 UTC (Tue)
by tobin_baker (guest, #139557)
[Link] (1 responses)
Posted Aug 4, 2021 20:06 UTC (Wed)
by immibis (subscriber, #105511)
[Link]
Posted Jul 20, 2021 8:51 UTC (Tue)
by ddevault (subscriber, #99589)
[Link] (14 responses)
Posted Jul 20, 2021 16:56 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (5 responses)
It wouldn't surprise me if the basic idea of it comes from the 60s - much of what we learnt then has been forgotten ...
Cheers,
Posted Jul 30, 2021 11:58 UTC (Fri)
by cesarb (subscriber, #6266)
[Link] (4 responses)
Posted Aug 4, 2021 20:40 UTC (Wed)
by immibis (subscriber, #105511)
[Link] (3 responses)
Posted Aug 5, 2021 10:31 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (1 responses)
io_uring is headed that way - you can run it in a mode where it puts commands in one queue, Linux picks them up and executes them, and puts the results in the other queue. Doesn't yet handle all Linux system calls (ioctl being the next one on the list), but it's a direction that Linux is taking.
Done sensibly, it could be a very efficient way to work; you need a few syscalls to make it work well, but SYSENTER/SYSCALL give you register space to handle that. All syscalls tell the kernel to check the submission queue for new work, but the kernel is always allowed to check the queue at its discretion. The queue itself has dependency marks in it, so that the kernel can execute work out of order if there's no flagged dependencies (e.g. so that you can submit GPU work, network reads and disk writes all in parallel). However, if you do mark dependencies, a work item cannot start until all its dependencies are complete.
With those 5 syscalls, you have a full system for a queued syscall mechanism that gives the kernel lots of freedom to run work in parallel (although it's not obliged to), and gives applications lots of freedom to be high performance.
Syscall 1 is your idea, and is necessary.
You need syscalls 2 and 3 to block the application if it cannot do any further useful work without a response from the outside world; the timeout allows the application to still do background maintenance (cache TTL cleanouts, for example) even though the outside world is not affecting it yet. Without these syscalls, the application busy-loops on the kernel completing work, wasting CPU time - and given that you want the kernel to block you, you might as well be able to tell the kernel what it will take for you to be able to do something so that it can avoid giving you CPU cycles if you're not going to use them.
And syscalls 4 and 5 could be simply something done at process creation time, but it's powerful to be able to run (say) queue per worker thread and not have to do application-level locking when there truly isn't a dependency between threads.
Posted Aug 6, 2021 1:14 UTC (Fri)
by nybble41 (subscriber, #55106)
[Link]
The case which was missed in splitting up the syscalls is the combination of 2 & 3: the thread may be able to do useful work if *either* a prior request completes *or* space becomes available in the queue, but with separate syscalls there is no way to wait for whichever occurs first.
If a process is created with an initial submission queue and there is support for queuing "create queue" and "destroy queue" commands then you don't really need syscalls 4 & 5, which brings us back down to a single syscall.
Posted Aug 9, 2021 15:53 UTC (Mon)
by cesarb (subscriber, #6266)
[Link]
I've been thinking about that for a few months, as a though experiment: would it be possible to create an operating system based solely on multiple asynchronous ring buffers, not only for IPC but also for all communication with the kernel, to the extreme of having no system call at all?
I came to the conclusion that there are two important exceptions. The first one is yielding the time slice, either when there's nothing to do, or as an optimization to wake up another task to read something the current task wrote to a ring buffer. Since the point of this exercise is to go to the extreme, that would be the single system call, but it would have no inputs and no outputs (and all registers would be kept unchanged); the kernel would read what to do from one of the ring buffers ("wake up task X", "wake up task Y", "I don't need the timeslice anymore until another task wakes me up"), and unless told otherwise ("I don't need the timeslice anymore") would return immediately after reading these messages.
The second one is reading the clock, which must be synchronous and fast. For that, instead of synchronous system calls, or asynchronous messages through a ring buffer, it should directly read hardware registers visible to user space (CNTPCT/CNTFRQ or similar) and apply adjustments read from a shared read-only page (this is AFAIK similar to what the Linux VDSO already does).
Posted Jul 20, 2021 19:26 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (7 responses)
Posted Jul 20, 2021 19:30 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
NT implements completion ports for async IO, not ring-based pull communication.
Posted Jul 21, 2021 16:44 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link] (4 responses)
However, the big issue with windows IOPC imo are the awful docs.
Posted Jul 29, 2021 6:11 UTC (Thu)
by njs (subscriber, #40338)
[Link] (3 responses)
I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?
> readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...
FWIW, IOCP actually does support readiness-style notification -- the windows kernel only understands IOCP, not `select`, so Winsock `select` is actually using IOCP syscalls underneath. The API for this was supposed to be completely internal and undocumented, but someone (@piscisaureus) reverse-engineered it, and it's now used in a number of modern event-loop libraries like libuv, tokio, trio, etc. (And also apparently "Starcraft: Remastered", according to the wine issue tracker.) It's awkward and janky and has some outright bugs you have to work around, but it's so much better than dealing with `select` that it's worth it.
Posted Jul 29, 2021 7:34 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Well, RIO is kinda a completely separate API from regular IOCP and overlapped IO. It is indeed similar to io_uring, albeit more limited.
IOCP is pretty well designed, actually. It's a way better design than the classic POSIX asynchronous APIs, but it's around 20 years old by now and its age is showing.
Posted Jul 31, 2021 0:56 UTC (Sat)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
> I haven't tried using Windows's "registered IO", their equivalent to io_uring's ring buffer and registered buffers... is it particularly broken somehow?
Well, it's not really IOCP network IO, but something that can be bolted sideways to iocp :). But yes, fair enough.
> > readiness style notification (not building data to send out before socket is ready is useful for efficiency, even if one otherwise uses completion notification), ...
> FWIW, IOCP actually does support readiness-style notification -- the windows kernel only understands IOCP, not `select`, so Winsock `select` is actually using IOCP syscalls underneath. The API for this was supposed to be completely internal and undocumented, but someone (@piscisaureus) reverse-engineered it, and it's now used in a number of modern event-loop libraries like libuv, tokio, trio, etc. (And also apparently "Starcraft: Remastered", according to the wine issue tracker.) It's awkward and janky and has some outright bugs you have to work around, but it's so much better than dealing with `select` that it's worth it.
That is *very* interesting. I've occasionally looked in that space because the winsock restrictions are problematic for some things in postgres..
Posted Aug 5, 2021 1:43 UTC (Thu)
by njs (subscriber, #40338)
[Link]
If you want to dig into this trick, I think this issue is the most complete compendium of knowledge currently available: https://github.com/python-trio/trio/issues/52
Posted Jul 30, 2021 10:40 UTC (Fri)
by da4089 (subscriber, #1195)
[Link]
Posted Jul 20, 2021 10:02 UTC (Tue)
by darmengod (subscriber, #130659)
[Link] (3 responses)
Will this have any major implications with regards to telling which files a program is referencing at any given time using standard tools?
Posted Jul 20, 2021 10:54 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Jul 21, 2021 7:44 UTC (Wed)
by colo (guest, #45564)
[Link] (1 responses)
It certainly is a future I am a tad afraid of.
Posted Jul 25, 2021 6:59 UTC (Sun)
by gulsef073 (guest, #123117)
[Link]
Posted Jul 20, 2021 18:37 UTC (Tue)
by jezuch (subscriber, #52988)
[Link] (4 responses)
Posted Jul 21, 2021 3:44 UTC (Wed)
by liam (guest, #84133)
[Link] (3 responses)
Posted Jul 21, 2021 18:34 UTC (Wed)
by jezuch (subscriber, #52988)
[Link] (2 responses)
Posted Jul 21, 2021 23:19 UTC (Wed)
by liam (guest, #84133)
[Link]
Posted Nov 1, 2021 19:34 UTC (Mon)
by Deleted user 129183 (guest, #129183)
[Link]
Ah yeah, the good old phenomenon of retroactive plagiarism – that’s when you think you finally have an original, sometimes revolutionary thought, then you learn that some obscure German philosopher wrote a book about it already 200 years ago. Or in this case, a post on a mailing list six months ago.
Posted Jul 22, 2021 15:15 UTC (Thu)
by ejr (subscriber, #51652)
[Link]
Quick searching gives me a sense that it's an "oh cool idea!" response. I'd be interesting in finding active projects rather than starting from scratch.
Posted Jul 22, 2021 15:32 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link] (3 responses)
And some outdated rules such as "allocate the lowest one" cost a lot for no benefit in highly loaded network programs. Picking any from a recycled pool would be way better.
Posted Aug 3, 2021 20:25 UTC (Tue)
by tobin_baker (guest, #139557)
[Link]
Posted Feb 20, 2024 19:35 UTC (Tue)
by adobriyan (subscriber, #30858)
[Link] (1 responses)
open(O_WHAT_IS_POSIX) ?
Posted Feb 21, 2024 0:35 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Aug 3, 2021 20:20 UTC (Tue)
by tobin_baker (guest, #139557)
[Link] (4 responses)
Posted Aug 4, 2021 11:39 UTC (Wed)
by kleptog (subscriber, #1183)
[Link] (1 responses)
> Q6 Will closing a file descriptor cause it to be removed from all epoll sets automatically?
> A6 Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.
So if you have a descriptor in epoll that gets passed to a child then you will continue getting notifications from that descriptor and you can't do anything about it. If you close it yourself and open another file that gets the same FD, you get maximum confusion.
Posted Aug 4, 2021 18:07 UTC (Wed)
by tobin_baker (guest, #139557)
[Link]
Posted Aug 6, 2021 14:59 UTC (Fri)
by foom (subscriber, #14868)
[Link] (1 responses)
But the fix would've been super-simple, it didn't need to invent any new "offset" mechanisms, or descriptorless files, or prohibit inheritance of the epoll configuration across fork(). There's a perfectly good piece of data that should've bee used as a unique key: the user-provided "epoll_data_t data;" field! That is an arbitrary 64-bit value provided by the caller during registration, which, most importantly, is the (only) identifier passed BACK to the caller on every epoll_wait event report. If epoll had used that as a unique key, then de-registering an unwanted registration would be as simple as passing the same "data" you received from epoll_wait back in to epoll_ctl(EPOLL_CTL_DEL). Oh well.
At the cost of maintaining a second rb-tree indexed by "data" to the epoll structure, this could possibly even be fixed. Add two new operations, EPOLL_CTL_MOD2 and EPOLL_CTL_DEL2, which ignore the "fd" parameter, and instead look up the epoll entry to delete/modify using "data". Ideally, there would be a requirement that "data" be unique, but that would be an incompatible change, so we can't do that. However, these APIs can just use the first entry with a given "data" key, if there are multiple. Not ideal, but should suffice.
Posted Aug 6, 2021 17:48 UTC (Fri)
by tobin_baker (guest, #139557)
[Link]
Sadly, this design flaw prevents me from passing epoll descriptors over Unix sockets (for an unusual but legit use case). Obviously, it also prevents using an inherited epoll descriptor across fork() as well. It should go without saying that no fd-based interface that doesn't allow freely passing its fds between processes should ever be accepted (because that violates the fundamental semantics of fds), but somehow Linus has failed to enforce that in this and other cases (mostly involving Davide Libenzi, btw).
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
A program mostly.
But how can I do that? Is it something that's still being planned?
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Wol
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Wol
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
In fact, Microsoft has just announced that they're copying the io_uring design for Windows: see https://windows-internals.com/i-o-rings-when-one-i-o-operation-is-not-enough/
Descriptorless files for io_uring
Descriptorless files for io_uring - lsof interaction?
Descriptorless files for io_uring - lsof interaction?
Descriptorless files for io_uring - lsof interaction?
Descriptorless files for io_uring - lsof interaction?
Descriptorless files for io_uring
I think you mean this?
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
BTW, lwn produced an article on this topic a few months ago: https://lwn.net/Articles/847951/
Descriptorless files for io_uring
Quick question for those in the know: Use for computational offload engines?
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
Descriptorless files for io_uring
