epoll_pwait2(), close_range(), and encoded I/O

By Jonathan Corbet
November 20, 2020

The various system calls and other APIs that the kernel provides for access to files and filesystems has grown increasingly comprehensive over the years. That does not mean, though, that there is no need or room for improvement. Several relatively small additions to the kernel's filesystem-related API are under consideration in the development community; read on for a survey of some of this work.

Higher-resolution `epoll_wait()` timeouts

The kernel's "epoll" subsystem provides a high-performance mechanism for a process to wait on events from a large number of open file descriptors. Using it involves creating an epoll file descriptor with epoll_create(), adding file descriptors of interest with epoll_ctl(), then finally waiting on events with epoll_wait() or epoll_pwait(). When waiting, the caller can specify a timeout as an integer number of milliseconds.

The epoll mechanism was added during the 2.5 development series, and became available in the 2.6 release at the end of 2003. Nearly 20 years ago, when this work was being done, a millisecond timeout seemed like enough resolution; the kernel couldn't reliably do shorter timeouts in any case. In 2020, though, one millisecond can be an eternity; there are users who would benefit from much shorter timeouts than that. Thus, it seems it is time for another update to the epoll API.

Willem de Bruijn duly showed up with a patch set adding nanosecond timeout support to epoll_wait(), but it took a bit of a roundabout path. Since there is no "flags" argument to epoll_wait(), there is no way to ask for high-resolution timeouts directly. So the patch set instead added a new flag (EPOLL_NSTIMEO) to epoll_create() (actually, to epoll_create1(), which was added in 2.6.27 since epoll_create() also lacks a "flags" argument). If an epoll file descriptor was created with that flag set, then the timeout value for epoll_wait() would be interpreted as being in nanoseconds rather than milliseconds.

Andrew Morton, however, complained about this API. Having one system call set a flag to change how arguments to a different system call would be interpreted was "not very nice" in his view; he suggested adding a new system call instead. After a bit of back and forth, that is what happened; the current version of the patch set adds epoll_pwait2():

    int epoll_pwait2(int fd, struct epoll_event *events, int maxevents,
                     const struct timespec *timeout, const sigset_t *sigset);

In this version, the timeout is passed as a timespec structure, which includes a field for nanoseconds.

There has been some discussion of the implementation of this system call, but not a lot of comments on the API, so perhaps this work will go forward in this form. Your editor cannot help but note, however, that this system call, too, lacks a "flags" argument, so the eventual need for an epoll_pwait3() can be readily foreseen.

`close_range()` — eventually

The close_range() system call was added in the 5.9 release as a way to efficiently close a whole list of file descriptors:

    int close_range(int first, int last, unsigned int flags);

This call will close all file descriptors between first and last, inclusive. There is currently one flags value defined: CLOSE_RANGE_UNSHARE, which causes the indicated range of file descriptors to be unshared from any other processes (and does not close them).

In this patch set, Giuseppe Scrivano adds another flag, CLOSE_RANGE_CLOEXEC. This flag will set the "close on exec()" flag on each of the indicated file descriptors. Once again, close_range() does not actually close the files in this case; it simply marks them to be closed if and when the calling process does an exec() in the future. This is, presumably, faster than executing a loop and setting the flag with fcntl() on each file descriptor individually.

The functionality seems useful, and there have not really been any complaints about the API (there were some issues with the implementation in previous versions of the patch set). Given that close_range() is taking on more functionality that does not involve actually closing files, though, it seems increasingly clear that this system call is misnamed. It has only been available since the 5.9 release on October 11, so there are not yet C-library wrappers for it in circulation. So there is time to come up with a better name for this system call, should the desire to do so arise.

Encoded I/O

Some filesystems have the ability to compress and/or encrypt data written to files. Normally, this data will be restored to its original form when read from those files, so users may be entirely unaware that this transformation is taking place at all. What if, however, somebody wanted the ability to work with this "encoded" data directly, bypassing the processing steps within the filesystem code? Omar Sandoval has a patch set making that possible.

The main motivation for this work appears to be backups and, in particular, the transmission of partial or full filesystem images with the Btrfs send and receive operations. The whole point of using this mechanism is to create an identical copy of a Btrfs subvolume on another device. If the subvolume is using compression, a send will currently decompress the data, which must then be recompressed on the receive side, ending up in its original form. If there is a lot of data involved, this is a somewhat wasteful operation; it would be more efficient to just transmit the compressed data.

With this patch set applied, it becomes possible to read the compressed and/or encrypted data directly and write it directly, with no intervening processing. The first step is to open the subvolume with the new O_ALLOW_ENCODED flag. The CAP_SYS_ADMIN capability is needed to open a subvolume in this mode; imagine what could happen if an attacker were to write corrupt compressed data to a file, for example. Dave Chinner argued early on that corrupt data should just be treated as bad data and this operation could be unprivileged, but that view did not win out.

Then, encoded data can be read or written using the preadv() and pwritev() system calls. The new RWF_ENCODED flag must be used to indicate that encoded data is being transferred. A normal invocation of these system calls takes an array of pointers to iovec structures describing the buffers to be transferred; when encoded I/O is being done, though, the first pointer instead refers to an instance of the new encoded_iov structure type:

    struct encoded_iov {
	__aligned_u64 len;
	__aligned_u64 unencoded_len;
	__aligned_u64 unencoded_offset;
	__u32 compression;
	__u32 encryption;
    };

The len field must contain the length of this structure; it is there in case new fields are added in the future. The unencoded_len and unencoded_offset fields describe the portion of the file affected by this operation; the compression and encryption fields contain filesystem-dependent values describing the type of compression and encryption applied. All other pointers in the iovec array point to actual iovec structures describing the data to transfer.

The patch set includes support for reading and writing compressed data from a Btrfs filesystem. There is also a follow-on patch set working this support into the send and receive operations. Benchmarks included there show a significant reduction in bandwidth required to transmit the data, reduced CPU time usage and, in some cases, reduced elapsed time as well.

This patch series has been through six revisions as of this writing; the first version was posted in September 2019. Various implementation issues have been addressed, and the work appears to be converging on something that should be ready to merge soon.

Index entries for this article
Kernel	Btrfs
Kernel	Epoll
Kernel	Filesystems/Btrfs
Kernel	System calls/close_range()

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 20, 2020 20:12 UTC (Fri) by hmh (subscriber, #3838) [Link] (5 responses)

Seriously, close_range() really deserves a rename if it is still possible to do so.

And I can see the demand in the horizon not just for a single range, but for a set of ranges (or of bitmaps) if any operations it performs becomes attractive to do, e.g., on epoll sets :-)

But yes, that way lies a cliff...

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 20, 2020 21:56 UTC (Fri) by dezgeg (subscriber, #92243) [Link] (1 responses)

Clearly we need to be able to provide an eBPF predicate function that returns true/false whether the given file descriptor needs to be closed or not!

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 21, 2020 21:14 UTC (Sat) by gps (subscriber, #45638) [Link]

That's not actually a bad idea.

Going further: rather than at syscall time, how about registering that eBPF to determine the fate of fds as an at-fork and/or at-exec time fd filter?

epoll_pwait2(), close_range(), and encoded I/O

Posted Jan 3, 2021 19:20 UTC (Sun) by Shabbyx (guest, #104730) [Link] (1 responses)

Wasn't io_uring made for all this? What's the advantage of close_range over io_uring?

epoll_pwait2(), close_range(), and encoded I/O

Posted Jan 5, 2021 10:19 UTC (Tue) by flussence (guest, #85566) [Link]

They're orthogonal APIs. You certainly could emulate close_range with io_uring, but that's a lot more syscalls and a whole lot less error-checking.

epoll_pwait2(), close_range(), and encoded I/O

Posted Feb 16, 2021 22:46 UTC (Tue) by calumapplepie (guest, #143655) [Link]

How about fcntl_range(int first, int last, int cmd, ...)

It executes any fcntl command across the range of file descriptors, and also accepts a new F_CLOSE command (for the original purpose of the API)

(not a kernel dev, but wanted to add some cents)

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 21, 2020 1:54 UTC (Sat) by glenn (subscriber, #102223) [Link] (12 responses)

Perhaps epoll_pwait3() be used to implement something like Window's WaitForMultipleObjects(...,/*bWaitAll*/=1,...) to wait until all file descriptors are ready (rather than just one)?

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 21, 2020 9:35 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (11 responses)

I'm a bit confused about the point of that API, actually. My understanding is that epoll is generally used when waiting for the network, which is multiple orders of magnitude slower than making a syscall. So why can't you just wait for them one at a time? You even have EPOLLONESHOT to make it properly atomic.

(Nevertheless, I am surprised that epoll_pwait2() lacks a flags argument. I had thought that was a standard feature at this point.)

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 21, 2020 16:08 UTC (Sat) by smcv (subscriber, #53363) [Link] (8 responses)

> why can't you just wait for them one at a time?

It's a general-purpose API, the modernized equivalent of select() or poll(), used to wait for whatever interesting event happens next on any of several waitable objects in an event loop - which might indeed come from a network socket, but might equally come from an IPC channel (Wayland, X11, D-Bus, any other AF_UNIX socket, or a pipe), or from a communication channel between threads (eventfd or pipe-to-self). Some of these are slow anyway, but some are time-sensitive.

If you are watching pollable fds A, B, C, you poll one at a time with a nonzero timeout, and pollable fd C is ready, then you won't process the event from C until 2 timeouts later than you could have done, resulting in an unnecessary delay. This results in an incentive to set the poll timeout to be shorter than the intended application-level timeout, leading to the opposite problem: the process is woken up once per poll timeout period, even if nothing interesting has actually happened yet.

The common state for a CPU- and power-efficient event loop should be that if the process is waiting for something interesting to happen (like the next X11 or Wayland event), then it's mostly sleeping in epoll_wait() or similar, trusting the kernel to wake it up at an appropriate time; and if it's waiting for one or more application-level timeouts (which might be short, like the time an animation needs to wake up to start drawing the next frame so that it will be finished in time, or long, like the time at which the app gives up hope of receiving a reply to a pending network operation or D-Bus call), then the timeout value that it gives to epoll_wait() or equivalent should be whichever of those application-level timeouts will happen soonest. This is how the GLib and Qt event loops work, for instance.

If you're trying to write an event loop that is capable of running a smooth animation at the refresh rate of a 60Hz screen, then the time between frames is less than 17ms, so I can imagine that the difference between one millisecond and the next does make a difference; if you're trying to keep up with a 144Hz screen, then you get less than 7ms between frames.

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 22, 2020 14:12 UTC (Sun) by smcv (subscriber, #53363) [Link] (7 responses)

Oh, sorry, perhaps you meant "I don't see the point of bWaitAll=1" rather than "I don't see the point of high-resolution epoll_wait"?

If that, then I'm not sure I see it either.

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 22, 2020 15:48 UTC (Sun) by Sesse (subscriber, #53779) [Link] (6 responses)

WFMO lets you wait for many different kinds of non-fd structures, in particular mutexes and semaphores. It's a useful primitive for that (“wait until this file is ready and I have these two mutexes”), and one WINE has had great problems emulating in the past.

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 22, 2020 21:16 UTC (Sun) by sbaugh (guest, #103291) [Link] (5 responses)

The ability to wait for mutexes and semaphores is useful for sure, and epoll lacks it. But the WaitAll flag isn't related to that, and shouldn't be necessary for epoll, since one can just use edge-triggered mode and repeatedly epoll_wait until you've seen all the events you want to wait for.

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 22, 2020 21:17 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Will edge-triggered mode work correctly is one of the descriptors becomes ready between epoll() invocations?

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 22, 2020 21:22 UTC (Sun) by sbaugh (guest, #103291) [Link]

>Will edge-triggered mode work correctly is one of the descriptors becomes ready between epoll() invocations?

Yes.

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 22, 2020 21:24 UTC (Sun) by Sesse (subscriber, #53779) [Link] (2 responses)

But WaitAll allows you to take multiple mutexes efficiently and deadlock-free.

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 23, 2020 6:45 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

We're talking in circles. There are two different APIs here:

- Windows WaitForMultipleObjects: Has a WaitAll option, can wait for both mutexes and file handles (and various other things) in the same call.
- Linux epoll: No WaitAll option. Mutexes are acquired one by one with futex(2) or more commonly pthread_mutex_lock(3) (which uses futex internally). epoll has nothing to do with futex and can't acquire anything that resembles a mutex. It waits on file descriptors, and that's it.

Adding a WaitAll option to epoll would not help with mutexes, because you can't use epoll to acquire mutexes in the first place. You would need to also add the ability to acquire mutexes. But most of the complexity of mutexes lives in userspace (in NTPL) and epoll is a kernel interface. So that's probably infeasible without some kind of userspace abstraction layer. In practice, it would probably make more sense to add this functionality to NTPL than to add it to epoll.

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 23, 2020 6:48 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

(And of course I misspelled NPTL. Oh well.)

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 25, 2020 4:56 UTC (Wed) by glenn (subscriber, #102223) [Link] (1 responses)

I can point to two patterns that would benefit from epoll_wait() with bWaitAll.

(1) You want to have a thread that services a UDP socket, processing one datagram per unit time. A clean way to do this would be to pair the UDP socket file descriptor with a timerfd file descriptor. Using epoll_wait() and bWaitAll, this thread would only wake when a datagram is available and enough time has elapsed. Is there an equally clean way that avoids trivial wakeups?

(2) Consider a real-time application with a message-passing task graph software architecture that may or may not span multiple compute nodes (e.g., ROS). Each node in the task graph is backed by a thread. These threads should only execute when data is available on all input file descriptors. epoll_wait() with something like bWaitAll would allow these threads from waking before all inputs are satisfied. Furthermore, my understanding is that this capability is necessary if threads are scheduled by SCHED_DEADLINE. Irregular wake-up patterns can cause thread budget reclamation to kick in before the thread has done any real work (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt).

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 25, 2020 9:11 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

> (1) You want to have a thread that services a UDP socket, processing one datagram per unit time. A clean way to do this would be to pair the UDP socket file descriptor with a timerfd file descriptor. Using epoll_wait() and bWaitAll, this thread would only wake when a datagram is available and enough time has elapsed. Is there an equally clean way that avoids trivial wakeups?

The network RTT is multiple orders of magnitude slower than a context switch, unless you have crazy bad swapping *and* the other node is just across the room. So my position is that trivial wakeups don't matter in this case. They're effectively noise.

> (2) Consider a real-time application [...]

Does epoll respect real-time scheduling in the first place? If a thread goes to sleep, it necessarily relinquishes the CPU. And, once it has done so, what exactly is the kernel expected to do about it, if anything? I tend to think of real-time scheduling as protecting CPU-bound threads from CPU-stealing by other threads, but epoll certainly isn't a CPU-bound interface.

> may or may not span multiple compute nodes [...] Furthermore, my understanding is that this capability is necessary if threads are scheduled by SCHED_DEADLINE. Irregular wake-up patterns can cause thread budget reclamation to kick in before the thread has done any real work (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt).

If you want this to actually work across multiple nodes, you need to be tolerant of variances in network latency (unless you're building an entirely real-time network... which seems like it would be really hard to do), and (again) those are way bigger than some minor CPU context switching. If you don't want this to work across multiple nodes, then perhaps you should be using more elaborate userspace synchronization primitives rather than fooling around with loopback addressing and epoll.

And still no flags

Posted Nov 21, 2020 7:30 UTC (Sat) by ras (subscriber, #33059) [Link] (2 responses)

Given how often they've been bitten by this, I'm surprised to see epoll_pwait2 also has no flags.

And still no flags

Posted Nov 21, 2020 10:01 UTC (Sat) by awww (guest, #122021) [Link]

Yes, but this time it's finally the one true interface that will suit all possible purposes, so there really is no need.

And still no flags

Posted Jan 30, 2021 23:53 UTC (Sat) by andyc (subscriber, #1130) [Link]

A few years ago epoll_pwait1() was proposed[0] that had both flags _and_ nanosecond resolution...

[0]: https://lwn.net/Articles/633422/

epoll_pwait2(), close_range(), and encoded I/O

Posted Nov 24, 2020 23:47 UTC (Tue) by ppisa (subscriber, #67307) [Link] (2 responses)

epoll_pwait2() is useless for real-time and control if clockid_t clock_id cannot be specified tosynchronize activity with walltime (realtime) and monotonic. Monotonic is a must for all dynamic systems control applications. It worth to add flag for selection between time interval and absolute timeout value as well.

epoll_pwait2(), close_range(), and encoded I/O

Posted Dec 10, 2020 10:29 UTC (Thu) by njs (subscriber, #40338) [Link] (1 responses)

epoll_* currently use CLOCK_MONOTONIC for everything. It would be handy if they supported absolute monotonic times. And I can see an argument for supporting CLOCK_MONOTONIC_RAW and CLOCK_BOOTTIME. But I don't think CLOCK_REALTIME makes sense: it can jump around, so relative times aren't necessarily well-defined (!), and they can't even represent all absolute times (e.g. leap seconds).

Also note that timerfd already does support all these variations, so in a pinch you can stick one of those in your epoll set.

epoll_pwait2(), close_range(), and encoded I/O

Posted Dec 10, 2020 10:37 UTC (Thu) by ppisa (subscriber, #67307) [Link]

Thanks, you are right that I forgot timerfd, which is solution for non-critical variants. Real time has reason as the time base, if you mix RT control with some "people time"/wall time alarms, shifts changes, railroad timetables etc....

Absolute would be nice, but it only one more system call and often virtual one, so no significant problem. And if you have two time queues for one e-poll, one MONOTONIC and one REALTIME, then it is better at least one move to timerfd to resolve mutual time shifts correctly.

epoll_pwait2(), close_range(), and encoded I/O

Posted Dec 4, 2020 11:30 UTC (Fri) by brauner (subscriber, #109349) [Link]

> The functionality seems useful, and there have not really been any complaints about the API (there were some issues with the implementation in previous versions of the
> patch set). Given that close_range() is taking on more functionality that does not involve actually closing files, though, it seems increasingly clear that this system call is
> misnamed. It has only been available since the 5.9 release on October 11, so there are not yet C-library wrappers for it in circulation. So there is time to come up with a
> better name for this system call, should the desire to do so arise.

Unfortunately I don't think we can change the name anymore. There are already users including Python, systemd, LXC, LXD, and other large codebases and also the name is identical between FreeBSD and Linux since Kyle and I coordinated on this syscall after FreeBSD picked it up from us. But CLOSE_RANGE_CLOEXEC is essentially like a deferred close so I'd consider this closing file descriptors.

epoll_pwait2(), close_range(), and encoded I/O

Higher-resolution epoll_wait() timeouts

close_range() — eventually

Encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

And still no flags

And still no flags

And still no flags

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

epoll_pwait2(), close_range(), and encoded I/O

Higher-resolution `epoll_wait()` timeouts

`close_range()` — eventually