epoll_pwait2(), close_range(), and encoded I/O
Higher-resolution epoll_wait() timeouts
The kernel's "epoll" subsystem provides a high-performance mechanism for a process to wait on events from a large number of open file descriptors. Using it involves creating an epoll file descriptor with epoll_create(), adding file descriptors of interest with epoll_ctl(), then finally waiting on events with epoll_wait() or epoll_pwait(). When waiting, the caller can specify a timeout as an integer number of milliseconds.
The epoll mechanism was added during the 2.5 development series, and became available in the 2.6 release at the end of 2003. Nearly 20 years ago, when this work was being done, a millisecond timeout seemed like enough resolution; the kernel couldn't reliably do shorter timeouts in any case. In 2020, though, one millisecond can be an eternity; there are users who would benefit from much shorter timeouts than that. Thus, it seems it is time for another update to the epoll API.
Willem de Bruijn duly showed up with a patch set adding nanosecond timeout support to epoll_wait(), but it took a bit of a roundabout path. Since there is no "flags" argument to epoll_wait(), there is no way to ask for high-resolution timeouts directly. So the patch set instead added a new flag (EPOLL_NSTIMEO) to epoll_create() (actually, to epoll_create1(), which was added in 2.6.27 since epoll_create() also lacks a "flags" argument). If an epoll file descriptor was created with that flag set, then the timeout value for epoll_wait() would be interpreted as being in nanoseconds rather than milliseconds.
Andrew Morton, however, complained
about this API.  Having one system call set a flag to change how arguments
to a different system call would be interpreted was "not very
nice
" in his view; he suggested adding a new system call instead.
After a bit of back and forth, that is what happened; the current
version of the patch set adds epoll_pwait2():
    int epoll_pwait2(int fd, struct epoll_event *events, int maxevents,
                     const struct timespec *timeout, const sigset_t *sigset);
In this version, the timeout is passed as a timespec structure, which includes a field for nanoseconds.
There has been some discussion of the implementation of this system call, but not a lot of comments on the API, so perhaps this work will go forward in this form. Your editor cannot help but note, however, that this system call, too, lacks a "flags" argument, so the eventual need for an epoll_pwait3() can be readily foreseen.
close_range() — eventually
The close_range() system call was added in the 5.9 release as a way to efficiently close a whole list of file descriptors:
    int close_range(int first, int last, unsigned int flags);
This call will close all file descriptors between first and last, inclusive. There is currently one flags value defined: CLOSE_RANGE_UNSHARE, which causes the indicated range of file descriptors to be unshared from any other processes (and does not close them).
In this patch set, Giuseppe Scrivano adds another flag, CLOSE_RANGE_CLOEXEC. This flag will set the "close on exec()" flag on each of the indicated file descriptors. Once again, close_range() does not actually close the files in this case; it simply marks them to be closed if and when the calling process does an exec() in the future. This is, presumably, faster than executing a loop and setting the flag with fcntl() on each file descriptor individually.
The functionality seems useful, and there have not really been any complaints about the API (there were some issues with the implementation in previous versions of the patch set). Given that close_range() is taking on more functionality that does not involve actually closing files, though, it seems increasingly clear that this system call is misnamed. It has only been available since the 5.9 release on October 11, so there are not yet C-library wrappers for it in circulation. So there is time to come up with a better name for this system call, should the desire to do so arise.
Encoded I/O
Some filesystems have the ability to compress and/or encrypt data written to files. Normally, this data will be restored to its original form when read from those files, so users may be entirely unaware that this transformation is taking place at all. What if, however, somebody wanted the ability to work with this "encoded" data directly, bypassing the processing steps within the filesystem code? Omar Sandoval has a patch set making that possible.
The main motivation for this work appears to be backups and, in particular, the transmission of partial or full filesystem images with the Btrfs send and receive operations. The whole point of using this mechanism is to create an identical copy of a Btrfs subvolume on another device. If the subvolume is using compression, a send will currently decompress the data, which must then be recompressed on the receive side, ending up in its original form. If there is a lot of data involved, this is a somewhat wasteful operation; it would be more efficient to just transmit the compressed data.
With this patch set applied, it becomes possible to read the compressed and/or encrypted data directly and write it directly, with no intervening processing. The first step is to open the subvolume with the new O_ALLOW_ENCODED flag. The CAP_SYS_ADMIN capability is needed to open a subvolume in this mode; imagine what could happen if an attacker were to write corrupt compressed data to a file, for example. Dave Chinner argued early on that corrupt data should just be treated as bad data and this operation could be unprivileged, but that view did not win out.
Then, encoded data can be read or written using the preadv() and pwritev() system calls. The new RWF_ENCODED flag must be used to indicate that encoded data is being transferred. A normal invocation of these system calls takes an array of pointers to iovec structures describing the buffers to be transferred; when encoded I/O is being done, though, the first pointer instead refers to an instance of the new encoded_iov structure type:
    struct encoded_iov {
	__aligned_u64 len;
	__aligned_u64 unencoded_len;
	__aligned_u64 unencoded_offset;
	__u32 compression;
	__u32 encryption;
    };
The len field must contain the length of this structure; it is there in case new fields are added in the future. The unencoded_len and unencoded_offset fields describe the portion of the file affected by this operation; the compression and encryption fields contain filesystem-dependent values describing the type of compression and encryption applied. All other pointers in the iovec array point to actual iovec structures describing the data to transfer.
The patch set includes support for reading and writing compressed data from a Btrfs filesystem. There is also a follow-on patch set working this support into the send and receive operations. Benchmarks included there show a significant reduction in bandwidth required to transmit the data, reduced CPU time usage and, in some cases, reduced elapsed time as well.
This patch series has been through six revisions as of this writing; the first
version was posted in September 2019.  Various implementation issues
have been addressed, and the work appears to be converging on something
that should be ready to merge soon.
| Index entries for this article | |
|---|---|
| Kernel | Btrfs | 
| Kernel | Epoll | 
| Kernel | Filesystems/Btrfs | 
| Kernel | System calls/close_range() | 
      Posted Nov 20, 2020 20:12 UTC (Fri)
                               by hmh (subscriber, #3838)
                              [Link] (5 responses)
       
And I can see the demand in the horizon not just for a single range, but for a set of ranges (or of bitmaps) if any operations it performs becomes attractive to do, e.g., on epoll sets :-) 
But yes, that way lies a cliff... 
     
    
      Posted Nov 20, 2020 21:56 UTC (Fri)
                               by dezgeg (subscriber, #92243)
                              [Link] (1 responses)
       
     
    
      Posted Nov 21, 2020 21:14 UTC (Sat)
                               by gps (subscriber, #45638)
                              [Link] 
       
Going further: rather than at syscall time, how about registering that eBPF to determine the fate of fds as an at-fork and/or at-exec time fd filter? 
     
      Posted Jan 3, 2021 19:20 UTC (Sun)
                               by Shabbyx (guest, #104730)
                              [Link] (1 responses)
       
     
    
      Posted Jan 5, 2021 10:19 UTC (Tue)
                               by flussence (guest, #85566)
                              [Link] 
       
     
      Posted Feb 16, 2021 22:46 UTC (Tue)
                               by calumapplepie (guest, #143655)
                              [Link] 
       
It executes any fcntl command across the range of file descriptors, and also accepts a new F_CLOSE command (for the original purpose of the API) 
(not a kernel dev, but wanted to add some cents) 
     
      Posted Nov 21, 2020 1:54 UTC (Sat)
                               by glenn (subscriber, #102223)
                              [Link] (12 responses)
       
     
    
      Posted Nov 21, 2020 9:35 UTC (Sat)
                               by NYKevin (subscriber, #129325)
                              [Link] (11 responses)
       
(Nevertheless, I am surprised that epoll_pwait2() lacks a flags argument. I had thought that was a standard feature at this point.) 
     
    
      Posted Nov 21, 2020 16:08 UTC (Sat)
                               by smcv (subscriber, #53363)
                              [Link] (8 responses)
       
It's a general-purpose API, the modernized equivalent of select() or poll(), used to wait for whatever interesting event happens next on any of several waitable objects in an event loop - which might indeed come from a network socket, but might equally come from an IPC channel (Wayland, X11, D-Bus, any other AF_UNIX socket, or a pipe), or from a communication channel between threads (eventfd or pipe-to-self). Some of these are slow anyway, but some are time-sensitive. 
If you are watching pollable fds A, B, C, you poll one at a time with a nonzero timeout, and pollable fd C is ready, then you won't process the event from C until 2 timeouts later than you could have done, resulting in an unnecessary delay. This results in an incentive to set the poll timeout to be shorter than the intended application-level timeout, leading to the opposite problem: the process is woken up once per poll timeout period, even if nothing interesting has actually happened yet. 
The common state for a CPU- and power-efficient event loop should be that if the process is waiting for something interesting to happen (like the next X11 or Wayland event), then it's mostly sleeping in epoll_wait() or similar, trusting the kernel to wake it up at an appropriate time; and if it's waiting for one or more application-level timeouts (which might be short, like the time an animation needs to wake up to start drawing the next frame so that it will be finished in time, or long, like the time at which the app gives up hope of receiving a reply to a pending network operation or D-Bus call), then the timeout value that it gives to epoll_wait() or equivalent should be whichever of those application-level timeouts will happen soonest. This is how the GLib and Qt event loops work, for instance. 
If you're trying to write an event loop that is capable of running a smooth animation at the refresh rate of a 60Hz screen, then the time between frames is less than 17ms, so I can imagine that the difference between one millisecond and the next does make a difference; if you're trying to keep up with a 144Hz screen, then you get less than 7ms between frames. 
     
    
      Posted Nov 22, 2020 14:12 UTC (Sun)
                               by smcv (subscriber, #53363)
                              [Link] (7 responses)
       
If that, then I'm not sure I see it either. 
     
    
      Posted Nov 22, 2020 15:48 UTC (Sun)
                               by Sesse (subscriber, #53779)
                              [Link] (6 responses)
       
     
    
      Posted Nov 22, 2020 21:16 UTC (Sun)
                               by sbaugh (guest, #103291)
                              [Link] (5 responses)
       
     
    
      Posted Nov 22, 2020 21:17 UTC (Sun)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (1 responses)
       
     
    
      Posted Nov 22, 2020 21:22 UTC (Sun)
                               by sbaugh (guest, #103291)
                              [Link] 
       
Yes. 
     
      Posted Nov 22, 2020 21:24 UTC (Sun)
                               by Sesse (subscriber, #53779)
                              [Link] (2 responses)
       
     
    
      Posted Nov 23, 2020 6:45 UTC (Mon)
                               by NYKevin (subscriber, #129325)
                              [Link] (1 responses)
       
- Windows WaitForMultipleObjects: Has a WaitAll option, can wait for both mutexes and file handles (and various other things) in the same call. 
Adding a WaitAll option to epoll would not help with mutexes, because you can't use epoll to acquire mutexes in the first place. You would need to also add the ability to acquire mutexes. But most of the complexity of mutexes lives in userspace (in NTPL) and epoll is a kernel interface. So that's probably infeasible without some kind of userspace abstraction layer. In practice, it would probably make more sense to add this functionality to NTPL than to add it to epoll. 
     
    
      Posted Nov 23, 2020 6:48 UTC (Mon)
                               by NYKevin (subscriber, #129325)
                              [Link] 
       
     
      Posted Nov 25, 2020 4:56 UTC (Wed)
                               by glenn (subscriber, #102223)
                              [Link] (1 responses)
       
(1) You want to have a thread that services a UDP socket, processing one datagram per unit time.  A clean way to do this would be to pair the UDP socket file descriptor with a timerfd file descriptor.  Using epoll_wait() and bWaitAll, this thread would only wake when a datagram is available and enough time has elapsed.  Is there an equally clean way that avoids trivial wakeups?
 
(2) Consider a real-time application with a message-passing task graph software architecture that may or may not span multiple compute nodes (e.g., ROS).  Each node in the task graph is backed by a thread.  These threads should only execute when data is available on all input file descriptors.  epoll_wait() with something like bWaitAll would allow these threads from waking before all inputs are satisfied.  Furthermore, my understanding is that this capability is necessary if threads are scheduled by SCHED_DEADLINE.  Irregular wake-up patterns can cause thread budget reclamation to kick in before the thread has done any real work (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt).
      
           
     
    
      Posted Nov 25, 2020 9:11 UTC (Wed)
                               by NYKevin (subscriber, #129325)
                              [Link] 
       
The network RTT is multiple orders of magnitude slower than a context switch, unless you have crazy bad swapping *and* the other node is just across the room. So my position is that trivial wakeups don't matter in this case. They're effectively noise. 
> (2) Consider a real-time application [...] 
Does epoll respect real-time scheduling in the first place? If a thread goes to sleep, it necessarily relinquishes the CPU. And, once it has done so, what exactly is the kernel expected to do about it, if anything? I tend to think of real-time scheduling as protecting CPU-bound threads from CPU-stealing by other threads, but epoll certainly isn't a CPU-bound interface. 
> may or may not span multiple compute nodes [...] Furthermore, my understanding is that this capability is necessary if threads are scheduled by SCHED_DEADLINE. Irregular wake-up patterns can cause thread budget reclamation to kick in before the thread has done any real work (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt). 
If you want this to actually work across multiple nodes, you need to be tolerant of variances in network latency (unless you're building an entirely real-time network... which seems like it would be really hard to do), and (again) those are way bigger than some minor CPU context switching. If you don't want this to work across multiple nodes, then perhaps you should be using more elaborate userspace synchronization primitives rather than fooling around with loopback addressing and epoll. 
     
      Posted Nov 21, 2020 7:30 UTC (Sat)
                               by ras (subscriber, #33059)
                              [Link] (2 responses)
       
     
    
      Posted Nov 21, 2020 10:01 UTC (Sat)
                               by awww (guest, #122021)
                              [Link] 
       
     
      Posted Jan 30, 2021 23:53 UTC (Sat)
                               by andyc (subscriber, #1130)
                              [Link] 
       
     
      Posted Nov 24, 2020 23:47 UTC (Tue)
                               by ppisa (subscriber, #67307)
                              [Link] (2 responses)
       
     
    
      Posted Dec 10, 2020 10:29 UTC (Thu)
                               by njs (subscriber, #40338)
                              [Link] (1 responses)
       
Also note that timerfd already does support all these variations, so in a pinch you can stick one of those in your epoll set. 
     
    
      Posted Dec 10, 2020 10:37 UTC (Thu)
                               by ppisa (subscriber, #67307)
                              [Link] 
       
Absolute would be nice, but it only one more system call and often virtual one, so no significant problem. And if you have two time queues for one e-poll, one MONOTONIC and one REALTIME, then it is better at least one move to timerfd to resolve mutual time shifts correctly. 
     
      Posted Dec 4, 2020 11:30 UTC (Fri)
                               by brauner (subscriber, #109349)
                              [Link] 
       
Unfortunately I don't think we can change the name anymore. There are already users including Python, systemd, LXC, LXD, and other large codebases and also the name is identical between FreeBSD and Linux since Kyle and I coordinated on this syscall after FreeBSD picked it up from us. But CLOSE_RANGE_CLOEXEC is essentially like a deferred close so I'd consider this closing file descriptors. 
     
    epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
      Perhaps epoll_pwait3() be used to implement something like Window's WaitForMultipleObjects(...,/*bWaitAll*/=1,...) to wait until all file descriptors are ready (rather than just one)?
      
          epoll_pwait2(), close_range(), and encoded I/O
      epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
- Linux epoll: No WaitAll option. Mutexes are acquired one by one with futex(2) or more commonly pthread_mutex_lock(3) (which uses futex internally). epoll has nothing to do with futex and can't acquire anything that resembles a mutex. It waits on file descriptors, and that's it.
epoll_pwait2(), close_range(), and encoded I/O
      
      I can point to two patterns that would benefit from epoll_wait() with bWaitAll.
epoll_pwait2(), close_range(), and encoded I/O
      epoll_pwait2(), close_range(), and encoded I/O
      
And still no flags
      
And still no flags
      
And still no flags
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
epoll_pwait2(), close_range(), and encoded I/O
      
> patch set). Given that close_range() is taking on more functionality that does not involve actually closing files, though, it seems increasingly clear that this system call is
> misnamed. It has only been available since the 5.9 release on October 11, so there are not yet C-library wrappers for it in circulation. So there is time to come up with a 
> better name for this system call, should the desire to do so arise.
 
           