LWN: Comments on "A ring buffer for epoll" https://lwn.net/Articles/789603/ This is a special feed containing comments posted to the individual LWN article titled "A ring buffer for epoll". en-us Thu, 16 Oct 2025 09:29:11 +0000 Thu, 16 Oct 2025 09:29:11 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Filling the ring buffer https://lwn.net/Articles/855164/ https://lwn.net/Articles/855164/ brho <div class="FormattedComment"> this sounds a lot like what i did in a similar &quot;ring buffer for something like epoll&quot; in another OS: <br> <p> <a href="https://github.com/brho/akaros/blob/master/kern/include/ros/ceq.h">https://github.com/brho/akaros/blob/master/kern/include/r...</a>. <br> <p> the &#x27;CEQ&#x27; was designed so that i could do epoll in userspace on a non-linux research OS.<br> </div> Mon, 03 May 2021 14:08:02 +0000 A ring buffer for epoll https://lwn.net/Articles/790757/ https://lwn.net/Articles/790757/ excors <div class="FormattedComment"> It looks like it stores index plus one so that 0 can have a special meaning, but I don't understand why 0 needs a special meaning. Why wouldn't the kernel just write the correct index before incrementing tail (with a write barrier in between), so that userspace only needs to wait for tail and not wait again for index? I think the kernel has to be doing a write barrier anyway (after writing to the epoll_uitem, before writing to tail/index) so I don't see how it would be needed for performance.<br> </div> Sun, 09 Jun 2019 15:21:54 +0000 A ring buffer for epoll https://lwn.net/Articles/790756/ https://lwn.net/Articles/790756/ corbet Sigh, it should have been using the <tt>head</tt> index, yes; that has been fixed. The "-1" is correct, though: remember that the index value is one higher than the actual distance into the array. Sun, 09 Jun 2019 15:00:24 +0000 A ring buffer for epoll https://lwn.net/Articles/790753/ https://lwn.net/Articles/790753/ ianmcc <div class="FormattedComment"> item = header-&gt;items + index[header-&gt;tail] - 1;<br> <p> This looks suspicious. Should it be<br> <p> item = header-&gt;items + index[header-&gt;head];<br> <p> ?<br> </div> Sun, 09 Jun 2019 13:04:12 +0000 A ring buffer for epoll https://lwn.net/Articles/790180/ https://lwn.net/Articles/790180/ daney <div class="FormattedComment"> Because you cannot change (break) userspace after a new facility is added to the kernel, you have to make sure the userspace interfaces are fully specified, and correct *before* they are added.<br> <p> Because use of this new epool interface requires correct, race-free, access to multiple memory locations (head, tail, items[i].ready_events, etc.) by both the kernel and userspace, instead of a simple system call, specification of ordering is important. When running on x86, naive implementations may work by accident, where identical code may fail on more weakly ordered architectures. It would be nice to see something that also works on arm64, ppc, et al. from the start.<br> <p> Also, since the kernel is consuming data from multiple memory locations that are under control of userspace, it would seem great care must be take to ensure that no security vulnerabilities are introduced. The more common paradigm of: Copy in user data at system call time, validate, compute result, return to user, no longer holds. <br> </div> Mon, 03 Jun 2019 16:14:08 +0000 A ring buffer for epoll https://lwn.net/Articles/790108/ https://lwn.net/Articles/790108/ smurf <div class="FormattedComment"> Assignee of record is still RedHat.<br> </div> Mon, 03 Jun 2019 09:09:26 +0000 A ring buffer for epoll https://lwn.net/Articles/790103/ https://lwn.net/Articles/790103/ edeloget <div class="FormattedComment"> I agree that handling memory barriers in user code is not very simple, but avoiding a system call is still a big win for the rare number of people who are in need for this interface (and, frankly, if you are in need for this API, you'd better write the code correctly :))<br> </div> Mon, 03 Jun 2019 00:36:11 +0000 A ring buffer for epoll https://lwn.net/Articles/790095/ https://lwn.net/Articles/790095/ pbonzini <div class="FormattedComment"> So far Red Hat has not been acquired; it's still an independent public company, although in the process of being acquired.<br> </div> Sun, 02 Jun 2019 19:30:00 +0000 A ring buffer for epoll https://lwn.net/Articles/790084/ https://lwn.net/Articles/790084/ daney <div class="FormattedComment"> Your code for reading items out of the ring buffer is undoubtedly missing memory barrier operations.<br> <p> Within the kernel you have code review of memory access ordering issues and might have a chance at getting them implemented correctly. How do you ensure the same on the consuming side in userspace?<br> <p> Somebody should probably provide a wrapper library for all this that tries to do the right thing.<br> <p> System calls make nice barriers, if you eliminate the system call, you move the responsibility for correctly implementing the barriers to the authors of any userspace code. <br> </div> Sun, 02 Jun 2019 16:05:34 +0000 A ring buffer for epoll https://lwn.net/Articles/790081/ https://lwn.net/Articles/790081/ alison <div class="FormattedComment"> <font class="QuotedText">&gt;I think you mean "donated to OIN". </font><br> <p> There's some good news, at least. The contrast between IBM's handling of RedHat and Oracle's of Sun is striking, at least so far.<br> </div> Sun, 02 Jun 2019 14:52:03 +0000 A ring buffer for epoll https://lwn.net/Articles/790073/ https://lwn.net/Articles/790073/ pbonzini I think you mean "donated to <a href="https://www.openinventionnetwork.com/about-us/">OIN</a>". Sun, 02 Jun 2019 09:48:52 +0000 A ring buffer for epoll https://lwn.net/Articles/790071/ https://lwn.net/Articles/790071/ alison <div class="FormattedComment"> I think you mean "owned by IBM," ahem.<br> </div> Sun, 02 Jun 2019 05:07:39 +0000 A ring buffer for epoll https://lwn.net/Articles/790063/ https://lwn.net/Articles/790063/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Figuring out the above interface required a substantial amount of reverse engineering of the code. This is a rather complex new API, but it is almost entirely undocumented; that will make it hard to use, but the lack of documentation also makes it hard to review the API in the first place. It is doubtful that anybody beyond the author has written any code to use this API at this point. Whether the development community will fully understand this API before committing to it is far from clear. </font><br> <p> I don't understand how this is the case for just about every new linux API. Coming from another community (postgres), I really don't understand why it's not a hard requirement to provide some minimal set of API docs? It doesn't have to be in fully man-page formatted, nicely phrased, native-speaker level English. But it should provide at least enough information to be able to write that manpage (and a good bit of this article) without having to do a lot of original research. This isn't even hard to enforce?<br> </div> Sat, 01 Jun 2019 20:43:34 +0000 Not need for new syscall https://lwn.net/Articles/790059/ https://lwn.net/Articles/790059/ luto <div class="FormattedComment"> What’s the issue on x86? As far as I know, the only real issue is running into the silly x32 aliases, but we can easily fix that.<br> </div> Sat, 01 Jun 2019 15:24:15 +0000 A ring buffer for epoll https://lwn.net/Articles/790048/ https://lwn.net/Articles/790048/ smurf <div class="FormattedComment"> That patent is owned by Red Hat, so no problem here.<br> </div> Sat, 01 Jun 2019 08:05:45 +0000 A ring buffer for epoll https://lwn.net/Articles/790047/ https://lwn.net/Articles/790047/ pbonzini <div class="FormattedComment"> The event ring buffer can be accessed from user space without invoking io_getevents.<br> </div> Sat, 01 Jun 2019 07:15:16 +0000 A ring buffer for epoll https://lwn.net/Articles/790034/ https://lwn.net/Articles/790034/ cesarb <div class="FormattedComment"> Very interesting. One of the things mentioned in that paper is that using a ring buffer for system calls allows running the kernel and user space in separate cores; this might be a way to reduce the impact of Spectre/Meltdown/etc mitigations, and even strengthen them by keeping both siblings of each SMT pair either in the kernel or in user space all the time (so there would no longer be a need to either disable SMT, or do a very expensive IPI on every kernel entry/exit to protect against MDS).<br> </div> Fri, 31 May 2019 22:58:25 +0000 A ring buffer for epoll https://lwn.net/Articles/790031/ https://lwn.net/Articles/790031/ mm7323 <div class="FormattedComment"> You could call that syscall batching. Apart from the downsides of error handling, I believe there are patent issues:<br> <p> <a href="https://patents.google.com/patent/US9038075B2/en">https://patents.google.com/patent/US9038075B2/en</a><br> </div> Fri, 31 May 2019 21:50:29 +0000 System calls and architectures https://lwn.net/Articles/790004/ https://lwn.net/Articles/790004/ cyphar <div class="FormattedComment"> And new (&gt;403) syscalls now use the same number on all architectures, so in principle there should be no need to rebuild libraries to get a __NR_foobar definition on a given architecutre -- libraries should be able to simply do a -ENOSYS check at runtime with an non-arch-specific __NR_foobar value.<br> </div> Fri, 31 May 2019 14:27:42 +0000 Filling the ring buffer https://lwn.net/Articles/789999/ https://lwn.net/Articles/789999/ corbet That cannot happen, as it turns out. I didn't get deeply into this in the article, maybe I should have. The new epoll code gives each file descriptor a dedicated entry in the <tt>items</tt> array; when one becomes ready, an index to it is added to the index array, which is the real ring buffer. Until user space consumes the item, there is nothing more to add to the index array - the file descriptor is already there (though more POLL* bits could be set). So the ring buffer can fill but never overflow. Fri, 31 May 2019 14:21:58 +0000 A ring buffer for epoll https://lwn.net/Articles/789998/ https://lwn.net/Articles/789998/ jhoblitt <div class="FormattedComment"> What happens if userspace doesn't keep up, the ring buffer is full, and new fd events are generated?<br> </div> Fri, 31 May 2019 14:14:23 +0000 System calls and architectures https://lwn.net/Articles/789996/ https://lwn.net/Articles/789996/ corbet That's not really a problem with new system calls; it's about how they are implemented in the kernel. The good news is that this situation has gotten a lot better and continues to improve. A lot of the system-call boilerplate is being unified across architectures, and it's increasingly expected that new system calls will be enabled for most or all architectures from the outset. Fri, 31 May 2019 13:50:59 +0000 Not need for new syscall https://lwn.net/Articles/789972/ https://lwn.net/Articles/789972/ smurf <div class="FormattedComment"> Multiplex syscalls are generally frowned upon these days. Indirection eats another register for the "real" syscall number, tracing and syscall filtering get more complicated, … Besides, yes the syscall table would be full after adding the 512th entry, but extending it to 1024 is not exactly rocket science.<br> <p> Adding a generator for these tables, in order to use a central point of syscall registry instead of the current arch hodgepodge, is certainly possible. Just do it …<br> </div> Fri, 31 May 2019 12:47:18 +0000 Not need for new syscall https://lwn.net/Articles/789971/ https://lwn.net/Articles/789971/ epa <div class="FormattedComment"> It seems there are two different issues here. One is the ABI used to call into the kernel on different architectures. That may support a fixed number of 'system call numbers' or have performance reasons to keep it down. The other is the API provided to the C library and by the C library to applications so they can call the familiar named functions like open(2) or kill(2). You could have an operating system running on i386 that used only a syscall number when calling into the kernel, but still provided the usual POSIX system call names. Is there a reason Linux can't add new "system calls" indefinitely in this way?<br> </div> Fri, 31 May 2019 11:28:37 +0000 FlexSC https://lwn.net/Articles/789963/ https://lwn.net/Articles/789963/ smurf <div class="FormattedComment"> That paper seems very interesting. Too bad it's 9 years old and no follow-up has happened.<br> <p> <p> </div> Fri, 31 May 2019 07:33:12 +0000 Not need for new syscall https://lwn.net/Articles/789961/ https://lwn.net/Articles/789961/ koenkooi <div class="FormattedComment"> My issue with new syscalls is that they usually get added and enabled for a single platform, x86_64, and only added to more platforms months or years after that. This happened with the original epoll and accept4. The issue manifested itself as a 180 second delay during boot due to accept4:<br> <p> * sys_accept4() was added in 2.6.28<br> * sys_accept4() was added for ARM in 2.6.36 <br> * (e)glibc built against 2.6.32 headers on and ARM board running 2.6.32<br> <p> With help from the systemd folks I tracked it down to accept4 missing, so I applied <a href="http://lists.infradead.org/pipermail/linux-arm-kernel/2010-August/022349.html">http://lists.infradead.org/pipermail/linux-arm-kernel/201...</a> to the 2.6.32 kernel. Still a 3 minute delay. That's when I realized I needed to build eglibc against the patched 2.6.32 headers as well as patching the kernel. Running a kernel with the new syscall hooked up is not enough!<br> <p> So everytime a new syscall gets proposed that is desired by the base layers in the OS I keep an eye on the ARM syscall list to avoid surprises. Marcin keeps this table up to date: <a href="https://fedora.juszkiewicz.com.pl/syscalls.html">https://fedora.juszkiewicz.com.pl/syscalls.html</a><br> </div> Fri, 31 May 2019 06:56:05 +0000 A ring buffer for epoll https://lwn.net/Articles/789957/ https://lwn.net/Articles/789957/ smurf <div class="FormattedComment"> That, and a single syscall – to signal the kernel, when you write the first event to an empty buffer. (That syscall already exists, by the way: futex_wait(). You simply need to also support kernel threads.)<br> <p> </div> Fri, 31 May 2019 05:48:33 +0000 A ring buffer for epoll https://lwn.net/Articles/789956/ https://lwn.net/Articles/789956/ smurf <div class="FormattedComment"> Too much overhead. If there's a continuous stream of ready file descriptors you want no system calls. This does it.<br> <p> Would be even less overhead if the kernel had a single sensible ring buffer implementation. This is not.<br> </div> Fri, 31 May 2019 05:42:46 +0000 A ring buffer for epoll https://lwn.net/Articles/789953/ https://lwn.net/Articles/789953/ sbaugh <div class="FormattedComment"> <font class="QuotedText">&gt;I wonder if future operating system designs will use ring buffers for everything instead of system calls</font><br> <p> Sounds like FlexSC: <a href="https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Soares.pdf">https://www.usenix.org/legacy/events/osdi10/tech/full_pap...</a><br> </div> Fri, 31 May 2019 02:50:04 +0000 A ring buffer for epoll https://lwn.net/Articles/789939/ https://lwn.net/Articles/789939/ cesarb <div class="FormattedComment"> <font class="QuotedText">&gt; takes a familiar form: add yet another ring-buffer interface to the kernel. </font><br> <p> I wonder if future operating system designs will use ring buffers for everything instead of system calls. Want to open a file? Add an "open file" request to one ring buffer, and wait for the corresponding response in another ring buffer.<br> </div> Fri, 31 May 2019 00:55:33 +0000 A ring buffer for epoll https://lwn.net/Articles/789933/ https://lwn.net/Articles/789933/ mst@redhat.com <div class="FormattedComment"> <font class="QuotedText">&gt; Now, let's pull the virtio ring into io_uring as well... ;-)</font><br> The old split ring layout is somewhat complex.<br> The new packed ring format might be a good fit for that.<br> <p> </div> Thu, 30 May 2019 23:34:14 +0000 Not need for new syscall https://lwn.net/Articles/789931/ https://lwn.net/Articles/789931/ cyphar <div class="FormattedComment"> We are running out of syscall space. 5.3 will probably have 434 common syscalls on all architectures, and there are apparently cache-related performance impacts once you pass 512 (on x86 at least). This doesn't mean we should always avoid new syscalls, but rather we should be careful when we add them. If the only user-facing purpose of a new syscall is to add a struct argument then we should look at doing it that way.<br> </div> Thu, 30 May 2019 22:53:03 +0000 Not need for new syscall https://lwn.net/Articles/789927/ https://lwn.net/Articles/789927/ scientes <div class="FormattedComment"> I just checked, epoll_create1() checks for unknown flags, so there totally is no need for a new syscall.<br> </div> Thu, 30 May 2019 21:50:59 +0000 Not need for new syscall https://lwn.net/Articles/789923/ https://lwn.net/Articles/789923/ Cyberax <div class="FormattedComment"> I don't quite get it why people are so opposed to new syscalls.<br> </div> Thu, 30 May 2019 21:36:54 +0000 Not need for new syscall https://lwn.net/Articles/789921/ https://lwn.net/Articles/789921/ scientes <div class="FormattedComment"> You just add a flag, and with that flag there is a second syscall argument. Look at futex() and the crazy variable number of arguments. glibc then magically calls it epoll_create2, or whatever. But no need for a new syscall, just a new flag.<br> </div> Thu, 30 May 2019 21:30:56 +0000 A ring buffer for epoll https://lwn.net/Articles/789892/ https://lwn.net/Articles/789892/ bjorntopel <div class="FormattedComment"> Regarding the "Some closing grumbles" section: We (as in the AF_XDP authors) are looking into supporting the io_uring in addition to the AF_XDP rings. At least for sockets, the io_uring looks like an excellent fit. Jens Axboe has done some really good design decisions there!<br> <p> Now, let's pull the virtio ring into io_uring as well... ;-)<br> <p> <p> </div> Thu, 30 May 2019 17:14:23 +0000 A ring buffer for epoll https://lwn.net/Articles/789891/ https://lwn.net/Articles/789891/ quotemstr <div class="FormattedComment"> I'm really confused. Why wouldn't we just use the AIO poll interface?<br> </div> Thu, 30 May 2019 17:03:59 +0000