LWN: Comments on "A ring buffer for epoll"

Filling the ring buffer

brho — Mon, 03 May 2021 14:08:02 +0000

this sounds a lot like what i did in a similar "ring buffer for something like epoll" in another OS:

https://github.com/brho/akaros/blob/master/kern/include/r....

the 'CEQ' was designed so that i could do epoll in userspace on a non-linux research OS.

A ring buffer for epoll

excors — Sun, 09 Jun 2019 15:21:54 +0000

It looks like it stores index plus one so that 0 can have a special meaning, but I don't understand why 0 needs a special meaning. Why wouldn't the kernel just write the correct index before incrementing tail (with a write barrier in between), so that userspace only needs to wait for tail and not wait again for index? I think the kernel has to be doing a write barrier anyway (after writing to the epoll_uitem, before writing to tail/index) so I don't see how it would be needed for performance.

A ring buffer for epoll

corbet — Sun, 09 Jun 2019 15:00:24 +0000

Sigh, it should have been using the head index, yes; that has been fixed. The "-1" is correct, though: remember that the index value is one higher than the actual distance into the array.

A ring buffer for epoll

ianmcc — Sun, 09 Jun 2019 13:04:12 +0000

item = header->items + index[header->tail] - 1;

This looks suspicious. Should it be

item = header->items + index[header->head];

A ring buffer for epoll

daney — Mon, 03 Jun 2019 16:14:08 +0000

Because you cannot change (break) userspace after a new facility is added to the kernel, you have to make sure the userspace interfaces are fully specified, and correct *before* they are added.

Because use of this new epool interface requires correct, race-free, access to multiple memory locations (head, tail, items[i].ready_events, etc.) by both the kernel and userspace, instead of a simple system call, specification of ordering is important. When running on x86, naive implementations may work by accident, where identical code may fail on more weakly ordered architectures. It would be nice to see something that also works on arm64, ppc, et al. from the start.

Also, since the kernel is consuming data from multiple memory locations that are under control of userspace, it would seem great care must be take to ensure that no security vulnerabilities are introduced. The more common paradigm of: Copy in user data at system call time, validate, compute result, return to user, no longer holds.

A ring buffer for epoll

smurf — Mon, 03 Jun 2019 09:09:26 +0000

Assignee of record is still RedHat.

A ring buffer for epoll

edeloget — Mon, 03 Jun 2019 00:36:11 +0000

I agree that handling memory barriers in user code is not very simple, but avoiding a system call is still a big win for the rare number of people who are in need for this interface (and, frankly, if you are in need for this API, you'd better write the code correctly :))

A ring buffer for epoll

pbonzini — Sun, 02 Jun 2019 19:30:00 +0000

So far Red Hat has not been acquired; it's still an independent public company, although in the process of being acquired.

A ring buffer for epoll

daney — Sun, 02 Jun 2019 16:05:34 +0000

Your code for reading items out of the ring buffer is undoubtedly missing memory barrier operations.

Within the kernel you have code review of memory access ordering issues and might have a chance at getting them implemented correctly. How do you ensure the same on the consuming side in userspace?

Somebody should probably provide a wrapper library for all this that tries to do the right thing.

System calls make nice barriers, if you eliminate the system call, you move the responsibility for correctly implementing the barriers to the authors of any userspace code.

A ring buffer for epoll

alison — Sun, 02 Jun 2019 14:52:03 +0000

>I think you mean "donated to OIN".

There's some good news, at least. The contrast between IBM's handling of RedHat and Oracle's of Sun is striking, at least so far.

A ring buffer for epoll

pbonzini — Sun, 02 Jun 2019 09:48:52 +0000

I think you mean "donated to OIN".

A ring buffer for epoll

alison — Sun, 02 Jun 2019 05:07:39 +0000

I think you mean "owned by IBM," ahem.

A ring buffer for epoll

andresfreund — Sat, 01 Jun 2019 20:43:34 +0000

> Figuring out the above interface required a substantial amount of reverse engineering of the code. This is a rather complex new API, but it is almost entirely undocumented; that will make it hard to use, but the lack of documentation also makes it hard to review the API in the first place. It is doubtful that anybody beyond the author has written any code to use this API at this point. Whether the development community will fully understand this API before committing to it is far from clear.

I don't understand how this is the case for just about every new linux API. Coming from another community (postgres), I really don't understand why it's not a hard requirement to provide some minimal set of API docs? It doesn't have to be in fully man-page formatted, nicely phrased, native-speaker level English. But it should provide at least enough information to be able to write that manpage (and a good bit of this article) without having to do a lot of original research. This isn't even hard to enforce?

Not need for new syscall

luto — Sat, 01 Jun 2019 15:24:15 +0000

What’s the issue on x86? As far as I know, the only real issue is running into the silly x32 aliases, but we can easily fix that.

A ring buffer for epoll

smurf — Sat, 01 Jun 2019 08:05:45 +0000

That patent is owned by Red Hat, so no problem here.

A ring buffer for epoll

pbonzini — Sat, 01 Jun 2019 07:15:16 +0000

The event ring buffer can be accessed from user space without invoking io_getevents.

A ring buffer for epoll

cesarb — Fri, 31 May 2019 22:58:25 +0000

Very interesting. One of the things mentioned in that paper is that using a ring buffer for system calls allows running the kernel and user space in separate cores; this might be a way to reduce the impact of Spectre/Meltdown/etc mitigations, and even strengthen them by keeping both siblings of each SMT pair either in the kernel or in user space all the time (so there would no longer be a need to either disable SMT, or do a very expensive IPI on every kernel entry/exit to protect against MDS).

A ring buffer for epoll

mm7323 — Fri, 31 May 2019 21:50:29 +0000

You could call that syscall batching. Apart from the downsides of error handling, I believe there are patent issues:

https://patents.google.com/patent/US9038075B2/en

System calls and architectures

cyphar — Fri, 31 May 2019 14:27:42 +0000

And new (>403) syscalls now use the same number on all architectures, so in principle there should be no need to rebuild libraries to get a __NR_foobar definition on a given architecutre -- libraries should be able to simply do a -ENOSYS check at runtime with an non-arch-specific __NR_foobar value.

Filling the ring buffer

corbet — Fri, 31 May 2019 14:21:58 +0000

That cannot happen, as it turns out. I didn't get deeply into this in the article, maybe I should have. The new epoll code gives each file descriptor a dedicated entry in the items array; when one becomes ready, an index to it is added to the index array, which is the real ring buffer. Until user space consumes the item, there is nothing more to add to the index array - the file descriptor is already there (though more POLL* bits could be set). So the ring buffer can fill but never overflow.

A ring buffer for epoll

jhoblitt — Fri, 31 May 2019 14:14:23 +0000

What happens if userspace doesn't keep up, the ring buffer is full, and new fd events are generated?

System calls and architectures

corbet — Fri, 31 May 2019 13:50:59 +0000

That's not really a problem with new system calls; it's about how they are implemented in the kernel. The good news is that this situation has gotten a lot better and continues to improve. A lot of the system-call boilerplate is being unified across architectures, and it's increasingly expected that new system calls will be enabled for most or all architectures from the outset.

Not need for new syscall

smurf — Fri, 31 May 2019 12:47:18 +0000

Multiplex syscalls are generally frowned upon these days. Indirection eats another register for the "real" syscall number, tracing and syscall filtering get more complicated, … Besides, yes the syscall table would be full after adding the 512th entry, but extending it to 1024 is not exactly rocket science.

Adding a generator for these tables, in order to use a central point of syscall registry instead of the current arch hodgepodge, is certainly possible. Just do it …

Not need for new syscall

epa — Fri, 31 May 2019 11:28:37 +0000

It seems there are two different issues here. One is the ABI used to call into the kernel on different architectures. That may support a fixed number of 'system call numbers' or have performance reasons to keep it down. The other is the API provided to the C library and by the C library to applications so they can call the familiar named functions like open(2) or kill(2). You could have an operating system running on i386 that used only a syscall number when calling into the kernel, but still provided the usual POSIX system call names. Is there a reason Linux can't add new "system calls" indefinitely in this way?

FlexSC

smurf — Fri, 31 May 2019 07:33:12 +0000

That paper seems very interesting. Too bad it's 9 years old and no follow-up has happened.

Not need for new syscall

koenkooi — Fri, 31 May 2019 06:56:05 +0000

My issue with new syscalls is that they usually get added and enabled for a single platform, x86_64, and only added to more platforms months or years after that. This happened with the original epoll and accept4. The issue manifested itself as a 180 second delay during boot due to accept4:

* sys_accept4() was added in 2.6.28
* sys_accept4() was added for ARM in 2.6.36
* (e)glibc built against 2.6.32 headers on and ARM board running 2.6.32

With help from the systemd folks I tracked it down to accept4 missing, so I applied http://lists.infradead.org/pipermail/linux-arm-kernel/201... to the 2.6.32 kernel. Still a 3 minute delay. That's when I realized I needed to build eglibc against the patched 2.6.32 headers as well as patching the kernel. Running a kernel with the new syscall hooked up is not enough!

So everytime a new syscall gets proposed that is desired by the base layers in the OS I keep an eye on the ARM syscall list to avoid surprises. Marcin keeps this table up to date: https://fedora.juszkiewicz.com.pl/syscalls.html

A ring buffer for epoll

smurf — Fri, 31 May 2019 05:48:33 +0000

That, and a single syscall – to signal the kernel, when you write the first event to an empty buffer. (That syscall already exists, by the way: futex_wait(). You simply need to also support kernel threads.)

A ring buffer for epoll

smurf — Fri, 31 May 2019 05:42:46 +0000

Too much overhead. If there's a continuous stream of ready file descriptors you want no system calls. This does it.

Would be even less overhead if the kernel had a single sensible ring buffer implementation. This is not.

A ring buffer for epoll

sbaugh — Fri, 31 May 2019 02:50:04 +0000

>I wonder if future operating system designs will use ring buffers for everything instead of system calls

Sounds like FlexSC: https://www.usenix.org/legacy/events/osdi10/tech/full_pap...

A ring buffer for epoll

cesarb — Fri, 31 May 2019 00:55:33 +0000

> takes a familiar form: add yet another ring-buffer interface to the kernel.

I wonder if future operating system designs will use ring buffers for everything instead of system calls. Want to open a file? Add an "open file" request to one ring buffer, and wait for the corresponding response in another ring buffer.

A ring buffer for epoll

mst@redhat.com — Thu, 30 May 2019 23:34:14 +0000

> Now, let's pull the virtio ring into io_uring as well... ;-)
The old split ring layout is somewhat complex.
The new packed ring format might be a good fit for that.

Not need for new syscall

cyphar — Thu, 30 May 2019 22:53:03 +0000

We are running out of syscall space. 5.3 will probably have 434 common syscalls on all architectures, and there are apparently cache-related performance impacts once you pass 512 (on x86 at least). This doesn't mean we should always avoid new syscalls, but rather we should be careful when we add them. If the only user-facing purpose of a new syscall is to add a struct argument then we should look at doing it that way.

Not need for new syscall

scientes — Thu, 30 May 2019 21:50:59 +0000

I just checked, epoll_create1() checks for unknown flags, so there totally is no need for a new syscall.

Not need for new syscall

Cyberax — Thu, 30 May 2019 21:36:54 +0000

I don't quite get it why people are so opposed to new syscalls.

Not need for new syscall

scientes — Thu, 30 May 2019 21:30:56 +0000

You just add a flag, and with that flag there is a second syscall argument. Look at futex() and the crazy variable number of arguments. glibc then magically calls it epoll_create2, or whatever. But no need for a new syscall, just a new flag.

A ring buffer for epoll

bjorntopel — Thu, 30 May 2019 17:14:23 +0000

Regarding the "Some closing grumbles" section: We (as in the AF_XDP authors) are looking into supporting the io_uring in addition to the AF_XDP rings. At least for sockets, the io_uring looks like an excellent fit. Jens Axboe has done some really good design decisions there!

Now, let's pull the virtio ring into io_uring as well... ;-)

A ring buffer for epoll

quotemstr — Thu, 30 May 2019 17:03:59 +0000

I'm really confused. Why wouldn't we just use the AIO poll interface?