The return of kevent?

[Posted May 8, 2007 by corbet]

The last time this page looked at the kevent interface, it seemed to have reached the end of its run. The eventfd patches had stolen the thunder, providing a way for applications to wait on many types of events using the standard polling interfaces. The kevent developer has shelved the work on the assumption that it would not get in. That assumption appeared to be justified, given that Andrew Morton, in his 2.6.22 merge plans document said that the eventfd patches would be included.

As was mentioned last week, one obstacle came up in the form of pollfs, an implementation of a very similar idea. There were a couple of relatively harsh reviews of the pollfs code, and its profile appears to have lowered considerably. It is possible that a new, improved version of pollfs could show up in the near future, but it would have to be a lot better to grab a significant amount of attention. The pollfs code has probably shown up too late to the game.

There's another late arrival who will have to be listened to, however: glibc maintainer Ulrich Drepper. Having sat out the discussion of eventfd, he is now back and opposing its inclusion into the mainline:

It's Linus decision whether he wants to add yet more code, yet more possible problems, yet more maintenance overhead/nightmare for an interim solution which isn't necessary, which cannot solve all the problems, and which is not as scalable as other proposed methods.

I can only say that I would be trickly [sic] against it. It makes just no sense.

Ulrich has a number of complaints about the eventfd approach:

The eventfd code, by relying on poll() and variants, does not provide a way for applications to obtain events without entering the kernel. For high-bandwidth applications - big network servers, for example - eliminating system calls is one of the keys to adequate performance. The kevent code, with its user-space event ring, provides that sort of mechanism while eventfd does not.
The use of poll() also makes it hard for the kernel to pass information back to the application - the communication channel only includes a few bits. The kevent interface allows for a fair amount of information to be packaged with each event. Eventfd gets around this problem by allowing applications to read more event information from the relevant file descriptors - but that requires another system call.
Ulrich argues that the poll() interface poses unsolvable issues with regard to threads and cancellation processing. This argument is not universally accepted, however.
The current eventfd code does not let applications wait on futexes, and Davide Libenzi, the eventfd developer, is uninclined to add that support. The pollfs patches do support futex waits, though Ulrich had some issues with the implementation. In general, Ulrich would like to see a single system call where applications can wait for anything, so leaving out primitives like futexes will leave him unsatisfied.

The end result of this is that Ulrich opposes the merging of eventfd; he would rather see the effort go into making kevent (or a replacement with similar functionality) ready for the mainline. A kevent-like interface, he says, will eventually become necessary in any case:

I think we ultimately have to have something like kevent and then all this *fd() work is unnecessary and just adds code to the kernel which has to be kept around and which might hinder further work in this area.

How this issue will be resolved is entirely unclear. There's not been a flood of developers lining up to support Ulrich's position - but they are not opposing him either. Nobody has dusted off the kevent patches for another round of discussion - yet. But one thing that does seem likely is that this whole discussion may delay the merging of eventfd past the 2.6.22 merge window. User-space interfaces are important and, once they are added to the kernel, they are almost impossible to remove. Waiting another development cycle seems like a small price to pay if it helps the developers to get this decision right.

Update: the eventfd code was merged into the mainline on May 11.

Index entries for this article
Kernel	Events reporting
Kernel	Kevent

The return of kevent?

Posted May 10, 2007 10:00 UTC (Thu) by pphaneuf (guest, #23480) [Link] (9 responses)

The first complaint is not that significative, IMHO. First off, Linux is quite efficient at syscalls compared to many other Unixes, and where on some of those other Unixes, syscalls are to be avoided like the plague, on Linux you get to worry about this only in the most extreme cases. But also, select/poll/epoll (and any other mechanism which retrieves a number of events at a time) have this property of having a lower and lower syscall overhead as the load increases: the more events are returned, the more time is spent between calls to select/poll/epoll (in order to process those events), and they are thus called less and less often, with big chunks of events returned each time.

And again, for the Nth time, the ring buffer does have a syscall every so many events (not unlike a select/poll/epoll syscall every so many events received)! There is a difference when the load is low, as the application can still call kevent_commit() once per N events received, instead of just getting fewer events per call as with select/poll/epoll, but arguably, this lower overhead is only useful at high load, and it disappear there.

The ring buffer scheme has a bad smell to me, in that it reminds me of notification via realtime signals, which could overflow the signal queue and required the application to support this "overload" and have another path of code to handle it. What happens when an event arrives and the ring buffer is full (the application is slow to process events, which is likely in a high-load situation)? Do we need to have another path of code in the application? Ironically, this would occur at high loads, which is precisely what we're pushing that ring buffer for! Something like epoll, IMHO, having no ceiling (the readiness information is kept in the fd structures in the kernel, if I remember correctly, so you can never run out of space), has the advantage of simplicity, which is not to be sneered at. A single code path means less to debug, smaller code size and less branching.

In short, I suspect that a large enough "maxevents" parameter to epoll_wait() might yield identical performance results to using the kevent ring buffer, possibly with simpler code too.

The second complaints is kind of fair, although that's never been that much of a problem for my applications. Whether it's readable or writable, as well as a pointer to my own data to quickly find the corresponding context data to process the event is quite enough, in my case. Maybe there's space for improvement, I would need to be pointed at examples. Also, note that most applications will want to have an abstraction over this platform-specific code, so that they can substitute a more portable version for non-Linux platforms, having an interface that's too radically different or difficult to emulate might just go unused, but that's a bit of a judgement call. Let's hear more of that point, I'd say.

For the complaint about thread cancellation, I'm with Linus. There is actually a case where this could be a problem, when using edge-triggered event with epoll, but it could probably be made to behave correctly, still (checking for cancellation before pulling the events from the file descriptors, say). select and poll are perfectly safe from this (select/poll do not change the state of the file descriptors at all, so another thread calling them again would get the events again).

The lack of support for futexes can seem annoying, but in reality, isn't much of a problem, and integrating them would actually be a lot of trouble for application developers (going back to having to emulate things on non-Linux platforms, again). The biggest thing that it would be handy for would be for semaphores, which is basically a counter (protected by a mutex, or using processor-specific atomic instructions) with a condition variable, and it would certainly be doable to make a semaphore that uses a pipe instead of a condition variable. In the simplest case, the pipe can also be used directly as a semaphore, and it should be possible to reduce the number of syscalls with a more complex implementation (although I suspect pthread_cond_broadcast(), which I think is called in sem_post(), also issues a syscall every time it is called).

Otherwise, most other uses of futexes in the kind of server application that would use event multiplexing would be similar to that of spinlocks in the kernel, not blocking for long period of time, so being able to process other events while waiting for them just wouldn't be so useful.

The return of kevent?

Posted May 10, 2007 21:50 UTC (Thu) by intgr (subscriber, #39733) [Link] (3 responses)

select/poll/epoll [...] have this property of having a lower and lower syscall overhead as the load increases

This is not true if "load" means "a large number of sockets", especially when the majority of sockets are inactive at any given time. The difference between the APIs is that select and poll have to enumerate all known file descriptors on each cycle, while epoll and kevent are specifically told which file descriptors are hot. TCP congestion control will take care that more events wouldn't be signalled that the server can handle. Formally, select/poll scale linearly to the number of sockets while epoll/kevent scale linearly to the number of events.

And again, for the Nth time, the ring buffer does have a syscall every so many events

This is actually the advantage of kevent over epoll — with kevent, the kernel always knows where the event ring is located in the user space; thus, it can just dump the events directly to the user space when they arrive, and forget about them. Since the events are written directly to the process's ring buffer, the process can tell when new events have arrived without a syscall. Thus: no copies, no syscalls.

The ring buffer scheme has a bad smell to me, in that it reminds me of notification via realtime signals

The problem with signals is that the signal buffers are allocated for every process and they exist in kernel space, thus their size has to be conservative. kevent buffers, however, can afford to be huge; and in the case of file descriptor events, the upper bound is set by the maximum number of file descriptors allowed for the process; although the event structure is regrettably big (36 bytes if I counted correctly).

While imperfect, Ulrich Drepper writes in his blog:

I would imagine that on 64bit platforms we can use large areas. Several MBs if necessary. This would cover worst case scenarios. The key points are that a) the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that fails because the memory is swapped out) and b) we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)).

While I would very much prefer a more elegant solution to this problem, I think the kevent API has merit over epoll.

The return of kevent?

Posted May 11, 2007 9:45 UTC (Fri) by pphaneuf (guest, #23480) [Link] (2 responses)

Of course, the good old select/poll being O(N) on the number of file descriptors watched still applies, indeed. I used "load" in this context to mean "work to do", but I indeed use epoll for all servers on Linux (I use kqueue on *BSD, as well). I often end up having to have a select/poll version as well, for portability to those platforms not so well endowed.

I also know about the ring buffer having less copies, but I maintain my point: the kernel needs to know how many events have been consumed by the application in order not to overwrite unread events, and this is done with a system call. Making a system call to get the events that arrive, or making a system call to tell the kernel that we did process the events, at the end of the day, it's a system call either way.

Also, in order to do edge-triggered event notification (which I find can be useful to spread the load over multiple threads), the kernel can't just "forget about it", it keeps some information on the side in the file descriptor structure. The ring buffer does save a copy, but for the size of events, struct epoll_event isn't so bad (12 bytes), particularly compared with the work that will have to be done to process the events themselves.

I know that the ring buffer can be much bigger than the signal queues were, but the point is that they have a fixed size, and thus has to manage the overflow case properly. epoll keeps the information in the file descriptor structures (where it has to be kept anyway, in addition to the event, as I described earlier), so there is no overflow case: if you could open the file descriptor in the first place, it's all good.

Note that in other things punted over to the application to manage, there's also the issue of closed file descriptors. If a file descriptor has an event, but is closed before the event is processed, and another connection is accepted (very likely to get the same file descriptor number), what happens?

Not to mention that with the kevent ring buffer, it's tricky to spread the load between multiple threads (as described in Ulrich's post that you linked to), where epoll manages multiple threads going in epoll_wait() on the same epoll file descriptor nicely...

The return of kevent?

Posted May 12, 2007 20:58 UTC (Sat) by intgr (subscriber, #39733) [Link] (1 responses)

I concur with all of your points.

Also, in order to do edge-triggered event notification [...] the kernel can't just "forget about it"

It can forget about the events; naturally, events have side effects, and the kernel will have to keep track of the state of its objects. (Or am I missing something?)

If a file descriptor has an event, but is closed before the event is processed, and another connection is accepted

Both APIs have an "opaque pointer" field in their event structures. Applications are supposed to use this for identifying clients, not file descriptor numbers.

The return of kevent?

Posted May 12, 2007 21:34 UTC (Sat) by pphaneuf (guest, #23480) [Link]

If a file descriptor becomes readable, then not anymore, then again, without the event queue being looked at, should you get two events? With epoll, you get only one (you only get told of the file descriptor being readable once you've known about it).

Of course, it could go "on the cheap" and let userspace figure it out. But since it's so handy to just have this one bit in the file descriptor structure (which is really the "how many bytes in the appropriate buffer", which you really have to have, interpreted as a bool), why not?

They don't really get told that they are supposed to use that. The file descriptor number really is the proper identifier, as far as the kernel is concerned. Note that all the other APIs can support that without a problem (none of select/poll/epoll ever give you a "bad information" like that). Having a pointer is just to be helpful (and it is, quite!).

The return of kevent?

Posted May 17, 2007 7:40 UTC (Thu) by slamb (guest, #1070) [Link] (4 responses)

The first complaint is not that significative, IMHO.

You're too kind. The first complaint is a total load of shit, and we're all stupider for having entertained the idea. Under load, the syscall overhead of one epoll_wait() is insignificant compared to the syscall overhead of the many, many reads and writes associated with it, not to mention the actual costs of copying or checksumming buffers if you're not just doing zerocopy. I am unable to imagine how anyone could think otherwise, though I've seen this argument (and the resultant code) come up several times in this discussion.

The third complaint is also wrong, but not obviously/offensively so. It's solvable through something like my own sigsafe library (see the table in the main page of the API documentation). They might have to make changes to the syscall page mechanism for this approach work as well as old-fashioned int 0x80, but that's doable (preserving compatibility and all).

On the other hand, Ulrich's second and fourth complaints have some merit, IMHO. The second in particular has long made me prefer the BSD-style kevent to epoll.

The return of kevent?

Posted May 17, 2007 13:52 UTC (Thu) by pphaneuf (guest, #23480) [Link] (2 responses)

Good old select/poll did have ridiculous overheads at low loads, with large numbers of clients, but yeah, epoll is more than good enough.

As you mention on your sigsafe main API documentation, if you're using an event loop and non-blocking I/O already, you can do the pipe trick very easily, so in the context of this event delivery mechanism, that's kind of the obvious answer, rather than starting to wrap all the "slow" system call. So while there are some things that could be done to make it even better, I'm not worrying.

You find that there's that much usefulness to the kqueue extra information? For most of those (in any case, those things you can also watch with epoll, there's some interesting "extra" stuff like watching processes in kqueue), you get the same information with an extra system call (the read() or accept() that gives you EAGAIN, for example).

I kind of use it in the more basic way, behind an abstraction for it, epoll and select, most of the time. The thing is, if I had that extra information in my abstraction, then I'd have to have some special "I don't know" value for select/epoll, test for that, slightly different behaviour between epoll and kqueue, etc... So I just don't really find it worth the trouble, at the moment. But I'm open to the idea that I might be missing something...

The return of kevent?

Posted May 17, 2007 17:09 UTC (Thu) by slamb (guest, #1070) [Link]

Good old select/poll did have ridiculous overheads at low loads, with large numbers of clients, but yeah, epoll is more than good enough.

Yeah, I'm just comparing kevent to Linux's best existing mechanism - epoll_wait(). As far as I'm concerned, select()/poll() is a straw man. O(n) with number of watched descriptors is ridiculous.

As you mention on your sigsafe main API documentation, if you're using an event loop and non-blocking I/O already, you can do the pipe trick very easily, so in the context of this event delivery mechanism, that's kind of the obvious answer, rather than starting to wrap all the "slow" system call.

Right, but Ulrich wants to implement thread cancellation. Even that is possible in a way that doesn't lose edge events. Not that I think it's worthwhile to do, as thread cancellation is hopelessly messed up for other reasons. But ncm tells me that the C++ people are looking at doing things right with "thread interruption", though, and an approach like sigsafe might be useful there.

You find that there's that much usefulness to the kqueue extra information?

I confess that I haven't actually taken advantage of any of it, but I think there's potential, especially as more event types are added. And here he is actually talking about returning information right away vs. making another system call per event, so this syscall overhead reduction argument makes more sense than removing the actual polling call.

The return of kevent?

Posted May 17, 2007 20:30 UTC (Thu) by slamb (guest, #1070) [Link]

The thing is, if I had that extra information in my abstraction, then I'd have to have some special "I don't know" value for select/epoll, test for that, slightly different behaviour between epoll and kqueue, etc... So I just don't really find it worth the trouble, at the moment. But I'm open to the idea that I might be missing something...

One thing is that not everyone cares about writing portable code. I used to always write everything in this way (don't use fancy features or don't depend on them), but a few things started to change my mind. One of them was reading this story. Another was starting to work on a BSD-based proprietary system. We have the make world style - a single source tree, no collection of RPMs with tarballs and patches. We can add libraries, daemons, and kernel extensions that the rest of the system depends on. We don't use autoconf-based fallback code or lowest-common denominator abstraction layers. We've given up on the idea of sending most of our changes upstream, so we do what works for us. It can be liberating to say "screw portability/ compatibility". It's much easier to do on our system where you can make a single changeset that modifies everything you need and be certain one piece will never run on a real system without the other.

Now I'm ready to take the same attitude to other code that I write. Portability isn't a hard requirement; it's something to be kept as long as it doesn't hold me back too much. Code doesn't run on Python 2.2? Who cares?!? I run mostly CentOS 5, which has Python 2.4, so I'll take advantage of the newer language features. Code doesn't run without epoll_wait() or kevent()? Who cares?!? I use systems with modern kernel interfaces. Code doesn't run without a Linux-only kernel interface? Maybe I'll add it or something equivalent to BSD if I want to run it there.

The return of kevent?

Posted May 31, 2007 14:21 UTC (Thu) by pphaneuf (guest, #23480) [Link]

Well, turns out Ingo has found out as much.

The return of kevent?

Posted May 17, 2007 14:42 UTC (Thu) by renox (guest, #23785) [Link] (1 responses)

It's really weird, but each time I read a discussion where there is Ulrich Drepper involved, I always disagree with him..

His strange point about the poll vs read cancellation point, is just the latest example.

Note that I do respect the man for his many works on free software (he did much more for free software than me), but I just found funny that I'm always disagreeing with him, weird.

The return of kevent?

Posted May 17, 2007 22:34 UTC (Thu) by slamb (guest, #1070) [Link]

His strange point about the poll vs read cancellation point, is just the latest example.

I feel some sympathy for Ulrich there. You have to realize that thread cancellation is a rather poorly-described idea that the standards people said in a hand-wavy way was mandatory long before anyone had a working implementation. People like Ulrich are stuck working out the details of an actual working system, and traffic on the mailing lists shows that he's put a lot of effort into it. I don't think his approach is right - he basically just enables async cancels around every cancellation point, where a sigsafe-like way would make it possible to honor the cancellation if and only if the system call has not yet returned - but it's not for lack of trying.