The return of kevent?
As was mentioned last week, one obstacle came up in the form of pollfs, an implementation of a very similar idea. There were a couple of relatively harsh reviews of the pollfs code, and its profile appears to have lowered considerably. It is possible that a new, improved version of pollfs could show up in the near future, but it would have to be a lot better to grab a significant amount of attention. The pollfs code has probably shown up too late to the game.
There's another late arrival who will have to be listened to, however: glibc maintainer Ulrich Drepper. Having sat out the discussion of eventfd, he is now back and opposing its inclusion into the mainline:
I can only say that I would be trickly [sic] against it. It makes just no sense.
Ulrich has a number of complaints about the eventfd approach:
- The eventfd code, by relying on poll() and variants, does not
provide a way for applications to obtain events without entering the
kernel. For high-bandwidth applications - big network servers, for
example - eliminating system calls is one of the keys to adequate
performance. The kevent code, with its user-space event ring,
provides that sort of mechanism while eventfd does not.
- The use of poll() also makes it hard for the kernel to pass
information back to the application - the communication channel only
includes a few bits. The kevent interface allows for a fair amount of
information to be packaged with each event. Eventfd gets around this
problem by allowing applications to read more event information from
the relevant file descriptors - but that requires another system call.
- Ulrich argues that the poll()
interface poses unsolvable issues with regard to threads and
cancellation processing. This argument is not universally accepted, however.
- The current eventfd code does not let applications wait on futexes, and Davide Libenzi, the eventfd developer, is uninclined to add that support. The pollfs patches do support futex waits, though Ulrich had some issues with the implementation. In general, Ulrich would like to see a single system call where applications can wait for anything, so leaving out primitives like futexes will leave him unsatisfied.
The end result of this is that Ulrich opposes the merging of eventfd; he would rather see the effort go into making kevent (or a replacement with similar functionality) ready for the mainline. A kevent-like interface, he says, will eventually become necessary in any case:
How this issue will be resolved is entirely unclear. There's not been a flood of developers lining up to support Ulrich's position - but they are not opposing him either. Nobody has dusted off the kevent patches for another round of discussion - yet. But one thing that does seem likely is that this whole discussion may delay the merging of eventfd past the 2.6.22 merge window. User-space interfaces are important and, once they are added to the kernel, they are almost impossible to remove. Waiting another development cycle seems like a small price to pay if it helps the developers to get this decision right.
Update: the eventfd code was merged into the mainline on May 11.
Index entries for this article | |
---|---|
Kernel | Events reporting |
Kernel | Kevent |
Posted May 10, 2007 10:00 UTC (Thu)
by pphaneuf (guest, #23480)
[Link] (9 responses)
The first complaint is not that significative, IMHO. First off, Linux is quite efficient at syscalls compared to many other Unixes, and where on some of those other Unixes, syscalls are to be avoided like the plague, on Linux you get to worry about this only in the most extreme cases. But also, select/poll/epoll (and any other mechanism which retrieves a number of events at a time) have this property of having a lower and lower syscall overhead as the load increases: the more events are returned, the more time is spent between calls to select/poll/epoll (in order to process those events), and they are thus called less and less often, with big chunks of events returned each time.
And again, for the Nth time, the ring buffer does have a syscall every so many events (not unlike a select/poll/epoll syscall every so many events received)! There is a difference when the load is low, as the application can still call kevent_commit() once per N events received, instead of just getting fewer events per call as with select/poll/epoll, but arguably, this lower overhead is only useful at high load, and it disappear there.
The ring buffer scheme has a bad smell to me, in that it reminds me of notification via realtime signals, which could overflow the signal queue and required the application to support this "overload" and have another path of code to handle it. What happens when an event arrives and the ring buffer is full (the application is slow to process events, which is likely in a high-load situation)? Do we need to have another path of code in the application? Ironically, this would occur at high loads, which is precisely what we're pushing that ring buffer for! Something like epoll, IMHO, having no ceiling (the readiness information is kept in the fd structures in the kernel, if I remember correctly, so you can never run out of space), has the advantage of simplicity, which is not to be sneered at. A single code path means less to debug, smaller code size and less branching.
In short, I suspect that a large enough "maxevents" parameter to epoll_wait() might yield identical performance results to using the kevent ring buffer, possibly with simpler code too.
The second complaints is kind of fair, although that's never been that much of a problem for my applications. Whether it's readable or writable, as well as a pointer to my own data to quickly find the corresponding context data to process the event is quite enough, in my case. Maybe there's space for improvement, I would need to be pointed at examples. Also, note that most applications will want to have an abstraction over this platform-specific code, so that they can substitute a more portable version for non-Linux platforms, having an interface that's too radically different or difficult to emulate might just go unused, but that's a bit of a judgement call. Let's hear more of that point, I'd say.
For the complaint about thread cancellation, I'm with Linus. There is actually a case where this could be a problem, when using edge-triggered event with epoll, but it could probably be made to behave correctly, still (checking for cancellation before pulling the events from the file descriptors, say). select and poll are perfectly safe from this (select/poll do not change the state of the file descriptors at all, so another thread calling them again would get the events again).
The lack of support for futexes can seem annoying, but in reality, isn't much of a problem, and integrating them would actually be a lot of trouble for application developers (going back to having to emulate things on non-Linux platforms, again). The biggest thing that it would be handy for would be for semaphores, which is basically a counter (protected by a mutex, or using processor-specific atomic instructions) with a condition variable, and it would certainly be doable to make a semaphore that uses a pipe instead of a condition variable. In the simplest case, the pipe can also be used directly as a semaphore, and it should be possible to reduce the number of syscalls with a more complex implementation (although I suspect pthread_cond_broadcast(), which I think is called in sem_post(), also issues a syscall every time it is called).
Otherwise, most other uses of futexes in the kind of server application that would use event multiplexing would be similar to that of spinlocks in the kernel, not blocking for long period of time, so being able to process other events while waiting for them just wouldn't be so useful.
Posted May 10, 2007 21:50 UTC (Thu)
by intgr (subscriber, #39733)
[Link] (3 responses)
This is not true if "load" means "a large number of sockets", especially when
the majority of sockets are inactive at any given time. The difference
between the APIs is that select and poll have to enumerate all known file
descriptors on each cycle, while epoll and kevent are specifically told
which file descriptors are hot. TCP congestion control will take care
that more events wouldn't be signalled that the server can handle. Formally,
select/poll scale linearly to the number of sockets while epoll/kevent
scale linearly to the number of events.
This is actually the advantage of kevent over epoll — with kevent, the
kernel always knows where the event ring is located in the user space; thus, it
can just dump the events directly to the user space when they arrive, and
forget about them. Since the events are written directly to the process's ring
buffer, the process can tell when new events have arrived without a syscall.
Thus: no copies, no syscalls.
The problem with signals is that the signal buffers are allocated for
every process and they exist in kernel space, thus their size has to
be conservative. kevent buffers, however, can afford to be huge; and in the
case of file descriptor events, the upper bound is set by the maximum number
of file descriptors allowed for the process; although the event structure is
regrettably big (36 bytes if I counted correctly).
While imperfect, Ulrich Drepper writes in his blog: While I would very much prefer a more elegant solution to this problem, I
think the kevent API has merit over epoll.
Posted May 11, 2007 9:45 UTC (Fri)
by pphaneuf (guest, #23480)
[Link] (2 responses)
Of course, the good old select/poll being O(N) on the number of file descriptors watched still applies, indeed. I used "load" in this context to mean "work to do", but I indeed use epoll for all servers on Linux (I use kqueue on *BSD, as well). I often end up having to have a select/poll version as well, for portability to those platforms not so well endowed.
I also know about the ring buffer having less copies, but I maintain my point: the kernel needs to know how many events have been consumed by the application in order not to overwrite unread events, and this is done with a system call. Making a system call to get the events that arrive, or making a system call to tell the kernel that we did process the events, at the end of the day, it's a system call either way.
Also, in order to do edge-triggered event notification (which I find can be useful to spread the load over multiple threads), the kernel can't just "forget about it", it keeps some information on the side in the file descriptor structure. The ring buffer does save a copy, but for the size of events, struct epoll_event isn't so bad (12 bytes), particularly compared with the work that will have to be done to process the events themselves.
I know that the ring buffer can be much bigger than the signal queues were, but the point is that they have a fixed size, and thus has to manage the overflow case properly. epoll keeps the information in the file descriptor structures (where it has to be kept anyway, in addition to the event, as I described earlier), so there is no overflow case: if you could open the file descriptor in the first place, it's all good.
Note that in other things punted over to the application to manage, there's also the issue of closed file descriptors. If a file descriptor has an event, but is closed before the event is processed, and another connection is accepted (very likely to get the same file descriptor number), what happens?
Not to mention that with the kevent ring buffer, it's tricky to spread the load between multiple threads (as described in Ulrich's post that you linked to), where epoll manages multiple threads going in epoll_wait() on the same epoll file descriptor nicely...
Posted May 12, 2007 20:58 UTC (Sat)
by intgr (subscriber, #39733)
[Link] (1 responses)
It can forget about the events; naturally, events have side effects, and the kernel will have to keep track of the state of its objects. (Or am I missing something?)
Both APIs have an "opaque pointer" field in their event structures. Applications are supposed to use this for identifying clients, not file descriptor numbers.
Posted May 12, 2007 21:34 UTC (Sat)
by pphaneuf (guest, #23480)
[Link]
If a file descriptor becomes readable, then not anymore, then again, without the event queue being looked at, should you get two events? With epoll, you get only one (you only get told of the file descriptor being readable once you've known about it).
Of course, it could go "on the cheap" and let userspace figure it out. But since it's so handy to just have this one bit in the file descriptor structure (which is really the "how many bytes in the appropriate buffer", which you really have to have, interpreted as a bool), why not?
They don't really get told that they are supposed to use that. The file descriptor number really is the proper identifier, as far as the kernel is concerned. Note that all the other APIs can support that without a problem (none of select/poll/epoll ever give you a "bad information" like that). Having a pointer is just to be helpful (and it is, quite!).
Posted May 17, 2007 7:40 UTC (Thu)
by slamb (guest, #1070)
[Link] (4 responses)
You're too kind. The first complaint is a total load of shit, and we're all stupider for
having entertained the idea. Under load, the syscall overhead of one epoll_wait() is insignificant
compared to the syscall overhead of the many, many reads and writes associated with it, not to
mention the actual costs of copying or checksumming buffers if you're not just doing zerocopy. I
am unable to imagine how anyone could think otherwise, though I've seen this argument (and
the resultant code) come up several times in this discussion. The third complaint is also wrong, but not obviously/offensively so. It's
solvable through something like my own sigsafe library (see the table in the main
page of the API documentation). They might have to make changes to the syscall page
mechanism for this approach work as well as old-fashioned int 0x80, but that's
doable (preserving compatibility and all). On the other hand, Ulrich's second and fourth complaints have some merit, IMHO. The
second in particular has long made me prefer the BSD-style kevent to epoll.
Posted May 17, 2007 13:52 UTC (Thu)
by pphaneuf (guest, #23480)
[Link] (2 responses)
Good old select/poll did have ridiculous overheads at low loads, with large numbers of clients, but yeah, epoll is more than good enough.
As you mention on your sigsafe main API documentation, if you're using an event loop and non-blocking I/O already, you can do the pipe trick very easily, so in the context of this event delivery mechanism, that's kind of the obvious answer, rather than starting to wrap all the "slow" system call. So while there are some things that could be done to make it even better, I'm not worrying.
You find that there's that much usefulness to the kqueue extra information? For most of those (in any case, those things you can also watch with epoll, there's some interesting "extra" stuff like watching processes in kqueue), you get the same information with an extra system call (the read() or accept() that gives you EAGAIN, for example).
I kind of use it in the more basic way, behind an abstraction for it, epoll and select, most of the time. The thing is, if I had that extra information in my abstraction, then I'd have to have some special "I don't know" value for select/epoll, test for that, slightly different behaviour between epoll and kqueue, etc... So I just don't really find it worth the trouble, at the moment. But I'm open to the idea that I might be missing something...
Posted May 17, 2007 17:09 UTC (Thu)
by slamb (guest, #1070)
[Link]
Yeah, I'm just comparing kevent to Linux's best existing mechanism - epoll_wait(). As far as
I'm concerned, select()/poll() is a straw man. O(n) with number of watched descriptors is
ridiculous.
Right, but Ulrich wants to implement thread cancellation. Even that is possible in a way that
doesn't lose edge events. Not that I think it's worthwhile to do, as thread cancellation is
hopelessly messed up for other reasons. But ncm tells me that the C++ people are looking at
doing things right with "thread interruption", though, and an approach like sigsafe might be
useful there.
I confess that I haven't actually taken advantage of any of it, but I think there's potential,
especially as more event types are added. And here he is actually talking about returning
information right away vs. making another system call per event, so this syscall overhead
reduction argument makes more sense than removing the actual polling call.
Posted May 17, 2007 20:30 UTC (Thu)
by slamb (guest, #1070)
[Link]
One thing is that not everyone cares about writing portable code. I used to always write
everything in this way (don't use fancy features or don't depend on them), but a few things
started to change my mind. One of them was reading this story. Another was starting to work
on a BSD-based proprietary system. We
have the make world style - a single source tree, no collection of RPMs with tarballs
and patches. We can add libraries, daemons, and kernel extensions that the rest of the system
depends on. We don't use autoconf-based fallback code or lowest-common denominator
abstraction layers. We've given up on the idea of sending most of our
changes upstream, so we do what works for us. It can be liberating to say "screw portability/
compatibility". It's much easier to do on our system where you can make a single changeset that
modifies everything you need and be certain one piece will never run on a real system without
the other.
Now I'm ready to take the same attitude to other code that I write. Portability isn't a hard
requirement; it's something to be kept as long as it doesn't hold me back too much. Code
doesn't run on Python 2.2? Who cares?!? I run mostly CentOS 5, which has Python 2.4, so I'll take
advantage of the newer language features. Code
doesn't run without epoll_wait() or kevent()? Who cares?!? I use systems with modern kernel
interfaces. Code doesn't run without a Linux-only kernel interface? Maybe I'll add it or something
equivalent to BSD if I want to run it there.
Posted May 31, 2007 14:21 UTC (Thu)
by pphaneuf (guest, #23480)
[Link]
Well, turns out Ingo has found out as much.
Posted May 17, 2007 14:42 UTC (Thu)
by renox (guest, #23785)
[Link] (1 responses)
His strange point about the poll vs read cancellation point, is just the latest example.
Note that I do respect the man for his many works on free software (he did much more for free software than me), but I just found funny that I'm always disagreeing with him, weird.
Posted May 17, 2007 22:34 UTC (Thu)
by slamb (guest, #1070)
[Link]
I feel some sympathy for Ulrich there. You have to realize that thread cancellation is a rather
poorly-described idea that the standards people said in a hand-wavy way was mandatory long
before anyone had a working implementation. People like Ulrich are stuck working out the details of
an actual working system, and traffic on the mailing lists shows that he's put a lot of effort
into it. I don't think his approach is right - he basically just enables async cancels around every
cancellation point, where a sigsafe-like way would make it possible to honor the cancellation if and
only if the system call has not yet returned - but it's not for lack of trying.
The return of kevent?
The return of kevent?
select/poll/epoll [...] have this property of having a lower and lower syscall overhead as the load increases
And again, for the Nth time, the ring buffer does have a syscall every so many events
The ring buffer scheme has a bad smell to me, in that it reminds me of notification via realtime signals
I would imagine that on 64bit platforms we can use large areas. Several MBs if necessary. This would cover worst case scenarios. The key points are that a) the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that fails because the memory is swapped out) and b) we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)).
The return of kevent?
I concur with all of your points.
The return of kevent?
Also, in order to do edge-triggered event notification [...] the kernel can't just "forget about it"
If a file descriptor has an event, but is closed before the event is processed, and another connection is accepted
The return of kevent?
The return of kevent?
The first complaint is not that significative, IMHO.
The return of kevent?
The return of kevent?
Good old select/poll did have ridiculous overheads at low loads, with large
numbers of clients, but yeah, epoll is more than good enough.
As you mention on your sigsafe main API documentation, if you're using an event
loop and non-blocking I/O already, you can do the pipe trick very easily, so in the context of this
event delivery mechanism, that's kind of the obvious answer, rather than starting to wrap all the
"slow" system call.
You find that there's that much usefulness to the kqueue extra information?
The return of kevent?
The thing is, if I had that extra information in my abstraction, then I'd have to have
some special "I don't know" value for select/epoll, test for that, slightly different behaviour
between epoll and kqueue, etc... So I just don't really find it worth the trouble, at the moment.
But I'm open to the idea that I might be missing something...
The return of kevent?
It's really weird, but each time I read a discussion where there is Ulrich Drepper involved, I always disagree with him..The return of kevent?
The return of kevent?
His strange point about the poll vs read cancellation point, is just the latest
example.