The return of kevent?

Posted May 10, 2007 21:50 UTC (Thu) by intgr (subscriber, #39733)
In reply to: The return of kevent? by pphaneuf
Parent article: The return of kevent?

select/poll/epoll [...] have this property of having a lower and lower syscall overhead as the load increases

This is not true if "load" means "a large number of sockets", especially when the majority of sockets are inactive at any given time. The difference between the APIs is that select and poll have to enumerate all known file descriptors on each cycle, while epoll and kevent are specifically told which file descriptors are hot. TCP congestion control will take care that more events wouldn't be signalled that the server can handle. Formally, select/poll scale linearly to the number of sockets while epoll/kevent scale linearly to the number of events.

And again, for the Nth time, the ring buffer does have a syscall every so many events

This is actually the advantage of kevent over epoll — with kevent, the kernel always knows where the event ring is located in the user space; thus, it can just dump the events directly to the user space when they arrive, and forget about them. Since the events are written directly to the process's ring buffer, the process can tell when new events have arrived without a syscall. Thus: no copies, no syscalls.

The ring buffer scheme has a bad smell to me, in that it reminds me of notification via realtime signals

The problem with signals is that the signal buffers are allocated for every process and they exist in kernel space, thus their size has to be conservative. kevent buffers, however, can afford to be huge; and in the case of file descriptor events, the upper bound is set by the maximum number of file descriptors allowed for the process; although the event structure is regrettably big (36 bytes if I counted correctly).

While imperfect, Ulrich Drepper writes in his blog:

I would imagine that on 64bit platforms we can use large areas. Several MBs if necessary. This would cover worst case scenarios. The key points are that a) the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that fails because the memory is swapped out) and b) we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)).

While I would very much prefer a more elegant solution to this problem, I think the kevent API has merit over epoll.

The return of kevent?

Posted May 11, 2007 9:45 UTC (Fri) by pphaneuf (guest, #23480) [Link] (2 responses)

Of course, the good old select/poll being O(N) on the number of file descriptors watched still applies, indeed. I used "load" in this context to mean "work to do", but I indeed use epoll for all servers on Linux (I use kqueue on *BSD, as well). I often end up having to have a select/poll version as well, for portability to those platforms not so well endowed.

I also know about the ring buffer having less copies, but I maintain my point: the kernel needs to know how many events have been consumed by the application in order not to overwrite unread events, and this is done with a system call. Making a system call to get the events that arrive, or making a system call to tell the kernel that we did process the events, at the end of the day, it's a system call either way.

Also, in order to do edge-triggered event notification (which I find can be useful to spread the load over multiple threads), the kernel can't just "forget about it", it keeps some information on the side in the file descriptor structure. The ring buffer does save a copy, but for the size of events, struct epoll_event isn't so bad (12 bytes), particularly compared with the work that will have to be done to process the events themselves.

I know that the ring buffer can be much bigger than the signal queues were, but the point is that they have a fixed size, and thus has to manage the overflow case properly. epoll keeps the information in the file descriptor structures (where it has to be kept anyway, in addition to the event, as I described earlier), so there is no overflow case: if you could open the file descriptor in the first place, it's all good.

Note that in other things punted over to the application to manage, there's also the issue of closed file descriptors. If a file descriptor has an event, but is closed before the event is processed, and another connection is accepted (very likely to get the same file descriptor number), what happens?

Not to mention that with the kevent ring buffer, it's tricky to spread the load between multiple threads (as described in Ulrich's post that you linked to), where epoll manages multiple threads going in epoll_wait() on the same epoll file descriptor nicely...

The return of kevent?

Posted May 12, 2007 20:58 UTC (Sat) by intgr (subscriber, #39733) [Link] (1 responses)

I concur with all of your points.

Also, in order to do edge-triggered event notification [...] the kernel can't just "forget about it"

It can forget about the events; naturally, events have side effects, and the kernel will have to keep track of the state of its objects. (Or am I missing something?)

If a file descriptor has an event, but is closed before the event is processed, and another connection is accepted

Both APIs have an "opaque pointer" field in their event structures. Applications are supposed to use this for identifying clients, not file descriptor numbers.

The return of kevent?

Posted May 12, 2007 21:34 UTC (Sat) by pphaneuf (guest, #23480) [Link]

If a file descriptor becomes readable, then not anymore, then again, without the event queue being looked at, should you get two events? With epoll, you get only one (you only get told of the file descriptor being readable once you've known about it).

Of course, it could go "on the cheap" and let userspace figure it out. But since it's so handy to just have this one bit in the file descriptor structure (which is really the "how many bytes in the appropriate buffer", which you really have to have, interpreted as a bool), why not?

They don't really get told that they are supposed to use that. The file descriptor number really is the proper identifier, as far as the kernel is concerned. Note that all the other APIs can support that without a problem (none of select/poll/epoll ever give you a "bad information" like that). Having a pointer is just to be helpful (and it is, quite!).