The return of kevent?
Posted May 10, 2007 21:50 UTC (Thu) by intgr
In reply to: The return of kevent?
Parent article: The return of kevent?
select/poll/epoll [...] have this property of having a lower and lower syscall overhead as the load increases
This is not true if "load" means "a large number of sockets", especially when
the majority of sockets are inactive at any given time. The difference
between the APIs is that select and poll have to enumerate all known file
descriptors on each cycle, while epoll and kevent are specifically told
which file descriptors are hot. TCP congestion control will take care
that more events wouldn't be signalled that the server can handle. Formally,
select/poll scale linearly to the number of sockets while epoll/kevent
scale linearly to the number of events.
And again, for the Nth time, the ring buffer does have a syscall every so many events
This is actually the advantage of kevent over epoll — with kevent, the
kernel always knows where the event ring is located in the user space; thus, it
can just dump the events directly to the user space when they arrive, and
forget about them. Since the events are written directly to the process's ring
buffer, the process can tell when new events have arrived without a syscall.
Thus: no copies, no syscalls.
The ring buffer scheme has a bad smell to me, in that it reminds me of notification via realtime signals
The problem with signals is that the signal buffers are allocated for
every process and they exist in kernel space, thus their size has to
be conservative. kevent buffers, however, can afford to be huge; and in the
case of file descriptor events, the upper bound is set by the maximum number
of file descriptors allowed for the process; although the event structure is
regrettably big (36 bytes if I counted correctly).
While imperfect, Ulrich Drepper writes in his blog:
I would imagine that on 64bit platforms we can use large areas. Several MBs if necessary. This would cover worst case scenarios. The key points are that a) the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that fails because the memory is swapped out) and b) we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)).
While I would very much prefer a more elegant solution to this problem, I
think the kevent API has merit over epoll.
to post comments)