LWN.net Logo

sys_epoll - making poll fast

The classic Unix way to wait for I/O events on multiple file descriptors is with the select() and poll() system calls. When a process invokes one of those calls, the kernel goes through the list of interesting file descriptors, checks to see if non-blocking I/O is available on any of them, and adds the calling process to a wait queue for each file descriptor that would block. This implementation works reasonably well when the number of file descriptors is small. But if a process is managing thousands of file descriptors, the select() and poll() calls must check every single one of them, and add the calling process to thousands of wait queues. For every single call. Needless to say, this approach does not scale very well.

Davide Libenzi and others have been working for some time on a new approach to polling that would work for thousands of files. It was originally implemented as a special device (/dev/epoll), but, on request from Linus, the new scheme was turned into a new set of epoll() system calls. These calls work in a very different way. Every call to select() or poll() is a separate event; the data structures must be set up and torn down every time. epoll, instead, requires the application to build a persistent (across calls) data structure in kernel space first. The application starts by creating a special epoll file descriptor:

    epfd = epoll_create(int maxfds);

The maxfds parameter is the maximum number of file descriptors that the process expects to manage. The return value is a file descriptor to be used with the other epoll calls; it should be shut down with close() when it is no longer needed.

Each file descriptor to be managed must be added to the special epoll descriptor with:

    int epoll_ctl(int epfd, int op, int fd, unsigned int events);

The op parameter specifies the operation to be performed (add, change, or remove the given file descriptor fd), and events is a mask of events of interest to the process.

Once everything has been set up, the process can sit back and wait until there is something for it to do:

    int epoll_wait(int epfd, struct pollfd const **events, int timeout); 

The return value is the number of events (i.e. readable or writeable file descriptors) that epoll_wait() has found.

These system calls have been shown, through heavy benchmarking, to scale in constant time up to unbelievable numbers of file descriptors (some graphs can be found on this page). The persistent data structure built around the epoll file descriptor is one of the reasons for this scalability: there is no need to set it up and tear it down for every epoll_wait() call. The other half of the story is in how epoll_wait() finds the readable or writeable file descriptors. Rather than polling each file descriptor (and adding itself to wait queues), the epoll mechanism adds a callback structure onto the struct file associated with each file descriptor. When a file descriptor becomes readable or writeable, its callback(s) are called, and processes using epoll_wait() can be notified directly. So an epoll_wait() call never needs to make a pass over the list of file descriptors it is watching.

The epoll patch is ready, and Linus has indicated that he wants to merge it. For now, epoll only works for pipes and sockets (its initial use is likely to be network services that manage large numbers of connections). Expanding its scope to other types of I/O should just be a matter of doing the work, however.


(Log in to post comments)

sys_epoll - making poll fast

Posted Oct 31, 2002 2:09 UTC (Thu) by yanfali (subscriber, #2949) [Link]

I've personally used the old version /dev/epoll and it scales beautifully,
especially when combined with NAPIfied network drivers. In an internal
test application I was able to handle 60K sockets opened and over
12K active per second on a 1Ghz pentium III. This API is going to rocket
the performance of linux based TCP/IP servers through the ceiling.

Congrats to Davide for a job well done.

sys_epoll - making poll fast

Posted Oct 31, 2002 4:02 UTC (Thu) by cpeterso (guest, #305) [Link]


This sounds a lot like NT's IO Completion Ports, but without NT's automatic thread pool management. IO Completion Ports are nice because the thread pool balances the work items (IO callbacks) to threads pinned to each CPU. There is usually one one thread per CPU servicing IO callbacks. This avoids extra context switches overhead and avoids trashing the CPU cache with multiple threads. I'm sure thread pooling could be added with userland libraries on top of the sys_epoll system call.

sys_epoll - making poll fast

Posted Oct 31, 2002 8:08 UTC (Thu) by johnchx (guest, #4262) [Link]

I'd like to hear more about how this approach compares with asynchronous reads and writes from/to sockets. Doesn't the ability to do async i/o reduce the need for a scalable select()-style mechanism?

sys_epoll - making poll fast

Posted Oct 31, 2002 7:22 UTC (Thu) by dododge (subscriber, #2870) [Link]

LKML has had a bunch of messages this week about the API documentation, and it's clear there are some subtle issues with how you go about using sys_epoll correctly. If you try to use it as a drop-in replacement for existing select and poll code, you can end up with races. The setup and waiting has to be done with an understanding of what actually causes epoll_wait to report an event. A good/quick explanation of how to use it:

http://www.uwsg.indiana.edu/hypermail/linux/kernel/0210.3/1520.html

I want some of them callbacks!

Posted Oct 31, 2002 11:01 UTC (Thu) by zooko (subscriber, #2589) [Link]

I don't understand why the callbacks aren't propagated to userland. All of my code which uses select() or poll() style interfaces immediately invokes callbacks based on which fds have changed, so if my code ran on top of epoll(), then the sequence of events would look like this:

  • the socket activity would cause callbacks
  • which would be intercepted by the epoll() system
  • which would update the data structure and then return from epoll_wait(),
  • where my userland code would crawl the data structure and invoke callbacks for each changed thing,
  • where higher-level application code would then receive the callbacks.

Doesn't this seem unnecessarily complicated, not to mention less efficient? Is it too tricky to have the kernel change to user context and invoke a function pointer that I have given it?

Regards,

Zooko

I want some of them callbacks!

Posted Oct 31, 2002 13:24 UTC (Thu) by IkeTo (subscriber, #2122) [Link]

Hmm... that seems to be as difficult to use as signals. The function that you want to use is never in the list of async signal-safe functions, and you application will always have race condition using signals unless your application uses the call in a very limited way: your main application must do basically nothing, and your application must block basically every signal (and wait for signals using sigwait() or sigsuspend()). That means... you'll have to create system calls similar to those to deal with signals, and then to setup a huge size set for suspending file descriptor callbacks. Not to mention that pushing some code to the user is not something that a kernel developers would like to do unless absolutely needed. Doesn't seem to be exactly exciting for this type of optimization which is constant-time anyway.

I want some of them callbacks!

Posted Oct 31, 2002 20:56 UTC (Thu) by zooko (subscriber, #2589) [Link]

I think there is an interesting disconnect between the low-level hackers and the higher level network/application level hackers here. A lot of the interesting projects that I know of (not to mention the ones that I lead myself: Mnet), use select()/poll() interfaces exactly as I've described above -- as nothing more than a way to get at the callbacks which they then immediately de-multiplex back into callbacks. The projects I'm thinking of here are things like Twisted, and E (although the current E implementation doesn't use kernel calls, as it is in Java...).

Regards,

Zooko

I want some of them callbacks!

Posted Nov 2, 2002 21:15 UTC (Sat) by bronson (subscriber, #4806) [Link]

Maybe IkeTo gave too comprehensive an answer...

Is it too tricky to have the kernel change to user context and invoke a function pointer that I have given it?

Yes.

Whenever the kernel calls user code directly, non-obvious and non-trivial race conditions result. It's a very difficult problem. Much easier to just have the system call return, and then the user can deal with the callbacks however she/he wants.

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds