sys_epoll - making poll fast
[Posted October 30, 2002 by corbet]
The classic Unix way to wait for I/O events on multiple file descriptors is
with the
select() and
poll() system calls. When a
process invokes one of those calls, the kernel goes through the list of
interesting file descriptors, checks to see if non-blocking I/O is
available on any of them, and adds the calling process to a wait queue for
each file descriptor that would block. This implementation works
reasonably well when the number of file descriptors is small. But if a
process is managing thousands of file descriptors, the
select()
and
poll() calls must check every single one of them, and add the
calling process to thousands of wait queues. For every single call.
Needless to say, this approach does not scale very well.
Davide Libenzi and others have been working for some time on a new approach
to polling that would work for thousands of files. It was originally
implemented as a special device (/dev/epoll), but, on request from
Linus, the new scheme was turned into a new set of epoll() system
calls.
These calls work in a very different way. Every call to select()
or poll() is a separate event; the data structures must be set up
and torn down every time. epoll, instead, requires the
application to build a persistent (across calls) data structure in kernel
space first. The application starts by creating a special epoll
file descriptor:
epfd = epoll_create(int maxfds);
The maxfds parameter is the maximum number of file descriptors
that the process expects to manage. The return value is a file descriptor
to be used with the other epoll calls; it should be shut down with
close() when it is no longer needed.
Each file descriptor to be managed must be added to the special
epoll descriptor with:
int epoll_ctl(int epfd, int op, int fd, unsigned int events);
The op parameter specifies the operation to be performed (add,
change, or remove the given file descriptor fd), and
events is a mask of events of interest to the process.
Once everything has been set up, the process can sit back and wait until
there is something for it to do:
int epoll_wait(int epfd, struct pollfd const **events, int timeout);
The return value is the number of events (i.e. readable or writeable file
descriptors) that epoll_wait() has found.
These system calls have been shown, through heavy benchmarking, to scale in
constant time up to unbelievable numbers of file descriptors (some graphs
can be found on this
page). The persistent data structure built around the epoll
file descriptor is one of the reasons for this scalability: there is no
need to set it up and tear it down for every epoll_wait() call.
The other half of the story is in how epoll_wait() finds the
readable or writeable file descriptors. Rather than polling each file
descriptor (and adding itself to wait queues), the epoll mechanism
adds a callback structure onto the struct file associated
with each file descriptor. When a file descriptor becomes readable or
writeable, its callback(s) are called, and processes using
epoll_wait() can be notified directly. So an
epoll_wait() call never needs to make a pass over the list of file
descriptors it is watching.
The epoll patch is ready, and Linus has indicated that he wants to
merge it. For now, epoll only works for pipes and sockets (its
initial use is likely to be network services that manage large numbers of
connections). Expanding its scope to other types of I/O should just be a
matter of doing the work, however.
(
Log in to post comments)