LWN.net Logo

epoll final ( hopefully ) interface ...

From:  Davide Libenzi <davidel@xmailserver.org>
To:  Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject:  [rfc] epoll final ( hopefully ) interface ...
Date:  Tue, 19 Nov 2002 21:46:18 -0800 (PST)
Cc:  Ulrich Drepper <drepper@redhat.com>


This is how it looks my current interface file for epoll. Jamie suggestion
to remove the "fd" member from the "struct epoll_event" structure was a
good one and actually saved 25% of space, that matters with high event
transfer rates.


#ifndef	_SYS_EPOLL_H
#define	_SYS_EPOLL_H	1

#include <features.h>
#include <sys/types.h>
#include <sys/poll.h>


#define EPOLLIN			POLLIN
#define EPOLLPRI		POLLPRI
#define EPOLLOUT		POLLOUT

#ifdef __USE_XOPEN
# define EPOLLRDNORM		POLLRDNORM
# define EPOLLRDBAND		POLLRDBAND
# define EPOLLWRNORM		POLLWRNORM
# define EPOLLWRBAND		POLLWRBAND
#endif

#ifdef __USE_GNU
# define EPOLLMSG		POLLMSG
#endif

#define EPOLLERR		POLLERR
#define EPOLLHUP		POLLHUP


/* Valid opcodes to issue to epoll_ctl() */
#define EPOLL_CTL_ADD 1
#define EPOLL_CTL_DEL 2
#define EPOLL_CTL_MOD 3

struct epoll_event {
	unsigned short events;
	unsigned short revents;
	__uint64_t obj;
};


__BEGIN_DECLS

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event) __THROW;
int epoll_wait(int epfd, struct epoll_event *events, int maxevents,
               int timeout) __THROW;

__END_DECLS


#endif /* #ifndef _SYS_EPOLL_H */



I also include the man pages that I edited to reflect the new interface.
Almost certainly they'll contain errors and reference to the old interface
API. I'd very much appreciate if someone that suck less then me in either
English and man page creation, will try to make them in better shape. Pls,
in the rare event that more than one would like to take the job, try to
keep you in sync avoiding work duplication. The major work IMHO should be
done on epoll.2




- Davide

.\"
.\"  epoll by Davide Libenzi ( efficient event notification retrieval )
.\"  Copyright (C) 2002  Davide Libenzi
.\"
.\"  This program is free software; you can redistribute it and/or modify
.\"  it under the terms of the GNU General Public License as published by
.\"  the Free Software Foundation; either version 2 of the License, or
.\"  (at your option) any later version.
.\"
.\"  This program is distributed in the hope that it will be useful,
.\"  but WITHOUT ANY WARRANTY; without even the implied warranty of
.\"  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\"  GNU General Public License for more details.
.\"
.\"  You should have received a copy of the GNU General Public License
.\"  along with this program; if not, write to the Free Software
.\"  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
.\"
.\"  Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL 2 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll \- edge triggered I/O readiness notification facility
.SH SYNOPSIS
.B #include <sys/epoll.h>
.SH DESCRIPTION
Edge based variant of 
.BR poll (2)
which scales well to large numbers of watched fds. To set up and control an
epoll set, three calls are provided: 
.BR epoll_create (2) ,
.BR epoll_ctl (2) ,
.BR epoll_wait (2) .

An epoll set is connected to a file descriptor, which is created by
.BR epoll_create (2) .
Interest for certain file descriptors is then registered using 
.BR epoll_ctl(2) .
Finally, the actual wait is then started by 
.BR epoll_wait (2) .

A good analogy is to consider epoll as a waitqueue, which is also
like a futex. To use a waitqueue properly you have to do these
things in the order shown:

.SR
.TP
.BR 1. 
Set the task state to stopped.
.TP
.BR 2. 
Register yourself on the waitqueue.
.TP
.BR 3. 
Check the condition.
.TP
.BR 4. 
If the condition is not met, schedule.
.SE

.PP
epoll is similar. To wait for a condition on a file descriptor, 
such as readability, you must do these things in the order shown:
.PP

.SR
.TP
.B 1. 
Register interest by using epoll_ctl (2).
.TP
.B 2. 
Check the condition by calling read (2).
.TP
.B 3. 
If the condition is not met (i.e. read() returned EAGAIN), call 
.BR epoll_wait (2) 
( i.e. equivalent to schedule).
.SE

.PP
epoll allows you to optimize by registering interest just once 
i.e. steps 2 and 3 may be repeated without repeating step 1.

If you are concerned about starvation (i.e. one file descriptor
always has new data so others don't get serviced.) don't be. 
With epoll you do not have to completely read one file descriptor
until you see EAGAIN. All that matters that until you see the EAGAIN,
your user space data structure has a flag that says the file descriptor 
is still readable, so another epoll event is not expected or required for 
that file descriptor.
.PP

.SH NOTES
epoll should not be regarded as a faster version of 
.BR poll (2)
because its semantics are fundamentally different. Whereas 
poll and 
.BR select (2)
return immediately if any of their fds are available for reading,
writing or have an exceptional condition, epoll only returns if a
condition changes. 

Because of this, epoll should be used with non-blocking to avoid
that a blocking read/write will starve your task handling multiple
file decriptors. The suggested way to use epoll is:
.SR
.TP 
.B i
with non-blocking file descriptors
.TP 
.B ii
by going to wait for an event only after
.BR read (2)
or 
.BR write (2)
return EAGAIN
.SE
.SH EXAMPLE
In this example, listener is a non-blocking socket on which
.BR listen (2)
has been called. The function do_use_fd() will use the new ready
file descriptor until EAGAIN is returned by either
.BR read (2)
or
.BR write (2).
An event driven state machine application should, after having received
EAGAIN, record its current state so that at the next call to do_use_fd()
will continue to
.BR read (2)
or
.BR write (2)
from where it left.

.nf
for(;;) {
    nfds = epoll_wait(kdpfd, &pfds, -1);

    for(n = 0; n < nfds; ++n) {
        if(pfds[n].fd == s) {
            client = accept(listener, (struct sockaddr*)&local, &addrlen);
            if(client < 0){
                perror("accept");
                continue;
            }
            setnonblocking(client);
            if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, EPOLLIN) < 0) {
                fprintf(stderr, "epoll set insertion error: fd=%d\n", client);
                return -1;
            }
        }
        else
            do_use_fd(pfds[n].fd);
    }
}
.fi

For performance reasons it is possible to add the file descriptor inside the
epoll interface ( 
.B EPOLL_CTL_ADD
 ) once by specifying (EPOLLIN | EPOLLOUT) without having to continuously switch
between EPOLLIN and EPOLLOUT calling
.BR epoll_ctl (2)
with
.BR EPOLL_CTL_ADD .
.SH CONFORMING TO
.BR epoll (2)
is a new API introduced in Linux kernel
.BR 2.5.44 .
.SH "SEE ALSO"
.BR epoll_create (2)
.BR epoll_ctl (2)
.BR epoll_wait (2)

.\"
.\"  epoll by Davide Libenzi ( efficient event notification retrieval )
.\"  Copyright (C) 2002  Davide Libenzi
.\"
.\"  This program is free software; you can redistribute it and/or modify
.\"  it under the terms of the GNU General Public License as published by
.\"  the Free Software Foundation; either version 2 of the License, or
.\"  (at your option) any later version.
.\"
.\"  This program is distributed in the hope that it will be useful,
.\"  but WITHOUT ANY WARRANTY; without even the implied warranty of
.\"  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\"  GNU General Public License for more details.
.\"
.\"  You should have received a copy of the GNU General Public License
.\"  along with this program; if not, write to the Free Software
.\"  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
.\"
.\"  Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL_CREATE 2 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll_create \- open an
.B epoll
file descriptor
.SH SYNOPSIS
.B #include <sys/epoll.h>
.sp
.BR "int epoll_create(int " size )
.SH DESCRIPTION
Open an
.B epoll
file descriptor by requesting the kernel to allocate
an event backing store dimensioned for
.I size
descriptors. This is not the maximum size of the backing store but
just an hint to the kernel about how to dimension internal structures.
The returned file descriptor will be used for all the subsequent calls to the
.B epoll
interface. The file descriptor returned by
.BR epoll_create (2)
must be closed by using
.BR close (2).
.SH "RETURN VALUE"
On success,
.BR epoll_create (2)
returns a positive integer identifying the descriptor.
On error, it returns -1 and
.I errno
is set appropriately.
.SH ERRORS
.TP
.B ENOMEM
There was insufficient memory to create the kernel object.
.SH CONFORMING TO
.BR epoll_create (2)
is a new API introduced in Linux kernel
.BR 2.5.44 .
.SH "SEE ALSO"
.BR epoll_ctl (2)
.BR epoll_wait (2)
.BR close (2)

.\"
.\"  epoll by Davide Libenzi ( efficient event notification retrieval )
.\"  Copyright (C) 2002  Davide Libenzi
.\"
.\"  This program is free software; you can redistribute it and/or modify
.\"  it under the terms of the GNU General Public License as published by
.\"  the Free Software Foundation; either version 2 of the License, or
.\"  (at your option) any later version.
.\"
.\"  This program is distributed in the hope that it will be useful,
.\"  but WITHOUT ANY WARRANTY; without even the implied warranty of
.\"  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\"  GNU General Public License for more details.
.\"
.\"  You should have received a copy of the GNU General Public License
.\"  along with this program; if not, write to the Free Software
.\"  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
.\"
.\"  Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL_CTL 2 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll_ctl \- controller interface for an
.B epoll
descriptor
.SH SYNOPSIS
.B #include <sys/epoll.h>
.sp
.BR "int epoll_ctl(int " epfd ", int " op ", int " fd ", struct epoll_event *" event )
.SH DESCRIPTION
Control an
.B epoll
descriptor,
.IR epfd ,
by requesting the operation
.IR op
be performed on the target file descriptor,
.IR fd .
The
.IR event
describe the object linked to the file descriptor
.IR fd .
The
.B struct epoll_event
is defined as :
.sp
.nf
	struct epoll_event {
		unsigned short events;	/* Requested events */
		unsigned short revents;	/* Returned events */
		__uint64_t obj;		/* Opaque data object */
	};
.fi

The
.I events
and the
.I revents
members are bit sets composed using the following available event
types :
.TP
.B EPOLLIN
The associated file is available for
.BR read (2)
opeartions.
.TP
.B EPOLLOUT
The associated file is available for
.BR write (2)
opeartions.
.TP
.B EPOLLPRI
There is urgent data available for
.BR read (2)
opeartions.
.TP
.B EPOLLERR
Error condition happened on the associated file descriptor.
.TP
.B EPOLLHUP
Hang up happened on the associated file descriptor.
.PP
.sp
The
.B epoll
interface supports all file descriptors that supports
.BR poll (2).
Valid values for the
.IR op
parameter are :
.SR
.TP
.B EPOLL_CTL_ADD
Add the target file descriptor
.I fd
to the
.B epoll
descriptor
.I epfd
and associate the event
.I event
with the internal file linked to
.IR fd .
.TP
.B EPOLL_CTL_MOD
Change the event
.I event
associated to the target file descriptor
.IR fd .
.TP
.B EPOLL_CTL_DEL
Remove the target file descriptor
.I fd
from the
.B epoll
file descriptor,
.IR epfd .
.SE
.SH "RETURN VALUE"
On success, 
.BR epoll_ctl (2)
returns zero.  On error, it returns \-1 and
.I errno
is set appropriately.
.SH ERRORS
.TP
.B EBADF
The
.I epfd
file descriptor is not a valid file descriptor.
.TP
.B EPERM
The target file
.I fd
is not supported by
.BR epoll .
.TP
.B EINVAL
The supplied file descriptor,
.IR epfd ,
is not an
.B epoll
file descriptor, or the requested operation
.I op
is not supported by this interface.
.TP
.B ENOMEM
There was insufficient memory to handle the requested
.I op
controller operation.
.SH CONFORMING TO
.BR epoll_ctl (2)
is a new API introduced in Linux kernel
.BR 2.5.44 .
.SH "SEE ALSO"
.BR epoll_create (2)
.BR epoll_wait (2)

.\"
.\"  epoll by Davide Libenzi ( efficient event notification retrieval )
.\"  Copyright (C) 2002  Davide Libenzi
.\"
.\"  This program is free software; you can redistribute it and/or modify
.\"  it under the terms of the GNU General Public License as published by
.\"  the Free Software Foundation; either version 2 of the License, or
.\"  (at your option) any later version.
.\"
.\"  This program is distributed in the hope that it will be useful,
.\"  but WITHOUT ANY WARRANTY; without even the implied warranty of
.\"  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
.\"  GNU General Public License for more details.
.\"
.\"  You should have received a copy of the GNU General Public License
.\"  along with this program; if not, write to the Free Software
.\"  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
.\"
.\"  Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL_WAIT 2 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll_wait \- wait for an I/O event on an
.B epoll
file descriptor
.SH SYNOPSIS
.B #include <sys/epoll.h>
.sp
.BR "int epoll_wait(int " epfd ", struct epoll_event * " events ", int " maxevents ", int " timeout)
.SH DESCRIPTION
Wait for events on the
.B epoll
file descriptor
.I epfd
for a maximum time of
.I timeout
milliseconds. The memory area pointed by
.I events
will contain the events (see
.BR epoll_ctl (2))
that will be available for the caller. Up to
.I maxevents
will be returned by
.BR epoll_wait (2).
The
.I maxevents
parameter must be greater than zero. Specifying a
.I timeout
of \-1 will make
.BR epoll_wait (2)
wait indefinitely.
.SH "RETURN VALUE"
On success, 
.BR epoll_wait (2)
returns the number of file descriptors ready for the requested I/O, or zero
if no file descriptor became ready during the requested
.I timeout
milliseconds.  On error, it returns \-1 and
.I errno
is set appropriately.
.SH ERRORS
.TP
.B EBADF
.I epfd
is not a valid file descriptor.
.TP
.B EINVAL
The supplied file descriptor,
.IR epfd ,
is not an
.B epoll
file descriptor. Or, the
.I maxevents
parameter is less or equal to zero.
.TP
.B EFAULT
The memory area pointed to by
.I events
is not accesible with write permissions.
.SH CONFORMING TO
.BR epoll_wait (2)
is a new API introduced in Linux kernel
.BR 2.5.44 .
.SH "SEE ALSO"
.BR epoll_create (2)
.BR epoll_ctl (2)


Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds