LWN.net Logo

Kevents and review of new APIs

The proposed kevent interface has been covered here before - see this article and this one too. Kevents appear to have gained significant momentum over the last few weeks, to the point that inclusion in 2.6.19 is not entirely out of the question. Most developers who have reviewed the code seem to like the core idea (a unified interface for applications to get information on all events of interest) and the implementation within the kernel. Only now, however, is significant attention being paid to the user-space API which comes with kevents. But the definition of that API is of crucial importance. This article will look at it from two perspectives - first technical, then political.

The discussion of the proposed API has been hampered somewhat by the lack of associated documentation - and the fact that said API is still changing quickly. In an attempt to pull together some of the available information, Stephen Hemminger has put up a page at OSDL describing the system call API. That page misses one important aspect of kevents, however: the ability to receive events via a shared memory interface. In an attempt to fill that gap, we'll look at the August 23 version of the memory-mapped kevent API.

One of the goals behind kevents is to make the processing of events as fast as possible - the idea being that a high-performance network server (say) can work through vast numbers of events per second without appreciable system overhead. One way to achieve this is to avoid system calls altogether whenever possible. That is why there is interest in mapping kevents directly into user space; this approach will allow the application to consume them without calling into the kernel for each one.

To use the mmap interface, the application obtains a kevent file descriptor, as usual. A simple call to mmap() will then create the shared buffer for kevent communication. The size of this buffer is currently determined by an in-kernel parameter - the maximum number of kevents which will be stored there. Presumably there will eventually be a KEVENT_MMAP_PAGES macro (or some such) to free the application from trying to figure out how many pages it should map, but that is not yet provided.

The resulting memory array is treated as a big circular buffer by the kernel. There is a single index only, however - where the next event will be written by the kernel. In other words, the kernel has no way to know which events have been consumed by the application; if that application falls too far behind, the kernel will begin to overwrite unprocessed events. For this reason, perhaps, the buffer is made relatively large - 4096 events fit there in the current version of the patch.

The events stored in the buffer are not the same ukevent structures used by the system call interface. There is, instead, a shortened version in the form of struct mukevent:

    struct kevent_id
    {
	union {
	    __u32 raw[2];
	    __u64 raw_u64 __attribute__((aligned(8)));
	};
    };

    struct mukevent
    {
	struct kevent_id	id;
	__u32			ret_flags;
    };

The id field contains some information about what happened: the relevant file descriptor, for example. The actual event code itself is not present, however.

The event ring is not quite a pure circular buffer. It is formatted with a four-byte field at the beginning of each page, followed by as many mukevent structures as will fit within the page. The four-byte field in the first page contains the current event index - where the kernel will write the next event. The application will, presumably, keep track of the last event it read from the buffer, moving that counter forward until it catches up with the kernel. The application must take care, however, to notice every time it crosses a page boundary so it can skip the count field.

Since there is no way to inform the kernel that events have been consumed from the memory-mapped ring, and since the full event information is not available via that ring, the application must still call into the kernel for events. Otherwise, if nothing else, they will accumulate there until they reach their maximum allowed number. So the advantage of the memory-mapped approach will be hard to obtain with the current code. As was noted above, however, this API is very young. One assumes that these little problems will be ironed out in the near future.

Meanwhile, kevents have created a separate discussion on how new APIs go into the kernel. One Nicholas Miell requested that some documentation for this interface be written:

Is any of this documented anywhere? I'd think that any new userspace interfaces should have man pages explaining their use and some example code before getting merged into the kernel to shake out any interface problems.

The response he got was "Get real". Others suggested that, if Mr. Miell really wanted documentation, he could sit down and write it himself. It must be said that, through the discussion, Mr. Miell has comported himself in a way which is highly unlikely to inspire cooperation from anybody. He seems to carry a certain contempt for the interface, the process, and the people involved in it.

But it must also be said that he has a point. The creation of user-space APIs differs from how most kernel code is written. Much is made of the evolutionary nature of the kernel itself - things continually evolve as better solutions to problems are found. User-space interfaces, however, cannot evolve - once they are shipped as part of a mainline kernel, they are set in stone and must be maintained forever. They must be right from the outset. So it is not unreasonable to expect that the level of review for new user-space APIs would be higher, and that documentation of proposed APIs, which can be expected to help the review process, should be provided. It is true, however, that the original developer is not always the best person to provide that documentation.

One question which has been raised about this interface has to do with its similarity to the FreeBSD kqueue mechanism. The intent of the interface is the same, but no attempt to emulate the kqueue API has been made. Andrew Morton has said:

I mean, if there's nothing wrong with kqueue then let's minimise app developer pain and copy it exactly. If there _is_ something wrong with kqueue then let us identify those weaknesses and then diverge. Doing something which looks the same and works the same and does the same thing but has a different API doesn't benefit anyone.

There are, evidently, real reasons for not replicating the kqueue interface, but those reasons have not, yet, been made clear.

Kevents will, it is hoped, be a major improvement for people writing applications for Linux. This new API should bring together all information of interest into a single place, provide significant performance benefits, and ease porting of applications from other operating systems. But, if this API is going to meet the high expectations being placed on it, it will require a high level of review from a number of interested parties. That review is now starting to happen, so expect this API to remain in flux for some time yet.


(Log in to post comments)

Kevents and review of new APIs

Posted Aug 24, 2006 9:07 UTC (Thu) by pphaneuf (guest, #23480) [Link]

Linus had mentioned kqueue before.

I said before how I don't think systems calls are really the issue. First off, if a process is going to sleep, it has to make a system call. Second, for an event gathering API, as long as you can fetch many events at once (like epoll_wait() and (eurgh) select()/poll() allows one to do), you can manage to make it a single system call per iteration of the main loop, where a lot of work (and many, many system calls!) will be done in-between each of those calls.

Now, something much more profitable would be a way to limit the number of events in the first place. Many protocols have fixed block size, a way to only trigger a readability event when a full block has arrived, or a writability event when there is space for X bytes instead of 1 would be taking a higher level approach to reducing event delivery overhead, simply by delivering less events.

This is exactly how epoll was so much more efficient than select()/poll(), not by employing radical new ways of communicating between kernel and userspace (all three take pointer to a userspace buffer and fiddle with it), but by reducing the complexity of the processing itself by keeping some state in the kernel. epoll doesn't take any input at event dispatch time, and returns only the interesting events.

The potentially nasty behaviour of overfilling the ring buffer is reminding me of what would happen with realtime signals when it was out of space for events: you'd have to fall back to using select(), so you ended up having a lot of complexity, with a special "overload" code path and all the bugs and testing headache that would come out of it. Oh well, at least it didn't explode and overwrite earlier event data, like this proposal looks like it'll do!

I really wonder how much of a problem epoll registration and unregistration is (the only issue Ulrich's paper mentioned about epoll), considering how many system calls are usually involved in accepting a new connection. You have to accept the connection, set it to non-blocking, possibly turn off Nagle (depending on the protocol), maybe set TCP_CORK, maybe getsockname() to find out which virtual server we're supposed to act as, maybe set the priority if we need some particular ToS (streaming media servers come to mind as a high-performance ToS-aware kind of application)... Oh, and an epoll_ctl(). Is that going to kill performance, really?

I find epoll rather satisfactory, at the moment, from a performance point of view. My main problem is reducing the number of copies in bulk transfers, where Ulrich's paper was interesting, and can be helped at the moment with APIs such as sendfile() and splice(). Also, reducing the number of events themselves would help performance of some servers.

I think that there are also aspects of API usability, which would be a good time to improve. When I'll be able to implement an asynchronous DNS resolver that doesn't fork or use threads (to wait for replies? how silly) and works without too much help from the application, it'll be a great day.

Kevents and review of new APIs

Posted Aug 24, 2006 19:56 UTC (Thu) by vmole (guest, #111) [Link]

If you're starting a new network connection, sure, the event notification call doesn't mean squat. But I think the kevent stuff is a general mechanism, and for things like async I/O and other local events, it might be significant.

Kevents and review of new APIs

Posted Aug 25, 2006 9:18 UTC (Fri) by pphaneuf (guest, #23480) [Link]

In what situation does one opens files in rapid succession, without a sockets being more or less involved at the same time? Consider that open() itself is synchronous and blocking, and you've already got a good bit of overhead. If you start thinking about an "async open", then I'll refer you to this.

I would remind you that a big way to decrease the overhead of event handling is to have less events in the first place, thus the usefulness of interest sets and such (think of X11's event_mask, as on a remote display, event dispatching can be slow). At some point, you'll have to make a syscall to tell the kernel if yes or no you want events for a given file descriptor. At best, it could be specified through flags when opening, but I don't think this is an issue worth addressing.

I think what Ulrich was referring to was more when you want to "fiddle" with the interest set rather frequently. For example, picture a user-space port forwarder, where it starts waiting for readability on both sockets, but upon receiving a packet, will add writability of the other socket to its interest set (maybe remove readability of the socket where the packet arrived, for flow control), etc...

This particular situation is a case where edge-triggered events (well supported by epoll, by the way) are indicated, as you can set the interest set of both sockets to read and write, and leave them as-is, keeping track of the state of the sockets in user-space.

Lastly, what other "local events" are there?

Kevents and review of new APIs

Posted Aug 31, 2006 20:05 UTC (Thu) by vmole (guest, #111) [Link]

Async I/O includes things other than open(), ya know. If it takes longer to deal with the completion of the async_read() than it would to just call read() in the first place, async i/o becomes pointless.

Something that opens lots of files without using sockets: compilers. Consider some auditing system that wants to know every time a file is accessed.

And while sockets may be involved, a web server, news server or IMAP server might open, read, and possibly write a *lot* of files for one socket instance.

As for other local events, consider a piece of shared memory, with a master process that wants to know anytime one of the children writes to it. Yeah, it can be done with an atomic counter and polling, but an event would be much cleaner. And I've had a need for this.

Designing a general kernel events mechanism around the limitations of socket open() seems shortsighted.

Kevents and review of new APIs

Posted Sep 1, 2006 12:20 UTC (Fri) by pphaneuf (guest, #23480) [Link]

I think you misunderstood what kevent is for. kevent isn't concerned with all sorts of events, but rather on the very specific general types of events that can wake a process up. Just about all of those events are file descriptor events, due to the Unix design ("everything is a file" or close enough to do).

The exact issue that was raised by epoll in Ulrich's paper was the overhead of registering a file descriptor with the epoll_ctl() call before getting the events, and I was wondering what would he have otherwise. Just getting all the events would be highly inefficient.

To deal point by point with your reply, Linux-AIO uses a single file descriptor to get notifications on all the operations it does (reads and writes, on all other file descriptors, be them files, sockets or other). This file descriptor can most likely be put in an epoll interest set. An auditing system (such as exists already in the form of Dazuko, inotify and such) would most likely deliver its events on a file descriptor (which you can put in epoll and get notified when those events arrive). Web, news and IMAP servers could use Linux-AIO (covered earlier), but normal filesystem-based file descriptor are "always readable", even when they aren't, so you usually don't want to use them in a event mechanism like kevent or epoll (being always "ready", they make your application busy-spin, eating 100% CPU).

Processes communicating bulk data through shared memory often use a Unix domain socket to notify the other process that it should get the data. X11's MITSHM extension, for example, but simpler systems that just write a single byte (enough to make the file descriptor go "readable") are also seen. Unix domain sockets involve more copies for bulk data, but writing a single byte to wake the other process up is very cheap. If the notification is one-way only, a pipe is enough.

You also missed a few other interesting cases. Central processing of signals and timeouts are two others. Signals can also be dealt with the "single byte written on a pipe" trick, from the signal handler, deferring the work to the other end of the pipe. Timeouts can be dealt with, well, the timeout parameter of epoll_wait(), of course.

The main problem I have with epoll is still that it doesn't centralise the event dispatching for libraries. A new API should include a callback function when events arrive, which would get called without needing cooperation between unrelated pieces of code. For example, if I write an asynchronous DNS resolver library, I should have a way to be notified when a file descriptor is ready or a timeout expires without having to cooperate with other code. Right now, code in a library has to provide a way to let the code that will be doing the call to epoll_wait know that it has a specific timeout or that if it gets an event on a certain file descriptor, it should pass it on.

Some libraries, like Qt, libevent and such can do that, but the big problem is that it's a very basic functionality, and it's worthless if it's not standard (if my library registers its events with Qt, but the main program uses libevent, nothing happens and my library never gets its events). These libraries already do a good job, but the point here is to make one that will be good enough to be integrated as the Linux event API and be integrated in the glibc, so it can be relied upon.

They're just guessing

Posted Sep 2, 2006 9:30 UTC (Sat) by slamb (guest, #1070) [Link]

The exact issue that was raised by epoll in Ulrich's paper was the overhead of registering a file descriptor with the epoll_ctl() call before getting the events, and I was wondering what would he have otherwise. Just getting all the events would be highly inefficient.

I haven't seen this paper (got a link?), but I'd say there are three options:

  1. make assumptions - like that because read() returned EWOULDBLOCK you want to know when it next becomes available for write
  2. abandon level-driven polling. Edge-driven polling should let you set your notification preferences to READ|WRITE and leave it there, even if it's available and you don't currently want to consume it.
  3. accept a list of changes at the same time as the blocking call. Of course, this is the BSD way, so the Linux people have to do something different.

This article makes it painfully obvious to me that the Linux developers as a whole are just guessing. They're going back to a kevent-like system after mocking it when creating epoll. Well, now they're finding that the complexity of those other event types is worth it, and that their system call overhead is too high. Probably should have listened the first time. If sheer numbers of system calls is the problem, it's obvious that in level-driven notification applications, the FreeBSD approach of passing in all your change notifications at the same time of blocking is better than the unnecessary system calls of epoll_ctl. (Do the Linux people only care about edge-driven stuff? Perhaps that's reasonable, but I don't see it stated anywhere.)

This bizarre extreme of trying to eliminate all system calls by using a ring buffer...well, I agree with your comment that it sounds exactly like the signal-based polling mistake, and your comment in an earlier thread that some sort of blocking call is clearly necessary. Maybe it is true that it's the copying of event buffers is significant, but I haven't seen benchmark numbers that demonstrate this is superior, so again it seems that they're just guessing. That's a poor reason for throwing out what someone has already done in favor of a much more convoluted and error-prone interface.

I'm glad to see Andrew Morton's voice of reason, both on needing a clear justification for going against the existing FreeBSD interface and on the documentation. The latter is a serious problem with Linux interfaces in general. Look at inotify - they have section 2 manual pages for the system calls but no section 4 manual page for the whole interface. That's worthless - the system calls are completely obvious; the section 4 manual page is needed to actually describe what the constants and structure elements mean, among other things.

if I write an asynchronous DNS resolver library, I should have a way to be notified when a file descriptor is ready or a timeout expires without having to cooperate with other code ... ome libraries, like Qt, libevent and such can do that, but the big problem is that it's a very basic functionality, and it's worthless if it's not standard (if my library registers its events with Qt, but the main program uses libevent, nothing happens and my library never gets its events) ... the point here is to make one that will be good enough to be integrated as the Linux event API and be integrated in the glibc, so it can be relied upon.

I've always liked liboop for this purpose. It's more usable by libraries because it is general - you can plug it in to Qt's event loop, glib's event loop, libevent, etc. You don't have to make the sort of assumptions you're talking about to use it. I'd strongly prefer a well-maintained, liboop-like library to one in glibc like you're talking about. Largely because I would like my code to run on FreeBSD as well, and because it doesn't require waiting for the Qt and glib people to rebase their stuff on it. It doesn't even require other people to use it, though it'd sure be nice.

Kevents and review of new APIs

Posted Sep 6, 2006 17:15 UTC (Wed) by wilck (subscriber, #29844) [Link]

Mr. Miell has comported himself in a way which is highly unlikely to inspire cooperation from anybody. He seems to carry a certain contempt for the interface, the process, and the people involved in it.

I read the thread before your comment, and I must say I can't follow you. I found Nicholas' concerns valid and his style not overly aggressive (well, for the most part). I liked his description of the differing opinions of developers wrt kernel API design (someone called that post a "rant" - strange).

Even if everyone is a volunteer, it's bad style to respond to criticism with "go fix it" and to requests for documentation with "go write it yourself".

I hope that the concerns that Nicholas and others have articulated will be dealt with constructively before this code is merged. No rush, please.

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds