EPOLL_CTL_DISABLE, epoll, and API design
In an article last week, we saw that the EPOLL_CTL_DISABLE operation proposed by Paton Lewis provides a way for multithreaded applications that cache information about file descriptors to safely delete those file descriptors from an epoll interest list. For the sake of brevity, in the remainder of this article we'll use the term "the EPOLL_CTL_DISABLE problem" to label the underlying problem that EPOLL_CTL_DISABLE solves.
This article revisits the EPOLL_CTL_DISABLE story from a different angle, with the aim of drawing some lessons about the design of the APIs that the kernel presents to user space. The initial motivation for pursuing this angle arises from the observation that the EPOLL_CTL_DISABLE solution has some difficulties of its own. It is neither intuitive (it relies on some non-obvious details of the epoll implementation) nor easy to use. Furthermore, the solution is somewhat limiting, since it forces the programmer to employ the EPOLLONESHOT flag. Of course, these difficulties arise at least in part because EPOLL_CTL_DISABLE is designed so as to satisfy one of the cardinal rules of Linux development: interface changes must not break existing user-space applications.
If there had been an awareness of the EPOLL_CTL_DISABLE problem when the epoll API was originally designed, it seems likely that a better solution would have been built, rather than bolting on EPOLL_CTL_DISABLE after the fact. Leaving aside the question of what that solution might have been, there's another interesting question: could the problem have been foreseen?
One might suppose that predicting the EPOLL_CTL_DISABLE problem would have been quite difficult. However, the synchronized-state problem is well known and the epoll API was designed to be thread friendly. Furthermore, the notion of employing a user-space cache of the ready list to prevent file descriptor starvation was documented in the epoll(7) man page (see the sections "Example for Suggested Usage" and "Possible Pitfalls and Ways to Avoid Them") that was supplied as part of the original implementation.
In other words, almost all of the pieces of the puzzle were known when the epoll API was designed. The one fact whose implications might not have been clear was the presence of a blocking interface (epoll_wait()) in the API. One wonders if more review (and building of test applications) as the epoll API was being designed might have uncovered the interaction of epoll_wait() with the remaining well-known pieces of the puzzle, and resulted in a better initial design that addressed the EPOLL_CTL_DISABLE problem.
So, the first lesson from the EPOLL_CTL_DISABLE story is that more review is necessary in order to create better API designs (and we'll see further evidence supporting that claim in a moment). Of course, the need for more review is a general problem in all aspects of Linux development. However, the effects of insufficient review can be especially painful when it comes to API design. The problem is that once an API has been released, applications come to depend on it, and it becomes at the very least difficult, or, more likely, impossible to later change the aspects of the API's behavior that applications depend upon. As a consequence, a mistake in API design by one kernel developer can create problems that thousands of user-space developers must live with for many years.
A second lesson about API design can be found in a comment that Paton made when responding to a question from Andrew Morton about the design of EPOLL_CTL_DISABLE. Paton was speculating about whether a call of the form:
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &epoll_event);
could be used to provide the required functionality. The
EPOLL_CTL_DEL operation does not currently use the fourth argument
of epoll_ctl(), and applications should specify it as
NULL (but more on that point in a moment). The idea would be that
"epoll_ctl [EPOLL_CTL_DEL] could set a bit in epoll_event.events
(perhaps called EPOLLNOTREADY)
" to notify the caller that the file
descriptor was in use by another thread.
But Paton noted a shortcoming of this approach:
In other words, although the EPOLL_CTL_DEL operation doesn't use the epoll_event argument, the caller is not required to specify it as NULL. Consequently, existing applications are free to pass random addresses in epoll_event. If the kernel now started using the epoll_event argument for EPOLL_CTL_DEL, it seems likely that some of those applications would break. Even though those applications might be considered poorly written, that's no justification for breaking them. Quoting Linus Torvalds:
The lesson here is that when an API doesn't use an argument, usually the right thing to do is for the implementation to include a check that requires the argument to have a suitable "empty" value, such as NULL or zero. Failure to do that means that we may later be prevented from making the kind of API extensions that Paton was talking about. (We can leave aside the question of whether this particular extension to the API was the right approach. The point is that the option to pursue this approach was unavailable.) The kernel-user-space API provides numerous examples of failure to do this sort of checking.
However, there is yet more life in this story. Although there have been many examples of system calls that failed to check that "empty" values were passed for unused arguments, it turns out that epoll_ctl(EPOLL_CTL_DEL) fails to include the check for another reason. Quoting the BUGS section of the epoll_ctl() man page:
In other words, applications that use EPOLL_CTL_DEL are not only permitted to pass random values in the epoll_event argument: if they want to be portable to Linux kernels before 2.6.9 (which fixed the problem), they are required to pass a pointer to some random, but valid user-space address. (Of course, most such applications would simply allocate an unused epoll_event structure and pass a pointer to that structure.) Here, we're back to the first lesson: more review of the initial epoll API design would almost certainly have uncovered this fairly basic design error. (It's this writer's contention that one of the best ways to conduct that sort of review is by thoroughly documenting the API, but he admits to a certain bias on this point.)
Failing to check that unused arguments (or unused pieces of arguments) have "empty" values can cause subtle problems long after the fact. Anyone looking for further evidence on that point does not need to go far: the epoll_ctl() system call provides another example.
Linux 3.5 added a new epoll flag, EPOLLWAKEUP, that can be specified in the epoll_event.events field passed to epoll_ctl(). The effect of this flag is to prevent the system from being suspended while epoll readiness events are pending for the corresponding file descriptor. Since this flag has a system-wide effect, the caller must have a capability, CAP_BLOCK_SUSPEND (initially misnamed CAP_EPOLLWAKEUP).
In the initial EPOLLWAKEUP implementation, if the caller did not have the CAP_BLOCK_SUSPEND capability, then epoll_ctl() returned an error so that the caller was informed of the problem. However, Jiri Slaby reported that the new flag caused a regression: an existing program failed because it was setting formerly unused bits in epoll_event.events when calling epoll_ctl(). When one of those bits acquired a meaning (as EPOLLWAKEUP), the call failed because the program lacked the required capability. The problem of course is that epoll_ctl() has never checked the flags in epoll_event.events to ensure that the caller has specified only flag bits that are actually implemented in the kernel. Consequently, applications were free to pass random garbage in the unused bits.
When one of those random bits suddenly caused the application to fail, what should be done? Following the logic outlined above, of course the answer is that the kernel must change. And that is exactly what happened in this case. A patch was applied so that if the EPOLLWAKEUP flag was specified in a call to epoll_ctl() and the caller did not have the CAP_BLOCK_SUSPEND capability, then epoll_ctl() silently ignored the flag instead of returning an error. Of course, in this case, the calling application might easily carry on, unaware that the request for EPOLLWAKEUP semantics had been ignored.
One might observe that there is a certain arbitrariness about the approach taken to dealing with the EPOLLWAKEUP breakage. Taken to the extreme, this type of logic would say that the kernel can never add new flags to APIs that didn't hitherto check their bit-mask arguments—and there is a long list of such system calls (mmap(), splice(), and timer_settime(), to name just a few). Nevertheless, new flags are added. So, for example, Linux 2.6.17 added the epoll event flag EPOLLRDHUP, and since no one complained about a broken application, the flag remained. It seems likely that the same would have happened for the original implementation of EPOLLWAKEUP that returned an error when CAP_BLOCK_SUSPEND was lacking, if someone hadn't chanced to make an error report.
As an aside to the previous point, in cases where someone reports a regression after an API change has been officially released, there is a conundrum. On the one hand, there may be old applications that depend on the previous behavior; on the other hand, newer applications may already depend on the newly implemented change. At that point, there is no simple remedy: to fix things almost certainly means that some applications must break.
We can conclude with two observations, one specific, and the other more general. The specific observation is that, ironically, EPOLL_CTL_DISABLE itself seems to have had surprisingly little review before being accepted into the 3.7 merge window. And in fact, now that more attention has been focused on it, it looks as though the proposed API will see some changes. So, we have a further, very current, piece of evidence that there is still insufficient review of kernel-user-space APIs.
More generally, the problem seems to be that—while the kernel
code gets reviewed on many dimensions—it is relatively uncommon for
kernel-user-space APIs to be reviewed on their own merits. The kernel has
maintainers for many subsystems. By now, the time seems ripe for there to
be a kernel-user-space API maintainer—someone whose job it is
to actively review and ack every kernel-user-space API change, and to
ensure that test cases and sufficient documentation are supplied with the
implementation of those changes. Lacking such a maintainer, it seems likely
that we'll see many more cases where kernel developers add badly designed
APIs that cause years
of pain [PDF] for user-space developers.
| Index entries for this article | |
|---|---|
| Kernel | Epoll |
| Kernel | User-space API/Design |
(Log in to post comments)
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 23, 2012 16:19 UTC (Tue) by dankamongmen (subscriber, #35141) [Link]
I really wish I'd have followed this out to its conclusion back in 2010. :/
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 23, 2012 20:27 UTC (Tue) by Yorick (guest, #19241) [Link]
This is unfortunate. There is nothing more important to review than immutable interfaces (except for security audits). I'm going to be grossly unfair here, but in neglecting basic software engineering principles, the developers here come out as bumbling amateurs. (Of course I've made similar mistakes myself, but with slightly less severe consequences.)
With a cast-in-stone policy, these APIs must be subject to extreme scrutiny. I would like to go further than the editor of the excellent article: There should be working applications present, not just tiny test cases or proofs of concept, in addition to full documentation, before any proposals can be accepted. It's not just a matter of verifying that the APIs work, but that they are useful and complete as well.
The importance of checking unused parameter bits was learned dearly in the 1960s, over and over again, both for hardware and software. We should know this by know.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 23, 2012 20:30 UTC (Tue) by dlang (guest, #313) [Link]
The early applications that use an API are going to be written by people who are very familiar with the API and know not only what the API does, but how it _should_ be used.
a few years later, you get applications developed by people who don't know how it _should_ be used, and as they start trying to use it in other ways (some of them very good ways), they expose limitations that the people who were involved with the development, testing, and reviews of the API missed.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 23, 2012 21:01 UTC (Tue) by Yorick (guest, #19241) [Link]
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 23, 2012 21:07 UTC (Tue) by dlang (guest, #313) [Link]
If applications will be broken, then the API shouldn't change.
If there are no applications using an API, that API can be removed.
But If applications are using that API, creating new versions of the API doesn't solve the problem. Existing applications are not going to disappear.
This attitude that "the APIs are versioned, we can drop support for old versions" is exactly what's causing the problems in the Linux Desktop Environment world.
It doesn't matter how justified you think you are in getting rid of an old version, if it breaks users it's a regression and you should not do so.
Maintaining lots of different, incompatible versions of an API is a huge amount of work, so just versioning the API isn't nearly enough, and it's questionable if it really helps in the long run.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 23, 2012 23:04 UTC (Tue) by nybble41 (subscriber, #55106) [Link]
However, I agree that in general maintaining multiple API versions, where the old one is not a simple subset of the new one, is not likely to go well.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 25, 2012 9:42 UTC (Thu) by dgm (subscriber, #49227) [Link]
For that reason we should be putting much more effort into API design. In fact, tacking into account that API _will_ outlive its implementation (maybe several of them), most of the effort should be towards getting the API right.
Re API (and ABI) design and maintenance
Posted Oct 26, 2012 18:31 UTC (Fri) by davecb (subscriber, #1574) [Link]
I spent three years at the (late, lamented) Sun Microsystems doing ABI stability, and ended up sensitized enough that I notice it when it happens.
An easy example was a company's linker, which needed a different data structure for a method call that it had. We added a new "record type" in tghe linker, and converted the compilers to produce only it. We supported the old format for a year, then made its use produce a warning, and a year after that, made it require a link-line option.
We got two complaints, total, about the option. Out of all our customers, only two were so very very far behind that they ran one of the old compilers. A year or two later a hardware change made those compilers produce impossibly lousy code, and the two outliers upgraded to the new compilers. Then we retired the old interface.
If we had moved any faster, we would have annoyed at least two customers. If we had moved any slower in switching the compilers over, we would have made the OS developers unhappy. The time between the first switch and the final stages of the retirement made the maintainers unhappy, but there weren't many bugs in that interface nor were there many user of it, so the time cost of maintaining it was low.
We did have to manage it, and we had to do a fair bit of work behind the scenes to make it invisible to the users, but we succeeded at evolution.
Just as with humans, if you don't evolve, you might just die out. Homo habilis, anyone?
--dave
Re API (and ABI) design and maintenance
Posted Nov 13, 2012 12:39 UTC (Tue) by k3ninho (subscriber, #50375) [Link]
I suspect that Linus' view on never, ever, binning old binary interfaces will mean that there will be (un)dead interfaces supported forever. My thoughts on a management plan look like:
(*) build a test suite round existing, in-use interfaces to maintain their intended functionality
(*) build a versioning API where a version-aware program can call in to use versioned interfaces
(*) attach version info to the existing APIs and handle a 'this interface is not implemented in your version of Linux' error
(*) develop a plumbing metalanguage to support disabled/legacy interface functionality via newer intefaces
(*) have all the interfaces configurable in the makefile, defaulting to enabled
(*) stop talking and show you some code
K3n.
Re API (and ABI) design and maintenance
Posted Nov 15, 2012 17:15 UTC (Thu) by nix (subscriber, #2304) [Link]
Re API (and ABI) design and maintenance
Posted Nov 19, 2012 6:44 UTC (Mon) by k3ninho (subscriber, #50375) [Link]
K3n.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 24, 2012 4:25 UTC (Wed) by daniels (subscriber, #16193) [Link]
the developers here come out as bumbling amateurs
Err? They made a mistake, which in the context of something as complex and genuinely impressive as the Linux kernel, can be forgiven. How many kernels which scale from embedded to clusters have you written?
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 24, 2012 10:40 UTC (Wed) by Yorick (guest, #19241) [Link]
When we suffer from badly thought-out APIs made by Microsoft, say, we pour scorn over the developers who designed the mess and call them incompetent, fairly or not. Linux kernel programmers are not exempt and should not be.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 24, 2012 10:51 UTC (Wed) by mkerrisk (subscriber, #1978) [Link]
That comes out as sloppy, but is really more a sign of a development process in the need of improvement.
Yes. Having watched what goes on for quite a while now, I consider this mainly a process problem, rather than a problem of individual developers (though obviously some do a better job than others).
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 25, 2012 9:45 UTC (Thu) by dgm (subscriber, #49227) [Link]
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 24, 2012 16:30 UTC (Wed) by mjthayer (guest, #39183) [Link]
I am sure that there is a good reason why it won't work, but couldn't the original problem be solved, in user space, if the user space file descriptor cache included not just a "should be deleted" flag, but also a reference count of threads currently using a file descriptor? Then, before accessing the descriptor, a thread could check the "should be deleted" flag and if it is set decrease the reference count instead of accessing it, freeing the resources if the count reached zero.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 25, 2012 0:02 UTC (Thu) by kjp (guest, #39639) [Link]
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 25, 2012 21:57 UTC (Thu) by Jandar (subscriber, #85683) [Link]
The solution using the cookie is simple and elegant. I don't understand the comments about not using the cookie because someone would like to use it otherwise. This line of reason means nobody should use the cookie.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 25, 2012 22:03 UTC (Thu) by dlang (guest, #313) [Link]
it shows you comments for all stories that you haven't read since the last time you viewed that page (and that you haven't read by going to the specific article page)
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 26, 2012 0:21 UTC (Fri) by Jandar (subscriber, #85683) [Link]
I read LWN in the "One big page" mode and go to the comments (in a new tab) with the "Comments (xxx posted)" button. After I have read the comments for one article I close the tab. Is there something I can do to make Comments/unread more useful?
What I would really like would be some means to hop from one unread comment to the next within the complete comment-section to see the surrounding context. It could be a link at each unread comment pointing to an anchor at the next. E.g. http://lwn.net/Articles/520198/#Comments-UnRead42.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 26, 2012 0:46 UTC (Fri) by dlang (guest, #313) [Link]
However, I suspect that if you go to the unread page again, you will not see all those comment any more and you will find it much more useful.
you should also look at the greasemonkey script for lwn, I think it does more of what you are looking for.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 26, 2012 1:22 UTC (Fri) by Jandar (subscriber, #85683) [Link]
I use konqueror not firefox but greasemonkey with fancyLWNComments seems to be a reason to switch for reading LWN. Thanks for pointing me to it.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 26, 2012 5:01 UTC (Fri) by dirtyepic (guest, #30178) [Link]
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 26, 2012 6:00 UTC (Fri) by dlang (guest, #313) [Link]
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 26, 2012 17:03 UTC (Fri) by Jandar (subscriber, #85683) [Link]
Greasemonkeys fancyLWNComments is the first working method to tell read and unread apart.
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 27, 2012 2:54 UTC (Sat) by dirtyepic (guest, #30178) [Link]
EPOLL_CTL_DISABLE, epoll, and API design
Posted Oct 25, 2012 9:05 UTC (Thu) by ncm (subscriber, #165) [Link]
I'm also much more impressed with kjp's solution than is our esteemed author. It attacks the problem at the root, even enabling rescue of such poorly architected designs (anyway until the next flaw uncloaks). Using the field suggested doesn't "burn" it: other uses can piggyback on the same hash node. By such reasoning any use at all would burn it, so no use can be deserving enough, and it never gets used for anything.
Unless I am misunderstanding the argument...
EPOLL_CTL_DISABLE, epoll, and API design
Posted Nov 1, 2012 0:45 UTC (Thu) by kevinm (guest, #69913) [Link]
By now, the time seems ripe for there to be a kernel-user-space API maintainer—someone whose job it is to actively review and ack every kernel-user-space API change, and to ensure that test cases and sufficient documentation are supplied with the implementation of those changes.
Hark, is that the sound of volunteering? ;)
EPOLL_CTL_DISABLE, epoll, and API design
Posted Nov 1, 2012 13:19 UTC (Thu) by Karellen (subscriber, #67644) [Link]
