LWN.net Logo

Choosing between portability and innovation

Choosing between portability and innovation

Posted Mar 3, 2011 3:25 UTC (Thu) by foom (subscriber, #14868)
In reply to: Choosing between portability and innovation by wahern
Parent article: Choosing between portability and innovation

And epoll certainly has a *HUGE* misdesign in it, that anyone who actually understood what a file descriptor is should've seen coming. But if you look back in the history of epoll, you'll see that it looks like the implementors apparently didn't understand the difference between file descriptors and file descriptions. :(

epoll_ctl lets you register a file descriptor for monitoring. However, it *actually* registers the underlying file description for monitoring. But then it remembers in kernel-land the file descriptor number you passed in and reports all events for the file descriptor with that file descriptor number you passed in originally. No matter what you do to the file descriptor afterwards.

Furthermore, it has an automatic deregistration feature: if you close an fd, it'll stop monitoring it automatically...at least, that would make sense. So, no, it doesn't *really* do that. What it really has is automatic deregistration when the file description is closed (that is: when *all* file descriptors referencing the file description are closed). That's just a pain in the ass!

So, if you close an fd, but have in the meantime have fork()'d, dup()'d the fd to another fd, or something of that nature, epoll will continue reporting events with the number of a closed fd. And sometimes you can't even remove it from the set, since the fd number it's reporting back isn't the right file anymore!

So if you're designing a library where user code might close an fd out from under you (not an unreasonable thing to support), you need to have special case code to go back and recreate the epoll set from scratch in case you start getting bogus responses to workaround this misdesign. Quite obviously, epoll ought to work solely on fds: if you register fd 3, start watching fd 3, dup fd 3 to fd 4, and then close fd 3, epoll should not report events anymore. That would be sane. But nope, it doesn't work that way. Mutter mutter.


(Log in to post comments)

Choosing between portability and innovation

Posted Mar 3, 2011 16:53 UTC (Thu) by nix (subscriber, #2304) [Link]

Good grief. That's appalling. And you can't fix it without breaking anyone who *expects* this stuff to work across fork() (though I'd be inclined to call such code broken).

Choosing between portability and innovation

Posted Mar 6, 2011 0:27 UTC (Sun) by Tet (subscriber, #5433) [Link]

This is one case where the pain of breaking compatibility with existing code is probably less than the pain of living with a broken design.

Choosing between portability and innovation

Posted Mar 6, 2011 4:30 UTC (Sun) by nix (subscriber, #2304) [Link]

Quite (though it can be used properly, it's definitely in the negative half of that API correctness thing gregkh(?) put up a while back: an API for which the obvious use is wrong).

Thankfully epoll() is not widely used yet, and being Linux-specific all the users can fall back to other mechanisms. (However, the old syscall would have to stay, anyway: we'd need an epoll2() syscall that did things right.)

Choosing between portability and innovation

Posted Mar 6, 2011 6:41 UTC (Sun) by jthill (guest, #56558) [Link]

I wouldn't knock epoll, myself. I recognize the design. It seems odd to me that over the last few days I've run across a spate of gripes about epoll in various places, all of them based on misconceptions like the ones in this thread. It isn't just the original complaint (since retracted), it's the remaining gripes too.

There's an unavoidable race between event reporting and any close processing that's most efficiently handled by keeping your own books. It's not so hard to EPOLL_CTL_DEL before you close(), and defer event-tracking cleanup rcu-style until after a nonblocking epoll_wait returns 0, to catch events reported between your decision to close and the actual close.

Note that nothing you can do avoids that race. Having close do the deletion processing will not save you: the event may have already been posted.

IBM don't want their manuals accessed by non-customers so I won't link the description, but epoll is down there in big-O range of mainframe-style event reporting. The main difference I can see is that on the mainframe, the table is in your address space and sized by how many events can get backed up (posted but as-yet unhandled) before something is badly wrong -- for anything but large-scale work you just size it to handle everything anyway.

I notice that nothing in the reporting api constrains epoll to file-based events. Tying any asynchronous event at all to epoll should be possible. Timers come to mind, of course. epoll_pwait handles signals. Me, I think it'd be good to have an epolled_fork, so child termination comes in as just another thing on your events list.

Choosing between portability and innovation

Posted Mar 4, 2011 21:04 UTC (Fri) by nevyn (subscriber, #33129) [Link]

> And epoll certainly has a *HUGE* misdesign in it [...]
> epoll_ctl lets you register a file descriptor for monitoring. However,
> it *actually* registers the underlying file description for monitoring.
> But then it remembers in kernel-land the file descriptor number you passed
> in and reports all events for the file descriptor with that file
> descriptor number you passed in originally.
> No matter what you do to the file descriptor afterwards.

That's a pretty misleading description. You register for events and give the kernel a pointer or a number as your callback data ... you then get that piece of data and a set of flags, when something happens.

If you pass in your "original fd number" as the data you want back, that's what you get. Personally, when I've used it, I used a pointer to a struct as the callback data.

Choosing between portability and innovation

Posted Mar 4, 2011 21:45 UTC (Fri) by foom (subscriber, #14868) [Link]

No, it really wasn't misleading. The userspace-visible key *is* the fd. You're required to pass that in, that's what epoll uses to look up the file object, it will always pass that back, and that's what you have to pass to epoll_ctl for further modifications/deletions of the watch. You can of course *in addition* register arbitrary user data that it'll pass back to you. But that's optional. The fd is required.

Segue into another rant of mine....that's yet another reason why the auto-removal-upon-file-close functionality is broken. There's no way to get a notification that epoll has autoremoved an entry from the epoll set, so, it's not actually possible for you to ever *free* the data (and other resources) you allocated and passed into epoll_ctl. Sigh. (same bug exists with kqueue API btw).

Choosing between portability and innovation

Posted Mar 4, 2011 22:11 UTC (Fri) by nevyn (subscriber, #33129) [Link]

> No, it really wasn't misleading. The userspace-visible key *is* the fd.
> You're required to pass that in, that's what epoll uses to look up the
> file object, it will always pass that back

Yes, you need an fd to register/change the epoll events. But it isn't ever passed back:

epoll_wait() returns a list of "struct epoll_event", which is defined as:

struct epoll_event
{
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
} __attribute__ ((__packed__));

...epoll_data_t is the union I talked about before, so you can use a pointer or a number for your callback data. You don't get the fd back from the API, unless you use that as your callback data but if you do that it's just "a number" ... so I wouldn't expect the kernel to do anything special with it.

Choosing between portability and innovation

Posted Mar 4, 2011 23:08 UTC (Fri) by foom (subscriber, #14868) [Link]

Okay, I must apologize, you are completely right.

I was getting myself mixed up with kqueue's API (which actually does have you specify a struct containing both the fd and userdata), and then mis-reread the epoll docs. (As you might be able to tell, I always passed the fd as the user data. :))

However, that doesn't change the main point I wanted to make, which is that it ought to be tracking fds internally instead of files -- I don't want it to do anything special with the value it returns, I want it to stop watching (and tell me that it did stop watching), if the *fd* I originally gave is closed, not if the underlying file is closed. Returning events on a file that I no longer have a handle for in the current process is annoying.

Choosing between portability and innovation

Posted Mar 6, 2011 19:04 UTC (Sun) by intgr (subscriber, #39733) [Link]

> However, that doesn't change the main point I wanted to make, which is
> that it ought to be tracking fds internally instead of files

> I want it to stop watching (and tell me that it did stop watching), if
> the *fd* I originally gave is closed, not if the underlying file is closed

Seriously, if you want it to stop watching, you use EPOLL_CTL_DEL before closing a file descriptor. The implicit deregister-on-close() is just a safety net -- because it's the only sane thing that the kernel can do when you have watches on a file that's being closed. It's not intended to be used that way.

You tried to spin it as a "*HUGE* misdesign", but in practice it's just a trivial edge case that shouldn't affect real applications. Wouldn't be the first time *BSD people criticize parts of Linux they don't actually understand.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds