The problem with BSD cooperation is that Linux (and Linus) don't like to cooperate. They like to experiment and do their own thing. Look at epoll+timerfd+signalfd+(dnotify/fanotify), etc versus kqueue. BSDs had that functionality for a decade while Linux experimented and went their own way.
Linux is about features, while BSD emphasizes design. This makes the BSDs conservative adopters; not because they don't want to adopt the feature, but because they're concerned about the API, and the littlest doubt about an API will prevent adoption of the feature.
This might be a result of the fact that Linux has a mob of people jockeying to rewrite huge subsystems at the drop of a hat, so that the quality of APIs matters less. (As for backwards compatibility, Linux has supported several competing interfaces simultaneously.)
All of this is a difference of degree, though. Of course Linux developers are concerned about design; but bad design isn't as costly, so there are more risks taken. Or like with the case of epoll, what matters is getting feature X out before even worrying about feature Y; let the APIs accrete just like the features.
Posted Mar 3, 2011 3:25 UTC (Thu) by foom (subscriber, #14868)
[Link]
And epoll certainly has a *HUGE* misdesign in it, that anyone who actually understood what a file descriptor is should've seen coming. But if you look back in the history of epoll, you'll see that it looks like the implementors apparently didn't understand the difference between file descriptors and file descriptions. :(
epoll_ctl lets you register a file descriptor for monitoring. However, it *actually* registers the underlying file description for monitoring. But then it remembers in kernel-land the file descriptor number you passed in and reports all events for the file descriptor with that file descriptor number you passed in originally. No matter what you do to the file descriptor afterwards.
Furthermore, it has an automatic deregistration feature: if you close an fd, it'll stop monitoring it automatically...at least, that would make sense. So, no, it doesn't *really* do that. What it really has is automatic deregistration when the file description is closed (that is: when *all* file descriptors referencing the file description are closed). That's just a pain in the ass!
So, if you close an fd, but have in the meantime have fork()'d, dup()'d the fd to another fd, or something of that nature, epoll will continue reporting events with the number of a closed fd. And sometimes you can't even remove it from the set, since the fd number it's reporting back isn't the right file anymore!
So if you're designing a library where user code might close an fd out from under you (not an unreasonable thing to support), you need to have special case code to go back and recreate the epoll set from scratch in case you start getting bogus responses to workaround this misdesign. Quite obviously, epoll ought to work solely on fds: if you register fd 3, start watching fd 3, dup fd 3 to fd 4, and then close fd 3, epoll should not report events anymore. That would be sane. But nope, it doesn't work that way. Mutter mutter.
Choosing between portability and innovation
Posted Mar 3, 2011 16:53 UTC (Thu) by nix (subscriber, #2304)
[Link]
Good grief. That's appalling. And you can't fix it without breaking anyone who *expects* this stuff to work across fork() (though I'd be inclined to call such code broken).
Choosing between portability and innovation
Posted Mar 6, 2011 0:27 UTC (Sun) by Tet (subscriber, #5433)
[Link]
This is one case where the pain of breaking compatibility with existing code is probably less than the pain of living with a broken design.
Choosing between portability and innovation
Posted Mar 6, 2011 4:30 UTC (Sun) by nix (subscriber, #2304)
[Link]
Quite (though it can be used properly, it's definitely in the negative half of that API correctness thing gregkh(?) put up a while back: an API for which the obvious use is wrong).
Thankfully epoll() is not widely used yet, and being Linux-specific all the users can fall back to other mechanisms. (However, the old syscall would have to stay, anyway: we'd need an epoll2() syscall that did things right.)
Choosing between portability and innovation
Posted Mar 6, 2011 6:41 UTC (Sun) by jthill (guest, #56558)
[Link]
I wouldn't knock epoll, myself. I recognize the design. It seems odd to me that over the last few days I've run across a spate of gripes about epoll in various places, all of them based on misconceptions like the ones in this thread. It isn't just the original complaint (since retracted), it's the remaining gripes too.
There's an unavoidable race between event reporting and any close processing that's most efficiently handled by keeping your own books. It's not so hard to EPOLL_CTL_DEL before you close(), and defer event-tracking cleanup rcu-style until after a nonblocking epoll_wait returns 0, to catch events reported between your decision to close and the actual close.
Note that nothing you can do avoids that race. Having close do the deletion processing will not save you: the event may have already been posted.
IBM don't want their manuals accessed by non-customers so I won't link the description, but epoll is down there in big-O range of mainframe-style event reporting. The main difference I can see is that on the mainframe, the table is in your address space and sized by how many events can get backed up (posted but as-yet unhandled) before something is badly wrong -- for anything but large-scale work you just size it to handle everything anyway.
I notice that nothing in the reporting api constrains epoll to file-based events. Tying any asynchronous event at all to epoll should be possible. Timers come to mind, of course. epoll_pwait handles signals. Me, I think it'd be good to have an epolled_fork, so child termination comes in as just another thing on your events list.
Choosing between portability and innovation
Posted Mar 4, 2011 21:04 UTC (Fri) by nevyn (subscriber, #33129)
[Link]
> And epoll certainly has a *HUGE* misdesign in it [...]
> epoll_ctl lets you register a file descriptor for monitoring. However,
> it *actually* registers the underlying file description for monitoring.
> But then it remembers in kernel-land the file descriptor number you passed
> in and reports all events for the file descriptor with that file
> descriptor number you passed in originally.
> No matter what you do to the file descriptor afterwards.
That's a pretty misleading description. You register for events and give the kernel a pointer or a number as your callback data ... you then get that piece of data and a set of flags, when something happens.
If you pass in your "original fd number" as the data you want back, that's what you get. Personally, when I've used it, I used a pointer to a struct as the callback data.
Choosing between portability and innovation
Posted Mar 4, 2011 21:45 UTC (Fri) by foom (subscriber, #14868)
[Link]
No, it really wasn't misleading. The userspace-visible key *is* the fd. You're required to pass that in, that's what epoll uses to look up the file object, it will always pass that back, and that's what you have to pass to epoll_ctl for further modifications/deletions of the watch. You can of course *in addition* register arbitrary user data that it'll pass back to you. But that's optional. The fd is required.
Segue into another rant of mine....that's yet another reason why the auto-removal-upon-file-close functionality is broken. There's no way to get a notification that epoll has autoremoved an entry from the epoll set, so, it's not actually possible for you to ever *free* the data (and other resources) you allocated and passed into epoll_ctl. Sigh. (same bug exists with kqueue API btw).
Choosing between portability and innovation
Posted Mar 4, 2011 22:11 UTC (Fri) by nevyn (subscriber, #33129)
[Link]
> No, it really wasn't misleading. The userspace-visible key *is* the fd.
> You're required to pass that in, that's what epoll uses to look up the
> file object, it will always pass that back
Yes, you need an fd to register/change the epoll events. But it isn't ever passed back:
epoll_wait() returns a list of "struct epoll_event", which is defined as:
struct epoll_event
{
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
} __attribute__ ((__packed__));
...epoll_data_t is the union I talked about before, so you can use a pointer or a number for your callback data. You don't get the fd back from the API, unless you use that as your callback data but if you do that it's just "a number" ... so I wouldn't expect the kernel to do anything special with it.
Choosing between portability and innovation
Posted Mar 4, 2011 23:08 UTC (Fri) by foom (subscriber, #14868)
[Link]
Okay, I must apologize, you are completely right.
I was getting myself mixed up with kqueue's API (which actually does have you specify a struct containing both the fd and userdata), and then mis-reread the epoll docs. (As you might be able to tell, I always passed the fd as the user data. :))
However, that doesn't change the main point I wanted to make, which is that it ought to be tracking fds internally instead of files -- I don't want it to do anything special with the value it returns, I want it to stop watching (and tell me that it did stop watching), if the *fd* I originally gave is closed, not if the underlying file is closed. Returning events on a file that I no longer have a handle for in the current process is annoying.
Choosing between portability and innovation
Posted Mar 6, 2011 19:04 UTC (Sun) by intgr (subscriber, #39733)
[Link]
> However, that doesn't change the main point I wanted to make, which is
> that it ought to be tracking fds internally instead of files
> I want it to stop watching (and tell me that it did stop watching), if
> the *fd* I originally gave is closed, not if the underlying file is closed
Seriously, if you want it to stop watching, you use EPOLL_CTL_DEL before closing a file descriptor. The implicit deregister-on-close() is just a safety net -- because it's the only sane thing that the kernel can do when you have watches on a file that's being closed. It's not intended to be used that way.
You tried to spin it as a "*HUGE* misdesign", but in practice it's just a trivial edge case that shouldn't affect real applications. Wouldn't be the first time *BSD people criticize parts of Linux they don't actually understand.
Choosing between portability and innovation
Posted Mar 3, 2011 20:24 UTC (Thu) by clump (subscriber, #27801)
[Link]
The problem with BSD cooperation is that Linux (and Linus) don't like to cooperate.
The article does a good job of articulating issues with resource constraints and rapid pace. You've mentioned functionality, but nothing to support the accusation that "Linux" doesn't like to cooperate.
Choosing between portability and innovation
Posted Mar 4, 2011 14:37 UTC (Fri) by trasz (guest, #45786)
[Link]
Kqueue vs epoll is a pretty good example of lack of willingness to coooperate - kqueue was earlier, yet Linux folks chose to implement something incompatible and functionally inferior.
Choosing between portability and innovation
Posted Mar 4, 2011 20:59 UTC (Fri) by nevyn (subscriber, #33129)
[Link]
kqueue is a _huge_ interface, combining a number of ideas, you basically need to design your kernel around it if you want users to use it. And much like SYSV streams, expecting other people to reimplement a giant redesign like that is often expecting _way_ too much.
On the other side, timerfd(), socketfd() and epoll() are all there to do a single thing. FreeBSD did the retarded thing when they reimplemented sendfile() roughly a week after Linux added it. TCP_CORK is still not implemented in FreeBSD. mremap() was NAKd, there's the whole MMAP_ANON vs. MMAP_ANONYMOUS, or the weird bits of mmap in general.
Cooperation is much more like to happen in the FreeBSD => Linux direction, IMO ... but that could also just be the much bigger developer pool.
Choosing between portability and innovation
Posted Mar 9, 2011 14:48 UTC (Wed) by mheily (guest, #27123)
[Link]
Actually, you don't need to redesign your kernel to implement the kqueue API on Linux. The libkqueue project provides a userspace wrapper library that translates each kevent() call into the equivalent epoll/inotify/timerfd/signalfd/etc call for Linux. On Solaris, it uses the event port framework. On Windows, it will use the WaitForMultipleObjects() function.
(Disclaimer: I am the main author of libkqueue)
Choosing between portability and innovation
Posted Mar 9, 2011 20:52 UTC (Wed) by nevyn (subscriber, #33129)
[Link]
My biggest concern with doing something like that would be how efficient it is compared to using the native interface (and why I said you'd need to redesign the kernel, so that it could be implemented efficiently), the only benchmark you have is vs. poll() (and uses ab) ... which is pretty sad.
epoll <=> kqueue is probably the best case test too, to be convincing you'd want something that benchmarked EVFILT_VNODE/SIGNAL/TIMER at least. To be really convincing you'd want PROC/USER/AIO and play with the EV_* flags.
Choosing between portability and innovation
Posted Mar 20, 2012 20:07 UTC (Tue) by scientes (guest, #83068)
[Link]
> Linux is about features, while BSD emphasizes design.
While I can't exactly agree with this flame, I will note that this very thing your are describing with kqueue/epoll also happened with /dev/cryto which was on OpenBSD and there even existed a Linux patch, and then Linux did their own thing with AF_ALG. (now, they did do some benchmarking...)