LWN: Comments on "EPOLL_CTL_DISABLE and multithreaded applications"

EPOLL_CTL_DISABLE and multithreaded applications

normalperson — Tue, 30 Oct 2012 00:23:12 +0000

I agree ONESHOT is awesome for MT and probably should've been the default.

epoll_wait() already returns ONESHOT events in FIFO order based on my reading of fs/eventpoll.c (and my own testing/experience). kqueue also seems to return in FIFO order with ONESHOT.

I wrote a server based on this behavior (along with concurrent calls to epoll_wait(...,maxevents=1,...)) for getting fair distribution between threads stuck in I/O wait a while back: http://bogomips.org/cmogstored.git

EPOLL_CTL_DISABLE and multithreaded applications

happynut — Sun, 28 Oct 2012 18:57:03 +0000

I mean "fully processed" by the kernel, which is really the only issue; the application can (and indeed: must, even with the proposed EPOLL_CTL_DISABLE change) control its own concurrency issues with its own locks.

The issue is that the kernel and app are running asynchronously with an implicit race condition around removing file descriptors from epoll; sending a notification through the normal epoll mechanism that the kernel is done should be enough to allow both sides of the API to run asynchronously.

EPOLL_CTL_DISABLE and multithreaded applications

kjp — Sun, 28 Oct 2012 17:59:27 +0000

My comment was imprecise at best. I'll clarify what I think you are doing:

Thread 1 decides the fd is no longer needed, due to no events
Thread 2 gets a wakeup for a real event, but is then scheduled out and does not progress
Thread 1 deletes the socket from epoll, marks fd as needing deletion, and signals via a pipe.
Another thead 3 then reads the pipe and deletes the fd

That does nothing to address the race with thread 2. There's still a race, all you've added is the essence of a sleep() which delays things. (Like the article mentioned, the solution of adding an arbitrary delay).

EPOLL_CTL_DISABLE and multithreaded applications

kjp — Sun, 28 Oct 2012 14:49:04 +0000

It sounds like the issue is a timeout case. The diagram shows one thread sees the file descriptor as not ready (no events) and decides to delete it. But, then suddenly an event for it comes in and starts processing on another thread. I don't see how your solution addreses that. Your pipe wakeup could happen at the same time as a 'real' socket wakeup event.

EPOLL_CTL_DISABLE and multithreaded applications

kjp — Sun, 28 Oct 2012 14:39:44 +0000

so both userspace and the kernel are modifying the reference count? I'm confused.

EPOLL_CTL_DISABLE and multithreaded applications

kjp — Sun, 28 Oct 2012 14:38:51 +0000

How do you know 'what fully processed' means? Another thread could have been 'woken up' by the kernel, but hasn't gotten around to looking in your internal structures. If another thread gets the 'deleted processed' event, it could delete the data structure prematurely.

EPOLL_CTL_DISABLE and multithreaded applications

runciter — Sat, 27 Oct 2012 14:16:25 +0000

This is nonsense. The deleting thread should just mark the cache data for that fd as "ready for deletion" and interrupt the epoll_wait (using a write to a pipe monitored by epoll, for example). The thread doing epoll_wait() can then synchronously release the resources. You'll need a mutex for the "ready-for-deletion" flag, but you need it for the "exists" or "ready" flags anyway. It's just a matter of checking the flags: the deleting thread checks "ready" before deleting; the epoll_wait() thread checks "ready for deletion" before updating "ready". With a mutex in place there is no race.

I don't get the point about losing data at all. You've decided to destroy the userspace cache entry *first*, before epoll_ctl() returned. Data will be lost either way.

EPOLL_CTL_DISABLE and multithreaded applications

happynut — Thu, 25 Oct 2012 16:32:00 +0000

Perhaps I'm missing something, but it seems like the proposed solution is to add a synchronous call to control an asynchronous queue.

Couldn't this be solved with a flag (or an alternate version) of EPOLL_CTL_DEL to add an event to the queue reporting that the delete has been fully processed?

Then the caller of epoll_wait() could then clean up the remaining application's data structures, with no new locks required.

EPOLL_CTL_DISABLE and multithreaded applications

cyanit — Thu, 25 Oct 2012 05:14:24 +0000

Other solution: pass a pointer to a userspace reference count in a new EPOLL_CTL_ADD_RC, which is incremented in the kernel under the epoll lock when an event concerning that fd is returned to userspace.

This way, after EPOLL_CTL_DEL either the fd will never be returned or the reference count has been raised already.

Userspace just needs to be changed to use EPOLL_CTL_ADD_RC and to decrement the reference count after it finished processing the event, and delete the fd data if it goes to zero either at that point or after EPOLL_CTL_DEL.

EPOLL_CTL_DISABLE and multithreaded applications

wahern — Fri, 19 Oct 2012 21:39:03 +0000

ONESHOT is the obvious and easiest way to handle lockless multithreaded processing of an epoll queue. In fact, on Solaris ONESHOT is the only option. There are no persistent events. The kernel, of course, is free to optimize for persistence, but userspace threads don't need to worry about a loaded gun lying around.

Edge-triggered signaling only provides a nominal percentage improvement in performance. If you're already going multithreaded and attempting lockless, than you're already massively multicore. Why bother adding all the complexity of edge-triggered events? Also, it's worth pointing out that with *BSD kqueue, ONESHOT automatically removes the decriptor from the queue. If epoll followed this excellent example then using ONESHOT would be end of story.

Seems to me the simplest solution to starvation is to ask the kernel to return events in FIFO order, i.e. the last one installed will also be the last in the next reported pending queue. That way you can use ONESHOT and still guarantee that there's only ever be a single owner of the object, e.g. exactly one of the threads or the kernel.

For the early termination cases (e.g. a second thread walking a shared queue and destroying sockets), just call shutdown on the socket and let the kernel report it via the normal queue processing.

The root of the problem here is that people want to use both a message passing pattern via epoll messaging, as well as allow arbitrary threads to jump into the fray and manipulate shared contexts. That's just asking for trouble.

EPOLL_CTL_DISABLE and multithreaded applications

mkerrisk — Fri, 19 Oct 2012 15:35:35 +0000

But if I had multiple threads calling epoll, I think the solution I outlined would work fine. I don't see what else I would need the cookie for... as long as its still doing its job, pointing to a real data structure of mine.

Yes, but other people may want to use the cookie in a quite different way, and it seems a shame to limit the generality of the API by requiring it to be used for this task.

And the nice thing is, it works with all epoll modes. What's very distasteful about the kernel patch is that it requires ONESHOT (yuck!).

Yes, requiring the use of EPOLLONESHOT is rather unfortunate. I strongly suspect that there could be a solution quite similar to the EPOLL_CTL_DISABLE approach that doesn't require EPOLLONESHOT. I have something in mind, but I need to think about it a little more.

EPOLL_CTL_DISABLE and multithreaded applications

kjp — Fri, 19 Oct 2012 15:25:30 +0000

To clarify, my solution is all user space, no kernel changes. It's just changing the kernel from holding a 'strong reference' to a 'weak reference'. (You know what they say about adding a layer of indirection...)

When I did epoll, I used it in edge_triggered mode (and also not 'oneshot') and had a single thread processing the epoll events and scheduling workers. I had worker threads that just pumped data to and from the kernel. I put direct pointers in the epoll data (i.e. strong references) to my internal structures since I had only one thread calling epoll.

But if I had multiple threads calling epoll, I think the solution I outlined would work fine. I don't see what else I would need the cookie for... as long as its still doing its job, pointing to a real data structure of mine.

And the nice thing is, it works with all epoll modes. What's very distasteful about the kernel patch is that it requires ONESHOT (yuck!).

The epoll designer(s) had their thinking caps on with this api. Storing arbitrary cookies + edge triggered mode = Insanely good.

EPOLL_CTL_DISABLE and multithreaded applications

mkerrisk — Fri, 19 Oct 2012 14:33:16 +0000

Thanks for the kind words. I *do* wish I'd thought of the diagram at the time I wrote TLPI, though. I love having good diagrams...

EPOLL_CTL_DISABLE and multithreaded applications

mkerrisk — Fri, 19 Oct 2012 14:05:25 +0000

Better solution: since epoll lets you store a generic 64 bit cookie, just use a 64 bit sequence that increments for each new file descriptor.

I haven't thought through your solution very far, but it seems unfortunate to have to chew up the cookie to solve this problem. User space might very want to user the epoll_event.data field for other purposes.

EPOLL_CTL_DISABLE and multithreaded applications

pbonzini — Fri, 19 Oct 2012 12:41:32 +0000

That pretty much boils down to delaying the deletion of the items to a moment where all epoll_waits have been done (since epoll_wait is an RCU quiescence point).

An efficient solution for an arbitrary number of epoll_wait threads can be implemented even in userspace and without using a full-blown RCU.

Equip each thread with a) an id or something else that lets each thread refer to "the next" thread; b) a lists of "items waiting to be deleted". Then the deleting thread adds the item being deleted to the first thread's list. Before executing epoll_wait, thread K empties its list and "passes the buck", appending the old contents of its list to that of thread K+1. This is an O(1) operation no matter how many items are being deleted; only Thread N, being the last thread, actually has to go through the list and delete the items.

EPOLL_CTL_DISABLE and multithreaded applications

nix — Fri, 19 Oct 2012 10:47:09 +0000

Now, how do I automatically add this excellent documentation to my copy of _The Linux Programming Interface_? The paper just won't update properly! :}

EPOLL_CTL_DISABLE and multithreaded applications

kjp — Fri, 19 Oct 2012 02:25:07 +0000

Better solution: since epoll lets you store a generic 64 bit cookie, just use a 64 bit sequence that increments for each new file descriptor. In a hash table, store the cookie -> fd mapping. The hash table should be thread safe but still scalable, and you could have a ref count too. So all wakeups from epoll need to check the hash table to see if the fd still exists, and check it out (bump refcount) if so.

So unless you need to process more than 2^63 sockets...

EPOLL_CTL_DISABLE and multithreaded applications

mhelsley — Thu, 18 Oct 2012 20:49:53 +0000

Rather than a mutex guarding the epoll fd (and thus the interest set), and rather than EPOLL_CTL_DISABLE, could userspace RCU be used to protect the shared resources of the set until they are unused? I haven't fully thought it through but if it works then that might be another scalable solution which is useable "today".

EPOLL_CTL_DISABLE and multithreaded applications

corbet — Thu, 18 Oct 2012 16:00:42 +0000

After this article was published, it has become clear that, thanks partially to Michael's questions, this API is likely to be changed by the final 3.7 release. Stay tuned.