Rethinking the futex API
There is a wide range of synchronization options within the kernel, but there have always been fewer options in user space. For years, the only real option was System V semaphores, but it is fair to say that they have never been universally loved. Developers have long wanted a mutex-like option for user space that does not kill performance.
Back in 2002, Rusty Russell proposed a fast user-space mutex mechanism that quickly became known as a "futex"; this feature was present in the 2.6.0 kernel release at the end of 2003 and immediately used to control concurrency for POSIX threads. The initial implementation was just a few hundred lines of code. At its core, a futex is a 32-bit word of memory shared between cooperating processes; a value of one indicates that the futex is available, while anything else marks it as unavailable. A process wishing to acquire a futex will issue a locked decrement instruction, then verify that the resulting value was zero; if so, the acquisition was successful and execution can continue. Releasing the futex is a simple matter of incrementing its value again.
The nice thing about futexes as described so far is that the kernel is not involved in their operation at all; futexes can be acquired and released without the need to make system calls. That cannot be sustained if there is contention for the futex, though; at that point, a task will have to block to wait for the futex to become available. That is where the futex() system call comes in:
int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, /* or: uint32_t val2 */ int *uaddr2, int val3);
The initial futex() implementation had only two arguments: uaddr (the address of the futex) and futex_op, which would be either +1 to increment the futex, or -1 to decrement it. The modern equivalents for futex_op are FUTEX_WAKE (to signal that the futex has been freed and wake task(s) waiting for it) or FUTEX_WAIT to block until the futex becomes available.
Many other operations also exist at this point. Over time, the futex interface has gained complexity, including "robust futexes", adaptive spinning, priority inheritance, and much more. See this article for a (somewhat dated) overview, the above-linked man page, or this low-level description for more information.
The current effort to rework futexes appears to be driven by a couple of concerns. One that goes mostly unstated is the desire to create a system-call interface that makes a bit more sense than futex(), which is a complex, multiplexed API with wildly varying arguments and a number of special cases. Whenever a system call is documented in terms like this:
One can conclude with a fair degree of certainty that the API design is not as great as it could be.
In past years, when C-library developers have discussed exposing futex(), they have proposed splitting it into a set of simpler wrapper functions; that work has never been merged. Now, though, the futex2() proposal from André Almeida does the same thing, adding three new system calls:
int futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timeout); int futex_wait_time32(void *uaddr, unsigned long val, unsigned long flags, old_time32_t *timeout); int futex_wake(void *uaddr, unsigned long nr_wake, unsigned long flags);
It is a rare patch set that includes a question like: "Has anyone
started worked on a implementation of this interface?
". Almeida's
patch set adds no new functionality; indeed, it is currently rather less
functional than the existing futex() API, lacking support for
features like priority inheritance. Basic futex functionality is
implemented, though, by calling into the existing futex() code.
The purpose of this patch set is clearly not to show off new features at this point; instead, the hope is to nail down what a new futex API should look like, with the new features to be added after that is done. That said, there are some enhancements that the developers have in mind and would like to get into place.
One of those is the ability to wait on multiple futexes at once and be awakened when any one of them becomes available. Gabriel Krisman Bertazi posted an implementation of this functionality (for futex()) in July 2019; it is driven in particular by the needs of Wine, which is emulating a similar Windows feature. In a discussion sparked by another posting of this patch set in March, Thomas Gleixner gently suggested that perhaps the time had come to design a new futex interface where features like this could be added (and used) more easily. The current proposal is a direct result of that suggestion.
That said, the proposed API doesn't handle multiple futexes, but the cover letter from the current series describes a planned addition:
struct futex_wait { void *uaddr; unsigned long val; unsigned long flags; }; int futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters, unsigned long flags, ktime_t *timeout);
Another upcoming feature is the ability to handle futexes in a number of common sizes, not just the 32-bit ones supported today.
Then, there is the issue of performance on NUMA systems. The kernel must maintain internal data structures describing the futexes that are currently being waited on; if those structures are kept on the wrong NUMA node, futex operations can sustain a lot of remote-node cache misses, which slows them down considerably. See this article for details. Futexes are often used by threaded processes that are all running on the same NUMA node; their performance would be improved if the kernel kept its data structures on the same node. Thus, there is a "NUMA hint" functionality planned for the new API that would suggest that the kernel keep its associated data structures on a specific node.
While the thinking about how to improve futex functionality in the kernel
has clearly entered a new phase, users should not hold their collective
breath waiting for new futex features to enter production kernels. The
complexity of this subsystem makes developers reluctant to push through
quick changes; they have learned the hard way that it's easy for things to
go wrong with futexes. So the new API and the implementation of any new
features are likely to go through an extended period of review and
revision. The "F" in "futex" may stand for "fast", but the implementation
of new futex features may not be.
Index entries for this article | |
---|---|
Kernel | Futex |
Posted Jun 18, 2020 23:31 UTC (Thu)
by ras (subscriber, #33059)
[Link] (33 responses)
Here I see a whole pile of candy that has nothing to do "keep it simple and fast". It can all be done with judicious combination of libatomic , a simple futex and some data structure in shared memory, wrapped in a userspace library because can be difficult to get right. Why someone would need something other than 32 bits is beyond me. Waiting on multiple values can be done in user space. Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup. The thundering herd is already avoided by waking up a limited number of sleepers.
Posted Jun 19, 2020 0:24 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (26 responses)
> Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.
Posted Jun 19, 2020 0:59 UTC (Fri)
by ras (subscriber, #33059)
[Link] (25 responses)
Why not efficiently? Can you give me an example?
>> Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.
Again, what environments?
Posted Jun 19, 2020 1:25 UTC (Fri)
by HenrikH (subscriber, #31152)
[Link] (12 responses)
Posted Jun 19, 2020 2:32 UTC (Fri)
by ras (subscriber, #33059)
[Link] (11 responses)
Yes, it's complex, racy and horrible to debug. Putting it in the kernel doesn't solve that. User space RCU (and yes, there is such a thing) is no more or less complex the kernel version. The complexity has to be somewhere - hiding it in the kernel may seem neater, but it binds it to an unchanging ABI so support and maintainability wise it's worse than putting it in a user space library.
Putting stuff in the kernel is often a win because it can roll up multiple syscalls into one, but mutex's are peculiar because if they are not contested there is no need for a syscall. That peculiarity was the primary driving force behind futex's - the fast mutex. It's not like the kernel didn't already provide user space multiple other mechanisms to do the same thing - they were just all slower because they involved one or more syscalls for the fast path. And yet here we are with a proposal for futex_waitv() that must always make a syscall, which could be avoid in the uncontested case if it was implemented in user space.
Posted Jun 19, 2020 4:55 UTC (Fri)
by krisman (subscriber, #102057)
[Link] (6 responses)
How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex, (2) avoiding spurious wakeups of unrelated threads, or relying on (3) poll (which we've shown is not fast enough); while (4) not relying in your waiters spinning (which throws cpu utilization through the roof)? One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.
Can you better describe what your model would look like in memory? One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers. Each consumer waits on a different subset, but there are intersection between subsets for different consumers. One or more consumers are waken up when any of their events trigger. How do you represent this design in your model?
>And yet here we are with a proposal for futex_waitv() that must always make a syscall
That is not true. For the uncontested case, a waiter can grab a futex without going into the kernel. The semantics for FWM are "any of the futexes in the list" so if the consumer acquires any futex, it doesn't call futex_waitv.
Posted Jun 19, 2020 6:42 UTC (Fri)
by ras (subscriber, #33059)
[Link]
If they are waiting on the same futex, one call will suffice. If they are waiting on multiple calls, multiple wakes will be needed of course. But remember, futex's optimised for the fast path. The fast path is when the lock is uncontested. In that case there no one is waiting, and no one has to be woken up, and so no syscalls at all. If you do have to wake another thread up, you've already incurred the overhead on one syscall: the FUTEX_WAIT that put it to sleep, so it's not an untenable additional penalty.
> (2) avoiding spurious wakeups of unrelated threads,
Why are there wakeups on unrelated threads? This wonderful utex library has recorded what utex_t's each thread is willing to be woken up in its shared memory, and wakes up as many as needed. It's not like the kernel can do it differently: it also will have to look up what user space threads are waiting on the futex, and wake some up. The basic algorithm and code doesn't change when it moves from kernel space to user space.
> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)
Why is anybody spinning - kernel or user space? Well OK - the kernel might spin because your critical section isn't long and so it's faster than mutex_lock(). But that's just a speed optimisation - you could always just use a mutex_lock(). In user space spinning always requires a syscall - perhaps many if you end up spinning for a while, so you are always better off with the user space mutex_lock() equivalent, which is futex(), and hope it is uncontested. If it is uncontested there are no syscalls, and remember, if the kernel is spinning, then someone must has done a syscall to the kernel already, which is another way of saying futex_waitv() takes the syscall hit every time.
> Can you better describe what your model would look like in memory?
Ahh, that is a long answer. And you're stretching my memory because I was 30 years ago I did it for an OS I developed and sold. It's too long to go into it in detail, but briefly it was based on two primitives: one that looked like utex_waitv(). The second primitive was based on a new trick: every utex allowed you to attach an arbitrary number of child utex's on them. Waking up such a child transparently woke up the parent instead of the child, so a utex_waitv() on a child would never succeed. utex_waitv() returned the utex that woke up, which might be a child utex of one of the utex’s in the top level utex_waitv() vector.
utex_wake() becomes more complex of course: it has to check if the child had a parent, and indeed if the parent had a parent in order to wake up the right utex. I put the woken up utex’s in a queue list attached to the parent because I wanted to preserve the order of wakeups, but simply setting a flag in the child utex then having utex_wake() go looking for it would also work.
I actually had a 3rd feature which would probably be frowned upon today: you could attach a function to a utex and instead of waking it up, utex_wake() would call that function, effectively turning the utex into an interrupt like thingy. That was how the parent / child feature was actually implemented - the child had a function attached that woke up the parent. This feature has the interesting property of letting a utex transparently defer to some other mutex / condition mechanism. For example, instead of waking up a parent, you could write a byte to a fifo, or signal a semop(). The person calling the utex_wake() is none the wiser.
In the general case you have to lock a lot of things when you are altering the resulting tree of mutex's. In my case I can't remember that happening. Instead one thread, the one that the parent mutex added and deleted children and since the children can only be attached to one parent there is no contention.
BTW, I've never found a use for waking up more than one thread at a time. It obviously has it's uses if you really want to release a thundering herd, but a mutex usually guards a shared volatile resource that only one thread can change at a time, so there is no point waking more than one. If it discovers it doesn't want it, it can always wake someone else up.
This all requires a bit of work. You aren't dolling up futex's with a thin wrapper, you are implementing a new thing that uses futex's where the kernel would use spinlocks, IRQ masking or mutex_lock(). It is duplicating the existing code and data structures in the kernel. And there are more complexities I haven't mentioned, like when a wake calls a function that does a wait. However, because of that remarkable property of mutex that most code isn't executed unless there is contention, it doesn't matter speed wise because none of it is touched in the fast path.
Posted Jun 19, 2020 19:35 UTC (Fri)
by itsmycpu (guest, #139639)
[Link] (4 responses)
I agree with the posters who say wait-multiple can be implemented efficiently in user space using the existing Futex API.
Re (1): It might makes sense to add a wake-multiple call (as opposed to a wait-multiple) since that doesn't require copying dynamic lists into the kernel.
> Can you better describe what your model would look like in memory?
When necessary, you associate a list of consumers with each event producer, each list entry containing a pointer to the waiters futex and/or atomic guard variable.
The kernel would have to do the same or similar processing otherwise, however in my understanding it is to be avoided for the kernel, for example, to copy and maintain dynamic memory lists. It is better to do in user space, also for performance (and to avoid impacting the performance of other applications). At least I know this works well in my own use cases.
Additionally, in use cases where the same "subset" is used repeatedly, these associations don't have to be re-established for each call.
Posted Jun 20, 2020 2:54 UTC (Sat)
by krisman (subscriber, #102057)
[Link] (3 responses)
So, your design has one futex per consumer. This means that when a producer signals, in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.
Posted Jun 20, 2020 3:38 UTC (Sat)
by ras (subscriber, #33059)
[Link]
At this point I'm lost. You obviously have some specific use case in mind none of this fits. Could you post a link / explain what it is?
Posted Jun 20, 2020 3:58 UTC (Sat)
by itsmycpu (guest, #139639)
[Link] (1 responses)
Almost, but not with that disadvantage. As I indicated, "in my design" there is (optionally) a FUTEX_WAKE_MULTIPLE (not wait_multiple) kernel API which is much simpler to implement, also because it doesn't need to copy and maintain the list internally in the kernel. It does multiple FUTEX_WAKE operations in a single syscall. And this can be added later on, as an optimization, if and when such a call becomes available in the kernel, while the rest can already be implemented, tested and tuned with the kernel as is.
Since a FUTEX_WAKE is only needed for futexes that actually have thread(s) waiting, and the waiting itself is already a time consuming kernel call, one non-blocking WAKE call per waiting futex shouldn't be that bad, although it will be better to have a wake_multiple call, which solves that problem (if it is one) completely.
Posted Jun 20, 2020 4:05 UTC (Sat)
by itsmycpu (guest, #139639)
[Link]
Posted Jun 19, 2020 4:58 UTC (Fri)
by nybble41 (subscriber, #55106)
[Link] (2 responses)
Why must it make a syscall in the uncontested case? It seems to me that it could first try to lock each of the futexes in user space and only make the syscall if none of them were available—much like the current implementation does for a single futex.
Posted Jun 19, 2020 5:31 UTC (Fri)
by HenrikH (subscriber, #31152)
[Link]
"Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel."
Posted Jun 19, 2020 6:58 UTC (Fri)
by ras (subscriber, #33059)
[Link]
It doesn't have to be. But this isn't just a simple compare and swap. You are checking a vector, which requires many steps. These steps have to be done atomically along with the following futex_waitv() if needed. How are you going to do it?
I've been programming long enough to be very confident you can' trust your average programmer to get that sort of thing right. Hell, I'm pretty confident I won't get it right on the first try. That in itself isn't a problem: you wrap it in a user space library so they can't get it wrong.
But then, if you are implementing a user space library anyway, and the library could implement the futex_waitv() in user space and it would be faster to boot in the uncontested case, what have you gained by doing it in the kernel?
Posted Jun 19, 2020 5:36 UTC (Fri)
by HenrikH (subscriber, #31152)
[Link]
Posted Jun 19, 2020 2:31 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
Posted Jun 19, 2020 2:50 UTC (Fri)
by ras (subscriber, #33059)
[Link] (10 responses)
Posted Jun 19, 2020 3:08 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
Posted Jun 19, 2020 3:50 UTC (Fri)
by ras (subscriber, #33059)
[Link] (8 responses)
I did not mean current futex's could be implemented in userspace. I meant the candy being proposed here could be implemented in user space, candy such as futex_waitv() and allowing different sized int's for the uaddr argument. You need something fast, simple and small to base that on, and that something is futex's. That was the whole point of futex's - the existing alternatives like semop()'s were not fast or simple. If we aren't careful, futex's will go the same way.
One of the complaints about the current futex implementation is the code is hard to follow. One reason for that is it's been expanded to do stuff that could have been done in user space. FUTEX_FD is one such thing. I suspect that FUTEX_REQUEUE is another - it could be done in user space with no more syscalls than are used now. futex_waitv() is definitely in the same class. You don't fix complexity by adding more features.
Posted Jun 19, 2020 5:14 UTC (Fri)
by krisman (subscriber, #102057)
[Link] (7 responses)
For what is worth, yes, the futex interface has grown out of hand with deprecated features. FUTEX_FD was dropped long ago, so that is not a big deal. And yes, most futex features can be implemented in userspace with the WAKE/WAIT pair with additional cost.
REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes? I don't see how it is possible, in fact, CMP_REQUEUE is planned to exist in a future API.
Posted Jun 19, 2020 6:50 UTC (Fri)
by ras (subscriber, #33059)
[Link] (4 responses)
You couldn't.
I was querying whether the added extra syscall complexity is worth the cost. It's not like you can't do it in user space - it's just more expensive. You've already incurred one syscall, so the orders of magnitude speed improvement the futex fast path got you is gone. You are now trying to gain a factor of 2 or 3 or something. If it was a very frequent use case, then perhaps - but it looks esoteric to me. If complexity is an issue esoteric should be shown the door.
Posted Jun 23, 2020 20:56 UTC (Tue)
by rweikusat2 (subscriber, #117920)
[Link] (3 responses)
With REQUEUE/ CMP_REQUEUE, it's possible to ask the kernel to wake up one thread which then acquires the mutex and (presumably) does something useful while the others are moved from the condvar futex to the mutex futex without waking them up.
Posted Jun 23, 2020 22:01 UTC (Tue)
by ras (subscriber, #33059)
[Link] (2 responses)
My point wasn't that thundering herd isn't a problem. It clearly is.
My point was that it is already solved by the current futex() API. You get to decide how many of those syscall(SYS_futex, &futex, FUTEX_WAIT)'s you wake up. It can be anywhere from 1 to all. The others just sit around, waiting for another FUTEX_WAKE.
To solve the pthread_cond_wait() case you just wake up one, so it could have been done without REQUEUE now. I'll grant you that not all that REQUEUE does, it also prevents the futex from seeing any future wakeups. But that could equally well be done from userspace using more futex's. Perhaps it's a speed optimisation (I do hope they benchmarked it), or more likely arranging or everyone to share information the kernel already had was hard so they took the easy way out.
Posted Jun 24, 2020 14:54 UTC (Wed)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
That's an insanely complicated way to transfer a linked list from a head pointer a in the kernel to another head pointer b in the kernel. It's much simpler if the kernel just wakes one thread and transfers the existing list to the other futex.
Posted Jun 24, 2020 19:17 UTC (Wed)
by itsmycpu (guest, #139639)
[Link]
I'm not sure if you are discussing a more specific question, but just to point it out, for example a user space implementation of the producer/consumer model can (if so designed) just move a waiting entry from one producer's list to another producer's list without any wake ups or even kernel calls. (Except a wake call in the case one is ready.)
Posted Jun 19, 2020 10:43 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
You would just do things differently. A model that is relatively common is the "parking lot" where each thread has a single binary semaphore attached to it. The primitives in this model are "parking" yourself (consuming the token in the semaphore, or waiting for one and consuming it immediately if it's not there) and "unparking" another thread (placing a token on its semaphore). The parking lot itself is an associative array from addresses to waitqueues, similar to what the kernel does for futexes, so that you can "park" yourself on a list and "unpark" the head of a list.
By moving waitqueue handling to userspace, you don't need a system call anymore for either requeueing or multiple-futex wait. On the other hand it makes the implementation of pthread_barrier_t less efficient because every slow-path wakeup needs a separate system call.
Posted Jun 21, 2020 17:44 UTC (Sun)
by itsmycpu (guest, #139639)
[Link]
Again a reason to add a "FUTEX_WAKE_MULTIPLE" kernel API, or "futex_wakev" ("wake" not "wait"), that takes a vector of multiple futex addresses as parameter.
Posted Jun 19, 2020 0:58 UTC (Fri)
by krisman (subscriber, #102057)
[Link] (1 responses)
Yes, it can be done in userspace, as mentioned in my original submission of Futex Wait Multiple last year. Still, we experimented with all that, and the performance of a userspace solution or even of a solution based on eventfd (which would be the most obvious way to implement these semantics in userspace with the existing kernel support), is a bottleneck on our real-world scenarios of emulation of windows applications over Wine.
Posted Jun 19, 2020 19:48 UTC (Fri)
by itsmycpu (guest, #139639)
[Link]
However I think the existing Futex API allows implementing wait-for-multiple-events with great performance. I have good experience implementing such functionality with existing APIs, and know of no conceptual restrictions there. See also my slightly more detailed comment above.
Posted Jun 19, 2020 9:58 UTC (Fri)
by bno1 (guest, #138051)
[Link] (1 responses)
Wine already has this. The esync patchset uses eventfd to wait on multiple winapi mutexes [1]. But some Windows programs run out of file descriptors because they leak mutexes. File descriptors are a much more limited resource than memory. Also, the patch for wait for multiple futexes (called fsync in the wine community) has slightly better performance [2].
[1] https://github.com/zfigura/wine/blob/esync/README.esync
Posted Jun 23, 2020 7:04 UTC (Tue)
by ncm (guest, #165)
[Link]
See, this is why W(in)e can't have nice things.
Posted Jun 19, 2020 14:57 UTC (Fri)
by sbaugh (guest, #103291)
[Link] (1 responses)
As far as I'm aware, this can only be done by dedicating a thread to each futex you want to wait on - not exactly cheap.
Posted Jun 19, 2020 19:59 UTC (Fri)
by itsmycpu (guest, #139639)
[Link]
Conceptually, a futex is not something you wait on. It is something that you use to wait. It is a consumer, not a producer. The wake call is the producer.
You can use a single futex to wait for many things at the same time (using user space utility functions to establish the necessary connections).
Posted Jun 19, 2020 23:43 UTC (Fri)
by glenn (subscriber, #102223)
[Link] (7 responses)
Posted Jun 20, 2020 1:12 UTC (Sat)
by itsmycpu (guest, #139639)
[Link] (6 responses)
This problem relates to a specific approach, not to user space solutions in general. The goal is to determine contention without any kernel call at all.
The way you ask this question suggests that your life doesn't depend on the answer.
Posted Jun 20, 2020 7:25 UTC (Sat)
by glenn (subscriber, #102223)
[Link] (5 responses)
Posted Jun 23, 2020 7:12 UTC (Tue)
by ncm (guest, #165)
[Link] (4 responses)
Posted Jun 25, 2020 8:39 UTC (Thu)
by glenn (subscriber, #102223)
[Link] (3 responses)
> See, if your life has taken you to a place where you are thinking about ways to reduce what it costs to do nested locking, you have gone astray, and need to find your way back.
Thanks for the life advice. I’ll bring it up with my therapist. Shall we get back to science and engineering?
Let me talk about some challenges I’ve run into developing a pipelined real-time task-graph runtime where I used the futex API directly to manage thread waiting and signalling. My workload is modeled by a set of nodes connected by directed edges. A node is a consumer of input edges, and it is a producer on output edges. Each node encapsulates a unit of work that is backed by a dedicated thread. This thread is ready to run when data is available on _all_ of its input edges.
Being a task-graph system, some nodes are join/fan-in points. These nodes consume data from several edges. How does a producer determine when it should wake the consumer? One method is to use a futex for each edge. The consumer can loop over these futexes until all inputs are satisfied. This may work, but it’s not particularly efficient: it doesn’t scale well (suppose your node has 10s or even 100s of inputs) as a thread may wake up many times before it does any real work. Recall that this is a runtime for a real-time application. These trivial wakeups can preempt other real-time work (e.g., active threads of other nodes) and induce threads to migrate to other CPUs (Linux is very aggressive about migrating preempted real-time threads to available CPUs), inducing cache affinity loss and additional context switch overheads. Also, this wakeup pattern is _very_ problematic for SCHED_DEADLINE. This scheduler allocates budget assuming that the thread in question exhibits a periodic/sporadic execution pattern. Our consumer thread does _not_ exhibit this pattern because it is constantly dealing with trivial wakeups. A SCHED_DEADLINE consumer can exhaust its budget dealing with trivial wakeups before the thread is ready to do any real work! This is due to SCHED_DEADLINE's budget reclaiming algorithm (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt).
An alternative approach is to have the consumer wait on a single futex. Now each producer is responsible for evaluating whether all of the inputs of the consumer have been satisfied. The last producer to satisfy these inputs wakes the consumer. This leads to a more efficient solution, but it is not without its trade-offs. A producer needs insight into the availability of data on the other edges of its consumer. This implies that the producers must run within the same process or share some state in shared memory. This makes for a cross-cutting software architecture where a composable one would feel much more natural. The lifetime of shared memory is not fun to manage. There are also security implications.
My task-graph framework would have been simpler if the futex API let my consumer wait on a set of futexes and wake only when all inputs had been satisfied.
In my runtime, I also explored the use of eventfd and epoll. I encountered the same problem: select, poll, and epoll all wake the caller when any one file descriptor is ready, not when all are ready. I am unaware of any synchronization or IPC mechanism in Linux that would let me design my software in a composable way while simultaneously avoiding the problems of trivial wakeups. I haven’t revisited my problem since io_uring landed--perhaps there’s a solution for me there. Does anyone have a suggestion? No need to suggest a therapist. I already have one. ;)
Posted Jun 25, 2020 11:22 UTC (Thu)
by itsmycpu (guest, #139639)
[Link] (1 responses)
Your situation may have specifics that I am not aware of, so I can only hope that this is going to make sense:
Each producer keeps track of if a consumer has already been informed about the availability of data, in that consumer's list entry. That list entry of course also has a reference to the consumer. The consumer itself has an atomic counter for how many inputs it is still missing. When a producer is able to provide a new input, it reduces the consumer's (atomic) counter by 1. In case it reaches zero at that point, it wakes the consumer.
Instead of a counter, for example an atomic bitset can be used. Then the consumer's list entry contains the specific bit that this producer should set. When a producer sees that it has set the last necessary bit, it wakes the consumer (or just sets that flag if the consumer isn't waiting). This is what I have used in some cases. In my case, the consumer is often doing other work while the flag is set, so that it often does not have to wait, and the whole process takes place in user space.
> This implies that the producers must run within the same process or share some state in shared memory. This makes for a cross-cutting software architecture where a composable one would feel much more natural. The lifetime of shared memory is not fun to manage. There are also security implications.
As far as I know, using futexes between processes always requires shared memory, one way or the other. That probably makes them fastest solution when blocking threads is required on the slow path.
(Of course futexes should only be used on the slow path.)
Posted Jun 30, 2020 15:42 UTC (Tue)
by ncm (guest, #165)
[Link]
Bits are set by compare-and-swap. atomically, with no system call needed.
To do it with a counter, the producer atomically decrements a counter, and wakes the thread when it sees the transition to zero.
Waking the thread might mean writing a byte to a pipe it is sleeping on.
Or, the producer's thread might just add the consumer to a scoreboard of runnable tasks, and some already-running thread will get to it. That way, the kernel is never involved at all. You keep exactly as many threads as you have cores on the job, and they spin when there is no work. Isolate those cores (isolcpus, nohz_full, rcu_nocbs) so the kernel does not steal them.
Posted Jun 27, 2020 7:49 UTC (Sat)
by ras (subscriber, #33059)
[Link]
Waking up precisely one waiter if always the fastest choice.
> This implies that the producers must run within the same process or share some state in shared memory.
Yes, but futex's only work with shared memory. Communicating via shared memory is orders of magnitude faster than another other method. That's where futex's get their speed from. So if you want speed shared memory unavoidable.
If speed is not such a big issue there are as you say lots of choices, including epoll(), pipes, semop()'s ...
> My task-graph framework would have been simpler if the futex API let my consumer wait on a set of futexes and wake only when all inputs had been satisfied.
Of course it would be. Things are always simpler when there is a pre-existing library that does exactly what you want. What triggered my interest here is the claim that library should be in the kernel.
Posted Jun 23, 2020 4:00 UTC (Tue)
by pr1268 (guest, #24648)
[Link] (2 responses)
From the futex(2) man page: And from the proposed patch set mentioned below: Just curious, and I couldn't find in our editor's article, nor in the comments: why is uaddr being changed from an int * to a void *?
Posted Jun 23, 2020 13:02 UTC (Tue)
by corbet (editor, #1)
[Link] (1 responses)
Posted Jun 23, 2020 17:46 UTC (Tue)
by itsmycpu (guest, #139639)
[Link]
Is it worth the effort and cost of a larger kernel API, though? An application level feature can easily use another atomic variable (which it might like to do anyway), it doesn't have to use the futex value itself, except on the slow path. Not every convenience requires its own kernel API.
The use of futexes requires specialized understanding *and* experience, and application writers are usually better off using higher level library functions like semaphores and mutexes. Leave the support of multiple sizes to the libraries. Or add a Linux-specific C library directly supported by the kernel developers (which can then perhaps also be used by the kernel to communicate with user space processes).
My suggestion is to (first) add a kernel API that can actually make a performance difference: a WAKE-multiple-futexes-with-one-call API. ("Wake" not "wait"). Should be easy to implement (for a kernel developer) and not interfere with the existing functionality and concepts.
Rethinking the futex API
Rethinking the futex API
Not efficiently.
Not in all environments.
Rethinking the futex API
>Not efficiently.
>Not in all environments.
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
> create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in
> memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates
> a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't
> going to look much different regardless of whether it is in kernel space or user space.
Rethinking the futex API
Rethinking the futex API
> (1) without issuing multiple syscalls to wake each futex,
> (2) avoiding spurious wakeups of unrelated threads, or relying on
> (3) poll (which we've shown is not fast enough); while
> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)?
> One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.
This would be much, much simpler.
Re (2): You can protect against spurious wakeups by re-checking atomic variables after wake up. However what are "unrelated threads"?
Re (3): A user space implementation doesn't require polling (except once which is good for peformance).
Re (4): You have the same general options implementing this as you do when implementing locks.
> One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers.
Depending on the type of producer, you mark and/or wake one or all consumers. Why is that a question, did you try and have problems with this approach?
Rethinking the futex API
> a pointer to the waiters futex and/or atomic guard variable.
Rethinking the futex API
Rethinking the futex API
> in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
For example, if you can't open new files (not enough FDs or to not mess up the file descriptor numbers).
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
> for that is it's been expanded to do stuff that could have been done in user space. FUTEX_FD is one
> such thing. I suspect that FUTEX_REQUEUE is another - it could be done in user space with no more
> syscalls than are used now. futex_waitv() is definitely in the same class. You don't fix complexity by
> adding more features.
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
[2] https://github.com/ValveSoftware/Proton/issues/2966
Rethinking the futex API
Rethinking the futex API
Rethinking the futex API
Wait until all futexes are available?
Has any thought been given to waiting until _all_ the futexes are available? Some sorting of mutexes could be imparted to avoid deadlock issues. Yes, these locks can be obtained serially in userspace, but this may result in as many as N-1 pointless wake-ups of N futexes are needed.
Wait until all futexes are available?
> Some sorting of mutexes could be imparted to avoid deadlock issues.
> Yes, these locks can be obtained serially in userspace, but this may result in as many as N-1 pointless wake-ups of N futexes are needed.
My suggestion: keep it that way.
Wait until all futexes are available?
Wait until all futexes are available?
Wait until all futexes are available?
Wait until all futexes are available?
Wait until all futexes are available?
Wait until all futexes are available?
Rethinking the futex API - why change the type that uaddr points to?
int futex(int *uaddr, [...]
int futex_wait(void *uaddr, [...]
int futex_wait_time32(void *uaddr, [...]
int futex_wake(void *uaddr, [...]
I would guess that it has to do with the desire to support different sizes of futexes. Once that's in place, the type of that pointer will vary, so it pretty much has to be void *.
Rethinking the futex API - why change the type that uaddr points to?
Rethinking the futex API - why change the type that uaddr points to?