Rethinking the futex API

By Jonathan Corbet
June 18, 2020

The Linux futex() system call is a bit of a strange beast. It is widely used to provide low-level synchronization support in user space, but there is no wrapper for it in the GNU C Library. Its implementation was meant to be simple, but kernel developers have despaired at the complex beast that it has become, and few dare to venture into that code. Recently, though, a new effort has begun to rework futexes; it is limited to a new system-call interface for now, but the plans go far beyond that.

There is a wide range of synchronization options within the kernel, but there have always been fewer options in user space. For years, the only real option was System V semaphores, but it is fair to say that they have never been universally loved. Developers have long wanted a mutex-like option for user space that does not kill performance.

Back in 2002, Rusty Russell proposed a fast user-space mutex mechanism that quickly became known as a "futex"; this feature was present in the 2.6.0 kernel release at the end of 2003 and immediately used to control concurrency for POSIX threads. The initial implementation was just a few hundred lines of code. At its core, a futex is a 32-bit word of memory shared between cooperating processes; a value of one indicates that the futex is available, while anything else marks it as unavailable. A process wishing to acquire a futex will issue a locked decrement instruction, then verify that the resulting value was zero; if so, the acquisition was successful and execution can continue. Releasing the futex is a simple matter of incrementing its value again.

The nice thing about futexes as described so far is that the kernel is not involved in their operation at all; futexes can be acquired and released without the need to make system calls. That cannot be sustained if there is contention for the futex, though; at that point, a task will have to block to wait for the futex to become available. That is where the futex() system call comes in:

    int futex(int *uaddr, int futex_op, int val,
              const struct timespec *timeout,   /* or: uint32_t val2 */
              int *uaddr2, int val3);

The initial futex() implementation had only two arguments: uaddr (the address of the futex) and futex_op, which would be either +1 to increment the futex, or -1 to decrement it. The modern equivalents for futex_op are FUTEX_WAKE (to signal that the futex has been freed and wake task(s) waiting for it) or FUTEX_WAIT to block until the futex becomes available.

Many other operations also exist at this point. Over time, the futex interface has gained complexity, including "robust futexes", adaptive spinning, priority inheritance, and much more. See this article for a (somewhat dated) overview, the above-linked man page, or this low-level description for more information.

The current effort to rework futexes appears to be driven by a couple of concerns. One that goes mostly unstated is the desire to create a system-call interface that makes a bit more sense than futex(), which is a complex, multiplexed API with wildly varying arguments and a number of special cases. Whenever a system call is documented in terms like this:

For several blocking operations, the timeout argument is a pointer to a timespec structure that specifies a timeout for the operation. However, notwithstanding the prototype shown above, for some operations, the least significant four bytes of this argument are instead used as an integer whose meaning is determined by the operation.

One can conclude with a fair degree of certainty that the API design is not as great as it could be.

In past years, when C-library developers have discussed exposing futex(), they have proposed splitting it into a set of simpler wrapper functions; that work has never been merged. Now, though, the futex2() proposal from André Almeida does the same thing, adding three new system calls:

    int futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timeout);
    int futex_wait_time32(void *uaddr, unsigned long val, unsigned long flags, 
			  old_time32_t *timeout);
    int futex_wake(void *uaddr, unsigned long nr_wake, unsigned long flags);

It is a rare patch set that includes a question like: "Has anyone started worked on a implementation of this interface?". Almeida's patch set adds no new functionality; indeed, it is currently rather less functional than the existing futex() API, lacking support for features like priority inheritance. Basic futex functionality is implemented, though, by calling into the existing futex() code.

The purpose of this patch set is clearly not to show off new features at this point; instead, the hope is to nail down what a new futex API should look like, with the new features to be added after that is done. That said, there are some enhancements that the developers have in mind and would like to get into place.

One of those is the ability to wait on multiple futexes at once and be awakened when any one of them becomes available. Gabriel Krisman Bertazi posted an implementation of this functionality (for futex()) in July 2019; it is driven in particular by the needs of Wine, which is emulating a similar Windows feature. In a discussion sparked by another posting of this patch set in March, Thomas Gleixner gently suggested that perhaps the time had come to design a new futex interface where features like this could be added (and used) more easily. The current proposal is a direct result of that suggestion.

That said, the proposed API doesn't handle multiple futexes, but the cover letter from the current series describes a planned addition:

    struct futex_wait {
	void *uaddr;
	unsigned long val;
	unsigned long flags;
    };

    int futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
		    unsigned long flags, ktime_t *timeout);

Another upcoming feature is the ability to handle futexes in a number of common sizes, not just the 32-bit ones supported today.

Then, there is the issue of performance on NUMA systems. The kernel must maintain internal data structures describing the futexes that are currently being waited on; if those structures are kept on the wrong NUMA node, futex operations can sustain a lot of remote-node cache misses, which slows them down considerably. See this article for details. Futexes are often used by threaded processes that are all running on the same NUMA node; their performance would be improved if the kernel kept its data structures on the same node. Thus, there is a "NUMA hint" functionality planned for the new API that would suggest that the kernel keep its associated data structures on a specific node.

While the thinking about how to improve futex functionality in the kernel has clearly entered a new phase, users should not hold their collective breath waiting for new futex features to enter production kernels. The complexity of this subsystem makes developers reluctant to push through quick changes; they have learned the hard way that it's easy for things to go wrong with futexes. So the new API and the implementation of any new features are likely to go through an extended period of review and revision. The "F" in "futex" may stand for "fast", but the implementation of new futex features may not be.

Index entries for this article
Kernel	Futex

Rethinking the futex API

Posted Jun 18, 2020 23:31 UTC (Thu) by ras (subscriber, #33059) [Link] (33 responses)

Sometimes I think developers lose focus on the problem being solved. futex started life because user space needed a fast mutex. It solves the problem by avoiding a syscall kernel, which boils down do "no syscall" unless you have to block.

Here I see a whole pile of candy that has nothing to do "keep it simple and fast". It can all be done with judicious combination of libatomic , a simple futex and some data structure in shared memory, wrapped in a userspace library because can be difficult to get right. Why someone would need something other than 32 bits is beyond me. Waiting on multiple values can be done in user space. Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup. The thundering herd is already avoided by waking up a limited number of sleepers.

Rethinking the futex API

Posted Jun 19, 2020 0:24 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (26 responses)

> Waiting on multiple values can be done in user space.
Not efficiently.

> Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.
Not in all environments.

Rethinking the futex API

Posted Jun 19, 2020 0:59 UTC (Fri) by ras (subscriber, #33059) [Link] (25 responses)

>> Waiting on multiple values can be done in user space.
>Not efficiently.

Why not efficiently? Can you give me an example?

>> Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.
>Not in all environments.

Again, what environments?

Rethinking the futex API

Posted Jun 19, 2020 1:25 UTC (Fri) by HenrikH (subscriber, #31152) [Link] (12 responses)

How would you wait on multiple futexes effeciently in userspace? You would have to busy wait and spinkle in some sleep(0)/sched_yield() nonsense calls here and there to not waste cpu cycles if none of the futexes are not triggered instead of just having the kernel put your thread to sleep until any of them gets triggered.

Rethinking the futex API

Posted Jun 19, 2020 2:32 UTC (Fri) by ras (subscriber, #33059) [Link] (11 responses)

You do it the same way you do it in kernel space. Maybe its a user space library that uses futex's to create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't going to look much different regardless of whether it is in kernel space or user space.

Yes, it's complex, racy and horrible to debug. Putting it in the kernel doesn't solve that. User space RCU (and yes, there is such a thing) is no more or less complex the kernel version. The complexity has to be somewhere - hiding it in the kernel may seem neater, but it binds it to an unchanging ABI so support and maintainability wise it's worse than putting it in a user space library.

Putting stuff in the kernel is often a win because it can roll up multiple syscalls into one, but mutex's are peculiar because if they are not contested there is no need for a syscall. That peculiarity was the primary driving force behind futex's - the fast mutex. It's not like the kernel didn't already provide user space multiple other mechanisms to do the same thing - they were just all slower because they involved one or more syscalls for the fast path. And yet here we are with a proposal for futex_waitv() that must always make a syscall, which could be avoid in the uncontested case if it was implemented in user space.

Rethinking the futex API

Posted Jun 19, 2020 4:55 UTC (Fri) by krisman (subscriber, #102057) [Link] (6 responses)

> You do it the same way you do it in kernel space. Maybe its a user space library that uses futex's to
> create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in
> memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates
> a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't
> going to look much different regardless of whether it is in kernel space or user space.

How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex, (2) avoiding spurious wakeups of unrelated threads, or relying on (3) poll (which we've shown is not fast enough); while (4) not relying in your waiters spinning (which throws cpu utilization through the roof)? One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

Can you better describe what your model would look like in memory? One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers. Each consumer waits on a different subset, but there are intersection between subsets for different consumers. One or more consumers are waken up when any of their events trigger. How do you represent this design in your model?

>And yet here we are with a proposal for futex_waitv() that must always make a syscall

That is not true. For the uncontested case, a waiter can grab a futex without going into the kernel. The semantics for FWM are "any of the futexes in the list" so if the consumer acquires any futex, it doesn't call futex_waitv.

Rethinking the futex API

Posted Jun 19, 2020 6:42 UTC (Fri) by ras (subscriber, #33059) [Link]

> How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex,

If they are waiting on the same futex, one call will suffice. If they are waiting on multiple calls, multiple wakes will be needed of course. But remember, futex's optimised for the fast path. The fast path is when the lock is uncontested. In that case there no one is waiting, and no one has to be woken up, and so no syscalls at all. If you do have to wake another thread up, you've already incurred the overhead on one syscall: the FUTEX_WAIT that put it to sleep, so it's not an untenable additional penalty.

> (2) avoiding spurious wakeups of unrelated threads,

Why are there wakeups on unrelated threads? This wonderful utex library has recorded what utex_t's each thread is willing to be woken up in its shared memory, and wakes up as many as needed. It's not like the kernel can do it differently: it also will have to look up what user space threads are waiting on the futex, and wake some up. The basic algorithm and code doesn't change when it moves from kernel space to user space.

> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)

Why is anybody spinning - kernel or user space? Well OK - the kernel might spin because your critical section isn't long and so it's faster than mutex_lock(). But that's just a speed optimisation - you could always just use a mutex_lock(). In user space spinning always requires a syscall - perhaps many if you end up spinning for a while, so you are always better off with the user space mutex_lock() equivalent, which is futex(), and hope it is uncontested. If it is uncontested there are no syscalls, and remember, if the kernel is spinning, then someone must has done a syscall to the kernel already, which is another way of saying futex_waitv() takes the syscall hit every time.

> Can you better describe what your model would look like in memory?

Ahh, that is a long answer. And you're stretching my memory because I was 30 years ago I did it for an OS I developed and sold. It's too long to go into it in detail, but briefly it was based on two primitives: one that looked like utex_waitv(). The second primitive was based on a new trick: every utex allowed you to attach an arbitrary number of child utex's on them. Waking up such a child transparently woke up the parent instead of the child, so a utex_waitv() on a child would never succeed. utex_waitv() returned the utex that woke up, which might be a child utex of one of the utex’s in the top level utex_waitv() vector.

utex_wake() becomes more complex of course: it has to check if the child had a parent, and indeed if the parent had a parent in order to wake up the right utex. I put the woken up utex’s in a queue list attached to the parent because I wanted to preserve the order of wakeups, but simply setting a flag in the child utex then having utex_wake() go looking for it would also work.

I actually had a 3rd feature which would probably be frowned upon today: you could attach a function to a utex and instead of waking it up, utex_wake() would call that function, effectively turning the utex into an interrupt like thingy. That was how the parent / child feature was actually implemented - the child had a function attached that woke up the parent. This feature has the interesting property of letting a utex transparently defer to some other mutex / condition mechanism. For example, instead of waking up a parent, you could write a byte to a fifo, or signal a semop(). The person calling the utex_wake() is none the wiser.

In the general case you have to lock a lot of things when you are altering the resulting tree of mutex's. In my case I can't remember that happening. Instead one thread, the one that the parent mutex added and deleted children and since the children can only be attached to one parent there is no contention.

BTW, I've never found a use for waking up more than one thread at a time. It obviously has it's uses if you really want to release a thundering herd, but a mutex usually guards a shared volatile resource that only one thread can change at a time, so there is no point waking more than one. If it discovers it doesn't want it, it can always wake someone else up.

This all requires a bit of work. You aren't dolling up futex's with a thin wrapper, you are implementing a new thing that uses futex's where the kernel would use spinlocks, IRQ masking or mutex_lock(). It is duplicating the existing code and data structures in the kernel. And there are more complexities I haven't mentioned, like when a wake calls a function that does a wait. However, because of that remarkable property of mutex that most code isn't executed unless there is contention, it doesn't matter speed wise because none of it is touched in the fast path.

Rethinking the futex API

Posted Jun 19, 2020 19:35 UTC (Fri) by itsmycpu (guest, #139639) [Link] (4 responses)

> How do you wake up multiple threads from your producer
> (1) without issuing multiple syscalls to wake each futex,
> (2) avoiding spurious wakeups of unrelated threads, or relying on
> (3) poll (which we've shown is not fast enough); while
> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)?
> One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

I agree with the posters who say wait-multiple can be implemented efficiently in user space using the existing Futex API.

Re (1): It might makes sense to add a wake-multiple call (as opposed to a wait-multiple) since that doesn't require copying dynamic lists into the kernel.
This would be much, much simpler.
Re (2): You can protect against spurious wakeups by re-checking atomic variables after wake up. However what are "unrelated threads"?
Re (3): A user space implementation doesn't require polling (except once which is good for peformance).
Re (4): You have the same general options implementing this as you do when implementing locks.

> Can you better describe what your model would look like in memory?
> One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers.

When necessary, you associate a list of consumers with each event producer, each list entry containing a pointer to the waiters futex and/or atomic guard variable.
Depending on the type of producer, you mark and/or wake one or all consumers. Why is that a question, did you try and have problems with this approach?

The kernel would have to do the same or similar processing otherwise, however in my understanding it is to be avoided for the kernel, for example, to copy and maintain dynamic memory lists. It is better to do in user space, also for performance (and to avoid impacting the performance of other applications). At least I know this works well in my own use cases.

Additionally, in use cases where the same "subset" is used repeatedly, these associations don't have to be re-established for each call.

Rethinking the futex API

Posted Jun 20, 2020 2:54 UTC (Sat) by krisman (subscriber, #102057) [Link] (3 responses)

> When necessary, you associate a list of consumers with each event producer, each list entry containing
> a pointer to the waiters futex and/or atomic guard variable.

So, your design has one futex per consumer. This means that when a producer signals, in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Rethinking the futex API

Posted Jun 20, 2020 3:38 UTC (Sat) by ras (subscriber, #33059) [Link]

> So, your design has one futex per consumer.

At this point I'm lost. You obviously have some specific use case in mind none of this fits. Could you post a link / explain what it is?

Rethinking the futex API

Posted Jun 20, 2020 3:58 UTC (Sat) by itsmycpu (guest, #139639) [Link] (1 responses)

> So, your design has one futex per consumer. This means that when a producer signals,
> in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Almost, but not with that disadvantage. As I indicated, "in my design" there is (optionally) a FUTEX_WAKE_MULTIPLE (not wait_multiple) kernel API which is much simpler to implement, also because it doesn't need to copy and maintain the list internally in the kernel. It does multiple FUTEX_WAKE operations in a single syscall. And this can be added later on, as an optimization, if and when such a call becomes available in the kernel, while the rest can already be implemented, tested and tuned with the kernel as is.

Since a FUTEX_WAKE is only needed for futexes that actually have thread(s) waiting, and the waiting itself is already a time consuming kernel call, one non-blocking WAKE call per waiting futex shouldn't be that bad, although it will be better to have a wake_multiple call, which solves that problem (if it is one) completely.

Rethinking the futex API

Posted Jun 20, 2020 4:05 UTC (Sat) by itsmycpu (guest, #139639) [Link]

(Also, for example, the FUTEX_WAKE calls can offloaded to a worker thread, if the callers are high priority.)

Rethinking the futex API

Posted Jun 19, 2020 4:58 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (2 responses)

> And yet here we are with a proposal for futex_waitv() that must always make a syscall, which could be avoid in the uncontested case if it was implemented in user space.

Why must it make a syscall in the uncontested case? It seems to me that it could first try to lock each of the futexes in user space and only make the syscall if none of them were available—much like the current implementation does for a single futex.

Rethinking the futex API

Posted Jun 19, 2020 5:31 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

Exactly! In Krismans patch to LKML he specifically wrote:

"Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel."

Rethinking the futex API

Posted Jun 19, 2020 6:58 UTC (Fri) by ras (subscriber, #33059) [Link]

> Why must it make a syscall in the uncontested case?

It doesn't have to be. But this isn't just a simple compare and swap. You are checking a vector, which requires many steps. These steps have to be done atomically along with the following futex_waitv() if needed. How are you going to do it?

I've been programming long enough to be very confident you can' trust your average programmer to get that sort of thing right. Hell, I'm pretty confident I won't get it right on the first try. That in itself isn't a problem: you wrap it in a user space library so they can't get it wrong.

But then, if you are implementing a user space library anyway, and the library could implement the futex_waitv() in user space and it would be faster to boot in the uncontested case, what have you gained by doing it in the kernel?

Rethinking the futex API

Posted Jun 19, 2020 5:36 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

But this call is not about the uncontested case, it's so that the kernel can put a thread to sleep that is waiting for one (or more) of the contested futexes to no longer be contested. That you cannot do from userspace.

Rethinking the futex API

Posted Jun 19, 2020 2:31 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

> Again, what environments?
For example, if you can't open new files (not enough FDs or to not mess up the file descriptor numbers).

Rethinking the futex API

Posted Jun 19, 2020 2:50 UTC (Fri) by ras (subscriber, #33059) [Link] (10 responses)

What's that got to do with futex's? They don't use an FD.

Rethinking the futex API

Posted Jun 19, 2020 3:08 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

Uhh... You stated that futex functionality can be implemented in userspace: https://lwn.net/Articles/823579/

Rethinking the futex API

Posted Jun 19, 2020 3:50 UTC (Fri) by ras (subscriber, #33059) [Link] (8 responses)

If it reads that way I've written it badly.

I did not mean current futex's could be implemented in userspace. I meant the candy being proposed here could be implemented in user space, candy such as futex_waitv() and allowing different sized int's for the uaddr argument. You need something fast, simple and small to base that on, and that something is futex's. That was the whole point of futex's - the existing alternatives like semop()'s were not fast or simple. If we aren't careful, futex's will go the same way.

One of the complaints about the current futex implementation is the code is hard to follow. One reason for that is it's been expanded to do stuff that could have been done in user space. FUTEX_FD is one such thing. I suspect that FUTEX_REQUEUE is another - it could be done in user space with no more syscalls than are used now. futex_waitv() is definitely in the same class. You don't fix complexity by adding more features.

Rethinking the futex API

Posted Jun 19, 2020 5:14 UTC (Fri) by krisman (subscriber, #102057) [Link] (7 responses)

> One of the complaints about the current futex implementation is the code is hard to follow. One reason
> for that is it's been expanded to do stuff that could have been done in user space. FUTEX_FD is one
> such thing. I suspect that FUTEX_REQUEUE is another - it could be done in user space with no more
> syscalls than are used now. futex_waitv() is definitely in the same class. You don't fix complexity by
> adding more features.

For what is worth, yes, the futex interface has grown out of hand with deprecated features. FUTEX_FD was dropped long ago, so that is not a big deal. And yes, most futex features can be implemented in userspace with the WAKE/WAIT pair with additional cost.

REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes? I don't see how it is possible, in fact, CMP_REQUEUE is planned to exist in a future API.

Rethinking the futex API

Posted Jun 19, 2020 6:50 UTC (Fri) by ras (subscriber, #33059) [Link] (4 responses)

> REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes? I don't see how it is possible, in fact, CMP_REQUEUE is planned to exist in a future API.

You couldn't.

I was querying whether the added extra syscall complexity is worth the cost. It's not like you can't do it in user space - it's just more expensive. You've already incurred one syscall, so the orders of magnitude speed improvement the futex fast path got you is gone. You are now trying to gain a factor of 2 or 3 or something. If it was a very frequent use case, then perhaps - but it looks esoteric to me. If complexity is an issue esoteric should be shown the door.

Rethinking the futex API

Posted Jun 23, 2020 20:56 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

It is a frequent use case: POSIX condition variables. These are associated with a mutex. A thread calling pthread_cond_wait must do so while owning the mutex associated with the condition variable. The mutex is then released and the thread put to sleep until the condition is signalled. On wakeup, the mutex must be reacquired by the thread prior to the function call returning. If more than one thread is woken up because the condition is signalled, they would thus all race to exit the kernel, most of them would find the mutex locked and enter the kernel again in order to block on that.

With REQUEUE/ CMP_REQUEUE, it's possible to ask the kernel to wake up one thread which then acquires the mutex and (presumably) does something useful while the others are moved from the condvar futex to the mutex futex without waking them up.

Rethinking the futex API

Posted Jun 23, 2020 22:01 UTC (Tue) by ras (subscriber, #33059) [Link] (2 responses)

Thanks for that.

My point wasn't that thundering herd isn't a problem. It clearly is.

My point was that it is already solved by the current futex() API. You get to decide how many of those syscall(SYS_futex, &futex, FUTEX_WAIT)'s you wake up. It can be anywhere from 1 to all. The others just sit around, waiting for another FUTEX_WAKE.

To solve the pthread_cond_wait() case you just wake up one, so it could have been done without REQUEUE now. I'll grant you that not all that REQUEUE does, it also prevents the futex from seeing any future wakeups. But that could equally well be done from userspace using more futex's. Perhaps it's a speed optimisation (I do hope they benchmarked it), or more likely arranging or everyone to share information the kernel already had was hard so they took the easy way out.

Rethinking the futex API

Posted Jun 24, 2020 14:54 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

The library implementation doesn't get to decide this: Application code can either use pthread_cond_signal to wake exactly on thread or pthread_cond_broadcast to wake all threads. With the naive implementation, the kernel goes through a linked list and wakes everything while tearing the list apart. All the awoken threads exit the kernel and compete for ownership of the same mutex. One of these threads will end up getting the mutex, the others all do system calls in order to enter the kernel again where kernel code than builds a new linked list of them which is attached to a different futex.

That's an insanely complicated way to transfer a linked list from a head pointer a in the kernel to another head pointer b in the kernel. It's much simpler if the kernel just wakes one thread and transfers the existing list to the other futex.

Rethinking the futex API

Posted Jun 24, 2020 19:17 UTC (Wed) by itsmycpu (guest, #139639) [Link]

> That's an insanely complicated way to transfer a linked list from a head pointer a in the kernel to another head pointer b in the kernel. It's much simpler if the kernel just wakes one thread and transfers the existing list to the other futex.

I'm not sure if you are discussing a more specific question, but just to point it out, for example a user space implementation of the producer/consumer model can (if so designed) just move a waiting entry from one producer's list to another producer's list without any wake ups or even kernel calls. (Except a wake call in the case one is ready.)

Rethinking the futex API

Posted Jun 19, 2020 10:43 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (1 responses)

> REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes?

You would just do things differently. A model that is relatively common is the "parking lot" where each thread has a single binary semaphore attached to it. The primitives in this model are "parking" yourself (consuming the token in the semaphore, or waiting for one and consuming it immediately if it's not there) and "unparking" another thread (placing a token on its semaphore). The parking lot itself is an associative array from addresses to waitqueues, similar to what the kernel does for futexes, so that you can "park" yourself on a list and "unpark" the head of a list.

By moving waitqueue handling to userspace, you don't need a system call anymore for either requeueing or multiple-futex wait. On the other hand it makes the implementation of pthread_barrier_t less efficient because every slow-path wakeup needs a separate system call.

Rethinking the futex API

Posted Jun 21, 2020 17:44 UTC (Sun) by itsmycpu (guest, #139639) [Link]

> On the other hand it makes the implementation of pthread_barrier_t less efficient because every slow-path wakeup needs a separate system call.

Again a reason to add a "FUTEX_WAKE_MULTIPLE" kernel API, or "futex_wakev" ("wake" not "wait"), that takes a vector of multiple futex addresses as parameter.

Rethinking the futex API

Posted Jun 19, 2020 0:58 UTC (Fri) by krisman (subscriber, #102057) [Link] (1 responses)

> Waiting on multiple values can be done in user space.

Yes, it can be done in userspace, as mentioned in my original submission of Futex Wait Multiple last year. Still, we experimented with all that, and the performance of a userspace solution or even of a solution based on eventfd (which would be the most obvious way to implement these semantics in userspace with the existing kernel support), is a bottleneck on our real-world scenarios of emulation of windows applications over Wine.

Rethinking the futex API

Posted Jun 19, 2020 19:48 UTC (Fri) by itsmycpu (guest, #139639) [Link]

I believe your idea to use futexes is far better than using eventfd for dealing with a large number of calls or events.

However I think the existing Futex API allows implementing wait-for-multiple-events with great performance. I have good experience implementing such functionality with existing APIs, and know of no conceptual restrictions there. See also my slightly more detailed comment above.

Rethinking the futex API

Posted Jun 19, 2020 9:58 UTC (Fri) by bno1 (guest, #138051) [Link] (1 responses)

>Waiting on multiple values can be done in user space. Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.

Wine already has this. The esync patchset uses eventfd to wait on multiple winapi mutexes [1]. But some Windows programs run out of file descriptors because they leak mutexes. File descriptors are a much more limited resource than memory. Also, the patch for wait for multiple futexes (called fsync in the wine community) has slightly better performance [2].

[1] https://github.com/zfigura/wine/blob/esync/README.esync
[2] https://github.com/ValveSoftware/Proton/issues/2966

Rethinking the futex API

Posted Jun 23, 2020 7:04 UTC (Tue) by ncm (guest, #165) [Link]

> "But some Windows programs run out of file descriptors because they leak mutexes."

See, this is why W(in)e can't have nice things.

Rethinking the futex API

Posted Jun 19, 2020 14:57 UTC (Fri) by sbaugh (guest, #103291) [Link] (1 responses)

>Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.

As far as I'm aware, this can only be done by dedicating a thread to each futex you want to wait on - not exactly cheap.

Rethinking the futex API

Posted Jun 19, 2020 19:59 UTC (Fri) by itsmycpu (guest, #139639) [Link]

> As far as I'm aware, this can only be done by dedicating a thread to each futex you want to wait on - not exactly cheap.

Conceptually, a futex is not something you wait on. It is something that you use to wait. It is a consumer, not a producer. The wake call is the producer.

You can use a single futex to wait for many things at the same time (using user space utility functions to establish the necessary connections).

Wait until all futexes are available?

Posted Jun 19, 2020 23:43 UTC (Fri) by glenn (subscriber, #102223) [Link] (7 responses)

> One of those is the ability to wait on multiple futexes at once and be awakened when any one of them becomes available.
Has any thought been given to waiting until _all_ the futexes are available? Some sorting of mutexes could be imparted to avoid deadlock issues. Yes, these locks can be obtained serially in userspace, but this may result in as many as N-1 pointless wake-ups of N futexes are needed.

Wait until all futexes are available?

Posted Jun 20, 2020 1:12 UTC (Sat) by itsmycpu (guest, #139639) [Link] (6 responses)

> Has any thought been given to waiting until _all_ the futexes are available?
> Some sorting of mutexes could be imparted to avoid deadlock issues.
> Yes, these locks can be obtained serially in userspace, but this may result in as many as N-1 pointless wake-ups of N futexes are needed.

This problem relates to a specific approach, not to user space solutions in general. The goal is to determine contention without any kernel call at all.

The way you ask this question suggests that your life doesn't depend on the answer.
My suggestion: keep it that way.

Wait until all futexes are available?

Posted Jun 20, 2020 7:25 UTC (Sat) by glenn (subscriber, #102223) [Link] (5 responses)

Perhaps I need to rephrase my question: What can the kernel do to reduce the system overheads associated with nested locking?

Wait until all futexes are available?

Posted Jun 23, 2020 7:12 UTC (Tue) by ncm (guest, #165) [Link] (4 responses)

See, if your life has taken you to a place where you are thinking about ways to reduce what it costs to do nested locking, you have gone astray, and need to find your way back.

Wait until all futexes are available?

Posted Jun 25, 2020 8:39 UTC (Thu) by glenn (subscriber, #102223) [Link] (3 responses)

> The way you ask this question suggests that your life doesn't depend on the answer.

> See, if your life has taken you to a place where you are thinking about ways to reduce what it costs to do nested locking, you have gone astray, and need to find your way back.

Thanks for the life advice. I’ll bring it up with my therapist. Shall we get back to science and engineering?

Let me talk about some challenges I’ve run into developing a pipelined real-time task-graph runtime where I used the futex API directly to manage thread waiting and signalling. My workload is modeled by a set of nodes connected by directed edges. A node is a consumer of input edges, and it is a producer on output edges. Each node encapsulates a unit of work that is backed by a dedicated thread. This thread is ready to run when data is available on _all_ of its input edges.

Being a task-graph system, some nodes are join/fan-in points. These nodes consume data from several edges. How does a producer determine when it should wake the consumer? One method is to use a futex for each edge. The consumer can loop over these futexes until all inputs are satisfied. This may work, but it’s not particularly efficient: it doesn’t scale well (suppose your node has 10s or even 100s of inputs) as a thread may wake up many times before it does any real work. Recall that this is a runtime for a real-time application. These trivial wakeups can preempt other real-time work (e.g., active threads of other nodes) and induce threads to migrate to other CPUs (Linux is very aggressive about migrating preempted real-time threads to available CPUs), inducing cache affinity loss and additional context switch overheads. Also, this wakeup pattern is _very_ problematic for SCHED_DEADLINE. This scheduler allocates budget assuming that the thread in question exhibits a periodic/sporadic execution pattern. Our consumer thread does _not_ exhibit this pattern because it is constantly dealing with trivial wakeups. A SCHED_DEADLINE consumer can exhaust its budget dealing with trivial wakeups before the thread is ready to do any real work! This is due to SCHED_DEADLINE's budget reclaiming algorithm (see “2.2 Bandwidth reclaiming” in Documentation/scheduler/sched-deadline.txt).

An alternative approach is to have the consumer wait on a single futex. Now each producer is responsible for evaluating whether all of the inputs of the consumer have been satisfied. The last producer to satisfy these inputs wakes the consumer. This leads to a more efficient solution, but it is not without its trade-offs. A producer needs insight into the availability of data on the other edges of its consumer. This implies that the producers must run within the same process or share some state in shared memory. This makes for a cross-cutting software architecture where a composable one would feel much more natural. The lifetime of shared memory is not fun to manage. There are also security implications.

My task-graph framework would have been simpler if the futex API let my consumer wait on a set of futexes and wake only when all inputs had been satisfied.

In my runtime, I also explored the use of eventfd and epoll. I encountered the same problem: select, poll, and epoll all wake the caller when any one file descriptor is ready, not when all are ready. I am unaware of any synchronization or IPC mechanism in Linux that would let me design my software in a composable way while simultaneously avoiding the problems of trivial wakeups. I haven’t revisited my problem since io_uring landed--perhaps there’s a solution for me there. Does anyone have a suggestion? No need to suggest a therapist. I already have one. ;)

Wait until all futexes are available?

Posted Jun 25, 2020 11:22 UTC (Thu) by itsmycpu (guest, #139639) [Link] (1 responses)

> An alternative approach is to have the consumer wait on a single futex. Now each producer is responsible for evaluating whether all of the inputs of the consumer have been satisfied. The last producer to satisfy these inputs wakes the consumer. This leads to a more efficient solution, but it is not without its trade-offs. A producer needs insight into the availability of data on the other edges of its consumer.

Your situation may have specifics that I am not aware of, so I can only hope that this is going to make sense:

Each producer keeps track of if a consumer has already been informed about the availability of data, in that consumer's list entry. That list entry of course also has a reference to the consumer. The consumer itself has an atomic counter for how many inputs it is still missing. When a producer is able to provide a new input, it reduces the consumer's (atomic) counter by 1. In case it reaches zero at that point, it wakes the consumer.

Instead of a counter, for example an atomic bitset can be used. Then the consumer's list entry contains the specific bit that this producer should set. When a producer sees that it has set the last necessary bit, it wakes the consumer (or just sets that flag if the consumer isn't waiting). This is what I have used in some cases. In my case, the consumer is often doing other work while the flag is set, so that it often does not have to wait, and the whole process takes place in user space.

> This implies that the producers must run within the same process or share some state in shared memory. This makes for a cross-cutting software architecture where a composable one would feel much more natural. The lifetime of shared memory is not fun to manage. There are also security implications.

As far as I know, using futexes between processes always requires shared memory, one way or the other. That probably makes them fastest solution when blocking threads is required on the slow path.

(Of course futexes should only be used on the slow path.)

Wait until all futexes are available?

Posted Jun 30, 2020 15:42 UTC (Tue) by ncm (guest, #165) [Link]

I endorse the above. For nodes with a large number of inputs, each bit can represent another bitmap, recursively, until there are enough bits for all the inputs. When any bitmap is filled, the producer sets the bit representing it at the next level up. When the topmost bitmap is filled, the producer that filled it wakes the thread.

Bits are set by compare-and-swap. atomically, with no system call needed.

To do it with a counter, the producer atomically decrements a counter, and wakes the thread when it sees the transition to zero.

Waking the thread might mean writing a byte to a pipe it is sleeping on.

Or, the producer's thread might just add the consumer to a scoreboard of runnable tasks, and some already-running thread will get to it. That way, the kernel is never involved at all. You keep exactly as many threads as you have cores on the job, and they spin when there is no work. Isolate those cores (isolcpus, nohz_full, rcu_nocbs) so the kernel does not steal them.

Wait until all futexes are available?

Posted Jun 27, 2020 7:49 UTC (Sat) by ras (subscriber, #33059) [Link]

> An alternative approach is to have the consumer wait on a single futex.

Waking up precisely one waiter if always the fastest choice.

> This implies that the producers must run within the same process or share some state in shared memory.

Yes, but futex's only work with shared memory. Communicating via shared memory is orders of magnitude faster than another other method. That's where futex's get their speed from. So if you want speed shared memory unavoidable.

If speed is not such a big issue there are as you say lots of choices, including epoll(), pipes, semop()'s ...

> My task-graph framework would have been simpler if the futex API let my consumer wait on a set of futexes and wake only when all inputs had been satisfied.

Of course it would be. Things are always simpler when there is a pre-existing library that does exactly what you want. What triggered my interest here is the claim that library should be in the kernel.

Rethinking the futex API - why change the type that uaddr points to?

Posted Jun 23, 2020 4:00 UTC (Tue) by pr1268 (guest, #24648) [Link] (2 responses)

From the futex(2) man page:

    int futex(int *uaddr, [...]

And from the proposed patch set mentioned below:

    int futex_wait(void *uaddr, [...]
    int futex_wait_time32(void *uaddr, [...]
    int futex_wake(void *uaddr, [...]

Just curious, and I couldn't find in our editor's article, nor in the comments: why is uaddr being changed from an int * to a void *?

Rethinking the futex API - why change the type that uaddr points to?

Posted Jun 23, 2020 13:02 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

I would guess that it has to do with the desire to support different sizes of futexes. Once that's in place, the type of that pointer will vary, so it pretty much has to be void *.

Rethinking the futex API - why change the type that uaddr points to?

Posted Jun 23, 2020 17:46 UTC (Tue) by itsmycpu (guest, #139639) [Link]

> I would guess that it has to do with the desire to support different sizes of futexes. Once that's in place, the type of that pointer will vary, so it pretty much has to be void *.

Is it worth the effort and cost of a larger kernel API, though? An application level feature can easily use another atomic variable (which it might like to do anyway), it doesn't have to use the futex value itself, except on the slow path. Not every convenience requires its own kernel API.

The use of futexes requires specialized understanding *and* experience, and application writers are usually better off using higher level library functions like semaphores and mutexes. Leave the support of multiple sizes to the libraries. Or add a Linux-specific C library directly supported by the kernel developers (which can then perhaps also be used by the kernel to communicate with user space processes).

My suggestion is to (first) add a kernel API that can actually make a performance difference: a WAKE-multiple-futexes-with-one-call API. ("Wake" not "wait"). Should be easy to implement (for a kernel developer) and not interfere with the existing functionality and concepts.