|
|
Log in / Subscribe / Register

Rethinking the futex API

Rethinking the futex API

Posted Jun 19, 2020 0:24 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
In reply to: Rethinking the futex API by ras
Parent article: Rethinking the futex API

> Waiting on multiple values can be done in user space.
Not efficiently.

> Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.
Not in all environments.


to post comments

Rethinking the futex API

Posted Jun 19, 2020 0:59 UTC (Fri) by ras (subscriber, #33059) [Link] (25 responses)

>> Waiting on multiple values can be done in user space.
>Not efficiently.

Why not efficiently? Can you give me an example?

>> Association with a fd can be done with a userspace library writing a byte to a fifo on wakeup.
>Not in all environments.

Again, what environments?

Rethinking the futex API

Posted Jun 19, 2020 1:25 UTC (Fri) by HenrikH (subscriber, #31152) [Link] (12 responses)

How would you wait on multiple futexes effeciently in userspace? You would have to busy wait and spinkle in some sleep(0)/sched_yield() nonsense calls here and there to not waste cpu cycles if none of the futexes are not triggered instead of just having the kernel put your thread to sleep until any of them gets triggered.

Rethinking the futex API

Posted Jun 19, 2020 2:32 UTC (Fri) by ras (subscriber, #33059) [Link] (11 responses)

You do it the same way you do it in kernel space. Maybe its a user space library that uses futex's to create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't going to look much different regardless of whether it is in kernel space or user space.

Yes, it's complex, racy and horrible to debug. Putting it in the kernel doesn't solve that. User space RCU (and yes, there is such a thing) is no more or less complex the kernel version. The complexity has to be somewhere - hiding it in the kernel may seem neater, but it binds it to an unchanging ABI so support and maintainability wise it's worse than putting it in a user space library.

Putting stuff in the kernel is often a win because it can roll up multiple syscalls into one, but mutex's are peculiar because if they are not contested there is no need for a syscall. That peculiarity was the primary driving force behind futex's - the fast mutex. It's not like the kernel didn't already provide user space multiple other mechanisms to do the same thing - they were just all slower because they involved one or more syscalls for the fast path. And yet here we are with a proposal for futex_waitv() that must always make a syscall, which could be avoid in the uncontested case if it was implemented in user space.

Rethinking the futex API

Posted Jun 19, 2020 4:55 UTC (Fri) by krisman (subscriber, #102057) [Link] (6 responses)

> You do it the same way you do it in kernel space. Maybe its a user space library that uses futex's to
> create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in
> memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates
> a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't
> going to look much different regardless of whether it is in kernel space or user space.

How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex, (2) avoiding spurious wakeups of unrelated threads, or relying on (3) poll (which we've shown is not fast enough); while (4) not relying in your waiters spinning (which throws cpu utilization through the roof)? One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

Can you better describe what your model would look like in memory? One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers. Each consumer waits on a different subset, but there are intersection between subsets for different consumers. One or more consumers are waken up when any of their events trigger. How do you represent this design in your model?

>And yet here we are with a proposal for futex_waitv() that must always make a syscall

That is not true. For the uncontested case, a waiter can grab a futex without going into the kernel. The semantics for FWM are "any of the futexes in the list" so if the consumer acquires any futex, it doesn't call futex_waitv.

Rethinking the futex API

Posted Jun 19, 2020 6:42 UTC (Fri) by ras (subscriber, #33059) [Link]

> How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex,

If they are waiting on the same futex, one call will suffice. If they are waiting on multiple calls, multiple wakes will be needed of course. But remember, futex's optimised for the fast path. The fast path is when the lock is uncontested. In that case there no one is waiting, and no one has to be woken up, and so no syscalls at all. If you do have to wake another thread up, you've already incurred the overhead on one syscall: the FUTEX_WAIT that put it to sleep, so it's not an untenable additional penalty.

> (2) avoiding spurious wakeups of unrelated threads,

Why are there wakeups on unrelated threads? This wonderful utex library has recorded what utex_t's each thread is willing to be woken up in its shared memory, and wakes up as many as needed. It's not like the kernel can do it differently: it also will have to look up what user space threads are waiting on the futex, and wake some up. The basic algorithm and code doesn't change when it moves from kernel space to user space.

> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)

Why is anybody spinning - kernel or user space? Well OK - the kernel might spin because your critical section isn't long and so it's faster than mutex_lock(). But that's just a speed optimisation - you could always just use a mutex_lock(). In user space spinning always requires a syscall - perhaps many if you end up spinning for a while, so you are always better off with the user space mutex_lock() equivalent, which is futex(), and hope it is uncontested. If it is uncontested there are no syscalls, and remember, if the kernel is spinning, then someone must has done a syscall to the kernel already, which is another way of saying futex_waitv() takes the syscall hit every time.

> Can you better describe what your model would look like in memory?

Ahh, that is a long answer. And you're stretching my memory because I was 30 years ago I did it for an OS I developed and sold. It's too long to go into it in detail, but briefly it was based on two primitives: one that looked like utex_waitv(). The second primitive was based on a new trick: every utex allowed you to attach an arbitrary number of child utex's on them. Waking up such a child transparently woke up the parent instead of the child, so a utex_waitv() on a child would never succeed. utex_waitv() returned the utex that woke up, which might be a child utex of one of the utex’s in the top level utex_waitv() vector.

utex_wake() becomes more complex of course: it has to check if the child had a parent, and indeed if the parent had a parent in order to wake up the right utex. I put the woken up utex’s in a queue list attached to the parent because I wanted to preserve the order of wakeups, but simply setting a flag in the child utex then having utex_wake() go looking for it would also work.

I actually had a 3rd feature which would probably be frowned upon today: you could attach a function to a utex and instead of waking it up, utex_wake() would call that function, effectively turning the utex into an interrupt like thingy. That was how the parent / child feature was actually implemented - the child had a function attached that woke up the parent. This feature has the interesting property of letting a utex transparently defer to some other mutex / condition mechanism. For example, instead of waking up a parent, you could write a byte to a fifo, or signal a semop(). The person calling the utex_wake() is none the wiser.

In the general case you have to lock a lot of things when you are altering the resulting tree of mutex's. In my case I can't remember that happening. Instead one thread, the one that the parent mutex added and deleted children and since the children can only be attached to one parent there is no contention.

BTW, I've never found a use for waking up more than one thread at a time. It obviously has it's uses if you really want to release a thundering herd, but a mutex usually guards a shared volatile resource that only one thread can change at a time, so there is no point waking more than one. If it discovers it doesn't want it, it can always wake someone else up.

This all requires a bit of work. You aren't dolling up futex's with a thin wrapper, you are implementing a new thing that uses futex's where the kernel would use spinlocks, IRQ masking or mutex_lock(). It is duplicating the existing code and data structures in the kernel. And there are more complexities I haven't mentioned, like when a wake calls a function that does a wait. However, because of that remarkable property of mutex that most code isn't executed unless there is contention, it doesn't matter speed wise because none of it is touched in the fast path.

Rethinking the futex API

Posted Jun 19, 2020 19:35 UTC (Fri) by itsmycpu (guest, #139639) [Link] (4 responses)

> How do you wake up multiple threads from your producer
> (1) without issuing multiple syscalls to wake each futex,
> (2) avoiding spurious wakeups of unrelated threads, or relying on
> (3) poll (which we've shown is not fast enough); while
> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)?
> One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

I agree with the posters who say wait-multiple can be implemented efficiently in user space using the existing Futex API.

Re (1): It might makes sense to add a wake-multiple call (as opposed to a wait-multiple) since that doesn't require copying dynamic lists into the kernel.
This would be much, much simpler.
Re (2): You can protect against spurious wakeups by re-checking atomic variables after wake up. However what are "unrelated threads"?
Re (3): A user space implementation doesn't require polling (except once which is good for peformance).
Re (4): You have the same general options implementing this as you do when implementing locks.

> Can you better describe what your model would look like in memory?
> One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers.

When necessary, you associate a list of consumers with each event producer, each list entry containing a pointer to the waiters futex and/or atomic guard variable.
Depending on the type of producer, you mark and/or wake one or all consumers. Why is that a question, did you try and have problems with this approach?

The kernel would have to do the same or similar processing otherwise, however in my understanding it is to be avoided for the kernel, for example, to copy and maintain dynamic memory lists. It is better to do in user space, also for performance (and to avoid impacting the performance of other applications). At least I know this works well in my own use cases.

Additionally, in use cases where the same "subset" is used repeatedly, these associations don't have to be re-established for each call.

Rethinking the futex API

Posted Jun 20, 2020 2:54 UTC (Sat) by krisman (subscriber, #102057) [Link] (3 responses)

> When necessary, you associate a list of consumers with each event producer, each list entry containing
> a pointer to the waiters futex and/or atomic guard variable.

So, your design has one futex per consumer. This means that when a producer signals, in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Rethinking the futex API

Posted Jun 20, 2020 3:38 UTC (Sat) by ras (subscriber, #33059) [Link]

> So, your design has one futex per consumer.

At this point I'm lost. You obviously have some specific use case in mind none of this fits. Could you post a link / explain what it is?

Rethinking the futex API

Posted Jun 20, 2020 3:58 UTC (Sat) by itsmycpu (guest, #139639) [Link] (1 responses)

> So, your design has one futex per consumer. This means that when a producer signals,
> in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Almost, but not with that disadvantage. As I indicated, "in my design" there is (optionally) a FUTEX_WAKE_MULTIPLE (not wait_multiple) kernel API which is much simpler to implement, also because it doesn't need to copy and maintain the list internally in the kernel. It does multiple FUTEX_WAKE operations in a single syscall. And this can be added later on, as an optimization, if and when such a call becomes available in the kernel, while the rest can already be implemented, tested and tuned with the kernel as is.

Since a FUTEX_WAKE is only needed for futexes that actually have thread(s) waiting, and the waiting itself is already a time consuming kernel call, one non-blocking WAKE call per waiting futex shouldn't be that bad, although it will be better to have a wake_multiple call, which solves that problem (if it is one) completely.

Rethinking the futex API

Posted Jun 20, 2020 4:05 UTC (Sat) by itsmycpu (guest, #139639) [Link]

(Also, for example, the FUTEX_WAKE calls can offloaded to a worker thread, if the callers are high priority.)

Rethinking the futex API

Posted Jun 19, 2020 4:58 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (2 responses)

> And yet here we are with a proposal for futex_waitv() that must always make a syscall, which could be avoid in the uncontested case if it was implemented in user space.

Why must it make a syscall in the uncontested case? It seems to me that it could first try to lock each of the futexes in user space and only make the syscall if none of them were available—much like the current implementation does for a single futex.

Rethinking the futex API

Posted Jun 19, 2020 5:31 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

Exactly! In Krismans patch to LKML he specifically wrote:

"Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel."

Rethinking the futex API

Posted Jun 19, 2020 6:58 UTC (Fri) by ras (subscriber, #33059) [Link]

> Why must it make a syscall in the uncontested case?

It doesn't have to be. But this isn't just a simple compare and swap. You are checking a vector, which requires many steps. These steps have to be done atomically along with the following futex_waitv() if needed. How are you going to do it?

I've been programming long enough to be very confident you can' trust your average programmer to get that sort of thing right. Hell, I'm pretty confident I won't get it right on the first try. That in itself isn't a problem: you wrap it in a user space library so they can't get it wrong.

But then, if you are implementing a user space library anyway, and the library could implement the futex_waitv() in user space and it would be faster to boot in the uncontested case, what have you gained by doing it in the kernel?

Rethinking the futex API

Posted Jun 19, 2020 5:36 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

But this call is not about the uncontested case, it's so that the kernel can put a thread to sleep that is waiting for one (or more) of the contested futexes to no longer be contested. That you cannot do from userspace.

Rethinking the futex API

Posted Jun 19, 2020 2:31 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

> Again, what environments?
For example, if you can't open new files (not enough FDs or to not mess up the file descriptor numbers).

Rethinking the futex API

Posted Jun 19, 2020 2:50 UTC (Fri) by ras (subscriber, #33059) [Link] (10 responses)

What's that got to do with futex's? They don't use an FD.

Rethinking the futex API

Posted Jun 19, 2020 3:08 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

Uhh... You stated that futex functionality can be implemented in userspace: https://lwn.net/Articles/823579/

Rethinking the futex API

Posted Jun 19, 2020 3:50 UTC (Fri) by ras (subscriber, #33059) [Link] (8 responses)

If it reads that way I've written it badly.

I did not mean current futex's could be implemented in userspace. I meant the candy being proposed here could be implemented in user space, candy such as futex_waitv() and allowing different sized int's for the uaddr argument. You need something fast, simple and small to base that on, and that something is futex's. That was the whole point of futex's - the existing alternatives like semop()'s were not fast or simple. If we aren't careful, futex's will go the same way.

One of the complaints about the current futex implementation is the code is hard to follow. One reason for that is it's been expanded to do stuff that could have been done in user space. FUTEX_FD is one such thing. I suspect that FUTEX_REQUEUE is another - it could be done in user space with no more syscalls than are used now. futex_waitv() is definitely in the same class. You don't fix complexity by adding more features.

Rethinking the futex API

Posted Jun 19, 2020 5:14 UTC (Fri) by krisman (subscriber, #102057) [Link] (7 responses)

> One of the complaints about the current futex implementation is the code is hard to follow. One reason
> for that is it's been expanded to do stuff that could have been done in user space. FUTEX_FD is one
> such thing. I suspect that FUTEX_REQUEUE is another - it could be done in user space with no more
> syscalls than are used now. futex_waitv() is definitely in the same class. You don't fix complexity by
> adding more features.

For what is worth, yes, the futex interface has grown out of hand with deprecated features. FUTEX_FD was dropped long ago, so that is not a big deal. And yes, most futex features can be implemented in userspace with the WAKE/WAIT pair with additional cost.

REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes? I don't see how it is possible, in fact, CMP_REQUEUE is planned to exist in a future API.

Rethinking the futex API

Posted Jun 19, 2020 6:50 UTC (Fri) by ras (subscriber, #33059) [Link] (4 responses)

> REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes? I don't see how it is possible, in fact, CMP_REQUEUE is planned to exist in a future API.

You couldn't.

I was querying whether the added extra syscall complexity is worth the cost. It's not like you can't do it in user space - it's just more expensive. You've already incurred one syscall, so the orders of magnitude speed improvement the futex fast path got you is gone. You are now trying to gain a factor of 2 or 3 or something. If it was a very frequent use case, then perhaps - but it looks esoteric to me. If complexity is an issue esoteric should be shown the door.

Rethinking the futex API

Posted Jun 23, 2020 20:56 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

It is a frequent use case: POSIX condition variables. These are associated with a mutex. A thread calling pthread_cond_wait must do so while owning the mutex associated with the condition variable. The mutex is then released and the thread put to sleep until the condition is signalled. On wakeup, the mutex must be reacquired by the thread prior to the function call returning. If more than one thread is woken up because the condition is signalled, they would thus all race to exit the kernel, most of them would find the mutex locked and enter the kernel again in order to block on that.

With REQUEUE/ CMP_REQUEUE, it's possible to ask the kernel to wake up one thread which then acquires the mutex and (presumably) does something useful while the others are moved from the condvar futex to the mutex futex without waking them up.

Rethinking the futex API

Posted Jun 23, 2020 22:01 UTC (Tue) by ras (subscriber, #33059) [Link] (2 responses)

Thanks for that.

My point wasn't that thundering herd isn't a problem. It clearly is.

My point was that it is already solved by the current futex() API. You get to decide how many of those syscall(SYS_futex, &futex, FUTEX_WAIT)'s you wake up. It can be anywhere from 1 to all. The others just sit around, waiting for another FUTEX_WAKE.

To solve the pthread_cond_wait() case you just wake up one, so it could have been done without REQUEUE now. I'll grant you that not all that REQUEUE does, it also prevents the futex from seeing any future wakeups. But that could equally well be done from userspace using more futex's. Perhaps it's a speed optimisation (I do hope they benchmarked it), or more likely arranging or everyone to share information the kernel already had was hard so they took the easy way out.

Rethinking the futex API

Posted Jun 24, 2020 14:54 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

The library implementation doesn't get to decide this: Application code can either use pthread_cond_signal to wake exactly on thread or pthread_cond_broadcast to wake all threads. With the naive implementation, the kernel goes through a linked list and wakes everything while tearing the list apart. All the awoken threads exit the kernel and compete for ownership of the same mutex. One of these threads will end up getting the mutex, the others all do system calls in order to enter the kernel again where kernel code than builds a new linked list of them which is attached to a different futex.

That's an insanely complicated way to transfer a linked list from a head pointer a in the kernel to another head pointer b in the kernel. It's much simpler if the kernel just wakes one thread and transfers the existing list to the other futex.


Rethinking the futex API

Posted Jun 24, 2020 19:17 UTC (Wed) by itsmycpu (guest, #139639) [Link]

> That's an insanely complicated way to transfer a linked list from a head pointer a in the kernel to another head pointer b in the kernel. It's much simpler if the kernel just wakes one thread and transfers the existing list to the other futex.

I'm not sure if you are discussing a more specific question, but just to point it out, for example a user space implementation of the producer/consumer model can (if so designed) just move a waiting entry from one producer's list to another producer's list without any wake ups or even kernel calls. (Except a wake call in the case one is ready.)

Rethinking the futex API

Posted Jun 19, 2020 10:43 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (1 responses)

> REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes?

You would just do things differently. A model that is relatively common is the "parking lot" where each thread has a single binary semaphore attached to it. The primitives in this model are "parking" yourself (consuming the token in the semaphore, or waiting for one and consuming it immediately if it's not there) and "unparking" another thread (placing a token on its semaphore). The parking lot itself is an associative array from addresses to waitqueues, similar to what the kernel does for futexes, so that you can "park" yourself on a list and "unpark" the head of a list.

By moving waitqueue handling to userspace, you don't need a system call anymore for either requeueing or multiple-futex wait. On the other hand it makes the implementation of pthread_barrier_t less efficient because every slow-path wakeup needs a separate system call.

Rethinking the futex API

Posted Jun 21, 2020 17:44 UTC (Sun) by itsmycpu (guest, #139639) [Link]

> On the other hand it makes the implementation of pthread_barrier_t less efficient because every slow-path wakeup needs a separate system call.

Again a reason to add a "FUTEX_WAKE_MULTIPLE" kernel API, or "futex_wakev" ("wake" not "wait"), that takes a vector of multiple futex addresses as parameter.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds