|
|
Log in / Subscribe / Register

Rethinking the futex API

Rethinking the futex API

Posted Jun 19, 2020 1:25 UTC (Fri) by HenrikH (subscriber, #31152)
In reply to: Rethinking the futex API by ras
Parent article: Rethinking the futex API

How would you wait on multiple futexes effeciently in userspace? You would have to busy wait and spinkle in some sleep(0)/sched_yield() nonsense calls here and there to not waste cpu cycles if none of the futexes are not triggered instead of just having the kernel put your thread to sleep until any of them gets triggered.


to post comments

Rethinking the futex API

Posted Jun 19, 2020 2:32 UTC (Fri) by ras (subscriber, #33059) [Link] (11 responses)

You do it the same way you do it in kernel space. Maybe its a user space library that uses futex's to create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't going to look much different regardless of whether it is in kernel space or user space.

Yes, it's complex, racy and horrible to debug. Putting it in the kernel doesn't solve that. User space RCU (and yes, there is such a thing) is no more or less complex the kernel version. The complexity has to be somewhere - hiding it in the kernel may seem neater, but it binds it to an unchanging ABI so support and maintainability wise it's worse than putting it in a user space library.

Putting stuff in the kernel is often a win because it can roll up multiple syscalls into one, but mutex's are peculiar because if they are not contested there is no need for a syscall. That peculiarity was the primary driving force behind futex's - the fast mutex. It's not like the kernel didn't already provide user space multiple other mechanisms to do the same thing - they were just all slower because they involved one or more syscalls for the fast path. And yet here we are with a proposal for futex_waitv() that must always make a syscall, which could be avoid in the uncontested case if it was implemented in user space.

Rethinking the futex API

Posted Jun 19, 2020 4:55 UTC (Fri) by krisman (subscriber, #102057) [Link] (6 responses)

> You do it the same way you do it in kernel space. Maybe its a user space library that uses futex's to
> create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in
> memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates
> a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't
> going to look much different regardless of whether it is in kernel space or user space.

How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex, (2) avoiding spurious wakeups of unrelated threads, or relying on (3) poll (which we've shown is not fast enough); while (4) not relying in your waiters spinning (which throws cpu utilization through the roof)? One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

Can you better describe what your model would look like in memory? One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers. Each consumer waits on a different subset, but there are intersection between subsets for different consumers. One or more consumers are waken up when any of their events trigger. How do you represent this design in your model?

>And yet here we are with a proposal for futex_waitv() that must always make a syscall

That is not true. For the uncontested case, a waiter can grab a futex without going into the kernel. The semantics for FWM are "any of the futexes in the list" so if the consumer acquires any futex, it doesn't call futex_waitv.

Rethinking the futex API

Posted Jun 19, 2020 6:42 UTC (Fri) by ras (subscriber, #33059) [Link]

> How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex,

If they are waiting on the same futex, one call will suffice. If they are waiting on multiple calls, multiple wakes will be needed of course. But remember, futex's optimised for the fast path. The fast path is when the lock is uncontested. In that case there no one is waiting, and no one has to be woken up, and so no syscalls at all. If you do have to wake another thread up, you've already incurred the overhead on one syscall: the FUTEX_WAIT that put it to sleep, so it's not an untenable additional penalty.

> (2) avoiding spurious wakeups of unrelated threads,

Why are there wakeups on unrelated threads? This wonderful utex library has recorded what utex_t's each thread is willing to be woken up in its shared memory, and wakes up as many as needed. It's not like the kernel can do it differently: it also will have to look up what user space threads are waiting on the futex, and wake some up. The basic algorithm and code doesn't change when it moves from kernel space to user space.

> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)

Why is anybody spinning - kernel or user space? Well OK - the kernel might spin because your critical section isn't long and so it's faster than mutex_lock(). But that's just a speed optimisation - you could always just use a mutex_lock(). In user space spinning always requires a syscall - perhaps many if you end up spinning for a while, so you are always better off with the user space mutex_lock() equivalent, which is futex(), and hope it is uncontested. If it is uncontested there are no syscalls, and remember, if the kernel is spinning, then someone must has done a syscall to the kernel already, which is another way of saying futex_waitv() takes the syscall hit every time.

> Can you better describe what your model would look like in memory?

Ahh, that is a long answer. And you're stretching my memory because I was 30 years ago I did it for an OS I developed and sold. It's too long to go into it in detail, but briefly it was based on two primitives: one that looked like utex_waitv(). The second primitive was based on a new trick: every utex allowed you to attach an arbitrary number of child utex's on them. Waking up such a child transparently woke up the parent instead of the child, so a utex_waitv() on a child would never succeed. utex_waitv() returned the utex that woke up, which might be a child utex of one of the utex’s in the top level utex_waitv() vector.

utex_wake() becomes more complex of course: it has to check if the child had a parent, and indeed if the parent had a parent in order to wake up the right utex. I put the woken up utex’s in a queue list attached to the parent because I wanted to preserve the order of wakeups, but simply setting a flag in the child utex then having utex_wake() go looking for it would also work.

I actually had a 3rd feature which would probably be frowned upon today: you could attach a function to a utex and instead of waking it up, utex_wake() would call that function, effectively turning the utex into an interrupt like thingy. That was how the parent / child feature was actually implemented - the child had a function attached that woke up the parent. This feature has the interesting property of letting a utex transparently defer to some other mutex / condition mechanism. For example, instead of waking up a parent, you could write a byte to a fifo, or signal a semop(). The person calling the utex_wake() is none the wiser.

In the general case you have to lock a lot of things when you are altering the resulting tree of mutex's. In my case I can't remember that happening. Instead one thread, the one that the parent mutex added and deleted children and since the children can only be attached to one parent there is no contention.

BTW, I've never found a use for waking up more than one thread at a time. It obviously has it's uses if you really want to release a thundering herd, but a mutex usually guards a shared volatile resource that only one thread can change at a time, so there is no point waking more than one. If it discovers it doesn't want it, it can always wake someone else up.

This all requires a bit of work. You aren't dolling up futex's with a thin wrapper, you are implementing a new thing that uses futex's where the kernel would use spinlocks, IRQ masking or mutex_lock(). It is duplicating the existing code and data structures in the kernel. And there are more complexities I haven't mentioned, like when a wake calls a function that does a wait. However, because of that remarkable property of mutex that most code isn't executed unless there is contention, it doesn't matter speed wise because none of it is touched in the fast path.

Rethinking the futex API

Posted Jun 19, 2020 19:35 UTC (Fri) by itsmycpu (guest, #139639) [Link] (4 responses)

> How do you wake up multiple threads from your producer
> (1) without issuing multiple syscalls to wake each futex,
> (2) avoiding spurious wakeups of unrelated threads, or relying on
> (3) poll (which we've shown is not fast enough); while
> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)?
> One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

I agree with the posters who say wait-multiple can be implemented efficiently in user space using the existing Futex API.

Re (1): It might makes sense to add a wake-multiple call (as opposed to a wait-multiple) since that doesn't require copying dynamic lists into the kernel.
This would be much, much simpler.
Re (2): You can protect against spurious wakeups by re-checking atomic variables after wake up. However what are "unrelated threads"?
Re (3): A user space implementation doesn't require polling (except once which is good for peformance).
Re (4): You have the same general options implementing this as you do when implementing locks.

> Can you better describe what your model would look like in memory?
> One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers.

When necessary, you associate a list of consumers with each event producer, each list entry containing a pointer to the waiters futex and/or atomic guard variable.
Depending on the type of producer, you mark and/or wake one or all consumers. Why is that a question, did you try and have problems with this approach?

The kernel would have to do the same or similar processing otherwise, however in my understanding it is to be avoided for the kernel, for example, to copy and maintain dynamic memory lists. It is better to do in user space, also for performance (and to avoid impacting the performance of other applications). At least I know this works well in my own use cases.

Additionally, in use cases where the same "subset" is used repeatedly, these associations don't have to be re-established for each call.

Rethinking the futex API

Posted Jun 20, 2020 2:54 UTC (Sat) by krisman (subscriber, #102057) [Link] (3 responses)

> When necessary, you associate a list of consumers with each event producer, each list entry containing
> a pointer to the waiters futex and/or atomic guard variable.

So, your design has one futex per consumer. This means that when a producer signals, in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Rethinking the futex API

Posted Jun 20, 2020 3:38 UTC (Sat) by ras (subscriber, #33059) [Link]

> So, your design has one futex per consumer.

At this point I'm lost. You obviously have some specific use case in mind none of this fits. Could you post a link / explain what it is?

Rethinking the futex API

Posted Jun 20, 2020 3:58 UTC (Sat) by itsmycpu (guest, #139639) [Link] (1 responses)

> So, your design has one futex per consumer. This means that when a producer signals,
> in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Almost, but not with that disadvantage. As I indicated, "in my design" there is (optionally) a FUTEX_WAKE_MULTIPLE (not wait_multiple) kernel API which is much simpler to implement, also because it doesn't need to copy and maintain the list internally in the kernel. It does multiple FUTEX_WAKE operations in a single syscall. And this can be added later on, as an optimization, if and when such a call becomes available in the kernel, while the rest can already be implemented, tested and tuned with the kernel as is.

Since a FUTEX_WAKE is only needed for futexes that actually have thread(s) waiting, and the waiting itself is already a time consuming kernel call, one non-blocking WAKE call per waiting futex shouldn't be that bad, although it will be better to have a wake_multiple call, which solves that problem (if it is one) completely.

Rethinking the futex API

Posted Jun 20, 2020 4:05 UTC (Sat) by itsmycpu (guest, #139639) [Link]

(Also, for example, the FUTEX_WAKE calls can offloaded to a worker thread, if the callers are high priority.)

Rethinking the futex API

Posted Jun 19, 2020 4:58 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (2 responses)

> And yet here we are with a proposal for futex_waitv() that must always make a syscall, which could be avoid in the uncontested case if it was implemented in user space.

Why must it make a syscall in the uncontested case? It seems to me that it could first try to lock each of the futexes in user space and only make the syscall if none of them were available—much like the current implementation does for a single futex.

Rethinking the futex API

Posted Jun 19, 2020 5:31 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

Exactly! In Krismans patch to LKML he specifically wrote:

"Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel."

Rethinking the futex API

Posted Jun 19, 2020 6:58 UTC (Fri) by ras (subscriber, #33059) [Link]

> Why must it make a syscall in the uncontested case?

It doesn't have to be. But this isn't just a simple compare and swap. You are checking a vector, which requires many steps. These steps have to be done atomically along with the following futex_waitv() if needed. How are you going to do it?

I've been programming long enough to be very confident you can' trust your average programmer to get that sort of thing right. Hell, I'm pretty confident I won't get it right on the first try. That in itself isn't a problem: you wrap it in a user space library so they can't get it wrong.

But then, if you are implementing a user space library anyway, and the library could implement the futex_waitv() in user space and it would be faster to boot in the uncontested case, what have you gained by doing it in the kernel?

Rethinking the futex API

Posted Jun 19, 2020 5:36 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

But this call is not about the uncontested case, it's so that the kernel can put a thread to sleep that is waiting for one (or more) of the contested futexes to no longer be contested. That you cannot do from userspace.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds