Rethinking the futex API

Posted Jun 19, 2020 4:55 UTC (Fri) by krisman (subscriber, #102057)
In reply to: Rethinking the futex API by ras
Parent article: Rethinking the futex API

> You do it the same way you do it in kernel space. Maybe its a user space library that uses futex's to
> create user space mutexes, perhaps known as a utex_t. The library stores it's utex_t implementation in
> memory shared between the threads / processing doing the waiting. To cater for a waitv, it associates
> a pointer with each utex_t that says "utex_wake has to wake a whole pile of utex_t's". The code isn't
> going to look much different regardless of whether it is in kernel space or user space.

How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex, (2) avoiding spurious wakeups of unrelated threads, or relying on (3) poll (which we've shown is not fast enough); while (4) not relying in your waiters spinning (which throws cpu utilization through the roof)? One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

Can you better describe what your model would look like in memory? One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers. Each consumer waits on a different subset, but there are intersection between subsets for different consumers. One or more consumers are waken up when any of their events trigger. How do you represent this design in your model?

>And yet here we are with a proposal for futex_waitv() that must always make a syscall

That is not true. For the uncontested case, a waiter can grab a futex without going into the kernel. The semantics for FWM are "any of the futexes in the list" so if the consumer acquires any futex, it doesn't call futex_waitv.

Rethinking the futex API

Posted Jun 19, 2020 6:42 UTC (Fri) by ras (subscriber, #33059) [Link]

> How do you wake up multiple threads from your producer (1) without issuing multiple syscalls to wake each futex,

If they are waiting on the same futex, one call will suffice. If they are waiting on multiple calls, multiple wakes will be needed of course. But remember, futex's optimised for the fast path. The fast path is when the lock is uncontested. In that case there no one is waiting, and no one has to be woken up, and so no syscalls at all. If you do have to wake another thread up, you've already incurred the overhead on one syscall: the FUTEX_WAIT that put it to sleep, so it's not an untenable additional penalty.

> (2) avoiding spurious wakeups of unrelated threads,

Why are there wakeups on unrelated threads? This wonderful utex library has recorded what utex_t's each thread is willing to be woken up in its shared memory, and wakes up as many as needed. It's not like the kernel can do it differently: it also will have to look up what user space threads are waiting on the futex, and wake some up. The basic algorithm and code doesn't change when it moves from kernel space to user space.

> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)

Why is anybody spinning - kernel or user space? Well OK - the kernel might spin because your critical section isn't long and so it's faster than mutex_lock(). But that's just a speed optimisation - you could always just use a mutex_lock(). In user space spinning always requires a syscall - perhaps many if you end up spinning for a while, so you are always better off with the user space mutex_lock() equivalent, which is futex(), and hope it is uncontested. If it is uncontested there are no syscalls, and remember, if the kernel is spinning, then someone must has done a syscall to the kernel already, which is another way of saying futex_waitv() takes the syscall hit every time.

> Can you better describe what your model would look like in memory?

Ahh, that is a long answer. And you're stretching my memory because I was 30 years ago I did it for an OS I developed and sold. It's too long to go into it in detail, but briefly it was based on two primitives: one that looked like utex_waitv(). The second primitive was based on a new trick: every utex allowed you to attach an arbitrary number of child utex's on them. Waking up such a child transparently woke up the parent instead of the child, so a utex_waitv() on a child would never succeed. utex_waitv() returned the utex that woke up, which might be a child utex of one of the utex’s in the top level utex_waitv() vector.

utex_wake() becomes more complex of course: it has to check if the child had a parent, and indeed if the parent had a parent in order to wake up the right utex. I put the woken up utex’s in a queue list attached to the parent because I wanted to preserve the order of wakeups, but simply setting a flag in the child utex then having utex_wake() go looking for it would also work.

I actually had a 3rd feature which would probably be frowned upon today: you could attach a function to a utex and instead of waking it up, utex_wake() would call that function, effectively turning the utex into an interrupt like thingy. That was how the parent / child feature was actually implemented - the child had a function attached that woke up the parent. This feature has the interesting property of letting a utex transparently defer to some other mutex / condition mechanism. For example, instead of waking up a parent, you could write a byte to a fifo, or signal a semop(). The person calling the utex_wake() is none the wiser.

In the general case you have to lock a lot of things when you are altering the resulting tree of mutex's. In my case I can't remember that happening. Instead one thread, the one that the parent mutex added and deleted children and since the children can only be attached to one parent there is no contention.

BTW, I've never found a use for waking up more than one thread at a time. It obviously has it's uses if you really want to release a thundering herd, but a mutex usually guards a shared volatile resource that only one thread can change at a time, so there is no point waking more than one. If it discovers it doesn't want it, it can always wake someone else up.

This all requires a bit of work. You aren't dolling up futex's with a thin wrapper, you are implementing a new thing that uses futex's where the kernel would use spinlocks, IRQ masking or mutex_lock(). It is duplicating the existing code and data structures in the kernel. And there are more complexities I haven't mentioned, like when a wake calls a function that does a wait. However, because of that remarkable property of mutex that most code isn't executed unless there is contention, it doesn't matter speed wise because none of it is touched in the fast path.

Rethinking the futex API

Posted Jun 19, 2020 19:35 UTC (Fri) by itsmycpu (guest, #139639) [Link] (4 responses)

> How do you wake up multiple threads from your producer
> (1) without issuing multiple syscalls to wake each futex,
> (2) avoiding spurious wakeups of unrelated threads, or relying on
> (3) poll (which we've shown is not fast enough); while
> (4) not relying in your waiters spinning (which throws cpu utilization through the roof)?
> One of the main motivations of the patchset, which might have been lost in the article, is to reduce cpu utilization.

I agree with the posters who say wait-multiple can be implemented efficiently in user space using the existing Futex API.

Re (1): It might makes sense to add a wake-multiple call (as opposed to a wait-multiple) since that doesn't require copying dynamic lists into the kernel.
This would be much, much simpler.
Re (2): You can protect against spurious wakeups by re-checking atomic variables after wake up. However what are "unrelated threads"?
Re (3): A user space implementation doesn't require polling (except once which is good for peformance).
Re (4): You have the same general options implementing this as you do when implementing locks.

> Can you better describe what your model would look like in memory?
> One of my use cases is a producer-consumer scenario where each consumers wait on a subset of events triggered by different producers.

When necessary, you associate a list of consumers with each event producer, each list entry containing a pointer to the waiters futex and/or atomic guard variable.
Depending on the type of producer, you mark and/or wake one or all consumers. Why is that a question, did you try and have problems with this approach?

The kernel would have to do the same or similar processing otherwise, however in my understanding it is to be avoided for the kernel, for example, to copy and maintain dynamic memory lists. It is better to do in user space, also for performance (and to avoid impacting the performance of other applications). At least I know this works well in my own use cases.

Additionally, in use cases where the same "subset" is used repeatedly, these associations don't have to be re-established for each call.

Rethinking the futex API

Posted Jun 20, 2020 2:54 UTC (Sat) by krisman (subscriber, #102057) [Link] (3 responses)

> When necessary, you associate a list of consumers with each event producer, each list entry containing
> a pointer to the waiters futex and/or atomic guard variable.

So, your design has one futex per consumer. This means that when a producer signals, in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Rethinking the futex API

Posted Jun 20, 2020 3:38 UTC (Sat) by ras (subscriber, #33059) [Link]

> So, your design has one futex per consumer.

At this point I'm lost. You obviously have some specific use case in mind none of this fits. Could you post a link / explain what it is?

Rethinking the futex API

Posted Jun 20, 2020 3:58 UTC (Sat) by itsmycpu (guest, #139639) [Link] (1 responses)

> So, your design has one futex per consumer. This means that when a producer signals,
> in order to wake all of the entries in its waitqueue, it walks the list and does a FUTEX_WAKE syscall per entry. That's much worse than FWM.

Almost, but not with that disadvantage. As I indicated, "in my design" there is (optionally) a FUTEX_WAKE_MULTIPLE (not wait_multiple) kernel API which is much simpler to implement, also because it doesn't need to copy and maintain the list internally in the kernel. It does multiple FUTEX_WAKE operations in a single syscall. And this can be added later on, as an optimization, if and when such a call becomes available in the kernel, while the rest can already be implemented, tested and tuned with the kernel as is.

Since a FUTEX_WAKE is only needed for futexes that actually have thread(s) waiting, and the waiting itself is already a time consuming kernel call, one non-blocking WAKE call per waiting futex shouldn't be that bad, although it will be better to have a wake_multiple call, which solves that problem (if it is one) completely.

Rethinking the futex API

Posted Jun 20, 2020 4:05 UTC (Sat) by itsmycpu (guest, #139639) [Link]

(Also, for example, the FUTEX_WAKE calls can offloaded to a worker thread, if the callers are high priority.)