|
|
Log in / Subscribe / Register

Rethinking the futex API

Rethinking the futex API

Posted Jun 19, 2020 5:14 UTC (Fri) by krisman (subscriber, #102057)
In reply to: Rethinking the futex API by ras
Parent article: Rethinking the futex API

> One of the complaints about the current futex implementation is the code is hard to follow. One reason
> for that is it's been expanded to do stuff that could have been done in user space. FUTEX_FD is one
> such thing. I suspect that FUTEX_REQUEUE is another - it could be done in user space with no more
> syscalls than are used now. futex_waitv() is definitely in the same class. You don't fix complexity by
> adding more features.

For what is worth, yes, the futex interface has grown out of hand with deprecated features. FUTEX_FD was dropped long ago, so that is not a big deal. And yes, most futex features can be implemented in userspace with the WAKE/WAIT pair with additional cost.

REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes? I don't see how it is possible, in fact, CMP_REQUEUE is planned to exist in a future API.


to post comments

Rethinking the futex API

Posted Jun 19, 2020 6:50 UTC (Fri) by ras (subscriber, #33059) [Link] (4 responses)

> REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes? I don't see how it is possible, in fact, CMP_REQUEUE is planned to exist in a future API.

You couldn't.

I was querying whether the added extra syscall complexity is worth the cost. It's not like you can't do it in user space - it's just more expensive. You've already incurred one syscall, so the orders of magnitude speed improvement the futex fast path got you is gone. You are now trying to gain a factor of 2 or 3 or something. If it was a very frequent use case, then perhaps - but it looks esoteric to me. If complexity is an issue esoteric should be shown the door.

Rethinking the futex API

Posted Jun 23, 2020 20:56 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

It is a frequent use case: POSIX condition variables. These are associated with a mutex. A thread calling pthread_cond_wait must do so while owning the mutex associated with the condition variable. The mutex is then released and the thread put to sleep until the condition is signalled. On wakeup, the mutex must be reacquired by the thread prior to the function call returning. If more than one thread is woken up because the condition is signalled, they would thus all race to exit the kernel, most of them would find the mutex locked and enter the kernel again in order to block on that.

With REQUEUE/ CMP_REQUEUE, it's possible to ask the kernel to wake up one thread which then acquires the mutex and (presumably) does something useful while the others are moved from the condvar futex to the mutex futex without waking them up.

Rethinking the futex API

Posted Jun 23, 2020 22:01 UTC (Tue) by ras (subscriber, #33059) [Link] (2 responses)

Thanks for that.

My point wasn't that thundering herd isn't a problem. It clearly is.

My point was that it is already solved by the current futex() API. You get to decide how many of those syscall(SYS_futex, &futex, FUTEX_WAIT)'s you wake up. It can be anywhere from 1 to all. The others just sit around, waiting for another FUTEX_WAKE.

To solve the pthread_cond_wait() case you just wake up one, so it could have been done without REQUEUE now. I'll grant you that not all that REQUEUE does, it also prevents the futex from seeing any future wakeups. But that could equally well be done from userspace using more futex's. Perhaps it's a speed optimisation (I do hope they benchmarked it), or more likely arranging or everyone to share information the kernel already had was hard so they took the easy way out.

Rethinking the futex API

Posted Jun 24, 2020 14:54 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

The library implementation doesn't get to decide this: Application code can either use pthread_cond_signal to wake exactly on thread or pthread_cond_broadcast to wake all threads. With the naive implementation, the kernel goes through a linked list and wakes everything while tearing the list apart. All the awoken threads exit the kernel and compete for ownership of the same mutex. One of these threads will end up getting the mutex, the others all do system calls in order to enter the kernel again where kernel code than builds a new linked list of them which is attached to a different futex.

That's an insanely complicated way to transfer a linked list from a head pointer a in the kernel to another head pointer b in the kernel. It's much simpler if the kernel just wakes one thread and transfers the existing list to the other futex.


Rethinking the futex API

Posted Jun 24, 2020 19:17 UTC (Wed) by itsmycpu (guest, #139639) [Link]

> That's an insanely complicated way to transfer a linked list from a head pointer a in the kernel to another head pointer b in the kernel. It's much simpler if the kernel just wakes one thread and transfers the existing list to the other futex.

I'm not sure if you are discussing a more specific question, but just to point it out, for example a user space implementation of the producer/consumer model can (if so designed) just move a waiting entry from one producer's list to another producer's list without any wake ups or even kernel calls. (Except a wake call in the case one is ready.)

Rethinking the futex API

Posted Jun 19, 2020 10:43 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (1 responses)

> REQUEUE/CMP_REQUEUE exist to reduce system calls and unnecessary wakes. How would you requeue in userspace without more syscalls and unnecessary awakes?

You would just do things differently. A model that is relatively common is the "parking lot" where each thread has a single binary semaphore attached to it. The primitives in this model are "parking" yourself (consuming the token in the semaphore, or waiting for one and consuming it immediately if it's not there) and "unparking" another thread (placing a token on its semaphore). The parking lot itself is an associative array from addresses to waitqueues, similar to what the kernel does for futexes, so that you can "park" yourself on a list and "unpark" the head of a list.

By moving waitqueue handling to userspace, you don't need a system call anymore for either requeueing or multiple-futex wait. On the other hand it makes the implementation of pthread_barrier_t less efficient because every slow-path wakeup needs a separate system call.

Rethinking the futex API

Posted Jun 21, 2020 17:44 UTC (Sun) by itsmycpu (guest, #139639) [Link]

> On the other hand it makes the implementation of pthread_barrier_t less efficient because every slow-path wakeup needs a separate system call.

Again a reason to add a "FUTEX_WAKE_MULTIPLE" kernel API, or "futex_wakev" ("wake" not "wait"), that takes a vector of multiple futex addresses as parameter.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds