|
|
Log in / Subscribe / Register

A new futex API

By Jonathan Corbet
August 14, 2023
The Linux fast user-space mutex ("futex") subsystem debuted with the 2.6.0 kernel; it provides a mechanism that can be used to implement user-space locking. Since futexes avoid calling into the kernel whenever possible, they can indeed be fast, especially in the uncontended case. The API used to access futexes has never been seen as one of Linux's strongest points, though, so there has long been a desire to improve it. This patch series from Peter Zijlstra shows what the future of futexes may look like.

A futex is a 32-bit value stored in user-space memory that is, presumably, shared between at least two threads or processes. When used as a lock, a futex can be acquired with a single compare-and-swap instruction, without kernel involvement. The kernel comes into the picture, though, in the contended case, where a thread must block until a futex becomes available. Waiting for a futex and waking threads that are waiting are some of the features provided by the futex() system call.

futex() is a multiplexed system call, meaning that it performs a number of unrelated operations depending on its arguments. Over the years, as futex functionality has grown, this system call has become complex and unwieldy, to put it mildly. It is difficult to use, and difficult to extend further. This API has, among other things, slowed the addition of features that developers would like to see.

For some years, there has been talk of splitting the futex API into a set of more focused system calls; LWN covered one such discussion in 2020. Thus far, though, actual progress in this direction has been limited to the addition of futex_waitv() to the 5.16 release in early 2022; work on futexes seemingly stalled after that. With the current patch set, Zijlstra appears to be trying to restart this project and bring a better futex API to the mainline.

This patch series includes three new system calls:

    int futex_wait(void *addr, unsigned long val, unsigned long mask, unsigned int flags,
    		   struct __kernel_timespec *timeout, clockid_t clockid);

    int futex_wake(void *addr, unsigned long mask, int nr, unsigned int flags);

    int futex_requeue(struct futex_waitv *waiters, unsigned int flags, int nr_wake,
		      int nr_requeue);

For the most part, so far, these system calls are just new interfaces to existing functionality. futex_wait() is the same as futex(FUTEX_WAIT_BITSET); it will cause the calling thread to wait on the futex stored at addr, assuming that futex contains val at the time of the call. The mask will be stored in the kernel's context for this thread. futex_wake() mirrors the FUTEX_WAKE_BITSET operation, using mask to identify which waiter(s) to wake. futex_requeue() is another interface to FUTEX_CMP_REQUEUE, which can wake some waiters and requeue others onto a different futex.

There is also some new functionality that is made available via the new API, generally using the flags argument. One of the new flags is FUTEX2_NUMA, which is intended to improve performance on NUMA systems. User space is in control of the placement of its futexes and can, thus, ensure that they live on a NUMA node that is close to the threads that are using it. But the kernel maintains its own data structures for futexes that are being waited on; poor placement of those structures can slow everything down. This problem was mentioned here in 2016, but still lacks a solution in mainline kernels.

As noted above, a futex is a 32-bit value. When the FUTEX2_NUMA flag is provided, though, the futex(es) referred to are, instead, interpreted as:

    struct futex_numa_32 {
	u32 val;
	u32 node;
    };

(Though this structure does not actually appear in this form in the kernel source).

The val field is the same old 32-bit quantity, while node is the number of the NUMA node on which the kernel should allocate its own data structures. The node value can also be set to all-ones (~0), in which case the current NUMA node will be used, and the node value will be updated accordingly by the kernel. The patch adding this structure admonishes: "If userspace corrupts the node value between WAIT and WAKE, the futex will not be found and no wakeup will happen".

The FUTEX2_NUMA flag thus gives user space some control over the placement of associated memory within the kernel. If this flag does not appear, these allocations will be spread across all of the nodes of the system (as is done with futexes in current kernels).

One other change in this series is reflected by another set of new flags. There has long been a desire to get away from the 32-bit requirement for futexes. Often, only a single bit is used; a smaller size would reduce waste and (more importantly) allow more futexes to be crammed into a single cache line. To accommodate this need, the new system calls require a flags value indicating the size of the futex(es) to be operated on; it can be one of FUTEX2_SIZE_U8, FUTEX2_SIZE_U16, or FUTEX2_SIZE_U32. There is also a FUTEX2_SIZE_U64 flag defined, but 64-bit futexes are not implemented in the current patch set.

When FUTEX2_NUMA is used, the node number has the same size as the futex value itself. Thus, a futex operation specifying FUTEX2_NUMA|FUTEX2_SIZE_U8 will provide an eight-bit node number, which could be modeled as:

    struct futex_numa_8 {
    	u8 val;
	u8 node;
    };

The end result of this work is a set of incremental improvements that cleans up the futex API and provides some functionality that developers have been asking for. Review comments have mostly been focused on relatively minor details, suggesting that there may not be much in the way of getting these changes merged into the mainline. Of course, the job does not end with these patches; there is still a lot of functionality provided by futex(), including priority inheritance, that is not available with the new API. But this work should make the common cases easier for developers to work with and might even, someday, lead to support for futexes in the GNU C library.

Index entries for this article
KernelFutex


to post comments

A new futex API

Posted Aug 15, 2023 4:38 UTC (Tue) by alison (subscriber, #63752) [Link] (2 responses)

Hopefully the ability to specify 8- and 16-bit futexes is a step to addressing the problems described in the 2016 talk Real-Time Summit 2016 meeting called "Futexes are cursed" (https://wiki.linuxfoundation.org/_media/realtime/events/r...) and the similarly themed next presentation called "Pthread Condvars: Posix Compliance and the PI Gap" (https://wiki.linuxfoundation.org/_media/realtime/events/r...). The conclusion then was there was not enough space in the existing futex either to entirely eliminate out-of-order wakeups or to implement priority inheritance. Letting userspace choose the size of futexes is great, as that means that embedded control systems should (eventually) be able to have PI and giant systems can still have many, many simultaneous futexes.

There's still the question of what libpthread will choose, but there are alternatives there as well (https://ossna2020.sched.com/event/cZIe/librtpi-conditiona...).

A new futex API

Posted Aug 15, 2023 18:35 UTC (Tue) by ghodgkins (subscriber, #157257) [Link] (1 responses)

What were the problems described in "Futexes are evil"? The linked slide deck isn't very informative.

A new futex API

Posted Aug 16, 2023 4:57 UTC (Wed) by alison (subscriber, #63752) [Link]

The videos are here: https://www.youtube.com/playlist?list=PLbzoR-pLrL6oauKOYM...

"Futexes are cursed" as I recall was basically about the difficulty of fixing them in the kernel, and "Pthread Condvars" was about the difficulty of fixing them in glibc. Problems, as described by the reliably excellent LWN article, were false (misordered) wakeups and no ability to implement priority inheritance. The root cause is too fee bits to indicate state, as is so often the case for older APIs. Recently I have thought that it's too bad that the kernel chose not to break the futex() ABI at the same time it essentially forced distros to switch to 64-bit time.

A new futex API

Posted Aug 17, 2023 1:51 UTC (Thu) by irogers (subscriber, #121692) [Link]

There has also been a proposal to add a FUTEX_SWAP operation as a primitive to build user land schedulers/fibers:
https://lore.kernel.org/lkml/20200722234538.166697-1-posk...
https://lore.kernel.org/lkml/20200803221510.170674-1-posk...


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds