|
|
Log in / Subscribe / Register

API changes for the futex robust list

By Jake Edge
February 4, 2026

LPC

The robust futex kernel API is a way for a user-space program to ensure that the locks it holds are properly cleaned up when it exits. But the API suffers from a number of different problems, as André Almeida described in a session in the "Gaming on Linux" microconference at the 2025 Linux Plumbers Conference in Tokyo. He had some ideas for a new API that would solve many of those problems, which he wanted to discuss with attendees; there is a difficult-to-trigger race condition that he wanted to talk about too.

"Some years ago, I made a new API for futex", Almeida said to start things off, "so why not do a new API for robust list as well?" The new futex API that he was referring to was merged for 5.16 in 2022 in the form of the futex_waitv() system call (documentation). Some further pieces of the futex2 API were released with Linux 6.7 in 2024.

[André Almeida]

The ABI for games on the SteamOS distribution, where much of the work in gaming on Linux is being done, is Windows on the x86 architecture. The games are mostly built for that ABI, but SteamOS also runs on Arm64, which leads to "a lot of interesting challenges". It adds the FEX emulator to run x86 binaries on the Arm64 processor in addition to the Proton compatibility layer that provides the Windows ABI. That has implications for various kernel areas, including futexes, memory management, and filesystems.

FEX is a just-in-time (JIT) compiler for turning x86 instructions, for both 32 and 64 bits, into Arm64 machine code. As part of that, when it finds a syscall instruction, it needs to translate that to the Arm64 system call, but that does not work well for some x86-32 system calls. The FEX project has a wiki page describing the problematic calls, one of which is set_robust_list().

set_robust_list() is used to avoid problems when a futex holder dies before releasing the lock, which will starve any other threads waiting on it. So, when a thread takes a lock, it can add the lock to the robust list, which is a linked list maintained in user space. The thread informs the kernel where the head of the list is using set_robust_list(). The exit path for a thread in the kernel uses that information to wake all of the threads waiting for each futex on the list; it also adds the FUTEX_OWNER_DIED mark to each futex. One other wrinkle that he mentioned is that a futex can be put into a "pending" field on the list head while an operation (taking or releasing the lock) is being done, but before the linked list has been updated, so that it can be cleaned up if a crash happens in that window.

Why?

A new API is needed for several reasons, he said. The first is that, unlike x86, Arm64 does not have both 32- and 64-bit system calls, so emulating 32-bit applications is difficult—the "compat" system calls are missing. For example, a 32-bit robust list cannot be handled by the 64-bit system call because it cannot parse the list due to the different pointer size. There is a need for the new interface to allow user space to inform the kernel whether it is a 32- or 64-bit robust list so that the kernel can parse the list correctly.

Another shortcoming of the existing interface is that only one robust list can be set for a thread, but FEX also wants to use robust futexes. If the application uses them, FEX has to choose which one gets that access. A new interface would provide a way to set multiple list heads for a thread.

There is currently a limit of 2048 entries on a robust list that will be processed by the kernel, which is meant to avoid getting trapped in an infinite loop. But that limit was never documented as part of the API, so user-space programs are unaware of it, which led to a bug report for the GNU C library. With a new API, either the limit should be documented and exposed as part of the API or it should be made limitless using countermeasures against circular lists, he said.

The final problem is "much more interesting" but "kind of tricky to explain"; it is a race condition that can occur when a futex is being unlocked. The normal sequence for unlocking a robust futex is as follows:

  1. The address of the futex is put into the pending slot of the robust list
  2. The futex is removed from the robust list
  3. The low-level unlock is done, which clears the futex and wakes any threads waiting on it
  4. The pending slot is cleared
Between steps three and four, though, another thread can decide to free the futex because it appears to be the only user of the futex. That thread could then allocate memory in the same location as the former futex. Then the original thread, which is about to perform step four, dies, which causes the kernel to write FUTEX_OWNER_DIED in the futex, thus corrupting some random memory. It is difficult to reproduce, but it does happen.

Almeida said that he is unsure how to address this. Perhaps serializing the exit path with all of the mmap() and munmap() calls made by the thread is a possibility. Another idea might be to change the API around the pending field somehow to avoid the race. The previous day he had participated in the extensible scheduler class (sched_ext) microconference, which got him thinking that perhaps a specialized scheduler could be written to reproduce the problem reliably; that would help in the fixing process and could be turned into a test case as well.

New API

The API he proposed in the session seems to have evolved somewhat since his v6 patch set posting in November 2025 (a few weeks before LPC). It consists of two new system calls:

    set_robust_list2(struct robust_list_head *head, unsigned int index,
                     unsigned int cmd, unsigned int flags);
                     
    get_robust_list2(int pid, void **head_ptr,
                     unsigned int index, unsigned int flags);
The index argument is used to distinguish between different lists so that libraries and applications can have their own lists. The cmd argument to set_robust_list2() can be CREATE_LIST_32 (or 64) to create a list of the appropriate "bitness" using the head pointer; in that case, the call returns an unused index that is associated with the list. A list can be overwritten using the SET_LIST_32 (or 64) command by passing the index of interest. The LIST_LIMIT command returns the number of lists supported for each task. (All of the command names will presumably have FUTEX_ROBUST_LIST_CMD_ as part of their full name.) get_robust_list2() will simply return the head of the robust list (in head_ptr) for a given pid and index.

Discussion

After that, Almeida opened the floor to questions and comments. Liam Howlett noted that the exit path for robust lists requires a delay to the out-of-memory (OOM) handling in the kernel, so the race condition could be more easily reproduced by setting the OOM-handler delay to zero and triggering an OOM-kill of the task. While that may be true, glibc maintainer Carlos O'Donell said, it does not really lead to a solution to the race, which both he and Rich Felker of the musl libc project have looked at. If there is going to be a new API, it is a "perfect opportunity" to sit down and figure out a proper solution and also determine how existing C libraries can transition to the new interface over time.

"It gets worse", Howlett said. Tasks that are exiting can be frozen by the control-group subsystem, which means that the OOM handler has to wait potentially forever before it can clean things up. That is another piece that should be unwound as part of the process of creating the new API, he said.

O'Donell said that it made sense that users of the new API will need to be able to register the number of bits in the structure that is being shared with the kernel. He asked if sizes other than 32 or 64 bits should be considered, but Howlett pointed out that there is an unused flags argument in the proposed API, which could be used if needed.

The conversation turned back to the delay for the OOM handler, which no one seemed to fully understand. O'Donell wondered if it was an attempt to fix the race condition that Almeida is concerned about when it arose in some other context. Howlett said that he believed it was meant to hold off the OOM killer from freeing memory holding the locks before the exit-handling code could process the robust list. Sebastian Siewior said that he was not clear on why the delay was added, either, but would put it on his list to look into.

There was some further discussion of why and how the OOM-killer delay came about, but the session ran out of time. Interested readers may want to consult the YouTube video and slides from the talk. Overall, participants seemed to agree that the new API was needed, and no real complaints about its proposed form were heard, but there are obviously still some details to be worked out before it can go upstream.

[ I would like to thank our travel sponsor, the Linux Foundation, for assistance with my travel to Tokyo for Linux Plumbers Conference. ]

Index entries for this article
KernelFutex
ConferenceLinux Plumbers Conference/2025


to post comments

how about endianness swapping versions?

Posted Feb 4, 2026 19:00 UTC (Wed) by pm215 (subscriber, #98099) [Link]

I don't suppose we can get variants of CREATE_LIST_{32,64} that say "data structure is the opposite endianness to the kernel's natural one" ? :). QEMU linux-user mode also doesn't emulate set_robust_list because of the problem with the guest's idea of the data structure layout potentially not matching the kernel's, but we also have the case of "big endian guest on little endian host" and vice versa...

Cooperation between multiple userspace runtimes?

Posted Feb 4, 2026 19:53 UTC (Wed) by josh (subscriber, #17465) [Link] (2 responses)

Are there any efforts trying to figure out how to coordinate between multiple prospective users of the robust list? As far as I can tell, you can only have one per thread, and libc doesn't provide any way to integrate with theirs when using futex natively rather than via libc's locks.

By contrast, as I understand it, a great deal of effort went into a userspace ABI that supports multiple users of rseq.

Cooperation between multiple userspace runtimes?

Posted Feb 4, 2026 20:11 UTC (Wed) by pm215 (subscriber, #98099) [Link] (1 responses)

The new API seems to give multiple lists per thread, and the CREATE_LIST operation causes the kernel to return the index of some not currently used list, so I think that means different users don't need to cooperate -- the only restriction is that you can't have more than some fixed number (6? 8? 10?) of such independent users of a robust-list.

Cooperation between multiple userspace runtimes?

Posted Feb 4, 2026 20:21 UTC (Wed) by josh (subscriber, #17465) [Link]

...apparently I should read more closely, thank you. That's exactly what I was looking for, and that sounds perfect.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds