Restartable sequences, TCMalloc, and Hyrum's Law
As a quick reminder: the restartable sequences feature, accessed by way of the rseq() system call, provides a mechanism for the execution of brief critical sections in user space. A shared-memory segment is used to communicate to the kernel when a critical section is active, and the kernel can redirect execution if the running thread is preempted or migrated during that critical section. There are a number of associated features, including the ability to quickly determine which CPU a thread is running on; the time-slice-extension feature merged for the 7.0 release is also tied to restartable sequences.
TCMalloc trouble
On April 22, Mathias Stearn reported two problems with TCMalloc resulting from the 6.19 improvements to restartable sequences. One of them turned out to be a simple bug in the 64-bit Arm implementation, tied to the fact that the Arm architecture does not yet fully use the generic entry code. This bug is uncontroversial and will be fixed like any other. The second problem, though, has deeper origins.
The memory shared between the kernel and user space for restartable sequences includes an instance of struct rseq; that structure contains a number of fields, one of which, in particular, is of interest for this discussion. The 32-bit cpu_id_start field contains the number of the CPU on which the thread is running. This value, which is maintained by the kernel, is explicitly defined as a read-only value for user space, and is guaranteed to always contain a valid CPU number, even if restartable sequences are not in use.
Prior to the 6.19 release, the kernel would update cpu_id_start on every return from the kernel to user space, regardless of whether the CPU number had changed or not. Storing a single integer value does not seem like an expensive operation, but looks can be deceiving; many CPUs have features that prevent the kernel from randomly changing user-space memory. Turning off that protection (and re-enabling it after the store) is expensive. Removing the redundant stores improved performance by 15% for many workloads without changing the restartable-sequences ABI in any way — or so it seemed.
The TCMalloc library makes extensive use of restartable sequences to improve performance. Notably, while it does use this feature for critical sections, it also uses it to detect scheduling interruptions outside of critical sections. The trick (to avoid more pejorative terms) used is described in detail in this document. In short, TCMalloc's internal data structures are designed to overlay the shared struct rseq so that cpu_id_start becomes the upper four bytes of an internal cache pointer. When TCMalloc stores this pointer, the result is to write zeroes into cpu_id_start; the topmost bit is set, though, to distinguish the contents of cpu_id_start from any valid CPU number. When the kernel stores into cpu_id_start, it will end up clearing that top bit, enabling TCMalloc to quickly detect the change and regenerate that pointer.
The key point is that TCMalloc needs that signal for any sort of interruption, even if the running thread did not move to a new CPU. Pre-6.19 kernels would always overwrite cpu_id_start — an undocumented but observable behavior — providing TCMalloc with that signal; as of 6.19, that overwriting only happens if the thread migrates to a new CPU. As a result, TCMalloc, which has come to depend on that undocumented behavior, ends up leaving a smoking crater in the middle of any application that is trying to use it.
A regression?
The problematic nature of this behavior has been widely known for some
years; the above-linked documentation advises that, since
cpu_id_start can no longer be counted on to hold the current CPU
number, "this makes __rseq_abi.cpu_id_start unusable for its
original purpose
". In other words, no other code running within a
TCMalloc-using thread can use restartable sequences and expect it to work.
That is somewhat awkward, given that one other user of the feature is the
GNU C Library (glibc), as of version 2.35. Back in 2022, this problem was
reported in the TCMalloc issue tracker and a change of behavior was
requested but, despite a fair amount of discussion, no change was ever
made. As a result, code using TCMalloc must be run with an environment
variable set to prevent glibc from trying to use restartable sequences.
Kernel developers have had a dim view of this behavior for a while.
Unsurprisingly, that view became rather dimmer yet when users started
complaining that the 6.19 release broke TCMalloc entirely. The behavior
described above violates the documented restartable-sequences ABI and makes
the feature unusable for anybody else. It would have been detected by the
kernel's debugging facilities, but those were clearly never used with
TCMalloc, since they would have caused an immediate killing of the
offending thread. In the issue discussion, kernel developers had offered an
extension to restartable sequences to let TCMalloc stop overwriting
cpu_id_start, but that offer was not accepted. The result, as
Thomas Gleixner described it, is
"everyone is in a hard place and up a creek without a paddle
".
Gleixner made it clear that he thought TCMalloc's difficulties should not
be considered to be a kernel regression, since the documented ABI
guarantees are still being upheld by the kernel and the debugging feature
would have caught the problem years ago. Linus Torvalds, though, was just
as clear that the only thing that matters is that a once-working
program stopped working as the result of a kernel change: "This is not
some kind of gray area. It clearly violates our regression rules
".
This response was clearly expected by Gleixner, though he still did not like it: "Feel
free to enforce it, but be aware that you thereby set a precedence that a
single abuser can then rightfully own a general shared interface of the
kernel forever and force everybody else to give up
". Glibc developer
Florian Weimer was also
unhappy, pointing out that TCMalloc's use breaks the modular design of
the restartable-sequences ABI. Torvalds, though, was
adamant that a fix had to be found.
Now what?
Various options for fixes were discussed; Stearn had been working on a simple, low-cost option that, seemingly, has not worked out. Another option, of course, is to simply go back to always updating cpu_id_start and accepting the associated performance penalty; there are not many supporters of this approach in the kernel community. The most likely fix, as of this writing, is something based on this patch from Gleixner, which works without requiring changes to either TCMalloc or glibc, albeit with the performance penalty in some environments.
As is the pattern with many recent system calls, rseq() requires the caller to pass in both the pointer to struct rseq and the size of that structure. In this way, future extensions can be made in compatible ways by increasing that size. Gleixner proposes increasing it from the current 32 bytes to 33 bytes, which would also have the effect of forcing a 64-byte alignment of the structure. Any caller presenting a 32-byte struct rseq or failing to align the structure properly would see the pre-6.19 behavior, with unconditional updating of cpu_id_start; more recent related features, such as time-slice extension, would also be unavailable. If, instead, the caller provides a 33-byte, 64-byte-aligned rseq structure, the kernel provides the 6.19 behavior, with full performance.
The result should be a fully compatible change. Existing TCMalloc installations will use the older structure size; the overlay trick used by the library also prevents a 64-byte alignment for the structure. So TCMalloc will be given the older behavior that it depends on. Newer glibc versions query the expected structure size (using getauxval()) and will be rewarded with higher performance and full functionality, with no glibc update needed.
Older glibc versions (those prior to 2.41), though, will be stuck with the performance penalty; Weimer indicated that updating those versions would not be an easy thing to do. Mathieu Desnoyers suggested adding a flag that could be passed to rseq() as an "I'm not TCMalloc" indicator, resulting in the faster behavior. Adding that flag would be a far easier backport to older glibc versions. Gleixner, though, dismissed that idea, saying that it would lead to unnecessary complexity in the code and, in any case, would be problematic in cases where there are multiple users of restartable sequences within a single application.
This solution appears to have survived initial testing, and has been put together into a proper patch series, along with some sharp words for the people who made it necessary:
As Linus decreed the onus is on the lack of ABI compliance enforcement in the original RSEQ kernel implementation and the clever abuse is fine. That's technically correct, but in the context of the larger ecosystem a fundamentally flawed decision. Though that's a completely different discussion to have as it affects the long term sustainability of the Open Source ecosystem in general and the ability to protect it against rogue actors, which are thereby officially entitled to hold a whole ecosystem hostage and force the people who provide them their operational base to go out on a limb to make progress.
Grumpiness aside, it seems that the form of the solution to this particular demonstration of Hyrum's law has been worked out. As long as there are no other users doing strange things, it should be possible to move forward, preserving all of the improvements that have been made, without breaking existing users. The absence of other surprises seems fairly likely to be the case, since there are few users of restartable sequences. In the end, things clearly could have been worse.
That said, this episode will surely not be reassuring to developers who
fear exposing features that could end up creating similar compatibility
problems in the future. Many discussions about adding new interfaces have
run aground on that point; see, for example, this
response by David Hildenbrand to the idea of allowing BPF programs to
control more aspects of memory management. It is easy to see Hyrum lurking
in the corner, just waiting to ossify some behavior inadvertently exposed
by the kernel.
| Index entries for this article | |
|---|---|
| Kernel | Development model/Regressions |
| Kernel | Restartable sequences |
