|
|
Log in / Subscribe / Register

Restartable sequences, TCMalloc, and Hyrum's Law

By Jonathan Corbet
April 30, 2026
Hyrum's Law states that any observable behavior of a system will eventually be depended upon by somebody. The kernel community is currently contending with a clear demonstration of that principle. The recent work to address some restartable-sequences performance problems in the 6.19 release maintained the documented API in all respects, but that was not enough; Google's TCMalloc library, as it turns out, violates the documented API, prevents other code from using restartable features, and breaks with 6.19. But the kernel's no-regressions rule is forcing developers to find a way to accommodate TCMalloc's behavior.

As a quick reminder: the restartable sequences feature, accessed by way of the rseq() system call, provides a mechanism for the execution of brief critical sections in user space. A shared-memory segment is used to communicate to the kernel when a critical section is active, and the kernel can redirect execution if the running thread is preempted or migrated during that critical section. There are a number of associated features, including the ability to quickly determine which CPU a thread is running on; the time-slice-extension feature merged for the 7.0 release is also tied to restartable sequences.

TCMalloc trouble

On April 22, Mathias Stearn reported two problems with TCMalloc resulting from the 6.19 improvements to restartable sequences. One of them turned out to be a simple bug in the 64-bit Arm implementation, tied to the fact that the Arm architecture does not yet fully use the generic entry code. This bug is uncontroversial and will be fixed like any other. The second problem, though, has deeper origins.

The memory shared between the kernel and user space for restartable sequences includes an instance of struct rseq; that structure contains a number of fields, one of which, in particular, is of interest for this discussion. The 32-bit cpu_id_start field contains the number of the CPU on which the thread is running. This value, which is maintained by the kernel, is explicitly defined as a read-only value for user space, and is guaranteed to always contain a valid CPU number, even if restartable sequences are not in use.

Prior to the 6.19 release, the kernel would update cpu_id_start on every return from the kernel to user space, regardless of whether the CPU number had changed or not. Storing a single integer value does not seem like an expensive operation, but looks can be deceiving; many CPUs have features that prevent the kernel from randomly changing user-space memory. Turning off that protection (and re-enabling it after the store) is expensive. Removing the redundant stores improved performance by 15% for many workloads without changing the restartable-sequences ABI in any way — or so it seemed.

The TCMalloc library makes extensive use of restartable sequences to improve performance. Notably, while it does use this feature for critical sections, it also uses it to detect scheduling interruptions outside of critical sections. The trick (to avoid more pejorative terms) used is described in detail in this document. In short, TCMalloc's internal data structures are designed to overlay the shared struct rseq so that cpu_id_start becomes the upper four bytes of an internal cache pointer. When TCMalloc stores this pointer, the result is to write zeroes into cpu_id_start; the topmost bit is set, though, to distinguish the contents of cpu_id_start from any valid CPU number. When the kernel stores into cpu_id_start, it will end up clearing that top bit, enabling TCMalloc to quickly detect the change and regenerate that pointer.

The key point is that TCMalloc needs that signal for any sort of interruption, even if the running thread did not move to a new CPU. Pre-6.19 kernels would always overwrite cpu_id_start — an undocumented but observable behavior — providing TCMalloc with that signal; as of 6.19, that overwriting only happens if the thread migrates to a new CPU. As a result, TCMalloc, which has come to depend on that undocumented behavior, ends up leaving a smoking crater in the middle of any application that is trying to use it.

A regression?

The problematic nature of this behavior has been widely known for some years; the above-linked documentation advises that, since cpu_id_start can no longer be counted on to hold the current CPU number, "this makes __rseq_abi.cpu_id_start unusable for its original purpose". In other words, no other code running within a TCMalloc-using thread can use restartable sequences and expect it to work. That is somewhat awkward, given that one other user of the feature is the GNU C Library (glibc), as of version 2.35. Back in 2022, this problem was reported in the TCMalloc issue tracker and a change of behavior was requested but, despite a fair amount of discussion, no change was ever made. As a result, code using TCMalloc must be run with an environment variable set to prevent glibc from trying to use restartable sequences.

Kernel developers have had a dim view of this behavior for a while. Unsurprisingly, that view became rather dimmer yet when users started complaining that the 6.19 release broke TCMalloc entirely. The behavior described above violates the documented restartable-sequences ABI and makes the feature unusable for anybody else. It would have been detected by the kernel's debugging facilities, but those were clearly never used with TCMalloc, since they would have caused an immediate killing of the offending thread. In the issue discussion, kernel developers had offered an extension to restartable sequences to let TCMalloc stop overwriting cpu_id_start, but that offer was not accepted. The result, as Thomas Gleixner described it, is "everyone is in a hard place and up a creek without a paddle".

Gleixner made it clear that he thought TCMalloc's difficulties should not be considered to be a kernel regression, since the documented ABI guarantees are still being upheld by the kernel and the debugging feature would have caught the problem years ago. Linus Torvalds, though, was just as clear that the only thing that matters is that a once-working program stopped working as the result of a kernel change: "This is not some kind of gray area. It clearly violates our regression rules".

This response was clearly expected by Gleixner, though he still did not like it: "Feel free to enforce it, but be aware that you thereby set a precedence that a single abuser can then rightfully own a general shared interface of the kernel forever and force everybody else to give up". Glibc developer Florian Weimer was also unhappy, pointing out that TCMalloc's use breaks the modular design of the restartable-sequences ABI. Torvalds, though, was adamant that a fix had to be found.

Now what?

Various options for fixes were discussed; Stearn had been working on a simple, low-cost option that, seemingly, has not worked out. Another option, of course, is to simply go back to always updating cpu_id_start and accepting the associated performance penalty; there are not many supporters of this approach in the kernel community. The most likely fix, as of this writing, is something based on this patch from Gleixner, which works without requiring changes to either TCMalloc or glibc, albeit with the performance penalty in some environments.

As is the pattern with many recent system calls, rseq() requires the caller to pass in both the pointer to struct rseq and the size of that structure. In this way, future extensions can be made in compatible ways by increasing that size. Gleixner proposes increasing it from the current 32 bytes to 33 bytes, which would also have the effect of forcing a 64-byte alignment of the structure. Any caller presenting a 32-byte struct rseq or failing to align the structure properly would see the pre-6.19 behavior, with unconditional updating of cpu_id_start; more recent related features, such as time-slice extension, would also be unavailable. If, instead, the caller provides a 33-byte, 64-byte-aligned rseq structure, the kernel provides the 6.19 behavior, with full performance.

The result should be a fully compatible change. Existing TCMalloc installations will use the older structure size; the overlay trick used by the library also prevents a 64-byte alignment for the structure. So TCMalloc will be given the older behavior that it depends on. Newer glibc versions query the expected structure size (using getauxval()) and will be rewarded with higher performance and full functionality, with no glibc update needed.

Older glibc versions (those prior to 2.41), though, will be stuck with the performance penalty; Weimer indicated that updating those versions would not be an easy thing to do. Mathieu Desnoyers suggested adding a flag that could be passed to rseq() as an "I'm not TCMalloc" indicator, resulting in the faster behavior. Adding that flag would be a far easier backport to older glibc versions. Gleixner, though, dismissed that idea, saying that it would lead to unnecessary complexity in the code and, in any case, would be problematic in cases where there are multiple users of restartable sequences within a single application.

This solution appears to have survived initial testing, and has been put together into a proper patch series, along with some sharp words for the people who made it necessary:

As Linus decreed the onus is on the lack of ABI compliance enforcement in the original RSEQ kernel implementation and the clever abuse is fine. That's technically correct, but in the context of the larger ecosystem a fundamentally flawed decision. Though that's a completely different discussion to have as it affects the long term sustainability of the Open Source ecosystem in general and the ability to protect it against rogue actors, which are thereby officially entitled to hold a whole ecosystem hostage and force the people who provide them their operational base to go out on a limb to make progress.

Grumpiness aside, it seems that the form of the solution to this particular demonstration of Hyrum's law has been worked out. As long as there are no other users doing strange things, it should be possible to move forward, preserving all of the improvements that have been made, without breaking existing users. The absence of other surprises seems fairly likely to be the case, since there are few users of restartable sequences. In the end, things clearly could have been worse.

That said, this episode will surely not be reassuring to developers who fear exposing features that could end up creating similar compatibility problems in the future. Many discussions about adding new interfaces have run aground on that point; see, for example, this response by David Hildenbrand to the idea of allowing BPF programs to control more aspects of memory management. It is easy to see Hyrum lurking in the corner, just waiting to ossify some behavior inadvertently exposed by the kernel.

Index entries for this article
KernelDevelopment model/Regressions
KernelRestartable sequences


to post comments

tcmalloc's weird hack

Posted Apr 30, 2026 19:39 UTC (Thu) by roc (subscriber, #30627) [Link]

When I implemented rseq support in rr to make tcmalloc work, I remember encountering this hack and getting very confused. The gift that keeps on giving!

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted Apr 30, 2026 21:57 UTC (Thu) by hailfinger (subscriber, #76962) [Link] (7 responses)

If the kernel is has an exploitable interface and some large-scale user depends on it (think large malware campaign relying on a kernel bug), would Linus block fixing that bug as well because malware has come to depend on it? After all, Linus once famously wrote: "I personally consider security bugs to be just normal bugs".

An equally interesting question is: Would the malicious actor have to prove that exploitation is widespread in order to get protected against the "regression" which is a security bugfix?

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted Apr 30, 2026 22:19 UTC (Thu) by Kamilion (subscriber, #42576) [Link]

We'll see, I guess, now that copy.fail is demonstrating an exploitable kernel interface, today.

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted Apr 30, 2026 22:40 UTC (Thu) by geofft (subscriber, #59789) [Link]

Leaving aside, of course, the "I know it when I see it" method, I think this is distinguishable: the functionality that TCMalloc is relying on does not violate any intended security properties of the kernel. TCMalloc isn't gaining access to info that it shouldn't be able to, and it's clear that some other mechanism for what they want would be fine and mergeable—the dispute is just whether this particular ABI should expose the functionality. If rseq had never worked this way, a request to add some kind of API would have been considered; a request to add an explicitly local-root API would not be considered. Similarly, the change in behavior wasn't because the old behavior was buggy and they wanted to fix it, but because there was a performance boost from taking a different approach, and that happened to cause different user-visible behavior as a side effect.

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted May 1, 2026 8:36 UTC (Fri) by farnz (subscriber, #17727) [Link] (3 responses)

Note that Linus's test is a judgement call - Linus rules that the userspace ABI is defined by kernel behaviour, and as long as you're a desirable user of the kernel ABI (ignoring whether you're using it as intended - just whether Linus wants you as a user of his kernel), regressions are not permitted.

Thus, in the malware case, Linus would probably consider the malware an undesirable user, and refuse to "fix" the regression because he'd like the malware authors to target a different kernel.

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted May 1, 2026 19:35 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Thus, in the malware case, Linus would probably consider the malware an undesirable user, and refuse to "fix" the regression because he'd like the malware authors to target a different kernel.

Funny you said that: https://www.linux.com/news/torvalds-creates-patch-cross-p...

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted May 4, 2026 7:39 UTC (Mon) by kleptog (subscriber, #1183) [Link] (1 responses)

Interesting way to frame it. Someone discovers that the kernel is violating the syscall ABI by not restoring the EBX register in the old syscall interface. This happened to work most of the time because glibc saves the register beforehand. But Linus codes a patch to fix the issue anyway.

And because it was found while researching an old virus, you can frame it as: Linus fixes bug so that old virus works again.

Clickbait I suppose...

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted May 4, 2026 17:28 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

I couldn't find the LKML message from Linus, but he said something like: "This code was stupid, but here, I fixed your virus".

"Lack of enforcement" and "exploitable" should be easier to tell apart

Posted May 3, 2026 17:24 UTC (Sun) by sionescu (subscriber, #59410) [Link]

I wonder when the companies that bankroll kernel development will realise that Linus is more of a liability than anything else. I hope we'll soon see an abdication or a hard fork.

What is an interface?

Posted Apr 30, 2026 22:11 UTC (Thu) by magfr (subscriber, #16052) [Link] (6 responses)

Despite what Linus says things have been removed.
CGroups v1 is just one of the things I can think of.
What is the difference here?

What is an interface?

Posted Apr 30, 2026 22:49 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

cgroups v1 are still supported by the kernel. They are being deprecated by user-space tools, but they are still supported.

What is an interface?

Posted May 3, 2026 9:55 UTC (Sun) by LtWorf (subscriber, #124958) [Link] (2 responses)

I have had failures because setting some flags on some type of descriptors was previously ignored but a later version decided to act on it instead.

Not to mention /proc and /sys that are for some reason not considered kernel API and free to change at will.

What is an interface?

Posted May 5, 2026 7:43 UTC (Tue) by dankamongmen (subscriber, #35141) [Link]

i had a console ioctl removed out from underneath my shipping code =\.

What is an interface?

Posted May 5, 2026 15:35 UTC (Tue) by mb (subscriber, #50428) [Link]

There is no hard line between what is a user space breaking change and what is not. Even adding a new driver can be a breaking change, if your program depends on the absence of a kernel driver.
There is no solution to the spacebar heating problem.

Whether interpreting previously unused bits or any other kernel change breaks user space is less of a technical question, because the technical answer is always: Yes, it can break something. It's a question of common sense and checking afterwards, if we broke something *important*.

What is an interface?

Posted May 1, 2026 0:13 UTC (Fri) by ebiederm (subscriber, #35028) [Link] (1 responses)

If someone makes the effort to update userspace so that the potential problem no longer exists in practice you can remove an interface.

This is all about the very practical test. Can a kernel be upgraded without having to change userspace.

That makes tracking down other problems much easier.

What is an interface?

Posted May 1, 2026 10:30 UTC (Fri) by ballombe (subscriber, #9523) [Link]

So your interpretation is that glibc is penalized for having accepted to add an environment variable for tcmalloc compatibility ?

Not only a kernel problem

Posted Apr 30, 2026 22:47 UTC (Thu) by ballombe (subscriber, #9523) [Link]

This is not only a kernel problem. Whatever Linux decide, the glibc team could decide to stop supporting tcmalloc by removing support for the environment variable.

rseq vs load_acquire then store_release

Posted May 1, 2026 8:29 UTC (Fri) by jreiser (subscriber, #11027) [Link] (7 responses)

Would rseq be unnecessary if all important hardware implemented the instructions load_acquire and store_release (also known as load_locked and store_conditional)? These instructions enhance a cache in order to provide enough accounting to detect actual collisions, and also potential future collisions that are enabled because of hardware interrupts and context switches. Some background is in https://lwn.net/Articles/844224/.

rseq vs load_acquire then store_release

Posted May 1, 2026 8:34 UTC (Fri) by Sesse (subscriber, #53779) [Link]

load_acquire and store_release (which exist in some form on nearly all modern CPUs) are used to guard access to shared data in a safe way. rseq is used to avoid having shared data in the first place.

rseq vs load_acquire then store_release

Posted May 1, 2026 9:27 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

Acquire/release is different from LL/SC. Acquire and release determine the synchronization point between threads, LL/SC delimit a "transaction" within a single thread.

rseq vs load_acquire then store_release

Posted May 1, 2026 10:41 UTC (Fri) by farnz (subscriber, #17727) [Link]

All important hardware does implement those atomic instructions you name - but using them comes at a cost because you're no longer just accessing memory the cheap way, but also ensuring that all other cache agents will see the memory access in the "right" order according to the system's memory ordering rules.

However, if you don't care about other cache agents, you can avoid paying that price - and that's the point of rseq. With rseq, you know if you're preempted and moved to another CPU, and thus can use a cheap sequence when you're not pre-empted, and a more expensive sequence (with atomics) if you're moved to a new CPU.

For many reasons, the kernel tries hard to keep threads on the same CPU unless it has to move them, and so on a system that's not overloaded, you'll do the cheap option most of the time, and thus can afford for the "moved CPU" option to be quite a bit more expensive than just using atomics.

rseq vs load_acquire then store_release

Posted May 1, 2026 15:43 UTC (Fri) by wahern (subscriber, #37304) [Link] (3 responses)

Probably not. Even existing LL/SC implementations are probably too weak: only one or two adjacent cache lines and too many spurious conflicts. And rseq semantics are in some ways orthogonal. Maybe a stronger LL/SC with much more well-defined and guaranteed semantics, e.g. N disjoint cache lines per core that don't fail updates just because some other cache lines on another core associatively overlapped? (Or maybe instead of disjoint just a whole page?)

Sun's Rock processor was supposed to offer this (billed as hardware transactional memory, but IIRC it was just strong'ish LL/SC like above), but it never got off the ground. I think strong LL/SC with a useful amount of operations/memory imposes too many hurdles in modern chips, or would cost too much silicon. rseq has a really good cost+benefit profile, and I had always wondered why it didn't exist 20+ years ago. But it would be nice to get stronger hardware primitives, rather than doubling the width of cmpxchg every other decade.

rseq vs load_acquire then store_release

Posted May 1, 2026 17:04 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

The stronger LL/SC you're describing sounds a lot like Intel's Restricted Transactional Memory (part of the TSX extension). It's been removed from mobile and desktop processors, but not from Xeons. I suspect it's removed because it's a pain to validate, and thus bug-prone.

rseq vs load_acquire then store_release

Posted May 1, 2026 18:41 UTC (Fri) by wahern (subscriber, #37304) [Link]

Yeah, TSX was Intel's attempt to hack together something analogous to a less weak LL/SC into their ISA, but I think mostly implemented in microcode to leverage what cache coherency interfaces each particular microarchitecture could provide. But it was plagued by bugs and others issues, like side channel exploits, and even on chips that nominally support it, it's often disabled.

To get a stronger, durable primitive (whether labelled LL/SC or not) probably requires a major redesign of cache coherency architectures, etc. And that would probably stall performance progression more generally for a generation or two, and even if it succeeded it would take a few years at least to see any significant usage. So it's a huge risk for AMD and Intel, especially. The story is more complicated for ARM and RISC. Some implementations provide wider and more reliable LL/SC than others, enough to really broaden the horizon of lock-free and wait-free algorithms, but the landscape is too fragmented and dynamic so few people would want to predicate their concurrency architecture on what the best implementations can provide.

rseq vs load_acquire then store_release

Posted May 11, 2026 18:36 UTC (Mon) by anton (subscriber, #25547) [Link]

They certainly had their share of bugs in TSX instructions, but it makes no sense to leave them enabled in large Xeons if they have not fixed the bugs.

My guess is that TSX is disabled completely in the consumer and small Xeon CPUs because it provides a high-bandwidth speculative side channel attack opportunity. For the larger Xeons, Intel did not completely disable it because SAP HANA uses it, but I expect that only Xeons that run only trusted users will have it enabled, and the others will disable it in the BIOS or so.

workaround is okay

Posted May 1, 2026 10:30 UTC (Fri) by meyert (subscriber, #32097) [Link] (3 responses)

Given that the structure size act like a version of the API structure the workaround is okay I think.
Next step: deprecated old version and remove the hack, which is okay as it would not be a "regression" I think ;-)
What would be the earliest date for depreciation?

workaround is okay

Posted May 1, 2026 14:59 UTC (Fri) by hmh (subscriber, #3838) [Link] (2 responses)

Deprecation can slow down new adoptions of the undesired behaviour, but they are not going to help much unless you can actually push the existing users away from the deprecated API.

workaround is okay

Posted May 3, 2026 16:07 UTC (Sun) by jthill (subscriber, #56558) [Link] (1 responses)

That's the part I don't get. Making decisions based on what you expect is in the cpu caches is a perfectly cromulent goal, why not make some form of the existing behavior selectable in the new interface? Even just conditionally updating the cpu id if it doesn't match sounds doable (he says, from the peanut gallery).

workaround is okay

Posted May 20, 2026 7:30 UTC (Wed) by daenzer (subscriber, #7050) [Link]

> Even just conditionally updating the cpu id if it doesn't match sounds doable

Wouldn't that still incur a performance penalty? Based on the assumption that "features that prevent the kernel from randomly changing user-space memory" make reading from user-space memory more expensive as well.

Apply some soft pressure on TCMalloc

Posted May 1, 2026 17:23 UTC (Fri) by jbills (subscriber, #161176) [Link] (3 responses)

What about adding a small additional slowdown to those who use the legacy behavior? For example, mark the page with `cpu_id_start` as fault-on-write for every write. Their motivation is performance, and if their performance is deliberately killed maybe that will motivate them to fix their crap. This wouldn't be a breaking change, just a rude one.

Hard on Google

Posted May 2, 2026 5:17 UTC (Sat) by CChittleborough (subscriber, #60775) [Link] (1 responses)

AFAICT, Google went to these extremes with TCMalloc because they use it on their gigantic fleets of machines. Penalizing TCMalloc might well cause a detectable increase in the energy use at Google's main (non-AI?) compute centers.

Hard on Google

Posted May 2, 2026 7:13 UTC (Sat) by joib (subscriber, #8541) [Link]

Sounds like an excellent way to motivate Google to fix their issue, then.

Apply some soft pressure on TCMalloc

Posted May 2, 2026 9:52 UTC (Sat) by corbet (editor, #1) [Link]

The legacy behavior will already sacrifice the significant performance improvements that come with the newer code — and features like time-slice extension as well. That, hopefully, should be enough to motivate change.

Interesting.

Posted May 4, 2026 19:33 UTC (Mon) by gmatht (subscriber, #58961) [Link]

My instincts were telling me that the solution would be to do a quick check for the flag, hinting to the CPU that it is not expected to be set, and only do the expensive write only if the flag was set. A branch misprediction seems like a small price to pay to exploit this undocumented behaviour.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds