LWN.net Weekly Edition for August 10, 2023

Welcome to the LWN.net Weekly Edition for August 10, 2023

This edition contains the following feature content:

CPython without a global interpreter lock: a detailed look at how a no-GIL Python interpreter will work.
Making life (even) harder for proprietary modules: the arms race between kernel developers and proprietary-module vendors continues.
Beginning the software-interrupt lock pushdown: a new attempt to address an old scalability problem.
Shadow stacks for 64-bit Arm systems: protection against return-oriented programming attacks comes to Arm processors.
SFrame: fast, low-overhead stack traces: the ups and downs of this new mechanism for working out the contents of the call stack.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

CPython without a global interpreter lock

By Jake Edge
August 9, 2023

The global interpreter lock (GIL) has been a part of CPython since the beginning—nearly—but that seems likely to change over the next five or so years. As we described last week, the Python steering council has announced its intention to start moving toward a no-GIL CPython, potentially as soon as Python 3.13 in October 2024 for the preliminaries. The no-GIL version of CPython comes from Sam Gross, who introduced it as a proof-of-concept nearly two years ago; now, the idea has been formalized in a Python Enhancement Proposal (PEP) that describes no-GIL mode and how it interacts with the rest of the Python ecosystem.

The PEP

PEP 703 ("Making the Global Interpreter Lock Optional in CPython") was posted back in January, then revised in May. It proposes creating a second build of CPython using a new build configuration switch (‑‑disable-gil) that would "run Python code without the global interpreter lock and with the necessary changes needed to make the interpreter thread-safe". The GIL is a bottleneck for multi-threaded Python programs because it prevents more than one thread from executing Python at any given time. The PEP is not definitive about the eventual end state for the no-GIL build, but the steering council made it clear that its intent is to eventually have only a single build of CPython—either without the GIL if no-GIL works out, or rolling back to the with-GIL version if it does not.

Gross obviously recognized that acceptance of the PEP might be something of a struggle; one of the ways he dealt with that was by giving PEP 703 one of the more extensive "Motivation" sections ever seen. It looks at multiple different Python use cases (AI, numerical libraries, GPU-heavy workloads) and gets quotes from Python developers and maintainers about the problems they have encountered because of the GIL—and the lengths they have had to go to in order to work around the GIL. One project has already switched to using Gross's experimental no-GIL fork in order to avoid communication bottlenecks in its data-acquisition system.

The core of the PEP is the "Specification" section, which describes the changes needed to CPython for no-GIL operation. It mentions two ways to control the no-GIL operation of the ‑‑disable-gil build. First, if the Py_mod_gil slot of an extension module that is being loaded is not set to Py_mod_gil_not_used (or is not present at all), the interpreter will pause any current threads and re-enable the GIL before resuming them. It will issue a warning when that happens, so that users are notified that their code has loaded an module that is not compatible with no-GIL operation.

But, of course, there may well be extensions that have not been updated to use the Py_mod_gil slot (these slots came from PEP 489 ("Multi-phase extension module initialization")) even though they would work fine without the GIL. The PYTHONGIL environment variable can be used to override the slot check; if it is set to zero, the GIL will be disabled, while a value of one forces the GIL to be enabled. That will allow testing modules that may work fine without a GIL, but there are other reasons the override is useful:

The PYTHONGIL=0 override is important because extensions that are not thread-safe can still be useful in multi-threaded applications. For example, one may want to use the extension from only a single thread or guard access by locks. For context, there are already some extensions that [are] not thread-safe even with the GIL, and users already have to take these sorts of steps.

Garbage collection

Most of the changes to CPython for PEP 703 relate to memory management—garbage collection, in particular. The techniques used have not changed all that much since the initial posting of the no-GIL project (and our article describing it); we will review some of that here, but this article will mostly focus on other details in the proposal.

Python's garbage-collection mechanism relies on reference counts (for the most part), but the maintenance of those counts is currently protected with the GIL, so no other locking is required. Multiple, concurrent accesses to these counts is a recipe for bugs and crashes, so those counts need to be protected some other way in the absence of the GIL. Operations on reference counts are ubiquitous in the interpreter, though, so adding atomic locking to each operation would be a performance nightmare.

The PEP proposes three techniques to make the reference counts thread-safe in a performant manner. The first is to use biased reference counting, "which is a thread-safe reference counting technique with lower execution overhead than plain atomic reference counting". It uses the fact that most objects are generally not actually shared between threads, so the owning thread uses its own reference count that it maintains without any locking. Any other threads have to use atomic operations on a separate shared reference count; those two counts are then used to determine when the object can be freed.

Some Python objects, such as small integers, interned strings, True, False, and None, are immortal—they live as long as the interpreter does—so they do not need to participate in the reference-counting dance. These are marked as immortal objects, though the scheme used is slightly different than that in PEP 683 ("Immortal Objects, Using a Fixed Refcount"), which was accepted for Python 3.12. Because the no-GIL interpreter uses biased reference counts, it cannot use the same representation for immortal objects as in PEP 683. In any case, incrementing or decrementing the reference count of an immortal object is a noop.

While most objects get freed when their reference count drops to zero, there are some objects that have reference cycles that prevent the count from reaching zero. These are currently detected and freed during a garbage-collection pass that is protected by the GIL. PEP 703 proposes two "stop-the-world" passes that would pause all threads, first to identify the objects to be freed, and second to identify any that are left after the finalizers from the first round have completed.

Those two phases will also handle the third mechanism (after biased reference counts and immortal objects) that is being added: deferred reference counting. Some objects are generally long-lived, but not immortal, such as modules, top-level functions, and code objects; those objects are commonly accessed by multiple threads as well. Instead of performing an expensive atomic reference-count operation for them, they would instead be marked for deferred reference counting. When those objects are pushed and popped from the interpreter's stack, no reference-count operations will be performed, so the true state of references to those objects can only be calculated during a stop-the-world garbage-collection phase. In practice, it is not a lot different than how they are handled today:

Note that the objects that use deferred reference counting already naturally form reference cycles in CPython, so they would typically be deallocated by the garbage collector even without deferred reference counting. For example, top-level functions and modules form a reference cycle as do methods and type objects.

Allocation and locking

The pymalloc memory allocator, which is not thread-safe without the GIL, has been replaced with mimalloc, which has been modified somewhat to support the CPython use case. The mimalloc internal data structures can be used to replace the existing linked list that allows the garbage collector to find all of the Python objects that have been allocated. Mimalloc has also been modified to support something similar to read-copy update (RCU) that allows locks to be avoided when retrieving items from dict and list objects:

A few operations on dict and list optimistically avoid acquiring the per-object locks. They have a fast path operation that does not acquire locks, but may fall back to a slower operation that acquires the dictionary's or list's lock when another thread is concurrently modifying that container.
[...] There are two motivations for avoiding lock acquisitions in these functions. The primary reason is that it is necessary for scalable multi-threaded performance even for simple applications. Dictionaries hold top-level functions in modules and methods for classes. These dictionaries are inherently highly shared by many threads in multi-threaded programs. Contention on these locks in multi-threaded programs for loading methods and functions would inhibit efficient scaling in many basic programs.
The secondary motivation for avoiding locking is to reduce overhead and improve single-threaded performance. Although lock acquisition has low overhead compared to most operations, accessing individual elements of lists and dictionaries are fast operations (so the locking overhead is comparatively larger) and frequent (so the overhead has more impact).

In general, Python containers (dict, list, etc.) are protected from concurrent modification by the GIL, though there are some operations that even the GIL does not fully protect, as described in the PEP. For most container operations, the no-GIL interpreter uses per-object locking, which "aims for similar protections as the GIL", though, as mentioned above, read operations avoid locking at all if they can. But per-object locking can lead to deadlocks:

Straightforward per-object locking could introduce deadlocks that were not present when running with the GIL. Threads may hold locks for multiple objects simultaneously because Python operations can nest. Operations on objects can invoke operations on other objects, acquiring multiple per-object locks. If threads try to acquire the same locks in different orders, they will deadlock.

Those deadlocks can be avoided using "Python critical sections". The idea is that a lock can only be held while an operation is being performed; if there is a nested operation, the lock is "suspended" by being released until the nested operation completes, when it must be reacquired. That suspension must also be done around blocking operations, such as I/O. As an optimization, the suspension is only done if the thread would block. "This reduces the number of lock acquisitions and releases for nested operations, while avoiding deadlocks."

Backward compatibility

As the PEP notes, the vast majority of compatibility concerns with the existing CPython ecosystem are related to the C API. To start with, extensions built for today's CPython will not be ABI compatible with no-GIL, so they will require a recompile at minimum. The bigger problem is that the existence of the GIL has masked concurrency problems that exist in the C code of many extensions.

Even extension developers who wanted to develop thread-safe extensions had no real way to test them until no-GIL came along. By the sounds, testing extensions with no-GIL is ongoing, especially for the larger, active extensions that have been chafing under the constraints of the GIL for many years. There is a long tail of extensions, however; not breaking those with no-GIL is important, thus the Py_mod_gil slot test. Beyond that, Gross plans to write a compatibility HOWTO that should help the process along.

As noted in last week's article, the lack of a GIL has some negative effects on the ongoing Faster CPython work, which is described in PEP 659 ("Specializing Adaptive Interpreter"). The no-GIL PEP mentions some of those problems and points to a specific specialization problem as an open issue for the no-GIL interpreter. For now, it looks like those problems are seen as challenges by the Faster CPython team, who are looking to work with Gross and others toward a no-GIL interpreter without sacrificing too much single-threaded performance.

Single-threaded performance is another area that the (quite comprehensive) PEP 703 touches on. Since the vast majority of Python code is single-threaded now, which is something that will only start to change slowly once no-GIL gets going, it is imperative that measures be taken to ensure that the performance of those programs does not regress. As Faster CPython developer Mark Shannon said, research will need to be done on "the optimizations necessary to make Python fast in a free-threading environment", but he and other members of the team seem up for the task.

While the numbers are somewhat disputed, PEP 703 gives a performance cost of 5-8% for the no-GIL changes relative to the in-progress Python 3.12. Those numbers are strictly the cost of the changes for no-GIL and do not reflect the gains that will come for multi-threaded Python programs that are not restricted by the GIL.

In conclusion

Even this fairly lengthy look only scratches the surface of the full contents of the PEP; it is well worth a read for those who are interested. One important thing to keep in mind, though, is that the steering council made it quite clear that the process will play out rather deliberately—slowly—over five years or more. There will be lots of opportunities to test and help fix no-GIL Python over that time frame, as well as to work on making extensions thread-safe without the GIL. To a large extent, the success of the no-GIL project is going to depend, at least in part, on the Python community—not just the core developers and the teams from various companies—pulling together to help make it succeed. It will be interesting to see (and report on) how it all goes.

Comments (17 posted)

Making life (even) harder for proprietary modules

By Jonathan Corbet
August 3, 2023

The kernel community has never had a smooth relationship with the purveyors of proprietary kernel modules. Developers tend to strongly dislike those modules, which cannot be debugged or fixed by anybody other than their creator, and many see them as a violation of the kernel's license and their copyrights on the code. Nonetheless, proprietary modules are tolerated, within bounds. A recent patch from Christoph Hellwig suggests that those bounds are about to be tightened slightly, in a somewhat surprising way.

Back in 2006, there was a brief effort to ban the loading of proprietary kernel modules altogether. That attempt was shut down by Linus Torvalds for a number of reasons, starting with the fact that simply loading a proprietary module into the Linux kernel is, on its own, not a copyright violation; it is something that any Linux user is allowed to do. Trying to ban such modules, Torvalds said, would be an indication that the development community is more interested in arguing about licenses than in improving the technology.

Distributing a proprietary module might be a copyright violation, though, if the module itself is a derived work of the kernel code. But "derived work" is a fuzzy concept, and the kernel itself cannot really make that judgment. There is a longstanding mechanism in the kernel designed to keep infringing modules out, though: GPL-only exports. A kernel module cannot do anything useful without accessing symbols (functions and data structures) exported to it by the kernel. Many of those symbols are restricted to modules that have declared a GPL-compatible license, thus fencing proprietary modules away from a lot of kernel functionality.

In theory, the GPL-only marking indicates that a symbol is so deeply tied into the kernel that any code making use of it must necessarily be a derived work of the kernel. In practice, the developers making those decisions do not carry out an analysis to determine whether that is the case — and they are not usually qualified to do such an analysis anyway. Instead, symbols are routinely marked GPL-only as a way of making life harder for proprietary modules in general.

To the surprise of, well, almost nobody, the creators of proprietary modules have long sought ways around the limitations imposed by GPL-only exports. This has included falsely declaring a GPL-compatible license, but doing so is such an overt admission of guilt that companies instinctively avoid it. Instead, they look for more subtle ways to get the access they need. One scheme was inadvertently revealed in 2020 as part of a larger patch set intended to make a form of peer-to-peer DMA work. If a module declares itself to have a GPL-compatible license, it will have full access to all of the symbols exported by the kernel. If that module then imports symbols from a proprietary module, it can serve as a go-between, making the full kernel available to the proprietary code. This is a variant of the often-used "GPL condom" approach.

At that time, Hellwig merged a patch intended to make that method more difficult. In current kernels, any module that uses symbols from a proprietary module is treated as being proprietary itself; it immediately loses its ability to access GPL-only symbols. If the module has already gained access to any GPL-only symbols, any attempt it makes to import symbols from a proprietary module will fail. This check breaks the ability of a module to serve as a go-between, sending proprietary-module vendors scurrying back to their lairs complaining about how they have been foiled again.

There is always another way, though, it seems. The kernel provides a macro called symbol_get(), which turns around and calls __symbol_get() to do the real work, which happens to be looking up the address associated with a kernel symbol. This function, which has been in the kernel since the 2.5.48 release in 2002 (where it was added as a part of the wholesale replacement of the module loader), has a specific limitation, though: it only looks up symbols provided by loadable modules. It is intended for use with tightly linked modules that need to reference each other in situations where one of the modules may not be loaded, and without creating reference loops. By appearances, it cannot be used to find the location of GPL-only kernel symbols, and would seem to be of little use for proprietary-module vendors trying to bend the rules.

What symbol_get() can be used for, though, is obtaining addresses from a proprietary module without going through the normal import mechanism (and its restrictions). It can, in other words, be used to circumvent the 2020 fix, making it once again possible for a nominally GPL-licensed module to call into a proprietary module and give that module access to the kernel functionality it needs. Hellwig has asserted that NVIDIA, a company long known for its proprietary kernel modules, has duly modified its code to make use of this workaround.

In response, he posted this patch set (since revised) to close the hole once again. It changes the behavior of symbol_get(), causing it to fail when asked to look up a symbol that is not marked GPL-only. This is an inversion of the usual test, which denies access to symbols that are marked GPL-only. The reasoning is that symbol_get() has always been intended for low-level cooperation deep within the kernel, where everything is expected to be GPL-only anyway. As it happens, a handful of the uses in the kernel were for symbols that were not so marked, so the patch set includes changes to make the symbols referenced in those cases GPL-only.

The symbol_get() change also, coincidentally, makes it impossible for a GPL-licensed kernel module to resolve symbols defined inside a proprietary module using symbol_get(). Once this change finds its way into a mainline release and, from there, into distributions, the developers of proprietary modules will have to find another way to gain access to the kernel internals they need. The change has been applied by module maintainer Luis Chamberlain, so the chances of it going into the mainline eventually seem fairly good.

Given how long this conflict has been simmering, there is little doubt that authors of proprietary modules will find another way to bypass the intent of the kernel community. Proprietary or not, modules are running within the kernel's address space, so the attack surface available to any module attempting to circumvent the kernel's symbol-access policies is large and can probably never be properly secured. The best that can be done is to continue to make life uncomfortable for those who would ship binary-only kernel modules in the hope that they eventually take the hint and create a freely licensed solution. It is not a perfect technique, but it has often worked over the years.

Comments (79 posted)

Beginning the software-interrupt lock pushdown

By Jonathan Corbet
August 4, 2023

The big kernel lock (BKL) is a distant memory now but, for years, it was one of the more intractable problems faced by the kernel development community. The end of the BKL does not mean that the kernel is without problematic locks, however. In recent times, some attention has been paid to the software-interrupt (or "bottom half") lock, which can create latency problems, especially on realtime systems. Frederic Weisbecker is taking a new tack in his campaign to cut this lock down to size, with an approach based on how the BKL was eventually removed.

The Linux kernel was initially developed on a uniprocessor system — understandably, since that was all any of us had back then — and the code was, as a result, heavily based on the assumption that it was running on the CPU and no others existed. The BKL was eventually introduced to enable Linux to run on those multiprocessor machines that, industry analysts assured us, would eventually be all the rage. It ensured that only one CPU was ever running kernel code at any given time, making all kinds of concurrency problems go away, but at a substantial performance cost, especially as the number of CPUs increased. It did not take long for the realization to sink in that the BKL had to go.

The approach that was taken in many subsystems (described more in depth in this article) was to push the BKL down into lower levels of the system. Rather than call every driver's open() function with the BKL held, every driver was modified to acquire the BKL itself. Then, the open() functions could be safely called without the BKL, and each driver could be independently audited and fixed if needed, after which its use of the BKL could be removed. This pushdown broke up a big problem into a large number of smaller, much more tractable problems. It took years, but the BKL was finally removed in 2011.

Software interrupts are a deferred-execution method for work that is urgent, but which cannot be done directly in the context of a hardware interrupt. When there is this kind of work to do, a subsystem will raise the software interrupt by setting a flag; that will cause its handler to be called at the next opportune moment, usually immediately after the completion of hardware-interrupt processing or before returning to user space from a system call. Processing can also be pushed out to a separate ksoftirqd thread if it starts to take too much time. See this article for a more thorough discussion of this mechanism — and of one of Weisbecker's other attempts to improve it.

There are many users of software interrupts, including tasklets, networking, the block subsystem, read-copy-update, and kernel timers. On some workloads, software-interrupt processing can be a significant part of the overall load on the CPUs; it can run for long enough to create latency problems for software running in user space. Kernel code that disables software-interrupt processing (to avoid race conditions with the handlers) becomes non-preemptible and can also cause unwanted latencies. In general, like the BKL, software interrupts reflect a design that might have been suitable decades ago, but which is problematic now.

One aspect of that design is that software-interrupt handlers exclude each other; only one can be executing on a given CPU at any time. As a result, if the block software-interrupt handler runs for a long time, the handlers for networking and timers may be indefinitely delayed. This is the case even though it is rare for software-interrupt handlers of different types to race with each other. There is no way to know for sure that running two handlers concurrently is safe, so that is not done.

Weisbecker's patch set is meant to chip away at this problem using a BKL-style pushdown in the timer subsystem. Timer functions are queued from all over the kernel; they tend to be independent and to lack concurrency problems with other software-interrupt handlers. Almost all of them could be run entirely concurrently with other software-interrupt processing — but the "almost" part is the catch. Making timer processing entirely independent of software-interrupt processing without being sure of the safety of every timer function would be a way to introduce hard-to-debug problems.

Weisbecker, instead, takes a two-step approach to introducing more concurrency to timer processing. The first is to allow individual software-interrupt vectors to be disabled without disabling software-interrupt processing entirely. The purpose of the patch set is to allow timer functions to run concurrently with other software interrupts, but they still should not run concurrently with each other. By disabling the processing of timer events (on the local CPU), the timer handler can safely re-enable software-interrupt processing without the risk of being called again.

The second step is to allow individual timer functions to inform the timer subsystem that they can be run concurrently with other software-interrupt handling. Any timer function that will not race with software-interrupt handlers, or which performs its own software-interrupt disabling where needed, can be marked by adding the TIMER_SOFTINTERRUPTIBLE flag when setting up its timer event. When the timer subsystem sees this flag, it will re-enable software-interrupt processing while that timer function runs. As a result, the timer function can be preempted if some more-important work comes along.

In the patch set, only one timer function, process_timeout() is marked in this way. Weisbecker looks forward to a day, though, "after a few years", when all of the kernel's timer functions have been audited and made safe to run concurrently with software-interrupt handlers; at that point, it will be possible to remove timer processing from the software-interrupt mechanism entirely. That, in turn, will be a small step toward the eventual removal of software interrupts in general.

Clearly, there is a fair amount of work needed to get to that point. Even this patch set needs "more tweaks" to enable interruptible timer functions to preempt other software-interrupt handlers, which is an important part of the problem. But this work could represent a step in that direction, should it find its way into the mainline. Weisbecker has made a few attempts to address software interrupts by now without a lot of success. Eventually, though, as with the BKL, the right approach will be found and a longstanding problem will, finally, be resolved.

Comments (3 posted)

Shadow stacks for 64-bit Arm systems

By Jonathan Corbet
August 7, 2023

Return-oriented programming (ROP) has, for some years now, been a valuable tool for those who would subvert a system's security. It is thus not surprising that a lot of effort has gone into thwarting ROP attacks, which depend on corrupting the call stack with a carefully chosen set of return addresses, at both the hardware and software levels. One result of this work is shadow stacks, which can detect corruption of the call stack, allowing the operating system to react accordingly. The 64-bit Arm implementation of shadow stacks is called "guarded control stack" (GCS); patches implementing support for this feature are currently under discussion.

A shadow stack is a copy of a thread's call stack; it is often (but not necessarily) maintained by the CPU hardware. Whenever a function call is made, the current return address is pushed onto both the regular stack and the shadow stack. When the function returns, the addresses at the top of the two stacks are compared; if they do not match, the system concludes that the call stack has been corrupted and, probably, aborts execution. This check is enough to defeat most attacks that involve writing a sequence of return addresses to the stack. Even if the shadow stack is writable, the need to update it to match the call stack raises the bar for a successful exploit considerably.

Software shadow stacks can be effective, but there are advantages to implementing them in hardware; the performance can be better, and the CPU can prevent attempts to corrupt the shadow stack. Naturally, any such support will be architecture-specific, and so will require architecture-specific code to make use of. The effort to implement user-space shadow stacks for x86 has been underway for some time and will, with luck, land in the mainline in the near future.

The 64-bit Arm ("aarch64") processors — and the developers adding support for the new processor features — are coming later to the shadow-stack party, a fact that brings both advantages and disadvantages. On the "advantage" side, the x86 developers have spent years in extended discussions over how shadow stacks should be supported and what the interface to them should look like, and they have the scars to show for it. In the cover letter for the GCS support patch series, Mark Brown made it clear that he intends to avoid a similar experience if possible:

As there has been extensive discussion with the wider community around the ABI for shadow stacks I have as far as practical kept implementation decisions close to those for x86, anticipating that review would lead to similar conclusions in the absence of strong reasoning for divergence.

On the other hand, the first implementer of a kernel feature is not normally expected to make that implementation sufficiently general for the needs of those that will follow. Experience has shown that premature abstraction, like premature optimization, tends not to lead to good results. So it is often the second or third comer who has to create a framework that all implementations can fit into.

The Arm shadow-stack interface

In this case, there was relatively little work of this type to do. The Arm world doesn't use the term "shadow stack" much, preferring the GCS term, but x86 got there first, so "shadow stack" has become the generic way of referring to this feature. The x86 implementation adds some arch_prctl() calls to control the feature, but aarch64 does not implement arch_prctl() at all. So, instead, the GCS patches create a new set of prctl() calls meant to control the feature on all architectures. The main control operation is PR_SET_SHADOW_STACK_STATUS, which takes a number of flags.

Whenever a new thread is created, it will not have a shadow stack. One can be added using PR_SET_SHADOW_STACK_STATUS with the PR_SHADOW_STACK_ENABLE flag; that will cause a shadow stack to be allocated, and the calling thread will start using it. Since this call initializes the shadow stack, the portion of the call stack that was populated prior to the function that turned on the shadow stack will not be represented there; as a result, returning from that function will not be possible. Since the expectation is that the shadow stack will be enabled in the dynamic loader before jumping to a program's entry point, this limitation should not normally be a problem.

Invoking PR_SET_SHADOW_STACK_STATUS without PR_SHADOW_STACK_ENABLE set will disable the shadow stack. On the Arm architecture, the setup of a shadow stack can only be done once for any given thread; if the shadow stack is subsequently disabled, it is gone forever.

Shadow-stack memory is specially marked and can be protected from manipulation by the owning process. There is a pair of flags that controls whether user space can make changes to the shadow stack (outside of those that happen automatically at function-call and return time). The PR_SHADOW_STACK_PUSH flag allows user space to push entries onto the stack using a special instruction, while PR_SHADOW_STACK_WRITE enables ordinary writes. Either or both of these capabilities may be needed to, for example, support user-space threading; enabling them reduces the security provided by the shadow stack, but the core defense against stack-smashing attacks (including ROP attacks) remains.

As with x86, shadow stacks are normally allocated automatically by the kernel. In cases where user space may need to allocate shadow stacks separately (again, user-space threading comes to mind), the map_shadow_stack() system call is supported:

    void *map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);

The returned pointer, on success, indicates a range of memory that has been properly prepared for shadow-stack use, with the protections set appropriately and the necessary tokens (which allow the CPU to recognize the stack and prevent concurrent use) put in place. Actually switching to the allocated stack requires using a dedicated Arm instruction.

The PR_LOCK_SHADOW_STACK_STATUS flag locks the indicated configuration in place, preventing future changes. This flag can be used to prevent the thread from disabling the shadow stack or enabling writes to it. There is also a separate PR_GET_SHADOW_STACK_STATUS operation that can be used to query the current status.

This patch set only implements shadow-stack support for user space; there is no support for kernel-space shadow stacks.

Prospects

This work appears to be relatively uncontroversial and to be nearly ready to go, with one caveat: it depends on the x86 shadow-stack work in a number of ways. The x86 patches also seem nearly ready, but they were turned down by Linus Torvalds during the 6.4 merge window, and were not proposed for 6.5. Until the x86 work lands in the mainline, the Arm patches will not be able do to so. As a result, the 6.7 release seems like the earliest that can be expected to include Arm shadow-stack support.

It is also worth mentioning that shadow stacks are also coming to RISC-V, with the feature bearing the intuitive name "zisslpcfi". The support patches are still "RFC quality" and will likely need some work. They contain the generic prctl() operations (indeed, the first version of that interface appeared there), but do not include map_shadow_stack(), preferring an mmap() interface that has been deemed unsuitable elsewhere. The zisslpcfi patches also include support for forward-edge control-flow integrity, similar to the x86 indirect branch tracking.

Hardware-based protection for control-flow integrity is clearly seen by all of the vendors as an important part of their security strategy, with most processors of interest adding support. Updating the kernel to actually use these features has been a slow process, with a number of roadblocks appearing along the way. The indications are, though, that this multi-year journey is reaching its end and attackers will have to move on to new techniques in the ongoing security arms race.

Comments (6 posted)

SFrame: fast, low-overhead stack traces

By Jake Edge
August 8, 2023

OSSNA

Getting a stack trace of a running program is useful in a variety of scenarios: tracing, profiling, debugging, performance tuning, and more. There are existing mechanisms to get stack traces, but there are some downsides to them; the "Simple Frame" (SFrame) stack-trace format came about to address the shortcomings in the other techniques. Back in May, Steve Rostedt and Indu Bhagat gave a talk about SFrame support in the kernel as part of LSFMM+BPF; a few days later, Bhagat gave a more general talk about SFrame (YouTube video) at Open Source Summit North America in Vancouver. That second talk helped fill in some other aspects of SFrame and the overall stack-tracing picture.

Background

Bhagat began by defining what a stack trace is: "a list of function calls that are currently active in the thread". A useful stack trace will show information about the instruction pointer (IP) location within each function in the call-chain list, as well as some human-readable symbol names. That includes the function name, but it can also provide information like file names and line numbers. The symbolization piece is beyond the scope of her talk, however, which focuses on how to get the list of IPs in the call chain.

Different kinds of tools will generate the call-chain IPs in different ways because they are focused on their own use cases; "a debugger will do it differently than a profiler". From her slides, she put up a list of five existing stack-tracing mechanisms, but said that she would be talking about three in particular: frame pointer, EH frame, and application-specific formats. Those three will provide enough context to explain the motivations behind the SFrame format, she said.

The frame-pointer technique is perhaps one of the oldest stack-tracing methods. It reserves a register to hold the frame pointer, which is a pointer to the start of the current stack frame; the compiler generates some extra code to save and restore the value of the stack pointer to that register on function entry and exit. So there is some performance impact of the extra code for each function call; beyond that, the compiler has to reserve the register just for the frame pointer, which also has performance implications. But it is an easy-to-understand mechanism that works well; "it's beautiful, it just works and it is so simple".

The EH-frame mechanism is a DWARF-based method that can do more than just stack tracing; it can also do stack unwinding, which means that it can restore the state of all of the registers at each point in the call chain. The information needed to do that is stored in the binaries in .eh_frame and .eh_frame_hdr ELF sections. The format itself is compact, versatile, and works well in practice, Bhagat said; it also has good library support for applications that want to handle the format.

Using EH frame does not require reserving a register for the frame pointer but "the stack tracer itself is slow and it is complex". The reason is that the DWARF information is a set of instructions needed to recover the stack offsets for the information of interest; some of the instructions are simple and some complex, "but you would need to implement sort of a stack machine where you can execute the opcodes". The main complaints about this method are its speed and complexity, which also make it unsuitable for use in the kernel. She noted that the discussion around the now-accepted Fedora 37 proposal to enable frame pointers by default in the builds for the distribution (which was covered by LWN) also touched on some of the problems with the frame pointer and EH-frame methods.

The application-specific formats have come about because of these shortcomings. For example, the kernel's ORC-based stack tracing is used because of the problems with the EH-frame method; there are other application-specific formats, which are not open source, but also demonstrate the need for a fast and simple stack-tracing solution. The application-specific solutions are not using information generated by the toolchain, so they may require reverse-engineering to use the formats in other ways; that may make it difficult to port and maintain those formats.

Requirements

She put up a slide that summarized the pros and cons of the three methods, which could be used to generate a set of requirements for a new stack-tracing method. The first was that given any program counter (or instruction pointer, she used both in the talk), a precise stack trace can be generated from it, which she called "asynchronous stack tracing". At the end of the talk, an audience member asked about what that term means in this context. Bhagat said that frame-pointer-based stack traces are not always precise because the compiler adds extra instructions in the prologue and epilogue of a function. If the stack trace starts on one of those instructions, some part of the trace will be missed because the frame-pointer handling is incomplete at those points.

The other requirements were more obviously derived from the pros and cons on her slide: low-overhead tracing, with a low-complexity tracer, using information that is generated by the toolchain. SFrame has been designed with those requirements in mind, she said.

The SFrame format was defined and implemented in Binutils 2.40 as SFrame version 1. Since the talk, Binutils 2.41 was released with some fairly small, but not backward compatible, changes to the format, which is now SFrame version 2. The format simply contains enough information to be able to do a stack trace: given a program counter (PC) value, the canonical frame address (CFA), frame pointer (FP), and return address (RA) can be retrieved. "That's all you need to stack trace and that is all that is encoded in the Simple Frame stack-trace format."

SFrame is defined for two ABIs: x86_64 and 64-bit Arm. It has support for encoding procedure linkage table entries (pltN). On Arm, it can encode pointer authentication information so that the return address values that have been mangled by the authentication can be decoded.

The SFrame information is stored in a .sframe ELF section, which is stored in its own PT_GNU_SFRAME segment. To generate it, the ‑‑gsframe option needs to be passed to the GNU assembler. The GNU linker "will do the right thing" if it sees multiple .sframe sections by combining them in the output. The readelf and objdump tools also have support for SFrame; using the ‑‑sframe option will provide human-readable text describing the SFrame information.

Format

The SFrame format has three parts: a header, a set of function descriptor entities (FDEs), and a set of frame row entries (FREs). The header contains a magic number, a version number, and offsets to the other two sections. The FDEs are fixed size, sorted in PC order, so a binary search can be used to find the function corresponding to a given PC value. FREs are variable-length in order to be as compact as possible. Offsets are used to access the various pieces of information in the format.

Each FDE corresponds to a single function. It stores the start PC and function size in bytes; it also indicates whether it is a regular code block or a pltN. After that, it has an offset to the first FRE along with the number and types of FREs that the function has.

The FREs are the backbone of the format, she said. They provide the stack offsets that can be used to recover the CFA, FP, and RA given a particular PC within the function. Because function sizes differ, the space needed to represent an offset from the starting PC value also differs; there are three different representations for FREs depending on whether the offsets can be encoded in one, two, or four bytes. Each FRE covers a contiguous range of addresses within the function and encodes the stack offsets for the CFA, FP, and RA values that are valid for that range.

The ability to perform a binary search on the FDEs is one advantage that SFrame has; it allows speedy access to the starting point for a trace. Another advantage for the format is that the FREs directly encode the offsets needed to recover the CFA, FP, and RA; there is no need to execute stack-machine instructions to do so. The kernel's ORC format also directly encodes these offsets, but SFrame does some space optimizations to make its format more compact. She put up a graph ("take it with some grain of salt") showing the space savings on x86_64 for SFrame versus EH frame for ten different binaries from Binutils; it showed that SFrame was roughly 80% of the size required for EH frame. She did caution that the use case for EH frame is different, so it is not exactly a fair comparison.

Library

The libsframe format library ships with Binutils (starting with 2.40); it has APIs to read and write SFrame data; the library was created with the linker in mind, thus it includes a write API that a stack tracer likely will not need. There are also APIs that target stack tracers, such as for finding the FRE that corresponds to a PC value or getting the stack offset from an FRE.

Libsframe is a young library, so it is too early to make any ABI stability guarantees, Bhagat said. The API is described in sframe-api.h. The SFrame format is not aligned on disk, but the library internally arranges the data so that there are no unaligned accesses. She showed some sample code to demonstrate "how easy it is to stack walk"; it found an FRE based on a PC value (find_fre()), then got the offsets for the CFA, FP, and RA values (get_*_offset()) in order to retrieve them.

There is some future work needed in the assembler to support a CFI directive (.cfi_escape) that has been skipped at this point; it means that SFrame is not fully asynchronous, but the compiler rarely emits that directive so it is not a large problem in practice. There is also a need to add more regression tests for SFrame unwinding to the tests for the GNU assembler. Beyond that, she said that the SFrame developers plan to work with the community on use cases for SFrame, including for user-space applications and user-space stack tracing for the kernel. Adding SFrame support to LLVM has been suggested since the talk, as well.

Bhagat finished her talk by suggesting that those who are interested in working with SFrame get in touch with the developers via the Binutils mailing list. An audience member asked about applications currently using SFrame; Bhagat said that other than the work on the kernel piece, which could be used by perf, Ftrace, BPF, and others, there are no applications using the format. The SFrame developers started with the kernel use case and are now starting to look into user-space applications that could benefit from fast, low-overhead stack traces.

An attendee asked about other architectures, noting that he believed the RISC-V ABI was rather different. Bhagat said that SFrame already accommodated the differences between x86_64 and Arm64, but if another architecture had major differences in the way it handled its return address, SFrame would probably need to be changed to handle it. As it stands, x86_64 always uses the stack for its RA, while Arm64 uses both the stack and a dedicated register, which SFrame already handles.

Bhagat's colleague, Jose Marchesi, asked about the relationship between SFrame and ORC; he wondered why the kernel would need something like SFrame instead of simply using ORC. Bhagat said that because ORC is application-specific, it can represent the stack usage of all of the different kinds of code in the kernel, including its hand-rolled assembly code. But to do user-space stack tracing, the ORC format will need some changes; SFrame is not meant to replace the kernel's internal use of ORC, which has similar goals to those of SFrame, but to complement it for doing user-space tracing from the kernel.

Comments (12 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: More speculative-execution vulns; Incus; NVK; Sourceware roadmap; Bram Moolenaar RIP; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>