Controlling shadow-stack allocation in clone3()

By Jonathan Corbet
December 7, 2023

User-space shadow stacks are a relatively new feature in Linux; support was only added for 6.6, and is limited to the x86 architecture. As support for other architectures (including arm64 and RISC-V) approaches readiness, though, more thought is going into the API for this feature. As a recent discussion on the integration of shadow stacks with the clone3() system call shows, there are still some details to be worked out.

A shadow stack is a copy of the current call stack that contains only return addresses; it is maintained by the CPU. While user-space code can access (and even modify) the shadow stack, that access is limited in a number of ways by the hardware. When a shadow stack is enabled, every function call results in the return address being pushed onto both the regular and the shadow stacks. Whenever a function returns, the return address on the regular stack is compared to the copy on the shadow stack; if the two don't match, the processor will trap and (normally) the affected process will be killed. This feature is meant to provide a defense against attacks based on overrunning stack-based variables, including return-oriented programming (ROP) attacks.

There is code that will not work with a shadow stack, so the feature cannot be enabled by default. Thus, when a process is created, it does not have a shadow stack, even on an architecture that supports the feature; a shadow stack can be created and enabled with a prctl() call. If, however, a thread with a shadow stack already set up creates a new thread, the kernel will create and install a shadow stack for that thread before it begins execution; that ensures that the thread will never run without protection. As will be seen, though, there are reasons why a process may want a higher level of control over how that shadow stack is created.

In October, Mark Brown (who is working on the arm64 shadow-stack implementation) posted a patch series adding that control to clone3(), a relatively new system call that was designed to allow the addition of new features in this way. The initial version of the series added two fields to the clone_args structure used to pass parameters to clone3(): the address and size of the shadow stack to be provided to the new thread. Rick Edgecombe (who carried the x86 implementation over the finish line) quickly pointed out a problem with that API, though: the ability to place the shadow stack in memory could be used to put it in an inconvenient location — on top of another shadow stack, for example. Nothing good would come from such an action, and it could be used as an attack vector.

After some discussion, it was concluded that, while it might be useful to allow user space to be able to position the shadow stack exactly, there was no overwhelming need for that capability. So, in subsequent versions of the series (including the current fourth revision), only the size of the desired shadow stack can be provided to clone3(), in a clone_args field called, unsurprisingly, shadow_stack_size. If that size is provided, it will be used by the kernel to create the new thread's shadow stack; otherwise the default size (which is equal to the size of the regular stack) will be used instead.

By version 3, posted in in late November, the patch set appeared to be settling down. Christian Brauner, though, questioned whether this API was worth adding, worrying that it was a step toward turning clone3() (which he created) into "a fancier version of prctl()". He wondered why it was necessary to allow user space to affect the size of the shadow stack at thread-creation time. Recognizing that he perhaps did not fully understand the problem, he asked a few questions about the motivations for this change.

One of those motivations is to prevent over-allocation of the shadow stack, which can result from the current policy of allocating the shadow stack with a size equal to that of the regular stack. Szabolcs Nagy explained the problem in this case: if a thread is created with a large (regular) stack, perhaps so that it can store a large array of data there, the shadow stack will be just as large, and almost all of that space will be wasted. For a single thread, perhaps that waste could be tolerated, but in an application with a large number of threads, it could add up to a lot of lost memory.

There is also a case where an equally sized shadow stack could be too small. The sigaltstack() system call allows a thread to set up an alternative stack to be used for signal delivery. Even when a thread is switched to its alternative stack, though, it continues to use the same shadow stack. If the thread exhausts the regular stack, then handles a signal (perhaps even caused by running out of stack space) with a deep call chain on an alternative stack, the shadow stack could overflow.

The kernel can try to make an educated guess as to what the optimal shadow-stack size might be, but it will remain a guess. As Brown pointed out, the only way to improve on that guess is to accept information from user space, which (presumably) has a better idea of what its needs are. Creating a new thread without a shadow stack and letting that thread map one explicitly would be one way to solve the problem; creating a suitably sized shadow stack in clone3(), though, ensures that the new thread will never run without shadow-stack coverage.

Brauner seemed to accept the reasoning behind the addition of this feature to clone3(), but he worried that there is currently only one architecture with shadow-stack support in the mainline currently. The addition of others, he hinted, could drive changes in the proposed API; he suggested keeping the clone3() changes out of the mainline until arm64 support has been merged. Brown was amenable to that plan for now, as long as the arm64 and clone3() changes could be merged together.

That seems likely to be how things will go from here. The merging of arm64 shadow-stack support appears to be on a slow path while the user-space side is being finalized, so it may be a while before all this work lands in a mainline kernel. If all goes well, though, it will eventually be possible to control the size of the shadow stack given to new threads on all architectures that implement shadow stacks.

Index entries for this article
Kernel	Security/Control-flow integrity
Kernel	System calls/clone()

Controlling shadow-stack allocation in clone3()

Posted Dec 8, 2023 19:57 UTC (Fri) by roc (subscriber, #30627) [Link] (2 responses)

My understanding is that userspace can access shadow stack memory, at least under some conditions, so as an rr maintainer I suspect we are going to need the ability to control where shadow stacks are allocated, so we can ensure during replay the shadow stacks are allocated at the same address as during recording.

But I would have thought CRIU has the same issues, and yet I know CRIU maintainers have been talking to the CET people and this hasn't come up?

Controlling shadow-stack allocation in clone3()

Posted Dec 8, 2023 22:02 UTC (Fri) by redgecombe (subscriber, #126527) [Link] (1 responses)

Yep, map_shadow_stack syscall takes an optional address (like mmap), and the CRIU patches used it.

The earlier proposed clone3 design involved userspace allocating the shadow stack and then passing the address into clone3. So it was setting the shadow stack pointer register (SSP) to an arbitrary point, not telling the kernel to allocate the shadow stack at a specific point.

Do you mean rr needs to control where a newly created thread allocates a shadow stack? If so could you comment the details on the mailing list to that series? Keep in mind the SSP is controllable via ptrace, so a tracer should be able to write to shadow stacks, set the SSP wherever it wants, and map shadow stacks at specific locations (via map_shadow_stack injection). So it seems like something could me made to work, but it would be good to know if there are any hard requirements.

Controlling shadow-stack allocation in clone3()

Posted Dec 9, 2023 0:58 UTC (Sat) by roc (subscriber, #30627) [Link]

Done, thanks.