Shadow-stack control in clone3()
As its name suggests, a shadow stack is a sort of copy of a thread's ordinary call stack, but its contents are limited to return addresses. On a system with shadow-stack support, each function call will push the return address onto both the normal and shadow stacks. On return from a function, the return addresses are popped from both stacks and compared; if they do not match, some sort of corruption has occurred and the thread in question is killed.
The shadow stack is marked specially in the system's page tables and is not writable from user space. A shadow stack must also contain a special, hardware-generated token at the top; this token identifies it as a real shadow stack, and prevents multiple threads from using the same shadow stack. The setup requirements mean that the creation of a shadow stack for a thread must be done by the kernel.
When the kernel goes to create a thread's shadow stack, there is an immediate question to be answered: how large should that stack be? The creation of the initial shadow stack for a process can be influenced by the process itself, but that is not true for any threads that the process may create thereafter. Whenever a new thread comes into existence, it is given a shadow stack that is the same size as its regular stack as the kernel's best guess for the right size.
That guess, of course, could be far from the mark. While quite a bit of information — local variables and saved registers, for example — is pushed onto the call stack, the shadow stack only holds return addresses, so there is a good chance that an equally sized shadow stack will be far too large for the thread's needs. If the process creates many threads, the amount of memory wasted by oversized shadow stacks could become significant. There are also (less common) situations, described in the 2023 article, when an equally sized shadow stack could turn out to be too small.
Either way, there would be value in giving a process a voice in the sizing of shadow stacks for the threads it creates. For some time now, Mark Brown has been working on the ability to control shadow-stack allocation when a thread is created with clone3(); that work was covered here in 2023. At that point, the patch set was in its fourth revision; it is now in its 19th iteration, a number of changes have been made, and Brown is expressing hopes that the series will soon be ready to merge.
Early versions of the series allowed user space to specify the address and size of the desired shadow stack, but that ran into opposition from other developers. Subsequent attempts, which only allowed the size of the shadow stack to be specified, proved to be too limiting. In the end, it was decided to just have the parent process create a shadow stack as it sees fit. So, since version 5, the process calling clone3() must first ask the kernel to create the shadow stack for the new thread to use with a call to map_shadow_stack():
sstack = map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
The addr and size arguments describe where the newly created stack should be placed and how big it should be; setting addr to zero leaves the placement decision to the kernel. The flags argument must be SHADOW_STACK_SET_TOKEN to cause the kernel place the special token at the top of the stack; otherwise the resulting shadow-stack mapping cannot be used for the new thread. This system call will return the virtual address where the stack is actually mapped.
When the time comes to call clone3(), the new shadow_stack_token field in struct clone_args must be set to point to the shadow-stack token. This requirement might be a bit unintuitive; the pointer is to the token, not to the base of the stack itself. So code using this feature will look something like this:
struct clone_args args; void *ss_addr; ss_addr = map_shadow_stack(0, 4096, SHADOW_STACK_SET_TOKEN); args->shadow_stack_token = ss_addr + 4096 - sizeof(void *); pid = clone3(&args, sizeof(args));
This code allocates a single, 4096-byte shadow stack, letting the kernel choose where the stack should be placed; it then points args->shadow_stack_token at the token that the kernel will have stored at the top of the stack. The thread created with the subsequent clone3() call will, if all goes well, be using the newly created shadow stack. Error handling, obviously, has been omitted.
Brown said, in the cover letter: "I think at this point everyone is OK
with the ABI, and the x86 implementation has been tested so hopefully we
are near to being able to get this merged?
" So far, nobody has spoken
up to disagree with that idea. In the absence of surprises, this
long-under-development addition to the clone3() API may finally be
headed for the mainline.
Index entries for this article | |
---|---|
Kernel | Security/Control-flow integrity |
Kernel | System calls/clone() |
Posted Aug 26, 2025 16:41 UTC (Tue)
by DemiMarie (subscriber, #164188)
[Link] (13 responses)
Posted Aug 26, 2025 16:46 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (7 responses)
Posted Aug 26, 2025 17:38 UTC (Tue)
by Lionel_Debroux (subscriber, #30014)
[Link] (5 responses)
Posted Aug 26, 2025 17:58 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (3 responses)
Posted Aug 26, 2025 18:21 UTC (Tue)
by Lionel_Debroux (subscriber, #30014)
[Link] (2 responses)
Posted Aug 27, 2025 18:30 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Aug 28, 2025 13:22 UTC (Thu)
by DemiMarie (subscriber, #164188)
[Link]
Posted Aug 27, 2025 9:23 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
(That doesn't mean you just have to live with it, of course. You can ask the kernel folks nicely to make these things compatible, or to provide some suitable alternative, and maybe they'll do it. Or you can send them a patch yourself. What you cannot do is demand that the rest of userspace respect your requirements, just because that happens to be more convenient for you than the alternative.)
Posted Sep 5, 2025 0:39 UTC (Fri)
by cypherpunks2 (guest, #152408)
[Link]
For pthreads, which uses clone3() by default and is the usual place you'll find them, if you blacklist it so it returns -ENOSYS then it will transparently fall back to clone2(), which you can filter more effectively. But yes, it's not an ideal situation since you'd lose the shadow stack.
Posted Aug 26, 2025 17:34 UTC (Tue)
by geofft (subscriber, #59789)
[Link] (2 responses)
Posted Aug 26, 2025 20:40 UTC (Tue)
by alip (subscriber, #170176)
[Link] (1 responses)
Posted Aug 27, 2025 3:34 UTC (Wed)
by alip (subscriber, #170176)
[Link]
Posted Aug 27, 2025 3:55 UTC (Wed)
by cyphar (subscriber, #110703)
[Link]
Posted Sep 5, 2025 0:29 UTC (Fri)
by cypherpunks2 (guest, #152408)
[Link]
This is how the Tor daemon's sandbox is able to whitelist things like paths that are passed to open().
It's a bit of a hack, but it works in a pinch.
Posted Aug 26, 2025 16:46 UTC (Tue)
by jrtc27 (subscriber, #107748)
[Link] (2 responses)
[^1]: Actually not quite, it's the logical stack pointer to use; 64-bit SPARC applies a constant bias to its stack pointer to improve code generation, but the caller is not, if I recall correctly, required to do so for clone(2), the kernel will apply the bias itself.
Posted Aug 27, 2025 13:27 UTC (Wed)
by jreiser (subscriber, #11027)
[Link]
Posted Aug 28, 2025 11:12 UTC (Thu)
by broonie (subscriber, #7078)
[Link]
We could add a new flag to map_shadow_stack() to make it a bit more directly useful here as you suggest, that feels like another thing to go into the series of things to do when factoring some of this code out into common implementations now that we know that all the current architectures look very similar. The interface for that was locked in with the x86 implementation which didn't consider clone3() at all. It does feel out of scope for this series (and I'm very reluctant to introduce new changes that aren't absolutely required, this has already been going on for two years at this point for a bunch of reasons).
Posted Sep 4, 2025 5:46 UTC (Thu)
by Vorpal (guest, #136011)
[Link]
And what about fibers / go routines etc, where user space is doing it's own thread scheduling?
It seems the current implementation of shadow stacks is incompatible with both of these, unless I'm mistaken. Is the answer simply that you don't get to use those security features?
Posted Sep 4, 2025 10:32 UTC (Thu)
by jtepe (subscriber, #145026)
[Link]
I only skimmed the patch series and did not find a reason for this. Why would I need to pass the flag explicitly, when not passing it results in garbage anyway? Seems to me, the sole purpose of the function is to allocate a shadow stack for the thread. Is there an alternative use for map_shadow_stack that I don't know of?
clone3() + seccomp = bad
clone3() + seccomp = bad
clone3() + seccomp = bad
clone3() + seccomp = bad
clone3() + seccomp = bad
clone3() + seccomp = bad
eBPF is root-only
clone3() + seccomp = bad
clone3() + seccomp = bad
clone3() + seccomp = bad
clone3() + seccomp = bad
Take the Syd sandbox as an example. It is an unprivileged user-space sandbox that uses both SECCOMP_RET_USER_NOTIF (default) and SECCOMP_RET_TRACE (can be opted-out). To ensure the guarantees provided by the sandbox, subnamespace creation must be disallowed. At this point clone3(2) becomes problematic because the second argument is a pointer to struct cl_args. If the sandbox reads this structure from sandbox process memory, check for unsafe flags and decide it is safe to proceed with a SECCOMP_USER_NOTIF_FLAG_CONTINUE or PTRACE_CONT respectively, there exists a time-window when a fellow thread (or a fellow process with process_vm_writev(2) or proc_pid_mem(5) access rights) can change the security-sensitive data in the structure before the Linux kernel reads this structure and acts on it. For more information on how easy this is to exploit, see this article.
clone3() + seccomp = bad
Therefore, there remains two options for the sandbox:
Option 1 is simple and therefore common practice. However, this means getting no support of added features which is an unfortunate consequence. Going forward, this situation is only going to escalate further as Linux keeps adding more system calls with security-sensitive data in arguments hidden from seccomp(2) behind pointer indirection.
Option 2 is the obvious solution to the problem. Two decades ago, when I first got interested in sandboxing, providing this option in unprivileged userspace was a pipe-dream. Most unprivileged userspace sandbox tools, such as limon, mbox, nsjail, subterfugue, sydbox and systrace, used ptrace(2), which is a debugging interface and was not designed with the goal to act as a security boundary. The sandbox developer had to choose between PTRACE_SYSEMU and PTRACE_SYSCALL. Both came with a noticable overhead. Considering the lack of selective filtering capabilities of PTRACE_O_TRACESECCOMP, stopping the sandbox process at each and every system call boundary, once on syscall-entry and once on syscall-exit, was a dealbreaker for most interested users. There was no PTRACE_O_EXITKILL either so it was difficult to ensure the sandbox process does not outlive the sandboxer. The ptrace(2) requests PTRACE_GET_SYSCALL_INFO, and the newer PTRACE_SET_SYSCALL_INFO were missing so editing sandbox process' system call number, return value, or arguments in a portable way was a major pain. Reading/writing sandbox process memory using process_vm_readv(2) and process_vm_writev(2) were missing. The remaining choice is proc_pid_mem(5) but again it is designed as a debug interface which does not honour the page protections of the sandbox process unlike process_vm_* so the sandboxer could be used as a confused deputy to leak or overwrite memory regions that were otherwise inaccessible to the sandbox process. Finally, ptrace(2) is easily detectable with an EPERM return on a ptrace(2) PTRACE_TRACEME so most malware would not run under the sandbox, completely beating the purpose for malware analysts.
These were only the beginning of things to worry about. Tackling with the issue of TOCTOU in userspace was simply impossible. Many helpful interfaces that allows the current Syd sandbox to (attempt to) act as a (TOCTOU-free) security boundary in unprivileged user-space today, such as pidfd_getfd(2), SECCOMP_IOCTL_NOTIF_ADDFD and openat2(2) with RESOLVE_BENEATH, RESOLVE_SYMLINKS and RESOLVE_MAGICLINKS were missing. The list goes on. Therefore, it is fair to say things have improved substantially. However, there're still some open ends with clone3(2) being one of them. Within the context of the Syd sandbox, the others are chdir(2), execve(2), and open(2) family system calls with the O_PATH flag. Syd cannot provide safe access to AMD GPUs either because unlike Nvidia GPUs, opening the /dev/kfd character device has per-process limitations preventing the use of SECCOMP_IOCTL_NOTIF_ADDFD with them. There's a feature request on Linux kernel bugzilla about these issues. There may be other unprivileged userspace sandboxes out there with different requirements, but it's safe to say pidfds with their ability to duplicate sandbox process file descriptors using pidfd_getfd(2) cover most of the ground.
We need a solution to the clone3(2) filtering problem without hindering future system call innovation. Preventing addition of new system calls with pointer arguments is not feasible. Adding pointer indirection support to seccomp(2) comes with its own problems wrt. security and portability. Support for multiple co-existing architectures in seccomp(2) filters is going to require per-architecture information about data structures used in system call arguments which can vary in unexpected ways for niche architectures such as x32 or mips.
Pointer indirection is not the only problem. In cases of filtering chdir(2) and execve(2), the lack of a reliable way to perform the action on behalf of the sandbox process prevents sandboxes using SECCOMP_RET_USER_NOTIF and SECCOMP_RET_USER_TRACE from emulating them safely.
gVisor's systrap platform provides a solution to this problem with SECCOMP_RET_TRAP where the Sentry userspace kernel emulates almost everything, including the memory and process subsystems. When a process running under the gVisor sandbox calls execve(2) or mmap(2), these system calls are not going to reach the host Linux kernel and are directly handled by the Sentry user-space kernel. This is safe and stops attack vectors such as DirtyCoW, but it has issues on its own. All processes run in the same address space from the perspective of the Linux kernel. This means battle-tested defenses provided by the Linux kernel such as ASLR needs to be reimplemented. It also requires the sandbox to implement the illusion of an inter-process boundary in addition to a user-space-user/user-space-kernel boundary. Providing these isolation boundaries has noticable performance impact. NanoVisor is a gVisor fork which weakens these boundaries to reduce the performance overhead. Portability is an issue as well because hand-written assembly is hardly unavoidable. As a result, gVisor supports x86-64 and arm64 architectures only, whereas Syd works on x86-64, i586, x32, arm64, armv7, riscv64, loongarch64, powerpc64, powerpc64le, powerpc, s390x, s390 with multipersonality support and outperforms gVisor in benchmarks.
In my humble opinion, a generic workable solution to the system call filtering problem is to have the Linux kernel provide process-guided versions of problematic system calls on a best-effort basis based on pidfds such as process_clone3, process_map_shadow_stack, process_chdir, and process_execve. The precedent for this are the system calls pidfd_getfd(2), pidfd_send_signal(2) and process_mrelease(2).
clone3() + seccomp = bad
clone3() + seccomp = bad
Abstraction
Offset for stack pointer
Abstraction
Segmented stacks? Fibers?
Why the need for the special flag?
> A shadow stack must also contain a special, hardware-generated token at the top; this token identifies it as a real shadow stack, and prevents multiple threads from using the same shadow stack.
and
> The flags argument must be SHADOW_STACK_SET_TOKEN to cause the kernel place the special token at the top of the stack; otherwise the resulting shadow-stack mapping cannot be used for the new thread.
I understand you'd always need the flag in call to map_shadow_stack.