clone3() + seccomp = bad

Posted Aug 26, 2025 17:34 UTC (Tue) by geofft (subscriber, #59789)
In reply to: clone3() + seccomp = bad by DemiMarie
Parent article: Shadow-stack control in clone3()

What's a case where you'd want to do argument filtering on clone3 and neither SECCOMP_RET_USER_NOTIF nor SECCOMP_RET_TRACE are workable?

clone3() + seccomp = bad

Posted Aug 26, 2025 20:40 UTC (Tue) by alip (subscriber, #170176) [Link] (1 responses)

You cannot filter clone3 arguments with a SECCOMP_RET_USER_NOTIF or SECCOMP_RET_TRACE handler unless you can fully emulate the clone otherwise the struct pointer indirection will lead to a TOCTOU.

clone3() + seccomp = bad

Posted Aug 27, 2025 3:34 UTC (Wed) by alip (subscriber, #170176) [Link]

Take the Syd sandbox as an example. It is an unprivileged user-space sandbox that uses both SECCOMP_RET_USER_NOTIF (default) and SECCOMP_RET_TRACE (can be opted-out). To ensure the guarantees provided by the sandbox, subnamespace creation must be disallowed. At this point clone3(2) becomes problematic because the second argument is a pointer to struct cl_args. If the sandbox reads this structure from sandbox process memory, check for unsafe flags and decide it is safe to proceed with a SECCOMP_USER_NOTIF_FLAG_CONTINUE or PTRACE_CONT respectively, there exists a time-window when a fellow thread (or a fellow process with process_vm_writev(2) or proc_pid_mem(5) access rights) can change the security-sensitive data in the structure before the Linux kernel reads this structure and acts on it. For more information on how easy this is to exploit, see this article.

Therefore, there remains two options for the sandbox:

Deny clone3(2) with ENOSYS signaling libc to fallback to clone(2).
Emulate clone3(2) completely on behalf of the sandbox process.

Option 1 is simple and therefore common practice. However, this means getting no support of added features which is an unfortunate consequence. Going forward, this situation is only going to escalate further as Linux keeps adding more system calls with security-sensitive data in arguments hidden from seccomp(2) behind pointer indirection.

Option 2 is the obvious solution to the problem. Two decades ago, when I first got interested in sandboxing, providing this option in unprivileged userspace was a pipe-dream. Most unprivileged userspace sandbox tools, such as limon, mbox, nsjail, subterfugue, sydbox and systrace, used ptrace(2), which is a debugging interface and was not designed with the goal to act as a security boundary. The sandbox developer had to choose between PTRACE_SYSEMU and PTRACE_SYSCALL. Both came with a noticable overhead. Considering the lack of selective filtering capabilities of PTRACE_O_TRACESECCOMP, stopping the sandbox process at each and every system call boundary, once on syscall-entry and once on syscall-exit, was a dealbreaker for most interested users. There was no PTRACE_O_EXITKILL either so it was difficult to ensure the sandbox process does not outlive the sandboxer. The ptrace(2) requests PTRACE_GET_SYSCALL_INFO, and the newer PTRACE_SET_SYSCALL_INFO were missing so editing sandbox process' system call number, return value, or arguments in a portable way was a major pain. Reading/writing sandbox process memory using process_vm_readv(2) and process_vm_writev(2) were missing. The remaining choice is proc_pid_mem(5) but again it is designed as a debug interface which does not honour the page protections of the sandbox process unlike process_vm_* so the sandboxer could be used as a confused deputy to leak or overwrite memory regions that were otherwise inaccessible to the sandbox process. Finally, ptrace(2) is easily detectable with an EPERM return on a ptrace(2) PTRACE_TRACEME so most malware would not run under the sandbox, completely beating the purpose for malware analysts.

These were only the beginning of things to worry about. Tackling with the issue of TOCTOU in userspace was simply impossible. Many helpful interfaces that allows the current Syd sandbox to (attempt to) act as a (TOCTOU-free) security boundary in unprivileged user-space today, such as pidfd_getfd(2), SECCOMP_IOCTL_NOTIF_ADDFD and openat2(2) with RESOLVE_BENEATH, RESOLVE_SYMLINKS and RESOLVE_MAGICLINKS were missing. The list goes on. Therefore, it is fair to say things have improved substantially. However, there're still some open ends with clone3(2) being one of them. Within the context of the Syd sandbox, the others are chdir(2), execve(2), and open(2) family system calls with the O_PATH flag. Syd cannot provide safe access to AMD GPUs either because unlike Nvidia GPUs, opening the /dev/kfd character device has per-process limitations preventing the use of SECCOMP_IOCTL_NOTIF_ADDFD with them. There's a feature request on Linux kernel bugzilla about these issues. There may be other unprivileged userspace sandboxes out there with different requirements, but it's safe to say pidfds with their ability to duplicate sandbox process file descriptors using pidfd_getfd(2) cover most of the ground.

We need a solution to the clone3(2) filtering problem without hindering future system call innovation. Preventing addition of new system calls with pointer arguments is not feasible. Adding pointer indirection support to seccomp(2) comes with its own problems wrt. security and portability. Support for multiple co-existing architectures in seccomp(2) filters is going to require per-architecture information about data structures used in system call arguments which can vary in unexpected ways for niche architectures such as x32 or mips.

Pointer indirection is not the only problem. In cases of filtering chdir(2) and execve(2), the lack of a reliable way to perform the action on behalf of the sandbox process prevents sandboxes using SECCOMP_RET_USER_NOTIF and SECCOMP_RET_USER_TRACE from emulating them safely.

gVisor's systrap platform provides a solution to this problem with SECCOMP_RET_TRAP where the Sentry userspace kernel emulates almost everything, including the memory and process subsystems. When a process running under the gVisor sandbox calls execve(2) or mmap(2), these system calls are not going to reach the host Linux kernel and are directly handled by the Sentry user-space kernel. This is safe and stops attack vectors such as DirtyCoW, but it has issues on its own. All processes run in the same address space from the perspective of the Linux kernel. This means battle-tested defenses provided by the Linux kernel such as ASLR needs to be reimplemented. It also requires the sandbox to implement the illusion of an inter-process boundary in addition to a user-space-user/user-space-kernel boundary. Providing these isolation boundaries has noticable performance impact. NanoVisor is a gVisor fork which weakens these boundaries to reduce the performance overhead. Portability is an issue as well because hand-written assembly is hardly unavoidable. As a result, gVisor supports x86-64 and arm64 architectures only, whereas Syd works on x86-64, i586, x32, arm64, armv7, riscv64, loongarch64, powerpc64, powerpc64le, powerpc, s390x, s390 with multipersonality support and outperforms gVisor in benchmarks.

In my humble opinion, a generic workable solution to the system call filtering problem is to have the Linux kernel provide process-guided versions of problematic system calls on a best-effort basis based on pidfds such as process_clone3, process_map_shadow_stack, process_chdir, and process_execve. The precedent for this are the system calls pidfd_getfd(2), pidfd_send_signal(2) and process_mrelease(2).