clone3() + seccomp = bad [LWN.net]

clone3() + seccomp = bad

Posted Aug 26, 2025 16:46 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (7 responses)

I'll use it in mine, thanks. I'm not going to slow innovation for everyone because a few insist on using outdated security products

clone3() + seccomp = bad

Posted Aug 26, 2025 17:38 UTC (Tue) by Lionel_Debroux (subscriber, #30014) [Link] (5 responses)

Out of curiosity, what would you suggest using instead of, or maybe in addition to, seccomp, which is, AFAIK at least, neither niche - it's used in containers and various sandboxing runtimes - nor outdated, being the state of the art in that area of Linux sandboxing despite its shortcomings, one of which was mentioned in the post you're replying to ?

clone3() + seccomp = bad

Posted Aug 26, 2025 17:58 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (3 responses)

eBPF LSMs come to mind. It's far better to enforce policy in the kernel at the later of objects and operations than at the interface to the kernel at the later of registers and function parameters.

clone3() + seccomp = bad

Posted Aug 26, 2025 18:21 UTC (Tue) by Lionel_Debroux (subscriber, #30014) [Link] (2 responses)

Point taken. However, I'm not yet convinced about using eBPF for security purposes, given how less than stellar the security track record of the eBPF infrastructure has proven to be throughout the years, in terms of high-severity issues in the eBPF JIT or eBPF verifier. They rival the likes of unprivileged user namespaces and io_uring. Yet, most of the main distros force enable the BPF JIT, not leaving sysadmins a chance of easily administratively disabling it.

clone3() + seccomp = bad

Posted Aug 27, 2025 18:30 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (1 responses)

Okay, but progress isn't going to wait for your sense of comfort. Plenty of people have been using eBPF for a long time. You don't get to sit there and tell people not to use clone3 because you have a feeling.

eBPF is root-only

Posted Aug 28, 2025 13:22 UTC (Thu) by DemiMarie (subscriber, #164188) [Link]

eBPF is limited to root. For many use-cases, this is a dealbreaker.

clone3() + seccomp = bad

Posted Aug 27, 2025 9:23 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

IMHO quotemstr phrased it in a rather dismissive tone, but I think the idea is roughly correct. If you want to use seccomp, then that's your problem. You don't get to impose your compatibility requirements on everyone else, even if there is no better way of doing whatever you're trying to do.

(That doesn't mean you just have to live with it, of course. You can ask the kernel folks nicely to make these things compatible, or to provide some suitable alternative, and maybe they'll do it. Or you can send them a patch yourself. What you cannot do is demand that the rest of userspace respect your requirements, just because that happens to be more convenient for you than the alternative.)

clone3() + seccomp = bad

Posted Sep 5, 2025 0:39 UTC (Fri) by cypherpunks2 (guest, #152408) [Link]

You might be thinking of seccomp mode 1 which is outdated and virtually never used. Seccomp mode 2, which is based on BPF internally, is not and is increasingly being used for security and is continuously being improved. Just recently, the landlock LSM was created to augment seccomp's weaknesses (namely the difficulty of whitelisting paths).

For pthreads, which uses clone3() by default and is the usual place you'll find them, if you blacklist it so it returns -ENOSYS then it will transparently fall back to clone2(), which you can filter more effectively. But yes, it's not an ideal situation since you'd lose the shadow stack.

clone3() + seccomp = bad

Posted Aug 26, 2025 17:34 UTC (Tue) by geofft (subscriber, #59789) [Link] (2 responses)

What's a case where you'd want to do argument filtering on clone3 and neither SECCOMP_RET_USER_NOTIF nor SECCOMP_RET_TRACE are workable?

clone3() + seccomp = bad

Posted Aug 26, 2025 20:40 UTC (Tue) by alip (subscriber, #170176) [Link] (1 responses)

You cannot filter clone3 arguments with a SECCOMP_RET_USER_NOTIF or SECCOMP_RET_TRACE handler unless you can fully emulate the clone otherwise the struct pointer indirection will lead to a TOCTOU.

clone3() + seccomp = bad

Posted Aug 27, 2025 3:34 UTC (Wed) by alip (subscriber, #170176) [Link]

Take the Syd sandbox as an example. It is an unprivileged user-space sandbox that uses both SECCOMP_RET_USER_NOTIF (default) and SECCOMP_RET_TRACE (can be opted-out). To ensure the guarantees provided by the sandbox, subnamespace creation must be disallowed. At this point clone3(2) becomes problematic because the second argument is a pointer to struct cl_args. If the sandbox reads this structure from sandbox process memory, check for unsafe flags and decide it is safe to proceed with a SECCOMP_USER_NOTIF_FLAG_CONTINUE or PTRACE_CONT respectively, there exists a time-window when a fellow thread (or a fellow process with process_vm_writev(2) or proc_pid_mem(5) access rights) can change the security-sensitive data in the structure before the Linux kernel reads this structure and acts on it. For more information on how easy this is to exploit, see this article.

Therefore, there remains two options for the sandbox:

Deny clone3(2) with ENOSYS signaling libc to fallback to clone(2).
Emulate clone3(2) completely on behalf of the sandbox process.

Option 1 is simple and therefore common practice. However, this means getting no support of added features which is an unfortunate consequence. Going forward, this situation is only going to escalate further as Linux keeps adding more system calls with security-sensitive data in arguments hidden from seccomp(2) behind pointer indirection.

Option 2 is the obvious solution to the problem. Two decades ago, when I first got interested in sandboxing, providing this option in unprivileged userspace was a pipe-dream. Most unprivileged userspace sandbox tools, such as limon, mbox, nsjail, subterfugue, sydbox and systrace, used ptrace(2), which is a debugging interface and was not designed with the goal to act as a security boundary. The sandbox developer had to choose between PTRACE_SYSEMU and PTRACE_SYSCALL. Both came with a noticable overhead. Considering the lack of selective filtering capabilities of PTRACE_O_TRACESECCOMP, stopping the sandbox process at each and every system call boundary, once on syscall-entry and once on syscall-exit, was a dealbreaker for most interested users. There was no PTRACE_O_EXITKILL either so it was difficult to ensure the sandbox process does not outlive the sandboxer. The ptrace(2) requests PTRACE_GET_SYSCALL_INFO, and the newer PTRACE_SET_SYSCALL_INFO were missing so editing sandbox process' system call number, return value, or arguments in a portable way was a major pain. Reading/writing sandbox process memory using process_vm_readv(2) and process_vm_writev(2) were missing. The remaining choice is proc_pid_mem(5) but again it is designed as a debug interface which does not honour the page protections of the sandbox process unlike process_vm_* so the sandboxer could be used as a confused deputy to leak or overwrite memory regions that were otherwise inaccessible to the sandbox process. Finally, ptrace(2) is easily detectable with an EPERM return on a ptrace(2) PTRACE_TRACEME so most malware would not run under the sandbox, completely beating the purpose for malware analysts.

These were only the beginning of things to worry about. Tackling with the issue of TOCTOU in userspace was simply impossible. Many helpful interfaces that allows the current Syd sandbox to (attempt to) act as a (TOCTOU-free) security boundary in unprivileged user-space today, such as pidfd_getfd(2), SECCOMP_IOCTL_NOTIF_ADDFD and openat2(2) with RESOLVE_BENEATH, RESOLVE_SYMLINKS and RESOLVE_MAGICLINKS were missing. The list goes on. Therefore, it is fair to say things have improved substantially. However, there're still some open ends with clone3(2) being one of them. Within the context of the Syd sandbox, the others are chdir(2), execve(2), and open(2) family system calls with the O_PATH flag. Syd cannot provide safe access to AMD GPUs either because unlike Nvidia GPUs, opening the /dev/kfd character device has per-process limitations preventing the use of SECCOMP_IOCTL_NOTIF_ADDFD with them. There's a feature request on Linux kernel bugzilla about these issues. There may be other unprivileged userspace sandboxes out there with different requirements, but it's safe to say pidfds with their ability to duplicate sandbox process file descriptors using pidfd_getfd(2) cover most of the ground.

We need a solution to the clone3(2) filtering problem without hindering future system call innovation. Preventing addition of new system calls with pointer arguments is not feasible. Adding pointer indirection support to seccomp(2) comes with its own problems wrt. security and portability. Support for multiple co-existing architectures in seccomp(2) filters is going to require per-architecture information about data structures used in system call arguments which can vary in unexpected ways for niche architectures such as x32 or mips.

Pointer indirection is not the only problem. In cases of filtering chdir(2) and execve(2), the lack of a reliable way to perform the action on behalf of the sandbox process prevents sandboxes using SECCOMP_RET_USER_NOTIF and SECCOMP_RET_USER_TRACE from emulating them safely.

gVisor's systrap platform provides a solution to this problem with SECCOMP_RET_TRAP where the Sentry userspace kernel emulates almost everything, including the memory and process subsystems. When a process running under the gVisor sandbox calls execve(2) or mmap(2), these system calls are not going to reach the host Linux kernel and are directly handled by the Sentry user-space kernel. This is safe and stops attack vectors such as DirtyCoW, but it has issues on its own. All processes run in the same address space from the perspective of the Linux kernel. This means battle-tested defenses provided by the Linux kernel such as ASLR needs to be reimplemented. It also requires the sandbox to implement the illusion of an inter-process boundary in addition to a user-space-user/user-space-kernel boundary. Providing these isolation boundaries has noticable performance impact. NanoVisor is a gVisor fork which weakens these boundaries to reduce the performance overhead. Portability is an issue as well because hand-written assembly is hardly unavoidable. As a result, gVisor supports x86-64 and arm64 architectures only, whereas Syd works on x86-64, i586, x32, arm64, armv7, riscv64, loongarch64, powerpc64, powerpc64le, powerpc, s390x, s390 with multipersonality support and outperforms gVisor in benchmarks.

In my humble opinion, a generic workable solution to the system call filtering problem is to have the Linux kernel provide process-guided versions of problematic system calls on a best-effort basis based on pidfds such as process_clone3, process_map_shadow_stack, process_chdir, and process_execve. The precedent for this are the system calls pidfd_getfd(2), pidfd_send_signal(2) and process_mrelease(2).

clone3() + seccomp = bad

Posted Aug 27, 2025 3:55 UTC (Wed) by cyphar (subscriber, #110703) [Link]

I presented a fairly minimal solution for this that doesn't require seccomp to use eBPF at last years LPC[1]. Unfortunately, I haven't had time to finish this work yet -- I'm hoping to get some time in the coming months.

[1]: https://www.youtube.com/watch?v=CHpLLR0CwSw

clone3() + seccomp = bad

Posted Sep 5, 2025 0:29 UTC (Fri) by cypherpunks2 (guest, #152408) [Link]

You can filter it, in a way. If you have the struct in rodata and use seccomp to limit access to that memory range so that the permissions cannot be changed (such as with mprotect(), or a second VMA created over it using mmap() or something), you can whitelist the address and call it a day.

This is how the Tor daemon's sandbox is able to whitelist things like paths that are passed to open().

It's a bit of a hack, but it works in a pinch.