Protocol ossification

Posted Jan 24, 2025 7:16 UTC (Fri) by mb (subscriber, #50428)
In reply to: Protocol ossification by cesarb
Parent article: The trouble with the new uretprobes

I do not agree that this is an abuse of seccomp. It is the intended use case. The first version of seccomp actually imlemented exactly this: Just allow a handfull of kernel-decided hardcoded syscalls.

Running seccomp with a deny list is just not going to work, given the current amount of syscalls and the constant ingress of new ones.

This is a special case, because it breaks without updating the application. Seccomp filter updates on application updates are expected. On kernel updates the requirement of seccomp filter updates (that would mean app update) is unexpected.

Protocol ossification

Posted Jan 24, 2025 9:21 UTC (Fri) by taladar (subscriber, #68407) [Link] (4 responses)

I would argue that it is an abuse of seccomp because seccomp was not intended to be used with a fixed allow list while both components in user space and kernels get updated. It is similar to using a firewall with deep packet inspection from the HTTP 1.0 days on today's HTTP 2.0 connections and complaining that things break while refusing to assign the blame to the one outdated component in the system.

Protocol ossification

Posted Jan 24, 2025 14:12 UTC (Fri) by cesarb (subscriber, #6266) [Link] (3 responses)

> I would argue that it is an abuse of seccomp because seccomp was not intended to be used with a fixed allow list while both components in user space and kernels get updated.

I'm not one of the designers of seccomp, but my reading of the original intention is that seccomp was supposed to be used by the final user space component (that is how it's used, for instance, in the Firefox/Chrome sandboxes). When that component gets updated, the seccomp policy naturally get updated at the same time. Separate dynamic libraries like glibc add an extra complication, and it's not always possible to avoid these libraries by doing system calls directly, but it's still the component being "protected" by seccomp that's defining the seccomp policy.

Using seccomp to sandbox third-party components (like Docker does) is where it becomes more problematic, since the sandboxing code has no idea which system calls the third-party component requires, and it's easy for it to stay in an old version while both the kernel and the sandboxed code get updated. A light denylist-only use of seccomp would cause few issues (for instance, there's no need for a sandboxed process to call the reboot system call), but for security paranoia reasons their default policy blocks too much.

Protocol ossification

Posted Jan 25, 2025 11:40 UTC (Sat) by fw (subscriber, #26023) [Link]

However, as things have turned out, the modern container approach (with ENOSYS) works more reliably than what the browser sandboxes are doing: software (whether it runs in containers or not) usually supports older kernels as well, so there is an existing run-time dispatch, or perhaps a performance penalty, or some fixed vulnerabilities and reliability issues come back.

With browser sandboxes, the desire is to disable everything that the browser does not need, usually after the process has been created (and not outside the process, as with containers). There is in-process emulation/checking of some system calls via SIGSYS. And yet, browsers link in lots of system components, and their system call requirements keep changing. This is largely invisible because of early testing in distributions like Fedora rawhide. Incompatible things are temporarily reverted until browsers catch up.

Protocol ossification

Posted Jan 25, 2025 13:59 UTC (Sat) by mokki (subscriber, #33200) [Link] (1 responses)

Why can't docker have a dynamic allowlist that users can add the new syscalls without new docker version required?
This has been a problem for 10+ years and still no solution from docker developers on this.
Does podman handle this better?

Protocol ossification

Posted Jan 25, 2025 16:47 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

I see this (experimental) flag in podman's docs: https://docs.podman.io/en/v4.6.1/markdown/options/seccomp...

Protocol ossification

Posted Jan 24, 2025 16:24 UTC (Fri) by hmh (subscriber, #3838) [Link] (2 responses)

"This is a special case, because it breaks without updating the application. Seccomp filter updates on application updates are expected. On kernel updates the requirement of seccomp filter updates (that would mean app update) is unexpected."

IMO, what you described *is* very much an ossification danger from the PoV of the kernel. Kernel updates (and not only the kernel, there are also library updates of stuff like the libc and kernel-interface libraries, and no-libc runtimes like golang's) need to be able to change the syscalls they use, if ossification is to be avoided.

I do agree that changes that would require seccomp updates on *stable* kernels (and minor/patch/stable-train updates of libraries and language runtimes) should be both a do-it-only-as-a-last-resort thing *and* very explicitly documented.

Of course, there is no ossification issue with the original "restricted compute worker process" seccomp mode, because the hard-coded policy for that mode is in the kernel and it will be kept in sync with any changes to the syscalls.

Protocol ossification

Posted Jan 24, 2025 16:36 UTC (Fri) by hmh (subscriber, #3838) [Link] (1 responses)

(It is a pity one cannot edit posts). A clarification: on this *specific* case of the restricted uretprobe syscall, I agree with others that it should just be hardcoded-allow-listed by seccomp (i.e. "invisible" to seccomp). And my opinion is heavily based on the very specific detail that this syscall cannot be called by general code without triggering a SIGILL: this specific syscall is just an internal implementation detail of uretprobes.

So, IMHO, if one wants to restrict uretprobes to an application (or even system-wide), it should be done in some other higher abstraction level that deals with uretprobes itself, or ptracing and the like.

Protocol ossification

Posted Jan 24, 2025 20:45 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

IMHO this is not unreasonable, but it would be helpful to make an explicit category for syscalls of this nature, with documentation and possibly new seccomp flags for dealing with it.

Actually, I think it would probably be helpful to have multiple categories for syscalls of different types based on what they can do and whether they can affect things outside of the process. A seccomp BPF filter could, hypothetically, receive a bitmask indicating the properties of a given syscall, such as:

* Whether it can read/modify state, and separate flags for the kind of state it reads or modifies. dup2 would be flagged differently from write, because dup2 modifies the process's file descriptor table, while write can modify the filesystem or do IPC (pipes etc.). (No, procfs is not "the filesystem" for the purposes of this discussion.)
* Whether it is considered part of the kernel's userspace API. Yes for most syscalls, no for uretprobe. Denying syscalls where this flag is not set would be considered poor practice, and might result in compatibility issues on the next kernel release, but you can still do it if you really want to.
* Whether it requires at least one capability to call with the specified arguments, for any reason other than filesystem permissions (because that would require looking them up, which slows things down a ton, and is also inherently racey).
* Whether it would require a filesystem permission check, if executed by a non-privileged process (i.e. the flag is still set if you are root). Does not indicate what the result of that permission check would have been, only that it would have been done.

The BPF filter could then use those flags to make an informed choice about unrecognized syscalls, and you could even pass some kind of mask to seccomp/prctl to indicate which syscalls you want to filter in the first place.

I don't know how backwards compatible this is, or whether there is the will to implement something this complicated. But it would be nice to have.