Protocol ossification
Protocol ossification
Posted Jan 24, 2025 7:16 UTC (Fri) by mb (subscriber, #50428)In reply to: Protocol ossification by cesarb
Parent article: The trouble with the new uretprobes
Running seccomp with a deny list is just not going to work, given the current amount of syscalls and the constant ingress of new ones.
This is a special case, because it breaks without updating the application. Seccomp filter updates on application updates are expected. On kernel updates the requirement of seccomp filter updates (that would mean app update) is unexpected.
Posted Jan 24, 2025 9:21 UTC (Fri)
by taladar (subscriber, #68407)
[Link] (4 responses)
Posted Jan 24, 2025 14:12 UTC (Fri)
by cesarb (subscriber, #6266)
[Link] (3 responses)
I'm not one of the designers of seccomp, but my reading of the original intention is that seccomp was supposed to be used by the final user space component (that is how it's used, for instance, in the Firefox/Chrome sandboxes). When that component gets updated, the seccomp policy naturally get updated at the same time. Separate dynamic libraries like glibc add an extra complication, and it's not always possible to avoid these libraries by doing system calls directly, but it's still the component being "protected" by seccomp that's defining the seccomp policy.
Using seccomp to sandbox third-party components (like Docker does) is where it becomes more problematic, since the sandboxing code has no idea which system calls the third-party component requires, and it's easy for it to stay in an old version while both the kernel and the sandboxed code get updated. A light denylist-only use of seccomp would cause few issues (for instance, there's no need for a sandboxed process to call the reboot system call), but for security paranoia reasons their default policy blocks too much.
Posted Jan 25, 2025 11:40 UTC (Sat)
by fw (subscriber, #26023)
[Link]
With browser sandboxes, the desire is to disable everything that the browser does not need, usually after the process has been created (and not outside the process, as with containers). There is in-process emulation/checking of some system calls via SIGSYS. And yet, browsers link in lots of system components, and their system call requirements keep changing. This is largely invisible because of early testing in distributions like Fedora rawhide. Incompatible things are temporarily reverted until browsers catch up.
Posted Jan 25, 2025 13:59 UTC (Sat)
by mokki (subscriber, #33200)
[Link] (1 responses)
Posted Jan 25, 2025 16:47 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 24, 2025 16:24 UTC (Fri)
by hmh (subscriber, #3838)
[Link] (2 responses)
IMO, what you described *is* very much an ossification danger from the PoV of the kernel. Kernel updates (and not only the kernel, there are also library updates of stuff like the libc and kernel-interface libraries, and no-libc runtimes like golang's) need to be able to change the syscalls they use, if ossification is to be avoided.
I do agree that changes that would require seccomp updates on *stable* kernels (and minor/patch/stable-train updates of libraries and language runtimes) should be both a do-it-only-as-a-last-resort thing *and* very explicitly documented.
Of course, there is no ossification issue with the original "restricted compute worker process" seccomp mode, because the hard-coded policy for that mode is in the kernel and it will be kept in sync with any changes to the syscalls.
Posted Jan 24, 2025 16:36 UTC (Fri)
by hmh (subscriber, #3838)
[Link] (1 responses)
So, IMHO, if one wants to restrict uretprobes to an application (or even system-wide), it should be done in some other higher abstraction level that deals with uretprobes itself, or ptracing and the like.
Posted Jan 24, 2025 20:45 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
Actually, I think it would probably be helpful to have multiple categories for syscalls of different types based on what they can do and whether they can affect things outside of the process. A seccomp BPF filter could, hypothetically, receive a bitmask indicating the properties of a given syscall, such as:
* Whether it can read/modify state, and separate flags for the kind of state it reads or modifies. dup2 would be flagged differently from write, because dup2 modifies the process's file descriptor table, while write can modify the filesystem or do IPC (pipes etc.). (No, procfs is not "the filesystem" for the purposes of this discussion.)
The BPF filter could then use those flags to make an informed choice about unrecognized syscalls, and you could even pass some kind of mask to seccomp/prctl to indicate which syscalls you want to filter in the first place.
I don't know how backwards compatible this is, or whether there is the will to implement something this complicated. But it would be nice to have.
Protocol ossification
Protocol ossification
Protocol ossification
Protocol ossification
This has been a problem for 10+ years and still no solution from docker developers on this.
Does podman handle this better?
Protocol ossification
Protocol ossification
Protocol ossification
Protocol ossification
* Whether it is considered part of the kernel's userspace API. Yes for most syscalls, no for uretprobe. Denying syscalls where this flag is not set would be considered poor practice, and might result in compatibility issues on the next kernel release, but you can still do it if you really want to.
* Whether it requires at least one capability to call with the specified arguments, for any reason other than filesystem permissions (because that would require looking them up, which slows things down a ton, and is also inherently racey).
* Whether it would require a filesystem permission check, if executed by a non-privileged process (i.e. the flag is still set if you are root). Does not indicate what the result of that permission check would have been, only that it would have been done.