The trouble with the new uretprobes

By Jonathan Corbet
January 23, 2025

A "uretprobe" is a dynamic, user-space tracepoint injected by the kernel into a running process; this document tersely describes their use. Among other things, uretprobes are used by the perf utility to time function calls. The 6.11 kernel saw a significant change to uretprobes that improved their performance, but that change is also creating trouble for some users. The best way to solve the problem is not entirely clear.

Specifically, a uretprobe exists to gain information at the return from a function in the process of interest. Older kernels implemented uretprobes by injecting code that, on entry to a function, changed the return address to a special trampoline that, in turn, contained a breakpoint trap instruction. When the target process executed that instruction, it would trap back into the kernel, which would then extract the information of interest (such as the function's return value) and run any other attached code (a BPF program, perhaps) before allowing the process to resume. This method worked, but it also had a noticeable performance impact on the probed process.

In an attempt to improve uretprobe performance, Jiri Olsa put together a patch set that changed the implementation on x86 systems. The return trampoline still exists but, rather than triggering a trap, it just calls the new uretprobe() system call, which then takes care of all of the associated work. Since system-call handling is faster than taking a trap, the cost to the probed process is lower when uretprobe() is used. This new system call takes no arguments, and it can only be called from the kernel-injected special trampoline; otherwise it will just deliver a SIGILL signal to the calling process.

Arguably, all system calls are special, but this one takes "special" to a whole new level. It is not something that a process can just call to obtain a useful service from the kernel. uretprobe() is thus unlikely to be on anybody's list of "five new system calls that every programmer should know". It does, however, succeed in accelerating uretprobes by as much as about 30%. This change went into the 6.11 release, seemingly without ill effect.

On January 10, though, Eyal Birger reported an ill effect; kernels that implement uretprobe() were causing Docker containers to crash. The problem is that Docker uses seccomp() to impose a policy on which system calls a containerized system can invoke. The policy used by Docker is, as standard practice would suggest, a default-deny arrangement; if a given system call has not been explicitly enabled, it will be blocked. uretprobe(), not being on any Docker developer's list of new exciting system calls, is not found in the allowlist. As a result, the injection of a uretprobe into a process running under Docker will result in that process's untimely and mysterious death. Docker users will, indeed, no longer notice a performance hit on a traced process, but they are unlikely to express their gratitude for that.

Various possibilities for fixing the problem were suggested at the time. Olsa put together a quick patch to detect a failure to execute uretprobe() and fall back to the old implementation in that case. He also considered simply disabling the uretprobe() implementation entirely when seccomp() is in use, or adding a sysctl knob to control whether uretprobe() is used. Birger, though, disliked the sysctl idea, saying: "'Give me speed but potentially crash processes I don't control' is a curious semantic".

Oleg Nesterov, instead, suggested patching seccomp() to simply ignore calls to uretprobe(), making that system call even more special. Birger returned with a patch implementing Nesterov's suggestion. Kees Cook, however, questioned this change, wondering why uretprobe() needs to be so special. Docker, he pointed out, already handles other weird system calls like sigreturn(); it should be able to do the same with uretprobe():

Basically, this is a Docker issue, not a kernel issue. Seccomp is behaving correctly. I don't want to start making syscalls invisible without an extremely good reason.

Birger responded that this case is indeed different:

I think the difference is that this syscall is not part of the process's code - it is inserted there by another process tracing it. So this is different than desiring to deploy a new version of a binary that uses a new libc or a new syscall.

That reasoning just hardened Cook's position, though. A process might want to defend against injection of uretprobe() by blocking it with seccomp(), he said. The whole point of seccomp() is to implement a policy given to it, he added; there should not be a separate policy within seccomp() itself.

This reasoning, sound as it may be, does little to solve Birger's problem, which, he said, is simply:

The problem we're facing is that existing workloads are breaking, and as mentioned I'm not sure how practical it is to demand replacing a working docker environment because of a new syscall that was added for performance reasons.

This replacement, he said, would not be easy to accomplish. He concluded by wondering if the right solution might be to just revert the uretprobe() change. Olsa said, again, that it might be better to introduce a new sysctl knob to control whether uretprobe() is used, but Cook answered that reverting may be the best choice, at least for now, while some thought goes into how this implementation should interact with features like seccomp(). Olsa then suggested that the best solution might be to disable uretprobe() temporarily, without removing it from the kernel, until Docker can be updated to handle it correctly. Birger went off to consider that idea.

This approach may lead to a solution for this specific problem, though it could take years before enough Docker installations have been updated to make re-enabling uretprobe() safe. But we will be seeing this problem again. Running systems within a sandbox that denies all system calls that have not been explicitly enabled may well be good for security, but that practice will run into trouble when the kernel underlying the whole system routinely adds new system calls. Beyond uretprobe(), the x86 architecture saw the addition of nine new system calls in 2024: setxattrat(), getxattrat(), listxattrat(), removexattrat(), mseal(), map_shadow_stack(), lsm_get_self_attr(), lsm_set_self_attr(), and lsm_list_modules(). There is no reason to believe that the addition of system calls will stop now.

Something will have to give; in this case, that something would appear to be the new uretprobe implementation. But it is hard to imagine that the development community will be pleased at the idea that it cannot add new system calls lest existing Docker implementations break. Perhaps there will never be another system call as special as uretprobe(), with its ability to break systems with just a kernel change but, as Cook pointed out, there have been cases where the addition of a more "normal" system call has caused similar crashes. In summary, it would be surprising if the combination of "don't allow anything new" and "add lots of new things" didn't explode every now and then.

Index entries for this article
Kernel	Security/seccomp
Kernel	System calls

Surely default-deny should fail the syscall with ENOSYS?

Posted Jan 24, 2025 0:38 UTC (Fri) by smcv (subscriber, #53363) [Link]

It's appropriate that a container runtime that wants to be a security boundary has a default-deny policy, but if it doesn't recognise a syscall, it should do the same thing the kernel does: make the syscall fail with ENOSYS, indistinguishable (except possibly by timing) from the effect of running on an older kernel that has no support for that syscall. That way, for ordinary syscalls (that are less odd than this one), well-behaved user-space can fall back to a less optimal implementation, for example emulating splice with read and write.

I thought Docker had addressed this already, last time newer container payloads were broken by Docker default-denying a newly-added syscall with EPERM and glibc not expecting that result?

Protocol ossification

Posted Jan 24, 2025 2:53 UTC (Fri) by cesarb (subscriber, #6266) [Link] (15 responses)

There's a term for that situation: ossification. It's the same shit we have to deal with when designing new network protocols, or extending existing ones: some middleware is going to assume everything it doesn't know (or even something it does know but misinterprets) is malicious and must be immediately discarded, rejected, or corrupted, sometimes also temporarily or permanently blocking the sender or the receiver, just in case.

We are seeing the exact same dynamics with abusers of seccomp. New network protocols (io_uring) get blocked. Variations on existing network protocols (clone3) get corrupted (returning an invalid error code, as if it existed but failed) or blocked. Now we're seeing a new network protocol (uretprobe) being abandoned just because a firewall somewhere (seccomp) misbehaves. I wonder how long until we start playing games with system calls (pretending to be something else, like how the TLS 1.3 new connection handshake pretends to be a TLS 1.2 session resume; or doing GREASE with system call parameters, to ensure they are kept available for future use). Perhaps something like doing a write() with a bogus fd (above the runtime limit), and hoping Docker's seccomp policy doesn't look too closely at the fd number?

Protocol ossification

Posted Jan 24, 2025 7:16 UTC (Fri) by mb (subscriber, #50428) [Link] (8 responses)

I do not agree that this is an abuse of seccomp. It is the intended use case. The first version of seccomp actually imlemented exactly this: Just allow a handfull of kernel-decided hardcoded syscalls.

Running seccomp with a deny list is just not going to work, given the current amount of syscalls and the constant ingress of new ones.

This is a special case, because it breaks without updating the application. Seccomp filter updates on application updates are expected. On kernel updates the requirement of seccomp filter updates (that would mean app update) is unexpected.

Protocol ossification

Posted Jan 24, 2025 9:21 UTC (Fri) by taladar (subscriber, #68407) [Link] (4 responses)

I would argue that it is an abuse of seccomp because seccomp was not intended to be used with a fixed allow list while both components in user space and kernels get updated. It is similar to using a firewall with deep packet inspection from the HTTP 1.0 days on today's HTTP 2.0 connections and complaining that things break while refusing to assign the blame to the one outdated component in the system.

Protocol ossification

Posted Jan 24, 2025 14:12 UTC (Fri) by cesarb (subscriber, #6266) [Link] (3 responses)

> I would argue that it is an abuse of seccomp because seccomp was not intended to be used with a fixed allow list while both components in user space and kernels get updated.

I'm not one of the designers of seccomp, but my reading of the original intention is that seccomp was supposed to be used by the final user space component (that is how it's used, for instance, in the Firefox/Chrome sandboxes). When that component gets updated, the seccomp policy naturally get updated at the same time. Separate dynamic libraries like glibc add an extra complication, and it's not always possible to avoid these libraries by doing system calls directly, but it's still the component being "protected" by seccomp that's defining the seccomp policy.

Using seccomp to sandbox third-party components (like Docker does) is where it becomes more problematic, since the sandboxing code has no idea which system calls the third-party component requires, and it's easy for it to stay in an old version while both the kernel and the sandboxed code get updated. A light denylist-only use of seccomp would cause few issues (for instance, there's no need for a sandboxed process to call the reboot system call), but for security paranoia reasons their default policy blocks too much.

Protocol ossification

Posted Jan 25, 2025 11:40 UTC (Sat) by fw (subscriber, #26023) [Link]

However, as things have turned out, the modern container approach (with ENOSYS) works more reliably than what the browser sandboxes are doing: software (whether it runs in containers or not) usually supports older kernels as well, so there is an existing run-time dispatch, or perhaps a performance penalty, or some fixed vulnerabilities and reliability issues come back.

With browser sandboxes, the desire is to disable everything that the browser does not need, usually after the process has been created (and not outside the process, as with containers). There is in-process emulation/checking of some system calls via SIGSYS. And yet, browsers link in lots of system components, and their system call requirements keep changing. This is largely invisible because of early testing in distributions like Fedora rawhide. Incompatible things are temporarily reverted until browsers catch up.

Protocol ossification

Posted Jan 25, 2025 13:59 UTC (Sat) by mokki (subscriber, #33200) [Link] (1 responses)

Why can't docker have a dynamic allowlist that users can add the new syscalls without new docker version required?
This has been a problem for 10+ years and still no solution from docker developers on this.
Does podman handle this better?

Protocol ossification

Posted Jan 25, 2025 16:47 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

I see this (experimental) flag in podman's docs: https://docs.podman.io/en/v4.6.1/markdown/options/seccomp...

Protocol ossification

Posted Jan 24, 2025 16:24 UTC (Fri) by hmh (subscriber, #3838) [Link] (2 responses)

"This is a special case, because it breaks without updating the application. Seccomp filter updates on application updates are expected. On kernel updates the requirement of seccomp filter updates (that would mean app update) is unexpected."

IMO, what you described *is* very much an ossification danger from the PoV of the kernel. Kernel updates (and not only the kernel, there are also library updates of stuff like the libc and kernel-interface libraries, and no-libc runtimes like golang's) need to be able to change the syscalls they use, if ossification is to be avoided.

I do agree that changes that would require seccomp updates on *stable* kernels (and minor/patch/stable-train updates of libraries and language runtimes) should be both a do-it-only-as-a-last-resort thing *and* very explicitly documented.

Of course, there is no ossification issue with the original "restricted compute worker process" seccomp mode, because the hard-coded policy for that mode is in the kernel and it will be kept in sync with any changes to the syscalls.

Protocol ossification

Posted Jan 24, 2025 16:36 UTC (Fri) by hmh (subscriber, #3838) [Link] (1 responses)

(It is a pity one cannot edit posts). A clarification: on this *specific* case of the restricted uretprobe syscall, I agree with others that it should just be hardcoded-allow-listed by seccomp (i.e. "invisible" to seccomp). And my opinion is heavily based on the very specific detail that this syscall cannot be called by general code without triggering a SIGILL: this specific syscall is just an internal implementation detail of uretprobes.

So, IMHO, if one wants to restrict uretprobes to an application (or even system-wide), it should be done in some other higher abstraction level that deals with uretprobes itself, or ptracing and the like.

Protocol ossification

Posted Jan 24, 2025 20:45 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

IMHO this is not unreasonable, but it would be helpful to make an explicit category for syscalls of this nature, with documentation and possibly new seccomp flags for dealing with it.

Actually, I think it would probably be helpful to have multiple categories for syscalls of different types based on what they can do and whether they can affect things outside of the process. A seccomp BPF filter could, hypothetically, receive a bitmask indicating the properties of a given syscall, such as:

* Whether it can read/modify state, and separate flags for the kind of state it reads or modifies. dup2 would be flagged differently from write, because dup2 modifies the process's file descriptor table, while write can modify the filesystem or do IPC (pipes etc.). (No, procfs is not "the filesystem" for the purposes of this discussion.)
* Whether it is considered part of the kernel's userspace API. Yes for most syscalls, no for uretprobe. Denying syscalls where this flag is not set would be considered poor practice, and might result in compatibility issues on the next kernel release, but you can still do it if you really want to.
* Whether it requires at least one capability to call with the specified arguments, for any reason other than filesystem permissions (because that would require looking them up, which slows things down a ton, and is also inherently racey).
* Whether it would require a filesystem permission check, if executed by a non-privileged process (i.e. the flag is still set if you are root). Does not indicate what the result of that permission check would have been, only that it would have been done.

The BPF filter could then use those flags to make an informed choice about unrecognized syscalls, and you could even pass some kind of mask to seccomp/prctl to indicate which syscalls you want to filter in the first place.

I don't know how backwards compatible this is, or whether there is the will to implement something this complicated. But it would be nice to have.

Protocol ossification

Posted Jan 24, 2025 13:41 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (5 responses)

Ossification is indeed a problem here, but seccomp didn't help itself by being designed in a way that's more or less guaranteed to ossify as I understand it.

In TLS for example it was from the outset documented how to avoid ossifying, and what we saw each iteration is that the "Network Security" industry had not bothered because what they're actually selling is rocks that keep tigers away - the customers don't understand that this is impossible and aren't interested in learning what's actually for sale, I want your tiger rock and I want it immediately, take my money.

You mentioned the workaround which was (had to be) sanctioned for TLS 1.3, but there's worse - one of the vendors was selling tiger rocks which would re-use a third party secret and pretend it's their secret, this destroys security but that's OK the customer isn't buying actual security they want a tiger rock and this says it's a tiger rock so all good - however in TLS 1.3 when their proxy tells a remote server, "Sorry, I don't know TLS 1.3, give me TLS 1.2" the server goes ok, that's cool, the random data I picked was some-random-data-DOWNGRADE-more-random-data†. If you're really a TLS 1.2 client this doesn't seem suspicious, it's extremely unlikely - but so is any particular value, so everything works. And the proxy isn't bothered.

But when this idiot vendor's proxy copies the data into their reply to a client asking to perform TLS 1.3 that's a huge red flag for the client. Why is the server which claims it can't speak TLS 1.3 also telling me (as per TLS 1.3 protocol design) that it was asked to downgrade from TLS 1.3? I didn't ask for a downgrade, clearly there's an attack -- and there is, it's your own systems attacking you because you (most likely the corporation you work for or at) bought a tiger rock.

Anyway, so much to say, in TLS 1.3 we had these problems _despite_ the existence of obvious ways to never have the problem. seccomp is worse because it's not obvious how say Docker should magically avoid this problem. If you sat down in 2010 to write a "good" TLS proxy yours still works today. Of course you're out of business because unscrupulous, incompetent and greedy vendors will say their tiger rock (which doesn't work) is better, faster, cheaper, whatever, but that's too bad.

Instead: Seccomp should have separate allow & block lists. If you neither allow nor block a system call, what happens depends on the kernel, and so NOW there's an opportunity for a policy debate about what goes in which pile and why but it's not a magic special case it's just a kernel policy decision which is nothing new.

† That's not the real data, but you can go read the RFC if you need details

Protocol ossification

Posted Jan 24, 2025 20:55 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

> Instead: Seccomp should have separate allow & block lists. If you neither allow nor block a system call, what happens depends on the kernel, and so NOW there's an opportunity for a policy debate about what goes in which pile and why but it's not a magic special case it's just a kernel policy decision which is nothing new.

Seccomp does not have allowlists or blocklists (except for SECCOMP_SET_MODE_STRICT, which sets an extremely restrictive hard-coded allowlist for compute-only applications). If you want to customize its behavior, you have to use BPF filters. But a BPF filter is just arbitrary-ish code. You can write whatever logic you want - if you want to have an allowlist, a blocklist, and some userspace-configurable behavior for unrecognized syscalls, you can do that today.

In practice, everybody just writes a simple allowlist implementation for their BPF filter, but the kernel did not make them do that.

Protocol ossification

Posted Jan 25, 2025 14:46 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (1 responses)

It's all very well to say the kernel didn't "make them" choose this, but what else was available? Imagine that next week we're adding a new kernel system call, you know nothing about it except it doesn't exist yet, now, what does your BPF filter say to ensure that this this call, in addition to the existing set, is callable, but no others?

Actually wait, changed my mind, after you wrote that BPF, I'm actually adding three new calls, two obviously you need to allow and one you definitely must not, does that just work with the BFP you wrote for the original statement?

Protocol ossification

Posted Jan 25, 2025 14:58 UTC (Sat) by intelfx (subscriber, #130118) [Link]

>>> Instead: Seccomp should have separate allow & block lists. If you neither allow nor block a system call, what happens depends on the kernel, and so NOW there's an opportunity for a policy debate about what goes in which pile and why but it's not a magic special case it's just a kernel policy decision which is nothing new.
>>
>> In practice, everybody just writes a simple allowlist implementation for their BPF filter, but the kernel did not make them do that.
>
> It's all very well to say the kernel didn't "make them" choose this, but what else was available? Imagine that next week we're adding a new kernel system call, you know nothing about it except it doesn't exist yet, now, what does your BPF filter say to ensure that this this call, in addition to the existing set, is callable, but no others?

The problem isn't even that seccomp does not have a rigid whitelist/blacklist mechanic (as NYKevin says). You can, for instance, let the BPF program return a third verdict "defer to the kernel judgment" in addition to the "allow" and "deny" verdicts.

However, in order to be able to have a meaningful "default policy" in the kernel, there has to be some agreed-upon overarching semantics for the entire seccomp mode 2 mechanism, and there isn't. Seccomp is just "a mechanism to mess with syscalls". Someone might use it for security, someone else might use it for debugging, or even a rudimentary form of fault injection. There is no way to have a meaningful default policy in the kernel when there is no predefined goal that this policy must fulfill.

Protocol ossification

Posted Jan 25, 2025 12:42 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

The obvious way to avoid this problem is to stop sandboxing things at the level of individual system calls. Access restrictions should be on *actions* applied to *objects*, not on the specific nouns and verbs you're using to talk about those objects.

It's like I'm a bank teller (remember those?) and we're trying to solve the problem of bank robbery by installing an automated gun turret that shoots anyone saying "give me all your money". Robbers just say "give me all your currency", so now the automated gun turret, which can't enumerate every possible sentence, has to shoot anyone saying anything other than a few pre-defined sentences, like "I would like to make a deposit". Woe be to the client who instead says "I'd like to deposit a check" and gets shot.

The right approach is to make a rule against bank robbery!

Likewise, docker and others shouldn't be whitelisting specific system calls. They should talk about what actions processes can take on which objects.

Protocol ossification

Posted Jan 25, 2025 21:45 UTC (Sat) by Wol (subscriber, #4433) [Link]

> But when this idiot vendor's proxy copies the data into their reply to a client asking to perform TLS 1.3 that's a huge red flag for the client. Why is the server which claims it can't speak TLS 1.3 also telling me (as per TLS 1.3 protocol design) that it was asked to downgrade from TLS 1.3? I didn't ask for a downgrade, clearly there's an attack -- and there is, it's your own systems attacking you because you (most likely the corporation you work for or at) bought a tiger rock.

And this is where the European Computer Security Act (or whatever it was called) would come into effect. If the device claims to do TLS 1.3 and doesn't, it's clearly defective. And if it isn't fixed as per the CE mark ...

This is where I would like the government to say "save money, buy COTS gear in bulk, but have a supplier blacklist. If you have to replace gear because the supplier welched on the CE mark, they go on the blacklist for the life of the REPLACEMENT gear".

So if another supplier comes in and says "I'll give you a 10-year CE life instead of the standard 5", they're taking a risk, but they're also locking a competitor out of a lucrative market ...

Cheers,
Wol

Why a syscall?

Posted Jan 24, 2025 3:19 UTC (Fri) by JesseW (subscriber, #41816) [Link] (14 responses)

I am confused why the performance improving patch was made using a new syscall at all. I am not very familiar with kernels, but are there really only two ways to get into the kernel -- a software interrupt/trap and a visible (and blockable) syscall? Since the kernel is injecting the code, can't it branch to itself without any visible ceremony?

I'm sure I'm missing something obvious here, but hopefully others are too, so this question is useful.

Why a syscall?

Posted Jan 24, 2025 4:08 UTC (Fri) by tullmann (subscriber, #20149) [Link]

The syscall is the branch into the kernel. To get the CPU into a state where kernel code can be executed the CPU needs to change protection levels and memory (or even the address space), and doing those things is basically the definition of a syscall. Doing a branch without the protection level change would just trigger a fault of some kind. (There are kernels, like DOS, where there is little to no protection and the branch is really all it takes. But not Linux.)

Why a syscall?

Posted Jan 24, 2025 7:05 UTC (Fri) by geuder (subscriber, #62854) [Link] (12 responses)

As already answered we are talking about a security boundary.

The kernel has injected the code, but the user space process executes it. The fact that this code has somewhat special history is not sticky, it's just user space code. If it could cross the security boundary by some shortcut, there would he no way to prevent "original" user space code to use the same shortcut, i.e. cross the security boundary where it shouldn't.

Why a syscall?

Posted Jan 24, 2025 7:19 UTC (Fri) by mb (subscriber, #50428) [Link] (10 responses)

This kind of shortcut exists with vdso, but that is basically read-only and highly restricted to a couple of selected syscalls.

Why a syscall?

Posted Jan 24, 2025 11:16 UTC (Fri) by kleptog (subscriber, #1183) [Link]

Which just raises the question: why not just have the trampoline use the vdso?

Why a syscall?

Posted Jan 24, 2025 13:59 UTC (Fri) by cesarb (subscriber, #6266) [Link] (8 responses)

No, the vdso doesn't cross any security boundary, it runs solely in user space; there's no shortcut. When it needs to cross the security boundary, the code in the vdso has to do a system call. The kernel does share some read-only data pages which the vdso can use to implement some system calls completely in user space, without entering the kernel; for instance, the adjustment variables necessary to convert the CPU clock (which is readable in user space, through the rdtsc instruction on x86 or equivalents in other architectures) into the real time or monotonic clocks. But it's still user-space only code, it could even be reimplemented outside the vdso (other than it being very kernel-version-specific, since the layout of the shared data pages is an implementation detail of the kernel and can change at any time).

Why a syscall?

Posted Jan 24, 2025 18:04 UTC (Fri) by mb (subscriber, #50428) [Link] (7 responses)

If you define the read access as "not crossing a security barrier", then yes.
But that's a rather strange definition, IMO.

Why a syscall?

Posted Jan 24, 2025 21:02 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (6 responses)

The vDSO is userspace code reading userspace memory. True, the kernel does take care to set up those pages of memory with certain "magic" values, but those values are not sensitive (or if they are, the kernel does the relevant security checks at program startup before it maps the pages). After program startup, it's just another shared object, no different from libc or any other userspace library.

Why a syscall?

Posted Jan 24, 2025 21:08 UTC (Fri) by mb (subscriber, #50428) [Link] (5 responses)

Would you please read the thread I replied to?
I never said that this was not userspace code. Quite contrary.

Why a syscall?

Posted Jan 24, 2025 22:13 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (4 responses)

If your definition of "crossing a security boundary" includes userspace code dereferencing a pointer into userspace memory, then we will have to agree to disagree.

Why a syscall?

Posted Jan 25, 2025 1:45 UTC (Sat) by mb (subscriber, #50428) [Link] (3 responses)

So, why isn't vdso implemented as an actutal so file, if it's just an ordinary userspace thing?

Why a syscall?

Posted Jan 25, 2025 2:37 UTC (Sat) by intelfx (subscriber, #130118) [Link]

Because it is dependent on the layout of the data page that the kernel maps into the process.

Why a syscall?

Posted Jan 25, 2025 3:40 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Layout and logistics. It can be implemented as a shared object somewhere in /proc or /sys, but none of these are guaranteed to be available for the applications (and on Android neither of them are, btw).

Why a syscall?

Posted Jan 25, 2025 15:55 UTC (Sat) by dezgeg (subscriber, #92243) [Link]

Furthermore, on 32-bit x86 the fastest way to make syscalls depends on what the CPU supports - thus making syscalls is preferably done by calling into the VDSO. So you couldn't even make the syscalls to open/map the vdso without ugly hacks like having separate syscall stubs for early process startup code.

Why a syscall?

Posted Jan 24, 2025 14:32 UTC (Fri) by intelfx (subscriber, #130118) [Link]

> If it could cross the security boundary by some shortcut, there would he no way to prevent "original" user space code to use the same shortcut, i.e. cross the security boundary where it shouldn't.

This is wrong, the article very explicitly says there *is* a way:

> it can only be called from the kernel-injected special trampoline; otherwise it will just deliver a SIGILL signal to the calling process.

Why no sysctl?

Posted Jan 24, 2025 7:16 UTC (Fri) by geuder (subscriber, #62854) [Link]

I fail to see why they don't want a sysctl. Of course there are too many... But from a kernel perspective that seems the best solution to me. Just let the distro decide what user experience they want to give: Best performance, but legacy code might fail or maximum compatibility with a performance hit. And the expert user can switch if their need is different than what the distro chose.

Just accelerate traps?

Posted Jan 24, 2025 14:09 UTC (Fri) by bushdave (guest, #58418) [Link] (1 responses)

If the problem is that the trap instruction is slower than a syscall, why not just make a trap() syscall documented to behave just like a trap instruction? This syscall should not be affected by seccomp(), since a trap instruction is not affected by seccomp(). A solution like this is also re-usable for other scenarios.

Just accelerate traps?

Posted Jan 26, 2025 22:35 UTC (Sun) by Sesse (subscriber, #53779) [Link]

Ironically, in the old days (386), it was the other way round: Traps were faster than syscalls.[1]

https://devblogs.microsoft.com/oldnewthing/20041215-00/?p...

[1] To the degree “syscall” was a more sanctioned way to make a syscall, e.g., INT some_value. The SYSCALL instruction didn't exist yet.

Hard disagree

Posted Jan 24, 2025 14:27 UTC (Fri) by intelfx (subscriber, #130118) [Link] (13 responses)

I'm not a kernel maintainer, but I definitely disagree with Kees here.

The uretprobe syscall is an implementation detail. The application does not contain it, the application has no choice about it (because uretprobes would be attached by an outside mechanism like a debugger), so why should the application know or care about adding it to its seccomp mask?

Imagine if we had to authorize internal kernel functions via seccomp. Implementation of some other syscall changed in kernel X.Y? Too bad, if you don't add all of the new contents of the call graph, the application stops working.

The security arguments don't appear to hold any water: the article says that uretprobe() syscall is guarded against unauthorized usage in its implementation, so if an application calls one itself, it'll just deliver a SIGILL.

So yes, uretprobe() _is_ special: it's not used by the application, it's injected by the kernel dynamically as part of operation of another feature, it has its own protections... It *should* be invisible.

Hard disagree

Posted Jan 24, 2025 15:02 UTC (Fri) by epa (subscriber, #39769) [Link]

You are right. But that suggests there should be a single syscall which is the only one the kernel will ever inject into processes (and which cannot ordinarily be called from user space). The uretprobes will be the first application of this new call (passing 0 as the first parameter or something) but others may be added in future. Then this call and this only can be ignored by seccomp.

Hard disagree

Posted Jan 24, 2025 19:59 UTC (Fri) by ibukanov (subscriber, #3942) [Link] (11 responses)

It does not matter if it was the kernel that injected the code. The code runs in the user space and so a malicious code can also try to call the uretprobe syscall. And the implementation may have bugs. As a defense in depth one may want to block all that functionality until the code mature and sufficiently tested to add it to the white list.

The real trouble with secomp is that it is implemented as BPF code so uretprobe has no chance to know if it will be denied in advance before the code injection. If, for example, secomp API would apply per syscall, so one can check if a particular call has a filter associated with it, the implementation could check if can use uretprobe or should use older and less performant trap code.

Hard disagree

Posted Jan 24, 2025 20:28 UTC (Fri) by intelfx (subscriber, #130118) [Link] (10 responses)

> It does not matter if it was the kernel that injected the code. The code runs in the user space and so a malicious code can also try to call the uretprobe syscall.

It does, though. As said above, the implementation handles this case.

> And the implementation may have bugs

Just like any other functionality.

> As a defense in depth one may want to block all that functionality until the code mature and sufficiently tested to add it to the white list.

As said elsewhere by another commenter, if the goal is "defense-in-depth" conservatism, then it's the wrong layer to make these restrictions at. If it's a defense-in-depth mechanism, then it must be handled with a more suitable mechanism, like a separate sysctl or in the same vein as other ptrace-related restrictions.

In other words: if the goal is to protect against "immature not sufficiently tested code", then it's a policy decision that must be taken by the local administrator, not by every single application which **has nothing to do with the syscall being injected**.

Hard disagree

Posted Jan 25, 2025 19:43 UTC (Sat) by ibukanov (subscriber, #3942) [Link] (1 responses)

> In other words: if the goal is to protect against "immature not sufficiently tested code", then it's a policy decision that must be taken by the local administrator, not by every single application

The problem was caused by Docker, not the application code. Configuration of default policy for the Docker is responsibility of administrator or at least the distribution, not applications.

And Docker is absolutely right here. Its policy is about minimizing the attack surface against the kernel.

Hard disagree

Posted Jan 25, 2025 20:12 UTC (Sat) by intelfx (subscriber, #130118) [Link]

> The problem was caused by Docker, not the application code.

Well, that's even worse. That's double "spooky action at a distance".

> Its policy is about minimizing the attack surface against the kernel.

You're making precisely zero sense. It's not Docker's business to accidentally restrict the administrator from injecting tracepoints using unrelated mechanisms into unrelated applications, and it's not Docker's business to enact such policy (even if it was intentional, which it is not, due to lousy architecture all around).

Hard disagree

Posted Jan 26, 2025 1:39 UTC (Sun) by wahern (subscriber, #37304) [Link] (7 responses)

> > And the implementation may have bugs

> Just like any other functionality.

Right. That's the whole point of seccomp--that the software, both userspace *and* kernel, might have exploitable bugs, and that minimizing exposed kernel surface area is no less important (if not *more* important) than in-process mitigations for userspace code. seccomp has very little value as a defensive, mitigation layer if it's not deny by default.

Someone else mentioned that everybody should be focused on capability systems, not seccomp. Well, I don't think many on the seccomp side would disagree that the community should be designing, implementing, and adopting capability systems more strongly. But that takes highly coordinated effort across all layers of the stack that the Linux software ecosystem in particular hasn't been particularly successful at. Afterall, what's the capability story with uretprobe? There's no file descriptor/token involved. How would it even work--the profiler is effectively injecting code in the application. And presuming there was some proper capability system involved, Docker's strategy here is to impose a jail without the cooperation of the application (it's the administrator, via Docker making a policy decision for an application that itself isn't even aware of the mechanism), and in a proper capability system Docker would likely be a broker that would presumably deny by default. Docker exists for the same reason seccomp exists--because administrative tooling is easier for the mainstream to adopt than to coordinate refactoring of application stacks.

There are no easy answers, here. OpenBSD has pledge, a saner, more comprehensive seccomp, and it works very well there. But pledge is premised on the notion that each application is refactored to make proper use of it; it doesn't work much better as a practical matter than seccomp when imposed administratively. FreeBSD has Capsicum, a capability architecture. But FreeBSD has a much more diverse ecosystem, refactoring for Capsicum is a much heavier lift, and it's seen little uptake by non-core software.

Hard disagree

Posted Jan 26, 2025 2:03 UTC (Sun) by intelfx (subscriber, #130118) [Link] (6 responses)

> Right. That's the whole point of seccomp--that the software, both userspace *and* kernel, might have exploitable bugs, and that minimizing exposed kernel surface area is no less important (if not *more* important) than in-process mitigations for userspace code. seccomp has very little value as a defensive, mitigation layer if it's not deny by default.
> <...>
> Docker's strategy here is to impose a jail without the cooperation of the application (it's the administrator, via Docker making a policy decision for an application that itself isn't even aware of the mechanism), and in a proper capability system Docker would likely be a broker that would presumably deny by default. Docker exists for the same reason seccomp exists--because administrative tooling is easier for the mainstream to adopt than to coordinate refactoring of application stacks.

*What* capability would Docker be denying by default, if this was a capability-based system?

There is no (hypothetical) capability that is being used by the target application. I, as the administrator (presumably in possession of root-equivalent privileges), am requesting the kernel to inject some code into the target application on my behalf. Nothing else in the system (not Docker, not the target application) has any business meddling with this request in any way.

Seccomp has about as much reason to block this pseudo-syscall as it has to block, say, a trap instruction. Seccomp doesn't block trap instructions, now does it? They are entry points into the kernel too, after all.

Hard disagree

Posted Jan 26, 2025 2:10 UTC (Sun) by intelfx (subscriber, #130118) [Link] (5 responses)

> I, as the administrator (presumably in possession of root-equivalent privileges), am requesting the kernel to inject some code into the target application on my behalf.

Slight correction: I am requesting the kernel to "do something" to let me trace the application. What this "something" is is an implementation detail of the kernel. This implementation detail has its own protections against being abused. So it makes even less sense that Docker can somehow interfere with this implementation detail.

Like I said: it's as if we had to use seccomp to whitelist internal kernel functions that are being invoked during the course of execution of an (otherwise allowed) syscall, on the grounds that "if the kernel is calling some new functions, those represent untested code paths which we want to deny by default because they are untested and immature".

It makes no sense.

Hard disagree

Posted Jan 26, 2025 8:33 UTC (Sun) by ibukanov (subscriber, #3942) [Link] (1 responses)

The malicious userspace can execute the new syscall even if no system administrator has asked for it. One can argue that the implementation on the kernel side is bulletproof and clearly rejects such attempts, but the past experience is full of cases when this was false.

So secomp is right to reject this case. The trap case is fundamentally different because that code is extremely mature and well-tested allowing secomp to trust that by default.

Hard disagree

Posted Jan 26, 2025 14:40 UTC (Sun) by intelfx (subscriber, #130118) [Link]

> The trap case is fundamentally different because that code is extremely mature and well-tested allowing secomp to trust that by default.

Seccomp never "distrusted" trap instructions. It cannot prevent trap instructions from being executed, never did. It's not because "seccomp trusts that by default", it's because trap instructions are out of scope of seccomp, always were, always would be.

So no, this reasoning is invalid.

Hard disagree

Posted Jan 26, 2025 12:25 UTC (Sun) by glettieri (subscriber, #15705) [Link] (2 responses)

> This implementation detail has its own protections against being abused.

I may be wrong, but I think you are missing a point here. The protection is against calls coming from outside the injected trampoline (or even from the exact location in the trampoline). But an attacker who has hijacked the control flow in the traced application can make it jump into the trampoline and issue a uretprobe syscall that passes the protection check. Therefore, if there are bugs in the uretprobe implementation, the injected trampoline potentially exposes those bugs to the attacker.

Hard disagree

Posted Jan 26, 2025 14:51 UTC (Sun) by intelfx (subscriber, #130118) [Link] (1 responses)

> I think you are missing a point here

I think you are missing mine. How is it different from an application hijacking control flow or whatever to jump to the previous implementation of this mechanism, i.e., a trap instruction? The answer is "it's not", and we were okay with it.

This argument is clearly going in circles, so in order not to incur the wrath of our editors, I will stop participating in this subthread. (However, I must note that this is not equal to conceding.)

Hard disagree

Posted Jan 26, 2025 15:35 UTC (Sun) by glettieri (subscriber, #15705) [Link]

> I think you are missing mine. How is it different from an application hijacking control flow or whatever to jump to the previous implementation of this mechanism, i.e., a trap instruction? The answer is "it's not", and we were okay with it.

Good point, I see what you mean now.

update docker's seccomp policy

Posted Jan 28, 2025 18:50 UTC (Tue) by meyert (subscriber, #32097) [Link] (1 responses)

So the user did upgrade the Linux kernel, and docker did break because of the strict seccomp deny policy?
So why didn't the user update the docker seccomp profile after kernel upgrade?
I had to do that on an old rhel 7 system with very new jvms for similar reason..

update docker's seccomp policy

Posted Jan 30, 2025 6:54 UTC (Thu) by rwmj (subscriber, #5474) [Link]

Yeah I was also confused by this article too. Updating the kernel is possible, but updating docker (or just seccomp policies) is impossible and will take years and years? That's not any Linux distro that I'm familiar with.

Effects on systemd service units

Posted Jan 30, 2025 7:18 UTC (Thu) by kpfleming (subscriber, #23250) [Link] (2 responses)

It is becoming common for systemd service units (shipped in packages) to use systemd's similar facilities for only allowing specified groups of syscalls, and the new uretprobe() syscall isn't going to be included in any of those lists in existing installations of systemd.

see: https://www.freedesktop.org/software/systemd/man/latest/s...

Effects on systemd service units

Posted Jan 30, 2025 18:11 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

I note that BIND removed its seccomp jail because of repeated instances of things like this hanging named and even hanging named before daemonization (often preventing boot from continuing).

The OpenSSH seccomp jail is the only one I know of in core daemons that hasn't hit disastrous problems and been ripped out.

Effects on systemd service units

Posted Jan 31, 2025 11:43 UTC (Fri) by taladar (subscriber, #68407) [Link]

A lot of those security technologies built into systemd can have some unexpected additional failure modes, e.g. the other day I had a failure in one of the services using some of the mount restrictions (PrivateTmp or something equally common, don't remember which one exactly) and the unit failed to start because the host also has a network mount that couldn't reach the CIFS server (was mounted and then lost contact rather) and so creating the new mount namespace failed.

It is unfortunate since I would prefer to use more of them in as many units as possible.