LWN: Comments on "The inherent fragility of seccomp()"

The inherent fragility of seccomp()

mjg59 — Mon, 11 Dec 2017 07:14:39 +0000

Not necessarily - in combination with an LSM policy you could restrict which things can be execve()ed. But the fact that all of these security features are effectively orthogonal makes it pretty hard to write an overarching policy.

The inherent fragility of seccomp()

roc — Mon, 11 Dec 2017 01:13:22 +0000

If the security constraints are not carried across execve(), then execve() has to be blocked or the constraints are worthless. That's a problem; I've recently implemented a seccomp sandbox around an application that definitely had to use execve().

openat() was available before - why they are blaming glibc?

corbet — Thu, 16 Nov 2017 14:09:31 +0000

This conversation confuses me a bit. Who is blaming glibc? The article is about a particular kernel functionality that is prone to breakage. The term "fragility" in the title was applied to seccomp(), not glibc, after all. I've not seen comments blaming glibc either...?

openat() was available before - why they are blaming glibc?

vadim — Thu, 16 Nov 2017 13:27:00 +0000

The problem goes as follows:

1. Programmer writes code, wants more protection and decides on the use of seccomp.
2. Programmer looks at what the code needs, and comes up with the 'open' syscall. However the code doesn't call the syscall directly, but the wrapper glibc provides.
3. Code is finished, programmer moves on to the next project.
4. Kernel development goes on, and the 'openat' syscall gets created.
5. Glibc adds usage of openat, and makes it so that in some cases, the glibc provided open wrapper sometimes actually calls 'openat'.
6. In those cases, the previously written code ends up using the 'openat' syscall which is not whitelisted because it didn't exist when the code was written, or because the 'open' wrapper always used the 'open' syscall and nothing else, and this changed later.

The 'open' call doesn't go anywhere. Glibc just doesn't promise to do an exact 1 to 1 wrapper, or not to introduce internal usage of additional new syscalls for its own internal reasons. When you call glibc open(), glibc may actually invoke a new, more advanced syscall like openat instead, or use additional syscalls in the wrapper.

openat() was available before - why they are blaming glibc?

jem — Thu, 16 Nov 2017 09:19:30 +0000

I'm not convinced. I don't think it is fair to blame Glibc for system calls suddenly disappearing from underneath it at the whim of some random system administrator or application developer. If you use drastic tools like seccomp(), you should really know what you are doing and be prepared for surprises like changing library implementations. In the case of open() vs. openat(), I wonder what the reason was for whitelisting one but not the other. Maybe somebody was just sloppy and simply forgot openat() existed?

openat() was available before - why they are blaming glibc?

smcv — Thu, 16 Nov 2017 08:24:26 +0000

> Some "security" filter decided that it wants to prevent user from calling open(), but they forget about openat().

The situation here seems to be the other way round: a whitelist-based filter allowed a particular program to call the open syscall (and therefore open files), but in recent glibc, the open(2) wrapper function actually uses the more general openat syscall, which the filter didn't allow. This caused that program to become unable to open files - not vulnerable, but also not usable ("failing closed").

openat() was available before - why they are blaming glibc?

sasha — Wed, 15 Nov 2017 12:29:11 +0000

I do not understand why somebody blames glibc at all. There were 2 system calls: open() and openat(). Some "security" filter decided that it wants to prevent user from calling open(), but they forget about openat(). An application may use openat() with any (g|uc|musl)libc, it is just a syscall. So this "security filter" is just stupid and does not provide any security at all. The new glibc release accidentally showed the hole in the filter, thanks you very much. If the developers of this "security filter" blame glibc for this, then it looks... strange.

The inherent fragility of seccomp()

nix — Wed, 15 Nov 2017 00:34:34 +0000

Which brings up another benefit of pledge over seccomp--pledge doesn't require root privileges to invoke.

Neither does the installation of a seccomp filter, as long as you have done a prctl(PR_SET_NO_NEW_PRIVS, 1) first to ensure that you can't go invoking setuid programs, etc, later on. Heck, it was basically designed for Chromium's renderers, and no way are they run as root except by absolute lunatics :)

(This is how it avoids the old sendmail cap attack: setuid programs or their children can't be fooled into running with an unexpected seccomp filter installed before the setuid took effect, because installation of a filter requires turning permanently off the ability to invoke setuid programs in the process hierarchy that has the filter in force.)

The inherent fragility of seccomp()

wahern — Tue, 14 Nov 2017 23:41:15 +0000

> OK, in this case, the blind user would more likely be preloading something into the ssh *client*, which is not seccomped

Which brings up another benefit of pledge over seccomp--pledge doesn't require root privileges to invoke. Almost all the standard utilities in OpenBSD call pledge, _including_ ssh(1). pledge can do this because it's not inherited across exec, which smartly sidesteps all the messy security issues with the setuid and setgid executable bits.

The inherent fragility of seccomp()

wahern — Mon, 13 Nov 2017 21:42:38 +0000

True, actually using Capsicum from applications takes considerable work unless you're starting from scratch. But getting it merged seems more like a political rather than a technical issue, as most of the technical work exists for the taking.

Getting over that political hurdle seems daunting, unfortunately. AFAIK the CLONE_FD patch (https://lwn.net/Articles/638613/), necessary for implementing Capsicum's pdfork() interface, _still_ hasn't been merged.

Regarding glibc, I'm not sure how much of an impact it would have on glibc. The particular case of open v openat is irrelevent because applications are supposed to be using openat in the Capsicum model, anyhow. The benefit of Capsicum is that it builds upon the existing, de facto file descriptors-as-capabilities model in Unix. From the perspective of libc, playing nice with Capsicum is roughly similar to refactoring to better leverage the latest evolutions of POSIX and privilege separation best practices. For example, use getrandom() instead of expecting to open /dev/urandom. And stop relying on /proc so heavily because it's not always visible. These are things glibc has to do, anyhow.

The inherent fragility of seccomp()

nix — Mon, 13 Nov 2017 15:08:16 +0000

The problem is this doesn't just apply to "foundational" libraries. It applies to all of them, the complete set of all syscalls that might be made by all libraries in the address space throughout the lifetime of the seccomped process, and you just can't tell what those might be, not at compile time, not at link time, not even at install time.

e.g. if your sshd has something obscure LD_PRELOADed into it for the sake of a blind user, now you have to adapt to the new syscalls it makes in routine operation, even though you probably had no idea the thing existed. (OK, in this case, the blind user would more likely be preloading something into the ssh *client*, which is not seccomped, but if we're going to try seccomping anything associated with a user interface we'll suddenly have to consider input methods and God knows what getting preloaded in or plugged in).

The inherent fragility of seccomp()

musicinmybrain — Mon, 13 Nov 2017 14:43:46 +0000

There’s more than one credible libc out there.

The inherent fragility of seccomp()

quotemstr — Mon, 13 Nov 2017 07:00:14 +0000

Maybe it's time to move the development of libc and other foundational userspace libraries into the kernel repository so that they evolve together.

The inherent fragility of seccomp()

simcop2387 — Mon, 13 Nov 2017 01:01:26 +0000

This is one of the things I do for enabling arbitrary code execution on a pastebin I run, along with a bunch of other techniques to sandbox the whole thing and prevent it from having any system-visible side effects. The code is publish on cpan, https://metacpan.org/pod/App::EvalServerAdvanced , I'm working on making it easier to handle arbitrary programs that can be sent to the sandbox and keep things secure.

The inherent fragility of seccomp()

roc — Sun, 12 Nov 2017 09:27:05 +0000

Getting that merged would in itself be a monumental task.

Then you'd have to rewrite glibc and most other userspace libraries and applications to use capsicum-enabled APIs.

It could be great, but don't claim it's easy.

The inherent fragility of seccomp()

wahern — Sun, 12 Nov 2017 08:59:36 +0000

Getting there from here is nearly as easy as a single merge: https://github.com/google/capsicum-linux

The inherent fragility of seccomp()

epa — Sun, 12 Nov 2017 08:27:43 +0000

Surely the right way to do seccomp() is to just kill the process with a signal if it calls a system call that isn't allowed. That would be much less dangerous than a pernicious weakening of all API promises, where random things can start failing even if POSIX guarantees they don't.

Programs that are seccomp-aware, and want to handle these things defensively, could arrange to trap the signal. Otherwise existing code would at least either work correctly or fail cleanly.

The inherent fragility of seccomp()

marcH — Sat, 11 Nov 2017 16:26:55 +0000

Another random example of how seccomp breaks corner cases. This one took a few months to realize

https://bugs.chromium.org/p/chromium/issues/detail?id=772273
sslh seccomp policy blocks ssh to ChromeOS over link-local IPv6 addresses
https://chromium-review.googlesource.com/c/chromiumos/ove...

The inherent fragility of seccomp()

alonz — Sat, 11 Nov 2017 08:38:18 +0000

I believe the OP meant something subtly different: he wasn't planning to check which syscalls the program uses, rather just what syscalls exist in the kernel. When new syscalls are added - he would add them to the appropriate group in the filters (e.g., if it's a new way to open files, it will be filtered the same as all other open* syscalls). And until this update happens, the filters will ensure glibc (or any other library) will get ENOENT for this new syscall, forcing it to fall back to older syscalls.

In a sense, this just implements a poor-man's-pledge, with the CI system ensuring it evolves together with the kernel (or at least trying to).

The inherent fragility of seccomp()

roc — Sat, 11 Nov 2017 04:59:26 +0000

We hit similar issues with some rr tests as well. Various rr things not related to seccomp also had to be updated to handle openat efficiently.

There isn't really a good solution here. pledge() won't scale to a broader software ecosystem. Trying to let libraries express their syscall requirements and collect those transitively would be complicated and prone to errors that over-expose the kernel. Probably a more capability-based kernel API would be better, but it's hard to get there from here.

The inherent fragility of seccomp()

patrakov — Sat, 11 Nov 2017 03:37:39 +0000

I believe the current situation has some similarity to the decade-old sendmail bug:
https://sites.google.com/site/fullycapable/Home/thesendma...

There, it was also a syscall failing, that could not fail previously (with sendmail not checking the return), due to a new security mechanism (capabilities).

The inherent fragility of seccomp()

pkern — Sat, 11 Nov 2017 01:45:24 +0000

Well, systemd does that: https://www.freedesktop.org/software/systemd/man/systemd....

At the same time as stated in the original post AppArmor also leaks the details of the libraries an application loads into the profiles. Or if they exec something you need to account for whatever the exec'ed app does.

The inherent fragility of seccomp()

nix — Sat, 11 Nov 2017 00:14:42 +0000

(1) Keep a list of all syscalls that have been checked in the source code, and regularly (on CI) check if there are new ones. If new ones appear, they have to be compared to the existing set, and if similar enough, added to the list.

You have to check all libraries your program uses, as well, and all libraries those libraries use, and so on ad infinitum. Oh and don't forget LD_PRELOADed libraries, dynamically loaded plugins, etc etc etc. (Particularly relevant if things like Gtk are in use because of the possibility of accessibility and IM plugins that call out to weird hardware and the like that you have quite possibly never realised even exists: but speech recognition for blind people sometimes relies on LD_PRELOAD to interpose all console I/O, etc etc etc... the list of obscure edge cases crucial to someone that this breaks is endless, and IMHO unmaintainable.)

(2) make syscalls return ENOSYS instead of aborting the program. This should cause libc to fall back from new optimised syscalls to old syscalls, as it has to maintain a certain base level

See my comment below for a case where the affected syscall was getpid(). getpid() is guaranteed to never fail, so nobody ever checks to see if it failed.

I just checked the seccomp filters active in a bunch of programs running on the system on which I'm typing this. Several of them still do not whitelist getpid(), almost a year after glibc 2.25 was released. I guess they're working by luck. The first such example is something that really *needs* seccomp, too: ntpd 4.2.8p10. It calls getpid() multiple times in the very same source file where it sets up a filter list that excludes getpid(): the obscure and out-of-the-way ntpd/ntpd.c. One of its calls does not check for failure, so can easily end up trying to set a process group of (pid_t)-1... it's in a tangle of conditionals that mean that most of the time, if you're lucky, you'll end up not compiling in that code -- but there are several other calls elsewhere in the source tree... and oh yes it also links to OpenSSL's libcrypto. Any bets on whether *that* calls getpid()? Repeat for every other syscall it doesn't allow past, and every syscall it allows past but only with argument checking.

This is not a maintainable strategy for any but the simplest programs.

The inherent fragility of seccomp()

nix — Sat, 11 Nov 2017 00:00:31 +0000

That entirely depends on the syscall.

This is not the first instance of such breakage: in glibc 2.25, BIND's named daemon stopped working. The failure was catastrophic: rather than daemonizing, it hung forever, which had a tendency to bring boots to a grinding halt: if it didn't, it had a tendency to bring whole networks to a halt if this wasn't noticed and all machines were eventually rebooted after an update. The cause? glibc 2.25 dropped the internal caching of getpid() which it had long done, since it didn't speed much up, added a lot of complexity, introduced subtle bugs, and broke horribly with PID namespaces. When this was done, threaded programs which called getpid() for the first time after activating their seccomp filters needed to whitelist it in those filters, where they never needed to before. BIND had not done so, and called getpid() before daemonizing. Strangely neither it nor glibc expected getpid() to fail. POSIX guarantees it cannot fail, but thanks to seccomp it now can.

Worse yet, this sort of failure can happen even if the call is only made in some non-glibc library, even if the library has no idea the seccomp filters are in force in the first place, and even if the program installing the filter has no idea the library was calling the function (perhaps it wasn't when the filters were added, and who can check every change ever made to every library your program depends on, even indirectly?)

Expecting glibc and other libraries to avoid making changes that break seccomp filters is tantamount to demanding that they never change the set of syscalls they invoke (or the arguments passed to them, because who knows what validation those filters are carrying out) in any situation ever, which would make library development on Linux essentially impossible.

I don't see a way to fix this in the current model other than to demand that all seccomp-filtered programs be statically linked and never upgraded (which would make it impossible to fix security holes in them or any libraries they used: this is of course ridiculous). The increasing use of seccomp is placing silent landmines beneath the feet of everyone using every seccomp-filtered program. This is a shame, because if programs were never upgraded and their behaviour was completely predictable, seccomp would be an excellent way to prevent malicious behaviour. However, in a world like that, programs would all already be secure and we wouldn't need seccomp in the first place.

The obvious fix, to introduce LD_AUDIT-style filtering on *library* arguments, falls at the same hurdle, for the same reason: as long as the filters are process-wide, some library getting upgraded can unintentionally violate the contract of the filter and break. The only solution I can see that would work reliably would be for each library to filter *its own* calls, so that it could at least in theory adjust its filters as its own set of expected calls changed: a sort of DT_SYMBOLIC per-.so filter for inter-shared library function calls. God knows how to implement that without totally wrecking performance though: it would mean every filtered call would have to go through the PLT and ld.so, at the very least: the very opposite of the increasing reduction in lazy binding that's actually happening. I suspect there are more complexities I haven't considered, too.

The inherent fragility of seccomp()

juliank — Fri, 10 Nov 2017 22:45:33 +0000

One problem with ENOSYS is that you can get weird behaviour in programs due to them not checking errors properly. It's much easier to detect issues when trapping, you can even write a signal handler that writes the blocked syscall to stdout (well, to fd 1 :D) [or just look at a backtrace]. My approach would be too have a list of syscalls, mark the good ones, add traps for all other syscalls in the list, and return -ENOSYS from all other (new) syscalls (or EINVAL for stuff like prctl). This way you have a defined baseline. You can even regularly trap new syscalls if you continue maintaining the software.

The backtrace thing with the trap signal is especially useful for stuff like NSS modules and preloaded libraries.

The inherent fragility of seccomp()

pbonzini — Fri, 10 Nov 2017 22:43:48 +0000

glibc can be compiled with a guaranteed minimum kernel version, and will skip compatibility code if the minimum kernel version is higher or equal to the one that included a particular system call. You can search the libc manual for "--enable-kernel".

The inherent fragility of seccomp()

arnd — Fri, 10 Nov 2017 22:41:39 +0000

glibc has the concept of a minimum kernel version, currently linux-3.2 IIRC. If a system call was available on all architectures in that version, the glibc policy is to assume it works. Removing backwards compatibility fallbacks is generally considered a good thing here, but that is what caused the issue.

Part of the problem is that we have reduced the set of available syscalls on modern architectures, anything that uses include/uapi/asm-generic/unistd.h for instance intentionally offer only openat() but not open(). When glibc can reasonably assume that openat() is available on all architectures, the logical next step is to always call that to reduce the differences between architectures.

The inherent fragility of seccomp()

luto — Fri, 10 Nov 2017 21:59:19 +0000

I've never understood why this is such a big deal. Whitelist the okay syscalls, handle known-bad ones sensibly, and force -ENOSYS from everything else. Glibc needs to work on old kernels, so it has to handle -ENOSYS correctly.

The inherent fragility of seccomp()

juliank — Fri, 10 Nov 2017 21:10:34 +0000

Maybe we could start a libsseccomp-easy where we consolidate groups of syscalls and maintain that in a central place, optimally in libseccomp. It would be similar to pledge, except for the paths component - that would require kernel changes AFAICT.

The inherent fragility of seccomp()

juliank — Fri, 10 Nov 2017 21:07:08 +0000

This is why, when designing the seccomp filtering for APT's downloading code [1], I looked at a list of all syscalls and picked all similar ones. So if I pick open(), I also pick openat(), for example. In fact, I broadly categorized it into

(1) base set of permissions (normal file I/O, sysv IPC [if fakeroot is used])
(2) directory reading
(3) sockets

See https://anonscm.debian.org/cgit/apt/apt.git/tree/methods/... and later lines.

This will break eventually if a new syscall is introduced. I consider two ways to solve that:

(1) Keep a list of all syscalls that have been checked in the source code, and regularly (on CI) check if there are new ones. If new ones appear, they have to be compared to the existing set, and if similar enough, added to the list.
(2) make syscalls return ENOSYS instead of aborting the program. This should cause libc to fall back from new optimised syscalls to old syscalls, as it has to maintain a certain base level

Combining the two should yield a maintainable result.

[1] https://juliank.wordpress.com/2017/10/23/apt-1-6-alpha-1-...

Another thing people don't consider are NSS modules and LD_PRELOAD. They could be doing all kind of weird stuff when you call getaddrinfo(). For example, they could use SYSV IPC to talk to another process, like a DNS cache. Evil little bastards. We had the same problem with people running apt in fakeroot: fakeroot needs sysv ipc to talk to its metadata daemon thing, and these were not whitelisted. I hacked in support for that - if FAKED_MODE is set in the environment, it now adds ipc syscalls. Ugly.