PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 0:14 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
In reply to: PostgreSQL considers seccomp() filters by cyphar
Parent article: PostgreSQL considers seccomp() filters

seccomp+bpf is a bad idea in general. It's really better to use other mechanisms. We also have:
1) SELinux which is unusable and should be trashed.
2) AppArmor that is better but is not compatible with unprivileged users.
3) pledge() call that solves most of the issues but is absent on Linux.
...

Seriously, Linux security at this point is a huge Rube-Goldbergesque machine that probably no single person understands in its entirety. And it's getting worse, not better.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 0:32 UTC (Wed) by cyphar (subscriber, #110703) [Link] (9 responses)

I agree in spirit with your point, but the problem is (as it has been for almost the entire history of Linux) that Conway's Law applies to Linux just as much as it applies to any project. Iterative development within subsystems results in subsystem-specific security features which require users to rope everything together themselves -- you can see this with containers just as easily as you can see it with Linux's security story.

If we had our own hypothetical pledge(2) proposal you would need to hook together all of the existing security primitives together, which would definitely make for a very lively LKML thread. But for the record, I agree that pledge(2) is a much better model for a security API that userspace sees (with the caveat that it probably should be slightly more granular if we ever get it on Linux but that's a fairly minor bike-shed to paint).

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 10:54 UTC (Wed) by brauner (subscriber, #109349) [Link] (8 responses)

As Kees mentioned I don't think you need a separate syscall for this. You can get this with seccomp() in userspace.
I agree with your earlier point though, that we need an easier way of generating seccomp profiles.
libseccomp does not provide an abstract enough interface to do this easily. It could grow support probably.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 14:26 UTC (Wed) by cyphar (subscriber, #110703) [Link] (7 responses)

> You can get this with seccomp() in userspace.

Maybe, but the main benefit of pledge(2) is that the kernel knows what syscalls each pledge refers to -- and any new syscalls will be automatically included. Coming up with a userspace wrapper for this will be (AFAICS) quite difficult for any method you pick (and all the complications boil down to trying to work around the fact that we aren't doing it in-kernel), such as:

* You just have hard-coded mappings of pledge(3) arguments to syscall sets. This means you're limited to what syscalls your userspace library understands. This results in new syscalls not being included (and if you expose lower-level seccomp primitives to handle unknown syscalls, you're back at manually-managed seccomp filters).

* You use some kind of symbol attribute which lists what syscalls a function calls, and then you're limited to the effectiveness (and overhead) of static analysis (which has to be run at at least partially at run-time because shared libraries can be updated -- though we could probably cache the call graph analysis). This is more resilient to changes, but won't work for all programs.

* You expose some kind of metadata about available syscalls from the kernel, which includes some kind of grouping or tags for the syscalls (to allow for you to dynamically add all of the syscalls matching a tag to the whitelist). This is much more flexible, but now you're making pledge-grouping decisions in-kernel -- why not do the whole thing in-kernel?

And an overarching problem is that (for unknown-to-userspace syscalls), the best you can really do is block the syscall outright. But maybe some pledge(2)s should only block certain flags (an obvious example would be a hypothetical socket(2)-like syscall -- how would you implement pledge(2) for "only allow unix sockets" in user-space without having code that knows about socket(2)?). The last proposal might help solve this if you exposed "enough" metadata, but it feels wrong to me to try to expose a bunch of metadata in the hopes that userspace will be able to make sense of it.

But then again, I might be missing something obvious. If we can solve this problem in a sane way, I'm all for it.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 14:58 UTC (Wed) by brauner (subscriber, #109349) [Link] (6 responses)

* You expose some kind of metadata about available syscalls from the kernel

You mean btf...

(I'm only partially trolling btw.)

We don't actually need pledge(2). seccomp(2) could be extended to do this. There's even precedence SECCOMP_SET_MODE_STRICT is restricting you to a very limited set of syscalls. We could extend seccomp(SECCOMP_PLEDGE, 0, "stdio,sendfd,recvfd") and then seccomp would just create a bpf filter or more elaborate for future extensibility :):

struct seccomp_pledge pledge;
seccomp(SECCOMP_PLEDGE, 0, &pledge);

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 15:08 UTC (Wed) by cyphar (subscriber, #110703) [Link] (5 responses)

> You mean btf...

I was thinking of BTF while writing it, though I don't know if BTF currently gives us the details we want -- we don't just want lists of functions and structure layouts. We need to have a way for the kernel to tell userspace "this syscall is part of the net/tcp plege-set" or something similar (and probably a way to indicate "if this flag is set then the syscall is (also?) part the foobar pledge-set").

> We could extend seccomp(SECCOMP_PLEDGE, 0, "stdio,sendfd,recvfd") ...

"What's in a name? That which we call [pledge(2)]
By any other name would [provide the same functionality];"

#define pledge(list) seccomp(SECCOMP_PLEDGE, 0, list)

But yes, in that case we are in agreement -- let's do it in-kernel (but taking care to be incompatible with OpenBSD, so that we can pretend we came up with the idea :P).

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 20:37 UTC (Wed) by wahern (subscriber, #37304) [Link] (4 responses)

Part of what makes pledge so convenient are the edge cases. For example, say you don't want a network daemon (or an unprivileged subprocess of a network daemon) to have any access to the file system. But what about /etc/resolv.conf? One of the niceties of pledge is that the "dns" privilege permits opening /etc/resolv.conf even if open(2) is otherwise disallowed. seccomp doesn't permit path name filtering at all.

OpenBSD added the sendsyslog syscall so that processes could be denied socket access but still be able to use the syslog facility. I don't think this could be emulated with seccomp, either, as filtering on the sun.sun_path argument to bind(2) has the same problems as filtering on the path argument to open(2). You could require processes to open the socket before dropping privileges, but what happens if the syslogd daemon restarts?

There are several little pragmatic tweaks like this that make pledge functional. A fundamental hurdle on Linux is that the project is so large and diverse that there's enormous pressure to prevent leaky abstractions that require far flung tweaks across the system. Such tweaks are especially brittle in the Linux development model, and people are wary of unintended consequences. That's completely understandable, but sometimes such tweaks are simply unavoidable if your goal is maximizing userland convenience and security. Irreducible complexity has to be apportioned among userland and various kernel subsystems somehow; OpenBSD tends to apportion it quite differently than Linux, partly because of the different development models.

The irony is that the path of least resistance for Linux has been containers--namespaces, cgroups, etc--which has become precisely the slippery slope of complexity and code churn people feared. (Which is why OpenBSD rejected FreeBSD jails.) Not that containers weren't worth it for their own sake, I just find the path dependency and contradictions interesting.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 23:41 UTC (Wed) by roc (subscriber, #30627) [Link]

That convenience comes at the price of baking details of specific userspace behavior --- in your example, DNS resolution and /etc/resolve.conf --- into the kernel, which currently is independent of those details. That's acceptable for OpenBSD which largely controls their userspace and is used in far less diverse ways than Linux, but it is a problem.

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 6:14 UTC (Thu) by cyphar (subscriber, #110703) [Link] (2 responses)

> One of the niceties of pledge is that the "dns" privilege permits opening /etc/resolv.conf even if open(2) is otherwise disallowed. seccomp doesn't permit path name filtering at all.

In my view, this is actually a good thing -- pathname filtering based on the string value of the path is (in my view) destined to be a bad idea (I explain this further in [1]). I reckon that the right combination of bind-mounts and AppArmor/SELinux would be a far more effective method for doing this without all of the foot-guns.

> There are several little pragmatic tweaks like this that make pledge functional.

I agree that these sorts of niceties are very useful for making pledge(2) much easier to use for userspace, but I'm not convinced that we need to do all of them in-kernel. There is no reason why we can't also have a libpledge which can help deal with some of the more peculiar userspace bits (that are separate from the core "these syscalls on this kernel form this pledge-group" feature we need to be in-kernel).

[1]: https://lwn.net/Articles/801187/

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 8:51 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> I reckon that the right combination of bind-mounts and AppArmor/SELinux would be a far more effective method for doing this without all of the foot-guns.
SELinux is never a solution...

AppArmor has several problems, though. In particular, it can't be effectively used in unprivileged contexts. For example, you can't run a program that you just compiled with a custom policy.

It also was not possible to use AppArmor from inside containers (has this changed?).

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 10:48 UTC (Thu) by jem (subscriber, #24231) [Link]

Hard coding /etc/resolv.conf is a horrible example of putting policy in the kernel. Name resolution is a library level thing, and even there the resolv.conf file is an implementation detail, not specified by any standards. getaddrinfo() is the POSIX standard name resolution interface, and it can, and has, been implemented without the need for /etc/resolv.conf.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 3:06 UTC (Wed) by kees (subscriber, #27264) [Link] (5 responses)

I'd love to see glibc support pledge(). It knows which of its own APIs map to which syscalls, etc. I don't think there is anything missing from the seccomp interface that a glibc pledge() implementation would need. (And if there was, I'd be happy to help get it implemented in seccomp.)

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 5:50 UTC (Wed) by mjg59 (subscriber, #23239) [Link] (4 responses)

Filtering based on path arguments?

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 10:51 UTC (Wed) by brauner (subscriber, #109349) [Link]

Hm, we discussed this at KSummit and I'm not sure seccomp is the right tool for this.
Not without introducing all kinds of races or moving that part of seccomp into it's own LSM which has it's own problems.
In general path-based filtering seems LSM territory.
However, we intend to bring aspects of deep argument inspection to seccomp eventually.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 18:27 UTC (Wed) by kees (subscriber, #27264) [Link]

Let's skip that bit for now; we can do the rest, though, yes?

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 23:44 UTC (Wed) by roc (subscriber, #30627) [Link]

You probably know this but I haven't seen it mentioned: Firefox and Chrome sandboxes implement path filtering by having seccomp filters trigger SIGSYS on path-related syscalls, and having the signal handler fake the syscall using IPC to a trusted broker process.

It's not great for performance, but a pledge-like sandboxing library/API can take this approach.

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 6:09 UTC (Thu) by cyphar (subscriber, #110703) [Link]

I'll admit that I'm not convinced that pathname-as-a-string filtering is a good idea (though I'd love to know what usecases it would provide real protection in). Unless you restrict the set of syscalls the process can use to be incredibly small (i.e. only allowing open("/foo", O_RDWR) and no other filesystem-related syscalls), there are all sorts of ways you can get around pathname restrictions (symlink races, bind-mounts in mount namespaces, magic-links, and so on).

If the purpose would be stop malicious programs from doing something bad -- I think it would be more productive to just use mount namespaces and isolate away the rest of the filesystem entirely. Maybe you could make use of path-based filtering in combination with a read-only mount namespace, but I'm still not completely convinced.

If the purpose is to stop a trusted program from being tricked into operating on the wrong kinds of paths by an attacker, I think openat2 and the stuff I'm working on with libpathrs[1] (which takes advantage of existing tricks involving O_PATH and procfs) would be a better solution.

[1]: https://github.com/openSUSE/libpathrs