PostgreSQL considers seccomp() filters

By Jake Edge
October 1, 2019

A discussion on the pgsql-hackers mailing list at the end of August is another reminder that the suitability of seccomp() filters is likely more narrow than was hoped. Applying filters to the PostgreSQL database is difficult for a number of reasons and the benefit for the project and its users is not entirely clear. The discussion highlights the tradeoffs inherent in adding system-call filtering to a complex software suite; it may help crystallize the thinking of other projects that are also looking at supporting seccomp() filters.

Joe Conway raised the idea in an RFC patch posting. It added a way to filter system calls in the main postmaster process and, with a separate system-call list, in the per-session backends. It also showed how to generate the list of system calls that are being used by PostgreSQL under various workloads, such as the test targets in the Makefile or by running a specific application. Information on the system calls made is logged by the audit subsystem; those logs are then processed to produce the list. Once there is confidence that the list is complete—which may be a sticking point—the remaining system calls could be blocked so that executing them would cause an error.

But Peter Eisentraut was concerned that the list is going to be incomplete due to the "fantastic test coverage" needed to generate it and that it will require constant maintenance to keep up with GNU C Library (glibc) and other changes. Beyond that, PostgreSQL extensions will need their own lists of allowed system calls. Conway seems to see the support as something that those interested will maintain for themselves, rather than having a list that the project will distribute. "Perhaps most people never use this, but when needed (and increasingly will be required) it is available."

Tom Lane suggested that it made more sense to use some kind of static analysis to determine the system calls that PostgreSQL legitimately makes, rather than simply testing to produce the list. But he also doesn't quite see what threat model the feature is protecting against. Since it is PostgreSQL itself that is maintaining and configuring the system-call filter list, a compromise that allowed privileged code execution in PostgreSQL could just disable the filtering and restart PostgreSQL, making the filters moot:

Given that we'll allow any syscall that an unmodified PG executable might use, it seems like the only scenarios being protected against involve someone having already compromised the server enough to have arbitrary code execution. OK, fine, but then why wouldn't the attacker just bypass libseccomp? Or tell it to let through the syscall he wants to use? Having the list of allowed syscalls be determined inside the process seems like fundamentally the wrong implementation.

Joshua Brindle thought that at least blacklisting some high-risk system calls would help bolster the security of a system running PostgreSQL. Systemd has some predefined lists that might be used as a starting point. He is also concerned that since the feature is just one component of a full solution, looking at it in isolation is not the right approach:

The goal is to prevent an ACE [arbitrary code execution] hole in Postgres from becoming a complete system compromise. This may not do it alone, and security conscious integrators will want to, for example, add seccomp filters to systemd to prevent superuser from disabling them. The postmaster and per-role lists can further reduce the available syscalls based on the exact extensions and PLs being used. Each step reduced the surface more and throwing it all out because one step can go rogue is unsatisfying.

The fragility of seccomp() filters is also part of what concerned Andres Freund. He noted that there have already been PostgreSQL bug reports about seccomp() because of how it is used by some container-management systems. The system-call landscape is constantly shifting as well, he said, pointing to an LWN article about one seccomp()-related problem:

There's regularly new syscalls (e.g. epoll_create1(), and we'll soon get openat2()), different versions of glibc use different syscalls (e.g. switching from open() to always using openat()), the system configuration influences which syscalls are being used (e.g. using vsyscalls only being used for certain clock sources), and kernel bugfixes change the exact set of syscalls being used.

Lane wondered why SELinux was not being used; Brindle made it clear that while SELinux cannot be used to do system-call filtering, it is part of the overall PostgreSQL hardening effort. As an example of the kind of system call that could be blacklisted for PostgreSQL using seccomp(), Brindle pointed to madvise(), which he said is not preventable by SELinux, not used by PostgreSQL, and "a clear win in the dont-let-PG-be-a-vector-for-kernel-compromise arena".

But Freund cautioned that madvise() is used by glibc as part of its malloc() implementation. "That's *precisely* my problem with this approach." As Lane pointed out, calls that are buried deeply in dependencies of PostgreSQL are not going to be found easily via testing:

So this makes a perfect example for [Peter Eisentraut's] point that testing is going to be a very fallible way of finding the set of syscalls that need to be allowed. Even if we had 100.00% code coverage of PG proper, we would not necessarily find calls like this.

Yet another instance of seccomp() fragility was raised by Thomas Munro. The sync_file_range() system call on PowerPC and Arm has a sync_file_range2() variant with better argument ordering; glibc helpfully remaps calls to that variant on the relevant architectures. But Docker and other container managers did not include sync_file_range2() in their whitelists, leading to unexpected errors.

While generally acknowledging the problems mentioned, Brindle and Conway still think it makes sense to provide the hooks for the feature for those who want it. Brindle said:

The feature allows end users to define some sandboxing within PG. Nothing is being forced on anyone but we would like the capability to harden a PG installation for many reasons already stated. This is being done in places all across the Linux ecosystem and is IMO a very useful mitigation.

But even for an optional feature, there is still a cost to PostgreSQL, Eisentraut said:

Features come with a maintenance cost. If we ship it, then people are going to try it out. Then weird things will happen. They will report mysterious bugs. They will complain to their colleagues. It all comes with a cost.

Conway noted that PostgreSQL is already being run under seccomp() filters, however. He and Brindle think that it would be better for the project to proactively implement support:

It is our assessment that PostgreSQL will be subject to seccomp willingly or not (e.g., via docker, systemd, etc.) and the community might be better served to get out in front and have first class support.

Conway wondered if just adding the hooks to load the filters would be a path forward, but Lane was not in favor of putting the filter controls inside the PostgreSQL process. That "seems like a fundamentally incorrect architecture", he said. In order to be a "credible security improvement", the filters need to be imposed on the PostgreSQL processes from the outside. As might be guessed, Brindle and Conway disagreed with that characterization. Their company (Crunchy Data) has customers that need the feature, so they will continue to pursue it, Conway said. For the PostgreSQL mainline, though, it would seem that the feature is not really welcome—at least in its present form.

While the appeal of filtering at the system-call level is strong, it is not entirely clear that it is the best way forward for everything. Processes like a browser's rendering engine, which was an initial seccomp() filtering target, are well suited to the approach. By their very nature, database engines—effectively general-purpose programming languages—do not really fit that mold. Adding system-call filters to something like Python (or Perl, Ruby, PHP, ...) is similarly problematic. Alternatives, at least for Linux, are not readily available, however, which may be causing people to try to fit round pegs in square holes.

Index entries for this article
Kernel	Security/seccomp
Security	Linux kernel/Seccomp

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 17:51 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (12 responses)

How can anyone "need" a mechanism to disable something unused? At best, that's a no-op.

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 18:39 UTC (Tue) by kfox1111 (subscriber, #51633) [Link] (10 responses)

Nots not:
How can anyone "need" a mechanism to disable something unused?
But:
How can anyone "need" a mechanism to block something unused from getting used?

The desire is to not only disable something unused (passive) but prevent it from ever being used, as it never should be used in the first place (active).

What this looks like: "postgres will never make syscall X". Active block rule added to prevent postgres from ever successfully making syscall X. This should not effect a normal postges. An attacker manages to break into postgres and execute their own code. If their own code tries to make syscall X, now it fails while it would normally succeed, preventing a bigger security issue.

Those needing to harden their systems need that feature.

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 19:14 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (9 responses)

That was the point I was trying to make: Nobody needs this feature because it doesn't do anything productive.

Some people believe this would limit the amount of damage a prospective attacker could do after gaining the ability to inject arbitrary code for execution into a running Postgres process, IOW, they want to use system call filtering because they think it would increases the safety of their operations.

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 19:18 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

> That was the point I was trying to make: Nobody needs this feature because it doesn't do anything productive.
:facepalm:

Programs have bugs. Programs in C have A LOT of bugs that easily result in code injection. Mitigating the damage from them is a no-brainer.

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 20:07 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (7 responses)

The kernel is a program. Consequently, it has bugs and considering that it's a C program, it should have lots of bugs. In absence of any further information, all non-trivial system calls could be equally exploitable, hence, there's no reason to assume that any particular subset of the available system calls is 'safer' than any other subset. But that's besides the point which was that "we need feature X" and "we are strongly convinced that feature X would be somewhat helpful[*] in hypothetical situation Y" are two very much different things.

[*] Execute arbitrary code in the context of a database server process is a pretty devastating "security breach" in its own right. If this happens, the organisation on whose behalf the database server was running is going to get some serious and possibly even very public problems (eg, as in "all our customer information just got published on the internet").

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 21:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> The kernel is a program. Consequently, it has bugs and considering that it's a C program, it should have lots of bugs.
Indeed. And that's the main motivator for seccomp filtering, to make sure that as little kernel is exposed to a potential attacker as possible.

> hence, there's no reason to assume that any particular subset of the available system calls is 'safer' than any other subset.
Not quite. Objectively some system calls are exercised much less than others. Additionally, some system calls make no sense at all for Postgres (e.g. vm86) and but present a clear threat because they exercise rarely used codepaths and hardware paths.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 18:35 UTC (Wed) by nivedita76 (subscriber, #121790) [Link] (4 responses)

Yes, this seems kind of superfluous for something like PostresSQL. That is most likely going to be THE thing that the server is doing anyway. Compromising it means its game over already. It could be useful for small stuff like eg ntpd, to prevent a bug in ntpd being escalated into a compromise of the database process, but the other way is just brain-dead.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 21:19 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

In theory, this would make more sense. But in practice, it's just build on the same, shaky foundation: A single, exploitable bug is sufficient for such a compromise. And for as long as there's no guaranteed way to determine that the code implementing a certain system call has no exploitable bugs and won't ever have exploitable bugs (system call implementations change), there's no way to determine if restricing the set of allowed system calls to a certain subset of the set of available system calls will actually reduce the number of exploitable errors a prospective attacker could try to utilitze, let alone reduce it to zero.

There's some outright paradoxical reasoning in here: seccomp is supposed to defend against the issue that it's conjectured to be impossible to determine if the implementation of a given system call is free of exploitable errors but in order to use seccomp sensibly, ie not in a "fire a shotgun in the dark at hope that some of the projectiles hit something" mode of operation, this very information would need to be known.

The best one could sensibly use this for is to block access to system calls known to be exploitable until a fix becomes available. Or for its original purpose: Run unknown code in a sandbox which is supposed to be prohibited from doing certain things, eg, most file system manipulations.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 22:21 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> there's no way to determine if restricing the set of allowed system calls to a certain subset of the set of available system calls will actually reduce the number of exploitable errors a prospective attacker could try to utilitze, let alone reduce it to zero.
This is just nonsense. Reducing amount of code exposed to attacker reduces the chances that an exploitable bug will be accessible to them.

> There's some outright paradoxical reasoning in here: seccomp is supposed to defend against the issue that it's conjectured to be impossible to determine if the implementation of a given system call is free of exploitable errors
You clearly live in a fantasy world. The seccomp sandboxing is designed to prevent access to as much of the attack surface as possible. This automatically makes sure that the probability of an exploit goes down.

Note the word "probability". This is not prevention, it's mitigation.

History shows that this approach actually works in practice. Even the misguided SELinux has prevented multiple exploitable bugs.

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 17:04 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

I seem to live in a fantasy world called 'reality',

https://en.wikipedia.org/wiki/Seccomp

PostgreSQL considers seccomp() filters

Posted Oct 4, 2019 8:43 UTC (Fri) by cyphar (subscriber, #110703) [Link]

The default seccomp rules that Docker/LXC/cri-o/etc specify have blocked more than 95% of kernel 0day exploits in the past 6 years or so[1], purely by blocking esoteric syscalls and strange flags. There is clear and undeniable evidence that even a very generic seccomp profile does help protect systems running untrusted workloads against kernel bugs.

(As an aside, note that Docker doesn't user user namespaces by default, LXC has been protected against even more exploits. But that's a very different topic.)

[1]: https://docs.docker.com/engine/security/non-events/

PostgreSQL considers seccomp() filters

Posted Oct 25, 2019 5:42 UTC (Fri) by ssmith32 (subscriber, #72404) [Link]

>In absence of any further information, all non-trivial system calls could be equally exploitable, ..

Perhaps, but perhaps not.

Either way, though, the whole point of having people familiar with a particular code base make a list of system calls that they, to the best of their knowledge, is not needed, and are known to give their caller unnecessary privileges, is a way of adding "further information"

And, your assumption now being incorrect, the reasoning based on it should be reworked.

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 22:35 UTC (Thu) by flussence (guest, #85566) [Link]

And nobody “needs” side impact airbags in cars, because cars do not drive sideways?

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 18:42 UTC (Tue) by kfox1111 (subscriber, #51633) [Link] (7 responses)

I think containers may be part of the solution.

The concern is postgres can be used with arbitrary dependencies and getting a perfect list of what all possible combinations of dependencies with postgres would be next to impossible. That I might agree with.

But, building a container, so that all its dependencies are fixed in stone, then running the test suite on it recording the syscalls would be much more reliable. The syscall list then would be static along with the static container.

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 19:56 UTC (Tue) by dezgeg (subscriber, #92243) [Link] (2 responses)

Never applying any updates sounds quite counterproductive from security point of view (which was the whole reason for the syscall filtering)...

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 0:31 UTC (Wed) by kfox1111 (subscriber, #51633) [Link] (1 responses)

Your assuming updates need to be applied from within, rather then from without.

You don't upgrade the contents of a container. You launch an upgraded container.

PostgreSQL considers seccomp() filters

Posted Oct 25, 2019 5:45 UTC (Fri) by ssmith32 (subscriber, #72404) [Link]

This is generally a good idea - but I find the fixed-state model of containers, that works well for well-designed useful, but simple (in a good way) services inflicts a lot of pain when you apply it to a service whose main point is to manipulate complex state in complex ways.

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 21:29 UTC (Tue) by nix (subscriber, #2304) [Link]

running the test suite on it recording the syscalls would be much more reliable.

As was noted, the testsuite doesn't have anything like 100% coverage, particularly of error paths (not even SQLite's manages that, and it goes to incredible lengths to get closer than anyone else I've ever heard of) -- and even if it did, changes in the syscalls used by dependent libraries would break things anyway (this is not academic and has happened multiple times. Heck, it's happened multiple times to me alone, so I'm sure it's downright common for this to go wrong.)

Worse yet, PostgreSQL can execute arbitrary syscalls because it can invoke pluggable language interpreters and much else. I see no sane way to sandbox this without a major rearchitecture to move components seen as vulnerable into sandboxable subprocesses that do nothing else -- and even then you'd have the problem that any decent database server with server-side languages is supposed to execute more or less arbitrary code on behalf of users and is useless if it cannot. Diagnosing which arbitrary operations are suspicious and which are not seems a very difficult problem, and one almost certainly unimplementable under the constraints of seccomp sandboxes (which can look at args but cannot dereference pointers if an arg is a pointer, etc).

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 0:37 UTC (Wed) by KaiRo (subscriber, #1987) [Link] (2 responses)

Just that PostreSQL is not very suitable for containerization as there is no good upgrade path between major versions when you just switch containers and not have both an old and new binary installed in parallel on the same system with pg_upgrade having access to them both (which is almost impossible with containers). I hope this will be solved one day, as it will make pg containers a lot more attractive.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 0:39 UTC (Wed) by cyphar (subscriber, #110703) [Link]

PostgreSQL is not really well-mapped to Docker-style application containers, but I run PostgreSQL inside an LXC/LXD container quite well (after all, it's tastes almost exactly like a VM).

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 8:45 UTC (Wed) by knan (subscriber, #3940) [Link]

The most promising route there is probably using replication from the still running old major version to the new in a new set of containers. Inconvenient but probably workable.

PostgreSQL considers seccomp() filters

Posted Oct 1, 2019 23:43 UTC (Tue) by cyphar (subscriber, #110703) [Link] (17 responses)

One important thing to note is that even if you accidentally block a new syscall and your users update, glibc deals with -ENOSYS **exceptionally** well. So the only thing you really need to worry about is what syscalls PostgreSQL itself uses directly.

It is a shame that seccomp has become such a complicated beast to use for upstream projects, and it feels as though there needs to be more work put into making seccomp profile generation more usable (as a very hand-wavey example, glibc could include information about what syscalls are called by each libc function -- and then consumers could generate seccomp filters based on those lists -- updating glibc would update the lists of syscalls potentially used).

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 0:14 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (16 responses)

seccomp+bpf is a bad idea in general. It's really better to use other mechanisms. We also have:
1) SELinux which is unusable and should be trashed.
2) AppArmor that is better but is not compatible with unprivileged users.
3) pledge() call that solves most of the issues but is absent on Linux.
...

Seriously, Linux security at this point is a huge Rube-Goldbergesque machine that probably no single person understands in its entirety. And it's getting worse, not better.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 0:32 UTC (Wed) by cyphar (subscriber, #110703) [Link] (9 responses)

I agree in spirit with your point, but the problem is (as it has been for almost the entire history of Linux) that Conway's Law applies to Linux just as much as it applies to any project. Iterative development within subsystems results in subsystem-specific security features which require users to rope everything together themselves -- you can see this with containers just as easily as you can see it with Linux's security story.

If we had our own hypothetical pledge(2) proposal you would need to hook together all of the existing security primitives together, which would definitely make for a very lively LKML thread. But for the record, I agree that pledge(2) is a much better model for a security API that userspace sees (with the caveat that it probably should be slightly more granular if we ever get it on Linux but that's a fairly minor bike-shed to paint).

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 10:54 UTC (Wed) by brauner (subscriber, #109349) [Link] (8 responses)

As Kees mentioned I don't think you need a separate syscall for this. You can get this with seccomp() in userspace.
I agree with your earlier point though, that we need an easier way of generating seccomp profiles.
libseccomp does not provide an abstract enough interface to do this easily. It could grow support probably.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 14:26 UTC (Wed) by cyphar (subscriber, #110703) [Link] (7 responses)

> You can get this with seccomp() in userspace.

Maybe, but the main benefit of pledge(2) is that the kernel knows what syscalls each pledge refers to -- and any new syscalls will be automatically included. Coming up with a userspace wrapper for this will be (AFAICS) quite difficult for any method you pick (and all the complications boil down to trying to work around the fact that we aren't doing it in-kernel), such as:

* You just have hard-coded mappings of pledge(3) arguments to syscall sets. This means you're limited to what syscalls your userspace library understands. This results in new syscalls not being included (and if you expose lower-level seccomp primitives to handle unknown syscalls, you're back at manually-managed seccomp filters).

* You use some kind of symbol attribute which lists what syscalls a function calls, and then you're limited to the effectiveness (and overhead) of static analysis (which has to be run at at least partially at run-time because shared libraries can be updated -- though we could probably cache the call graph analysis). This is more resilient to changes, but won't work for all programs.

* You expose some kind of metadata about available syscalls from the kernel, which includes some kind of grouping or tags for the syscalls (to allow for you to dynamically add all of the syscalls matching a tag to the whitelist). This is much more flexible, but now you're making pledge-grouping decisions in-kernel -- why not do the whole thing in-kernel?

And an overarching problem is that (for unknown-to-userspace syscalls), the best you can really do is block the syscall outright. But maybe some pledge(2)s should only block certain flags (an obvious example would be a hypothetical socket(2)-like syscall -- how would you implement pledge(2) for "only allow unix sockets" in user-space without having code that knows about socket(2)?). The last proposal might help solve this if you exposed "enough" metadata, but it feels wrong to me to try to expose a bunch of metadata in the hopes that userspace will be able to make sense of it.

But then again, I might be missing something obvious. If we can solve this problem in a sane way, I'm all for it.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 14:58 UTC (Wed) by brauner (subscriber, #109349) [Link] (6 responses)

* You expose some kind of metadata about available syscalls from the kernel

You mean btf...

(I'm only partially trolling btw.)

We don't actually need pledge(2). seccomp(2) could be extended to do this. There's even precedence SECCOMP_SET_MODE_STRICT is restricting you to a very limited set of syscalls. We could extend seccomp(SECCOMP_PLEDGE, 0, "stdio,sendfd,recvfd") and then seccomp would just create a bpf filter or more elaborate for future extensibility :):

struct seccomp_pledge pledge;
seccomp(SECCOMP_PLEDGE, 0, &pledge);

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 15:08 UTC (Wed) by cyphar (subscriber, #110703) [Link] (5 responses)

> You mean btf...

I was thinking of BTF while writing it, though I don't know if BTF currently gives us the details we want -- we don't just want lists of functions and structure layouts. We need to have a way for the kernel to tell userspace "this syscall is part of the net/tcp plege-set" or something similar (and probably a way to indicate "if this flag is set then the syscall is (also?) part the foobar pledge-set").

> We could extend seccomp(SECCOMP_PLEDGE, 0, "stdio,sendfd,recvfd") ...

"What's in a name? That which we call [pledge(2)]
By any other name would [provide the same functionality];"

#define pledge(list) seccomp(SECCOMP_PLEDGE, 0, list)

But yes, in that case we are in agreement -- let's do it in-kernel (but taking care to be incompatible with OpenBSD, so that we can pretend we came up with the idea :P).

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 20:37 UTC (Wed) by wahern (subscriber, #37304) [Link] (4 responses)

Part of what makes pledge so convenient are the edge cases. For example, say you don't want a network daemon (or an unprivileged subprocess of a network daemon) to have any access to the file system. But what about /etc/resolv.conf? One of the niceties of pledge is that the "dns" privilege permits opening /etc/resolv.conf even if open(2) is otherwise disallowed. seccomp doesn't permit path name filtering at all.

OpenBSD added the sendsyslog syscall so that processes could be denied socket access but still be able to use the syslog facility. I don't think this could be emulated with seccomp, either, as filtering on the sun.sun_path argument to bind(2) has the same problems as filtering on the path argument to open(2). You could require processes to open the socket before dropping privileges, but what happens if the syslogd daemon restarts?

There are several little pragmatic tweaks like this that make pledge functional. A fundamental hurdle on Linux is that the project is so large and diverse that there's enormous pressure to prevent leaky abstractions that require far flung tweaks across the system. Such tweaks are especially brittle in the Linux development model, and people are wary of unintended consequences. That's completely understandable, but sometimes such tweaks are simply unavoidable if your goal is maximizing userland convenience and security. Irreducible complexity has to be apportioned among userland and various kernel subsystems somehow; OpenBSD tends to apportion it quite differently than Linux, partly because of the different development models.

The irony is that the path of least resistance for Linux has been containers--namespaces, cgroups, etc--which has become precisely the slippery slope of complexity and code churn people feared. (Which is why OpenBSD rejected FreeBSD jails.) Not that containers weren't worth it for their own sake, I just find the path dependency and contradictions interesting.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 23:41 UTC (Wed) by roc (subscriber, #30627) [Link]

That convenience comes at the price of baking details of specific userspace behavior --- in your example, DNS resolution and /etc/resolve.conf --- into the kernel, which currently is independent of those details. That's acceptable for OpenBSD which largely controls their userspace and is used in far less diverse ways than Linux, but it is a problem.

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 6:14 UTC (Thu) by cyphar (subscriber, #110703) [Link] (2 responses)

> One of the niceties of pledge is that the "dns" privilege permits opening /etc/resolv.conf even if open(2) is otherwise disallowed. seccomp doesn't permit path name filtering at all.

In my view, this is actually a good thing -- pathname filtering based on the string value of the path is (in my view) destined to be a bad idea (I explain this further in [1]). I reckon that the right combination of bind-mounts and AppArmor/SELinux would be a far more effective method for doing this without all of the foot-guns.

> There are several little pragmatic tweaks like this that make pledge functional.

I agree that these sorts of niceties are very useful for making pledge(2) much easier to use for userspace, but I'm not convinced that we need to do all of them in-kernel. There is no reason why we can't also have a libpledge which can help deal with some of the more peculiar userspace bits (that are separate from the core "these syscalls on this kernel form this pledge-group" feature we need to be in-kernel).

[1]: https://lwn.net/Articles/801187/

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 8:51 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> I reckon that the right combination of bind-mounts and AppArmor/SELinux would be a far more effective method for doing this without all of the foot-guns.
SELinux is never a solution...

AppArmor has several problems, though. In particular, it can't be effectively used in unprivileged contexts. For example, you can't run a program that you just compiled with a custom policy.

It also was not possible to use AppArmor from inside containers (has this changed?).

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 10:48 UTC (Thu) by jem (subscriber, #24231) [Link]

Hard coding /etc/resolv.conf is a horrible example of putting policy in the kernel. Name resolution is a library level thing, and even there the resolv.conf file is an implementation detail, not specified by any standards. getaddrinfo() is the POSIX standard name resolution interface, and it can, and has, been implemented without the need for /etc/resolv.conf.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 3:06 UTC (Wed) by kees (subscriber, #27264) [Link] (5 responses)

I'd love to see glibc support pledge(). It knows which of its own APIs map to which syscalls, etc. I don't think there is anything missing from the seccomp interface that a glibc pledge() implementation would need. (And if there was, I'd be happy to help get it implemented in seccomp.)

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 5:50 UTC (Wed) by mjg59 (subscriber, #23239) [Link] (4 responses)

Filtering based on path arguments?

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 10:51 UTC (Wed) by brauner (subscriber, #109349) [Link]

Hm, we discussed this at KSummit and I'm not sure seccomp is the right tool for this.
Not without introducing all kinds of races or moving that part of seccomp into it's own LSM which has it's own problems.
In general path-based filtering seems LSM territory.
However, we intend to bring aspects of deep argument inspection to seccomp eventually.

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 18:27 UTC (Wed) by kees (subscriber, #27264) [Link]

Let's skip that bit for now; we can do the rest, though, yes?

PostgreSQL considers seccomp() filters

Posted Oct 2, 2019 23:44 UTC (Wed) by roc (subscriber, #30627) [Link]

You probably know this but I haven't seen it mentioned: Firefox and Chrome sandboxes implement path filtering by having seccomp filters trigger SIGSYS on path-related syscalls, and having the signal handler fake the syscall using IPC to a trusted broker process.

It's not great for performance, but a pledge-like sandboxing library/API can take this approach.

PostgreSQL considers seccomp() filters

Posted Oct 3, 2019 6:09 UTC (Thu) by cyphar (subscriber, #110703) [Link]

I'll admit that I'm not convinced that pathname-as-a-string filtering is a good idea (though I'd love to know what usecases it would provide real protection in). Unless you restrict the set of syscalls the process can use to be incredibly small (i.e. only allowing open("/foo", O_RDWR) and no other filesystem-related syscalls), there are all sorts of ways you can get around pathname restrictions (symlink races, bind-mounts in mount namespaces, magic-links, and so on).

If the purpose would be stop malicious programs from doing something bad -- I think it would be more productive to just use mount namespaces and isolate away the rest of the filesystem entirely. Maybe you could make use of path-based filtering in combination with a read-only mount namespace, but I'm still not completely convinced.

If the purpose is to stop a trusted program from being tricked into operating on the wrong kinds of paths by an attacker, I think openat2 and the stuff I'm working on with libpathrs[1] (which takes advantage of existing tricks involving O_PATH and procfs) would be a better solution.

[1]: https://github.com/openSUSE/libpathrs