LWN: Comments on "CAP_PERFMON — and new capabilities in general"

CAP_PERFMON — and new capabilities in general

immibis — Thu, 12 Mar 2020 16:29:38 +0000

I think that was his/her point - you know that any subversion has to go through the mechanism to which permission is granted, so you only need to be especially careful there. You don't need to check the output file path the parent passed to you, because you don't have any special permission to write to files that the parent process couldn't write to anyway.

Reducing CAP_SYS_ADMIN

mathstuf — Wed, 04 Mar 2020 14:20:35 +0000

Won't that break containers running older distros on newer kernels? Would we need capability namespaces then?

Reducing CAP_SYS_ADMIN

Wol — Wed, 04 Mar 2020 13:40:59 +0000

I think that as a new capability is added, that ability should be "deleted" from CAP_SYS_ADMIN. Have a boot flag that says "restrict CAP_SYS_ADMIN" and those abililties will no longer be there (okay the default is don't restrict, and those abilities will still be available to anything that needs them).

But then, if we get the distros on board, especially long term distros like RHEL, they should state that "anything that won't compile and run when the flag is on, is not supported". If the long-term-kernel maintainers also agree that no capability-removal code will be back-ported, so the CAP_SYS_ADMIN capabilities are fixed for any individual x.y kernel, then there is clear pressure on upstream to support new capabilities, and users who run longer-term kernels can rely on the capability system to provide the protection it was designed to.

Cheers,
Wol

CAP_PERFMON — and new capabilities in general

Freeaqingme — Tue, 25 Feb 2020 17:35:15 +0000

I concur.

A few years ago I had a process that spawned workers. The parent process would then assign jobs to these workers. I wanted every job to be performed in a specific namespace/cgroup. Therefore, I needed the master process to change the namespaces of the spawned workers. Such a syscall does not exist, so we decided the worker should switch to that namespace (setns()) itself.

Ideally, we'd not run the child processes as root because they also executed/processed user input. As such, I set out to implement a custom capability that would grant a process the rights to change its namespace/cgroups without having to run as root.

A few limitations I ran into:
- There's indeed a max of 64 (IIRC) capabilities. This makes it difficult to pick a number of which you're sure it won't be used by another capability (introduced by 'upstream') in the future.
- I don't entirely recall it anymore, but I believe we'd have to modify libcap, libapparmor, libc as well as the kernel itself.

These constraints make it hard to prototype something. Lack of prototypes will probably also - at least in part - be a reason why there's not much of it upstreamed.

Also, because it's very specific to our use case, I did expect that upstream would not be willing to accept this new privilege. That may be a reason why there's so relatively few capabilities. For every scenario a different capability could probably be thought of.

I'm not a seasoned kernel developer, so I may have had some more challenges than someone more experienced in this regard would have been. However, after trying various options for a couple of days, our solution simply was to run the mentioned master process as root, and harden it through things like appamor instead.

CAP_PERFMON — and new capabilities in general

andresfreund — Mon, 24 Feb 2020 19:16:07 +0000

And then when the web server wants to e.g. use SO_REUSEPORT to have a separate socket for each socket / group of cores/core, you have to teach your system startup tooling that. And can't configure it in the application's config file anymore.

Not saying that passing the fd in is not a good solution in some cases, just that it does has its own set of implied limitations.

CAP_PERFMON — and new capabilities in general

smurf — Mon, 24 Feb 2020 18:38:52 +0000

You could pass the open port to the web server as an open file descriptor.

CAP_PERFMON — and new capabilities in general

Cyberax — Mon, 24 Feb 2020 17:49:37 +0000

> Why do you believe CAP_NET_BIND_SERVICE shouldn't exist?
Because there should have been no restriction on <1024 ports to begin with (i.e. everything should have CAP_NET_BIND_SERVICE).

CAP_PERFMON — and new capabilities in general

imMute — Mon, 24 Feb 2020 17:24:32 +0000

Why do you believe CAP_NET_BIND_SERVICE shouldn't exist?
I, for one example, believe it would be useful to allow an HTTP server to bind to port 80/443 without needing to be started as root.

CAP_PERFMON — and new capabilities in general

epa — Mon, 24 Feb 2020 13:28:43 +0000

True. I think that splitting the root account's powers into umpteen different capability bits is conceptually pretty simple. Instead of checking uid==0 you check whether the relevant bit is set. There's not too much to go wrong in that, and it's certainly less code than SELinux or seccomp. The hard part seems to be finding space for the bitmask in relevant structures and perhaps in filesystems .

CAP_PERFMON — and new capabilities in general

diconico07 — Mon, 24 Feb 2020 07:53:45 +0000

Another reason to add capabilities carefully is the fact there can only be a limited number of these (64 if I remember well), so a badly defined or "useless" capability (e.g CAP_SYS_PACCT or CAP_NET_BROADCAST) only lowers the number of capabilities for future features that would be needing a clearly separated capability.

I really think CAP_SYS_ADMIN is bloated and unusable, and that some of its feature could have get their own capability (e.g seccomp related checks), because for now I prefer giving root rather than CAP_SYS_ADMIN as it shows more clearly that the process might do dangerous things in a quite uncontrollable manner (without other things like seccomp and al.), however I don't think we can have this balance of having some really usable and useful capabilities without having some bloated ones (remember that there is nothing just checking for root in the kernel anymore, becoming root just means getting all capabilities).

CAP_PERFMON — and new capabilities in general

NYKevin — Sun, 23 Feb 2020 19:20:26 +0000

> On a large number of deployed systems, any ordinary user account can be escalated to root, either because of unpatched bugs or because the architecture is inherently not that secure. That does not mean the whole structure of user permissions is useless.

The difference, I think, is that you are describing a bug, and I am describing how the system was designed to work.

Obviously, defense in depth is a Good Thing. I am not suggesting we eliminate capabilities entirely, or that we do anything at all, for that matter. The concern is that additional complexity in privileged code (such as the kernel) carries additional risk. So when adding new layers of security, we need to balance the security benefits with the complexity. It's not clear to me how capabilities strike that balance, and under what circumstances they ought to be used in concert with or in lieu of seccomp, containerization, SELinux, etc. As a sysadmin, I would like to know which security subsystems are actually best practices, and which ones are just there because somebody wanted them to be there.

> A buggy daemon running as root will be much easier to subvert than one that runs as a normal user account with a couple of extra capabilities. Those capabilities might get you root through a few tricks, but getting the daemon to perform those steps is harder than getting it to overwrite a random file because of missing path sanitization.

This is a reasonable point. As I said, capabilities do offer some defense against confused deputies. It's just not clear to me that they are the Right Way to go about doing that.

(Of course, this is a more general problem with Linux. The man pages are great at telling you what syscall X does, but often not so good at telling you why you might want that functionality, or how you might choose to compose it with other syscalls. Section 7 pages frequently do provide this information, but they can be hard to find because it's less obvious what name you should give to man. Section 2 pages, on the other hand, tend to be rather terse. I realize this is by design, but rightly or wrongly, many people learn to program Unix by reading man pages, and this is not a great first impression.)

CAP_PERFMON — and new capabilities in general

mpr22 — Sun, 23 Feb 2020 19:01:59 +0000

A majver bump for Linux means "Linus woke up and felt like bumping the majver instead of the minver", and nothing more.

To be allowed to break a userspace interface, you have to be able to demonstrate that nobody who's paying attention is using that interface on a system that has a realistic prospect of being upgraded to the new kernel version.

CAP_PERFMON — and new capabilities in general

intelfx — Sun, 23 Feb 2020 18:35:29 +0000

Alas — Linux kernel does not use semantic versioning. The model is "we do not break userspace".

CAP_PERFMON — and new capabilities in general

pbonzini — Sun, 23 Feb 2020 12:45:59 +0000

Capabilities alone are useless. Capabilities make no new privs, seccomp stronger and seccomp makes capabilities usable.

CAP_PERFMON — and new capabilities in general

ibukanov — Sun, 23 Feb 2020 12:35:02 +0000

Those examples actually prove the grand-parent point. In my experience things like no-new-privileges, namespaces, syscall filters are vastly more useful to secure systems than capabilities. With those it is possible to secure a system even without restricting capabilities, while capabilities alone cannot realistically secure the system. Then again, why it took so long to come up with ambient capabilities that allow to grant a particular capability to a particular invocation of a process, not each and every execution of a binary?

CAP_PERFMON — and new capabilities in general

meyert — Sun, 23 Feb 2020 12:10:14 +0000

Increasing major version number to 5 could have been used to introduce breaking changes like above deadlock situation, and get rid of other legacy stuff.

CAP_PERFMON — and new capabilities in general

matthias — Sun, 23 Feb 2020 07:08:44 +0000

> One reason, of course, is the aforementioned compatibility issue: once CAP_SYS_ADMIN allows an action, it can never lose that power without possibly breaking existing systems. When Serge Hallyn added CAP_SYSLOG, he added the usual code that made things continue to work if the process in question had CAP_SYS_ADMIN. In that case, though, the kernel issues a warning that use of CAP_SYS_ADMIN for these operations is deprecated. Nearly ten years later, the compatibility code — and the warning — remain. Splitting capabilities out of CAP_SYS_ADMIN is less than fully rewarding when the power of CAP_SYS_ADMIN itself can never be reduced.

I do not buy this. The compatibility code could be made optional in kernel config. There already are a bunch of options that say in the help text "Only enable this if you want to run binaries from the stone age." Probably there is no demand for such an option because CAP_SYS_ADMIN is omnipotent anyway. The reward for splitting capabilities out of CAP_SYS_ADMIN is not that CAP_SYS_ADMIN becomes less powerfull. The reward is that less processes need the power of CAP_SYS_ADMIN and processes can use less privileged capabilities instead.

CAP_PERFMON — and new capabilities in general

Cyberax — Sat, 22 Feb 2020 20:03:31 +0000

> A buggy daemon running as root will be much easier to subvert than one that runs as a normal user account with a couple of extra capabilities.
The problem is that there are almost no capabilities that are useful for regular daemons, with the exception of CAP_SYS_NET_BIND (which shouldn't have existed in the first place).

So if your daemon runs as root then it probably needs it for something that can't be expressed as capabilities anyway.

CAP_PERFMON — and new capabilities in general

epa — Sat, 22 Feb 2020 08:03:45 +0000

On a large number of deployed systems, any ordinary user account can be escalated to root, either because of unpatched bugs or because the architecture is inherently not that secure. That does not mean the whole structure of user permissions is useless.

A buggy daemon running as root will be much easier to subvert than one that runs as a normal user account with a couple of extra capabilities. Those capabilities might get you root through a few tricks, but getting the daemon to perform those steps is harder than getting it to overwrite a random file because of missing path sanitization.

For human accounts it can also work to have specific administrative roles with their needed capabilities rather than an all-powerful root account. This is why even oclassical Unix systems have a sudoers file, so admins log in with an ordinary account and ‘sudo’ particular commands when needed. In principle this gives the same power as just logging in as root, but it gives better protection against mistakes and some auditing of what the admin does, even if half the time the command is ‘sudo bash’.

In simpler times there were efforts to split the human admin account into its capabilities too. Windows NT defined roles like ‘Backup operator’. Unfortunately the messy world we inhabit means that any admin probably does need full access to get anything done.

CAP_PERFMON — and new capabilities in general

pbonzini — Fri, 21 Feb 2020 21:57:03 +0000

It depends on the usecase. Some capabilities are not equivalent to root, and others can be paired with other defense mechanism:

> "Mount and unmount any filesystem" can be used to create a setuid-root binary, to backdoor anything in /bin or /sbin, and for a variety of other privilege escalation attacks.

Not in combination with mount namespaces + seccomp to block exec, for example. A program that is launched as root can set them up before dropping all other capabilities.

> "Call setuid(2) with any value" can be used to become root, and then full capabilities are regained on calling execve(2).

Besides using seccomp to block execve, you can also use inheritable capabilities so that children do not keep them.

In other cases, the environment around the program can limit the root-equivalence of capabilities:

> "Load kernel modules" can be used to execute arbitrary code in kernel space, because that's exactly what it is meant to do.

You can use SELinux to prevent the program from loading a .ko file that wasn't given a particular SELinux label; or you can reject non-signed modules.

> "ptrace any process" can be used to execute arbitrary code as any user who is running code on the machine, which will generally include root.

A process that runs in a pid namespace will not be able to exit it and do ptrace outside its pid namespace (IIRC).

CAP_PERFMON — and new capabilities in general

smcv — Fri, 21 Feb 2020 19:54:04 +0000

> The idea is that the program that's been granted the privilege needs only be careful when using that exact privilege

... and when defending itself against being subverted by processes that don't have the privilege, including its parent process.

CAP_PERFMON — and new capabilities in general

smurf — Fri, 21 Feb 2020 19:17:11 +0000

The operative word is "can be". These granular privileges aren't supposed to be granted to any random user process.

The idea is that the program that's been granted the privilege needs only be careful when using that exact privilege.

As an example, a program that has "mount any filesystem" privileges needs only be careful when actually mounting a file system, but not when opening the file that's backing the data for the file system (just as a random example). Similarly, the system profiler might be allowed to profile the system, but not to overwrite /etc/shadow with the resulting data.

CAP_PERFMON — and new capabilities in general

NYKevin — Fri, 21 Feb 2020 18:03:40 +0000

Perhaps I just don't understand what the kernel developers are trying to do (which is a very real possibility as I don't read LKML religiously). But it appears that quite a lot of capability-guarded operations are inherently root-equivalent and cannot be meaningfully sandboxed without a complete redesign of Linux's security model. Some examples:

"Mount and unmount any filesystem" can be used to create a setuid-root binary, to backdoor anything in /bin or /sbin, and for a variety of other privilege escalation attacks.
"ptrace any process" can be used to execute arbitrary code as any user who is running code on the machine, which will generally include root.
"Load kernel modules" can be used to execute arbitrary code in kernel space, because that's exactly what it is meant to do.
"Call setuid(2) with any value" can be used to become root, and then full capabilities are regained on calling execve(2).

I don't really understand the purpose of trying to sandbox operations similar to the above. I suppose capabilities could be used to mitigate the confused-deputy problem in some cases, but they seem like a rather roundabout way of doing that (contrast seccomp, containers, etc.). Of course, there are privileged operations which are not root-equivalent, and sandboxing those does make sense. I just don't understand why capabilities are applied to literally every privileged operation under the sun.