LWN.net Logo

Linux capabilities support for user namespaces

By Jake Edge
December 22, 2010

Linux capabilities are a sparsely used kernel facility to add granularity to the set of privileges that a process can have. By using capabilities, an administrator can grant a process a limited set of privileges, rather than the usual, essentially binary, choice between granting all privileges via setuid() or granting just those of the user running the program. Combining capabilities with user namespaces will allow administrators to apply those fine-grained privileges to containers, which is just what a patch set proposed by Serge E. Hallyn sets out to do.

We have looked at capabilities several times in the past, most recently in the context of adding capability sets to files, though an earlier article provides more details on the rules that govern how capabilities are applied and inherited. With the addition of file capabilities, Linux systems have all the tools needed to eliminate most setuid programs though, in practice, that hasn't happened. There is an effort underway to eliminate most setuid programs for Fedora 15, however.

Namespaces are part of the Linux containers implementation, which is a lightweight virtualization technique that allows groups of processes to run in their own little world, separate from the rest of the processes running on the system. These containers must not be able to see or interact with things outside, so various global resources (things like process IDs, network devices, filesystems, and so on) need to be wrapped in a namespace layer that provides the illusion that the container is its own system. User namespaces provide a container with its own set of UIDs, completely separate from those in the parent. Each of the different kinds of namespaces can be created by using flags to the clone() system call.

The idea behind Hallyn's patches, the core of which was originally developed by Eric Biederman, is to eventually allow unprivileged users to create namespaces. In order to do that, the capabilities of processes in a namespace must not leak out to parent (or even sibling) namespaces. In the core patch, Hallyn says that the proposed changes accomplish 90% of the goal to allow unprivileged namespace creation, with some UID confusion issues still to be addressed.

In the initial user namespace—the "normal" namespace that is created at boot time—capabilities for a task are calculated in the usual way, using the permitted, effective, and inheritable capability sets associated with the task. The proposed changes will restrict any capabilities in a child user namespace to only act within that namespace or on any of its descendants.

Each capabilities set is contained in a structure that references the user it corresponds to, and those user structures have a namespace to which they are attached. When checking to determine whether a particular set of capabilities should be used, the code looks at whether the user is part of the target namespace. If so, its capabilities are used, if not, each parent namespace is checked all the way back to the initial user namespace. Since the capabilities can only be associated with one namespace (via a user in that namespace), they are only active in the namespace that contains them or any descendant from that namespace.

The user that creates the namespace will have all capabilities in that namespace, not just the set of capabilities they have in the parent. Essentially, the creator has the privileges of the root user in any namespace he or she creates.

In order to ensure that the namespace creator's capabilities don't leak out to the rest of the system, a new capability check is added in the patch:

    int ns_capable(struct user_namespace *ns, int cap);
The existing capable() function, which determines whether a task has a particular capability or not, has been changed to call ns_capable(), but it passes the initial user namespace for ns. That means that the existing calls to capable() currently sprinkled around the kernel do not suddenly change their semantics. In order to allow specific capabilities to function in a user namespace, calls to capable() need to be changed to ns_capable() while passing the appropriate namespace. The cap_capable() function, which is eventually called from ns_capable(), has been changed to properly handle capabilities in user namespaces.

In this way, kernel functionality that requires certain capabilities can be incrementally added to user namespaces while still protecting the rest of the kernel from being affected. Hallyn's patches enable three specific capabilities for user namespaces by making the change from capable() to ns_capable(). The first, and simplest, just allows the sethostname() system call to be successfully called if the user in the namespace has CAP_SYSADMIN. The second, which is slightly more complicated, but still a pretty small change, alters check_kill_permission() to allow CAP_KILL enabled tasks to send a signal to another task. The last patch allows CAP_SYS_PTRACE capable tasks to use ptrace() on other tasks in the user namespace.

This is an incremental approach that will allow each addition of user namespace capabilities to be reviewed and tested separately before adding them into the mainline. Hallyn notes his current plans for enabling some additional capabilities from user namespaces:

My near-term next goals will be to enable setuid and setgid, and to provide a way for the filesystem to be usable in child user namespaces. At the very least I'd like a fresh loopback or LVM mount and proc mounts to be supported.

Capabilities are something of gnarly corner of the kernel, and one that has caused problems in the past (e.g. the "sendmail capabilities" bug). Combining them with namespaces is a bit of a delicate task. Clearly, if regular users are able to create these namespaces, it is imperative that any tricky interactions caused by capabilities in namespaces do not lead to privilege escalations. From that perspective, Hallyn's approach seems sound.


(Log in to post comments)

OpenWall 3.0

Posted Dec 23, 2010 21:49 UTC (Thu) by smoogen (subscriber, #97) [Link]

As a Fedora person, I just wanted to point out that there may have been a setuid-less OS shipped before Fedora 15. The openwall project released their 3.0 on December 10th which states that their default install does not have setuid but some setgid programs. I don't know if they are using capabilities or not (have it on my after Xmas list of things to check out) but thought it should be noted.

http://www.openwall.com/lists/announce/2010/12/15/1

OpenWall 3.0

Posted Dec 23, 2010 22:03 UTC (Thu) by smoogen (subscriber, #97) [Link]

And in reading the next section I see they have instead implemented kernel fixes to allow for setgid programs to open icmp versus capabilities.

OpenWall 3.0

Posted Jan 6, 2011 16:23 UTC (Thu) by solardiz (guest, #35993) [Link]

ping is a special case. For everything else, we made purely userland changes to eliminate the need for having any SUID programs. You may want to check out these links:

http://www.openwall.com/tcb/
http://www.openwall.com/presentations/Owl/mgp00013.html
http://www.openwall.com/presentations/Owl/mgp00020.html
http://www.openwall.com/presentations/Owl/mgp00021.html
http://www.openwall.com/presentations/Owl/mgp00022.html
http://www.openwall.com/presentations/Owl/mgp00023.html

also mentioned in the "next section" that you referred to:

http://lwn.net/Articles/420801/

OpenWall 3.0

Posted Jan 6, 2011 16:38 UTC (Thu) by solardiz (guest, #35993) [Link]

Also, this description - "kernel fixes to allow for setgid programs to open icmp" (from your comment) - is not entirely correct. What we're proposing on LKML is adding non-raw ICMP sockets (where one can only send certain things and receive certain relevant responses). This is not the same as permitting some programs to access the existing (raw) ICMP sockets. And this is post-Owl-3.0 stuff; on our 3.0 release, we left out the ping special case (ping is simply restricted to invocation by root by default, although this is configurable; our traceroute works as non-root fine).

Overall, Owl 3.0 is primarily about the hardened userland. We do not use filesystem capabilities, and our userland is usable with mainstream kernels (although we do provide and recommend a specific RHEL5/OpenVZ patched kernel). In fact, some people are running our userland in OpenVZ containers on non-Owl host systems (we provide pre-created OpenVZ templates of the userland), although we generally use Owl for both "host" and "guest" ourselves.

suid less os's.

Posted Dec 24, 2010 5:59 UTC (Fri) by ebiederm (subscriber, #35028) [Link]

Sigh plan 9 did this years ago; without suid, without sgid, and without capabilities.

Linux has all of the capabilities plan 9 did, so going suidless without caveats is possible if someone would care.

Frankly being able to raise the priveleges of an existing process is such a dangerous mechanism and so limiting on system design that I wish someone would care, and remove all suid, sgid, and capabilities use from a distro. It is hard to count how many neat new features have been shelved because of the requirement to support suid root executables.

suid less os's.

Posted Jan 5, 2011 12:13 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

> Sigh plan 9 did this years ago; without suid, without sgid, and without capabilities.

I'm no expert on Plan 9, but from a bit of quick googling it looks to me like it had local server processes to do privileged things for other processes that didn't have the rights to do them themselves. Which sounds rather like DBus/PolicyKit to me.

OpenWall 3.0

Posted Dec 26, 2010 22:42 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

I don't really understand all this setuidless craziness.

Sure, having setuid on 'ping' is crazy, but having setuid bit on 'sudo' is downright logical.

OpenWall 3.0

Posted Jan 6, 2011 17:01 UTC (Thu) by solardiz (guest, #35993) [Link]

Having sudo and allowing for the use of su to elevate privileges is downright illogical in most cases (on servers, which is what Openwall GNU/*/Linux is for). Here are some excerpts from past discussions on the topic:

http://www.openwall.com/lists/owl-users/2004/10/20/6
http://lwn.net/Articles/413891/
http://linux.slashdot.org/comments.pl?sid=1915256&cid...

The alternative to the su/sudo approach is direct root logins. And the solution to the accountability problem (with multiple sysadmins) is multiple root-privileged accounts (with a distinct naming convention for clarity).

Occasional exceptions do exist. In our experience, less than 10% of server systems would potentially benefit from sudo, and a safer approach can be used on those anyway: we generally prefer ssh forced commands - that is, command=... in authorized_keys - even if this is to be invoked by a local account on the system itself, such as by a support person who is not a "full" sysadmin.

OpenWall 3.0

Posted Jan 6, 2011 17:15 UTC (Thu) by solardiz (guest, #35993) [Link]

As to the "setuidless craziness" in general, it makes more sense once you actually have no SUID programs(*) left on the system - like we do not on a default install of Owl 3.0. This mitigates the impact of potential vulnerabilities in parts of ld.so, libc, and the kernel. Relevant vulnerabilities in each one of these components have been discovered (and fixed) in the past, and more are to be introduced/discovered/fixed.

(*) ...nor any similarly-privileged-on-exec programs, such as with fscaps with a root-equivalent capability set. We do not use fscaps in Owl 3.0.

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds