By Jake Edge
December 22, 2010
Linux capabilities are a sparsely used kernel facility to add granularity
to the set of privileges that a process can have. By using capabilities,
an administrator can grant a process a limited set of privileges, rather
than the
usual, essentially binary, choice between granting all privileges via
setuid() or granting just those of the user running the program.
Combining capabilities with user namespaces will allow administrators to
apply those fine-grained privileges to containers, which is just what a patch set proposed by Serge
E. Hallyn sets out to do.
We have looked at capabilities several times in the past, most recently in
the context of adding capability
sets to files, though an earlier article provides more
details on the rules that govern how capabilities are applied and
inherited. With the addition of file capabilities, Linux systems have all
the tools needed to eliminate most setuid programs though, in practice,
that hasn't happened. There is an effort underway to eliminate most
setuid programs for Fedora 15, however.
Namespaces are part of the Linux containers implementation, which is a
lightweight virtualization technique that allows groups of processes to run
in their own little world, separate from the rest of the processes running
on the system. These containers must not be able to see or interact with
things outside, so various global resources (things like process
IDs, network devices, filesystems, and so on) need to be wrapped in a
namespace layer that provides the illusion that the container is its own
system. User namespaces provide a container with its own set of UIDs,
completely separate from those in the parent. Each of the different kinds
of namespaces can be created by using flags to the clone() system
call.
The idea behind Hallyn's patches, the core of which was originally
developed by Eric Biederman, is to eventually allow unprivileged users to
create namespaces. In order to do that, the capabilities of processes in a
namespace must not leak out to parent (or even sibling) namespaces. In the
core patch, Hallyn says that the proposed
changes accomplish 90% of the goal to allow unprivileged namespace
creation, with some UID confusion issues still
to be addressed.
In the initial user namespace—the "normal" namespace that is created
at boot
time—capabilities for a task are calculated in the usual way, using
the permitted, effective, and inheritable capability sets associated with
the task. The proposed changes will restrict any capabilities in a child
user namespace to only act within that namespace or on any of its
descendants.
Each capabilities set is contained in a structure that
references the user it corresponds to, and those user structures have a
namespace to which
they are attached. When checking to determine whether a particular set of
capabilities should be used, the code looks at whether the user is part of
the target namespace. If so, its
capabilities are used, if not, each parent namespace is checked all the way
back to the initial user namespace. Since the capabilities can only be
associated with one namespace (via a user in that namespace), they are only
active in the namespace that contains them or any descendant from that namespace.
The user that creates the namespace will have all
capabilities in that namespace, not just the set of capabilities they have
in the parent. Essentially, the creator has the privileges of the root
user in any namespace he or she creates.
In order to ensure that the namespace creator's capabilities don't leak out
to the rest of the system, a new capability check is added in the patch:
int ns_capable(struct user_namespace *ns, int cap);
The existing
capable() function, which determines whether a task has a
particular capability or not, has been changed to call
ns_capable(), but
it passes the initial user namespace for
ns. That means that the
existing calls to
capable() currently sprinkled around the kernel
do not suddenly change their semantics. In order to allow specific
capabilities to function in a user namespace, calls to
capable()
need to be changed to
ns_capable() while passing the appropriate
namespace. The
cap_capable() function, which is eventually called
from
ns_capable(), has been changed to properly handle capabilities
in user namespaces.
In this way, kernel functionality that requires certain
capabilities can be incrementally added to user
namespaces while still protecting the rest of the kernel from being
affected.
Hallyn's patches enable three specific capabilities for user namespaces by
making the change from capable() to ns_capable(). The
first, and simplest, just allows the sethostname() system call to
be successfully called if the user in the namespace has
CAP_SYSADMIN. The second, which is slightly more complicated, but still a pretty small
change, alters check_kill_permission() to allow CAP_KILL
enabled tasks to send a signal to another task. The last patch allows
CAP_SYS_PTRACE
capable tasks to use ptrace() on other tasks in the user namespace.
This is an incremental approach that will allow each addition of user
namespace capabilities to be reviewed and tested separately before adding
them into the mainline. Hallyn notes his current plans for enabling some
additional
capabilities from user namespaces:
My near-term next goals will be to enable setuid and setgid,
and to provide a way for the filesystem to be usable in child
user namespaces. At the very least I'd like a fresh loopback
or LVM mount and proc mounts to be supported.
Capabilities are something of gnarly corner of the
kernel, and one that has caused problems in the past (e.g. the "sendmail
capabilities" bug). Combining them with namespaces is a bit of a
delicate task. Clearly, if regular users are able to create these
namespaces, it is imperative that any tricky interactions caused by
capabilities in namespaces do not lead to privilege escalations. From that
perspective, Hallyn's approach seems sound.
(
Log in to post comments)