By Michael Kerrisk
December 13, 2012
The results of the user namespaces work on Linux have been a long time in
coming, probably because they are the most complex of the various namespaces that have been
added to the kernel so far. The first pieces of the implementation started
appearing when Linux 2.6.23 (released in late 2007) added the
CLONE_NEWUSER flag for the clone() and unshare() system calls. By
Linux 2.6.29, that flag also became meaningful in the clone()
system call. However, until now many of the pieces necessary for a complete
implementation have remained absent.
We last looked at user
namespaces back in April, when Eric Biederman was working to push a raft of
patches into the kernel with the goal of bringing the implementation closer
to completion. Eric is now engaged in pushing further patches into the
kernel with the goal of having a more or less complete implementation of
user namespaces in Linux 3.8. Thus, it seems to be time to have another look
at this work. First, however, a brief recap of user namespaces is probably
in order.
User namespaces allow per-namespace mappings of user and group IDs. In
the context of containers, this means that
users and groups may have privileges for certain operations inside the
container without having those privileges outside the container. (In other
words, a process's set of capabilities for operations inside a user
namespace can be quite different from its set of capabilities in the
host system.) One of the specific goals of user namespaces is to allow a
process to have root privileges for operations inside the container, while
at the same time being a normal unprivileged process on the wider system
hosting the container.
To support this behavior, each of a process's user IDs has, in effect,
two values: one inside the container and another outside the
container. Similar remarks hold true for group IDs. This duality is
accomplished by maintaining a per-user-namespace mapping of user IDs: each
user namespace has a table that maps user IDs on the host system to
corresponding user IDs in the namespace. This mapping is set and viewed by
writing and reading the /proc/PID/uid_map
pseudo-file, where PID is the process ID of one of the processes in
the user namespace. Thus, for example, user ID 1000 on the host system
might be mapped to user ID 0 inside a namespace; a process with a user ID
of 1000 would thus be a normal user on the host system, but would have root
privileges inside the namespace. If no mapping is provided for a particular
user ID on the host system, then, within the namespace, the user ID is
mapped to the value provided in the file
/proc/sys/kernel/overflowuid (the default value in this file is
65534). Our earlier article went into more details of the
implementation.
One further point worth noting is that the description given in the
previous paragraph looks at things from the perspective of a single user
namespace. However, user namespaces can be nested, with user and group ID
mappings applied at each nesting level. This means that a process might
have distinct user and group IDs in each of the nested user namespaces in
which it is a member.
Eric has assembled a number of namespace-related patch sets for
submission in the upcoming 3.8 merge window. Chief among these is the set that completes the main pieces of the
user namespace infrastructure. With the changes in this set,
unprivileged processes can now create new user namespaces (using
clone(CLONE_NEWUSER)). This is safe, says Eric, because:
Now that we have been through every permission check in the kernel
having uid == 0 and gid == 0 in your local user namespace no
longer adds any special privileges.
Even having a full set of caps in your local user namespace is safe because
capabilities are relative to your local user namespace, and do not confer
unexpected privileges.
The point that Eric is making here is that following the work
(described in our earlier article) to implement the kuid_t and
kgid_t types within the kernel, and the conversion of various
calls to capable() to its namespace analog, ns_capable(),
having a user ID of zero inside a user namespace no longer grants special
privileges outside the namespace. (capable() is the kernel
function that checks whether a process has a capability;
ns_capable() checks whether a process has a capability
inside a namespace.)
The creator of a new user namespace starts off with a full set of
permitted and effective capabilities within the namespace, regardless of
its user ID or capabilities on the host system. The creating process thus
has root privileges, for the purpose of setting up the environment inside
the namespace in preparation for the creation or the addition of other
processes inside the namespace. Among other things, this means that the
(unprivileged) creator of the user namespace (or indeed any process with
suitable capabilities inside the namespace) can in turn create all other
types of namespaces, such as network, mount, and PID namespaces (those
operations require the CAP_SYS_ADMIN capability). Because the
effect of creating those namespaces is limited to the members of the user
namespace, no damage can be done in the host system.
Other notable user-space changes in Eric's patches include extending the unshare()
system call so that it can be employed to create user namespaces, and extensions that allow a process to use the setns()
system call to enter an existing user namespace.
Looking at some of the other patches in the series gives an idea of
just how subtle some of the details are that must be dealt with in order to
create a workable implementation of user namespaces. For example, one of the patches deals with the behavior of
set-user-ID (and set-group-ID) programs. When a set-user-ID program is
executed (via the execve() system call), the effective user ID of
the process is changed to match the user ID of the executable file. When a
process inside a user namespace executes a set-user-ID program, the effect
is to change the process's effective user ID inside the namespace to
whatever value was mapped for the file user ID. Returning to the example
used above, where user ID 1000 on the host system is mapped to user ID 0
inside the namespace, if a process inside the user namespace executes a
set-user-ID program owned by user ID 1000, then the process will assume an
effective user ID of 0 (inside the namespace).
However, what should be done if the file user ID has no mapping inside
the namespace? One possibility would be for the execve() call to
fail. However, Eric's patch implements another approach: the set-user-ID
bit is ignored in this case, so that the new program is executed, but the
process's effective user ID is left unchanged. Eric's reasoning is that
this mirrors the semantics of executing a set-user-ID program that resides
on a filesystem that was mounted with the MS_NOSUID flag. Those
semantics have been in place since Linux 2.4, so the kernel code paths
should for this behavior should be well tested.
Another notable piece of work in Eric's patch set concerns the files in
the /proc/PID/ns directory. This directory
contains one file for each type of namespace of which the process is a
member (thus, for each process, there are the files ipc, mnt,
net, pid, user, and uts). These files
already serve a couple of purposes. Passing an open file descriptor for one
of these files to setns() allows a process to join an existing
namespace. Holding an open file descriptor for one of these files, or bind
mounting one of the files to some other location in the filesystem, will
keep a namespace alive even if all current members of the namespace
terminate. Among other things, the latter feature allows the piecemeal
construction of the contents of a container. With this patch in Eric's recent series, a single
/proc inode is now created per namespace, and the
/proc/PID/ns files are instead implemented as
special symbolic links that refer to that inode. The practical upshot is
that if two processes are in, say, the same user namespace, then calling
stat() on the respective
/proc/PID/ns/user files will return the same inode
numbers (in the st_ino field of the returned stat
structure). This provides a mechanism for discovering if two processes are
in the same namespace, a long-requested feature.
This article has covered just the patch set to complete the user
namespace implementation. However, at the same time, Eric is pushing a
number of related patch sets towards the mainline, including: changes to the networking stack so that user
namespace root users can create network namespaces: enhancements and clean-ups of the PID namespace
code that, among other things, add unshare() and
setns() support for PID namespaces; enhancements to the mount namespace code that
allow user namespace root users to call chroot() and to create and
manipulate mount namespaces; and a series of
patches that add support for user namespaces to a number of file
systems that do not yet provide that support.
It's worth emphasizing one of the points that Eric noted
in a documentation patch for the user
namespace work, and elaborated on in a private mail. Beyond the
practicalities of supporting containers, there is another significant
driving force behind the user namespaces work: to free the UNIX/Linux API
of the "handcuffs" imposed by set-user-ID and set-group-ID programs. Many
of the user-space APIs provided by the kernel are root-only simply to
prevent the possibility of accidentally or maliciously distorting the
run-time environment of privileged programs, with the effect that those
programs are confused into doing something that they were not designed to
do. By limiting the effect of root privileges to a user namespace, and
allowing unprivileged users to create user namespaces, it now becomes
possible to give non-root programs access to interesting functionality that
was formerly limited to the root user.
There have been a few Acked-by: mails sent in response to
Eric's patches, and a few small questions, but the patches have otherwise
passed largely without comment, and no one has raised objections. It seems
likely that this is because the patches have been around in one form or
another for a considerable period, and Eric has gone to considerable effort
to address objections that were raised earlier during the user namespaces
work. Thus, it seems that there's a good chance that Eric's pull request to have the patches merged in the
currently open 3.8 merge window will be successful, and that a complete
implementation of user namespaces is now very close to reality.
(
Log in to post comments)