By Michael Kerrisk
February 27, 2013
Continuing our ongoing series on namespaces, this
article looks more closely at user namespaces, a feature whose
implementation was (largely) completed in Linux 3.8. (The remaining work
consists of changes for XFS and a number of other filesystems; the latter has
already been merged for 3.9.) User namespaces allow per-namespace mappings
of user and group IDs. This means that a process's user and group IDs
inside a user namespace can be different from its IDs outside of the
namespace. Most notably, a process can have a nonzero user ID outside a
namespace while at the same time having a user ID of zero inside the
namespace; in other words, the process is unprivileged for operations
outside the user namespace but has root privileges inside the namespace.
Creating user namespaces
User namespaces are created by specifying
the CLONE_NEWUSER flag when
calling clone() or unshare(). Starting with Linux 3.8
(and unlike the flags used for creating other types of namespaces), no
privilege is required to create a user namespace. In our examples below,
all of the user namespaces are created using the unprivileged user ID 1000.
To begin investigating user namespaces, we'll make use of a small
program, demo_userns.c, that
creates a child in a new user namespace. The child simply displays its
effective user and group IDs as well as its capabilities.
Running this program as an unprivileged user produces the following result:
$ id -u # Display effective user ID of shell process
1000
$ id -g # Effective group ID of shell
1000
$ ./demo_userns
eUID = 65534; eGID = 65534; capabilities: =ep
The output from this program shows some interesting details. One of
these is the capabilities that were assigned to the child process. The
string "=ep" (produced by the library function
cap_to_text(), which converts capability sets to a textual
representation) indicates that the child has a full set of permitted and
effective capabilities, even though the program was run from an
unprivileged account. When a user namespace is created, the first process
in the namespace is granted a full set of capabilities in the namespace.
This allows that process to perform any initializations that are necessary
in the namespace before other process are created in the namespace.
The second point of interest is the user and group IDs of the child
process. As noted above, a process's user and group IDs inside and outside
a user namespace can be different. However, there needs to be a mapping
from the user IDs inside a user namespace to a corresponding set of user
IDs outside the namespace; the same is true of group IDs. This allows the
system to perform the appropriate permission checks when a process in a
user namespace performs operations that affect the wider system (e.g.,
sending a signal to a process outside the namespace or accessing a file).
System calls that return process user and group IDs—for example,
getuid() and getgid()—always return credentials as
they appear inside the user namespace in which the calling process resides.
If a user ID has no mapping inside the namespace, then system calls that
return user IDs return the value defined in the file
/proc/sys/kernel/overflowuid, which on a standard system defaults
to the value 65534. Initially, a user namespace has no user ID mapping, so
all user IDs inside the namespace map to this value. Likewise, a new user
namespace has no mappings for group IDs, and all unmapped group IDs map to
/proc/sys/kernel/overflowgid (which has the same default as
overflowuid).
There is one other important point worth noting that can't be gleaned from
the output above. Although the new process has a full set of capabilities
in the new user namespace, it has no capabilities in the parent namespace.
This is true regardless of the credentials and capabilities of the process
that calls clone(). In particular, even if root employs
clone(CLONE_NEWUSER), the resulting child process will have no
capabilities in the parent namespace.
One final point to be made about the creation of user namespaces is
that namespaces can be nested; that is, each user namespace (other than the
initial user namespace) has a parent user namespace, and can have zero or
more child user namespaces. The parent of a user namespace is the user
namespace of the process that creates the user namespace via a call to
clone() or unshare() with the CLONE_NEWUSER
flag. The significance of the parent-child relationship between user
namespaces will become clearer in the remainder of this article.
Mapping user and group IDs
Normally, one of the first steps after creating a new user namespace is to define
the mappings used for the user and group IDs of the processes that will be
created in that namespace.
This is done by writing mapping information to the
/proc/PID/uid_map and
/proc/PID/gid_map files corresponding to one of
the processes in the user namespace. (Initially, these two files are
empty.) This information consists of one or more lines, each of which
contains three values separated by white space:
ID-inside-ns ID-outside-ns length
Together, the ID-inside-ns and length values define a
range of IDs inside the namespace that are to be mapped to an ID range of
the same length outside the namespace. The ID-outside-ns value
specifies the starting point of the outside range. How
ID-outside-ns is interpreted depends on the whether the process
opening the file /proc/PID/uid_map
(or /proc/PID/gid_map) is in the same user namespace
as the process PID:
-
If the two processes are in the same namespace, then ID-outside-ns
is interpreted as a user ID (group ID) in the parent user namespace of the
process PID. The common case here is that a process is writing to
its own mapping file (/proc/self/uid_map or
/proc/self/gid_map).
-
If the two processes are in different namespaces, then
ID-outside-ns is interpreted as a user ID (group ID) in the user
namespace of the process opening /proc/PID/uid_map
(/proc/PID/gid_map). The writing process is then
defining the mapping relative to its own user namespace.
Suppose that we once more invoke our demo_userns program, but
this time with a single command-line argument (any string). This causes
the program to loop, continuously displaying credentials and capabilities
every few seconds:
$ ./demo_userns x
eUID = 65534; eGID = 65534; capabilities: =ep
eUID = 65534; eGID = 65534; capabilities: =ep
Now we switch to another terminal window—to a shell process
running in another namespace (namely, the parent user namespace of the
process running demo_userns) and create a user ID mapping for the
child process in the new user namespace created by demo_userns:
$ ps -C demo_userns -o 'pid uid comm' # Determine PID of clone child
PID UID COMMAND
4712 1000 demo_userns # This is the parent
4713 1000 demo_userns # Child in a new user namespace
$ echo '0 1000 1' > /proc/4713/uid_map
If we return to the window running demo_userns, we now see:
eUID = 0; eGID = 65534; capabilities: =ep
In other words, the user ID 1000 in the parent user namespace (which
was formerly mapped to 65534) has been
mapped to user ID 0 in the user namespace created by demo_userns.
From this point, all operations within the new user namespace that deal
with this user ID will see the number 0, while corresponding operations in
the parent user namespace will see the same process as having user ID 1000.
We can likewise create a mapping for group IDs in the new user
namespace. Switching to another terminal window, we create a mapping for
the single group ID 1000 in the parent user namespace to the group ID 0 in
the new user namespace:
$ echo '0 1000 1' > /proc/4713/gid_map
Switching back to the window running demo_userns, we see that
change reflected in the display of the effective group ID:
eUID = 0; eGID = 0; capabilities: =ep
Rules for writing to mapping files
There are a number of rules governing writing to uid_map
files; analogous rules apply for writing to gid_map files. The
most important of these rules are as follows.
Defining a mapping is a one-time operation per namespace: we can perform only a
single write (that may contain multiple newline-delimited records) to a
uid_map file of exactly one of the processes in the user
namespace. Furthermore, the number of lines that may be written to the file
is currently limited to five (an arbitrary limit that may be increased in
the future).
The /proc/PID/uid_map file is owned by the user
ID that created the namespace, and is writeable only by that user (or a
privileged user). In addition, all of the following requirements must be
met:
-
The writing process must have the CAP_SETUID (CAP_SETGID
for gid_map) capability in the user namespace of the process
PID.
-
Regardless of capabilities, the writing process must be in either the user
namespace of the process PID or inside the (immediate) parent user namespace of
the process PID.
-
One of the following must be true:
-
The data written to uid_map (gid_map) consists of a
single line that maps (only) the writing process's effective user ID (group ID) in the
parent user namespace to a user ID (group ID) in the user namespace.
This rule allows the initial process in a user namespace (i.e., the child
created by clone()) to write a mapping for its
own user ID (group ID).
-
The process has the CAP_SETUID (CAP_SETGID for
gid_map) capability in the parent user namespace. Such a process
can define mappings to arbitrary user IDs (group IDs) in the parent user
namespace. As we noted earlier, the initial process in a new user
namespace has no capabilities in the parent namespace. Thus, only a process
in the parent namespace can write a mapping that maps arbitrary IDs in the
parent user namespace.
Capabilities, execve(), and user ID 0
In an earlier article in this series, we developed the ns_child_exec program.
This program uses clone() to create a child process in new
namespaces specified by command-line options and then executes a shell
command in the child process.
Suppose that we use this program to execute a shell in a new user
namespace and then within that shell we try to define the user ID mapping for
the new user namespace. In doing so, we run into a problem:
$ ./ns_child_exec -U bash
$ echo '0 1000 1' > /proc/$$/uid_map # $$ is the PID of the shell
bash: echo: write error: Operation not permitted
This error occurs because the shell has no capabilities inside the new user
namespace, as can be seen from the following commands:
$ id -u # Verify that user ID and group ID are not mapped
65534
$ id -g
65534
$ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
The problem occurred at the execve() call that executed the
bash shell: when a process with non-zero user IDs performs an
execve(), the process's capability sets are cleared. (The capabilities(7)
manual page details the treatment of capabilities during an
execve().)
To avoid this problem, it is necessary to create a user ID mapping
inside the user namespace before performing the
execve(). This is not possible with the ns_child_exec
program; we need a slightly enhanced version of the program that does allow
this.
The userns_child_exec.c
program performs the same task as the ns_child_exec program, and
has the same command-line interface, except that it allows two additional
command-line options, -M and -G. These options accept
string arguments that are used to define user and group ID maps for the new
user namespace. For example, the following command maps both user ID 1000
and group ID 1000 to 0 in the new user namespace:
$ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash
This time, updating the mapping files succeeds, and we see that the shell
has the expected user ID, group ID, and capabilities:
$ id -u
0
$ id -g
0
$ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
There are some subtleties to the implementation of the
userns_child_exec program. First, either the parent process
(i.e., the caller of clone()) or the new child process could
update the user ID and group ID maps of the new user namespace. However, following
the rules above, the only kind of mapping that the child process could
define would be one that maps just its own effective user ID. If we want to
define arbitrary user and group ID mappings in the child, then that must be done
by the parent process. Furthermore, the parent process must have suitable
capabilities, namely CAP_SETUID, CAP_SETGID, and (to ensure that
the parent has the permissions needed to open the mapping files)
CAP_DAC_OVERRIDE.
Furthermore, the parent must ensure that it updates the mapping files
before the child calls execve() (otherwise we have exactly the
problem described above, where the child will lose capabilities during the
execve()). To do this, the two processes employ a pipe to ensure
the required synchronization; comments in the program source code give full
details.
Viewing user and group ID mappings
The examples so far showed the use of
/proc/PID/uid_map and
/proc/PID/gid_map files for defining a
mapping. These files can also be used to view the mappings governing a
process. As when writing to these files, the second
(ID-outside-ns) value is interpreted according to which process is
opening the file. If the process opening the file is in the same user
namespace as the process PID, then ID-outside-ns is defined
with respect to the parent user namespace. If the process opening the file
is in a different user namespace, then ID-outside-ns is defined
with respect to the user namespace of the process opening the file.
We can illustrate this by creating a couple of user namespaces running
shells, and
examining the uid_map files of the processes in the namespaces.
We begin by creating a new user namespace with a process running a shell:
$ id -u # Display effective user ID
1000
$ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash
$ echo $$ # Show shell's PID for later reference
2465
$ cat /proc/2465/uid_map
0 1000 1
$ id -u # Mapping gives this process an effective user ID of 0
0
Now suppose we switch to another terminal window and create a sibling
user namespace that employs different user and group ID mappings:
$ ./userns_child_exec -U -M '200 1000 1' -G '200 1000 1' bash
$ cat /proc/self/uid_map
200 1000 1
$ id -u # Mapping gives this process an effective user ID of 200
200
$ echo $$ # Show shell's PID for later reference
2535
Continuing in the second terminal window, which is running in the
second user namespace, we view the user ID mapping of the process in the other
user namespace:
$ cat /proc/2465/uid_map
0 200 1
The output of this command shows that user ID 0 in the other user
namespace maps to user ID 200 in this namespace. Note
that the same command produced different output when executed in the other
user namespace, because the kernel generates the ID-outside-ns
value according to the user namespace of the process that is reading from
the file.
If we switch back to the first terminal window, and display the user ID mapping
file for the process in the second user namespace, we see the converse
mapping:
$ cat /proc/2535/uid_map
200 0 1
Again, the output here is different from the same command when executed
in the second user namespace, because the ID-outside-ns value is
generated according to the user namespace of the process that is reading
from the file. Of course, in the initial namespace, user ID 0 in the first
namespace and user ID 200 in the second namespace both map to user ID
1000. We can verify this by executing the following commands in a third
shell window inside the initial user namespace:
$ cat /proc/2465/uid_map
0 1000 1
$ cat /proc/2535/uid_map
200 1000 1
Concluding remarks
In this article, we've looked at the basics of user namespaces: creating a
user namespace, using user and group ID map files, and the interaction of user
namespaces and capabilities.
As we noted in an earlier article, one
of the motivations for implementing user namespaces is to give non-root
applications access to functionality that was formerly limited to the root
user. In traditional UNIX systems, various pieces of functionality have
been limited to the root user in order to prevent unprivileged users from
manipulating the runtime environment of privileged programs, which could
affect the operation of those programs in unexpected or
undesirable ways.
A user namespace allows a process (that is unprivileged
outside the namespace) to have root privileges while at the same time
limiting the scope of that privilege to the namespace, with the result that
the process cannot manipulate the runtime environment of privileged
programs in the wider system. In order to use these root privileges
meaningfully, we need to combine user namespaces with other types of
namespaces—that topic will form the subject of the next article in
this series.
Comments (16 posted)
Brief items
properly quote rpath $ORIGIN so it can be passed from make to shell to configure to generated Makefile to libtool to invoked gcc without loss of valuable dollars.
It is an open question to which extent this commit should be credited to the designers of sh, autoconf, libtool, make, and/or Solaris ld.
—
Michael
Stahl (hat tip to Cesar Eduardo Barros)
Comments (40 posted)
Version
2.0.0 of the Ruby language is now available. "
Ruby 2.0.0 is the
first stable release of the Ruby 2.0 series, with many new features and
improvements in response to the increasingly diverse and expanding demands
for Ruby." Changes include keyword arguments, UTF-8 encoding by
default, a number of new libraries, some performance improvements, and
more. See
this
article for more information about the changes in this release.
Comments (13 posted)
Version 1.5 of the Django web framework is
available;
new features include a new configurable user model, Python 3 support,
a lot of documentation improvements, and more; see
the release
notes for details.
Comments (8 posted)
Dirk Hohndel has announced the release of Subsurface 3.0, the open source dive-logging program. Improvements include map display of GPS locations, a dive planner, automatic dive numbering, and support for many new dive computers.
Full Story (comments: none)
Version 10.0 of BIND has been released. Although known to many as a DNS server, BIND 10 provides a number of additional features, including "dynamic DNS, zone
transfers, and experimental forwarding and recursive name service," plus statistics collection, reporting, and remote configuration. Among the many changes are DDNS (Dynamic Updates) support, an SQLite3 backend, and a "semi-interactive client to conveniently
look at and set some configuration settings."
Full Story (comments: none)
GNOME 3.7.90, the first beta release of the development cycle that will eventually become GNOME 3.8, is now available. Lengthy changelogs for both the core and the base applications are available.
Full Story (comments: none)
Newsletters and articles
Comments (none posted)
Xiph.org, the purveyors of widely-used open audio and video codecs, has released Digital Show and Tell, a video demonstrating facets of digital audio processing such as "of sampling, quantization, bit-depth, and dither show
digital audio behavior on real audio equipment using both modern
digital analysis and vintage analog bench equipment... just in case we
can't trust those newfangled digital gizmos." Accompanying the video is source code with which interested viewers can reproduce the demos shown.
Full Story (comments: 1)
Page editor: Nathan Willis
Next page: Announcements>>