LWN.net Logo

Development

Namespaces in operation, part 5: User namespaces

By Michael Kerrisk
February 27, 2013

Continuing our ongoing series on namespaces, this article looks more closely at user namespaces, a feature whose implementation was (largely) completed in Linux 3.8. (The remaining work consists of changes for XFS and a number of other filesystems; the latter has already been merged for 3.9.) User namespaces allow per-namespace mappings of user and group IDs. This means that a process's user and group IDs inside a user namespace can be different from its IDs outside of the namespace. Most notably, a process can have a nonzero user ID outside a namespace while at the same time having a user ID of zero inside the namespace; in other words, the process is unprivileged for operations outside the user namespace but has root privileges inside the namespace.

Creating user namespaces

User namespaces are created by specifying the CLONE_NEWUSER flag when calling clone() or unshare(). Starting with Linux 3.8 (and unlike the flags used for creating other types of namespaces), no privilege is required to create a user namespace. In our examples below, all of the user namespaces are created using the unprivileged user ID 1000.

To begin investigating user namespaces, we'll make use of a small program, demo_userns.c, that creates a child in a new user namespace. The child simply displays its effective user and group IDs as well as its capabilities. Running this program as an unprivileged user produces the following result:

    $ id -u          # Display effective user ID of shell process
    1000
    $ id -g          # Effective group ID of shell
    1000
    $ ./demo_userns 
    eUID = 65534;  eGID = 65534;  capabilities: =ep

The output from this program shows some interesting details. One of these is the capabilities that were assigned to the child process. The string "=ep" (produced by the library function cap_to_text(), which converts capability sets to a textual representation) indicates that the child has a full set of permitted and effective capabilities, even though the program was run from an unprivileged account. When a user namespace is created, the first process in the namespace is granted a full set of capabilities in the namespace. This allows that process to perform any initializations that are necessary in the namespace before other process are created in the namespace.

The second point of interest is the user and group IDs of the child process. As noted above, a process's user and group IDs inside and outside a user namespace can be different. However, there needs to be a mapping from the user IDs inside a user namespace to a corresponding set of user IDs outside the namespace; the same is true of group IDs. This allows the system to perform the appropriate permission checks when a process in a user namespace performs operations that affect the wider system (e.g., sending a signal to a process outside the namespace or accessing a file).

System calls that return process user and group IDs—for example, getuid() and getgid()—always return credentials as they appear inside the user namespace in which the calling process resides. If a user ID has no mapping inside the namespace, then system calls that return user IDs return the value defined in the file /proc/sys/kernel/overflowuid, which on a standard system defaults to the value 65534. Initially, a user namespace has no user ID mapping, so all user IDs inside the namespace map to this value. Likewise, a new user namespace has no mappings for group IDs, and all unmapped group IDs map to /proc/sys/kernel/overflowgid (which has the same default as overflowuid).

There is one other important point worth noting that can't be gleaned from the output above. Although the new process has a full set of capabilities in the new user namespace, it has no capabilities in the parent namespace. This is true regardless of the credentials and capabilities of the process that calls clone(). In particular, even if root employs clone(CLONE_NEWUSER), the resulting child process will have no capabilities in the parent namespace.

One final point to be made about the creation of user namespaces is that namespaces can be nested; that is, each user namespace (other than the initial user namespace) has a parent user namespace, and can have zero or more child user namespaces. The parent of a user namespace is the user namespace of the process that creates the user namespace via a call to clone() or unshare() with the CLONE_NEWUSER flag. The significance of the parent-child relationship between user namespaces will become clearer in the remainder of this article.

Mapping user and group IDs

Normally, one of the first steps after creating a new user namespace is to define the mappings used for the user and group IDs of the processes that will be created in that namespace. This is done by writing mapping information to the /proc/PID/uid_map and /proc/PID/gid_map files corresponding to one of the processes in the user namespace. (Initially, these two files are empty.) This information consists of one or more lines, each of which contains three values separated by white space:

    ID-inside-ns   ID-outside-ns   length

Together, the ID-inside-ns and length values define a range of IDs inside the namespace that are to be mapped to an ID range of the same length outside the namespace. The ID-outside-ns value specifies the starting point of the outside range. How ID-outside-ns is interpreted depends on the whether the process opening the file /proc/PID/uid_map (or /proc/PID/gid_map) is in the same user namespace as the process PID:

  • If the two processes are in the same namespace, then ID-outside-ns is interpreted as a user ID (group ID) in the parent user namespace of the process PID. The common case here is that a process is writing to its own mapping file (/proc/self/uid_map or /proc/self/gid_map).
  • If the two processes are in different namespaces, then ID-outside-ns is interpreted as a user ID (group ID) in the user namespace of the process opening /proc/PID/uid_map (/proc/PID/gid_map). The writing process is then defining the mapping relative to its own user namespace.

Suppose that we once more invoke our demo_userns program, but this time with a single command-line argument (any string). This causes the program to loop, continuously displaying credentials and capabilities every few seconds:

    $ ./demo_userns x
    eUID = 65534;  eGID = 65534;  capabilities: =ep
    eUID = 65534;  eGID = 65534;  capabilities: =ep

Now we switch to another terminal window—to a shell process running in another namespace (namely, the parent user namespace of the process running demo_userns) and create a user ID mapping for the child process in the new user namespace created by demo_userns:

    $ ps -C demo_userns -o 'pid uid comm'      # Determine PID of clone child
      PID   UID COMMAND 
     4712  1000 demo_userns                    # This is the parent
     4713  1000 demo_userns                    # Child in a new user namespace
    $ echo '0 1000 1' > /proc/4713/uid_map

If we return to the window running demo_userns, we now see:

    eUID = 0;  eGID = 65534;  capabilities: =ep

In other words, the user ID 1000 in the parent user namespace (which was formerly mapped to 65534) has been mapped to user ID 0 in the user namespace created by demo_userns. From this point, all operations within the new user namespace that deal with this user ID will see the number 0, while corresponding operations in the parent user namespace will see the same process as having user ID 1000.

We can likewise create a mapping for group IDs in the new user namespace. Switching to another terminal window, we create a mapping for the single group ID 1000 in the parent user namespace to the group ID 0 in the new user namespace:

    $ echo '0 1000 1' > /proc/4713/gid_map

Switching back to the window running demo_userns, we see that change reflected in the display of the effective group ID:

    eUID = 0;  eGID = 0;  capabilities: =ep

Rules for writing to mapping files

There are a number of rules governing writing to uid_map files; analogous rules apply for writing to gid_map files. The most important of these rules are as follows.

Defining a mapping is a one-time operation per namespace: we can perform only a single write (that may contain multiple newline-delimited records) to a uid_map file of exactly one of the processes in the user namespace. Furthermore, the number of lines that may be written to the file is currently limited to five (an arbitrary limit that may be increased in the future).

The /proc/PID/uid_map file is owned by the user ID that created the namespace, and is writeable only by that user (or a privileged user). In addition, all of the following requirements must be met:

  • The writing process must have the CAP_SETUID (CAP_SETGID for gid_map) capability in the user namespace of the process PID.
  • Regardless of capabilities, the writing process must be in either the user namespace of the process PID or inside the (immediate) parent user namespace of the process PID.
  • One of the following must be true:
    • The data written to uid_map (gid_map) consists of a single line that maps (only) the writing process's effective user ID (group ID) in the parent user namespace to a user ID (group ID) in the user namespace. This rule allows the initial process in a user namespace (i.e., the child created by clone()) to write a mapping for its own user ID (group ID).
    • The process has the CAP_SETUID (CAP_SETGID for gid_map) capability in the parent user namespace. Such a process can define mappings to arbitrary user IDs (group IDs) in the parent user namespace. As we noted earlier, the initial process in a new user namespace has no capabilities in the parent namespace. Thus, only a process in the parent namespace can write a mapping that maps arbitrary IDs in the parent user namespace.

Capabilities, execve(), and user ID 0

In an earlier article in this series, we developed the ns_child_exec program. This program uses clone() to create a child process in new namespaces specified by command-line options and then executes a shell command in the child process.

Suppose that we use this program to execute a shell in a new user namespace and then within that shell we try to define the user ID mapping for the new user namespace. In doing so, we run into a problem:

    $ ./ns_child_exec -U  bash
    $ echo '0 1000 1' > /proc/$$/uid_map       # $$ is the PID of the shell
    bash: echo: write error: Operation not permitted

This error occurs because the shell has no capabilities inside the new user namespace, as can be seen from the following commands:

    $ id -u         # Verify that user ID and group ID are not mapped
    65534
    $ id -g
    65534
    $ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000

The problem occurred at the execve() call that executed the bash shell: when a process with non-zero user IDs performs an execve(), the process's capability sets are cleared. (The capabilities(7) manual page details the treatment of capabilities during an execve().)

To avoid this problem, it is necessary to create a user ID mapping inside the user namespace before performing the execve(). This is not possible with the ns_child_exec program; we need a slightly enhanced version of the program that does allow this.

The userns_child_exec.c program performs the same task as the ns_child_exec program, and has the same command-line interface, except that it allows two additional command-line options, -M and -G. These options accept string arguments that are used to define user and group ID maps for the new user namespace. For example, the following command maps both user ID 1000 and group ID 1000 to 0 in the new user namespace:

    $ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash

This time, updating the mapping files succeeds, and we see that the shell has the expected user ID, group ID, and capabilities:

    $ id -u
    0
    $ id -g
    0
    $ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
    CapInh: 0000000000000000
    CapPrm: 0000001fffffffff
    CapEff: 0000001fffffffff

There are some subtleties to the implementation of the userns_child_exec program. First, either the parent process (i.e., the caller of clone()) or the new child process could update the user ID and group ID maps of the new user namespace. However, following the rules above, the only kind of mapping that the child process could define would be one that maps just its own effective user ID. If we want to define arbitrary user and group ID mappings in the child, then that must be done by the parent process. Furthermore, the parent process must have suitable capabilities, namely CAP_SETUID, CAP_SETGID, and (to ensure that the parent has the permissions needed to open the mapping files) CAP_DAC_OVERRIDE.

Furthermore, the parent must ensure that it updates the mapping files before the child calls execve() (otherwise we have exactly the problem described above, where the child will lose capabilities during the execve()). To do this, the two processes employ a pipe to ensure the required synchronization; comments in the program source code give full details.

Viewing user and group ID mappings

The examples so far showed the use of /proc/PID/uid_map and /proc/PID/gid_map files for defining a mapping. These files can also be used to view the mappings governing a process. As when writing to these files, the second (ID-outside-ns) value is interpreted according to which process is opening the file. If the process opening the file is in the same user namespace as the process PID, then ID-outside-ns is defined with respect to the parent user namespace. If the process opening the file is in a different user namespace, then ID-outside-ns is defined with respect to the user namespace of the process opening the file.

We can illustrate this by creating a couple of user namespaces running shells, and examining the uid_map files of the processes in the namespaces. We begin by creating a new user namespace with a process running a shell:

    $ id -u            # Display effective user ID
    1000
    $ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash
    $ echo $$          # Show shell's PID for later reference
    2465
    $ cat /proc/2465/uid_map
             0       1000          1
    $ id -u            # Mapping gives this process an effective user ID of 0
    0

Now suppose we switch to another terminal window and create a sibling user namespace that employs different user and group ID mappings:

    $ ./userns_child_exec -U -M '200 1000 1' -G '200 1000 1' bash
    $ cat /proc/self/uid_map
           200       1000          1
    $ id -u            # Mapping gives this process an effective user ID of 200
    200
    $ echo $$          # Show shell's PID for later reference
    2535

Continuing in the second terminal window, which is running in the second user namespace, we view the user ID mapping of the process in the other user namespace:

    $ cat /proc/2465/uid_map
             0        200          1

The output of this command shows that user ID 0 in the other user namespace maps to user ID 200 in this namespace. Note that the same command produced different output when executed in the other user namespace, because the kernel generates the ID-outside-ns value according to the user namespace of the process that is reading from the file.

If we switch back to the first terminal window, and display the user ID mapping file for the process in the second user namespace, we see the converse mapping:

    $ cat /proc/2535/uid_map
           200          0          1

Again, the output here is different from the same command when executed in the second user namespace, because the ID-outside-ns value is generated according to the user namespace of the process that is reading from the file. Of course, in the initial namespace, user ID 0 in the first namespace and user ID 200 in the second namespace both map to user ID 1000. We can verify this by executing the following commands in a third shell window inside the initial user namespace:

    $ cat /proc/2465/uid_map
             0       1000          1
    $ cat /proc/2535/uid_map
           200       1000          1

Concluding remarks

In this article, we've looked at the basics of user namespaces: creating a user namespace, using user and group ID map files, and the interaction of user namespaces and capabilities.

As we noted in an earlier article, one of the motivations for implementing user namespaces is to give non-root applications access to functionality that was formerly limited to the root user. In traditional UNIX systems, various pieces of functionality have been limited to the root user in order to prevent unprivileged users from manipulating the runtime environment of privileged programs, which could affect the operation of those programs in unexpected or undesirable ways.

A user namespace allows a process (that is unprivileged outside the namespace) to have root privileges while at the same time limiting the scope of that privilege to the namespace, with the result that the process cannot manipulate the runtime environment of privileged programs in the wider system. In order to use these root privileges meaningfully, we need to combine user namespaces with other types of namespaces—that topic will form the subject of the next article in this series.

Comments (16 posted)

Brief items

Quote of the week

properly quote rpath $ORIGIN so it can be passed from make to shell to configure to generated Makefile to libtool to invoked gcc without loss of valuable dollars.

It is an open question to which extent this commit should be credited to the designers of sh, autoconf, libtool, make, and/or Solaris ld.

Michael Stahl (hat tip to Cesar Eduardo Barros)

Comments (40 posted)

Ruby 2.0.0 released

Version 2.0.0 of the Ruby language is now available. "Ruby 2.0.0 is the first stable release of the Ruby 2.0 series, with many new features and improvements in response to the increasingly diverse and expanding demands for Ruby." Changes include keyword arguments, UTF-8 encoding by default, a number of new libraries, some performance improvements, and more. See this article for more information about the changes in this release.

Comments (13 posted)

Django 1.5 released

Version 1.5 of the Django web framework is available; new features include a new configurable user model, Python 3 support, a lot of documentation improvements, and more; see the release notes for details.

Comments (8 posted)

Subsurface 3.0 has been released

Dirk Hohndel has announced the release of Subsurface 3.0, the open source dive-logging program. Improvements include map display of GPS locations, a dive planner, automatic dive numbering, and support for many new dive computers.

Full Story (comments: none)

BIND10 1.0.0 available

Version 10.0 of BIND has been released. Although known to many as a DNS server, BIND 10 provides a number of additional features, including "dynamic DNS, zone transfers, and experimental forwarding and recursive name service," plus statistics collection, reporting, and remote configuration. Among the many changes are DDNS (Dynamic Updates) support, an SQLite3 backend, and a "semi-interactive client to conveniently look at and set some configuration settings."

Full Story (comments: none)

GNOME 3.7.90 beta development release

GNOME 3.7.90, the first beta release of the development cycle that will eventually become GNOME 3.8, is now available. Lengthy changelogs for both the core and the base applications are available.

Full Story (comments: none)

Newsletters and articles

Development newsletters from the past week

Comments (none posted)

Digital show and tell video from Xiph.org

Xiph.org, the purveyors of widely-used open audio and video codecs, has released Digital Show and Tell, a video demonstrating facets of digital audio processing such as "of sampling, quantization, bit-depth, and dither show digital audio behavior on real audio equipment using both modern digital analysis and vintage analog bench equipment... just in case we can't trust those newfangled digital gizmos." Accompanying the video is source code with which interested viewers can reproduce the demos shown.

Full Story (comments: 1)

Page editor: Nathan Willis
Next page: Announcements>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds