Development

Namespaces in operation, part 2: the namespaces API

By Michael Kerrisk
January 8, 2013

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource. Namespaces are used for a variety of purposes, with the most notable being the implementation of containers, a technique for lightweight virtualization. This is the second part in a series of articles that looks in some detail at namespaces and the namespaces API. The first article in this series provided an overview of namespaces. This article looks at the namespaces API in some detail and shows the API in action in a number of example programs.

The namespace API consists of three system calls—clone(), unshare(), and setns()—and a number of /proc files. In this article, we'll look at all of these system calls and some of the /proc files. In order to specify a namespace type on which to operate, the three system calls make use of the CLONE_NEW* constants listed in the previous article: CLONE_NEWIPC, CLONE_NEWNS, CLONE_NEWNET, CLONE_NEWPID, CLONE_NEWUSER, and CLONE_NEWUTS.

Creating a child in a new namespace: clone()

One way of creating a namespace is via the use of clone(), a system call that creates a new process. For our purposes, clone() has the following prototype:

    int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

Essentially, clone() is a more general version of the traditional UNIX fork() system call whose functionality can be controlled via the flags argument. In all, there are more than twenty different CLONE_* flags that control various aspects of the operation of clone(), including whether the parent and child process share resources such as virtual memory, open file descriptors, and signal dispositions. If one of the CLONE_NEW* bits is specified in the call, then a new namespace of the corresponding type is created, and the new process is made a member of that namespace; multiple CLONE_NEW* bits can be specified in flags.

Our example program (demo_uts_namespace.c) uses clone() with the CLONE_NEWUTS flag to create a UTS namespace. As we saw last week, UTS namespaces isolate two system identifiers—the hostname and the NIS domain name—that are set using the sethostname() and setdomainname() system calls and returned by the uname() system call. You can find the full source of the program here. Below, we'll focus on just some of the key pieces of the program (and for brevity, we'll omit the error checking code that is present in the full version of the program).

The example program takes one command-line argument. When run, it creates a child that executes in a new UTS namespace. Inside that namespace, the child changes the hostname to the string given as the program's command-line argument.

The first significant piece of the main program is the clone() call that creates the child process:

    child_pid = clone(childFunc, 
                      child_stack + STACK_SIZE,   /* Points to start of 
                                                     downwardly growing stack */ 
                      CLONE_NEWUTS | SIGCHLD, argv[1]);

    printf("PID of child created by clone() is %ld\n", (long) child_pid);

The new child will begin execution in the user-defined function childFunc(); that function will receive the final clone() argument (argv[1]) as its argument. Since CLONE_NEWUTS is specified as part of the flags argument, the child will execute in a newly created UTS namespace.

The main program then sleeps for a moment. This is a (crude) way of giving the child time to change the hostname in its UTS namespace. The program then uses uname() to retrieve the host name in the parent's UTS namespace, and displays that hostname:

    sleep(1);           /* Give child time to change its hostname */

    uname(&uts);
    printf("uts.nodename in parent: %s\n", uts.nodename);

Meanwhile, the childFunc() function executed by the child created by clone() first changes the hostname to the value supplied in its argument, and then retrieves and displays the modified hostname:

    sethostname(arg, strlen(arg);
    
    uname(&uts);
    printf("uts.nodename in child:  %s\n", uts.nodename);

Before terminating, the child sleeps for a while. This has the effect of keeping the child's UTS namespace open, and gives us a chance to conduct some of the experiments that we show later.

Running the program demonstrates that the parent and child processes have independent UTS namespaces:

    $ su                   # Need privilege to create a UTS namespace
    Password: 
    # uname -n
    antero
    # ./demo_uts_namespaces bizarro
    PID of child created by clone() is 27514
    uts.nodename in child:  bizarro
    uts.nodename in parent: antero

As with most other namespaces (user namespaces are the exception), creating a UTS namespace requires privilege (specifically, CAP_SYS_ADMIN). This is necessary to avoid scenarios where set-user-ID applications could be fooled into doing the wrong thing because the system has an unexpected hostname.

Another possibility is that a set-user-ID application might be using the hostname as part of the name of a lock file. If an unprivileged user could run the application in a UTS namespace with an arbitrary hostname, this would open the application to various attacks. Most simply, this would nullify the effect of the lock file, triggering misbehavior in instances of the application that run in different UTS namespaces. Alternatively, a malicious user could run a set-user-ID application in a UTS namespace with a hostname that causes creation of the lock file to overwrite an important file. (Hostname strings can contain arbitrary characters, including slashes.)

The `/proc/`PID`/ns` files

Each process has a /proc/PID/ns directory that contains one file for each type of namespace. Starting in Linux 3.8, each of these files is a special symbolic link that provides a kind of handle for performing certain operations on the associated namespace for the process.

    $ ls -l /proc/$$/ns         # $$ is replaced by shell's PID
    total 0
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 ipc -> ipc:[4026531839]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 mnt -> mnt:[4026531840]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 net -> net:[4026531956]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 pid -> pid:[4026531836]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 user -> user:[4026531837]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 uts -> uts:[4026531838]

One use of these symbolic links is to discover whether two processes are in the same namespace. The kernel does some magic to ensure that if two processes are in the same namespace, then the inode numbers reported for the corresponding symbolic links in /proc/PID/ns will be the same. The inode numbers can be obtained using the stat() system call (in the st_ino field of the returned structure).

However, the kernel also constructs each of the /proc/PID/ns symbolic links so that it points to a name consisting of a string that identifies the namespace type, followed by the inode number. We can examine this name using either the ls -l or the readlink command.

Let's return to the shell session above where we ran the demo_uts_namespaces program. Looking at the /proc/PID/ns symbolic links for the parent and child process provides an alternative method of checking whether the two processes are in the same or different UTS namespaces:

    ^Z                                # Stop parent and child
    [1]+  Stopped          ./demo_uts_namespaces bizarro
    # jobs -l                         # Show PID of parent process
    [1]+ 27513 Stopped         ./demo_uts_namespaces bizarro
    # readlink /proc/27513/ns/uts     # Show parent UTS namespace
    uts:[4026531838]
    # readlink /proc/27514/ns/uts     # Show child UTS namespace
    uts:[4026532338]

As can be seen, the content of the /proc/PID/ns/uts symbolic links differs, indicating that the two processes are in different UTS namespaces.

The /proc/PID/ns symbolic links also serve other purposes. If we open one of these files, then the namespace will continue to exist as long as the file descriptor remains open, even if all processes in the namespace terminate. The same effect can also be obtained by bind mounting one of the symbolic links to another location in the file system:

    # touch ~/uts                            # Create mount point
    # mount --bind /proc/27514/ns/uts ~/uts

Before Linux 3.8, the files in /proc/PID/ns were hard links rather than special symbolic links of the form described above. In addition, only the ipc, net, and uts files were present.

Joining an existing namespace: setns()

Keeping a namespace open when it contains no processes is of course only useful if we intend to later add processes to it. That is the task of the setns() system call, which allows the calling process to join an existing namespace:

    int setns(int fd, int nstype);

More precisely, setns() disassociates the calling process from one instance of a particular namespace type and reassociates the process with another instance of the same namespace type.

The fd argument specifies the namespace to join; it is a file descriptor that refers to one of the symbolic links in a /proc/PID/ns directory. That file descriptor can be obtained either by opening one of those symbolic links directly or by opening a file that was bind mounted to one of the links.

The nstype argument allows the caller to check the type of namespace that fd refers to. If this argument is specified as zero, no check is performed. This can be useful if the caller already knows the namespace type, or does not care about the type. The example program that we discuss in a moment (ns_exec.c) falls into the latter category: it is designed to work with any namespace type. Specifying nstype instead as one of the CLONE_NEW* constants causes the kernel to verify that fd is a file descriptor for the corresponding namespace type. This can be useful if, for example, the caller was passed the file descriptor via a UNIX domain socket and needs to verify what type of namespace it refers to.

Using setns() and execve() (or one of the other exec() functions) allows us to construct a simple but useful tool: a program that joins a specified namespace and then executes a command in that namespace.

Our program (ns_exec.c, whose full source can be found here) takes two or more command-line arguments. The first argument is the pathname of a /proc/PID/ns/* symbolic link (or a file that is bind mounted to one of those symbolic links). The remaining arguments are the name of a program to be executed inside the namespace that corresponds to that symbolic link and optional command-line arguments to be given to that program. The key steps in the program are the following:

    fd = open(argv[1], O_RDONLY);   /* Get descriptor for namespace */

    setns(fd, 0);                   /* Join that namespace */

    execvp(argv[2], &argv[2]);      /* Execute a command in namespace */

An interesting program to execute inside a namespace is, of course, a shell. We can use the bind mount for the UTS namespace that we created earlier in conjunction with the ns_exec program to execute a shell in the new UTS namespace created by our invocation of demo_uts_namespaces:

    # ./ns_exec ~/uts /bin/bash     # ~/uts is bound to /proc/27514/ns/uts
    My PID is: 28788

We can then verify that the shell is in the same UTS namespace as the child process created by demo_uts_namespaces, both by inspecting the hostname and by comparing the inode numbers of the /proc/PID/ns/uts files:

    # hostname
    bizarro
    # readlink /proc/27514/ns/uts
    uts:[4026532338]
    # readlink /proc/$$/ns/uts      # $$ is replaced by shell's PID
    uts:[4026532338]

In earlier kernel versions, it was not possible to use setns() to join mount, PID, and user namespaces, but, starting with Linux 3.8, setns() now supports joining all namespace types.

Leaving a namespace: unshare()

The final system call in the namespaces API is unshare():

    int unshare(int flags);

The unshare() system call provides functionality similar to clone(), but operates on the calling process: it creates the new namespaces specified by the CLONE_NEW* bits in its flags argument and makes the caller a member of the namespaces. (As with clone(), unshare() provides functionality beyond working with namespaces that we'll ignore here.) The main purpose of unshare() is to isolate namespace (and other) side effects without having to create a new process or thread (as is done by clone()).

Leaving aside the other effects of the clone() system call, a call of the form:

    clone(..., CLONE_NEWXXX, ....);

is roughly equivalent, in namespace terms, to the sequence:

    if (fork() == 0)
        unshare(CLONE_NEWXXX);      /* Executed in the child process */

One use of the unshare() system call is in the implementation of the unshare command, which allows the user to execute a command in a separate namespace from the shell. The general form of this command is:

    unshare [options] program [arguments]

The options are command-line flags that specify the namespaces to unshare before executing program with the specified arguments.

The key steps in the implementation of the unshare command are straightforward:

     /* Code to initialize 'flags' according to command-line options
        omitted */

     unshare(flags);

     /* Now execute 'program' with 'arguments'; 'optind' is the index
        of the next command-line argument after options */

     execvp(argv[optind], &argv[optind]);

A simple implementation of the unshare command (unshare.c) can be found here.

In the following shell session, we use our unshare.c program to execute a shell in a separate mount namespace. As we noted in last week's article, mount namespaces isolate the set of filesystem mount points seen by a group of processes, allowing processes in different mount namespaces to have different views of the filesystem hierarchy.

    # echo $$                             # Show PID of shell
    8490
    # cat /proc/8490/mounts | grep mq     # Show one of the mounts in namespace
    mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
    # readlink /proc/8490/ns/mnt          # Show mount namespace ID 
    mnt:[4026531840]
    # ./unshare -m /bin/bash              # Start new shell in separate mount namespace
    # readlink /proc/$$/ns/mnt            # Show mount namespace ID 
    mnt:[4026532325]

Comparing the output of the two readlink commands shows that the two shells are in separate mount namespaces. Altering the set of mount points in one of the namespaces and checking whether that change is visible in the other namespace provides another way of demonstrating that the two programs are in separate namespaces:

    # umount /dev/mqueue                  # Remove a mount point in this shell
    # cat /proc/$$/mounts | grep mq       # Verify that mount point is gone
    # cat /proc/8490/mounts | grep mq     # Is it still present in the other namespace?
    mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0

As can be seen from the output of the last two commands, the /dev/mqueue mount point has disappeared in one mount namespace, but continues to exist in the other.

Concluding remarks

In this article we've looked at the fundamental pieces of the namespace API and how they are employed together. In the follow-on articles, we'll look in more depth at some other namespaces, in particular, the PID and user namespaces; user namespaces open up a range of new possibilities for applications to use kernel interfaces that were formerly restricted to privileged applications.

(2013-01-15: updated the concluding remarks to reflect the fact that there will be more than one following article.)

Comments (11 posted)

Brief items

Quotes of the week

...with so much drama, he now praises the iPad.

— Google's text-to-speech engine, when fed any of a number of input strings containing "-ed with."

A programmer had a problem. He thought to himself, "I know, I'll solve it with threads!". has Now problems. two he

— Davidlohr Bueso

Comments (4 posted)

Systemd 197 released

For those watching systemd, version 197 has a number of interesting new features, including a new mechanism for stable network interface naming, an integrated bootchart implementation, optimized readahead behavior on Btrfs, and more. "systemd will no longer detect and recognize specific distributions. All distribution-specific #ifdeffery has been removed, systemd is now fully generic and distribution-agnostic. Effectively, not too much is lost as a lot of the code is still accessible via explicit configure switches. However, support for some distribution specific legacy configuration file formats has been dropped. We recommend distributions to simply adopt the configuration files everybody else uses now and convert the old configuration from packaging scripts."

Full Story (comments: 44)

Firefox 18 is now available

Firefox 18 has been released. See the release notes for the details. There are Firefox for Android release notes also.

Full Story (comments: 48)

wiki.python.org compromised

The python.org wiki was compromised and the attacker gained shell access to the "moin" user account. "Some time later, the attacker deleted all files owned by the "moin" user, including all instance data for both the Python and Jython wikis. The attack also had full access to all MoinMoin user data on all wikis. In light of this, the Python Software Foundation encourages all wiki users to change their password on other sites if the same one is in use elsewhere. We apologize for the inconvenience and will post further news as we bring the new and improved wiki.python.org online."

This is the second high-profile compromise of a Moin-based wiki reported recently; anybody running such a site should be sure they are current with their security patches.

Full Story (comments: 2)

GCC 4.8.0 status report: only regression fixes and docs accepted as of now

Richard Biener has posted another report on the status of GCC 4.8.0 development. Noting that the code had "stabilized itself over the holidays", stage 3 of the development cycle is over, so "GCC trunk is now in release branch mode, thus only regression fixes and documentation changes are allowed now."

Full Story (comments: none)

GNU Radio 3.6.3 available

Version 3.6.3 of GNU Radio has been released. This release adds "major new capabilities and many bug fixes, while maintaining strict source compatibility with user code already written for the 3.6 API." Enhancements include asynchronous message passing, new blocks for interacting with the operating system's networking stack, and the ability to write signal processing blocks in Python.

Full Story (comments: none)

Newsletters and articles

Development newsletters from the past week

Caml Weekly News (January 8)
What's cooking in git.git (January 6)
What's cooking in git.git (January 9)
Openstack Community Weekly Newsletter (January 4)
Perl Weekly (January 7)

Comments (none posted)

The USPTO Would Like to Partner with the Software Community ... Wait. What? Really? (Groklaw)

Groklaw reports on an invitation from the United States Patent and Trademark Office (USPTO) for software developers to join two roundtable discussions aimed at "enhancing" the quality of software patents. Both events are in February: one in New York City and one in Silicon Valley. As Groklaw points out, the events are space-limited and proprietary software vendors are sure to attend. "Large companies with patent portfolios they treasure and don't want to lose can't represent the interests of individual developers or the FOSS community, those most seriously damaged by toxic software patents." (Thanks to Davide Del Vento)

Comments (36 posted)

Fox: How to un-break GNOME menus

Taryn Fox has written a blog entry examining GNOME 3's global application menu, including a few outstanding problems, and a proposal for how they could be addressed. "Ideally, the App Menu will contain all of a given app's functionality. This is the assumption new GNOME apps (like Documents) are building on, and the one certain existing apps (like Empathy) are adopting."

Comments (none posted)

Crouch: Packaged HTML5 Apps: Are we emulating failure?

At his blog, Mozilla's Luke Crouch explores whether or not HTML applications deployed on "web runtime" environments offer a better experience than those running in a traditional browser. The answer is evidently no: "If you're making an HTML5 app, consider - do you want to make a native desktop application? Why or why not? Then consider if the same reasoning is true for the native mobile application."

Comments (none posted)

Page editor: Nathan Willis
Next page: Announcements>>

Development

Creating a child in a new namespace: clone()

The /proc/PID/ns files

Joining an existing namespace: setns()

Leaving a namespace: unshare()

Concluding remarks

Brief items

Newsletters and articles

The `/proc/`PID`/ns` files