Namespaces in operation, part 2: the namespaces API

By Michael Kerrisk
January 8, 2013

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource. Namespaces are used for a variety of purposes, with the most notable being the implementation of containers, a technique for lightweight virtualization. This is the second part in a series of articles that looks in some detail at namespaces and the namespaces API. The first article in this series provided an overview of namespaces. This article looks at the namespaces API in some detail and shows the API in action in a number of example programs.

The namespace API consists of three system calls—clone(), unshare(), and setns()—and a number of /proc files. In this article, we'll look at all of these system calls and some of the /proc files. In order to specify a namespace type on which to operate, the three system calls make use of the CLONE_NEW* constants listed in the previous article: CLONE_NEWIPC, CLONE_NEWNS, CLONE_NEWNET, CLONE_NEWPID, CLONE_NEWUSER, and CLONE_NEWUTS.

Creating a child in a new namespace: clone()

One way of creating a namespace is via the use of clone(), a system call that creates a new process. For our purposes, clone() has the following prototype:

    int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

Essentially, clone() is a more general version of the traditional UNIX fork() system call whose functionality can be controlled via the flags argument. In all, there are more than twenty different CLONE_* flags that control various aspects of the operation of clone(), including whether the parent and child process share resources such as virtual memory, open file descriptors, and signal dispositions. If one of the CLONE_NEW* bits is specified in the call, then a new namespace of the corresponding type is created, and the new process is made a member of that namespace; multiple CLONE_NEW* bits can be specified in flags.

Our example program (demo_uts_namespace.c) uses clone() with the CLONE_NEWUTS flag to create a UTS namespace. As we saw last week, UTS namespaces isolate two system identifiers—the hostname and the NIS domain name—that are set using the sethostname() and setdomainname() system calls and returned by the uname() system call. You can find the full source of the program here. Below, we'll focus on just some of the key pieces of the program (and for brevity, we'll omit the error checking code that is present in the full version of the program).

The example program takes one command-line argument. When run, it creates a child that executes in a new UTS namespace. Inside that namespace, the child changes the hostname to the string given as the program's command-line argument.

The first significant piece of the main program is the clone() call that creates the child process:

    child_pid = clone(childFunc, 
                      child_stack + STACK_SIZE,   /* Points to start of 
                                                     downwardly growing stack */ 
                      CLONE_NEWUTS | SIGCHLD, argv[1]);

    printf("PID of child created by clone() is %ld\n", (long) child_pid);

The new child will begin execution in the user-defined function childFunc(); that function will receive the final clone() argument (argv[1]) as its argument. Since CLONE_NEWUTS is specified as part of the flags argument, the child will execute in a newly created UTS namespace.

The main program then sleeps for a moment. This is a (crude) way of giving the child time to change the hostname in its UTS namespace. The program then uses uname() to retrieve the host name in the parent's UTS namespace, and displays that hostname:

    sleep(1);           /* Give child time to change its hostname */

    uname(&uts);
    printf("uts.nodename in parent: %s\n", uts.nodename);

Meanwhile, the childFunc() function executed by the child created by clone() first changes the hostname to the value supplied in its argument, and then retrieves and displays the modified hostname:

    sethostname(arg, strlen(arg);
    
    uname(&uts);
    printf("uts.nodename in child:  %s\n", uts.nodename);

Before terminating, the child sleeps for a while. This has the effect of keeping the child's UTS namespace open, and gives us a chance to conduct some of the experiments that we show later.

Running the program demonstrates that the parent and child processes have independent UTS namespaces:

    $ su                   # Need privilege to create a UTS namespace
    Password: 
    # uname -n
    antero
    # ./demo_uts_namespaces bizarro
    PID of child created by clone() is 27514
    uts.nodename in child:  bizarro
    uts.nodename in parent: antero

As with most other namespaces (user namespaces are the exception), creating a UTS namespace requires privilege (specifically, CAP_SYS_ADMIN). This is necessary to avoid scenarios where set-user-ID applications could be fooled into doing the wrong thing because the system has an unexpected hostname.

Another possibility is that a set-user-ID application might be using the hostname as part of the name of a lock file. If an unprivileged user could run the application in a UTS namespace with an arbitrary hostname, this would open the application to various attacks. Most simply, this would nullify the effect of the lock file, triggering misbehavior in instances of the application that run in different UTS namespaces. Alternatively, a malicious user could run a set-user-ID application in a UTS namespace with a hostname that causes creation of the lock file to overwrite an important file. (Hostname strings can contain arbitrary characters, including slashes.)

The `/proc/`PID`/ns` files

Each process has a /proc/PID/ns directory that contains one file for each type of namespace. Starting in Linux 3.8, each of these files is a special symbolic link that provides a kind of handle for performing certain operations on the associated namespace for the process.

    $ ls -l /proc/$$/ns         # $$ is replaced by shell's PID
    total 0
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 ipc -> ipc:[4026531839]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 mnt -> mnt:[4026531840]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 net -> net:[4026531956]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 pid -> pid:[4026531836]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 user -> user:[4026531837]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 uts -> uts:[4026531838]

One use of these symbolic links is to discover whether two processes are in the same namespace. The kernel does some magic to ensure that if two processes are in the same namespace, then the inode numbers reported for the corresponding symbolic links in /proc/PID/ns will be the same. The inode numbers can be obtained using the stat() system call (in the st_ino field of the returned structure).

However, the kernel also constructs each of the /proc/PID/ns symbolic links so that it points to a name consisting of a string that identifies the namespace type, followed by the inode number. We can examine this name using either the ls -l or the readlink command.

Let's return to the shell session above where we ran the demo_uts_namespaces program. Looking at the /proc/PID/ns symbolic links for the parent and child process provides an alternative method of checking whether the two processes are in the same or different UTS namespaces:

    ^Z                                # Stop parent and child
    [1]+  Stopped          ./demo_uts_namespaces bizarro
    # jobs -l                         # Show PID of parent process
    [1]+ 27513 Stopped         ./demo_uts_namespaces bizarro
    # readlink /proc/27513/ns/uts     # Show parent UTS namespace
    uts:[4026531838]
    # readlink /proc/27514/ns/uts     # Show child UTS namespace
    uts:[4026532338]

As can be seen, the content of the /proc/PID/ns/uts symbolic links differs, indicating that the two processes are in different UTS namespaces.

The /proc/PID/ns symbolic links also serve other purposes. If we open one of these files, then the namespace will continue to exist as long as the file descriptor remains open, even if all processes in the namespace terminate. The same effect can also be obtained by bind mounting one of the symbolic links to another location in the file system:

    # touch ~/uts                            # Create mount point
    # mount --bind /proc/27514/ns/uts ~/uts

Before Linux 3.8, the files in /proc/PID/ns were hard links rather than special symbolic links of the form described above. In addition, only the ipc, net, and uts files were present.

Joining an existing namespace: setns()

Keeping a namespace open when it contains no processes is of course only useful if we intend to later add processes to it. That is the task of the setns() system call, which allows the calling process to join an existing namespace:

    int setns(int fd, int nstype);

More precisely, setns() disassociates the calling process from one instance of a particular namespace type and reassociates the process with another instance of the same namespace type.

The fd argument specifies the namespace to join; it is a file descriptor that refers to one of the symbolic links in a /proc/PID/ns directory. That file descriptor can be obtained either by opening one of those symbolic links directly or by opening a file that was bind mounted to one of the links.

The nstype argument allows the caller to check the type of namespace that fd refers to. If this argument is specified as zero, no check is performed. This can be useful if the caller already knows the namespace type, or does not care about the type. The example program that we discuss in a moment (ns_exec.c) falls into the latter category: it is designed to work with any namespace type. Specifying nstype instead as one of the CLONE_NEW* constants causes the kernel to verify that fd is a file descriptor for the corresponding namespace type. This can be useful if, for example, the caller was passed the file descriptor via a UNIX domain socket and needs to verify what type of namespace it refers to.

Using setns() and execve() (or one of the other exec() functions) allows us to construct a simple but useful tool: a program that joins a specified namespace and then executes a command in that namespace.

Our program (ns_exec.c, whose full source can be found here) takes two or more command-line arguments. The first argument is the pathname of a /proc/PID/ns/* symbolic link (or a file that is bind mounted to one of those symbolic links). The remaining arguments are the name of a program to be executed inside the namespace that corresponds to that symbolic link and optional command-line arguments to be given to that program. The key steps in the program are the following:

    fd = open(argv[1], O_RDONLY);   /* Get descriptor for namespace */

    setns(fd, 0);                   /* Join that namespace */

    execvp(argv[2], &argv[2]);      /* Execute a command in namespace */

An interesting program to execute inside a namespace is, of course, a shell. We can use the bind mount for the UTS namespace that we created earlier in conjunction with the ns_exec program to execute a shell in the new UTS namespace created by our invocation of demo_uts_namespaces:

    # ./ns_exec ~/uts /bin/bash     # ~/uts is bound to /proc/27514/ns/uts
    My PID is: 28788

We can then verify that the shell is in the same UTS namespace as the child process created by demo_uts_namespaces, both by inspecting the hostname and by comparing the inode numbers of the /proc/PID/ns/uts files:

    # hostname
    bizarro
    # readlink /proc/27514/ns/uts
    uts:[4026532338]
    # readlink /proc/$$/ns/uts      # $$ is replaced by shell's PID
    uts:[4026532338]

In earlier kernel versions, it was not possible to use setns() to join mount, PID, and user namespaces, but, starting with Linux 3.8, setns() now supports joining all namespace types.

Leaving a namespace: unshare()

The final system call in the namespaces API is unshare():

    int unshare(int flags);

The unshare() system call provides functionality similar to clone(), but operates on the calling process: it creates the new namespaces specified by the CLONE_NEW* bits in its flags argument and makes the caller a member of the namespaces. (As with clone(), unshare() provides functionality beyond working with namespaces that we'll ignore here.) The main purpose of unshare() is to isolate namespace (and other) side effects without having to create a new process or thread (as is done by clone()).

Leaving aside the other effects of the clone() system call, a call of the form:

    clone(..., CLONE_NEWXXX, ....);

is roughly equivalent, in namespace terms, to the sequence:

    if (fork() == 0)
        unshare(CLONE_NEWXXX);      /* Executed in the child process */

One use of the unshare() system call is in the implementation of the unshare command, which allows the user to execute a command in a separate namespace from the shell. The general form of this command is:

    unshare [options] program [arguments]

The options are command-line flags that specify the namespaces to unshare before executing program with the specified arguments.

The key steps in the implementation of the unshare command are straightforward:

     /* Code to initialize 'flags' according to command-line options
        omitted */

     unshare(flags);

     /* Now execute 'program' with 'arguments'; 'optind' is the index
        of the next command-line argument after options */

     execvp(argv[optind], &argv[optind]);

A simple implementation of the unshare command (unshare.c) can be found here.

In the following shell session, we use our unshare.c program to execute a shell in a separate mount namespace. As we noted in last week's article, mount namespaces isolate the set of filesystem mount points seen by a group of processes, allowing processes in different mount namespaces to have different views of the filesystem hierarchy.

    # echo $$                             # Show PID of shell
    8490
    # cat /proc/8490/mounts | grep mq     # Show one of the mounts in namespace
    mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
    # readlink /proc/8490/ns/mnt          # Show mount namespace ID 
    mnt:[4026531840]
    # ./unshare -m /bin/bash              # Start new shell in separate mount namespace
    # readlink /proc/$$/ns/mnt            # Show mount namespace ID 
    mnt:[4026532325]

Comparing the output of the two readlink commands shows that the two shells are in separate mount namespaces. Altering the set of mount points in one of the namespaces and checking whether that change is visible in the other namespace provides another way of demonstrating that the two programs are in separate namespaces:

    # umount /dev/mqueue                  # Remove a mount point in this shell
    # cat /proc/$$/mounts | grep mq       # Verify that mount point is gone
    # cat /proc/8490/mounts | grep mq     # Is it still present in the other namespace?
    mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0

As can be seen from the output of the last two commands, the /dev/mqueue mount point has disappeared in one mount namespace, but continues to exist in the other.

Concluding remarks

In this article we've looked at the fundamental pieces of the namespace API and how they are employed together. In the follow-on articles, we'll look in more depth at some other namespaces, in particular, the PID and user namespaces; user namespaces open up a range of new possibilities for applications to use kernel interfaces that were formerly restricted to privileged applications.

(2013-01-15: updated the concluding remarks to reflect the fact that there will be more than one following article.)

Index entries for this article
Kernel	Namespaces
Kernel	Namespaces/UTS namespaces
Kernel	System calls/clone()
Kernel	System calls/setns()
Kernel	System calls/unshare()

Namespaces in operation, part 2: the namespaces API

Posted Jan 8, 2013 22:04 UTC (Tue) by justincormack (subscriber, #70439) [Link] (3 responses)

I have spent a fair amount of time with these interfaces, except the shiny new user namespace, so I am a bit confused by that. If you change to a new user ns and therefore become "root" what can you do? Is it affected by which other namespaces you are in? eg if you create a new user ns and new netns can you say use ping or other root-requiring network ops? I guess I should install a new kernel and experiment...

Namespaces in operation, part 2: the namespaces API

Posted Jan 9, 2013 0:23 UTC (Wed) by rvolgers (guest, #63218) [Link]

Looking at the source a user namespace root user has all capabilities within that namespace, and raw socket access is controlled by a ns_capable(...) check, so it should be possible.

I have not tested this, so take it with a grain of salt.

Namespaces in operation, part 2: the namespaces API

Posted Jan 9, 2013 1:24 UTC (Wed) by hallyn (subscriber, #22558) [Link] (1 responses)

> eg if you create a new user ns and new netns can you say use ping or other root-requiring network ops?

Yes - but only with nics owned by your new network namespace. Which means nics which you create (which won't be hooked into the parent ns), or nics which a privileged task in the parent netns passed into your ns.

Namespaces in operation, part 2: the namespaces API

Posted Jan 17, 2013 3:03 UTC (Thu) by kevinm (guest, #69913) [Link]

So with a VPN (or IPv6 tunnel) endpoint that uses TUN/TAP, you could bring up your VPN and pingflood away to your heart's content.

Namespaces in operation, part 2: the namespaces API

Posted Jan 11, 2013 11:19 UTC (Fri) by tpo (subscriber, #25713) [Link]

What a wonderful article.

It clearly and comprehensively explains the basics and the use in practice, is technical and easy to read.

Thanks Michael!
*t

setns syscall via syscall()

Posted Jan 18, 2013 9:15 UTC (Fri) by bourbaki (guest, #84259) [Link]

If your libc is too old and and does not have the setns() wrapper for the sys_setns syscall, you can use the syscall() function instead (in <sys/syscall.h>).

On x86_64, the sys_setns syscall number is 308, so in ns_exec.c you can do :

syscall(308,fd,0)

instead of

setns(fd,0)

Namespaces in operation, part 2: the namespaces API

Posted Apr 27, 2013 15:50 UTC (Sat) by wkevils (guest, #90640) [Link] (1 responses)

Did anybody tried the example with the mount namespaces ?

I tried it both on Fedora 18 and Ubuntu 12.10 and it did not
work.
For Fedora 18, it could be because of the shared flags.
(running cat /proc/$$/mountinfo show shared).

But for ubuntu:
cat /proc/$$/mountinfo |grep shared
gives nothing.

I would appreciate if someone that tested it and it worked
will post which OS and which kernel he uses.

rgs
Kevin Wilson

Namespaces in operation, part 2: the namespaces API

Posted Apr 27, 2013 20:38 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

Anything that ships with XFS enabled (pretty much any distro) can't support user namespaces yet since XFS won't work with them. You're probably stuck compiling your own for now (which is what I did and then tossed it into a VM).

Namespaces in operation, part 2: the namespaces API

Posted Mar 2, 2015 12:08 UTC (Mon) by mkerrisk (subscriber, #1978) [Link]

One point to note regarding the unshare.c experiment with mount namespaces (shown toward the end of the article)... These days, some distributions (e.g., Fedora) enable mount event propagation (mount --make-shared) by default, so that an unmount in the second namespace would automatically affect the initial namespace as well. To prevent mount event propagation, we need to make / a private mount in the second namespace. See the following example:

$ echo $$      # Show PID of shell in initial mount NS
989
$ readlink /proc/989/ns/mnt
mnt:[4026531840]
$ cat /proc/989/mounts | awk '/test/ { print $1 , $2 , $3}'
/dev/sda3 /test ext4
$ PS1='$sh2 ' sudo ./unshare -m /bin/bash   # Start a new shell in a new mount NS
sh2$ readlink /proc/$$/ns/mnt       # Verify that shell is in different mount NS
mnt:[4026532640]
sh2$ # Check whether / mount point propagates mount events
sh2$ cat /proc/$$/mountinfo | awk '/\/ \/ / {print $4, $5, $6, $7}'
/ / rw,relatime shared:1
sh2$ sudo mount --make-private /    # Prevent propagation of events for /
sh2$ cat /proc/$$/mountinfo | awk '/\/ \/ / {print $4, $5, $6, $7}'
/ / rw,relatime -
sh2$ sudo umount /test              # Unmount /test in second mount NS
sh2$ Verify that mount has been removed in second mount NS
sh2$ cat /proc/$$/mounts | awk '/test/ { print $1 , $2 , $3}'
sh2$ Verify that mount is still present in initial mount NS
sh2$ cat /proc/989/mounts | awk '/test/ { print $1 , $2 ,$3}'
/dev/sda3 /test ext4

For more info about mount propagation, see the kernel source file Documentation/filesystems/sharedsubtree.txt and the mount(8) man page.

Namespaces in operation, part 2: the namespaces API

Posted Mar 3, 2015 6:40 UTC (Tue) by apollock (guest, #14629) [Link] (1 responses)

I'm curious. If you bind mount the /proc/PID/ns/uts entry, and then the original process goes away, is that PID available for reuse?

I presume the answer is yes, and the new /proc/PID just winds up with a completely unrelated inode from the old one.

Namespaces in operation, part 2: the namespaces API

Posted Jan 30, 2020 1:09 UTC (Thu) by mkerrisk (subscriber, #1978) [Link]

> I presume the answer is yes, and the new
> /proc/PID just winds up with a completely
> unrelated inode from the old one.

That's correct.

Namespaces in operation, part 2: the namespaces API

Creating a child in a new namespace: clone()

The /proc/PID/ns files

Joining an existing namespace: setns()

Leaving a namespace: unshare()

Concluding remarks

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

setns syscall via syscall()

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

Namespaces in operation, part 2: the namespaces API

The `/proc/`PID`/ns` files