User: Password:
|
|
Subscribe / Log in / New account

Namespaces in operation, part 4: more on PID namespaces

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Michael Kerrisk
January 23, 2013

In this article, we continue last week's discussion of PID namespaces (and extend our ongoing series on namespaces). One use of PID namespaces is to implement a package of processes (a container) that behaves like a self-contained Linux system. A key part of a traditional system—and likewise a PID namespace container—is the init process. Thus, we'll look at the special role of the init process and note one or two areas where it differs from the traditional init process. In addition, we'll look at some other details of the namespaces API as it applies to PID namespaces.

The PID namespace init process

The first process created inside a PID namespace gets a process ID of 1 within the namespace. This process has a similar role to the init process on traditional Linux systems. In particular, the init process can perform initializations required for the PID namespace as whole (e.g., perhaps starting other processes that should be a standard part of the namespace) and becomes the parent for processes in the namespace that become orphaned.

In order to explain the operation of PID namespaces, we'll make use of a few purpose-built example programs. The first of these programs, ns_child_exec.c, has the following command-line syntax:

    ns_child_exec [options] command [arguments]

The ns_child_exec program uses the clone() system call to create a child process; the child then executes the given command with the optional arguments. The main purpose of the options is to specify new namespaces that should be created as part of the clone() call. For example, the -p option causes the child to be created in a new PID namespace, as in the following example:

    $ su                  # Need privilege to create a PID namespace
    Password:
    # ./ns_child_exec -p sh -c 'echo $$'
    1

That command line creates a child in a new PID namespace to execute a shell echo command that displays the shell's PID. With a PID of 1, the shell was the init process for the PID namespace that (briefly) existed while the shell was running.

Our next example program, simple_init.c, is a program that we'll execute as the init process of a PID namespace. This program is designed to allow us to demonstrate some features of PID namespaces and the init process.

The simple_init program performs the two main functions of init. One of these functions is "system initialization". Most init systems are more complex programs that take a table-driven approach to system initialization. Our (much simpler) simple_init program provides a simple shell facility that allows the user to manually execute any shell commands that might be needed to initialize the namespace; this approach also allows us to freely execute shell commands in order to conduct experiments in the namespace. The other function performed by simple_init is to reap the status of its terminated children using waitpid().

Thus, for example, we can use the ns_child_exec program in conjunction with simple_init to fire up an init process that runs in a new PID namespace:

    # ./ns_child_exec -p ./simple_init
    init$

The init$ prompt indicates that the simple_init program is ready to read and execute a shell command.

We'll now use the two programs we've presented so far in conjunction with another small program, orphan.c, to demonstrate that processes that become orphaned inside a PID namespace are adopted by the PID namespace init process, rather than the system-wide init process.

The orphan program performs a fork() to create a child process. The parent process then exits while the child continues to run; when the parent exits, the child becomes an orphan. The child executes a loop that continues until it becomes an orphan (i.e., getppid() returns 1); once the child becomes an orphan, it terminates. The parent and the child print messages so that we can see when the two processes terminate and when the child becomes an orphan.

In order to see what that our simple_init program reaps the orphaned child process, we'll employ that program's -v option, which causes it to produce verbose messages about the children that it creates and the terminated children whose status it reaps:

    # ./ns_child_exec -p ./simple_init -v
            init: my PID is 1
    init$ ./orphan
            init: created child 2
    Parent (PID=2) created child with PID 3
    Parent (PID=2; PPID=1) terminating
            init: SIGCHLD handler: PID 2 terminated
    init$                   # simple_init prompt interleaved with output from child
    Child  (PID=3) now an orphan (parent PID=1)
    Child  (PID=3) terminating
            init: SIGCHLD handler: PID 3 terminated

In the above output, the indented messages prefixed with init: are printed by the simple_init program's verbose mode. All of the other messages (other than the init$ prompts) are produced by the orphan program. From the output, we can see that the child process (PID 3) becomes an orphan when its parent (PID 2) terminates. At that point, the child is adopted by the PID namespace init process (PID 1), which reaps the child when it terminates.

Signals and the init process

The traditional Linux init process is treated specially with respect to signals. The only signals that can be delivered to init are those for which the process has established a signal handler; all other signals are ignored. This prevents the init process—whose presence is essential for the stable operation of the system—from being accidentally killed, even by the superuser.

PID namespaces implement some analogous behavior for the namespace-specific init process. Other processes in the namespace (even privileged processes) can send only those signals for which the init process has established a handler. This prevents members of the namespace from inadvertently killing a process that has an essential role in the namespace. Note, however, that (as for the traditional init process) the kernel can still generate signals for the PID namespace init process in all of the usual circumstances (e.g., hardware exceptions, terminal-generated signals such as SIGTTOU, and expiration of a timer).

Signals can also (subject to the usual permission checks) be sent to the PID namespace init process by processes in ancestor PID namespaces. Again, only the signals for which the init process has established a handler can be sent, with two exceptions: SIGKILL and SIGSTOP. When a process in an ancestor PID namespace sends these two signals to the init process, they are forcibly delivered (and can't be caught). The SIGSTOP signal stops the init process; SIGKILL terminates it. Since the init process is essential to the functioning of the PID namespace, if the init process is terminated by SIGKILL (or it terminates for any other reason), the kernel terminates all other processes in the namespace by sending them a SIGKILL signal.

Normally, a PID namespace will also be destroyed when its init process terminates. However, there is an unusual corner case: the namespace won't be destroyed as long as a /proc/PID/ns/pid file for one of the processes in that namespaces is bind mounted or held open. However, it is not possible to create new processes in the namespace (via setns() plus fork()): the lack of an init process is detected during the fork() call, which fails with an ENOMEM error (the traditional error indicating that a PID cannot be allocated). In other words, the PID namespace continues to exist, but is no longer usable.

Mounting a procfs filesystem (revisited)

In the previous article in this series, the /proc filesystems (procfs) for the PID namespaces were mounted at various locations other than the traditional /proc mount point. This allowed us to use shell commands to look at the contents of the /proc/PID directories that corresponded to each of the new PID namespace while at the same time using the ps command to look at the processes visible in the root PID namespace.

However, tools such as ps rely on the contents of the procfs mounted at /proc to obtain the information that they require. Therefore, if we want ps to operate correctly inside a PID namespace, we need to mount a procfs for that namespace. Since the simple_init program permits us to execute shell commands, we can perform this task from the command line, using the mount command:

    # ./ns_child_exec -p -m ./simple_init
    init$ mount -t proc proc /proc
    init$ ps a
      PID TTY      STAT   TIME COMMAND
        1 pts/8    S      0:00 ./simple_init
        3 pts/8    R+     0:00 ps a

The ps a command lists all processes accessible via /proc. In this case, we see only two processes, reflecting the fact that there are only two processes running in the namespace.

When running the ns_child_exec command above, we employed that program's -m option, which places the child that it creates (i.e., the process running simple_init) inside a separate mount namespace. As a consequence, the mount command does not affect the /proc mount seen by processes outside the namespace.

unshare() and setns()

In the second article in this series, we described two system calls that are part of the namespaces API: unshare() and setns(). Since Linux 3.8, these system calls can be employed with PID namespaces, but they have some idiosyncrasies when used with those namespaces.

Specifying the CLONE_NEWPID flag in a call to unshare() creates a new PID namespace, but does not place the caller in the new namespace. Rather, any children created by the caller will be placed in the new namespace; the first such child will become the init process for the namespace.

The setns() system call now supports PID namespaces:

    setns(fd, 0);   /* Second argument can be CLONE_NEWPID to force a
                       check that 'fd' refers to a PID namespace */

The fd argument is a file descriptor that identifies a PID namespace that is a descendant of the PID namespace of the caller; that file descriptor is obtained by opening the /proc/PID/ns/pid file for one of the processes in the target namespace. As with unshare(), setns() does not move the caller to the PID namespace; instead, children that are subsequently created by the caller will be placed in the namespace.

We can use an enhanced version of the ns_exec.c program that we presented in the second article in this series to demonstrate some aspects of using setns() with PID namespaces that appear surprising until we understand what is going on. The new program, ns_run.c, has the following syntax:

    ns_run [-f] [-n /proc/PID/ns/FILE]... command [arguments]

The program uses setns() to join the namespaces specified by the /proc/PID/ns files contained within -n options. It then goes on to execute the given command with optional arguments. If the -f option is specified, it uses fork() to create a child process that is used to execute the command.

Suppose that, in one terminal window, we fire up our simple_init program in a new PID namespace in the usual manner, with verbose logging so that we are informed when it reaps child processes:

    # ./ns_child_exec -p ./simple_init -v
            init: my PID is 1
    init$ 

Then we switch to a second terminal window where we use the ns_run program to execute our orphan program. This will have the effect of creating two processes in the PID namespace governed by simple_init:

    # ps -C sleep -C simple_init
      PID TTY          TIME CMD
     9147 pts/8    00:00:00 simple_init
     # ./ns_run -f -n /proc/9147/ns/pid ./orphan
     Parent (PID=2) created child with PID 3
     Parent (PID=2; PPID=0) terminating
     # 
     Child  (PID=3) now an orphan (parent PID=1)
     Child  (PID=3) terminating

Looking at the output from the "Parent" process (PID 2) created when the orphan program is executed, we see that its parent process ID is 0. This reflects the fact that the process that started the orphan process (ns_run) is in a different namespace—one whose members are invisible to the "Parent" process. As already noted in the previous article, getppid() returns 0 in this case.

The following diagram shows the relationships of the various processes before the orphan "Parent" process terminates. The arrows indicate parent-child relationships between processes.

[Relationship of
    processes inside PID namespaces]

Returning to the window running the simple_init program, we see the following output:

    init: SIGCHLD handler: PID 3 terminated

The "Child" process (PID 3) created by the orphan program was reaped by simple_init, but the "Parent" process (PID 2) was not. This is because the "Parent" process was reaped by its parent (ns_run) in a different namespace. The following diagram shows the processes and their relationships after the orphan "Parent" process has terminated and before the "Child" terminates.

[Relationship of
    processes inside PID namespaces]

It's worth emphasizing that setns() and unshare() treat PID namespaces specially. For other types of namespaces, these system calls do change the namespace of the caller. The reason that these system calls do not change the PID namespace of the calling process is because becoming a member of another PID namespace would cause the process's idea of its own PID to change, since getpid() reports the process's PID with respect to the PID namespace in which the process resides. Many user-space programs and libraries rely on the assumption that a process's PID (as reported by getpid()) is constant (in fact, the GNU C library getpid() wrapper function caches the PID); those programs would break if a process's PID changed. To put things another way: a process's PID namespace membership is determined when the process is created, and (unlike other types of namespace membership) cannot be changed thereafter.

Concluding remarks

In this article we've looked at the special role of the PID namespace init process, shown how to mount a procfs for a PID namespace so that it can be used by tools such as ps, and looked at some of the peculiarities of unshare() and setns() when employed with PID namespaces. This completes our discussion of PID namespaces; in the next article, we'll turn to look at user namespaces.


(Log in to post comments)

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 23, 2013 19:09 UTC (Wed) by luto (subscriber, #39314) [Link]

I wrote a little tool awhile ago to play with this stuff. It's here:

http://web.mit.edu/luto/www/linux/nnp/newns.c

Have fun! If any of you find it useful, let me know -- I can probably polish it a bit and send it to util-linux or something.

Also, on very new kernels (3.8+), a lot of this stuff can be done without privilege if you're willing to accept a few restrictions.

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 24, 2013 16:56 UTC (Thu) by dashesy (guest, #74652) [Link]

Integrating these nice utilities in the util-linux is an excellent idea. A few different simple utilities, along with some simple SysV-init scripts (that go in /etc/init.d of ns for convenience) and a generic init suitable for namespace(polished simple_init.c) is all needed to experiment with ns.

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 25, 2013 16:44 UTC (Fri) by fishface60 (subscriber, #88700) [Link]

Util-Linux already has such tools in git, unshare works with all the name spaces and nsenter let's you create a new process in an existing namespace.

I'm quite looking forward to being able to launch a container in shell.

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 23, 2013 21:44 UTC (Wed) by dashesy (guest, #74652) [Link]

Thanks, I have become quite fond of these excellent series!

Is there a way I can bookmark the entire series rather than individual articles?

Bookmarkig the series

Posted Jan 23, 2013 21:50 UTC (Wed) by corbet (editor, #1) [Link]

The initial article is gaining links to the rest as we go along. The namespaces section in the Kernel Index might also prove useful.

Bookmarkig the series

Posted Jan 23, 2013 23:31 UTC (Wed) by dashesy (guest, #74652) [Link]

Thanks, yes the index is better than I could imagine.

Bookmarkig the series

Posted Jan 24, 2013 4:44 UTC (Thu) by xxiao (guest, #9631) [Link]

I was thinking about the same question today, it will be _great_ if I can somehow bookmark interesting lwn articles that associates with my account for future references, how hard is it to implement that?

Thanks!

Bookmarkig the series

Posted Jan 25, 2013 15:03 UTC (Fri) by nix (subscriber, #2304) [Link]

You've already got it. A web browser with sync support (FF or Chrome) and, um, ordinary bookmarks.

Bookmarkig the series

Posted Jan 27, 2013 0:23 UTC (Sun) by xxiao (guest, #9631) [Link]

that's not very useful when I'm on the go, even with xmarks' help.
it's much better to login to lwn then see all my favourite links booked on this site.

Bookmarkig the series

Posted Jan 27, 2013 18:06 UTC (Sun) by nix (subscriber, #2304) [Link]

So... you want every single website you use to implement its own implementation of bookmarks (all different), because you have web browsers that don't implement syncing? (What web browser would that be? Chrome does it, Firefox does it... Epiphany, perhaps?)

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 25, 2013 10:00 UTC (Fri) by sorokin (subscriber, #88478) [Link]

"The only signals that can be delivered to init are those for which the process has established a signal handler; all other signals are ignored"

Looks like some dirty hack. Why init process can not disable signals itself?

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 25, 2013 10:27 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

I can think of three reasons, the first being that establishing the signal handlers takes some time. the second is that the list of signals you know might not be exhaustive, especially some time later. The third I am not sure about and I am too lazy to check the code right now but does that perhaps include signal handlers you can't block yourself?

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 25, 2013 10:35 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

I would hope so. (The easy way to check is to sit in front of a machine you don't mind having crash, and try sending SIGKILL to PID 1.)

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 25, 2013 18:07 UTC (Fri) by ebiederm (subscriber, #35028) [Link]

Yes SIGKILL and SIGSTOP are the biggies.

The reason for ignoring the others is that is the way things have worked for "init" processes as far back in the linux history as I have looked, and maintaining backwards compatibility is important.

Namespaces in operation, part 4: more on PID namespaces

Posted Jan 31, 2013 16:26 UTC (Thu) by alex2 (guest, #73934) [Link]

Can the network namespace be used to restrict a program so it can't phone home? Sometimes I'd like to test some commercial demo version but I really don't want it to report an unknown amount information back to the company.

Namespaces in operation, part 4: more on PID namespaces

Posted Feb 5, 2013 12:50 UTC (Tue) by Lennie (guest, #49641) [Link]

You can setup iptables inside the network namespace, if you trust the program not to change it, you'll be fine.

This is because I'm not sure how well you can control what packets can and can not be send from the network namespace from the parent namespace.

Namespaces in operation, part 4: more on PID namespaces

Posted Feb 5, 2013 16:18 UTC (Tue) by bjencks (subscriber, #80303) [Link]

A fresh network namespace only has a loopback interface. If you don't add any other interfaces, it's totally isolated network-wise.

(Note that you can still connect to filesystem-namespace unix sockets if you can access them as files -- you need to chroot or use mount namespaces if you want to hide them as well. I believe abstract namespace unix sockets are isolated per-namespace.)

Namespaces in operation, part 4: more on PID namespaces

Posted Jun 27, 2013 20:00 UTC (Thu) by Urhixidur (guest, #91620) [Link]

Creating a mount namespace at the same time that a PID namespace is created is an elegant solution, but seems fallible. Noting prevents new processes from being created into the new PID namespace without simultaneously joining the mount namespace. This seems error-prone. Or am I missing something?

Can't get mount namespaces to behave as expected

Posted Mar 5, 2015 2:04 UTC (Thu) by apollock (subscriber, #14629) [Link]

Hi, I'm messing around with your utilities, as well as unshare and the first commentator's newns, and I can't seem to get mount namespaces to work as I'd expect.

I'm creating a new everything, i.e.
sudo unshare --mount --uts --net --pid --fork --mount-proc /bin/bash
sudo /tmp/newns --uts --mount --pid --init --net /bin/bash
sudo /tmp/ns_child_exec -p -m /tmp/simple_init

and then unmounting a filesystem from that shell, and it's getting unmounted in another shell that hasn't been interacting with the namespace, which isn't what I would have expected? Similarly, if I mount /proc in the last two example invocations above, it clobbers the systemwide /proc mount with what's going on inside my new PID namespace. Also not what I would have expected?

I'm using 3.19.0

Can't get mount namespaces to behave as expected

Posted Mar 5, 2015 9:25 UTC (Thu) by mkerrisk (subscriber, #1978) [Link]

@apollock: yes, I recently commented on this in another article in this series. Basically, some distros (e.g., Fedora) these days enable mount propagation by default, which means that when you mount /proc in the new mount namespace, you do indeed clobber /proc in the initial mount namespace.

So, in the new namespace, you need to disable propagation of mount events on /, either by making it a private mount (prevents propagation in both directions) or by making it a slave mount (allows propagation of mounts events under / into the new namespace, but doesn't propagate events outside the new namespace. So, for example, in the shell session under the heading Mounting a procfs filesystem (revisited), we should add one further shell command:

# ./ns_child_exec -p -m ./simple_init
init$ mount --make-slave /            # <== NEW
init$ mount -t proc proc /proc
init$ ps a

For more info about mount propagation, see the kernel source file Documentation/filesystems/sharedsubtree.txt and the mount(8) man page.

Can't get mount namespaces to behave as expected

Posted Mar 5, 2015 23:55 UTC (Thu) by apollock (subscriber, #14629) [Link]

Thanks for the quick response.

It looks like I have to do the same thing to /proc prior to mounting it

Can't get mount namespaces to behave as expected

Posted Mar 6, 2015 9:19 UTC (Fri) by mkerrisk (subscriber, #1978) [Link]

> Thanks for the quick response.

Actually, it was quite by chance. I happened to be checking some details in these articles myself.

> It looks like I have to do the same thing to /proc prior to mounting it

I don't believe that should be necessary. What makes you think that it is?

Can't get mount namespaces to behave as expected

Posted Mar 7, 2015 7:53 UTC (Sat) by apollock (subscriber, #14629) [Link]

Because /proc was still getting clobbered outside of my namespace without it when I mounted it inside my namespace.

I was basically testing two scenarios:

1) Unmounting a filesystem that was mounted inside and outside the new namespace. Expected behaviour: it was only unmounted inside the new namespace

2) Mounting /proc inside the new namespace. Expected behaviour: only seeing the process entries for processes inside the new namespace inside the namespace, and there being no impact outside this namespace

Can't get mount namespaces to behave as expected

Posted Mar 7, 2015 10:04 UTC (Sat) by mkerrisk (subscriber, #1978) [Link]

So, going back to your earlier comment:

> It looks like I have to do the same thing to /proc prior to mounting it

Yes, you're right. I was getting confused with another case, where if we mount a procfs at a location other than the usual /proc, then we need to make / a private or slave mount in order not to have that mount appear in the initial mount namespace.

So, in fact all that's needed if we're mounting at /proc inside the simple_init program is

# ./ns_child_exec -p -m ./simple_init
init$ mount --make-slave /proc            # <== NEW
init$ mount -t proc proc /proc
init$ ps a
Nothing needs to be done to /, as far as I can tell.

Namespaces in operation, part 4: more on PID namespaces

Posted Dec 15, 2016 2:26 UTC (Thu) by orbisvicis (guest, #113024) [Link]

The final example isn't working properly. Orphan's child does nothing after the parent successfully exits - no "Child ..." lines. Also, orphan's child is rarely reparented to ns_child_exec: the line "init: SIGCHLD handler: PID ... terminated" rarely happens.


Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds