LWN.net Logo

Process IDs in a multi-namespace world

By Jonathan Corbet
November 6, 2007
Last week's article on containers discussed process ID namespaces. The purpose of these namespaces is to manage which processes are visible to a process inside a container. The heavy use of PIDs to identify processes has caused this particular patch to go through a long period of development before being merged for 2.6.24. It appears that there are some remaining issues, though, which could prevent this feature from being available in the next kernel release. As is often the case, the biggest problems come down to user-space API issues.

On November 1, Ingo Molnar pointed out that some questions raised by Ulrich Drepper back in early 2006 remained unanswered. These questions all have to do with what happens when the use of a PID escapes the namespace that it belongs to. There are a number of kernel APIs related to interprocess communication and synchronization where this could happen. Realtime signals carry process ID information, as do SYSV message queues. At best, making these interfaces work properly across PID namespaces will require that the kernel perform magic PID translations whenever a PID crosses a namespace boundary.

The biggest sticking point, though, would appear to be the robust futex mechanism, which uses PIDs to track which process owns a specific futex at any given time. One of the key points behind futexes is that the fast acquisition path (when there is no contention for the futex) does not require the kernel's involvement at all. But that acquisition path is also where the PID field is set. So there is no way to let the kernel perform magic PID translation without destroying the performance feature that was the motivation for futexes in the first place.

Ingo, Ulrich, and others who are concerned about this problem would like to see the PID namespace feature completely disabled in the 2.6.24 release so that there will be time to come up with a proper solution. But it is not clear what form that solution would take, or if it is even necessary.

The approach seemingly favored by Ulrich is to eliminate some of the fine-grained control that the kernel currently provides over the sharing of namespaces. With the 2.6.24-rc1 interface, a process which calls clone() can request that the child be placed into a new PID namespace, but that other namespaces (filesystems, for example, or networking) be shared. That, says Ulrich, is asking for trouble:

This whole approach to allow switching on and off each of the namespaces is just wrong. Do it all or nothing, at least for the problematic ones like NEWPID. Having access to the same filesystem but using separate PID namespaces is simply not going to work.

Coalescing a number of the namespace options into a single "new container" bit would help with the current shortage of clone bits. But it might well not succeed in solving the API issues. Even processes with different filesystem namespaces might be able to find the same futex via a file visible in both namespaces. The passing of credentials over Unix-domain sockets could throw in an interesting twist. And it would seem that there are other places where PIDs are used that nobody has really thought of yet.

Another possible approach, one which hasn't really featured in the current debate, would be to create globally-unique PIDs which would work across namespaces. The current 32-bit PID value could be split into two fields, with the most significant bits indicating which namespace the PID (contained in the least significant bits) is defined in. Most of the time, only the low-order part of the PID would be needed; it would be interpreted relative to the current PID namespace. But, in places where it makes sense, the full, unique PID could be used. That would enable features like futexes to work across PID namespaces.

There are still problems, of course. The whole point of PID namespaces is to completely hide processes which are outside of the current namespace; the creation and use of globally-unique PIDs pokes holes in that isolation. And there's sure to be some complications in the user-space API which prove to be hard to paper over.

Then, there is the question of whether this problem is truly important or not. Linus thinks not, pointing out that the sharing of PIDs across namespaces is analogous to the use of PIDs in lock files shared across a network. PID-based locking does not work on NFS-mounted files, and PID-based interfaces will not work between PID namespaces. Linus concludes:

I don't understand how you can call this a "PID namespace design bug", when it clearly has nothing what-so-ever to do with pid namespaces, and everything to do with the *futexes* that blithely assume that pid's are unique and that made it part of the user-visible interface.

One could argue that the conflict with PID namespaces was known when the robust futex feature was merged and that something could have been done at that time. But that does not really help anybody now. And, in any case, there are issues beyond futexes.

PID namespaces are a significant complication of the user-space API; they redefine a basic value which has had a well-understood meaning since the early days of Unix. So it is not surprising that some interesting questions have come to light. Getting solid answers to nagging API questions has not always been the strongest point of the Linux development process, but things could always change. With luck and some effort, these questions can be worked through so that PID namespaces, when they become available, will have well-thought-out and well-defined semantics in all cases and will support the functionality that users need.


(Log in to post comments)

Process IDs in a multi-namespace world

Posted Nov 6, 2007 19:34 UTC (Tue) by sayler (guest, #3164) [Link]

One thing I'm unclear on:

Are people arguing that you would want to (eg) share futexes between separate PID-spaces, or
is there some other reason that people find this offensive?

I find it hard to argue that the sharing futexes between containers is a good idea, which
makes me think I'm missing something.

Process IDs in a multi-namespace world

Posted Nov 6, 2007 19:45 UTC (Tue) by elanthis (guest, #6227) [Link]

That's pretty much Linus' point in all this.  There are things that won't work with separate
PID namespaces, but doing those things is crazy, and there's not much reason for us to care.

Process migration

Posted Nov 6, 2007 19:51 UTC (Tue) by i3839 (guest, #31386) [Link]

Globally unique PIDs seem to solve the whole problem. Only counter argument I know off is that
it could make migrating processes harder. But if the range of PIDs can be configured, the
systems where processes are migrated to and from can be configured in such way that there's no
PID collision.

Process migration

Posted Nov 8, 2007 8:46 UTC (Thu) by alexl (subscriber, #19068) [Link]

Why not just have two kinds of pids, namespace local and global. 

Local pids look like current pids, global ones have the high bit set. Each process can now be
"named" in two ways (namespace relative or absolute). 

Clearly some things (like kill(2)) have to verify that a pid referenced through a global
identifier is in the same namespace (or has right to affect the other namespace), but the
global id is useful for things like the robust futexes.

Again - what if different systems are involved ?

Posted Nov 9, 2007 22:31 UTC (Fri) by khim (subscriber, #9252) [Link]

Suppose I have 400'000-500'000 computers (Google actually does) and I want to use globally unique IDs. And one bit to separate global vs local PIDs. I can only have ~4000 processes on one computer! In all containers! Combined! Plus I'll need complex system to keep all these tables around somewhere and do a lot of other things - just to make it possible to do some insane things...

Looks like bad tradeoff to me...

Process IDs in a multi-namespace world

Posted Nov 6, 2007 20:44 UTC (Tue) by iq-0 (subscriber, #36655) [Link]

It's an interesting puzzle. I'm still pondering what the direct consequence would be if the
pid number would be completely de-coupled from the container logic (pid numbers are unique
within the system and don't try to magically encode container membership).
The only theoretical problem I currently see is that creating new processes will show you how
many new processes were created in the whole system (not just this container), but is that
really that bad? Or is it just a part of containers not being "invisible"? Because containers
simply aren't invisible and this one little piece of more evidence that they aren't isn't
really that big a deal.
But somebody will hopefull proof me wrong and point out that this really is a big deal ;-)

Process IDs in a multi-namespace world

Posted Nov 6, 2007 20:52 UTC (Tue) by martinfick (subscriber, #4455) [Link]

One problem with this is that you can't have the 'root-parent' for each namespace with a PID
of 1 then.  This could be faked so that this process has a unique global id but appears to
system tools (other processes) as PID 1.  I don't think it needs to think of itself as PID 1,
does it?

I'm not sure of if there are other problems?

Process IDs in a multi-namespace world

Posted Nov 6, 2007 21:36 UTC (Tue) by iq-0 (subscriber, #36655) [Link]

Good point, I had overlooked that one. There are some programs that check for pid 1 (I thought
bash did that, and init too). From a pure technical standpoint it shouldn't matter too much
but I can imagine tons of code checking if their parent is 1 (daemonize checks eg.)

Process IDs in a multi-namespace world

Posted Nov 8, 2007 18:43 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

I can't think of any good reason for any program (other than perhaps init itself and telinit)
to care what PID init has, and even those could be changed to use a better mechansim.

In particular, whether a program runs as a daemon or not should definitely NOT be determined
by the PPID.

The PID should be viewed as an entirely opaque data type, and shouldn't need namespaces.

Process IDs in a multi-namespace world

Posted Nov 7, 2007 18:27 UTC (Wed) by mrjk (subscriber, #48482) [Link]

A global pid is context information that probably shouldn't be shared across "context boxes".
You could possibly figure out what pid a "targeted" process is by knowing about when it 
started and what processes started before and after. This would be useful in an attack 
breaking confinement. I know this is all theoretical, and may not even be possible, but if 
you can design it out you don't have to seriously think about it. 

Why should processes in one box have any knowledge of the processes running in another 
that don't explicitly announce themselves? If you are relying on this then you are really 
in the same context (process namespace) anyway. It doesn't matter that the containers 
are visible or not, this is one of the points in having containers with namespaces in 
the first place, I would think.

Process IDs in a multi-namespace world

Posted Nov 7, 2007 6:18 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

As usual, Linus manages to cut through a whole Gordian knot of confusion over how to solve a problem by asking "Why do we even care?" In this case, why do they even care about sharing userspace resources like pids, futexes, filesystem mounts, and the like between containers, when the whole point of containers is that each container appears to the contained processes to be its own separate system?

I think he's right. Disallow this sharing and treat each container as a completely separate userspace, which means each one has its own set of every resource from the userspace point of view. Let the kernel uses namespaces in-kernel, and take care of the translating; if containers want to communicate with each other, we have well-defined means of doing that, namely TCP/IP sockets, network file systems, distributed systems, and the like. Linux could speed things up a bit by using in-kernel "zero-copy" communication between containers, so that TCP/IP sockets between containers would be as fast as Unix domain sockets, but userspace should not have to care or even know about it.

Process IDs in a multi-namespace world

Posted Nov 7, 2007 18:23 UTC (Wed) by samroberts (subscriber, #46749) [Link]

Last week's container article and this one talks about what pid 
namespaces are, but doesn't say why. Its not obvious, why is this even 
being discussed? What purpose does it have? I can understand wanting to 
give a particular process a view of the filesystem namespace, but a 
custom view of the pid space???


What it's for

Posted Nov 7, 2007 18:30 UTC (Wed) by corbet (editor, #1) [Link]

The idea behind containers is to give the contained processes the illusion of having the system to themselves. It's a security and isolation thing; in a complete container implementation it should be possible to give root privileges to a contained process and not have problems outside of the container. That clearly would not be the case if contained processes could see (and operate upon) processes running elsewhere in the system.

What it's for

Posted Nov 7, 2007 21:59 UTC (Wed) by samroberts (subscriber, #46749) [Link]

OK, that could be useful, maybe.

But don't the many flavors of LSM we've seen endlessly discussed solve 
the problem of what processes can do, and to whom?

Containers to associate processes together to be managed as a group 
strategy (scheduling priority, permissions, etc) makes sense to me, but 
doesn't seem to need pid hiding.

Just making processes invisible to each other by pid seems a bit fishy as 
a security mechanism. It reminds me of using chroot for security, which 
seems to be in disrepute:

http://kerneltrap.org/Linux/Abusing_chroot

Or is it more just lightweight virtualization?


What it's for

Posted Nov 8, 2007 0:45 UTC (Thu) by i3839 (guest, #31386) [Link]

There are quite a lot systemcalls taking a pid as argument, so isolating processes' pids has
the effect of containing those calls. To name a couple important ones, ptrace(2) and kill(2).

Process IDs in a multi-namespace world

Posted Nov 8, 2007 3:44 UTC (Thu) by Gollum (subscriber, #25237) [Link]

If the processes are isolated in a container, it makes it possible at some point in the future
to migrate the entire container to different hardware (assuming that other resources like
filesystems are still reachable).

Process IDs in a multi-namespace world

Posted Nov 8, 2007 23:42 UTC (Thu) by giraffedata (subscriber, #1954) [Link]

If the processes are isolated in a container, it makes it possible at some point in the future to migrate the entire container

The PID problem fades into insignificance compared to the difficulty of migrating all the other state of a container - all the state in the kernel that uses the global kernel address space, such as inodes, plus the state that lives outside Linux, such as TCP connections and SCSI tasks.

I would wait until those problems are solved before complicating the PID namespace in the name of migration.

Process IDs in a multi-namespace world

Posted Nov 11, 2007 22:37 UTC (Sun) by kolyshkin (subscriber, #34342) [Link]

The PID problem fades into insignificance compared to the difficulty of migrating all the other state of a container - all the state in the kernel that uses the global kernel address space, such as inodes, plus the state that lives outside Linux, such as TCP connections and SCSI tasks.

I would wait until those problems are solved before complicating the PID namespace in the name of migration.

I guess you might want to take a look at OpenVZ (and if you want to see the actual kernel code it's under kernel/cpt/ in source tree, for example, here).

And OpenVZ is not the only one available implementation of containers migration in Linux -- two others I know are Meiosys Metacluster and Zap (both are closed-source unfortunately although Zap may become opensource; also they tend to concentrate on migration while OpenVZ sees it as just another feature of containers).

Process IDs in a multi-namespace world

Posted Nov 8, 2007 13:43 UTC (Thu) by davecb (subscriber, #1574) [Link]

One might do a literature survey* and see what,
for example, the Solarii did to adress this
problem when creating zones.

--dave
[* Computer science students are famously
   reluctant to do literature surveys (:-)]

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds