LWN.net Logo

Containers and PID virtualization

The folks at IBM would like to add a "container" capability to the Linux kernel. Containers are a way of walling a group of processes off from the rest of the system; a process within a container will only see its fellow inmate processes and whatever resources are made accessible to that container. This feature has some obvious security-related applications. IBM's plans, evidently, also include the ability to pack up a container and move it to another physical host without disrupting the processes trapped inside.

The patches which have been circulating so far fall short of the final plan, but they already disturb enough code to have attracted some skeptical criticism. In particular, the 34-part PID virtualization patch creates a simple container type, and implements a separate process ID space within containers. But, as we'll see, doing even that much involves some significant kernel changes.

The containers themselves are fairly simple. The patches create a virtual file called /proc/container. If a process writes a string to that file, a new container is created for that process, using the string as its name. The namespace is global, so every container on the system must have a unique name. Any child processes created by the newly-contained process will also be trapped within the container, with no way out.

At this point, being inside a container does not affect a process's life that much. The one thing that does change, however, is that each container has its own process ID (PID) space. Processes within the container can only see others in the same container. There is nothing particularly controversial about that behavior, but the developers have another objective in mind: they want to be able to change the PIDs of contained processes without the processes themselves noticing. In particular, they would like to be able to migrate a container to a different system, which will certainly assign new PIDs to every process within the container. Code written for Unix-like systems does not normally expect its PID to change over time, however; so switching PIDs underneath a process could lead to all kinds of strange behavior. To avoid this problem, the plan is that PIDs remain constant within the container, even if those PIDs change in the real world.

Implementing constant PIDs (from a viewpoint inside the container) is not a straightforward task; it involves adding a whole new virtualization layer inside the kernel. There are two types of PIDs now, "real" PIDs and the virtual PIDs used by contained processes. Any place in the kernel which deals with PID values must become aware of which type of PID it is using, and convert to the other type when necessary. So, as a general rule, any code which exchanges PIDs with user space must use the virtual variety, while PIDs handled within the kernel are real.

The PID logic is complicated by a few little details, like: what happens when containers are nested? A process living within a container has a real PID and a virtual PID associated with the container. If that process creates a container of its own, it will acquire yet another PID associated with the new container. So it is not possible to simply convert a real PID to a virtual PID; such questions require a "context" so that the kernel knows which virtual PID is wanted.

The result of all this is that PID handling within the kernel changes significantly. Code which used to get the current process's PID with current->pid must now use tsk_pid(current) for the real PID, or tsk_vpid(current) for the virtual PID - and it must know which one it wants. In situations where more than one virtual PID might be appropriate, tsk_vpid_ctx() must be used to supply the context. Much of the patch set is concerned simply with making these conversions; for good measure, it also renames the pid field of struct task_struct to catch any code still trying to access it directly.

Behind all of this is a concept called "pidspaces." The patch carves up the global PID space takes the upper 9 bits of the 32-bit PID value and puts the pidspace number there. A virtual PID as seen within a container is turned into a real kernel PID by stuffing the pidspace number in those upper bits. Since the contained processes only see virtual PIDs, they never see the pidspace number, and they will not notice if that number changes.

All of this code seems to work, but there is a certain amount of opposition to merging it. As Alan Cox put it:

This is an obscure, weird piece of functionality for some special case usages most of which are going to be eliminated by Xen. I don't see the kernel side justification for it at all.

The developers answer that the ability to checkpoint and restart process trees, possibly moving them in between, will be highly useful. Some other virtualization projects also require this capability - not everybody wants to use Xen. So the pressure for PID virtualization probably won't just go away.

What might happen is that the hiding of current->pid might be taken out, greatly reducing the size of the patch. Another idea which has been floated is to eliminate, to the greatest degree possible, the use of PIDs within the kernel. Almost any in-kernel use of a PID can be replaced with a direct pointer to the task structure. If a PID eventually is reduced to little more than a process-identifying cookie used for communication with user space, it will be easier to virtualize without complicating large amounts of kernel code.


(Log in to post comments)

Containers and PID virtualization

Posted Jan 19, 2006 10:03 UTC (Thu) by hingo (guest, #14792) [Link]

Just a curious uninformed question, what does this really offer that cannot be done with UML or Xen? Aside from possible performance enhancements and such. I mean I do understand how it's different technically, but if it's so much trouble and invasive on others, what's the real benefit?

Containers and PID virtualization

Posted Jan 19, 2006 11:26 UTC (Thu) by eru (subscriber, #2753) [Link]

> Just a curious uninformed question, what does this really offer that cannot be done with UML or Xen?

Is UML even developed any more these days? Looking at its home page, it seems to talk about 2.4 kernels, like here: http://user-mode-linux.sourceforge.net/dl-sf.html

Anyway, getting these isolated containers without having many copies of the kernel is probably a big enough performance enhancement to be worth the trouble.

Containers and PID virtualization

Posted Jan 19, 2006 14:49 UTC (Thu) by Segora (subscriber, #8209) [Link]

> Is UML even developed any more these days?

UML is included in 2.6 kernels and seems to be actively developed, see also the UML diary[1].

Segora

[1] http://user-mode-linux.sourceforge.net/diary.html

Containers and PID virtualization

Posted Jan 22, 2006 8:38 UTC (Sun) by xorbe (subscriber, #3165) [Link]

Just look further down that page, it says to look at
http://www.user-mode-linux.org/~blaisorblade/
for 2.6.9+ updates (which it has through 2.6.15)

Containers and PID virtualization

Posted Jan 19, 2006 14:05 UTC (Thu) by smoogen (subscriber, #97) [Link]

The future part of this that would seem to be useful would be the cluster aspects.

Machine A detects a system fault and is going to shutdown

A tells the cluster manager(s) it has these containers that need to be moved.

The cluster manager finds another machine (machine C and D) and tells Machine A

Machine A and C/Cdo the needed process handover and containers start running on C/D.

Machine A shutsdown.

[Of course this may be possible with Xen, but it would seem to be a heavier solution with the fact that C/D would then need to instantiate and start a new sub-kernel process and every other sub-process (nfs, etc) that was in the old Xen.]

Containers and PID virtualization

Posted Jan 19, 2006 17:03 UTC (Thu) by swiftone (guest, #17420) [Link]

what does this really offer that cannot be done with UML or Xen?

I'm hardly an expert, but I'll post my understanding here so that if I'm wrong someone can point it out to me :)

Xen, as far as I know, runs a virtual machine within the kernal (okay, it's not a virtual machine in the VMware sense, but that's the concept).

A container is a collection of processes that are aware of each other. Basically, Xen is the kernel, and the container is the processes RUNNING on the kernel.

If you have a series of long-running processes that can grow in memory/CPU usage, conventional load-balancing techniques won't help you at all. This would let you move some of those processes to other machines. (or perhaps to another CPU on the same machine...LWN didn't mention anything about threads). Heck, this could give you a "suspend-to-disk" method that would let you take your work from machine to machine. Imagine carrying a USB drive with your work container on it, and being able to load up that container on whatever linux system you're at. (although Xen probably can/will do something similar to that, except that it'd have to carry your whole OS with it)

Containers and PID virtualization

Posted Jan 24, 2006 21:44 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

If I make a few assumptions about process migration:

1. A container cannot be divided--all of its processes move, or none.

2. Moving a container is transparent--the processes have the same open files and network sockets when they arrive at their destination

then there's very little difference between a VM and a migratable process container. A plain process container isn't sufficient--you'd need to keep file descriptors, memory maps, and a bunch of other state to make migration work. VM's have all that, but need nothing else since they can defer the rest to their host kernel.

I suspect that the cost/capability curves of "light VM" and "heavy process container" will intersect each other at some point. The nice thing about VM's is that they start off isolated from the kernel and gradually intrude into the kernel, while process containers start out intrusive and gradually become less intrusive.

Containers and PID virtualization

Posted Feb 1, 2006 17:12 UTC (Wed) by dev (guest, #34359) [Link]

It was implemented in OpenVZ half a year already.
This feature is mostly important for VPS checkpointing/restoring/migration.
And not everyone wants to use Virtual Machines just for having an ability to move applications across machines.

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds