Notes from a container

By Jonathan Corbet
October 29, 2007

"Containers" are a form of lightweight virtualization as represented by projects like OpenVZ. While virtualization creates a new virtual machine upon which the guest system runs, containers implementations work by making walls around groups of processes. The result is that, while virtualized guests each run their own kernel (and can run different operating systems than the host), containerized systems all run on the host's kernel. So containers lack some of the flexibility of full virtualization, but they tend to be quite a bit more efficient.

As of 2.6.23, virtualization is quite well supported on Linux, at least for the x86 architecture. Containers lag a little behind, instead. It turns out that, in many ways, containers are harder to implement than virtualization is. A container implementation must wrap a namespace layer around every global resource found in the kernel, and there are a lot of these resources: processes, filesystems, devices, firewall rules, even the system time. Finding ways to wrap all of these resources in a way which satisfies the needs of the various container projects out there, and which also does not irritate kernel developers who may have no interest in containers, has been a bit of a challenge.

Full container support will get quite a bit closer once the 2.6.24 kernel is released. The merger of a number of important patches in this development cycle fills in some important pieces, though a certain amount of work remains to be done.

Once upon a time, there was a patch set called process containers. The containers subsystem allows an administrator (or administrative daemon) to group processes into hierarchies of containers; each hierarchy is managed by one or more "subsystems." The original "containers" name was considered to be too generic - this code is an important part of a container solution, but it's far from the whole thing. So containers have now been renamed "control groups" (or "cgroups") and merged for 2.6.24.

Control groups need not be used for containers; for example, the group scheduling feature (also merged for 2.6.24) uses control groups to set the scheduling boundaries. But it makes sense to pair control groups with the management of the various namespaces and resource management in general to create a framework for a containers implementation.

The management of control groups is straightforward. The system administrator starts by mounting a special cgroup filesystem, associating the subsystems of interest with the filesystem at mount time. There can be more than one such filesystem mounted, as long as each subsystem appears on at most one control group. So the administrator could create one cgroup filesystem to manage scheduling and a completely different one to associate processes with namespaces.

Once the filesystem is mounted, specific groups are created by making directories within the cgroup filesystem. Putting a process into a control group is a simple matter of writing its process ID into the tasks virtual file in the cgroup directory. Processes can be moved between control groups at will.

The concept of a process ID has gotten more complicated, though, since the PID namespace code was also merged. A PID namespace is a view of the processes on the system. On a "normal" Linux system, there is only the global PID namespace, and all processes can be found there. On a system with PID namespaces, different processes can have very different views of what is running on the system. When a new PID namespace is created, the only visible process is the one which created that namespace; it becomes, in essence, the init process for that namespace. Any descendants of that process will be visible in the new namespace, but they will never be able to see anything running outside of that namespace.

Virtualizing process IDs in this way complicates a number of things. A process which creates a namespace remains visible to its parent in the old namespace - and it may not have the same process ID in both namespaces. So processes can have more than one ID, and the same process ID may be found referring to different processes in different namespaces. For example, it is fairly common in containers implementations to have the per-namespace init process have ID 1 in its namespace.

[PULL QUOTE: What all of this means is that process IDs only make sense when placed into a specific context. That, in turn, sets a trap for any kernel code which works with process IDs. END QUOTE] What all of this means is that process IDs only make sense when placed into a specific context. That, in turn, sets a trap for any kernel code which works with process IDs; any such code must take care to maintain the association between a process ID and the namespace in which it is defined. To make life easier (and safer), the containers developers have been working for some time to eliminate (to the greatest extent possible) use of process IDs within the kernel itself. Kernel code should use task_struct pointers (which are always unambiguous) to refer to specific processes; a process ID, instead, has become a cookie for communication with user space, and not much more.

This job of cleaning up PID use is not complete at this point. In fact, the process ID namespace work has a great many loose ends in general, to the point that some of the developers do not think that it is really ready to be used yet. In particular, there is concern that some of the management APIs could change, breaking code which is written for the 2.6.24 API. Adding new user-space APIs is always problematic in this regard: getting an API right is hard, and getting it right the first time is even harder. But user-space APIs are supposed to stay constant once they are merged; there is no provision for any sort of stabilization period where things can change. For PID namespaces, what's likely to happen is that the feature will be marked "experimental" in the hope that nobody will use it in its 2.6.24 form.

Also merged for 2.6.24 is the network namespace patch. The idea behind this code is to allow processes within each namespace to have an entirely different view of the network stack. That includes the available interfaces, routing tables, firewall rules, and so on. These patches are in a relatively early state; they add the infrastructure to track different namespaces, but not a whole lot more. Quite a few internal networking APIs have been changed to take a namespace parameter, but, in most cases, the code simply fails any operation which is attempted in anything other than the default, root namespace. There is a new "veth" virtual network device which can be used to create tunnels between namespaces.

The PID and network namespace patches have added a couple of lines to <linux/sched.h>:

    #define CLONE_NEWPID	0x20000000	/* New pid namespace */
    #define CLONE_NEWNET	0x40000000	/* New network namespace */

These entries highlight an interesting problem: the CLONE_ flags are passed to the kernel as a 32-bit value. As of this writing, there are only two bits left for new flags. So the containers developers are going to run out of flags; how they plan to deal with that problem is not clear at this point.

These developers are also working on the management of containers, and, in particular, how to move between them. One of the things likely to come out of that work in the near future is a proposal for a new system call:

    int hijack(unsigned long clone_flags, int which, int id);

This system call behaves much like clone() in that it creates a new process, but with an interesting twist. The new process created by clone() takes all of its resources - including namespaces - from the calling process; these resources will be copied or shared as directed by the clone_flags argument. A call to hijack(), instead, obtains all of those resources from the process whose ID is given in the id parameter. So it is possible to write a little program which forks via a hijack() call and runs a shell in the resulting child process; that shell will be running with all of the namespaces of the hijacked process.

To make life easier for people working with containers, the which parameter was added in recent versions of this API. If which is passed as 1, the call treats id as a process ID, as described above. A value of 2, instead, says that id is actually an open file descriptor for the tasks file in a cgroup control directory. In this case, hijack() finds the lead process for that control group and obtains resources from there.

This system call is new, and it has not seen a whole lot of review outside of the containers mailing list. So chances are that some changes will be requested once it becomes more widely visible; among other things, a name change might be called for. In general, there is a lot yet to be done with the containers code, but progress is visibly being made. There will come a point where the mainline kernel comes equipped with complete container capabilities.

Not a new concept

Posted Nov 1, 2007 15:39 UTC (Thu) by mheily (subscriber, #27123) [Link] (4 responses)

This is a good feature that provides another layer of security. It's about time that Linux gained this functionality in the mainline kernel. The 'cgroups' concept has been successfully implemented in several other Unix systems. In FreeBSD, this type of container is called a jail and was first published in FreeBSD 4.0 almost seven years ago. In Solaris 10, it is called a zone. Why couldn't the Linux developers use one of these terms to describe their containers? Better yet, why not import the entire jail(8) subsystem from FreeBSD? This would give Linux a proven design, along with documentation, manpages, header files, and userland tools.

The "not invented here" syndrome strikes again...

Not a new concept

Posted Nov 1, 2007 18:50 UTC (Thu) by dowdle (subscriber, #659) [Link]

SWsoft's Virtuozzo has been around for a little over 6 years I think... and OpenVZ is the
GPL'ed release of the kernel code and some of the userland code... and documentation, etc.
Linux-VServer has been around for as long if not longer.  Much of the code that is being
adapted into the Linux kernel regarding control groups (trying to get used to the new name)...
has been contributed by Google, IBM, OpenVZ and others... working together to get a consensus.

If they wanted to borrow from FreeBSD, they'd probably be using FreeBSD.  My guess is that
Linux and FreeBSD are different enough that adapting the existing FreeBSD jails code would
take as much work if not more as adapting any other project's code would... and having
everyone work together will hopefully lead to situation where everyone is happier than they
would have been if a single, canned solution were chosen.  It also gives them the opportunity
to take what they've already done, and improve it yet again.

Not a new concept

Posted Nov 2, 2007 1:45 UTC (Fri) by dvdeug (guest, #10998) [Link]

I seriously doubt that the code from FreeBSD would be a good fit for Linux; you almost never
hear of major programs borrowing much deep code from each other. Generalizing the concept of
jails like Linux has done is interesting, and it may or not be useful. I would suspect you
could write a API-compatible implementation of BSD's jail in userspace pretty easily using
this, which is really the important thing.

Not a new concept

Posted Nov 3, 2007 15:09 UTC (Sat) by TRS-80 (guest, #1804) [Link] (1 responses)

As well as OpenVZ, Linux VServer has been around since 2003, and is one of the groups participating in the containers merging effort. In Debian Etch, using Linux VServer is as simple as apt-get installing a new kernel - you can even run VServer and Xen on the one system if you so desire.

Not a new concept

Posted Nov 6, 2007 18:39 UTC (Tue) by sayler (guest, #3164) [Link]

Yep.  I've used Linux-Vserver for many years now (in production environments), and it's
performed great.  It *is* a shame that nothing made it into the mainline kernel before recenet
times, but my impression was that neither of the major codebases (Vserver, OpenVZ) were
particularly merge-worthy..

I'd also like to second the recommendation for Debian/etch's Linux Vserver integration.  A few
minutes of download and a reboot and you're ready to go.

PID virtualization?

Posted Nov 2, 2007 2:16 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (2 responses)

Why do PIDs need to be virtualized? Each process can retain its globally-unique PID. From the
point of view of a process in a cgroup, processes that aren't in that particular group just
don't exist. Any new process would get a free entry in the global PID list. Granted, that
doesn't allow each namespace to have its PID 1, but is that a big deal?

PID virtualization?

Posted Nov 2, 2007 6:14 UTC (Fri) by dlang (guest, #313) [Link]

one reason is that with full container virtualization it should be possible to pick up a
container from one machine and drop it on another machine and have everything keep running.

that's the goal the container people are aiming for. it's significantly mor ethen the BSD
jail, but without the overhead of system virtualization.

PID virtualization?

Posted Nov 6, 2007 7:28 UTC (Tue) by AndrewHuo (guest, #28799) [Link]

PID virtualization can be used to avoid PID-conflict
in process migration (checkpoint&restart), though the
probability of PID-conflict is very low.

struct pid, struct pid, struct pid

Posted Nov 4, 2007 3:44 UTC (Sun) by ebiederm (subscriber, #35028) [Link]

If you need to store a persistent reference to a user space process use a 
struct pid *.  Not a task_struct reference.

References can potentially last for a long time after user space processes 
have exited so holding a reference to a tiny struct pid is much cheaper 
memory wise.  In addition a struct pid is a drop in replacement for a 
pid_t (except for the need for reference counting).  A struct pid can 
reference process groups, thread groups, and sessions not just individual 
processes.  Further a struct pid is immune from pid wrap around, removing 
a whole class of theoretical problems from being issues in the kernel.

Hijack for debugging

Posted Nov 8, 2007 11:39 UTC (Thu) by endecotp (guest, #36428) [Link]

Hijack sounds like a potentially useful debugging tool: you can make a copy of another process
and look inside it, while the original continues to execute.

Notes from a container

Posted May 23, 2008 12:40 UTC (Fri) by Manromen (guest, #52223) [Link]

Hi,

sorry, i don't understood it exactly:

Container != Control Group ?