By Jonathan Corbet
October 29, 2007
"Containers" are a form of lightweight virtualization as represented by
projects like
OpenVZ. While
virtualization creates a new virtual machine upon which the guest system
runs, containers implementations work by making walls around groups of
processes. The result is that, while virtualized guests each run their own
kernel (and can run different operating systems than the host),
containerized systems all run on the host's kernel. So containers lack
some of the flexibility of full virtualization, but they tend to be quite a
bit more efficient.
As of 2.6.23, virtualization is quite well supported on Linux, at least for
the x86 architecture. Containers lag a little behind, instead. It turns
out that, in many ways, containers are harder to implement than
virtualization is. A container implementation must wrap a namespace layer
around every global resource found in the kernel, and there are a lot of
these resources: processes, filesystems, devices, firewall rules, even the
system time. Finding ways to wrap all of these resources in a way which
satisfies the needs of the various container projects out there, and which
also does not irritate kernel developers who may have no interest in
containers, has been a bit of a challenge.
Full container support will get quite a bit closer once the 2.6.24 kernel
is released. The merger of a number of important patches in this
development cycle fills in some important pieces, though a certain amount
of work remains to be done.
Once upon a time, there was a patch set called process containers. The
containers subsystem allows an administrator (or administrative daemon) to
group processes into hierarchies of containers; each hierarchy is managed
by one or more "subsystems." The original "containers" name was considered
to be too generic - this code is an important part of a container solution,
but it's far from the whole thing. So containers have now been renamed
"control groups" (or "cgroups") and merged for 2.6.24.
Control groups need not be used for containers; for example, the group
scheduling feature (also merged for 2.6.24) uses control groups to set the
scheduling boundaries. But it makes sense to pair control groups with the
management of the various namespaces and resource management in general to
create a framework for a containers implementation.
The management of control groups is straightforward. The system
administrator starts by mounting a special cgroup filesystem,
associating the subsystems of interest with the filesystem at mount time.
There can be more than one such filesystem mounted, as long as each
subsystem appears on at most one control group. So the administrator could
create one cgroup filesystem to
manage scheduling and a completely different one to associate processes
with namespaces.
Once the filesystem is mounted, specific groups are created by making
directories within the cgroup filesystem. Putting a process into a control
group is a simple matter of writing its process ID into the tasks
virtual file in the cgroup directory. Processes can be moved between
control groups at will.
The concept of a process ID has gotten more complicated, though, since the
PID namespace code was also merged. A PID namespace is a view of the
processes on the system. On a "normal" Linux system, there is only the
global PID namespace, and all processes can be found there. On a system
with PID namespaces, different processes can have very different views of
what is running on the system. When a new PID namespace is created, the
only visible process is the one which created that namespace; it becomes,
in essence, the init process for that namespace. Any descendants
of that process will be visible in the new namespace, but they will never
be able to see anything running outside of that namespace.
Virtualizing process IDs in this way complicates a number of things. A
process which creates a namespace remains visible to its parent in the old
namespace - and it may not have the same process ID in both namespaces. So
processes can have more than one ID, and the same process ID may be found
referring to different processes in different namespaces. For example, it
is fairly common in containers implementations to have the per-namespace
init process have ID 1 in its namespace.
[PULL QUOTE:
What all of this means is that process IDs only make sense when placed into
a specific context. That, in turn, sets a trap for any kernel code which
works with process IDs.
END QUOTE]
What all of this means is that process IDs only make sense when placed into
a specific context. That, in turn, sets a trap for any kernel code which
works with process IDs; any such code must take care to maintain the
association between a process ID and the namespace in which it is defined.
To make life easier (and safer), the containers developers have been
working for some time to eliminate (to the greatest extent possible) use of
process IDs within the kernel itself. Kernel code should use
task_struct pointers (which are always unambiguous) to refer to
specific processes; a process ID, instead, has become a cookie for
communication with user space, and not much more.
This job of cleaning up PID use is not complete at this point. In fact,
the process ID namespace work has a great many loose ends in general, to
the point that some of the developers do not think that it is really ready
to be used yet. In particular, there is concern that some of the
management APIs could change, breaking code which is written for the 2.6.24
API. Adding new user-space APIs is always problematic in this regard:
getting an API right is hard, and getting it right the first time is even
harder. But user-space APIs are supposed to stay constant once they are
merged; there is no provision for any sort of stabilization period where
things can change. For PID namespaces, what's likely to happen is that the
feature will be marked "experimental" in the hope that nobody will use it
in its 2.6.24 form.
Also merged for 2.6.24 is the network namespace patch. The idea behind
this code is to allow processes within each namespace to have an entirely
different view of the network stack. That includes the available
interfaces, routing tables, firewall rules, and so on. These patches are
in a relatively early state; they add the infrastructure to track different
namespaces, but not a whole lot more. Quite a few internal networking APIs
have been changed to take a namespace parameter, but, in most cases, the
code simply fails any operation which is attempted in anything other than
the default, root namespace. There is a new "veth" virtual network device
which can be used to create tunnels between namespaces.
The PID and network namespace patches have added a couple of lines to
<linux/sched.h>:
#define CLONE_NEWPID 0x20000000 /* New pid namespace */
#define CLONE_NEWNET 0x40000000 /* New network namespace */
These entries highlight an interesting problem: the CLONE_ flags
are passed to the kernel as a 32-bit value. As of this writing, there are
only two bits left for new flags. So the containers developers are going
to run out of flags; how they plan to deal with that problem is not clear
at this point.
These developers are also working on the management of containers, and, in
particular, how to move between them. One of the things likely to come out
of that work in the near future is a
proposal for a new system call:
int hijack(unsigned long clone_flags, int which, int id);
This system call behaves much like clone() in that it creates a
new process, but with an interesting twist. The new process created by
clone() takes all of its resources - including namespaces - from
the calling process; these resources will be copied or shared as directed
by the clone_flags argument. A call to hijack(),
instead, obtains all of those resources from the process whose ID is given
in the id parameter. So it is possible to write a little program
which forks via a hijack() call and runs a shell in the resulting
child process; that shell will be running with all of the namespaces of the
hijacked process.
To make life easier for people working with containers, the which
parameter was added in recent versions of this API. If which is
passed as 1, the call treats id as a process ID, as described
above. A value of 2, instead, says that id is actually an open
file descriptor for the tasks file in a cgroup control directory.
In this case, hijack() finds the lead process for that control
group and obtains resources from there.
This system call is new, and it has not seen a whole lot of review outside
of the containers mailing list. So chances are that some changes will be
requested once it becomes more widely visible; among other things, a name
change might be called for. In general, there is a lot yet to be done with
the containers code, but progress is visibly being made. There will come a
point where the mainline kernel comes equipped with complete container
capabilities.
(
Log in to post comments)