For the first time in a few years, virtualization was not on the agenda at
the 2007 kernel summit. The related field of containers, however, was
deemed worth talking about. The virtualization problem has been mostly
solved, at least at the kernel level, but there is still a lot of work to
do in the containers area.
Paul Menage talked about the process containers patch, which
has recently been rebranded "control groups." The control groups API is
currently being used by the CFS scheduler, cpusets, and the memory
controller code. Work in progress includes rlimits and an interface to the
process freezer used by the suspend/resume code. Controlling the freezer
via control groups allows user space to freeze specific groups of
processes, which, in turn, is very useful when implementing checkpointing
and live migration. In particular, with control groups, it will be
possible to freeze an entire group of processes in an atomic way.
Control groups have very little overhead when not in use. There is an
approximately 1% hit on the fork() and exec() calls when
control groups are being used. The control groups code is managed by way
of a virtual filesystem. This filesystem is a user-space API which must be
managed carefully; there needs to be consistency across the various
controllers which can work with control groups. To that end, parts of this
interface are being pushed into generic code when possible. One other
issue is the use of control groups within containers. It would be nice if
a containerized system could manage control groups for processes within the
container, but that is not yet implemented.
Eric Biederman talked about the container situation in general.
Implementing containers requires the creation of container-specific
namespaces for all of the global resources found on the system. Namespaces
for time, SYSV interprocess communication primitives, and users are in the
mainline now. There is a process ID namespace patch in -mm which is
getting close. Network namespaces are in development now. Resources which
still need to have namespaces created for them include system time
(important to keep time from moving backward when containers are migrated
from one system to another) and devices.
Each namespace which is created requires an option to the clone()
system call to say whether it should be shared or not. It seems that there
may not be enough clone bits to go around; how that problem will be solved
is not clear.
So, how close are we to having a working container solution? It is still
somewhat distant, says Eric. But, when it's done, the support for
containers in Linux will be more general and more capable than the options
which are available now. It is, he says, a more general solution than
OpenVZ, and, unlike Solaris Zones, it will have network namespaces. An
important milestone will be the incorporation of PID namespaces, which will
make it possible to start actually playing with Linux containers. That
code should, with luck, be merged before too long, though it is proving to
be a bit of a challenge: kernel code has process IDs hidden away in a
number of unexpected places.
Stay tuned; perhaps, by the next kernel summit, containers will be
considered to be a solved problem as well.
to post comments)