Containers and lightweight virtualization
[Posted April 10, 2006 by corbet]
"Virtualization" is the act of making a set of processes believe that it
has a dedicated system to itself. There are a number of approaches being
taken to the virtualization problem, with Xen, VMWare, and User-mode Linux
being some of the better-known options. Those are relatively heavy-weight
solutions, however, with a separate kernel being run for each virtual
machine. Often, that is exactly the right solution to the problem; running
independent kernels gives strong separation between environments and
enables the running of multiple operating systems on the same hardware.
Full virtualization and paravirtualization are not the only approaches
being taken, however. An alternative is lightweight virtualization,
generally based on some sort of container concept. With containers, a
group of processes still appears to have its own dedicated system, but it
is really running in a specially isolated environment. All containers run
on top of the same kernel. With containers, the ability to run different
operating systems is lost, as is the strong separation between virtual
systems. Thus, one might not want to give root access to processes running
within a container environment. On the other hand, containers can have
considerable performance advantages, enabling large numbers of them to run
on the same physical host.
There is no shortage of container-oriented projects. These include
relatively simple efforts like the BSD jail module through more
thorough efforts like Linux-VServer, OpenVZ, and the proprietary Virtuozzo (based on OpenVZ) offering. Many of these
projects would like to get at least some of their code into the kernel and
shed the load of carrying out-of-tree patches. There is little
interest, however, in merging code which only supports some of these
projects. The container people are going to have to get together and work
out some common solutions which they can all use.
It appears that this is exactly what the container developers are doing. A
loose agreement has been put in place
wherein developers from a few projects will discuss proposed changes and
jointly work them into a form where they meet everybody's needs. Once a
particular patch has reached a point where all of the developers are
willing to sign off on it, it can be forwarded for eventual merging into
the mainline.
The more complex and intrusive changes, such as PID virtualization, appear to be
on hold for now. Instead, it looks like the first jointly-agreed patch
might be the UTS namespace
virtualization patch. The aim of the patch is relatively straightforward:
it allows each container (as represented by a family tree of processes) to
have its own version of the utsname structure, which holds the
node name, domain name, operating system version, and a few other things.
In essence, it replaces a single global structure with multiple structures
attached at various places in the process tree. It still requires a
five-part patch, with every reference to the global system_utsname
structure replaced by a call to the new utsname() function.
Longer-range plans call for the virtualization of every global namespace in
the kernel, including SYSV IPC, process IDs, and even netfilter rules.
There was an interesting discussion on the virtualization of security
modules; some think that each container should be able to load its own
security policy, while others argue in favor of a single system security
policy which is aware of (and able to use) containers. Unsurprisingly,
SELinux is already equipped with a type hierarchy mechanism which can be
used with containers in the single-policy approach.
Containers might still prove to be a hard sell with some developers, who
will see them as complicating access to many internal kernel data structures
without adding a whole lot of value. It is clear, however, that there is a
demand for this sort of lightweight virtualization - OpenVZ, alone, claims to be running over 300,000 virtual
environments. So the pressure to standardize this code and move it into
the mainline will only grow over time. Once they are clean enough to
satisfy the development community, pieces of the container concept are
likely to be merged.
(
Log in to post comments)