KS2007: Containers
We're bad at marketingWe can admit it, marketing is not our strong suit. Our strength is writing the kind of articles that developers, administrators, and free-software supporters depend on to know what is going on in the Linux world. Please subscribe today to help us keep doing that, and so we don’t have to get good at marketing.
For the first time in a few years, virtualization was not on the agenda at the 2007 kernel summit. The related field of containers, however, was deemed worth talking about. The virtualization problem has been mostly solved, at least at the kernel level, but there is still a lot of work to do in the containers area.
Paul Menage talked about the process containers patch, which has recently been rebranded "control groups." The control groups API is currently being used by the CFS scheduler, cpusets, and the memory controller code. Work in progress includes rlimits and an interface to the process freezer used by the suspend/resume code. Controlling the freezer via control groups allows user space to freeze specific groups of processes, which, in turn, is very useful when implementing checkpointing and live migration. In particular, with control groups, it will be possible to freeze an entire group of processes in an atomic way.
Control groups have very little overhead when not in use. There is an approximately 1% hit on the fork() and exec() calls when control groups are being used. The control groups code is managed by way of a virtual filesystem. This filesystem is a user-space API which must be managed carefully; there needs to be consistency across the various controllers which can work with control groups. To that end, parts of this interface are being pushed into generic code when possible. One other issue is the use of control groups within containers. It would be nice if a containerized system could manage control groups for processes within the container, but that is not yet implemented.
Eric Biederman talked about the container situation in general. Implementing containers requires the creation of container-specific namespaces for all of the global resources found on the system. Namespaces for time, SYSV interprocess communication primitives, and users are in the mainline now. There is a process ID namespace patch in -mm which is getting close. Network namespaces are in development now. Resources which still need to have namespaces created for them include system time (important to keep time from moving backward when containers are migrated from one system to another) and devices.
Each namespace which is created requires an option to the clone() system call to say whether it should be shared or not. It seems that there may not be enough clone bits to go around; how that problem will be solved is not clear.
So, how close are we to having a working container solution? It is still somewhat distant, says Eric. But, when it's done, the support for containers in Linux will be more general and more capable than the options which are available now. It is, he says, a more general solution than OpenVZ, and, unlike Solaris Zones, it will have network namespaces. An important milestone will be the incorporation of PID namespaces, which will make it possible to start actually playing with Linux containers. That code should, with luck, be merged before too long, though it is proving to be a bit of a challenge: kernel code has process IDs hidden away in a number of unexpected places.
Stay tuned; perhaps, by the next kernel summit, containers will be
considered to be a solved problem as well.
| Index entries for this article | |
|---|---|
| Kernel | Containers |
Posted Sep 10, 2007 22:59 UTC (Mon)
by kolyshkin (guest, #34342)
[Link]
By the way, slides used for this session are available here. (Most of) PID namespaces code are already in -mm tree. A big part here is resource management. Memory controller that is now in -mm is just the very beginning -- there is a whole lot more than RSS and page cache (from the other side, Pavel Emelyanov already sent kernel memory controller patchset as an RFC). Group-based CFQ scheduling is not yet merged AFAIK. Group I/O scheduling (based on Jens Axboe's CFQ) will probably be sent for review soon; but scheduling delayed writes requires some dirty page tracking mechanism that only exists in OpenVZ for now (described in Pavel's paper), a discussion of how to implement that for mainstream is not even started. At the end -- there are a lot of issues to be solved, but given the latest progress, most of the functionality could be there in a year or so, so I more or less agree with your optimistic forecast. :) When containers are ready, we can start work on checkpointing.
Posted Sep 11, 2007 19:33 UTC (Tue)
by cajal (guest, #4167)
[Link] (2 responses)
Posted Sep 12, 2007 9:13 UTC (Wed)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Sep 12, 2007 14:42 UTC (Wed)
by ebiederm (subscriber, #35028)
[Link]
Posted Sep 12, 2007 14:41 UTC (Wed)
by ebiederm (subscriber, #35028)
[Link]
Doing this this with namespaces makes decomposes the problem so we can
The question asked of me is how long until we have in kernel support that
If you only need a subset of that functionality (like a lot of projects)
Having the additional resource management seems to be a big part of the
For global resources there are two approaches that a designer can choose
What little I know of Solaris Zones is that they grew out of efforts to
As for the question of what are network namespaces. They are a way to
Eric
Posted Sep 12, 2007 20:51 UTC (Wed)
by kolyshkin (guest, #34342)
[Link]
KS2007: Containers
An important milestone will be the incorporation of PID namespaces, which will make it possible to start actually playing with Linux containers. That code should, with luck, be merged before too long
It is, he says, a more general solution than OpenVZ
Yes, in a sense that one can only use parts of container functionality (like only have a PID namespace, or a network namespace) -- which makes sense in some situations. Currently, OpenVZ kernel only lets you use just some parts separately (like beancounters, or fair CPU scheduler), and this is only from the kernel side -- user-level tools can only deal with "full scale" containers.
From the other side, checkpointing is only possible when container is a closed object, so "half-containers" can not be checkpointed.
So, how close are we to having a working container solution?
I'm puzzled by this quote "unlike Solaris Zones, it will have network namespaces." What is a network namespace?What is a network namespace?
It's an ability to have different network stacks running along. It's network stack virtualization. And, contrary to comment above, it's available in Solaris 10u4 and OpenSolaris. It's nicknamed project Crossbow.
What is a network namespace?
Odd. I don't think I actually made that comment.What is a network namespace?
When I claimed the current kernel infrastructure is more general then KS2007: Containers
vserver and OpenVZ what I meant is that we have to support the entire
kernel and everything it can do, and doing it with code that can pass
a code review by the kernel community. Ensuring that architecture and
subarchitecture will work, and that every weird kernel subsystem will work
appears to me to be more then the out of tree projects have tackled.
have an incremental merge (simplifying things). It also makes things a
little harder as we have to handle all of the weird partial interactions.
is equal to OpenVZ, or Solaris Zones. Getting there pretty much requires
us to get everything complete and will take a while.
we should have something interesting when the we get things like the pid
namespace merged.
existing out of tree solutions because when you load the machine heavily
you have more contention between users. However for some uses like a
better chroot for rpm installs or an isolated set of process for
checkpoint restart you don't need additional resource management.
from. Namespaces where you allow two instances of the same global name to
exist in different namespaces. Pure isolation (which is almost
exclusively what vserver provides) which only allows you to see a subset
of the global names. If you are not supporting process migration they
are about the same. Without namespaces process migration is in trouble
because there is no guarantee that you can restore your global identifiers
and keep running.
improve chroot type solutions, and thus do primarily global resource
isolation and do not provide namespaces. The implication of that is that
Solaris Zones do not provide an easy path to container migration from one
machine to another. However everything is evolving and even if my
understanding was right at one time, Solaris may have changed since then.
make it appear to user space as if you have multiple network stacks. Each
logical stack with it's own routing tables, firewall tables, network
devices and the works. Fundamentally they aren't to hard to implement but
they need a bit of work on how the network stack handles global data.
Gerrit Huizenga's coverage of the same containers session is here:KS2007: Containers
http://gh-linux.blogspot.com/2007/09/linux-kernel-summit-...
