By Jake Edge
February 6, 2013
Linux containers, which are implemented using kernel namespaces and control groups, allow
processes to operate
in an isolated manner, so that the interactions with other processes and
kernel services are limited. That makes containers attractive for a
variety of tasks, including many that might have once been done using
chroot(). As namespace support in the kernel matures, tools
to set up and use containers are becoming more prevalent—and easier
to use. A feature
proposed for Fedora 19 will make use of systemd to create and manage
containers.
At first blush, systemd does not really seem like a
container-management tool. In fact, detractors might see that as feature
creep. But systemd already has infrastructure to spawn containers in
the form of the systemd-nspawn
command. In addition, creating a new process ID (PID) namespace means that
an init program (i.e. PID 1) is needed, which is, of course, the
role that systemd normally fills.
Beyond that, systemd is designed around the idea of "socket activation", so
that services can be started when the first connection is made to them.
That idea can be applied to containers, so that a new container gets
started when a connection is made to a certain port. This "container
activation" feature is reminiscent of a similar idea in the SELinux-based secure containers feature that
was added to Fedora 18. Unlike the secure containers, though, those
created with systemd-nspawn are not primarily intended for
security. With proper care and feeding, however, they can
provide another layer of a "defense in depth".
One goal of the "systemd lightweight containers" feature is to make it easy
to run an unmodified Fedora 19 inside the containers created by
systemd-nspawn. But it isn't just Fedora that could run in those
containers, Debian is another candidate; other distributions are possible
too. By installing a minimal system
into a directory somewhere—using yum or
debootstrap for example—and
then pointing systemd-nspawn at it, a usable version of the
distribution can be run. Users can log into it from the "console", set
up a service or services to run inside of it, and so on. Rudimentary
directions on setting that up are part of the feature proposal.
By default, systemd-nspawn sets up separate PID, mount, IPC
(inter-process communication), and
UTS (host and domain name) namespaces, and executes the given command
inside of them. If invoked with the -b option, it will search for
an init binary to execute, and pass any arguments to that
program. This command:
systemd-nspawn -bD /srv/rawhide 3
would start a container with a root filesystem at
/srv/rawhide,
execute the
init found there (which would be Rawhide's
version of systemd) and pass the runlevel "3" to it. Note that due to a
bug in
Fedora's audit support (or the kernel, or
systemd-nspawn,
depending on who you talk to), auditing needs to be disabled in the kernel
by booting with "
audit=0". Even then, some systems will still
experience problems unless they give the container extra capabilities using
a command like:
systemd-nspawn --capability=cap_audit_write,cap_audit_control -bD /srv/rawhide 3
Presumably, that particular problem will be shaken out before long, as
giving those capabilities to the container allows it to control auditing in
the host—just the kind of thing a container is meant to avoid.
With a simple unit file, the container can be turned into a service that
can be started, stopped, and monitored with systemctl. Fans of
the systemd journal can use the -j option of
systemd-nspawn to effectively export
the container's journal to the host. A "journalctl -m"
command on the host will then show merged journal entries from the host and any
containers.
Multiple containers can be started using the same directory
and they won't be able to see each other. Changes to the filesystem will
be immediately visible in any container using it, but processes in one container cannot interact with processes
in another, nor with the processes on the host.
Using the techniques described in "systemd
for Administrators, Part XX", these containers can easily be made
socket activated. An incoming connection on a particular host port would
spawn the container, which would have unit files that recognized the
incoming connection to start the right service on the inside. Users will
likely also want to set up
sshd inside the container to run on a different port (the host
presumably already uses
22) for ease of accessing the container.
There is also an option to run the container in a separate network
namespace (--private-network), which essentially turns off
networking for the container. Only the loopback interface is available to
the container, so no network connections of any kind can be made, though
it could still read and write using socket file descriptors that were
passed to it. That would be a way to isolate an internet-facing service,
for example.
There are a number of different use cases for the feature, but it also
looks like something that will be built upon in the future. Allowing for tightened
security, possibly using user ID namespaces, would be one possibility.
Adding support for network namespaces that have more than just the loopback
interface could be interesting as well. Since FESCo approved the feature
for Fedora 19 at its February 6 meeting, more users of the feature can be
expected. That means that
more use cases will be found, which seems likely to lead to expanded
functionality, but it's a useful feature as it stands.
(
Log in to post comments)