Systemd lightweight containers

By Jake Edge
February 6, 2013

Linux containers, which are implemented using kernel namespaces and control groups, allow processes to operate in an isolated manner, so that the interactions with other processes and kernel services are limited. That makes containers attractive for a variety of tasks, including many that might have once been done using chroot(). As namespace support in the kernel matures, tools to set up and use containers are becoming more prevalent—and easier to use. A feature proposed for Fedora 19 will make use of systemd to create and manage containers.

At first blush, systemd does not really seem like a container-management tool. In fact, detractors might see that as feature creep. But systemd already has infrastructure to spawn containers in the form of the systemd-nspawn command. In addition, creating a new process ID (PID) namespace means that an init program (i.e. PID 1) is needed, which is, of course, the role that systemd normally fills.

Beyond that, systemd is designed around the idea of "socket activation", so that services can be started when the first connection is made to them. That idea can be applied to containers, so that a new container gets started when a connection is made to a certain port. This "container activation" feature is reminiscent of a similar idea in the SELinux-based secure containers feature that was added to Fedora 18. Unlike the secure containers, though, those created with systemd-nspawn are not primarily intended for security. With proper care and feeding, however, they can provide another layer of a "defense in depth".

One goal of the "systemd lightweight containers" feature is to make it easy to run an unmodified Fedora 19 inside the containers created by systemd-nspawn. But it isn't just Fedora that could run in those containers, Debian is another candidate; other distributions are possible too. By installing a minimal system into a directory somewhere—using yum or debootstrap for example—and then pointing systemd-nspawn at it, a usable version of the distribution can be run. Users can log into it from the "console", set up a service or services to run inside of it, and so on. Rudimentary directions on setting that up are part of the feature proposal.

By default, systemd-nspawn sets up separate PID, mount, IPC (inter-process communication), and UTS (host and domain name) namespaces, and executes the given command inside of them. If invoked with the -b option, it will search for an init binary to execute, and pass any arguments to that program. This command:

    systemd-nspawn -bD /srv/rawhide 3

would start a container with a root filesystem at /srv/rawhide, execute the init found there (which would be Rawhide's version of systemd) and pass the runlevel "3" to it. Note that due to a bug in Fedora's audit support (or the kernel, or systemd-nspawn, depending on who you talk to), auditing needs to be disabled in the kernel by booting with "audit=0". Even then, some systems will still experience problems unless they give the container extra capabilities using a command like:

    systemd-nspawn --capability=cap_audit_write,cap_audit_control -bD /srv/rawhide 3

Presumably, that particular problem will be shaken out before long, as giving those capabilities to the container allows it to control auditing in the host—just the kind of thing a container is meant to avoid.

With a simple unit file, the container can be turned into a service that can be started, stopped, and monitored with systemctl. Fans of the systemd journal can use the -j option of systemd-nspawn to effectively export the container's journal to the host. A "journalctl -m" command on the host will then show merged journal entries from the host and any containers.

Multiple containers can be started using the same directory and they won't be able to see each other. Changes to the filesystem will be immediately visible in any container using it, but processes in one container cannot interact with processes in another, nor with the processes on the host.

Using the techniques described in "systemd for Administrators, Part XX", these containers can easily be made socket activated. An incoming connection on a particular host port would spawn the container, which would have unit files that recognized the incoming connection to start the right service on the inside. Users will likely also want to set up sshd inside the container to run on a different port (the host presumably already uses 22) for ease of accessing the container.

There is also an option to run the container in a separate network namespace (--private-network), which essentially turns off networking for the container. Only the loopback interface is available to the container, so no network connections of any kind can be made, though it could still read and write using socket file descriptors that were passed to it. That would be a way to isolate an internet-facing service, for example.

There are a number of different use cases for the feature, but it also looks like something that will be built upon in the future. Allowing for tightened security, possibly using user ID namespaces, would be one possibility. Adding support for network namespaces that have more than just the loopback interface could be interesting as well. Since FESCo approved the feature for Fedora 19 at its February 6 meeting, more users of the feature can be expected. That means that more use cases will be found, which seems likely to lead to expanded functionality, but it's a useful feature as it stands.

Systemd lightweight containers

Posted Feb 7, 2013 9:19 UTC (Thu) by mezcalero (subscriber, #45103) [Link]

Regarding the difference between the two command lines:

systemd git (which will appear in F19) implicitly adds CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL to the capabilities set for the nspawn container, which nspawn in F18 did not do. The first command line will hence work fine on F19 hosts, the second one already on F18 hosts (and will still work on F19). Just granting these capabilities is however not enough to make everything work fine, you also need to turn off the entire audit kernel stack with audit=0 on the kernel command line. Ironically, turning off the kernel stack alone, without also granting CAP_AUDIT_CONTROL/AUDIT_WRITE to the container won't make things work, as the audit userspace code is really bad.

I hope one day audit gets fixed so that audit=0 won't be necessary anymore.

Systemd lightweight containers

Posted Feb 7, 2013 9:40 UTC (Thu) by bgmarete (guest, #47484) [Link] (8 responses)

Am I right in assuming that until UID namespaces are complete in the kernel, root in a container (on an un-patched mainline kernel) is also root on the host? Further, other than the UID namespace work, what further work remains to make hosts completely safe from containers?

Systemd lightweight containers

Posted Feb 7, 2013 9:58 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (7 responses)

It's not as bad as it sounds, as in most cases the root inside the container won't see much of the host's resources and hence can't do much bad on it. It also lacks many capabilities, so that it is anyway much less powerful than a real root. And then, the per-user settings the kernel maintains for root generally don't have much effect anyway, such as RLIMIT_NPROC.

I also doubt that UID namespaces are really a magic bullet. Their support in file systems is really awkward (if you processes has multiple UIDs, but your file system only maintains one per file, how could that ever work?), so I kinda get the impression they create more problems than they solve.

I am not convinced that I'll ever update nspawn to make use of UID namespaces.

Systemd lightweight containers

Posted Feb 7, 2013 11:58 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Can we also automatically setup a network namespace for the container?

For example, I might want to create a virtual network interface inside the container and hook it up with a TUN device on the host. That might allow to use the usual DHCP infrastructure to assign addresses to containers.

Systemd lightweight containers

Posted Feb 7, 2013 21:57 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (1 responses)

Currently nspawn has two modes: a) inherited network from the host, or b) no connectivity at all, only a private loopback device and nothing else.

We might add more later on, the kernel certainly supports other modes. We try to be careful though that we don't end up with too much complexity, after all nspawn really should stay the simple tool that just works, rather than this super-configurable beast.

Systemd lightweight containers

Posted Feb 7, 2013 22:00 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Makes sense. Is it possible to attach scripted hooks that will set up the networking infrastructure? Also, how to connect a new interface to the accepted socket?

Systemd lightweight containers

Posted Feb 7, 2013 15:56 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (3 responses)

> [UID namespace] support in file systems is really awkward (if you processes has multiple UIDs, but your file system only maintains one per file, how could that ever work?), so I kinda get the impression they create more problems than they solve.

I wouldn't expect that to be as difficult as you make it sound. The filesystem should see a single UID within its own namespace; if I mount a filesystem and then create a UID namespace mapping UID 0 to UID 1000, then a process with UID 0 inside the namespace and UID 1000 outside should appear to the filesystem as UID 1000. On the other hand, if I created the UID namespace first and then mounted the filesystem inside it, the filesystem should see UID 0, since the filesystem and the process are in the same UID namespace.

Systemd lightweight containers

Posted Feb 7, 2013 21:55 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (2 responses)

So, you boot your OS in your container, and it uses like a 50 different UIDs, because it runs various daemons unprivileged. And all these shall map to the same UID 1000 on the host, and the file system. Which component then knows how the files on disk map to those 50 different UIDs? How would that ever work?

Systemd lightweight containers

Posted Feb 7, 2013 22:20 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

You make a good point. I think that you would need to map multiple container UIDs to multiple UIDs on the host if you want a filesystem mounted outside the container to keep them separate. IIRC the UID namespace code allows each container UID to map to a different host UID.

Mapping from container to host is always well-defined, but I'm not sure what the kernel does when the reverse translation is one-to-many. Most likely it just picks the first match, so all UID 1000 files would appear to be owned by root. (However, the permissions check should be performed using UID 1000 on the host side, in the filesystem's namespace, regardless of what the processes inside the container see.)

Systemd lightweight containers

Posted Feb 7, 2013 23:33 UTC (Thu) by zuki (subscriber, #41808) [Link]

Eric Biederman has a helper library which reserves a range of UIDs on the host and let's users use them (http://thread.gmane.org/gmane.linux.kernel.containers/25062). Afaicu, it is up to the user which UID means what.