Systemd lightweight containers
Linux containers, which are implemented using kernel namespaces and control groups, allow processes to operate in an isolated manner, so that the interactions with other processes and kernel services are limited. That makes containers attractive for a variety of tasks, including many that might have once been done using chroot(). As namespace support in the kernel matures, tools to set up and use containers are becoming more prevalent—and easier to use. A feature proposed for Fedora 19 will make use of systemd to create and manage containers.
At first blush, systemd does not really seem like a container-management tool. In fact, detractors might see that as feature creep. But systemd already has infrastructure to spawn containers in the form of the systemd-nspawn command. In addition, creating a new process ID (PID) namespace means that an init program (i.e. PID 1) is needed, which is, of course, the role that systemd normally fills.
Beyond that, systemd is designed around the idea of "socket activation", so that services can be started when the first connection is made to them. That idea can be applied to containers, so that a new container gets started when a connection is made to a certain port. This "container activation" feature is reminiscent of a similar idea in the SELinux-based secure containers feature that was added to Fedora 18. Unlike the secure containers, though, those created with systemd-nspawn are not primarily intended for security. With proper care and feeding, however, they can provide another layer of a "defense in depth".
One goal of the "systemd lightweight containers" feature is to make it easy to run an unmodified Fedora 19 inside the containers created by systemd-nspawn. But it isn't just Fedora that could run in those containers, Debian is another candidate; other distributions are possible too. By installing a minimal system into a directory somewhere—using yum or debootstrap for example—and then pointing systemd-nspawn at it, a usable version of the distribution can be run. Users can log into it from the "console", set up a service or services to run inside of it, and so on. Rudimentary directions on setting that up are part of the feature proposal.
By default, systemd-nspawn sets up separate PID, mount, IPC (inter-process communication), and UTS (host and domain name) namespaces, and executes the given command inside of them. If invoked with the -b option, it will search for an init binary to execute, and pass any arguments to that program. This command:
systemd-nspawn -bD /srv/rawhide 3
would start a container with a root filesystem at /srv/rawhide,
execute the init found there (which would be Rawhide's
version of systemd) and pass the runlevel "3" to it. Note that due to a bug in
Fedora's audit support (or the kernel, or systemd-nspawn,
depending on who you talk to), auditing needs to be disabled in the kernel
by booting with "audit=0". Even then, some systems will still
experience problems unless they give the container extra capabilities using
a command like:
systemd-nspawn --capability=cap_audit_write,cap_audit_control -bD /srv/rawhide 3
Presumably, that particular problem will be shaken out before long, as
giving those capabilities to the container allows it to control auditing in
the host—just the kind of thing a container is meant to avoid.
With a simple unit file, the container can be turned into a service that can be started, stopped, and monitored with systemctl. Fans of the systemd journal can use the -j option of systemd-nspawn to effectively export the container's journal to the host. A "journalctl -m" command on the host will then show merged journal entries from the host and any containers.
Multiple containers can be started using the same directory and they won't be able to see each other. Changes to the filesystem will be immediately visible in any container using it, but processes in one container cannot interact with processes in another, nor with the processes on the host.
Using the techniques described in "systemd for Administrators, Part XX", these containers can easily be made socket activated. An incoming connection on a particular host port would spawn the container, which would have unit files that recognized the incoming connection to start the right service on the inside. Users will likely also want to set up sshd inside the container to run on a different port (the host presumably already uses 22) for ease of accessing the container.
There is also an option to run the container in a separate network namespace (--private-network), which essentially turns off networking for the container. Only the loopback interface is available to the container, so no network connections of any kind can be made, though it could still read and write using socket file descriptors that were passed to it. That would be a way to isolate an internet-facing service, for example.
There are a number of different use cases for the feature, but it also looks like something that will be built upon in the future. Allowing for tightened security, possibly using user ID namespaces, would be one possibility. Adding support for network namespaces that have more than just the loopback interface could be interesting as well. Since FESCo approved the feature for Fedora 19 at its February 6 meeting, more users of the feature can be expected. That means that more use cases will be found, which seems likely to lead to expanded functionality, but it's a useful feature as it stands.
