While the Linux
Security Summit (LSS) was held later in the week, it
was logically part of the minisummits that accompanied the Kernel
Summit—organizer James Morris made a forward-reference report on LSS
as part of the minisummit reports. Day one was filled with talks on
various topics of interest to the
assembled security developers, while day two was mostly devoted to reports
from the kernel security subsystems. We plan to write up much of LSS over
the coming weeks; the first installment covers a talk given by SELinux
developer Dan Walsh on secure Linux containers.
Walsh's opening slide had a picture of a "secure" Linux container (label
seen at right)—a plastic "unix ware" storage container—but his
talk was a tad more serious. Application sandboxes are becoming more common
for isolating general-purpose applications from each other. There are a
variety of Linux tools that can be used to create sandboxes, including seccomp,
SELinux, the Java virtual machine, and virtualization. The idea behind
sandboxing is the age-old concept of "defense in depth".
There is another mechanism that can be used to isolate applications:
containers. When most people think of containers, they think of LXC, which
is a command-line tool created by IBM. But, the Linux kernel knows nothing
about containers, per se, and LXC is built atop Linux namespaces.
The secure containers project did not use LXC directly; instead it
uses libvirt-lxc.
Using namespaces, child processes can have an entirely different view
of the system than does the parent. Namespaces are not all that new, RHEL5
and Fedora 6 used the pam_namespace to partition logins into "secret"
vs. "top secret" for example. The SELinux sandbox also used namespaces and
was available in RHEL6 and Fedora 8. More recently, Fedora 17 uses
systemd which has PrivateTmp and PrivateNetwork directives for unit files
that can be used
to give services their own view of /tmp or the network. There are
20-30 services in Fedora 17 that are running with their own /tmp,
Walsh said.
In addition, Red Hat offers the OpenShift service which allows
anyone to have their own Apache webserver for free on Red Hat servers. It
is meant to remove the management aspect so that developers can concentrate
on developing web applications that can eventually be deployed elsewhere.
Since there are many different Apache instances running on the OpenShift
servers, sandboxing is used to keep them from interfering with each other.
There are several different kinds of namespaces in Linux. The mount
namespace gives processes their own view of the filesystem, while the PID
namespace gives them their own set of process IDs. The IPC and Network
namespaces allow for private views of those resources, and the UTS
namespace allows the processes to have their own host and domain names.
The UID namespace is
another that is not yet available, and one that concerns Walsh because of
its intrusiveness. It would give a private set of UIDs, such that UID 0
inside of the namespace is not the same as root outside.
Secure Linux containers uses libvirt-lxc to set up namespaces that
effectively create
containers to hold processes that are isolated from those in other
containers. Libvirt-lxc has a C API, but also has bindings for several different
higher-level languages. It can set up a container, with a firewall,
SELinux type enforcement (TE) and multi-category security (MCS), bind
mounts that pass through to the host filesystem, and so on. Once that is
done, it can start an init process (systemd in this case) inside
the container so that it appears to be almost a full Linux system inside the
container. In addition, these containers can be managed using control
groups (cgroups) so that no one container can monopolize resources like
memory or CPU.
But, libvirt-lxc has a complex API that is XML-based. Walsh wanted something
simpler, so he created libvirt-sandbox with a key-value based
configuration. He intends to replace the SELinux sandbox using
libvirt-sandbox, but it is not quite ready for that yet.
To make things even easier, Walsh created a Python script that makes it
"dirt simple" for an administrator to build a container or set of
containers. He said that Red Hat is famous for building "cool tools that
no one uses" because they are too complicated, so he set out to make
something very simple to use.
The tool can be used as follows:
virt-sandbox-service create -C -u httpd.service.apache1
That call will do multiple things under the covers. It creates a systemd
unit file for the container, which means that standard systemd commands can
be used to manage it. In addition, if someone puts a GUI on systemd
someday, administrators can use that to manage their containers, he said.
It also
creates the filesystems for the container. It does not use a full
chroot(), Walsh said, because he wants to be able to share
/usr between containers. For this use case (an Apache web server
container), he wants the individual containers to pick up any updates that
come from doing a
yum update on the host.
It also clones the /var and
/etc configuration files into its own copy. In a perfect world,
the container would bind mount over /etc, but it can't do that,
partly because /etc has so many needed configuration files
("/etc is a cesspool of garbage" was his colorful way of describing
that). In addition, it allocates a unique SELinux MCS label that restricts the
processes inside the container. "Containers are not for security", he
said, because root inside the container can always escape, so the container
gets wrapped in SELinux to restrict it.
Once the container has been created, it can be started with:
virt-sandbox-service start apache1
Similarly, the
stop command can terminate the container. One can
also use the
connect command to get a shell in the container.
virt-sandbox-service execute -C ifconfig apache1
will run a command in the container. For example, there is no
separate
cron running in each of the containers, instead the
execute is used to do things like
logrotate from the
host's
cron.
The systemd unit file that gets created can start and stop multiple
container instances with a single command. Beyond that, using the
ReloadPropagatedFrom directive in the unit file will allow an
update of the host's apache package to restart all of the servers in the
containers. So:
systemctl reload httpd.service
will trigger a reload in all container instances, while:
systemctl start http@.service
will start up all such services (which means all of the defined containers).
This is all recent work, Walsh said. It works "relatively well", but still
needs work. There are other use cases for these containers, beyond just
the OpenShift-like example he used. For instance, the Fedora project
uses Mock to
build packages, and Mock runs as root. That means there are some 3000 Fedora
packagers who could do "bad stuff" on the build systems, so putting Mock
into a secure container would provide better security. Another possibility
would be to run customer processes (e.g. Hadoop) on a GlusterFS node. Another service
that Walsh has containerized is MySQL, and more are possible.
Walsh demonstrated virt-sandbox-service at the end of his talk.
He demonstrated some of the differences
inside and outside of the container, including a surprising answer to
getenforce inside the container. It reports that SELinux is
disabled, but that is a lie, he said, to stop various scripts from trying to do
SELinux things within the container. In addition, he showed that the
eth0 device inside the container did not even appear in the host's
ifconfig output (nor, of course, did the host's wlan0
appear in the container).
A number of steps have been taken to try to prevent root from breaking out
of the container, but there is more to be done. Both mount and
mknod will fail inside the container for example. These
containers are not as secure as full virtualization, Walsh said, but they are
much easier to manage than handling the multiple full operating systems that
virtualization requires. For many use cases, secure containers may be the
right fit.
(
Log in to post comments)