Measuring container security

By Jake Edge
December 11, 2018

There are a lot of claims regarding the relative security of containers versus virtual machines (VMs), but there has been little in the way of actually trying to measure those differences. James Bottomley gave a talk in the refereed track of the 2018 Linux Plumbers Conference (LPC) that described work that targets filling in that gap. He and his colleagues have come up with a measure that, while not perfect, gives a starting point for further efforts.

Bottomley introduced himself as a "container evangelist" for IBM. He used to help convert businesses to becoming part of the open-source community. Working at Parallels on that is how he got involved with containers. He is also a kernel developer and maintainer.

Containers and hypervisors

He began with some "container basics". The difference between containers and hypervisors, which run VMs, is where the interface is located. Hypervisors are based on emulating hardware, he said; they can bring anything up that expects to talk directly to the hardware (e.g. a Linux kernel). Containers are about virtualizing the subsystems of the operating system (OS). The interface is the system-call interface provided by the OS.

If you look at the "container revolution" today, "everything is Linux". There is really no other kind of container out there, he said. That is because of the "hardness guarantees" of the Linux system-call interface and its stability over time. There has been no container system for Windows because its system-call interface changes with every release and the line between user space and the kernel is blurry—and changeable.

Containers all run on a single kernel, while hypervisors run a host kernel and a guest kernel per VM. There is a benefit to that for containers because one kernel means that there is only one resource manager for the whole system. Resource sharing between containers provides agility; there is instant scaling up or down without needing to "inflate balloons" or do any of the other tricks that are needed for hypervisors. Resource decisions can be made more efficiently for containers since they aren't done across a virtual hardware interface.

Docker is rather famous in the container world, but containers are not really what led to that fame. Docker provided a way to box up an application and all of its dependencies so that what got tested on a developer's laptop was exactly the same as what runs on the host. Docker is really nothing more than "an application packaging and transport system" that was enabled by containers, Bottomley said. One of the "dirty secrets of hypervisors" is that you can't build and run a VM on your laptop and then deploy it to, say, the Amazon cloud; there is a mismatch in drivers that will prevent that from working.

Security risks

The crux of his talk is that there is a great benefit to the sharing that containers allow, but sharing increases the security risk. Hypervisor advocates seek to exploit this by noting that hypervisors use a small interface, while "containers are shit" because they can use any of the 300 Linux system calls. Those system calls are the most easily exploited part of Linux, so container security is clearly worse, they would argue. In truth, Bottomley said, hypervisor advocates have a bit of container envy because of all the excitement around containers. They want to "Make Hypervisors Great Again", he said, showing an image of a red baseball cap with that slogan. Luckily, most of what the hypervisor advocates are saying is "fake news", he said with a grin. Bottomley's web-based slides are available; use the arrow keys to navigate.

The real problem with all of this advocacy around security is a lack of facts. He didn't provide any real facts about containers either, just "waved my hands around". There is no "intellectual rigor" in any of these debates. Part of what he and his colleagues are trying to do is to put together some way to measure the relative security to provide some rigor.

Containers on Linux all use control groups (cgroups) and namespaces. LXC, Docker, Mesos, and others all build on the same interfaces. Hypervisors all use different interfaces. The container API was agreed upon at the Kernel Summit in 2011. The various players converged on a unified upstream API to avoid a repeat of the split between Xen and KVM.

The cgroups are available for resource control; there are many of them, each controlling a different resource. Those cgroups can be turned on or off for any given container; this makes the definition of a container less hard and fast than that of a hypervisor, which has a fixed interface based on the virtual hardware. There are also lots of namespaces for virtualizing various kernel resources, which can be used or not by different container solutions. User namespaces, in particular, are a tool that can help secure containers.

Docker advocates will say that the Linux container API is "almost impossible to use", but that's not true. It is a myth that orchestration people push, Bottomley said. Docker is just the start of what containers can do and it is the source of a lot of the security problems because most Docker containers do not use a user namespace. That means that "real root" is used inside the container, which is a real security problem, as hypervisor advocates are eager to bring up. One of the discussions in the containers microconference the previous day was about how to get Docker to incorporate user namespaces into its containers, he said.

Attack profiles

In order to counter the hype, there is a need to find a measure to define "security" or "containment". Even when security experts talk about "security" they often cannot agree because there are no numerical measurements that can be compared dispassionately. He asked his research group to help him figure out some kind of numerical measure that could be used. What they came up with is something called the "attack profile"; it is a good first approximation of a measure for security.

An internet-exposed application takes input from the net, makes calls down the stack until it reaches the hardware then traverses back up to give a response. As it does that traversing, it could encounter bugs that would allow attackers to exploit the system. So the vertical attack profile is the number of lines of code encountered as the application performs its normal function multiplied by the bug density. But the bug density is a constant for the kernel as a whole, which is the shared piece, so it can be thrown away, leaving just the lines of code as the rough measure of the attack profile.

Where the discussion gets interesting is that some of the attack profile belongs to the tenant and some to the hosting provider. In a hypervisor system, the provider is only responsible for a relatively thin layer that provides the hardware emulation, so the tenant is responsible for everything else. In a container system, the tenant's responsibility stops at the system-call interface; the tenant is responsible for much less of the vertical attack profile.

Hosting providers would generally prefer their tenants to use VMs, rather than containers, because it makes the provider responsible for less code, Bottomley said. While providers might claim that hypervisors make their tenants more secure, they really are simply trying to shift responsibility for the kernel to the tenant.

The horizontal attack profile looks at the whole system and all of its containers or VMs; it gives a measure of the overall chance of being attacked by the exposed shared code. The kernel has a large horizontal attack profile as it exposes a huge number of interfaces.

This large horizontal attack profile is also why there are no providers of bare-metal containers; they all run containers on top of a hypervisor. IBM's Blue Mix cloud did provide bare-metal containers because IBM wanted to show that even though the hypervisor fans claim that containers are hundreds of times less secure than hypervisors, it simply isn't true, Bottomley said. If it were, the Blue Mix cloud should have seen many more exploits than the hypervisor-based container providers—and didn't. "There must be a reason for this", which is what led to the research and the idea of using attack profiles as a measure of security.

An exploit in the horizontal attack profile is the worst kind for a hosting provider. It means that a container can attack other containers via the system-call interface. That profile is roughly the number of lines of code in the kernel multiplied by its bug density.

He and his team decided to measure the horizontal profile of Docker versus that of a Kata container running on KVM. They used Ftrace to determine which kernel functions were being called from the Kata container through KVM to the host versus the functions being called by a Docker container on the host. He put up a graph showing the number of unique kernel functions accessed by each for three separate tests (for Node.js, Redis, and Python Tornado). While Docker is worse than Kata for all three, it is not many times worse. It is not a factor of 100, more like 10-30% he said.

The tests that were run are well-behaved applications, whereas a more comprehensive test would perhaps use a fuzzer to better simulate misbehaving attackers. But if the system calls invoked by an application can be identified, a reasonable seccomp filter could be applied to the Docker container so that it could only use those calls. He is not promising that Docker can be made completely secure, but he is promising that there is a way to secure Docker such that its risk matches that of a hypervisor within a few tens of percent. Putting a good seccomp filter in for Docker is definitely difficult ("horrible") but it can be done, he said.

Now that there is a measure, it can be minimized. Experiments can be run with different container descriptions; those descriptions can be crafted to reduce the horizontal attack profile. This is what projects like Google's gVisor and IBM's Nabla containers are trying to do, he said. With Nabla containers, he set out to be able to claim that they are more secure than hypervisors based on the horizontal attack profile measure because he knew that would get press attention.

But he also knows that measuring the horizontal attack profile is not the end point. "I think this is the beginning of the conversation of how we measure security." He has a "funny feeling" there are much better ways to measure the security of these systems; "now we need to find out what they are".

There are some problems with using horizontal attack profile as a measure. The kernel contains many bugs, but not all are exploitable; there is a need to incorporate some measure of "exploitability" into the equation. He suspects that the interface description plays a larger role than the number of lines of code. System calls that have more sweeping effects are more likely to harbor exploitable bugs. But that is all speculation at this point.

Bottomley's best guess is that the next generation of security measurements will focus more on the interfaces. Some interfaces are "inherently more exploitable" than others; there are kernel system calls that are often exploited and some that no one has ever found a way to exploit. Somehow that needs to be taken into account. Perhaps having a way to monitor calls to the dangerous system calls "with eBPF or something", and disallowing any of the insecure uses, might be a way to help secure the kernel interfaces. That would be useful "independently of whether we use it for containers or not".

Other kinds of containers

Sandboxing is a way to run a container that does not expose it to the full Linux interface. Instead, some of the system calls are emulated inside the sandbox. That starts to sound "a bit like a mini-hypervisor running and calling itself a container", he said. But a sandbox is a way to get security containment without doing full hypervisor set up and image building.

So, for example, Nabla containers do not use namespaces, just cgroups; namespaces are not needed because of the system-call emulation. In reality, though, Nabla containers do use a network namespace because that is required by Kubernetes. But emulation in the sandbox means that the code is not shared between containers, thus reducing the horizontal attack profile, at least in theory. It does push the responsibility for that emulation layer on the tenant, however, because it is not shared. With Nabla containers, IBM has found a way to push the emulation layer back into the remit of the provider, though Bottomley said he wouldn't go into details as there is a whole other 45-minute talk needed to describe it.

The difficulty is to get the sandboxing right without pushing the whole thing out to the tenant as hypervisors do. A benefit of the cloud should be that it shifts the burden of infrastructure security away from the tenant. Providers have response teams to handle various infrastructure problems, which should relieve the tenant from having to handle them. Most hosting providers are trying to make as much as possible be the responsibility of the tenant, but that runs counter to the advantages of the cloud. In the end, though, "it doesn't mean that your crap application is any less crap", he said; tenants do need to focus on securing their applications, but they shouldn't need to do more than that.

Two of the better known sandboxing systems are gVisor and Nabla, he said. GVisor has rewritten many system calls in Go, while Nabla uses unikernel techniques to create a library operating system with a container profile. Nabla is meant to create containers that are more secure than hypervisors, at least based on the horizontal attack profile measure that his team developed.

He put up a graph (using the same data as in a graph in this blog post) showing the number of unique kernel functions accessed for the same three tests. One thing to note about the results, he said, is that gVisor came in worse than most of the other solutions; that is because it uses Go for the system-call emulation and the Go runtime is "incredibly profligate in terms of system calls". For the horizontal attack profile measure, that makes it appear to be far less secure, though that may not actually tell the full story. As might be guessed, the graph shows that Nabla fares better than any of the others: Docker, Kata, gVisor, and gVisor on KVM. Nabla "is not significantly better than hypervisors", but it is better on all of the tests; on the Node.js test Nabla uses less than half of the kernel functions than Kata does. He is hoping this "puts to rest" the argument about whether hypervisors or containers are more secure; "the answer is neither, unless you build them correctly".

He also noted that the analysis of these tests found an anomaly in the way that Kata containers accessed the Linux kernel; he asked an intern to see if it could be exploited. He wanted to see if there was a way to run code in a Kata container that could oops the host kernel. The intern was successful in doing that and sent him a video of the exploit just that morning; it was written up on the Nabla blog after the conference. The exploit is not particularly serious; it exploits the Plan 9 filesystem interface that Kata uses to transport filesystem operations from the guest to the host. But it does show that "you have to be incredibly careful when you set up interface descriptions", even for VMs; without the right interfaces, "you may not be getting all of the benefits of a virtual machine", Bottomley said.

Next up

He then turned to some ideas for the future. Separation or segmentation of system calls within the kernel might be a useful thing to look at. If, for example, system calls could be each run in their own tenant-owned address space, they could no longer interfere with each other, which might be helpful. Perhaps running parts of the kernel in user context would be another way to segment system calls, though "Linus [Torvalds] is going to scream" when people start experimenting with that. Another possibility is to use supervisors, using something like Linux Security Modules (LSMs), to correct problems in the system-call interface. We are now starting to measure where those problem areas are, he said, so we may be able to craft an LSM that protects the system calls better than the existing LSMs, which are all based on "security, handwavy stuff of 'we think these interfaces are insecure'".

There is also a need to look at the vertical attack profile. Hosting providers should be trying to reduce that for tenants, not by reducing the stack, but by taking responsibility for more of it. There is a need to look at ways of moving up the stack and taking that responsibility, he said.

In conclusion, there is "lots of exciting progress to be made". The field of measuring container and hypervisor security is in its infancy; the current measure is crude. There is a need for second and third generation measures that go much further. There is, Bottomley said, a need for much more research into all of this.

A YouTube video of the talk is available.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Vancouver for LPC.]

Index entries for this article
Security	Containers
Conference	Linux Plumbers Conference/2018

Measuring container security

Posted Dec 11, 2018 22:23 UTC (Tue) by dowdle (subscriber, #659) [Link]

I found Dan Walsh's recent presentation from LISA 2018 to have a little more technical meat on the bones. He also addresses gVisor and Nabla a little.

LISA18 - Container Security
https://www.youtube.com/watch?v=hH9kzvDE_gM

Measuring container security

Posted Dec 12, 2018 7:21 UTC (Wed) by joncb (guest, #128491) [Link] (3 responses)

I think i need to listen to the full talk to be sure but I question whether he's entirely correct about "no containers in windows". They're almost certainly under-powered (given that the only thing you can do with them is docker for windows) and they probably have way more security issues (because windows) but they do exist, unless he's pulling a "not really true containers" perhaps.

Measuring container security

Posted Dec 12, 2018 10:07 UTC (Wed) by farnz (subscriber, #17727) [Link]

They exist, but they work by exposing a Linux userspace ABI to the contained process, rather than a Windows ABI. Hence the only thing you can do is Docker on Windows, as that exposes a Linux ABI to the contained process.

Measuring container security

Posted Dec 12, 2018 10:53 UTC (Wed) by dowdle (subscriber, #659) [Link]

Docker has two flavors of Docker for Windows. One is for Windows Containers on Windows and the other is a Linux VM that runs Linux containers on a Linux VM with a Windows-based client.

Virtuozzo also did Windows Containers on Windows for many years although they had to figure out the Windows kernel to do it... and I'm not sure they have kept up with newer versions of Windows.

Measuring container security

Posted Dec 16, 2018 1:58 UTC (Sun) by ThinkRob (guest, #64513) [Link]

> way more security issues (because windows)

I know you're being tongue-in-cheek, but as an aside: this idea may be due for retirement. Microsoft has made some serious progress on securing Windows in the years since Blaster et al. roamed the earth. They still get hit with a bunch of stuff (due to their prominence), but it's no longer like the bad old days where half-baked ideas like ActiveX opened holes in the OS a mile wide. AFAIK their NX and ASLR support are state-of-the-art, and I know they've put a lot of work into static analysis etc. in an attempt to cut down on security bug counts. Plus, UAC is a pretty solid idea (initial UI disaster aside)...

Measuring container security

Posted Dec 12, 2018 8:19 UTC (Wed) by smurf (subscriber, #17840) [Link] (1 responses)

"running parts of the kernel in user context" is cute, given that User Mode Linux does exist and essentially runs *all* of the kernel in user space.

I do wonder why UML does not seem to be used by anybody. Too slow? Too boring?

Measuring container security

Posted Dec 12, 2018 8:41 UTC (Wed) by k8to (guest, #15413) [Link]

Unmaintained, containers have largely fillled the niche.

UML made some kinds of debugging much easier, but it has to be more reliable to be worth it.

Measuring container security

Posted Dec 13, 2018 2:08 UTC (Thu) by bergwolf (guest, #55931) [Link] (2 responses)

> He put up a graph (using the same data as in a graph in this blog post) showing the number of unique kernel functions accessed for the same three tests.

The tests only showed the kernel functions accessed by specific workloads. I'm still missing the attack surface Nabla exposes. Is it something very strictly limited? If so, how many syscalls?

Measuring container security

Posted Dec 13, 2018 12:03 UTC (Thu) by mato (guest, #964) [Link] (1 responses)

> The tests only showed the kernel functions accessed by specific workloads. I'm still missing the attack surface Nabla exposes. Is it something very strictly limited? If so, how many syscalls?

Nabla uses a modified version of Solo5 [1] for its low-level sandbox, using seccomp for the sandboxing instead of hardware virtualization. Through the use of unikernel (to be precise, library operating system) techniques, you can essentially run a POSIX-like environment in the "guest" with just 8 system calls. See our paper [2] for the technical details.

Disclaimer: I'm a co-author of Solo5, also, I do not work for IBM.

[1] https://github.com/Solo5/solo5
[2] https://dl.acm.org/citation.cfm?id=3267845

Measuring container security

Posted Dec 18, 2018 17:29 UTC (Tue) by iwan (subscriber, #108557) [Link]

> [2] https://dl.acm.org/citation.cfm?id=3267845
I was just going to post link to your paper. I found it really interesting!

Measuring container security

Posted Dec 13, 2018 13:43 UTC (Thu) by walters (subscriber, #7396) [Link]

I'm surprised this didn't mention WASM, which is I think going to play a growing role on the server side.

Measuring container security

Posted Dec 14, 2018 20:54 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

> The vertical attack profile is the number of lines of [kernel] code encountered as the application performs its normal function
> multiplied by the bug density.

Or putting it in other words: There is no "intellectual rigor" in any of these debates.

I liked the open admission better.

Measuring container security

Posted Dec 15, 2018 9:29 UTC (Sat) by zdzichu (subscriber, #17118) [Link] (1 responses)

Those metrics were invented to present IBM solution in positive light. I was very disappointed, when James' series on container security (starting at https://lwn.net/Articles/757987/) turned out to be veiled commercial for his employer's product.

Measuring container security

Posted Dec 20, 2018 12:07 UTC (Thu) by Wol (subscriber, #4433) [Link]

Isn't that exactly what James said? In response to hypervisor providers doing exactly the same thing in the other direction?

When you see other people touting rigged figures in favour of *their* product, what's wrong with doing the same thing in favour of your own? SO LONG as you're honest about it, which afaict James was.

(You may remember me sounding off about relational, which I feel has been pretty much rigged in exactly the same way ... :-)

And as he says, maybe we'll now get a bit of impartial debate.

Cheers,
Wol

Measuring container security

Posted Dec 24, 2018 15:04 UTC (Mon) by amarao (guest, #87073) [Link]

As a representative of the hosting community, I want to point out that isolation is not a real reason why hosters prefer VMs over containers. Security is important, but 'security' for freshly patched system is a gamble on 'what's next'. In this aspect I completely agree that containers may be broken at the same rate as VMs.

The Tenant Isolation is the keystone issue for hosters.

Fist, is a resource isolation. VM have four visible: CPU, net, memory and block IO, and three hidden: host services memory, cpu overhead and cache depletion. Is really easy to have two heavyweight VMs running on the same host without affecting each other. With a little tinkering I can push it to real-time level of isolation (pci pass through, CPU pinning, etc).

For containers it's not possible. There are few dozens of shared kernel resources, some if which aren't even counted against container cgroup limit. Any attempt to put limits on them is a double problem. First, is a vast amount of those resources. Second, the way those limits are translated to customers. No one want a list of 4k+ arbitrary restrictions.

Second, is a crash isolation. It somehow related to 'security', but not exactly. The Kernel sometimes can put processes into TASK_UNINTERRUPTABLE state. Any ideas how to terminate a container with such nastiness? Hint: 'virsh destroy' for VMs. There are many such corner cases in kernel, when bugs in one workload can affect another. If it happened in VM environment, all rouge guest can do is to hog resources up to the quota (see above on resources). In a container crazy guest can badly and undebuggingly (for container-level operator) affect innocent (other) containers.

Up to the point I thought that VMs is a way. Run VM for isolation purposes, put containers inside for management purposes.

Unfortunately, Spectre has been casting shadow long enough to kill all perspectives for VMs for isolation purposes. Baremetal for isolation, containers for management.

We should thank vms and clouds for help for building an infrastructure for image management and deployment patterns, so they could be adopted for baremetal deployment, but they (vms) are now limited to staging, or development purposes. One shall not trust isolation nor to containers nor to VMs.