Another runc container breakout [LWN.net]

Another runc container breakout

Posted Feb 12, 2024 17:50 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

If you want true privilege separation where violations can cost you a lot of money (banking, HIPAA, etc.) then you go with dedicated hosts. So each workload gets its own physical host, possibly running multiple VMs/containers but at the same privilege level.

If you're slightly less paranoid, then VMs that don't share resources (including CPUs in the same socket) are fine. One more step, and you can use VMs on oversubscribed hardware or VMs with dynamic resources. All these VMs can also run multiple containers, typically for the same application.

I don't think containers are good enough for _any_ privilege separation.

Another runc container breakout

Posted Feb 12, 2024 18:49 UTC (Mon) by raven667 (subscriber, #5198) [Link] (3 responses)

> each workload gets its own physical host, possibly running multiple VMs/containers but at the same privilege level

I would add that while this may sound expensive, if you have more than a very small workload, partitioning the VM clusters by security level doesn't really cost much more in hardware, it's the same amount of resources you would have needed to provision anyway, as long as you don't get crazy about defining fine-grained boundaries at the VM level and constrain yourself to broad security risk zones based on the sensitivity of data being processed or the auditability/control of the devices connecting to it.

Another runc container breakout

Posted Feb 12, 2024 20:30 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (2 responses)

A lot of this stuff sounds expensive only because we tend to underestimate the cost of a security breach. If you properly account for that cost, making the system more secure by design starts to look more sensible. Until you get to the extreme, security is generally cheaper than insecurity. It's just the cost of insecurity is sporadic, so bad managers pretend it doesn't exist.

Another runc container breakout

Posted Feb 13, 2024 4:51 UTC (Tue) by marcH (subscriber, #57642) [Link] (1 responses)

> It's just the cost of insecurity is sporadic, so bad managers pretend it doesn't exist.

Same as planes crashing or plane doors blowing off and quality in general. It's very rare so surely you can cut corners here and there and save a fair amount of money.

I don't agree that "a security bug is just another bug" in general but from that pure cost and management perspective a security bug is indeed just a quality defect like any other.

That's why managers are paid so much: because _in theory_ they're able to find the best quality trade-offs - and it's hard. In practice however...

Another runc container breakout

Posted Feb 13, 2024 11:31 UTC (Tue) by Wol (subscriber, #4433) [Link]

> That's why managers are paid so much: because _in theory_ they're able to find the best quality trade-offs - and it's hard. In practice however...

It also assumes the necessary data exists. My employer collects all the data imaginable, but as far as I can tell, the data I need to do my job doesn't exist ... namely "what are the discrepancies between what we've planned, and the resources we've got on the ground".

HOPEFULLY I'm going to succeed in doing something about it, but one lone voice crying "help" while drowning in a data lake can't achieve much...

Cheers,
Wol

Another runc container breakout

Posted Feb 14, 2024 20:22 UTC (Wed) by ringerc (subscriber, #3071) [Link]

They're good for casual isolation of cooperating services. They help manage failure domains.

They also have benefits for deterring casual attackers. "To evade the bear you don't have to outrun the bear, just run faster than the next guy".

When combined with strict seccomp rules, dropping all capabilities, nonzero uid and a restrictive apparmor config they can be quite strong for suitable workloads. But in reality it's rarely feasible to lock a workload down enough that an attacker couldn't potentially exploit their way out; most things need general purpose file I/O and/or network I/O after all, and that gives it the ability to touch system interfaces that could be exploitable. So they're not fit for purpose for isolating general purpose untrusted workloads where there are strong incentives to escape the container and high impacts if an attacker succeeds.

I have significantly more confidence in lightweight VM/hypervisor models like Firecracker. My work is moving in that direction for some of our shared tenancy stuff.

But I agree you can't beat physical node isolation for workloads where the bill for a breach is in the many many millions.

Another runc container breakout

Posted Feb 12, 2024 18:17 UTC (Mon) by atnot (subscriber, #124910) [Link]

In general, you still use containers, but use some layer in between to provide each customer with a separate kernel. There's a large variety of ways to do this. On the simpler end, you can just statically provision normal dedicated hardware or VMs per customer that you operate with a shared control plane. This is kind of wasteful because you need to run a whole intermediate OS and size it for worst case memory use. You can be a bit smarter than this though, so there's a number of microvm based solution like Kata Containers that run each container as a lightweight KVM-based virtual machine instead, with impressively small overhead. But if even that is too much for you, you can also use something like gVisor, which implements the kernel interface entirely in userspace. Kind of like wine, but for running Linux applications on Linux.

There's a huge variety of solutions here, but what they have in common is that nobody trusts multiple customers with one Kernel.

Another runc container breakout

Posted Feb 12, 2024 19:47 UTC (Mon) by geofft (subscriber, #59789) [Link] (1 responses)

There's a couple of options. At my day job we've been running a traditional UNIX environment for about 20 years for our HPC use case, and so we have relied on the normal UNIX privilege separation mechanism, namely UIDs. We mostly kept this when we started moving to containers: user-provided code only runs as their own user even inside the container. Users have historically had non-root shell access to our systems via plain SSH, and so as long as they remain non-root, a containerization breakout like this doesn't impact our security model. (Of course, keeping a Linux environment secure against local privilege escalation is challenging on its own, and if our untrusted users were anonymous people on the internet instead of employees who have gone through background checks, we would likely be more paranoid.)

We benefited from the fact that our workloads were already designed for non-containerized environments (and so we already had non-root ways of installing software etc.) and we could just containerize that model. If you're running container-native workloads that expect to have root, or even multi-tenant workloads that all use the same non-root UID (e.g. everyone runs as UID 1000), user namespaces are a great option: the code in the container feels like it's running as whatever UID, and if it's root it can do things like mount a tmpfs or reconfigure its own networking. But from outside the container, these are separate, unprivileged UIDs. So the attack in the article about reading /etc/passwd is still possible, but a container running as "root" cannot write to it. Kubernetes has user namespace support in alpha.

And then yes, you can use virtualization (of any level of hog). If you're on a public cloud platform and you're on VMs anyway, setting up your VMs to be per workload is pretty easy. There's also stuff like Kata Containers or gVisor (commercially offered by Google as GKE Sandbox) for having a "container" runtime that actually uses hardware virtualization.

Another runc container breakout

Posted Feb 17, 2024 15:12 UTC (Sat) by raven667 (subscriber, #5198) [Link]

> Of course, keeping a Linux environment secure against local privilege escalation is challenging on its own, and if our untrusted users were anonymous people on the internet instead of employees who have gone through background checks, we would likely be more paranoid

This is something I bang on about all the time, one of the corollarys if risk is lack of accountability, in places where you have accountability you have a less likelihood for issues, and the driver of security process should be more driven by impact. Making a reasonable and effective risk calculation for where to spend your limited time instituting security is hard to do, it's mostly vibes and tribal embarrassment which drives people to apply all the "best practices" they've heard of without regard to which ones actually help in their situation and which are wastes of time.

Another runc container breakout

Posted Feb 15, 2024 9:14 UTC (Thu) by cortana (subscriber, #24596) [Link] (1 responses)

I think about it like this: containers are a technology for distributing software artefacts, packaged up and ready to run on a host. Think shipping containers. They aren't designed to confine an executing workload within its container's namespaces; you still need other technologies, which can be used in conjunction with containers, for that.

So, for instance, looking at a platform like Red Hat OpenShift, which runs on top of RHEL CoreOS, a typical container:

* Must run with a UID, GID and supplemental groups from the range assigned to its project
* Must run within a restrictive domain (container_t) and a unique MCS category (e.g., c5,c67)
* Must run with the runtime/default seccomp profile applied

So for one workload to read the files of another, it would have to:

* Use an exploit to escape out of its namespaces
* Use an exploit to override DAC (i.e., read a file owned by another user; made more difficult by the seccomp profile which prevents the use of less common system calls)
* Use an exploit to override the SELinux policy which prevents, for instance, a process running as container_t:c5,c67 from accessing objects labelled with container_file_t:c78,c200

It's interesting you mention qemu, because the idea of using a restrictive label & MCS category to protect container workloads from one another came from the very same setup that RHEL uses to protect VMs from one other: The container_t label is (or was at some point) an alias to svirt_lxc_net_t, the label that libvirt applies to all the qemu processes that it launches (again, with each one being given a unique pair of MCS categories so that VMs can't attack each other).

Another runc container breakout

Posted Feb 17, 2024 15:17 UTC (Sat) by raven667 (subscriber, #5198) [Link]

Thanks for the summary of how this works, those looks like all good things to do, however there are an awful lot of local kernel exploits in Linux that can make any permission system moot, so that's where having different hardware for different security/risk workloads can help, remote exploit is much harder todo/easier to defend/detect.