LWN: Comments on "Making containers safer"

Making containers safer

Margaret48 — Fri, 30 Aug 2019 10:26:46 +0000

Security focused distros patch userns to be restricted to root be default which blocks unprivileged usage. This is what Debian, Linux-hardened, Grsecurity do. Disabling userns is also official KSPP recommendation.

It's also worth noting that granting user membership to lxd group = root[1], same as for docker. That means the "unprivileged" term is meaningless.

Systemd maintainers rejected userns support for systemd-nspawn saying thjat they always rely on some privileged process running behind the curtain.

[1] https://bugs.launchpad.net/ubuntu/+source/lxd/+bug/1829071

Making containers safer

mathstuf — Mon, 26 Aug 2019 20:02:43 +0000

CI builds usually occur within containers. How is one supposed to CI their `Dockerfile` builds inside of such an environment if nesting isn't supported?

> With nested containers would it make sense to have one "host" container that then creates nested containers for apps x and y?

It means that my application can work from within a container no differently than outside a container.

Making containers safer

gradey — Mon, 26 Aug 2019 16:20:39 +0000

What's a use case for nested containers? A traditional docker setup has you create a container per application. So if I have app x and y, then docker creates two containers. With nested containers would it make sense to have one "host" container that then creates nested containers for apps x and y? What's advantageous about this?

Making containers safer

jgg — Sun, 25 Aug 2019 20:27:36 +0000

I've said before that HPC people should not call their use of Mount namespaces to package software 'contianers' - it just confuses people. It is much closer to flat pack and snap, IMHO.

podman completely overlooked in the discussion

dowdle — Fri, 23 Aug 2019 10:36:41 +0000

On YouTube there are several video recordings of presentations done by Red Hat's Dan Walsh... several specifically about the various security mechanisms that are available and how they are being utilized with podman. Perhaps podman was overlooked intentionally because it was considered to be primarily targeting application containers... but in later releases of podman, system containers are also doable.

podman also is striving for making unprivledged containers the norm and has put a lot of work into that by utilizing user namespaces among other things. They also have the issue with filesystem ownership on the host vs in the container and trying to find the best way to solve it. I hadn't heard of shiftfs before but perhaps the two projects (LXD and podman) could work on this problem together?

Making containers safer

Freeaqingme — Fri, 23 Aug 2019 07:54:15 +0000

> LSM support is also essential for privileged containers, he said. Access to various files in procfs and sysfs must be blocked or the container can be compromised. The LSMs most frequently used by container managers are SELinux and AppArmor, but other "minor" LSMs (which can stack) are also added into the mix sometimes.

I'd like to counter this. Denying access to 'various files in procfs and sysfs' is too fragile. All it takes is a new file to be added in a new kernel release for this to cause a security issue. A container should be fully usable without having a procfs or sysfs mounted at all.

If that's not possible for some reason, access to various files in procfs/sysfs should be based on a whitelist, not a blacklist.

Making containers safer

stgraber — Thu, 22 Aug 2019 18:01:45 +0000

lxc launch ubuntu:18.04 c1 -c security.nesting=true
lxc exec c1 bash
  lxd init
  lxc launch images:alpine/edge a1
  lxc list

This should work fine. During "lxd init", the one thing that you'll need to pick which isn't already the default value is the IPv4 subnet. In my test, I used "192.168.0.1/24" which worked fine.

The reason for this, is that the try-it environment has a subnet of 10.0.0.0/8 which prevents LXD from automatically picking an unused subnet in that range. Manually specifying one is therefore required.

PS: Note that the try-it session is already itself a LXD container, so doing the above actually gets you a nested, nested container :)

Making containers safer

rc — Thu, 22 Aug 2019 17:54:26 +0000

I'm still not understanding this so maybe I'll explain where I'm coming from and see if that helps. I am specifically referring to HPC. HPC environments (with some rare exceptions) are multi-user environments. There can be dozens or hundreds of people logged into the same systems at any moment. Job schedulers are used to farm out work to various compute nodes (node==server in HPC parlance, just to be clear). Many HPC centers allow jobs from different users to be run on the same nodes at the same time, assuming there are enough resources available.

Generally, in HPC the desire to use containers is not to run containers. The desire is to package up terrible software in such a way that it will work. That is all. Some code that users want to run was cobbled together by grad students who barely got the thing running so they could finish up their research and graduate. Then people all over the world want to take that (often) unmaintainable garbage and use it themselves. That's where containers come in. Most code tends to be of decent quality, but it's that garbage software that we want to put in a garbage bin (aka container).

In what way would using namespaces other than the mount or user namespace help? (or using seccomp, etc)? Users can already run arbitrary code so how is allowing arbitrary code *in a container* any worse if it is only using mount or user namespacing? Sure, using the pid or network namespaces could help isolate users from each other, but that's orthogonal since it should be done for *all* user processes and not just containers.

Long story short, I fail to see how this particular usage in HPC somehow makes "run arbitrary code x in a container with mount and user namespacing only, all launched in an unprivileged manner by a normal user" any worse than "run arbitrary program y launched in an unprivileged manner by a normal user not inside a container".

Our goal is to isolate users from each other, not isolate containers from each other or prevent code in containers from escaping into the environment of the user that launched the container.

Making containers safer

cyphar — Thu, 22 Aug 2019 13:02:59 +0000

> Why is that bad practice? Only use what you need, right?

Not using security features is not a form of "principle of least privilege", it's an example of worrisome (if not outright bad) engineering.

> And yes, I agree with others who are concerned about the increased kernel attack surface of user namespaces.

The attack surface is increased by setting CONFIG_USER_NS=y in your kernel config. Using them to contain workloads on a machine that has CONFIG_USER_NS=y does not increase the attack surface, because even if you don't use them the container could call unshare(CLONE_NEWUSER) itself. In addition, most container runtimes block CLONE_NEWUSER with seccomp by default.

Making containers safer

cyphar — Thu, 22 Aug 2019 12:55:33 +0000

Unless you are setting CONFIG_USER_NS=n in your kernels (which isn't the case on basically every distribution these days), then you aren't reducing the attack surface by not using user namespaces (the code is still in your kernel) -- you're just choosing not to use an additional security feature. Any unprivileged user on your host can call unshare(CLONE_NEWUSER) and start exploiting user namespace 0days. But in containers, we block unshare(CLONE_NEWUSER) so you can use user namespaces but the container process cannot. In addition, user namespaces are used *alongside* capability dropping, seccomp, devices cgroup, AppArmor/SELinux, no_new_privs, and so on. Using user namespaces doesn't make any of those other security features stop working, it complements them.

As for uid=0, I would suggest that it's always a Very Bad Idea™ to run code as uid=0 unless it's absolutely necessary, even if you're doing it with user namespaces. But if you are going to do it, then using user namespaces is still much better than not using them (assuming the capability set is the same in both cases).

Making containers safer

walters — Thu, 22 Aug 2019 12:55:02 +0000

Speaking of credit, from the article:

> Sadly, he said, the vast majority of containers that are run today are privileged containers. That includes most Docker containers and most of the containers that are run with Kubernetes.

I also think OpenShift deserves a lot of credit for coming out of the box from the very first 3.0 (Kubernetes-based) release in 2015 with the `MustRunAsRange` security policy - i.e. the pods aren't running as uid 0. This actually causes still to this day a lot of incompatibility with apps that run on "stock Kubernetes".

At the time, user namespaces were a lot more immature, so I think it was the right call.

(To be clear, I work on OpenShift now, but I didn't have anything to do with implementing that feature)

Making containers safer

walters — Thu, 22 Aug 2019 12:44:13 +0000

Let me rephrase the original point I was trying to make:

The LXD team's push for user namespaces is great, and worth a lot of credit. The article's authors (and you) are right to highlight the risks of running without user namespaces.

The way I think about security is: I often use the term "secure" when talking about code to mean "we believe we can ship fixes for the security issues that arise using this", and I think that's true of "uid 0 containers". You're right there have been numerous CVEs, and there are required band-aids like seccomp for open_by_handle_at() - but this all got fixed.

So again, I think calling them "privileged containers" is taking things a step too far.

Making containers safer

corsac — Thu, 22 Aug 2019 08:41:47 +0000

Fair points, but I think you missed the “et al” part. And yes I'm aware that capabilities are not perfect (far from it) and a lot of them are equivalent to SYS_ADMIN / full root. But dropping the relevant caps still seem more reasonable to me than exposing the kernel. There's still a lot of stuff not namespace-aware and thus a large attack surface which is reachable when you're uid=0 in a user namespace.

Making containers safer

cyphar — Thu, 22 Aug 2019 08:41:06 +0000

Though (to add even more pedantry), both LXD and Docker use the mappings specified in /etc/sub[ug]id.

Making containers safer

cyphar — Thu, 22 Aug 2019 08:36:31 +0000

User namespaces are an incredibly important security feature, and disabling them is unequivocally a bad idea. There are hundreds of userns-related security checks in-kernel that you simply cannot emulate without using user namespaces. Dropping CAP_SYS_ADMIN is nowhere near sufficient to protect you, just for a taste of the problem:

* You need to use seccomp (or drop CAP_DAC_READ_SEARCH) to stop the open_by_handle_at(2) attacks that allow a container to open the root filesystem of the host (don't forget that you're running code as kuid=0).
* There are a bunch of attacks against attaching processes if you don't drop CAP_SYS_PTRACE (CVE-2016-9962 is a good example, but there are many more such as variations on CVE-2019-5736). Even after dropping CAP_SYS_PTRACE, there are userns-related security checks (in ptrace_may_access()) that completely eliminate a bunch of container-attach attacks -- protections you don't get without using user namespaces.
* If (for whatever reason) the container gets access to a file descriptor from the host's mount namespace, it's game over without user namespaces (SELinux can protect you here too, but given most people run on Ubuntu that doesn't help much).

There are plenty of other examples, but those are the ones that immediately came to mind.

Making containers safer

skissane — Thu, 22 Aug 2019 04:49:22 +0000

> It's excellent that LXD containers can be nested. That would have been my first question about them.

I tried starting a container inside a container using tryit. I couldn't get it to work, lots of permissions issues. (I don't really know what I am doing though, maybe I used the wrong steps or config options.)

Making containers safer

jeffcook — Thu, 22 Aug 2019 04:40:43 +0000

Things may have changed since I initially set up my LXC-based local containerization a couple of years ago, but unprivileged containers at least used to come with many caveats. I tried running one unprivileged and hit enough roadblocks that I just decided to go privileged on the rest of them. Some distros didn't even bundle the USER_NS patchset until fairly recently, making the unprivileged containers a non-starter.

Note that I am using just plain LXC, *not* LXD, so maybe things are somewhat easier through LXD. But that brings me to the next point: it's disappointing that the LXC project is so laser-focused on intertwining LXD, which has a very Docker-ish feel to it; it requires a running daemon, config primarily via commands instead of files, etc. A little odd since if you're coming to LXC/LXD, you're probably looking specifically for something non-Dockery anyway.

All in all the Linux containers thing is just a total mess, sad as it is to say. cgroups, v1 and v2, are a mess. USER_NS, unprivileged containers, and root daemons to control everything is a mess. It's great that LXC is continuing in the tradition of things like OpenVZ and making an actually-usable containerized system that can at least run an init without some obtuse black magic and without fear of the whole thing getting vaporized if it's stopped the wrong way, but to be frank, Linux should deprecate all of that junk and just do as near as possible to a 1:1 copy of jails.

Making containers safer

walters — Wed, 21 Aug 2019 23:45:56 +0000

subuids aren't involved in "uid 0 root-userns containers", that's for user namespaces.

Making containers safer

SEJeff — Wed, 21 Aug 2019 22:11:26 +0000

Well while we're being pedantic, should we call them core correctly "uid 0 subuid containers"?

See: newuidmap(1) and newgidmap(1) in a pretty recent version of shadow-utils.

Making containers safer

rc — Wed, 21 Aug 2019 22:09:34 +0000

Why is that bad practice? Only use what you need, right? Some of the solutions use a user namespace to set up the mounts so it can all be done by an unprivileged user rather than a setuid process (or with capabilities).

Unless I'm missing something, that sounds ideal. (And yes, I agree with others who are concerned about the increased kernel attack surface of user namespaces)

Making containers safer

walters — Wed, 21 Aug 2019 21:09:51 +0000

Let's use the term "uid 0 containers" (more precisely, "uid 0 root-userns containers") or something - calling them "privileged containers" is very confusing given you're *not* talking about `docker run --privileged` right?

Making containers safer

corsac — Wed, 21 Aug 2019 20:02:45 +0000

For me, “unprivileged containers” are those running without CAP_SYS_ADMIN et al. (which is increasingly difficult these days). (unprivileged) User namespaces are still dangerous imho because of the wide attack surface they expose on the kernel. It's kind of going to the opposite direction kernel hardening is going.

Making containers safer

brauner — Wed, 21 Aug 2019 17:40:34 +0000

I should clarify that my comment about some app container workloads essentially use mount namespaces only is something you can find in some HPC workloads. That insanely bad practice is hopefully fading though!

Making containers safer

stgraber — Wed, 21 Aug 2019 16:45:30 +0000

The LXD daemon requires root privileges. That's effectively needed by a lot of features to the point where adding logic to handle unprivileged daemon everywhere would have been very impractical (we did at the very beginning).
Some of those features involve network/storage management, system call interception in containers, checkpoint/restore, injection of uevents, mounts and devices inside containers, ...

Now that being said, LXD can absolutely work nested inside an unprivileged container, which is the very setup we do on that demo server. In such an environment, LXD effectively runs as a root inside a user namespace, so as a globally unprivileged user which does have privileges against the namespaces tied to that user namespace.

It's worth pointing out that has a result of our daemon running as root, we do take any communication between the container and daemon very seriously as flaws there would be disastrous from a security standpoint (potentially allowing escape from an unprivileged container). The only such interface is a /dev/lxd REST API that we expose containers to fetch properties from the daemon. This interface can be disabled through a configuration key, at which point the container would no longer have any way to interact with the daemon that spawned it.

Making containers safer

epa — Wed, 21 Aug 2019 15:45:03 +0000

It's excellent that LXD containers can be nested. That would have been my first question about them.

Is it possible to use LXD as a non-privileged user, or do you need to be root to set it up?

Making containers safer

stgraber — Wed, 21 Aug 2019 15:34:29 +0000

For anyone interested in playing with LXD, we have an online demo available here: https://linuxcontainers.org/lxd/try-it/

This gives you 30min of root access to a LXD container that itself has LXD installed.
That way you can play with LXD, start containers and explore some of the basic features without ever installing anything locally.
There is a small tutorial that you can follow which will get you through some of the basics, or you can just play with it whichever way you want instead.