Containers as kernel objects — again

Posted Feb 22, 2019 16:59 UTC (Fri) by jejb (subscriber, #6654)
Parent article: Containers as kernel objects — again

> perhaps the time has come to codify some of what has been understood into kernel features that make containers as a whole easier to deal with.

The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container. The solution we came up with was to have userspace tell the kernel for audit purposes (the audit label) what the container is.

If the kernel is going to impose its view of what a container is, the question becomes which container construction should it be? The obvious answer might be what docker/kubernetes does, but some of those practices (like no user namespace, pod shared ipc and net namespace) are somewhat incompatible with what LXC does and they're definitely wholly incompatible with other less popular container use cases, like the architecture emulation containers I use to maintain cross arch builds of my projects. This is the fundamental problem: imposing a kernel view of container is pejorative and eliminates all other non-conforming uses. The argument is mostly about whether you see this as a bug or a feature.

Containers as kernel objects — again

Posted Feb 22, 2019 19:11 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (7 responses)

> The obvious answer might be what docker/kubernetes does, but some of those practices (like no user namespace, pod shared ipc and net namespace)

I'm not a docker apologist but I'd like to point out that docker certainly can take advantage of userns'. At my $day_job I have built a production service that makes use of this. However, userns has a few serious usability draw backs. While it does provide [the rather important] removal of true uid 0 processes from a container, it doesn't provide for unique uid reservations -- meaning it takes careful planning to keep other service role uids from overlapping with the range mapped into a userns. Another issue, specific to how docker uses userns, is that every container has the same system<->container uid mapping, resulting in the possibility for many processes in different containers to all share the real system uid. This isn't a major issue but it certainly feels untidy if the goal is maximum isolation. Finally, the most serious limitation is when trying to bind mount a filesystem into the container for persistence (yes, I'm aware that the "docker way" is use docker volumes but that isn't always convenient and has its own set of limitations) that the userns mapping system<->ns is a 1:1 range and no overlapping is allowed.

Suppose that you want to persist files with a system uid of 5000 and use the same uid inside the container. To do this with a single mapping for the namespace, you'd have to start the mapping at 0 and have a range of at least 5000 uids. That's a no go as then system uid 0 == container uid 0. This means for a lot of scenarios (say, systemd running the container) one mapping is needed for the root uid and one for uid 5000. However, there is now the problem that without caps, uid 0 in the container can't access uid 5000's files unless they are world accessible. It also means that every container run needs to to follow this uid pattern. Want to use a docker image packaged utility (terraform, etc.)? A "wrapper" image needs to be built to change the uids -- exactly the same as it was without userns except now knowledge of the mapping is also required.

I believe what most users actually want is the equivalent of an NFS uid squash between the system and the container userns -- I am aware of an example of this essentially being done using a k8s storage driver. While I am a heavy k8s user, it isn't a realistic to solution to the typical case of wanting to run multiple docker containers in a sequence that interact with the same files, which is why the DinD pattern ends up being employed in k8s pods. Docker storage volumes don't solve this issue either. Solutions other than uid squashing are painful: give up on posix semantics completely and move to object storage, use a utilty with caps to re-chown files, local NFS exports/mounts, and/or FUSE games.

Containers as kernel objects — again

Posted Feb 22, 2019 20:26 UTC (Fri) by jejb (subscriber, #6654) [Link] (5 responses)

OK, so it is possible to set up unprivileged docker and a few people are doing it. I always use unprivileged containers for my use case as well. However, I don't think you would argue that the number of people doing any form of unprivileged containers is dwarfed by the number of people who simply add privilege to the standard docker container to make it work (and usually this means real root). This is known to be a huge source of security issues (the latest being the runc CVE) but people do it anyway. Therefore, I stand by my statement that if you were to enforce the container description to be what the majority do today it would be without the user namespace. I'm in no way arguing this is right, and it's definitely not secure, but it is the majority container construction.

The meta point here, I think, is that the notion we've been experimenting long enough to have an idea of what a good container construction consists of is actually wrong and we're still need to experiment further. Which also means we really don't yet want to be pejorative about container constructions at the kernel level because that hobbles the experimentation.

Containers as kernel objects — again

Posted Feb 22, 2019 20:33 UTC (Fri) by bfields (subscriber, #19510) [Link] (4 responses)

I'm confused. David's container_create() still has a flags argument allowing the caller to choose which namespaces to inherit. So you're free to either create a new user namespace or not.

Containers as kernel objects — again

Posted Feb 22, 2019 20:47 UTC (Fri) by jejb (subscriber, #6654) [Link] (3 responses)

> I'm confused. David's container_create() still has a flags argument allowing the caller to choose which namespaces to inherit. So you're free to either create a new user namespace or not.

Allowing limited flexibility over the current interface doesn't make it non-pejorative. For instance:

1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.
2. requiring init also requires a container be populated by at least one process. This seems to completely deny the current concept of bind mounting a namespace (i.e. creating an empty container)
3. Nesting doesn't seem to be thought through
4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.
And so on ...

As I said: you can regard the above as bugs or features, but you can't deny it introduces a pejorative view of a kernel container.

Containers as kernel objects — again

Posted Feb 23, 2019 17:22 UTC (Sat) by drag (guest, #31333) [Link]

> 4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.

Not exactly sure what you are talking about here, but I would like to point out that in Kubernetes a pod can be made up of any number of containers. When you are doing things like sidecar containers or init containers (as in stuff that runs before the application starts) then you can have containers made by different projects and different people with different assumptions about uids and whatnot.

So certainly you want to be able to audit and interact with things on a per container level. Such interactions should be avoided as much as possible, but occasionally you need to still deal with individual containers. Usually when viewing logs or debugging things.

Containers as kernel objects — again

Posted Feb 24, 2019 19:18 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> 1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.

I don't think that necessarily follows. See for example prctl(PR_SET_CHILD_SUBREAPER) (which lets a process become init-like with respect to its children, without having PID 1).

I do agree that running (for example) systemd with PID != 1 is likely to be a minor headache, but nobody said you have to use systemd (or whatever) as your init system. You could just as easily write a bespoke program that forks off some hard-coded set of children and wait()s for them.

Containers as kernel objects — again

Posted Feb 25, 2019 8:18 UTC (Mon) by smcv (subscriber, #53363) [Link]

bubblewrap is an example of a program that forks into a container, turns the forked child into pid 1/the reaper for the container, and forks again to run the useful content of the container. It's the container-runner for Flatpak, among others (analogous to the role of runc in Docker), and Flatpak apps all run as pid 2 inside the container, unless they fork again.

The actual reaper process is very simple: it just calls wait() in a loop. The complicated parts of something like systemd (or even sysvinit) are the parts that set up and run all the services, not the part that reaps processes.

Containers as kernel objects — again

Posted Feb 23, 2019 17:16 UTC (Sat) by drag (guest, #31333) [Link]

> I believe what most users actually want is the equivalent of an NFS uid squash between the system and the container userns

YES. THIS.

This would make life so much easier.

Containers as kernel objects — again

Posted Mar 2, 2019 7:58 UTC (Sat) by ThinkRob (guest, #64513) [Link]

The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container.

Well that's the cathedral vs. the bazaar in a nutshell, isn't it?

Containers (and really any features designed/imposed primarily by/because of the kernel) require userspace cooperation/config. So you get whatever common spanning set of features the two agree on. Which may not be a set/superset of what's available in kernel-land. :(

Compare and contrast to Illumos zones or FreeBSD jails: something is added, and it's generally available ASAP in the tooling.

There's something to be said for a tool that matches ring 0's contour.