Containers as kernel objects — again

Posted Feb 22, 2019 20:26 UTC (Fri) by jejb (subscriber, #6654)
In reply to: Containers as kernel objects — again by jhoblitt
Parent article: Containers as kernel objects — again

OK, so it is possible to set up unprivileged docker and a few people are doing it. I always use unprivileged containers for my use case as well. However, I don't think you would argue that the number of people doing any form of unprivileged containers is dwarfed by the number of people who simply add privilege to the standard docker container to make it work (and usually this means real root). This is known to be a huge source of security issues (the latest being the runc CVE) but people do it anyway. Therefore, I stand by my statement that if you were to enforce the container description to be what the majority do today it would be without the user namespace. I'm in no way arguing this is right, and it's definitely not secure, but it is the majority container construction.

The meta point here, I think, is that the notion we've been experimenting long enough to have an idea of what a good container construction consists of is actually wrong and we're still need to experiment further. Which also means we really don't yet want to be pejorative about container constructions at the kernel level because that hobbles the experimentation.

Containers as kernel objects — again

Posted Feb 22, 2019 20:33 UTC (Fri) by bfields (subscriber, #19510) [Link] (4 responses)

I'm confused. David's container_create() still has a flags argument allowing the caller to choose which namespaces to inherit. So you're free to either create a new user namespace or not.

Containers as kernel objects — again

Posted Feb 22, 2019 20:47 UTC (Fri) by jejb (subscriber, #6654) [Link] (3 responses)

> I'm confused. David's container_create() still has a flags argument allowing the caller to choose which namespaces to inherit. So you're free to either create a new user namespace or not.

Allowing limited flexibility over the current interface doesn't make it non-pejorative. For instance:

1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.
2. requiring init also requires a container be populated by at least one process. This seems to completely deny the current concept of bind mounting a namespace (i.e. creating an empty container)
3. Nesting doesn't seem to be thought through
4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.
And so on ...

As I said: you can regard the above as bugs or features, but you can't deny it introduces a pejorative view of a kernel container.

Containers as kernel objects — again

Posted Feb 23, 2019 17:22 UTC (Sat) by drag (guest, #31333) [Link]

> 4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.

Not exactly sure what you are talking about here, but I would like to point out that in Kubernetes a pod can be made up of any number of containers. When you are doing things like sidecar containers or init containers (as in stuff that runs before the application starts) then you can have containers made by different projects and different people with different assumptions about uids and whatnot.

So certainly you want to be able to audit and interact with things on a per container level. Such interactions should be avoided as much as possible, but occasionally you need to still deal with individual containers. Usually when viewing logs or debugging things.

Containers as kernel objects — again

Posted Feb 24, 2019 19:18 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> 1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.

I don't think that necessarily follows. See for example prctl(PR_SET_CHILD_SUBREAPER) (which lets a process become init-like with respect to its children, without having PID 1).

I do agree that running (for example) systemd with PID != 1 is likely to be a minor headache, but nobody said you have to use systemd (or whatever) as your init system. You could just as easily write a bespoke program that forks off some hard-coded set of children and wait()s for them.

Containers as kernel objects — again

Posted Feb 25, 2019 8:18 UTC (Mon) by smcv (subscriber, #53363) [Link]

bubblewrap is an example of a program that forks into a container, turns the forked child into pid 1/the reaper for the container, and forks again to run the useful content of the container. It's the container-runner for Flatpak, among others (analogous to the role of runc in Docker), and Flatpak apps all run as pid 2 inside the container, unless they fork again.

The actual reaper process is very simple: it just calls wait() in a loop. The complicated parts of something like systemd (or even sysvinit) are the parts that set up and run all the services, not the part that reaps processes.