Containers as kernel objects — again
Containers as kernel objects — again
Posted Feb 22, 2019 16:59 UTC (Fri) by jejb (subscriber, #6654)Parent article: Containers as kernel objects — again
The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container. The solution we came up with was to have userspace tell the kernel for audit purposes (the audit label) what the container is.
If the kernel is going to impose its view of what a container is, the question becomes which container construction should it be? The obvious answer might be what docker/kubernetes does, but some of those practices (like no user namespace, pod shared ipc and net namespace) are somewhat incompatible with what LXC does and they're definitely wholly incompatible with other less popular container use cases, like the architecture emulation containers I use to maintain cross arch builds of my projects. This is the fundamental problem: imposing a kernel view of container is pejorative and eliminates all other non-conforming uses. The argument is mostly about whether you see this as a bug or a feature.
Posted Feb 22, 2019 19:11 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (7 responses)
I'm not a docker apologist but I'd like to point out that docker certainly can take advantage of userns'. At my $day_job I have built a production service that makes use of this. However, userns has a few serious usability draw backs. While it does provide [the rather important] removal of true uid 0 processes from a container, it doesn't provide for unique uid reservations -- meaning it takes careful planning to keep other service role uids from overlapping with the range mapped into a userns. Another issue, specific to how docker uses userns, is that every container has the same system<->container uid mapping, resulting in the possibility for many processes in different containers to all share the real system uid. This isn't a major issue but it certainly feels untidy if the goal is maximum isolation. Finally, the most serious limitation is when trying to bind mount a filesystem into the container for persistence (yes, I'm aware that the "docker way" is use docker volumes but that isn't always convenient and has its own set of limitations) that the userns mapping system<->ns is a 1:1 range and no overlapping is allowed.
Suppose that you want to persist files with a system uid of 5000 and use the same uid inside the container. To do this with a single mapping for the namespace, you'd have to start the mapping at 0 and have a range of at least 5000 uids. That's a no go as then system uid 0 == container uid 0. This means for a lot of scenarios (say, systemd running the container) one mapping is needed for the root uid and one for uid 5000. However, there is now the problem that without caps, uid 0 in the container can't access uid 5000's files unless they are world accessible. It also means that every container run needs to to follow this uid pattern. Want to use a docker image packaged utility (terraform, etc.)? A "wrapper" image needs to be built to change the uids -- exactly the same as it was without userns except now knowledge of the mapping is also required.
I believe what most users actually want is the equivalent of an NFS uid squash between the system and the container userns -- I am aware of an example of this essentially being done using a k8s storage driver. While I am a heavy k8s user, it isn't a realistic to solution to the typical case of wanting to run multiple docker containers in a sequence that interact with the same files, which is why the DinD pattern ends up being employed in k8s pods. Docker storage volumes don't solve this issue either. Solutions other than uid squashing are painful: give up on posix semantics completely and move to object storage, use a utilty with caps to re-chown files, local NFS exports/mounts, and/or FUSE games.
Posted Feb 22, 2019 20:26 UTC (Fri)
by jejb (subscriber, #6654)
[Link] (5 responses)
The meta point here, I think, is that the notion we've been experimenting long enough to have an idea of what a good container construction consists of is actually wrong and we're still need to experiment further. Which also means we really don't yet want to be pejorative about container constructions at the kernel level because that hobbles the experimentation.
Posted Feb 22, 2019 20:33 UTC (Fri)
by bfields (subscriber, #19510)
[Link] (4 responses)
Posted Feb 22, 2019 20:47 UTC (Fri)
by jejb (subscriber, #6654)
[Link] (3 responses)
Allowing limited flexibility over the current interface doesn't make it non-pejorative. For instance:
1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.
As I said: you can regard the above as bugs or features, but you can't deny it introduces a pejorative view of a kernel container.
Posted Feb 23, 2019 17:22 UTC (Sat)
by drag (guest, #31333)
[Link]
Not exactly sure what you are talking about here, but I would like to point out that in Kubernetes a pod can be made up of any number of containers. When you are doing things like sidecar containers or init containers (as in stuff that runs before the application starts) then you can have containers made by different projects and different people with different assumptions about uids and whatnot.
So certainly you want to be able to audit and interact with things on a per container level. Such interactions should be avoided as much as possible, but occasionally you need to still deal with individual containers. Usually when viewing logs or debugging things.
Posted Feb 24, 2019 19:18 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I don't think that necessarily follows. See for example prctl(PR_SET_CHILD_SUBREAPER) (which lets a process become init-like with respect to its children, without having PID 1).
I do agree that running (for example) systemd with PID != 1 is likely to be a minor headache, but nobody said you have to use systemd (or whatever) as your init system. You could just as easily write a bespoke program that forks off some hard-coded set of children and wait()s for them.
Posted Feb 25, 2019 8:18 UTC (Mon)
by smcv (subscriber, #53363)
[Link]
The actual reaper process is very simple: it just calls wait() in a loop. The complicated parts of something like systemd (or even sysvinit) are the parts that set up and run all the services, not the part that reaps processes.
Posted Feb 23, 2019 17:16 UTC (Sat)
by drag (guest, #31333)
[Link]
YES. THIS.
This would make life so much easier.
Posted Mar 2, 2019 7:58 UTC (Sat)
by ThinkRob (guest, #64513)
[Link]
Well that's the cathedral vs. the bazaar in a nutshell, isn't it?
Containers (and really any features designed/imposed primarily by/because of the kernel) require userspace cooperation/config. So you get whatever common spanning set of features the two agree on. Which may not be a set/superset of what's available in kernel-land. :(
Compare and contrast to Illumos zones or FreeBSD jails: something is added, and it's generally available ASAP in the tooling.
There's something to be said for a tool that matches ring 0's contour.
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
2. requiring init also requires a container be populated by at least one process. This seems to completely deny the current concept of bind mounting a namespace (i.e. creating an empty container)
3. Nesting doesn't seem to be thought through
4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.
And so on ...
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container.
