Containers as kernel objects

By Jonathan Corbet
May 23, 2017

The kernel has, over the years, gained comprehensive support for containers; that, in turn, has helped to drive the rapid growth of a number of containerization systems. Interestingly, though, the kernel itself has no concept of what a container is; it just provides a number of facilities that can be used in the creation of containers in user space. David Howells is trying to change that state of affairs with a patch set adding containers as a first-class kernel object, but the idea is proving to be a hard sell in the kernel community.

Containers can be thought of as a form of lightweight virtualization. Processes running within a container have the illusion of running on an independent system but, in truth, many containers can be running simultaneously on the same host kernel. The container illusion is created using namespaces, giving each container its own view of the network, the filesystem, and more, and control groups, which isolate containers from each other and control resource usage. Security modules or seccomp can be used to further restrict what a container can do. The result is a mechanism that, like so many things in Linux, offers a great deal of flexibility at the cost of a fair amount of complexity. Setting up a container in a way that ensures it will stay contained is not a trivial task and, as we'll see, the lack of a container primitive also complicates things on the kernel side.

Adding a container object

Howells's patch creates (or modifies) a set of system calls to make it possible for user space to manipulate containers. It all starts with:

    int container_create(const char *name, unsigned int flags);

This new system call creates a container with the given name. The flags mainly specify which namespaces from the caller should be replaced by new namespaces in the created container. For example, specifying CONTAINER_NEW_USER_NS will cause the container to be created with a new user namespace. The return value is a file descriptor that can be used to refer to the container. There are a couple of flags that indicate whether the container should be destroyed when the file descriptor is closed, and whether the descriptor should be closed if the calling process calls exec().

The container starts out empty, with no processes running within it; if it is created with a new mount namespace, there are no filesystems mounted inside it either. Two new system calls (fsopen() and fsmount(), added in a separate patch set) can be used to add filesystems to the container. The "at" versions of the file system calls (openat(), for example) can take a container file descriptor as the starting point, easing the creation of files inside the container. It is possible to open a socket within the container with:

    int container_socket(int container_fd, int domain, int type, int protocol);

The main purpose of container_socket() appears to be to make it easy to use netlink sockets to configure the container's networking from the outside. It can help an orchestration system avoid the need to run a process inside the container to do this configuration.

When it comes time to start things running inside the container, a call can be made to:

    pid_t fork_into_container(int container_fd);

The new process created by this call will be the init process inside the given container, and will run inside the container's namespaces. It can only be called once for any given container.

There are a number of things that, Howells said, could still be added to this mechanism. They include the ability to set a container's namespaces directly, support for the management of a container's control groups, the ability to suspend and restart a container, and more. But it is not clear that this work will progress far in its current form.

A poor match?

A number of developers expressed concerns about this proposal, mostly focused on two issues: the proposed container object is not seen as a good match for how containers are actually used now, and it is seen as the wrong solution to a specific problem. On the first issue, the flexibility of the current mechanisms is seen by many as an advantage, one that they would rather not lose. Jessica Frazelle said:

Adding a container object seems a bit odd to me because there are so many different ways to make containers, aka not all namespaces are always used as well as not all cgroups, various LSM objects sometimes apply, mounts blah blah blah. The OCI spec was made to cover all these things so why a kernel object?

Here, she was referring to the runtime specification from the Open Containers Initiative. James Bottomley was more direct, saying that:

This sounds like a step in the wrong direction: the strength of the current container interfaces in Linux is that people who set up containers don't have to agree what they look like. So I can set up a user namespace without a mount namespace or an architecture emulation container with only a mount namespace.

He pointed out, in particular, an apparent mismatch between the proposed container object and the concepts of containers and "pods" implemented in Kubernetes. Some namespaces are specific to a container, while others are shared across a pod, blurring the boundaries somewhat. The kernel container object, he added, "isn't something the orchestration systems would find usable".

Eric Biederman took an even stronger position by rejecting the patch outright. As he put it:

Embracing the complexity of namespaces head on tends to mean all of the goofy scary semantic corner cases are visible from the first version of the design, and so developers can't take short cuts that result in buggy kernel code that persists for decades. I am rather tired of finding and fixing those.

Unlike the others, he is not so deeply concerned with what existing orchestration systems do; his worries have to do with the exposing of the container object to user space at all. That is where the second issue comes up.

Upcalls

To a great extent, it appears that the motivation behind this patch set isn't to make the management of containers easier for user-space code. Instead, it is trying to solve a nagging problem that has become increasingly irritating for kernel developers: how to make kernel "upcalls" work properly in a containerized environment.

As a general rule, the kernel, as the lowest level of the system, tries to be self-sufficient in all things. There really is nobody else to rely on to get things done, after all. There are times, though, when the kernel has to ask user space for help. That is typically done with a call to call_usermodehelper(), an internal function that will create a user-space process and run a specific program to get something done — "calling up" to user space, in other words.

There are a number of call_usermodehelper() call sites in the kernel. Some of the tasks it is used for include:

The core-dump code can use it to invoke a program to do something useful with the dumped data.
The NFSv4 client can call a helper program to perform DNS resolution.
The module loader can invoke a helper to perform demand-loading of modules.
The kernel's key-management code will call to user space when a key is needed to perform a specific function — to mount an encrypted filesystem, for example.

Once upon a time, when life was simple, these upcalls would create a process running as root that could run the requested program. Now, however, the action that provoked the upcall in the first place may well have come from inside a container, and it may well be that the upcall should run within that container as well. At least, it should run inside that container's particular mix of namespaces. But, since the kernel has no concept of a container, it has no way to know which container to run any particular upcall within. A kernel upcall that is run in the wrong namespace might do the wrong thing — or allow a process to escape its container.

Adding a container concept to the kernel is one way to fix this problem. But this particular patch has raised questions of whether (1) a container object is the best solution to the upcall problem, and (2) if a container object does make sense, does it need to be exposed to user space? The kernel might be able to keep track of the proper namespaces to use for specific upcalls without creating a bunch of new infrastructure or exposing a new API that would have to be maintained forever. Biederman suggested one possible scheme that could be used to track namespaces for the key-management upcalls, for example.

Another possible approach, proposed by Colin Walters, is to drop the upcall model entirely. Instead, a protocol could be created to report events to a user-space daemon that could act on them in the proper context. That kind of change has been made in the past; device-related events were once handled via upcalls, but now they are communicated directly to the udev (or systemd-udevd) process instead. But, as Jeff Layton pointed out, that model only works in some settings. In others, it just leads to a proliferation of daemon processes that clutter up the system and can create reliability issues. So the events model isn't necessarily a replacement for all kernel upcalls.

This discussion is young as of this writing, and may yet progress in unexpected directions. From the early indications, it seems relatively unlikely that a container object visible to user space will be added to the kernel anytime soon. If, perhaps, some future attempt creates a container concept that is useful to existing orchestration systems, that could change. Meanwhile, we may well see an attempt to improve the kernel's internal ability to determine the proper namespace for any given upcall. Either way, the inherent complexity of the container problem seems likely to be with us for a long time.

Index entries for this article
Kernel	Containers

Containers as kernel objects

Posted May 24, 2017 1:34 UTC (Wed) by neilbrown (subscriber, #359) [Link] (7 responses)

call_usermodehelper() effectively works by forking init (pid=1).
Maybe we need an "upcall namespace" which identifies a different process to (transparently) fork.

Containers as kernel objects

Posted May 24, 2017 4:45 UTC (Wed) by garloff (subscriber, #319) [Link] (2 responses)

So a container gets its own "init" if it needs to be host to upcalls?
Sounds comparably clean... Kernel still would need to somehow determine at the upcall place which init it belongs to.

Containers as kernel objects

Posted May 26, 2017 9:50 UTC (Fri) by Albis (guest, #115537) [Link] (1 responses)

Which means that the kernel has to maintain a table of which PID lives in which container?

Containers as kernel objects

Posted May 26, 2017 10:25 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

Wouldn't the kernel already have to know which init is paired with a process for reparenting purposes?

Containers as kernel objects

Posted May 24, 2017 10:47 UTC (Wed) by jlayton (subscriber, #31672) [Link] (3 responses)

That was one of the suggestions from Eric B. Basically we could allow kernel subsystems to spawn a "private" kthreadd and use that as the context for cloning new upcall processes. That would work reasonably well for something like nfsdcltrack upcalls from knfsd, but it's less clear how that would work in the case of something like keyrings.

Containers as kernel objects

Posted May 25, 2017 2:12 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

> but it's less clear how that would work in the case of something like keyrings.

So what is the important details of the context in which request_key() is called, that determines the desired container? A lot of the calls come from filesystems. Should the container context come from the filesystem, or the mount point, or the process making some request?
Either way there is a well defined kernel object that can hold a reference to a fork-for-upcalls kthread - isn't there?

Containers as kernel objects

Posted May 25, 2017 11:50 UTC (Thu) by ebiederm (subscriber, #35028) [Link] (1 responses)

For filesystems we need to determine context for user mode helpers at mount time.

Containers as kernel objects

Posted May 26, 2017 14:12 UTC (Fri) by cyphar (subscriber, #110703) [Link]

I'm not fully aware of all of the users of upcalls, but can't we do something like PR_SET_CHILD_SUBREAPER which would effectively mean "if any of the processes under me do an action that result in an upcall, call the upcall in my namespaces and cgroups [effectively like doing a fork-exec from the process]". Obviously there are some huge semantic discussions to be had (how to handle the process dying or something like seccomp come to mind almost immediately) but I feel like a "blessed process" model might make this palatable.

No

Posted May 24, 2017 3:47 UTC (Wed) by cyphar (subscriber, #110703) [Link]

As someone who maintains a container runtime, this whole idea is just ridiculous. While there are downsides to the split-out model of namespaces, all of the upsides are immediately lost when you suddenly decide to create a mirror interface that removes all of the benefits (and is also just ridiculous).

The complexity in runtimes is something that we've already made peace with. Why are you changing the interfaces underneath us? Who actually asked for this?

If you wanted to implement Zones or Jails on Linux, you should've just done that in the first place.

Containers as kernel objects

Posted May 24, 2017 3:50 UTC (Wed) by brauner (subscriber, #109349) [Link] (1 responses)

No.
I think all maintainers of current container runtimes uniformely agree, including me, that this is not a good idea. I think the strongest argument against this interface is that it defines what a container actually is *supposed* to be. This is something which we should not to. The strength of containerization features in the Linux kernel is that they allow to be combined in multiple ways to create different breeds of container objects. A first class container citizen doesn't make sense. Most of all, because the point where something like this could have been implemented has passed. There just is not concept of a container for Linux. We should embrace this fact in all its complexity.

Containers as kernel objects

Posted May 24, 2017 14:41 UTC (Wed) by Tara_Li (guest, #26706) [Link]

That was my first thought reading the headlines - who thought this would be a good idea? My understanding is that there are a variety of container implementations, and this would give one particular version of it a standing above the others. Kind of like why Washington DC is not, and never should be, a state or in a state - by being a state, or in a state, it creates a first among equals.

Socket activation, not upcalls?

Posted May 24, 2017 8:12 UTC (Wed) by bbockelm (subscriber, #71069) [Link]

With sysvinit, the upcall model made a lot of sense for infrequent upcalls - if you sent this information via a socket associated with the user namespace (or netlink messages), you would indeed need a daemon per upcall.

With systemd, the upcall model makes less sense. systemd is very capable of socket activation (not sure about activation based on netlink messages) and launching a one-shot service to handle the task.

Heck, even the container daemon or runtime can do this. I'm not convinced this would lead to a proliferation of standing processes per container for infrequent calls.

Containers as kernel objects

Posted May 25, 2017 14:44 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (11 responses)

Upcalls are a really flawed concept. Doing upcalls is always the wrong choice. Having processes around that appear out of thin air and are not resource or runtime managed or within a clear security context is really awful. For example, the coredumper is a primary example where this concept is really terrible: processing coredumps is highly resource consuming (think: firefox crashing) and security sensitive (since unprivileged processes shouldn't be processed by privileged code, parsing is nasty and binutils and friends too complex to be entirely safe). If we do this outside of PID1's management then this stuff will compete against everything else on the system and cannot take benefit of the usual sandboxing applied to services but needs to do this on its own. In systemd we hence decided to immediately hand off the coredump pipe of the kernel to a local service running under systemd's management and exit very quickly in the upcall handler. And thats the scheme we generally have to follow to deal with the fucked up concept of upcalls: hand over the event to something properly part of the system.

Note that systemd can do socket activation for you, including netlink activation. Hence there isn't really proliferation of daemons if people would just give up on upcalls: the stuff in userspace would only run when something is happening much the same as before, but everything cleanly managed, introspectable by the user with all security policies properly in effect and so on.

Hence my suggestion: dont try to make upcalls work in more cases. Instead just stip doing them altogether. They are awful. There should be a singlr upcall only on the system: the one invoking PID 1.

Lennart

Containers as kernel objects

Posted May 26, 2017 23:16 UTC (Fri) by nix (subscriber, #2304) [Link] (10 responses)

Please don't give up on upcalls. Some of us wrote our own scripts for e.g. coredump management, and not everyone wants PID 1 to do things like this...

(Side note: at least one NFS developer is a sysvinit user. The likelihood of upcalls going away from NFSv4 in favour of just throwing everything at PID 1 is thus essentially nil, though it is possible that a PID 1-throwing *option* might be added. Removing everything in favour of your proposed PID 1 approach would break every non-systemd system out there, and also systemd systems too old to understand whatever this upcall request might turn out to be. Some people really do have existing systems to maintain and don't want gratuitous breakage of this sort of thing, thanks!)

Containers as kernel objects

Posted May 26, 2017 23:56 UTC (Fri) by walters (subscriber, #7396) [Link] (7 responses)

systemd-coredump isn't pid 1. Moving away from upcalls doesn't imply requiring systemd. And yes, in the end a non-upcall model would really have to be an option anyways for backwards compat.

Containers as kernel objects

Posted May 28, 2017 0:06 UTC (Sun) by nix (subscriber, #2304) [Link] (6 responses)

Lennart said 'PID 1', which implies that systemd itself, the init daemon, would be the thing receiving the upcalls, not systemd-coredump. (Though it would doubtless respond by talking to systemd-coredump.)

Containers as kernel objects

Posted May 28, 2017 15:17 UTC (Sun) by MattJD (subscriber, #91390) [Link]

I think the points was systemd would receive the notification and launch the daemon. Nothing stops an alternate coredump daemon to just always listen for events.

If having several daemons running is an issue, an alternate daemon could do the systemd's part, by listening and then launching the appropriate daemon on a notification.

I think Lennart was saying how systemd could implement it, using it's existing functionality. Nothing in that example implies to me you have to run systemd, nor have the listener be PID 1 if you are willing to reimplement some functionality.

Containers as kernel objects

Posted May 29, 2017 8:18 UTC (Mon) by matthias (subscriber, #94967) [Link] (4 responses)

Lennart said, that the code should not run outside of PID 1's management system. This does not necessarily mean that PID 1 manages the events, but just that the process dealing with the events is ultimately managed by PID 1, i.e., it is started directly or indirectly from PID 1. In the case of systemd, this process would be systemd-coredump, which is managed by PID 1, like any other process that is started from userspace.

Containers as kernel objects

Posted May 30, 2017 0:19 UTC (Tue) by nix (subscriber, #2304) [Link] (3 responses)

Frankly, I don't see why PID 1 should have anything to do with this, except that Lennart reasonably considers his own PID 1 a useful place to dump everything to avoid having to put policy anywhere: rather than registering a binary to be upcalled, or having something listen on a netlink protocol (problematical as an upcall replacement, as previously noted)... why not stick it in PID 1, since it's the only thing you know must always be running?

Well, one problem is that asking the systemwide PID 1 doesn't help much in the very container case that triggered this, unless you are lucky enough to have the PID 1-associated framework know about every container on the system -- and we know where *that*'s gone, with a mass of argument bordering on open warfare over containerization systems and their degree of talking to systemd, and no sign of a resolution. The only way to fix this would be to hand a containerized upcall off to PID 1 in the relevant container. This would work great except that not all containers use the PID namespace, leaving you with nowhere to hand the upcall off to, so you're screwed; also, most containerization systems that do use PID namespaces seem to run nothing resembling init but just run the contained binary *as* PID 1 in the container: of course, the contained binary would have no idea how to handle these upcall messages, so you're screwed. (I think that running anything not an init as PID 1 is somewhere between horrifying and outright demented, but a lot of systems already do this, so we must allow for their existence.)

No, I don't think handing things off to PID 1 would work. It might work in an ideal universe in which every containerization system knew how to tell every init system that knew how to respond to these upcalls about the container's existence, but we do not live in that universe, and as long as containers are non-first-class objects and anyone can take a pile of random namespaces and call it a container, we will not live in it. (I exploit this fairly heavily on my systems, like, I expect, many others: you can kick off a "container" with a ten-line sudo-invoked shell script calling unshare(1), and with a couple of extra lines you can store enough state to have arbitrary other stuff pop in to join the container too. It's really flexible but y'know my random shell scripts invoking compilation environments in fs trees for various distros and the like really do not know how to handle upcalls and certainly aren't going to tell PID 1 about their existence either.)

Containers as kernel objects

Posted May 30, 2017 4:54 UTC (Tue) by MattJD (subscriber, #91390) [Link] (2 responses)

AFAIU, the argument is not to move everything to PID 1 or systemd. It's to move upcalls from calling random processes to communicating over some protocol (such as netlink). Systemd only enters the picture by providing inetd like services for netlink. This only removes the argument about having too many daemons running on the system. nix-coredump could just always run listening on netlink, no need for systemd/any particular PID 1.

This seems to have the best chance of capturing upcall related behaviour. A daemon can tell the kernel over netlink what namespaces it cares about (whether it's mount/network/etc) with the kernel enforcing security boundaries to avoid process escaping (maybe only allowing the current one?). The kernel can communicate about what namespace (when the information is available) as well, to allow a global monitor to process an upcall. This makes much more sense, and allows the daemon to be started in the context it wants, managed by the administrator (whether through init scripts or systemd/upstart/etc). The kernel doesn't have to dictate any of that, which seems a win.

This doesn't even need to invalidate your use case of customized containers. If you use any functionality requiring an upcall, it can be handled by an appropriate daemon of your design. And it doesn't require systemd, nor any particular PID 1, nor any functionality be in PID 1.

And to be clear, I'd be against a system that required PID 1 to be systemd, and would dislike a system requiring this functionality to be integrated in PID 1. Whether it's a good idea is a different question, but it shouldn't be a requirement.

Containers as kernel objects

Posted May 31, 2017 13:25 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

Didn't we just have a conversation about this, that the general consensus was that daemons receiving messages (over whatever protocol) are a terrible replacement for upcalls? Sure, using PID 1 solves one of the many problems, that the thing always has to be running, but replaces it with the other problem that you are now obliged to make PID 1 respond to it. Using "an appropriate daemon of your design" is no solution because unless you register the PID of that daemon with the kernel, or have the kernel run it, you *still* need to run it, not your contained program, as PID 1, which many, perhaps most, current containerization solutions are not bothering to do.

It appears you're suggesting having one daemon in the root namespace somehow communicate with the kernel and somehow partition the space of namespaces into those it cares about and those it doesn't (how it does this when it may not have been told about the existence of half of them, without tiresomely iterating over the lot, is unclear to me). This seems terribly complex and fragile, for almost no benefits over the current solution -- and all the complexity is layered into one of the most diverse parts of the Linux ecosystem, a place which is correspondingly hard to change in any coherent way.

Containers as kernel objects

Posted May 31, 2017 13:49 UTC (Wed) by MattJD (subscriber, #91390) [Link]

As I understood the article and comments, daemon were not wanted as you might have to run several different ones to handle all the upcalls. The suggested solution is to have a different daemon do socket activation for the relevant daemons (with the obvious suggestion being systemd, since it already supports this, but any would do).

I'm not sure exactly how such a system would look, as I'm not familiar with all the moving pieces nor use cases, so I was just throwing out a general picture. I was generally thinking that appropriate sockets would opened with the kernel for communication. Depending upon the namespace (and again I'm not familiar with all the details), the process would either need to be in the namespace or the kernel would need to identify the namespace to the process somehow. Ideally a process in a given namespace should be able to take over handling that namespace, which should allow any containerization solution to handle itself without caring what runs in the root/parent namespace.

That general sketch seems cleaner to me, as it moves policy about how to handle a given upcall to userspace, which seems to be line with the kernel's wishes. If your container is complicated enough to require upcall handling, then yes a process will need to run to listen for those events (whether it's your process itself, or some process started by the init of the container). Ideally container managers like docker/rkt could provide handling for their containers, to ease system administration. If you are hand rolling your own, you'll need to handle all these details. But that won't change from the status quo, where you still need something to handle the upcall. And many simple cases should avoid requiring this discussion all together, like your example of a build container.

Containers as kernel objects

Posted May 29, 2017 8:49 UTC (Mon) by mezcalero (subscriber, #45103) [Link] (1 responses)

On systemd systems PID 1 does not do coredump management. PID 1 supports service activation by socket and it is configured to activate the actual coredump handler if it the coredump upcall process asks it to and passes the kernel coredump pipe to it. On systemd systems activation logic is unified at one place, so that the execution environment of all services on the system is properly clean and managed.

Whether you run systemd or something else is just an implementation detail. Whatever you run, it's always highly problematic if you have unmanaged processes around, that live entirely outside of the context of the rest of the system resource-management-wise, security-wise, introspection-wise, monitoring-wise and runtime-wise.

The three most relevant upcalls are probably the modprobe requester, the cgroups agent and the coredump handler (at least on systemd systems). In the first case we turn it off these days in udev, and use the netlink logic, and in the latter two cases we install a small binary that notifies something else and exits quickly in order to keep the surface small for code that runs outside of any lifecycle management, resource-management, security management. The only logic next step is to avoid this altogether, and just notify that something else directly.

Upcalls are really little more than workarounds for dumb init systems which cannot do any form of event-based activation. I figure it's fine if they continue to exist, already for compat reasons, but I think it's important to get the message out that they are a hack, and a bad thing and that new mechanisms should use proper notification, the kernel has many better options for asking userspace to do things.

Containers as kernel objects

Posted May 30, 2017 0:23 UTC (Tue) by nix (subscriber, #2304) [Link]

Unfortunately, as noted in https://lwn.net/Articles/724078/ just now, I don't think this can be made to work: many containers do not use PID namespaces, many which do do not tell the root namespace's init system about their existence, and many of those also have things running as PID 1 that cannot have upcalls sent to them -- and this will always be true as long as you can make containers using a shell script and a bunch of calls to unshare(1). (The ability to do this is really convenient, but leaves little space for decreeing any sort of policy, let alone telling other systems about this new container, and little space for responding to upcalls via your new namespace's PID 1 -- which probably isn't even an init -- either).