Containers as kernel objects

Posted May 30, 2017 4:54 UTC (Tue) by MattJD (subscriber, #91390)
In reply to: Containers as kernel objects by nix
Parent article: Containers as kernel objects

AFAIU, the argument is not to move everything to PID 1 or systemd. It's to move upcalls from calling random processes to communicating over some protocol (such as netlink). Systemd only enters the picture by providing inetd like services for netlink. This only removes the argument about having too many daemons running on the system. nix-coredump could just always run listening on netlink, no need for systemd/any particular PID 1.

This seems to have the best chance of capturing upcall related behaviour. A daemon can tell the kernel over netlink what namespaces it cares about (whether it's mount/network/etc) with the kernel enforcing security boundaries to avoid process escaping (maybe only allowing the current one?). The kernel can communicate about what namespace (when the information is available) as well, to allow a global monitor to process an upcall. This makes much more sense, and allows the daemon to be started in the context it wants, managed by the administrator (whether through init scripts or systemd/upstart/etc). The kernel doesn't have to dictate any of that, which seems a win.

This doesn't even need to invalidate your use case of customized containers. If you use any functionality requiring an upcall, it can be handled by an appropriate daemon of your design. And it doesn't require systemd, nor any particular PID 1, nor any functionality be in PID 1.

And to be clear, I'd be against a system that required PID 1 to be systemd, and would dislike a system requiring this functionality to be integrated in PID 1. Whether it's a good idea is a different question, but it shouldn't be a requirement.

Containers as kernel objects

Posted May 31, 2017 13:25 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

Didn't we just have a conversation about this, that the general consensus was that daemons receiving messages (over whatever protocol) are a terrible replacement for upcalls? Sure, using PID 1 solves one of the many problems, that the thing always has to be running, but replaces it with the other problem that you are now obliged to make PID 1 respond to it. Using "an appropriate daemon of your design" is no solution because unless you register the PID of that daemon with the kernel, or have the kernel run it, you *still* need to run it, not your contained program, as PID 1, which many, perhaps most, current containerization solutions are not bothering to do.

It appears you're suggesting having one daemon in the root namespace somehow communicate with the kernel and somehow partition the space of namespaces into those it cares about and those it doesn't (how it does this when it may not have been told about the existence of half of them, without tiresomely iterating over the lot, is unclear to me). This seems terribly complex and fragile, for almost no benefits over the current solution -- and all the complexity is layered into one of the most diverse parts of the Linux ecosystem, a place which is correspondingly hard to change in any coherent way.

Containers as kernel objects

Posted May 31, 2017 13:49 UTC (Wed) by MattJD (subscriber, #91390) [Link]

As I understood the article and comments, daemon were not wanted as you might have to run several different ones to handle all the upcalls. The suggested solution is to have a different daemon do socket activation for the relevant daemons (with the obvious suggestion being systemd, since it already supports this, but any would do).

I'm not sure exactly how such a system would look, as I'm not familiar with all the moving pieces nor use cases, so I was just throwing out a general picture. I was generally thinking that appropriate sockets would opened with the kernel for communication. Depending upon the namespace (and again I'm not familiar with all the details), the process would either need to be in the namespace or the kernel would need to identify the namespace to the process somehow. Ideally a process in a given namespace should be able to take over handling that namespace, which should allow any containerization solution to handle itself without caring what runs in the root/parent namespace.

That general sketch seems cleaner to me, as it moves policy about how to handle a given upcall to userspace, which seems to be line with the kernel's wishes. If your container is complicated enough to require upcall handling, then yes a process will need to run to listen for those events (whether it's your process itself, or some process started by the init of the container). Ideally container managers like docker/rkt could provide handling for their containers, to ease system administration. If you are hand rolling your own, you'll need to handle all these details. But that won't change from the status quo, where you still need something to handle the upcall. And many simple cases should avoid requiring this discussion all together, like your example of a build container.