Containers as kernel objects
Containers as kernel objects
Posted May 26, 2017 23:16 UTC (Fri) by nix (subscriber, #2304)In reply to: Containers as kernel objects by mezcalero
Parent article: Containers as kernel objects
(Side note: at least one NFS developer is a sysvinit user. The likelihood of upcalls going away from NFSv4 in favour of just throwing everything at PID 1 is thus essentially nil, though it is possible that a PID 1-throwing *option* might be added. Removing everything in favour of your proposed PID 1 approach would break every non-systemd system out there, and also systemd systems too old to understand whatever this upcall request might turn out to be. Some people really do have existing systems to maintain and don't want gratuitous breakage of this sort of thing, thanks!)
Posted May 26, 2017 23:56 UTC (Fri)
by walters (subscriber, #7396)
[Link] (7 responses)
Posted May 28, 2017 0:06 UTC (Sun)
by nix (subscriber, #2304)
[Link] (6 responses)
Posted May 28, 2017 15:17 UTC (Sun)
by MattJD (subscriber, #91390)
[Link]
If having several daemons running is an issue, an alternate daemon could do the systemd's part, by listening and then launching the appropriate daemon on a notification.
I think Lennart was saying how systemd could implement it, using it's existing functionality. Nothing in that example implies to me you have to run systemd, nor have the listener be PID 1 if you are willing to reimplement some functionality.
Posted May 29, 2017 8:18 UTC (Mon)
by matthias (subscriber, #94967)
[Link] (4 responses)
Posted May 30, 2017 0:19 UTC (Tue)
by nix (subscriber, #2304)
[Link] (3 responses)
Well, one problem is that asking the systemwide PID 1 doesn't help much in the very container case that triggered this, unless you are lucky enough to have the PID 1-associated framework know about every container on the system -- and we know where *that*'s gone, with a mass of argument bordering on open warfare over containerization systems and their degree of talking to systemd, and no sign of a resolution. The only way to fix this would be to hand a containerized upcall off to PID 1 in the relevant container. This would work great except that not all containers use the PID namespace, leaving you with nowhere to hand the upcall off to, so you're screwed; also, most containerization systems that do use PID namespaces seem to run nothing resembling init but just run the contained binary *as* PID 1 in the container: of course, the contained binary would have no idea how to handle these upcall messages, so you're screwed. (I think that running anything not an init as PID 1 is somewhere between horrifying and outright demented, but a lot of systems already do this, so we must allow for their existence.)
No, I don't think handing things off to PID 1 would work. It might work in an ideal universe in which every containerization system knew how to tell every init system that knew how to respond to these upcalls about the container's existence, but we do not live in that universe, and as long as containers are non-first-class objects and anyone can take a pile of random namespaces and call it a container, we will not live in it. (I exploit this fairly heavily on my systems, like, I expect, many others: you can kick off a "container" with a ten-line sudo-invoked shell script calling unshare(1), and with a couple of extra lines you can store enough state to have arbitrary other stuff pop in to join the container too. It's really flexible but y'know my random shell scripts invoking compilation environments in fs trees for various distros and the like really do not know how to handle upcalls and certainly aren't going to tell PID 1 about their existence either.)
Posted May 30, 2017 4:54 UTC (Tue)
by MattJD (subscriber, #91390)
[Link] (2 responses)
This seems to have the best chance of capturing upcall related behaviour. A daemon can tell the kernel over netlink what namespaces it cares about (whether it's mount/network/etc) with the kernel enforcing security boundaries to avoid process escaping (maybe only allowing the current one?). The kernel can communicate about what namespace (when the information is available) as well, to allow a global monitor to process an upcall. This makes much more sense, and allows the daemon to be started in the context it wants, managed by the administrator (whether through init scripts or systemd/upstart/etc). The kernel doesn't have to dictate any of that, which seems a win.
This doesn't even need to invalidate your use case of customized containers. If you use any functionality requiring an upcall, it can be handled by an appropriate daemon of your design. And it doesn't require systemd, nor any particular PID 1, nor any functionality be in PID 1.
And to be clear, I'd be against a system that required PID 1 to be systemd, and would dislike a system requiring this functionality to be integrated in PID 1. Whether it's a good idea is a different question, but it shouldn't be a requirement.
Posted May 31, 2017 13:25 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
It appears you're suggesting having one daemon in the root namespace somehow communicate with the kernel and somehow partition the space of namespaces into those it cares about and those it doesn't (how it does this when it may not have been told about the existence of half of them, without tiresomely iterating over the lot, is unclear to me). This seems terribly complex and fragile, for almost no benefits over the current solution -- and all the complexity is layered into one of the most diverse parts of the Linux ecosystem, a place which is correspondingly hard to change in any coherent way.
Posted May 31, 2017 13:49 UTC (Wed)
by MattJD (subscriber, #91390)
[Link]
I'm not sure exactly how such a system would look, as I'm not familiar with all the moving pieces nor use cases, so I was just throwing out a general picture. I was generally thinking that appropriate sockets would opened with the kernel for communication. Depending upon the namespace (and again I'm not familiar with all the details), the process would either need to be in the namespace or the kernel would need to identify the namespace to the process somehow. Ideally a process in a given namespace should be able to take over handling that namespace, which should allow any containerization solution to handle itself without caring what runs in the root/parent namespace.
That general sketch seems cleaner to me, as it moves policy about how to handle a given upcall to userspace, which seems to be line with the kernel's wishes. If your container is complicated enough to require upcall handling, then yes a process will need to run to listen for those events (whether it's your process itself, or some process started by the init of the container). Ideally container managers like docker/rkt could provide handling for their containers, to ease system administration. If you are hand rolling your own, you'll need to handle all these details. But that won't change from the status quo, where you still need something to handle the upcall. And many simple cases should avoid requiring this discussion all together, like your example of a build container.
Posted May 29, 2017 8:49 UTC (Mon)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
Whether you run systemd or something else is just an implementation detail. Whatever you run, it's always highly problematic if you have unmanaged processes around, that live entirely outside of the context of the rest of the system resource-management-wise, security-wise, introspection-wise, monitoring-wise and runtime-wise.
The three most relevant upcalls are probably the modprobe requester, the cgroups agent and the coredump handler (at least on systemd systems). In the first case we turn it off these days in udev, and use the netlink logic, and in the latter two cases we install a small binary that notifies something else and exits quickly in order to keep the surface small for code that runs outside of any lifecycle management, resource-management, security management. The only logic next step is to avoid this altogether, and just notify that something else directly.
Upcalls are really little more than workarounds for dumb init systems which cannot do any form of event-based activation. I figure it's fine if they continue to exist, already for compat reasons, but I think it's important to get the message out that they are a hack, and a bad thing and that new mechanisms should use proper notification, the kernel has many better options for asking userspace to do things.
Posted May 30, 2017 0:23 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects
Containers as kernel objects