A control group manager
CGManager is a year-old project to develop a daemon to manage control groups (cgroups) on a Linux system. These days, it is mostly targeted at doing that management for LXC containers, but it was originally envisioned as an alternative to systemd's cgroup management for those distributions that were not using systemd as their init. LXC maintainer Serge Hallyn gave a presentation about CGManager on October 13 at LinuxCon Europe in Düsseldorf, Germany.
![[Serge Hallyn]](https://static.lwn.net/images/2014/lce-hallyn-sm.jpg)
Hallyn began his talk by saying that he and others didn't really care about CGManager, per se, but need the features that it currently provides. If there are alternatives that can still solve the problems that LXC has, he is not tied to keeping CGManager around. He hoped that conversations during the week (at LinuxCon, the containers track at Linux Plumbers, and the systemd hackfest) would be productive in that regard.
Background
Control groups started out as "task containers" when they were first introduced in 2007, Hallyn said. The idea was to add core kernel functionality to group tasks, along with code for tracking and limiting resource usage of each group as a whole. Task containers were eventually renamed and, over the years, controllers for resources like memory, CPU, block I/O, and so on have been added. Cgroups are administered through a filesystem interface.
Containers are an operating system (OS) level virtualization mechanism that uses many different kernel features to emulate virtual machines (VMs). Containers provide a separate, clean environment for the processes they contain using just the base OS, without any hardware support (unlike full virtualization solutions such as KVM). He called containers a "user space fiction" that builds on cgroups, bind mounts, namespaces, and other features to give the illusion of isolated systems. In addition, containers can be used without requiring privilege and they can be nested, he said.
CGManager was born out of the need for safe, unprivileged, nested containers. The cgroups maintainer (Tejun Heo) discourages the delegation of portions of the cgroup filesystem (cgroupfs) to unprivileged processes, Hallyn said, which was the mechanism used by LXC to support unprivileged containers in the past. Thus, CGManager avoids the need to grant safe access to cgroupfs to other processes and prevents processes from "escaping" into parent cgroups, even if they are root within the container.
In order to support that use case, he proposed the idea of CGManager in November 2013. Since that time, it has been developed and is in use by LXC, upstart, systemd-shim, and libvirt on Ubuntu and other systems.
Design
There is one CGManager daemon per host and requests to it are sent over D-Bus. The kinds of requests that are made are things like "create a new cgroup" or "move this process into that cgroup". D-Bus uses a Unix-domain socket, so the SCM_CREDENTIALS message can provide the UID, GID, and PID of the requesting process; the kernel will translate them to the appropriate value in the receiving namespace.
But when a request to move a process to a new cgroup is sent to CGManager, the PID used to identify it is local to the requesting process's namespace. CGManager would somehow have to translate that to the PID in the root namespace, but that is tricky to do. One possibility would be to use the setns() system call to put CGManager into the namespace of the requester. There are two problems there, however. For one, he wanted to be able to support kernels prior to the introduction of setns() in 3.0. More importantly, though, switching to the requester's namespace could cause CGManager to lose the privileges it needs to do its job: possibly including the ability to switch back to the root namespace.
Beyond those problems, though, he wanted users of CGManager to be able to send simple D-Bus requests, without adding credentials or doing anything special. So there is a proxy that lives in each container to accept simple D-Bus requests and translate them to requests with credentials to send to CGManager. It is worth noting that the proxies cannot chain, as their input and output have different characteristics; the proxies talk directly to CGManager, no matter how deeply nested they are. Chaining them could introduce performance problems for deeply nested containers, he said.
The standard socket for CGManager is created at /sys/fs/cgroup/cgmanager/sock. LXC bind mounts the /sys/fs/cgroup/cgmanager directory into each container. The proxy moves the cgmanager directory aside and puts its own socket in its place. That way, processes inside and outside of containers do not have to be aware of the difference.
Hallyn listed the 18 or so D-Bus methods that CGManager provides. They allow requesters to create cgroups, move processes to or from them, list processes in a cgroup, and so on. All of the requests are handled as being relative to the cgroup of the requester, so there is no way to create or access cgroups further up the hierarchy.
The GetValue and SetValue requests allow processes to set cgroup subsystem resource limits using the names currently exported by cgroupfs. There was discussion early on about creating an API to separate cgroup users from the exact names that are currently exported, but that has not happened. The idea is to allow the kernel to change those parameter names and other characteristics without requiring changes in various user-space programs. It is a "worthwhile goal", Hallyn said, but LXC has been exporting those names for longer than he has been maintaining it. For now, LXC will continue using them, but the project is willing to work on a higher-level API down the road.
Future
There are several alternatives for the future of CGManager, he said. One possibility is to enhance cgroupfs to virtualize cgroups. That would mean cgroups would not be able to see other cgroups that are above (or at the same level) in the hierarchy. Currently, /proc/self/cgroup and cgroupfs leak information about other cgroups. One way to avoid that would be via cgroups namespaces that have been proposed by Aditya Kali. That, coupled with the ability to fully delegate parts of the hierarchy to other processes, would obviate the need for CGManager. Since the cgroups maintainer does not favor that approach, though, it is unlikely, Hallyn said.
Another idea might be to move the functionality provided by CGManager into systemd. That would require enhancing the functionality of systemd slices and to allow users to specify sub-slices. In order to support the use cases that LXC users require, it would mean moving much of what CGManager does into systemd, and it is unclear whether that would be possible or not.
Lastly, CGManager could continue on. There are features that need work, including a higher-level API to abstract away the cgroupfs resource file names, that should be done in conjunction with others in the community. Support for the new "remove on empty" feature of cgroups is another. There is also work to be done on integrating with systemd, since CGManager will have to coexist with systemd on some systems. All of those topics were things he hoped to discuss with others during the week.
[ I would like to thank the Linux Foundation for travel assistance to Düsseldorf for LinuxCon Europe. ]
Index entries for this article | |
---|---|
Conference | LinuxCon Europe/2014 |
Posted Nov 5, 2014 19:31 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
The next step is to admit that actually replicating all the access controls in userspace is a not-so-smart task and add cgroups namespacing. With the single tree mode it's also a pretty logical thing to do.
Posted Nov 5, 2014 22:20 UTC (Wed)
by stgraber (subscriber, #57367)
[Link] (3 responses)
That means that we can finally have containers mount cgroupfs and write assuming that / means their namespace rather than have to jump through hoops and bind-mount part of the hierarchy read-only and part of it writable.
Posted Nov 6, 2014 0:28 UTC (Thu)
by luto (guest, #39314)
[Link] (1 responses)
They're not the full solution, though -- whatever manages cgroups on the host will have to play along to a limited extent. It remains to be seen whether systemd will do so. I imagine it will.
Posted Nov 7, 2014 5:53 UTC (Fri)
by CameronNemo (guest, #94700)
[Link]
Posted Nov 6, 2014 2:31 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
A control group manager
A control group manager
With those patches, it's possible for a task that's already in a cgroup say /cg1 to unshare the cgroup namespace and then appear as being cgroup / but have that new / actually refer to /cg1 from the host's point of view.
A control group manager
A control group manager
A control group manager