Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:57 UTC (Tue) by smurf (subscriber, #17840)
In reply to: Shuttleworth: Losing graciously by Cyberax
Parent article: Shuttleworth: Losing graciously

Dbus gives you the mechanisms to do all of that. A file system interface does not.

I'm not involved with the details of which dbus call does, or will do, exactly what, and I like it that way. So, sorry but you'll have to get the actual implementation details from somebody else.

No, the kernel people are not "idiotic" when they want to impose a one-writer-only policy on the cgroups subsystem. It makes perfect sense to have one process arbitrate access instead of adding ACL support to cgroupfs and dealing with multiple processes stepping onto each other's toes.

In any case, unless you can actually convince them to not enforce a single-writer policy after all, demanding that a multi-writer cgroupfs should continue to be available is … somewhat futile. Especially here; this is not a kernel mailing list.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 20:00 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (52 responses)

> I'm not involved with the details of which dbus call does, or will do, exactly what, and I like it that way. So, sorry but you'll have to get the actual implementation details from somebody else.
So I conclude that NOBODY knows how to do it. Does it not ring any alarm bells?

>No, the kernel people are not "idiotic" when they want to impose a one-writer-only policy on the cgroups subsystem.
Yes, they are. They are total idiots in this regard.

>It makes perfect sense to have one process arbitrate access instead of adding ACL support to cgroupfs and dealing with multiple processes stepping onto each other's toes.
Why does it make a perfect sense? What are the reasons? Can you point out a design document with them?

> In any case, unless you can actually convince them to not enforce a single-writer policy after all, demanding that a multi-writer cgroupfs should continue to be available is … somewhat futile. Especially here; this is not a kernel mailing list.
LKML is a dump. Asking there is an almost certain guarantee for a message to be lost.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 20:51 UTC (Tue) by fandingo (guest, #67019) [Link] (51 responses)

> So I conclude that NOBODY knows how to do it. Does it not ring any alarm bells?

It's because these features are currently being developed. They're not finished. None of this work will be completed until KDBus is merged.

I suggest that if you have such pressing concerns and questions about how all of this works, it's more appropriate to take your complaints to the developers directly.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:32 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (50 responses)

> It's because these features are currently being developed. They're not finished. None of this work will be completed until KDBus is merged.
Incorrect. Single-writer mode is already there and KDBus is going to be an optional dependency for a long time even after that.

All the policy mechanisms are already there. Yet nobody here can tell me how to do the simplest thing possible - delegate a DBUS subtree to a user.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:41 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (49 responses)

You don't delegate subtrees of DBus APIs. I imagine that requests for changes to cgroup would come with a parameter for which subtree to apply the changes to. By default, everyone passes '/' as the subtree to apply things to, but if you're delegating a subtree, pass '/machine/vm0' as the subtree. You then authenticate the caller against who is allowed to manage the '/machine/vm0' subtree. Or you attach to the 'org.freedesktop.systemd.cgroupdelegate1' interface at the '/machine/vm0' path and call methods there (everyone else calls the 'org.freedesktop.systemd.cgroup1' interface methods).

I have a feeling that your emotions here are getting in the way of seeing potential solutions and mixing up pieces of information. May I suggest putting your concerns into a wiki page of some sort so that responses to them aren't spread across unpteen LWN articles and subthreads?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:51 UTC (Tue) by fandingo (guest, #67019) [Link]

This is correct, but just to clarify:

Telling systemd that user X (even if that's a user that needs to be resolved several times from namespaces) is allowed to perform actions A,B,C on a subtree does not exist presently. That's why no one can tell Cyberax what API calls are needed.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 22:09 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I'm now reading the kernel and systemd code. I still don't understand how authentication is going to work. Also, what happens next?

Suppose that I have systemd managing the root host and I create a cgmanager-based container. Suppose that they interoperate, so somehow cgmanager should connect to the host? How do I pass credentials for it? Or does systemd simply trust the first connection?

Then the cgmanager connects to its local DBUS running inside the container and starts serving its local clients. KDBus doesn't really change this, their namespaces would be separate.

Then the next question, would the cgmanager-based partitions be visible in the global manager? Probably yes, since there's no delegation. However, access rights would definitely be lost because cgmanager is probably going to implement its own policies. So there's not going to be any way to check what users and/or containers are using subtrees.

Still a mess. I guess to dive into it headfirst and try to make sense of it.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 22:46 UTC (Tue) by fandingo (guest, #67019) [Link]

Users inside a namespace are mapped to UIDs outside the namespace. There are more details here http://lwn.net/Articles/532593/, but it seems that some privileged inner UID needs to be mapped.

Here's my understanding on how it would work. Besides some terminology, the model seems common between systemd and CGManager.

1) Bind mount the DBus socket dir into where the container will run.
2) Create cgroup and apply controllers as needed.
3) User namespace is created.
4) Root has an outer UID mapped. If the cgroup manager runs as a different user, it is also mapped.
5) PolicyKit is updated to allow access to the mapped user on the specific cgroup subtree.
6) The container OS boots.
7) The cgroup manager inside the container connects to the DBus socket. (This does not serve as the system bus inside the container. That is separate.)
8) The cgroup manager inside the container attaches to its system bus.

It's expected that the container software takes care of at least 3-8 and possibly the first two as well.

Operation:

A process inside the container wants to make a cgroup modification.

1) It connects to DBus (or cgroupfs if desired and CGManager is running inside the container) and sends the request.
2) PolicyKit (or file system permissions if using cgroupfs via CGManager) inside the container authorizes the action.
3) The cgroup manager inside the container accepts the command, and sends the command over the DBus socket to the outer cgroup manager.

4) The outer cgroup manager receives the message.
5) The outer cgroup manager translate the inner cgroup path to its relative position outside and consults PolicyKit for authorization. If authorized, the cgroup action is completed. A return message (with properly sanitized path) is sent across the socket to the inner cgroup manager, which forwards the message to the process that initiated the call.

=====

Let's say that some process inside the container wants delegation of a part of the container's subtree. That authorization doesn't take place in the outer PolicyKit. It happens inside.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 15:32 UTC (Wed) by HelloWorld (guest, #56129) [Link] (45 responses)

I have to say that I tend to sympathize with Cyberax here. cgroups are a hierarchy of objects, and we have an API to manipulate those: the file system. Of course you can build what essentially amounts to a copy of the file system API with D-Bus, but you'll loose all the tool support along the way, and people won't know how to use it, so why bother?

By the way, I feel similarly with regard to cgroups as a whole. We already have a process hierarchy, why do we need another one? Of course the problem with the traditional process hierarchy was that processes could escape by double-forking, but that was fixed with with prctl(PR_SET_CHILD_SUBREAPER). So why do we need cgroups at all?

Oh well, by now it's probably too late to change any of this.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 16:22 UTC (Wed) by fandingo (guest, #67019) [Link] (17 responses)

I'm wondering what tool support actually exists. It seems more likely that there's a mash of shell scripts that mkdir, chmod, chown, and echo their way through using cgroups. That's pretty lousy. I've already posted the warnings from cgroups.txt on using echo. If you're using something beyond shell script, you'll end up with a better program by interfacing with DBus if only due to better return information and error handling.

> Of course the problem with the traditional process hierarchy was that processes could escape by double-forking, but that was fixed with with prctl(PR_SET_CHILD_SUBREAPER). So why do we need cgroups at all?

Doesn't that require one of the following three situations?

* Well behaved main process of the service that sets PR_SET_CHILD_SUBREAPER, so none of its descendents escape. If this process ever dies, is killed without cleaning up (e.g. sigkill), or fails to set itself as the subreaper processes can escape the hierarchy.

* The init system has to maintain a process for each service that sets itself as the subreaper. It's not responsible for anything besides executing/stopping/killing the service, and cleaning up PIDs. That certainly adds a lot of overhead and complexity. Without dedicating a process to each service, you just end up with everything having subreaper set to PID 1 (or whatever a modular service manager runs as); these service hierarchies would all point to the same parent, making it impossible to distinguish between them.

The first option does not seem appealing because it requires a significant amount of trust in the service, there are reliability concerns, and the developers of each service need to do work to explicitly support this init model.

The second option mainly suffers from complexity. Init has far more processes running as part of its service management. Some IPC mechanism would be needed to track these processes and allow start/stop/restart/kill/etc. commands from the user to the service manager to the hierarchy manager to work.

Subreaper is designed to be used for proper process cleanup, not for tracking process hierarchies.

Lastly, the ability to set resource limits is limited. It would be possible to use something like setrlimit or prlimit, but those are both process-specific, and don't allow the flexibility of group-based resource limits.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:16 UTC (Wed) by HelloWorld (guest, #56129) [Link] (4 responses)

> * The init system has to maintain a process for each service that sets itself as the subreaper. It's not responsible for anything besides executing/stopping/killing the service, and cleaning up PIDs. That certainly adds a lot of overhead and complexity.
I don't think so. Processes are cheap on Linux, and I don't think that reaping child processes is likely to ever be a bottleneck for any realistic program.

> Some IPC mechanism would be needed to track these processes and allow start/stop/restart/kill/etc. commands from the user to the service manager to the hierarchy manager to work.
Where “Some IPC mechanism” would obviously be D-Bus, which makes this sort of thing very easy.

> Lastly, the ability to set resource limits is limited. It would be possible to use something like setrlimit or prlimit, but those are both process-specific, and don't allow the flexibility of group-based resource limits.
That's not a fundamental limitation. Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants, and PR_SET_CHILD_SUBREAPER ensures that your descendants can't reparent themselves to init. So I think this whole thing could be made to work fine if somebody bothered to do the work. Otoh, I'm not sure if anything would actually be gained by doing that instead of cgroups.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:43 UTC (Wed) by fandingo (guest, #67019) [Link] (2 responses)

> Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants, and PR_SET_CHILD_SUBREAPER ensures that your descendants can't reparent themselves to init.

That's true if and only if that ancestor reaper never dies or is killed. The security implications complicate things. It should be possible to overcome them possibly, but the warts add up.

> That's not a fundamental limitation. Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants[...] Otoh, I'm not sure if anything would actually be gained by doing that instead of cgroups.

I totally agree, but it would require a change to those functions (or new recursive versions).

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:59 UTC (Wed) by HelloWorld (guest, #56129) [Link] (1 responses)

> That's true if and only if that ancestor reaper never dies or is killed.
So what? systemd is already required to never die.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:29 UTC (Wed) by fandingo (guest, #67019) [Link]

If the reaper dies, then processes in that service escaped their "container" (because PPID is now 1). The service manager can no longer track them, and has lost all reliable control of the service.

This becomes a major problem with privileged services. Many services maintain a parent process that runs as root. Any compromise of this privileged service process (or malfeasance by it) allows it to kill it's hierarchy manager process and escape all control.

On the other hand, cgroups in a single-writer environment should be immune to this.

With systemd cgroup manager, PolKit would not authorize a process move outside all cgroups or to another cgroup (outside specific definitions like system.slice/sshd.service/ --> /user.slice/session.scope/).

The major benefit to systemd's cgroup manager is that it is not attackable via this style. It cannot be intentionally killed (it ignores all signals, even sigkill since it is PID 1), and if it were somehow forced to crash, the system would panic. Since PID 1 is the cgroup manager, there is no way to gain control of the kernel interface either.

There's no meaningful way to protect a reaper, unless you mandate that nothing in a hierarchy can run with enough privileges to kill the reaper. That would require a substantial change in many services, or requires additional sandboxing mechanisms in the kernel. (The kernel would need to perform a check that a caller of kill(2) is not trying to kill its reaper.)

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:58 UTC (Wed) by smurf (subscriber, #17840) [Link]

Umm … did you ever happen to run across the idea that just maybe process hierarchies and cgroups are a VERY bad fit? I can think of a couple of use cases where that wouldn't work at all well.

What if I want to fork off a bundle of programs which need to share the same memory limit (i.e. 200MBytes for all of them in sum, not individually … like for instance all the processes in James' sessions … and what if James logs in with X *and* with ssh)?

What if I realize, after starting my disk copy program, that it eats too much memory / disk bandwidth, and I want to retroactively park it in a more limiting cgroup? Does my shell suddenly need to know about that stuff?

Sorry -- won't work.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:42 UTC (Wed) by HelloWorld (guest, #56129) [Link] (11 responses)

> I'm wondering what tool support actually exists. It seems more likely that there's a mash of shell scripts that mkdir, chmod, chown, and echo their way through using cgroups. That's pretty lousy.
It's not lousy, there's nothing wrong with that! Those are tools that every admin knows and uses, and that's a Good Thing. The reason devices are exposed as “files” in /dev is precisely that one can do things like access control just as if they were proper files. Do you want to replace that too? It's certainly possible to give udev a D-Bus interface and use fd passing to open device files!

> I've already posted the warnings from cgroups.txt on using echo. If you're using something beyond shell script, you'll end up with a better program by interfacing with DBus if only due to better return information and error handling.
“We can't use a file system based interface because bash's echo builtin is broken” is about as lame an excuse as it gets. The answer to that is to fix bash or to use printf.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:48 UTC (Wed) by smurf (subscriber, #17840) [Link] (10 responses)

> The reason devices are exposed as “files” in /dev is precisely that
> one can do things like access control just as if they were proper files.

"It behaves like a plain file" doesn't work for quite a few device nodes, and most Linux subsystems are not controlled by echo: You don't emit sound by "cat rhapsody.wav >/dev/snd" these days, and you don't resize a LVM partition by "echo 10TB >/sys/devices/virtual/block/volgroup/master/varlog/size".

This is Linux. This is not Plan 9 where you can open a TCP connection with mkdir. cgroupfs is fine for introspection, but control? that always seemed a bit unnatural to me.

Besides, pragmatically, a sensible "cgroupctl"-style program will have a --help option and a manpage. To me that seems a lot more useful than traipsing around in cgroupfs and wondering which magic mkdir+echo+mv combo I need to evoke to limit my disk copy program's memory usage.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 20:22 UTC (Wed) by HelloWorld (guest, #56129) [Link] (9 responses)

> "It behaves like a plain file" doesn't work for quite a few device nodes,
So what? Just because you can't read(2) or write(2) to some device nodes doesn't mean you need to use another interface for things like poll(2) or chmod(2). Stop thinking about the “file system” and start thinking about a general hierarchical namespace for all kinds of objects. This is where we are today with files, sockets, fifos, devices files etc.. It's only natural to extend that further.

> This is Linux. This is not Plan 9 where you can open a TCP connection with mkdir.
Uh, I know this is Linux and not Plan 9. How is that supposed to be an argument? We should learn from Plan 9 instead of taking that kind of “us vs. them” stance.

> cgroupfs is fine for introspection, but control? that always seemed a bit unnatural to me.
And to me it seems unnatural that access control for cgroups is supposed to be done through a completely different mechanism than access control to files or devices. Though I agree with you that the current cgroups API isn't ideal. For one thing, I think the natural thing is to use
ln /proc/42 /sys/fs/cgroup/yaddah/cgroup.procs
and not
echo 42 > /sys/fs/cgroup/yaddah/cgroup.procs
to add processes to a cgroup.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 21:17 UTC (Wed) by smurf (subscriber, #17840) [Link] (8 responses)

> And to me it seems unnatural that access control for cgroups
> is supposed to be done through a completely different mechanism
> than access control to files or devices

I strongly suspect that the main reason for that is because you're used to it.

> ln /proc/42 /sys/fs/cgroup/yaddah/cgroup.procs

Linking.
Across file systems.
Yeah, right.

Sorry, but this is the point where I stop responding to you.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 21:44 UTC (Wed) by HelloWorld (guest, #56129) [Link] (7 responses)

> Linking.
> Across file systems.
> Yeah, right.
So what? It's not allowed for conventional file systems because it doesn't make sense there. It does make sense for this case, so there's no reason for it not to be allowed.

> Sorry, but this is the point where I stop responding to you.
You're doing as if I had somehow offended you. I haven't.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:06 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (6 responses)

Not only are you linking across filesystems (how would one find out that it is hardlinked elsewhere?), but you're hardlinking a directory. When process 42 ends, does the "hardlink" disappear? If not (as one might expect of hardlinks), does a new process with PID 42 get put there? The /sys and /proc filesystems are already pretty magical, but those are only around read and write (AFAIK), not how many other syscalls as well. Really, even echoing the PID to a file is racy. I'd much rather have something like a procfd to use here.

These behaviors you're asking for are quite different than the usual semantics these tools imply. Sure, filesystems and cgroups are both hierarchical, but there is such a thing as stretching a metaphor too far. To make a meta-metaphor: Should we abandon databases and just use spreadsheets instead? Vice versa? They're both "just" grids of data cells.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 0:13 UTC (Thu) by HelloWorld (guest, #56129) [Link] (5 responses)

Alright, you have a point. Using link(2) is probably not a good idea.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 2:08 UTC (Thu) by MrWim (subscriber, #47432) [Link] (4 responses)

rename() might be though. AFAIU all pids have to appear in the cgroup tree so to put a pid in a cgroup you have to remove it from another. You would need permissions for both cgroups and it happens atomically.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 15:20 UTC (Thu) by HelloWorld (guest, #56129) [Link] (3 responses)

rename() was my first thought. But that would remove the process from the /proc directory, and that doesn't really make sense, does it?

Shuttleworth: Losing graciously

Posted Feb 20, 2014 15:23 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

I think the suggestion was to move the pid from one cgroup directory to another, not from /proc.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 17:34 UTC (Thu) by HelloWorld (guest, #56129) [Link] (1 responses)

Well, that would work, but then how do you move a process that isn't a member of any cgroup into one?

Shuttleworth: Losing graciously

Posted Feb 20, 2014 18:15 UTC (Thu) by MrWim (subscriber, #47432) [Link]

That's what I meant by "all pids have to appear in the cgroup tree so to put a pid in a cgroup you have to remove it from another". My assumption is that there cannot be a process which isn't a member of any cgroup. If init starts in a cgroup and it's children end up in the same cgroup and there's no way of unlink()ing pids from the cgroup tree then you're guaranteed that every process is in the tree.

In that setup you can't steal other users processes and put them in your subtree, you can only move pids around in the trees you own. You can then use whichever cgroup manager that you desire in your subtree. Containers work while still only co-operating with the kernel, rather than having to communicate with other user-space programs running outside.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 16:29 UTC (Wed) by paulj (subscriber, #341) [Link] (26 responses)

+1 on the parent. Cyberax makes good points.

Further, no one has been able to explain why this is better implemented in user-space via DBus. There only seem to be assertions from Tejun on mailing lists that there are problems, but pretty much no detail on what those problems are. Worse, nothing, not even in the abstract, on why these problems would be any easier to tackle in user-space.

If the argument is that it is difficult to get multi-writer access to filesystems right, or multi-writer setting of permissions, then the kernel surely has bigger problems.

If the issue is ABI stability, those problems also do not get magically become easier in user-space. Except perhaps that it is easier to circumvent Linus' determination to keep the kernel:user-space ABI stable (?) by simply not having one.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:31 UTC (Wed) by fandingo (guest, #67019) [Link] (25 responses)

There are plenty of concerns with the status quo. http://www.linux.com/news/featured-blogs/200-libby-clark/... and https://lwn.net/Articles/574317/ identify many of the issues.

From my perspective, the two most notable deficiencies are:

* Security. Delegating direct access to kernel interfaces is dangerous. The kernel is not running anything resembling a full security policy and shouldn't be expected to. Commands which have the ability to drastically affect a running system need to be vetted by something, and it only makes sense to do that in user space. Traditional UNIX ownership and permissions are insufficient for building comprehensive policies. While this may not be a concern to some, it's a major weakness. There's no excuse for providing an insecure interface to the kernel.

* Exposing too much interior detail. If the cgroups API had been more abstract from the start, it probably would have been possible to fix at least some of the other deficiencies. Unfortunately, cgroupfs exposes too much internal information, making it, in Tejun's opinion, infeasible to fix without major changes.

> If the argument is that it is difficult to get multi-writer access to filesystems right, or multi-writer setting of permissions, then the kernel surely has bigger problems.

Not all kernel developers work on every part of the kernel. The people who maintain and develop cgroups have decided that this is the highest priority undertaking for them.

> If the issue is ABI stability, those problems also do not get magically become easier in user-space. Except perhaps that it is easier to circumvent Linus' determination to keep the kernel:user-space ABI stable (?) by simply not having one.

The kernel will still have an ABI for user space. I'm not sure why people keep saying otherwise. It's only usable by one process, but it's certainly still there. And for user space, it actually does become substantially easier. Just look at CGManager, which has decided to support two APIs simultaneously.

If there is a need to modify the cgroups ABI in the future, it's so much easier. Rather than having hundreds of thousands of users (everything from systemd to shell scripts) that will be impacted, it's just a handful of cgroup managers.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:10 UTC (Wed) by paulj (subscriber, #341) [Link] (20 responses)

On security: If delegating direct access to kernel interfaces is inherently dangerous (richness/flexibility of security models excepted), but user-space mediation is not, then we need to ban all direct filesystem access, and make all code access files and data via IPC to a management daemon.

I simply don't think that's the case. If we can't do this securely in the kernel, the user-space mediator won't be any better (certainly not when it's coded in a similar style), security wise (with that exception). Otherwise, please explain how user-space is more secure?

With regard to the Unix ownership/permissions model:

1. This may suffice for many. There's little evidence, in the history of operating systems, of complex security models getting wide-spread end-user use.

2. However, it is incorrect to say the kernel to userspace delegation API is limited to Unix owner/group permissions. It also allows for ACLs - which an fs based cgroups API (not necessarily the current version) could implement.

3. Even if Unix perms and/or ACLs *were* still insufficient for all users, that is *NOT* a reason to not offer the FS API. If a user-space daemon wants to offer some other security model on top, the existence of the FS API does not stop that. They can live together. Why does it mean the fs interface has to be removed?

On exposing too much detail: Then that's a problem of the current cgroupfs API. Fix it with a new one. Why is it better for that new API to live in user-space?

One of the problem's Tejun has is that the fs API *allows delegation*, and hence allows an admin to give resources to non-privileged processes that might affect other users/processes. But isn't that perhaps inherent to an over-committed resource sharing system like normal Unix/Linux? Further, if the problem is inherently to do with the delegation, how will a user-space API that allows delegation fix things? The answer, if delegation really is a problem, must be that that delegation has to be removed. There is no reason this can be done in a kernel API, surely? Or would you argue it is easier to remove things in userspace APIs? (That argument would scare me).

Why not just try and get the kernel API right? Why will it be any easier to get things right if the thousands of shell scripts are calling dbus-send instead of writing to a virtual fs? How will having this in user-space make it any easier to deal with all the thousands of users, just cause they talk to a manager instead of the kernel?

It very much sounds like the kernel cgroups people simply don't want to have to work out the details of what is needed, and so want to punt it to user-space. It sounds almost a social problem, more than a technical one.

Lastly, on the "not all kernel developers are familiar with implementing a virtual fs" issue - they can ask for help, surely. :) Eventually viro, or someone similar, will get annoyed enough to do further extend VFS and library support to further ease implementation of virtual fses, if needed. As (IIRC) happened yonks ago when he got fed up with procfs and others. ;)

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:55 UTC (Wed) by fandingo (guest, #67019) [Link] (9 responses)

I think the primary reason why it was moved out of the kernel is that the policies authorizing access are not simple and may not follow traditional methods. Here are a couple of situations where non-traditional policies may be desired

* Only allow PID X to manage cgroup subtree /A/B/C.
* Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.

Traditional file permissions and even ACLs are incapable of dealing with either situation. In the first, case there's no way to grant that access without allowing all processing owned by that same user/group from controlling /A/B/C. In the second situation, there's no way to allow a one-way move.

Yeah, we can start adding more files to the cgroups to attempt finer control, but now things get messy, and it's likely the who permission issue goes out the window. That's just two random policies that a system may want to enforce. To do #1 you would likely have to give one of U, G, or O +w to the subtree for that user, but that's a broken definition because hidden kernel policy will revoke writes by other processes, even though they have the proper U or G or fall into O. In the end, a cgroup would spout many files to cover all sorts of policy combinations, and the kernel developers will be left with a monstrosity that can't be fixed because next time the same cries about API will arise.

A filesystem hierarchy is not suited to this complexity. It will have to be contorted into all kinds ways where the permissions scheme (even with ACLs) doesn't match traditional behavior.

It's far preferable to take the LSM approach. The kernel provides the primitives throughout the kernel subsystems and talks to another module to establish and enforce policy. It's true that LSM modules are kernel modules, but they remain separate from the LSM code. (I wouldn't have a problem with a cgroup manager living as a kernel module, but I don't see any inherent benefit either.)

Shuttleworth: Losing graciously

Posted Feb 19, 2014 20:36 UTC (Wed) by HelloWorld (guest, #56129) [Link] (3 responses)

> * Only allow PID X to manage cgroup subtree /A/B/C.
> * Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.
>
> Traditional file permissions and even ACLs are incapable of dealing with either situation.
But similar restrictions might also make sense for other kinds of hierarchically organised objects. So why not generalise the existing access control mechanisms to allow for things like that instead of inventing something new?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:28 UTC (Wed) by fandingo (guest, #67019) [Link] (1 responses)

The only existing mechanism that could possibly be used would be an LSM. That probably wouldn't be a bad approach.

I don't think that there's any way that traditional permissions, even with ACLs, could be massaged into giving the necessary flexibility and clean interfaces.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:46 UTC (Wed) by dlang (guest, #313) [Link]

the thing is, existing LSMs know how to deal with permissions to filesystem objects. SELinux and AppArmor work on the existing cgroups interfaces today (as Cyberax has noted).

Plus there is the entire extended ACL structure thats available (but very seldom used because it's not needed)

On Linux, permissions for filesystem objects have not been limited to the unix wrx bits for a long time.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:49 UTC (Wed) by vonbrand (subscriber, #4458) [Link]

So the solution to the problem that the API isn't well known/standard is to create another totally new, in practice untested, "general hierarchical security model" to be applied across the board to anything with a hierarchical structure. That sounds much, much harder to do right to me.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:00 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

>* Only allow PID X to manage cgroup subtree /A/B/C.
>* Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.
Then they should talk to a some kind of privileged program that can do this. Traditional UNIX used suid programs for that, and it totally makes sense to use something like cgmanager/systemd for this.

However, such situations are not really normal. In particular, changing levels in cgroups hierarchy is not a trivial operation - new subtree might have limits that the subtree which is being moved already exceeds.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:32 UTC (Wed) by fandingo (guest, #67019) [Link] (3 responses)

> However, such situations are not really normal. In particular, changing levels in cgroups hierarchy is not a trivial operation - new subtree might have limits that the subtree which is being moved already exceeds.

That's really a question of policy, though. If the policy says that some processes need to move to /D/E/F, then they need to go there, regardless of what the resource controllers say. (I'd argue that the process should be moved first, and then the resource controller terminates processes to get back into proper configuration. I don't think that it is acceptable to leave a process in the wrong the subtree.)

It's worth noting that systemd-login, which is not PID 1, does the second action today on user login.

> Then they should talk to a some kind of privileged program that can do this. Traditional UNIX used suid programs for that, and it totally makes sense to use something like cgmanager/systemd for this.

If there are acknowledged shortcomings of cgroupfs, shouldn't the API be changed to support all reasonable actions? Why should the kernel keep interfaces that clearly have shortcomings that cannot be resolved without massive API incompatibilities?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> That's really a question of policy, though. If the policy says that some processes need to move to /D/E/F, then they need to go there, regardless of what the resource controllers say.
You policy might say that a process can use 100G of RAM, but that's not going to help you if you only have 500Mb. Right now if you try to do this trick the kernel simply kills the over-limit processes.

> If there are acknowledged shortcomings of cgroupfs, shouldn't the API be changed to support all reasonable actions? Why should the kernel keep interfaces that clearly have shortcomings that cannot be resolved without massive API incompatibilities?
Like filesystem interface? Perhaps we should switch to DBUS instead of using open()?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:43 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

backwards compatibility would be a reason for keeping poor interfaces, but if you are going to break them, then you need to do so once, not multiple times.

And the new systemd API is just that, a systemd API, by definition it doesn't deal with use cases that don't use systemd.

and the currently proposed single-writer API in known to not support all use cases, so why should it replace the existing API?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:59 UTC (Wed) by fandingo (guest, #67019) [Link]

> and the currently proposed single-writer API in known to not support all use cases, so why should it replace the existing API?

You are talking about the Google multi-hierarchy complaint, right? The cgroups maintainers are on record as saying that this is not reasonable, and they intend to eliminate it.

> backwards compatibility would be a reason for keeping poor interfaces

The cgroups developers believe them to be broken, not poor. Plus, they get in the way of fixing cgroups since cgroupfs leaks too many implementation details.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 19:04 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (2 responses)

I believe Tejun directly speaks a lot of this in 2012 discussion.

Right now... its a mess. He even comments on the fact that the gentlemen's agreement in the form of PaxControlGroup shows exactly how problematic trying to support multiple writers actually is right now in the multi-heiarchy cgroups because they aren't all multi-heiarchy aware. You can't actually use all the controllers in a multiple writer fashion without them step on each other. Caveats abound. He even correctly notes that the way it works quite well right now for hand crafted setups... where one human has crafted all the cgroup interactions via scripting and its effectively "single writer" in a sense. But once you tack on automation or applications which want to make use of cgroups side-by-side with other applications or other automation or even hand-craft scripts... you run into problems. PAXControlGroups details the in and outs of those problems.

So as part of making room to clean up all the problems the single userspace writer will be mandated in the middle term of the kernel side work to make a flat hierarchy api. I'm not even sure the plan is to require that single writer model forever. I believe the plan right now is to stop pretending that cgroups and all the controllers work with the multiple writer model while that flat hierarchy is being developed and controllers are all reworked to correctly support that new model.

I also think you need to in mind that Tejun also mentions a pie-in-the-sky goal of merging cgroups into the process hierarchy.

Look I think the real issue here is that until the new work (both kernel and userspace) has progressed further along, there are going to be existing use cases that only served by the deprecated API. The kernel developers have always recognized this, from the start of the discussion.
I think the discussion here is only filling in the details of what Tejun knew were local admin policy scripting centric use cases that would be impacted in the short to mid term. Tejun clearly states the choices on how to proceed involved a trade-off that would impact some existing use cases.

The real question is, will the linux distribution vendors continue to provide userspace solution which support the old API as an option? I have no idea on that. I know that libvirt for example has a transition plan in place to support the older cgroupfs if the systemd D-Bus API is not available on the host. But I have no idea if any distribution vendors are going to expose a configuration option to pick which cgroups API to use when mounting cgroups. So Cyberax should probably start making the case to his distribution vendors about supporting his use case by making it possible to choose to run the deprecated cgroupfs API in the future, so long as the API is available and not pulled from the kernel the vendor ships.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:56 UTC (Wed) by dlang (guest, #313) [Link]

Everyone is in agreement that the old multiple hierarchy approach is a problem and that all controllers need to use the same hierarchy.

But using that as justification for a single-writer model doesn't compute, what does one have to do with the other?

> I'm not even sure the plan is to require that single writer model forever.

I agree with you that the kernel developers have said this, I just can't find a quote easily to back this up.

But if this is the case, having systemd take complete control and then defining a complex DBUS interface to be used for delegation strikes me as a very bad thing to do 'temporarily'

> The real question is, will the linux distribution vendors continue to provide userspace solution which support the old API as an option?

is systemd going to even allow this? or is systemd going to say that it's broken if the new interface isn't there?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:12 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Sure, multiple hierarchies writers will have to be rewritten. That's totally OK because the old cgroups API definitely needs to be fixed with a jackhammer. Nobody argues that. It's also clear that the unified tree will make some use-cases impossible, at least for now.

And that's OK - there _are_ valid technical reasons why the current multiple hierarchies model is broken. These reasons are clearly spelled out in scores of mailing list messages.

For example, there are problems with memory accounting and blkio and that's why blkio can't account for buffered writes right now.

The switch to the single-writer model, however, has no such rationale. There are literally NO arguments at all for it that I can find. One would think that unfixable security problems deserves at least a mailing list message, but there's literally _nothing_ there.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 19:13 UTC (Wed) by smurf (subscriber, #17840) [Link] (2 responses)

> It very much sounds like the kernel cgroups people simply don't
> want to have to work out the details of what is needed, and so
> want to punt it to user-space. It sounds almost a social problem,
> more than a technical one.

Surprise: You are exactly right. The kernel people do not WANT to set policy there because the requirements are unknown / too diverse / we don't have much experience what we actually need in complex real-world scenarios / take your pick.

We do know that a user/group scheme will not work: you can nest namespaces, and the process which sets up the outer namespace's access rights does not know which user IDs will eventually end up being mapped to the processes inside these (sub)containers. This is (probably) why cgroupfs does not have ACLs: they'd be insufficient anyway.

It's not the kernel's job to set policy. It's the kernel's job to facilitate a stable ABI, and leave policy to user space where it belongs.

Besides, access rights are insufficient for another reason. Suppose you want to limit users' memory usage to 100MB each; if they want to have 1GB of main memory they can -- but only two at the same time, and only for 10 minutes.

This is not a particularly exotic requirement for a multiuser system. If somebody adds code for that kind of thing to their favorite cgroupsmanager program, no problem whatsoever. In the kernel? don't even think of doing that.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:06 UTC (Wed) by dlang (guest, #313) [Link]

> Besides, access rights are insufficient for another reason. Suppose you want to limit users' memory usage to 100MB each; if they want to have 1GB of main memory they can -- but only two at the same time, and only for 10 minutes.

> This is not a particularly exotic requirement for a multiuser system. If somebody adds code for that kind of thing to their favorite cgroupsmanager program, no problem whatsoever. In the kernel? don't even think of doing that.

what multiuser system supports these sorts of limits today? If you are claiming that they are not unusual, that must mean that something common supports them.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:22 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

First, you create this hierarchy: root/user1/delegate, root/user2/delegate. Then you set memory limits on them and start a daemon that does the balancing act. This daemon should have permissions to change 'user1' and 'user2' hierarchies.

But nobody stops you from making 'delegate' directories writable for the users! They won't be able to affect the settings in the parent levels of cgroups, and they'll limited by them.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:55 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Except that delegation to untrusted processes is not inherently dangerous. Cgroups (by design!) limit what the children processes can do by setting limits on their parents. Well, except for the broken blkio controller that is being fixed anyway.

I'm pretty familiar with the current cgroups interface and it seems that an untrusted process _at_ _most_ can cause high load on the kernel and perhaps significantly slow down other processes.

There's also a small problem with several controllers which use weights to distribute resources, so it's possible for a cgroup to affect its siblings. But again, that's trivially worked around by using an intermediary tree level if one wants to delegate a subtree to untrusted processes.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:15 UTC (Wed) by fandingo (guest, #67019) [Link] (2 responses)

> it seems that an untrusted process _at_ _most_ can cause high load on the kernel and perhaps significantly slow down other processes.

Only if that process is unprivileged. If you have a service that runs a privileged process (like the parent PID of Apache or OpenSSH), it can modify any part of the cgroup hierarchy.

A single-writer model (especially if the writer is PID 1) with policy enforcement precludes this behavior. Even a privileged user would not be able to gain authorization to perform cgroup changes outside what the policy allows (like managing its subtree). Furthermore, a privileged user couldn't even connect to the kernel cgroup API directly, because a writer is already registered, and if it's PID 1, cannot be crashed in order to register a malicious writer.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:20 UTC (Wed) by dlang (guest, #313) [Link]

existing LSMs can block access to cgroups by even root processes today.

or you can play games with the PID namespace so that those processes are only root within their limited context, not for the whole systems.

But if you are concerned about a malicious root process, the fact that it can change cgroups settings seems like a pretty minor thing to worry about.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Only if that process is unprivileged. If you have a service that runs a privileged process (like the parent PID of Apache or OpenSSH), it can modify any part of the cgroup hierarchy.
It might as well simply do 'chmod -R a+r+w+x /' to the same effect.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:48 UTC (Wed) by dlang (guest, #313) [Link] (3 responses)

by the way, the new interface apparent;y isn't actually limited to being accessed by a single process, a group of coorperating processes can be used instead.

this sounds like an even bigger problem to me if the group of coordinating processes don't coordinate well, nothing in ther kernel can know that they aren't and big problems can result.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:53 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (2 responses)

Are you sure about that? Can you provide me instructions on how to do that when the cgroups is mounted with the new API?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:09 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

I don't know how to do it, but I saw something posted in the last week or so (I think on lwn related to systemd) that stated that this was the case.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:14 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Lennart (who is known as 'mezcalero' here) meant that systemd can give other processes access to a subtree, through systemd's interface.

That doesn't change the fact that only one process can do direct cgroups manipulations.