Shuttleworth: Losing graciously

Posted Feb 15, 2014 21:30 UTC (Sat) by jspaleta (subscriber, #50639)
In reply to: Shuttleworth: Losing graciously by stgraber
Parent article: Shuttleworth: Losing graciously

so help me out....
Am I understanding your future scenario correctly?
in the future....Ubuntu host system running systemd as init by default:
1) runs systemd PID=1 and is the cgroups manager for the host system exposing the documented cgroup management API.
2) runs a cgmanager process which exposes its own API, but internally will (in the future) communicate cgroup management requests back to systemd using systemd's native API and will not be touching cgroupfs directly.
3) guest lxc containers run something that talks to the cgmanager process talking the cgmanager API.

I that the future scenario you expect to see once ubuntu switched to systemd?

Assuming yes, here's what's baking my noodle. If in the future, cgmanager is just going to end up talking to systemd using systemd's API.... how does this configuration provide any functionality or service any use cases above and beyond what systemd's API already exposes? Serious question. cgmanager's API doesn't appear to be an abstraction, it appears to be mired in the details of what cgroupsfs exposes. So I don't get how a future, where cgmanager just talks to systemd via systemd's API is more capable or can cover additional use cases than systemd can directly. Well not without patching the bejesus out of systemd on the host..which isn't what you seem to be proposing. It just looks like cgmanager is going to be wedged in between for no benefit at all.

In contrast, I'm far less confused about how libvirt's future roadmap is going to work. Libvirt exposes an abstracted API, that doesn't go into cgroupfs minutia. So I get how libvirt can expose a stable abstracted API for containers to make use of that.. and can internally can then talk to systemd abstracted cgroup API and it will all work out. Libvirt's API doesn't propose to expose capability or usage cases thought to be unsupported by systemd's API.

-jef

Shuttleworth: Losing graciously

Posted Feb 15, 2014 22:34 UTC (Sat) by stgraber (subscriber, #57367) [Link] (110 responses)

The main use of cgmanager is when doing nested LXC containers where the LXC containers run unprivileged as users.

In those cases you'll get requests coming from sub-containers where the emitter of those requests is root in their own namespace but not on the host. The whole cgmanager/cgproxy API is designed so that we can safely check what process the requester actually owns and then allow it to mess with those and those only.

So we basically track the various pid namespaces and user namespaces, deal with the uid and pid translation and then do ACL checks on the host.

Shuttleworth: Losing graciously

Posted Feb 17, 2014 21:03 UTC (Mon) by fandingo (guest, #67019) [Link] (109 responses)

The systemd developers (including Lennart in this very thread) have stated that they intend to allow this sort of behavior where systemd in a container will have access to the proper subtree outside the container. What's the rationale of developing this capability in cgmanager and not doing the work directly in systemd?

Shuttleworth: Losing graciously

Posted Feb 17, 2014 21:28 UTC (Mon) by stgraber (subscriber, #57367) [Link] (108 responses)

While it's great that eventually we may get an API from systemd that may cover the needs of LXC, this isn't the case now and I suspect won't be in the near future.

What Lennart refers to is running systemd in a container managed by systemd (with nspawn) within the host's user namespace and probably without running a full distro inside it (though that last bit doesn't matter that much).

As I already stated before, LXC also supports distros that do not have systemd, including Android. cgmanager was designed to be generic enough to work on any of those and will itself talk to the systemd API or any other similar API instead of cgroupfs if they offer an API that's low level enough for us.

Now if you want an example of complex setup which I need to support with LXC (due to actual user demand, not because I want to find a far fetched example), consider this:

Host runs Ubuntu 14.04 with a 3.13 kernel (upstart).
-> User x with uid 1000 runs an unprivileged container running Debian Testing (using sysvinit)
-> Root in this container (uid 100000 on the host) runs a Plamo Linux system container (some custom init)
-> User nobody with uid 65534 (uid 165534 on the host) runs an unprivileged Ubuntu 12.04 container (upstart)

This all works today with LXC 1.0 and cgmanager, the cgmanager host socket gets passed from one level of container into the next. If the container cares about cgroups (all of the above except the last one), they need to spawn a cgproxy process that'll do SCM calls over DBus to pass user credentials and PIDs in a way that gets translated by the kernel when crossing namespace boundaries.

The main difficulty in the above is when uid 0 in the leaf container with a mapped uid of 200000 (depending on the configured mappings) is requesting for PID 50 to be moved into cgroup "a".

That's because:
- uid 0 is actually uid 200000
- pid 50 is actually pid 123123
- cgroup "a" is actually cgroup "lxc/c1/c2/c3/a"

So that's why we have cgmanager, why we use ucreds to get translated uids and pids and why we need complex logic (using namespace attach and such) to check whether uid 200000 on the host is indeed uid 0 in its namespace and whether pid 123123 is in the pid namespace that's linked with its user namespace and finally whether it's actually supposed to be able to write to lxc/c1/c2/c3/a.

That example is actually a fairly simple and common example of what cgmanager does, we have way trickier cases but those usually need me an hour or so to properly express (mostly happen on older kernels or when a sub-sub-container wants to add a pid to a cgroup which is owned by a user in that namespace. The PID ownership logic becomes pretty tricky pretty quickly.)

Shuttleworth: Losing graciously

Posted Feb 18, 2014 1:45 UTC (Tue) by fandingo (guest, #67019) [Link] (107 responses)

Thanks for the detailed reply.

> As I already stated before, LXC also supports distros that do not have systemd, including Android. cgmanager was designed to be generic enough to work on any of those and will itself talk to the systemd API or any other similar API instead of cgroupfs if they offer an API that's low level enough for us.

What's the reason for not adopting the systemd DBus API, especially when it preceded CGManager? That clearly complicates the situation for application developers, or now we've added another mandatory abstraction layer, CGManager, that should not be needed on systems that already have a cgroups manager.

I guess I don't see what the future possibly holds for CGManager. Even the present is dicey beyond the cgroupfs driver. There's not even *one* page of documentation or explanation on how to use CGManager. (This appears to be the official project page: http://cgmanager.linuxcontainers.org/.) I was perplexed and spent far too long on Google before coming to the conclusion that CGManager's only mention is on a few mailing list threads. I can't find any definition of the DBus API for CGManager. I was under the impression that CGManager was ready for use.

Over the next year and a half, it is extremely likely that new GNU/Linux installations will overwhelming use systemd. During that time, it is hard to envision the actual kernel cgroupfs driver disappearing.

Combine the longevity that the kernel cgroupfs will have with the simplicity of developing the missing features of systemd (system bus delegation and policies), it's not clear that CGManager has much purpose or will become the generic cgroup manager as it was initially advertised.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 2:18 UTC (Tue) by stgraber (subscriber, #57367) [Link] (7 responses)

Documentation wasn't our first priority due to the tight schedule caused by the expected 1.0 release of LXC, we expect more documentation to be published once we're done with the LXC release.

In the mean time, Serge published some notes on github:
https://github.com/cgmanager/cgmanager

As for application developers, our biggest user is LXC and LXC certainly knows why and how to use it (as it's the same group of people who developed both and cgmanager was mostly built from LXC's old cgroup management code). For the others, the choice is relatively straightforward, if you want something very simple that works everywhere, just use cgroupfs directly. If you care about namespaces, uid/pid translation and nesting, use cgmanager. If you prefer to use a standard DBus API and only care about systemd-based distros, use systemd.
At the end of the day, all of those configure the exact same thing. It's not ideal when accesses aren't centralized but we've lived with that for years without any major problem.

I believe my earlier comment explains why the systemd API isn't sufficient for LXC's needs and for the other group of people involved with cgmanager.

As for all distros moving to systemd, I personally think this would be a pretty sad day, diversity is very important and is the main source of improvements. Anyway, we don't expect Android to start using systemd in the near future and that's one of the reasons why LXC will be using cgmanager.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 5:28 UTC (Tue) by fandingo (guest, #67019) [Link] (6 responses)

> Documentation wasn't our first priority due to the tight schedule caused by the expected 1.0 release of LXC, we expect more documentation to be published once we're done with the LXC release.

That's a real shame. How can you be confident that it is tested or used outside the LXC use-cases, or that the API is stable for a 1.0 release?

> I believe my earlier comment explains why the systemd API isn't sufficient for LXC's needs and for the other group of people involved with cgmanager.

Besides the delegation and policy components, what's inadequate with systemd's API (not implementation)?

I feel like CGManager was advertised as the cgroup manager for everyone not using systemd. After talking to you, it seems like a LXC utility only that isn't likely to see much additional use. The use cases that you have outlined strongly indicate that the cgroupfs API is likely to be the only thing that is used.

> diversity is very important and is the main source of improvements.

This is a truism that is oft repeated, but I don't see objective evidence that it's actually true. In fact, the Linux kernel is a perfect counter example. There hasn't been useful competition to Linux for years now, and kernel developers have not had trouble innovating.

> Anyway, we don't expect Android to start using systemd in the near future and that's one of the reasons why LXC will be using cgmanager.

Is Android going to switch to CGManager? If not, how useful is it for testing or use when official Android uses the kernel cgroupfs, not the cgroupfs provided by CGManager?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 7:57 UTC (Tue) by mbunkus (subscriber, #87248) [Link] (1 responses)

> This is a truism that is oft repeated, but I don't see objective evidence that it's actually true. In fact, the Linux kernel is a perfect counter example. There hasn't been useful competition to Linux for years now, and kernel developers have not had trouble innovating.

But the Linux kernel does have competition. It's called Windows, Mac OS, the BSDs and the commercial Unices.

And monocultures are bad for innovation. Just look at the regulated telecommunication industries before they were split up (e.g. in the US) or the governmentally-mandated restrictions lifted (e.g. Germany). The Deutsche Bundespost (predecessor to what today is Deutsche Post AG, the postal service; Deutsche Telekom AG with its offspring T-Online; Deutsche Postbank AG, a bank) was known for bad service, obscene prices, a snail-like pace of innovation, complete lack of flexibility.

However, systemd works in totally different environment. There are no regulatory authorities here; the only thing preventing yet another init system to come along and take its place is technical excellence which translates into people seeing the need for it and then following through with a proper implementation. Therefore I'm not worried about a perceived lack of diversity regarding systemd, especially if the alternatives are so far behind in terms of functionality.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 16:43 UTC (Tue) by fandingo (guest, #67019) [Link]

> Therefore I'm not worried about a perceived lack of diversity regarding systemd, especially if the alternatives are so far behind in terms of functionality.

It's not clear that *anyone* is interested in an alternative and modern init system. If 14.04 weren't the LTS release of Ubuntu, Canonical would be done with Upstart at this point, and Mark Shuttleworth has said that they will switch to systemd as soon as Debian makes the switch. That's the last major holdout from GNU/Linux distributions. The only other system, which is not GNU/Linux, is Android, and I'm sure they'll continue to do their own quasi-proprietary thing.

I guess the bigger question is if something were to come along and try to fulfill the features that systemd does: why wouldn't that init system bring its own cgroup manager?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 10:43 UTC (Tue) by hummassa (subscriber, #307) [Link]

This is a silly oversimplification. There are lots of competition in kernel-space: Windows (DOS-based 98 and VMS-based NT), at least five BSDs, commercial unices (I worked with SunOS/Solaris, HP/UX, AIX, ULTRIX, the infamous Microsft/SCO Xenix, amongst a dozen others).

IIRC, once upon a time Linux got inspired by VMS/WinNT for its asynchronous IO, AIX via Sequent for its RCU synchronization, FreeBSD was faster in the same hardware, NetBSD supported more hardware architectures, the BSDs got plug-and-play hardware first/better, firewalls, USB, etc.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 19:50 UTC (Thu) by lsl (subscriber, #86508) [Link] (2 responses)

> There hasn't been useful competition to Linux for years now, and kernel developers have not had trouble innovating.

If there's any ongoing innovation left at all in the OS kernel space it isn't happening in Linux. Not that that's (necessarily) a bad thing: Linux is (and is supposed to be) a 'production' OS with users relying on it for day-to-day work. While it has some cool new stuff that wasn't there in Unix back then I still sometimes wish the attention given to new OS research was a bit greater.

Well, that particular train probably left the station more than a decade ago. It seems that what we have is 'good enough' for people to consider putting up with the pain of transitioning to something new and unknown.

Then again, they seem to gladly endure the torture of gigantic 'programming frameworks' aimed at making up for weaknesses in the operating system interface. ;-)

Shuttleworth: Losing graciously

Posted Feb 20, 2014 23:54 UTC (Thu) by vonbrand (subscriber, #4458) [Link]

I wonder what the current systemd brouhaha is all about then. Also the recent article here on file-owned locks...

Shuttleworth: Losing graciously

Posted Feb 21, 2014 4:00 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

I'd bet that there is lots of research code being written on top of Linux. And, really, if a team wants their research to ship, they'd build it on Linux (capsicum for FreeBSD is one exception that comes to mind though). Have we seen much change from anything in Singularity yet? What about other prototype research kernels? It may take a few releases to actually ship, but if that were my field, I'd base it on Linux (provided it would make sense; testing something like a completely new ABI would not make sense).

Shuttleworth: Losing graciously

Posted Feb 18, 2014 12:38 UTC (Tue) by rleigh (guest, #14622) [Link] (98 responses)

> What's the reason for not adopting the systemd DBus API, especially when it preceded CGManager?

As a systems programmer, I find the use of DBUS APIs as opposed to properly designed and implemented system calls and filesystem interfaces abhorrent. Mandating the use of DBUS for fundamental system functions is wrong on many levels.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 12:43 UTC (Tue) by HelloWorld (guest, #56129) [Link] (26 responses)

Why would anybody care if you like it? Maybe if you gave some reasons for your opinion it would be interesting; like this it's just meaningless trolling.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 12:46 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (25 responses)

I gave the reasons multiple times:
1) Auditing.
2) Security.
3) Transparency.
4) Delegation.

It _all_ works for good Linux filesystem-based interfaces, like /proc or /sys. But somehow not for cgroups.

WTF?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:07 UTC (Tue) by fandingo (guest, #67019) [Link] (24 responses)

1) It's easy to see which policies are in effect on a system. Since policies are XML files, they can also be checked into version control. I'm not sure how that happens with run-time directory permission modifications.

2) It's not possible to implement a complex policy with cgroupfs. The cgroup filesystem does not support ACLs, and consequently, you're left with the limited UGO permissions.

3) I don't know what this is supposed to mean. DBus methods are more transparent to the caller since the call returns with a meaningful response (even if empty). In fact, it seems that `echo` is the primary way that people write to special file systems. From the cgroups.txt documentation:

> bash's builtin 'echo' command does not check calls to write() against errors. If you use it in the cgroup file system, you won't be able to tell whether a command succeeded or failed.

4) Delegation is currently missing, but the systemd developers have affirmatively stated that they intend to add it.

Lastly, it's pretty clear that the kernel developers don't like /sys that much. I wouldn't be surprised to see it gradually moved to DBus over the next few years either.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:46 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (23 responses)

> 1) It's easy to see which policies are in effect on a system. Since policies are XML files, they can also be checked into version control. I'm not sure how that happens with run-time directory permission modifications.
How? Can you point me out a command-line utility that can show who has access to a given group? Do I have to parse XML?

> 2) It's not possible to implement a complex policy with cgroupfs. The cgroup filesystem does not support ACLs, and consequently, you're left with the limited UGO permissions.
So let's add SELinux policies and ACLs to cgroupfs. It's going to be useful in other situations, like /sys delegation. For me, UGO permissions are plenty enough.

> 3) I don't know what this is supposed to mean. DBus methods are more transparent to the caller since the call returns with a meaningful response (even if empty).
How do I check which cgroups are writable by me, for example? I have tons of tools for that for the classic filesystem interfaces.

> In fact, it seems that `echo` is the primary way that people write to special file systems. From the cgroups.txt documentation
Sure, and it's convenient. I can write to cgroups from a pure Java program - can I do the same with DBUS?

> 4) Delegation is currently missing, but the systemd developers have affirmatively stated that they intend to add it.
Only for other systemd containers. There are no plans to support cgmanager or my own incompatible manager that I'm just going to write out of spite.

> Lastly, it's pretty clear that the kernel developers don't like /sys that much. I wouldn't be surprised to see it gradually moved to DBus over the next few years either.
Doesn't matter. /sys and /proc virtualization and delegation are here to stay, forever. And also, [citation needed]

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:57 UTC (Tue) by jspaleta (subscriber, #50639) [Link] (18 responses)

General rule of thumb, if you want to interop with other software, document and version your stable APIs. It's a bit difficult to support cgmanager or your yet to be coded spitemanager if they use undocumented D-Bus APIs.

For example, cgmanager's draft readme, containing a draft design spec for its D-BUS API showed up in the source tree only like 5 days ago.
And even from this, draft, its unclear to me if cgmanager's D-Bus API can be considered stable at present. Since there doesn't appear to be any versioning on the API internally, I'd have to assume its prudent to still consider it unstable and subject to change. As of right now cgmanger's API should be considered an lxc private API, and not suitable to be relied on by external projects, until such time that the API is versioned and marked as stable by its developers.

I bet once cgmanager's API is deemed stable, libvirt developers will look at supporting it.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:04 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

> General rule of thumb, if you want to interop with other software, document and version your stable APIs. It's a bit difficult to support cgmanager or your yet to be coded spitemanager if they use undocumented D-Bus APIs.
And by the time you write this interface, it's going to be so indistinguishable from a filesystem interface that people are going to start asking WTF it was all for.

> I bet once cgmanager's API is deemed stable, libvirt developers will look at supporting it.
Does it have AppArmor support? It works fine for delegated cgroups. How about using fanotify to screen for malicious attacks (yes, I can haz an antivirus on Linux)?

And how about delegation to Android userspace which does not use DBUS at all?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:05 UTC (Tue) by jspaleta (subscriber, #50639) [Link] (5 responses)

Does what have Apparmor support? Pronouns kill. Were you refering to libvirt or cgmanager in your question?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:17 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Sorry, I was not clear. The current cgroupfs interface supports AppArmor.

Or to be precise, AppArmor simply treats it as usual file operations and can apply all the regular policies.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:24 UTC (Tue) by fandingo (guest, #67019) [Link] (1 responses)

KDBus is LSM-aware[1], so it can be secured with any of those providers. Whether the AppArmor developers actually expose that in their policy language is another matter. (The AppArmor project is sorely lacking in documentation, and I can't even figure out if it currently supports the DBus1 system bus.)

[1] - http://lwn.net/Articles/551969/

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:34 UTC (Tue) by jspaleta (subscriber, #50639) [Link]

The legacy dbus-deamon has security hooks originally written for SELinux support and extended for AppArmor support (at least in Ubuntu and I'd assume on any distro that supports apparmor by default). My understanding is the reference daemon can choose not to send messages based on which ever security policy is being used on the host system. This particular feature of the reference daemon is tersely noted as an optional implementation feature in the D-BUS specification document. , I've not checked if AppArmor extension is in the mainline sources but there's no reason to think its not)

What's not clear to me is how kdbus's support for LSM will practically differ from how the reference userspace daemon's hooks worked. As in will it be more expressive or less expressive in terms of how you can lock down how applications use the bus. Still trying to wrap my head around that.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:25 UTC (Tue) by jspaleta (subscriber, #50639) [Link] (1 responses)

I don't understand how the question is relevant to question as to whether systemd will work with cgmanager and you hypothetic spitemanager.

If you code your spitemanager and you want it to interop with the other managers, then you'll have to expose an API for them to work with.

I recognize that you think any manager construct is sub-optimal to the cgroupfs construct. Noted. But your original question was about whether systemd would support cgmanager and your hypothetical spitemanager... not whether any specific manager would support the same thing that the cgroupfs construct does. My point stands.. the alternative managers to systemd's manager have to expose a stable API interoperate with. Demanding systemd to support an alternative manager that doesn't have a stable API is putting the cart before the horse.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 19:53 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> I don't understand how the question is relevant to question as to whether systemd will work with cgmanager and you hypothetic spitemanager.
This is relevant. Right now the kernel interface is manager-agnostic - it can be used by anything.

With the brain-dead change to single-writer you'd have to reinvent the whole filesystem in DBUS to replicate the functionality. Look, we've already reinvented security policies, delegation (bind-mounts) and almost reinvented containers!

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:14 UTC (Tue) by fandingo (guest, #67019) [Link] (10 responses)

> And how about delegation to Android userspace which does not use DBUS at all?

This is going to be required no matter what. Your choices are either systemd or CGManager for your cgroups manager. Both use systemd. All containers will need support for interfacing with a DBus cgroup manager. CGManager just provides two APIs for its users: traditional cgroupfs style and DBus. The cgroupfs interface cannot be passed through a container.

Arguing against DBus as the principal API for any cgroup manager is a losing cause.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:18 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

Right now I _can_ do this with a simple --bind mount. It works perfectly fine.

> Arguing against DBus as the principal API for any cgroup manager is a losing cause.
Only because of idiotic kernel developers.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:29 UTC (Tue) by fandingo (guest, #67019) [Link] (8 responses)

>> Arguing against DBus as the principal API for any cgroup manager is a losing cause.
> Only because of idiotic kernel developers.

Huh? The kernel cgroups are exposed by system calls to a single writer. The only two writers that exist today (or have even been announced) both principally expose cgroups using a DBus API. CGManager also supports the cgroupfs API.

The inclusion of kDBus (when that happens later this year) is orthogonal to how the kernel exposes cgroups to the manager.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 19:54 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

>Huh? The kernel cgroups are exposed by system calls to a single writer.
Incorrect. Right now cgroups can be manipulated by any number of processes.

I repeat, IT WORKS ALREADY.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 20:44 UTC (Tue) by fandingo (guest, #67019) [Link] (6 responses)

The subject has been thoroughly covered: cgroupsfs as provided by the kernel will eventually go away. There's no ambiguity to the situation. It is being left on for the short-term, so user space is not broken immediately.

You're complaining about a deprecated feature will be removed. You have four options:

1) Start using the systemd or CGManager DBus APIs.

2) Use CGManager with its cgroupfs provider.

3) Implement spitemanager for whatever API (presumably cgroupfs) you desire.

4) Fork the kernel or stop using new versions.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:04 UTC (Tue) by jspaleta (subscriber, #50639) [Link]

I should point out that libvirt is also planning on providing transitional cgroupfs support for hosts not using systemd as manager yet, for libvirt based container users. That's not to say that libvirt provides universal use case coverage. Just pointing out that libvirt was yet another cgroupfs writer and that it has a transition plan in place to work in the single writer world right now and the transition plan is documented on the libvirt project site. They have no plans for cgmanager support yet, but as the api for cgmanager hasn't really been communicated as stable yet, can't really expect them to be able to support that api.

Though it is interesting to see if Ubuntu patches libvirt as shipped in Trusty to talk to cgmanager. Right now it it appears that libvirt as packaged in Trusty isn't patched for that yet and will be relying on the cgroupfs in 14.04. So that's an interesting little wrinkle. Will a trusty host running cgmanager be able to work with trusty libvirt based containers?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:34 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Basically, all the solutions are sub-optimal compared to the only real solution: use filesystem-based interface.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:57 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (3 responses)

What about another system call where a process in a new PID namespace asks to be the manager, the kernel sees that it's in a namespace and asks the parent cgroup manager whether it should be allowed. If it is, the kernel allows delegation of that subtree to only that manager (the parent can't touch inside of it anymore). If the parent denies access, the kernel gives an error to the container manager that the subtree is already managed (I assume there is such an error condition already). Would that be sufficient for the New World Order[1]?

[1]Assuming that you don't somehow convince kernel developers to forego the single-writer changes (which seems very unlikely at this point).

Shuttleworth: Losing graciously

Posted Feb 18, 2014 22:20 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Yes, I have something like this in mind:

0) Regular permissions apply.

1) Add 'pid-lock' file at each level of cgroups tree. Everyone can modify cgroups tree if this file is empty.

2) Once you write a pid into this file only this process can make modifications to this tree level and deeper.

3) The pid-lock process can modify pid-lock files in its subtree, either clearing them completely or by writing another pid. It doesn't lose access as long as it's still alive.

4) Subtree moves must respect pid-locks and permissions.

That's basically it. It still allows to lock the tree against accidental modifications and also gives a clear path for delegation. DBUS connoisseurs can still use fully DBUS-based delegation and access control and everybody else can use normal filesystem-based API.

It's also possible to add a bitmask of delegated controllers. For example, so that the parent controller can limit the delegated controllers to cpu manager but not costly memcg.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 14:24 UTC (Wed) by HelloWorld (guest, #56129) [Link] (1 responses)

Well, fair enough. Are you going to implement it? I think it's hardly a secret that kernel developers are usually not interested in designs that aren't accompanied by working code.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 14:28 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

It'd be worth it to ask whether it would even be considered first at least.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:05 UTC (Tue) by fandingo (guest, #67019) [Link] (3 responses)

>> 1) It's easy to see which policies are in effect on a system. Since policies are XML files, they can also be checked into version control. I'm not sure how that happens with run-time directory permission modifications.

> How? Can you point me out a command-line utility that can show who has access to a given group? Do I have to parse XML?

The cgroup delegation is not finished. Systemd generally allows read access on most things, and I doubt this would be different for cgroups. Therefore, it should be as simple as running a dbus-send command.

>> 2) It's not possible to implement a complex policy with cgroupfs. The cgroup filesystem does not support ACLs, and consequently, you're left with the limited UGO permissions.

> So let's add SELinux policies and ACLs to cgroupfs. It's going to be useful in other situations, like /sys delegation. For me, UGO permissions are plenty enough.

You've clearly decided to go it alone on this (or just continually complaining). Step on up and show us the progress you've made.

>> 3) I don't know what this is supposed to mean. DBus methods are more transparent to the caller since the call returns with a meaningful response (even if empty).

> How do I check which cgroups are writable by me, for example? I have tons of tools for that for the classic filesystem interfaces.

See #1.

>> 4) Delegation is currently missing, but the systemd developers have affirmatively stated that they intend to add it.

> Only for other systemd containers. There are no plans to support cgmanager or my own incompatible manager that I'm just going to write out of spite.

Except systemd has a good record of keeping API stability. A container will be free to connect to DBus and talk to systemd. The container's cgroup manager can expose whatever interface it wants inside. (This is exactly the same approach that CGManager is taking. There will always be a requirement that the cgroup manager inside the container knows how to talk to the manager outside the container.)

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:11 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> The cgroup delegation is not finished. Systemd generally allows read access on most things, and I doubt this would be different for cgroups. Therefore, it should be as simple as running a dbus-send command.
Presumably, systemd uses the magical powerz of DBUS for access controls. So all the tools should be already there.

Where are they?

I suspect that all the people pontificating about 'just connect to DBUS' do not even understand how it works. For example, what happens if a container starts its own DBUS daemon that knows nothing about the external daemon? How is authorization of connections handled?

> Except systemd has a good record of keeping API stability. A container will be free to connect to DBus and talk to systemd. The container's cgroup manager can expose whatever interface it wants inside.
Wrong. I can't expose cgroups filesystem interface, for example. Or re-delegate to a manager that can only act as the root manager.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 1:48 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

>Except systemd has a good record of keeping API stability.

I'm not sure how you can say that.

Systemd hasn't really been around very long, so they are almost entirely stil on their first system, they haven't had much reason to modify much.

But their attitude that they are the only thing that matters, and willingness to take over and replace existing APIs with their different replacement doesn't give _me_ much confidence that they will maintain the old APIs long term.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 12:34 UTC (Wed) by pizza (subscriber, #46) [Link]

> But their attitude that they are the only thing that matters, and willingness to take over and replace existing APIs with their different replacement doesn't give _me_ much confidence that they will maintain the old APIs long term.

So, your argument against their public commitment to (and track record of) API stability is to say "I don't believe them."

You then try to justify that attitude by saying that they're still doing new things, and those new things may require new APIs. Well... duh. It's rather pointless to do something new if you don't create a way of managing it.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 13:01 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

I don't disagree, but that's not really systemd's or cgmanager's fault, they didn't decide that way.
And there actually is a good reason to delegate responsibility in cgmanger's case, delegating part of the responsibility to higher privileges makes sense.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 16:53 UTC (Tue) by fandingo (guest, #67019) [Link] (69 responses)

You do realize that the DBus is moving into the kernel this year, right? It's going to become one of the core IPC mechanisms on Linux (if you didn't already consider it one).

What's actually good about filesystem interfaces besides familiarity? There's nothing advantageous about them, and in fact, there is a great difficulty in sending information back to the caller.

System calls are only good when you're talking to the kernel. Without a more full-fledged IPC mechanism (like DBUs perhaps), you don't want the kernel acting as the arbiter translating calls to other programs.

Lastly, I don't understand the dislike of DBus. What's bad about reliability, easy type support, extensive policy support, and multiple messaging paradigms? I'm also a programmer, and I don't understand why anyone wouldn't want that. Perhaps you could elaborate.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 16:59 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

>What's actually good about filesystem interfaces besides familiarity?
How do I delegate '/sys/fs/cgroup/some/cgroup/container/path' or whatever its counterpart in DBUS is going to be to user 'root' inside a namespaced container?

I have no idea. People usually make some noises about PolKit so I checked it. Its configuration looks something like this: http://cgit.freedesktop.org/polkit/tree/data/org.freedesk...

Yeehaw! An XML config with lots of strange options. Documentation is also quite impenetrable.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:12 UTC (Tue) by michich (guest, #17902) [Link] (1 responses)

That's not polkit's configuration. That's its DBus policy description, which is parsed by dbus-daemon. I.e. it's one of the things that's going away with the move to kdbus.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:30 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

And how? A lot of people are saying that it's easy, can anyone point me to a tutorial that describes it?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 19:44 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (4 responses)

Not that this is going to make you happier, but PolicyKit *rules* are written in JavaScript[1][2]. The XML stuff is for the declaration of *actions* which can take place.

[1]https://wiki.archlinux.org/index.php/Polkit#Structure
[2]http://blog.christophersmart.com/2014/01/06/policykit-jav...

Shuttleworth: Losing graciously

Posted Feb 18, 2014 19:57 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Are you joking??? It just keeps getting better and better.

Now rules are not only opaque, they are not analyzable even in principle! Never mind dirty little tricks of JavaScript like using floats instead of ints for numbers.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 20:28 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (2 responses)

Well, as long as you keep the JS to a subset which is not Turing complete (probably doable; avoiding loops accomplishes that). Other than that, I think the justification was that some policy decisions need more complex logic and JS was the easiest language to embed. I don't know how seriously languages like Perl, Tcl, and Lua was considered (I'm 100% OK with shell not being the language).

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:38 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I don't see how this can be done automatically, without writing something like SMACK for JavaScript.

Sigh... It looks like DBUS developers have gone off the rails completely and systemd+kernel people are happy to join the bandwagon.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:46 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

These aren't DBus developers; these are PolicyKit developers. There was a decently large thread on fedora-devel about it when it came out. I'm pretty sure that if you want to write a separate access mechanism than PolicyKit, DBus doesn't itself care. I don't know what KDBus will do here since calling back out to PolicyKit for every call seems to work at cross-purposes to some of the goals of it (the number of context switches mainly).

Shuttleworth: Losing graciously

Posted Feb 18, 2014 16:59 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (60 responses)

> System calls are only good when you're talking to the kernel. Without a more full-fledged IPC mechanism (like DBUs perhaps), you don't want the kernel acting as the arbiter translating calls to other programs.

Uhm. Cgroups is a kernel interface. It's not an interface to some userspace program.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:08 UTC (Tue) by fandingo (guest, #67019) [Link] (59 responses)

The kernel cgroupfs interface is depreciated and will disappear in due time.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:30 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (58 responses)

No, it's not. Only delegation and multiple-writers are deprecated.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:57 UTC (Tue) by smurf (subscriber, #17840) [Link] (57 responses)

Exactly. Now what is a poor little init in a nonprivileged namespace to do if it wants to partition its slice of the universe into cgroups, if it can no longer write to cgroupfs?

Right -- it needs to talk to whatever process does the actual cgroups work. Presumably, DBus is a reasonable way to do that -- you can implement more complex permissions, send structured data in, get structured replies out, and have an altogether more high-level interface for what you actually want to do instead of doing baby steps in cgroupfs.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 17:59 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (56 responses)

> you can implement more complex permissions, send structured data in, get structured replies out, and have an altogether more high-level interface for what you actually want to do instead of doing baby steps in cgroupfs.
A simple question: "How?"

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:05 UTC (Tue) by fandingo (guest, #67019) [Link] (1 responses)

PolicyKit.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:12 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Again, show me the code.

To make cgroupfs subtree accessible to a user 'vasja' I simply need to do 'chown -R vasja /sys/fs/cgroup/some/subtree' and that's it. How do I do the same with DBUS?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 18:57 UTC (Tue) by smurf (subscriber, #17840) [Link] (53 responses)

Dbus gives you the mechanisms to do all of that. A file system interface does not.

I'm not involved with the details of which dbus call does, or will do, exactly what, and I like it that way. So, sorry but you'll have to get the actual implementation details from somebody else.

No, the kernel people are not "idiotic" when they want to impose a one-writer-only policy on the cgroups subsystem. It makes perfect sense to have one process arbitrate access instead of adding ACL support to cgroupfs and dealing with multiple processes stepping onto each other's toes.

In any case, unless you can actually convince them to not enforce a single-writer policy after all, demanding that a multi-writer cgroupfs should continue to be available is … somewhat futile. Especially here; this is not a kernel mailing list.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 20:00 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (52 responses)

> I'm not involved with the details of which dbus call does, or will do, exactly what, and I like it that way. So, sorry but you'll have to get the actual implementation details from somebody else.
So I conclude that NOBODY knows how to do it. Does it not ring any alarm bells?

>No, the kernel people are not "idiotic" when they want to impose a one-writer-only policy on the cgroups subsystem.
Yes, they are. They are total idiots in this regard.

>It makes perfect sense to have one process arbitrate access instead of adding ACL support to cgroupfs and dealing with multiple processes stepping onto each other's toes.
Why does it make a perfect sense? What are the reasons? Can you point out a design document with them?

> In any case, unless you can actually convince them to not enforce a single-writer policy after all, demanding that a multi-writer cgroupfs should continue to be available is … somewhat futile. Especially here; this is not a kernel mailing list.
LKML is a dump. Asking there is an almost certain guarantee for a message to be lost.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 20:51 UTC (Tue) by fandingo (guest, #67019) [Link] (51 responses)

> So I conclude that NOBODY knows how to do it. Does it not ring any alarm bells?

It's because these features are currently being developed. They're not finished. None of this work will be completed until KDBus is merged.

I suggest that if you have such pressing concerns and questions about how all of this works, it's more appropriate to take your complaints to the developers directly.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:32 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (50 responses)

> It's because these features are currently being developed. They're not finished. None of this work will be completed until KDBus is merged.
Incorrect. Single-writer mode is already there and KDBus is going to be an optional dependency for a long time even after that.

All the policy mechanisms are already there. Yet nobody here can tell me how to do the simplest thing possible - delegate a DBUS subtree to a user.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:41 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (49 responses)

You don't delegate subtrees of DBus APIs. I imagine that requests for changes to cgroup would come with a parameter for which subtree to apply the changes to. By default, everyone passes '/' as the subtree to apply things to, but if you're delegating a subtree, pass '/machine/vm0' as the subtree. You then authenticate the caller against who is allowed to manage the '/machine/vm0' subtree. Or you attach to the 'org.freedesktop.systemd.cgroupdelegate1' interface at the '/machine/vm0' path and call methods there (everyone else calls the 'org.freedesktop.systemd.cgroup1' interface methods).

I have a feeling that your emotions here are getting in the way of seeing potential solutions and mixing up pieces of information. May I suggest putting your concerns into a wiki page of some sort so that responses to them aren't spread across unpteen LWN articles and subthreads?

Shuttleworth: Losing graciously

Posted Feb 18, 2014 21:51 UTC (Tue) by fandingo (guest, #67019) [Link]

This is correct, but just to clarify:

Telling systemd that user X (even if that's a user that needs to be resolved several times from namespaces) is allowed to perform actions A,B,C on a subtree does not exist presently. That's why no one can tell Cyberax what API calls are needed.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 22:09 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I'm now reading the kernel and systemd code. I still don't understand how authentication is going to work. Also, what happens next?

Suppose that I have systemd managing the root host and I create a cgmanager-based container. Suppose that they interoperate, so somehow cgmanager should connect to the host? How do I pass credentials for it? Or does systemd simply trust the first connection?

Then the cgmanager connects to its local DBUS running inside the container and starts serving its local clients. KDBus doesn't really change this, their namespaces would be separate.

Then the next question, would the cgmanager-based partitions be visible in the global manager? Probably yes, since there's no delegation. However, access rights would definitely be lost because cgmanager is probably going to implement its own policies. So there's not going to be any way to check what users and/or containers are using subtrees.

Still a mess. I guess to dive into it headfirst and try to make sense of it.

Shuttleworth: Losing graciously

Posted Feb 18, 2014 22:46 UTC (Tue) by fandingo (guest, #67019) [Link]

Users inside a namespace are mapped to UIDs outside the namespace. There are more details here http://lwn.net/Articles/532593/, but it seems that some privileged inner UID needs to be mapped.

Here's my understanding on how it would work. Besides some terminology, the model seems common between systemd and CGManager.

1) Bind mount the DBus socket dir into where the container will run.
2) Create cgroup and apply controllers as needed.
3) User namespace is created.
4) Root has an outer UID mapped. If the cgroup manager runs as a different user, it is also mapped.
5) PolicyKit is updated to allow access to the mapped user on the specific cgroup subtree.
6) The container OS boots.
7) The cgroup manager inside the container connects to the DBus socket. (This does not serve as the system bus inside the container. That is separate.)
8) The cgroup manager inside the container attaches to its system bus.

It's expected that the container software takes care of at least 3-8 and possibly the first two as well.

Operation:

A process inside the container wants to make a cgroup modification.

1) It connects to DBus (or cgroupfs if desired and CGManager is running inside the container) and sends the request.
2) PolicyKit (or file system permissions if using cgroupfs via CGManager) inside the container authorizes the action.
3) The cgroup manager inside the container accepts the command, and sends the command over the DBus socket to the outer cgroup manager.

4) The outer cgroup manager receives the message.
5) The outer cgroup manager translate the inner cgroup path to its relative position outside and consults PolicyKit for authorization. If authorized, the cgroup action is completed. A return message (with properly sanitized path) is sent across the socket to the inner cgroup manager, which forwards the message to the process that initiated the call.

=====

Let's say that some process inside the container wants delegation of a part of the container's subtree. That authorization doesn't take place in the outer PolicyKit. It happens inside.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 15:32 UTC (Wed) by HelloWorld (guest, #56129) [Link] (45 responses)

I have to say that I tend to sympathize with Cyberax here. cgroups are a hierarchy of objects, and we have an API to manipulate those: the file system. Of course you can build what essentially amounts to a copy of the file system API with D-Bus, but you'll loose all the tool support along the way, and people won't know how to use it, so why bother?

By the way, I feel similarly with regard to cgroups as a whole. We already have a process hierarchy, why do we need another one? Of course the problem with the traditional process hierarchy was that processes could escape by double-forking, but that was fixed with with prctl(PR_SET_CHILD_SUBREAPER). So why do we need cgroups at all?

Oh well, by now it's probably too late to change any of this.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 16:22 UTC (Wed) by fandingo (guest, #67019) [Link] (17 responses)

I'm wondering what tool support actually exists. It seems more likely that there's a mash of shell scripts that mkdir, chmod, chown, and echo their way through using cgroups. That's pretty lousy. I've already posted the warnings from cgroups.txt on using echo. If you're using something beyond shell script, you'll end up with a better program by interfacing with DBus if only due to better return information and error handling.

> Of course the problem with the traditional process hierarchy was that processes could escape by double-forking, but that was fixed with with prctl(PR_SET_CHILD_SUBREAPER). So why do we need cgroups at all?

Doesn't that require one of the following three situations?

* Well behaved main process of the service that sets PR_SET_CHILD_SUBREAPER, so none of its descendents escape. If this process ever dies, is killed without cleaning up (e.g. sigkill), or fails to set itself as the subreaper processes can escape the hierarchy.

* The init system has to maintain a process for each service that sets itself as the subreaper. It's not responsible for anything besides executing/stopping/killing the service, and cleaning up PIDs. That certainly adds a lot of overhead and complexity. Without dedicating a process to each service, you just end up with everything having subreaper set to PID 1 (or whatever a modular service manager runs as); these service hierarchies would all point to the same parent, making it impossible to distinguish between them.

The first option does not seem appealing because it requires a significant amount of trust in the service, there are reliability concerns, and the developers of each service need to do work to explicitly support this init model.

The second option mainly suffers from complexity. Init has far more processes running as part of its service management. Some IPC mechanism would be needed to track these processes and allow start/stop/restart/kill/etc. commands from the user to the service manager to the hierarchy manager to work.

Subreaper is designed to be used for proper process cleanup, not for tracking process hierarchies.

Lastly, the ability to set resource limits is limited. It would be possible to use something like setrlimit or prlimit, but those are both process-specific, and don't allow the flexibility of group-based resource limits.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:16 UTC (Wed) by HelloWorld (guest, #56129) [Link] (4 responses)

> * The init system has to maintain a process for each service that sets itself as the subreaper. It's not responsible for anything besides executing/stopping/killing the service, and cleaning up PIDs. That certainly adds a lot of overhead and complexity.
I don't think so. Processes are cheap on Linux, and I don't think that reaping child processes is likely to ever be a bottleneck for any realistic program.

> Some IPC mechanism would be needed to track these processes and allow start/stop/restart/kill/etc. commands from the user to the service manager to the hierarchy manager to work.
Where “Some IPC mechanism” would obviously be D-Bus, which makes this sort of thing very easy.

> Lastly, the ability to set resource limits is limited. It would be possible to use something like setrlimit or prlimit, but those are both process-specific, and don't allow the flexibility of group-based resource limits.
That's not a fundamental limitation. Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants, and PR_SET_CHILD_SUBREAPER ensures that your descendants can't reparent themselves to init. So I think this whole thing could be made to work fine if somebody bothered to do the work. Otoh, I'm not sure if anything would actually be gained by doing that instead of cgroups.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:43 UTC (Wed) by fandingo (guest, #67019) [Link] (2 responses)

> Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants, and PR_SET_CHILD_SUBREAPER ensures that your descendants can't reparent themselves to init.

That's true if and only if that ancestor reaper never dies or is killed. The security implications complicate things. It should be possible to overcome them possibly, but the warts add up.

> That's not a fundamental limitation. Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants[...] Otoh, I'm not sure if anything would actually be gained by doing that instead of cgroups.

I totally agree, but it would require a change to those functions (or new recursive versions).

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:59 UTC (Wed) by HelloWorld (guest, #56129) [Link] (1 responses)

> That's true if and only if that ancestor reaper never dies or is killed.
So what? systemd is already required to never die.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:29 UTC (Wed) by fandingo (guest, #67019) [Link]

If the reaper dies, then processes in that service escaped their "container" (because PPID is now 1). The service manager can no longer track them, and has lost all reliable control of the service.

This becomes a major problem with privileged services. Many services maintain a parent process that runs as root. Any compromise of this privileged service process (or malfeasance by it) allows it to kill it's hierarchy manager process and escape all control.

On the other hand, cgroups in a single-writer environment should be immune to this.

With systemd cgroup manager, PolKit would not authorize a process move outside all cgroups or to another cgroup (outside specific definitions like system.slice/sshd.service/ --> /user.slice/session.scope/).

The major benefit to systemd's cgroup manager is that it is not attackable via this style. It cannot be intentionally killed (it ignores all signals, even sigkill since it is PID 1), and if it were somehow forced to crash, the system would panic. Since PID 1 is the cgroup manager, there is no way to gain control of the kernel interface either.

There's no meaningful way to protect a reaper, unless you mandate that nothing in a hierarchy can run with enough privileges to kill the reaper. That would require a substantial change in many services, or requires additional sandboxing mechanisms in the kernel. (The kernel would need to perform a check that a caller of kill(2) is not trying to kill its reaper.)

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:58 UTC (Wed) by smurf (subscriber, #17840) [Link]

Umm … did you ever happen to run across the idea that just maybe process hierarchies and cgroups are a VERY bad fit? I can think of a couple of use cases where that wouldn't work at all well.

What if I want to fork off a bundle of programs which need to share the same memory limit (i.e. 200MBytes for all of them in sum, not individually … like for instance all the processes in James' sessions … and what if James logs in with X *and* with ssh)?

What if I realize, after starting my disk copy program, that it eats too much memory / disk bandwidth, and I want to retroactively park it in a more limiting cgroup? Does my shell suddenly need to know about that stuff?

Sorry -- won't work.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:42 UTC (Wed) by HelloWorld (guest, #56129) [Link] (11 responses)

> I'm wondering what tool support actually exists. It seems more likely that there's a mash of shell scripts that mkdir, chmod, chown, and echo their way through using cgroups. That's pretty lousy.
It's not lousy, there's nothing wrong with that! Those are tools that every admin knows and uses, and that's a Good Thing. The reason devices are exposed as “files” in /dev is precisely that one can do things like access control just as if they were proper files. Do you want to replace that too? It's certainly possible to give udev a D-Bus interface and use fd passing to open device files!

> I've already posted the warnings from cgroups.txt on using echo. If you're using something beyond shell script, you'll end up with a better program by interfacing with DBus if only due to better return information and error handling.
“We can't use a file system based interface because bash's echo builtin is broken” is about as lame an excuse as it gets. The answer to that is to fix bash or to use printf.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:48 UTC (Wed) by smurf (subscriber, #17840) [Link] (10 responses)

> The reason devices are exposed as “files” in /dev is precisely that
> one can do things like access control just as if they were proper files.

"It behaves like a plain file" doesn't work for quite a few device nodes, and most Linux subsystems are not controlled by echo: You don't emit sound by "cat rhapsody.wav >/dev/snd" these days, and you don't resize a LVM partition by "echo 10TB >/sys/devices/virtual/block/volgroup/master/varlog/size".

This is Linux. This is not Plan 9 where you can open a TCP connection with mkdir. cgroupfs is fine for introspection, but control? that always seemed a bit unnatural to me.

Besides, pragmatically, a sensible "cgroupctl"-style program will have a --help option and a manpage. To me that seems a lot more useful than traipsing around in cgroupfs and wondering which magic mkdir+echo+mv combo I need to evoke to limit my disk copy program's memory usage.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 20:22 UTC (Wed) by HelloWorld (guest, #56129) [Link] (9 responses)

> "It behaves like a plain file" doesn't work for quite a few device nodes,
So what? Just because you can't read(2) or write(2) to some device nodes doesn't mean you need to use another interface for things like poll(2) or chmod(2). Stop thinking about the “file system” and start thinking about a general hierarchical namespace for all kinds of objects. This is where we are today with files, sockets, fifos, devices files etc.. It's only natural to extend that further.

> This is Linux. This is not Plan 9 where you can open a TCP connection with mkdir.
Uh, I know this is Linux and not Plan 9. How is that supposed to be an argument? We should learn from Plan 9 instead of taking that kind of “us vs. them” stance.

> cgroupfs is fine for introspection, but control? that always seemed a bit unnatural to me.
And to me it seems unnatural that access control for cgroups is supposed to be done through a completely different mechanism than access control to files or devices. Though I agree with you that the current cgroups API isn't ideal. For one thing, I think the natural thing is to use
ln /proc/42 /sys/fs/cgroup/yaddah/cgroup.procs
and not
echo 42 > /sys/fs/cgroup/yaddah/cgroup.procs
to add processes to a cgroup.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 21:17 UTC (Wed) by smurf (subscriber, #17840) [Link] (8 responses)

> And to me it seems unnatural that access control for cgroups
> is supposed to be done through a completely different mechanism
> than access control to files or devices

I strongly suspect that the main reason for that is because you're used to it.

> ln /proc/42 /sys/fs/cgroup/yaddah/cgroup.procs

Linking.
Across file systems.
Yeah, right.

Sorry, but this is the point where I stop responding to you.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 21:44 UTC (Wed) by HelloWorld (guest, #56129) [Link] (7 responses)

> Linking.
> Across file systems.
> Yeah, right.
So what? It's not allowed for conventional file systems because it doesn't make sense there. It does make sense for this case, so there's no reason for it not to be allowed.

> Sorry, but this is the point where I stop responding to you.
You're doing as if I had somehow offended you. I haven't.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:06 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (6 responses)

Not only are you linking across filesystems (how would one find out that it is hardlinked elsewhere?), but you're hardlinking a directory. When process 42 ends, does the "hardlink" disappear? If not (as one might expect of hardlinks), does a new process with PID 42 get put there? The /sys and /proc filesystems are already pretty magical, but those are only around read and write (AFAIK), not how many other syscalls as well. Really, even echoing the PID to a file is racy. I'd much rather have something like a procfd to use here.

These behaviors you're asking for are quite different than the usual semantics these tools imply. Sure, filesystems and cgroups are both hierarchical, but there is such a thing as stretching a metaphor too far. To make a meta-metaphor: Should we abandon databases and just use spreadsheets instead? Vice versa? They're both "just" grids of data cells.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 0:13 UTC (Thu) by HelloWorld (guest, #56129) [Link] (5 responses)

Alright, you have a point. Using link(2) is probably not a good idea.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 2:08 UTC (Thu) by MrWim (subscriber, #47432) [Link] (4 responses)

rename() might be though. AFAIU all pids have to appear in the cgroup tree so to put a pid in a cgroup you have to remove it from another. You would need permissions for both cgroups and it happens atomically.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 15:20 UTC (Thu) by HelloWorld (guest, #56129) [Link] (3 responses)

rename() was my first thought. But that would remove the process from the /proc directory, and that doesn't really make sense, does it?

Shuttleworth: Losing graciously

Posted Feb 20, 2014 15:23 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

I think the suggestion was to move the pid from one cgroup directory to another, not from /proc.

Shuttleworth: Losing graciously

Posted Feb 20, 2014 17:34 UTC (Thu) by HelloWorld (guest, #56129) [Link] (1 responses)

Well, that would work, but then how do you move a process that isn't a member of any cgroup into one?

Shuttleworth: Losing graciously

Posted Feb 20, 2014 18:15 UTC (Thu) by MrWim (subscriber, #47432) [Link]

That's what I meant by "all pids have to appear in the cgroup tree so to put a pid in a cgroup you have to remove it from another". My assumption is that there cannot be a process which isn't a member of any cgroup. If init starts in a cgroup and it's children end up in the same cgroup and there's no way of unlink()ing pids from the cgroup tree then you're guaranteed that every process is in the tree.

In that setup you can't steal other users processes and put them in your subtree, you can only move pids around in the trees you own. You can then use whichever cgroup manager that you desire in your subtree. Containers work while still only co-operating with the kernel, rather than having to communicate with other user-space programs running outside.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 16:29 UTC (Wed) by paulj (subscriber, #341) [Link] (26 responses)

+1 on the parent. Cyberax makes good points.

Further, no one has been able to explain why this is better implemented in user-space via DBus. There only seem to be assertions from Tejun on mailing lists that there are problems, but pretty much no detail on what those problems are. Worse, nothing, not even in the abstract, on why these problems would be any easier to tackle in user-space.

If the argument is that it is difficult to get multi-writer access to filesystems right, or multi-writer setting of permissions, then the kernel surely has bigger problems.

If the issue is ABI stability, those problems also do not get magically become easier in user-space. Except perhaps that it is easier to circumvent Linus' determination to keep the kernel:user-space ABI stable (?) by simply not having one.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 17:31 UTC (Wed) by fandingo (guest, #67019) [Link] (25 responses)

There are plenty of concerns with the status quo. http://www.linux.com/news/featured-blogs/200-libby-clark/... and https://lwn.net/Articles/574317/ identify many of the issues.

From my perspective, the two most notable deficiencies are:

* Security. Delegating direct access to kernel interfaces is dangerous. The kernel is not running anything resembling a full security policy and shouldn't be expected to. Commands which have the ability to drastically affect a running system need to be vetted by something, and it only makes sense to do that in user space. Traditional UNIX ownership and permissions are insufficient for building comprehensive policies. While this may not be a concern to some, it's a major weakness. There's no excuse for providing an insecure interface to the kernel.

* Exposing too much interior detail. If the cgroups API had been more abstract from the start, it probably would have been possible to fix at least some of the other deficiencies. Unfortunately, cgroupfs exposes too much internal information, making it, in Tejun's opinion, infeasible to fix without major changes.

> If the argument is that it is difficult to get multi-writer access to filesystems right, or multi-writer setting of permissions, then the kernel surely has bigger problems.

Not all kernel developers work on every part of the kernel. The people who maintain and develop cgroups have decided that this is the highest priority undertaking for them.

> If the issue is ABI stability, those problems also do not get magically become easier in user-space. Except perhaps that it is easier to circumvent Linus' determination to keep the kernel:user-space ABI stable (?) by simply not having one.

The kernel will still have an ABI for user space. I'm not sure why people keep saying otherwise. It's only usable by one process, but it's certainly still there. And for user space, it actually does become substantially easier. Just look at CGManager, which has decided to support two APIs simultaneously.

If there is a need to modify the cgroups ABI in the future, it's so much easier. Rather than having hundreds of thousands of users (everything from systemd to shell scripts) that will be impacted, it's just a handful of cgroup managers.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:10 UTC (Wed) by paulj (subscriber, #341) [Link] (20 responses)

On security: If delegating direct access to kernel interfaces is inherently dangerous (richness/flexibility of security models excepted), but user-space mediation is not, then we need to ban all direct filesystem access, and make all code access files and data via IPC to a management daemon.

I simply don't think that's the case. If we can't do this securely in the kernel, the user-space mediator won't be any better (certainly not when it's coded in a similar style), security wise (with that exception). Otherwise, please explain how user-space is more secure?

With regard to the Unix ownership/permissions model:

1. This may suffice for many. There's little evidence, in the history of operating systems, of complex security models getting wide-spread end-user use.

2. However, it is incorrect to say the kernel to userspace delegation API is limited to Unix owner/group permissions. It also allows for ACLs - which an fs based cgroups API (not necessarily the current version) could implement.

3. Even if Unix perms and/or ACLs *were* still insufficient for all users, that is *NOT* a reason to not offer the FS API. If a user-space daemon wants to offer some other security model on top, the existence of the FS API does not stop that. They can live together. Why does it mean the fs interface has to be removed?

On exposing too much detail: Then that's a problem of the current cgroupfs API. Fix it with a new one. Why is it better for that new API to live in user-space?

One of the problem's Tejun has is that the fs API *allows delegation*, and hence allows an admin to give resources to non-privileged processes that might affect other users/processes. But isn't that perhaps inherent to an over-committed resource sharing system like normal Unix/Linux? Further, if the problem is inherently to do with the delegation, how will a user-space API that allows delegation fix things? The answer, if delegation really is a problem, must be that that delegation has to be removed. There is no reason this can be done in a kernel API, surely? Or would you argue it is easier to remove things in userspace APIs? (That argument would scare me).

Why not just try and get the kernel API right? Why will it be any easier to get things right if the thousands of shell scripts are calling dbus-send instead of writing to a virtual fs? How will having this in user-space make it any easier to deal with all the thousands of users, just cause they talk to a manager instead of the kernel?

It very much sounds like the kernel cgroups people simply don't want to have to work out the details of what is needed, and so want to punt it to user-space. It sounds almost a social problem, more than a technical one.

Lastly, on the "not all kernel developers are familiar with implementing a virtual fs" issue - they can ask for help, surely. :) Eventually viro, or someone similar, will get annoyed enough to do further extend VFS and library support to further ease implementation of virtual fses, if needed. As (IIRC) happened yonks ago when he got fed up with procfs and others. ;)

Shuttleworth: Losing graciously

Posted Feb 19, 2014 18:55 UTC (Wed) by fandingo (guest, #67019) [Link] (9 responses)

I think the primary reason why it was moved out of the kernel is that the policies authorizing access are not simple and may not follow traditional methods. Here are a couple of situations where non-traditional policies may be desired

* Only allow PID X to manage cgroup subtree /A/B/C.
* Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.

Traditional file permissions and even ACLs are incapable of dealing with either situation. In the first, case there's no way to grant that access without allowing all processing owned by that same user/group from controlling /A/B/C. In the second situation, there's no way to allow a one-way move.

Yeah, we can start adding more files to the cgroups to attempt finer control, but now things get messy, and it's likely the who permission issue goes out the window. That's just two random policies that a system may want to enforce. To do #1 you would likely have to give one of U, G, or O +w to the subtree for that user, but that's a broken definition because hidden kernel policy will revoke writes by other processes, even though they have the proper U or G or fall into O. In the end, a cgroup would spout many files to cover all sorts of policy combinations, and the kernel developers will be left with a monstrosity that can't be fixed because next time the same cries about API will arise.

A filesystem hierarchy is not suited to this complexity. It will have to be contorted into all kinds ways where the permissions scheme (even with ACLs) doesn't match traditional behavior.

It's far preferable to take the LSM approach. The kernel provides the primitives throughout the kernel subsystems and talks to another module to establish and enforce policy. It's true that LSM modules are kernel modules, but they remain separate from the LSM code. (I wouldn't have a problem with a cgroup manager living as a kernel module, but I don't see any inherent benefit either.)

Shuttleworth: Losing graciously

Posted Feb 19, 2014 20:36 UTC (Wed) by HelloWorld (guest, #56129) [Link] (3 responses)

> * Only allow PID X to manage cgroup subtree /A/B/C.
> * Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.
>
> Traditional file permissions and even ACLs are incapable of dealing with either situation.
But similar restrictions might also make sense for other kinds of hierarchically organised objects. So why not generalise the existing access control mechanisms to allow for things like that instead of inventing something new?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:28 UTC (Wed) by fandingo (guest, #67019) [Link] (1 responses)

The only existing mechanism that could possibly be used would be an LSM. That probably wouldn't be a bad approach.

I don't think that there's any way that traditional permissions, even with ACLs, could be massaged into giving the necessary flexibility and clean interfaces.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:46 UTC (Wed) by dlang (guest, #313) [Link]

the thing is, existing LSMs know how to deal with permissions to filesystem objects. SELinux and AppArmor work on the existing cgroups interfaces today (as Cyberax has noted).

Plus there is the entire extended ACL structure thats available (but very seldom used because it's not needed)

On Linux, permissions for filesystem objects have not been limited to the unix wrx bits for a long time.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:49 UTC (Wed) by vonbrand (subscriber, #4458) [Link]

So the solution to the problem that the API isn't well known/standard is to create another totally new, in practice untested, "general hierarchical security model" to be applied across the board to anything with a hierarchical structure. That sounds much, much harder to do right to me.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:00 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

>* Only allow PID X to manage cgroup subtree /A/B/C.
>* Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.
Then they should talk to a some kind of privileged program that can do this. Traditional UNIX used suid programs for that, and it totally makes sense to use something like cgmanager/systemd for this.

However, such situations are not really normal. In particular, changing levels in cgroups hierarchy is not a trivial operation - new subtree might have limits that the subtree which is being moved already exceeds.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:32 UTC (Wed) by fandingo (guest, #67019) [Link] (3 responses)

> However, such situations are not really normal. In particular, changing levels in cgroups hierarchy is not a trivial operation - new subtree might have limits that the subtree which is being moved already exceeds.

That's really a question of policy, though. If the policy says that some processes need to move to /D/E/F, then they need to go there, regardless of what the resource controllers say. (I'd argue that the process should be moved first, and then the resource controller terminates processes to get back into proper configuration. I don't think that it is acceptable to leave a process in the wrong the subtree.)

It's worth noting that systemd-login, which is not PID 1, does the second action today on user login.

> Then they should talk to a some kind of privileged program that can do this. Traditional UNIX used suid programs for that, and it totally makes sense to use something like cgmanager/systemd for this.

If there are acknowledged shortcomings of cgroupfs, shouldn't the API be changed to support all reasonable actions? Why should the kernel keep interfaces that clearly have shortcomings that cannot be resolved without massive API incompatibilities?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:40 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> That's really a question of policy, though. If the policy says that some processes need to move to /D/E/F, then they need to go there, regardless of what the resource controllers say.
You policy might say that a process can use 100G of RAM, but that's not going to help you if you only have 500Mb. Right now if you try to do this trick the kernel simply kills the over-limit processes.

> If there are acknowledged shortcomings of cgroupfs, shouldn't the API be changed to support all reasonable actions? Why should the kernel keep interfaces that clearly have shortcomings that cannot be resolved without massive API incompatibilities?
Like filesystem interface? Perhaps we should switch to DBUS instead of using open()?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:43 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

backwards compatibility would be a reason for keeping poor interfaces, but if you are going to break them, then you need to do so once, not multiple times.

And the new systemd API is just that, a systemd API, by definition it doesn't deal with use cases that don't use systemd.

and the currently proposed single-writer API in known to not support all use cases, so why should it replace the existing API?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:59 UTC (Wed) by fandingo (guest, #67019) [Link]

> and the currently proposed single-writer API in known to not support all use cases, so why should it replace the existing API?

You are talking about the Google multi-hierarchy complaint, right? The cgroups maintainers are on record as saying that this is not reasonable, and they intend to eliminate it.

> backwards compatibility would be a reason for keeping poor interfaces

The cgroups developers believe them to be broken, not poor. Plus, they get in the way of fixing cgroups since cgroupfs leaks too many implementation details.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 19:04 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (2 responses)

I believe Tejun directly speaks a lot of this in 2012 discussion.

Right now... its a mess. He even comments on the fact that the gentlemen's agreement in the form of PaxControlGroup shows exactly how problematic trying to support multiple writers actually is right now in the multi-heiarchy cgroups because they aren't all multi-heiarchy aware. You can't actually use all the controllers in a multiple writer fashion without them step on each other. Caveats abound. He even correctly notes that the way it works quite well right now for hand crafted setups... where one human has crafted all the cgroup interactions via scripting and its effectively "single writer" in a sense. But once you tack on automation or applications which want to make use of cgroups side-by-side with other applications or other automation or even hand-craft scripts... you run into problems. PAXControlGroups details the in and outs of those problems.

So as part of making room to clean up all the problems the single userspace writer will be mandated in the middle term of the kernel side work to make a flat hierarchy api. I'm not even sure the plan is to require that single writer model forever. I believe the plan right now is to stop pretending that cgroups and all the controllers work with the multiple writer model while that flat hierarchy is being developed and controllers are all reworked to correctly support that new model.

I also think you need to in mind that Tejun also mentions a pie-in-the-sky goal of merging cgroups into the process hierarchy.

Look I think the real issue here is that until the new work (both kernel and userspace) has progressed further along, there are going to be existing use cases that only served by the deprecated API. The kernel developers have always recognized this, from the start of the discussion.
I think the discussion here is only filling in the details of what Tejun knew were local admin policy scripting centric use cases that would be impacted in the short to mid term. Tejun clearly states the choices on how to proceed involved a trade-off that would impact some existing use cases.

The real question is, will the linux distribution vendors continue to provide userspace solution which support the old API as an option? I have no idea on that. I know that libvirt for example has a transition plan in place to support the older cgroupfs if the systemd D-Bus API is not available on the host. But I have no idea if any distribution vendors are going to expose a configuration option to pick which cgroups API to use when mounting cgroups. So Cyberax should probably start making the case to his distribution vendors about supporting his use case by making it possible to choose to run the deprecated cgroupfs API in the future, so long as the API is available and not pulled from the kernel the vendor ships.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:56 UTC (Wed) by dlang (guest, #313) [Link]

Everyone is in agreement that the old multiple hierarchy approach is a problem and that all controllers need to use the same hierarchy.

But using that as justification for a single-writer model doesn't compute, what does one have to do with the other?

> I'm not even sure the plan is to require that single writer model forever.

I agree with you that the kernel developers have said this, I just can't find a quote easily to back this up.

But if this is the case, having systemd take complete control and then defining a complex DBUS interface to be used for delegation strikes me as a very bad thing to do 'temporarily'

> The real question is, will the linux distribution vendors continue to provide userspace solution which support the old API as an option?

is systemd going to even allow this? or is systemd going to say that it's broken if the new interface isn't there?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:12 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Sure, multiple hierarchies writers will have to be rewritten. That's totally OK because the old cgroups API definitely needs to be fixed with a jackhammer. Nobody argues that. It's also clear that the unified tree will make some use-cases impossible, at least for now.

And that's OK - there _are_ valid technical reasons why the current multiple hierarchies model is broken. These reasons are clearly spelled out in scores of mailing list messages.

For example, there are problems with memory accounting and blkio and that's why blkio can't account for buffered writes right now.

The switch to the single-writer model, however, has no such rationale. There are literally NO arguments at all for it that I can find. One would think that unfixable security problems deserves at least a mailing list message, but there's literally _nothing_ there.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 19:13 UTC (Wed) by smurf (subscriber, #17840) [Link] (2 responses)

> It very much sounds like the kernel cgroups people simply don't
> want to have to work out the details of what is needed, and so
> want to punt it to user-space. It sounds almost a social problem,
> more than a technical one.

Surprise: You are exactly right. The kernel people do not WANT to set policy there because the requirements are unknown / too diverse / we don't have much experience what we actually need in complex real-world scenarios / take your pick.

We do know that a user/group scheme will not work: you can nest namespaces, and the process which sets up the outer namespace's access rights does not know which user IDs will eventually end up being mapped to the processes inside these (sub)containers. This is (probably) why cgroupfs does not have ACLs: they'd be insufficient anyway.

It's not the kernel's job to set policy. It's the kernel's job to facilitate a stable ABI, and leave policy to user space where it belongs.

Besides, access rights are insufficient for another reason. Suppose you want to limit users' memory usage to 100MB each; if they want to have 1GB of main memory they can -- but only two at the same time, and only for 10 minutes.

This is not a particularly exotic requirement for a multiuser system. If somebody adds code for that kind of thing to their favorite cgroupsmanager program, no problem whatsoever. In the kernel? don't even think of doing that.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:06 UTC (Wed) by dlang (guest, #313) [Link]

> Besides, access rights are insufficient for another reason. Suppose you want to limit users' memory usage to 100MB each; if they want to have 1GB of main memory they can -- but only two at the same time, and only for 10 minutes.

> This is not a particularly exotic requirement for a multiuser system. If somebody adds code for that kind of thing to their favorite cgroupsmanager program, no problem whatsoever. In the kernel? don't even think of doing that.

what multiuser system supports these sorts of limits today? If you are claiming that they are not unusual, that must mean that something common supports them.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:22 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

First, you create this hierarchy: root/user1/delegate, root/user2/delegate. Then you set memory limits on them and start a daemon that does the balancing act. This daemon should have permissions to change 'user1' and 'user2' hierarchies.

But nobody stops you from making 'delegate' directories writable for the users! They won't be able to affect the settings in the parent levels of cgroups, and they'll limited by them.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:55 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Except that delegation to untrusted processes is not inherently dangerous. Cgroups (by design!) limit what the children processes can do by setting limits on their parents. Well, except for the broken blkio controller that is being fixed anyway.

I'm pretty familiar with the current cgroups interface and it seems that an untrusted process _at_ _most_ can cause high load on the kernel and perhaps significantly slow down other processes.

There's also a small problem with several controllers which use weights to distribute resources, so it's possible for a cgroup to affect its siblings. But again, that's trivially worked around by using an intermediary tree level if one wants to delegate a subtree to untrusted processes.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:15 UTC (Wed) by fandingo (guest, #67019) [Link] (2 responses)

> it seems that an untrusted process _at_ _most_ can cause high load on the kernel and perhaps significantly slow down other processes.

Only if that process is unprivileged. If you have a service that runs a privileged process (like the parent PID of Apache or OpenSSH), it can modify any part of the cgroup hierarchy.

A single-writer model (especially if the writer is PID 1) with policy enforcement precludes this behavior. Even a privileged user would not be able to gain authorization to perform cgroup changes outside what the policy allows (like managing its subtree). Furthermore, a privileged user couldn't even connect to the kernel cgroup API directly, because a writer is already registered, and if it's PID 1, cannot be crashed in order to register a malicious writer.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:20 UTC (Wed) by dlang (guest, #313) [Link]

existing LSMs can block access to cgroups by even root processes today.

or you can play games with the PID namespace so that those processes are only root within their limited context, not for the whole systems.

But if you are concerned about a malicious root process, the fact that it can change cgroups settings seems like a pretty minor thing to worry about.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Only if that process is unprivileged. If you have a service that runs a privileged process (like the parent PID of Apache or OpenSSH), it can modify any part of the cgroup hierarchy.
It might as well simply do 'chmod -R a+r+w+x /' to the same effect.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:48 UTC (Wed) by dlang (guest, #313) [Link] (3 responses)

by the way, the new interface apparent;y isn't actually limited to being accessed by a single process, a group of coorperating processes can be used instead.

this sounds like an even bigger problem to me if the group of coordinating processes don't coordinate well, nothing in ther kernel can know that they aren't and big problems can result.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 22:53 UTC (Wed) by jspaleta (subscriber, #50639) [Link] (2 responses)

Are you sure about that? Can you provide me instructions on how to do that when the cgroups is mounted with the new API?

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:09 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

I don't know how to do it, but I saw something posted in the last week or so (I think on lwn related to systemd) that stated that this was the case.

Shuttleworth: Losing graciously

Posted Feb 19, 2014 23:14 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Lennart (who is known as 'mezcalero' here) meant that systemd can give other processes access to a subtree, through systemd's interface.

That doesn't change the fact that only one process can do direct cgroups manipulations.