Shuttleworth: Losing graciously
Shuttleworth: Losing graciously
Posted Feb 15, 2014 21:30 UTC (Sat) by jspaleta (subscriber, #50639)In reply to: Shuttleworth: Losing graciously by stgraber
Parent article: Shuttleworth: Losing graciously
Am I understanding your future scenario correctly?
in the future....Ubuntu host system running systemd as init by default:
1) runs systemd PID=1 and is the cgroups manager for the host system exposing the documented cgroup management API.
2) runs a cgmanager process which exposes its own API, but internally will (in the future) communicate cgroup management requests back to systemd using systemd's native API and will not be touching cgroupfs directly.
3) guest lxc containers run something that talks to the cgmanager process talking the cgmanager API.
I that the future scenario you expect to see once ubuntu switched to systemd?
Assuming yes, here's what's baking my noodle. If in the future, cgmanager is just going to end up talking to systemd using systemd's API.... how does this configuration provide any functionality or service any use cases above and beyond what systemd's API already exposes? Serious question. cgmanager's API doesn't appear to be an abstraction, it appears to be mired in the details of what cgroupsfs exposes. So I don't get how a future, where cgmanager just talks to systemd via systemd's API is more capable or can cover additional use cases than systemd can directly. Well not without patching the bejesus out of systemd on the host..which isn't what you seem to be proposing. It just looks like cgmanager is going to be wedged in between for no benefit at all. 
In contrast, I'm far less confused about how libvirt's future roadmap is going to work. Libvirt exposes an abstracted API, that doesn't go into cgroupfs minutia. So I get how libvirt can expose a stable abstracted API for containers to make use of that.. and can internally can then talk to systemd abstracted cgroup API and it will all work out.  Libvirt's API doesn't propose to expose capability or usage cases thought to be unsupported by systemd's API.
-jef
      Posted Feb 15, 2014 22:34 UTC (Sat)
                               by stgraber (subscriber, #57367)
                              [Link] (110 responses)
       
In those cases you'll get requests coming from sub-containers where the emitter of those requests is root in their own namespace but not on the host. The whole cgmanager/cgproxy API is designed so that we can safely check what process the requester actually owns and then allow it to mess with those and those only. 
So we basically track the various pid namespaces and user namespaces, deal with the uid and pid translation and then do ACL checks on the host. 
     
    
      Posted Feb 17, 2014 21:03 UTC (Mon)
                               by fandingo (guest, #67019)
                              [Link] (109 responses)
       
     
    
      Posted Feb 17, 2014 21:28 UTC (Mon)
                               by stgraber (subscriber, #57367)
                              [Link] (108 responses)
       
What Lennart refers to is running systemd in a container managed by systemd (with nspawn) within the host's user namespace and probably without running a full distro inside it (though that last bit doesn't matter that much). 
As I already stated before, LXC also supports distros that do not have systemd, including Android. cgmanager was designed to be generic enough to work on any of those and will itself talk to the systemd API or any other similar API instead of cgroupfs if they offer an API that's low level enough for us. 
 
Now if you want an example of complex setup which I need to support with LXC (due to actual user demand, not because I want to find a far fetched example), consider this: 
Host runs Ubuntu 14.04 with a 3.13 kernel (upstart). 
This all works today with LXC 1.0 and cgmanager, the cgmanager host socket gets passed from one level of container into the next. If the container cares about cgroups (all of the above except the last one), they need to spawn a cgproxy process that'll do SCM calls over DBus to pass user credentials and PIDs in a way that gets translated by the kernel when crossing namespace boundaries. 
The main difficulty in the above is when uid 0 in the leaf container with a mapped uid of 200000 (depending on the configured mappings) is requesting for PID 50 to be moved into cgroup "a". 
That's because: 
So that's why we have cgmanager, why we use ucreds to get translated uids and pids and why we need complex logic (using namespace attach and such) to check whether uid 200000 on the host is indeed uid 0 in its namespace and whether pid 123123 is in the pid namespace that's linked with its user namespace and finally whether it's actually supposed to be able to write to lxc/c1/c2/c3/a. 
That example is actually a fairly simple and common example of what cgmanager does, we have way trickier cases but those usually need me an hour or so to properly express (mostly happen on older kernels or when a sub-sub-container wants to add a pid to a cgroup which is owned by a user in that namespace. The PID ownership logic becomes pretty tricky pretty quickly.) 
     
    
      Posted Feb 18, 2014 1:45 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (107 responses)
       
> As I already stated before, LXC also supports distros that do not have systemd, including Android. cgmanager was designed to be generic enough to work on any of those and will itself talk to the systemd API or any other similar API instead of cgroupfs if they offer an API that's low level enough for us. 
What's the reason for not adopting the systemd DBus API, especially when it preceded CGManager? That clearly complicates the situation for application developers, or now we've added another mandatory abstraction layer, CGManager, that should not be needed on systems that already have a cgroups manager.  
I guess I don't see what the future possibly holds for CGManager. Even the present is dicey beyond the cgroupfs driver. There's not even *one* page of documentation or explanation on how to use CGManager. (This appears to be the official project page: http://cgmanager.linuxcontainers.org/.) I was perplexed and spent far too long on Google before coming to the conclusion that CGManager's only mention is on a few mailing list threads. I can't find any definition of the DBus API for CGManager. I was under the impression that CGManager was ready for use.  
Over the next year and a half, it is extremely likely that new GNU/Linux installations will overwhelming use systemd. During that time, it is hard to envision the actual kernel cgroupfs driver disappearing.  
Combine the longevity that the kernel cgroupfs will have with the simplicity of developing the missing features of systemd (system bus delegation and policies), it's not clear that CGManager has much purpose or will become the generic cgroup manager as it was initially advertised.  
     
    
      Posted Feb 18, 2014 2:18 UTC (Tue)
                               by stgraber (subscriber, #57367)
                              [Link] (7 responses)
       
In the mean time, Serge published some notes on github: 
As for application developers, our biggest user is LXC and LXC certainly knows why and how to use it (as it's the same group of people who developed both and cgmanager was mostly built from LXC's old cgroup management code).  For the others, the choice is relatively straightforward, if you want something very simple that works everywhere, just use cgroupfs directly. If you care about namespaces, uid/pid translation and nesting, use cgmanager. If you prefer to use a standard DBus API and only care about systemd-based distros, use systemd. 
I believe my earlier comment explains why the systemd API isn't sufficient for LXC's needs and for the other group of people involved with cgmanager. 
As for all distros moving to systemd, I personally think this would be a pretty sad day, diversity is very important and is the main source of improvements. Anyway, we don't expect Android to start using systemd in the near future and that's one of the reasons why LXC will be using cgmanager. 
     
    
      Posted Feb 18, 2014 5:28 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (6 responses)
       
That's a real shame. How can you be confident that it is tested or used outside the LXC use-cases, or that the API is stable for a 1.0 release?  
> I believe my earlier comment explains why the systemd API isn't sufficient for LXC's needs and for the other group of people involved with cgmanager. 
Besides the delegation and policy components, what's inadequate with systemd's API (not implementation)?  
I feel like CGManager was advertised as the cgroup manager for everyone not using systemd. After talking to you, it seems like a LXC utility only that isn't likely to see much additional use. The use cases that you have outlined strongly indicate that the cgroupfs API is likely to be the only thing that is used.  
> diversity is very important and is the main source of improvements. 
This is a truism that is oft repeated, but I don't see objective evidence that it's actually true. In fact, the Linux kernel is a perfect counter example. There hasn't been useful competition to Linux for years now, and kernel developers have not had trouble innovating.  
> Anyway, we don't expect Android to start using systemd in the near future and that's one of the reasons why LXC will be using cgmanager. 
Is Android going to switch to CGManager? If not, how useful is it for testing or use when official Android uses the kernel cgroupfs, not the cgroupfs provided by CGManager? 
     
    
      Posted Feb 18, 2014 7:57 UTC (Tue)
                               by mbunkus (subscriber, #87248)
                              [Link] (1 responses)
       
But the Linux kernel does have competition. It's called Windows, Mac OS, the BSDs and the commercial Unices. 
And monocultures are bad for innovation. Just look at the regulated telecommunication industries before they were split up (e.g. in the US) or the governmentally-mandated restrictions lifted (e.g. Germany). The Deutsche Bundespost (predecessor to what today is Deutsche Post AG, the postal service; Deutsche Telekom AG with its offspring T-Online; Deutsche Postbank AG, a bank) was known for bad service, obscene prices, a snail-like pace of innovation, complete lack of flexibility. 
However, systemd works in totally different environment. There are no regulatory authorities here; the only thing preventing yet another init system to come along and take its place is technical excellence which translates into people seeing the need for it and then following through with a proper implementation. Therefore I'm not worried about a perceived lack of diversity regarding systemd, especially if the alternatives are so far behind in terms of functionality. 
     
    
      Posted Feb 18, 2014 16:43 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] 
       
It's not clear that *anyone* is interested in an alternative and modern init system. If 14.04 weren't the LTS release of Ubuntu, Canonical would be done with Upstart at this point, and Mark Shuttleworth has said that they will switch to systemd as soon as Debian makes the switch. That's the last major holdout from GNU/Linux distributions. The only other system, which is not GNU/Linux, is Android, and I'm sure they'll continue to do their own quasi-proprietary thing. 
I guess the bigger question is if something were to come along and try to fulfill the features that systemd does: why wouldn't that init system bring its own cgroup manager? 
 
 
     
      Posted Feb 18, 2014 10:43 UTC (Tue)
                               by hummassa (subscriber, #307)
                              [Link] 
       
This is a silly oversimplification. There are lots of competition in kernel-space: Windows (DOS-based 98 and VMS-based NT), at least five BSDs, commercial unices (I worked with SunOS/Solaris, HP/UX, AIX, ULTRIX, the infamous Microsft/SCO Xenix, amongst a dozen others). 
IIRC, once upon a time Linux got inspired by VMS/WinNT for its asynchronous IO, AIX via Sequent for its RCU synchronization, FreeBSD was faster in the same hardware, NetBSD supported more hardware architectures, the BSDs got plug-and-play hardware first/better, firewalls, USB, etc.  
     
      Posted Feb 20, 2014 19:50 UTC (Thu)
                               by lsl (subscriber, #86508)
                              [Link] (2 responses)
       
If there's any ongoing innovation left at all in the OS kernel space it isn't happening in Linux. Not that that's (necessarily) a bad thing: Linux is (and is supposed to be) a 'production' OS with users relying on it for day-to-day work. While it has some cool new stuff that wasn't there in Unix back then I still sometimes wish the attention given to new OS research was a bit greater. 
Well, that particular train probably left the station more than a decade ago. It seems that what we have is 'good enough' for people to consider putting up with the pain of transitioning to something new and unknown. 
Then again, they seem to gladly endure the torture of gigantic 'programming frameworks' aimed at making up for weaknesses in the operating system interface. ;-) 
     
    
      Posted Feb 20, 2014 23:54 UTC (Thu)
                               by vonbrand (subscriber, #4458)
                              [Link] 
       I wonder what the current systemd brouhaha is all about then. Also the recent article here on file-owned locks... 
     
      Posted Feb 21, 2014 4:00 UTC (Fri)
                               by mathstuf (subscriber, #69389)
                              [Link] 
       
     
      Posted Feb 18, 2014 12:38 UTC (Tue)
                               by rleigh (guest, #14622)
                              [Link] (98 responses)
       
As a systems programmer, I find the use of DBUS APIs as opposed to properly designed and implemented system calls and filesystem interfaces abhorrent.  Mandating the use of DBUS for fundamental system functions is wrong on many levels. 
     
    
      Posted Feb 18, 2014 12:43 UTC (Tue)
                               by HelloWorld (guest, #56129)
                              [Link] (26 responses)
       
     
    
      Posted Feb 18, 2014 12:46 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (25 responses)
       
It _all_ works for good Linux filesystem-based interfaces, like /proc or /sys. But somehow not for cgroups. 
WTF? 
     
    
      Posted Feb 18, 2014 17:07 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (24 responses)
       
2) It's not possible to implement a complex policy with cgroupfs. The cgroup filesystem does not support ACLs, and consequently, you're left with the limited UGO permissions.  
3) I don't know what this is supposed to mean. DBus methods are more transparent to the caller since the call returns with a meaningful response (even if empty). In fact, it seems that `echo` is the primary way that people write to special file systems. From the cgroups.txt documentation:  
> bash's builtin 'echo' command does not check calls to write() against errors. If you use it in the cgroup file system, you won't be able to tell whether a command succeeded or failed. 
4) Delegation is currently missing, but the systemd developers have affirmatively stated that they intend to add it.  
Lastly, it's pretty clear that the kernel developers don't like /sys that much. I wouldn't be surprised to see it gradually moved to DBus over the next few years either.  
     
    
      Posted Feb 18, 2014 17:46 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (23 responses)
       
> 2) It's not possible to implement a complex policy with cgroupfs. The cgroup filesystem does not support ACLs, and consequently, you're left with the limited UGO permissions.  
> 3) I don't know what this is supposed to mean. DBus methods are more transparent to the caller since the call returns with a meaningful response (even if empty).  
> In fact, it seems that `echo` is the primary way that people write to special file systems. From the cgroups.txt documentation 
>  4) Delegation is currently missing, but the systemd developers have affirmatively stated that they intend to add it.  
>  Lastly, it's pretty clear that the kernel developers don't like /sys that much. I wouldn't be surprised to see it gradually moved to DBus over the next few years either.  
     
    
      Posted Feb 18, 2014 17:57 UTC (Tue)
                               by jspaleta (subscriber, #50639)
                              [Link] (18 responses)
       
For example, cgmanager's draft readme, containing a draft design spec for its D-BUS API showed up in the source tree only like 5 days ago.  
I bet once cgmanager's API is deemed stable, libvirt developers will look at supporting it. 
 
 
     
    
      Posted Feb 18, 2014 18:04 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (17 responses)
       
> I bet once cgmanager's API is deemed stable, libvirt developers will look at supporting it. 
And how about delegation to Android userspace which does not use DBUS at all? 
     
    
      Posted Feb 18, 2014 18:05 UTC (Tue)
                               by jspaleta (subscriber, #50639)
                              [Link] (5 responses)
       
 
 
     
    
      Posted Feb 18, 2014 18:17 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (4 responses)
       
Or to be precise, AppArmor simply treats it as usual file operations and can apply all the regular policies. 
     
    
      Posted Feb 18, 2014 18:24 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (1 responses)
       
     
    
      Posted Feb 18, 2014 18:34 UTC (Tue)
                               by jspaleta (subscriber, #50639)
                              [Link] 
       
What's not clear to me is how kdbus's support for LSM will practically differ from how the reference userspace daemon's hooks worked.  As in will it be more expressive or less expressive in terms of how you can lock down how applications use the bus.  Still trying to wrap my head around that.  
     
      Posted Feb 18, 2014 18:25 UTC (Tue)
                               by jspaleta (subscriber, #50639)
                              [Link] (1 responses)
       
If you code your spitemanager and you want it to interop with the other managers, then you'll have to expose an API for them to work with. 
I recognize that you think any manager construct is sub-optimal to the cgroupfs construct. Noted.  But your original question was about whether systemd would support cgmanager and your hypothetical spitemanager... not whether any specific manager would support the same thing that the cgroupfs construct does.  My point stands.. the alternative managers to systemd's manager have to expose a stable API interoperate with. Demanding systemd to support an alternative manager that doesn't have a stable API is putting the cart before the horse.   
     
    
      Posted Feb 18, 2014 19:53 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
With the brain-dead change to single-writer you'd have to reinvent the whole filesystem in DBUS to replicate the functionality. Look, we've already reinvented security policies, delegation (bind-mounts) and almost reinvented containers! 
     
      Posted Feb 18, 2014 18:14 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (10 responses)
       
This is going to be required no matter what. Your choices are either systemd or CGManager for your cgroups manager. Both use systemd. All containers will need support for interfacing with a DBus cgroup manager. CGManager just provides two APIs for its users: traditional cgroupfs style and DBus. The cgroupfs interface cannot be passed through a container.  
Arguing against DBus as the principal API for any cgroup manager is a losing cause.  
     
    
      Posted Feb 18, 2014 18:18 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (9 responses)
       
> Arguing against DBus as the principal API for any cgroup manager is a losing cause.  
     
    
      Posted Feb 18, 2014 18:29 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (8 responses)
       
Huh? The kernel cgroups are exposed by system calls to a single writer. The only two writers that exist today (or have even been announced) both principally expose cgroups using a DBus API. CGManager also supports the cgroupfs API.  
The inclusion of kDBus (when that happens later this year) is orthogonal to how the kernel exposes cgroups to the manager.  
     
    
      Posted Feb 18, 2014 19:54 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (7 responses)
       
I repeat, IT WORKS ALREADY. 
     
    
      Posted Feb 18, 2014 20:44 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (6 responses)
       
You're complaining about a deprecated feature will be removed. You have four options: 
1) Start using the systemd or CGManager DBus APIs. 
2) Use CGManager with its cgroupfs provider. 
3) Implement spitemanager for whatever API (presumably cgroupfs) you desire.  
4) Fork the kernel or stop using new versions.  
     
    
      Posted Feb 18, 2014 21:04 UTC (Tue)
                               by jspaleta (subscriber, #50639)
                              [Link] 
       
Though it is interesting to see if Ubuntu patches libvirt as shipped in Trusty to talk to cgmanager. Right now it it appears that libvirt as packaged in Trusty isn't patched for that yet and will be relying on the cgroupfs in 14.04.  So that's an interesting little wrinkle. Will a trusty host running cgmanager be able to work with trusty libvirt based containers?  
     
      Posted Feb 18, 2014 21:34 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (4 responses)
       
     
    
      Posted Feb 18, 2014 21:57 UTC (Tue)
                               by mathstuf (subscriber, #69389)
                              [Link] (3 responses)
       
[1]Assuming that you don't somehow convince kernel developers to forego the single-writer changes (which seems very unlikely at this point). 
     
    
      Posted Feb 18, 2014 22:20 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (2 responses)
       
0) Regular permissions apply. 
1) Add 'pid-lock' file at each level of cgroups tree. Everyone can modify cgroups tree if this file is empty.  
2) Once you write a pid into this file only this process can make modifications to this tree level and deeper. 
3) The pid-lock process can modify pid-lock files in its subtree, either clearing them completely or by writing another pid. It doesn't lose access as long as it's still alive. 
4) Subtree moves must respect pid-locks and permissions. 
That's basically it. It still allows to lock the tree against accidental modifications and also gives a clear path for delegation. DBUS connoisseurs can still use fully DBUS-based delegation and access control and everybody else can use normal filesystem-based API. 
It's also possible to add a bitmask of delegated controllers. For example, so that the parent controller can limit the delegated controllers to cpu manager but not costly memcg. 
     
    
      Posted Feb 19, 2014 14:24 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (1 responses)
       
     
    
      Posted Feb 19, 2014 14:28 UTC (Wed)
                               by mathstuf (subscriber, #69389)
                              [Link] 
       
     
      Posted Feb 18, 2014 18:05 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (3 responses)
       
> How? Can you point me out a command-line utility that can show who has access to a given group? Do I have to parse XML? 
The cgroup delegation is not finished. Systemd generally allows read access on most things, and I doubt this would be different for cgroups. Therefore, it should be as simple as running a dbus-send command.  
>> 2) It's not possible to implement a complex policy with cgroupfs. The cgroup filesystem does not support ACLs, and consequently, you're left with the limited UGO permissions.  
> So let's add SELinux policies and ACLs to cgroupfs. It's going to be useful in other situations, like /sys delegation. For me, UGO permissions are plenty enough. 
You've clearly decided to go it alone on this (or just continually complaining). Step on up and show us the progress you've made.  
>> 3) I don't know what this is supposed to mean. DBus methods are more transparent to the caller since the call returns with a meaningful response (even if empty).  
> How do I check which cgroups are writable by me, for example? I have tons of tools for that for the classic filesystem interfaces. 
See #1.  
>> 4) Delegation is currently missing, but the systemd developers have affirmatively stated that they intend to add it.  
> Only for other systemd containers. There are no plans to support cgmanager or my own incompatible manager that I'm just going to write out of spite. 
Except systemd has a good record of keeping API stability. A container will be free to connect to DBus and talk to systemd. The container's cgroup manager can expose whatever interface it wants inside. (This is exactly the same approach that CGManager is taking. There will always be a requirement that the cgroup manager inside the container knows how to talk to the manager outside the container.) 
 
 
     
    
      Posted Feb 18, 2014 18:11 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
Where are they? 
I suspect that all the people pontificating about 'just connect to DBUS' do not even understand how it works. For example, what happens if a container starts its own DBUS daemon that knows nothing about the external daemon? How is authorization of connections handled? 
> Except systemd has a good record of keeping API stability. A container will be free to connect to DBus and talk to systemd. The container's cgroup manager can expose whatever interface it wants inside. 
     
      Posted Feb 19, 2014 1:48 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] (1 responses)
       
I'm not sure how you can say that. 
Systemd hasn't really been around very long, so they are almost entirely stil on their first system, they haven't had much reason to modify much. 
But their attitude that they are the only thing that matters, and willingness to take over and replace existing APIs with their different replacement doesn't give _me_ much confidence that they will maintain the old APIs long term. 
     
    
      Posted Feb 19, 2014 12:34 UTC (Wed)
                               by pizza (subscriber, #46)
                              [Link] 
       
So, your argument against their public commitment to (and track record of) API stability is to say "I don't believe them." 
You then try to justify that attitude by saying that they're still doing new things, and those new things may require new APIs.  Well... duh.  It's rather pointless to do something new if you don't create a way of managing it. 
 
 
     
      Posted Feb 18, 2014 13:01 UTC (Tue)
                               by andresfreund (subscriber, #69562)
                              [Link] 
       
     
      Posted Feb 18, 2014 16:53 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (69 responses)
       
What's actually good about filesystem interfaces besides familiarity? There's nothing advantageous about them, and in fact, there is a great difficulty in sending information back to the caller.  
System calls are only good when you're talking to the kernel. Without a more full-fledged IPC mechanism (like DBUs perhaps), you don't want the kernel acting as the arbiter translating calls to other programs.  
Lastly, I don't understand the dislike of DBus. What's bad about reliability, easy type support, extensive policy support, and multiple messaging paradigms? I'm also a programmer, and I don't understand why anyone wouldn't want that. Perhaps you could elaborate.  
     
    
      Posted Feb 18, 2014 16:59 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (7 responses)
       
I have no idea. People usually make some noises about PolKit so I checked it. Its configuration looks something like this: http://cgit.freedesktop.org/polkit/tree/data/org.freedesk... 
Yeehaw! An XML config with lots of strange options. Documentation is also quite impenetrable. 
     
    
      Posted Feb 18, 2014 17:12 UTC (Tue)
                               by michich (guest, #17902)
                              [Link] (1 responses)
       
     
    
      Posted Feb 18, 2014 17:30 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
 
     
      Posted Feb 18, 2014 19:44 UTC (Tue)
                               by mathstuf (subscriber, #69389)
                              [Link] (4 responses)
       
[1]https://wiki.archlinux.org/index.php/Polkit#Structure 
     
    
      Posted Feb 18, 2014 19:57 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (3 responses)
       
Now rules are not only opaque, they are not analyzable even in principle! Never mind dirty little tricks of JavaScript like using floats instead of ints for numbers. 
     
    
      Posted Feb 18, 2014 20:28 UTC (Tue)
                               by mathstuf (subscriber, #69389)
                              [Link] (2 responses)
       
     
    
      Posted Feb 18, 2014 21:38 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (1 responses)
       
Sigh... It looks like DBUS developers have gone off the rails completely and systemd+kernel people are happy to join the bandwagon. 
     
    
      Posted Feb 18, 2014 21:46 UTC (Tue)
                               by mathstuf (subscriber, #69389)
                              [Link] 
       
     
      Posted Feb 18, 2014 16:59 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (60 responses)
       
Uhm. Cgroups is a kernel interface. It's not an interface to some userspace program. 
     
    
      Posted Feb 18, 2014 17:08 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (59 responses)
       
     
    
      Posted Feb 18, 2014 17:30 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (58 responses)
       
     
    
      Posted Feb 18, 2014 17:57 UTC (Tue)
                               by smurf (subscriber, #17840)
                              [Link] (57 responses)
       
Right -- it needs to talk to whatever process does the actual cgroups work. Presumably, DBus is a reasonable way to do that -- you can implement more complex permissions, send structured data in, get structured replies out, and have an altogether more high-level interface for what you actually want to do instead of doing baby steps in cgroupfs. 
 
     
    
      Posted Feb 18, 2014 17:59 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (56 responses)
       
     
    
      Posted Feb 18, 2014 18:05 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (1 responses)
       
     
    
      Posted Feb 18, 2014 18:12 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
To make cgroupfs subtree accessible to a user 'vasja' I simply need to do 'chown -R vasja /sys/fs/cgroup/some/subtree' and that's it. How do I do the same with DBUS? 
     
      Posted Feb 18, 2014 18:57 UTC (Tue)
                               by smurf (subscriber, #17840)
                              [Link] (53 responses)
       
I'm not involved with the details of which dbus call does, or will do, exactly what, and I like it that way. So, sorry but you'll have to get the actual implementation details from somebody else. 
No, the kernel people are not "idiotic" when they want to impose a one-writer-only policy on the cgroups subsystem. It makes perfect sense to have one process arbitrate access instead of adding ACL support to cgroupfs and dealing with multiple processes stepping onto each other's toes. 
In any case, unless you can actually convince them to not enforce a single-writer policy after all, demanding that a multi-writer cgroupfs should continue to be available is … somewhat futile. Especially here; this is not a kernel mailing list. 
     
    
      Posted Feb 18, 2014 20:00 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (52 responses)
       
>No, the kernel people are not "idiotic" when they want to impose a one-writer-only policy on the cgroups subsystem.  
>It makes perfect sense to have one process arbitrate access instead of adding ACL support to cgroupfs and dealing with multiple processes stepping onto each other's toes. 
> In any case, unless you can actually convince them to not enforce a single-writer policy after all, demanding that a multi-writer cgroupfs should continue to be available is … somewhat futile. Especially here; this is not a kernel mailing list. 
     
    
      Posted Feb 18, 2014 20:51 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] (51 responses)
       
It's because these features are currently being developed. They're not finished. None of this work will be completed until KDBus is merged. 
I suggest that if you have such pressing concerns and questions about how all of this works, it's more appropriate to take your complaints to the developers directly.  
     
    
      Posted Feb 18, 2014 21:32 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (50 responses)
       
All the policy mechanisms are already there. Yet nobody here can tell me how to do the simplest thing possible - delegate a DBUS subtree to a user. 
     
    
      Posted Feb 18, 2014 21:41 UTC (Tue)
                               by mathstuf (subscriber, #69389)
                              [Link] (49 responses)
       
I have a feeling that your emotions here are getting in the way of seeing potential solutions and mixing up pieces of information. May I suggest putting your concerns into a wiki page of some sort so that responses to them aren't spread across unpteen LWN articles and subthreads? 
     
    
      Posted Feb 18, 2014 21:51 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] 
       
Telling systemd that user X (even if that's a user that needs to be resolved several times from namespaces) is allowed to perform actions A,B,C on a subtree does not exist presently. That's why no one can tell Cyberax what API calls are needed. 
     
      Posted Feb 18, 2014 22:09 UTC (Tue)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (1 responses)
       
Suppose that I have systemd managing the root host and I create a cgmanager-based container. Suppose that they interoperate, so somehow cgmanager should connect to the host? How do I pass credentials for it? Or does systemd simply trust the first connection? 
Then the cgmanager connects to its local DBUS running inside the container and starts serving its local clients. KDBus doesn't really change this, their namespaces would be separate. 
Then the next question, would the cgmanager-based partitions be visible in the global manager? Probably yes, since there's no delegation. However, access rights would definitely be lost because cgmanager is probably going to implement its own policies. So there's not going to be any way to check what users and/or containers are using subtrees. 
Still a mess. I guess to dive into it headfirst and try to make sense of it. 
     
    
      Posted Feb 18, 2014 22:46 UTC (Tue)
                               by fandingo (guest, #67019)
                              [Link] 
       
Here's my understanding on how it would work. Besides some terminology, the model seems common between systemd and CGManager. 
1) Bind mount the DBus socket dir into where the container will run. 
It's expected that the container software takes care of at least 3-8 and possibly the first two as well. 
Operation: 
A process inside the container wants to make a cgroup modification. 
1) It connects to DBus (or cgroupfs if desired and CGManager is running inside the container) and sends the request. 
4) The outer cgroup manager receives the message. 
===== 
Let's say that some process inside the container wants delegation of a part of the container's subtree. That authorization doesn't take place in the outer PolicyKit. It happens inside.  
     
      Posted Feb 19, 2014 15:32 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (45 responses)
       
By the way, I feel similarly with regard to cgroups as a whole. We already have a process hierarchy, why do we need another one? Of course the problem with the traditional process hierarchy was that processes could escape by double-forking, but that was fixed with with prctl(PR_SET_CHILD_SUBREAPER). So why do we need cgroups at all? 
Oh well, by now it's probably too late to change any of this. 
     
    
      Posted Feb 19, 2014 16:22 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] (17 responses)
       
> Of course the problem with the traditional process hierarchy was that processes could escape by double-forking, but that was fixed with with prctl(PR_SET_CHILD_SUBREAPER). So why do we need cgroups at all? 
Doesn't that require one of the following three situations? 
* Well behaved main process of the service that sets PR_SET_CHILD_SUBREAPER, so none of its descendents escape. If this process ever dies, is killed without cleaning up (e.g. sigkill), or fails to set itself as the subreaper processes can escape the hierarchy. 
* The init system has to maintain a process for each service that sets itself as the subreaper. It's not responsible for anything besides executing/stopping/killing the service, and cleaning up PIDs. That certainly adds a lot of overhead and complexity. Without dedicating a process to each service, you just end up with everything having subreaper set to PID 1 (or whatever a modular service manager runs as); these service hierarchies would all point to the same parent, making it impossible to distinguish between them.  
The first option does not seem appealing because it requires a significant amount of trust in the service, there are reliability concerns, and the developers of each service need to do work to explicitly support this init model. 
The second option mainly suffers from complexity. Init has far more processes running as part of its service management. Some IPC mechanism would be needed to track these processes and allow start/stop/restart/kill/etc. commands from the user to the service manager to the hierarchy manager to work.  
Subreaper is designed to be used for proper process cleanup, not for tracking process hierarchies.  
Lastly, the ability to set resource limits is limited. It would be possible to use something like setrlimit or prlimit, but those are both process-specific, and don't allow the flexibility of group-based resource limits.  
     
    
      Posted Feb 19, 2014 17:16 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (4 responses)
       
> Some IPC mechanism would be needed to track these processes and allow start/stop/restart/kill/etc. commands from the user to the service manager to the hierarchy manager to work. 
> Lastly, the ability to set resource limits is limited. It would be possible to use something like setrlimit or prlimit, but those are both process-specific, and don't allow the flexibility of group-based resource limits.  
     
    
      Posted Feb 19, 2014 17:43 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] (2 responses)
       
That's true if and only if that ancestor reaper never dies or is killed. The security implications complicate things. It should be possible to overcome them possibly, but the warts add up. 
> That's not a fundamental limitation. Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants[...] Otoh, I'm not sure if anything would actually be gained by doing that instead of cgroups. 
I totally agree, but it would require a change to those functions (or new recursive versions).  
     
    
      Posted Feb 19, 2014 17:59 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (1 responses)
       
     
    
      Posted Feb 19, 2014 18:29 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] 
       
This becomes a major problem with privileged services. Many services maintain a parent process that runs as root. Any compromise of this privileged service process (or malfeasance by it) allows it to kill it's hierarchy manager process and escape all control.  
On the other hand, cgroups in a single-writer environment should be immune to this.  
With systemd cgroup manager, PolKit would not authorize a process move outside all cgroups or to another cgroup (outside specific definitions like system.slice/sshd.service/ --> /user.slice/session.scope/).  
The major benefit to systemd's cgroup manager is that it is not attackable via this style. It cannot be intentionally killed (it ignores all signals, even sigkill since it is PID 1), and if it were somehow forced to crash, the system would panic. Since PID 1 is the cgroup manager, there is no way to gain control of the kernel interface either.  
There's no meaningful way to protect a reaper, unless you mandate that nothing in a hierarchy can run with enough privileges to kill the reaper. That would require a substantial change in many services, or requires additional sandboxing mechanisms in the kernel. (The kernel would need to perform a check that a caller of kill(2) is not trying to kill its reaper.) 
     
      Posted Feb 19, 2014 18:58 UTC (Wed)
                               by smurf (subscriber, #17840)
                              [Link] 
       
What if I want to fork off a bundle of programs which need to share the same memory limit (i.e. 200MBytes for all of them in sum, not individually … like for instance all the processes in James' sessions … and what if James logs in with X *and* with ssh)? 
What if I realize, after starting my disk copy program, that it eats too much memory / disk bandwidth, and I want to retroactively park it in a more limiting cgroup? Does my shell suddenly need to know about that stuff? 
Sorry -- won't work. 
     
      Posted Feb 19, 2014 17:42 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (11 responses)
       
> I've already posted the warnings from cgroups.txt on using echo. If you're using something beyond shell script, you'll end up with a better program by interfacing with DBus if only due to better return information and error handling. 
     
    
      Posted Feb 19, 2014 18:48 UTC (Wed)
                               by smurf (subscriber, #17840)
                              [Link] (10 responses)
       
"It behaves like a plain file" doesn't work for quite a few device nodes, and most Linux subsystems are not controlled by echo: You don't emit sound by "cat rhapsody.wav >/dev/snd" these days, and you don't resize a LVM partition by "echo 10TB >/sys/devices/virtual/block/volgroup/master/varlog/size". 
This is Linux. This is not Plan 9 where you can open a TCP connection with mkdir. cgroupfs is fine for introspection, but control? that always seemed a bit  unnatural to me. 
Besides, pragmatically, a sensible "cgroupctl"-style program will have a --help option and a manpage. To me that seems a lot more useful than traipsing around in cgroupfs and wondering which magic mkdir+echo+mv combo I need to evoke to limit my disk copy program's memory usage. 
 
     
    
      Posted Feb 19, 2014 20:22 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (9 responses)
       
> This is Linux. This is not Plan 9 where you can open a TCP connection with mkdir. 
> cgroupfs is fine for introspection, but control? that always seemed a bit unnatural to me. 
 
     
    
      Posted Feb 19, 2014 21:17 UTC (Wed)
                               by smurf (subscriber, #17840)
                              [Link] (8 responses)
       
I strongly suspect that the main reason for that is because you're used to it.  
> ln /proc/42 /sys/fs/cgroup/yaddah/cgroup.procs 
Linking. 
Sorry, but this is the point where I stop responding to you. 
     
    
      Posted Feb 19, 2014 21:44 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (7 responses)
       
> Sorry, but this is the point where I stop responding to you. 
     
    
      Posted Feb 19, 2014 22:06 UTC (Wed)
                               by mathstuf (subscriber, #69389)
                              [Link] (6 responses)
       
These behaviors you're asking for are quite different than the usual semantics these tools imply. Sure, filesystems and cgroups are both hierarchical, but there is such a thing as stretching a metaphor too far. To make a meta-metaphor: Should we abandon databases and just use spreadsheets instead? Vice versa? They're both "just" grids of data cells. 
     
    
      Posted Feb 20, 2014 0:13 UTC (Thu)
                               by HelloWorld (guest, #56129)
                              [Link] (5 responses)
       
     
    
      Posted Feb 20, 2014 2:08 UTC (Thu)
                               by MrWim (subscriber, #47432)
                              [Link] (4 responses)
       
     
    
      Posted Feb 20, 2014 15:20 UTC (Thu)
                               by HelloWorld (guest, #56129)
                              [Link] (3 responses)
       
     
    
      Posted Feb 20, 2014 15:23 UTC (Thu)
                               by mathstuf (subscriber, #69389)
                              [Link] (2 responses)
       
     
    
      Posted Feb 20, 2014 17:34 UTC (Thu)
                               by HelloWorld (guest, #56129)
                              [Link] (1 responses)
       
     
    
      Posted Feb 20, 2014 18:15 UTC (Thu)
                               by MrWim (subscriber, #47432)
                              [Link] 
       That's what I meant by "all pids have to appear in the cgroup tree so to put a pid in a cgroup you have to remove it from another".  My assumption is that there cannot be a process which isn't a member of any cgroup. If init starts in a cgroup and it's children end up in the same cgroup and there's no way of  In that setup you can't steal other users processes and put them in your subtree, you can only move pids around in the trees you own.  You can then use whichever cgroup manager that you desire in your subtree.  Containers work while still only co-operating with the kernel, rather than having to communicate with other user-space programs running outside. 
     
      Posted Feb 19, 2014 16:29 UTC (Wed)
                               by paulj (subscriber, #341)
                              [Link] (26 responses)
       
Further, no one has been able to explain why this is better implemented in user-space via DBus. There only seem to be assertions from Tejun on mailing lists that there are problems, but pretty much no detail on what those problems are. Worse, nothing, not even in the abstract, on why these problems would be any easier to tackle in user-space.  
If the argument is that it is difficult to get multi-writer access to filesystems right, or multi-writer setting of permissions, then the kernel surely has bigger problems. 
If the issue is ABI stability, those problems also do not get magically become easier in user-space. Except perhaps that it is easier to circumvent Linus' determination to keep the kernel:user-space ABI stable (?) by simply not having one. 
     
    
      Posted Feb 19, 2014 17:31 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] (25 responses)
       
From my perspective, the two most notable deficiencies are: 
* Security. Delegating direct access to kernel interfaces is dangerous. The kernel is not running anything resembling a full security policy and shouldn't be expected to. Commands which have the ability to drastically affect a running system need to be vetted by something, and it only makes sense to do that in user space. Traditional UNIX ownership and permissions are insufficient for building comprehensive policies. While this may not be a concern to some, it's a major weakness. There's no excuse for providing an insecure interface to the kernel.  
* Exposing too much interior detail. If the cgroups API had been more abstract from the start, it probably would have been possible to fix at least some of the other deficiencies. Unfortunately, cgroupfs exposes too much internal information, making it, in Tejun's opinion, infeasible to fix without major changes. 
> If the argument is that it is difficult to get multi-writer access to filesystems right, or multi-writer setting of permissions, then the kernel surely has bigger problems. 
Not all kernel developers work on every part of the kernel. The people who maintain and develop cgroups have decided that this is the highest priority undertaking for them. 
> If the issue is ABI stability, those problems also do not get magically become easier in user-space. Except perhaps that it is easier to circumvent Linus' determination to keep the kernel:user-space ABI stable (?) by simply not having one. 
The kernel will still have an ABI for user space. I'm not sure why people keep saying otherwise. It's only usable by one process, but it's certainly still there. And for user space, it actually does become substantially easier. Just look at CGManager, which has decided to support two APIs simultaneously. 
If there is a need to modify the cgroups ABI in the future, it's so much easier. Rather than having hundreds of thousands of users (everything from systemd to shell scripts) that will be impacted, it's just a handful of cgroup managers.   
     
    
      Posted Feb 19, 2014 18:10 UTC (Wed)
                               by paulj (subscriber, #341)
                              [Link] (20 responses)
       
I simply don't think that's the case. If we can't do this securely in the kernel, the user-space mediator won't be any better (certainly not when it's coded in a similar style), security wise (with that exception). Otherwise, please explain how user-space is more secure? 
With regard to the Unix ownership/permissions model: 
1. This may suffice for many. There's little evidence, in the history of operating systems, of complex security models getting wide-spread end-user use. 
2. However, it is incorrect to say the kernel to userspace delegation API is limited to Unix owner/group permissions. It also allows for ACLs - which an fs based cgroups API (not necessarily the current version) could implement. 
3. Even if Unix perms and/or ACLs *were* still insufficient for all users, that is *NOT* a reason to not offer the FS API. If a user-space daemon wants to offer some other security model on top, the existence of the FS API does not stop that. They can live together. Why does it mean the fs interface has to be removed? 
On exposing too much detail: Then that's a problem of the current cgroupfs API. Fix it with a new one. Why is it better for that new API to live in user-space?  
One of the problem's Tejun has is that the fs API *allows delegation*, and hence allows an admin to give resources to non-privileged processes that might affect other users/processes. But isn't that perhaps inherent to an over-committed resource sharing system like normal Unix/Linux? Further, if the problem is inherently to do with the delegation, how will a user-space API that allows delegation fix things? The answer, if delegation really is a problem, must be that that delegation has to be removed. There is no reason this can be done in a kernel API, surely? Or would you argue it is easier to remove things in userspace APIs? (That argument would scare me). 
Why not just try and get the kernel API right? Why will it be any easier to get things right if the thousands of shell scripts are calling dbus-send instead of writing to a virtual fs? How will having this in user-space make it any easier to deal with all the thousands of users, just cause they talk to a manager instead of the kernel? 
It very much sounds like the kernel cgroups people simply don't want to have to work out the details of what is needed, and so want to punt it to user-space. It sounds almost a social problem, more than a technical one.  
Lastly, on the "not all kernel developers are familiar with implementing a virtual fs" issue - they can ask for help, surely. :) Eventually viro, or someone similar, will get annoyed enough to do further extend VFS and library support to further ease implementation of virtual fses, if needed. As (IIRC) happened yonks ago when he got fed up with procfs and others. ;) 
     
    
      Posted Feb 19, 2014 18:55 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] (9 responses)
       
* Only allow PID X to manage cgroup subtree /A/B/C. 
Traditional file permissions and even ACLs are incapable of dealing with either situation. In the first, case there's no way to grant that access without allowing all processing owned by that same user/group from controlling /A/B/C. In the second situation, there's no way to allow a one-way move. 
Yeah, we can start adding more files to the cgroups to attempt finer control, but now things get messy, and it's likely the who permission issue goes out the window. That's just two random policies that a system may want to enforce. To do #1 you would likely have to give one of U, G, or O +w to the subtree for that user, but that's a broken definition because hidden kernel policy will revoke writes by other processes, even though they have the proper U or G or fall into O. In the end, a cgroup would spout many files to cover all sorts of policy combinations, and the kernel developers will be left with a monstrosity that can't be fixed because next time the same cries about API will arise.  
A filesystem hierarchy is not suited to this complexity. It will have to be contorted into all kinds ways where the permissions scheme (even with ACLs) doesn't match traditional behavior. 
It's far preferable to take the LSM approach. The kernel provides the primitives throughout the kernel subsystems and talks to another module to establish and enforce policy. It's true that LSM modules are kernel modules, but they remain separate from the LSM code. (I wouldn't have a problem with a cgroup manager living as a kernel module, but I don't see any inherent benefit either.) 
     
    
      Posted Feb 19, 2014 20:36 UTC (Wed)
                               by HelloWorld (guest, #56129)
                              [Link] (3 responses)
       
     
    
      Posted Feb 19, 2014 22:28 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] (1 responses)
       
I don't think that there's any way that traditional permissions, even with ACLs, could be massaged into giving the necessary flexibility and clean interfaces.  
     
    
      Posted Feb 19, 2014 22:46 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] 
       
Plus there is the entire extended ACL structure thats available (but very seldom used because it's not needed) 
On Linux, permissions for filesystem objects have not been limited to the unix wrx bits for a long time. 
     
      Posted Feb 19, 2014 22:49 UTC (Wed)
                               by vonbrand (subscriber, #4458)
                              [Link] 
       So the solution to the problem that the API isn't well known/standard is to create another totally new, in practice untested, "general hierarchical security model" to be applied across the board to anything with a hierarchical structure. That sounds much, much harder to do right to me. 
     
      Posted Feb 19, 2014 23:00 UTC (Wed)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (4 responses)
       
However, such situations are not really normal. In particular, changing levels in cgroups hierarchy is not a trivial operation - new subtree might have limits that the subtree which is being moved already exceeds. 
     
    
      Posted Feb 19, 2014 23:32 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] (3 responses)
       
That's really a question of policy, though. If the policy says that some processes need to move to /D/E/F, then they need to go there, regardless of what the resource controllers say. (I'd argue that the process should be moved first, and then the resource controller terminates processes to get back into proper configuration. I don't think that it is acceptable to leave a process in the wrong the subtree.) 
It's worth noting that systemd-login, which is not PID 1, does the second action today on user login.  
> Then they should talk to a some kind of privileged program that can do this. Traditional UNIX used suid programs for that, and it totally makes sense to use something like cgmanager/systemd for this. 
If there are acknowledged shortcomings of cgroupfs, shouldn't the API be changed to support all reasonable actions? Why should the kernel keep interfaces that clearly have shortcomings that cannot be resolved without massive API incompatibilities? 
     
    
      Posted Feb 19, 2014 23:40 UTC (Wed)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
> If there are acknowledged shortcomings of cgroupfs, shouldn't the API be changed to support all reasonable actions? Why should the kernel keep interfaces that clearly have shortcomings that cannot be resolved without massive API incompatibilities? 
     
      Posted Feb 19, 2014 23:43 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] (1 responses)
       
backwards compatibility would be a reason for keeping poor interfaces, but if you are going to break them, then you need to do so once, not multiple times. 
And the new systemd API is just that, a systemd API, by definition it doesn't deal with use cases that don't use systemd. 
and the currently proposed single-writer API in known to not support all use cases, so why should it replace the existing API? 
     
    
      Posted Feb 19, 2014 23:59 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] 
       
You are talking about the Google multi-hierarchy complaint, right? The cgroups maintainers are on record as saying that this is not reasonable, and they intend to eliminate it.  
> backwards compatibility would be a reason for keeping poor interfaces 
The cgroups developers believe them to be broken, not poor. Plus, they get in the way of fixing cgroups since cgroupfs leaks too many implementation details. 
 
     
      Posted Feb 19, 2014 19:04 UTC (Wed)
                               by jspaleta (subscriber, #50639)
                              [Link] (2 responses)
       
Right now... its a mess.  He even comments on the fact that the gentlemen's agreement in the form of PaxControlGroup shows exactly how problematic trying to support multiple writers actually is right now in the multi-heiarchy cgroups because they aren't all multi-heiarchy aware. You can't actually use all the controllers in a multiple writer fashion without them step on each other. Caveats abound. He even correctly notes that the way it works quite well right now for hand crafted setups... where one human has crafted all the cgroup interactions via scripting and its effectively "single writer" in a sense.  But once you tack on automation or applications which want to make use of cgroups side-by-side with other applications or other automation or even hand-craft scripts... you run into problems. PAXControlGroups details the in and outs of those problems. 
So as part of making room to clean up all the problems the single userspace writer will be mandated in the middle term of the kernel side work to make a flat hierarchy api.  I'm not even sure the plan is to require that single writer model forever. I believe the plan right now is to stop pretending that cgroups and all the controllers work with the multiple writer model while that flat hierarchy is being developed and controllers are all reworked to correctly support that new model.   
I also think you need to in mind that Tejun also mentions a pie-in-the-sky goal of merging cgroups into the process hierarchy.   
 
Look I think the real issue here is that until the new work (both kernel and userspace) has progressed further along, there are going to be existing use cases that only served by the deprecated API. The kernel developers have always recognized this, from the start of the discussion.  
The real question is, will the linux distribution vendors continue to provide userspace solution which support the old API as an option?  I have no idea on that.  I know that libvirt for example has a transition plan in place to support the older cgroupfs if the systemd D-Bus API is not available on the host. But I have no idea if any distribution vendors are going to expose a configuration option to pick which cgroups API to use when mounting cgroups.  So Cyberax should probably start making the case to his distribution vendors about supporting his use case by making it possible to choose to run the deprecated cgroupfs API in the future, so long as the API is available and not pulled from the kernel the vendor ships. 
 
 
 
 
     
    
      Posted Feb 19, 2014 22:56 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] 
       
But using that as justification for a single-writer model doesn't compute, what does one have to do with the other? 
> I'm not even sure the plan is to require that single writer model forever. 
I agree with you that the kernel developers have said this, I just can't find a quote easily to back this up. 
But if this is the case, having systemd take complete control and then defining a complex DBUS interface to be used for delegation strikes me as a very bad thing to do 'temporarily' 
> The real question is, will the linux distribution vendors continue to provide userspace solution which support the old API as an option?  
is systemd going to even allow this? or is systemd going to say that it's broken if the new interface isn't there? 
     
      Posted Feb 19, 2014 23:12 UTC (Wed)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
And that's OK - there _are_ valid technical reasons why the current multiple hierarchies model is broken. These reasons are clearly spelled out in scores of mailing list messages.  
For example, there are problems with memory accounting and blkio and that's why blkio can't account for buffered writes right now. 
The switch to the single-writer model, however, has no such rationale. There are literally NO arguments at all for it that I can find. One would think that unfixable security problems deserves at least a mailing list message, but there's literally _nothing_ there. 
     
      Posted Feb 19, 2014 19:13 UTC (Wed)
                               by smurf (subscriber, #17840)
                              [Link] (2 responses)
       
Surprise: You are exactly right. The kernel people do not WANT to set policy there because the requirements are unknown / too diverse / we don't have much experience what we actually need in complex real-world scenarios / take your pick. 
We do know that a user/group scheme will not work: you can nest namespaces, and the process which sets up the outer namespace's access rights does not know which user IDs will eventually end up being mapped to the processes inside these (sub)containers. This is (probably) why cgroupfs does not have ACLs: they'd be insufficient anyway. 
It's not the kernel's job to set policy. It's the kernel's job to facilitate a stable ABI, and leave policy to user space where it belongs. 
Besides, access rights are insufficient for another reason. Suppose you want to limit users' memory usage to 100MB each; if they want to have 1GB of main memory they can -- but only two at the same time, and only for 10 minutes. 
This is not a particularly exotic requirement for a multiuser system. If somebody adds code for that kind of thing to their favorite cgroupsmanager program, no problem whatsoever. In the kernel? don't even think of doing that. 
 
     
    
      Posted Feb 19, 2014 23:06 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] 
       
> This is not a particularly exotic requirement for a multiuser system. If somebody adds code for that kind of thing to their favorite cgroupsmanager program, no problem whatsoever. In the kernel? don't even think of doing that. 
what multiuser system supports these sorts of limits today? If you are claiming that they are not unusual, that must mean that something common supports them. 
     
      Posted Feb 19, 2014 23:22 UTC (Wed)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
First, you create this hierarchy: root/user1/delegate, root/user2/delegate. Then you set memory limits on them and start a daemon that does the balancing act. This daemon should have permissions to change 'user1' and 'user2' hierarchies. 
But nobody stops you from making 'delegate' directories writable for the users! They won't be able to affect the settings in the parent levels of cgroups, and they'll limited by them. 
     
      Posted Feb 19, 2014 22:55 UTC (Wed)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] (3 responses)
       
I'm pretty familiar with the current cgroups interface and it seems that an untrusted process _at_ _most_ can cause high load on the kernel and perhaps significantly slow down other processes.  
There's also a small problem with several controllers which use weights to distribute resources, so it's possible for a cgroup to affect its siblings. But again, that's trivially worked around by using an intermediary tree level if one wants to delegate a subtree to untrusted processes. 
     
    
      Posted Feb 19, 2014 23:15 UTC (Wed)
                               by fandingo (guest, #67019)
                              [Link] (2 responses)
       
Only if that process is unprivileged. If you have a service that runs a privileged process (like the parent PID of Apache or OpenSSH), it can modify any part of the cgroup hierarchy.  
A single-writer model (especially if the writer is PID 1) with policy enforcement precludes this behavior. Even a privileged user would not be able to gain authorization to perform cgroup changes outside what the policy allows (like managing its subtree). Furthermore, a privileged user couldn't even connect to the kernel cgroup API directly, because a writer is already registered, and if it's PID 1, cannot be crashed in order to register a malicious writer.  
     
    
      Posted Feb 19, 2014 23:20 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] 
       
or you can play games with the PID namespace so that those processes are only root within their limited context, not for the whole systems. 
But if you are concerned about a malicious root process, the fact that it can change cgroups settings seems like a pretty minor thing to worry about. 
     
      Posted Feb 19, 2014 23:25 UTC (Wed)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
     
      Posted Feb 19, 2014 22:48 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] (3 responses)
       
this sounds like an even bigger problem to me if the group of coordinating processes don't coordinate well, nothing in ther kernel can know that they aren't and big problems can result. 
     
    
      Posted Feb 19, 2014 22:53 UTC (Wed)
                               by jspaleta (subscriber, #50639)
                              [Link] (2 responses)
       
 
 
     
    
      Posted Feb 19, 2014 23:09 UTC (Wed)
                               by dlang (guest, #313)
                              [Link] (1 responses)
       
     
    
      Posted Feb 19, 2014 23:14 UTC (Wed)
                               by Cyberax (✭ supporter ✭, #52523)
                              [Link] 
       
That doesn't change the fact that only one process can do direct cgroups manipulations. 
     
    Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
 -> User x with uid 1000 runs an unprivileged container running Debian Testing (using sysvinit)
    -> Root in this container (uid 100000 on the host) runs a Plamo Linux system container (some custom init)
      -> User nobody with uid 65534 (uid 165534 on the host) runs an unprivileged Ubuntu 12.04 container (upstart)
 - uid 0 is actually uid 200000
 - pid 50 is actually pid 123123
 - cgroup "a" is actually cgroup "lxc/c1/c2/c3/a"
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
https://github.com/cgmanager/cgmanager
At the end of the day, all of those configure the exact same thing. It's not ideal when accesses aren't centralized but we've lived with that for years without any major problem.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
1) Auditing.
2) Security.
3) Transparency.
4) Delegation.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
How? Can you point me out a command-line utility that can show who has access to a given group? Do I have to parse XML?
So let's add SELinux policies and ACLs to cgroupfs. It's going to be useful in other situations, like /sys delegation. For me, UGO permissions are plenty enough.
How do I check which cgroups are writable by me, for example? I have tons of tools for that for the classic filesystem interfaces.
Sure, and it's convenient. I can write to cgroups from a pure Java program - can I do the same with DBUS?
Only for other systemd containers. There are no plans to support cgmanager or my own incompatible manager that I'm just going to write out of spite.
Doesn't matter. /sys and /proc virtualization and delegation are here to stay, forever. And also, [citation needed]
Shuttleworth: Losing graciously
      
And even from this, draft, its unclear to me if cgmanager's D-Bus API can be considered stable at present. Since there doesn't appear to be any versioning on the API internally, I'd have to assume its prudent to still consider it unstable and subject to change.  As of right now cgmanger's API should be considered an lxc private API, and not suitable to be relied on by external projects, until such time that the API is versioned and marked as stable by its developers. 
Shuttleworth: Losing graciously
      
And by the time you write this interface, it's going to be so indistinguishable from a filesystem interface that people are going to start asking WTF it was all for.
Does it have AppArmor support? It works fine for delegated cgroups. How about using fanotify to screen for malicious attacks (yes, I can haz an antivirus on Linux)? 
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
This is relevant. Right now the kernel interface is manager-agnostic - it can be used by anything.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Only because of idiotic kernel developers.
Shuttleworth: Losing graciously
      
> Only because of idiotic kernel developers.
Shuttleworth: Losing graciously
      
Incorrect. Right now cgroups can be manipulated by any number of processes.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Presumably, systemd uses the magical powerz of DBUS for access controls. So all the tools should be already there.
Wrong. I can't expose cgroups filesystem interface, for example. Or re-delegate to a manager that can only act as the root manager.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
And there actually is a good reason to delegate responsibility in cgmanger's case, delegating part of the responsibility to higher privileges makes sense.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
How do I delegate '/sys/fs/cgroup/some/cgroup/container/path' or whatever its counterpart in DBUS is going to be to user 'root' inside a namespaced container?
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
[2]http://blog.christophersmart.com/2014/01/06/policykit-jav...
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
A simple question: "How?"
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
So I conclude that NOBODY knows how to do it. Does it not ring any alarm bells? 
Yes, they are. They are total idiots in this regard.
Why does it make a perfect sense? What are the reasons? Can you point out a design document with them?
LKML is a dump. Asking there is an almost certain guarantee for a message to be lost.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Incorrect. Single-writer mode is already there and KDBus is going to be an optional dependency for a long time even after that.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
2) Create cgroup and apply controllers as needed.
3) User namespace is created.
4) Root has an outer UID mapped. If the cgroup manager runs as a different user, it is also mapped. 
5) PolicyKit is updated to allow access to the mapped user on the specific cgroup subtree.
6) The container OS boots.
7) The cgroup manager inside the container connects to the DBus socket. (This does not serve as the system bus inside the container. That is separate.)
8) The cgroup manager inside the container attaches to its system bus. 
2) PolicyKit (or file system permissions if using cgroupfs via CGManager) inside the container authorizes the action.
3) The cgroup manager inside the container accepts the command, and sends the command over the DBus socket to the outer cgroup manager.
5) The outer cgroup manager translate the inner cgroup path to its relative position outside and consults PolicyKit for authorization. If authorized, the cgroup action is completed. A return message (with properly sanitized path) is sent across the socket to the inner cgroup manager, which forwards the message to the process that initiated the call.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
I don't think so. Processes are cheap on Linux, and I don't think that reaping child processes is likely to ever be a bottleneck for any realistic program.
Where “Some IPC mechanism” would obviously be D-Bus, which makes this sort of thing very easy.
That's not a fundamental limitation. Just allow processes to specify that an rlimit is supposed to apply to them as well as their descendants, and PR_SET_CHILD_SUBREAPER ensures that your descendants can't reparent themselves to init. So I think this whole thing could be made to work fine if somebody bothered to do the work. Otoh, I'm not sure if anything would actually be gained by doing that instead of cgroups.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
So what? systemd is already required to never die.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
It's not lousy, there's nothing wrong with that! Those are tools that every admin knows and uses, and that's a Good Thing. The reason devices are exposed as “files” in /dev is precisely that one can do things like access control just as if they were proper files. Do you want to replace that too? It's certainly possible to give udev a D-Bus interface and use fd passing to open device files!
“We can't use a file system based interface because bash's echo builtin is broken” is about as lame an excuse as it gets. The answer to that is to fix bash or to use printf.
Shuttleworth: Losing graciously
      
> one can do things like access control just as if they were proper files.
Shuttleworth: Losing graciously
      
So what? Just because you can't read(2) or write(2) to some device nodes doesn't mean you need to use another interface for things like poll(2) or chmod(2). Stop thinking about the “file system” and start thinking about a general hierarchical namespace for all kinds of objects. This is where we are today with files, sockets, fifos, devices files etc.. It's only natural to extend that further.
Uh, I know this is Linux and not Plan 9. How is that supposed to be an argument? We should learn from Plan 9 instead of taking that kind of “us vs. them” stance.
And to me it seems unnatural that access control for cgroups is supposed to be done through a completely different mechanism than access control to files or devices. Though I agree with you that the current cgroups API isn't ideal. For one thing, I think the natural thing is to use 
ln /proc/42 /sys/fs/cgroup/yaddah/cgroup.procs
and not
echo 42 > /sys/fs/cgroup/yaddah/cgroup.procs
to add processes to a cgroup. 
Shuttleworth: Losing graciously
      
> is supposed to be done through a completely different mechanism
> than access control to files or devices
Across file systems.
Yeah, right.
Shuttleworth: Losing graciously
      
> Across file systems.
> Yeah, right.
So what? It's not allowed for conventional file systems because it doesn't make sense there. It does make sense for this case, so there's no reason for it not to be allowed.
You're doing as if I had somehow offended you. I haven't.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      rename() might be though.  AFAIU all pids have to appear in the cgroup tree so to put a pid in a cgroup you have to remove it from another.  You would need permissions for both cgroups and it happens atomically.
      
          Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      unlink()ing pids from the cgroup tree then you're guaranteed that every process is in the tree.Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
* Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.
Shuttleworth: Losing graciously
      
> * Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.
>
> Traditional file permissions and even ACLs are incapable of dealing with either situation. 
But similar restrictions might also make sense for other kinds of hierarchically organised objects. So why not generalise the existing access control mechanisms to allow for things like that instead of inventing something new? 
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      Shuttleworth: Losing graciously
      
>* Processes in subtree /A/B/C/ may move their processes in /D/E/F but should not have any further control.
Then they should talk to a some kind of privileged program that can do this. Traditional UNIX used suid programs for that, and it totally makes sense to use something like cgmanager/systemd for this.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
You policy might say that a process can use 100G of RAM, but that's not going to help you if you only have 500Mb. Right now if you try to do this trick the kernel simply kills the over-limit processes.
Like filesystem interface? Perhaps we should switch to DBUS instead of using open()?
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
I think the discussion here is only filling in the details of what Tejun knew were local admin policy scripting centric use cases that would be impacted in the short to mid term. Tejun clearly states the choices on how to proceed involved a trade-off that would impact some existing use cases.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
> want to have to work out the details of what is needed, and so
> want to punt it to user-space. It sounds almost a social problem,
> more than a technical one. 
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Sure. And the current delegation-based interface works perfectly fine in this scenario.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
It might as well simply do 'chmod -R a+r+w+x /' to the same effect.
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
Shuttleworth: Losing graciously
      
 
           