LWN.net Logo

When the kernel ABI has to change

By Jonathan Corbet
July 2, 2013
Maintaining user-space ABI compatibility is one of the key guiding principles of Linux kernel development; changes that break user space are likely to be reverted quickly, often after an incendiary message from Linus. But what is to be done in cases where an ABI is deemed to be unworkable and unmaintainable? Control group maintainer Tejun Heo is trying to solve that problem, but, in the process, he is running into opposition from one of Linux's highest-profile users.

Control groups ("cgroups") allow an administrator to divide the processes in a system into a hierarchy of groups; this hierarchy need not match the process tree. The grouping function alone is useful; systemd uses it to keep track of all of the processes involved with a given service, for example. But the real purpose of control groups is to allow resource control policies to be applied to the processes within each group; to that end, the kernel contains a range of "controllers" that enforce policies on CPU time, block I/O bandwidth, memory usage, and more. Control groups are managed with a virtual filesystem exported by the kernel; see Documentation/cgroups/cgroups.txt for a thorough (if slightly dated) description of how this subsystem works.

The trouble with control groups

There is no doubt that the functionality provided by control groups is both extensive and flexible. Indeed, part of the problem is that it is too flexible. Consider, for example, the support for multiple hierarchies in the control group subsystem. Cgroups allow the creation of a hierarchy of processes to be used in dividing up a limited resource, such as available CPU time. But they allow the creation of an entirely different hierarchy for the control of a different resource. Thus, for example, CPU time could be placed under a policy that favors certain users over others, while memory use could, instead, be regulated depending on what program a process is running. Processes can be grouped in entirely different ways in each hierarchy.

The problem here is that, while the design allowing each controller to have its own hierarchy seems nice and orthogonal, the implementation cannot be that way. The controllers for memory usage, I/O bandwidth, and writeback throttling all look independent on the surface, but those problems are all intertwined in the memory management system in the kernel. All three of those controllers will need to associate pages of memory with specific control groups; if a given process is in one cgroup from the memory controller's point of view, but a different cgroup for the I/O bandwidth controller, that tracking quickly becomes difficult or impossible. It is easy to set up policies that conflict or that simply cannot be properly implemented within the kernel.

Another perceived problem is that the virtual filesystem interface is too low-level, exposing too many details of how control groups are implemented in the kernel. As the number of users of control groups grows, it will become increasingly hard to make changes without breaking existing applications. It's not clear what the correct cgroup interface should be, but those who spend enough time looking at the current implementation tend to come away convinced that changes are needed.

This problem is aggravated by an increasing tendency to use file permissions to hand subtrees of a cgroup hierarchy over to unprivileged processes. There are legitimate reasons to want to delegate authority in that way; complex applications may want to use cgroups to implement their own internal policies, for example. There are also use cases associated with virtualization and containers. But that delegation greatly increases the number of programs with an intimate understanding of how cgroups work, complicating any future changes. There are also any number of security issues that come with unprivileged access to a cgroup hierarchy; it is trivially easy to run denial-of-service attacks against a system if one has write access to a cgroup hierarchy. In short, the interface was just never meant to be used in this way.

For these reasons and more, there is a strong desire to rework the cgroup interface into something that is more maintainable, more secure, and easier to use. Getting there, though, is likely to be a long and painful process, as can be seen by the early discussions around the subject.

The solution and its discontents

The plan for control groups can be described in relatively few words; the resulting discussion, instead, is rather more verbose. Multiple hierarchies are seen to be misconceived and unmaintainable on their face; the plan is to phase out that functionality so that, in the end, all controllers are attached to a single, unified hierarchy of processes. Unprivileged access to the cgroup hierarchy will be strongly discouraged; the hope is to have a single, privileged process handling all of the cgroup management tasks. That process will, in turn, provide some sort of higher-level interface to the rest of the system.

Tim Hockin is charged with making Google's massive cluster of machines work properly for a wide variety of internal users. Google uses cgroups extensively for internal resource management; more to the point, the company also makes extensive use of multiple hierarchies. So, needless to say, Tim is not at all pleased with the prospect of that functionality going away. As he put it:

So yeah, I'm in a bit of a panic. You're making a huge amount of work for us. You're breaking binary compatibility of the (probably) largest single installation of Linux in the world. And you're being kind of flip about the reality of it...

The kernel's ABI rules have not been suspended for control groups Part of the reason for Tim's panic is that he was under the impression that the existing functionality would be removed within a year or two. That is decidedly not the case; the kernel's ABI rules have not been suspended for control groups. The plan is to add a new control interface, and any new features will probably only work with that new interface, but the existing interface, including multiple hierarchies, will continue to be supported until it's clear that it is no longer being used.

Tim described, in general terms, how Google uses multiple hierarchies. Essentially, every job in the system has two attributes: whether it's a production or "batch" job, and whether it gets I/O bandwidth guarantees. The result is a 2x2 matrix describing resource allocation policies (though one of the entries — batch processes with I/O guarantees, makes little sense and is not used). Using two independent cgroup hierarchies makes this set of policies relatively easy to express; Tim asserts that a unified hierarchy would not be usable in the same way.

Tejun was unimpressed, responding that this case could be managed by setting up three cgroups at the same level of the hierarchy, each of which would implement one of the three useful policy combinations. The problem with this solution, according to Tim, is that the processes without I/O bandwidth guarantees would be split into two groups, whereas in the current solution they are in one group. If one of those two groups has far more members than the other, the members of that larger group will get far less of the available bandwidth than the members of the small group. Tejun still thinks that the problem should be solvable, perhaps with the use of a user-space management daemon that would adjust the relative bandwidth allocations depending on the workload. Tim has answered that the situation is actually a lot more complicated, but he has not yet shared the details of how, so it is hard to understand what the real difficulties with a single hierarchy are.

A single management process?

Tim also dislikes the plan to have a single process managing the control group hierarchy. That process could be made to provide the functionality that Google (along with others) needs, though there are performance concerns associated with adding a process in the middle. But Tim was not alone in being concerned by this message from Lennart Poettering on the nature of that single process:

This hierarchy becomes private property of systemd. systemd will set it up. Systemd will maintain it. Systemd will rearrange it. Other software that wants to make use of cgroups can do so only through systemd's APIs.

Google does not currently run systemd and is not thrilled by the prospect of having to switch to be able to make use of cgroup functionality. So Tim responded that "If systemd is the only upstream implementation of this single-agent idea, we will have to invent our own, and continue to diverge rather than converge." There is no particular judgment against systemd implied by that position; it is simply that making that switch would affect a whole lot of things beyond cgroups, and that is more than Google feels like it would want to take on at the moment. But, in general, it would not be surprising if, in the long term, some users remain opposed to the idea of systemd as the only interface to cgroups. That suggests that we will be seeing competing implementations of the cgroup management daemon concept.

One of those alternatives may be about to come into view; Serge Hallyn confessed that he is working on a cgroup management daemon of his own. In some situations, a separate daemon might meet a lot of needs, but Lennart was clear that he would never have systemd defer to such a daemon. His position — not an entirely unreasonable one — is that the init process, as the creator of all other processes in the system, should not be dependent on any other process for its normal operation. He also seems to feel that it would not be possible to put the cgroup management code into a library that could be used in multiple places. So we are likely to see multiple implementations of this functionality in use before this story is done. That, in turn, could create headaches for developers of applications that need to interface with the cgroup subsystem.

The discussion, thus far, seems to have changed few minds. But Tejun has made it clear that he doesn't intend to just ignore complaints from users:

While the bar to overcome is pretty high, I do want to learn about the problems you guys are foreseeing, so that I can at least evaluate the graveness properly and hopefully compromises which can mitigate the most sore ones can be made wherever necessary.

He also acknowledged the biggest problem faced by the development community: despite having accumulated some experience on wrong ways to solve the problem, nobody really knows what the right solution is. More mistakes are almost certain, so it's too soon to try to settle on final solutions.

In the early years of Linux, most of the ABIs implemented by the kernel were specified by groups like POSIX or by prior implementation in other kernels. That made the ABI design problem mostly go away; it was just a matter of doing what had already been done before. For current problems, though, there are rather fewer places to look for guidance, so we are having to figure out the best designs as we go. Mistakes are certain to happen in such a setting. So we are going to have to get better at learning from those mistakes, coming up with better designs, and moving to them without causing misery for our users. The control group transition is likely to set a lot of precedents regarding how these changes should (or should not) be handled in the future.


(Log in to post comments)

When the kernel ABI has to change

Posted Jul 2, 2013 17:13 UTC (Tue) by smoogen (subscriber, #97) [Link]

Extremely naive solution?

Fork cgroups to being dgroups (designated groups) which are the single hierarchy of resources. systemd and other items focus on dgroups. cgroups are maintained by google and other users to allow for their use cases.

When the kernel ABI has to change

Posted Jul 3, 2013 5:26 UTC (Wed) by alankila (subscriber, #47141) [Link]

That implies a fork of the kernel, nothing less. Rationale:

If you read carefully the problems with cgroup are with exposing implementation details and therefore locking the current implementation in its place, along with some misdesigned features that utilize current implementation details in way that appears to be both difficult to maintain and nonsensical to end users.

If "dgroups" remain in kernel tree, then the current implementation is still locked in place unless emulation of old interface can occur. I doubt it can. If dgroups remain out of kernel, then the "dgroups patch" likely won't apply cleanly and can't be made to apply cleanly few minor releases down the line.

When the kernel ABI has to change

Posted Jul 2, 2013 21:49 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

Part of the problem with ABI stability is that cgroups are a thin layer patched over top of kernel implementation details that mostly preceded it, not an interface designed to implement real use cases. Existing users of the cgroup ABI discovered the implementation details and adapted them to their requirements, and now expect those implementation details to be stable much to the surprise of the people doing the implementing ("Wait...we have users?"). The fact that people with interesting cgroup use cases seem averse to discussing their cgroup use cases in public also contributes to the problem. ;)

If anything, we should use an incompatible ABI change as an opportunity to implement behavior that is _intentionally_ useful.

We have a CPU controller that lets you tweak some knobs the scheduler already had. The memory controller lets you put some lightweight counters on top of the mostly unmodified VM subsystem without adapting that subsystem's behavior to operate in terms of distinct and enforceable partitions (this is part of the reason why blkio and memory controllers interact with each other in such ugly ways). The freezer controller lets you start an existing kernel procedure which happens to be really handy for temporarily descheduling processes without interfering too much with them, but was originally designed to do something very different.

This approach leads to some face-palming quirks and limitations. The block IO cgroup controller's write bandwidth limiting feature does exactly the opposite of the useful thing: filling RAM with dirty pages that can't be quickly flushed instead of limiting the rate at which RAM pages can be dirtied (this is a disaster if you haven't also limited the RAM usage of the write-rate-limited cgroup and the write rate limit is small). There is a memory controller that limits RAM usage and RAM+swap usage, but not swap usage (the two interesting limit values being 'unlimited' and 'zero') so you can't express a requirement like "I want these processes to never swap, but every other process can." You can't add a process to a frozen cgroup without first unfreezing the cgroup, which creates interesting race conditions if you have tasks trying to add themselves to a frozen cgroup while tasks in that cgroup are trying to (re)freeze themselves.

RAM is the resource I care the most about. All I want is a bunch of named nested boxes with minimum and maximum size parameters so that I can get predictable behavior from the memory subsystem (*). If the minimum is non-zero, then that amount of RAM is immediately allocated to that box and can never be used by any other (non-child) box, and if the minimum is equal to the maximum then the box never interacts with any other (non-child) box during any future memory allocation or deallocation. Pages shared between boxes are counted as usage by both. There will be unused RAM under such a regime--this is OK, because predictable is better than efficient.

If I've put two processes in the same RAM box, it should never imply that those two processes are also in the same CPU or freezer boxes. In simple cases a hierarchy can be built with RAM at the root and CPU and freezer as its children; however, non-simple cases are pretty common especially if any other resource controllers are involved.

(*) Unpredictable behavior from the memory subsystem includes (so far): swapping out latency-sensitive processes while leaving latency-insensitive processes in RAM even though the latency-insensitive processes are in a RAM-limited cgroup that forbids them ever using enough RAM to require swapping out of the latency-sensitive cgroup; or burning CPU endlessly mining the page tables for free pages that aren't there; or killing random processes in the root cgroup because _any_ cgroup is low on free RAM.

When the kernel ABI has to change

Posted Jul 2, 2013 22:14 UTC (Tue) by marcH (subscriber, #57642) [Link]

> Part of the problem with ABI stability is that .... are a thin layer patched over top of ... implementation details that mostly preceded it, not an interface designed to implement real use cases. Existing users ... discovered the implementation details and adapted them to their requirements, and now expect those implementation details to be stable much to the surprise of the people doing the implementing ("Wait...we have users?").

And here we go: yet another discussion about git...

When the kernel ABI has to change

Posted Jul 2, 2013 23:29 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

I'm using cgroups very heavily and I'm not aware of any serious _duplicating_ functionality. Cgroups are just that - a way to GROUP processes and subdivide resources between them.

There is no way to do this subdivision using existing kernel interfaces. All 'traditional' Unix ways are insufficient - they are racy and unreliable (often at the same time).

Now to the other details in your post.

blkio and RAM controller allow me to create lightweight containers. Sure, you can make SNAFUs like limiting the IO bandwidth too much, but that's entirely your fault. Don't do this unless that's exactly what you want.

You certainly CAN limit the swap usage by setting memory.memsw.limit_in_bytes and memory.limit_in_bytes. It is not possible to use swap limits if you want unlimited RAM usage - but that's simply the way RAM accounting works in Linux. I also don't understand why you would want to do this.

> RAM is the resource I care the most about. All I want is a bunch of named nested boxes with minimum and maximum size parameters so that I can get predictable behavior from the memory subsystem (*). If the minimum is non-zero, then that amount of RAM is immediately allocated to that box and can never be used by any other (non-child) box
We use RAM cgroups a lot - they are predictable and fast. And if you don't want overcommit - then don't overcommit RAM limits on cgroups.

When the kernel ABI has to change

Posted Jul 3, 2013 1:31 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

cgroups are lightweight containers, and not useless ones; however, the containers are egregiously leaky, and a lot of behavior occurs in practice that isn't sane.

Consider the following example: create two cgroups A and B, each one limited to 40% of the RAM (so 20% of the RAM is outside of all cgroups). All the processes on the system are in one of these two cgroups. The cgroup B processes are latency-sensitive and include the X server, window manager, xterms, sshd, and so forth; cgroup A is everything else.

One of these insane behaviors is that low-memory conditions cause losses on _all_ cgroups, not just the ones colliding directly with low RAM limits. In theory, it is impossible for processes in either cgroup to run out of total system RAM, so each cgroup should be able to fully utilize the RAM it has been allocated without interfering with the other. In practice, if processes in cgroup A aggressively write data to a filesystem or allocate anonymous memory, processes in cgroup B will be swapped out, and cgroup B will shed cache pages at a few GB/sec (requiring them to be re-fetched from disk later, increasing the latency on processes in cgroup B). In some cases I can get the OOM killer to kill processes in cgroup B, even though both cgroup B and the system at large have gigabytes of free RAM, and some offensive process in cgroup A continues to run uninterrupted.

To work around the OOM killer, I can turn it off for each cgroup, in which case I run into another problem: as a cgroup gets low on its available RAM, it spends more and more system time trying to find free pages. This CPU usage climbs to 100% on each core, and lasts for several minutes before the kernel finally gives up. This occurs even with overcommit disabled (which IIRC isn't a cgroup-specific parameter, so it has to be applied across the entire system and not the one or two cgroups that need it). Several minutes of latency is a problem that only affects cgroup A, but during those minutes the CPUs are not available to cgroup B, and quite a lot of battery power is consumed. To work around this I have hacked up a monitor process that looks for the symptoms and then starts slaughtering expendable processes (mostly Chromium tab processes since they are huge, sensitive to latency, easy to automatically identify, and they tolerate arbitrary fatal signals well) but that only works in the most frequently occurring problem cases.

Limiting swap usage as you describe isn't useful. The problem is that the memory controller can satisfy the combined ram+swap constraints by swapping out pages instead of using RAM. I never want to swap pages out for any process in cgroup B because swapping the pages back in is very slow. I don't want to run the system entirely without swap because I want processes in cgroup A to be able to use unlimited swap. Modifying every program that runs in cgroup B to call mlockall() doesn't help with RAM cached files that are often important for performance. I would want to limit the B cgroup's swap to zero, and leave A's swap unlimited, so that all processes in cgroup B are effectively locked in RAM. Nothing like this can happen with the current cgroup implementation due to side-effects leaking from one container to another, and even the closest possible approximation wastes a lot of RAM because it can only be effective when configured for the worst case.

It's much more convenient to specify "cgroup B has a minimum of 40% of the RAM" than "every cgroup except B has a combined dynamically allocated maximum that happens to be less than 60% of what we calculate to be the total available RAM" (a number that varies significantly over time, which might be another bug). This is doubly true if we lose the ability to have parallel hierarchies, when it will become that much harder to implement this in user space.

If I configure cgroup A with a low write bandwidth limit, there can be profound negative effects on processes in cgroup B. In this case all the above symptoms occur, plus anything sharing a filesystem with cgroup A has to deal with huge latencies allocating space for files, and there are long-held locks on directories where a rename() operation is occurring or a file is being created. To fix this requires some tunable like vm.dirty_ratio et al, but per-cgroup instead of system-wide. In practice I find blkio's write bandwidth limitation feature to do much more harm than good, so I no longer attempt use it (I use a PWM-like algorithm on freezer.state instead).

When the kernel ABI has to change

Posted Jul 3, 2013 2:09 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Consider the following example: create two cgroups A and B, each one limited to 40% of the RAM (so 20% of the RAM is outside of all cgroups). All the processes on the system are in one of these two cgroups. The cgroup B processes are latency-sensitive and include the X server, window manager, xterms, sshd, and so forth; cgroup A is everything else.
Why would you do it? The remaining 20% won't be used for anything.

> In practice, if processes in cgroup A aggressively write data to a filesystem or allocate anonymous memory, processes in cgroup B will be swapped out, and cgroup B will shed cache pages at a few GB/sec (requiring them to be re-fetched from disk later, increasing the latency on processes in cgroup B).
Actually, no. It doesn't happen if you set up limits correctly. We have this very use-case on our cluster and it works just fine. There are some leaks due to mis-assigned pages, but they are trivial (<100Mb).

>To work around the OOM killer, I can turn it off for each cgroup, in which case I run into another problem: as a cgroup gets low on its available RAM, it spends more and more system time trying to find free pages.
'Find available pages'? That's not how it works. You'll get some CPU usage from reclaim attempts, but it's trivial. 100% for several minutes is definitely a sign that something else goes awry, like massive disk thrashing because all of the file cache pages are evicted.

>I never want to swap pages out for any process in cgroup B because swapping the pages back in is very slow. I don't want to run the system entirely without swap because I want processes in cgroup A to be able to use unlimited swap.
Adjust swappiness for each cgroup. Works wonders.

So, from your description I can tell that you might be:
1) Using cgroups incorrectly.
2) Hitting kernel bugs.

Neither of these is a systemd problem.

When the kernel ABI has to change

Posted Jul 3, 2013 3:25 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

> Why would you do it? The remaining 20% won't be used for anything.
To ensure that there is no legitimate reason for swapping or page movement between cgroups to occur. The 40/40/20 split is a contrived scenario to make it easy to demonstrate the problems without a lot of confounding factors. A more realistic scenario has a dozen much smaller partitions with more variety in parameters but isn't necessary if the goal is to see problems.

> Actually, no. It doesn't happen if you set up limits correctly. We have this very use-case on our cluster and it works just fine. There are some leaks due to mis-assigned pages, but they are trivial (<100Mb).
Please explain how the limits could be not set up correctly. It's a really simple setup, setting just memory.limit_in_bytes, memory.swappiness = 0, and memory.oom_control (oom_kill_disable = 1).

On some of my tinier systems a "trivial" 100MB _is_ 20% of RAM. :-O

> 'Find available pages'? That's not how it works. You'll get some CPU usage from reclaim attempts, but it's trivial. 100% for several minutes is definitely a sign that something else goes awry, like massive disk thrashing because all of the file cache pages are evicted.
File cache pages are evicted--there are maybe a few hundred KB of file cache when this starts. There is no disk or swap I/O while the symptoms occur (swappiness = 0, and it occurs even if there is no swap on the system. No swap might even make it worse, but I haven't tested that case rigorously). It seems to be doing something trivial in the kernel, but it is doing it many thousands of times (perhaps it succeeds in reclaiming a page, only to immediately need to reclaim another one, or replace the page it just reclaimed?), and blocking userspace processes while it happens. I'm guessing it's constantly trading file cache pages for other memory because the numbers for cache and one or two other stats in memory.stat are constantly bouncing back and forth by small amounts while everything else stays mostly constant. If left on their own, userspace processes eventually get ENOMEM, usually exit immediately, free some memory, and the system resumes normal-ish operation, but it takes anywhere from 30 seconds to 20 minutes depending on the specific machine and application that is being constrained.

The symptoms are almost identical to a plain Linux system, not using cgroups, when it runs out of RAM and has no swap and no oom_killer.

> Neither of these is a systemd problem.
I haven't mentioned systemd, but I agree. I don't run systemd in production, this behavior is all from bare cgroups. ;)

When the kernel ABI has to change

Posted Jul 7, 2013 20:20 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

> Neither of these is a systemd problem.

This comment is a perfect example of why systemd should not take exclusive control of cgroups.

When the kernel ABI has to change

Posted Jul 7, 2013 20:24 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Why?

I'm not exactly thrilled with SystemD becoming the only master of cgroups. I'd much prefer to have a mechanism to delegate cgroups subtrees to certain processes.

But it's reasonable, if the single cgroup hierarchy (which I absolutely hate) ever comes to place.

When the kernel ABI has to change

Posted Jul 7, 2013 21:04 UTC (Sun) by dlang (✭ supporter ✭, #313) [Link]

If systemd says "I control everything related to cgroups" and then someone has a problem with using cgroups, it _is_ a problem for systems, because systemd has claimed to be taking complete ownership of cgroups.

To the extent that systemd says "I completely own this" and at the same time says "that's not my problem" when users run into problems they are creating a major problem.

or to put it another way Authority == Responsibility

if you try to hold someone Responsible for something they have no Authority to fix it just leads to frustration on everyone's part

Similarly, if you give someone Authority over something, but don't hold them Responsible for the results, things end up very badly.

This works the same way in software components that it does in all other areas of life.

When the kernel ABI has to change

Posted Jul 8, 2013 5:01 UTC (Mon) by raven667 (subscriber, #5198) [Link]

Let me see if I can summarise what is going on, maybe I don't understand, but the proposal is that the kernel interface only have one reader/writer on any given system, and on systems which are using systemd as PID1 that will be systemd. The systemd implementation is to dependant on cgroups at a low level to be able to abstract this out to a library or separate policy daemon which could be shared. On systems which don't use systemd they will be free to write whatever management utility is desired for handling cgroups, but there can be only one active on a system at any time. So it moves the problem of multiple readers/writers from the user/kernel cgroupfs interface into a custom client/management daemon API. You are free to write your own management daemon that works however you want but systemd has committed to writing a user-facing API so you can still implement your custom policies on a systemd host, if you are willing to write to the systemd cgroup management API. Or don't use systemd and write your own management daemon and API. It would be nice if whatever cgroup management daemons exist would also provide a secure API that can be provided to sub-containers so that they can still do their own thing but also be beholden to the general system policy.

It seems that this move means that there will be several different userspace APIs which work in different ways, but that the kernel developers won't be responsible for them or responsible for figuring out what the "right" API is, we'll all duke it out in userspace and maybe in five years there will be a more clear path forward for what the kernel API should look like. In the interim there will be a small, tightly knit community of developers who write the management daemons which will be the only consumers of the kernel cgroup interface.

When the kernel ABI has to change

Posted Jul 8, 2013 6:35 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

That's not all.

Cgroup developers have gone off the deep end. They are also removing _useful_ functionality and calling it 'sane'.

For instance, now I can easily freeze PostgreSQL server and Apache atomically using the freezer cgroup. With the 'single tree' implementation it's going to be impossible. Ditto for separate IO and CPU scheduling hierarchies (Google also uses this).

I can now easily delegate a part of subtree to untrusted users. For instance, I have a 'deeptime' utility that uses cgroups to get CPU time used by a process and its descendants. Works totally fine - I just need to set up correct permissions. Tejun Heo wants to remove this.

In short, Tejun Heo is thinking about HIS convenience, not users'.

When the kernel ABI has to change

Posted Jul 8, 2013 9:36 UTC (Mon) by micka (subscriber, #38720) [Link]

Isn't it at my convenience that the maintainer of a complex part of my system is comfortable with what he maintains ?

When the kernel ABI has to change

Posted Jul 8, 2013 9:38 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes, unless this maintainer decides that you don't need several useful features and offers you to do his work for him instead.

When the kernel ABI has to change

Posted Jul 8, 2013 18:29 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> In short, Tejun Heo is thinking about HIS convenience, not users'.

Sure, kernel developers often do what is best for kernel developers and the kernel project first. If this is too disruptive to users though it may get smacked down.

It seems that Tejun Heo has articulated some technical reasons for the other changes, exposing independent CPU and IO scheduling settings is problematic because they are not actually independent and have complex interaction with one another which can easily cause pathological badness. Similarly to that delegation has problems because a delegated entity, like a container, dosen't have enough visibility into the whole system to set proper priorities as the priorities are evaluated relative to one another and are systemwide, only the administrator of the root context has enough knowledge to pick values here that can't cause interference with other workloads.

So in each of these cases the existing implementation has some bad failure modes that probably shouldn't exist and would definitely cause problems if access is delegated to an untrusted party. Their thought is to filter this through an API to a user space daemon to deal with the trust issues rather than trying to handle the trust policy in kernel space.

In any event all of this discussion should be useful, this is a work in progress and can be changed I think.

When the kernel ABI has to change

Posted Jul 8, 2013 18:36 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Nope. We're most likely get dm-vs-md situation all over again.

I.e. Tejun Heo will go on and commit his "insane mode" fixes, tailored specifically for systemd. Namespace people would look at them and run away in horror - there'll be no simple delegation mechanism for cgroups anymore.

Then Google would look at this mess and decide to stick with the old multiple hierarchies interface.

So we'll get the worst of both worlds - two semi-functional interfaces co-existing at the same time, with almost the same functionality. Oh, and the single hierarchy won't make these pathological corner cases go away.

>Similarly to that delegation has problems because a delegated entity, like a container, dosen't have enough visibility into the whole system to set proper priorities as the priorities are evaluated relative to one another and are systemwide, only the administrator of the root context has enough knowledge to pick values here that can't cause interference with other workloads.
It's EASY right now to work around the weight manipulations affecting sibling groups simply by creating additional tree level. Problem solved.

And there'll STILL be required a mechanism for subtree delegation, even if systemd is used. And this mechanism would also need ACLs, security policies and discoverability. I.e. it would need to duplicate the existing FS functionality.

Epic fail, all around.

When the kernel ABI has to change

Posted Jul 8, 2013 18:46 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> Epic fail, all around.

Hyperbolic, obviously but this all does seem sub-optimal.

When the kernel ABI has to change

Posted Jul 9, 2013 19:06 UTC (Tue) by lambda (subscriber, #40735) [Link]

I'm curious; how are you write bandwidth limiting processes? I have tried to find a way to do that recently, as I am trying to preserve available write bandwidth for high-priority processes by throttling low-priority processes. But as far as I can tell, in the latest kernels, write bandwidth limits apply only to direct I/O, not normal buffered writes (since buffered writes get accounted to the global flush process, rather than the processes that dirtied the pages in the first place).

Is there currently a way to bandwidth limit buffered writes that I'm missing? Or were you using a non-stock kernel that had one of the proposed patch series applied?

I thought that one of the primary motivations for requiring a single hierarchy was so that mem cgroups and blkio could be managed together, allowing for accounting of dirty pages to push blkio pressure up through the memory management system, rather than encountering the kinds of problems that you're describing.

Am I just confused or missing something?

When the kernel ABI has to change

Posted Jul 10, 2013 14:41 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

The write bandwidth limit does seem to apply to buffered writes, but only in the special case when buffered writes are forced to be synchronous due to lack of RAM. This means our RAM is 1) full of dirty pages, and 2) writing those pages to clean them is slow, because we've limited the write bandwidth. There are huge variations in shared filesystem latency inside and outside of the limited cgroup. It's OK for limiting the I/O bandwidth of kvm instances, but unusable for e.g. preventing a large C compile from flooding the disk with I/O.

I put any process that is going to do a potentially crippling amount of buffered block writing into a RAM-limited cgroup. This prevents it from flooding RAM with dirty pages, and helps a lot with latency in processes belonging to other cgroups. It still enables the cgroup to flood a device with writes, but processes in non-limited cgroups can still use buffered writes while processes in the RAM-limited cgroup are forced to use synchronous writes. The deadline I/O scheduler can sort that sort of contention out reasonably well, but if you wanted strict block write rate limiting, this is not it.

If a process is going to do a lot of synchronous block writing through a filesystem (e.g. a process that calls sync, fsync, mkdir or rename hundreds of thousands of times) then I will have a process freeze and thaw the cgroup according to some policy (the policy varies considerably. Some examples: 10% duty cycle at 1Hz; freeze for a few seconds when a user moves the keyboard or mouse, or when a camera detects a user in front of the computer is moving; freeze when disk I/O rates or latencies exceed some threshold and thaw otherwise). This isn't really what I want--I want these cgroups to run continuously but without adding unacceptable latency to the rest of the system--but it does solve most of the practical problems I have.

I do agree with merging blkio and memory cgroups (and pull in the other sysctl tunables like dirty_ratio and vm_overcommit too); however, it would also be useful if blkio could throttle buffered writes directly without changing anything else. Buffered writes are tricky--does mmap()ing a file and then modifying it with the CPU count as a buffered write? What about swap? Swap devices are not bound to any particular cgroup, so we might want to limit swap I/O rates differently from other block I/O rates, or invent a way to bind swap devices to cgroups (or just fix the existing throttle mechanism so it works for device-mapper devices). There is MUCH room for improvement in the blkio cgroup hierarchy!

Presumably there's a similar benefit to merging cpu, cpuacct, and cpuset, and maybe cpuset and memory, but I don't see one for merging cpu+cpuacct with blkio+memory. cpuset presents a puzzle: "being a good candidate to merge with other cgroup controllers" is not necessarily a transitive property.

There is no userspace benefit from merging the other cgroup types, like net_cls, debug, devices, and freezer. There are solid use-case-based reasons to keep those hierarchies separate from everything else, including each other.

When the kernel ABI has to change

Posted Jul 10, 2013 13:57 UTC (Wed) by mstsxfx (subscriber, #41804) [Link]

> Consider the following example: create two cgroups A and B, each
> one limited to 40% of the RAM (so 20% of the RAM is outside of all
> cgroups). All the processes on the system are in one of these two
> cgroups.
[...]
> One of these insane behaviors is that low-memory conditions cause losses
> on _all_ cgroups,

By low-memory conditions you mean global low-memory conditions or per-group
low memory condition?

If the first one then there is nothing else to do then reclaim from all
groups and we do that in a round robin manner to be as much fair as
possible.
If the later then your groups must be sharing a lot of pages. Unfortunately
nobody has come up with something more clever than first-touch-gets-charged
approach for shared data that would work reasonably well in most cases. So
yes, shared data is a problem. On the positive note side, shared data is not
the first target of the memory reclaim so the problem shouldn't happen so
often. Anyway, if your groups share resource then you cannot be surprised by
the fact that groups interfere as well.

> In theory, it is impossible for processes in either cgroup to run out
> of total system RAM,

Well, not quite right. You have kernel allocations as well, not just the
memory tracked by memcg (pages that are on LRU). Recent kernels learned kmem
accounting as well so this might help in your use case if the kernel memory
consumption is high.

[...]
> In practice, if processes in cgroup A aggressively write data
> to a filesystem or allocate anonymous memory, processes in cgroup B
> will be swapped out,

This would be a bug (unless both groups share a lot of memory)

When the kernel ABI has to change

Posted Jul 10, 2013 19:33 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

> By low-memory conditions you mean global low-memory conditions or per-group low memory condition?

It seems to be triggered by low-memory conditions in a RAM-limited cgroup--specifically very high memory pressure conditions that would normally trigger the oom killer. In the field I've seen a 1GB cgroup with an aggressive memory/block write workload blow away 10GB of cache on a system that had a total of 24GB of RAM and about 6GB of it free. That workload was 'git gc' on a 6GB git repo with 512MB window and 8 threads, and bad things started happening when it wanted to start writing while it was already heavily swapping--precisely the kind of insane workload that I want a resource controlling subsystem to protect the rest of my system from. ;)

I've also seen the oom killer kill processes from a cgroup that was well below its RAM limit instead of an aggressively allocating process from a cgroup that was at its RAM limit and swapping. I turned off the oom killer when I noticed that happened, and haven't had a reason to turn it back on since, so I don't know if that still happens in recent kernels.

This problem pops up in a number of places--some of the system-level features, like the OOM killer and cache, don't properly infer intent from cgroup parameters yet, and make some really annoying decisions.

> You have kernel allocations as well, not just the memory tracked by memcg (pages that are on LRU). Recent kernels learned kmem accounting as well so this might help in your use case if the kernel memory consumption is high.

Kernel memory use is the first thing I look for when this problem pops up in the field, and I didn't find anywhere near sufficient amounts of it to have this kind of impact. I also suspected NVidia Xorg drivers until I saw it happening to headless servers and systems with Intel video hardware.

There is a lot of shared data in my workloads--gigabytes of it, in fact. It's unavoidable given that processes in some of the cgroups are high-priority tasks focused on some highly localized subset of data, while others are low-priority tasks that browse lazily through everything. Sooner or later both groups iterate over exactly the same data like hands passing over each other on an analog clock--that instant they're fully shared, and then become less and less shared. On the other hand, this happens at least four times a day and is not usually a problem. :-/

What I'd prefer is that the memory cgroup controllers either count each shared page against each cgroup that uses it (possibly leaving some RAM unallocated, but the decreased latency of predictable RAM caching far outweighs minor inefficiency of RAM usage), or duplicate/copy-on-write the pages to avoid the pathological behavior--at least among sibling cgroups. I could see value in allowing pages to be shared between a parent and its children (hierarchy might not make sense without it). That's maybe a bit beyond the scope of cgroups, but it is a necessary capability for a memory partitioning implementation.

Another possible solution is to give cgroups hard minimum RAM sizes, so that any page shared with a cgroup below its minimum RAM size can't ever be reclaimed from that cgroup, only unshared with a cgroup that is above its minimum size.

When the kernel ABI has to change

Posted Jul 3, 2013 9:24 UTC (Wed) by walken (subscriber, #7089) [Link]

Google MM person here.

What bothers me is not the talk about ABI changes - we can adapt to new ABIs if needed - but the fact that the proposed replacements are dropping some useful functionality.

When the kernel ABI has to change

Posted Jul 3, 2013 16:14 UTC (Wed) by JEFFREY (subscriber, #79095) [Link]

Seems to me that Lennart wants to supplant all of Linux, and bring it under systemd.

When the kernel ABI has to change

Posted Jul 3, 2013 21:00 UTC (Wed) by judas_iscariote (subscriber, #47386) [Link]

Well, then do not use systemd , also ad-hominems do not help your cause.

When the kernel ABI has to change

Posted Jul 4, 2013 1:56 UTC (Thu) by luto (subscriber, #39314) [Link]

The issue here is that you may not have a choice. On current kernels, you can set DefaultController= in systemd (if you're on, say, Fedora) and systemd will not do anything that will impact your alternative use of cgroups.

With the proposed changes, there is no equivalent. systemd needs cgroups for grouping, and there's absolutely nothing wrong with having a grouping hierarchy that doesn't match other hierarchies -- the groups don't control anything. But this is going away, so the endgame is presumably that DefaultControllers= results in nothing being able to use cgroups for resource control.

This is bad.

When the kernel ABI has to change

Posted Jul 4, 2013 7:12 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

Why using systemd API is bad?

When the kernel ABI has to change

Posted Jul 4, 2013 12:42 UTC (Thu) by jubal (subscriber, #67202) [Link]

You may want to read this e-mail exchange – it helps to understand why people might be rather reluctant to trust systemd to do the job they want it to do.

(As a side note, it's a beautiful example of why trying a reasoned discussion with a petulant child and a school bully is rather less than productive.)

When the kernel ABI has to change

Posted Jul 4, 2013 15:23 UTC (Thu) by luto (subscriber, #39314) [Link]

A lot of systemd APIs are great. For example, I can take a daemon, add a couple lines of code to make it socket-activated, and it will still work on non-systemd systems.

With cgroups, I can code to the systemd API for low-level stuff, which may or may not do what I need it to do, and even if it works, it will be totally incompatible with everything else.

When the kernel ABI has to change

Posted Jul 4, 2013 16:35 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

"everything else" could implement the same API as systemd will provide. Please note, there is no "everything else" at the moment. systemd is going to be one of the first cgroups management deamons.

When the kernel ABI has to change

Posted Jul 4, 2013 16:48 UTC (Thu) by jubal (subscriber, #67202) [Link]

…“everything else” will be limited to what systemd will provide. And there is absolutely no guarantee, that the systemd developers will be interested at all in listening to requests from their non-systemd users, as the thread I linked to shows very neatly.

When the kernel ABI has to change

Posted Jul 4, 2013 19:13 UTC (Thu) by martin.langhoff (subscriber, #61417) [Link]

The Linux kernel is moving boldly into offering new facilities that give us great advantages... if we get userland tooled up to use them. And userland hasn't been dancing to the tune at all, between ossified development processes (ie: glibc) and a focus on compatibility, the pace was abysmal.

There is a strong push for a small core userland that moves fast and is more tightly integrated with the latest kernel APIs. Probably comes as an offshoot of the Linux Plumbers folks. All in all, I think it is a winning strategy -- don't hold back progress worrying about Debian/kFreeBSD, and seize the new Kernel facilities to make impossible things possible.

I wrote more about this at https://plus.google.com/u/0/104365545644317805353/posts/M...

At the end of the day, the differences will be resolved with working code. IOWs, for all the folks that dislike systemd... it's time to stfu and out-code Lennart Poetering and his band. Maybe help Upstart developers.

They are shaping the future; join the party.

(Side note: this isn't a popularity contest. The Lennart's post you link isn't even nasty, I am sure he has worse. And Linus Torvalds has used strong words in ocassion. Who cares? Working code talks.)

When the kernel ABI has to change

Posted Jul 5, 2013 8:37 UTC (Fri) by dvdeug (subscriber, #10998) [Link]

And Google has working code. They will STFU, if necessary, but I seem to recall Linux people being upset when Google STFU and had quite a variety of patches to the base kernel for the kernels they were shipping with Android.

When the kernel ABI has to change

Posted Jul 6, 2013 19:47 UTC (Sat) by lacos (subscriber, #70616) [Link]

> this isn't a popularity contest. [...] Who cares? Working code talks.

This statement may or may not be true, but it sure as hell wasn't stressed much (or popular) when it came to Ulrich Drepper. Double standard?

When the kernel ABI has to change

Posted Jul 6, 2013 22:21 UTC (Sat) by martin.langhoff (subscriber, #61417) [Link]

Just multiple variables at play. I don't know what mix of talent, opportunity and politics kept Ulrich in place as maintainer, while he was reportedly unpopular. (Note that I know nothing of glibc politics, except as reported in LWN.)

And again, what has Ulrich or some easily bruised egos got to do with getting a really good init that makes use of modern kernel facilities? That's what matters! Where is a modern init going to come from? Show us.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds