CFS group scheduling

[Posted July 2, 2007 by corbet]

Ingo Molnar's completely fair scheduler (CFS) patch continues to develop; the current version, as of this writing, is v18. One aspect of CFS behavior is seen as a serious shortcoming by many potential users, however: it only implements fairness between individual processes. If 50 processes are trying to run at any given time, CFS will carefully ensure that each gets 2% of the CPU. It could be, however, that one of those processes is the X server belonging to Alice, while the other 49 are part of a massive kernel build launched by Karl the opportunistic kernel hacker, who logged in over the net to take advantage of some free CPU time. Assuming that allowing Karl on the system is considered fair at all, it is reasonable to say that his 49 compiler processes should, as a group, share the processor with Alice's X server. In other words, X should get 50% of the CPU (if it needs it) while all of Karl's processes share the other 50%.

This type of scheduling is called "group scheduling"; Linux has never really supported it with any scheduler. It would be nice if a "completely fair scheduler" to be merged in the future had the potential to be completely fair in this regard too. Thanks to work by Srivatsa Vaddagiri and others, things may well happen in just that way.

The first part of Srivatsa's work was merged into v17 of the CFS patch. It creates the concept of a "scheduling entity" - something to be scheduled, which might not be a process. This work takes the per-process scheduling information and packages it up within a sched_entity structure. In this form, it is essentially a cleanup - it encapsulates the relevant information (a useful thing to do in its own right) without actually changing how the CFS scheduler works.

Group scheduling is implemented in a separate set of patches which are not yet part of the CFS code. These patches turn a scheduling entity into a hierarchical structure. There can now be scheduling entities which are not directly associated with processes; instead, they represent a specific group of processes. Each scheduling entity of this type has its own run queue within it. All scheduling entities also now have a parent pointer and a pointer to the run queue into which they should be scheduled.

By default, processes are at the top of the hierarchy, and each is scheduled independently. A process can be moved underneath another scheduling entity, though, essentially removing it from the primary run queue. When that process becomes runnable, it is put on the run queue associated with its parent scheduling entity.

When the scheduler goes to pick the next task to run, it looks at all of the top-level scheduling entities and takes the one which is considered most deserving of the CPU. If that entity is not a process (it's a higher-level scheduling entity), then the scheduler looks at the run queue contained within that entity and starts over again. Things continue down the hierarchy until an actual process is found, at which point it is run. As the process runs, its runtime statistics are collected as usual, but they are also propagated up the hierarchy so that its CPU usage is properly reflected at each level.

So now the system administrator can create one scheduling entity for Alice, and another for Karl. All of Alice's processes are placed under her representative scheduling entity; a similar thing happens to all of the processes in Karl's big kernel build. The CFS scheduler will enforce fairness between Alice and Karl; once it decides who deserves the CPU, it will drop down a level and perform fair scheduling of that user's processes.

The creation of the process hierarchy need not be done on a per-user basis; processes can be organized in any way that the administrator sees fit. The grouping could be coarser; for example, on a university machine, all students could be placed in one group and faculty in another. Or the hierarchy could be based on the type of process: there could be scheduling entities representing system daemons, interactive tools, monster cranker CPU hogs, etc. There is nothing in the patch which limits the ways in which processes can be grouped.

One remaining question might be: how does the system administrator actually cause this grouping to happen? The answer is in the second part of the group scheduling patch, which integrates scheduling entities with the process container mechanism. The administrator mounts a container filesystem with the cpuctl option; scheduling groups can then be created as directories within that filesystem. Processes can be moved into (and out of) groups using the usual container interface. So any particular policy can be implemented through the creation of a simple, user-space daemon which responds to process creation events by placing newly-created processes in the right group.

In its current form, the container code only supports a single level of group hierarchy, so a two-level scheme (divide users into administrators, employees, and guests, then enforce fairness between users in each group, for example) cannot be implemented. This appears to be a "didn't get around to it yet" sort of limitation, though, rather than something which is inherent in the code.

With this feature in place, CFS will become more interesting to a number of potential users. Those users may have to wait a little longer, though. The 2.6.23 merge window will be opening soon, but it seems unlikely that this work will be considered ready for inclusion at that time. Maybe 2.6.24 will be a good release for people wanting a shiny, new, group-aware scheduler.

Index entries for this article
Kernel	Group scheduling
Kernel	Scheduler/Completely fair scheduler
Kernel	Scheduler/Group scheduling

CFS group scheduling

Posted Jul 5, 2007 13:33 UTC (Thu) by mclasen@redhat.com (subscriber, #31786) [Link]

Might be worthwhile to point out that the CFS scheduler has been included in the Fedora devel kernel for a few weeks now.

Userspace scheduling policy daemon

Posted Jul 5, 2007 13:49 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

> So any particular policy can be implemented through the creation of a simple, user-space daemon which responds to process creation events by placing newly-created processes in the right group.

Using a userspace daemon to set process scheduling policy might be fine for long-running processes like the X server, but wouldn't it introduce a lot of overhead for short-lived processes like gcc during kernel builds? I expect it would add a few context switches of overhead to every fork(); that doesn't seem consistent with general kernel developer attitude towards efficiency.

Userspace scheduling policy daemon

Posted Jul 6, 2007 10:11 UTC (Fri) by TRS-80 (guest, #1804) [Link]

The usual way to handle this is child processes get put into the same group as their parent. Obviously you wouldn't want this for all processes, but the daemon could mark things like make and gcc as creating a new sub-hierarchy.

CFS group scheduling

Posted Jul 5, 2007 14:35 UTC (Thu) by davecb (subscriber, #1574) [Link]

I've been a happy user of this kind of functionality
in Solaris 9 and 10 (before that it was pretty primitive).

I originally used for production machines, where a
guarantee of a certain amount of CPU to a bunch of
processes allows for eevrything from consolidation
to having the CPU available for root to use to kill
a runaway process.

However, I now run it on my laptop, and shove
background proceses into a different scheduling
group so they don't interfere with anything in
the foreground. Think of this as bg and nice
done right (;-))

--dave

CFS group scheduling

Posted Aug 28, 2007 15:41 UTC (Tue) by kolyshkin (guest, #34342) [Link]

This type of scheduling is called "group scheduling"; Linux has never really supported it with any scheduler.

Well, whe scheduler we have in OpenVZ is doing just that. It's a two-level CPU scheduler. On the first level schedules between groups of processes (with each group being a container), taking into account the (relative) priorities for those groups, and the (absolute) limits on the CPU time being used. On the second level, it schedules the process within chosen group.

That's indeed nice that such a feature appears in vanilla kernel.