CFS group scheduling
This type of scheduling is called "group scheduling"; Linux has never really supported it with any scheduler. It would be nice if a "completely fair scheduler" to be merged in the future had the potential to be completely fair in this regard too. Thanks to work by Srivatsa Vaddagiri and others, things may well happen in just that way.
The first part of Srivatsa's work was merged into v17 of the CFS patch. It creates the concept of a "scheduling entity" - something to be scheduled, which might not be a process. This work takes the per-process scheduling information and packages it up within a sched_entity structure. In this form, it is essentially a cleanup - it encapsulates the relevant information (a useful thing to do in its own right) without actually changing how the CFS scheduler works.
Group scheduling is implemented in a separate set of patches which are not yet part of the CFS code. These patches turn a scheduling entity into a hierarchical structure. There can now be scheduling entities which are not directly associated with processes; instead, they represent a specific group of processes. Each scheduling entity of this type has its own run queue within it. All scheduling entities also now have a parent pointer and a pointer to the run queue into which they should be scheduled.
By default, processes are at the top of the hierarchy, and each is scheduled independently. A process can be moved underneath another scheduling entity, though, essentially removing it from the primary run queue. When that process becomes runnable, it is put on the run queue associated with its parent scheduling entity.
When the scheduler goes to pick the next task to run, it looks at all of the top-level scheduling entities and takes the one which is considered most deserving of the CPU. If that entity is not a process (it's a higher-level scheduling entity), then the scheduler looks at the run queue contained within that entity and starts over again. Things continue down the hierarchy until an actual process is found, at which point it is run. As the process runs, its runtime statistics are collected as usual, but they are also propagated up the hierarchy so that its CPU usage is properly reflected at each level.
So now the system administrator can create one scheduling entity for Alice, and another for Karl. All of Alice's processes are placed under her representative scheduling entity; a similar thing happens to all of the processes in Karl's big kernel build. The CFS scheduler will enforce fairness between Alice and Karl; once it decides who deserves the CPU, it will drop down a level and perform fair scheduling of that user's processes.
The creation of the process hierarchy need not be done on a per-user basis; processes can be organized in any way that the administrator sees fit. The grouping could be coarser; for example, on a university machine, all students could be placed in one group and faculty in another. Or the hierarchy could be based on the type of process: there could be scheduling entities representing system daemons, interactive tools, monster cranker CPU hogs, etc. There is nothing in the patch which limits the ways in which processes can be grouped.
One remaining question might be: how does the system administrator actually cause this grouping to happen? The answer is in the second part of the group scheduling patch, which integrates scheduling entities with the process container mechanism. The administrator mounts a container filesystem with the cpuctl option; scheduling groups can then be created as directories within that filesystem. Processes can be moved into (and out of) groups using the usual container interface. So any particular policy can be implemented through the creation of a simple, user-space daemon which responds to process creation events by placing newly-created processes in the right group.
In its current form, the container code only supports a single level of group hierarchy, so a two-level scheme (divide users into administrators, employees, and guests, then enforce fairness between users in each group, for example) cannot be implemented. This appears to be a "didn't get around to it yet" sort of limitation, though, rather than something which is inherent in the code.
With this feature in place, CFS will become more interesting to a number of
potential users. Those users may have to wait a little longer, though.
The 2.6.23 merge window will be opening soon, but it seems unlikely that
this work will be considered ready for inclusion at that time. Maybe
2.6.24 will be a good release for people wanting a shiny, new, group-aware
scheduler.
Index entries for this article | |
---|---|
Kernel | Group scheduling |
Kernel | Scheduler/Completely fair scheduler |
Kernel | Scheduler/Group scheduling |
Posted Jul 5, 2007 13:33 UTC (Thu)
by mclasen@redhat.com (subscriber, #31786)
[Link]
Posted Jul 5, 2007 13:49 UTC (Thu)
by abatters (✭ supporter ✭, #6932)
[Link] (1 responses)
Using a userspace daemon to set process scheduling policy might be fine for long-running processes like the X server, but wouldn't it introduce a lot of overhead for short-lived processes like gcc during kernel builds? I expect it would add a few context switches of overhead to every fork(); that doesn't seem consistent with general kernel developer attitude towards efficiency.
Posted Jul 6, 2007 10:11 UTC (Fri)
by TRS-80 (guest, #1804)
[Link]
Posted Jul 5, 2007 14:35 UTC (Thu)
by davecb (subscriber, #1574)
[Link]
I originally used for production machines, where a
However, I now run it on my laptop, and shove
--dave
Posted Aug 28, 2007 15:41 UTC (Tue)
by kolyshkin (guest, #34342)
[Link]
Well, whe scheduler we have in OpenVZ is doing just that. It's a two-level CPU scheduler. On the first level schedules between groups of processes (with each group being a container), taking into account the (relative) priorities for those groups, and the (absolute) limits on the CPU time being used. On the second level, it schedules the process within chosen group. That's indeed nice that such a feature appears in vanilla kernel.
Might be worthwhile to point out that the CFS scheduler has been included in the Fedora devel kernel for a few weeks now. CFS group scheduling
> So any particular policy can be implemented through the creation of a simple, user-space daemon which responds to process creation events by placing newly-created processes in the right group.Userspace scheduling policy daemon
The usual way to handle this is child processes get put into the same group as their parent. Obviously you wouldn't want this for all processes, but the daemon could mark things like make and gcc as creating a new sub-hierarchy.
Userspace scheduling policy daemon
I've been a happy user of this kind of functionalityCFS group scheduling
in Solaris 9 and 10 (before that it was pretty primitive).
guarantee of a certain amount of CPU to a bunch of
processes allows for eevrything from consolidation
to having the CPU available for root to use to kill
a runaway process.
background proceses into a different scheduling
group so they don't interfere with anything in
the foreground. Think of this as bg and nice
done right (;-))
CFS group scheduling
This type of scheduling is called "group scheduling"; Linux has never really supported it with any scheduler.