By Jonathan Corbet
October 16, 2007
The Completely Fair Scheduler (CFS) was merged for the 2.6.23 kernel. One
CFS feature which did not get in, though, was the
group scheduling facility.
Group scheduling makes the CFS fairness algorithm operate in a hierarchical
fashion: processes are divided into groups, and, within each group,
processes are scheduled fairly against one another. At the higher level,
each group as a whole is given a fair share of the processor. The grouping
of processes is done in user space in a highly flexible manner; the control
groups (formerly "process containers") mechanism allows a management daemon
to classify processes according to almost any policy.
One of the reasons why group scheduling did not get into 2.6.23 is that the
control groups patch was not ready for merging. Your editor had expected
control groups to go in for 2.6.24, but, as of this writing, it is looking
like that patch might still be under too much active development to get
into the mainline. The group scheduling feature is not waiting, though; it
has been merged for the 2.6.24 release. In the absence of control groups,
the general group scheduling mechanism will not be available. Over the
last few months, though, the group scheduler has evolved a new feature which will
allow it to be used without control groups, and which implements what is
likely to be the most common use case.
That feature is per-user scheduling: creating a separate group for each
user running on the system and using those groups to give each user a fair share of the
processor. Since the groups are created implicitly by the scheduler, there
is no separate need for the control groups interface. Instead, if the
"fair user" configuration option is selected, the per-user group scheduling
will go into effect with no further intervention by the administrator
required.
Of course, once the system provides fair per-user scheduling,
administrators will immediately want to make it unfair by arranging for
some users to get more CPU time than others. The age-old technique of
raising the priority of that crucial administrative wesnoth process still
works, but it is a crude and transparent tool. It would be much nicer to
be able to tweak the scheduler so that certain users get a higher share of
the CPU for the running of their crucial games video
diagnostic tools.
To achieve such ends with the 2.6.24 scheduler, it will only be necessary
to go to the new sysfs directory /sys/kernel/uids. There will be
a subdirectory there for every active user ID on the system, and each
subdirectory will contain a file called cpu_share. The integer
value found in that file defaults to 1024. For the purposes of adjusting
scheduling, all that really matters with the cpu_share value is
its ratio between two users. If one user's cpu_share is set to
2048, that user will get twice as much CPU time as any one user whose value
remains at the default 1024. The end result is that adjusting the
scheduling of the CPU between users is quite easy for the administrator to
do.
A rather large number of other patches was also merged for 2.6.24. Most of
those are cleanups and small improvements. Some of the math within the
scheduler has been made less intensive, and fairness has been improved in a
number of ways. There is also a new facility for performing guest CPU
accounting for virtualized systems running under KVM. It's a lot of
patches, but the rate of change in the core CPU scheduler should be
beginning to slow down again.
There are some other scheduler-related patches in the works, though. A
couple of them address the problem of getting realtime tasks into a CPU
promptly. Normally, the CPU scheduler will make a significant effort to
avoid moving processes between CPUs because the cost of that migration
(resulting from lost memory cache contents) is high. If a realtime process
wants to run, though, the system is obligated to give it a processor even
if there is a price to be paid in terms of overall throughput. The current
CPU scheduler, however, will cause a realtime process to languish if a
higher-priority process is running on the same CPU, even if other
processors are available in the system.
Fixing this problem involves a couple of different patches. This one from Steven Rostedt
addresses the situation where the scheduling of one realtime task causes a
lower-priority (but still realtime) task to be pushed out of the CPU.
Rather than leave that luckless task in the run queue, Steven's patch
searches through the other processors on the system to find the one running
the lowest-priority process. If a processor running a sufficiently
low-priority process is found, the displaced realtime process is moved over
to that processor.
Gregory Haskins has posted a
similar patch which addresses a slightly different situation: a
realtime task has just been awakened, but the CPU it is on is already
running a higher-priority process. Once again, a search of the system to
find the lowest-priority CPU is performed, with the realtime process being
moved if a suitable home is found. In either case, the moved process will
suffer a small performance hit as it finds a completely cold cache waiting
for it. But it will still be able to respond much more quickly to the real
world than it would if it were sitting on a run queue somewhere; that, of
course, is what realtime scheduling is all about.
Another issue which has come up in some situations is that the accuracy of
fair scheduling decisions is constrained by the scheduler tick frequency.
In the absence of external events (such as I/O completions), one process
can only preempt another when the periodic timer tick comes in. As a
result, processes might run longer than their time slices would otherwise
allow. The scheduler will compensate for the extra time used by that process
by causing it to wait longer than it otherwise would for its next time
slice. The result is fair scheduling, but higher latencies than one might
like.
Peter Zijlstra has posted a
solution to this problem: a patch which uses the high-resolution timer
mechanism to preempt processes exactly at the end of their time slices.
When the scheduler notes that a time slice will run out between timer
ticks, it arranges for a special one-time timer interrupt at the time slice
expiration time. When that interrupt arrives, the running process can be
turfed out right on schedule. As a result, the process will not overrun
its time slice and will not have to face a longer-than-usual wait before it
is able to run again.
Mike Galbraith has reported that this patch
results in reduced context switching on his system, and higher throughput
as well. So it looks like the right solution to the problem, at least in
the absence of a true dynamic tick mechanism. The current dynamic tick
code turns off the periodic clock interrupt when the processor is idle, but
that interrupt continues to run when the processor is busy. In a fully
dynamic environment, periodic ticks would never be used and special
interrupts at the end of time slices would be the normal way of doing
business. Implementing full dynamic tick is a big job, though; in the
meantime the addition of an occasional extra tick can help the scheduler to
do a quick and accurate job.
(
Log in to post comments)