A safe SCHED_IDLE implementation
[Posted July 3, 2002 by corbet]
A longstanding kernel feature request is a
SCHED_IDLE scheduler
class. Tasks running as SCHED_IDLE would only run when the processor would
otherwise be idle. The "niceness" scheme in the current scheduler does not
provide this behavior: even the lowest-priority processes will run
sometimes. Users who want to search out encryption keys,
model proteins, or search for
extraterrestrial life on their systems generally want that work to not take
any time from other tasks running on the system. Thus the request for
SCHED_IDLE.
In principle, SCHED_IDLE is not that hard to implement. The
problem, of course, is the classic priority inversion trap. If a
SCHED_IDLE process acquires an important shared resource, such as
an internal filesystem semaphore, there is no way to know how long the
process may have to wait before it can run long enough to release that
resource. A SCHED_IDLE process can be preempted at any time by a
higher-priority process; it could then keep needed resources unavailable
indefinitely. Priority inversion problems can come up by themselves; this
situation could also be brought about intentionally as a denial of service
attack.
So far, no solution to this problem has been implemented, so no
SCHED_IDLE patch has ever been merged into the kernel. It is
easier to simply ensure that every process makes a little progress
occasionally so that priority inversion problems resolve themselves.
Now Ingo Molnar has posted a patch which, he
claims, implements SCHED_IDLE (which he calls
SCHED_BATCH) in a safe way. Those who are curious are encouraged
to read his posting, which describes the work in far more detail than you
will find here.
The fundamental observation behind Ingo's approach is that processes only
hold important kernel resources, such as semaphores, when they are running
in kernel mode. If a SCHED_BATCH process is preempted when
running in user mode, it is safe to set that process aside indefinitely.
If, instead, it is running in kernel mode, it must be allowed to finish it
work within a reasonable period of time.
So Ingo's patch splits the schedule() call into two variants.
schedule_userspace() is called when the preempted process is
running in user mode; it implements the full SCHED_BATCH
semantics. schedule(), instead, is invoked when the process is in
kernel mode; it will handle a SCHED_BATCH process like any other,
normal process. Thus SCHED_BATCH processes essentially have their
priorities raised while running in kernel mode.
Raising the priority of processes that hold critical resources is a classic
response to priority inversion problems. Ingo's patch takes a slightly
simpler approach by treating the entire kernel as such a resource. This
patch will raise the priority of SCHED_BATCH processes a bit more
than is strictly necessary; the approach should be robust, however, and the
difference in scheduling behavior would be difficult to measure.
(
Log in to post comments)