Much of the realtime scheduling work in Linux has been based around getting
the best behavior out of the POSIX realtime scheduling classes. Techniques
like priority inheritance, for example, exist to ensure that the
highest-priority task really can run within a bounded period of time. In
much of the rest of the world, though, priorities and POSIX realtime are no
longer seen as the best way to solve the problem. Instead, the realtime
community likes to talk about "deadlines" and deadline-oriented
scheduling. In this article, we'll look at a deadline scheduler has
recently been posted for review and related discussion at the recent Real
Time Linux Workshop in Dresden.
Priority-based realtime scheduling has the advantage of being fully
deterministic - the highest-priority task always runs. But priority-based
scheduling is subject to some unpleasant failure modes (priority
and starvation, for example), does not really isolate tasks running on the
same system, and is often not the best way to describe the problem. Most
tasks are more readily described in terms of an amount of work which must
be accomplished within a specific time period; the desire to work in those
terms has led to a lot of research in deadline-based scheduling in recent
A deadline system does away with static priorities. Instead, each running
task provides a set of three scheduling parameters:
- A deadline - when the work must be completed.
- An execution period - how often the work must be performed.
- The worst-case execution time (WCET) - the maximum amount of CPU
time which will be required to get the work done.
Deadline-scheduled tasks usually recur on a regular basis - thus the period
parameter - but sporadic work can also be handled with this model.
There are some advantages to this model. The "bandwidth" requirement of a
process - what percentage of a CPU it needs - is easily calculated, so the
scheduler knows at the outset whether the system is oversubscribed or not.
The scheduler can (and should) refuse to accept tasks which would require
more bandwidth than the system has available.
By refusing excess work, the scheduler
will always be able to provide the requisite CPU time to every process
within the specified deadline. That kind of promise makes realtime
Linux currently has no deadline scheduler. There is, however, an implementation posted for
review by Dario Faggioli and others; Dario also presented this
scheduler in Dresden. This implementation uses the "earliest deadline first"
(EDF) algorithm, which is based on a simple concept: the process with the earliest
deadline will be the first to run. Essentially, EDF attempts to ensure that
every process begins executing by its deadline, not that it actually
gets all of its work done by then. Since EDF runs work as early as
possible, most tasks should complete well ahead of their declared
This scheduler is implemented with the creation of a new scheduling class
called SCHED_EDF. It does away with the distinction between the
"deadline" and "period" parameters, using a single time period for both.
The patch places this class between the existing
realtime classes (SCHED_FIFO and SCHED_RR) and the normal
interactive scheduling class (SCHED_FAIR). The idea behind this
placement was to avoid breaking the "highest priority always runs" promise
provided by the POSIX realtime classes. Peter Zijlstra, though, thinks that deadline scheduling should run at
the highest priority; otherwise it cannot ensure that the deadlines will be
met. That placement could be seen as violating POSIX requirements; to
that, Peter responds, "In short, sod POSIX."
Peter would also like to name the scheduler SCHED_DEADLINE, for
the simple reason that EDF is not the only deadline algorithm out there.
In the future, it may be desirable to switch to a different algorithm
without forcing applications to change which scheduling class they
request. At the moment, the other contender would appear to be "least
laxity first" scheduling, which picks the task with the smallest amount of
"cushion" time between its remaining compute time and its deadline. Least
laxity first tries to ensure that each process can complete its computing
by the deadline. It tends to suffer from much higher context-switching
rates than EDF, though, and nobody is pushing such a scheduler for Linux at
One nice feature of deadline schedulers is that no process should be able
to prevent another from completing its work before its deadline passes. The
real world is messier
than that, as we will see below, but, even in the absence of deeper
problems, the scheduler can only make that guarantee if every process
actually stops running within its declared WCET. The EDF scheduler solves
that problem in an unsubtle way: when a process exceeds its bandwidth, it
is simply pushed out of the CPU until its next deadline period begins.
This approach is simple to implement and ensures that deadlines will be
met, but it can be hard on a process which must do a bit of extra computing
In the SCHED_EDF patch, processes indicate the end of their
processing period by calling sched_yield(). This modification to
the semantics of that system call makes some developers uneasy, though; it
is likely that the final patch will do something different. There may be a
new "I'm done for now" system call added for this purpose.
Peter also gave a talk in Dresden; his was mostly about why Linux does not
have a deadline
scheduler yet. The "what happens when a process exceeds its WCET" problem
was one of the reasons he gave. Calculating the
worst-case execution time is exceedingly difficult for any sort of
non-trivial program. As Peter puts it, researchers have spent their entire
lives trying to solve it. There are people working on automatically
deriving WCET from the source, but they are far from being able to do this
with real-world systems. So, for now, specification of the WCET comes down
to empirical observations and guesswork.
Another serious problem with EDF is that it works much better on
single-processor systems than on SMP systems. True EDF on a multiprocessor
system requires the maintenance of a global run queue, with all of the
scalability problems that entails. One solution is to partition SMP
systems, so that each CPU becomes an essentially independent scheduling domain;
the SCHED_EDF patch works this way. Partitioned systems have their own
problems, of course; the assignment of tasks to CPUs can be a pain, and it
is hard (or impossible) to get full utilization if tasks cannot move
Another problem with partitioning is that some scheduling problems simply
cannot be solved without occasional process migration. Imagine a two-CPU
system running three processes, each of which needs 60% of a single CPU's
time. The system clearly has the resources to run those three processes,
but not if it is unable to move processes between CPUs. So a partitioned
EDF scheduler needs to be able to migrate processes occasionally; the
SCHED_EDF developers have migration logic in the works, but it has
not yet been posted.
Yet another serious problem, according to Peter, is priority inversion. The
priority inheritance techniques used to solve priority inversion are tied
to priorities; it is not clear how to apply them to deadline schedulers.
But the problem is real: imagine a process acquiring an important lock,
then being preempted or forced out because it has exceeded its WCET. That
then block the execution of otherwise runnable processes with urgent
There are a few ways to approach this issue. Simplest, perhaps, is
deadline inheritance: lock owners inherit the earliest deadline in the
system for as long as they hold the lock. More sophisticated is bandwidth
inheritance; in this case, a lock owner which has exhausted its WCET will
receive a "donation" of time from the process(es) blocked on that lock. A
variant of that technique is proxy execution: blocked processes are left on the run
queue, but, when they "run," the lock owner runs in their place. Proxy
execution gets tricky in SMP environments when multiple processes are
blocked on the same lock; the result could be multiple CPUs trying to
proxy-execute the same process. The solution to that problem appears to be
to migrate blocked processes to the owner's CPU.
Proxy execution also runs into difficulties when the lock-owning process is
blocked for I/O. In that case, it cannot run as a proxy for the original
blocked task, which must then be taken off the run queue. That, in turn,
forces the creation of a "wait list" of processes which must be returned to
a runnable state when a different process (the lock owner) becomes
runnable. Needless to say, all this logic adds complexity and increases
The final problem, according to Peter, is POSIX, but it's an easy one to
solve. Since POSIX is silent on the topic of deadline schedulers, we can
do anything we want and life is good. He repeated that
SCHED_DEADLINE will probably be placed above SCHED_FIFO
in priority. There will be a new system call -
sched_setscheduler_ex() - to enable processes to request the
deadline scheduler and set the parameters accordingly; the
SCHED_EDF patch already implements that call. So many of the
pieces for deadline scheduling for Linux are in place, but a number of the
details are yet to be resolved.
The bottom line is that deadline schedulers in the real world are a
non-trivial problem - something that is true of real-world scheduling in
general. These problems should be solvable, though, and Linux should be
able to support a deadline scheduler at some point in the future. That
scheduler will probably make its first appearance in the realtime tree,
naturally, but it could eventually find users well beyond the realtime
community. Deadline schedulers are a fairly natural fit for periodic tasks
like the management of streaming media, which could
profitably make use of deadline scheduling to help eliminate jitter and
dropped-data problems. But that remains a little while in the future;
first, the code must be made ready for widespread use. And that, as we all
know, is a process which recognizes few deadlines.
to post comments)