Adding periods to SCHED_DEADLINE

By Jonathan Corbet
July 20, 2010

The Linux scheduler, in both the mainline and realtime versions, provides a couple of scheduling classes for realtime tasks. These classes implement the classic POSIX priority-based semantics, wherein the highest-priority runnable task is guaranteed to have access to the CPU. While this scheduler works as advertised, priority-based scheduling has a number of problems and has not been the focus of realtime research for some time. Cool schedulers in this century are based on deadlines instead. Linux does not yet have a deadline scheduler, though there is one in the works. A recent discussion on implementing the full deadline model has shown, once again, just how complex it can be to get deadline scheduling right in the real world.

Deadline scheduling does away with priorities, replacing them with a three-parameter tuple: a worst-case execution time (or budget), a deadline, and a period. In essence, a process tells the scheduler that it will require up to a certain amount of CPU time (the budget) by the given deadline, and that the deadline optionally repeats with the given period. So, for example, a video-processing application might request 1ms of CPU time to process the next incoming frame, expected in 10ms, with a 33ms period thereafter for subsequent frames. Deadline scheduling is appealing because it allows the specification of a process's requirements in a natural way which is not affected by any other processes running in the system. There is also great value, though, in using the deadline parameters to guarantee that a process will be able to meet its deadline, and to reject any process which might cause a failure to keep those guarantees.

The SCHED_DEADLINE scheduler being developed by Dario Faggioli appears to be on track for an eventual mainline merger, though nobody, yet, has been so foolish to set a deadline for that particular task. This scheduler works, but, thus far, it takes a bit of a shortcut: in SCHED_DEADLINE, the deadline and the period are assumed to be the same. This simplification makes the "admission test" - the decision as to whether to accept a new SCHED_DEADLINE task - relatively easy. Each process gets a "bandwidth" parameter, being the ratio of the CPU budget to the deadline/period value. As long as the sum of the bandwidth values for all processes on a given CPU does not exceed 1.0, the scheduler can guarantee that the deadlines will be met.

As Dario recently brought up on linux-kernel, there are users who would like to be able to specify separate deadline and period values. Adjusting the scheduler to implement those semantics is not particularly hard. Coming up with an admission test which insures that deadlines can still be met is rather harder, though. Once the period and the deadline are separated from each other, it becomes possible for processes to miss their deadlines even if the total bandwidth of the CPU has not been oversubscribed.

To see how this might come about, consider an example posted by Dario. Imagine a process which needs 4ms of CPU time with a period of 10ms and a deadline of 5ms. A timeline of how that process might be scheduled could look like this:

[Scheduling timeline]

Here, the scheduler is able to run the process within its deadline; indeed, there is 1ms of time to spare. Now, though, if a second process comes in with the same set of requirements, the results will not be as good:

[Scheduling timeline]

Despite the fact that 20% of the available CPU time remains unused, process P2 will miss its deadline by 3ms. In a hard realtime situation, that tardiness could prove fatal. Clearly, the scheduler should reject P2 in this situation. The problem is that detecting this kind of problem is not easy, especially if the scheduler is (as seems reasonable) expected to leave some CPU time for applications and not use it all performing complex admission calculations. For this reason, admission decision algorithms are currently an area of considerable research interest in the academic realtime community. See this paper by Alejandro Masrur et al. [PDF] or this one by Marko Bertogna [PDF] for examples of how involved it can get.

There are a couple of ways to simplify the problem. One of those would be to change the bandwidth calculation to be the ratio of the CPU budget to the relative deadline time (rather than to the period). For the example processes shown above, each has a bandwidth of 0.8 using this formula; the scheduler, on seeing that the second process would bump the total to 1.6, could then decide to reject it. Using this calculation, the scheduler can, once again, guarantee that deadlines will be met, but at a cost: this formula will cause the rejection of processes that, in reality, could be scheduled without trouble. It is an overly pessimistic heuristic which will prevent full utilization of the available resources.

An alternative, proposed by Dario, would be to stick with the period-based bandwidth values for admission decisions and to take the risk that some deadlines might not be met. In this case, user-space developers would be responsible for ensuring that the full set of tasks on the system can be scheduled. They might take some comfort in the fact that, since the overall bandwidth of the CPU would still not be oversubscribed, the amount by which a deadline could be missed would be deterministically bounded.

That idea did not survive its encounter with Peter Zijlstra, who thinks it ruins everything that a deadline scheduler is supposed to provide:

The whole reason SCHED_FIFO and friends suck so much is that they don't provide any kind of isolation, and thus, as an Operating-System abstraction they're an utter failure. If you take out admission control you end up with a similar situation.

Peter's suggestion, instead, was to split deadline scheduling logically into two different schedulers. The hard realtime scheduler would retain the highest priority, and would require that the deadline and period values be the same. If, at some future time, a suitable admission controller is developed then that requirement could be relaxed as long as deadlines could still be guaranteed.

Below the hard realtime scheduler would be a soft realtime scheduler which would have access to (most of) the CPU bandwidth left unused by the hard scheduler. That scheduler could accept processes using period-based bandwidth values with the explicit understanding that deadlines might be missed by bounded amounts. Soft realtime is entirely good enough for a great many applications, so there is no real reason not to provide it as long as hard realtime is not adversely affected.

So that is probably how things will go, though the real shape of the solution will not be seen until Dario posts the next version of the SCHED_DEADLINE patch. Even after this problem is solved, though, deadline scheduling has a number of other challenges to overcome, with good multi-core performance being near the top of the list. So, while Linux will almost certainly have a deadline scheduler at some point, it's still hard to say just when that might be.

(Readers who are interested in the intersection of academic and practical realtime work might want to peruse the recently-released proceedings [PDF] from the OSPERT 2010 conference, held in Brussels in early July.)

Index entries for this article
Kernel	Realtime/Deadline scheduling
Kernel	Scheduler/Deadline scheduling

Adding periods to SCHED_DEADLINE

Posted Jul 20, 2010 22:04 UTC (Tue) by xxiao (guest, #9631) [Link] (4 responses)

how about a new scheduler option...CFS is hard to tune and has a really bad performance on my multithread embedded apps(>10% cpu usage increase,etc), the same app ran perfectly well with the O1 scheduler in the past for years.

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 7:46 UTC (Wed) by abacus (guest, #49001) [Link] (2 responses)

Please report this as a kernel bug, such that Ingo Molnar can have a look at it.

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 16:18 UTC (Wed) by cesarb (subscriber, #6266) [Link] (1 responses)

It would be cool if people could "record" a trace of their scheduling so that the sched maintainers could "replay" it, so that they can find and fix bugs without having to find a reproducible testcase (i.e. even if it happened just once, if you have a trace, it could be used to find the cause).

I do not know if this already exists; I have not looked.

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 19:31 UTC (Wed) by sbohrer (guest, #61058) [Link]

I believe "perf sched record" and "perf sched replay" will do exactly that.

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 8:01 UTC (Wed) by viiru (subscriber, #53129) [Link]

> how about a new scheduler option...CFS is hard to tune and has a really
> bad performance on my multithread embedded apps(>10% cpu usage
> increase,etc), the same app ran perfectly well with the O1 scheduler in
> the past for years.

Google is having the same problem with some of their apps, which is why they have ported the old scheduler over to the 2.6 series. I'm not sure if they have a kernel tree with that publicly available, though.

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 15:09 UTC (Wed) by ms (subscriber, #41272) [Link] (9 responses)

I'm clearly missing something - I don't understand why there's a problem with P2. The only issue that I can see is that the scheduler should tell it that it can meet its demands, but only if P2 is happy to wait for 2ms initially - after that the interleaving with P1 works out fine for both of them.

Is this initial scheduling problem the only issue with P2 here or am I missing something else?

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 15:10 UTC (Wed) by ms (subscriber, #41272) [Link]

(Err, I meant "wait for 3ms initially", not 2ms)

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 16:03 UTC (Wed) by farnz (subscriber, #17727) [Link] (7 responses)

That's the entire problem - kernel-side, spotting that you have to tell P2 that its request for CPU time cannot be met as-is, but could be met if P2 could delay its initial deadline is the hard part. It's easy to tell that we have enough CPU time in total in the 10ms time period, but not (in general) easy to tell if we can't have P2 and P1 both meet their first deadline. If P2 could cope with having its deadline delayed by 2ms, it would say that - IOW, it would ask for first deadline 7ms from now, not 5ms from now.

Note that while this example is rather simple (we have two 4ms processes to fit into 5ms, which isn't going to work), you've also got to cope with more complex examples - e.g. P1 has a deadline of 5ms from now, 4ms of work to do, and a period of 10ms (so deadlines at 5ms from now, 15ms from now, 25ms from now etc), while P2 has a deadline of 3ms from now, 1ms of work to do, and a period of 16ms (so deadlines at 3ms, 19ms, 35ms etc). Can P1 and P2 co-exist on the same CPU, or does P2 have to be told that its demands can't be met?

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 17:02 UTC (Wed) by iabervon (subscriber, #722) [Link] (6 responses)

Actually, it should be trivial to tell that, before the first deadline, we already have 4ms of work and can't allocate another 4ms of work. I think the real issue is that, if you've got two tasks with slightly different periods, they may be fine for the first deadlines, but slowly drift into phase, at which point the processor is oversubscribed until they drift back out of phase.

Say you're doing video processing on a stream at 60 Hz and one at 50 Hz. If you've got tight windows between when the input is available and when the output is required (say 2ms of work out of 3ms of real time), you may be fine for the first 100ms and then have to do both frames at the same time. Or the streams might be staggered such that they never conflict.

Adding periods to SCHED_DEADLINE

Posted Jul 21, 2010 20:34 UTC (Wed) by blitzkrieg3 (guest, #57873) [Link] (5 responses)

I agree wholeheartedly. The example provided was pretty bad. Clearly what the scheduler does in the case above is just stagger the cpu time. I don't think any realtime guys consider it unreasonable for a process to have a start up cost.

The example you provided was exactly what I thought of when I started reading, sort of like how blinkers will phase in and out of sync while sitting in the turning lane.

Example

Posted Jul 21, 2010 20:43 UTC (Wed) by corbet (editor, #1) [Link] (4 responses)

I'll confess to being totally confused by this comment. The example was (deliberately) simple, but I think it's clear and valid: the scheduler simply cannot meet the deadlines in that case. I don't see how that can be fixed by "staggering the CPU time." The scheduler can't change the deadline... What am I missing?

Example

Posted Jul 21, 2010 21:58 UTC (Wed) by Shewmaker (guest, #1126) [Link] (2 responses)

I think their confusion centers around the implications of allowing separate deadline and period values. If the commenters read Dario's email, then they would have seen his reference to the "sporadic task model" which may have helped.

Perhaps another way to describe your example is that it is specifically constructed to show that when a scheduler tries to allow admission of a task at any time, and not just on period boundaries, then it gets harder. Yes, the scheduler could delay the admission of the new task until the deadline of the existing task, but then you have a periodic scheduler and not one that handles sporadic job arrivals.

I hope that helps.

Example

Posted Jul 21, 2010 22:37 UTC (Wed) by blitzkrieg3 (guest, #57873) [Link] (1 responses)

I think their confusion centers around the implications of allowing separate deadline and period values. If the commenters read Dario's email, then they would have seen his reference to the "sporadic task model" which may have helped.

Yeah exactly. I went back and read the email after seeing the response, and the key point is:

one of the tasks can miss its deadline by 3 time units in the worst case (if, and if yes which one, depends on the order with witch they arrive).

Obviously if we have the good fortune of having P2 arrive after 5 time units, we're golden. Before reading the email, I would have assumed that if you're just starting an executable, rather than acting on an event such as a udp packet or something, then you can sacrifice 5 time units.

I've illustrated exactly what I mean http://imgur.com/CKMcp.png

For the purposes of explaining how this works and what these values mean, it might make more sense to use a simple example that isn't schedule-able, no matter what order it comes in, like:


T_1 = (2, 4, 3)


T_2 = (2, 5, 3)

Even if T_2 can be scheduled after 2 time units, using the naive algorithm that I illustrated in the picture, it's clear that they're both going to wake up at the same time after 10 units, at which time you have 3 units to schedule 4 slots of cpu time.

Example

Posted Jul 23, 2010 22:33 UTC (Fri) by giraffedata (guest, #1954) [Link]

I think where people really get lost in these explanations is that they appear to describe a system where the scheduler gets to to decide exactly who to give CPU time and when, or where a process requests a precise CPU schedule for itself for the rest of time.

But in reality, the scheduler must wait for a process to request each chunk of time, and even the process can't know exactly when it will do that.

I presume the parameters actually say, "I am going to be requesting CPU time in the future. There will be at least 10 ms between requests. Each one will be for less than 4 ms, but whatever it is, I'll need it all within 5 ms of when I ask for it.

Based on that, the scheduler is supposed to say, "As long as you stick to those constraints, I guarantee I will fill all your requests" or "I can't guarantee that, so don't bother trying to run."

I really don't get the one statement about it being OK to miss the deadline by a bounded amount. If the process is OK with getting its 4ms within 10 ms, why did it state a deadline of 5 ms?

Example

Posted Jul 22, 2010 19:57 UTC (Thu) by droundy (subscriber, #4559) [Link]

I think the problem with the example was just that in the case where all the periods are the same (as in the example), the kernel need only look at the first period to figure out if it can accept the request, which wouldn't be particularly hard to figure out. But when there are different periods involved, it could be much harder...

Adding periods to SCHED_DEADLINE

Posted Jul 22, 2010 23:27 UTC (Thu) by ejr (subscriber, #51652) [Link]

Is it reasonable to assume these tasks will be in the scheduler for a "long" period of time? If so, perhaps the best idea is the simplest possible in-kernel schedule builder coupled with an optional user-space schedule builder. Using a more complicated scheduler (and admission decider) adds a large initial cost, but that *may* be acceptable if the initial cost is much smaller than the total span of time the task will be in the scheduler. Statically analyzed systems and those with only one or two such tasks may be content with the in-kernel simple solution.

Adding periods to SCHED_DEADLINE

Posted Jul 23, 2010 0:36 UTC (Fri) by gdt (subscriber, #6284) [Link]

From an application programming point of view I don't want to have to profile my application for each target system. I'd very much like to run a few AV frames, BGP Hellos or whatever and then say "deadline schedule me that". It would be nice if the APIs were better thought out so that this could be simple, but they aren't. The current APIs really encourage reserving the worst-case behaviour of the algorithm on the worst-case target (not really appropriate for a soft real time such as video decode or protocol handling, where we can let the occasional extraordinary event pass through to the keeper), so we can be asking for two or three orders of magnitude too much resources.

Adding periods to SCHED_DEADLINE

Posted Jul 23, 2010 10:23 UTC (Fri) by mjthayer (guest, #39183) [Link]

If the granularity of the periods is not too high (millisecond should be alright) then a brute force check - running through a few seconds of all the real-time tasks on the system with the hypothetical newly scheduled task - might do the trick. I could also imagine the task wanting to be scheduled giving an deadline for when it needed a definite answer as to whether or not it was possible, and the kernel successively trying increasingly "tight" methods in that time, accounting the time to the task, to try and schedule it. Bearing in mind of course that the checks might need to be restarted if something else was successfully scheduled - or descheduled - in that time.

Adding periods to SCHED_DEADLINE

Posted Jul 24, 2010 1:42 UTC (Sat) by mcmechanjw (subscriber, #38173) [Link]

When looking at the example, I believe both the current bandwidth and the future bandwidth need to be checked separately. I can envision the scheduler having 100% usage from now into the future and yet have a very low future bandwidth.

For example a current SCHED_DEADLINE process has claimed 100% until its deadline in 10ms but won't reschedule for 1000ms afterwards.

I can also see having 100% free for a while and 100% used later.
E.g. the current process has completed and won't reschedule for 1000ms but the total reserved is 100% of the CPU then...

Looking at it, it seems it does not matter if the period and deadline are separate or not, the current schedule must be tested against the first deadline regardless of when that will be and the periodic bandwidth must be tested separately against future bandwidth reservations.

The simplest method that comes to mind for the current schedule is to just walk the schedule forward and see when enough free ms have accumulated before the proposed deadline is hit...
I am fairly sure most heuristics will have bad to very bad corner cases either not offering cpu time when available or oversubscribing.

Adding periods to SCHED_DEADLINE

Posted Jul 29, 2010 6:04 UTC (Thu) by eternaleye (guest, #67051) [Link] (1 responses)

While reading up on this I came across this mail: http://lkml.org/lkml/2008/10/31/52 (though I forget exactly how), which made me curious what these 'pfair' algorithms were. I am posting here a link to a paper which explains the entire pfair/erfair family of scheduling algorithms in some depth, as it's really fascinating - it also makes me wonder why Dario is implementing EDF (Earliest Deadline First) despite its poor SMP properties rather than, say, the ERFair variant of PD2, which not only is more optimal in its scheduling but also has a much simpler admission test. (I do admit that 'simplicity of concept' may well be a valid argument here, but simplicity at the cost of usefulness isn't likely to go over well).

Link:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1....

Adding periods to SCHED_DEADLINE

Posted Aug 2, 2010 14:22 UTC (Mon) by Arakageeta (guest, #60609) [Link]

Unfortunately, preemption and CPU migration overheads limits pfair performance. Global earliest deadline scheduling is, practically speaking, the best algorithm for soft realtime scheduling.

BTW, the research group that published the paper you linked also maintains a Linux testbed (Litmus^RT) to test real-time scheduling (pfair, various deadline schedulers) and synchronization algorithms: http://www.cs.unc.edu/~anderson/litmus-rt/