SCHED_FIFO and realtime throttling
One of the many features merged back in the 2.6.25 cycle was realtime group scheduling. As a way of balancing CPU usage between competing groups of processes, each of which can be running realtime tasks, the group scheduler introduced the concept of "realtime bandwidth," or rt_bandwith. This bandwidth consists of a pair of values: a CPU time accounting period, and the amount of CPU that the group is allowed to use - at realtime priority - during that period. Once a SCHED_FIFO task causes a group to exceed its rt_bandwidth, it will be pushed out of the processor whether it wants to go or not.
This feature is required if one wants to allow multiple groups to split a system's realtime processing power. But it also turns out to have its uses in the default situation, where all processes on the system are contained within a single, default group. Kernels shipped since 2.6.25 have set the rt_bandwidth value for the default group to be 0.95 out of every 1.0 seconds. In other words, the group scheduler is configured, by default, to reserve 5% of the CPU for non-SCHED_FIFO tasks.
It seems that nobody really noticed this feature until mid-August, when Peter Zijlstra posted a patch which set the default value to "unlimited." At that point it became clear that some developers have a different idea about how this kind of policy should be set than others do.
Ingo Molnar disagreed with the patch, saying:
Ingo's suggestion was to raise the limit to ten seconds of CPU time. As he (and others) pointed out: any SCHED_FIFO application which needs to monopolize the CPU for that long has serious problems and needs to be fixed.
There are real problems associated with letting a SCHED_FIFO process run indefinitely. Should that process never get around to relinquishing the CPU, the system will simply hang forevermore; there is no possibility of the administrator slipping in with a kill command. This process will also block important things like kernel threads; even if it releases the processor after ten seconds, it will have seriously degraded the operation of the rest of the system. Even on a multiprocessor system, there will typically be processes bound to the CPU where the SCHED_FIFO process is running; there will be no way to recover those processes without breaking their CPU affinity, which is not a step anybody wants to take.
So, it is argued, the rt_bandwidth limit is an important safety breaker. With it in place, even a runaway SCHED_FIFO cannot prevent the administrator from (eventually) regaining control of the system and figuring out what is going on. In exchange for this safety, this feature only robs SCHED_FIFO tasks of a small amount of CPU time - the equivalent of running the application on a slightly weaker processor.
Those opposed to the default rt_bandwidth limit cite two main points: it is a user-space API change (which also breaks POSIX compliance) and represents an imposition of policy by the kernel. On the first point, Nick Piggin worries that this change could lead to broken applications:
Or a realtime app could definitely use the CPU adaptively up to 100% but still unable to tolerate an unexpected preemption.
What could make the problem worse is that the throttle might not cut in during testing; it could, instead, wait until something unexpected comes up in a production system. Needless to say, that is a prospect which can prove scary for people who create and deploy this kind of system.
The "policy in the kernel" argument was mostly shot down by Linus, who pointed out that there's lots of policy in the kernel, especially when it comes to the default settings of tunable parameters. He says:
Linus carefully avoided taking a position on which setting makes sense for the most people here. One could certainly argue that making systems resistant to being taken over by runaway realtime processes is the more sensible setting, especially considering that there is a certain amount of interest in running scary applications like PulseAudio with realtime priority. On the other hand, one can also make the case that conforming to the standard (and expected) SCHED_FIFO semantics is the only option which makes sense at all.
There has been
some talk of creating a new realtime scheduling class with throttling being
explicitly part of its semantics; this class could, with a suitably low
limit, even be made available to unprivileged processes.
Meanwhile, as of this writing, the 0.95-second limit - the one option that
nobody seems to like - remains unchanged. It will almost certainly
be raised; how much is something we'll have to wait to see.
| Index entries for this article | |
|---|---|
| Kernel | Realtime |
| Kernel | Scheduler/Realtime |
Posted Sep 2, 2008 0:13 UTC (Tue)
by jengelh (guest, #33263)
[Link] (2 responses)
Posted Sep 2, 2008 6:42 UTC (Tue)
by simlo (guest, #10866)
[Link] (1 responses)
Posted Sep 3, 2008 5:48 UTC (Wed)
by jzbiciak (guest, #5246)
[Link]
I guess it's only a minor point. I don't know if there are other housekeeping tasks the kernel can quickly dispatch of (CPU load balancing, for instance) if it periodically reschedules a SCHED_RR task, even if that prio level has only one task. Clearly, a running SCHED_RR task is the highest priority thing the CPU can and should run at that point.
Posted Sep 2, 2008 7:01 UTC (Tue)
by abacus (guest, #49001)
[Link]
Posted Sep 2, 2008 8:06 UTC (Tue)
by mjthayer (guest, #39183)
[Link] (12 responses)
Posted Sep 2, 2008 8:43 UTC (Tue)
by rvfh (guest, #31018)
[Link] (11 responses)
I think the default should be 'illimited' indeed, and changeable at run time via proc or sys, so everyone can make their choice.
Posted Sep 2, 2008 13:46 UTC (Tue)
by jamesh (guest, #1159)
[Link]
Posted Sep 2, 2008 15:48 UTC (Tue)
by IkeTo (subscriber, #2122)
[Link] (9 responses)
Should polling hardware be the task of the kernel, especially if it is so sensitive to timing?
> I think the default should be 'illimited' indeed, and changeable at run
If "everyone can make their choice" anyway, why not have the default to be limited to a sensible value so that the choice to betray themselves are made only consciously?
Be reminded that people (distributions) do use SCHED_FIFO, e.g., to play movie in a way that video hopefully sync with audio. It isn't nice if such programs would "normally" be risky of getting their computers hung up.
Posted Sep 2, 2008 19:38 UTC (Tue)
by k8to (guest, #15413)
[Link] (8 responses)
Let's say I have the software to run a robot implemented in Linux. My route plotting software is implemented in userland, and I've carefully written it so that it has a worst case time to reroute the robot in 50ms in an emergency, so it will never collide with a human.
Oops, my cpu got yanked away from me in the middle of that and now I ran someone over.
Posted Sep 2, 2008 21:46 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
besides, if you are building your super-fast robot control system you have the ability to turn the knob and give yourself absolute control of the CPU
Posted Sep 2, 2008 22:13 UTC (Tue)
by nix (subscriber, #2304)
[Link]
I'd also say that this is an extremely rare case: most Linux users are not
Posted Sep 3, 2008 2:26 UTC (Wed)
by walken (subscriber, #7089)
[Link]
Posted Sep 4, 2008 8:26 UTC (Thu)
by ekj (guest, #1524)
[Link] (2 responses)
In other words, they claim that -MOST- peoples machines are probably better of in MOST cases letting a gone-crazy SCHED_FIFO process hog only 95% of the cpu, rather than 100. Because it makes it possible to for example open a shell and kill the bugger.
The existence of special rare cases where you really want 100% is a no-argument: Simply *turn* the provided knob up to 100 for those cases then.
You're in essence saying that 99% of users should turn the knob down to 95 - so that the last 1% won't have to turn it up to 100. Which makes no sense.
Also, your example is extremely contrived. Indeed it serves as a good example of just how uncommon such situations are likely to be.
Posted Sep 5, 2008 7:26 UTC (Fri)
by rvfh (guest, #31018)
[Link] (1 responses)
But the issue here is not just 'what is best for most people' but also 'what's expected from the spec'. Breaking the spec is exactly what we reproach to some companies, and we need to think carefully before doing it.
I guess that's why Linus himself did not take party, and that's why I don't really know which is best either, but reading the comments here makes me think that I would rather have the 90% myself :-)
Posted Sep 5, 2008 12:10 UTC (Fri)
by ekj (guest, #1524)
[Link]
The overwhelming majority of machines should be configured so that a bug in a single SCHED_FIFO priority task cannot completely lock the machine up hard.
That makes it a sensible default in my book.
Okay okay, so you can argue the KERNEL should default to 100%, whereupon 99% of the distributions out there should default to turning the knob to 95% or 90% or whatever.
But -someone-, earlier in the chain than the end-user, should change their default. It's not reasonable (as today!) to expect Joe User to know that he needs to turn the knob to avoid hard lockups on stumbling on a pulseaudio-bug or whatever.
I still think the extremely few projects that DO need true 100% can turn the knob themselves; are you aware of even a single real such project ? (as opposed to theoretical constructs like the one in the grandparent)
Posted Sep 4, 2008 11:19 UTC (Thu)
by mb (subscriber, #50428)
[Link] (1 responses)
Posted Sep 8, 2008 0:53 UTC (Mon)
by vonbrand (subscriber, #4458)
[Link]
Part of the whole idea is getting rid on RTLinux (i.e., integrating that functionality into the vanilla kernel source).
Posted Sep 2, 2008 8:43 UTC (Tue)
by ballombe (subscriber, #9523)
[Link] (5 responses)
Posted Sep 2, 2008 22:07 UTC (Tue)
by nix (subscriber, #2304)
[Link] (4 responses)
And that's not good enough. Regardless of what the default is, I want some
(This is particularly important now that RT-prio stuff can run as
Posted Sep 3, 2008 2:32 UTC (Wed)
by walken (subscriber, #7089)
[Link] (2 responses)
The process running 100% of the time on a slow machine will see time progress in small increments (say, if it calls gettimeofday in a loop, it'll see it progress a few microseconds at a time) while the process running 95% on a fast machine will see a large time increment once in a while, when it is descheduled. So, there definitely is a difference, for realtime processes.
Still, I do not understand why one would try to yank 100% of the CPU on a posix system and expect the underlying OS to work fine...
Posted Sep 4, 2008 11:24 UTC (Thu)
by mb (subscriber, #50428)
[Link] (1 responses)
Posted Sep 5, 2008 10:27 UTC (Fri)
by dmarkh (guest, #40670)
[Link]
I will admit though, for a non SMP box this is probably a good thing as a default setting.
Posted Sep 3, 2008 8:40 UTC (Wed)
by njs (subscriber, #40338)
[Link]
It seems like there are two use cases, traditional real-time systems with more-or-less dedicated systems running custom programs, and the emerging desktop real-time stuff. A process will generally know which sort of process it is? It further seems to me that the former case wants weak emergency-only throttling for all tasks as a development aid, while the latter wants much stronger always-on throttling for unprivileged tasks in particular. (I don't see why unprivileged realtime tasks should have access to "most but not all" of the CPU... the ideal limit would be "you get exactly as much time as the fair staircase scheduler would give you, but since you flipped the realtime bit I will guarantee that you receive it exactly when you ask for it".) But they've made the default limit ~10s, which seems way higher than desired for ordinary desktop use, *and* annoys the hard-core real-time folks... (who aren't annoyed that they'll run uniformly 5% slower, they're scared that the 5% might come in the form of being abruptly descheduled just before a deadline. But it's still hard to imagine a task actually running 10s straight without sleeping, which is why the discussion is all in terms of API guarantees and abstract junk like that.)
Of course, in practice what will probably happen is that we won't end up with anyone's idea of the Platonic real-time API, kernels for both sorts of systems will end up tweaked away from the default anyway, desktop systems will use more elaborate schemes with different users segregated into different control groups with different limits, etc., and things will work out.
Posted Sep 5, 2008 9:49 UTC (Fri)
by sdalley (subscriber, #18550)
[Link] (1 responses)
The normal state of a realtime thread is sleeping. Waiting on an event of some sort, a timer, a mutex, a file event, a message/semaphore from an interrupt routine etc. In such a state, the throttle accounting is not kicking in. When the event occurs, the thread is immediately scheduled according to priority, does what it has to, and goes back to sleep.
The realtime thread should absolutely not be busy-waiting in a screaming poll loop or performing compute-bound processing for hundreds of milliseconds at a time. You pass the data to non-realtime tasks for that.
If a well-designed program is running into RT-throttling problems, it's time for a design rethink or a faster processor. IMHO it's quite proper in such cases that the throttling kicks in for the good of the system as a whole.
Posted Sep 10, 2008 1:08 UTC (Wed)
by sethml (guest, #8471)
[Link]
What I would have killed for at the time was a real-time scheduler for
which I could specify much more complex constraints. I'd have loved to be
able to express: thread A gets top priority, but in no more than 100 1ms
chunks per second, and thread B gets medium priority, but uses no more than
90% of the CPU available to it, and thread C gets whatever's left. (Thread
C would be my communications thread, so that the controller of the system
could actually fix it if something went wrong.)
SCHED_FIFO and realtime throttling
SCHED_RR doesn't limit CPU usage
priority task a SCHED_RR task is the same as a SCHED_FIFO task.
SCHED_RR doesn't limit CPU usage
A clarification: SCHED_FIFO applies to individual threads, not to whole processes. This holds both for the POSIX standard and the Linux implementation. See also pthread_setschedparam().
Processes versus threads
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
> when it should be getting some important data before it is overwritten?
> time via proc or sys, so everyone can make their choice.
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
I'd say you're negligent if you don't find and turn that knob (and a good
few extra ones).
working on systems that can harm people if held up for a fraction of a
second. The needs of the overwhelmingly common case (`don't let buggy code
lock up my system') should govern.
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
Yeah, really, you should.
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
it seems better to have both SCHED_FIFO and e.g. SCHED_FIFO_THROTTLED,
so that systems needing both kind of policy (maybe not at the same time)
does not need to change the setting back and forth.
SCHED_FIFO and realtime throttling
SCHED_FIFO or this new _THROTTLED variant. If I wanted to impose a site
policy that no process is allowed to lock up my system by taking total
ownership of the CPU forever, even if it wants to, I'd have to change the
source of every application that used SCHED_FIFO, and if it was
closed-source I'd be out of luck.
way to say 'let realtime-prio stuff run without scaring me', and that
means they take most, but not all of the CPU. (What's the difference
between a process taking most of the CPU and one taking all of it on a
slightly slower machine? A process, realtime or not, that cares that it's
running 5% slower than otherwise is a broken process --- just like one
that hogs the CPU constantly for ten seconds and doesn't let anyone else
get a word in edgeways.)
non-root, because that means it might actually get used for
non-specialized applications, not all of which can be guaranteed not to
have infloop-causing bugs.)
SCHED_FIFO and realtime throttling
> taking all of it on a slightly slower machine? A process, realtime or not,
> that cares that it's running 5% slower than otherwise is a broken process
> --- just like one that hogs the CPU constantly for ten seconds and doesn't
> let anyone else get a word in edgeways.)
SCHED_FIFO and realtime throttling
If you really care about not getting interrupted at all, you should be running inside of the kernel or probably with rtai or rtlinux.
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
SCHED_FIFO and realtime throttling
I second this. I've actually written systems using SCHED_FIFO, in which
losing a few ms would result in catastrophe. (Not human-killing-robot
catastrophe, but you might end up with a bacteria-infested bottle of juice
because the system failed to remove a bottle with a damaged cap from the
assembly line.) I would totally support the current 95% limit. Any thread
running at real-time priority should only be running a small percentage of
the time, or else it's screwed. Having the 95% limit would have saved me
many hours - instead I'd have to power-cycle the system, since there was no
way to kill the rogue thread. (And for reasons I never figured out,
setting up an sshd at an even higher SCHED_FIFO priority didn't work.)SCHED_FIFO and realtime throttling
