CFS and SD internals, design
Posted Jul 26, 2007 15:08 UTC (Thu) by mingo
In reply to: RDSL and ignoring feedback
Parent article: Still waiting for swap prefetch
As I understand it, the CFS brand of "fairness" takes a longer-term view, allowing tasks to get their "fair" share even if they sleep from time to time.
Correct, and i call this concept "sleeper fairness".
The simplest way to describe it is via an specific example: on my box if i run glxgears, it uses exactly 50% of CPU time. If i boot into the SD scheduler, and start a CPU hog in parallel to the glxgears task, the two tasks share the CPU: the CPU hog will get ~60% of CPU time, glxgears will get ~40% of CPU time. If i boot CFS, both tasks will get exactly 50% of CPU time.
I've described this mechanism and other internal details in another thread already, but i think it makes sense to paste that reply here too:
wait_runtime is a scheduler-internal metric that shows how much out-of-balance this task's execution history is compared to what execution time it could get on a "perfect, ideal multi-tasking CPU". So if wait_runtime gets negative that means it has spent more time on the CPU than it should have. If wait_runtime gets positive that means it has spent less time than it "should have". CFS sorts tasks in an rbtree with this value as a key and uses this value to choose the next task to run. (with lots of additional details - but this is the raw scheme.) It will pick the task with the largest wait_runtime value. (i.e. the task that is most in need of CPU time.)
This mechanism and implementation is basically not comparable to SD in any way, the two schedulers are so different. Basically the only common thing between them is that both aim to schedule tasks "fairly" - but even the definition of "fairness" is different: SD strictly considers time spent on the CPU and on the runqueue, CFS takes time spent sleeping into account as well. (and hence the approach of "sleep average" and the act of "rewarding" sleepy tasks, which was the main interactivity mechanism of the old scheduler, survives in CFS. Con was fundamentally against
sleep-average methods. CFS tried to be a no-tradeoffs replacement for the existing scheduler and the sleeper-fairness method was key to that.)
This (and other) design differences and approaches - not surprisingly - produced two completely different scheduler implementations. Anyone who has tried both schedulers will attest to the fact that they "feel" differently and behave differently as well.
Due to these fundamental design differences the data structures and algorithms are necessarily very different, so there was basically no opportunity to share code (besides the scheduler glue code that was already in sched.c), and there's only 1 line of code in common between CFS and SD (out of thousands of lines of code):
* This idea comes from the SD scheduler of Con Kolivas:
static inline void sched_init_granularity(void)
unsigned int factor = 1 + ilog2(num_online_cpus());
This boot-time "ilog2()" tuning based on the number of CPUs available is a tuning approach i saw in SD and i asked Con whether i could use it in CFS. (to which Con kindly agreed.)
to post comments)