About documentation...

Posted Jul 25, 2007 21:23 UTC (Wed) by mingo (guest, #31122)
In reply to: About documentation... by i3839
Parent article: Interview with Con Kolivas (APC)

If people are expected to ever use these knobs, it might be good to document what those wakeup and stat variants are, and the meaning of sched_features. When that's done all fields are easy to understand.

Yeah, i'll do that. _Normally_ you should not need to change any knobs - the scheduler auto-tunes itself. That's why they are only accessible under CONFIG_SCHED_DEBUG. (But it helps when diagnosing scheduler problems that you can tune various aspects of it without having to reboot the kernel.)

One other interesting field is sum_exec_runtime versus sum_wait_runtime: the accumulated amount of time spent on the CPU, compared to the time the task had to wait for getting on the CPU.

The "sum_exec_runtime/nr_switches" number is also interesting: it shows the average time ('scheduling atom') a task has spent executing on the CPU between two context-switches. The lower this value, the more context-switching-happy a task is.

se.wait_runtime is a scheduler-internal metric that shows how much out-of-balance this task's execution history is compared to what execution time it could get on a "perfect, ideal multi-tasking CPU". So if wait_runtime gets negative that means it has spent more time on the CPU than it should have. If wait_runtime gets positive that means it has spent less time than it "should have". CFS sorts tasks in an rbtree with this value as a key and uses this value to choose the next task to run. (with lots of additional details - but this is the raw scheme.) It will pick the task with the largest wait_runtime value. (i.e. the task that is most in need of CPU time.)

This mechanism and implementation is basically not comparable to SD in any way, the two schedulers are so different. Basically the only common thing between them is that both aim to schedule tasks "fairly" - but even the definition of "fairness" is different: SD strictly considers time spent on the CPU and on the runqueue, CFS takes time spent sleeping into account as well. (and hence the approach of "sleep average" and the act of "rewarding" sleepy tasks, which was the main interactivity mechanism of the old scheduler, survives in CFS. Con was fundamentally against sleep-average methods. CFS tried to be a no-tradeoffs replacement for the existing scheduler and the sleeper-fairness method was key to that.)

This (and other) design differences and approaches - not surprisingly - produced two completely different scheduler implementations. Anyone who has tried both schedulers will attest to the fact that they "feel" differently and behave differently as well.

Due to these fundamental design differences the data structures and algorithms are necessarily very different, so there was basically no opportunity to share code (besides the scheduler glue code that was already in sched.c), and there's only 1 line of code in common between CFS and SD (out of thousands of lines of code):

  * This idea comes from the SD scheduler of Con Kolivas:
  */
 static inline void sched_init_granularity(void)
 {
         unsigned int factor = 1 + ilog2(num_online_cpus());

This boot-time "ilog2()" tuning based on the number of CPUs available is a tuning approach i saw in SD and i asked Con whether i could use it in CFS. (to which Con kindly agreed.)

About documentation...

Posted Jul 25, 2007 23:38 UTC (Wed) by i3839 (guest, #31386) [Link]

Interesting info. But nothing about the fields I asked about. ;-)

Anyway, if those knobs only appear with CONFIG_SCHED_DEBUG enabled, I think it's better to document them in the Kconfig entry than in that documentation file. That way people interested in it can find it easily, and if the debug option ever disappears the help file won't need to be updated. When deciding whether to enable an option people look at the Kconfig text, so give all info they need to know there.

sum_exec_runtime/sum_wait_runtime seems also interesting. The ratio is 1 to 1442 for ksoftirqd (it ran for 5 ms and it waited 7 seconds for that, ouch). While wait_runtime_overruns is 232 and zero underruns. (Sure those fields aren't swapped accidentally?)

events/0 info is also interesting, it has a se.block_max of 1.1 second, which seems suspiciously high.

se.wait_runtime inclused the time a task slept, right? Otherwise it should be zero for all tasks that are sleeping, and that isn't the case.

Another strange thing is that really a lot tasks have almost the same block_max of 1.818 or 1.1816 seconds. The lower digits are so close together that it seems like all tasks were blocked and unblocked at the same time. Oh wait, that is probably caused by resume from/suspend to ram.