Benchmarks, scheduler instrumentation, PowerTop

Posted Jul 25, 2007 13:37 UTC (Wed) by mingo (guest, #31122)
In reply to: Con hit the nail on the head and ignored it by dion
Parent article: Interview with Con Kolivas (APC)

Con discovered what the Powertop people did; that anything that can be quantified will be improved.

Benchmarks like Con's interbench (a latency measurement test-suite) are very useful because they provide standardized metrics of system behavior and allow the precise tracking of them.

Instrumentation of latencies is an orthogonal thing, and it allows users to be involved much more directly - as PowerTop has proven.

Incidentally, i happened to be the one who wrote most of the kernel instrumentation code that enabled PowerTop (/proc/timer_stats, introduced upstream as part of dynticks) - and i'm quite impressed about the user-space tool, the phenomenon and the momentum that Arjan managed to build ontop of what i originally envisioned to be a raw, obscure, kernel-developers-only debugging feature.

So in CFS i've added various pieces of instrumentation to allow a powertop-like tool to be built: the scheduler now tracks worst-case blocking, worst-case sleeping latencies, how many times a task context-switched, etc.

Those stats are already used to report CFS regressions, but in addition to that a new tool could be written (lets call it "latencytop" :-) which would display the longest latencies that various tasks are observing. (or the tasks with the largest context-switch rate - etc.)

Enable CONFIG_SCHED_DEBUG for the new scheduler instrumentation code (all included in v2.6.23-rc1), the new stats are in /proc/[PID]/sched. These stats were very useful while developing CFS, nothing beats users being able to report: "my kmail just saw a 10.5 seconds latency while I built a kernel - Ingo you suck!".

About documentation...

Posted Jul 25, 2007 20:16 UTC (Wed) by i3839 (guest, #31386) [Link] (12 responses)

When searching for documentation about that /proc/<PID>/sched file, which I couldn't find, I noticed that sched-design.txt is outdated and could be removed, and that the new sched-design-CFS.txt is already lagging behind:

$ ls /proc/sys/kernel/sched_*
/proc/sys/kernel/sched_batch_wakeup_granularity_ns
/proc/sys/kernel/sched_child_runs_first
/proc/sys/kernel/sched_features
/proc/sys/kernel/sched_granularity_ns
/proc/sys/kernel/sched_runtime_limit_ns
/proc/sys/kernel/sched_stat_granularity_ns
/proc/sys/kernel/sched_wakeup_granularity_ns

But only sched_granularity_ns is documented.

If people are expected to ever use these knobs, it might be good to document what those wakeup and stat variants are, and the meaning of sched_features. When that's done all fields are easy to understand.

Interpreting and using /proc/<PID>/sched and /proc/sched_debug would also be much easier if they were documented, though as it's a debugging feature it's less important. But still.

Anyway, when hunting for latency spikes, sluggish apps and similar creatures, I guess the se.wait_max and se.block_max are most interesting?

A bit poking to get the top offenders turns up:

proc # grep se.wait_max */sched | sort -n -k 3 | tail -n 2
1182/sched:se.wait_max : 81381345
3/sched:se.wait_max : 111139352

proc # grep se.block_max */sched | sort -n -k 3 | tail -n 3
1182/sched:se.block_max : 3749201713
367/sched:se.block_max : 3938538101
721/sched:se.block_max : 4027921788

proc # ps 3 1182 367 721
PID TTY STAT TIME COMMAND
3 ? S< 0:00 [ksoftirqd/0]
367 ? S< 0:02 [kjournald]
721 ? Ss 0:00 /usr/sbin/syslogd -m 0
1182 ? Ss+ 17:25 X :0 -dpi 96 -nolisten tcp -br

So assuming that these values are in nanoseconds, ksoftirqd waited at most 111 ms before it could finally run, and X 81 ms.

And kjournald, syslogd and X blocked at most about 4 seconds on IO, which is a bit worrying, especially X as it's in both lists.

kjournald probably does huge IO request to get optimal throughput, but even then 4 seconds is bad.

syslogd does synchronous writes, so it's not that strange that it's top in the list. But it won't write much more than 100KB at once, so that's slightly scary too, but not too worrisome.

As for X, it doesn't seem to do much file IO, so it's probably blocking on something else. Video hardware? Unix sockets? Unclear, but 4s isn't healthy. Hopefully both cases happened at X startup and it's less serious than it looks (any way to reset those stats?).

About documentation...

Posted Jul 25, 2007 20:43 UTC (Wed) by mingo (guest, #31122) [Link] (1 responses)

any way to reset those stats?

Yeah: you can reset them on a per-task/thread basis by writing 0 to the /proc/[PID]/sched file. Then they'll go down to 0. (Unprivileged users can do it to, to their own tasks. root can reset it for everyone.)

You can reset it periodically as well if you want to sample/profile their typical sleep/block behavior.

About documentation...

Posted Jul 25, 2007 21:54 UTC (Wed) by i3839 (guest, #31386) [Link]

Yes, I figured that out by doing the only logical thing I could think of. :-)
Good interface.

About documentation...

Posted Jul 25, 2007 20:46 UTC (Wed) by mingo (guest, #31122) [Link] (4 responses)

So assuming that these values are in nanoseconds, ksoftirqd waited at most 111 ms before it could finally run, and X 81 ms.

Yes, the values are in nanoseconds. What priority does it have? [the prio field in /proc/[PID]/sched file] If it's niced to +19 then a longer delay is possible because other, high-prio tasks might delay it.

About documentation...

Posted Jul 25, 2007 22:03 UTC (Wed) by i3839 (guest, #31386) [Link] (3 responses)

ksoftirqd has prio 115 and X has prio 120. I didn't nice anything, so it's all default (all kernel threads at -5, user processes 0, except for pppd at -2 and udevd at -4).

About documentation...

Posted Jul 25, 2007 22:18 UTC (Wed) by mingo (guest, #31122) [Link] (2 responses)

Ok. Perhaps the 100+ msecs ksoftirqd delay was during bootup. Or if you are running a !CONFIG_PREEMPT kernel such delays could happen too. But if you are running a CONFIG_PREEMPT kernel and ksoftirqd shows such large latencies even after resetting its counters, that would be an anomaly. (if so then please report it to me and/or to lkml in email.)

About documentation...

Posted Jul 25, 2007 22:50 UTC (Wed) by i3839 (guest, #31386) [Link] (1 responses)

Ok, I'll keep an eye on it. Running PREEMPT here.

I suppose the best way to track any anomalies down is by applying latency-tracing-v2.6.23-rc1-combo.patch at your homepage? WAKEUP_TIMING seems slightly redundant now. I'll enable it anyway.

About documentation...

Posted Jul 25, 2007 23:41 UTC (Wed) by i3839 (guest, #31386) [Link]

I get compile errors with that patch, I'll send them by email.

About documentation...

Posted Jul 25, 2007 20:50 UTC (Wed) by mingo (guest, #31122) [Link] (1 responses)

Anyway, when hunting for latency spikes, sluggish apps and similar creatures, I guess the se.wait_max and se.block_max are most interesting?

Yes. There's also se.sleep_max - that's the maximum time the task spent sleeping voluntarily. ('se' stands for 'scheduling entity' - a task here)

block_max stands for the maximum involuntary delay. (waiting for disk IO, etc.)

wait_max stands for the maximum delay that a task saw, from the point it got on the runqueue to the point it actually started executing its first instruction.

Note that for multithreaded apps such as firefox all the worker threads are not in /proc/[PID]/sched but in /proc/[PID]/task/[TID]/sched. Often firefox latencies are in those threads not in the main thread.

About documentation...

Posted Jul 25, 2007 22:21 UTC (Wed) by i3839 (guest, #31386) [Link]

se.sleep_max is more interesting if you want to get most out of dynticks I suppose. Currently I'm mainly interested in wait_max.

Yes, I noticed how multithreaded apps were handled, but forgot to take account for that in my grep. The numbers are in the same range (But I resetted X and FF, and I don't see X high in the list now, FF is though).

(I'm interested in this because I wonder where the strange keyboard/mouse behaviour I irregularly get comes from. Warping mouse, "sticking" keys, since 2.6.22 or so, but could be shaky hardware too. Pain to debug).

About documentation...

Posted Jul 25, 2007 21:23 UTC (Wed) by mingo (guest, #31122) [Link] (1 responses)

If people are expected to ever use these knobs, it might be good to document what those wakeup and stat variants are, and the meaning of sched_features. When that's done all fields are easy to understand.

Yeah, i'll do that. _Normally_ you should not need to change any knobs - the scheduler auto-tunes itself. That's why they are only accessible under CONFIG_SCHED_DEBUG. (But it helps when diagnosing scheduler problems that you can tune various aspects of it without having to reboot the kernel.)

One other interesting field is sum_exec_runtime versus sum_wait_runtime: the accumulated amount of time spent on the CPU, compared to the time the task had to wait for getting on the CPU.

The "sum_exec_runtime/nr_switches" number is also interesting: it shows the average time ('scheduling atom') a task has spent executing on the CPU between two context-switches. The lower this value, the more context-switching-happy a task is.

se.wait_runtime is a scheduler-internal metric that shows how much out-of-balance this task's execution history is compared to what execution time it could get on a "perfect, ideal multi-tasking CPU". So if wait_runtime gets negative that means it has spent more time on the CPU than it should have. If wait_runtime gets positive that means it has spent less time than it "should have". CFS sorts tasks in an rbtree with this value as a key and uses this value to choose the next task to run. (with lots of additional details - but this is the raw scheme.) It will pick the task with the largest wait_runtime value. (i.e. the task that is most in need of CPU time.)

This mechanism and implementation is basically not comparable to SD in any way, the two schedulers are so different. Basically the only common thing between them is that both aim to schedule tasks "fairly" - but even the definition of "fairness" is different: SD strictly considers time spent on the CPU and on the runqueue, CFS takes time spent sleeping into account as well. (and hence the approach of "sleep average" and the act of "rewarding" sleepy tasks, which was the main interactivity mechanism of the old scheduler, survives in CFS. Con was fundamentally against sleep-average methods. CFS tried to be a no-tradeoffs replacement for the existing scheduler and the sleeper-fairness method was key to that.)

This (and other) design differences and approaches - not surprisingly - produced two completely different scheduler implementations. Anyone who has tried both schedulers will attest to the fact that they "feel" differently and behave differently as well.

Due to these fundamental design differences the data structures and algorithms are necessarily very different, so there was basically no opportunity to share code (besides the scheduler glue code that was already in sched.c), and there's only 1 line of code in common between CFS and SD (out of thousands of lines of code):

  * This idea comes from the SD scheduler of Con Kolivas:
  */
 static inline void sched_init_granularity(void)
 {
         unsigned int factor = 1 + ilog2(num_online_cpus());

This boot-time "ilog2()" tuning based on the number of CPUs available is a tuning approach i saw in SD and i asked Con whether i could use it in CFS. (to which Con kindly agreed.)

About documentation...

Posted Jul 25, 2007 23:38 UTC (Wed) by i3839 (guest, #31386) [Link]

Interesting info. But nothing about the fields I asked about. ;-)

Anyway, if those knobs only appear with CONFIG_SCHED_DEBUG enabled, I think it's better to document them in the Kconfig entry than in that documentation file. That way people interested in it can find it easily, and if the debug option ever disappears the help file won't need to be updated. When deciding whether to enable an option people look at the Kconfig text, so give all info they need to know there.

sum_exec_runtime/sum_wait_runtime seems also interesting. The ratio is 1 to 1442 for ksoftirqd (it ran for 5 ms and it waited 7 seconds for that, ouch). While wait_runtime_overruns is 232 and zero underruns. (Sure those fields aren't swapped accidentally?)

events/0 info is also interesting, it has a se.block_max of 1.1 second, which seems suspiciously high.

se.wait_runtime inclused the time a task slept, right? Otherwise it should be zero for all tasks that are sleeping, and that isn't the case.

Another strange thing is that really a lot tasks have almost the same block_max of 1.818 or 1.1816 seconds. The lower digits are so close together that it seems like all tasks were blocked and unblocked at the same time. Oh wait, that is probably caused by resume from/suspend to ram.

About documentation...

Posted Feb 7, 2008 12:32 UTC (Thu) by anders.blomdell@control.lth.se (guest, #50377) [Link]

I had similar problems with sluggish performance when doing heavy disk I/O, turning off
Hyper-Threading (in the BIOS) on the processor put the system back to more normal behavior
(i.e. worst-case blocking for syslog dropped from 25 seconds to 9 milliseconds).

Benchmarks, scheduler instrumentation, PowerTop

Posted Jul 26, 2007 16:33 UTC (Thu) by pheldens (guest, #19366) [Link] (1 responses)

Call me blind, but how do i turn on CFS in 23-rc1(-git*)?
I dont see the .config options anywhere, or is it on by default?

Benchmarks, scheduler instrumentation, PowerTop

Posted Jul 26, 2007 21:48 UTC (Thu) by mingo (guest, #31122) [Link]

Yeah, it's on by default - got merged early in the .23 merge window and is included in the 2.6.23-rc1 kernel (and -git*).