I disagree. I used oprofile, not perf, but the principle is the same. Looking at areas of high cache misses can tell you where to add cache prefetch code. This always helps, virtual memory layout oddities or not.
Posted Jul 3, 2009 23:38 UTC (Fri) by mingo (subscriber, #31122)
[Link]
Correct.
Another thing to note is that perf stat has a '--repeat N' parameter. This option directs perf stat to run the measured command N times. It saves the various counter results, and then emits basic (avg, std-dev) statistics about them.
For example, running the 'hackbench' messaging benchmark 10 times gives:
Shows us the statistical properties of the counters. If your system is 'noisy', or if the metric is a fundamentally volatile one (cycles, or cache-misses), the noise level will be higher.
Other metrics such as instructions or branches executed are a lot more stable.
But for any of the metrics, 'perf stat --repeat 10' gives you a good guess about how reliable that metric is on that particular system.
Somewhat surprisingly, for this particular workload, the most noisy metric is 'context-switches' and 'CPU-migrations' - which measures the number task switches and the number of cross-CPU task migrations. (this is not a PMU metric but a perfcounter metrics offered by the kernel.)
(The reason for the noise here is that hackbench starts and stops a lot of tasks in a bursty way, and any noise in initial conditions get magnified by the chance placement of tasks. 100 msecs is not a lot of time to run, so depending on when the scheduler's balancing algorithm kicks in the placement of tasks is randomized to a certain degree (due to the high overload) and the metric gets spread out.)
The conclusion is that noisy metrics are just as useful as stable metrics, as long as you can measure the noise and as long as you know how to reduce the noise to acceptable levels. Modern CPUs with huge caches and complex heuristics are fundamentally random in their characteristics, so deterministic results can rarely be expected.
Call-graph / call-chain support
Posted Jul 5, 2009 10:28 UTC (Sun) by mingo (subscriber, #31122)
[Link]
btw., another thing worth mentioning about perfcounters is turn-key call-graph support and call-graph visualization:
Here we record and output full call-chains (down to and including user-space call-chains) and display the overhead in a tree - detailing the call-path that results in that profile entry - and recursively so. (the '5' is a 5% filter - to skip entries below a 5% (relative-)overhead threshold)
Tells us that in this workload there's a combined overhead of 3.75% from user-copies (copy_user_generic_string()), and that ~52% overhead of that comes from a user-space read() and 45% comes from a user-space write() call.
With traditional 'flat' profiling output we'd only know that there's 3.75% overhead in copy_user_generic_string() - we would not know where it comes from.
Call-graph / call-chain support
Posted Jul 5, 2009 11:52 UTC (Sun) by njs (guest, #40338)
[Link]
That's lovely. What would be even more lovely would be the ability to output KCacheGrind's format (documentation), or at least something easily parseable (with less ascii art) and thus mungeable into said format, if that's possible?