LWN.net Logo

Perfcounters added to the mainline

Perfcounters added to the mainline

Posted Jul 3, 2009 23:38 UTC (Fri) by mingo (subscriber, #31122)
In reply to: Perfcounters added to the mainline by zlynx
Parent article: Perfcounters added to the mainline

Correct.

Another thing to note is that perf stat has a '--repeat N' parameter. This option directs perf stat to run the measured command N times. It saves the various counter results, and then emits basic (avg, std-dev) statistics about them.

For example, running the 'hackbench' messaging benchmark 10 times gives:

aldebaran:~> perf stat --repeat 10 ./hackbench 10
Time: 0.121
Time: 0.091
Time: 0.114
Time: 0.094
Time: 0.090
Time: 0.095
Time: 0.094
Time: 0.107
Time: 0.094
Time: 0.095

 Performance counter stats for './hackbench 10' (10 runs):

    1259.878957  task-clock-msecs         #     10.597 CPUs    ( +-   1.799% )
          51812  context-switches         #      0.041 M/sec   ( +-   5.103% )
           3519  CPU-migrations           #      0.003 M/sec   ( +-   4.915% )
          17870  page-faults              #      0.014 M/sec   ( +-   0.392% )
     3802645216  cycles                   #   3018.262 M/sec   ( +-   1.747% )
     1588586719  instructions             #      0.418 IPC     ( +-   0.837% )
       16885948  cache-references         #     13.403 M/sec   ( +-   1.503% )
        7328059  cache-misses             #      5.816 M/sec   ( +-   1.773% )

    0.118889101  seconds time elapsed   ( +-   3.398% )


Shows us the statistical properties of the counters. If your system is 'noisy', or if the metric is a fundamentally volatile one (cycles, or cache-misses), the noise level will be higher.

Other metrics such as instructions or branches executed are a lot more stable.

But for any of the metrics, 'perf stat --repeat 10' gives you a good guess about how reliable that metric is on that particular system.

Somewhat surprisingly, for this particular workload, the most noisy metric is 'context-switches' and 'CPU-migrations' - which measures the number task switches and the number of cross-CPU task migrations. (this is not a PMU metric but a perfcounter metrics offered by the kernel.)

(The reason for the noise here is that hackbench starts and stops a lot of tasks in a bursty way, and any noise in initial conditions get magnified by the chance placement of tasks. 100 msecs is not a lot of time to run, so depending on when the scheduler's balancing algorithm kicks in the placement of tasks is randomized to a certain degree (due to the high overload) and the metric gets spread out.)

The conclusion is that noisy metrics are just as useful as stable metrics, as long as you can measure the noise and as long as you know how to reduce the noise to acceptable levels. Modern CPUs with huge caches and complex heuristics are fundamentally random in their characteristics, so deterministic results can rarely be expected.


(Log in to post comments)

Call-graph / call-chain support

Posted Jul 5, 2009 10:28 UTC (Sun) by mingo (subscriber, #31122) [Link]

btw., another thing worth mentioning about perfcounters is turn-key call-graph support and call-graph visualization:

 $ perf record -g -f ./pipe-test-1m
 [ perf record: Captured and wrote 8.169 MB perf.data (~356901 samples) ]

 $ perf report --sort symbol --callchain fractal,5 | cat

 #
 # (80245 samples)
 #
 # Overhead  Symbol
 # ........  ......
 #
     4.50%  [k] pipe_read
                |          
                 --99.00%-- do_sync_read
                           vfs_read
                           sys_read
                           system_call_fastpath
                           __GI___libc_read
                           __libc_start_main

     4.39%  [.] main

     4.27%  [k] __switch_to
                |          
                |          |          
                |          |--50.97%-- __GI___libc_write
                |          |          
                |           --49.06%-- __GI___libc_read
                |                     __libc_start_main
                |          
                 --11.19%-- thread_return
                           |          
                           |--51.44%-- __GI___libc_write
                           |          
                            --48.83%-- __GI___libc_read
                                      __libc_start_main

     3.75%  [k] copy_user_generic_string
                |          
                |--52.11%-- do_sync_read
                |          vfs_read
                |          sys_read
                |          system_call_fastpath
                |          __GI___libc_read
                |          __libc_start_main
                |          
                 --45.39%-- pipe_write
                           do_sync_write
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __GI___libc_write
                           __libc_start_main

     3.36%  [k] avc_has_perm_noaudit
                |          
                 --96.59%-- avc_has_perm
                           inode_has_perm
                           file_has_perm
                           selinux_file_permission
                           security_file_permission
                           rw_verify_area
                           |          
                           |--51.34%-- vfs_read
                           |          sys_read
                           |          system_call_fastpath
                           |          __GI___libc_read
                           |          __libc_start_main
                           |          
                            --48.66%-- vfs_write
                                      sys_write
                                      system_call_fastpath
                                      |          
                                       --99.53%-- __GI___libc_write
                                                 __libc_start_main

     3.29%  [k] schedule
                |          
                |--50.34%-- sysret_careful
                |          __GI___libc_write
                |          
                 --46.55%-- pipe_wait
                           pipe_read
                           do_sync_read
                           vfs_read
                           sys_read
                           system_call_fastpath
                           __GI___libc_read
                           __libc_start_main

     2.89%  [k] switch_mm
                |          
                 --97.67%-- schedule
                           |          
                           |--50.75%-- pipe_wait
                           |          pipe_read
                           |          do_sync_read
                           |          vfs_read
                           |          sys_read
                           |          system_call_fastpath
                           |          __GI___libc_read
                           |          __libc_start_main
                           |          
                            --49.29%-- sysret_careful
                                      __GI___libc_write

     2.85%  [.] __GI___libc_write

     2.70%  [.] __GI___libc_read

     2.60%  [k] file_has_perm
                |          
                |--93.67%-- selinux_file_permission
                |          security_file_permission
                |          rw_verify_area
                |          |          
                |          |--55.02%-- vfs_write
                |          |          sys_write
                |          |          system_call_fastpath
                |          |          |          
                |          |           --99.81%-- __GI___libc_write
                |          |                     __libc_start_main
                |          |          
                |           --44.98%-- vfs_read
                |                     sys_read
                |                     system_call_fastpath
                |                     __GI___libc_read
                |                     __libc_start_main
                |          
                 --6.33%-- security_file_permission
                           rw_verify_area
                           |          
                           |--52.27%-- vfs_write
                           |          sys_write
                           |          system_call_fastpath
                           |          |          
                           |          |--97.10%-- __GI___libc_write
                           |          |          __libc_start_main
                           |          |          
                           |           --4.35%-- __write_nocancel
                           |          
                            --47.73%-- vfs_read
                                      sys_read
                                      system_call_fastpath
                                      __GI___libc_read
                                      __libc_start_main

     2.15%  [k] pipe_write
                |          
                 --98.31%-- do_sync_write
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __GI___libc_write
                           __libc_start_main

     2.05%  [k] system_call
                |          
                |--50.64%-- __GI___libc_write
                |          |          
                |           --49.46%-- __libc_start_main
                |          
                 --49.24%-- __GI___libc_read
                           __libc_start_main

Here we record and output full call-chains (down to and including user-space call-chains) and display the overhead in a tree - detailing the call-path that results in that profile entry - and recursively so. (the '5' is a 5% filter - to skip entries below a 5% (relative-)overhead threshold)

For example this portion:

     3.75%  [k] copy_user_generic_string
                |          
                |--52.11%-- do_sync_read
                |          vfs_read
                |          sys_read
                |          system_call_fastpath
                |          __GI___libc_read
                |          __libc_start_main
                |          
                 --45.39%-- pipe_write
                           do_sync_write
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __GI___libc_write
                           __libc_start_main
Tells us that in this workload there's a combined overhead of 3.75% from user-copies (copy_user_generic_string()), and that ~52% overhead of that comes from a user-space read() and 45% comes from a user-space write() call.

With traditional 'flat' profiling output we'd only know that there's 3.75% overhead in copy_user_generic_string() - we would not know where it comes from.

Call-graph / call-chain support

Posted Jul 5, 2009 11:52 UTC (Sun) by njs (guest, #40338) [Link]

That's lovely. What would be even more lovely would be the ability to output KCacheGrind's format (documentation), or at least something easily parseable (with less ascii art) and thus mungeable into said format, if that's possible?

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds