LWN.net Logo

Perfcounters added to the mainline

By Jake Edge
July 1, 2009

We last looked at the perfcounters patches back in December, shortly after they appeared. Since that time, a great deal of work has been done, culminating in perfcounters being included into the mainline during the recently completed 2.6.31 merge window. Along the way, a tool to use perfcounters, called perf, was added to the mainline as well.

Adding perf to the kernel tools/ directory is one of the more surprising aspects of the perfcounters merge. Kernel hackers have long been leery of adding user-space tools into the kernel source tree, but Linus Torvalds was unconvinced by multiple complaints about that approach. He pointed to oprofile to explain:

It took literally months for the user mode tools to catch up and get the patches to support new functionality into CVS (or is it SVN?), and after that it took even longer for them to become part of a release and be picked up by distributions. In fact, I'm not sure it is part of a release even now - I had to make a bug report to Fedora to get atom and Nehalem support in my tools: I think they took the unofficial patch.

Others were not so sure that the oprofile being developed separately from the kernel was the root cause of those failures. Christoph Hellwig had other ideas: "I don't think oprofile has been a [disaster] because of any kind of split, but because the design has been a failure from day 1." But, Torvalds wants to try including the tool to see where it leads: "Let's give a _new_ approach a chance, and see if we can avoid the mistakes of yesteryear this time."

The perf tool itself is a fairly simple command-line tool, which can be built from the tools/perf directory. It also includes some documentation, in the form of man pages that are also available via perf help (as well as in HTML and other formats). At its simplest, it gathers and reports some statistics for a particular command:

    $ perf stat ./hackbench 10
    Time: 4.174

     Performance counter stats for './hackbench 10':

	8134.135358  task-clock-msecs     #      1.859 CPUs
	      23524  context-switches     #      0.003 M/sec
	       1095  CPU-migrations       #      0.000 M/sec
	      16964  page-faults          #      0.002 M/sec
	10734363561  cycles               #   1319.669 M/sec
	12281522014  instructions         #      1.144 IPC
	  121964514  cache-references     #     14.994 M/sec
	   10280836  cache-misses         #      1.264 M/sec

	4.376588249  seconds time elapsed.
This summarizes the performance events that occurred while running the hackbench micro-benchmark program. There are a combination of hardware events (cycles, instructions, cache-references, and cache-misses) as well as software events (task-clock-msecs, context-switches, CPU-migrations, and page-faults) that are derived from the kernel code and not the processor-specific performance monitoring unit (PMU). Currently, support for hardware events is available for Intel, AMD, and PowerPC PMUs, but other architectures still have support for the software events.

There is also a top-like mode for observing which kernel functions are being executed most frequently in a continuously updating display:

    $ perf top -c 1000 -p 3216

    ------------------------------------------------------------------------------
       PerfTop:     360 irqs/sec  kernel:65.0% [1000 cycles],  (target_pid: 3216)
    ------------------------------------------------------------------------------

		 samples    pcnt         RIP          kernel function
      ______     _______   _____   ________________   _______________

		 1214.00 -  5.3% - 00000000c045cb4d : lock_acquire
		 1148.00 -  5.0% - 00000000c045d1d3 : lock_release
		  911.00 -  4.0% - 00000000c045d377 : lock_acquired
		  509.00 -  2.2% - 00000000c05a0cbc : debug_locks_off
		  490.00 -  2.2% - 00000000c05a2f08 : _raw_spin_trylock
		  489.00 -  2.1% - 00000000c041d1d8 : read_hpet
		  488.00 -  2.1% - 00000000c04419b8 : run_timer_softirq
		  483.00 -  2.1% - 00000000c04d5f72 : do_sys_poll
		  477.00 -  2.1% - 00000000c05a34a0 : debug_smp_processor_id
		  462.00 -  2.0% - 00000000c043df85 : __do_softirq
		  404.00 -  1.8% - 00000000c074d93f : sub_preempt_count
		  353.00 -  1.5% - 00000000c074d9d2 : add_preempt_count
		  338.00 -  1.5% - 00000000c0408a76 : native_sched_clock
		  318.00 -  1.4% - 00000000c074b4c3 : _spin_lock_irqsave
		  309.00 -  1.4% - 00000000c044ea10 : enqueue_hrtimer
This is a static version of the output from looking at a largely quiescent firefox process (pid 3216), sampling every 1000 cycles.

There is quite a bit more that perf can do. There is a record sub-function that gathers the performance counter data into a perf.data file which can be used by other commands:

    $ perf record ./hackbench 10
    Time: 4.348
    [ perf record: Captured and wrote 2.528 MB perf.data (~110448 samples) ]

    $ perf report --sort comm,dso,symbol | head -15

    #
    # (110146 samples)
    #
    # Overhead           Command  Shared Object              Symbol
    # ........  ................  .........................  ......
    #
	10.70%         hackbench  [kernel]                   [k] check_bytes_and_report
	 9.07%         hackbench  [kernel]                   [k] slab_pad_check
	 5.67%         hackbench  [kernel]                   [k] on_freelist
	 5.28%         hackbench  [kernel]                   [k] lock_acquire
	 5.03%         hackbench  [kernel]                   [k] lock_release
	 3.19%         hackbench  [kernel]                   [k] init_object
	 3.02%         hackbench  [kernel]                   [k] lock_acquired
	 2.47%         hackbench  [kernel]                   [k] _raw_spin_trylock
This output shows the top eight kernel functions executed while running hackbench. The same data file can also be used by perf annotate (when given a symbol name and the appropriate vmlinux file) to show the disassembled code for a function, along with the number of samples recorded on each instruction. There is clearly a wealth of information that can be derived from the tool.

The original posting of the perfcounters patches came as somewhat of a surprise to Stéphane Eranian, who had long been working on another performance monitoring solution, "perfmon". While he is still a bit skeptical of perfcounters, which were originally proposed by Ingo Molnar and Thomas Gleixner, he has been reviewing the patches, and providing lengthy comments. Molnar, also responded at length, breaking his reply into multiple chunks which can be found in the thread.

Perfmon was targeted at exposing as much of the underlying PMU data as possible to user space, but Molnar explicitly rejects that goal:

So for every "will you support advanced PMU feature X, Y and Z" question you ask, the first-level answer is: 'please show the developer usecase and integrate it into our tools so we can see how it all works and how useful it is'.

"A tool might want to do this" is not a good enough answer. We now have a working OSS tool-space with 'perf' where such arguments for more PMU features can be made in very specific terms: patches, numbers and comparisons. Actual hands-on utility, happy developers and faster apps is what matters in the end - not just the list of PMU features we expose.

His focus, presumably shared with his co-maintainers Peter Zijlstra and Paul Mackerras, is to generalize performance measurement features so that they are not dependent on any particular CPU and that they fit well with developer work flow: "I do claim we had few if any sane performance analysis tools before under Linux, and i think we are still in the stone ages and still have a lot of work to do in this area." From Molnar's perspective, that ease of use for users and developers is one of the main areas where perfmon fell short.

Molnar is not shy about pointing out that perfcounters still needs a lot of work, but the framework is there, so features can be added to that. As yet, there is no documentation in the kernel Documentation/ directory, but one presumes that will be handled sometime soon. Overall, perfcounters and the perf tool look to be a highly useful addition to the kernel, one that should start providing benefits—in the form of better performance—in the near term.


(Log in to post comments)

Perfcounters added to the mainline

Posted Jul 2, 2009 3:39 UTC (Thu) by jreiser (subscriber, #11027) [Link]

Linux does not use coloring algorithms when allocating page frames to virtual addresses. On popular architectures with large caches and small pages and small associativity (including x86 of the last few years) most statistics on cache misses are subject to large variances because of the uneven mapping of virtual address space to physical cache lines. The statistics have only small predictive value, and thus are little help to application programmers. I have seen the numbers vary by 15% for consecutive invocations of the same long-running CPU-bound deterministic program on the same data (the sequence of virtual addresses accessed by each run was identical) on an "otherwise idle" machine. Do not use the output of perf stat to tune your application.

Perfcounters added to the mainline

Posted Jul 2, 2009 14:14 UTC (Thu) by zlynx (subscriber, #2285) [Link]

I disagree. I used oprofile, not perf, but the principle is the same. Looking at areas of high cache misses can tell you where to add cache prefetch code. This always helps, virtual memory layout oddities or not.

Perfcounters added to the mainline

Posted Jul 3, 2009 23:38 UTC (Fri) by mingo (subscriber, #31122) [Link]

Correct.

Another thing to note is that perf stat has a '--repeat N' parameter. This option directs perf stat to run the measured command N times. It saves the various counter results, and then emits basic (avg, std-dev) statistics about them.

For example, running the 'hackbench' messaging benchmark 10 times gives:

aldebaran:~> perf stat --repeat 10 ./hackbench 10
Time: 0.121
Time: 0.091
Time: 0.114
Time: 0.094
Time: 0.090
Time: 0.095
Time: 0.094
Time: 0.107
Time: 0.094
Time: 0.095

 Performance counter stats for './hackbench 10' (10 runs):

    1259.878957  task-clock-msecs         #     10.597 CPUs    ( +-   1.799% )
          51812  context-switches         #      0.041 M/sec   ( +-   5.103% )
           3519  CPU-migrations           #      0.003 M/sec   ( +-   4.915% )
          17870  page-faults              #      0.014 M/sec   ( +-   0.392% )
     3802645216  cycles                   #   3018.262 M/sec   ( +-   1.747% )
     1588586719  instructions             #      0.418 IPC     ( +-   0.837% )
       16885948  cache-references         #     13.403 M/sec   ( +-   1.503% )
        7328059  cache-misses             #      5.816 M/sec   ( +-   1.773% )

    0.118889101  seconds time elapsed   ( +-   3.398% )


Shows us the statistical properties of the counters. If your system is 'noisy', or if the metric is a fundamentally volatile one (cycles, or cache-misses), the noise level will be higher.

Other metrics such as instructions or branches executed are a lot more stable.

But for any of the metrics, 'perf stat --repeat 10' gives you a good guess about how reliable that metric is on that particular system.

Somewhat surprisingly, for this particular workload, the most noisy metric is 'context-switches' and 'CPU-migrations' - which measures the number task switches and the number of cross-CPU task migrations. (this is not a PMU metric but a perfcounter metrics offered by the kernel.)

(The reason for the noise here is that hackbench starts and stops a lot of tasks in a bursty way, and any noise in initial conditions get magnified by the chance placement of tasks. 100 msecs is not a lot of time to run, so depending on when the scheduler's balancing algorithm kicks in the placement of tasks is randomized to a certain degree (due to the high overload) and the metric gets spread out.)

The conclusion is that noisy metrics are just as useful as stable metrics, as long as you can measure the noise and as long as you know how to reduce the noise to acceptable levels. Modern CPUs with huge caches and complex heuristics are fundamentally random in their characteristics, so deterministic results can rarely be expected.

Call-graph / call-chain support

Posted Jul 5, 2009 10:28 UTC (Sun) by mingo (subscriber, #31122) [Link]

btw., another thing worth mentioning about perfcounters is turn-key call-graph support and call-graph visualization:

 $ perf record -g -f ./pipe-test-1m
 [ perf record: Captured and wrote 8.169 MB perf.data (~356901 samples) ]

 $ perf report --sort symbol --callchain fractal,5 | cat

 #
 # (80245 samples)
 #
 # Overhead  Symbol
 # ........  ......
 #
     4.50%  [k] pipe_read
                |          
                 --99.00%-- do_sync_read
                           vfs_read
                           sys_read
                           system_call_fastpath
                           __GI___libc_read
                           __libc_start_main

     4.39%  [.] main

     4.27%  [k] __switch_to
                |          
                |          |          
                |          |--50.97%-- __GI___libc_write
                |          |          
                |           --49.06%-- __GI___libc_read
                |                     __libc_start_main
                |          
                 --11.19%-- thread_return
                           |          
                           |--51.44%-- __GI___libc_write
                           |          
                            --48.83%-- __GI___libc_read
                                      __libc_start_main

     3.75%  [k] copy_user_generic_string
                |          
                |--52.11%-- do_sync_read
                |          vfs_read
                |          sys_read
                |          system_call_fastpath
                |          __GI___libc_read
                |          __libc_start_main
                |          
                 --45.39%-- pipe_write
                           do_sync_write
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __GI___libc_write
                           __libc_start_main

     3.36%  [k] avc_has_perm_noaudit
                |          
                 --96.59%-- avc_has_perm
                           inode_has_perm
                           file_has_perm
                           selinux_file_permission
                           security_file_permission
                           rw_verify_area
                           |          
                           |--51.34%-- vfs_read
                           |          sys_read
                           |          system_call_fastpath
                           |          __GI___libc_read
                           |          __libc_start_main
                           |          
                            --48.66%-- vfs_write
                                      sys_write
                                      system_call_fastpath
                                      |          
                                       --99.53%-- __GI___libc_write
                                                 __libc_start_main

     3.29%  [k] schedule
                |          
                |--50.34%-- sysret_careful
                |          __GI___libc_write
                |          
                 --46.55%-- pipe_wait
                           pipe_read
                           do_sync_read
                           vfs_read
                           sys_read
                           system_call_fastpath
                           __GI___libc_read
                           __libc_start_main

     2.89%  [k] switch_mm
                |          
                 --97.67%-- schedule
                           |          
                           |--50.75%-- pipe_wait
                           |          pipe_read
                           |          do_sync_read
                           |          vfs_read
                           |          sys_read
                           |          system_call_fastpath
                           |          __GI___libc_read
                           |          __libc_start_main
                           |          
                            --49.29%-- sysret_careful
                                      __GI___libc_write

     2.85%  [.] __GI___libc_write

     2.70%  [.] __GI___libc_read

     2.60%  [k] file_has_perm
                |          
                |--93.67%-- selinux_file_permission
                |          security_file_permission
                |          rw_verify_area
                |          |          
                |          |--55.02%-- vfs_write
                |          |          sys_write
                |          |          system_call_fastpath
                |          |          |          
                |          |           --99.81%-- __GI___libc_write
                |          |                     __libc_start_main
                |          |          
                |           --44.98%-- vfs_read
                |                     sys_read
                |                     system_call_fastpath
                |                     __GI___libc_read
                |                     __libc_start_main
                |          
                 --6.33%-- security_file_permission
                           rw_verify_area
                           |          
                           |--52.27%-- vfs_write
                           |          sys_write
                           |          system_call_fastpath
                           |          |          
                           |          |--97.10%-- __GI___libc_write
                           |          |          __libc_start_main
                           |          |          
                           |           --4.35%-- __write_nocancel
                           |          
                            --47.73%-- vfs_read
                                      sys_read
                                      system_call_fastpath
                                      __GI___libc_read
                                      __libc_start_main

     2.15%  [k] pipe_write
                |          
                 --98.31%-- do_sync_write
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __GI___libc_write
                           __libc_start_main

     2.05%  [k] system_call
                |          
                |--50.64%-- __GI___libc_write
                |          |          
                |           --49.46%-- __libc_start_main
                |          
                 --49.24%-- __GI___libc_read
                           __libc_start_main

Here we record and output full call-chains (down to and including user-space call-chains) and display the overhead in a tree - detailing the call-path that results in that profile entry - and recursively so. (the '5' is a 5% filter - to skip entries below a 5% (relative-)overhead threshold)

For example this portion:

     3.75%  [k] copy_user_generic_string
                |          
                |--52.11%-- do_sync_read
                |          vfs_read
                |          sys_read
                |          system_call_fastpath
                |          __GI___libc_read
                |          __libc_start_main
                |          
                 --45.39%-- pipe_write
                           do_sync_write
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __GI___libc_write
                           __libc_start_main
Tells us that in this workload there's a combined overhead of 3.75% from user-copies (copy_user_generic_string()), and that ~52% overhead of that comes from a user-space read() and 45% comes from a user-space write() call.

With traditional 'flat' profiling output we'd only know that there's 3.75% overhead in copy_user_generic_string() - we would not know where it comes from.

Call-graph / call-chain support

Posted Jul 5, 2009 11:52 UTC (Sun) by njs (guest, #40338) [Link]

That's lovely. What would be even more lovely would be the ability to output KCacheGrind's format (documentation), or at least something easily parseable (with less ascii art) and thus mungeable into said format, if that's possible?

Perfcounters added to the mainline

Posted Jul 2, 2009 14:59 UTC (Thu) by deater (subscriber, #11746) [Link]

A few comments.

* Not all Intel chips have full hardware counter support under perf. Most notably, Pentium Pro/II/III or Pentium 4. One might argue that these are old and don't matter, but they are supported by perfmon2. Pentium 4 is the troublesome one, because the performance counters for that architecture don't map well at all to the abstraction chosen by the perf developers.

* I find calling perf a "simple" command line tool to be a bit deceptive. It is quite complicated and not very well documented yet.

* There is still some lingering bitterness about how Ingo took over perfcounters, sort of the same way he took over amd64 and the CFS scheduler. Mainly because he is re-inventing everything from scratch and making many mistakes that the other implementations already learned the hard way. Also the perf implementation does a lot of things, such as abusing ioctl()s, that the perfmon2 developers were told would not be allowed in the kernel and they wasted a lot of time working around these issues, only to find out it didn't matter in the end.

* I will admit the perf developers can be helpful, especially if you bug them enough. Upon prompting, they've reduced the static aggregate count overhead in the perf tool from a few thousand instructions to near zero.

* I personally think the abstraction they chose of having "common" counters hard-wired in the kernel to be a bad one, because as they are already finding out every chip and chip revision has different counters with different issues. perfmon2 took the saner route to do this in user-space; once 2.6.31 is released the ABI is frozen and we're going to be stuck with this.

Perfcounters added to the mainline

Posted Jul 2, 2009 20:01 UTC (Thu) by ejr (subscriber, #51652) [Link]

Not only is he re-inventing the wheel, he's forcing users to add yet another perfctr-alike support layer. PAPI has been around a long time and has many users, but the package and its users keep being declared not to exist. Bizarre. Many performance counter tools are one-shots meant to work for a particular application or stack. That's one reason why we (users) don't have a "flagship application" to wave in front of other developers. We want a flexible way to dig at the hardware without waiting for kernel developer X to decide a particular counter is worth-while. Let us deal with things in user space.

Perfcounters added to the mainline

Posted Jul 2, 2009 21:53 UTC (Thu) by deater (subscriber, #11746) [Link]

there is hope that soon PAPI and pfmon (the perfmon2 tool) will be ported to perfcounters, so assuming the features they need are available, things might work out in the end once things stabilize for a few months/years.

It will be nice that _finally_ performance counters will be available under Linux without having to patch the kernel. It's just a shame that it happened the way it did.

Perfcounters added to the mainline

Posted Jul 3, 2009 22:54 UTC (Fri) by mingo (subscriber, #31122) [Link]

* I personally think the abstraction they chose of having "common" counters hard-wired in the kernel to be a bad one, because as they are already finding out every chip and chip revision has different counters with different issues. perfmon2 took the saner route to do this in user-space; once 2.6.31 is released the ABI is frozen and we're going to be stuck with this.

Not really. The days of weird x86 PMUs changing with every CPU model are gone, fortunately.

The AMD PMU programming low level details have stayed pretty stable since around the K7 or so - for many years.

Intel has also introduced 'architectural performance monitoring' starting with Core2 (and has extended it in Corei7) which too is a future-proof method.

Also, even assuming weird PMUs, the perfcounters ABI does not hard-code low level details like that. Proof of this is in the fact that most Power CPUs are supported by perfcounters - which all have PMUs that are wildly different from x86 PMUs.

The reason perfcounters abstracts away common events is utility: it is convenient to tools to use a 'cycles' or a 'branches executed' events regardless of which CPU model they are running on. perfmon/pfmon, despite many years of development, never achieved this kind of basic utility.

Perfcounters added to the mainline

Posted Jul 3, 2009 23:09 UTC (Fri) by mingo (subscriber, #31122) [Link]

* There is still some lingering bitterness about how Ingo took over perfcounters [...]

(Just a question - from your post i gather that you are an (ex?) perfmon developer. If yes then i'm not surprised that you feel bitter about it - still it would have been nice had you openly disclosed your direct involvement and bias in this matter.)

The thing is, perfmon, despite being available for years, never achieved any measurable usage amongst kernel developers. You can check this yourself, just type: "git log --grep=pfmon" in an upstream kernel repository. It comes up empty: no-one ever found it important to mention pfmon in an upstream kernel changelog. Not one commit out of more than 150,000 upstream kernel commits ever mentioned that pfmon was used to measure something or to solve a problem.

The reason? I cannot speak for other kernel developers but i have my guesses: i tried it, and the thing is close to unusable to kernel developers. It has way too much overhead to measure workloads with lots of tasks and lots of overhead. It relies on a fat and quirky library and takes way too much effort to install and use. Its sampling (profiling) does not read ELF symbols as far as i could see.

perfmon had many other design problems as well (not directly visible to users) - i explained the reasons in the (many) posts i wrote about perfcounters in the past ~6 months. The main failure was that it tried to abstract on a way too low level, on an almost register basis - without the kernel having any knowledge about the structure of the hardware it abstracts away.

That design choice is lethal: it has shut off many interesting capabilities that perfcounters offers here and today: inherited counters, software counters, transparent workload monitoring, nested counters, various context-switch performance optimizations, etc. etc.

Perfcounters added to the mainline

Posted Jul 3, 2009 23:24 UTC (Fri) by mingo (subscriber, #31122) [Link]

* Not all Intel chips have full hardware counter support under perf. Most notably, Pentium Pro/II/III or Pentium 4. One might argue that these are old and don't matter, but they are supported by perfmon2. Pentium 4 is the troublesome one, because the performance counters for that architecture don't map well at all to the abstraction chosen by the perf developers.

Those chips are indeed old, abandoned and dont matter. perfmon has support for them partly because perfmon was started when those chips were still relevant. Alas, IMO, this also created a wrong design and the wrong mindset for perfmon.

So in a sense, perfcounters was lucky to have come later, when saner PMUs emerged on x86.

The central concept of perfmon is to expose the PMU to user-space and to push all the complexity to user-space.

The central concept of perfcounters is to provide rich, kernel-based abstractions to measure performance characteristics of a Linux system in a coherent, unified framework - regardless of whether the information comes from a PMU, a software counter, a tracepoint or some data field somewhere.

Those are two wildly different and fundamentally conflicting sets of design goals.

But you would be wrong to suggest that P4 support is not possible under perfcounters. For example PowerPC support (which was cited as the primary counter argument against perfcounters in the perfmon vs. perfcounters discussions) is alive, well and kicking under perfcounters.

The reason why people are not rushing to implement perfcounters for P4 is probably that Core2 and later CPUs are just so much more different from a performance profile than the abandoned Netburst architecture. They are also a lot more pleasant CPUs from many perspectives.

It also makes little sense to profile on too old CPUs, as any performance optimization would have to be re-validated on more recent CPUs as well. People who care about performance tend to try to stay on the hardware edge as well, and dont tend to use obsolete systems.

As the years advance, so will perfcounter's currently 'cutting edge' PMU support create the same kind of backwards-pointing trail of CPU models. Or, if anyone cares about P4 PMU support, it can be implemented just fine as well - patches are certainly welcome.

Perfcounters added to the mainline

Posted Jul 3, 2009 4:14 UTC (Fri) by njs (guest, #40338) [Link]

Does anyone know what is meant about oprofile being an "abject failure"? (This is an honest question, I use it all the time and for my use it's awesome.)

Perfcounters added to the mainline

Posted Jul 3, 2009 6:56 UTC (Fri) by graydon (subscriber, #5009) [Link]

Caveat: I use oprofile every couple of days, and I helped write some of the drivers for it. Still, I'm sympathetic to complaints about it. From what I've observed:

- There's a lot of drift between userspace and kernelspace. Event names and communication mechanisms seem to change from release to release. I've had to edit the shell scripts and source code repeatedly on installed versions.

- Apparently the kernel people wanted a much faster cycle time on new hardware support, and oprofile userspace is "lightly maintained". This hasn't been as much of a problem for me since I stay behind the curve intentionally, and mostly just want to monitor clocks, CPI, and crude cache and branch-mispredict hotspots; but I also know where to patch in userspace if necessary.

- There is further drift against libbfd, gcc, and the various toolchain pieces involved in mapping samples to reasonable debug information. I don't expect the kernel developers to do much more on this than snark about how userspace sucks; but who knows, maybe they'll write another toolchain and be free at last.

- The drift has been sufficiently bad that most non-kernel, non-oprofile developers on linux I talk to have no idea how to "get oprofile working" and need to be hand-held through it. And when they do get it working and it suddenly stops (or lies, which it can do if the drift is *just shy* of enough to break it), they don't know how to fix it.

- As an additional data point, I've seen two (perhaps three now?) separate projects spring up that ship *both* their own batch of replacement userspace tools *and* their own forked copy of the oprofile kernel module, in order to keep the interface, expectations and capabilities pinned down.

All that aside though, I really do use oprofile all the time and find it (and kcachegrind) *way* more useful for performance tuning than, say, shark or vtune. Totally worth the price of admission. The other-platform competition seems to lie, crash, wedge, and otherwise misbehave even worse.

Maybe there were some other complaints I missed?

Perfcounters added to the mainline

Posted Jul 3, 2009 9:04 UTC (Fri) by njs (guest, #40338) [Link]

Thanks for the explanation. On further thought, the tools *could* be somewhat better. I do now recall stalking people in #oprofile a few years ago to get them to explain what some of the output fields actually were, and the upstream oprofile-to-kcachegrind script is ~useless. (But then, that's partly my fault for not sending patches. I wouldn't want to live with oprofile long without real call tree visualization.)

That doesn't seem like the sort of thing the kernel folks would deign to notice, though.

Certainly it makes sense in general to keep coupled tools together, to reduce drift and lower the barrier to entry. When it comes to maintaining an ABI, though, I think I trust the kernel folks (when they decide they care) a bit more than binutils...

Perfcounters added to the mainline

Posted Jul 10, 2009 14:36 UTC (Fri) by oak (guest, #2786) [Link]

> the upstream oprofile-to-kcachegrind script is ~useless ...
> I wouldn't want to live with oprofile long without real call tree
visualization.

Because of that, I've used this:
http://code.google.com/p/jrfonseca/wiki/Gprof2Dot

(Not as nice as having Kcachegrind + code browsing, but mostly good
enough.)

Perfcounters added to the mainline

Posted Jul 5, 2009 10:04 UTC (Sun) by mingo (subscriber, #31122) [Link]

As yet, there is no documentation in the kernel Documentation/ directory, but one presumes that will be handled sometime soon.

It can be found under: tools/perf/design.txt

It moved to that place when Documentation/perf_counter/ moved to tools/perf/. I suspect we could move the .txt file back.

Perfcounters added to the mainline

Posted Jul 5, 2009 10:41 UTC (Sun) by mingo (subscriber, #31122) [Link]

    $ perf top -c 1000 -p 3216
[...]

This is a static version of the output from looking at a largely quiescent firefox process (pid 3216), sampling every 1000 cycles.

A sidenote: for the profiling of largely idle workloads one can use 'auto-frequency counters'. These are counters where the kernel does not used a fixed period sampling, but adapts dynamically to the workload's intensity.

This can be done via the -F/--freq parameter to perf record and perf top:

   $ perf top -F 1000 -p $(pidof firefox-bin)

This will sample Firefox with 1KHz, regardless of its intensity. If Firefox executes a lot then the cyclce-sampling intervals increase automatically - if Firefox is more idle then they shorten.

(All the perf tools handle such type of 'dynamic samples' correctly so the resulting profile will not be skewed by workload fluctuations.)

The advantage of auto-freq counters is convenience: one does not have to guess '1000 cycles' magic interval that you had to use in your example above (you probably first tried the default 100,000 cycles - then saw that no output was coming then you went down to 10,000 and then to 1000?) - and the sampling will also be more workload-uniform.

What about GPU perf counters?

Posted Jul 10, 2009 15:07 UTC (Fri) by oak (guest, #2786) [Link]

Regarding the "Infrastructure for tracking driver performance events"
patch article listed later on the LWN kernel page... Why intel GPUs (or
just that particular GPU?) don't have performance counters? Or if they/it
actually do have them, could this perfcounters thing expose them in
addition to the CPU counters?

It would be pretty important information for anything 3D related, maybe
even just for today's composited Desktops.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds