Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
unikernels and unified projects
Posted Mar 23, 2010 21:38 UTC (Tue) by mingo (subscriber, #31122)
Nah, no need to wait 5-10 years, as perf already exceeds Oprofile in quite a few areas:
- PEBS (preicise profiling) hw support on Intel CPUs
- supports multiple, overlapping profiling sessions done at once
- supports precise per task and per workload profiling (Oprofile only does system-wide profiling)
- working call-graph profiling
- integration with tracepoints (for example you can profile based not on some hardware counter such as cycles, but based in "IO submitted" events)
- 'perf diff': view differential profiles
- 'perf lock': lock profiling
- 'perf sched': scheduler profiling/tracing
- 'perf kmem': SLAB subsystem profiling
- 'perf probe': dynamic probe based tracing, profiling
- 'perf timechart': analyse delays
- superior usability ;-)
So yes, while it's still a very young project, it's working quite well and is growing fast. Also, i'm curious, which particular features/capabilities are you looking for?
Posted Mar 23, 2010 22:17 UTC (Tue) by ejr (subscriber, #51652)
Posted Mar 23, 2010 22:49 UTC (Tue) by mingo (subscriber, #31122)
If you are arguing that those are not important then i disagree with you.
(/dev/perfctr on the other hand was an external driver that never saw enough use to find someone to push it upstream.)
[...] You refuse to do your homework and keep pulling out strawmen. Look up Paradyn, TAU, and PAPI.
FYI, PAPI has already been ported to the perf syscall, and both TAU and Paradyn are PAPI based - so they should work. I cannot see wide use of them beyond the HPC world, can you cite examples of big OSS projects making use of them?
In any case tools/perf/ does not try to be the exclusive user of the perf syscall - there are other user-space tools that make use of it.
Also, could you please skip the insults and the condescending tone?
Posted Mar 24, 2010 18:29 UTC (Wed) by deater (subscriber, #11746)
Posted Mar 26, 2010 9:43 UTC (Fri) by mingo (subscriber, #31122)
I'm still waiting for perf to have half the features of the
pfmon tool that came with perfmon2.
I'm still waiting for perf to have half the features of the
pfmon tool that came with perfmon2.
I know pfmon and as far as i'm aware perf currently has a lot more features than pfmon ever had.
That's obviously in stark contrast with your statement - so could you please cite the features that perf doesnt have? Either we missed them (and thus will implement them) or they are implemented but you are not aware of them.
Posted Mar 26, 2010 21:41 UTC (Fri) by deater (subscriber, #11746)
Being able to specify an output file that is nice ascii text, dumped every time a counter hits a certain threshold. This is probably possible with perf, but the perf --help is hard to follow.
It would be nice to have a perf-only mailing list, but I guess that will never happen.
It would also be nice to be able to build perf without having to download the entire kernel tree, I often don't have the space or the bandwidth for hundreds of megabytes of kernel source when I just want to quick build perf on a new machine.
Posted Mar 23, 2010 22:22 UTC (Tue) by ejr (subscriber, #51652)
Posted Mar 23, 2010 23:02 UTC (Tue) by mingo (subscriber, #31122)
Most of those tools are PAPI based, and PAPI has been ported to the perf syscall as well - so there should be no problem whatsoever.
So, tools/perf/ is in no way an exclusive user of performance events - if that is your worry/argument. It uses the perf events system call, and that system call has a strict ABI and other tools can make use of it (and are making use of it today).
Finding a profiler that works, damnit
Posted Mar 23, 2010 23:19 UTC (Tue) by foom (subscriber, #14868)
Over the last week, I've tried just about every profiler I can get my hands on, in hopes of finding
works properly on x86-64. And I did find one. Just one. And it doesn't do system profiling. If you
know some secret way of making perf actually work, do tell.
gprof: Old standby -- I used to use this all the time back in the day... Unfortunately,
works on staticly linked programs. Not only does it not report functions in shared libraries, it
IGNORES any samples where the PC was outside the staticly linked area entirely. So, pretty much
sprof: Doesn't even work at all. Not even a little bit. Completely useless, waste of disk
oprofile: Works fine unless you want callgraphs. Then you need to recompile
everything with -
fno-omit-frame-pointers. Then it works okay. Also, can get *really* confused when you have
which is rather annoying. Not particularly practical, due to recompilation requirement. Also found
to convert to callgrind format (except a hacky script which only does flat profiles)
perf: I upgraded to kernel 2.6.32, and installed this, cause I heard it was the great new
but it *also* only seems to output proper callgraphs when programs (and all libraries...) are
-fno-omit-frame-pointer. Furthermore, even when having done that, I can't make heads or tails
callgraph output report. It is almost 100% unintelligible. It looks nothing like any callgraph
ever seen before. I also found no way to convert to output to callgrind format. Bonus fail:
valgrind callgrind: Best so far. Has a *GREAT* UI (kcachegrind), and outputs a reliable
Downside: program runs 100 times slower. (ARGH!).
google-perftools libprofile: Seems to actually work. Hooray! Bonus: Outputs callgrind
format, so I
can use kcachegrind. Only downside: no system profiling.
Posted Mar 24, 2010 0:30 UTC (Wed) by chantecode (subscriber, #54535)
but it *also* only seems to output proper callgraphs when programs (and all libraries...) are compiled with -fno-omit-frame-pointer.
Sure, how could it be another way? Without frame pointers you can't have reliable stacktraces. Or if you have a tip to go round this requirement I would be happy to implement it. The only arch I know that is able to walk the stack correctly without frame pointers is PowerPc.
Furthermore, even when having done that, I can't make heads or tails of the callgraph output report. It is almost 100% unintelligible. It looks nothing like any callgraph profile I've ever seen before.
The default output is a fractal statistical profile of the stacktraces, starting from the inner most caller (the origin of the event) to the outer most.
Let's look at a sample: http://tglx.de/~fweisbec/callchain_sample.txt
What this profile tells you is that the ls process, while entering the kernel, is part of 3.34 % of the total overhead, and the origin of this overhead is in the __lock_acquire function. Among all the callers of __lock_acquire() when it caused this overhead, lock_acquire() has been its caller 97 % of the time, and _raw_spin_lock() has been the caller of lock_acquire() 48 % of the time, etc...
This is why it is called a fractal profiling: each branch is a new profile on its own. _raw_spin_lock() is profiled relatively to its parent lock_acquire().
May be it doesn't look like other kind of callgraph profiler as you said. I just don't know as I haven't looked much at what other projects do. May be I took some inspiration from sysprof callgraphs, except sysprof does a outer most to inner most callgraph direction. The other direction we took for perf (from inner to outer) seems to me much more natural as we start from the hottest, deepest, most relevant origin to end up on the highest level origin.
But I can implement the other direction, shouldn't be that hard, I'm just not sure it will give us nice and useful results but it's worth the try, I think I'll put my hands on it soon.
BTW, we have a newt based TUI that can output the callgraph in a collapsed/expanded (toggable at will) fashion. May be that could better fit your needs.
An example here: http://tglx.de/~acme/perf-newt-callgraph.png
But you need to fetch the -tip tree for that as it's scheduled for 2.6.35
But please if you have suggestions to make our callgraphs better, tell us.
It would be very appreciated, we lack feedbacks in this area and it's still a young feature (although pretty functional).
I also found no way to convert to output to callgrind format.
You're right. I'll try to get this too once I get more time.
Posted Mar 24, 2010 1:11 UTC (Wed) by nix (subscriber, #2304)
Sure, how could it be another way? Without frame pointers you can't have
Posted Mar 24, 2010 1:25 UTC (Wed) by HelloWorld (guest, #56129)
Posted Mar 24, 2010 3:12 UTC (Wed) by joseph.h.garvin (subscriber, #64486)
Posted Mar 24, 2010 8:12 UTC (Wed) by ringerc (subscriber, #3071)
Of course, many profiling tools don't know about external debuginfo files...
Posted Mar 24, 2010 10:31 UTC (Wed) by nix (subscriber, #2304)
It's been some years since I've come across any tool that reads DWARF2 that doesn't know about at least one of these, and in the last year or so everything has learnt about both (including perf).
Since pretty much every distribution is now compiling everything with separated debugging information, anything that doesn't support it isn't going to be particularly useful.
Outside rare code (e.g. the occasional garbage collector) that actually needs frame pointers for correct operation, -fno-omit-frame-pointer is needless, and using -fomit-frame-pointer instead gives a substantial speedup on many common platforms. IIRC it's the default on x86-64, but I don't have a recent GCC source tree here so I can't be sure...
Posted Mar 25, 2010 4:23 UTC (Thu) by adsharma (subscriber, #8967)
libunwind supports multiple architectures and can unwind without frame
Posted Mar 24, 2010 1:55 UTC (Wed) by foom (subscriber, #14868)
GCC always includes an eh_frame section (unless you explicitly pass -
fnoasynchronous-unwind-tables, which of
should never do), which allows you to unwind reliably from any point within a program. This is
actually required by the X86-64
This data is never stripped, and actually is mapped into memory with the code and data of the
executable, so you can read it directly from the process's memory space. It's almost the
same format data as the DWARF debug info, but with a couple minor differences and with only a
subset of the functionality allowed.
See libunwind for a userspace library which
knows how to read and interpret this data. Google-perftools uses libunwind, which is how it's
able to work properly.
Posted Mar 24, 2010 2:25 UTC (Wed) by chantecode (subscriber, #54535)
We often save the stacktraces from an NMI path (covers all hardware pmu events). So this operation must be as fast and simple as possible.
With such DWARF based thing, we'd need to find where is mapped this section for the current task, we also need to be lucky enough for this page to be available (we can't do much from NMI, it's a best effort trial), and it may be unlikely considering this DWARF section is mostly used for debugging. Then if we are lucky enough, we need to parse all these DWARF infos, and finally walk the stack based on these info, which is probably much slower than a fast frame pointer walking.
This seems to me really too complicated, too slow and unlikely to work most of the time.
Posted Mar 24, 2010 3:19 UTC (Wed) by foom (subscriber, #14868)
Shrug. Sure, it's complicated, but it's totally necessary for the callgraph profiler to be useful at all
on x86-64. Without implementing it, callgraph profiling will continue to be broken for the most
popular architecture when attempting to profile a normal system installation with normal programs
If you don't think it's possible to implement support for EH_FRAME-based unwind, you might as
well say up front that callgraph profiling doesn't work, so people don't waste their time trying to
use it. (or, I guess gentoo users could at least waste their time usefully by recompiling the whole
system with -fno-omit-frame-pointers).
Posted Mar 24, 2010 10:34 UTC (Wed) by nix (subscriber, #2304)
Posted Mar 24, 2010 15:18 UTC (Wed) by foom (subscriber, #14868)
You can't convert a single address to a stacktrace later on! You'd need a copy of the whole stack to
do it offline...which I doubt is any faster than just running the unwinder to save out the PC of every
Anyways, thanks for mentioning sysprof: I hadn't heard of that one before. But looking at the
source, it doesn't seem like it's likely to work either, given the function heuristic_trace in:
Posted Mar 24, 2010 17:48 UTC (Wed) by nix (subscriber, #2304)
So it's call the unwinder or nothing, really. Unfortunately the job of figuring out what the call stack looks like really *is* quite expensive :/ any effort should presumably go into optimizing libunwind...
Posted Mar 26, 2010 17:55 UTC (Fri) by sandmann (subscriber, #473)
You are right that it doesn't generate good callgraphs on x86-64 unless you compile the application with -fno-omit-frame-pointer. I very much would like to fix this somehow, but I just don't see any really good way to do it.
Fundamentally, it needs to take a stack trace in the kernel in NMI context. You cannot read the .eh_frame information at that time because that would cause page faults and you are not allowed to sleep in NMI context.
Even if there were a way around that problem, you would still have to *interpret* the information which is a pretty hairy algorithm to put in the kernel (though Andi Kleen did exactly that I believe, resulting in flame wars later on).
You could try copying the entire user stack, but there is considerable overhead associated with that because user stacks can be very large (emacs for example allocates a 100K buffer on the stack). You could also try a heuristic stack walk (which is what an old version of sysprof - new versions use the same kernel interface as perf). That sort of works, but there is a lot of false positives from function pointers and left-over return addresses. The function pointers can be filtered out, but the return addresses can't. These false positives tend to make sysprof's UI display somewhat confusing, though not completely unusable.
You could also try some hybrid scheme where the kernel does a heuristic stack walk and userspace then uses the .eh_frame information to filter out the false positives. This is what I think is the most promising approach at this point. Some day I'll try it.
Finally, the distributions could just compile with -fno-omit-frame-pointer by default. The x86-64 is not really register-starved so it wouldn't be a significant performance problem. The Solaris compiler does precisely this because they need to take stack traces for dtrace.
But, I fully expect to be told that for performance reasons we can't have working profilers.
Posted Mar 26, 2010 18:06 UTC (Fri) by rahulsundaram (subscriber, #21946)
Posted Mar 26, 2010 21:44 UTC (Fri) by foom (subscriber, #14868)
Why does it need to happen at NMI time? Why can't you just do it in the user process's context,
before resuming execution of their code?
The setitimer(ITIMER_PROF) solution that userspace profilers use clearly works out fine for
userspace profiling. Can't you do something similar for userspace profiling from within the kernel?
The stack trace of the userspace half clearly can't change between when you received the NMI and
when you resume execution of the process...
That just leaves the complication of implementing the DWARF unwinder in the kernel, but there's
already much more complex code in the kernel...that really seems like it should be a non-issue.
Posted Mar 26, 2010 23:28 UTC (Fri) by nix (subscriber, #2304)
Posted Mar 27, 2010 14:10 UTC (Sat) by garloff (subscriber, #319)
Posted Mar 27, 2010 15:12 UTC (Sat) by foom (subscriber, #14868)
Posted Mar 28, 2010 0:26 UTC (Sun) by garloff (subscriber, #319)
Posted Apr 4, 2010 12:35 UTC (Sun) by chantecode (subscriber, #54535)
Posted Apr 5, 2010 1:37 UTC (Mon) by foom (subscriber, #14868)
Posted Mar 30, 2010 18:51 UTC (Tue) by fuhchee (subscriber, #40059)
Posted Mar 24, 2010 2:15 UTC (Wed) by foom (subscriber, #14868)
http://oprofile.sourceforge.net/doc/opreport.html (section 2.3. Callgraph output)
The important points being that it shows, linearly, a section of output for every function, sorted by
time (inclusive of subroutines). Each section includes, in addition to the line for the function itself,
1) above the function's line: who called it
2) below the function's line: who it calls
Maybe your method is better for some uses, but I can't quite get my ahead around how to
use it, even after your explanation of what it means. I don't expect I'm unique in this regard, since
basically all callgraph profilers I've seen work the same way.
Posted Mar 24, 2010 2:48 UTC (Wed) by chantecode (subscriber, #54535)
We have three kind of formats: flat (as in above link), fractal (as shown in my previous example) and graph (like fractal, except that percentages of overhead are absolute wrt to the whole overhead profiling, unlike fractal where each overhead percentage is relative to its parent branch).
syntax: perf report -g flat|fractal|graph
I've never heard about someone using the flat format. Everybody seem to use the fractal (probably because it's the default, and IMHO, the most useful). And because I've never heard about it, I've not tested it for a while so it works on 2.6.32 but seem to have been broken in 2.6.33 (the graph format too), but fractal works. I'm going to send a fix for that and get it to the 2.6.33 stable branch.
>Maybe your method is better for some uses, but I can't quite get my ahead around how to use it, even after your explanation of what it means. I don't expect I'm unique in this regard, since basically all callgraph profilers I've seen work the same way.
I must confess it's the first time I hear this complaint. That said may be it's a matter of playing a bit with it before being used to its layout.
Posted Mar 24, 2010 3:41 UTC (Wed) by foom (subscriber, #14868)
I'm not surprised about that: it looks like it's basically just the same as fractal, except without
combining the similar parts of call chains together. Doesn't seem useful at all. And that also isn't
what I want. (Sidenote: I'd have expected something named "flat" to actually show me a normal
flat profile; it seems a bit odd that perf report can't actually show a normal flat profile when
callgraph info was collected.)
I just like to see a list of the *direct* callers and callees of every function. (having both easily
available is an important component). I don't actually want to be shown every unique complete
call-chain starting from the named function: it's just too much data, and yet simultaneously not
enough, since it doesn't show callees.
I think the gprof docs explain its format pretty well, and that's basically exactly what I want to
see. It's a nice, concise, easy-to-understand summarization of the data that usually has enough
information to figure out where the problem is.
Or, heck, if you get callgrind output conversion working, kcachegrind also presents the data in
this way, so you wouldn't even have to implement the gprof-like output format. :)
Posted Mar 25, 2010 2:52 UTC (Thu) by chantecode (subscriber, #54535)
Yep, may be we can implement such layout mode too, or at least getting it through kcachegrind.
Posted Apr 1, 2010 20:42 UTC (Thu) by oak (subscriber, #2786)
Or similar callgraph visualization support in perf itself would be nice (including filtering options, highlighting the nodes based on CPU used by given node and the nodes called by it etc).
Posted Apr 4, 2010 12:44 UTC (Sun) by chantecode (subscriber, #54535)
Posted Apr 5, 2010 1:39 UTC (Mon) by foom (subscriber, #14868)
Posted Mar 24, 2010 7:11 UTC (Wed) by njs (guest, #40338)
Here's a hacky script that does callgraphs: http://roberts.vorpus.org/~njs/op2calltree.py
Posted Mar 24, 2010 8:10 UTC (Wed) by ringerc (subscriber, #3071)
gprof: Old standby -- I used to use this all the time back in the day... Unfortunately, only works on staticly linked programs.
gprof works fine on dynamically linked programs. In fact, it's awfully rare to find a genuinely statically linked program these days - most "static" programs dynamically link to libc via ld-linux.so and statically link in all the other libraries they use, as ldd will reveal. Not always, but usually.
I suspect what you were going for was "gprof only profiles one particular library or executable at a time, not a whole program including any shared libraries or plugins it uses." (Even that's not strictly true, it's just that you have to rebuild those plugins or shared libraries with gprof instrumentation too).
I'm totally with you on the general "argh" re callgraph profiling across multiple programs or a subset of the system, though.
Posted Mar 24, 2010 15:05 UTC (Wed) by foom (subscriber, #14868)
No, gprof really just doesn't work at all with shared libraries. If you link an executable with -pg,
you'll only get samples from when the PC is inside the base executable. Any times the timer tick
goes off when the PC is in another location, it's simply ignored. This is true even if you did
compile all libraries with -pg too.
So if you do run gprof with dynamically linked executable, your timing measures will all be
wrong, if you spend any time whatsoever inside shared libraries. E.g. I was profiling a program
where it turned out 1/4 the time was spent in libc and libstdc++ (shouldn't have been that
much..but that's what the profiler is for!). But using gprof, that time simply disappears, it's like
those functions took no time whatsoever even in the "inclusive" measures of higher level
functions that *are* in the executable. And the total run-time is also reported to be much
smaller than it actually is.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds