LWN.net Logo

Finding a profiler that works, damnit

Finding a profiler that works, damnit

Posted Mar 23, 2010 23:19 UTC (Tue) by foom (subscriber, #14868)
In reply to: unikernels and unified projects by mingo
Parent article: KVM, QEMU, and kernel project management

> - working call-graph profiling

Really?

Over the last week, I've tried just about every profiler I can get my hands on, in hopes of finding ONE that works properly on x86-64. And I did find one. Just one. And it doesn't do system profiling. If you know some secret way of making perf actually work, do tell.

gprof: Old standby -- I used to use this all the time back in the day... Unfortunately, only works on staticly linked programs. Not only does it not report functions in shared libraries, it simply IGNORES any samples where the PC was outside the staticly linked area entirely. So, pretty much useless.

sprof: Doesn't even work at all. Not even a little bit. Completely useless, waste of disk space.

oprofile: Works fine unless you want callgraphs. Then you need to recompile everything with - fno-omit-frame-pointers. Then it works okay. Also, can get *really* confused when you have chroots, which is rather annoying. Not particularly practical, due to recompilation requirement. Also found no way to convert to callgrind format (except a hacky script which only does flat profiles)

perf: I upgraded to kernel 2.6.32, and installed this, cause I heard it was the great new thing.... but it *also* only seems to output proper callgraphs when programs (and all libraries...) are compiled with -fno-omit-frame-pointer. Furthermore, even when having done that, I can't make heads or tails of the callgraph output report. It is almost 100% unintelligible. It looks nothing like any callgraph profile I've ever seen before. I also found no way to convert to output to callgrind format. Bonus fail: ungoogleable name.

valgrind callgrind: Best so far. Has a *GREAT* UI (kcachegrind), and outputs a reliable profile. Downside: program runs 100 times slower. (ARGH!).

google-perftools libprofile: Seems to actually work. Hooray! Bonus: Outputs callgrind format, so I can use kcachegrind. Only downside: no system profiling.


(Log in to post comments)

Finding a profiler that works, damnit

Posted Mar 24, 2010 0:30 UTC (Wed) by chantecode (subscriber, #54535) [Link]

but it *also* only seems to output proper callgraphs when programs (and all libraries...) are compiled with -fno-omit-frame-pointer.

Sure, how could it be another way? Without frame pointers you can't have reliable stacktraces. Or if you have a tip to go round this requirement I would be happy to implement it. The only arch I know that is able to walk the stack correctly without frame pointers is PowerPc.

Furthermore, even when having done that, I can't make heads or tails of the callgraph output report. It is almost 100% unintelligible. It looks nothing like any callgraph profile I've ever seen before.

The default output is a fractal statistical profile of the stacktraces, starting from the inner most caller (the origin of the event) to the outer most.

Let's look at a sample: http://tglx.de/~fweisbec/callchain_sample.txt

What this profile tells you is that the ls process, while entering the kernel, is part of 3.34 % of the total overhead, and the origin of this overhead is in the __lock_acquire function. Among all the callers of __lock_acquire() when it caused this overhead, lock_acquire() has been its caller 97 % of the time, and _raw_spin_lock() has been the caller of lock_acquire() 48 % of the time, etc...

This is why it is called a fractal profiling: each branch is a new profile on its own. _raw_spin_lock() is profiled relatively to its parent lock_acquire().

May be it doesn't look like other kind of callgraph profiler as you said. I just don't know as I haven't looked much at what other projects do. May be I took some inspiration from sysprof callgraphs, except sysprof does a outer most to inner most callgraph direction. The other direction we took for perf (from inner to outer) seems to me much more natural as we start from the hottest, deepest, most relevant origin to end up on the highest level origin.

But I can implement the other direction, shouldn't be that hard, I'm just not sure it will give us nice and useful results but it's worth the try, I think I'll put my hands on it soon.

BTW, we have a newt based TUI that can output the callgraph in a collapsed/expanded (toggable at will) fashion. May be that could better fit your needs. An example here: http://tglx.de/~acme/perf-newt-callgraph.png

But you need to fetch the -tip tree for that as it's scheduled for 2.6.35 (git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip.git)

But please if you have suggestions to make our callgraphs better, tell us. It would be very appreciated, we lack feedbacks in this area and it's still a young feature (although pretty functional).

I also found no way to convert to output to callgrind format.

You're right. I'll try to get this too once I get more time.

Finding a profiler that works, damnit

Posted Mar 24, 2010 1:11 UTC (Wed) by nix (subscriber, #2304) [Link]

Sure, how could it be another way? Without frame pointers you can't have reliable stacktraces.
GDB has been able to use DWARF2 information to produce stack traces without frame pointers for many releases now.

Finding a profiler that works, damnit

Posted Mar 24, 2010 1:25 UTC (Wed) by HelloWorld (guest, #56129) [Link]

How does that help? The problem seems to be that one has to recompile a lot of stuff in order to get stack traces. You have to recompile in order to get dwarf2 debugging information too.

Finding a profiler that works, damnit

Posted Mar 24, 2010 3:12 UTC (Wed) by joseph.h.garvin (subscriber, #64486) [Link]

It's slightly more usable, because people are usually doing debug builds already, but not -fno-omit-frame-pointer builds. It's also not uncommon to release optimized builds with debug symbols to make cores easier to understand in the wild.

Finding a profiler that works, damnit

Posted Mar 24, 2010 8:12 UTC (Wed) by ringerc (subscriber, #3071) [Link]

... and often those debug symbols are included as separate -dbg or -debug packages in the package management system, so if your tool knows how to read gdb external debug symbol files, you get to use the original binaries AND get debug info.

Of course, many profiling tools don't know about external debuginfo files...

Finding a profiler that works, damnit

Posted Mar 24, 2010 10:31 UTC (Wed) by nix (subscriber, #2304) [Link]

Virtually everything that reads debugging information knows how to use external debug files, both in /usr/lib/debug/$PATH.dbg format and in /usr/lib/debug/$BUILD_ID format, because both of these have been used by widely-released distributions (Ubuntu and Red Hat have both used various versions of these).

It's been some years since I've come across any tool that reads DWARF2 that doesn't know about at least one of these, and in the last year or so everything has learnt about both (including perf).

Since pretty much every distribution is now compiling everything with separated debugging information, anything that doesn't support it isn't going to be particularly useful.

Outside rare code (e.g. the occasional garbage collector) that actually needs frame pointers for correct operation, -fno-omit-frame-pointer is needless, and using -fomit-frame-pointer instead gives a substantial speedup on many common platforms. IIRC it's the default on x86-64, but I don't have a recent GCC source tree here so I can't be sure...

Finding a profiler that works, damnit

Posted Mar 25, 2010 4:23 UTC (Thu) by adsharma (subscriber, #8967) [Link]

libunwind supports multiple architectures and can unwind without frame
pointers.

http://www.nongnu.org/libunwind/

Finding a profiler that works, damnit

Posted Mar 24, 2010 1:55 UTC (Wed) by foom (subscriber, #14868) [Link]

>Sure, how could it be another way? Without frame pointers you can't have reliable stacktraces. Or if you have a tip to go round this requirement I would be happy to implement it. The only arch I know that is able to walk the stack correctly without frame pointers is PowerPc.

GCC always includes an eh_frame section (unless you explicitly pass - fnoasynchronous-unwind-tables, which of course you should never do), which allows you to unwind reliably from any point within a program. This is actually required by the X86-64 ABI.

This data is never stripped, and actually is mapped into memory with the code and data of the executable, so you can read it directly from the process's memory space. It's almost the same format data as the DWARF debug info, but with a couple minor differences and with only a subset of the functionality allowed.

See libunwind for a userspace library which knows how to read and interpret this data. Google-perftools uses libunwind, which is how it's able to work properly.

Finding a profiler that works, damnit

Posted Mar 24, 2010 2:25 UTC (Wed) by chantecode (subscriber, #54535) [Link]

Ok.Thanks for this pointer, I had no idea about this.
It doesn't look usable by perf though.

We often save the stacktraces from an NMI path (covers all hardware pmu events). So this operation must be as fast and simple as possible.
With such DWARF based thing, we'd need to find where is mapped this section for the current task, we also need to be lucky enough for this page to be available (we can't do much from NMI, it's a best effort trial), and it may be unlikely considering this DWARF section is mostly used for debugging. Then if we are lucky enough, we need to parse all these DWARF infos, and finally walk the stack based on these info, which is probably much slower than a fast frame pointer walking.

This seems to me really too complicated, too slow and unlikely to work most of the time.

Finding a profiler that works, damnit

Posted Mar 24, 2010 3:19 UTC (Wed) by foom (subscriber, #14868) [Link]

> This seems to me really too complicated, too slow and unlikely to work most of the time.

Shrug. Sure, it's complicated, but it's totally necessary for the callgraph profiler to be useful at all
on x86-64. Without implementing it, callgraph profiling will continue to be broken for the most
popular architecture when attempting to profile a normal system installation with normal programs
on it.

If you don't think it's possible to implement support for EH_FRAME-based unwind, you might as
well say up front that callgraph profiling doesn't work, so people don't waste their time trying to
use it. (or, I guess gentoo users could at least waste their time usefully by recompiling the whole
system with -fno-omit-frame-pointers).

Finding a profiler that works, damnit

Posted Mar 24, 2010 10:34 UTC (Wed) by nix (subscriber, #2304) [Link]

In any case, the fix is obvious. Dump an address out in the fast path, and convert it to a stacktrace later on. This is what absolutely everything else does, from GDB to valgrind to sysprof.

Finding a profiler that works, damnit

Posted Mar 24, 2010 15:18 UTC (Wed) by foom (subscriber, #14868) [Link]

> Dump an address out in the fast path, and convert it to a stacktrace later on.

...huh?

You can't convert a single address to a stacktrace later on! You'd need a copy of the whole stack to
do it offline...which I doubt is any faster than just running the unwinder to save out the PC of every
frame.

Anyways, thanks for mentioning sysprof: I hadn't heard of that one before. But looking at the
source, it doesn't seem like it's likely to work either, given the function heuristic_trace in:
http://git.gnome.org/browse/sysprof/tree/module/sysprof-m...

Finding a profiler that works, damnit

Posted Mar 24, 2010 17:48 UTC (Wed) by nix (subscriber, #2304) [Link]

Oh crap, you're right, of course. I plead temporary insanity caused by hours of sitting through mind-numbing presentations on stuff I already knew.

So it's call the unwinder or nothing, really. Unfortunately the job of figuring out what the call stack looks like really *is* quite expensive :/ any effort should presumably go into optimizing libunwind...

Finding a profiler that works, damnit

Posted Mar 26, 2010 17:55 UTC (Fri) by sandmann (subscriber, #473) [Link]

I am the author of sysprof.

You are right that it doesn't generate good callgraphs on x86-64 unless you compile the application with -fno-omit-frame-pointer. I very much would like to fix this somehow, but I just don't see any really good way to do it.

Fundamentally, it needs to take a stack trace in the kernel in NMI context. You cannot read the .eh_frame information at that time because that would cause page faults and you are not allowed to sleep in NMI context.

Even if there were a way around that problem, you would still have to *interpret* the information which is a pretty hairy algorithm to put in the kernel (though Andi Kleen did exactly that I believe, resulting in flame wars later on).

You could try copying the entire user stack, but there is considerable overhead associated with that because user stacks can be very large (emacs for example allocates a 100K buffer on the stack). You could also try a heuristic stack walk (which is what an old version of sysprof - new versions use the same kernel interface as perf). That sort of works, but there is a lot of false positives from function pointers and left-over return addresses. The function pointers can be filtered out, but the return addresses can't. These false positives tend to make sysprof's UI display somewhat confusing, though not completely unusable.

You could also try some hybrid scheme where the kernel does a heuristic stack walk and userspace then uses the .eh_frame information to filter out the false positives. This is what I think is the most promising approach at this point. Some day I'll try it.

Finally, the distributions could just compile with -fno-omit-frame-pointer by default. The x86-64 is not really register-starved so it wouldn't be a significant performance problem. The Solaris compiler does precisely this because they need to take stack traces for dtrace.

But, I fully expect to be told that for performance reasons we can't have working profilers.

Finding a profiler that works, damnit

Posted Mar 26, 2010 18:06 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

Have you tried asking about that in fedora devel list? Maybe, we can change
the compiler options for Fedora 14.

Finding a profiler that works, damnit

Posted Mar 26, 2010 21:44 UTC (Fri) by foom (subscriber, #14868) [Link]

I'm probably missing something, but...

Why does it need to happen at NMI time? Why can't you just do it in the user process's context,
before resuming execution of their code?

The setitimer(ITIMER_PROF) solution that userspace profilers use clearly works out fine for
userspace profiling. Can't you do something similar for userspace profiling from within the kernel?

The stack trace of the userspace half clearly can't change between when you received the NMI and
when you resume execution of the process...

That just leaves the complication of implementing the DWARF unwinder in the kernel, but there's
already much more complex code in the kernel...that really seems like it should be a non-issue.

Finding a profiler that works, damnit

Posted Mar 26, 2010 23:28 UTC (Fri) by nix (subscriber, #2304) [Link]

There is already a DWARF unwinder in the kernel (or was, and it could be
resurrected). The tricky part is making it paranoid enough to be
non-DoSable, even by hostile generators of DWARF2. IIRC the kernel
unwinder was ripped out by Linus because it kept falling over when
unwinding purely kernel stack frames...

Finding a profiler that works, damnit

Posted Mar 27, 2010 14:10 UTC (Sat) by garloff (subscriber, #319) [Link]

Finding a profiler that works, damnit

Posted Mar 27, 2010 15:12 UTC (Sat) by foom (subscriber, #14868) [Link]

Care to expand upon that link with some explanatory text?

Finding a profiler that works, damnit

Posted Mar 28, 2010 0:26 UTC (Sun) by garloff (subscriber, #319) [Link]

Sorry, that was somewhat terse. The Novell Kernel Debugger has stack
unwinding features built-in; so this is something that might be leveraged
in another project.

Finding a profiler that works, damnit

Posted Apr 4, 2010 12:35 UTC (Sun) by chantecode (subscriber, #54535) [Link]

We need to profile from NMI if we want to profile irqs as well. Otherwise a hardware pmu event would occur at the end of an irq disabled section, not at the exact place of the event, messing completely the result.

Finding a profiler that works, damnit

Posted Apr 5, 2010 1:37 UTC (Mon) by foom (subscriber, #14868) [Link]

You need to get stack traces for the *kernel* from an NMI. Surely the userspace backtracing can
wait until a more convenient time...

Finding a profiler that works, damnit

Posted Mar 30, 2010 18:51 UTC (Tue) by fuhchee (subscriber, #40059) [Link]

In systemtap, we do unwinding of kernel/userspace under similar constraints. We work around the "can't page in user data" by preemptively uploading the unwind data into the kernel, so it's ready for use when needed. It costs some memory but it saves time.

Finding a profiler that works, damnit

Posted Mar 24, 2010 2:15 UTC (Wed) by foom (subscriber, #14868) [Link]

BTW, the callgraph output I'm used to reading is like this:
http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC6

Or this:
http://oprofile.sourceforge.net/doc/opreport.html (section 2.3. Callgraph output)

The important points being that it shows, linearly, a section of output for every function, sorted by
time (inclusive of subroutines). Each section includes, in addition to the line for the function itself,
1) above the function's line: who called it
2) below the function's line: who it calls

Maybe your method is better for some uses, but I can't quite get my ahead around how to
use it, even after your explanation of what it means. I don't expect I'm unique in this regard, since
basically all callgraph profilers I've seen work the same way.

Finding a profiler that works, damnit

Posted Mar 24, 2010 2:48 UTC (Wed) by chantecode (subscriber, #54535) [Link]

Ah I see what you mean. We have this format too. We call that "flat" callchains. It's a sequence of traditional callgraphs, linear and sorted by overhead.

Example: http://tglx.de/~fweisbec/callchain_flat.txt

We have three kind of formats: flat (as in above link), fractal (as shown in my previous example) and graph (like fractal, except that percentages of overhead are absolute wrt to the whole overhead profiling, unlike fractal where each overhead percentage is relative to its parent branch).

syntax: perf report -g flat|fractal|graph

I've never heard about someone using the flat format. Everybody seem to use the fractal (probably because it's the default, and IMHO, the most useful). And because I've never heard about it, I've not tested it for a while so it works on 2.6.32 but seem to have been broken in 2.6.33 (the graph format too), but fractal works. I'm going to send a fix for that and get it to the 2.6.33 stable branch.

>Maybe your method is better for some uses, but I can't quite get my ahead around how to use it, even after your explanation of what it means. I don't expect I'm unique in this regard, since basically all callgraph profilers I've seen work the same way.

I must confess it's the first time I hear this complaint. That said may be it's a matter of playing a bit with it before being used to its layout.

Finding a profiler that works, damnit

Posted Mar 24, 2010 3:41 UTC (Wed) by foom (subscriber, #14868) [Link]

> I've never heard about someone using the flat format.

I'm not surprised about that: it looks like it's basically just the same as fractal, except without
combining the similar parts of call chains together. Doesn't seem useful at all. And that also isn't
what I want. (Sidenote: I'd have expected something named "flat" to actually show me a normal
flat profile; it seems a bit odd that perf report can't actually show a normal flat profile when
callgraph info was collected.)

I just like to see a list of the *direct* callers and callees of every function. (having both easily
available is an important component). I don't actually want to be shown every unique complete
call-chain starting from the named function: it's just too much data, and yet simultaneously not
enough, since it doesn't show callees.

I think the gprof docs explain its format pretty well, and that's basically exactly what I want to
see. It's a nice, concise, easy-to-understand summarization of the data that usually has enough
information to figure out where the problem is.

Or, heck, if you get callgrind output conversion working, kcachegrind also presents the data in
this way, so you wouldn't even have to implement the gprof-like output format. :)

Finding a profiler that works, damnit

Posted Mar 25, 2010 2:52 UTC (Thu) by chantecode (subscriber, #54535) [Link]

Ah ok, I see what you mean know. I parsed too quickly the gprof documentation.

Yep, may be we can implement such layout mode too, or at least getting it through kcachegrind.

Finding a profiler that works, damnit

Posted Apr 1, 2010 20:42 UTC (Thu) by oak (guest, #2786) [Link]

Having support for perf callgraph format in this:
http://code.google.com/p/jrfonseca/wiki/Gprof2Dot

Or similar callgraph visualization support in perf itself would be nice (including filtering options, highlighting the nodes based on CPU used by given node and the nodes called by it etc).

Finding a profiler that works, damnit

Posted Apr 4, 2010 12:44 UTC (Sun) by chantecode (subscriber, #54535) [Link]

I really like this. We may indeed want to support it.

Finding a profiler that works, damnit

Posted Apr 5, 2010 1:39 UTC (Mon) by foom (subscriber, #14868) [Link]

Check out the google-perftools' pprof's "gv" command; it's also pretty nice. (although I still usually
prefer kcachegrind's interactive viewer)

Finding a profiler that works, damnit

Posted Mar 24, 2010 7:11 UTC (Wed) by njs (guest, #40338) [Link]

> Also found no way to convert to callgrind format (except a hacky script which only does flat profiles)

Here's a hacky script that does callgraphs: http://roberts.vorpus.org/~njs/op2calltree.py

Finding a profiler that works, damnit

Posted Mar 24, 2010 8:10 UTC (Wed) by ringerc (subscriber, #3071) [Link]

gprof: Old standby -- I used to use this all the time back in the day... Unfortunately, only works on staticly linked programs.

gprof works fine on dynamically linked programs. In fact, it's awfully rare to find a genuinely statically linked program these days - most "static" programs dynamically link to libc via ld-linux.so and statically link in all the other libraries they use, as ldd will reveal. Not always, but usually.

I suspect what you were going for was "gprof only profiles one particular library or executable at a time, not a whole program including any shared libraries or plugins it uses." (Even that's not strictly true, it's just that you have to rebuild those plugins or shared libraries with gprof instrumentation too).

I'm totally with you on the general "argh" re callgraph profiling across multiple programs or a subset of the system, though.

Finding a profiler that works, damnit

Posted Mar 24, 2010 15:05 UTC (Wed) by foom (subscriber, #14868) [Link]

> Even that's not strictly true, it's just that you have to rebuild those plugins or shared libraries
> with gprof instrumentation too.

No, gprof really just doesn't work at all with shared libraries. If you link an executable with -pg,
you'll only get samples from when the PC is inside the base executable. Any times the timer tick
goes off when the PC is in another location, it's simply ignored. This is true even if you did
compile all libraries with -pg too.

So if you do run gprof with dynamically linked executable, your timing measures will all be
wrong, if you spend any time whatsoever inside shared libraries. E.g. I was profiling a program
where it turned out 1/4 the time was spent in libc and libstdc++ (shouldn't have been that
much..but that's what the profiler is for!). But using gprof, that time simply disappears, it's like
those functions took no time whatsoever even in the "inclusive" measures of higher level
functions that *are* in the executable. And the total run-time is also reported to be much
smaller than it actually is.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds