User: Password:
Subscribe / Log in / New account

Finding a profiler that works, damnit

Finding a profiler that works, damnit

Posted Mar 24, 2010 15:18 UTC (Wed) by foom (subscriber, #14868)
In reply to: Finding a profiler that works, damnit by nix
Parent article: KVM, QEMU, and kernel project management

> Dump an address out in the fast path, and convert it to a stacktrace later on.


You can't convert a single address to a stacktrace later on! You'd need a copy of the whole stack to
do it offline...which I doubt is any faster than just running the unwinder to save out the PC of every

Anyways, thanks for mentioning sysprof: I hadn't heard of that one before. But looking at the
source, it doesn't seem like it's likely to work either, given the function heuristic_trace in:

(Log in to post comments)

Finding a profiler that works, damnit

Posted Mar 24, 2010 17:48 UTC (Wed) by nix (subscriber, #2304) [Link]

Oh crap, you're right, of course. I plead temporary insanity caused by hours of sitting through mind-numbing presentations on stuff I already knew.

So it's call the unwinder or nothing, really. Unfortunately the job of figuring out what the call stack looks like really *is* quite expensive :/ any effort should presumably go into optimizing libunwind...

Finding a profiler that works, damnit

Posted Mar 26, 2010 17:55 UTC (Fri) by sandmann (subscriber, #473) [Link]

I am the author of sysprof.

You are right that it doesn't generate good callgraphs on x86-64 unless you compile the application with -fno-omit-frame-pointer. I very much would like to fix this somehow, but I just don't see any really good way to do it.

Fundamentally, it needs to take a stack trace in the kernel in NMI context. You cannot read the .eh_frame information at that time because that would cause page faults and you are not allowed to sleep in NMI context.

Even if there were a way around that problem, you would still have to *interpret* the information which is a pretty hairy algorithm to put in the kernel (though Andi Kleen did exactly that I believe, resulting in flame wars later on).

You could try copying the entire user stack, but there is considerable overhead associated with that because user stacks can be very large (emacs for example allocates a 100K buffer on the stack). You could also try a heuristic stack walk (which is what an old version of sysprof - new versions use the same kernel interface as perf). That sort of works, but there is a lot of false positives from function pointers and left-over return addresses. The function pointers can be filtered out, but the return addresses can't. These false positives tend to make sysprof's UI display somewhat confusing, though not completely unusable.

You could also try some hybrid scheme where the kernel does a heuristic stack walk and userspace then uses the .eh_frame information to filter out the false positives. This is what I think is the most promising approach at this point. Some day I'll try it.

Finally, the distributions could just compile with -fno-omit-frame-pointer by default. The x86-64 is not really register-starved so it wouldn't be a significant performance problem. The Solaris compiler does precisely this because they need to take stack traces for dtrace.

But, I fully expect to be told that for performance reasons we can't have working profilers.

Finding a profiler that works, damnit

Posted Mar 26, 2010 18:06 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

Have you tried asking about that in fedora devel list? Maybe, we can change
the compiler options for Fedora 14.

Finding a profiler that works, damnit

Posted Mar 26, 2010 21:44 UTC (Fri) by foom (subscriber, #14868) [Link]

I'm probably missing something, but...

Why does it need to happen at NMI time? Why can't you just do it in the user process's context,
before resuming execution of their code?

The setitimer(ITIMER_PROF) solution that userspace profilers use clearly works out fine for
userspace profiling. Can't you do something similar for userspace profiling from within the kernel?

The stack trace of the userspace half clearly can't change between when you received the NMI and
when you resume execution of the process...

That just leaves the complication of implementing the DWARF unwinder in the kernel, but there's
already much more complex code in the kernel...that really seems like it should be a non-issue.

Finding a profiler that works, damnit

Posted Mar 26, 2010 23:28 UTC (Fri) by nix (subscriber, #2304) [Link]

There is already a DWARF unwinder in the kernel (or was, and it could be
resurrected). The tricky part is making it paranoid enough to be
non-DoSable, even by hostile generators of DWARF2. IIRC the kernel
unwinder was ripped out by Linus because it kept falling over when
unwinding purely kernel stack frames...

Finding a profiler that works, damnit

Posted Mar 27, 2010 14:10 UTC (Sat) by garloff (subscriber, #319) [Link]

Finding a profiler that works, damnit

Posted Mar 27, 2010 15:12 UTC (Sat) by foom (subscriber, #14868) [Link]

Care to expand upon that link with some explanatory text?

Finding a profiler that works, damnit

Posted Mar 28, 2010 0:26 UTC (Sun) by garloff (subscriber, #319) [Link]

Sorry, that was somewhat terse. The Novell Kernel Debugger has stack
unwinding features built-in; so this is something that might be leveraged
in another project.

Finding a profiler that works, damnit

Posted Apr 4, 2010 12:35 UTC (Sun) by chantecode (subscriber, #54535) [Link]

We need to profile from NMI if we want to profile irqs as well. Otherwise a hardware pmu event would occur at the end of an irq disabled section, not at the exact place of the event, messing completely the result.

Finding a profiler that works, damnit

Posted Apr 5, 2010 1:37 UTC (Mon) by foom (subscriber, #14868) [Link]

You need to get stack traces for the *kernel* from an NMI. Surely the userspace backtracing can
wait until a more convenient time...

Finding a profiler that works, damnit

Posted Mar 30, 2010 18:51 UTC (Tue) by fuhchee (subscriber, #40059) [Link]

In systemtap, we do unwinding of kernel/userspace under similar constraints. We work around the "can't page in user data" by preemptively uploading the unwind data into the kernel, so it's ready for use when needed. It costs some memory but it saves time.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds