Fedora's tempest in a stack frame
A stack frame contains information relevant to a function call in a running program; this includes the return address, local variables, and saved registers. A frame pointer is a CPU register pointing to the base of the current stack frame; it can be useful for properly clearing the stack frame when returning from a function. Compilers, though, are well aware of the space they allocate on the stack and do not actually need a frame pointer to manage stack frames properly. It is, thus, common to build programs without the use of frame pointers.
Other code, though, lacks insights into the compiler's internal state and may struggle to interpret a stack's contents properly. As a result, code built without frame pointers can be harder to profile or to obtain useful crash dumps from. Both debugging and performance-optimization work are made much easier if frame pointers are present.
Back in June 2022, a Fedora system-wide change proposal, then aimed at the Fedora 37 release, called for the enabling of frame pointers for all binaries built for the distribution. While developers can build a specific program with frame pointers relatively easily when the need arises, the proposal stated, it is often necessary to rebuild a long list of libraries as well; that makes the job rather more difficult. Some types of profiling need to be done on a system-wide basis to be useful; that can only be done if the whole system has frame pointers enabled. Simply building the distribution that way to begin with would make life easier for developers and, it was argued, set the stage for many performance improvements in the future.
There is, of course, a cost to enabling frame pointers. Each function call must save the current frame pointer to the stack, slightly increasing the cost of that call and the size of the code. The frame pointer also occupies a general-purpose register, increasing register spills and slowing down code that might put the register to better use. Avoiding these costs is the main reasons why distributions are built without frame pointers in the first place.
The proposal resulted in an extensive discussion on both the mailing list and the associated Fedora Engineering Steering Council (FESCo) ticket. As would be expected, the primary objection was the performance cost, some of which was benchmarked on the Fedora wiki. Compiling the kernel turned out to be 2.4% slower, and a Blender test case regressed by 2%. The worst case appears to be Python programs, which can see as much as a 10% performance hit. To many, these costs were seen as unacceptable.
The immediate reaction was enough to cause the proposed changed to be deferred to Fedora 38. But the discussion went on. The proponents of the change were undeterred by any potential performance loss; for example, Andrii Nakryiko argued:
Even if we lose 1-2% of benchmark performance, what we gain instead is lots of systems enthusiasts that now can do ad-hoc profiling and investigations, without the need to recompile the system or application in special configuration. It's extremely underappreciated how big of a barrier it is for people contribution towards performance and efficiency, if even trying to do anything useful in this space takes tons of effort. If we care about the community to contribute, we need to make it simple for that community to observe applications.
He added that Meta builds its internal applications with frame pointers
enabled because the cost as seen as being more than justified by the
benefits. Brendan Gregg described the
benefits seen from frame pointers at Netflix, and Ian Rogers told a similar
story about the experience at Google. On the other hand, the
developers in Red Hat's platform tools team, represented by
Florian Weimer, remained steadfastly opposed to enabling frame
pointers. Neal Gompa, instead, supported the
change but worried that Fedora would be "roasted
" on certain
benchmark-oriented web sites for reducing performance across the entire
distribution.
The change was discussed at the
November 15 FESCo meeting (the IRC log is
available) and the proposal was ultimately rejected. That led to some
unhappiness among proponents of the change, who were unwilling to let the
idea go, despite Kevin Kofler's admonition
that "the toolchain people are the most qualified experts on the
topic
" and that it was time to move on. Michael Catanzaro complained
that he could "no longer trust the toolchain developers to make rational
decisions regarding real-world performance impact due to their handling of
this issue
". But even Catanzaro said that it was time to move on.
But that is not what happened. On January 3, FESCo held another meeting in which an entirely new ticket calling for a "revote" on the frame-pointer proposal was discussed; this was the first that most people had heard that the idea was back. The new ticket had been opened six days prior — on December 28 — by Gompa; it was approved by a vote of six to one (with one abstention). So, as of this writing, the plan is to enable frame pointers for the Fedora 38 release, which is currently scheduled for a late-April release.
There appear to be a few factors that brought about FESCo's change of heart, starting with the ongoing requests from the proposal's proponents. While this whole discussion was going on, FESCo approved another build-option change (setting _FORTIFY_SOURCE=3 for increased security). That change also has a performance cost (though how much is not clear); the fact that it won approval while frame pointers did not was seen by some as the result of a double standard. The proposal was also modified to exempt Python — which is where the worse performance costs were seen — from the use of frame pointers. All of that, it seems, was enough to convince most FESCo members to support the idea.
As might be imagined; not all participants in the discussion saw things the
same way. There were complaints about the short notice for the new ticket,
which was also opened in the middle of the holiday break, and that
participants in the discussion on the older ticket were not notified of the
new one. Vitaly Zaitsev said
that the proposal came back "because big corporations weren't
happy with the results
" and called it a bad precedent; Kofler called the process
"deliberately rigged
". Fedora leader Matthew Miller disputed that
claim, but did acknowledge that things could have been done better:
I agree with your earlier post that this did not have enough visibility, enough notice, or enough time. I was certainly taken by surprise, and I was trying to keep an eye on this one in particular. [...] BUT, I do not think it was done with malice, as "deliberately rigged" implies. I don't see that at all -- I see excitement and interest in moving forward on something that already has taken a long time, and looming practical deadlines.
The rushed timing for the second vote, it seems, was done so that a result could be had in time for the imminent mass rebuild. It obviously makes sense to make a change like that before rebuilding the entire distribution from source rather than after. But even some of the participants in the wider discussion who understand that point felt that the process had not worked well.
There is still time for FESCo to revisit (again) the decision, should that
seem warranted, but that seems unlikely. As FESCo member Zbigniew
Jędrzejewski-Szmek pointed out,
much of the discussion has already moved on to the technical details of how
to manage the change. Thus, Fedora 38 will probably be a little
slower than its predecessors, but hopefully the performance improvements
that will follow from this change in
future releases will more than make up for that cost.
Posted Jan 16, 2023 16:29 UTC (Mon)
by kushal (subscriber, #50806)
[Link] (13 responses)
This is bad, and also kind of sad as the Python upstream is working so hard to make things faster.
Posted Jan 16, 2023 16:58 UTC (Mon)
by mcatanzaro (subscriber, #93033)
[Link] (12 responses)
Posted Jan 16, 2023 17:00 UTC (Mon)
by kushal (subscriber, #50806)
[Link]
Posted Jan 16, 2023 22:39 UTC (Mon)
by jreiser (subscriber, #11027)
[Link] (3 responses)
Why would using a frame pointer cause so many more 32-bit displacements? Because of the placement of saved registers in the stack frame. On x86_64, traditional entry to a subroutine which saves all 6 saved registers (with a frame pointer) looks like
Changing the entry sequence to
As observed in the blog, nothing except intellectual inertia by the compiler prevents the use +d8(%rsp) with a frame pointer, too.
Posted Jan 18, 2023 10:22 UTC (Wed)
by jengelh (guest, #33263)
[Link] (2 responses)
Have you considered ABI? This looks like a change of calling convention.
Posted Jan 18, 2023 13:54 UTC (Wed)
by dezgeg (subscriber, #92243)
[Link]
Posted Jan 19, 2023 1:59 UTC (Thu)
by ncm (guest, #165)
[Link]
The compiler is free to push the frame pointer and then ignore it in subsequent stack accesses and use the same addressing mode, relative to the stack pointer, as it does now with no frame pointer. It also does not need to pop the value it pushed, but just increment the stack pointer past it.
Posted Jan 18, 2023 21:27 UTC (Wed)
by jlargentaye (subscriber, #75206)
[Link] (5 responses)
Posted Jan 18, 2023 23:06 UTC (Wed)
by intgr (subscriber, #39733)
[Link] (4 responses)
He is probably right that the tradeoff makes sense for Netflix.
But I seriously doubt being able to profile Fedora more easily will result in enough optimization patches to improve performance at least 2% across the board *all around the distribution*. Maybe some specific places do end up getting optimized thanks to this change, but it's almost guaranteed to be a net loss given the large amount of software shipped by the distro.
Posted Jan 18, 2023 23:48 UTC (Wed)
by pizza (subscriber, #46)
[Link] (3 responses)
I'm looking forward to finding out who's correct here.
Speaking personally, being able to do system-wide (ie not just single-application) profiling is worth a slight overall performance hit. Because currently it is effectively impossible to do so (at least for anything non-trivial)
Posted Jan 19, 2023 0:15 UTC (Thu)
by mcatanzaro (subscriber, #93033)
[Link] (1 responses)
Hopefully. ;)
Like Brendan said, "once you find a 500% perf win you have a different perspective about the <1% cost." Well the cost may be a little higher than 1%, but point remains.
Posted Jan 19, 2023 4:50 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Thank you. Who cares about a 2% hit on software that runs 0.001% of the time = the vast majority of software. Performance is ALL about bottlenecks and critical sections. Which you can't do anything about if you don't even know where they are.
Posted Jan 20, 2023 7:58 UTC (Fri)
by mgedmin (subscriber, #34497)
[Link]
Sadly I use Ubuntu, so Fedora's decision won't help me directly. It does make it somewhat more tempting to switch distros.
Posted Jan 28, 2023 19:17 UTC (Sat)
by sammythesnake (guest, #17693)
[Link]
Does anyone happen to know? Did Fedora benchmark that, or is pypy enough of a corner case that it wasn't part of the benchmarking?
If the performance hit for pypy is more typical than it is for cpython then maybe it matters less - those with more of a performance-weighted set of priorities might already largely be using pypy...?
Posted Jan 16, 2023 17:00 UTC (Mon)
by bredelings (subscriber, #53082)
[Link] (17 responses)
However, the F37 proposal authors consider this to be unacceptable for a number of reasons:
Its not clear to me that enabling frame pointers everywhere is the only solution here...
Posted Jan 16, 2023 17:03 UTC (Mon)
by vwduder (subscriber, #58547)
[Link] (12 responses)
Even if we could unwind fast enough by stashing stacks for user-space to decode, it potentially copies sensitive user information meaning we can't share the recordings with developers to further investigate.
Posted Jan 16, 2023 17:20 UTC (Mon)
by bredelings (subscriber, #53082)
[Link]
At the end of that blog post, you say that enabling frame pointers on everything is a practical, short-term solution. That seems reasonable.
What are some of the options for a long-term solution?
For example, one option might be to include ORC info on everything.
But it seems like there are probably be a lot of other options.
Posted Jan 16, 2023 20:46 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (10 responses)
This blog post seems to be written under the assumption that the kernel has to do the unwinding. It explores several options for doing so, all with serious trade-offs (e.g. having to keep large debug information tables resident just in case the kernel might need them). Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.
> If we can’t do this thousands of times per second, it’s not fast enough.
Does the kernel have magical performance superpowers? Whatever userspace unwinding strategy you choose, ISTM userspace can implement that strategy at least as efficiently as the kernel. Performance doesn't seem like a good reason to keep userspace unwinding out of userspace.
Speaking of performance: why is nobody talking about profiling by snapshotting the shadow call stack? For the case of native code, the SCS is *exactly* the right data structure: it's a dense list of frame pointers!
> If an RPM is upgraded, you lose access to the library mapped into memory from outside the process as both the inode and CRC will have changed on disk. You can’t build unwind tables, so accounting in that process breaks.
The right way to deal with this issue is to log all mmap()s in the perf/ftrace ring buffer along with the build-ID tags of all mapped binaries. This way, postprocessing tools can query debug servers (by build-ID) and obtain the right PC->source mapping even if some system package has been updated or inode assignments changed.
> Even if we could unwind fast enough by stashing stacks for user-space to decode, it potentially copies sensitive user information meaning we can't share the recordings with developers to further investigate.
Aren't you assuming that the dump of the kernel perf/ftrace ring buffer would have to be the same as the data file shared outside the machine? ISTM that a post-processing step could remove arbitrary sensitive information from traces before sharing.
Posted Jan 16, 2023 22:44 UTC (Mon)
by vwduder (subscriber, #58547)
[Link] (8 responses)
Can you explain how you would expect that to work?
If you mean to snapshot the stack and to unwind in userspace using something like libunwind, then it's simply to slow. If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.
The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment. If you don't have enough information to unwind there, you can't get stacks that cross kernel and user-space or enough information to symbolize after the fact.
Most of what Sysprof does, at least, is from user-space after the samples are captured.
> Does the kernel have magical performance superpowers? Whatever userspace unwinding strategy you choose, ISTM userspace can implement that strategy at least as efficiently as the kernel. Performance doesn't seem like a good reason to keep userspace unwinding out of userspace.
Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.
> Speaking of performance: why is nobody talking about profiling by snapshotting the shadow call stack? For the case of native code, the SCS is *exactly* the right data structure: it's a dense list of frame pointers!
Because it's not available in most configurations.
> The right way to deal with this issue is to log all mmap()s in the perf/ftrace ring buffer along with the build-ID tags of all mapped binaries. This way, postprocessing tools can query debug servers (by build-ID) and obtain the right PC->source mapping even if some system package has been updated or inode assignments changed.
Sysprof does in fact log all of these so we can symbolize properly. However, not everything has a build-id. The things that often trip up unwinding to the point of uselessness is JIT/FFI.
If you can't handle even handle unwinding libffi properly, the design is probably broken.
> Aren't you assuming that the dump of the kernel perf/ftrace ring buffer would have to be the same as the data file shared outside the machine? ISTM that a post-processing step could remove arbitrary sensitive information from traces before sharing.
Sysprof already has options to tack on symbol decode at the end of a trace file so they can be shared across machines. But it's symbolizer is an interface which has multiple implementations, so you could symbolize using something like debuginfod or a directory of binaries, etc.
Posted Jan 16, 2023 23:07 UTC (Mon)
by vwduder (subscriber, #58547)
[Link] (2 responses)
Read your post below and replied there.
In short, yes, what we all would want to do is to unwind a single time in user-space during scheduler transition. That is not something that is likely to ship in the next 6 months to a year if you ask me. And we very much need things working up until someone shows up to write that code and can:
1. Prove it works
Posted Jan 16, 2023 23:23 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Where is the urgency coming from? We've lived with the current situation this long. We can spend a year fixing it the right way.
Posted Jan 17, 2023 5:25 UTC (Tue)
by vwduder (subscriber, #58547)
[Link]
Because we've been waiting for this for a decade and nothing has materialized.
Are you signing up to implement your design? Because if so, that sound great! We can disable frame-pointers when it's ready to ship. I'll even do you a solid and implement the Sysprof part.
Posted Jan 16, 2023 23:52 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
Keep in mind that there's no "snapshot" involved. When a thread returns from kernel mode to user mode, its stack is snapshotted because that thread is running the unwind signal handler immediately upon returning to user mode and before resuming whatever it was doing before it entered the kernel. The stack is "snapshotted" implicitly in the same way that the stack above a call to sleep(10) is "snapshotted".
> and to unwind in userspace using something like libunwind, then it's simply to slow.
Why would it be any slower than whatever the kernel does? Perhaps full asynchronous DWARF bytecode interpretation would be too slow, but ORC unwinding probably wouldn't be. Either way, you take a kernel problem and make it userspace's problem. Userspace can use every unwinding strategy that the kernel can use and a lot more. There's no downside.
> If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.
I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.
> The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment.
It's not the case that "everything" has to be available "at that moment". At the moment the perf event fires, we have to capture only the *kernel stack*. The *user stack* of any thread running in kernel mode is frozen until that thread leaves kernel mode and re-enters user mode, so we can defer the user mode stack collection until the thread returns to user space without loss of information!
> Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.
When a perf event fires, we want "the" stack corresponding to that event. That stack always has two parts: the kernel stack and the user stack. (We always have a kernel stack because perf events always fire in the kernel, even if the kernel is just running an ISR.) I'm suggesting that we capture the kernel and user stacks separately and that we combine them later, in post-processing. User-space collects user stacks and can use whatever stack traversal approach it wants, e.g. an asynchronous DWARF unwinder, frame-pointer walking, or various runtime-specific approaches.
Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses? User space can traverse frame pointers just as well as the kernel. LDR, B.LE, and MOV don't execute any faster in kernel mode than they do in user mode.
> Because it's not available in most configurations.
But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.
> The things that often trip up unwinding to the point of uselessness is JIT/FFI.
That's why we need managed runtime environments to participate explicitly in the unwinding process.
Posted Jan 17, 2023 5:46 UTC (Tue)
by vwduder (subscriber, #58547)
[Link] (3 responses)
https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html
> Why would it be any slower than whatever the kernel does?
The point was it's too slow if we have to copy stack pages via perf to the user-space capture. Obviously it's irrelevant if coordinating processes are unwinding themselves from a glorified signal handler.
> I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.
Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?
> Because it's not available in most configurations.
I agree.
Somebody (clearly you have the expertise) needs to go implement it and then we need to wait for years for it to be available everywhere so we can finally rely on it.
In the meantime, I'll have to enable frame-pointers so I can get some work done.
> The things that often trip up unwinding to the point of uselessness is JIT/FFI.
I fully agree, and it's what I too am advocating for.
Posted Jan 17, 2023 6:02 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
More or less, although Zijlstra doesn't seem to talk about making the unwinder pluggable. That, and what he calls a "gloriously ugly hack" I'd call a clean separation of concerns. :-)
> Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?
Yes, and userspace could point to that "map" as an rseq extension. It might be easier, however, to just have threads make a system call (or post an io_uring request) to donate their stacks each time they finish an unwind. (Although, come to think of it, threads could in principle batch stack submission.) You could implement an explicit system call and then the rseq thing (supporting both) if the performance of the system call (even batched) proved inadequate.
> In the meantime, I'll have to enable frame-pointers so I can get some work done.
The point of this LWN article is that not everyone agrees that this is the right approach.
Posted Jan 17, 2023 16:55 UTC (Tue)
by vwduder (subscriber, #58547)
[Link] (1 responses)
I think you'll have a hard time finding people who want to give up a limited resource (max io_uring per process) for stack delivery via their libc.
Posted Jan 17, 2023 17:14 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link]
In any case, the use of io_uring is not central to my proposal, so the point is irrelevant. That said, all authors of new kernel interfaces should be designing with io_uring support in mind.
Posted Feb 26, 2023 15:27 UTC (Sun)
by nix (subscriber, #2304)
[Link]
If the binary is still running and ptraceable, you don't: you can open files in /proc/$pid/map_files/ and bingo (yes, that gives you the *whole file*, not just the mapping, including non-loaded sections). We use this in DTrace, and it just works, where Solaris had to use this horrific baroque scheme involving serial numbers nailed into ELF objects to detect a deletion/recreation of a binary.
Posted Jan 16, 2023 19:47 UTC (Mon)
by Sesse (subscriber, #53779)
[Link] (3 responses)
The biggest advantage with DWARF for call stacks is that it is very precise when it works; in particular, it understands inlining. The biggest disadvantage is that it's very slow.
Posted Jan 16, 2023 20:24 UTC (Mon)
by vwduder (subscriber, #58547)
[Link] (2 responses)
A significant portion of stack traces, that are interesting to us in whole system profiling, are deeper than 32 stack frames.
Without a common root, it's difficult to build callgraphs that have the slightest amount of value.
Posted Jan 16, 2023 20:25 UTC (Mon)
by Sesse (subscriber, #53779)
[Link]
The nuclear option is Intel PT, of course.
Posted Jan 19, 2023 17:25 UTC (Thu)
by irogers (subscriber, #121692)
[Link]
Posted Jan 16, 2023 17:07 UTC (Mon)
by mcatanzaro (subscriber, #93033)
[Link] (1 responses)
Posted Jan 17, 2023 1:28 UTC (Tue)
by ringerc (subscriber, #3071)
[Link]
Posted Jan 16, 2023 17:20 UTC (Mon)
by siddhesh (guest, #64914)
[Link]
It is largely clear; the broad based impact (measured with SPEC2000 and SPEC2017) is zero[1], the only concern is the possibility of some corner cases. I've got a post coming out that should hopefully make it a bit clearer; I had hypothesized about the performance impact in my post[2] introducing _FORTIFY_SOURCE=3 and my hypothesis got blown out of proportion in this comparison.
[1] https://fedoraproject.org/w/index.php?title=Changes/Add_F...
Posted Jan 16, 2023 18:06 UTC (Mon)
by ballombe (subscriber, #9523)
[Link] (4 responses)
Posted Jan 17, 2023 17:15 UTC (Tue)
by mwsealey (subscriber, #71282)
[Link] (3 responses)
Posted Jan 28, 2023 21:09 UTC (Sat)
by sammythesnake (guest, #17693)
[Link] (2 responses)
Posted Jan 29, 2023 12:52 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link]
Yes, Fedora already does that. For instance, Python excludes it based on the performance impact cited in the article. Defaults only apply when the maintainer hasn't changed it. It is not enforced for every package.
Posted Jan 29, 2023 12:53 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 16, 2023 20:33 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (9 responses)
Frame pointers also have the disadvantage of working only with AOT-compiled languages for
There's a third way.
See, both pro-FP and anti-FP camps think that it's the kernel that has to do the
1) backtrace requested in the kernel (e.g. to a perf counter overflow)
2) kernel unwinds itself to the userspace boundary the usual way
3) kernel forms a nonce (e.g. by incrementing a 64-bit counter)
4) kernel logs a stack trace the usual way (e.g. to the ftrace ring buffer), but with the
5) kernel queues a signal (one userspace has explicitly opted into via a new prctl()); the
6) kernel eventually returns to userspace; queued signal handler gains control
7) signal handler unwinds the calling thread however it wants (and can sleep and take page
8) signal handler logs the result of its unwind, along with the nonce, to the system log
Post-processing tools can associate kernel stacks with user stacks tagged with the
We can avoid duplicating unwindgs too: at step #3, if the kernel finds that the current
One nice property of this scheme is that the userspace unwinding isn't limited to
A pluggable userspace unwind mechanism would have zero cost in the case that we're not
In other words, choice between frame pointers and no frame pointers is a false dichotomy.
[1] https://lists.fedoraproject.org/archives/list/devel@lists...
Posted Jan 16, 2023 21:08 UTC (Mon)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted Jan 16, 2023 21:46 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Integration with kernel stack traces and the ability to fire in response to any event for which the kernel would otherwise collect a stack trace.
ITIMER_PROF is less flexible and precise: it fires in response only to accumulated CPU time, and CPU time from the whole process (as opposed to each thread individually) at that. There's also no general-purpose way to collect information in response to SIGPERF across the whole system: each process hooking into SIGPERF records information in its own way. (There's also no multi-tenant support for SIGPERF out of the box.)
The mechanism I'm proposing is no less flexible or accurate than kernel collection of usermode stack traces: the user mode stack of a thread cannot change between the time we notice a perf-relevant event in the kernel and the time at which that thread returns to userspace, and each thread is independent. On top of that, I'm proposing a mechanism through which userspace can contribute its unwound stack information back into the kernel performance ring buffer (a capability missing in SIGPERF), allowing tools to synthesize user and kernel stacks into coherent stacks that work the same way across the whole system.
> Which never got much traction (certainly never from interpreters, and -pg never really worked well with multithreaded programs)
Without an end-to-end integration of the sort I'm describing, there's limited motivation for interpreters to support SIGPERF.
Posted Jan 16, 2023 23:03 UTC (Mon)
by vwduder (subscriber, #58547)
[Link] (1 responses)
Additionally, in my blog post, I mentioned that none of us actually _want_ frame-pointers, what we want is _works out of the box_.
So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).
Posted Jan 16, 2023 23:39 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link]
Great! Any links you can share?
> So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).
I'm worried about the moral hazard.
Posted Jan 16, 2023 23:00 UTC (Mon)
by atnot (subscriber, #124910)
[Link] (3 responses)
For example, a common scenario might involve noticing that you are running out of disk bandwidth on some device. You attach a probe to look for disk events taking a certain amount of time and get their stack traces. The events look normal, but the deeper frames of the stack trace help you discover they are actually called from a background job. This helps you discover that this new software that was recently set up has a poorly written batch job which you are sometimes hitting a pathological case on that repeatedly rereads and rewrites the same file.
This isn't really a thing that can be discovered by instrumenting a single application. This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.
Posted Jan 16, 2023 23:22 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Sure it does. If you put the unwind code in libc, every program that links against libc can unwind in userspace. Programs that haven't opted into userspace unwinding could presumably be detected and unwound like they are today.
> This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.
It's libc that would opt into the mechanism on behalf of each application using that libc. Individual application authors wouldn't do anything.
Posted Jan 17, 2023 5:49 UTC (Tue)
by vwduder (subscriber, #58547)
[Link] (1 responses)
This is surprisingly a lot more code that one might think. Particularly Golang where it is much more common to link without glibc. I do think they include frame-pointers though ;)
Posted Jan 17, 2023 17:20 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link]
The amount of code involved isn't relevant to the discussion. However much code it is, the kernel doesn't have magical powers that lets it run that code more efficiently than userspace can.
Posted Jan 26, 2023 6:57 UTC (Thu)
by njs (subscriber, #40338)
[Link]
perf does have a mechanism for JITs and interpreted languages to record their function calls in perf call stacks. The next python release will even have support built in: https://docs.python.org/3.12/howto/perf_profiling.html
It is pretty janky though, and if you could get all the pieces lined up to support your approach I think everyone would be very happy!
Posted Jan 17, 2023 18:04 UTC (Tue)
by mwsealey (subscriber, #71282)
[Link] (3 responses)
Posted Jan 17, 2023 19:17 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jan 19, 2023 2:14 UTC (Thu)
by ncm (guest, #165)
[Link] (1 responses)
So, the performance cost to Python is just lack of a trivial optimization in the compiler. That optimization is already implemented, and just not applied when the frame pointer is being pushed.
Posted Jan 19, 2023 19:36 UTC (Thu)
by dezgeg (subscriber, #92243)
[Link]
Posted Jan 17, 2023 20:37 UTC (Tue)
by flussence (guest, #85566)
[Link]
Posted Jan 22, 2023 21:30 UTC (Sun)
by smitty_one_each (subscriber, #28989)
[Link]
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
One of the issues identified is (in many cases) the use of +d8(%rsp) addressing without frame pointer, versus -d32(%rbp) addressing with a frame pointer. +d8(%rsp) costs one byte for the 8-bit displacement plus one byte for s-i-b addressing mode to allow the stack pointer %rsp as a base register. -d32(%rbp) costs 4 bytes for the 32-bit displacement.
Placement of saved registers in a stack frame
push %rbp; mov %rsp,%rbp
push %r15; push %r14; push %r13; push %r12; push %rbx
in which 40 bytes of the 128-byte range for 8-bit displacement beneath the frame pointer %rbp are consumed by saved registers, leaving only 88 bytes for programmer-defined values. This contrasts to the 128 bytes available for eight-bit positive displacements from %rsp.
push %r15; push %r14; push %r13; push %r12; push %rbx
push %rbp; mov %rsp,%rbp
would move the saved registers to the other side of the frame pointer %rbp, which would recoup the 40 bytes as long as there were at most 80 bytes of incoming on-stack actual arguments (10 or fewer pointers, etc.) The return address has a different position relative to the frame pointer, but it can be found by disassembling the entry code, looking for push %rbp.
Placement of saved registers in a stack frame
Placement of saved registers in a stack frame
Placement of saved registers in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
* the kernel records stack-unwinding info in ORC instead of DWARF
* when using DWARF unwinding, the kernel saves all the stack data at each snapshot, as well as the instruction pointers. this stack info is then decoded later in userspace.
* there can be problems getting a COMPLETE stack trace, versus just the last N calls, for some N.
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
2. Get new syscalls into the kernel
3. Get language runtimes, JIT, FFI trampolines, etc to buy in
4. Profiler tooling to consume the different event types and cook the callgraph results
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
> Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses?
>> But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.
>> That's why we need managed runtime environments to participate explicitly in the unwinding process.
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
https://lore.kernel.org/lkml/20200309174639.4594-1-kan.li...
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
[2] https://developers.redhat.com/articles/2022/09/17/gccs-ne...
distro wide cflags are a bad idea anyway
Some packages are better optimized for speed other for security, and other for trouble shooting.
distro wide cflags are a bad idea anyway
distro wide cflags are a bad idea anyway
distro wide cflags are a bad idea anyway
distro wide cflags are a bad idea anyway
Fedora's tempest in a stack frame
which a trace analysis tool can associate an instruction pointer with a
semantically-relevant bit of code. If you try to use frame pointers to profile a Python
program, all you're going to get is a profile of the interpreter. It seems like the
debate is between those who want observability (via frame pointers) and those who want the
performance benefits of -fomit-frame-pointer.
unwinding unless we copy whole stacks into traces. Why should that be? As mentioned in
[1], instead of finding a way to have the kernel unwind user programs, we can create a
protocol through which the kernel can ask usermode to unwind itself. It could work like
this:
final frame referring to the nonce created in the previous step
siginfo_t structure encodes (e.g. via si_status and si_value) the nonce
faults if needed)
(e.g. via a new system call, a sysfs write, an io_uring submission, etc.)
corresponding nonces and reconstitute the full stacks in effect at the time of each logged
event.
thread already has an unwind pending, it can uses the already-pending nonce instead of
making a new one and queuing a signal: many kernel stacks can end with the same user stack
"tail".
native code. Libc could arbitrate unwinding across an arbitrary number of managed runtime
environments in the context of a single process: the system could be smart enough to know
that instead of unwinding through, e.g. Python interpreter frames, the unwinder (which is
normal userspace code, pluggable via DSO!) could traverse and log *Python* stack frames
instead, with meaningful function names. And if you happened to have, say, a JavaScript
runtime in the same process, both JavaScript and Python could participate in the semantic
unwinding process.
recording stack frames. On top of that, a pluggable userspace unwinder *could* be written
to traverse frame pointers just as the kernel unwinder does today, if userspace thinks
that's the best option. Without breaking kernel ABI, that userspace unwinder could use
DWARF, or ORC, or any other userspace unwinding approach. It's future-proof.
There's a better approach. The Linux ecosystem as a whole would be better off building
something like the pluggable userspace asynchronous unwinding infrastructure described
above.
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame
Fedora's tempest in a stack frame