Fedora's tempest in a stack frame

By Jonathan Corbet
January 16, 2023

It is rare to see an extensive and unhappy discussion over the selection of compiler options used to build a distribution, but it does happen. A case in point is the debate over whether Fedora should be built with frame pointers or not. It comes down to a tradeoff between a performance loss on current systems and hopes for gains that exceed that loss in the future — and some disagreements over how these decisions should be made within the Fedora community.

A stack frame contains information relevant to a function call in a running program; this includes the return address, local variables, and saved registers. A frame pointer is a CPU register pointing to the base of the current stack frame; it can be useful for properly clearing the stack frame when returning from a function. Compilers, though, are well aware of the space they allocate on the stack and do not actually need a frame pointer to manage stack frames properly. It is, thus, common to build programs without the use of frame pointers.

Other code, though, lacks insights into the compiler's internal state and may struggle to interpret a stack's contents properly. As a result, code built without frame pointers can be harder to profile or to obtain useful crash dumps from. Both debugging and performance-optimization work are made much easier if frame pointers are present.

Back in June 2022, a Fedora system-wide change proposal, then aimed at the Fedora 37 release, called for the enabling of frame pointers for all binaries built for the distribution. While developers can build a specific program with frame pointers relatively easily when the need arises, the proposal stated, it is often necessary to rebuild a long list of libraries as well; that makes the job rather more difficult. Some types of profiling need to be done on a system-wide basis to be useful; that can only be done if the whole system has frame pointers enabled. Simply building the distribution that way to begin with would make life easier for developers and, it was argued, set the stage for many performance improvements in the future.

There is, of course, a cost to enabling frame pointers. Each function call must save the current frame pointer to the stack, slightly increasing the cost of that call and the size of the code. The frame pointer also occupies a general-purpose register, increasing register spills and slowing down code that might put the register to better use. Avoiding these costs is the main reasons why distributions are built without frame pointers in the first place.

The proposal resulted in an extensive discussion on both the mailing list and the associated Fedora Engineering Steering Council (FESCo) ticket. As would be expected, the primary objection was the performance cost, some of which was benchmarked on the Fedora wiki. Compiling the kernel turned out to be 2.4% slower, and a Blender test case regressed by 2%. The worst case appears to be Python programs, which can see as much as a 10% performance hit. To many, these costs were seen as unacceptable.

The immediate reaction was enough to cause the proposed changed to be deferred to Fedora 38. But the discussion went on. The proponents of the change were undeterred by any potential performance loss; for example, Andrii Nakryiko argued:

Even if we lose 1-2% of benchmark performance, what we gain instead is lots of systems enthusiasts that now can do ad-hoc profiling and investigations, without the need to recompile the system or application in special configuration. It's extremely underappreciated how big of a barrier it is for people contribution towards performance and efficiency, if even trying to do anything useful in this space takes tons of effort. If we care about the community to contribute, we need to make it simple for that community to observe applications.

He added that Meta builds its internal applications with frame pointers enabled because the cost as seen as being more than justified by the benefits. Brendan Gregg described the benefits seen from frame pointers at Netflix, and Ian Rogers told a similar story about the experience at Google. On the other hand, the developers in Red Hat's platform tools team, represented by Florian Weimer, remained steadfastly opposed to enabling frame pointers. Neal Gompa, instead, supported the change but worried that Fedora would be "roasted" on certain benchmark-oriented web sites for reducing performance across the entire distribution.

The change was discussed at the November 15 FESCo meeting (the IRC log is available) and the proposal was ultimately rejected. That led to some unhappiness among proponents of the change, who were unwilling to let the idea go, despite Kevin Kofler's admonition that "the toolchain people are the most qualified experts on the topic" and that it was time to move on. Michael Catanzaro complained that he could "no longer trust the toolchain developers to make rational decisions regarding real-world performance impact due to their handling of this issue". But even Catanzaro said that it was time to move on.

But that is not what happened. On January 3, FESCo held another meeting in which an entirely new ticket calling for a "revote" on the frame-pointer proposal was discussed; this was the first that most people had heard that the idea was back. The new ticket had been opened six days prior — on December 28 — by Gompa; it was approved by a vote of six to one (with one abstention). So, as of this writing, the plan is to enable frame pointers for the Fedora 38 release, which is currently scheduled for a late-April release.

There appear to be a few factors that brought about FESCo's change of heart, starting with the ongoing requests from the proposal's proponents. While this whole discussion was going on, FESCo approved another build-option change (setting _FORTIFY_SOURCE=3 for increased security). That change also has a performance cost (though how much is not clear); the fact that it won approval while frame pointers did not was seen by some as the result of a double standard. The proposal was also modified to exempt Python — which is where the worse performance costs were seen — from the use of frame pointers. All of that, it seems, was enough to convince most FESCo members to support the idea.

As might be imagined; not all participants in the discussion saw things the same way. There were complaints about the short notice for the new ticket, which was also opened in the middle of the holiday break, and that participants in the discussion on the older ticket were not notified of the new one. Vitaly Zaitsev said that the proposal came back "because big corporations weren't happy with the results" and called it a bad precedent; Kofler called the process "deliberately rigged". Fedora leader Matthew Miller disputed that claim, but did acknowledge that things could have been done better:

I agree with your earlier post that this did not have enough visibility, enough notice, or enough time. I was certainly taken by surprise, and I was trying to keep an eye on this one in particular. [...] BUT, I do not think it was done with malice, as "deliberately rigged" implies. I don't see that at all -- I see excitement and interest in moving forward on something that already has taken a long time, and looming practical deadlines.

The rushed timing for the second vote, it seems, was done so that a result could be had in time for the imminent mass rebuild. It obviously makes sense to make a change like that before rebuilding the entire distribution from source rather than after. But even some of the participants in the wider discussion who understand that point felt that the process had not worked well.

There is still time for FESCo to revisit (again) the decision, should that seem warranted, but that seems unlikely. As FESCo member Zbigniew Jędrzejewski-Szmek pointed out, much of the discussion has already moved on to the technical details of how to manage the change. Thus, Fedora 38 will probably be a little slower than its predecessors, but hopefully the performance improvements that will follow from this change in future releases will more than make up for that cost.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 16:29 UTC (Mon) by kushal (subscriber, #50806) [Link] (13 responses)

>The worst case appears to be Python programs, which can see as much as a 10% performance hit.

This is bad, and also kind of sad as the Python upstream is working so hard to make things faster.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 16:58 UTC (Mon) by mcatanzaro (subscriber, #93033) [Link] (12 responses)

Note that python will not be built with frame pointers until the performance issue is resolved. See this link for analysis of what's going wrong with python: https://pagure.io/fesco/issue/2817#comment-826636

Fedora's tempest in a stack frame

Posted Jan 16, 2023 17:00 UTC (Mon) by kushal (subscriber, #50806) [Link]

Thank you for pointing this out.

Placement of saved registers in a stack frame

Posted Jan 16, 2023 22:39 UTC (Mon) by jreiser (subscriber, #11027) [Link] (3 responses)

One of the issues identified is (in many cases) the use of +d8(%rsp) addressing without frame pointer, versus -d32(%rbp) addressing with a frame pointer. +d8(%rsp) costs one byte for the 8-bit displacement plus one byte for s-i-b addressing mode to allow the stack pointer %rsp as a base register. -d32(%rbp) costs 4 bytes for the 32-bit displacement.

Why would using a frame pointer cause so many more 32-bit displacements? Because of the placement of saved registers in the stack frame. On x86_64, traditional entry to a subroutine which saves all 6 saved registers (with a frame pointer) looks like

push %rbp; mov %rsp,%rbp
push %r15; push %r14; push %r13; push %r12; push %rbx

in which 40 bytes of the 128-byte range for 8-bit displacement beneath the frame pointer %rbp are consumed by saved registers, leaving only 88 bytes for programmer-defined values. This contrasts to the 128 bytes available for eight-bit positive displacements from %rsp.

Changing the entry sequence to

push %r15; push %r14; push %r13; push %r12; push %rbx
push %rbp; mov %rsp,%rbp

would move the saved registers to the other side of the frame pointer %rbp, which would recoup the 40 bytes as long as there were at most 80 bytes of incoming on-stack actual arguments (10 or fewer pointers, etc.) The return address has a different position relative to the frame pointer, but it can be found by disassembling the entry code, looking for push %rbp.

As observed in the blog, nothing except intellectual inertia by the compiler prevents the use +d8(%rsp) with a frame pointer, too.

Placement of saved registers in a stack frame

Posted Jan 18, 2023 10:22 UTC (Wed) by jengelh (guest, #33263) [Link] (2 responses)

>nothing except intellectual inertia by the compiler prevents

Have you considered ABI? This looks like a change of calling convention.

Placement of saved registers in a stack frame

Posted Jan 18, 2023 13:54 UTC (Wed) by dezgeg (subscriber, #92243) [Link]

Changing location of rbp on the stack would be an ABI break (for anyone trying to unwind the stack), yes. But the last sentence is (if I understood right) an independent optimization of having the compiler still generate local variable accesses via rsp-relative offsets when beneficial, instead of always through the frame pointer (when frame pointers are enabled, of course).

Placement of saved registers in a stack frame

Posted Jan 19, 2023 1:59 UTC (Thu) by ncm (guest, #165) [Link]

There is no suggestion to change the calling convention.

The compiler is free to push the frame pointer and then ignore it in subsequent stack accesses and use the same addressing mode, relative to the stack pointer, as it does now with no frame pointer. It also does not need to pop the value it pushed, but just increment the stack pointer past it.

Fedora's tempest in a stack frame

Posted Jan 18, 2023 21:27 UTC (Wed) by jlargentaye (subscriber, #75206) [Link] (5 responses)

Interesting to see Brendan Gregg, IMHO the foremost expert in profiling, chiming in later in that conversation:

https://pagure.io/fesco/issue/2817#comment-826805

Fedora's tempest in a stack frame

Posted Jan 18, 2023 23:06 UTC (Wed) by intgr (subscriber, #39733) [Link] (4 responses)

Being a foremost expert doesn't guarantee that they see the big picture though.

He is probably right that the tradeoff makes sense for Netflix.

But I seriously doubt being able to profile Fedora more easily will result in enough optimization patches to improve performance at least 2% across the board *all around the distribution*. Maybe some specific places do end up getting optimized thanks to this change, but it's almost guaranteed to be a net loss given the large amount of software shipped by the distro.

Fedora's tempest in a stack frame

Posted Jan 18, 2023 23:48 UTC (Wed) by pizza (subscriber, #46) [Link] (3 responses)

> But I seriously doubt being able to profile Fedora more easily will result in enough optimization patches to improve performance at least 2% across the board *all around the distribution*.

I'm looking forward to finding out who's correct here.

Speaking personally, being able to do system-wide (ie not just single-application) profiling is worth a slight overall performance hit. Because currently it is effectively impossible to do so (at least for anything non-trivial)

Fedora's tempest in a stack frame

Posted Jan 19, 2023 0:15 UTC (Thu) by mcatanzaro (subscriber, #93033) [Link] (1 responses)

I think this is a misunderstanding. This isn't going to result in generalized performance improvements. It's a general pessimization. But it seems extremely likely to facilitate specific performance fixes that will make a huge difference to users who are suffering from particular performance problems. This should benefit users in practice to a much greater extent than it hurts.

Hopefully. ;)

Like Brendan said, "once you find a 500% perf win you have a different perspective about the <1% cost." Well the cost may be a little higher than 1%, but point remains.

Fedora's tempest in a stack frame

Posted Jan 19, 2023 4:50 UTC (Thu) by marcH (subscriber, #57642) [Link]

> But it seems extremely likely to facilitate specific performance fixes that will make a huge difference to users who are suffering from particular performance problems.

Thank you. Who cares about a 2% hit on software that runs 0.001% of the time = the vast majority of software. Performance is ALL about bottlenecks and critical sections. Which you can't do anything about if you don't even know where they are.

Fedora's tempest in a stack frame

Posted Jan 20, 2023 7:58 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

As a somewhat technical user, I'll take the 1-5% performance hit if it enables me to debug issues like firefox + gnome-shell randomly deciding to collectively eat 120% CPU while the screen is locked. I've tried sysprof and the profile is gibberish without frame pointers.

Sadly I use Ubuntu, so Fedora's decision won't help me directly. It does make it somewhat more tempting to switch distros.

Fedora's tempest in a stack frame

Posted Jan 28, 2023 19:17 UTC (Sat) by sammythesnake (guest, #17693) [Link]

I'd be interested to know if the "unusually large function that happens to be extra hard hit by frame pointers" issue that affects cpython also applies to pypy.

Does anyone happen to know? Did Fedora benchmark that, or is pypy enough of a corner case that it wasn't part of the benchmarking?

If the performance hit for pypy is more typical than it is for cpython then maybe it matters less - those with more of a performance-weighted set of priorities might already largely be using pypy...?

Fedora's tempest in a stack frame

Posted Jan 16, 2023 17:00 UTC (Mon) by bredelings (subscriber, #53082) [Link] (17 responses)

Profiling doesn't always need a frame pointer. As the F37 proposal mentions, you can use DWARF information in the binary to unwind the stack frame instead of using frame pointers.

However, the F37 proposal authors consider this to be unacceptable for a number of reasons:
* the kernel records stack-unwinding info in ORC instead of DWARF
* when using DWARF unwinding, the kernel saves all the stack data at each snapshot, as well as the instruction pointers. this stack info is then decoded later in userspace.
* there can be problems getting a COMPLETE stack trace, versus just the last N calls, for some N.

Its not clear to me that enabling frame pointers everywhere is the only solution here...

Fedora's tempest in a stack frame

Posted Jan 16, 2023 17:03 UTC (Mon) by vwduder (subscriber, #58547) [Link] (12 responses)

I cover a lot more of the issues in https://blogs.gnome.org/chergert/2022/12/31/frame-pointer...

Even if we could unwind fast enough by stashing stacks for user-space to decode, it potentially copies sensitive user information meaning we can't share the recordings with developers to further investigate.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 17:20 UTC (Mon) by bredelings (subscriber, #53082) [Link]

Thanks, that's informative. Its more clear that system-wide profiling is one of the driving issues.

At the end of that blog post, you say that enabling frame pointers on everything is a practical, short-term solution. That seems reasonable.

What are some of the options for a long-term solution?

For example, one option might be to include ORC info on everything.

But it seems like there are probably be a lot of other options.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 20:46 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (10 responses)

> I cover a lot more of the issues in https://blogs.gnome.org/chergert/2022/12/31/frame-pointer...

This blog post seems to be written under the assumption that the kernel has to do the unwinding. It explores several options for doing so, all with serious trade-offs (e.g. having to keep large debug information tables resident just in case the kernel might need them). Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.

> If we can’t do this thousands of times per second, it’s not fast enough.

Does the kernel have magical performance superpowers? Whatever userspace unwinding strategy you choose, ISTM userspace can implement that strategy at least as efficiently as the kernel. Performance doesn't seem like a good reason to keep userspace unwinding out of userspace.

Speaking of performance: why is nobody talking about profiling by snapshotting the shadow call stack? For the case of native code, the SCS is *exactly* the right data structure: it's a dense list of frame pointers!

> If an RPM is upgraded, you lose access to the library mapped into memory from outside the process as both the inode and CRC will have changed on disk. You can’t build unwind tables, so accounting in that process breaks.

The right way to deal with this issue is to log all mmap()s in the perf/ftrace ring buffer along with the build-ID tags of all mapped binaries. This way, postprocessing tools can query debug servers (by build-ID) and obtain the right PC->source mapping even if some system package has been updated or inode assignments changed.

> Even if we could unwind fast enough by stashing stacks for user-space to decode, it potentially copies sensitive user information meaning we can't share the recordings with developers to further investigate.

Aren't you assuming that the dump of the kernel perf/ftrace ring buffer would have to be the same as the data file shared outside the machine? ISTM that a post-processing step could remove arbitrary sensitive information from traces before sharing.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 22:44 UTC (Mon) by vwduder (subscriber, #58547) [Link] (8 responses)

> Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.

Can you explain how you would expect that to work?

If you mean to snapshot the stack and to unwind in userspace using something like libunwind, then it's simply to slow. If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.

The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment. If you don't have enough information to unwind there, you can't get stacks that cross kernel and user-space or enough information to symbolize after the fact.

Most of what Sysprof does, at least, is from user-space after the samples are captured.

> Does the kernel have magical performance superpowers? Whatever userspace unwinding strategy you choose, ISTM userspace can implement that strategy at least as efficiently as the kernel. Performance doesn't seem like a good reason to keep userspace unwinding out of userspace.

Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.

> Speaking of performance: why is nobody talking about profiling by snapshotting the shadow call stack? For the case of native code, the SCS is *exactly* the right data structure: it's a dense list of frame pointers!

Because it's not available in most configurations.

> The right way to deal with this issue is to log all mmap()s in the perf/ftrace ring buffer along with the build-ID tags of all mapped binaries. This way, postprocessing tools can query debug servers (by build-ID) and obtain the right PC->source mapping even if some system package has been updated or inode assignments changed.

Sysprof does in fact log all of these so we can symbolize properly. However, not everything has a build-id. The things that often trip up unwinding to the point of uselessness is JIT/FFI.

If you can't handle even handle unwinding libffi properly, the design is probably broken.

> Aren't you assuming that the dump of the kernel perf/ftrace ring buffer would have to be the same as the data file shared outside the machine? ISTM that a post-processing step could remove arbitrary sensitive information from traces before sharing.

Sysprof already has options to tack on symbol decode at the end of a trace file so they can be shared across machines. But it's symbolizer is an interface which has multiple implementations, so you could symbolize using something like debuginfod or a directory of binaries, etc.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 23:07 UTC (Mon) by vwduder (subscriber, #58547) [Link] (2 responses)

> Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.

Read your post below and replied there.

In short, yes, what we all would want to do is to unwind a single time in user-space during scheduler transition. That is not something that is likely to ship in the next 6 months to a year if you ask me. And we very much need things working up until someone shows up to write that code and can:

1. Prove it works
2. Get new syscalls into the kernel
3. Get language runtimes, JIT, FFI trampolines, etc to buy in
4. Profiler tooling to consume the different event types and cook the callgraph results

Fedora's tempest in a stack frame

Posted Jan 16, 2023 23:23 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (1 responses)

> That is not something that is likely to ship in the next 6 months to a year if you ask me.

Where is the urgency coming from? We've lived with the current situation this long. We can spend a year fixing it the right way.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 5:25 UTC (Tue) by vwduder (subscriber, #58547) [Link]

> Where is the urgency coming from? We've lived with the current situation this long. We can spend a year fixing it the right way.

Because we've been waiting for this for a decade and nothing has materialized.

Are you signing up to implement your design? Because if so, that sound great! We can disable frame-pointers when it's ready to ship. I'll even do you a solid and implement the Sysprof part.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 23:52 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (4 responses)

> If you mean to snapshot the stack

Keep in mind that there's no "snapshot" involved. When a thread returns from kernel mode to user mode, its stack is snapshotted because that thread is running the unwind signal handler immediately upon returning to user mode and before resuming whatever it was doing before it entered the kernel. The stack is "snapshotted" implicitly in the same way that the stack above a call to sleep(10) is "snapshotted".

> and to unwind in userspace using something like libunwind, then it's simply to slow.

Why would it be any slower than whatever the kernel does? Perhaps full asynchronous DWARF bytecode interpretation would be too slow, but ORC unwinding probably wouldn't be. Either way, you take a kernel problem and make it userspace's problem. Userspace can use every unwinding strategy that the kernel can use and a lot more. There's no downside.

> If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.

I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.

> The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment.

It's not the case that "everything" has to be available "at that moment". At the moment the perf event fires, we have to capture only the *kernel stack*. The *user stack* of any thread running in kernel mode is frozen until that thread leaves kernel mode and re-enters user mode, so we can defer the user mode stack collection until the thread returns to user space without loss of information!

> Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.

When a perf event fires, we want "the" stack corresponding to that event. That stack always has two parts: the kernel stack and the user stack. (We always have a kernel stack because perf events always fire in the kernel, even if the kernel is just running an ISR.) I'm suggesting that we capture the kernel and user stacks separately and that we combine them later, in post-processing. User-space collects user stacks and can use whatever stack traversal approach it wants, e.g. an asynchronous DWARF unwinder, frame-pointer walking, or various runtime-specific approaches.

Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses? User space can traverse frame pointers just as well as the kernel. LDR, B.LE, and MOV don't execute any faster in kernel mode than they do in user mode.

> Because it's not available in most configurations.

But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.

> The things that often trip up unwinding to the point of uselessness is JIT/FFI.

That's why we need managed runtime environments to participate explicitly in the unwinding process.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 5:46 UTC (Tue) by vwduder (subscriber, #58547) [Link] (3 responses)

It sounds like you're describing Peter Zijlstra 's idea from 2017?

https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html

> Why would it be any slower than whatever the kernel does?
> Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses?

The point was it's too slow if we have to copy stack pages via perf to the user-space capture. Obviously it's irrelevant if coordinating processes are unwinding themselves from a glorified signal handler.

> I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.

Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?

> Because it's not available in most configurations.
>> But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.

I agree.

Somebody (clearly you have the expertise) needs to go implement it and then we need to wait for years for it to be available everywhere so we can finally rely on it.

In the meantime, I'll have to enable frame-pointers so I can get some work done.

> The things that often trip up unwinding to the point of uselessness is JIT/FFI.
>> That's why we need managed runtime environments to participate explicitly in the unwinding process.

I fully agree, and it's what I too am advocating for.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 6:02 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (2 responses)

> https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html

More or less, although Zijlstra doesn't seem to talk about making the unwinder pluggable. That, and what he calls a "gloriously ugly hack" I'd call a clean separation of concerns. :-)

> Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?

Yes, and userspace could point to that "map" as an rseq extension. It might be easier, however, to just have threads make a system call (or post an io_uring request) to donate their stacks each time they finish an unwind. (Although, come to think of it, threads could in principle batch stack submission.) You could implement an explicit system call and then the rseq thing (supporting both) if the performance of the system call (even batched) proved inadequate.

> In the meantime, I'll have to enable frame-pointers so I can get some work done.

The point of this LWN article is that not everyone agrees that this is the right approach.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 16:55 UTC (Tue) by vwduder (subscriber, #58547) [Link] (1 responses)

> It might be easier, however, to just have threads make a system call (or post an io_uring request) to donate their stacks each time they finish an unwind.

I think you'll have a hard time finding people who want to give up a limited resource (max io_uring per process) for stack delivery via their libc.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 17:14 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

Huh? Why would io_uring requests be some kind of limited resource? Do we want to promote efficient alternatives to conventional system calls or not? It makes no sense to say "Hey, we have a more efficient way to communicate with the kernel" and at the same time, "You can't use this efficient interface, library author, due to some random arbitrary restrictions we imposed on ourselves ".

In any case, the use of io_uring is not central to my proposal, so the point is irrelevant. That said, all authors of new kernel interfaces should be designing with io_uring support in mind.

Fedora's tempest in a stack frame

Posted Feb 26, 2023 15:27 UTC (Sun) by nix (subscriber, #2304) [Link]

> If an RPM is upgraded, you lose access to the library mapped into memory from outside the process as both the inode and CRC will have changed on disk.

If the binary is still running and ptraceable, you don't: you can open files in /proc/$pid/map_files/ and bingo (yes, that gives you the *whole file*, not just the mapping, including non-loaded sections). We use this in DTrace, and it just works, where Solaris had to use this horrific baroque scheme involving serial numbers nailed into ELF objects to detect a deletion/recreation of a binary.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 19:47 UTC (Mon) by Sesse (subscriber, #53779) [Link] (3 responses)

You can also use LBR if you're on an Intel chip, but it tends to be pretty spotty.

The biggest advantage with DWARF for call stacks is that it is very precise when it works; in particular, it understands inlining. The biggest disadvantage is that it's very slow.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 20:24 UTC (Mon) by vwduder (subscriber, #58547) [Link] (2 responses)

LBR has roughly 8..32 entries in it, depending how much you're willing to pay Intel for your CPU.

A significant portion of stack traces, that are interesting to us in whole system profiling, are deeper than 32 stack frames.

Without a common root, it's difficult to build callgraphs that have the slightest amount of value.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 20:25 UTC (Mon) by Sesse (subscriber, #53779) [Link]

I will not deny that it has a bunch of limitations. But I've found it useful sometimes as a last resort (it can occasionally get out call stacks where nothing else can).

The nuclear option is Intel PT, of course.

Fedora's tempest in a stack frame

Posted Jan 19, 2023 17:25 UTC (Thu) by irogers (subscriber, #121692) [Link]

Just to advertise better LBR callgraphs with 'perf report --stitch-lbr' which is disabled by default. More context in the patch series:
https://lore.kernel.org/lkml/20200309174639.4594-1-kan.li...

Fedora's tempest in a stack frame

Posted Jan 16, 2023 17:07 UTC (Mon) by mcatanzaro (subscriber, #93033) [Link] (1 responses)

I'm quoted in this article saying something not very nice to fellow Fedora developers. This was not a good example of constructive disagreement, and I previously apologized for this comment. What I should have said instead was that I disagree with how the opponents of the change proposal were evaluating the performance considerations at stake here.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 1:28 UTC (Tue) by ringerc (subscriber, #3071) [Link]

Thanks. That's the sort of response it's good to see. I'm an uninvolved 3rd party, but I respect people who acknowledge poor communication choices and work to improve them. It helps improve the culture and community across the wider open source world, within and outside that specific community.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 17:20 UTC (Mon) by siddhesh (guest, #64914) [Link]

> While this whole discussion was going on, FESCo approved another build-option change (setting _FORTIFY_SOURCE=3 for increased security). That change also has a performance cost (though how much is not clear);

It is largely clear; the broad based impact (measured with SPEC2000 and SPEC2017) is zero[1], the only concern is the possibility of some corner cases. I've got a post coming out that should hopefully make it a bit clearer; I had hypothesized about the performance impact in my post[2] introducing _FORTIFY_SOURCE=3 and my hypothesis got blown out of proportion in this comparison.

[1] https://fedoraproject.org/w/index.php?title=Changes/Add_F...
[2] https://developers.redhat.com/articles/2022/09/17/gccs-ne...

distro wide cflags are a bad idea anyway

Posted Jan 16, 2023 18:06 UTC (Mon) by ballombe (subscriber, #9523) [Link] (4 responses)

Different packages are for different use cases, which call for different compiler flags.
Some packages are better optimized for speed other for security, and other for trouble shooting.

distro wide cflags are a bad idea anyway

Posted Jan 17, 2023 17:15 UTC (Tue) by mwsealey (subscriber, #71282) [Link] (3 responses)

I hear you just volunteered to audit every package in the repository and add a special toggle as to whether it needs -fomit-frame-pointers or no -fomit-frame-pointers.

distro wide cflags are a bad idea anyway

Posted Jan 28, 2023 21:09 UTC (Sat) by sammythesnake (guest, #17693) [Link] (2 responses)

I'm not very familiar with Fedora's organisational structure (or much more familiar with that of $other_distro that I use) but couldn't that be a choice made in a package by package basis by the relevant package maintainers?

distro wide cflags are a bad idea anyway

Posted Jan 29, 2023 12:52 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

> I'm not very familiar with Fedora's organisational structure (or much more familiar with that of $other_distro that I use) but couldn't that be a choice made in a package by package basis by the relevant package maintainers?

Yes, Fedora already does that. For instance, Python excludes it based on the performance impact cited in the article. Defaults only apply when the maintainer hasn't changed it. It is not enforced for every package.

distro wide cflags are a bad idea anyway

Posted Jan 29, 2023 12:53 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

That will just result in a whack-a-mole flurry of issues for people asking their favorite packages to turn it on/off depending on their preferences. A distro-wide with specific exclusions makes more sense IMO.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 20:33 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (9 responses)

Crossposted from https://lists.fedoraproject.org/archives/list/devel@lists...

Frame pointers also have the disadvantage of working only with AOT-compiled languages for
which a trace analysis tool can associate an instruction pointer with a
semantically-relevant bit of code. If you try to use frame pointers to profile a Python
program, all you're going to get is a profile of the interpreter. It seems like the
debate is between those who want observability (via frame pointers) and those who want the
performance benefits of -fomit-frame-pointer.

There's a third way.

See, both pro-FP and anti-FP camps think that it's the kernel that has to do the
unwinding unless we copy whole stacks into traces. Why should that be? As mentioned in
[1], instead of finding a way to have the kernel unwind user programs, we can create a
protocol through which the kernel can ask usermode to unwind itself. It could work like
this:

1) backtrace requested in the kernel (e.g. to a perf counter overflow)

2) kernel unwinds itself to the userspace boundary the usual way

3) kernel forms a nonce (e.g. by incrementing a 64-bit counter)

4) kernel logs a stack trace the usual way (e.g. to the ftrace ring buffer), but with the
final frame referring to the nonce created in the previous step

5) kernel queues a signal (one userspace has explicitly opted into via a new prctl()); the
siginfo_t structure encodes (e.g. via si_status and si_value) the nonce

6) kernel eventually returns to userspace; queued signal handler gains control

7) signal handler unwinds the calling thread however it wants (and can sleep and take page
faults if needed)

8) signal handler logs the result of its unwind, along with the nonce, to the system log
(e.g. via a new system call, a sysfs write, an io_uring submission, etc.)

Post-processing tools can associate kernel stacks with user stacks tagged with the
corresponding nonces and reconstitute the full stacks in effect at the time of each logged
event.

We can avoid duplicating unwindgs too: at step #3, if the kernel finds that the current
thread already has an unwind pending, it can uses the already-pending nonce instead of
making a new one and queuing a signal: many kernel stacks can end with the same user stack
"tail".

One nice property of this scheme is that the userspace unwinding isn't limited to
native code. Libc could arbitrate unwinding across an arbitrary number of managed runtime
environments in the context of a single process: the system could be smart enough to know
that instead of unwinding through, e.g. Python interpreter frames, the unwinder (which is
normal userspace code, pluggable via DSO!) could traverse and log *Python* stack frames
instead, with meaningful function names. And if you happened to have, say, a JavaScript
runtime in the same process, both JavaScript and Python could participate in the semantic
unwinding process.

A pluggable userspace unwind mechanism would have zero cost in the case that we're not
recording stack frames. On top of that, a pluggable userspace unwinder *could* be written
to traverse frame pointers just as the kernel unwinder does today, if userspace thinks
that's the best option. Without breaking kernel ABI, that userspace unwinder could use
DWARF, or ORC, or any other userspace unwinding approach. It's future-proof.

In other words, choice between frame pointers and no frame pointers is a false dichotomy.
There's a better approach. The Linux ecosystem as a whole would be better off building
something like the pluggable userspace asynchronous unwinding infrastructure described
above.

[1] https://lists.fedoraproject.org/archives/list/devel@lists...

Fedora's tempest in a stack frame

Posted Jan 16, 2023 21:08 UTC (Mon) by Sesse (subscriber, #53779) [Link] (3 responses)

What distinguishes your steps 5 onwards from the old SIGPROF-based way of profiling? Which never got much traction (certainly never from interpreters, and -pg never really worked well with multithreaded programs).

Fedora's tempest in a stack frame

Posted Jan 16, 2023 21:46 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (2 responses)

> What distinguishes your steps 5 onwards from the old SIGPROF-based way of profiling?

Integration with kernel stack traces and the ability to fire in response to any event for which the kernel would otherwise collect a stack trace.

ITIMER_PROF is less flexible and precise: it fires in response only to accumulated CPU time, and CPU time from the whole process (as opposed to each thread individually) at that. There's also no general-purpose way to collect information in response to SIGPERF across the whole system: each process hooking into SIGPERF records information in its own way. (There's also no multi-tenant support for SIGPERF out of the box.)

The mechanism I'm proposing is no less flexible or accurate than kernel collection of usermode stack traces: the user mode stack of a thread cannot change between the time we notice a perf-relevant event in the kernel and the time at which that thread returns to userspace, and each thread is independent. On top of that, I'm proposing a mechanism through which userspace can contribute its unwound stack information back into the kernel performance ring buffer (a capability missing in SIGPERF), allowing tools to synthesize user and kernel stacks into coherent stacks that work the same way across the whole system.

> Which never got much traction (certainly never from interpreters, and -pg never really worked well with multithreaded programs)

Without an end-to-end integration of the sort I'm describing, there's limited motivation for interpreters to support SIGPERF.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 23:03 UTC (Mon) by vwduder (subscriber, #58547) [Link] (1 responses)

I've proposed something very similar to a number of people recently, but as you write, is going to require a significant amount of work to plumb through the entire stack.

Additionally, in my blog post, I mentioned that none of us actually _want_ frame-pointers, what we want is _works out of the box_.

So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).

Fedora's tempest in a stack frame

Posted Jan 16, 2023 23:39 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

> I've proposed something very similar to a number of people recently

Great! Any links you can share?

> So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).

I'm worried about the moral hazard.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 23:00 UTC (Mon) by atnot (subscriber, #124910) [Link] (3 responses)

One issue with this is that it does not address whole system profiling. This is all well and good when you want to profile one specific bit of code that can opt into profiling, but this is not really a luxury one has in many situations.

For example, a common scenario might involve noticing that you are running out of disk bandwidth on some device. You attach a probe to look for disk events taking a certain amount of time and get their stack traces. The events look normal, but the deeper frames of the stack trace help you discover they are actually called from a background job. This helps you discover that this new software that was recently set up has a poorly written batch job which you are sometimes hitting a pathological case on that repeatedly rereads and rewrites the same file.

This isn't really a thing that can be discovered by instrumenting a single application. This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.

Fedora's tempest in a stack frame

Posted Jan 16, 2023 23:22 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (2 responses)

> One issue with this is that it does not address whole system profiling. This is all well and good when you want to profile one specific bit of code that can opt into profiling, but this is not really a luxury one has in many situations.

Sure it does. If you put the unwind code in libc, every program that links against libc can unwind in userspace. Programs that haven't opted into userspace unwinding could presumably be detected and unwound like they are today.

> This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.

It's libc that would opt into the mechanism on behalf of each application using that libc. Individual application authors wouldn't do anything.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 5:49 UTC (Tue) by vwduder (subscriber, #58547) [Link] (1 responses)

> > Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses?

This is surprisingly a lot more code that one might think. Particularly Golang where it is much more common to link without glibc. I do think they include frame-pointers though ;)

Fedora's tempest in a stack frame

Posted Jan 17, 2023 17:20 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

> This is surprisingly a lot more code that one might think

The amount of code involved isn't relevant to the discussion. However much code it is, the kernel doesn't have magical powers that lets it run that code more efficiently than userspace can.

Fedora's tempest in a stack frame

Posted Jan 26, 2023 6:57 UTC (Thu) by njs (subscriber, #40338) [Link]

> If you try to use frame pointers to profile a Python program, all you're going to get is a profile of the interpreter.

perf does have a mechanism for JITs and interpreted languages to record their function calls in perf call stacks. The next python release will even have support built in: https://docs.python.org/3.12/howto/perf_profiling.html

It is pretty janky though, and if you could get all the pieces lined up to support your approach I think everyone would be very happy!

Fedora's tempest in a stack frame

Posted Jan 17, 2023 18:04 UTC (Tue) by mwsealey (subscriber, #71282) [Link] (3 responses)

If the Python problem is because frame pointer usage causes stack spilling due to a controversially large switch statement, if register pressure was that high that performance is that fragile for the loss of a single register, then that function is a performance problem, not the use of the frame pointer. One can only boggle at the idea of hiding this kind of technical debt.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 19:17 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

The problem here is that it has to be one function to utilize the computed jump optimizations for this switch statements. Splitting it into multiple functions results in a similar loss of performance.

Fedora's tempest in a stack frame

Posted Jan 19, 2023 2:14 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

Just because you push a frame-pointer value on the stack does not mean you need to reserve a register to keep that value in. The value to push can be computed on demand. There is no need to access stack variables relative to that value when the stack pointer is right there.

So, the performance cost to Python is just lack of a trivial optimization in the compiler. That optimization is already implemented, and just not applied when the frame pointer is being pushed.

Fedora's tempest in a stack frame

Posted Jan 19, 2023 19:36 UTC (Thu) by dezgeg (subscriber, #92243) [Link]

No, you do need to keep the frame pointer in RBP so that the call stack can be traversed by perf at any arbitrary sampling point.

Fedora's tempest in a stack frame

Posted Jan 17, 2023 20:37 UTC (Tue) by flussence (guest, #85566) [Link]

There must be something I'm misunderstanding here; this is the first I've heard of the frame pointer compile option having any relevance to performance or debuggability since I last installed Gentoo on an i686 a decade ago. I thought it made no difference on amd64 (and I haven't personally noticed any loss of function in gdb or perf there) and that's why it's on by default in the first place.

Fedora's tempest in a stack frame

Posted Jan 22, 2023 21:30 UTC (Sun) by smitty_one_each (subscriber, #28989) [Link]

Gentoo users are all: "Tee hee hee".