LWN: Comments on "Fedora's tempest in a stack frame"

Fedora's tempest in a stack frame

nix — Sun, 26 Feb 2023 15:27:24 +0000

> If an RPM is upgraded, you lose access to the library mapped into memory from outside the process as both the inode and CRC will have changed on disk.

If the binary is still running and ptraceable, you don't: you can open files in /proc/$pid/map_files/ and bingo (yes, that gives you the *whole file*, not just the mapping, including non-loaded sections). We use this in DTrace, and it just works, where Solaris had to use this horrific baroque scheme involving serial numbers nailed into ELF objects to detect a deletion/recreation of a binary.

distro wide cflags are a bad idea anyway

mathstuf — Sun, 29 Jan 2023 12:53:58 +0000

That will just result in a whack-a-mole flurry of issues for people asking their favorite packages to turn it on/off depending on their preferences. A distro-wide with specific exclusions makes more sense IMO.

distro wide cflags are a bad idea anyway

rahulsundaram — Sun, 29 Jan 2023 12:52:38 +0000

> I'm not very familiar with Fedora's organisational structure (or much more familiar with that of $other_distro that I use) but couldn't that be a choice made in a package by package basis by the relevant package maintainers?

Yes, Fedora already does that. For instance, Python excludes it based on the performance impact cited in the article. Defaults only apply when the maintainer hasn't changed it. It is not enforced for every package.

distro wide cflags are a bad idea anyway

sammythesnake — Sat, 28 Jan 2023 21:09:45 +0000

I'm not very familiar with Fedora's organisational structure (or much more familiar with that of $other_distro that I use) but couldn't that be a choice made in a package by package basis by the relevant package maintainers?

Fedora's tempest in a stack frame

sammythesnake — Sat, 28 Jan 2023 19:17:36 +0000

I'd be interested to know if the "unusually large function that happens to be extra hard hit by frame pointers" issue that affects cpython also applies to pypy.

Does anyone happen to know? Did Fedora benchmark that, or is pypy enough of a corner case that it wasn't part of the benchmarking?

If the performance hit for pypy is more typical than it is for cpython then maybe it matters less - those with more of a performance-weighted set of priorities might already largely be using pypy...?

Fedora's tempest in a stack frame

njs — Thu, 26 Jan 2023 06:57:53 +0000

> If you try to use frame pointers to profile a Python program, all you're going to get is a profile of the interpreter.

perf does have a mechanism for JITs and interpreted languages to record their function calls in perf call stacks. The next python release will even have support built in: https://docs.python.org/3.12/howto/perf_profiling.html

It is pretty janky though, and if you could get all the pieces lined up to support your approach I think everyone would be very happy!

Fedora's tempest in a stack frame

smitty_one_each — Sun, 22 Jan 2023 21:30:14 +0000

Gentoo users are all: "Tee hee hee".

Fedora's tempest in a stack frame

mgedmin — Fri, 20 Jan 2023 07:58:27 +0000

As a somewhat technical user, I'll take the 1-5% performance hit if it enables me to debug issues like firefox + gnome-shell randomly deciding to collectively eat 120% CPU while the screen is locked. I've tried sysprof and the profile is gibberish without frame pointers.

Sadly I use Ubuntu, so Fedora's decision won't help me directly. It does make it somewhat more tempting to switch distros.

Fedora's tempest in a stack frame

dezgeg — Thu, 19 Jan 2023 19:36:26 +0000

No, you do need to keep the frame pointer in RBP so that the call stack can be traversed by perf at any arbitrary sampling point.

Fedora's tempest in a stack frame

irogers — Thu, 19 Jan 2023 17:25:56 +0000

Just to advertise better LBR callgraphs with 'perf report --stitch-lbr' which is disabled by default. More context in the patch series:
https://lore.kernel.org/lkml/20200309174639.4594-1-kan.li...

Fedora's tempest in a stack frame

marcH — Thu, 19 Jan 2023 04:50:31 +0000

> But it seems extremely likely to facilitate specific performance fixes that will make a huge difference to users who are suffering from particular performance problems.

Thank you. Who cares about a 2% hit on software that runs 0.001% of the time = the vast majority of software. Performance is ALL about bottlenecks and critical sections. Which you can't do anything about if you don't even know where they are.

Fedora's tempest in a stack frame

ncm — Thu, 19 Jan 2023 02:14:39 +0000

Just because you push a frame-pointer value on the stack does not mean you need to reserve a register to keep that value in. The value to push can be computed on demand. There is no need to access stack variables relative to that value when the stack pointer is right there.

So, the performance cost to Python is just lack of a trivial optimization in the compiler. That optimization is already implemented, and just not applied when the frame pointer is being pushed.

Placement of saved registers in a stack frame

ncm — Thu, 19 Jan 2023 01:59:57 +0000

There is no suggestion to change the calling convention.

The compiler is free to push the frame pointer and then ignore it in subsequent stack accesses and use the same addressing mode, relative to the stack pointer, as it does now with no frame pointer. It also does not need to pop the value it pushed, but just increment the stack pointer past it.

Fedora's tempest in a stack frame

mcatanzaro — Thu, 19 Jan 2023 00:15:22 +0000

I think this is a misunderstanding. This isn't going to result in generalized performance improvements. It's a general pessimization. But it seems extremely likely to facilitate specific performance fixes that will make a huge difference to users who are suffering from particular performance problems. This should benefit users in practice to a much greater extent than it hurts.

Hopefully. ;)

Like Brendan said, "once you find a 500% perf win you have a different perspective about the <1% cost." Well the cost may be a little higher than 1%, but point remains.

Fedora's tempest in a stack frame

pizza — Wed, 18 Jan 2023 23:48:38 +0000

> But I seriously doubt being able to profile Fedora more easily will result in enough optimization patches to improve performance at least 2% across the board *all around the distribution*.

I'm looking forward to finding out who's correct here.

Speaking personally, being able to do system-wide (ie not just single-application) profiling is worth a slight overall performance hit. Because currently it is effectively impossible to do so (at least for anything non-trivial)

Fedora's tempest in a stack frame

intgr — Wed, 18 Jan 2023 23:06:44 +0000

Being a foremost expert doesn't guarantee that they see the big picture though.

He is probably right that the tradeoff makes sense for Netflix.

But I seriously doubt being able to profile Fedora more easily will result in enough optimization patches to improve performance at least 2% across the board *all around the distribution*. Maybe some specific places do end up getting optimized thanks to this change, but it's almost guaranteed to be a net loss given the large amount of software shipped by the distro.

Fedora's tempest in a stack frame

jlargentaye — Wed, 18 Jan 2023 21:27:55 +0000

Interesting to see Brendan Gregg, IMHO the foremost expert in profiling, chiming in later in that conversation:

https://pagure.io/fesco/issue/2817#comment-826805

Placement of saved registers in a stack frame

dezgeg — Wed, 18 Jan 2023 13:54:13 +0000

Changing location of rbp on the stack would be an ABI break (for anyone trying to unwind the stack), yes. But the last sentence is (if I understood right) an independent optimization of having the compiler still generate local variable accesses via rsp-relative offsets when beneficial, instead of always through the frame pointer (when frame pointers are enabled, of course).

Placement of saved registers in a stack frame

jengelh — Wed, 18 Jan 2023 10:22:56 +0000

>nothing except intellectual inertia by the compiler prevents

Have you considered ABI? This looks like a change of calling convention.

Fedora's tempest in a stack frame

flussence — Tue, 17 Jan 2023 20:37:41 +0000

There must be something I'm misunderstanding here; this is the first I've heard of the frame pointer compile option having any relevance to performance or debuggability since I last installed Gentoo on an i686 a decade ago. I thought it made no difference on amd64 (and I haven't personally noticed any loss of function in gdb or perf there) and that's why it's on by default in the first place.

Fedora's tempest in a stack frame

Cyberax — Tue, 17 Jan 2023 19:17:45 +0000

The problem here is that it has to be one function to utilize the computed jump optimizations for this switch statements. Splitting it into multiple functions results in a similar loss of performance.

Fedora's tempest in a stack frame

mwsealey — Tue, 17 Jan 2023 18:04:00 +0000

If the Python problem is because frame pointer usage causes stack spilling due to a controversially large switch statement, if register pressure was that high that performance is that fragile for the loss of a single register, then that function is a performance problem, not the use of the frame pointer. One can only boggle at the idea of hiding this kind of technical debt.

Fedora's tempest in a stack frame

quotemstr — Tue, 17 Jan 2023 17:20:59 +0000

> This is surprisingly a lot more code that one might think

The amount of code involved isn't relevant to the discussion. However much code it is, the kernel doesn't have magical powers that lets it run that code more efficiently than userspace can.

distro wide cflags are a bad idea anyway

mwsealey — Tue, 17 Jan 2023 17:15:04 +0000

I hear you just volunteered to audit every package in the repository and add a special toggle as to whether it needs -fomit-frame-pointers or no -fomit-frame-pointers.

Fedora's tempest in a stack frame

quotemstr — Tue, 17 Jan 2023 17:14:48 +0000

Huh? Why would io_uring requests be some kind of limited resource? Do we want to promote efficient alternatives to conventional system calls or not? It makes no sense to say "Hey, we have a more efficient way to communicate with the kernel" and at the same time, "You can't use this efficient interface, library author, due to some random arbitrary restrictions we imposed on ourselves ".

In any case, the use of io_uring is not central to my proposal, so the point is irrelevant. That said, all authors of new kernel interfaces should be designing with io_uring support in mind.

Fedora's tempest in a stack frame

vwduder — Tue, 17 Jan 2023 16:55:55 +0000

> It might be easier, however, to just have threads make a system call (or post an io_uring request) to donate their stacks each time they finish an unwind.

I think you'll have a hard time finding people who want to give up a limited resource (max io_uring per process) for stack delivery via their libc.

Fedora's tempest in a stack frame

quotemstr — Tue, 17 Jan 2023 06:02:49 +0000

> https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html

More or less, although Zijlstra doesn't seem to talk about making the unwinder pluggable. That, and what he calls a "gloriously ugly hack" I'd call a clean separation of concerns. :-)

> Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?

Yes, and userspace could point to that "map" as an rseq extension. It might be easier, however, to just have threads make a system call (or post an io_uring request) to donate their stacks each time they finish an unwind. (Although, come to think of it, threads could in principle batch stack submission.) You could implement an explicit system call and then the rseq thing (supporting both) if the performance of the system call (even batched) proved inadequate.

> In the meantime, I'll have to enable frame-pointers so I can get some work done.

The point of this LWN article is that not everyone agrees that this is the right approach.

Fedora's tempest in a stack frame

vwduder — Tue, 17 Jan 2023 05:49:23 +0000

> > Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses?

This is surprisingly a lot more code that one might think. Particularly Golang where it is much more common to link without glibc. I do think they include frame-pointers though ;)

Fedora's tempest in a stack frame

vwduder — Tue, 17 Jan 2023 05:46:21 +0000

It sounds like you're describing Peter Zijlstra 's idea from 2017?

https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html

> Why would it be any slower than whatever the kernel does?
> Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses?

The point was it's too slow if we have to copy stack pages via perf to the user-space capture. Obviously it's irrelevant if coordinating processes are unwinding themselves from a glorified signal handler.

> I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.

Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?

> Because it's not available in most configurations.
>> But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.

I agree.

Somebody (clearly you have the expertise) needs to go implement it and then we need to wait for years for it to be available everywhere so we can finally rely on it.

In the meantime, I'll have to enable frame-pointers so I can get some work done.

> The things that often trip up unwinding to the point of uselessness is JIT/FFI.
>> That's why we need managed runtime environments to participate explicitly in the unwinding process.

I fully agree, and it's what I too am advocating for.

Fedora's tempest in a stack frame

vwduder — Tue, 17 Jan 2023 05:25:00 +0000

> Where is the urgency coming from? We've lived with the current situation this long. We can spend a year fixing it the right way.

Because we've been waiting for this for a decade and nothing has materialized.

Are you signing up to implement your design? Because if so, that sound great! We can disable frame-pointers when it's ready to ship. I'll even do you a solid and implement the Sysprof part.

Fedora's tempest in a stack frame

ringerc — Tue, 17 Jan 2023 01:28:06 +0000

Thanks. That's the sort of response it's good to see. I'm an uninvolved 3rd party, but I respect people who acknowledge poor communication choices and work to improve them. It helps improve the culture and community across the wider open source world, within and outside that specific community.

Fedora's tempest in a stack frame

quotemstr — Mon, 16 Jan 2023 23:52:35 +0000

> If you mean to snapshot the stack

Keep in mind that there's no "snapshot" involved. When a thread returns from kernel mode to user mode, its stack is snapshotted because that thread is running the unwind signal handler immediately upon returning to user mode and before resuming whatever it was doing before it entered the kernel. The stack is "snapshotted" implicitly in the same way that the stack above a call to sleep(10) is "snapshotted".

> and to unwind in userspace using something like libunwind, then it's simply to slow.

Why would it be any slower than whatever the kernel does? Perhaps full asynchronous DWARF bytecode interpretation would be too slow, but ORC unwinding probably wouldn't be. Either way, you take a kernel problem and make it userspace's problem. Userspace can use every unwinding strategy that the kernel can use and a lot more. There's no downside.

> If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.

I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.

> The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment.

It's not the case that "everything" has to be available "at that moment". At the moment the perf event fires, we have to capture only the *kernel stack*. The *user stack* of any thread running in kernel mode is frozen until that thread leaves kernel mode and re-enters user mode, so we can defer the user mode stack collection until the thread returns to user space without loss of information!

> Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.

When a perf event fires, we want "the" stack corresponding to that event. That stack always has two parts: the kernel stack and the user stack. (We always have a kernel stack because perf events always fire in the kernel, even if the kernel is just running an ISR.) I'm suggesting that we capture the kernel and user stacks separately and that we combine them later, in post-processing. User-space collects user stacks and can use whatever stack traversal approach it wants, e.g. an asynchronous DWARF unwinder, frame-pointer walking, or various runtime-specific approaches.

Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses? User space can traverse frame pointers just as well as the kernel. LDR, B.LE, and MOV don't execute any faster in kernel mode than they do in user mode.

> Because it's not available in most configurations.

But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.

> The things that often trip up unwinding to the point of uselessness is JIT/FFI.

That's why we need managed runtime environments to participate explicitly in the unwinding process.

Fedora's tempest in a stack frame

quotemstr — Mon, 16 Jan 2023 23:39:47 +0000

> I've proposed something very similar to a number of people recently

Great! Any links you can share?

> So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).

I'm worried about the moral hazard.

Fedora's tempest in a stack frame

quotemstr — Mon, 16 Jan 2023 23:23:39 +0000

> That is not something that is likely to ship in the next 6 months to a year if you ask me.

Where is the urgency coming from? We've lived with the current situation this long. We can spend a year fixing it the right way.

Fedora's tempest in a stack frame

quotemstr — Mon, 16 Jan 2023 23:22:44 +0000

> One issue with this is that it does not address whole system profiling. This is all well and good when you want to profile one specific bit of code that can opt into profiling, but this is not really a luxury one has in many situations.

Sure it does. If you put the unwind code in libc, every program that links against libc can unwind in userspace. Programs that haven't opted into userspace unwinding could presumably be detected and unwound like they are today.

> This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.

It's libc that would opt into the mechanism on behalf of each application using that libc. Individual application authors wouldn't do anything.

Fedora's tempest in a stack frame

vwduder — Mon, 16 Jan 2023 23:07:01 +0000

> Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.

Read your post below and replied there.

In short, yes, what we all would want to do is to unwind a single time in user-space during scheduler transition. That is not something that is likely to ship in the next 6 months to a year if you ask me. And we very much need things working up until someone shows up to write that code and can:

1. Prove it works
2. Get new syscalls into the kernel
3. Get language runtimes, JIT, FFI trampolines, etc to buy in
4. Profiler tooling to consume the different event types and cook the callgraph results

Fedora's tempest in a stack frame

vwduder — Mon, 16 Jan 2023 23:03:11 +0000

I've proposed something very similar to a number of people recently, but as you write, is going to require a significant amount of work to plumb through the entire stack.

Additionally, in my blog post, I mentioned that none of us actually _want_ frame-pointers, what we want is _works out of the box_.

So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).

Fedora's tempest in a stack frame

atnot — Mon, 16 Jan 2023 23:00:35 +0000

One issue with this is that it does not address whole system profiling. This is all well and good when you want to profile one specific bit of code that can opt into profiling, but this is not really a luxury one has in many situations.

For example, a common scenario might involve noticing that you are running out of disk bandwidth on some device. You attach a probe to look for disk events taking a certain amount of time and get their stack traces. The events look normal, but the deeper frames of the stack trace help you discover they are actually called from a background job. This helps you discover that this new software that was recently set up has a poorly written batch job which you are sometimes hitting a pathological case on that repeatedly rereads and rewrites the same file.

This isn't really a thing that can be discovered by instrumenting a single application. This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.

Fedora's tempest in a stack frame

vwduder — Mon, 16 Jan 2023 22:44:50 +0000

> Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.

Can you explain how you would expect that to work?

If you mean to snapshot the stack and to unwind in userspace using something like libunwind, then it's simply to slow. If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.

The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment. If you don't have enough information to unwind there, you can't get stacks that cross kernel and user-space or enough information to symbolize after the fact.

Most of what Sysprof does, at least, is from user-space after the samples are captured.

> Does the kernel have magical performance superpowers? Whatever userspace unwinding strategy you choose, ISTM userspace can implement that strategy at least as efficiently as the kernel. Performance doesn't seem like a good reason to keep userspace unwinding out of userspace.

Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.

> Speaking of performance: why is nobody talking about profiling by snapshotting the shadow call stack? For the case of native code, the SCS is *exactly* the right data structure: it's a dense list of frame pointers!

Because it's not available in most configurations.

> The right way to deal with this issue is to log all mmap()s in the perf/ftrace ring buffer along with the build-ID tags of all mapped binaries. This way, postprocessing tools can query debug servers (by build-ID) and obtain the right PC->source mapping even if some system package has been updated or inode assignments changed.

Sysprof does in fact log all of these so we can symbolize properly. However, not everything has a build-id. The things that often trip up unwinding to the point of uselessness is JIT/FFI.

If you can't handle even handle unwinding libffi properly, the design is probably broken.

> Aren't you assuming that the dump of the kernel perf/ftrace ring buffer would have to be the same as the data file shared outside the machine? ISTM that a post-processing step could remove arbitrary sensitive information from traces before sharing.

Sysprof already has options to tack on symbol decode at the end of a trace file so they can be shared across machines. But it's symbolizer is an interface which has multiple implementations, so you could symbolize using something like debuginfod or a directory of binaries, etc.

Placement of saved registers in a stack frame

jreiser — Mon, 16 Jan 2023 22:39:27 +0000

One of the issues identified is (in many cases) the use of +d8(%rsp) addressing without frame pointer, versus -d32(%rbp) addressing with a frame pointer. +d8(%rsp) costs one byte for the 8-bit displacement plus one byte for s-i-b addressing mode to allow the stack pointer %rsp as a base register. -d32(%rbp) costs 4 bytes for the 32-bit displacement.

Why would using a frame pointer cause so many more 32-bit displacements? Because of the placement of saved registers in the stack frame. On x86_64, traditional entry to a subroutine which saves all 6 saved registers (with a frame pointer) looks like

push %rbp; mov %rsp,%rbp
push %r15; push %r14; push %r13; push %r12; push %rbx

in which 40 bytes of the 128-byte range for 8-bit displacement beneath the frame pointer %rbp are consumed by saved registers, leaving only 88 bytes for programmer-defined values. This contrasts to the 128 bytes available for eight-bit positive displacements from %rsp.

Changing the entry sequence to

push %r15; push %r14; push %r13; push %r12; push %rbx
push %rbp; mov %rsp,%rbp

would move the saved registers to the other side of the frame pointer %rbp, which would recoup the 40 bytes as long as there were at most 80 bytes of incoming on-stack actual arguments (10 or fewer pointers, etc.) The return address has a different position relative to the frame pointer, but it can be found by disassembling the entry code, looking for push %rbp.

As observed in the blog, nothing except intellectual inertia by the compiler prevents the use +d8(%rsp) with a frame pointer, too.