LWN: Comments on "Fedora's tempest in a stack frame" https://lwn.net/Articles/919940/ This is a special feed containing comments posted to the individual LWN article titled "Fedora's tempest in a stack frame". en-us Sun, 12 Oct 2025 01:40:37 +0000 Sun, 12 Oct 2025 01:40:37 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Fedora's tempest in a stack frame https://lwn.net/Articles/924468/ https://lwn.net/Articles/924468/ nix <div class="FormattedComment"> <span class="QuotedText">&gt; If an RPM is upgraded, you lose access to the library mapped into memory from outside the process as both the inode and CRC will have changed on disk. </span><br> <p> If the binary is still running and ptraceable, you don't: you can open files in /proc/$pid/map_files/ and bingo (yes, that gives you the *whole file*, not just the mapping, including non-loaded sections). We use this in DTrace, and it just works, where Solaris had to use this horrific baroque scheme involving serial numbers nailed into ELF objects to detect a deletion/recreation of a binary.<br> </div> Sun, 26 Feb 2023 15:27:24 +0000 distro wide cflags are a bad idea anyway https://lwn.net/Articles/921547/ https://lwn.net/Articles/921547/ mathstuf <div class="FormattedComment"> That will just result in a whack-a-mole flurry of issues for people asking their favorite packages to turn it on/off depending on their preferences. A distro-wide with specific exclusions makes more sense IMO.<br> </div> Sun, 29 Jan 2023 12:53:58 +0000 distro wide cflags are a bad idea anyway https://lwn.net/Articles/921546/ https://lwn.net/Articles/921546/ rahulsundaram <div class="FormattedComment"> <span class="QuotedText">&gt; I'm not very familiar with Fedora's organisational structure (or much more familiar with that of $other_distro that I use) but couldn't that be a choice made in a package by package basis by the relevant package maintainers?</span><br> <p> Yes, Fedora already does that. For instance, Python excludes it based on the performance impact cited in the article. Defaults only apply when the maintainer hasn't changed it. It is not enforced for every package.<br> </div> Sun, 29 Jan 2023 12:52:38 +0000 distro wide cflags are a bad idea anyway https://lwn.net/Articles/921542/ https://lwn.net/Articles/921542/ sammythesnake <div class="FormattedComment"> I'm not very familiar with Fedora's organisational structure (or much more familiar with that of $other_distro that I use) but couldn't that be a choice made in a package by package basis by the relevant package maintainers?<br> </div> Sat, 28 Jan 2023 21:09:45 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/921538/ https://lwn.net/Articles/921538/ sammythesnake <div class="FormattedComment"> I'd be interested to know if the "unusually large function that happens to be extra hard hit by frame pointers" issue that affects cpython also applies to pypy.<br> <p> Does anyone happen to know? Did Fedora benchmark that, or is pypy enough of a corner case that it wasn't part of the benchmarking?<br> <p> If the performance hit for pypy is more typical than it is for cpython then maybe it matters less - those with more of a performance-weighted set of priorities might already largely be using pypy...?<br> </div> Sat, 28 Jan 2023 19:17:36 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/921281/ https://lwn.net/Articles/921281/ njs <div class="FormattedComment"> <span class="QuotedText">&gt; If you try to use frame pointers to profile a Python program, all you're going to get is a profile of the interpreter. </span><br> <p> perf does have a mechanism for JITs and interpreted languages to record their function calls in perf call stacks. The next python release will even have support built in: <a rel="nofollow" href="https://docs.python.org/3.12/howto/perf_profiling.html">https://docs.python.org/3.12/howto/perf_profiling.html</a><br> <p> It is pretty janky though, and if you could get all the pieces lined up to support your approach I think everyone would be very happy!<br> </div> Thu, 26 Jan 2023 06:57:53 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920758/ https://lwn.net/Articles/920758/ smitty_one_each <div class="FormattedComment"> Gentoo users are all: "Tee hee hee".<br> </div> Sun, 22 Jan 2023 21:30:14 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920588/ https://lwn.net/Articles/920588/ mgedmin <div class="FormattedComment"> As a somewhat technical user, I'll take the 1-5% performance hit if it enables me to debug issues like firefox + gnome-shell randomly deciding to collectively eat 120% CPU while the screen is locked. I've tried sysprof and the profile is gibberish without frame pointers.<br> <p> Sadly I use Ubuntu, so Fedora's decision won't help me directly. It does make it somewhat more tempting to switch distros.<br> </div> Fri, 20 Jan 2023 07:58:27 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920545/ https://lwn.net/Articles/920545/ dezgeg <div class="FormattedComment"> No, you do need to keep the frame pointer in RBP so that the call stack can be traversed by perf at any arbitrary sampling point.<br> </div> Thu, 19 Jan 2023 19:36:26 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920529/ https://lwn.net/Articles/920529/ irogers <div class="FormattedComment"> Just to advertise better LBR callgraphs with 'perf report --stitch-lbr' which is disabled by default. More context in the patch series:<br> <a href="https://lore.kernel.org/lkml/20200309174639.4594-1-kan.liang@linux.intel.com/">https://lore.kernel.org/lkml/20200309174639.4594-1-kan.li...</a><br> </div> Thu, 19 Jan 2023 17:25:56 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920414/ https://lwn.net/Articles/920414/ marcH <div class="FormattedComment"> <span class="QuotedText">&gt; But it seems extremely likely to facilitate specific performance fixes that will make a huge difference to users who are suffering from particular performance problems.</span><br> <p> Thank you. Who cares about a 2% hit on software that runs 0.001% of the time = the vast majority of software. Performance is ALL about bottlenecks and critical sections. Which you can't do anything about if you don't even know where they are.<br> <p> </div> Thu, 19 Jan 2023 04:50:31 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920408/ https://lwn.net/Articles/920408/ ncm <div class="FormattedComment"> Just because you push a frame-pointer value on the stack does not mean you need to reserve a register to keep that value in. The value to push can be computed on demand. There is no need to access stack variables relative to that value when the stack pointer is right there.<br> <p> So, the performance cost to Python is just lack of a trivial optimization in the compiler. That optimization is already implemented, and just not applied when the frame pointer is being pushed.<br> </div> Thu, 19 Jan 2023 02:14:39 +0000 Placement of saved registers in a stack frame https://lwn.net/Articles/920406/ https://lwn.net/Articles/920406/ ncm <div class="FormattedComment"> There is no suggestion to change the calling convention.<br> <p> The compiler is free to push the frame pointer and then ignore it in subsequent stack accesses and use the same addressing mode, relative to the stack pointer, as it does now with no frame pointer. It also does not need to pop the value it pushed, but just increment the stack pointer past it.<br> <p> </div> Thu, 19 Jan 2023 01:59:57 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920401/ https://lwn.net/Articles/920401/ mcatanzaro <div class="FormattedComment"> I think this is a misunderstanding. This isn't going to result in generalized performance improvements. It's a general pessimization. But it seems extremely likely to facilitate specific performance fixes that will make a huge difference to users who are suffering from particular performance problems. This should benefit users in practice to a much greater extent than it hurts.<br> <p> Hopefully. ;)<br> <p> Like Brendan said, "once you find a 500% perf win you have a different perspective about the &lt;1% cost." Well the cost may be a little higher than 1%, but point remains.<br> </div> Thu, 19 Jan 2023 00:15:22 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920398/ https://lwn.net/Articles/920398/ pizza <div class="FormattedComment"> <span class="QuotedText">&gt; But I seriously doubt being able to profile Fedora more easily will result in enough optimization patches to improve performance at least 2% across the board *all around the distribution*.</span><br> <p> I'm looking forward to finding out who's correct here.<br> <p> Speaking personally, being able to do system-wide (ie not just single-application) profiling is worth a slight overall performance hit. Because currently it is effectively impossible to do so (at least for anything non-trivial)<br> <p> </div> Wed, 18 Jan 2023 23:48:38 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920392/ https://lwn.net/Articles/920392/ intgr <div class="FormattedComment"> Being a foremost expert doesn't guarantee that they see the big picture though.<br> <p> He is probably right that the tradeoff makes sense for Netflix.<br> <p> But I seriously doubt being able to profile Fedora more easily will result in enough optimization patches to improve performance at least 2% across the board *all around the distribution*. Maybe some specific places do end up getting optimized thanks to this change, but it's almost guaranteed to be a net loss given the large amount of software shipped by the distro.<br> </div> Wed, 18 Jan 2023 23:06:44 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920385/ https://lwn.net/Articles/920385/ jlargentaye <div class="FormattedComment"> Interesting to see Brendan Gregg, IMHO the foremost expert in profiling, chiming in later in that conversation:<br> <p> <a href="https://pagure.io/fesco/issue/2817#comment-826805">https://pagure.io/fesco/issue/2817#comment-826805</a><br> </div> Wed, 18 Jan 2023 21:27:55 +0000 Placement of saved registers in a stack frame https://lwn.net/Articles/920294/ https://lwn.net/Articles/920294/ dezgeg <div class="FormattedComment"> Changing location of rbp on the stack would be an ABI break (for anyone trying to unwind the stack), yes. But the last sentence is (if I understood right) an independent optimization of having the compiler still generate local variable accesses via rsp-relative offsets when beneficial, instead of always through the frame pointer (when frame pointers are enabled, of course). <br> </div> Wed, 18 Jan 2023 13:54:13 +0000 Placement of saved registers in a stack frame https://lwn.net/Articles/920289/ https://lwn.net/Articles/920289/ jengelh <div class="FormattedComment"> <span class="QuotedText">&gt;nothing except intellectual inertia by the compiler prevents</span><br> <p> Have you considered ABI? This looks like a change of calling convention.<br> </div> Wed, 18 Jan 2023 10:22:56 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920253/ https://lwn.net/Articles/920253/ flussence <div class="FormattedComment"> There must be something I'm misunderstanding here; this is the first I've heard of the frame pointer compile option having any relevance to performance or debuggability since I last installed Gentoo on an i686 a decade ago. I thought it made no difference on amd64 (and I haven't personally noticed any loss of function in gdb or perf there) and that's why it's on by default in the first place.<br> </div> Tue, 17 Jan 2023 20:37:41 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920250/ https://lwn.net/Articles/920250/ Cyberax <div class="FormattedComment"> The problem here is that it has to be one function to utilize the computed jump optimizations for this switch statements. Splitting it into multiple functions results in a similar loss of performance.<br> </div> Tue, 17 Jan 2023 19:17:45 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920242/ https://lwn.net/Articles/920242/ mwsealey <div class="FormattedComment"> If the Python problem is because frame pointer usage causes stack spilling due to a controversially large switch statement, if register pressure was that high that performance is that fragile for the loss of a single register, then that function is a performance problem, not the use of the frame pointer. One can only boggle at the idea of hiding this kind of technical debt.<br> </div> Tue, 17 Jan 2023 18:04:00 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920239/ https://lwn.net/Articles/920239/ quotemstr <div class="FormattedComment"> <span class="QuotedText">&gt; This is surprisingly a lot more code that one might think</span><br> <p> The amount of code involved isn't relevant to the discussion. However much code it is, the kernel doesn't have magical powers that lets it run that code more efficiently than userspace can.<br> </div> Tue, 17 Jan 2023 17:20:59 +0000 distro wide cflags are a bad idea anyway https://lwn.net/Articles/920238/ https://lwn.net/Articles/920238/ mwsealey <div class="FormattedComment"> I hear you just volunteered to audit every package in the repository and add a special toggle as to whether it needs -fomit-frame-pointers or no -fomit-frame-pointers.<br> </div> Tue, 17 Jan 2023 17:15:04 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920237/ https://lwn.net/Articles/920237/ quotemstr <div class="FormattedComment"> Huh? Why would io_uring requests be some kind of limited resource? Do we want to promote efficient alternatives to conventional system calls or not? It makes no sense to say "Hey, we have a more efficient way to communicate with the kernel" and at the same time, "You can't use this efficient interface, library author, due to some random arbitrary restrictions we imposed on ourselves ".<br> <p> In any case, the use of io_uring is not central to my proposal, so the point is irrelevant. That said, all authors of new kernel interfaces should be designing with io_uring support in mind.<br> </div> Tue, 17 Jan 2023 17:14:48 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920235/ https://lwn.net/Articles/920235/ vwduder <div class="FormattedComment"> <span class="QuotedText">&gt; It might be easier, however, to just have threads make a system call (or post an io_uring request) to donate their stacks each time they finish an unwind.</span><br> <p> I think you'll have a hard time finding people who want to give up a limited resource (max io_uring per process) for stack delivery via their libc.<br> </div> Tue, 17 Jan 2023 16:55:55 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920198/ https://lwn.net/Articles/920198/ quotemstr <div class="FormattedComment"> <span class="QuotedText">&gt; <a href="https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html">https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html</a></span><br> <p> More or less, although Zijlstra doesn't seem to talk about making the unwinder pluggable. That, and what he calls a "gloriously ugly hack" I'd call a clean separation of concerns. :-)<br> <p> <span class="QuotedText">&gt; Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?</span><br> <p> Yes, and userspace could point to that "map" as an rseq extension. It might be easier, however, to just have threads make a system call (or post an io_uring request) to donate their stacks each time they finish an unwind. (Although, come to think of it, threads could in principle batch stack submission.) You could implement an explicit system call and then the rseq thing (supporting both) if the performance of the system call (even batched) proved inadequate.<br> <p> <span class="QuotedText">&gt; In the meantime, I'll have to enable frame-pointers so I can get some work done.</span><br> <p> The point of this LWN article is that not everyone agrees that this is the right approach.<br> </div> Tue, 17 Jan 2023 06:02:49 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920197/ https://lwn.net/Articles/920197/ vwduder <div class="FormattedComment"> <span class="QuotedText">&gt; &gt; Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses?</span><br> <p> This is surprisingly a lot more code that one might think. Particularly Golang where it is much more common to link without glibc. I do think they include frame-pointers though ;)<br> </div> Tue, 17 Jan 2023 05:49:23 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920195/ https://lwn.net/Articles/920195/ vwduder <div class="FormattedComment"> It sounds like you're describing Peter Zijlstra 's idea from 2017?<br> <p> <a href="https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html">https://lkml.iu.edu/hypermail/linux/kernel/1707.1/03003.html</a><br> <p> <span class="QuotedText">&gt; Why would it be any slower than whatever the kernel does?</span><br> <span class="QuotedText">&gt; Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses?</span><br> <p> The point was it's too slow if we have to copy stack pages via perf to the user-space capture. Obviously it's irrelevant if coordinating processes are unwinding themselves from a glorified signal handler.<br> <p> <span class="QuotedText">&gt; I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.</span><br> <p> Would you anticipate the unwound stack could be placed directly into a map for the kernel to consume on it's next sample (and provide to the perf stream)?<br> <p> <span class="QuotedText">&gt; Because it's not available in most configurations.</span><br> <span class="QuotedText">&gt;&gt; But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.</span><br> <p> I agree.<br> <p> Somebody (clearly you have the expertise) needs to go implement it and then we need to wait for years for it to be available everywhere so we can finally rely on it.<br> <p> In the meantime, I'll have to enable frame-pointers so I can get some work done.<br> <p> <span class="QuotedText">&gt; The things that often trip up unwinding to the point of uselessness is JIT/FFI.</span><br> <span class="QuotedText">&gt;&gt; That's why we need managed runtime environments to participate explicitly in the unwinding process.</span><br> <p> I fully agree, and it's what I too am advocating for.<br> </div> Tue, 17 Jan 2023 05:46:21 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920193/ https://lwn.net/Articles/920193/ vwduder <div class="FormattedComment"> <span class="QuotedText">&gt; Where is the urgency coming from? We've lived with the current situation this long. We can spend a year fixing it the right way.</span><br> <p> Because we've been waiting for this for a decade and nothing has materialized.<br> <p> Are you signing up to implement your design? Because if so, that sound great! We can disable frame-pointers when it's ready to ship. I'll even do you a solid and implement the Sysprof part.<br> </div> Tue, 17 Jan 2023 05:25:00 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920189/ https://lwn.net/Articles/920189/ ringerc <div class="FormattedComment"> Thanks. That's the sort of response it's good to see. I'm an uninvolved 3rd party, but I respect people who acknowledge poor communication choices and work to improve them. It helps improve the culture and community across the wider open source world, within and outside that specific community.<br> </div> Tue, 17 Jan 2023 01:28:06 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920185/ https://lwn.net/Articles/920185/ quotemstr <div class="FormattedComment"> <span class="QuotedText">&gt; If you mean to snapshot the stack </span><br> <p> Keep in mind that there's no "snapshot" involved. When a thread returns from kernel mode to user mode, its stack is snapshotted because that thread is running the unwind signal handler immediately upon returning to user mode and before resuming whatever it was doing before it entered the kernel. The stack is "snapshotted" implicitly in the same way that the stack above a call to sleep(10) is "snapshotted".<br> <p> <span class="QuotedText">&gt; and to unwind in userspace using something like libunwind, then it's simply to slow.</span><br> <p> Why would it be any slower than whatever the kernel does? Perhaps full asynchronous DWARF bytecode interpretation would be too slow, but ORC unwinding probably wouldn't be. Either way, you take a kernel problem and make it userspace's problem. Userspace can use every unwinding strategy that the kernel can use and a lot more. There's no downside.<br> <p> <span class="QuotedText">&gt; If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.</span><br> <p> I'm talking about something a bit different from SIGPROF fired via setitimer(2). I'm talking about using a signal handler to *collect* the user segment of a stack the collection of which is triggered using exactly the same triggers we have for kernel-side stack collection today. It *is* system-wide profiling.<br> <p> <span class="QuotedText">&gt; The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment. </span><br> <p> It's not the case that "everything" has to be available "at that moment". At the moment the perf event fires, we have to capture only the *kernel stack*. The *user stack* of any thread running in kernel mode is frozen until that thread leaves kernel mode and re-enters user mode, so we can defer the user mode stack collection until the thread returns to user space without loss of information!<br> <p> <span class="QuotedText">&gt; Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.</span><br> <p> When a perf event fires, we want "the" stack corresponding to that event. That stack always has two parts: the kernel stack and the user stack. (We always have a kernel stack because perf events always fire in the kernel, even if the kernel is just running an ISR.) I'm suggesting that we capture the kernel and user stacks separately and that we combine them later, in post-processing. User-space collects user stacks and can use whatever stack traversal approach it wants, e.g. an asynchronous DWARF unwinder, frame-pointer walking, or various runtime-specific approaches.<br> <p> Why would user space be any slower than the kernel at executing the same stack-capturing procedure the kernel uses? User space can traverse frame pointers just as well as the kernel. LDR, B.LE, and MOV don't execute any faster in kernel mode than they do in user mode.<br> <p> <span class="QuotedText">&gt; Because it's not available in most configurations.</span><br> <p> But when it is, it's exactly the right thing. That's why unwinding should be modular and implemented in user-space: flexibility.<br> <p> <span class="QuotedText">&gt; The things that often trip up unwinding to the point of uselessness is JIT/FFI.</span><br> <p> That's why we need managed runtime environments to participate explicitly in the unwinding process.<br> <p> <p> </div> Mon, 16 Jan 2023 23:52:35 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920184/ https://lwn.net/Articles/920184/ quotemstr <div class="FormattedComment"> <span class="QuotedText">&gt; I've proposed something very similar to a number of people recently</span><br> <p> Great! Any links you can share?<br> <p> <span class="QuotedText">&gt; So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).</span><br> <p> I'm worried about the moral hazard.<br> </div> Mon, 16 Jan 2023 23:39:47 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920181/ https://lwn.net/Articles/920181/ quotemstr <div class="FormattedComment"> <span class="QuotedText">&gt; That is not something that is likely to ship in the next 6 months to a year if you ask me.</span><br> <p> Where is the urgency coming from? We've lived with the current situation this long. We can spend a year fixing it the right way.<br> </div> Mon, 16 Jan 2023 23:23:39 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920180/ https://lwn.net/Articles/920180/ quotemstr <div class="FormattedComment"> <span class="QuotedText">&gt; One issue with this is that it does not address whole system profiling. This is all well and good when you want to profile one specific bit of code that can opt into profiling, but this is not really a luxury one has in many situations.</span><br> <p> Sure it does. If you put the unwind code in libc, every program that links against libc can unwind in userspace. Programs that haven't opted into userspace unwinding could presumably be detected and unwound like they are today. <br> <p> <span class="QuotedText">&gt; This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.</span><br> <p> It's libc that would opt into the mechanism on behalf of each application using that libc. Individual application authors wouldn't do anything.<br> </div> Mon, 16 Jan 2023 23:22:44 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920176/ https://lwn.net/Articles/920176/ vwduder <div class="FormattedComment"> <span class="QuotedText">&gt; Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.</span><br> <p> Read your post below and replied there.<br> <p> In short, yes, what we all would want to do is to unwind a single time in user-space during scheduler transition. That is not something that is likely to ship in the next 6 months to a year if you ask me. And we very much need things working up until someone shows up to write that code and can:<br> <p> 1. Prove it works<br> 2. Get new syscalls into the kernel<br> 3. Get language runtimes, JIT, FFI trampolines, etc to buy in<br> 4. Profiler tooling to consume the different event types and cook the callgraph results<br> </div> Mon, 16 Jan 2023 23:07:01 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920175/ https://lwn.net/Articles/920175/ vwduder <div class="FormattedComment"> I've proposed something very similar to a number of people recently, but as you write, is going to require a significant amount of work to plumb through the entire stack.<br> <p> Additionally, in my blog post, I mentioned that none of us actually _want_ frame-pointers, what we want is _works out of the box_.<br> <p> So until a scheme like is described here can be implemented, propagated to language runtimes and tooling, we need something else as a stop gap (e.g. frame-pointers).<br> </div> Mon, 16 Jan 2023 23:03:11 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920173/ https://lwn.net/Articles/920173/ atnot <div class="FormattedComment"> One issue with this is that it does not address whole system profiling. This is all well and good when you want to profile one specific bit of code that can opt into profiling, but this is not really a luxury one has in many situations.<br> <p> For example, a common scenario might involve noticing that you are running out of disk bandwidth on some device. You attach a probe to look for disk events taking a certain amount of time and get their stack traces. The events look normal, but the deeper frames of the stack trace help you discover they are actually called from a background job. This helps you discover that this new software that was recently set up has a poorly written batch job which you are sometimes hitting a pathological case on that repeatedly rereads and rewrites the same file.<br> <p> This isn't really a thing that can be discovered by instrumenting a single application. This sort of thing has to be universal for it to work, and that's not really the case if you introduce something applications must especially add and opt into today.<br> </div> Mon, 16 Jan 2023 23:00:35 +0000 Fedora's tempest in a stack frame https://lwn.net/Articles/920169/ https://lwn.net/Articles/920169/ vwduder <div class="FormattedComment"> <span class="QuotedText">&gt; Why not let userspace do its own unwinding? Doing so moves concerns about containers, static binaries, eBPF program installation and so on from the kernel to userspace.</span><br> <p> Can you explain how you would expect that to work?<br> <p> If you mean to snapshot the stack and to unwind in userspace using something like libunwind, then it's simply to slow. If you mean to create something that just uses SIGPROF+unwind in-process, well that isn't system-wide profiling.<br> <p> The perf sample point happens within the kernel, and you can't take faults so everything has to be available at that moment. If you don't have enough information to unwind there, you can't get stacks that cross kernel and user-space or enough information to symbolize after the fact.<br> <p> Most of what Sysprof does, at least, is from user-space after the samples are captured.<br> <p> <span class="QuotedText">&gt; Does the kernel have magical performance superpowers? Whatever userspace unwinding strategy you choose, ISTM userspace can implement that strategy at least as efficiently as the kernel. Performance doesn't seem like a good reason to keep userspace unwinding out of userspace.</span><br> <p> Again, I don't see how this would work. If you provide me some basic understanding of how this should work, I can give you some examples of how I can break it.<br> <p> <span class="QuotedText">&gt; Speaking of performance: why is nobody talking about profiling by snapshotting the shadow call stack? For the case of native code, the SCS is *exactly* the right data structure: it's a dense list of frame pointers!</span><br> <p> Because it's not available in most configurations.<br> <p> <span class="QuotedText">&gt; The right way to deal with this issue is to log all mmap()s in the perf/ftrace ring buffer along with the build-ID tags of all mapped binaries. This way, postprocessing tools can query debug servers (by build-ID) and obtain the right PC-&gt;source mapping even if some system package has been updated or inode assignments changed.</span><br> <p> Sysprof does in fact log all of these so we can symbolize properly. However, not everything has a build-id. The things that often trip up unwinding to the point of uselessness is JIT/FFI.<br> <p> If you can't handle even handle unwinding libffi properly, the design is probably broken.<br> <p> <span class="QuotedText">&gt; Aren't you assuming that the dump of the kernel perf/ftrace ring buffer would have to be the same as the data file shared outside the machine? ISTM that a post-processing step could remove arbitrary sensitive information from traces before sharing.</span><br> <p> Sysprof already has options to tack on symbol decode at the end of a trace file so they can be shared across machines. But it's symbolizer is an interface which has multiple implementations, so you could symbolize using something like debuginfod or a directory of binaries, etc.<br> </div> Mon, 16 Jan 2023 22:44:50 +0000 Placement of saved registers in a stack frame https://lwn.net/Articles/920165/ https://lwn.net/Articles/920165/ jreiser One of the issues identified is (in many cases) the use of <tt>+d8(%rsp)</tt> addressing without frame pointer, versus <tt>-d32(%rbp)</tt> addressing with a frame pointer. <tt>+d8(%rsp)</tt> costs one byte for the 8-bit displacement plus one byte for <tt>s-i-b</tt> addressing mode to allow the stack pointer <tt>%rsp</tt> as a base register. <tt>-d32(%rbp)</tt> costs 4 bytes for the 32-bit displacement. <p>Why would using a frame pointer cause so many more 32-bit displacements? Because of the placement of saved registers in the stack frame. On x86_64, traditional entry to a subroutine which saves all 6 saved registers (with a frame pointer) looks like <pre>push %rbp; mov %rsp,%rbp push %r15; push %r14; push %r13; push %r12; push %rbx</pre> in which 40 bytes of the 128-byte range for 8-bit displacement beneath the frame pointer <tt>%rbp</tt> are consumed by saved registers, leaving only 88 bytes for programmer-defined values. This contrasts to the 128 bytes available for eight-bit positive displacements from <tt>%rsp</tt>. <p>Changing the entry sequence to <pre>push %r15; push %r14; push %r13; push %r12; push %rbx push %rbp; mov %rsp,%rbp</pre> would move the saved registers to the other side of the frame pointer <tt>%rbp</tt>, which would recoup the 40 bytes as long as there were at most 80 bytes of incoming on-stack actual arguments (10 or fewer pointers, etc.) The return address has a different position relative to the frame pointer, but it can be found by disassembling the entry code, looking for <tt>push %rbp</tt>. <p>As observed in the blog, nothing except intellectual inertia by the compiler <b>prevents</b> the use <tt>+d8(%rsp)</tt> with a frame pointer, too. Mon, 16 Jan 2023 22:39:27 +0000