SFrame-based stack unwinding for the kernel

By Jonathan Corbet
July 11, 2025

The kernel's perf events subsystem can produce high-quality profiles, with full function-call chains, of resource usage within the kernel itself. Developers, however, often would like to see profiles of the whole system in one integrated report with, for example, call-stack information that crosses the boundary between the kernel and user space. Support for unwinding user-space call stacks in the perf events subsystem is currently inefficient at best. A long-running effort to provide reliable, user-space call-stack unwinding within the kernel, which will improve that situation considerably, appears to be reaching fruition.

A process's call stack (normally) contains all of the information that is needed to recreate the chain of function calls that led to a given point in its execution. Each call pushes a frame onto the stack; that frame contains, among other things, the return address for the call. The problem is that exactly where that information lives on the stack is not always clear. Functions can (and do) put other information there, so there may be an arbitrary distance between the address in the stack pointer register and the base of the current call frame at any given time. That makes it hard for the kernel (or any program) to reliably work through the call chain on the stack.

One solution to the problem is to use frame pointers — dedicating a CPU register to always point to the base of the current call frame. The frame pointer will be saved to the call stack at a known offset for each function call, so frame pointers will reliably make the structure of the call stack clear. Unfortunately, frame pointers hurt performance; they occupy a CPU register, and must be saved and restored on each function call. As a result, building software (in both kernel and user space) without frame pointers is a common practice; indeed, it is the usual case.

Without frame pointers, the kernel cannot unwind the user-space call stack in the perf events subsystem. The best that can be done is to copy the user-space stack (through the kernel) to the application receiving the performance data, in the hope that the unwinding can be done in user space. Given the resolution at which performance data can be collected, the result is the copying of massive amounts of data, and a corresponding impact on the performance of the system.

Call-stack unwinding is not a new problem, of course. In user space, the solution has often involved debugging data generated in the DWARF format. DWARF data is voluminous and complicated to interpret, though; handling it properly requires implementing a special-purpose virtual machine. Trying to use DWARF in the kernel is a good way to lower one's popularity in that community.

So DWARF is not an answer to this problem. Some years ago, developers working on live patching (which also needs reliable stack unwinding) came up with a new data format called ORC; it is far simpler than DWARF and relatively compact. ORC has been a part of the kernel's build process ever since, but it has never been adapted to user space.

Enter SFrame

At least, ORC hadn't been adapted to user space until the SFrame project (which truly needs an amusing Lord-of-the-Rings-inspired name) was launched. SFrame is an ORC-inspired format that is designed to be as compact as possible. In an executable file that has been built with SFrame data, there is a special ELF section containing two tables that look somewhat like this:

The left-hand table (the "frame descriptor entries") contains one entry for each function; the table is sorted by virtual address. When the need comes to unwind a call stack, the first step is to take the current program counter and, using a binary search, find the frame descriptor entry that includes that address. Each entry contains, beyond the base address for the function, pointers to one or more "frame row entries". Each of those entries contains between one and 15 triplets with start and size fields describing a portion of the function's code, and an associated frame-base offset. The offset of the program counter from the base address is used to find the relevant triplet which, in turn, gives the offset from the current stack pointer to the base of the call frame. From there, the process can be repeated to work up the call stack, one frame at a time.

The GNU toolchain has the ability to build executables with SFrame data, though it seems that some of that support is still stabilizing. SFrame capability is evidently coming to LLVM soon as well. Thus far, though, the kernel is unable to make use of that data to generate user-space call traces.

See this article for more information about SFrame. There is a specification for the v2 SFrame format available; this is the format that the GNU toolchain currently supports. Version 3 of the SFrame format is under development, seemingly with the goal of deployment before SFrame is widely used. The new version adds some flexibility for finding the location of the top frame, the elimination of unaligned data fields (which are explicitly used by the current format to reduce its memory requirements), support for signal frames, and s390x architecture support, among other things.

Using SFrame in the kernel

Back in January Josh Poimboeuf posted a patch series adding support for SFrame for user-space call-stack unwinding in the perf events subsystem. He has since moved on to other projects; Steve Rostedt has picked up the work and is trying to push it through to completion. The original series is now broken into three core chunks, each of which addresses a part of the problem.

The SFrame data, being stored in an ELF section, can be mapped into a process's address space along with the rest of the executable. Accessing that data from the kernel, though, is subject to all of the constraints that apply to user-space memory. Notably, this data may be mapped into the address space, but it might not be resident in RAM, so the kernel must be prepared to take page faults when accessing it.

That, in turn, means that the kernel can only access SFrame data when it is running in process context. Data from the system's performance monitoring unit, though, tends to be delivered via interrupts. So the code that handles those interrupts, and which generates the kernel portion of the stack trace, cannot do the user-space unwinding. That work has to be deferred until a safe time, which is usually just before the kernel returns back to user space.

So the first patch series adds the infrastructure for this deferred unwinding. When the need to unwind a user-space stack is detected, a work item is set up and attached using the task_work mechanism; the kernel will then ensure that this work is done at a time when user-space memory access is safe. Support for deferred, in-kernel stack unwinding on systems where frame pointers are in use is also added as part of this series.

The second series teaches the perf events subsystem to use this new infrastructure. The resulting interface for user space is somewhat interesting. A call into the kernel can generate a large number of performance events, each of which may have a stack trace associated with it. But the user-space portion of that trace can only be created just before the return to user space. So the perf events subsystem will report any number of kernel-side traces, followed by a single user-space trace at the end. Since the user-space side of the call chain does not change while this is happening, a single trace is all that is needed. User space must then put the pieces back together to obtain the full trace. The last patch in the series updates the perf tool to perform this merging and output unified call chains.

Finally, the third series adds SFrame support to all of this machinery. When an executable is loaded, any SFrame sections are mapped as well. Within the kernel, a maple tree is used to track the SFrame sections associated with each text range in the executable. The kernel does not, however, track the SFrame sections associated with shared libraries, so the series contains a new prctl() operation so that the C library can provide that information explicitly. That part of the API is likely to change (the relevant patch suggests that a separate system call should be added) before the series is merged.

With the SFrame information at hand, using it to generate stack traces is a relatively straightforward job; the kernel just has to use the SFrame tables to work through the frames on the stack. The result should be fast and reliable, with a minimal impact on the performance of the workload being measured.

The sum total of this work is quite a bit of new infrastructure added to the core kernel, including code that runs in some of the most constrained contexts. So the chances are good that there are still some minor problems to work out. The bulk of the hard work would appear to be done, though. It may take another development cycle or three, but the multi-year project to get SFrame support into the kernel would appear to be reaching its conclusion.

Index entries for this article
Kernel	Development tools/SFrame

SFrame us more accurate too

Posted Jul 12, 2025 18:35 UTC (Sat) by DemiMarie (subscriber, #164188) [Link]

Frame pointers can miss leaf functions. SFrame will not do that.

Co-routines?

Posted Jul 12, 2025 23:04 UTC (Sat) by jreiser (subscriber, #11027) [Link] (3 responses)

I do not see any support for co-routines: a RETURN is replaced by a co-CALL of some other routine in the nest, and is itself the destination of a co-CALL. Incremental producer-consumer bit packing can be expressed as a nest of two co-routines; "the stack" remembers two "leaf" activations, whose co-RETURN (co-CALL) points each vary.

Co-routines?

Posted Jul 14, 2025 7:44 UTC (Mon) by taladar (subscriber, #68407) [Link]

More generally anything using continuation passing style does not really play well with stack traces and won't work with this either without some special considerations.

Co-routines?

Posted Jul 14, 2025 14:22 UTC (Mon) by willy (subscriber, #9762) [Link] (1 responses)

What would you want a "stack trace" to say for coroutines? They may have called each other millions of times by the time we try to "inspect the stack". Surely you wouldn't want to have millions of repetitions of "producer called consumer called producer called ...". Isn't it enough to know "thread is in producer which was called from event loop"?

Is there any prior work in this area we can steal^W benefit from?

Co-routines?

Posted Jul 14, 2025 21:40 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> Isn't it enough to know "thread is in producer which was called from event loop"?

Not really. In particular, this makes debugging the coroutine-heavy code a pure nightmare. Debugger becomes worse than useless.

What I'd love is to have a weak link to the "caller frame". It's OK for it to be garbage-collected if the chain becomes too long.

Naming

Posted Jul 12, 2025 23:56 UTC (Sat) by SLi (subscriber, #53131) [Link] (7 responses)

> At least, ORC hadn't been adapted to user space until the SFrame project (which truly needs an amusing Lord-of-the-Rings-inspired name) was launched.

Indeed. What can we come up with that doesn't sound completely stupid?

Smeagol? Shelob (since it unwinds tangled stacks)? Framewise Gamgee? Framebeard? BackAgain?

Naming

Posted Jul 13, 2025 5:25 UTC (Sun) by magfr (subscriber, #16052) [Link] (2 responses)

Find a nice backronym from ENT or just bite the bullet and call it MAN.

Naming

Posted Jul 14, 2025 7:45 UTC (Mon) by taladar (subscriber, #68407) [Link] (1 responses)

I guess ENT would not work that well for a format more compact than DWARF.

Naming

Posted Jul 14, 2025 23:32 UTC (Mon) by himi (subscriber, #340) [Link]

Considering how (unintentionally?) ironic the name "DWARF" is, ENT seems perfectly reasonable for something designed to be compact and concise.

Naming

Posted Jul 13, 2025 9:12 UTC (Sun) by excors (subscriber, #95769) [Link] (2 responses)

As this is an attempt to spread ORC into the wider world, I think the obvious answer is the word for "to rule them all" in Black Speech (the common language of Orcs and other servants of Mordor, which was itself designed to solve the problem of mutually-unintelligible dialects between tribes and help them coordinate more efficiently, just as SFrame is doing): DURBATULÛK, or Debugging Userspace Reliably By A Table of Unwinding-Location Ûseful Knowledge.

Naming

Posted Jul 13, 2025 10:27 UTC (Sun) by SLi (subscriber, #53131) [Link]

I approve.

CONFIG_DURBATULUK=y

Naming

Posted Jul 13, 2025 17:16 UTC (Sun) by jemarch (subscriber, #116773) [Link]

It's spelled s-f-r-a-m-e but it's pronounced DURBATULUK

Naming

Posted Jul 13, 2025 21:27 UTC (Sun) by Sesse (subscriber, #53779) [Link]

SFrame obviously stands for Saruman-Frame, so we're already covered.

Some thoughts

Posted Jul 17, 2025 5:19 UTC (Thu) by irogers (subscriber, #121692) [Link]

SFrames trying to advance performance and profiling accuracy is a noble thing. The article missing LBR stack traces which have largely solved the issue in areas like AutoFDO.

Some other thoughts:
- Why a new encoding format, why not have a subset of dwarf that matches the limited stack frame encoding supported by sframes? Dwarf is compressed and there may be potential for reuse of the ELF section in areas like C++ exception delivery.
- When perf is invoked to record system wide for a period of time like `perf record -e cycles -a sleep 10` any system call that hasn't transitioned to user code will only have the kernel side of the stack trace.
- The bpf helper bpf_get_stackid with BPF_F_USER_STACK will just return a placeholder value which may break or cause additional memory usage in BPF programs.
- JITs need to be taught to generate/register/remove SFrames or perhaps some kind of JIT interface like GDB's can be devised.

"Unfortunately, frame pointers hurt performance; they occupy a CPU register, and must be saved and restored on each function call." A register can be freed by holding the frame pointer in thread-local storage. Function call cost is addressed by inlining. x86 lacks any callee-save floating point registers which is likely a greater performance issue and unfortunately not addressed by the upcoming APX changes. A greater performance issue is the -fno-omit-frame-pointers is less well optimized by C compilers, for example, missing tail call optimizations.

Last year the return of frame pointers was heralded:
https://www.brendangregg.com/blog/2024-03-17/the-return-o...

There have been calling conventions like ARM32's APCS that have very dense debug info but also would be reasonably easy to reverse engineer by walking the code forward until the function return is seen. Perhaps the code itself is the best debug format.

SFrame logo

Posted Aug 30, 2025 23:29 UTC (Sat) by ibhagat (subscriber, #133641) [Link]

> At least, ORC hadn't been adapted to user space until the SFrame project (which truly needs an amusing Lord-of-the-Rings-inspired name) was launched.

If not the name now, may be we can make do with a cool logo for SFrame..

Little-endian format

Posted Sep 8, 2025 1:54 UTC (Mon) by maskray (subscriber, #112608) [Link]

The format includes endianness variations that complicate toolchain support. I think we should use a little-endian format universally, regardless of the target system's native endianness. On the big-endian z/Architecture, this is efficient; the LOAD REVERSED instructions used by the bswap versions in the following program don't even require extra instructions.

#define WIDTH(x) \
typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
uint##x load_inc##x(uint##x *p) { return *p+1; } \
uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
uint##x load_eq##x(uint##x *p) { return *p==3; } \
uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \

WIDTH(16);
WIDTH(32);
WIDTH(64);

My blog post had a TODO chapter about CTF Frame. I spent some time today to read the spec and add my analysis here: https://maskray.me/blog/2020-11-08-stack-unwinding#sframe