Dynamically sizing the kernel stack

By Jonathan Corbet
May 21, 2024

The kernel stack is a scarce and tightly constrained resource; kernel developers often have to go far out of their way to avoid using too much stack space. The size of the stack is also fixed, leading to situations where it is too small for some code paths, while wastefully large for others. At the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, Pasha Tatashin proposed making the kernel stack size dynamic, making more space available when needed while saving memory overall. This change is not as easy to implement as it might seem, though.

Every thread has its own kernel stack, he began. The size of each thread's kernel stack was 8KB for a long time, and it is still that way on some architectures, but it was increased to 16KB on x86 systems in 2014. That change, which was driven by stack overflows experienced with subsystems like KVM and FUSE, makes every thread in the system heavier.

Expanding the stack size to 16KB has not solved all of the problems, though; kernel code is using more stack space in many contexts, he said. I/O is becoming more complex, perf events are handled on the stack, compilers are more aggressively inlining code, and so on. Google has stuck with 8KB stacks internally for memory efficiency, but is finding that to be increasingly untenable, and will be moving to 16KB stacks. That, he said, is an expensive change, causing an increase in memory use measured in petabytes. There are applications that create millions of threads, each of which pays the cost for the larger kernel stack, but 99% of those threads never use more than 8KB of that space.

Thus, he proposed making the kernel stack size dynamic; each thread would start with a 4KB stack, which would be increased in response to a page fault should that space be exhausted. An initial implementation was posted in March. The proposed solution takes advantage of virtually mapped stacks, which make it relatively easy to catch overflows. A larger stack is allocated in the kernel page tables, but only one 4KB page is mapped. The result is a significant speedup because the kernel does not have to find as much memory for kernel stacks, and tests have shown a 70-75% savings in memory used for the stacks. That, he said, was from a "simple boot test"; tests with real workloads would have shown a larger savings.

There is an interesting challenge associated with page faults for stack access, though: page faults are also handled on the kernel stack, which has just run out of space. When a thread tries to access an unmapped page and causes a page fault, the fault handler will try to save the current processor state onto the kernel stack, which will cause a double fault. The x86 architecture does not allow handling double faults; code is simply supposed to abort and clean up when that happens. If the kernel tries, instead, to handle that fault and expand the stack, it is operating outside of the rules defined by the architecture, and that tends not to lead to good things.

Solutions to that problem seem to be expensive. One idea, suggested by Matthew Wilcox but also already present on Tatashin's slides, is to add an expand_stack() function that would be called by subsystems that know they will need more stack space. It would map the extra space ahead of its use, avoiding the double-fault situation. Michal Hocko responded that this solution seemed like a game of Whac-A-Mole, with developers trying to guess where the stack might overflow. But direct reclaim, which can call filesystem or I/O-related functions with deep stack use, can happen just about anywhere. If that causes an overflow, the system will panic.

A second possible solution, Tatashin said, would be to take advantage of some of the kernel-hardening work to automatically grow the stack as needed. Specifically, he would like to use the STACKLEAK mechanism, which uses a GCC plugin to inject stack-size checks into kernel functions as they are compiled. That code could be enhanced to automatically grow the stack when usage passes a threshold. This solution adds almost no overhead to systems where STACKLEAK is already in use — but it is rather more expensive if STACKLEAK is not already enabled.

Finally, a third option would be to limit dynamic stacks to systems that either do not save state to the stack on faults or that can handle double faults. Tatashin suggested that x86 systems with FRED support might qualify, and 64-bit Arm systems as well.

Time for the session was running short as Hocko said that he liked the second solution but wondered what the cost would actually be. Tatashin said that he has been working on reducing it; he has refactored the STACKLEAK code to be more generic, so that it can be used for this purpose without needing to include the hardening features. A stack-frame size can be set at build time, and the plugin will only insert checks for functions that exceed that size. David Hildenbrand said that this scheme could be thwarted by a long chain of calls to small functions; Hocko said that would make the feature less than fully reliable. Tatashin answered that there is probably at least one large function somewhere in any significant call chain, but Hocko said that is not necessarily the case with direct reclaim.

Steve Rostedt said that, perhaps, the frame-size parameter could be set to zero, causing the check be made at every function call; Tatashin answered that, while he has not measured the overhead of the check, it would certainly add up and be noticeable in that case. The final suggestion of the session came from Hocko, who said that perhaps the ftrace hooks could be used instead of the STACKLEAK infrastructure, but Rostedt said that option would be too expensive to be worth considering.

Index entries for this article
Kernel	Kernel stack
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2024

Dynamically sizing the kernel stack

Posted May 21, 2024 16:16 UTC (Tue) by willy (subscriber, #9762) [Link] (2 responses)

I wasn't in this session, so I'm not sure how I made a suggestion that was already in his slides. Perhaps that was someone else ;-)

Had I been there, I would have said to Michal Hocko that reclaim seems like the perfect example of a place to expand_stack().

Dynamically sizing the kernel stack

Posted May 21, 2024 17:08 UTC (Tue) by soleen (subscriber, #113769) [Link] (1 responses)

Hey Mathew, I included a slide about expand_stack(), referencing your suggestion from the RFCv1 thread. I still believe it's a viable solution for architectures without kernel memory faulting, and may propose it in the next version.

Dynamically sizing the kernel stack

Posted May 21, 2024 17:28 UTC (Tue) by corbet (editor, #1) [Link]

So it seems I misinterpreted my notes, but that the origin of the suggestion was correct in the end? I certainly don't want to misattribute anybody's words!

Dynamically sizing the kernel stack

Posted May 21, 2024 17:08 UTC (Tue) by josh (subscriber, #17465) [Link] (3 responses)

> Finally, a third option would be to limit dynamic stacks to systems that either do not save state to the stack on faults or that can handle double faults. Tatashin suggested that x86 systems with FRED support might qualify, and 64-bit Arm systems as well.

This seems like the ideal solution, for any target supporting it.

Another option would be to get official guidance from x86 vendors about whether they could validate the ability to handle double-faults. If it currently *works*, going from that to "it's guaranteed to work" is not an impossible ask.

Dynamically sizing the kernel stack

Posted May 21, 2024 18:27 UTC (Tue) by epa (subscriber, #39769) [Link] (2 responses)

I was surprised to learn that handling a page fault can itself trigger a page fault. (If you could guarantee that the page fault handler didn't need to be re-entrant, then it wouldn't need to use the stack, but could have a small dedicated memory area per processor.) I guess it makes sense on complex modern systems that faulting in a page from swap space (for example) might itself perform filesystem or device driver operations that need to allocate.

Dynamically sizing the kernel stack

Posted May 21, 2024 18:49 UTC (Tue) by matthias (subscriber, #94967) [Link]

Of course, this can be arbitrarily complex. Swap space is still quite simple. Just think of faulting in a page from a memory-mapped file that lives in some FUSE filesystem. So you might actually need to wait for another user space process.

But it is not necessary to think that complicated. Faulting in a page just from swap takes ages (even with SSDs). Thus you want to yield to another thread. And the other thread might also have a page fault while you are still waiting for the first page. So even for the simplest of cases you really want the fault handler to be re-entrant.

Dynamically sizing the kernel stack

Posted May 21, 2024 20:24 UTC (Tue) by kleptog (subscriber, #1183) [Link]

It gets worse. A page fault in a VM can trigger multiple page faults in the hypervisor. Page tables themselves can also be swapped out. It is possible for fault handler to declare their own stack, but that's just shifting the problem.

Dynamically sizing the kernel stack

Posted May 22, 2024 19:55 UTC (Wed) by alkbyby (subscriber, #61687) [Link]

I wonder if anyone has considered special-casing for most common places to "park" kernel threads. Specifically, this Google's case referred to above, has by far vast majority of threads are sleeping in fiber's wait and switchto syscalls. Non-idle threads or threads idling in less common or "complex" places, there is not too many of those and it is fine to reserve plenty of stack on them.

If those "common/simple" idle threads "simply" release their stacks while idle/sleeping, Google should able to afford much bigger stacks (for non-idle kernel tasks) and still save ram and possibly at relatively small complexity cost. I.e. right after task has been switched "out of", we really only need to remember what it was waiting for and place for user-space context (registers), and this is relatively small amount of stuff.

At second place at Google by count are threads sleeping in futex_wait (they tend to be all kind of worker threads waiting for work, with occasional sleepers on mutex), and those could also potentially be treated similarly.

I am writing this without any knowledge of complexities of x64 context switching. So maybe this is harder then alternatives.

Dynamically sizing the kernel stack

Posted May 30, 2024 8:06 UTC (Thu) by dongmk (subscriber, #131668) [Link] (5 responses)

> There are applications that create millions of threads

What kinds of applications need to create millions of threads, on a single machine? Can anyone share some more details about the specific scenario?

Dynamically sizing the kernel stack

Posted May 30, 2024 8:29 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> What kinds of applications need to create millions of threads, on a single machine?

Pretty much any application trying to solve a big problem with recursion? Okay, recursion is often used when it shouldn't be, but it's a quick-n-dirty fix that's very popular ...

Cheers,
Wol

Dynamically sizing the kernel stack

Posted May 30, 2024 18:30 UTC (Thu) by mrugiero (guest, #153040) [Link]

Why would recursion launch threads? Maybe you were thinking of tasks that exhaust the stack instead?

Dynamically sizing the kernel stack

Posted May 30, 2024 17:26 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Various messaging services that have clients that are mostly idle, for example.

Dynamically sizing the kernel stack

Posted May 31, 2024 10:54 UTC (Fri) by kleptog (subscriber, #1183) [Link] (1 responses)

For example, using Erlang where it is common to have one or more processes per client (eg WhatsApp). These processes are a form of user-space threading though and *very* lightweight. Largely because of the functional nature of the language, no sharing of data between processes and all data being read-only means the stacks can be very small (no registers to be saved/restored) and are allocated on demand. Millions of threads are not unheard of.

But it's running on an interpreter, so can't really be compared with OS level threads in Linux where you're limited by what the hardware can support. If the CPU supported dynamically allocating stack frames when calling a function you'd also be able to get away with much smaller stacks.

Dynamically sizing the kernel stack

Posted May 31, 2024 19:21 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> If the CPU supported dynamically allocating stack frames when calling a function you'd also be able to get away with much smaller stacks.

Golang tried that via segmented stacks. The problem is that switching between segments killed performance. So Go moved to relocatable stacks instead, which is easy for them because (by the language design) pointers can't point to objects on the stack.