Dynamically sizing the kernel stack
Every thread has its own kernel stack, he began. The size of each thread's kernel stack was 8KB for a long time, and it is still that way on some architectures, but it was increased to 16KB on x86 systems in 2014. That change, which was driven by stack overflows experienced with subsystems like KVM and FUSE, makes every thread in the system heavier.
Expanding the stack size to 16KB has not solved all of the problems, though; kernel code is using more stack space in many contexts, he said. I/O is becoming more complex, perf events are handled on the stack, compilers are more aggressively inlining code, and so on. Google has stuck with 8KB stacks internally for memory efficiency, but is finding that to be increasingly untenable, and will be moving to 16KB stacks. That, he said, is an expensive change, causing an increase in memory use measured in petabytes. There are applications that create millions of threads, each of which pays the cost for the larger kernel stack, but 99% of those threads never use more than 8KB of that space.
Thus, he proposed making the kernel stack size dynamic; each thread would
start with a 4KB stack, which would be increased in response to a page
fault should that space be exhausted. An initial implementation was posted
in March.
The proposed solution takes advantage of virtually mapped stacks, which make it
relatively easy to catch overflows. A larger stack is allocated in the
kernel page tables, but only one 4KB page is mapped. The result is a
significant speedup because the kernel does not have to find as much memory
for kernel stacks, and tests have shown a 70-75% savings in memory used for
the stacks. That, he said, was from a "simple boot test"; tests with real
workloads would have shown a larger savings.
There is an interesting challenge associated with page faults for stack access, though: page faults are also handled on the kernel stack, which has just run out of space. When a thread tries to access an unmapped page and causes a page fault, the fault handler will try to save the current processor state onto the kernel stack, which will cause a double fault. The x86 architecture does not allow handling double faults; code is simply supposed to abort and clean up when that happens. If the kernel tries, instead, to handle that fault and expand the stack, it is operating outside of the rules defined by the architecture, and that tends not to lead to good things.
Solutions to that problem seem to be expensive. One idea, suggested by Matthew Wilcox but also already present on Tatashin's slides, is to add an expand_stack() function that would be called by subsystems that know they will need more stack space. It would map the extra space ahead of its use, avoiding the double-fault situation. Michal Hocko responded that this solution seemed like a game of Whac-A-Mole, with developers trying to guess where the stack might overflow. But direct reclaim, which can call filesystem or I/O-related functions with deep stack use, can happen just about anywhere. If that causes an overflow, the system will panic.
A second possible solution, Tatashin said, would be to take advantage of some of the kernel-hardening work to automatically grow the stack as needed. Specifically, he would like to use the STACKLEAK mechanism, which uses a GCC plugin to inject stack-size checks into kernel functions as they are compiled. That code could be enhanced to automatically grow the stack when usage passes a threshold. This solution adds almost no overhead to systems where STACKLEAK is already in use — but it is rather more expensive if STACKLEAK is not already enabled.
Finally, a third option would be to limit dynamic stacks to systems that either do not save state to the stack on faults or that can handle double faults. Tatashin suggested that x86 systems with FRED support might qualify, and 64-bit Arm systems as well.
Time for the session was running short as Hocko said that he liked the second solution but wondered what the cost would actually be. Tatashin said that he has been working on reducing it; he has refactored the STACKLEAK code to be more generic, so that it can be used for this purpose without needing to include the hardening features. A stack-frame size can be set at build time, and the plugin will only insert checks for functions that exceed that size. David Hildenbrand said that this scheme could be thwarted by a long chain of calls to small functions; Hocko said that would make the feature less than fully reliable. Tatashin answered that there is probably at least one large function somewhere in any significant call chain, but Hocko said that is not necessarily the case with direct reclaim.
Steve Rostedt said that, perhaps, the frame-size parameter could be set to
zero, causing the check be made at every function call; Tatashin answered
that, while he has not measured the overhead of the check, it would
certainly add up and be noticeable in that case. The final suggestion of
the session came from Hocko, who said that perhaps the ftrace hooks
could be used instead of the STACKLEAK infrastructure, but Rostedt said
that option would be too expensive to be worth considering.
Index entries for this article | |
---|---|
Kernel | Kernel stack |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2024 |
Posted May 21, 2024 16:16 UTC (Tue)
by willy (subscriber, #9762)
[Link] (2 responses)
Had I been there, I would have said to Michal Hocko that reclaim seems like the perfect example of a place to expand_stack().
Posted May 21, 2024 17:08 UTC (Tue)
by soleen (subscriber, #113769)
[Link] (1 responses)
Posted May 21, 2024 17:28 UTC (Tue)
by corbet (editor, #1)
[Link]
Posted May 21, 2024 17:08 UTC (Tue)
by josh (subscriber, #17465)
[Link] (3 responses)
This seems like the ideal solution, for any target supporting it.
Another option would be to get official guidance from x86 vendors about whether they could validate the ability to handle double-faults. If it currently *works*, going from that to "it's guaranteed to work" is not an impossible ask.
Posted May 21, 2024 18:27 UTC (Tue)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted May 21, 2024 18:49 UTC (Tue)
by matthias (subscriber, #94967)
[Link]
But it is not necessary to think that complicated. Faulting in a page just from swap takes ages (even with SSDs). Thus you want to yield to another thread. And the other thread might also have a page fault while you are still waiting for the first page. So even for the simplest of cases you really want the fault handler to be re-entrant.
Posted May 21, 2024 20:24 UTC (Tue)
by kleptog (subscriber, #1183)
[Link]
Posted May 22, 2024 19:55 UTC (Wed)
by alkbyby (subscriber, #61687)
[Link]
If those "common/simple" idle threads "simply" release their stacks while idle/sleeping, Google should able to afford much bigger stacks (for non-idle kernel tasks) and still save ram and possibly at relatively small complexity cost. I.e. right after task has been switched "out of", we really only need to remember what it was waiting for and place for user-space context (registers), and this is relatively small amount of stuff.
At second place at Google by count are threads sleeping in futex_wait (they tend to be all kind of worker threads waiting for work, with occasional sleepers on mutex), and those could also potentially be treated similarly.
I am writing this without any knowledge of complexities of x64 context switching. So maybe this is harder then alternatives.
Posted May 30, 2024 8:06 UTC (Thu)
by dongmk (subscriber, #131668)
[Link] (5 responses)
What kinds of applications need to create millions of threads, on a single machine? Can anyone share some more details about the specific scenario?
Posted May 30, 2024 8:29 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
Pretty much any application trying to solve a big problem with recursion? Okay, recursion is often used when it shouldn't be, but it's a quick-n-dirty fix that's very popular ...
Cheers,
Posted May 30, 2024 18:30 UTC (Thu)
by mrugiero (guest, #153040)
[Link]
Posted May 30, 2024 17:26 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted May 31, 2024 10:54 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (1 responses)
But it's running on an interpreter, so can't really be compared with OS level threads in Linux where you're limited by what the hardware can support. If the CPU supported dynamically allocating stack frames when calling a function you'd also be able to get away with much smaller stacks.
Posted May 31, 2024 19:21 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Golang tried that via segmented stacks. The problem is that switching between segments killed performance. So Go moved to relocatable stacks instead, which is easy for them because (by the language design) pointers can't point to objects on the stack.
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
So it seems I misinterpreted my notes, but that the origin of the suggestion was correct in the end? I certainly don't want to misattribute anybody's words!
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Wol
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack
Dynamically sizing the kernel stack