User: Password:
Subscribe / Log in / New account

Expanding the kernel stack

Did you know...? is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
May 29, 2014
Every process in the system occupies a certain amount of memory just by existing. Though it may seem small, one of the more important pieces of memory required for each process is a place to put the kernel stack. Since every process could conceivably be running in the kernel at the same time, each must have its own kernel stack area. If there are a lot of processes in the system, the space taken for kernel stacks can add up; the fact that the stack must be physically contiguous can stress the memory management subsystem as well. These concerns have always provided a strong motivation to keep the size of the kernel stack small.

For most of the history of Linux, on most architectures, the kernel stack has been put into an 8KB allocation — two physical pages. As recently as 2008 some developers were trying to shrink the stack to 4KB, but that effort eventually proved to be unrealistic. Modern kernels can end up creating surprisingly deep call chains that just do not fit into a 4KB stack.

Increasingly, it seems, those call chains don't even fit into an 8KB stack on x86-64 systems. Recently, Minchan Kim tracked down a crash that turned out to be a stack overflow; he responded by proposing that it was time to double the stack size on x86-64 to 16KB. Such proposals have seen resistance before, and that happened this time around as well; Alan Cox argued that the solution is to be found elsewhere. But he seems to be nearly alone in that point of view.

Dave Chinner often has to deal with stack overflow problems, since they often occur with the XFS filesystem, which happens to be a bit more stack-hungry than others. He was quite supportive of this change:

8k stacks were never large enough to fit the linux IO architecture on x86-64, but nobody outside filesystem and IO developers has been willing to accept that argument as valid, despite regular stack overruns and filesystem having to add workaround after workaround to prevent stack overruns.

Linus was unconvinced at the outset, and he made it clear that work on reducing the kernel's stack footprint needs to continue. But Linus, too, seems to have come around to the idea that playing "whack-a-stack" is not going to be enough to solve the problem in a reliable way:

[S]o while I am basically planning on applying that patch, I _also_ want to make sure that we fix the problems we do see and not just paper them over. The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting _too_ damn painful, but I don't want to just give up entirely when we have a known deep stack case.

Linus has also, unsurprisingly, made it clear that he is not interested in changing the stack size in the 3.15 kernel. But the 3.16 merge window can be expected to open in the near future; at that point, we may well see this patch go in as one of the first changes.

(Log in to post comments)

Expanding the kernel stack

Posted May 30, 2014 10:51 UTC (Fri) by HIGHGuY (subscriber, #62277) [Link]

Would it be possible to set the newly added 8K of the stack read-only and issue a warning with backtrace when a page fault happens for it?
Seems like a nice middle-ground between not crashing while also not ignoring the need to fix offenders.

Expanding the kernel stack

Posted May 30, 2014 13:36 UTC (Fri) by richard_weinberger (subscriber, #38938) [Link]

Not needed, we have already CONFIG_DEBUG_STACK_USAGE

Expanding the kernel stack

Posted May 30, 2014 17:11 UTC (Fri) by luto (subscriber, #39314) [Link]

CONFIG_DEBUG_STACK_USAGE isn't entirely reliable -- guard pages would be nicer. The downside is that we couldn't use huge pages.

Expanding the kernel stack

Posted May 30, 2014 18:41 UTC (Fri) by PaXTeam (guest, #24616) [Link]

i have good news then, the very new GRKERNSEC_KSTACKOVERFLOW feature solves this without breaking up huge pages. throw in our 3-year-old move of thread_info off the kstack and we've got a winner! ;)

Expanding the kernel stack

Posted May 30, 2014 19:55 UTC (Fri) by luto (subscriber, #39314) [Link]

That thing uses vmalloc, right? Then it still prevents hugepages from being used for the kernel stack. This will cost a TLB slot or two. It's probably still a good tradeoff.

Expanding the kernel stack

Posted May 30, 2014 20:49 UTC (Fri) by PaXTeam (guest, #24616) [Link]

correct but note that the lowmem map will keep using 2M/1G pages. as for its TLB impact, it's a tradeoff between one 2MB (or 1GB) TLB entry and 1-4 4k entries. the net performance impact depends on how the TLBs for each page size are organized in a given CPU and the access pattern of the virtual memory mapped by those entries (e.g., if there're separate TLBs for each page size and the access pattern continuously exhausts one but no the other(s) then obviously freeing/taking up extra entries will have a net positive/negative impact). i think in practice it'll come down to how many accesses are made to lowmem vs. the vmalloc ranges in a workload.

another advantage is that vmalloc by its nature handles lowmem fragmentation much better which becomes even more important now that amd64 kstacks have become order-2 allocations. it'd also be easy to implement lazy page allocation for kstacks further reducing their memory consumption (let's face it, many kstacks will never actually make use of the whole 16k yet they'll always have to be fully allocated in the current scheme).

Expanding the kernel stack

Posted May 30, 2014 20:52 UTC (Fri) by luto (subscriber, #39314) [Link]

It could pay to send the patch upstream. If it's clean, I'll advocate for it.

Expanding the kernel stack

Posted Jun 27, 2016 0:50 UTC (Mon) by luto (subscriber, #39314) [Link]

Lazy page allocation is an interesting idea. I can think of two potential problems, though:

1. What do you do if lazy allocation fails?

2. Hitting a not-present page on the stack is likely to result in a double-fault. Intel's manual advises against trying to recover from a double-fault, and I'd like to know why before messing with it. Even if recovery were guaranteed to work, it could be interesting trying to allocate memory (which can block) in a double-fault handler.

The espfix64 code can double-fault and recover, but we ran that specific abuse of the CPU by some Intel and AMD engineers before doing it.

Expanding the kernel stack

Posted Jun 1, 2014 8:35 UTC (Sun) by richard_weinberger (subscriber, #38938) [Link]

Are you willing to send this feature as stand-alone patch upstream for review?
Would be awesome. :-)

Expanding the kernel stack

Posted Jun 1, 2014 12:38 UTC (Sun) by PaXTeam (guest, #24616) [Link]

it's spender's code, so you'd have to convince him to go through the process.

Expanding the kernel stack

Posted Jun 2, 2014 11:49 UTC (Mon) by dgm (subscriber, #49227) [Link]

Or someone else to adopt the code and champion it.

Expanding the kernel stack

Posted May 30, 2014 17:38 UTC (Fri) by iabervon (subscriber, #722) [Link]

It seems to me that this is another case of "direct reclaim is a problem"; all of the overflows seem to come from doing a reasonable amount of regular work plus a reasonable amount of direct reclaim on the same stack, and it seems to me that, instead of a process actually performing direct reclaim itself, it would be better to have the process just donate its timeslice to a separate direct reclaim thread, so that you don't need the stack space for direct reclaim per-process, but only per-cpu. Sure, priority inheritance is a mess to do in general, but this special case (code that can inherit can't donate) isn't nearly so bad.

Merged for 3.15

Posted May 30, 2014 19:45 UTC (Fri) by corbet (editor, #1) [Link]

Well, I was wrong about one thing...Linus just merged the 16K stack patch for 3.15.

Merged for 3.15

Posted May 30, 2014 19:56 UTC (Fri) by smitty_one_each (subscriber, #28989) [Link]

It happens. BTW, how's your health?

Merged for 3.15

Posted May 30, 2014 20:05 UTC (Fri) by boog (subscriber, #30882) [Link]

"Should anybody wish to help, a good starting place would be to not ask for details on my condition or treatment plans; I do not intend to discuss such things in public spaces."

Expanding the kernel stack

Posted May 30, 2014 21:35 UTC (Fri) by parcs (guest, #71985) [Link]

Why is the kernel stack a fixed size? Does it have to be that way?

Expanding the kernel stack

Posted May 30, 2014 21:49 UTC (Fri) by nevets (subscriber, #11875) [Link]

No, but it is the easiest and currently the most efficient way of implementing a kernel stack.

A dynamic stack would be incredibly complex to implement. What happens when you need more stack? You would need to make sure the task faults when it overflows, and then the fault handler would require a separate stack. Where do you allocate the next page from? Oh, and as the stack must be continuous, the stack must be mapped into virtual memory. Currently, all kernel memory (except things like modules and stuff allocated with vmalloc) is mapped in huge page tables, and the kernel stack is just a pointer within that mapping.

The kernel stack doesn't have to be a fixed size, but the alternatives are much worse.

Expanding the kernel stack

Posted May 31, 2014 22:28 UTC (Sat) by sdalley (subscriber, #18550) [Link]

Would there be scope for keeping fixed sizes, but just giving the (presumably few) stack-hogging tasks the 16K size and leaving the others at 8K (or even 4K)?

Expanding the kernel stack

Posted May 31, 2014 23:06 UTC (Sat) by dlang (subscriber, #313) [Link]

The thing is that the problem isn't particular tasks. It's how many layers of indirection are involved

you have the task info, then the scheduler puts some info there, then you access memory and that does a page fault, finds that you need to interact with swap, makes a call to access the disk, which may need to go through raid, lvm, union mounts and then the filesystem needs it's data.....

In the case that triggered this, you didn't have a complex storage layer, you had a compile option that ate some space and a lot of cases where gcc decided to put variable data on the stack, and did so particularly inefficiently as well.

But the task doing the work that triggered this mess wasn't doing anything special, so there's no way to size the stack per task.

Expanding the kernel stack

Posted Jun 1, 2014 10:55 UTC (Sun) by sdalley (subscriber, #18550) [Link]

Ah, thanks for that explanation.

Indeed, one can't afford to assume low stack usage if one don't know ahead of time whether a task might need lots of stack, even if very infrequently.

Expanding the kernel stack

Posted Jun 2, 2014 17:21 UTC (Mon) by jhoblitt (subscriber, #77733) [Link]

A stack doesn't /have/ to be contiguous. GoLang has demonstrated that segmented stacks are viable, if your willing to break the C calling conventions.

Expanding the kernel stack

Posted Jun 4, 2014 18:57 UTC (Wed) by dtlin (✭ supporter ✭, #36537) [Link]

GCC -fsplit-stacks works for C/C++ code in user-space now, at least on i386/x86_64 Linux. I'm not sure how hard it would be to get it working in the kernel, but at a glance it looks non-trivial: the implementation depends on -fuse-ld=gold, which doesn't work with the kernel; the generated code uses the __private_ss slot in the %gs/%fs TCB, which would presumably have to change to access something in task_struct *current instead; and __morestack uses mmap to allocate new stack segments, which won't work (and there isn't one obvious way to safely allocate memory in kernel context).

For Go, split stacks were problematic (performance-wise) and 1.3 will switch to reallocating contiguous stacks. Since it involves moving the stack (and thus changing the addresses of everything that's on the stack), it's probably not doable for C/C++. Well, I suppose you could dynamically allocate anything that escapes (like Go does), but that seems pretty invasive…

Expanding the kernel stack

Posted Jun 4, 2014 20:41 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

FWIW, Rust also switched from split stacks to growing-on-demand stacks with guard pages.

Also for performance reasons.

Expanding the kernel stack

Posted Jun 7, 2014 5:38 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

Hmm. I wonder how the Mill architecture will fit with this which does not have contiguous stacks (see the security talk). In fact, it makes digging through stack rubble (all new memory is implicitly zero'd) and return a oriented a programming impossible (parent frame pointers are stored in memory not accessible to the process).

Expanding the kernel stack

Posted Jun 7, 2014 10:52 UTC (Sat) by PaXTeam (guest, #24616) [Link]

in practice, the very first gadget usually executes a stack pivot exactly because that's not where the rest of the payload is, so it doesn't matter how the rest of the stack is fragmented.

Expanding the kernel stack

Posted Jun 7, 2014 11:24 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

My understanding was that the only way to modify the stack pointer was either call, return, or resizing your frame. There is no register to write to which contains your stack location, so how would you do a stack pivot?

Expanding the kernel stack

Posted May 31, 2014 9:21 UTC (Sat) by mslusarz (subscriber, #58587) [Link]

I'm wondering, can't known offenders (like direct reclaim code) switch stack on the fly to bigger area? You don't need to allocate this space for every thread...

Expanding the kernel stack

Posted May 31, 2014 11:35 UTC (Sat) by corbet (editor, #1) [Link]

That is essentially what is done in a number of places — work is shifted to a kernel thread, which has the effect of going to a different stack (one that is known not to be almost full already). Doing it any other way involves controlling access to some sort of shared stack infrastructure; that would add a lot of unwelcome complexity, to say the least.

Expanding the kernel stack

Posted Jun 1, 2014 0:51 UTC (Sun) by dgc (subscriber, #6611) [Link]

And the problem with doing this is that each "stack switch" requires two context switches and an unknown amount of scheduler and workqueue latency to execute. We found that the added latency can have devastating effects of performance in XFS, so we have^W had to be very careful where we placed the stack switches.


Expanding the kernel stack

Posted Jun 1, 2014 10:54 UTC (Sun) by khim (subscriber, #9252) [Link]

Sounds like a bug to me. It looks like it should be possible to do such switch “for cheap” (i.e.: by changing a few data structures without context switches), but this will require special-casing (effectively this will mean that this special thread will be executed on it's own kernel stack with with “borrowed” userspace) and tricky additional manipulations.

Still it may be preferable to endlessly growing kernel stack.

Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds