LWN: Comments on "4K stacks by default?" https://lwn.net/Articles/279229/ This is a special feed containing comments posted to the individual LWN article titled "4K stacks by default?". en-us Fri, 10 Oct 2025 14:18:11 +0000 Fri, 10 Oct 2025 14:18:11 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net 4K stacks by default? https://lwn.net/Articles/688312/ https://lwn.net/Articles/688312/ Chakravarthioo7 <div class="FormattedComment"> I was searching in ARM arch, where they are define 4K stack. Kindly someone let me know where they define it. <br> </div> Sun, 22 May 2016 01:02:56 +0000 Not a big deal for embedded https://lwn.net/Articles/280667/ https://lwn.net/Articles/280667/ klossner <div class="FormattedComment"><pre> I disagree that 4K stacks are significant in the embedded world. We only run a few dozen processes so the footprint is not much, and we don't start thousands of new processes per hour so the O(1) allocation doesn't matter. On the other hand, it's *really* important to us that corner cases which don't fit in 4K not cause a problem. </pre></div> Thu, 01 May 2008 15:20:47 +0000 4K stacks by default? https://lwn.net/Articles/280270/ https://lwn.net/Articles/280270/ gswoods <div class="FormattedComment"><pre> The main problem I've had with Fedora's 4K stacks involves using modems. Are there any modems out there that can be purchased new, provide full access to the AT command set (so that an answering machine can be implemented using vgetty) and can handle faxing, that DON'T require a proprietary driver and NDISwrapper? </pre></div> Tue, 29 Apr 2008 21:02:16 +0000 4K stacks by default? https://lwn.net/Articles/279786/ https://lwn.net/Articles/279786/ giraffedata <blockquote> Why does the stack have to be in contiguous memory - is it addressed via its physical address? </blockquote> <p> Contiguous <em>virtual</em> memory. That's what I meant by the address space being the scarce resource. We can afford to allocate 4K of virtual addresses when the process is created, but we can't afford to allocate 8K of them even if the 2nd 4K aren't mapped to physical memory until needed. <p> With a different memory layout, Linux might not have that problem. Some OSes put the kernel stack in a separate address space for each process. But Linux puts all of the kernel memory, including every process' stack, in all the address spaces. So even if the kernel were pageable, there would still be a virtual address allocation problem. Fri, 25 Apr 2008 21:31:21 +0000 4K stacks by default? https://lwn.net/Articles/279788/ https://lwn.net/Articles/279788/ jzbiciak <div class="FormattedComment"><pre> In that last bit of comment, I should say "the notion of having some number of callee-save registers" is pretty powerful. If a function doesn't use very many registers, it may never have to touch the callee-save registers. If a caller only has a handful of live-across-call variables, it may be able to fit them entirely into callee-save registers. This limits stack traffic in the body of the function dramatically, causing some additional traffic at the edges of the mid-level function to save/restore the callee-save registers. Those save/restore sequences tend to be fairly independent of the rest of the code, too, which works well on dynamically scheduled CPUs. </pre></div> Fri, 25 Apr 2008 21:29:45 +0000 4K stacks by default? https://lwn.net/Articles/279784/ https://lwn.net/Articles/279784/ jzbiciak <div class="FormattedComment"><pre> They should only *need* to get stored if 1. They're live-across-call and there are no callee-save registers to park the values in. 2. They get spilled due to register pressure. 3. Their address gets taken. 4. Their storage class requires storing to memory (e.g. volatile). And there could be other reasons where it *might* end up on the stack, such as: 5. The compiler isn't able to register allocate the type--this happens most often with aggregates. 6. Compilation / debug model needs it on the stack. 7. Cost model for the architecture suggests register allocation for the variable isn't a win. #1 above is actually pretty powerful. Texas Instruments' C6400 DSP architecture has 10 registers that are callee-save and the first 10 arguments of function calls are passed in registers. The CPU has 64 registers total. All these work together to absorb and eliminate quite a bit of stack traffic on that architecture. I'm less familiar w/ GCC, the x86 and x86-64 ABIs and how they work, which prompted my original question. </pre></div> Fri, 25 Apr 2008 21:12:50 +0000 4K stacks by default? https://lwn.net/Articles/279781/ https://lwn.net/Articles/279781/ nix <div class="FormattedComment"><pre> If a page fault happens, you might need to swap pages out in order to satisfy the request for an additional page. You might think you could just use GFP_ATOMIC allocation for this, but the pages have to be contiguous (which might involve memory motion and swapping on its own), and if a lot of processes all need extra stack at once you'll run short on the free (-&gt; normally wasted) memory available for GFP_ATOMIC allocations. </pre></div> Fri, 25 Apr 2008 20:41:11 +0000 4K stacks by default? https://lwn.net/Articles/279780/ https://lwn.net/Articles/279780/ nix <div class="FormattedComment"><pre> Generally, even if locals live in registers they'll get stack slots assigned, because you have to store the locals somewhere across function calls. (Completely trivial leaf functions with almost no variables *might* be able to get away without it, but that's not the common case.) </pre></div> Fri, 25 Apr 2008 20:39:14 +0000 4K stacks by default? https://lwn.net/Articles/279762/ https://lwn.net/Articles/279762/ scarabaeus <div class="FormattedComment"><pre> Thanks for your comments! :-) But I still don't understand... I'm not proposing to swap out kernel stack pages. Instead, I'm wondering why it isn't possible to just allocate additional memory pages for the stack the moment a page fault happens because the currently allocated stack overflows. This assumes that it is possible to just map in additional pages. Why does the stack have to be in contiguous memory - is it addressed via its physical address? If so, is the cost of setting up virtual page mapping too high? The event that new stack pages would have to be allocated would be very rare, so it wouldn't have to be fast... </pre></div> Fri, 25 Apr 2008 18:11:39 +0000 4K stacks by default? https://lwn.net/Articles/279757/ https://lwn.net/Articles/279757/ NAR <div class="FormattedComment"><pre> That's interesting. I thought that the local variables are stored also on the stack and if you have pointers or integers which are bigger on x86-64, than the storage needed for these variables on the stack are also bigger. Of course, the clever compiler can optimize these variables to registers... </pre></div> Fri, 25 Apr 2008 17:41:48 +0000 4K stacks by default? https://lwn.net/Articles/279679/ https://lwn.net/Articles/279679/ jzbiciak <P>Hardly. </P><P> If parameters are passed on the stack, the argument frame basically exists on the stack for the entire duration of the function. If those same arguments are passed in registers, the arguments exist only as long as they're needed. If they're unused, consumed before a funtion call or passed down the call chain, they don't need to go to the stack. </P><P> The only things that need to go on the stack as you go down the call chain are values that are live across the call that don't have other storage--compiler temps and arguments are used after the call. </P><P> I haven't looked at the document linked above, but I wouldn't be surprised if the x86-64 calling convention also splits the GPRs between caller-saves vs. callee-saves, thereby also reducing the number of slots reserved for values live-across calls. </P><P> Separate of compiler temps and live-across call values are spill values. In my experience, modern compilers allocate a stack frame once at the start of a function and maintain it through the hlife of the function (alloca() being a notable exception, allocating beyond the static frame). If a function has a lot of spilled values, these too get statically allocated. x86 has less than half as many general purpose registers as x86-64, resulting in greater numbers of spilled variables as well. </P><P> Make sense? </P><P> How about an example? Here's the function prolog from<TT> ay8910_write </TT> in my <A HREF="http://spatula-city.org/~im14u2c/intv/jzintv-1.0-beta3/src/ay8910/ay8910.c">Intellivision emulator</A>, compiled for x86:</P> <PRE> ay8910_write: subl $60, %esp #, </PRE><P>The function allocates a 60 byte stack frame for itself, in addition to 12 bytes for arguments 2 through 4. (Only the first argument gets passed in a register as I recall). That's 72 bytes. Here's the same function prolog on x86-64:</P> <PRE>ay8910_write: movq %r13, -24(%rsp) #, movq %r14, -16(%rsp) #, movq %rdi, %r13 # bus, bus movq %r15, -8(%rsp) #, movq %rbx, -48(%rsp) #, movl %edx, %r15d # addr, addr movq %rbp, -40(%rsp) #, movq %r12, -32(%rsp) #, subq $56, %rsp #, </PRE><P>This version allocated 56 bytes, and had all its arguments passed in registers. That's 16 bytes <I>smaller.</I></P><P>I picked this function not because it's some extraordinary function, but rather because it's moderately sized with a moderate number of arguments, and it's smack dab in the middle of a call chain. And it's in production code.</P> Fri, 25 Apr 2008 04:22:05 +0000 4K stacks by default? https://lwn.net/Articles/279673/ https://lwn.net/Articles/279673/ giraffedata The stack doesn't overflow on x86-64 because it passes parameters in registers instead of on the stack? <p> Doesn't that just mean there are more registers that have to be saved on the stack? <P> There's the same total amount of state in the call chain either way; it has to be stored <em>somewhere</em>. Fri, 25 Apr 2008 03:05:59 +0000 4K stacks by default? https://lwn.net/Articles/279672/ https://lwn.net/Articles/279672/ giraffedata I think a more fundamental problem with paging in extra stack when you need it is that for a stack to work, it has to be in contiguous address space. The addresses past the end of the stack aren't available when you need them. <p> I believe address space is a more scarce resource than physical memory on many systems these days. Fri, 25 Apr 2008 03:02:43 +0000 4K stacks by default? https://lwn.net/Articles/279659/ https://lwn.net/Articles/279659/ zlynx <div class="FormattedComment"><pre> I believe the RHEL support engineers were finding systems with mysterious fork/clone failures that were caused by the kernel not being able to find 8K of continuous memory. It's really easy to allocate 4K since it's the i386 page size, but two pages next to each other can fail. Big Java programs using a lot of threads would fail to get a new thread. Apache servers would fail to spawn a new child. Etc. However, since then (2.6.16?) the memory system has also been reworked a bunch and I don't know if it's still such a problem to get a 8K alloc. You *would* think those big programs would now be running on x86_64 systems with the 8K stacks and having the same problems, if they still existed. Or maybe they get around it by installing 16 GB RAM instead. </pre></div> Thu, 24 Apr 2008 23:56:23 +0000 4K stacks by default? https://lwn.net/Articles/279653/ https://lwn.net/Articles/279653/ dvdeug <div class="FormattedComment"><pre> Why does this switch need to be done? If 8k stacks have worked for years, then they should be fine at least until the last x86 desktop/server is as common as Vaxen are now. Why not leave it as an option for those who really need it? </pre></div> Thu, 24 Apr 2008 23:01:39 +0000 4K stacks by default? https://lwn.net/Articles/279644/ https://lwn.net/Articles/279644/ jzbiciak <P>There was a lengthier comment that indicated it wasn't a "4K on x86 vs. 8K on x86-64" situation that was quoted over on KernelTrap. That perhaps biased my reading of the quote above to not read the same into it that you did. That exchange was:</P> <PRE>From: Eric Sandeen &lt;sandeen@...&gt; Subject: Re: x86: 4kstacks default Date: Apr 19, 10:36 pm 2008 Arjan van de Ven wrote: &gt; On the flipside the arguments tend to be &gt; 1) certain stackings of components still runs the risk of overflowing &gt; 2) I want to run ndiswrapper &gt; 3) general, unspecified uneasyness. &gt; &gt; For 1), we need to know which they are, and then solve them, because even on x86-64 with 8k stacks &gt; they can be a problem (just because the stack frames are bigger, although not quite double, there). Except, apparently, not, at least in my experience. Ask the xfs guys if they see stack overflows on x86_64, or on x86. I've personally never seen common stack problems with xfs on x86_64, but it's very common on x86. I don't have a great answer for why, but that's my anecdotal evidence.</PRE> <P>I agree that without this additional context it's easy to interpret the shorter quote the way you did. Sorry about that.</P> Thu, 24 Apr 2008 22:44:51 +0000 4K stacks by default? https://lwn.net/Articles/279640/ https://lwn.net/Articles/279640/ proski <div class="FormattedComment"><pre> Please check your logic. That suggests that x86_64 is significantly less likely to run out of 8k than i386 out of 4k. But if you are right about reduced usage of stack for automatic variables and parameter passing, it means that 4k stacks could be attempted on x86_64. </pre></div> Thu, 24 Apr 2008 21:48:16 +0000 4K stacks by default? https://lwn.net/Articles/279616/ https://lwn.net/Articles/279616/ sniper <div class="FormattedComment"><pre> From: <a href="http://www.x86-64.org/documentation/abi.pdf">http://www.x86-64.org/documentation/abi.pdf</a> Registers is the correct answer. Check out the section on passing parameters. Example: typedef struct { int a, b; double d; } structparm; structparm s; int e, f, g, h, i, j, k; long double ld; double m, n; extern void func (int e, int f, structparm s, int g, int h, long double ld, double m, double n, int i, int j, int k); func (e, f, s, g, h, ld, m, n, i, j, k); General Purpose Floating Point Stack Frame Offset %rdi: e %xmm0: s.d 0: ld %rsi: f %xmm1: m 16: j %rdx: s.a,s.b %xmm2: n 24: k %rcx: g %r8: h %r9: i </pre></div> Thu, 24 Apr 2008 18:46:27 +0000 4K stacks by default? https://lwn.net/Articles/279614/ https://lwn.net/Articles/279614/ jzbiciak <P>Currently both x86 and x86-64 have 8K stacks by default as I recall. That wasn't what I was talking about. I was referring to this comment in the original article:</P> <BLOCKQUOTE>We see them regularly enough on x86 to know that the first question to any strange crash is "are you using 4k stacks?". In comparison, I have never heard of a single stack overflow on x86_64....</BLOCKQUOTE> <P>That's just a general statement that suggests x86-64 places less demand on the stack than x86.</P> Thu, 24 Apr 2008 18:28:02 +0000 4K stacks by default? https://lwn.net/Articles/279609/ https://lwn.net/Articles/279609/ proski From Linux 2.6.25, file include/asm-x86/page_64.h: <pre> #define THREAD_ORDER 1 #define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER) </pre> This looks like 8k to my untrained eye. Thu, 24 Apr 2008 17:44:06 +0000 4K stacks by default? https://lwn.net/Articles/279584/ https://lwn.net/Articles/279584/ jzbiciak <div class="FormattedComment"><pre> I do find it interesting (though not terribly surprising) that x86-64 treads more lightly on the stack than x86. My initial inclination is that there are two factors at play: x86-64 should spill a whole heck of a lot less, and x86-64 passes more arguments in registers. Anyone here have any thoughts? </pre></div> Thu, 24 Apr 2008 15:51:31 +0000 4K stacks by default? https://lwn.net/Articles/279538/ https://lwn.net/Articles/279538/ pr1268 <p style="padding-left: 1.5em; padding-right: 1.5em;"><font class="QuotedText">Is four years of talking about 4K stacks enough? You can prepare and prepare for decades and still not be certain you've caught everything. There comes a time when you just the switch and fix anything that breaks. I hope that time is soon.</font></p> <p>Having read Jake's article, I was under the impression that the bigger issue wasn't about the impact of defaulting to 4K stacks, but rather how the patch submission showed a total disregard for the following the established procedure. IMO this smacks of some ulterior motive of trying to &quot;sneak&quot; by the senior kernel developers, especially given the controversy of the patch. But then again, I often entertain harebrained conspiracy theories, so don't mind me. ;-)</p> <p>I do agree with your comment, though.</p> <p>Here's a different, more constructive conspiracy theory: Perhaps the submitter knew full well that this patch would be caught despite the clandestine technique used, and he/she <i>wanted</i> to stimulate a discussion on defaulting to 4K stacks--after all, four years is a long time to keep this patch in mainline only to have it disabled by default. Of course, this wouldn't explain why the submitter didn't just announce the change and ask for comments on the LKML...</p> <p>I've run 4K stacks on vanilla kernels for several years now without any issues, even with proprietary NVIDIA graphics drivers (I know, I know!). I do remember having to keep 8K stacks on my laptop prior to 2.6.17 with a Broadcomm Wifi card and NDISWrapper (need I say more?).</p> Thu, 24 Apr 2008 12:42:27 +0000 4K stacks by default? https://lwn.net/Articles/279536/ https://lwn.net/Articles/279536/ nix <blockquote> have yet to see a single crash due to 4K stacks </blockquote> I guess no Fedora users use LVM and pcdrw burners at the same time, then (no need to LVM the CD-RW, just using it is enough) as until 2.6.25 that case was blowng the 4K stack. <p> There are still plenty of stack blowers out there :/ Thu, 24 Apr 2008 12:02:55 +0000 4K stacks by default? https://lwn.net/Articles/279524/ https://lwn.net/Articles/279524/ MathFox <div class="FormattedComment"><pre> The page fault code needs stack too... IIRC the Linux developers made the explicit decision that Kernel code and dara will always be in RAM; a page fault from kernel code is a reason to panic. If you want to make kernel code and data "demand pagable" you must take care that all code (and data) needed for paging in the swapped out kernel pages is locked in RAM. "The data I need to load this page is only available in the swap." Linux systems can swap to a local file system or over the network, a lot of code (and data) would have to be locked to keep the system running. The kernel gurus decided that 100% was far easier to manage. </pre></div> Thu, 24 Apr 2008 10:50:04 +0000 4K stacks by default? https://lwn.net/Articles/279523/ https://lwn.net/Articles/279523/ scarabaeus <div class="FormattedComment"><pre> I've always wondered: Would it be that difficult to fault in additional stack pages on demand, so the stack can grow as needed? Apparently there are good reasons why this is not possible - can someone explain them? </pre></div> Thu, 24 Apr 2008 10:28:36 +0000 4K stacks by default? https://lwn.net/Articles/279471/ https://lwn.net/Articles/279471/ bronson <div class="FormattedComment"><pre> Is four years of talking about 4K stacks enough? You can prepare and prepare for decades and still not be certain you've caught everything. There comes a time when you just the switch and fix anything that breaks. I hope that time is soon. Is there some way to allocate a 12K chunk and use it as the stack when ndiswrapper calls into Windows code? Seems easy enough to me, but I come from a day when kernels and memory architectures were a LOT simpler. :) </pre></div> Thu, 24 Apr 2008 04:17:58 +0000