User: Password:
|
|
Subscribe / Log in / New account

4K stacks by default?

4K stacks by default?

Posted Apr 24, 2008 15:51 UTC (Thu) by jzbiciak (subscriber, #5246)
Parent article: 4K stacks by default?

I do find it interesting (though not terribly surprising) that x86-64 treads more lightly on
the stack than x86.  My initial inclination is that there are two factors at play:  x86-64
should spill a whole heck of a lot less, and x86-64 passes more arguments in registers.

Anyone here have any thoughts?


(Log in to post comments)

4K stacks by default?

Posted Apr 24, 2008 17:44 UTC (Thu) by proski (subscriber, #104) [Link]

From Linux 2.6.25, file include/asm-x86/page_64.h:
#define THREAD_ORDER    1
#define THREAD_SIZE  (PAGE_SIZE << THREAD_ORDER)
This looks like 8k to my untrained eye.

4K stacks by default?

Posted Apr 24, 2008 18:28 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

Currently both x86 and x86-64 have 8K stacks by default as I recall. That wasn't what I was talking about. I was referring to this comment in the original article:

We see them regularly enough on x86 to know that the first question to any strange crash is "are you using 4k stacks?". In comparison, I have never heard of a single stack overflow on x86_64....

That's just a general statement that suggests x86-64 places less demand on the stack than x86.

4K stacks by default?

Posted Apr 24, 2008 21:48 UTC (Thu) by proski (subscriber, #104) [Link]

Please check your logic.  That suggests that x86_64 is significantly less likely to run out of
8k than i386 out of 4k.

But if you are right about reduced usage of stack for automatic variables and parameter
passing, it means that 4k stacks could be attempted on x86_64.

4K stacks by default?

Posted Apr 24, 2008 22:44 UTC (Thu) by jzbiciak (subscriber, #5246) [Link]

There was a lengthier comment that indicated it wasn't a "4K on x86 vs. 8K on x86-64" situation that was quoted over on KernelTrap. That perhaps biased my reading of the quote above to not read the same into it that you did. That exchange was:

From: Eric Sandeen <sandeen@...>
Subject: Re: x86: 4kstacks default
Date: Apr 19, 10:36 pm 2008

Arjan van de Ven wrote:

> On the flipside the arguments tend to be
> 1) certain stackings of components still runs the risk of overflowing
> 2) I want to run ndiswrapper
> 3) general, unspecified uneasyness.
> 
> For 1), we need to know which they are, and then solve them, because even on x86-64 with 8k stacks
> they can be a problem (just because the stack frames are bigger, although not quite double, there).

Except, apparently, not, at least in my experience.

Ask the xfs guys if they see stack overflows on x86_64, or on x86.

I've personally never seen common stack problems with xfs on x86_64, but
it's very common on x86.  I don't have a great answer for why, but
that's my anecdotal evidence.

I agree that without this additional context it's easy to interpret the shorter quote the way you did. Sorry about that.

4K stacks by default?

Posted Apr 24, 2008 18:46 UTC (Thu) by sniper (guest, #13219) [Link]

From: http://www.x86-64.org/documentation/abi.pdf

Registers is the correct answer. Check out the section on passing parameters.

Example:

typedef struct {
  int a, b;
  double d;
} structparm;
structparm s;
int e, f, g, h, i, j, k;
long double ld;
double m, n;
extern void func (int e, int f,
                  structparm s, int g, int h,
                  long double ld, double m,
                  double n, int i, int j, int k);
func (e, f, s, g, h, ld, m, n, i, j, k);


General Purpose  Floating Point    Stack Frame Offset
%rdi: e          %xmm0: s.d        0:  ld
%rsi: f          %xmm1: m          16: j
%rdx: s.a,s.b    %xmm2: n          24: k
%rcx: g
%r8:  h
%r9:  i

4K stacks by default?

Posted Apr 25, 2008 3:05 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

The stack doesn't overflow on x86-64 because it passes parameters in registers instead of on the stack?

Doesn't that just mean there are more registers that have to be saved on the stack?

There's the same total amount of state in the call chain either way; it has to be stored somewhere.

4K stacks by default?

Posted Apr 25, 2008 4:22 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

Hardly.

If parameters are passed on the stack, the argument frame basically exists on the stack for the entire duration of the function. If those same arguments are passed in registers, the arguments exist only as long as they're needed. If they're unused, consumed before a funtion call or passed down the call chain, they don't need to go to the stack.

The only things that need to go on the stack as you go down the call chain are values that are live across the call that don't have other storage--compiler temps and arguments are used after the call.

I haven't looked at the document linked above, but I wouldn't be surprised if the x86-64 calling convention also splits the GPRs between caller-saves vs. callee-saves, thereby also reducing the number of slots reserved for values live-across calls.

Separate of compiler temps and live-across call values are spill values. In my experience, modern compilers allocate a stack frame once at the start of a function and maintain it through the hlife of the function (alloca() being a notable exception, allocating beyond the static frame). If a function has a lot of spilled values, these too get statically allocated. x86 has less than half as many general purpose registers as x86-64, resulting in greater numbers of spilled variables as well.

Make sense?

How about an example? Here's the function prolog from ay8910_write in my Intellivision emulator, compiled for x86:

ay8910_write:
    subl    $60, %esp   #,

The function allocates a 60 byte stack frame for itself, in addition to 12 bytes for arguments 2 through 4. (Only the first argument gets passed in a register as I recall). That's 72 bytes. Here's the same function prolog on x86-64:

ay8910_write:
    movq    %r13, -24(%rsp) #,
    movq    %r14, -16(%rsp) #,
    movq    %rdi, %r13  # bus, bus
    movq    %r15, -8(%rsp)  #,
    movq    %rbx, -48(%rsp) #,
    movl    %edx, %r15d # addr, addr
    movq    %rbp, -40(%rsp) #,
    movq    %r12, -32(%rsp) #,
    subq    $56, %rsp   #,

This version allocated 56 bytes, and had all its arguments passed in registers. That's 16 bytes smaller.

I picked this function not because it's some extraordinary function, but rather because it's moderately sized with a moderate number of arguments, and it's smack dab in the middle of a call chain. And it's in production code.

4K stacks by default?

Posted Apr 25, 2008 17:41 UTC (Fri) by NAR (subscriber, #1313) [Link]

That's interesting. I thought that the local variables are stored also on the stack and if you
have pointers or integers which are bigger on x86-64, than the storage needed for these
variables on the stack are also bigger. Of course, the clever compiler can optimize these
variables to registers...

4K stacks by default?

Posted Apr 25, 2008 20:39 UTC (Fri) by nix (subscriber, #2304) [Link]

Generally, even if locals live in registers they'll get stack slots 
assigned, because you have to store the locals somewhere across function 
calls. (Completely trivial leaf functions with almost no variables *might* 
be able to get away without it, but that's not the common case.)

4K stacks by default?

Posted Apr 25, 2008 21:12 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

They should only *need* to get stored if

1. They're live-across-call and there are no callee-save registers to park the values in.
2. They get spilled due to register pressure.
3. Their address gets taken.
4. Their storage class requires storing to memory (e.g. volatile).

And there could be other reasons where it *might* end up on the stack, such as:

5. The compiler isn't able to register allocate the type--this happens most often with
aggregates.
6. Compilation / debug model needs it on the stack.
7. Cost model for the architecture suggests register allocation for the variable isn't a win.

#1 above is actually pretty powerful.  Texas Instruments' C6400 DSP architecture has 10
registers that are callee-save and the first 10 arguments of function calls are passed in
registers.  The CPU has 64 registers total.  All these work together to absorb and eliminate
quite a bit of stack traffic on that architecture.  

I'm less familiar w/ GCC, the x86 and x86-64 ABIs and how they work, which prompted my
original question.

4K stacks by default?

Posted Apr 25, 2008 21:29 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

In that last bit of comment, I should say "the notion of having some number of callee-save
registers" is pretty powerful.  If a function doesn't use very many registers, it may never
have to touch the callee-save registers.  If a caller only has a handful of live-across-call
variables, it may be able to fit them entirely into callee-save registers.  

This limits stack traffic in the body of the function dramatically, causing some additional
traffic at the edges of the mid-level function to save/restore the callee-save registers.
Those save/restore sequences tend to be fairly independent of the rest of the code, too, which
works well on dynamically scheduled CPUs.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds