In that last bit of comment, I should say "the notion of having some number of callee-save
registers" is pretty powerful. If a function doesn't use very many registers, it may never
have to touch the callee-save registers. If a caller only has a handful of live-across-call
variables, it may be able to fit them entirely into callee-save registers.
This limits stack traffic in the body of the function dramatically, causing some additional
traffic at the edges of the mid-level function to save/restore the callee-save registers.
Those save/restore sequences tend to be fairly independent of the rest of the code, too, which
works well on dynamically scheduled CPUs.