LWN.net Logo

4K stacks for everyone?

The 2.6.6 kernel contained, among many other things, a patch implementing single-page (4K) kernel stacks on the x86 architecture. Cutting the kernel stack size in half reduces the kernel's per-process overhead and eliminates a major consumer of multi-page allocations. So running with the smaller stack size is good for kernel performance and robustness. The only problem has been certain code paths in the kernel which require more stack space than that. Overrunning the kernel stack will corrupt kernel memory and lead to unfortunate behavior in a hurry.

Over time, however, most of these problems have been taken care of, to the point that Adrian Bunk recently asked: is it time to eliminate the 8K stack option entirely for x86? Some distributors (e.g. Fedora) have been shipping kernels with 4K stacks for some time without ill effect. What problems might result, Adrian asked, if 4K stacks became the only option for everyone?

It turns out that there are a few problems still. For example, the reiser4 filesystem still cannot work with 4K stacks. There is, however, a patch in the works which should take care of that particular problem.

A more complicated issue comes up in certain complex storage configurations. If a system administrator builds a fancy set of RAID volumes involving the device mapper, network filesystems, etc., the path between the decision to write a block and the actual issuance of I/O can get quite long. This situation can lead to stack overflows in strange and unpredictable times.

What happens here is that a filesystem will decide to write a block, which ends up creating a call to the relevant block driver's make_request() function (or the block subsystem's generic version of it). For stacked block devices, such as a RAID volume, that I/O request will be transformed into a new request for a different device, resulting in a new, recursive make_request() call. Once a few layers have been accumulated, the call path gets deep, and the stack eventually runs out. Neil Brown has posted a patch to resolve this problem by serializing recursive make_request() calls. With this patch, the kernel keeps an explicit stack of bio structures needing submission, and only processes one at a time in any given task. This patch will truncate the deep call paths, and should resolve the problem.

That leaves one other problem outstanding: NDISwrapper. This code is a glue layer which allows Windows network drivers to be loaded into a Linux kernel; it is used by people who have network cards which are not otherwise supported by Linux. NDIS drivers, it seems, require larger stacks. Since they are closed-source drivers written for an operating system which makes larger stacks available, there is little chance of fixing them. So a few options have been discussed:

  • Ignoring the problem. Since NDISwrapper is a means for loading proprietary drivers into the kernel - and Windows drivers at that - many kernel developers will happily refuse to support it at all. The fact is, however, that disallowing 8K stacks would break (formerly) working systems for many users, and there are kernel developers who do not want to do that.

  • Hack NDISwrapper to maintain its own special stack, and to switch to that stack before calling into the Windows driver. This solution seems possible, but it is a nontrivial bit of hacking to make it work right.

  • Move NDISwrapper into user space with some sort of mechanism for interrupt delivery and such. These mechanisms exist, so this solution should be entirely possible.

No consensus solution seems to have emerged as of this writing. There is time, anyway; removing the 8K stack option is not a particularly urgent task, and certainly will not be considered for 2.6.14.


(Log in to post comments)

4K stacks for everyone?

Posted Sep 8, 2005 5:43 UTC (Thu) by wtogami (subscriber, #32325) [Link]

NDISWrapper actually requires 16K stacks for reliable operation. 8K isn't always enough.

NDISWrapper and 4K stacks

Posted Sep 9, 2005 18:57 UTC (Fri) by Duncan (guest, #6647) [Link]

NDISWrapper moving to user-space seems the clear-case best choice, here.
Not only does it solve its problem with 4k kernel stacks, it also moves
not only proprietary, but proprietary MSWormOS even, drivers out of kernel
space into more sanely protected userspace. This sounds like exactly the
sort of solution the kernel devs would prefer, because it gets those
"black-box binary-only things" out of the kernel greatly simplifying and
securing things.

The cost is of course speed... Mode transfers between user space and
kernel space take time, so proprietary-only MSWormOS based drivers will be
slower. Somehow, I don't see that as being a big issue either, since the
view will be that it encourages transfering to more Linux-friendly
hardware, always seen as a good thing.

Duncan

NDISWrapper and 4K stacks

Posted Sep 15, 2005 15:36 UTC (Thu) by Luyseyal (guest, #15693) [Link]

Agreed. Most ndiswrapper users seem to be wireless card users who won't be expecting stellar performance in any case.

-l

4K stacks for everyone?

Posted Sep 8, 2005 7:02 UTC (Thu) by jwb (guest, #15467) [Link]

A friend relayed this excellent suggestion. Instead of causing great pain among users of ndiswrapper, raid, cryptoloop, xfs, nfs, lustre, and a great many other kernel features, why not accelerate the move to 16KiB soft pages on x86? Then the stack could be kept in a single softpage, with the last 4KiB hardware backing page unallocated. That leaves 12KiB in the stack, and a reliable means of determining when the stack overflows. In addition you get all the other efficiency benefits of larger pages.

The current proposal is sheer madness. The developers have NO IDEA what the maximum kernel stack usage is, and no way of determining it. They who are proposing mandatory 4KiB stacks are just crossing their fingers and saying "fuckit, it seems to run on my laptop." That's not a very modern method of software development, especially when the only beneficiaries are a couple of large [elided] customers with over-threaded Java apps.

Conservative Automatic Stack Size Check

Posted Sep 8, 2005 8:27 UTC (Thu) by pkolloch (subscriber, #21709) [Link]

> The current proposal is sheer madness. The developers have NO IDEA what the maximum kernel stack usage is, and no way of determining it.

Then the current state is desperate as well: They don't have a clue if the current stack size limit is sufficient. Your dynamic stack size check would be a step into the right direction, but:

Most stack allocation should be easily statically determinable (with only small conservative overapproximations). Things like alloca (if there is a kernel equivalent) or any other means which change the stack size by a dynamically computed amount are more tricky. However, these should be avoided anyways if stack conservation has such a priority.

At least conceptionally, computing a call graph with conservative stack usage annotations should be fairly easy (using existing code in GCC). In the absense of recursion, one could easily determine the largest stack size in use. And again, if you value the stack size so much, you should not use recursion. (well, there might be valid use cases with a known maximal recursion depth of 3 or so which might be hard to check statically for machines and even if that is the case, you will need something slightly smarter than plain call graphs.)

Without such an automatic check, I pretty much agree with you.

[Disclaimer: I have basically no clue about the kernel source except of what I read occasionally on this page.]

Conservative Automatic Stack Size Check

Posted Sep 8, 2005 9:12 UTC (Thu) by nix (subscriber, #2304) [Link]

Most stack allocation should be easily statically determinable
Some static determination is possible but not easy and not reliable (nor can it ever be reliable in the general case), and the error bars are large. See Olivier Hainque's paper in the GCC 2005 Summit proceedings for a pile of info on this.

TBH I'd expect that kernel developers' own hunches would be as reliable.

Conservative Automatic Stack Size Check

Posted Sep 8, 2005 9:53 UTC (Thu) by pkolloch (subscriber, #21709) [Link]

After a moderate amount of web searching, I could find the abstract of the
presentation, but not the paper itself. Any pointers?

BTW I did not say that it "easy" for the general case, but for the kernel
without dynamic stack allocations and recursion. And OK, I was probably
naive and will agree that it is probably also difficult for this special
case ;) But both feasible and desirable. I hope Olivier Hainque will be
successful in his quest and his work will be applied to the kernel.

> TBH I'd expect that kernel developers' own hunches would be as reliable.

And predict which variables are being stored in registers and which on the
stack and considering all call paths? No, I think humans would miss a lot
of special cases on that one. Additionally, not anyone would actually
endeavor to do this for anything but some core functions. Am I wrong?

Conservative Automatic Stack Size Check

Posted Sep 8, 2005 11:42 UTC (Thu) by farnz (guest, #17727) [Link]

The paper starts on page 99 of the proceedings PDF. I've not found it split separately, and the PDF file is quite large (around 1.7MB).

Conservative Automatic Stack Size Check

Posted Sep 9, 2005 1:58 UTC (Fri) by sethml (subscriber, #8471) [Link]

Clever idea, but you missed a case that's hard to deal with: calling through function pointers. The kernel uses function pointers extensively, especially for device drivers. I suspect the case mentioned involving RAID involves calling through quite a few levels of function pointers. Figuring out the maximum possible call stack depth, even very conservatively, is probably pretty difficult, and the conservative answer is probably "infinite" because there are pathways you could construct that would recurse, even if that never happens in practice.

Conservative Automatic Stack Size Check

Posted Sep 9, 2005 9:29 UTC (Fri) by pkolloch (subscriber, #21709) [Link]

hmmm, you are right, I knew I had been naive, but I couldn't see what I missed.

Since from what I saw about the VFS, it's a shame that it is not expressed in an object oriented fashion. That could at least limit the amount of candidates. Maybe one could provide some annotations?

But I can well imagine that especially concepts as unionfs which wrap other file systems could in principle be wrapped around each other infinitely. You would have too make up some clever notation to tell the stack analyzer that this really isn't possible. (If there is even such a check ;) ) Or is it done in some clever fashion that the wrapped and the wrapper are not called in a nested fashion, but in some kind of chaining way for exactly the purpose of saving stack space?

[Disclaimer: Again, I have no real clue about the kernel source, so I hope my assumptions are not totally off the beat.]

Conservative Automatic Stack Size Check

Posted Sep 10, 2005 21:03 UTC (Sat) by joern (subscriber, #22392) [Link]

FYI: Function pointers are not that hard to follow. See
http://wh.fh-wedel.de/~joern/quality.pdf

4K stacks for everyone?

Posted Nov 16, 2005 10:08 UTC (Wed) by DiegoCG (subscriber, #9198) [Link]

I strongly disagree - this patch has been in fedora core for 2 years. Developers are not pushig this out of their ass. The total stack space is actually bigger due to interrupt stacks. Also, it improves scalability

Apparently the main issues (Xfs, etc) have been fixed so I'd say that 4KB stacks has actually improved code quality....

Larger pages are interesting but they also bring more fragmentation, also lots of no-so-old x86 cpus can't handle pages > 4kb very well AFAIK

Binary graphics drivers?

Posted Sep 8, 2005 10:57 UTC (Thu) by NAR (subscriber, #1313) [Link]

I seem to remember that there was a warning in the kernel configuration option about breaking binary-only modules such as NVidia and ATI drivers if the stack is only 4k. Was this problem fixed?

Bye,NAR

Binary graphics drivers?

Posted Sep 8, 2005 13:00 UTC (Thu) by sbergman27 (guest, #10767) [Link]

Although I am currently running x86_64, which uses 8k stacks, I have run the same machine on i386 Fedora (4k stacks) with the NVidia drivers with no problem.

Binary graphics drivers?

Posted Sep 8, 2005 14:27 UTC (Thu) by alspnost (guest, #2763) [Link]

Yes - NVidia, at least, fixed their driver to work with 4k stacks back at the time. I used it for many months on my old x86 system, before switching to AMD64.

Binary graphics drivers?

Posted Sep 8, 2005 18:25 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

As I recall, the nVidia graphics drivers were fixed to work with 4k stacks at around the time 2.6.7 came out. Either .7 or .8. It's been awhile, though. I have been running with 4k stacks and the nVidia driver for several kernel versions with no incident.

Conexant drivers from Linuxant also requires >4K

Posted Sep 8, 2005 17:56 UTC (Thu) by astrand (guest, #4908) [Link]

On my laptop with the ICH6 chipset, which is running FC4, I'm using the HSF softmodem driver from Linuxant. After many kernel panics, I found out that this driver doesn't work with 4K stacks. I replaced the kernel with a version from Linuxant and now things works, but some drawbacks remains:

* I won't recieve any kernel updates via "yum"

* I cannot use the NTFS drivers from Livna.

To sum up, this problem has been boring and frustrating. I don't care if the kernel is a few percents slower, as long as things works...

Conexant drivers from Linuxant also requires >4K

Posted Sep 8, 2005 18:58 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

Well its not really a question of speed. Many kernel updates fix security issues. Custom ones will have to rebuild everytime. For filesystems, FUSE has recently been merged in the upstream kernel and will be included in 2.6.14 version. So it might be possible to use a pure user space solution which wouldnt potentially break with every new kernel update

Rahul

Conexant drivers from Linuxant also requires >4K

Posted Sep 15, 2005 5:25 UTC (Thu) by thleemhuis (guest, #32469) [Link]

> * I cannot use the NTFS drivers from Livna.

You can easily rebuild the kernel-module-ntfs.srpm from livna. See:
http://rpm.livna.org/kernel-modules.html

4K stacks for everyone?

Posted Sep 9, 2005 1:06 UTC (Fri) by mcelrath (guest, #8094) [Link]

I noticed the other day while configuring my 2.6.13 kernel that there is now an experimental option to use register arguments for function calls. I imagine this should seriously reduce stack usage. Perhaps a default 4k stack should require register arguments.

But this just begs the question...does the kernel really have no means of detecting or handling stack overflows? That just seems like bad design. Can't the stack be set up so that if it is over-written it will trigger a page fault, and the kernel could handle it? gcc/libc can allocate more stack pages for userspace programs if needed, but why not the kernel?

VM for device drivers?

Posted Sep 9, 2005 4:36 UTC (Fri) by xoddam (subscriber, #2322) [Link]

> gcc/libc can allocate more stack pages for userspace programs
> if needed, but why not the kernel?

A defining characteristic of kernel-space programming is that
you don't get the benefits of implicit memory protection.
Everything has to be done explicitly by the kernel itself.

It's possible in principle to give kernel-space tasks virtual
memory support, but it would open a big can of worms. If
you want deep recursion, do it in userspace.

As things stand the kernel-space page map is never changed
implicitly, and rarely explicitly. The prospect of giving
kernel tasks their own vm maps with holes to fit new pages
which are to be faulted in (from where?) when the stack
overflows is nightmarish! Performance and maintainability
are much, much more important than a growable stack.

If stack usage in kernel space isn't demonstrably finite
then the code is broken. The best solution is explicit
management of the resources (eg. using a queue) so that
the stack size ceases to be an issue.

VM for device drivers?

Posted Sep 11, 2005 1:14 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

The kernel uses virtual memory address translation today, and that's all that's required to do this kind of stack overflow protection. Make the stack one page and the page immediately after it invalid. A process tries to overflow the stack, and it gets oopsed.

However, I believe the point of 4K stacks is that there is a dirth of kernel virtual memory address space, so have 4K of usable addresses plus 4K of unusable obviates the 4K stack change.

To be robust like other OSes in this area, we'd have to go to some complex system with multiple kernel address spaces, and that probably would bring with it a pageable kernel.

VM for device drivers?

Posted Sep 13, 2005 1:17 UTC (Tue) by mcelrath (guest, #8094) [Link]

Dirth isn't a word. The word you wanted is dearth. (back at cha)

Anyway, one really needs an oops or panic if the kernel stack is overflowed. A previous poster said page faults in kernel space aren't detectable. You are proposing a 4k page at the end of the stack to check if the stack has overflowed. If there are no page faults in kernel space, then one has to check the stack-overflow page on every process switch? That seems expensive.

Then on the other hand this overflow page can probably be only one physical page, shared among all processes (and an oops or panic if ANY process writes it), and if a page fault isn't possible then the task switch could just do an

if(stack_overflow[0] != STACK_OVERFLOW_PATTERN) { oops }

e.g. just check the first byte. So overall it costs 1 cmp per task switch and 4k. Seems much better than silent stack overflows, and the possible security flaws that might come from them too...

VM for device drivers?

Posted Sep 13, 2005 16:07 UTC (Tue) by giraffedata (subscriber, #1954) [Link]

The kernel can and does detect page faults in kernel space. When the kernel tries to dereference a null pointer, the oops you see is due to the page fault. The same thing would work with the invalid page after the end of the stack (that's called a "guard page").

The earlier comment really meant that the kernel is not set up to handle a page fault in a virtual memory fashion -- i.e. do a pagein and continue as if nothing had happened.

But, unfortunately, the guard page has the same problem as 8K stacks -- requires an extra 4K per thread of kernel virtual memory address space and requires 2 contiguous virtual pages. There was a time when virtual address space was in abundant supply and we just worried about real memory, but today the reverse is often true.

VM for device drivers?

Posted Sep 20, 2005 20:19 UTC (Tue) by renox (subscriber, #23785) [Link]

>There was a time when virtual address space was in abundant supply and we just worried about real memory, but today the reverse is often true.

Well, only on 32 bit CPUs. The suggestion of adding guard page seems very valid to me, even if only for 64 bit CPUs: less crash or at least 'controlled crash' are always better.

4K stacks for everyone?

Posted Sep 9, 2005 17:09 UTC (Fri) by Ross (guest, #4065) [Link]

Actually the kernel is what expands the userspace stack, not the C library. If you think about it, it is exanded even in a program which doesn't call any functions in the C library.

How exactly would you handle a stack overflow in kernel space?

Posted Sep 10, 2005 10:15 UTC (Sat) by pkolloch (subscriber, #21709) [Link]

I think it is not that easy to gracefully deal with stack overflows, even if they get detected. What do you do, if the process scheduler triggers a stack overflow? Disable it?

Even for device drivers the general case gets tricky... [besides the fact that at least the device in question would have to stop working]

How exactly would you handle a stack overflow in kernel space?

Posted Sep 11, 2005 1:04 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

In most cases, an oops is easy, and significantly more graceful than what we have now. In some cases (e.g. process scheduler), oops isn't possible, but panic is still more graceful than we we have.

4K stacks for everyone?

Posted Sep 11, 2005 1:15 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

But this just begs the question...does the kernel really have no means of detecting or handling stack overflows?

It doesn't beg any question. It just raises one. "Beg" means "evade."

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds