Weekly Edition Return to the Kernel pageSponsored link Serve your customers, not your servers, with VERIO Linux VPS. Full-access test-drive here. |
4K stacks for everyone?
The 2.6.6 kernel contained, among many other things, a patch implementing
single-page (4K) kernel stacks on the x86 architecture. Cutting the kernel
stack size in half reduces the kernel's per-process overhead and eliminates
a major consumer of multi-page allocations. So running with the smaller
stack size is good for kernel performance and robustness. The only problem
has been certain code paths in the kernel which require more stack space
than that. Overrunning the kernel stack will corrupt kernel memory and
lead to unfortunate behavior in a hurry.
Over time, however, most of these problems have been taken care of, to the point that Adrian Bunk recently asked: is it time to eliminate the 8K stack option entirely for x86? Some distributors (e.g. Fedora) have been shipping kernels with 4K stacks for some time without ill effect. What problems might result, Adrian asked, if 4K stacks became the only option for everyone? It turns out that there are a few problems still. For example, the reiser4 filesystem still cannot work with 4K stacks. There is, however, a patch in the works which should take care of that particular problem. A more complicated issue comes up in certain complex storage configurations. If a system administrator builds a fancy set of RAID volumes involving the device mapper, network filesystems, etc., the path between the decision to write a block and the actual issuance of I/O can get quite long. This situation can lead to stack overflows in strange and unpredictable times. What happens here is that a filesystem will decide to write a block, which ends up creating a call to the relevant block driver's make_request() function (or the block subsystem's generic version of it). For stacked block devices, such as a RAID volume, that I/O request will be transformed into a new request for a different device, resulting in a new, recursive make_request() call. Once a few layers have been accumulated, the call path gets deep, and the stack eventually runs out. Neil Brown has posted a patch to resolve this problem by serializing recursive make_request() calls. With this patch, the kernel keeps an explicit stack of bio structures needing submission, and only processes one at a time in any given task. This patch will truncate the deep call paths, and should resolve the problem. That leaves one other problem outstanding: NDISwrapper. This code is a glue layer which allows Windows network drivers to be loaded into a Linux kernel; it is used by people who have network cards which are not otherwise supported by Linux. NDIS drivers, it seems, require larger stacks. Since they are closed-source drivers written for an operating system which makes larger stacks available, there is little chance of fixing them. So a few options have been discussed:
No consensus solution seems to have emerged as of this writing. There is time, anyway; removing the 8K stack option is not a particularly urgent task, and certainly will not be considered for 2.6.14. (Log in to post comments)
4K stacks for everyone? Posted Sep 8, 2005 5:43 UTC (Thu) by wtogami (subscriber, #32325) [Link] NDISWrapper actually requires 16K stacks for reliable operation. 8K isn't always enough.
NDISWrapper and 4K stacks Posted Sep 9, 2005 18:57 UTC (Fri) by Duncan (guest, #6647) [Link] NDISWrapper moving to user-space seems the clear-case best choice, here.Not only does it solve its problem with 4k kernel stacks, it also moves not only proprietary, but proprietary MSWormOS even, drivers out of kernel space into more sanely protected userspace. This sounds like exactly the sort of solution the kernel devs would prefer, because it gets those "black-box binary-only things" out of the kernel greatly simplifying and securing things. The cost is of course speed... Mode transfers between user space and kernel space take time, so proprietary-only MSWormOS based drivers will be slower. Somehow, I don't see that as being a big issue either, since the view will be that it encourages transfering to more Linux-friendly hardware, always seen as a good thing. Duncan
NDISWrapper and 4K stacks Posted Sep 15, 2005 15:36 UTC (Thu) by Luyseyal (guest, #15693) [Link] Agreed. Most ndiswrapper users seem to be wireless card users who won't be expecting stellar performance in any case.
-l
4K stacks for everyone? Posted Sep 8, 2005 7:02 UTC (Thu) by jwb (subscriber, #15467) [Link] A friend relayed this excellent suggestion. Instead of causing great pain among users of ndiswrapper, raid, cryptoloop, xfs, nfs, lustre, and a great many other kernel features, why not accelerate the move to 16KiB soft pages on x86? Then the stack could be kept in a single softpage, with the last 4KiB hardware backing page unallocated. That leaves 12KiB in the stack, and a reliable means of determining when the stack overflows. In addition you get all the other efficiency benefits of larger pages.
The current proposal is sheer madness. The developers have NO IDEA what the maximum kernel stack usage is, and no way of determining it. They who are proposing mandatory 4KiB stacks are just crossing their fingers and saying "fuckit, it seems to run on my laptop." That's not a very modern method of software development, especially when the only beneficiaries are a couple of large [elided] customers with over-threaded Java apps.
Conservative Automatic Stack Size Check Posted Sep 8, 2005 8:27 UTC (Thu) by pkolloch (subscriber, #21709) [Link] > The current proposal is sheer madness. The developers have NO IDEA what the maximum kernel stack usage is, and no way of determining it.
Then the current state is desperate as well: They don't have a clue if the current stack size limit is sufficient. Your dynamic stack size check would be a step into the right direction, but:
Most stack allocation should be easily statically determinable (with only small conservative overapproximations). Things like alloca (if there is a kernel equivalent) or any other means which change the stack size by a dynamically computed amount are more tricky. However, these should be avoided anyways if stack conservation has such a priority.
At least conceptionally, computing a call graph with conservative stack usage annotations should be fairly easy (using existing code in GCC). In the absense of recursion, one could easily determine the largest stack size in use. And again, if you value the stack size so much, you should not use recursion. (well, there might be valid use cases with a known maximal recursion depth of 3 or so which might be hard to check statically for machines and even if that is the case, you will need something slightly smarter than plain call graphs.)
Without such an automatic check, I pretty much agree with you.
[Disclaimer: I have basically no clue about the kernel source except of what I read occasionally on this page.]
Conservative Automatic Stack Size Check Posted Sep 8, 2005 9:12 UTC (Thu) by nix (subscriber, #2304) [Link] Most stack allocation should be easily statically determinableSome static determination is possible but not easy and not reliable (nor can it ever be reliable in the general case), and the error bars are large. See Olivier Hainque's paper in the GCC 2005 Summit proceedings for a pile of info on this. TBH I'd expect that kernel developers' own hunches would be as reliable.
Conservative Automatic Stack Size Check Posted Sep 8, 2005 9:53 UTC (Thu) by pkolloch (subscriber, #21709) [Link] After a moderate amount of web searching, I could find the abstract of thepresentation, but not the paper itself. Any pointers? BTW I did not say that it "easy" for the general case, but for the kernel without dynamic stack allocations and recursion. And OK, I was probably naive and will agree that it is probably also difficult for this special case ;) But both feasible and desirable. I hope Olivier Hainque will be successful in his quest and his work will be applied to the kernel. > TBH I'd expect that kernel developers' own hunches would be as reliable. And predict which variables are being stored in registers and which on the stack and considering all call paths? No, I think humans would miss a lot of special cases on that one. Additionally, not anyone would actually endeavor to do this for anything but some core functions. Am I wrong?
Conservative Automatic Stack Size Check Posted Sep 8, 2005 11:42 UTC (Thu) by farnz (subscriber, #17727) [Link] The paper starts on page 99 of the proceedings PDF. I've not found it split separately, and the PDF file is quite large (around 1.7MB).
Conservative Automatic Stack Size Check Posted Sep 9, 2005 1:58 UTC (Fri) by sethml (subscriber, #8471) [Link] Clever idea, but you missed a case that's hard to deal with: calling through function pointers. The kernel uses function pointers extensively, especially for device drivers. I suspect the case mentioned involving RAID involves calling through quite a few levels of function pointers. Figuring out the maximum possible call stack depth, even very conservatively, is probably pretty difficult, and the conservative answer is probably "infinite" because there are pathways you could construct that would recurse, even if that never happens in practice.
Conservative Automatic Stack Size Check Posted Sep 9, 2005 9:29 UTC (Fri) by pkolloch (subscriber, #21709) [Link] hmmm, you are right, I knew I had been naive, but I couldn't see what I missed.
Since from what I saw about the VFS, it's a shame that it is not expressed in an object oriented fashion. That could at least limit the amount of candidates. Maybe one could provide some annotations?
But I can well imagine that especially concepts as unionfs which wrap other file systems could in principle be wrapped around each other infinitely. You would have too make up some clever notation to tell the stack analyzer that this really isn't possible. (If there is even such a check ;) ) Or is it done in some clever fashion that the wrapped and the wrapper are not called in a nested fashion, but in some kind of chaining way for exactly the purpose of saving stack space?
[Disclaimer: Again, I have no real clue about the kernel source, so I hope my assumptions are not totally off the beat.]
Conservative Automatic Stack Size Check Posted Sep 10, 2005 21:03 UTC (Sat) by joern (subscriber, #22392) [Link] FYI: Function pointers are not that hard to follow. Seehttp://wh.fh-wedel.de/~joern/quality.pdf
4K stacks for everyone? Posted Nov 16, 2005 10:08 UTC (Wed) by DiegoCG (guest, #9198) [Link] I strongly disagree - this patch has been in fedora core for 2 years. Developers are not pushig this out of their ass. The total stack space is actually bigger due to interrupt stacks. Also, it improves scalability
Apparently the main issues (Xfs, etc) have been fixed so I'd say that 4KB stacks has actually improved code quality....
Larger pages are interesting but they also bring more fragmentation, also lots of no-so-old x86 cpus can't handle pages > 4kb very well AFAIK
Binary graphics drivers? Posted Sep 8, 2005 10:57 UTC (Thu) by NAR (subscriber, #1313) [Link] I seem to remember that there was a warning in the kernel configuration option about breaking binary-only modules such as NVidia and ATI drivers if the stack is only 4k. Was this problem fixed?
Binary graphics drivers? Posted Sep 8, 2005 13:00 UTC (Thu) by sbergman27 (subscriber, #10767) [Link] Although I am currently running x86_64, which uses 8k stacks, I have run the same machine on i386 Fedora (4k stacks) with the NVidia drivers with no problem.
Binary graphics drivers? Posted Sep 8, 2005 14:27 UTC (Thu) by alspnost (subscriber, #2763) [Link] Yes - NVidia, at least, fixed their driver to work with 4k stacks back at the time. I used it for many months on my old x86 system, before switching to AMD64.
Binary graphics drivers? Posted Sep 8, 2005 18:25 UTC (Thu) by flewellyn (subscriber, #5047) [Link] As I recall, the nVidia graphics drivers were fixed to work with 4k stacks at around the time 2.6.7 came out. Either .7 or .8. It's been awhile, though. I have been running with 4k stacks and the nVidia driver for several kernel versions with no incident.
Conexant drivers from Linuxant also requires >4K Posted Sep 8, 2005 17:56 UTC (Thu) by astrand (subscriber, #4908) [Link] On my laptop with the ICH6 chipset, which is running FC4, I'm using the HSF softmodem driver from Linuxant. After many kernel panics, I found out that this driver doesn't work with 4K stacks. I replaced the kernel with a version from Linuxant and now things works, but some drawbacks remains:
* I won't recieve any kernel updates via "yum"
* I cannot use the NTFS drivers from Livna.
To sum up, this problem has been boring and frustrating. I don't care if the kernel is a few percents slower, as long as things works...
Conexant drivers from Linuxant also requires >4K Posted Sep 8, 2005 18:58 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]
Well its not really a question of speed. Many kernel updates fix security issues. Custom ones will have to rebuild everytime. For filesystems, FUSE has recently been merged in the upstream kernel and will be included in 2.6.14 version. So it might be possible to use a pure user space solution which wouldnt potentially break with every new kernel update
Rahul
Conexant drivers from Linuxant also requires >4K Posted Sep 15, 2005 5:25 UTC (Thu) by thleemhuis (guest, #32469) [Link] > * I cannot use the NTFS drivers from Livna.
You can easily rebuild the kernel-module-ntfs.srpm from livna. See:
4K stacks for everyone? Posted Sep 9, 2005 1:06 UTC (Fri) by mcelrath (guest, #8094) [Link] I noticed the other day while configuring my 2.6.13 kernel that there is now an experimental option to use register arguments for function calls. I imagine this should seriously reduce stack usage. Perhaps a default 4k stack should require register arguments.
But this just begs the question...does the kernel really have no means of detecting or handling stack overflows? That just seems like bad design. Can't the stack be set up so that if it is over-written it will trigger a page fault, and the kernel could handle it? gcc/libc can allocate more stack pages for userspace programs if needed, but why not the kernel?
VM for device drivers? Posted Sep 9, 2005 4:36 UTC (Fri) by xoddam (subscriber, #2322) [Link] > gcc/libc can allocate more stack pages for userspace programs> if needed, but why not the kernel? A defining characteristic of kernel-space programming is that you don't get the benefits of implicit memory protection. Everything has to be done explicitly by the kernel itself. It's possible in principle to give kernel-space tasks virtual memory support, but it would open a big can of worms. If you want deep recursion, do it in userspace. As things stand the kernel-space page map is never changed implicitly, and rarely explicitly. The prospect of giving kernel tasks their own vm maps with holes to fit new pages which are to be faulted in (from where?) when the stack overflows is nightmarish! Performance and maintainability are much, much more important than a growable stack. If stack usage in kernel space isn't demonstrably finite then the code is broken. The best solution is explicit management of the resources (eg. using a queue) so that the stack size ceases to be an issue.
VM for device drivers? Posted Sep 11, 2005 1:14 UTC (Sun) by giraffedata (subscriber, #1954) [Link] The kernel uses virtual memory address translation today, and that's all that's required to do this kind of stack overflow protection. Make the stack one page and the page immediately after it invalid. A process tries to overflow the stack, and it gets oopsed.However, I believe the point of 4K stacks is that there is a dirth of kernel virtual memory address space, so have 4K of usable addresses plus 4K of unusable obviates the 4K stack change. To be robust like other OSes in this area, we'd have to go to some complex system with multiple kernel address spaces, and that probably would bring with it a pageable kernel.
VM for device drivers? Posted Sep 13, 2005 1:17 UTC (Tue) by mcelrath (guest, #8094) [Link] Dirth isn't a word. The word you wanted is dearth. (back at cha)
Anyway, one really needs an oops or panic if the kernel stack is overflowed. A previous poster said page faults in kernel space aren't detectable. You are proposing a 4k page at the end of the stack to check if the stack has overflowed. If there are no page faults in kernel space, then one has to check the stack-overflow page on every process switch? That seems expensive.
Then on the other hand this overflow page can probably be only one physical page, shared among all processes (and an oops or panic if ANY process writes it), and if a page fault isn't possible then the task switch could just do an
if(stack_overflow[0] != STACK_OVERFLOW_PATTERN) { oops }
e.g. just check the first byte. So overall it costs 1 cmp per task switch and 4k. Seems much better than silent stack overflows, and the possible security flaws that might come from them too...
VM for device drivers? Posted Sep 13, 2005 16:07 UTC (Tue) by giraffedata (subscriber, #1954) [Link] The kernel can and does detect page faults in kernel space. When the kernel tries to dereference a null pointer, the oops you see is due to the page fault. The same thing would work with the invalid page after the end of the stack (that's called a "guard page").The earlier comment really meant that the kernel is not set up to handle a page fault in a virtual memory fashion -- i.e. do a pagein and continue as if nothing had happened. But, unfortunately, the guard page has the same problem as 8K stacks -- requires an extra 4K per thread of kernel virtual memory address space and requires 2 contiguous virtual pages. There was a time when virtual address space was in abundant supply and we just worried about real memory, but today the reverse is often true.
VM for device drivers? Posted Sep 20, 2005 20:19 UTC (Tue) by renox (guest, #23785) [Link] >There was a time when virtual address space was in abundant supply and we just worried about real memory, but today the reverse is often true.
Well, only on 32 bit CPUs. The suggestion of adding guard page seems very valid to me, even if only for 64 bit CPUs: less crash or at least 'controlled crash' are always better.
4K stacks for everyone? Posted Sep 9, 2005 17:09 UTC (Fri) by Ross (subscriber, #4065) [Link] Actually the kernel is what expands the userspace stack, not the C library. If you think about it, it is exanded even in a program which doesn't call any functions in the C library.
How exactly would you handle a stack overflow in kernel space? Posted Sep 10, 2005 10:15 UTC (Sat) by pkolloch (subscriber, #21709) [Link] I think it is not that easy to gracefully deal with stack overflows, even if they get detected. What do you do, if the process scheduler triggers a stack overflow? Disable it?
Even for device drivers the general case gets tricky... [besides the fact that at least the device in question would have to stop working]
How exactly would you handle a stack overflow in kernel space? Posted Sep 11, 2005 1:04 UTC (Sun) by giraffedata (subscriber, #1954) [Link] In most cases, an oops is easy, and significantly more graceful than what we have now. In some cases (e.g. process scheduler), oops isn't possible, but panic is still more graceful than we we have.
4K stacks for everyone? Posted Sep 11, 2005 1:15 UTC (Sun) by giraffedata (subscriber, #1954) [Link] But this just begs the question...does the kernel really have no means of detecting or handling stack overflows? It doesn't beg any question. It just raises one. "Beg" means "evade."
|
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.