Virtually mapped stacks 2: thread_info strikes back

Posted Jun 30, 2016 8:22 UTC (Thu) by rwmj (subscriber, #5474)
Parent article: Virtually mapped stacks 2: thread_info strikes back

Does this mean we can now have stacks which are substantially larger than the current 16K limit?

Larger stacks

Posted Jun 30, 2016 14:06 UTC (Thu) by corbet (editor, #1) [Link] (6 responses)

I've seen no talk of increasing the stack size, but allocating them from the vmalloc() area would certainly remove the biggest impediment to doing so.

Larger stacks

Posted Jul 4, 2016 13:12 UTC (Mon) by nix (subscriber, #2304) [Link] (5 responses)

Indeed we could possibly do things the way userspace does, and have small initial stacks which use a guard page to expand at need, with a high sanity check to kill things if it gets huge. If that works (and it might not, since the kernel stack can presumably expand at times when vmalloc() is hard to do), hello recursive algorithms, deep recursion of nested filesystems, etc etc... it's probably hard to impossible but it sounds like a good place to aim for.

Larger stacks

Posted Jul 4, 2016 17:35 UTC (Mon) by luto (guest, #39314) [Link] (4 responses)

This would be difficult for a different reason on x86: page faults use the stack. This means that a stack overflow gets a double fault, and the manual warns against trying to recover.

Also, what happens if allocation fails?

Larger stacks

Posted Jul 5, 2016 16:07 UTC (Tue) by nix (subscriber, #2304) [Link] (3 responses)

Ug! OK, that's a deal breaker. A real shame, but stacks are "very hardware" and limitations like this often turn up. (Also, double faults are extra-slow: you don't want to incur those unless you really have to.)

I do suspect that recovery from double faults is, at best, very lightly tested, if at all, so even if it works on one CPU it might well fail on another. Ah well :(

(On allocation failure, you obviously kill the process just like you would on a detected stack overflow in the soon-to-be-current world. That's no different from userspace.)

Larger stacks

Posted Jul 5, 2016 21:55 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (2 responses)

> On allocation failure, you obviously kill the process just like you would on a detected stack overflow in the soon-to-be-current world. That's no different from userspace.

Userspace has the kernel to clean up things like locks that were held at the time the process was killed. Who cleans up when a kernel thread dies unexpectedly due to an asynchronous out-of-memory condition? You can't just terminate a kernel thread at any arbitrary point, but you can't resume the thread either without extending the stack.

Stack overflow is a programming error; you know how much stack you allocated and—barring unbounded recursion, variable-length arrays, or alloca()—can statically calculate how much you need (in the worst case) to complete a given function call. Stack overflow in a kernel thread could thus be treated as a bug with the potential to halt the system. An out-of-memory condition resulting from delayed allocation of stack pages is an entirely different matter. That can occur at any time, depending on the amount of memory available.

Larger stacks

Posted Jul 6, 2016 20:18 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

I wasn't thinking kernel threads for this, I was thinking user threads in kernel mode -- but these have the same problem because they're not killable at arbitrary points, only when schedule()d or when returning to userspace. So yes, indeed, this is a doomed idea (like most of my ideas!)

I was hoping we could *get*, not unbounded, but arbitrarily deep recursion: in complex situations like stacks of filesystems it can be very hard to compute how much space might be needed in advance, and possibly impractical to allocate it for every task just in case. What you're saying here is that system administrators can cause kernel bugs by stacking filesystems. That's the situation we have now, but it's very far from ideal: there is no conceptual reason why you shouldn't be able to stack the things fifty or five million deep (though obviously performance would start to suck!)

Larger stacks

Posted Jul 6, 2016 21:30 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> What you're saying here is that system administrators can cause kernel bugs by stacking filesystems. ... there is no conceptual reason why you shouldn't be able to stack the things fifty or five million deep

There are other ways to achieve the same result. One approach would be to employ "trampoline" functions; this only works if one filesystem can hand off to another without any further involvement. Rather than call the other filesystem directly you return a transformed version of the request, unwinding the stack, and a top-level iterative function passes the transformed request on to the other filesystem. When applicable, this approach can handle any number of nested filesystems in constant space.

Another approach which is more compatible with existing code would be to explicitly extend the stack before calling into the other filesystem. Allocation failure at this point could be handled safely like any other out-of-memory condition:

if (!ensure_minimum_stack(8192)) return -ENOMEM;
nested_filesystem_call();

The problem was in extending the stack implicitly through page faults, where allocation cannot be allowed to fail, not the basic concept of having an extendable stack.