Preventing stack guard-page hopping
Preventing stack guard-page hopping
Posted Jun 19, 2017 19:55 UTC (Mon) by tux3 (subscriber, #101245)Parent article: Preventing stack guard-page hopping
If the only full fix is currently recompiling the world with something expensive like -fstack-check or the various sanitizers, that is awfully worrying.
I wouldn't be surprised to learn that there is a whole lot of software out there at various level of openness that will happily allocate a handful of MBs on demand, and that will probably never be recompiled with those options.
Posted Jun 19, 2017 20:13 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (17 responses)
Posted Jun 19, 2017 20:26 UTC (Mon)
by cpitrat (subscriber, #116459)
[Link] (15 responses)
I'm surprised a 900 lines patch is only about increasing the size of the page guard. Isn't there more in it ?
Posted Jun 19, 2017 21:09 UTC (Mon)
by roc (subscriber, #30627)
[Link] (2 responses)
The local privilege escalation threat assumes that the high-privilege C code is trusted, and then exploits it.
If the attacker can write high-privilege C code, you've already lost.
Posted Jun 20, 2017 9:43 UTC (Tue)
by moltonel (guest, #45207)
[Link] (1 responses)
Posted Jun 20, 2017 10:13 UTC (Tue)
by matthias (subscriber, #94967)
[Link]
If the attacker has the ability to run his own code with privileges, everything is already lost. No need for an exploit.
Posted Jun 20, 2017 6:54 UTC (Tue)
by vbabka (subscriber, #91706)
[Link]
Well, it's 900 lines of .patch file text, but the diffstat is around 300 added+deleted, so not that much.
It's large because, as explained in the commit log, the old 1 stack guard page code simply extended to N pages made many accounting issues visible, because the guard page(s) were part of the VMA's [start, end] addresses. The patch deletes that approach and replaces it so that the gap is always between VMA boundaries. That means adjusting the code to check allowed VMA placement/enlargement so that it maintains the gap if the next/prev VMA is a stack one.
Posted Jun 20, 2017 9:55 UTC (Tue)
by moltonel (guest, #45207)
[Link] (9 responses)
That's going to mess up with the performance profile (allocating pages earlyer than expected) and decrease total performance in case the app wasn't going to touch those pages at all.
> This would protect remote attacks but wouldn't prevent an attacker to write his own stack allocation for local privilege escalation.
Assuming we accept the performance hit, can we use the same technique in the kernel ? Disable overcommit ? Or is the kernel not aware of what the app is considering its stack space ?
Posted Jun 20, 2017 10:39 UTC (Tue)
by nix (subscriber, #2304)
[Link] (7 responses)
It's... not common for applications to allocate page-size structures on the heap that are not optimized out and then not use them for anything. I suppose functions that have big local variables and then do early exit based only on the parameters, but in that case the compiler can adapt to adjust the stack only after the early exits, if this is really significant (which I very much doubt).
Posted Jun 20, 2017 15:15 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link] (6 responses)
In one project I found an innocuous-looking state structure that turned out to have ~5MB of unused bytes in the middle, buried under a pyramid of macro expansion, arrays, nested members, and unreadable coding style. The code did use all the other members in the struct, on both sides of the hole.
Also it's fairly common in userland to do IO to a buffer on the stack, where the buffer is huge and the IO is tiny.
Posted Jun 20, 2017 16:30 UTC (Tue)
by gutschke (subscriber, #27910)
[Link] (5 responses)
Do you really commonly see programs allocate many hundred of kilobytes if not many megabytes on the stack? That's not a pattern that I have encountered frequently. Buffers this large are more commonly allocated on the heap.
I am not saying it doesn't happen. Anything stupid that you can think of, somebody else probably thought of before. But common? Hopefully not.
Posted Jun 21, 2017 11:14 UTC (Wed)
by PaXTeam (guest, #24616)
[Link]
Posted Jun 21, 2017 11:24 UTC (Wed)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jun 21, 2017 14:57 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link] (1 responses)
On the other hand, if a function is being called in a loop then the probes keep happening over and over even though the page faults don't, so the probing gets expensive.
For programs that handle toxic data there might not be a quick and easy solution--they might just have to suck up the cost of doing probes all the time, or use other techniques (e.g. constant-stack algorithm proofs, coding standards forbidding alloca() and sparse structures, etc.) to make sure stack overflows don't happen.
Since changes to alloca require recompiling the program, it's up to individual applications to make the performance/security tradeoff anyway. Isn't there already a compiler option to do this?
Posted Jun 22, 2017 22:37 UTC (Thu)
by mikemol (guest, #83507)
[Link]
LTO will need to be careful to let these considerations bubble up to the final binary, however.
Posted Oct 3, 2019 13:18 UTC (Thu)
by ychevali (guest, #134753)
[Link]
Posted Jun 26, 2017 9:25 UTC (Mon)
by anton (subscriber, #25547)
[Link]
Posted Jun 26, 2017 9:09 UTC (Mon)
by anton (subscriber, #25547)
[Link]
Posted Jun 20, 2017 14:50 UTC (Tue)
by BenHutchings (subscriber, #37955)
[Link]
alloca() can't be implemented as a real function, so it's only "in" glibc in the sense that the definition is in a glibc header. Further, that definition just defers to the compiler's pseudo-function __builtin_alloca(). So even rebuilding against an updated glibc isn't enough to fix this. glibc has been updated to make its own use of alloca() safer, though.
Posted Jun 19, 2017 21:01 UTC (Mon)
by roc (subscriber, #30627)
[Link] (5 responses)
Posted Jun 19, 2017 21:02 UTC (Mon)
by roc (subscriber, #30627)
[Link]
Posted Jun 19, 2017 21:45 UTC (Mon)
by roc (subscriber, #30627)
[Link] (3 responses)
Posted Jun 19, 2017 23:31 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Posted Jun 20, 2017 18:54 UTC (Tue)
by dd9jn (✭ supporter ✭, #4459)
[Link] (1 responses)
Posted Jun 22, 2017 22:02 UTC (Thu)
by cesarb (subscriber, #6266)
[Link]
Posted Jun 19, 2017 22:31 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link] (9 responses)
In userland, if alloca() wants more than a page, it can run a heaver stack-smashing check, like probing each page of the allocated area in stack-growth order, or checking some data in the heap about the current thread's stack limits. Not doing that in the kernel is perhaps understandable due to the cost, but the capability should be there for those who need it.
I've occasionally wondered what would happen if stacks were not accessible to other threads in the same process (assuming the VM context thrashing involved was magically zero cost, which probably pushes this paragraph into the realm of wishful thinking). Obviously it would break some existing programs, but it smells like bad practice in general (I see student programmers pass pointers to ephemeral variables from the caller's stack to threads all the time, with immediately disastrous results). There might be some simple heuristic (e.g. if thread A creates or joins thread B, let thread B access thread A's stack in case thread B has been given a pointer to a result A needs to store there) that's good enough for current defensible program behavior.
Posted Jun 19, 2017 23:26 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jun 20, 2017 1:40 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
That's pretty much how C++11 async functions work, and should be covered by the heuristic exception for "thread A creates thread B".
It wouldn't work if there was a persistent worker thread pool (i.e. the functions are executed by previously existing threads that continue to exist after the result is computed, so there is no creator/created or join relationship). It might be possible to infer data dependencies from mutex locks or higher-level objects (promise/future pairs) but maybe there's too many false positives. Or one could mark worker pool threads differently (e.g. some new pthread_attr) wrt access to other threads' stacks.
Posted Jun 19, 2017 23:32 UTC (Mon)
by excors (subscriber, #95769)
[Link]
I think that would break reasonable code like:
std::atomic_int n;
which passes a pointer to n (on the current thread's stack) to a bunch of worker threads (that probably weren't created by this thread).
Posted Jun 19, 2017 23:36 UTC (Mon)
by nix (subscriber, #2304)
[Link] (3 responses)
I alternate between thinking this scheme is wonderful and should be widely emulated, and thinking it is insane and its authors should be punished by being forced to debug programs written this way (but then, they already have been).
Posted Jun 20, 2017 15:08 UTC (Tue)
by nybble41 (subscriber, #55106)
[Link] (2 responses)
That is... diabolical. Genius, but diabolical. A similar concept employed by Chicken Scheme is to start out the same way, using CPS and allocating on the C stack, but then after copying the live data to the heap just perform a longjmp() to unwind back to a trampoline function at the top of the original stack. That seems slightly saner than abusing alloca() to set the stack pointer.
Posted Jun 21, 2017 11:26 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Jun 21, 2017 14:41 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link]
...like some eager tools maintainer implementing alloca() parameter sanity checks, perhaps? ;)
Posted Jun 20, 2017 13:10 UTC (Tue)
by niner (subscriber, #26151)
[Link] (1 responses)
Posted Jun 20, 2017 16:20 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
It seems to me there's more fundamental problems to be solved before this one. How does a garbage collecting thread handle ordinary race conditions when accessing data on other thread stacks? Invasive locking? Indirect references through forwarding objects?
I'm not sure I like the idea of solving that case, largely because the difference between "frees approximately the right memory" and "frees exactly the right memory" can be pretty huge when there are adversaries throwing pointy things into your stack and heap.
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
I don't think that that's a significant issue, but anyway: You just need to read the byte (the guard page is not readable, is it?). So all the not-yet-used stack pages can be the same page containing zeroes (which also means that the same cache line will be used for all these reads in a physically-tagged (i.e., normal these days) cache). Only when it is used for real, a physical page is allocated.
A better change is to modify alloca() in libc to touch at least one byte on each allocated page.
That's going to mess up with the performance profile (allocating pages earlyer than expected) and decrease total performance in case the app wasn't going to touch those pages at all.
Preventing stack guard-page hopping
This would protect remote attacks but wouldn't prevent an attacker to write his own stack allocation for local privilege escalation.
I don't think that preventing this attack scenario prevents any halfway-competent attack. If the attacker can write his own stack allocation, he can write it to jump over guard regions of any size; actually, he can put the memory writes to the area below the stack in his otherwise-regular stack-allocation code directly. In other words: If you allow the attacker to execute his code in a setting that can escalate priviledges, you are already owned, guard page or not.
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
A fairly common practice is to allocate some data, launch several worker threads to compute its parts and then join all the threads to get the final result. It's not uncommon for it to be allocated or have parts of on-stack data.
Preventing stack guard-page hopping
Preventing stack guard-page hopping
run_in_worker_threads_and_wait_for_them_all(iters, [&n] { n++; });
Preventing stack guard-page hopping
One obvious implication of having threads in a common address space and (naive) alloca() at the same time is that you can guide one thread's stack into another thread's address space no matter how far apart they are in memory. I learned this the hard way in 1998 as I was debugging a Linux program that was doing this accidentally across almost 2MB-wide stack gaps.
Indeed. The "Cheney on the MTA" paper describes a remarkable way of using this sort of alloca() abuse to implement a copying garbage collector using only the C stack: you write your C program in continuation-passing style, with GCed data in functions that never return but only call on to others that do the same, and then when you want to do a GC your collector copies the relevant data into a new "stack" on the heap and alloca()s to it (finding the right alloca() value via trivial pointer arithmetic from a variable on the local stack frame), then free()s the old stack.
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping
Preventing stack guard-page hopping