Rethinking the Stack Clash fix

By Jonathan Corbet
July 13, 2017

It has been nearly one month since the Stack Clash vulnerability was disclosed and some hardening measures were rushed into the 4.12 kernel release. Since then, a fair amount of work has gone into fixing problems caused by those measures and porting the result back to stable kernel releases. Now, it seems, the kernel developers are considering taking a different approach entirely.

Stack Clash is generally seen as a user-space problem: a combination of large on-stack allocations and the lack of stack probing open up opportunities for an attacker to jump over the guard page at the end of the stack. Fixing those problems (and deploying the fixes widely) will take some time; meanwhile, it was thought, systems can be protected by replacing the kernel-provided guard page with a 1MB (or larger) guard region that, hopefully, cannot be jumped over.

That guard region should be invisible to most programs, but it has created a surprising number of problems for some applications. A number of those issues have been worked around, but one has proved difficult to fix; unfortunately, that one is LibreOffice, which can crash on 32-bit systems when running Java components. The issue is a bit complicated but, in short, Java wants to enable code execution in memory located immediately below the stack, which runs afoul of the guard region. This, as Linus Torvalds noted, is a problem:

We really can't be breaking libreoffice. That's like a big classic no-no - it affects normal users that simply cannot be expected to work around it. For them, it's a "my office application no longer works" situation, and they just think the system is flaky.

There is also the lingering fear that other, not-yet-discovered regressions lurk in user-space applications, regressions that might not surface until somebody does a kernel upgrade months or years from now. So, perhaps, adding a large guard region to work around a user-space problem is not the best answer.

On most systems, the resource limit (rlimit) mechanism restricts the stack to a maximum of 8MB of memory. The hard limit tends to be much larger, though, so any unprivileged process can raise the effective stack limit to a larger value — as high as "infinite". The posted exploits for Stack Clash work by raising the stack rlimit, then running a setuid program with a vast array of arguments to fill that huge stack. If that program can be prodded into overflowing the stack without hitting the guard page, it may be possible to overwrite heap data beyond the stack and, from there, take over the program and make use of its privileges.

There may be other ways to get a setuid program to exhaust its stack, but the command-line arguments method is easy and readily under the control of an attacker. (Note: the overflow is relatively easy; a successful exploit is harder). That suggests that the bulk of the Stack Clash exploits can be headed off by preventing the execution of setuid programs with a stack that's both (1) close to the heap area and (2) nearly full at the outset.

Kees Cook took a stab at that problem with this patch attacking the first of those two points. The idea was that, when a process is about to run a setuid program, the stack limit should be reset to a reasonable value; the value Cook chose was whatever the init process has. This patch would not prevent stack exhaustion (indeed, it might cause it if there are setuid programs needing a huge stack), but it would keep the stack from growing large enough to impinge on a heap area.

That patch didn't get far, though, since Torvalds disliked it. One of his complaints was that special-casing setuid programs would be likely to lead to bugs or inadequate protection, since the relevant code would see relatively little use. So Cook's next attempt took a different tack: it places an upper bound on the amount of stack memory that can be occupied by a program's arguments at exec() time. In particular, that limit is 75% of the default stack limit, or 6MB, regardless of the current stack limit. This patch has been merged for 4.13; it's not clear whether this change will find its way into the stable updates to earlier kernel releases.

Limiting stack use by arguments should suffice to block a lot of attacks but, as it turns out, there's still a desire to enforce a limit on the size of the stack for setuid programs. One reason for that might be fears of some sort of pathological behavior that could be exploited to force a setuid program to overflow even a huge and mostly empty stack. But it also turns out that, if the stack rlimit is set to "infinity", the kernel will change the layout of a program's memory areas. A large stack limit suggests that the stack itself is likely to be large, so the kernel maps other memory areas low in the address space to preserve room for the stack to grow into. If, instead, the stack is limited, the kernel will map those areas at higher locations. As a result, the stack rlimit gives an attacker a bit of control over how the target program's memory is laid out — not a desirable thing to do.

Thus this patch series, which applies a maximum 8MB stack limit on setuid programs. These patches, posted on July 10, have not yet been merged; applying this limit required a surprising number of changes to the core exec*() system-call code, so more than the usual amount of review is indicated. There would appear to be general agreement on the goal, though, so this change seems likely to find its way into the mainline eventually. There has been some talk of allowing a larger stack via an annotation in the binary file, but that has not been implemented and may not be without a demonstrated need.

At this point, nobody has said whether these changes will be enough to allow the removal of the larger guard region from the stack. Returning to the previous layout semantics would ease a lot of worries about regressions that might turn up months or years in the future, though, so it's not hard to see why the idea has appeal. It would seem that at least some of the kernel's internal memory-layout policies have become a part of the user-space ABI, so they need to be preserved if possible.

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	Security/Vulnerabilities
Security	Linux kernel

Rethinking the Stack Clash fix

Posted Jul 13, 2017 21:49 UTC (Thu) by nix (subscriber, #2304) [Link] (11 responses)

As Linus noted, a maximum limit on arg size (or stack size!) for non-setuid programs is seriously problematic. I routinely use >32MiB arg lists (wildcards with lots and lots of files in them) and stacks exceeding 20MiB (deep recursion) and would be most unhappy to find that blocked on 64-bit systems in which you could easily have gigabytes of both without coming close to exhausting even physical RAM, let alone address space.

Arbitrary limits for security reasons are one thing. Tiny ones on huge systems are another.

Rethinking the Stack Clash fix

Posted Jul 13, 2017 23:28 UTC (Thu) by foom (subscriber, #14868) [Link] (10 responses)

Um...how come nobody has noted that this new patch totally breaks the userspace API promises?

There is an API for getting the max argument size, sysconf(_SC_ARG_MAX). Documented in man sysconf:

       ARG_MAX - _SC_ARG_MAX
              The maximum length of the arguments to the exec(3) family of functions.  Must not be less than _POSIX_ARG_MAX (4096).

Implemented in glibc as:

    case _SC_ARG_MAX:
      /* Use getrlimit to get the stack limit.  */
      if (__getrlimit (RLIMIT_STACK, &rlimit) == 0)
        return MAX (legacy_ARG_MAX, rlimit.rlim_cur / 4);

      return legacy_ARG_MAX;

With this kernel patch, that is now a complete lie, and any program depending on it in order to decide whether the arguments can be passed on the commandline (vs., say, in a response file) are now broken. The clang compiler is one such program I know of, but I'm sure there are others...

WTF?

Rethinking the Stack Clash fix

Posted Jul 14, 2017 4:49 UTC (Fri) by areilly (subscriber, #87829) [Link] (1 responses)

Pretty sure that clang is not usually installed setuid. My reading of the article suggested that those are the only ones getting the proposed limit.
I'm not sure how well an suid limit like this would interact with something like xargs, that presumably knows similar things about allowable argument sizes. It would seem to require xargs to be aware of the setuid-nature of its client program.

Rethinking the Stack Clash fix

Posted Jul 14, 2017 11:20 UTC (Fri) by foom (subscriber, #14868) [Link]

The article discussed and links to (see part that says "This patch has been merged") a patch that applies a 6MB ARG_MAX maximal limit /regardless/ of suid or not. That results in sysconf lying if the stack rlimit has been set to 24MB or higher.

Rethinking the Stack Clash fix

Posted Jul 14, 2017 12:23 UTC (Fri) by joey (guest, #328) [Link] (6 responses)

The canonical example of using ARG_MAX is xargs. This is going to cause massive breakage if it slips in.

Rethinking the Stack Clash fix

Posted Jul 14, 2017 12:39 UTC (Fri) by jchaxby (subscriber, #63942) [Link] (2 responses)

And find ... -exec {} +

And I'm pretty sure there are other perfectly legitimate uses, as nix suggests, of very large arg sizes.

"Argument list too long"

Posted Jul 27, 2017 16:46 UTC (Thu) by geuder (subscriber, #62854) [Link] (1 responses)

In an up-to-date OpenSUSE 42.2 I got "Argument list too long" when using "xargs -0" today for the first time after I don't know how many years. So I remembered this discussion.

I have not tweaked ulimit in anyway, the stack size is 8192.

Luckily "find ... -exec ... {} +" worked without the error. Was just a bad old habit to use xargs in this case, where exec + was possible, too. But there might be some more legitimate use cases of "xargs -0"

Unluckily that means that I have not digged any deeper whether it was really a stack clash fix that introduced the "argument list too long" problem. Would be be quite weird coincidence if it weren't.

"Argument list too long"

Posted Jul 27, 2017 18:11 UTC (Thu) by mbunkus (subscriber, #87248) [Link]

> But there might be some more legitimate use cases of "xargs -0"

Parallel execution, for example: `… | xargs -0 --max-args=1 --max-procs=$(getconf _NPROCESSORS_ONLN) the-program-to-parallelize`

Rethinking the Stack Clash fix

Posted Jul 14, 2017 15:06 UTC (Fri) by corbet (editor, #1) [Link] (2 responses)

Perhaps, but remember that, unless you've tweaked the stack limit yourself, the limit was (and remains) 2MB. Are there really "massive" numbers of users who have changed the stack limit for the purposes of enabling a much larger args array?

Rethinking the Stack Clash fix

Posted Jul 15, 2017 0:46 UTC (Sat) by foom (subscriber, #14868) [Link]

I suspect it's not a massive number of people with large or unlimited stack size, unless some distro has done so by default.

But, once you have increased it for *any* purpose, this kernel change will break programs attempting to choose the right number of args to pass on the command line via calling sysconf. You don't need to be actively requiring or even desiring a large args array...

Rethinking the Stack Clash fix

Posted Jul 17, 2017 14:21 UTC (Mon) by NightMonkey (subscriber, #23051) [Link]

Maybe those same users who pushed for the ext4 "largefile" limit increase to 2 billion files? ;)

Rethinking the Stack Clash fix

Posted Jul 14, 2017 18:03 UTC (Fri) by Frogging101 (guest, #113180) [Link]

Methinks someone should reply and point out that this is broken.

I would do it but I don't want to butt in without really knowing what I'm talking about.

Rethinking the Stack Clash fix

Posted Jul 13, 2017 21:58 UTC (Thu) by josh (subscriber, #17465) [Link]

I really wish the kernel didn't have to be involved in this at *all*. In an ideal world, the stack (and heap, and memory layout in general) would be set up entirely by userspace in its loader (e.g. ld-linux), and the kernel would just say "here's some address space".

Rethinking the Stack Clash fix

Posted Jul 14, 2017 0:45 UTC (Fri) by eru (subscriber, #2753) [Link] (4 responses)

How about doing it like in the bad old days of MS-DOS programming: have the compiler insert a stack overflow check at each function entry and into specials like alloca(), at least when compiling sensitive code. MS-DOS compilers did this because there was no other way to attempt memory protection, but the article gives the impression Linux actually is not so much better in this respect!

Rethinking the Stack Clash fix

Posted Jul 14, 2017 1:06 UTC (Fri) by josh (subscriber, #17465) [Link]

The compiler has code to do exactly that, though it turns out that wasn't sufficient (see the various discussions about the compiler needing to probe each page).

But you also can't count on all of userspace using a reasonable compiler.

Rethinking the Stack Clash fix

Posted Jul 14, 2017 7:59 UTC (Fri) by mjw (subscriber, #16740) [Link] (2 responses)

See Jeff Law's GCC RFC patch set: https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00556.html

This series introduces -fstack-check=clash which is a variant of
-fstack-check designed to prevent "jumping the stack" as seen in the
stack-clash exploits.

Rethinking the Stack Clash fix

Posted Jul 14, 2017 21:51 UTC (Fri) by eru (subscriber, #2753) [Link] (1 responses)

Sounds complicated. The old MS-DOS compilers just compared SP - required size against the stack limit before allocation. What is wrong with that?

Rethinking the Stack Clash fix

Posted Jul 14, 2017 22:16 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Additional comparison - it's not free.