Per-system-call kernel-stack offset randomization

By Jonathan Corbet
March 27, 2020

In recent years, the kernel has (finally) upped its game when it comes to hardening. It is rather harder to compromise a running kernel than it used to be. But "rather harder" is relative: attackers still manage to find ways to exploit kernel bugs. One piece of information that can be helpful to attackers is the location of the kernel stack; this patch set from Kees Cook and Elena Reshetova may soon make that information harder to come by and nearly useless in any case.

The kernel stack will always be an attractive target. It typically contains no end of useful information that can be used, for example, to find the location of other kernel data structures. If it can be written to, it can be used for return-oriented programming attacks. Many exploits seen in the wild (Cook mentioned this video4linux exploit as an example) depend on locating the kernel stack as part of the sequence of steps to take over a running system.

In current kernels, the kernel stack is allocated from the vmalloc() area at process creation time. Among other things, this approach makes the location of any given process's kernel stack hard to guess, since it depends on the state of the memory allocator at the time of its creation. Once the stack has been allocated, though, its location remains fixed for as long as the process runs. So if an attacker can figure out where the kernel stack for a target process is, that information can be used for as long as that process lives.

As it turns out, there are a number of ways for an attacker to do that. Despite extensive cleanup work, there are still numerous kernel messages that will expose addresses of data structures, including the stack, in the kernel log. There are also attacks using ptrace() and cache timing that can be used to locate the stack. So the protection offered by an uncertain stack location is not as strong as one might like it to be.

Cook and Reshetova's patch set (which is inspired by the PaX RANDKSTACK feature, though the implementation is different) addresses this problem by changing a process's kernel stack offset every time that process makes a system call. Specifically, it modifies the system-call entry code so that the following sequence of events happens:

The pt_regs structure, containing the state of the processor registers, is pushed onto the base of the stack, just like it is done in current kernels.
A call to alloca() is made with a random value. This has the effect of "allocating" a random amount of memory on the stack, which is really just a matter of moving the stack pointer down by that amount.
The system call proceeds with its stack pointer in the now randomized location.

In other words, the kernel stack itself doesn't move, but the actual stack contents shift around and are located differently for every system call. That makes any attack that depends on placing data at a specific location in the stack likely to fail; even if the attacker succeeds in figuring out where the stack is to be found, they won't know exactly where any given system call will place its data on that stack.

Pushing the pt_regs structure before applying the randomization is important. The ptrace() attack mentioned above can be used to locate this structure (and thus the kernel stack); if it were located after the offset is applied, such attacks would thus reveal the offset.

Currently, the randomization amount is obtained by reading some low-order bits from the CPUs time-stamp counter. Cook notes that other, more robust sources of entropy can be added in the future, but he doesn't think that needs to be figured out before the current patches can be considered. There are currently five bits of entropy applied to the stack offset on 64-bit systems, and six bits on 32-bit systems. That is not a huge amount of entropy, but it is enough that any attack that depends on precise kernel-stack locations will probably fail — and generate a kernel oops — on the first few tries. More entropy can be added, at the cost of wasting more stack space.

With this feature in use, Cook measured the overhead as being about 0.9% on a no-op system call; it would clearly be less on any system call that does real work. But for people who don't want to pay even that cost, there is a static label to turn the randomization off.

The end result is a relatively simple mechanism to further harden the kernel against attack. Cook noted that it's not perfect, adding that "most things can't be given the existing kernel design trade-offs". If other developers agree, per-system-call stack offset randomization is likely to find its way into the mainline kernel's arsenal of hardening techniques.

Index entries for this article
Kernel	Security/Kernel hardening

Per-system-call kernel-stack offset randomization

Posted Mar 28, 2020 23:53 UTC (Sat) by marcH (subscriber, #57642) [Link]

Is there any chance this may help reveal "Programming by coincidence" bugs? Crashes due to memory corruption come and go depending on which way the wind blows. This looks like a bit more wind so I'm wondering.

Per-system-call kernel-stack offset randomization

Posted Mar 29, 2020 19:10 UTC (Sun) by gutschke (subscriber, #27910) [Link] (5 responses)

In the discussion of the how many bits of randomness get introduced, there is a statement that randomness comes at the cost of increased memory usage. At first sight, this seems puzzling, as even 6 bits would only yield 64 bytes of added stack usage. But what the article failed to mention is the fact that the x86 ABIs require a 16 byte stack alignment. So, those 6 bits turn into 1kB of extra memory usage. That's quite significant with notoriously small kernel stacks (8kB or 16kB depending on architecture).

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 2:25 UTC (Mon) by Paf (subscriber, #91811) [Link] (1 responses)

I'm a bit confused by this - If this is indeed the case, why not use the rest of the space rather than leave it unused? Can you explain the calculation that gets you to 1K in more detail?

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 3:05 UTC (Mon) by gutschke (subscriber, #27910) [Link]

The point of this patch is that instead of having a single base address for the system call stack, there should be 32 or 64 distinct addresses (depending on architecture). This makes it impossible for an attacker to reliably guess addresses on the stack. And hopefully, that will cause (some) attacks to fail with some sort of kernel crash. Of course, if there is no crash, an attacker can just keep guessing and eventually they'll get lucky.

Randomization happens by the virtue of random amounts of data being allocated on the stack. This happens right at the point of the transition from user space to kernel space.

But alloca() knows about the x86 ABI. And the ABI requires that stack frames are aligned in 16 byte increments. That's needed, because some CPU instructions want aligned data (I believe this mostly affects SSE instructions). The compiler assumes that stacks are always aligned when the program starts (or in this case, when the system call starts executing in the kernel) and then makes sure the necessary padding is added whenever a function call is made.

There really isn't any unused memory that is readily available for other purposes.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 5:35 UTC (Mon) by jorgegv (subscriber, #60484) [Link] (2 responses)

Sorry, I don't understand your reasoning. If Up to 64 bytes are wasted by randomization and Up to 16 bytes are wasted due to alignment, that makes a máximum of 80 bytes wasted, not 1kB.

I asume the randomization is done once per syscall invocation, right?

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 5:45 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

No, randomization can happen only at 16-byte intervals. So you're wasting 64*16 bytes.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 6:03 UTC (Mon) by gutschke (subscriber, #27910) [Link]

Picture open-coding alloca() using pseudo-code like this:

new_stack = old_stack - (rand() & 0x3F);

That would do your stack address randomization, and it would give you 6 bits of randomness, as (0x3F + 1 = 1 << 6). But now you need to follow up with the alignment that the x86 ABI requires. Let's mask out any LSB that violate the ABI:

new_stack = (old_stack - (rand() & 0x3F)) & 0xF;

But that's roughly equivalent to writing:

new_stack = old_stack - (rand() & 0x30);

Actually, if you look really closely, the transformation isn't exactly correct and sometimes results in a value that is off by 0x10; but let's not worry about that for now. Fixing that would just make the code needlessly complicated and not contribute anything useful to this discussion.

In any case, as you can see we now only have two bits of randomization. That sounds barely worth the effort. If we want to regain all six bits of stack address randomization. We need to instead do something like:

new_stack = old_stack - (rand() & 0x3F0);

And that's the 1kB (aka 0x400 bytes) of potentially wasted space. And again, we have lost 0x10 bytes because of my sloppy math a little earlier. Please forgive me, but it makes things easier to read.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 5:41 UTC (Mon) by geuder (subscriber, #62854) [Link] (14 responses)

> there are still numerous kernel messages that will expose addresses of data structures, including the stack, in the kernel log.

The kernel log? Where is the kernel log readable to non-root in any current system that tries to be somewhat security aware? Once the attacker is root there a probably worse problems.

Even without buying that argument I don't say that the approach is useless. Smart crackers will find ways nobody has thought of (or at least not prepared for), so defense in depth should not harm.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 6:12 UTC (Mon) by gutschke (subscriber, #27910) [Link] (9 responses)

I just checked on four completely different Linux systems that I could quickly log into. All four of them run a different distribution. Some are old-ish, others are bleeding edge. All of them allow non-privileged users to invoke "dmesg". I don't doubt that this can be restricted. But I suspect that it has unexpected side effects and that's why distributions don't do so by default.

If 90%+ of all userland doesn't restrict access to kernel messages, then maybe it is a good idea for the kernel to assume that this type of data is available to an attacker.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 10:53 UTC (Mon) by geuder (subscriber, #62854) [Link] (6 responses)

You are correct, dmesg seems still to be open everywhere. I must have mixed that up with the experience that journalctl does not show the system journal unless you add yourself to the approriate group or /var/log/* file are increasingly protected.

So what would break if we protect /dev/kmsg? Reading random text messages doesn't look like a desirable design for any purpose. Except for systemd-journald of course, but runs as root already.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 17:09 UTC (Mon) by jimi (guest, #6655) [Link] (5 responses)

I would guess that little would break. I base this guess on the fact that Slackware does not allow non-root access to dmesg and has not for a long time. If Slackware can find a way to restrict dmesg without breaking things, surely others can as well.

jimi@black:~> dmesg
dmesg: read kernel buffer failed: Operation not permitted

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 17:14 UTC (Mon) by zdzichu (subscriber, #17118) [Link] (4 responses)

It's upstream kernel option:

config SECURITY_DMESG_RESTRICT
bool "Restrict unprivileged access to the kernel syslog"
default n
help
This enforces restrictions on unprivileged users reading the kernel
syslog via dmesg(8).

It's there for over 9 years.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 17:58 UTC (Mon) by jimi (guest, #6655) [Link] (2 responses)

Ah - thank you for enlightening me.

So I'm left wondering, why not set the default to y? At least one distro runs with this restricted with no ill effects. What are the reasons to not restrict?

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 19:07 UTC (Mon) by madscientist (subscriber, #16861) [Link] (1 responses)

I expect there will be repercussions. For example, we have a daemon that runs on systems that can be asked to retrieve diagnostic information about a system, and dmesg output is often a critical aspect of that (for example, determining if processes were killed due to OOM, or hardware issues, etc.) Of course, we do not want such a daemon to have to run as root.

Restricting access to important system information to root will just provide incentive to give root access to more things, which seems like an anti-pattern to me.

If dmesg output is really a security issue then of course something needs to be done, but some careful thought is appropriate.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 21:32 UTC (Mon) by simcop2387 (subscriber, #101710) [Link]

You don't have to give it root, just give it CAP_SYSLOG which if it's a tool to gather diagnostic information would probably be needed anyway.

Per-system-call kernel-stack offset randomization

Posted Apr 6, 2020 16:42 UTC (Mon) by zdzichu (subscriber, #17118) [Link]

Actually there's even a sysctl file: /proc/sys/kernel/dmesg_restrict. It's can be toggled any time.

Per-system-call kernel-stack offset randomization

Posted Apr 6, 2020 13:27 UTC (Mon) by tao (subscriber, #17563) [Link] (1 responses)

Debian, at the very least, restricts dmesg by default. Ubuntu, on the other hand, doesn't (at least not 18.04 and 19.04; I don't have any newer systems to test on).

Per-system-call kernel-stack offset randomization

Posted Apr 15, 2020 12:34 UTC (Wed) by ghane (guest, #1805) [Link]

Works on Ubuntu 20.04 too

sanjeev@T450s-disco:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu Focal Fossa (development branch)
Release:	20.04
Codename:	focal
sanjeev@T450s-disco:~$ groups
sanjeev adm disk lp dialout cdrom sudo dip plugdev lpadmin sambashare docker
sanjeev@T450s-disco:~$ dmesg |tail -5
[ 7010.888825] wlp3s0: authenticated
[ 7010.890961] wlp3s0: associate with 38:d5:47:80:24:c4 (try 1/3)
[ 7010.891989] wlp3s0: RX AssocResp from 38:d5:47:80:24:c4 (capab=0x11 status=0 aid=4)
[ 7010.893546] wlp3s0: associated
[ 7010.913151] IPv6: ADDRCONF(NETDEV_CHANGE): wlp3s0: link becomes ready
sanjeev@T450s-disco:~$

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 7:44 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (3 responses)

There's a bunch of cases where you don't want root to have arbitrary code execution in the kernel, so there's still a benefit in preventing root from knowing this.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 11:17 UTC (Mon) by geuder (subscriber, #62854) [Link] (2 responses)

> a bunch of cases

Would you mind refreshing my memory on this? I once listened to a kernel lockdown presentation by you, but not working on those questions on a regular basis must have lead to too low refresh rate, I am sorry.

Second paragraph of https://mjg59.dreamwidth.org/50577.html says keeping secrets secret from root. Any example what such secret would be and where it would come from so that root cannot access it already without executing code.

TPMs are one way to keep private keys secret from root, even the kernel doesn't have them. Of course they are not applicable everywhere, so I don't intend to doubt that there are more use cases.

Per-system-call kernel-stack offset randomization

Posted Mar 30, 2020 20:01 UTC (Mon) by mjg59 (subscriber, #23239) [Link] (1 responses)

Sure. As an example - the credentials required for a user to be able to access production resources from a development workstation may be held in a daemon that generates an audit record for every access. If that secret can be extracted, it can be used without generating that audit trail and could potentially be copied to a different machine. LSM policy can be written to prevent root from being able to interact with that daemon, but that's not helpful if it's relatively straightforward for root to get code execution in the kernel.

Per-system-call kernel-stack offset randomization

Posted Mar 31, 2020 5:44 UTC (Tue) by geuder (subscriber, #62854) [Link]

Thanks for your reply, this is indeed a very interesting use case. A strict interpretation of regulation would require us to use that at my work every day.

So you are saying developers have root on their workstation, the daemon is running on their workstation, but still the developer cannot prevent that auditing record to be written to the correct, persistent and unmodifiable log for every usage of the credentials?

In practice we would need to solve much more fundamental problems in user space than preventing root from getting kernel stack addresses to prevent them from copying and modifying the daemon. Or having the audit records written to a wrong location where an auditor will not find them. Do you have a pointer to the overall design of such a system?