Per-system-call kernel-stack offset randomization
The kernel stack will always be an attractive target. It typically contains no end of useful information that can be used, for example, to find the location of other kernel data structures. If it can be written to, it can be used for return-oriented programming attacks. Many exploits seen in the wild (Cook mentioned this video4linux exploit as an example) depend on locating the kernel stack as part of the sequence of steps to take over a running system.
In current kernels, the kernel stack is allocated from the vmalloc() area at process creation time. Among other things, this approach makes the location of any given process's kernel stack hard to guess, since it depends on the state of the memory allocator at the time of its creation. Once the stack has been allocated, though, its location remains fixed for as long as the process runs. So if an attacker can figure out where the kernel stack for a target process is, that information can be used for as long as that process lives.
As it turns out, there are a number of ways for an attacker to do that. Despite extensive cleanup work, there are still numerous kernel messages that will expose addresses of data structures, including the stack, in the kernel log. There are also attacks using ptrace() and cache timing that can be used to locate the stack. So the protection offered by an uncertain stack location is not as strong as one might like it to be.
Cook and Reshetova's patch set (which is inspired by the PaX RANDKSTACK feature, though the implementation is different) addresses this problem by changing a process's kernel stack offset every time that process makes a system call. Specifically, it modifies the system-call entry code so that the following sequence of events happens:
- The pt_regs structure, containing the state of the processor registers, is pushed onto the base of the stack, just like it is done in current kernels.
- A call to alloca() is made with a random value. This has the effect of "allocating" a random amount of memory on the stack, which is really just a matter of moving the stack pointer down by that amount.
- The system call proceeds with its stack pointer in the now randomized location.
In other words, the kernel stack itself doesn't move, but the actual stack contents shift around and are located differently for every system call. That makes any attack that depends on placing data at a specific location in the stack likely to fail; even if the attacker succeeds in figuring out where the stack is to be found, they won't know exactly where any given system call will place its data on that stack.
Pushing the pt_regs structure before applying the randomization is important. The ptrace() attack mentioned above can be used to locate this structure (and thus the kernel stack); if it were located after the offset is applied, such attacks would thus reveal the offset.
Currently, the randomization amount is obtained by reading some low-order bits from the CPUs time-stamp counter. Cook notes that other, more robust sources of entropy can be added in the future, but he doesn't think that needs to be figured out before the current patches can be considered. There are currently five bits of entropy applied to the stack offset on 64-bit systems, and six bits on 32-bit systems. That is not a huge amount of entropy, but it is enough that any attack that depends on precise kernel-stack locations will probably fail — and generate a kernel oops — on the first few tries. More entropy can be added, at the cost of wasting more stack space.
With this feature in use, Cook measured the overhead as being about 0.9% on a no-op system call; it would clearly be less on any system call that does real work. But for people who don't want to pay even that cost, there is a static label to turn the randomization off.
The end result is a relatively simple mechanism to further harden the
kernel against attack. Cook noted that it's
not perfect, adding that "most things can't be
given the existing kernel design trade-offs
". If other developers
agree, per-system-call stack offset randomization is likely to find its
way into the mainline kernel's arsenal of hardening techniques.
Index entries for this article | |
---|---|
Kernel | Security/Kernel hardening |
Posted Mar 28, 2020 23:53 UTC (Sat)
by marcH (subscriber, #57642)
[Link]
Posted Mar 29, 2020 19:10 UTC (Sun)
by gutschke (subscriber, #27910)
[Link] (5 responses)
Posted Mar 30, 2020 2:25 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (1 responses)
Posted Mar 30, 2020 3:05 UTC (Mon)
by gutschke (subscriber, #27910)
[Link]
Randomization happens by the virtue of random amounts of data being allocated on the stack. This happens right at the point of the transition from user space to kernel space.
But alloca() knows about the x86 ABI. And the ABI requires that stack frames are aligned in 16 byte increments. That's needed, because some CPU instructions want aligned data (I believe this mostly affects SSE instructions). The compiler assumes that stacks are always aligned when the program starts (or in this case, when the system call starts executing in the kernel) and then makes sure the necessary padding is added whenever a function call is made.
There really isn't any unused memory that is readily available for other purposes.
Posted Mar 30, 2020 5:35 UTC (Mon)
by jorgegv (subscriber, #60484)
[Link] (2 responses)
I asume the randomization is done once per syscall invocation, right?
Posted Mar 30, 2020 5:45 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Mar 30, 2020 6:03 UTC (Mon)
by gutschke (subscriber, #27910)
[Link]
Picture open-coding That would do your stack address randomization, and it would give you 6 bits of randomness, as But that's roughly equivalent to writing: Actually, if you look really closely, the transformation isn't exactly correct and sometimes results in a value that is off by 0x10; but let's not worry about that for now. Fixing that would just make the code needlessly complicated and not contribute anything useful to this discussion. In any case, as you can see we now only have two bits of randomization. That sounds barely worth the effort. If we want to regain all six bits of stack address randomization. We need to instead do something like: And that's the 1kB (aka 0x400 bytes) of potentially wasted space. And again, we have lost 0x10 bytes because of my sloppy math a little earlier. Please forgive me, but it makes things easier to read.
Posted Mar 30, 2020 5:41 UTC (Mon)
by geuder (subscriber, #62854)
[Link] (14 responses)
The kernel log? Where is the kernel log readable to non-root in any current system that tries to be somewhat security aware? Once the attacker is root there a probably worse problems.
Even without buying that argument I don't say that the approach is useless. Smart crackers will find ways nobody has thought of (or at least not prepared for), so defense in depth should not harm.
Posted Mar 30, 2020 6:12 UTC (Mon)
by gutschke (subscriber, #27910)
[Link] (9 responses)
If 90%+ of all userland doesn't restrict access to kernel messages, then maybe it is a good idea for the kernel to assume that this type of data is available to an attacker.
Posted Mar 30, 2020 10:53 UTC (Mon)
by geuder (subscriber, #62854)
[Link] (6 responses)
So what would break if we protect /dev/kmsg? Reading random text messages doesn't look like a desirable design for any purpose. Except for systemd-journald of course, but runs as root already.
Posted Mar 30, 2020 17:09 UTC (Mon)
by jimi (guest, #6655)
[Link] (5 responses)
jimi@black:~> dmesg
Posted Mar 30, 2020 17:14 UTC (Mon)
by zdzichu (subscriber, #17118)
[Link] (4 responses)
config SECURITY_DMESG_RESTRICT
It's there for over 9 years.
Posted Mar 30, 2020 17:58 UTC (Mon)
by jimi (guest, #6655)
[Link] (2 responses)
So I'm left wondering, why not set the default to y? At least one distro runs with this restricted with no ill effects. What are the reasons to not restrict?
Posted Mar 30, 2020 19:07 UTC (Mon)
by madscientist (subscriber, #16861)
[Link] (1 responses)
Restricting access to important system information to root will just provide incentive to give root access to more things, which seems like an anti-pattern to me.
If dmesg output is really a security issue then of course something needs to be done, but some careful thought is appropriate.
Posted Mar 30, 2020 21:32 UTC (Mon)
by simcop2387 (subscriber, #101710)
[Link]
Posted Apr 6, 2020 16:42 UTC (Mon)
by zdzichu (subscriber, #17118)
[Link]
Posted Apr 6, 2020 13:27 UTC (Mon)
by tao (subscriber, #17563)
[Link] (1 responses)
Posted Apr 15, 2020 12:34 UTC (Wed)
by ghane (guest, #1805)
[Link]
Posted Mar 30, 2020 7:44 UTC (Mon)
by mjg59 (subscriber, #23239)
[Link] (3 responses)
Posted Mar 30, 2020 11:17 UTC (Mon)
by geuder (subscriber, #62854)
[Link] (2 responses)
Would you mind refreshing my memory on this? I once listened to a kernel lockdown presentation by you, but not working on those questions on a regular basis must have lead to too low refresh rate, I am sorry.
Second paragraph of https://mjg59.dreamwidth.org/50577.html says keeping secrets secret from root. Any example what such secret would be and where it would come from so that root cannot access it already without executing code.
TPMs are one way to keep private keys secret from root, even the kernel doesn't have them. Of course they are not applicable everywhere, so I don't intend to doubt that there are more use cases.
Posted Mar 30, 2020 20:01 UTC (Mon)
by mjg59 (subscriber, #23239)
[Link] (1 responses)
Posted Mar 31, 2020 5:44 UTC (Tue)
by geuder (subscriber, #62854)
[Link]
So you are saying developers have root on their workstation, the daemon is running on their workstation, but still the developer cannot prevent that auditing record to be written to the correct, persistent and unmodifiable log for every usage of the credentials?
In practice we would need to solve much more fundamental problems in user space than preventing root from getting kernel stack addresses to prevent them from copying and modifying the daemon. Or having the audit records written to a wrong location where an auditor will not find them. Do you have a pointer to the overall design of such a system?
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
alloca()
using pseudo-code like this:new_stack = old_stack - (rand() & 0x3F);
(0x3F + 1 = 1 << 6)
. But now you need to follow up with the alignment that the x86 ABI requires. Let's mask out any LSB that violate the ABI:new_stack = (old_stack - (rand() & 0x3F)) & 0xF;
new_stack = old_stack - (rand() & 0x30);
new_stack = old_stack - (rand() & 0x3F0);
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
dmesg: read kernel buffer failed: Operation not permitted
Per-system-call kernel-stack offset randomization
bool "Restrict unprivileged access to the kernel syslog"
default n
help
This enforces restrictions on unprivileged users reading the kernel
syslog via dmesg(8).
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Actually there's even a sysctl file: Per-system-call kernel-stack offset randomization
/proc/sys/kernel/dmesg_restrict
.
It's can be toggled any time.
Per-system-call kernel-stack offset randomization
Works on Ubuntu 20.04 too
Per-system-call kernel-stack offset randomization
sanjeev@T450s-disco:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu Focal Fossa (development branch)
Release: 20.04
Codename: focal
sanjeev@T450s-disco:~$ groups
sanjeev adm disk lp dialout cdrom sudo dip plugdev lpadmin sambashare docker
sanjeev@T450s-disco:~$ dmesg |tail -5
[ 7010.888825] wlp3s0: authenticated
[ 7010.890961] wlp3s0: associate with 38:d5:47:80:24:c4 (try 1/3)
[ 7010.891989] wlp3s0: RX AssocResp from 38:d5:47:80:24:c4 (capab=0x11 status=0 aid=4)
[ 7010.893546] wlp3s0: associated
[ 7010.913151] IPv6: ADDRCONF(NETDEV_CHANGE): wlp3s0: link becomes ready
sanjeev@T450s-disco:~$
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization
Per-system-call kernel-stack offset randomization