LWN: Comments on "Per-system-call kernel-stack offset randomization"

Per-system-call kernel-stack offset randomization

ghane — Wed, 15 Apr 2020 12:34:09 +0000

Works on Ubuntu 20.04 too

sanjeev@T450s-disco:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu Focal Fossa (development branch)
Release:	20.04
Codename:	focal
sanjeev@T450s-disco:~$ groups
sanjeev adm disk lp dialout cdrom sudo dip plugdev lpadmin sambashare docker
sanjeev@T450s-disco:~$ dmesg |tail -5
[ 7010.888825] wlp3s0: authenticated
[ 7010.890961] wlp3s0: associate with 38:d5:47:80:24:c4 (try 1/3)
[ 7010.891989] wlp3s0: RX AssocResp from 38:d5:47:80:24:c4 (capab=0x11 status=0 aid=4)
[ 7010.893546] wlp3s0: associated
[ 7010.913151] IPv6: ADDRCONF(NETDEV_CHANGE): wlp3s0: link becomes ready
sanjeev@T450s-disco:~$

Per-system-call kernel-stack offset randomization

zdzichu — Mon, 06 Apr 2020 16:42:25 +0000

Actually there's even a sysctl file: /proc/sys/kernel/dmesg_restrict. It's can be toggled any time.

Per-system-call kernel-stack offset randomization

tao — Mon, 06 Apr 2020 13:27:13 +0000

Debian, at the very least, restricts dmesg by default. Ubuntu, on the other hand, doesn't (at least not 18.04 and 19.04; I don't have any newer systems to test on).

Per-system-call kernel-stack offset randomization

geuder — Tue, 31 Mar 2020 05:44:13 +0000

Thanks for your reply, this is indeed a very interesting use case. A strict interpretation of regulation would require us to use that at my work every day.

So you are saying developers have root on their workstation, the daemon is running on their workstation, but still the developer cannot prevent that auditing record to be written to the correct, persistent and unmodifiable log for every usage of the credentials?

In practice we would need to solve much more fundamental problems in user space than preventing root from getting kernel stack addresses to prevent them from copying and modifying the daemon. Or having the audit records written to a wrong location where an auditor will not find them. Do you have a pointer to the overall design of such a system?

Per-system-call kernel-stack offset randomization

simcop2387 — Mon, 30 Mar 2020 21:32:01 +0000

You don't have to give it root, just give it CAP_SYSLOG which if it's a tool to gather diagnostic information would probably be needed anyway.

Per-system-call kernel-stack offset randomization

mjg59 — Mon, 30 Mar 2020 20:01:15 +0000

Sure. As an example - the credentials required for a user to be able to access production resources from a development workstation may be held in a daemon that generates an audit record for every access. If that secret can be extracted, it can be used without generating that audit trail and could potentially be copied to a different machine. LSM policy can be written to prevent root from being able to interact with that daemon, but that's not helpful if it's relatively straightforward for root to get code execution in the kernel.

Per-system-call kernel-stack offset randomization

madscientist — Mon, 30 Mar 2020 19:07:44 +0000

I expect there will be repercussions. For example, we have a daemon that runs on systems that can be asked to retrieve diagnostic information about a system, and dmesg output is often a critical aspect of that (for example, determining if processes were killed due to OOM, or hardware issues, etc.) Of course, we do not want such a daemon to have to run as root.

Restricting access to important system information to root will just provide incentive to give root access to more things, which seems like an anti-pattern to me.

If dmesg output is really a security issue then of course something needs to be done, but some careful thought is appropriate.

Per-system-call kernel-stack offset randomization

jimi — Mon, 30 Mar 2020 17:58:08 +0000

Ah - thank you for enlightening me.

So I'm left wondering, why not set the default to y? At least one distro runs with this restricted with no ill effects. What are the reasons to not restrict?

Per-system-call kernel-stack offset randomization

zdzichu — Mon, 30 Mar 2020 17:14:38 +0000

It's upstream kernel option:

config SECURITY_DMESG_RESTRICT
bool "Restrict unprivileged access to the kernel syslog"
default n
help
This enforces restrictions on unprivileged users reading the kernel
syslog via dmesg(8).

It's there for over 9 years.

Per-system-call kernel-stack offset randomization

jimi — Mon, 30 Mar 2020 17:09:29 +0000

I would guess that little would break. I base this guess on the fact that Slackware does not allow non-root access to dmesg and has not for a long time. If Slackware can find a way to restrict dmesg without breaking things, surely others can as well.

jimi@black:~> dmesg
dmesg: read kernel buffer failed: Operation not permitted

Per-system-call kernel-stack offset randomization

geuder — Mon, 30 Mar 2020 11:17:02 +0000

> a bunch of cases

Would you mind refreshing my memory on this? I once listened to a kernel lockdown presentation by you, but not working on those questions on a regular basis must have lead to too low refresh rate, I am sorry.

Second paragraph of https://mjg59.dreamwidth.org/50577.html says keeping secrets secret from root. Any example what such secret would be and where it would come from so that root cannot access it already without executing code.

TPMs are one way to keep private keys secret from root, even the kernel doesn't have them. Of course they are not applicable everywhere, so I don't intend to doubt that there are more use cases.

Per-system-call kernel-stack offset randomization

geuder — Mon, 30 Mar 2020 10:53:06 +0000

You are correct, dmesg seems still to be open everywhere. I must have mixed that up with the experience that journalctl does not show the system journal unless you add yourself to the approriate group or /var/log/* file are increasingly protected.

So what would break if we protect /dev/kmsg? Reading random text messages doesn't look like a desirable design for any purpose. Except for systemd-journald of course, but runs as root already.

Per-system-call kernel-stack offset randomization

mjg59 — Mon, 30 Mar 2020 07:44:40 +0000

There's a bunch of cases where you don't want root to have arbitrary code execution in the kernel, so there's still a benefit in preventing root from knowing this.

Per-system-call kernel-stack offset randomization

gutschke — Mon, 30 Mar 2020 06:12:57 +0000

I just checked on four completely different Linux systems that I could quickly log into. All four of them run a different distribution. Some are old-ish, others are bleeding edge. All of them allow non-privileged users to invoke "dmesg". I don't doubt that this can be restricted. But I suspect that it has unexpected side effects and that's why distributions don't do so by default.

If 90%+ of all userland doesn't restrict access to kernel messages, then maybe it is a good idea for the kernel to assume that this type of data is available to an attacker.

Per-system-call kernel-stack offset randomization

gutschke — Mon, 30 Mar 2020 06:03:42 +0000

Picture open-coding alloca() using pseudo-code like this:

new_stack = old_stack - (rand() & 0x3F);

That would do your stack address randomization, and it would give you 6 bits of randomness, as (0x3F + 1 = 1 << 6). But now you need to follow up with the alignment that the x86 ABI requires. Let's mask out any LSB that violate the ABI:

new_stack = (old_stack - (rand() & 0x3F)) & 0xF;

But that's roughly equivalent to writing:

new_stack = old_stack - (rand() & 0x30);

Actually, if you look really closely, the transformation isn't exactly correct and sometimes results in a value that is off by 0x10; but let's not worry about that for now. Fixing that would just make the code needlessly complicated and not contribute anything useful to this discussion.

In any case, as you can see we now only have two bits of randomization. That sounds barely worth the effort. If we want to regain all six bits of stack address randomization. We need to instead do something like:

new_stack = old_stack - (rand() & 0x3F0);

And that's the 1kB (aka 0x400 bytes) of potentially wasted space. And again, we have lost 0x10 bytes because of my sloppy math a little earlier. Please forgive me, but it makes things easier to read.

Per-system-call kernel-stack offset randomization

Cyberax — Mon, 30 Mar 2020 05:45:35 +0000

No, randomization can happen only at 16-byte intervals. So you're wasting 64*16 bytes.

Per-system-call kernel-stack offset randomization

geuder — Mon, 30 Mar 2020 05:41:47 +0000

> there are still numerous kernel messages that will expose addresses of data structures, including the stack, in the kernel log.

The kernel log? Where is the kernel log readable to non-root in any current system that tries to be somewhat security aware? Once the attacker is root there a probably worse problems.

Even without buying that argument I don't say that the approach is useless. Smart crackers will find ways nobody has thought of (or at least not prepared for), so defense in depth should not harm.

Per-system-call kernel-stack offset randomization

jorgegv — Mon, 30 Mar 2020 05:35:08 +0000

Sorry, I don't understand your reasoning. If Up to 64 bytes are wasted by randomization and Up to 16 bytes are wasted due to alignment, that makes a máximum of 80 bytes wasted, not 1kB.

I asume the randomization is done once per syscall invocation, right?

Per-system-call kernel-stack offset randomization

gutschke — Mon, 30 Mar 2020 03:05:03 +0000

The point of this patch is that instead of having a single base address for the system call stack, there should be 32 or 64 distinct addresses (depending on architecture). This makes it impossible for an attacker to reliably guess addresses on the stack. And hopefully, that will cause (some) attacks to fail with some sort of kernel crash. Of course, if there is no crash, an attacker can just keep guessing and eventually they'll get lucky.

Randomization happens by the virtue of random amounts of data being allocated on the stack. This happens right at the point of the transition from user space to kernel space.

But alloca() knows about the x86 ABI. And the ABI requires that stack frames are aligned in 16 byte increments. That's needed, because some CPU instructions want aligned data (I believe this mostly affects SSE instructions). The compiler assumes that stacks are always aligned when the program starts (or in this case, when the system call starts executing in the kernel) and then makes sure the necessary padding is added whenever a function call is made.

There really isn't any unused memory that is readily available for other purposes.

Per-system-call kernel-stack offset randomization

Paf — Mon, 30 Mar 2020 02:25:04 +0000

I'm a bit confused by this - If this is indeed the case, why not use the rest of the space rather than leave it unused? Can you explain the calculation that gets you to 1K in more detail?

Per-system-call kernel-stack offset randomization

gutschke — Sun, 29 Mar 2020 19:10:38 +0000

In the discussion of the how many bits of randomness get introduced, there is a statement that randomness comes at the cost of increased memory usage. At first sight, this seems puzzling, as even 6 bits would only yield 64 bytes of added stack usage. But what the article failed to mention is the fact that the x86 ABIs require a 16 byte stack alignment. So, those 6 bits turn into 1kB of extra memory usage. That's quite significant with notoriously small kernel stacks (8kB or 16kB depending on architecture).

Per-system-call kernel-stack offset randomization

marcH — Sat, 28 Mar 2020 23:53:15 +0000

Is there any chance this may help reveal "Programming by coincidence" bugs? Crashes due to memory corruption come and go depending on which way the wind blows. This looks like a bit more wind so I'm wondering.