Two topics in user-space access
User-space and kernel addresses
The kernel provides a whole set of functions that allow kernel-space code to access user data. Naturally, these functions have to handle all of the possible things that might happen, including data that has been paged out to disk or addresses that don't point to any valid data at all. In the latter case, functions like copy_from_user() will return -EFAULT, which is usually then passed back to user space. The faulty application, which is certainly checking for error returns from system calls, can then do the right thing.
Unpleasant things can happen, though, if the address passed in from user space points to kernel data. If the kernel actually dereferences those addresses, it could allow an attacker to get at data that should be protected. The access_ok() function exists to prevent this from happening, but it can't work if kernel developers forget to call it before passing an address to low-level user-space access functions like __copy_from_user() (the higher-level functions call access_ok() internally). This particular omission has led to some severe vulnerabilities in the past.
This problem was, until recently, aggravated by the fact that, if an attacker tried to exploit a missing-access_ok() vulnerability using a kernel-space address that turned out to be invalid, the kernel would helpfully return -EFAULT. That would allow attackers to probe the kernel's address space at leisure until the target data structures had been found. Back in August 2018, Jann Horn added a check to catch this case and cause a kernel oops when it happens; attackers with access to a missing-access_ok() vulnerability were deprived of the ability to quietly dig around in kernel space, but there were some other, unexpected consequences as well.
As reported by Changbin Du, kernel probes ("kprobes") can be configured to access strings at any address — in either user or kernel space. The chances of such probes seeing invalid addresses are relatively high and, after Horn's patch, they would cause a kernel oops. Linus Torvalds pulled the suggested fix, but objected to the idea that a single function in kprobes (or anywhere else in the kernel) could accept both user-space and kernel addresses and manage to tell them apart.
On most architectures supported by Linux, it is usually relatively easy to distinguish user-space addresses from kernel-space addresses; that is because the two are confined to different parts of the overall address space. On 32-bit x86 systems, the default was for user space to own addresses below 0xc0000000, with the kernel owning everything above that point. Among other things, this layout improves performance by avoiding the need to change page tables when switching between user and kernel mode. But there is nothing that requires the address space to be laid out that way. A classic example is the "4G:4G" mode for x86, which gave the entire 32-bit address space to user space, then switched page tables on entry into the kernel so that the kernel, too, had the full address space.
When something like 4G:4G is in effect, the same address can be meaningful in both user and kernel space, but will point to different data. There is, at that point, no way to reliably distinguish the two types of addresses just by looking at them. There are other environments where the address spaces can overlap in this way, and defensive technologies like kernel page-table isolation are pushing even plain x86 systems in that direction. As a result, any attempt to handle both user-space and kernel addresses without knowing which they are is going to end in grief sooner or later. That explains why Torvalds became so unhappy at any attempt to do so.
The solution for kprobes will be to require accesses to specify whether they are meant for user space or kernel space. To that end, Masami Hiramatsu has been working on a patch set to add a new set of accessors for user-space data. Once those are added, and after some time has passed, it's likely that the current accessors will be changed to work with kernel-space data only.
Kprobes are not the only place where addresses have been mixed up in this
way; it turns
out that BPF programs will call bpf_probe_read() with either
type of address and expect it to work. Changing that, Alexei Starovoitov
said,
could break existing user code. Torvalds responded,
though, that: "It appears that people haven't understood that kernel
and user addresses are distinct, and may have written programs that are
fundamentally buggy
". He would like to see such uses start to fail
on the x86 architecture sometime soon so that users will fix their code
before something more unpleasant happens.
The solution here will be similar to what is being done with kprobes. Two
new functions (with names like bpf_user_read() and
bpf_kernel_read()) will be introduced, and developers will be
strongly encouraged to convert their code over to them. Eventually,
bpf_probe_read() will go away entirely. But, as Torvalds noted,
that will not be happening in the immediate future: "It's really a
'long-term we really need to fix this', where the only question is how soon
'long-term' is
".
Keeping user space walled off
While the kernel must often access user space, unpleasant things can happen when the kernel does so accidentally. Many types of attacks depend on getting the kernel to read data (or execute code) that is located in user space and under the attacker's control. To prevent such things from happening, processor vendors have implemented features to prevent the kernel from accessing user-space pages from random places. Intel's supervisor-mode access prevention (SMAP) and Arm's privileged access never (PAN) mechanisms are examples of this type of feature; when this protection is available, the kernel tries to make use of it.
This protection must, of course, be removed whenever the kernel legitimately needs to get at user-space memory. For the most part, this is handled within the user-space access functions themselves, but there are cases where higher-level code may need to disable user-space access protection. If nothing else, the instructions to enable and disable protection are expensive, so code that performs a series of accesses can be sped up by just disabling protection once for the entire series. This is managed with calls to functions like user_access_begin() and user_access_end().
The code that runs with user-space access protection disabled should be as short as possible. The more code that runs, the bigger the chance that it could contain an exploitable bug. But there is another hazard to be aware of: a call to schedule() could result in another process taking over the processor — with user-space access protection still disabled. Once that happens, there is no knowing when protection could be enabled again or how much buggy code might be executed in the meantime.
The desire to prevent this situation is why user_access_begin()
comes with a special rule: users should call no other
functions while user-space access prevention is disabled. But, as Peter
Zijlstra noted, this rule is "currently unenforced and (therefore
obviously) violated
". That seems likely to change, though, as a
result of his patch
set enhancing the objtool utility with the ability to identify
(and complain about) function calls in these sections of code. Functions
known to be safe to call can be specially marked; the functions that
perform user-space access are about the only obvious candidates for this
annotation.
Both of these cases show that user-space access is trickier and less well
understood than many developers expect. A couple of long-time kernel
developers (at least) were surprised to learn that any particular address can be
valid (but mapped differently) in both kernel and user space. It seems, though,
that at least some of these problems can be addressed with better APIs and
better tools.
| Index entries for this article | |
|---|---|
| Kernel | copy_*_user() |
| Kernel | Security/Security technologies |
