Exclusive page-frame ownership
An attacker wanting to get the kernel to run arbitrary code faces a problem: where can that code be put so that the kernel might run it? If the kernel can be convinced to run code found in user space, that problem becomes much easier to solve, since placing code in user-space memory is something that anybody can do. Since user-space memory remains mapped while the processor is running in kernel mode, all that needs to be done is to convince the kernel to jump to a user-space address. Years ago, it was possible to simply map the page at address 0 and find a bug that would cause the kernel to jump to a null pointer. Such simple attacks have been headed off, but more complex exploits are still common.
Obviously, the best solution is to ensure that the kernel will never try to jump to a user-space address. If one accepts that there will always be bugs, though, it makes sense to add other defenses, such as simply preventing the execution of user-space memory by the kernel. The PaX KERNEXEC and UDEREF mechanisms are designed to prevent this kind of user-space access. More recently, the processor manufacturers have gotten into the game as well; Intel now has supervisor mode access prevention and supervisor mode execute protection, while ARM has added privileged execute-never. On systems where these mechanisms are fully implemented, it should be impossible for the kernel to execute code found in user-space memory.
Except, as this paper from Vasileios P. Kemerlis et al. [PDF] points out, there's a loophole. User-space memory is accessed via a process's page tables, and the various access-prevention mechanisms work to block kernel access via those page tables. But the kernel also maintains a linear mapping of the entire range of physical memory (on 64-bit systems; the situation on 32-bit systems is a bit more complicated). This mapping has many uses within the kernel, with page-level memory management being near the top of the list. It provides a separate address for every physical page in the system. Importantly, it's a kernel-space address and, on some systems (x86 before 3.9 and all ARM), this memory range is executable by the kernel.
If an attacker can cause the kernel to jump into the direct mapping, none of the user-space access-prevention mechanisms will apply, even if the target address corresponds to a user-space page. So the direct mapping offers a convenient way to bypass these protections, with only one little catch: an attacker must be able to determine the physical address of the page containing the exploit code. As the paper points out, the pagemap files under /proc will provide that information, and, while these files can be disabled, distributions tend not to do that. So, on most systems, everything is in place to enable an attacker to exploit a bug that can cause a jump to an arbitrary address and the existing access-prevention mechanisms are powerless to stop it.
(Life gets a little harder on current x86 kernels, where it is no longer possible to directly execute code via the direct mapping. In such cases, the attacker must resort to return-oriented programming instead — not a huge obstacle for many attackers.)
The solution, as described in the paper and implemented in the exclusive page frame ownership (XPFO) patch set posted by Juerg Haefliger, is to take away the back-door access to user-space pages via the direct mapping. The mechanism is fairly simple in concept. Whenever a page is allocated for user-space use (something the kernel already indicates with the GFP flags in the allocation request), the direct mapping for that page is removed. Thus, even if an attacker can generate the directly mapped address for the page and get the kernel to jump there, the kernel will fault due to lack of access permissions to that page. When user space frees a page, it will be zeroed (to prevent attacks via hostile code left in the page) and returned to the direct map.
There are times when the kernel must access user-space memory, of course; the copy_to_user() and copy_from_user() functions are obvious examples. In such cases, the direct mapping is restored for the duration of the operation.
Naturally, there is a performance cost to this. The mapping and unmapping of pages in the kernel's address space will slow things down somewhat, as will the zeroing of returned user-space pages. Perhaps more significant, though, is a change in how the direct mapping is implemented. Normally, the kernel creates this mapping with huge pages; that, among other things, greatly reduces the pressure on the processor's translation lookaside buffer (TLB) when the direct mapping is accessed. But use of huge pages is incompatible with adding and removing mappings for individual (small) pages in that range, so, with XPFO, the huge-page mappings have to go. There is also some increased memory overhead resulting from the need to store more per-page information. All told, enabling XPFO has a performance cost up to about 3% in the worst case, though most of the benchmarks reported in the paper suffered much less than that.
The patch set needs some completion work before it can be seriously
considered for merging into the mainline. Once that point comes, one can
assume that the conversation will hinge on how effective it is at
preventing exploits and whether it is worth the performance cost. The fact
that the slowdown for kernel builds is 2.5% could prove to be a bit of an
obstacle in this discussion. A performance hit on that scale is a hard
thing to swallow, but so are successful exploits. Which pill will prove to
be the bitterest will have to be seen as the patch set progresses.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Address-space isolation |
| Kernel | Security/Kernel hardening |
