Two address-space-isolation patches get closer

By Jonathan Corbet
October 27, 2020

Address-space isolation is the technique of removing a range of memory from one or more address spaces as a way of preventing accidental or malicious access to that memory. Since the disclosure of the Meltdown and Spectre vulnerabilities, the kernel has used one form of address-space isolation to make kernel memory completely inaccessible to user-space processes, for example. There has been a steady level of interest in using similar techniques to protect memory in other contexts; two patches implementing new isolation mechanisms are getting closer to being ready for merging into the mainline kernel.

`memfd_secret()`

The first of these is the memfd_secret() patch set from Mike Rapoport, which has been covered here before, so this overview will be relatively brief; see that article for more background. The purpose of this work is to allow a user-space process to create a "secret" memory area that is as inaccessible as possible outside of the process. Intended users include cryptographic libraries, which can use a secret area to hold cryptographic keys and keep them safe from prying eyes.

This functionality has, in recent revisions of the patch set, been moved into a separate system call:

    int memfd_secret(unsigned long flags);

The return value will be a file descriptor that can then be passed to mmap() to map an actual range of memory. For the most part, that memory will look (to the mapping process) like any other memory area, but there will be a couple of differences:

Pages of memory in this range will be removed from the kernel's direct map — the portion of the address space that lets the kernel access (almost) any physical page in the system. This makes it much harder for the kernel to access this memory, either intentionally or by way of an exploit.
If flags includes SECRETMEM_UNCACHED, then the memory will be mapped uncached if the underlying architecture supports it. Uncached memory will be far slower to access, but it is also immune to disclosure via many speculative-execution vulnerabilities.

Memory in a secret area is locked into RAM and unable to be swapped. As such, it is counted against the owning process's locked-memory limit.

One ongoing problem with features like this is that removal of pages from the kernel's direct map is an expensive operation. The direct map uses huge pages, minimizing its impact on the system's translation lookaside buffer (TLB). Removing random pages from the map breaks up those huge pages, significantly increasing TLB pressure. In order to minimize this impact, the memfd_secret() patch set maintains a separate cache of physically contiguous pages to use for this purpose.

The rate of change for this patch set has been slowing for some time, so it may be close to being ready for inclusion. One never knows for sure with memory-management patches, though, until the patches are actually applied.

KVM protected memory

While memory-disclosure vulnerabilities are unwanted on any system, the stakes are often higher on systems that are running virtualized guests. Such machines may be running workloads from unrelated groups that are unwilling to share their secrets with each other in ordinary circumstances; the possibility of sharing a physical system with a guest that is under the control of an attacker makes memory protection an even more urgent problem. As a way of hardening these systems, CPU vendors have been adding memory-encryption mechanisms that make guest memory inaccessible to the kernel and to other guests. These features have their own cost, though, and support in hardware is far from universal at this point.

Kirill Shutemov has drawn an interesting conclusion from these technologies, though: the fact that systems using them still work means that access to that memory from the kernel or the hypervisor is not actually needed most of the time. So he has put together a patch set that takes a fully software-based approach. Rather than encrypt guest memory, systems running this code just unmap it. Using this feature requires support on the part of both the kernel and the guest.

Specifically, a KVM hypercall is added that allows guests to request that their memory be made inaccessible. The host kernel will respond by removing any memory allocated to the guest from the direct map, taking away its own ability to access that memory. In user space the approach is a bit different: any memory belonging to the guest remains mapped but is marked with PROT_NONE protections, again making it inaccessible. This will affect processes like the QEMU emulator, which will lose direct access to guest memory. The lack of mappings will naturally impede attacks coming from other guests as well. Within the guest, the guest kernel controls memory permissions as usual.

The resulting isolation protects guest memory from unwanted access by way of vulnerabilities in components like the kernel or QEMU. It is not a complete protection, though; if the host kernel is compromised to the level of arbitrary code execution, it can remap the pages and pillage them at leisure. For the wide range of vulnerabilities that depend on getting the kernel to access a stray pointer — or speculative-execution vulnerabilities — though, this unmapping should significantly raise the bar for any exploit attempt.

Of course, there are times when the kernel must access memory within guests to perform normal kernel functions. A second hypercall has been added for guests to indicate which memory they need to open up to the host kernel; those ranges will be mapped back into the host kernel's address space. DMA buffers for virtualized devices are one example of the type of memory that a guest would want to share with the host kernel in this way.

This work looks interesting, but there are a number of loose ends that need to be tied down before it can be considered ready. Unlike memfd_secret(), this work has no mechanism for avoiding direct-map fragmentation as pages are removed; since the amount of memory involved is rather larger in this case, the fragmentation problems are likely to be that much more severe. Unmapped guest memory cannot be migrated, which defeats the kernel's mechanisms for defragmenting memory. That is likely to cause all sorts of problems over time; Shutemov has acknowledged that this problem will need to be fixed before the patches can be merged. It is also currently not possible to reboot a guest with protected memory; Shutemov has suggested that this case could just be declared "unsupported", an idea that has already drawn complaints in the discussion.

The length of this list of issues implies that the KVM protected memory work is not something that will be seen in the mainline kernel in the near future. Both of these patch sets are a likely indicator of the direction things are going, though. Sharing as much as possible may improve performance, but it seems increasingly clear that the associated security problems are anything but easy to address. Separating address spaces as much as possible looks like a relatively straightforward way to sidestep many of those problems.

Index entries for this article
Kernel	Memory management/Address-space isolation
Kernel	Memory management/Virtualization
Kernel	System calls/memfd_secret()

Two address-space-isolation patches get closer

Posted Oct 27, 2020 23:41 UTC (Tue) by dullfire (guest, #111432) [Link] (1 responses)

why is memfd_secret another syscall? doesn't it make much more sense as a flag to memfd_create?

and if you really really want, the 'secret' flag could add a 'secret_flags' argument.

unless memfd_create does not error on unknown flags I don't see a reason not to have done that.

Two address-space-isolation patches get closer

Posted Nov 2, 2020 16:12 UTC (Mon) by rppt (subscriber, #125478) [Link]

I hesitated a lot and decided in favour of a new systcall after I've started to draft man page.
The description would be quite different and I though it would be confusing.

Two address-space-isolation patches get closer

Posted Oct 28, 2020 14:29 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (2 responses)

How will this affect debuggers and core dumping? Will these be inaccessible through ptrace? What's the interaction with criu saving and restoring? If VM rebooting is being given up, I'd expect criu is also out of luck here.

Two address-space-isolation patches get closer

Posted Nov 2, 2020 17:49 UTC (Mon) by rppt (subscriber, #125478) [Link] (1 responses)

Secretmem prevents ptrace access so debuggers and core dumping won't be able to read these pages. As for criu, in theory it could read secretmem mappings, but this would reduce the security benefits of using secretmem.

Two address-space-isolation patches get closer

Posted Nov 2, 2020 22:14 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

> Secretmem prevents ptrace access…

*All* ptrace access, or just PTRACE_PEEKDATA? If it's the latter then ptrace could still be used to access the "secret" memory by first injecting code into the process to copy the data elsewhere.

I can't say I'm all that comfortable with the idea of handing processes rootkit-like tools to hide the contents of their memory from the system administrator, though I suppose the enforcement aspects could be patched out of the kernel easily enough without affecting the userspace ABI. This seems like something that could benefit malware (including, but not limited to, DRM) at least as much as security software.

Two address-space-isolation patches get closer

Posted Oct 28, 2020 14:51 UTC (Wed) by mss (subscriber, #138799) [Link]

In terms of address space isolation there is also KVM Address Space Isolation (ASI).

There will be talk about it later today at the KVM Forum:
https://kvmforum2020.sched.com/event/eE2A/kvm-address-spa...

Two address-space-isolation patches get closer

Posted May 26, 2021 2:36 UTC (Wed) by zengtm (guest, #74989) [Link]

A few years ago, linux-arch-msm + QHEE (Qualcomm Hypervisor Execution Environment) also posted patches to take memory away from host kernel via a new Hypercall "hyp_assign_phys()", with this proposed patch:

https://patchwork.kernel.org/project/linux-arm-msm/patch/...

> + ret = hyp_assign_phys(qproc->dev, addr, size,
> + qproc->vmid_details.srcVM,
> + src_count, qproc->vmid_details.destVM,
> + qproc->vmid_details.destVMperm, dest_count);

Agree it is a trade-off between fragmentation and protection.