|
|
Log in / Subscribe / Register

More efficient removal of pages from the direct map

By Jonathan Corbet
March 25, 2026
The kernel's direct map provides code running in kernel mode with direct access to all physical memory installed in the system — on 64-bit systems, at least. It obviously makes life easier for kernel developers, but the direct map also brings some problems of its own, most of which are security-related. Interest in removing at least some pages from the direct map has been simmering for years; a couple of patch sets under discussion show some use cases for memory that has been removed from the direct map, and how such memory might be efficiently managed.

The good thing about the direct map is that it gives the kernel easy access to all of the system's installed memory. That is also the bad thing about the direct map, of course. When all of memory is accessible, it becomes a target for attackers. A stray pointer might be pressed into service to corrupt data anywhere in the system (though technologies like supervisor mode access prevention can help). Directly mapped memory is also susceptible to speculative-execution attacks, which can be employed to exfiltrate information from the kernel or from an unrelated process or virtual machine.

Many of these attacks can be thwarted by removing memory from the direct map; if memory is not reachable, the kernel cannot access it and, as a result, cannot disclose or modify its contents. The memfd_secret() system call will remove memory from the direct map for this reason, but wider use of direct-map removal has been slow to come. Memory that is not in the direct map is harder to manage, and there are a number of performance problems that can be caused by removing memory from the direct map. So, while various patches have been in circulation for a while, they have not generally cleared the bar for inclusion in the mainline kernel.

guest_memfd

A common use case for large Linux systems is to run virtualized guests, often hosting multiple unrelated — and possibly hostile — users. It should come as no surprise that there are attackers out there who are interested in targeting some of those virtual machines from others on the same host. There are a number of efforts being made to thwart such attacks, at both the hardware and software levels; one of those is this patch set implementing direct-map removal of guest_memfd pages, posted by Nikita Kalyazin, built on work initially posted by Patrick Roy.

A guest_memfd is a form of memfd (a block of memory attached to a file descriptor) intended for use by virtual machines. This memory has a number of special characteristics, including the fact that it cannot normally be mapped into user space on the host system. That makes attacks from the hosts a bit harder, but there is more that can be done.

On systems with the right sort of hardware support, memory in a guest_memfd can be encrypted, making access from outside the virtual machine impossible. Not only is the host unable to decrypt the contents of the memory; any attempt to access it will generate a machine-check exception. That makes encrypted memory into a sort of land mine that would be best removed from the host kernel's address space entirely. Beyond that, though, encrypted memory is far from universally available. On systems where guest memory is not encrypted, removing it from the direct map will make it more resistant to attacks from the host, and far less susceptible to speculative-execution attacks from hostile guests.

So this patch set adds a new flag, GUEST_MEMFD_FLAG_NO_DIRECT_MAP, to the KVM_CREATE_GUEST_MEMFD ioctl() call provided by the KVM hypervisor. When that flag is present, the memory assigned to the newly created memfd will be removed from the host kernel's direct map. Internally, the series creates a new address-space flag, AS_NO_DIRECT_MAP, to mark an address space that is not directly mapped. When the memfd is freed, the underlying memory will be restored to the kernel's direct map.

Direct-map removal creates an interesting problem: how does KVM itself, running on the host, access the memory within the guest_memfd? There are a number of operations, many having to do with emulated I/O devices, that need that sort of access. The problem is solved by mapping the guest_memfd memory into the user-space address space (on the host) of the KVM process that is running the guest; KVM can then access that memory by way of functions like copy_from_user(). The end result is that the mapping of the memory has been shifted from the kernel's address space to a specific user space. That is sufficient to protect it from speculative-execution attacks on the kernel from a different guest.

This patch series has been circulating since July 2024, and has yet to clear the bar for merging into the mainline kernel. There are a few concerns holding this kind of work back, one of which is the performance implications of fragmenting the kernel's direct map, which, when it can, uses a huge mapping, reducing pressure on the system's translation lookaside buffer (TLB). Work done by Mike Rapoport a few years ago showed that the performance implications are not as bad as some had feared, but this fragmentation is still best avoided if possible. Meanwhile, flushing the TLB globally to reflect direct-map changes is expensive. Another roadblock, though, is that KVM can be built as a loadable module, and the memory-management developers are reluctant to export the ability to manipulate the direct map to modules. Given that there are other potential reasons to remove memory from the direct map, perhaps a better way of doing that is indicated here.

Enter the mermap

Brendan Jackman has been working on a more general address-space isolation (ASI) patch series for a while now. As a general rule, memory that does not appear within a given address space cannot be attacked by way of speculative-execution gadgets or more straightforward vulnerabilities. The kernel can already isolate its address space from user space to defend against Meltdown attacks, but this technique could be taken much farther. For example, many system calls could be at least partially implemented without access to most of the kernel's address space, with wider access only being granted for the code that strictly needs it.

Needless to say, any sort of practical address-space isolation will require removing memory from the direct map. As part of an effort to push this work forward, Jackman has posted a patch series to make the management of unmapped memory easier and more efficient.

Specifically, this series adds a new GFP flag, __GFP_UNMAPPED, that can be used to request unmapped memory from the page allocator. This allows the page allocator to manage these pages in a relatively efficient manner; they can be grouped together in a separate memory block, allowing them to be allocated and freed without changing the direct map every time (or fragmenting the direct map), and without the need for global TLB flushes. Allocating unmapped pages becomes a lot like allocating any other sort of page.

Except, of course, there are some complications. For example, another flag implemented by the page allocator is __GFP_ZERO, which requests that the pages be zero-filled. How can the kernel perform that zeroing without access to the memory involved? The answer is something that Jackman calls the "mermap", which is evidently a shortening of "ephemeral mapping". The pages in question are temporarily mapped into the kernel's address space, but only for the local CPU; they can then be zeroed, and there is no need for the global TLB flush that a wider mapping would require. An implication of this implementation, though, is that holding references to ephemerally mapped pages blocks migration for the running task, since it would lose access to those pages on any other CPU.

The page allocator needs to be able to track unmapped pages to be able to efficiently manage them. As Jackman points out in the series cover letter, the page allocator already has a mechanism for grouping pages by an attribute: the migration type, which is used to separate, for example, allocations of movable memory from those for unmovable memory. The migration type could thus be used to describe unmapped memory but, in current kernels, it can only track one attribute. A page might be both unmapped and unmovable, for example; the attributes are orthogonal to each other, and should be tracked separately. Migration types, as implemented, cannot support orthogonal attributes.

To address that shortcoming, Jackman's series adds the concept of a "freetype", described as "just a migratetype plus some flags". The current uses of migration types are, themselves, migrated to freetypes in a relatively large patch; work later in the series then adds the unmapped attribute and enables the page allocator to work with it, culminating in the implementation of __GFP_UNMAPPED.

This mechanism, once in place, would allow the guest_memfd machinery to more efficiently work with unmapped pages. It also "serves as a Trojan horse to get the page allocator into a state where adding ASI's features 'Should Be Easy'". None of that work appears in this series, though. What does appear is an update to memfd_secret() to use __GFP_UNMAPPED pages. This implementation is described as "hacky", though, since Jackman feels it could be optimized further.

Overall, __GFP_UNMAPPED is only in its second revision; that is early days for this sort of core page-allocator change. It is likely to require some work yet, before it can be considered for the mainline. This series will, though, certainly serve as fodder for discussion at the upcoming Linux Storage, Filesystem, Memory Management, and BPF Summit, to be held in early May, as will the guest_memfd work that it supports. Stay tuned.

Index entries for this article
KernelMemory management/Address-space isolation
KernelMemory management/Direct map


to post comments

Fun details for TLB flushing enthusiasts

Posted Mar 25, 2026 17:13 UTC (Wed) by bjackman (subscriber, #109548) [Link]

Thanks for the coverage, I'm ever thankful that the community gets to have articles like this :)

> The pages in question are temporarily mapped into the kernel's address space, but only for the local CPU;

The mappings are _logically_ CPU-local (it's a software bug to access them from another CPU. Very similar to kmap_local_page()) but _physically_ they are actually shared across the process. This has corrolaries for the next bit:

> and there is no need for the global TLB flush that a wider mapping would require

At least, on x86. On arm64 they will need a flush on all CPUs in the process, because of the break-before-make rule. Fortunately there is HW support for this, hopefully it's much less of a perf issue than it would be on x86. (Newer AMD CPUs have that support too, not sure of the Intel status).

Also worth noting that even on x86, you still need a full shootdown _at some point_ to stop CPU attacks leaking the data that "logically" belongs to other CPUs' mermap region. When exacly that flush is needed depends on the data's lifecycle. E.g. if you mermap a file page you need to make sure you flush the TLB before the process loses logical access to the file.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds