Page allocation for address-space isolation

By Jonathan Corbet
April 3, 2025

Address-space isolation may well be, as Brendan Jackman said at the beginning of his memory-management-track session at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, "some security bullshit". But it also holds the potential to protect the kernel from a wide range of vulnerabilities, both known and unknown, while reducing the impact of existing mitigations. Implementing address-space isolation with reasonable performance, though, is going to require some significant changes. Jackman was there to get feedback from the memory-management community on how those changes should be implemented.

The core idea behind address-space isolation (last covered here in March), he began, is to run as much kernel code as possible in an address space where sensitive data is unmapped, and thus invisible to speculative-execution vulnerabilities. It is like the kernel page-table isolation that was introduced in response to the Meltdown hardware vulnerability, but with a higher degree of protection. Kernel page-table isolation created a new address space with most of the kernel removed; the new work adds a restricted address-space that has holes in it where only the sensitive data has been removed.

The address-space isolation patches are deployed on a significant subset of Google's fleet, he said. Their current (public) form can be seen in this patch set posted in January. This version adds protection from bare-metal attackers, while previous versions had only protected the kernel from virtual machines. There are still two blockers that need to be addressed though. One is a better design for page allocation — the intended topic for this session — while the other is a 70% degradation in file-I/O performance.

In order for the kernel to unmap memory containing sensitive data, it needs to know where that memory is, so kernel code must indicate sensitivity at allocation time by way of a new __GFP_SENSITIVE flag. There are some performance considerations here; for example, mapping pages may require first zeroing them, since they may have previously contained sensitive data. He would also like to avoid fragmenting the kernel's direct map if possible. Mike Rapoport, who has analyzed the cost of direct-map fragmentation, said that it is best avoided if possible, but is not that critical.

Avoiding fragmentation, Jackman continued, requires grouping nonsensitive pages together in physical memory. He also preallocates page tables for restricted data down to the PMD (2MB) level, and maps or unmaps entire 2MB page blocks at one time. That helps to minimize fragmentation and translation lookaside buffer (TLB) invalidations.

The patch set adds two new migration types to distinguish sensitive and non-sensitive data, and a new constraint that disallows the allocation of pages across the two sensitivity types.

There are some challenges that come with unmapping page blocks while allocating memory. The unmapping requires a TLB invalidation, but that cannot be done while the zone lock (needed to allocate the page block) is held. The invalidation must be done, though, before other CPUs are allowed to see the block as being sensitive. So the current code allocates the entire page block, even if only one page is needed, releases the zone lock, performs the invalidation, then reacquires the zone lock. After that, the memory can be marked sensitive and any unneeded pages can be freed.

That technique works, but leads to a possible worrisome scenario. If all CPUs on the system decide to allocate a sensitive page at the same time, they will all end up doing the above dance, and they will all hammer each other with TLB invalidations. Jackman said that he is not sure that this case is worth optimizing for, but Matthew Wilcox said that database workloads could possibly act in just that way.

Mapping a block while allocating is easier, Jackman said; it is just a matter of populating the page tables and changing the migration type of the affected memory. It is essentially a normal case of migration-type fallback. But pages that might have held sensitive data have to be zeroed to prevent the possible exposure of that data; he wondered if the allocator should just zero pages unconditionally. The cost of doing so, he said, is not that bad. Jason Gunthorpe said, though, that zeroing can indeed be painful on systems with slower memory, and Suren Baghdasaryan said that there had once been a maple-tree performance regression caused by page zeroing.

If the zeroing overhead is too much, Jackman continued, then the allocator will have to repeat the unmapping dance described above, or handle zeroing one page at a time, using a page flag. Somebody asked what the performance of the allocator was at Google; Jackman said that it worked well, but the version of address-space isolation running there does things differently than the version that has been posted for upstream consideration.

Wilcox asked if the kernel's Spectre mitigations can be safely disabled once address-space isolation is in use. For now, Jackman said, the isolation only protects user-space pages; the task of marking kernel allocations for sensitivity is far from complete. Once that has been done, it should be possible to turn off the mitigations, and to never need another one again. The mitigations are off at Google, and the patch yields a performance gain overall.

Since he had some time at the end of the session, Jackman launched into the problem of the file-I/O performance regression caused by address-space isolation. The problem is that all file pages are marked sensitive within the kernel, so every read causes a fault and an address-space transition. It is pointless to protect pages that a process is about to read anyway, but the page cache as a whole cannot be marked non-sensitive since it likely contains data that any given process cannot access. Earlier versions of the patch set had a separate "local nonsensitive" marker for data that processes could leak to themselves but, even with that, the kernel does not know at allocation time where file pages should be mapped.

Thus, he said, the kernel needs a process-local mapping for file pages. One solution would be to map entire files when a process opens them; that is relatively easy, but it is harder to know when to unmap file data. A process can lose access to a file in a number of ways; a security module might change its mind, or fanotify permission events can revoke access, for example. There must also be action taken when file pages are removed from the cache, either through reclaim or because a file is truncated.

The alternative, he said, is ephemeral, per-CPU mappings that are in place only as long as the operation is ongoing. Once the operation completes, the page tables would be torn down right away, but the TLB flush could be deferred to minimize the performance impact.

At that point, the session was truly out of time and the discussion ended with no conclusions on the file-I/O problem.

Jackman has posted the slides from this session.

Index entries for this article
Kernel	Memory management/Address-space isolation
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2025

ASI vs ASR

Posted Apr 3, 2025 18:09 UTC (Thu) by ballombe (subscriber, #9523) [Link] (3 responses)

If additionally, address space isolation allows to get rid of address space randomization, it would be a huge win.

ASI vs ASR

Posted Apr 3, 2025 21:59 UTC (Thu) by geofft (subscriber, #59789) [Link] (2 responses)

These seem unrelated to me, if I understand correctly? ASI is to protect against speculation revealing the contents of memory holding sensitive data. ASLR (userspace and KASLR) seems mostly useful for protecting the addresses of code, when you already know what code is there (the source code and compiled versions are public) and you want to arrange your return address in the stack or some function pointer to point to particular known code that acts as a useful gadget. So you're not interested in hiding the actual contents of memory, just the assignment of which address has that contents.

ASI vs ASR

Posted Apr 4, 2025 14:13 UTC (Fri) by bjackman (subscriber, #109548) [Link]

Yeah they're kinda orthogonal tools for security.

BTW, if it KASLR gets meaningfully in your way, IMO you should just switch it off. It's a road-bump to slow attackers down, it doesn't actually create any new security boundary or have any fundamental theoretical benefit. It's a kinda economic lever - it makes life Y% more difficult for the attacker and X% more difficult for the defender. If Y isn't significantly higher than X for your usecase, I'd say you should drop it.

ASI vs ASR

Posted Apr 6, 2025 18:05 UTC (Sun) by naesten (subscriber, #71199) [Link]

Not totally: with ASLR, pages mapped from insensitive files become sensitive if they have dynamic relocations applied to them; without ASLR, maybe they (conceptually) don't.

In practice, I assume that the copy-on-write triggered by ld.so's first write to such a page would mark the private copy as sensitive, and I guess the safest way to counter that is using a tool like prelink(8) to prevent the page getting modified (unless there's an address conflict)? That way, it still gets marked sensitive if it's not what most other processes see.

(All of this is assuming some kind of "insensitive file" support exists, of course.)

sensitive

Posted Apr 4, 2025 1:28 UTC (Fri) by mb (subscriber, #50428) [Link] (2 responses)

How do I know when to mark an allocation as "sensitive"?
Is that even a global and universally well defined property without contradicting views?

sensitive

Posted Apr 4, 2025 4:05 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

Based on the article, they appear to be including all file blocks. That seems like a good starting point, but there are a number of other things that are "obviously" at least as sensitive as file blocks:

* All components of the RNG state (especially the entropy pool).
* memfds, pipe buffers, etc., basically anything where you're holding userspace's data for a little while.
* Drivers?

OTOH, if something is routinely displayed in e.g. top(1) without anyone getting upset about it being visible (PIDs, command lines, resource metrics, etc.), then it's probably not sensitive. I imagine that makes most if not all of the scheduler non-sensitive by extension (but I could be mistaken).

sensitive

Posted Apr 4, 2025 12:00 UTC (Fri) by bjackman (subscriber, #109548) [Link]

Yeah, it's tricky.

All the code I've posted so far just says everything allocated as GFP_USER is sensitive. So, not just file pages but also all anonymous user pages are sensitive. This already goes a pretty long way (it certainly adds a huge amount of extra engineering work for an attacker starting from a pre-ASI exploit) but as you've pointed out there are obvious things that it doesn't include that need to be protected. The other classic example to my mind is stuff copied into the kernel stack from userspace/VM guests.

In principle we should be able to flip this question on its head and instead make the question "what _isn't_ sensitive", i.e. isntead of marking stuff as __GFP_SENSITIVE with the default being unprotected (we call this "denylist"), we could protect erverything by default and mark exceptions as __GFP_NONSENSITIVE (we call this "allowlist"). So far the general feeling has been that it's more practical to start from something that people can actually deploy and evaluate without worrying about an unpredictable performance disaster. But we could certainly switch to an allowlist model later down the line, it would make good sense to me.

Minor correction

Posted Apr 4, 2025 11:33 UTC (Fri) by bjackman (subscriber, #109548) [Link]

> The mitigations are off at Google

I don't think I said this, but if I did I mis-spoke, sorry! I try to be transparent about the Google deployment where I can but the exact set of mitigations we use internally isn't something I'd discuss publicly.