Memory passthrough for virtual machines

By Jonathan Corbet
May 19, 2023

Memory management is tricky enough on it own, but virtualization adds another twist: now there are two kernels (host and guest) managing the same memory. This duplicated effort can be wasteful if not implemented carefully, so it is not surprising that a lot of effort, from both hardware and software developers, has gone into this problem. As Pasha Tatashin pointed out during a memory-management-track session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, though, there are still ways in which these systems run less efficiently than they could. He has put some effort into improving that situation.

The translation of a virtual address to a physical address is a more complex affair than it seems. The lookup operation must work through as many as five levels of page tables. At each level, a memory load must be performed and TLB misses are possible, meaning that the lookup operation can be slow. It gets worse when this happens in the guest, though; guest "physical" addresses are virtual addresses in the host space; as a result, the lookup at each level of the guest page-table hierarchy can require walking through the full hierarchy in the host. The worst-case lookup, when both the host and the guest are running with five-level page tables, could require 35 loads, which can hurt.

Optimizing this situation, he said, starts from a recognition that work is being duplicated in the virtualized environment. He was not just referring to page-table lookups; memory is also zeroed twice when virtualization is in use. The solution Tatashin has in mind is to push as much of the work as possible to the host system in ways that are not transparent to the guest.

Specifically, he has implemented a driver for a "memctl" device that is present on the guest side; this device provides many of the memory-management operations that are already available through the guest's system-call interface: mmap(), mlock(), madvise(), and so on. The difference is that, for the most part, these operations are passed through to the host for execution there rather than being handled by the guest. Additionally, the memctl device does not zero memory on the guest side; it counts on the host take care of that when needed.

The other piece of the puzzle is that memctl would allocate pages in the guest's physical address space at the 1GB huge-page size. On the host side, though, these pages are mapped at a smaller size — as either 2MB huge pages or 4KB base pages. The use of 1GB pages on the guest shorts out most of the address-translation overhead at that level, speeding access considerably. The smaller pages on the host side avoid fragmentation issues; guest memory can be managed in smaller units. This only works, though, if all operations on that memory are done by the host, which is why the memctl device must provide equivalents for all of the relevant system calls.

David Hildenbrand suggested that the real optimization in this setup is avoiding the need to zero pages on the guest side and, perhaps, from not having to allocate all of the guest's memory immediately on the host. He thought that some of these optimizations could be done within the balloon driver as well, but probably not all of them. The virtio-balloon is "the dumping ground" for a lot of similar code, he said.

Tatashin continued, wondering whether and how it might be possible to upstream this code. Andrew Morton asked where the changes live; the answer is that almost all of the work is in the new memctl device, which is a separate driver. So there would be little impact on the core memory-management code. But Tatashin worried about maintaining the ABI after the code goes upstream and wanted to be sure that he is not adding any security problems. He was advised to copy the patches widely, and the community would figure it out somehow.

As the session ran out of time, an attendee asked whether this mechanism required changing functions like malloc(). Since memory-management operations have to send commands to the memctl device, the answer was "yes", code like allocators would have to change. Perhaps someday it would be possible to do a lot of the basic memctl operations from within the kernel, but more specialized applications would have to do their own memctl calls.

Index entries for this article
Kernel	Memory management/Virtualization
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

512GB pages?

Posted May 19, 2023 17:02 UTC (Fri) by kilobyte (subscriber, #108024) [Link] (5 responses)

What about going the whole hog and using 512GB pages? Even the fattest today's boxen wouldn't be unlikely to run into TLB pressure there. The kernel has widespread 32-bit assumptions all around, but for a limited scope like KVM it could work ok.

512GB pages?

Posted May 20, 2023 6:27 UTC (Sat) by WolfWings (subscriber, #56790) [Link] (4 responses)

There's no x86 based CPU at least that supports larger than 1GB mappings.

And an often forgotten issue is that the CPU TLB has limited slots for higher-order mappings.

Often times only 16 (or less!) 1GB pages can be cached into the TLB in fact, you can check with the cpuid -1 command and manually examining the output, for example on the i7-6700K running my NAS:

CPU:
      0x63: data TLB: 2M/4M pages, 4-way, 32 entries
            data TLB: 1G pages, 4-way, 4 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x76: instruction TLB: 2M/4M pages, fully, 8 entries
      0xff: cache data is in CPUID leaf 4
      0xb5: instruction TLB: 4K, 8-way, 64 entries
      0xf0: 64 byte prefetching
      0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

Even large-scale server CPUs have similar limits, on some DigitalOcean AMD VMs I have for example:

   cache and TLB information (2):
      0x7d: L2 cache: 2M, 8-way, 64 byte lines
      0x30: L1 cache: 32K, 8-way, 64 byte lines
      0x2c: L1 data cache: 32K, 8-way, 64 byte lines
...
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries     = 0xff (255)
      instruction associativity = 0x1 (1)
      data # entries            = 0xff (255)
      data associativity        = 0x1 (1)
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
      instruction # entries     = 0xff (255)
      instruction associativity = 0x1 (1)
      data # entries            = 0xff (255)
      data associativity        = 0x1 (1)
...
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
      instruction # entries     = 0x200 (512)
      instruction associativity = 4-way (4)
      data # entries            = 0x200 (512)
      data associativity        = 4-way (4)
...
   L1 TLB information: 1G pages (0x80000019/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB information: 1G pages (0x80000019/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)

Zero 1GB TLB cache sizes. And if you track down the physical CPUs used it's accurate.

This is also why the host-side sticks to 2MB constructs, those often have full 1:1 parity with 4K pages already. 1GB minimizes useless page table allocations on the guest, but using actual 2MB mappings optimizes for the TLB properly.

512GB pages?

Posted May 21, 2023 10:47 UTC (Sun) by farnz (subscriber, #17727) [Link] (1 responses)

Of course, it's worth remembering that a bigger page size means fewer TLB entries to cover the same area of RAM; a 2M TLB entry covers 512 4k TLB entries worth of RAM, and a 1G TLB entry (if your CPU has them - my i9-12900 does for data but not code) covers 512 2M TLB entries, or 262,144 4k entries. Thus, you need many fewer TLB entries to cover the same amount of RAM - and access patterns start to matter more for whether the TLB is big enough or not.

512GB pages?

Posted May 24, 2023 17:42 UTC (Wed) by willy (subscriber, #9762) [Link]

I'm not surprised you don't have I-cache 1GB TLB entries -- do you have any programs with 1GB text segments? Here's my i7-1165G7's TLB configuration:

TLB info
L1 Instruction TLB: 4KB pages, 8-way associative, 128 entries
L1 Instruction TLB: 4MB/2MB pages, 8-way associative, 16 entries
L1 Store Only TLB: 1GB/4MB/2MB/4KB pages, fully associative, 16 entries
L1 Load Only TLB: 4KB pages, 4-way associative, 64 entries
L1 Load Only TLB: 4MB/2MB pages, 4-way associative, 32 entries
L1 Load Only TLB: 1GB pages, fully associative, 8 entries
L2 Unified TLB: 4MB/2MB/4KB pages, 8-way associative, 1024 entries
L2 Unified TLB: 1GB/4KB pages, 8-way associative, 1024 entries

If we did construct a 1GB executable and figure out a way to get it into a 1GB page, and use a PUD entry to map it, that 1GB translation would go into the L2 Unified TLB. The CPU would then create 2MB TLB entries for the L1 cache to use for whichever 2MB chunks of that 1GB executable are actually in use.

If that proved to be a performance problem, I'm pretty sure Intel would figure out they were leaving performance on the table and support 1GB L1 I$ TLB entries.

512GB pages?

Posted May 24, 2023 22:29 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

FWIW the same information is visible in dmesg, in a somewhat terse format:

[  +0.000002] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 512
[  +0.000002] Last level dTLB entries: 4KB 2048, 2MB 2048, 4MB 1024, 1GB 0

This one's a desktop CPU from this decade, so planning for about 10GB of in-use memory seems reasonable enough. The one on my actual NAS is incredibly anemic by comparison (the TLB can map ~64MB *total*!)

512GB pages?

Posted Jun 15, 2023 3:45 UTC (Thu) by ghane (guest, #1805) [Link]

Mine has:

Jun 14 14:40:02 P14sUbuntu kernel: process: using mwait in idle threads
Jun 14 14:40:02 P14sUbuntu kernel: Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
Jun 14 14:40:02 P14sUbuntu kernel: Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Mitigation: Enhanced IBRS
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
Jun 14 14:40:02 P14sUbuntu kernel: Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
Jun 14 14:40:02 P14sUbuntu kernel: Freeing SMP alternatives memory: 44K
Jun 14 14:40:02 P14sUbuntu kernel: smpboot: CPU0: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (family: 0x6, model: 0x8c, stepping: 0x1)

No L1/L2/L3 at all?

Memory passthrough for virtual machines

Posted May 20, 2023 13:52 UTC (Sat) by Sesse (subscriber, #53779) [Link] (2 responses)

Is there a reason why the _host_ has 4 kB pages and the _guest_ has 1 GB pages? Wouldn't the most flexible thing be to swap the two around? (E.g., Linux cannot yet use huge pages for the filesystem cache.)

Memory passthrough for virtual machines

Posted May 22, 2023 9:10 UTC (Mon) by matthias (subscriber, #94967) [Link]

> Wouldn't the most flexible thing be to swap the two around?

This would remove all flexibility from the host side. If the guest frees up some memory, it would be stuck in the 1GB mapping on the host side and cannot be reused for other virtual machines or the host itself.

Having 1GB mappings on the client side - at least in theory - does not remove any flexibility. Whichever mapping the client wants to modify, it can always talk to the host and ask the host to modify the mapping instead. The client can still use the huge pages as if it were a collection of small pages. Just every manipulation would have to be made by the host instead. And not all of the 1GB mapping needs to be backed by physical memory. The host is free to unmap parts of the address space if it is told by the guest that that part of memory is not currently needed.

Of course, this needs to be supported on the client. I must admit, I have not looked into the details, i.e., whether this is targeted at the guest kernel at all or only at memory hungry processes inside the guest. Saying that malloc() needs to be changed sounds a lot like that it is targeted at processes. And it sounds like this is - at least for now - only for very specific use cases.

Memory passthrough for virtual machines

Posted May 24, 2023 17:34 UTC (Wed) by willy (subscriber, #9762) [Link]

Linux absolutely does use variable size folios (including all the way up to PMD size) in the filesystem cache. You might want to refer to my State Of The Page talk, or my Folios talk from last year.

https://lwn.net/Articles/931794/ (2023)
https://lwn.net/Articles/893512/ (2022)

Memory passthrough for virtual machines

Posted May 22, 2023 10:29 UTC (Mon) by Karellen (subscriber, #67644) [Link]

As the session ran out of time, an attendee asked whether this mechanism required changing functions like malloc(). Since memory-management operations have to send commands to the memctl device, the answer was "yes", code like allocators would have to change.

Wait, why would malloc() need to change to take advantage of this? Why not implement the functionality in {,s}brk() and mmap() to make it transparent to the guest's userspace?

Memory passthrough for virtual machines

Posted May 23, 2023 6:36 UTC (Tue) by lbt (subscriber, #29672) [Link]

What happens with nested guests?

Memory passthrough for virtual machines

Posted May 27, 2023 12:08 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

This seems to pass a lot of trust onto the host when there's a lot of other work going on to *not* trust the host. I presume this is intended for trusted cloud/local cluster deployments (read: not AWS or the like)?

Memory passthrough for virtual machines

Posted Jun 23, 2023 12:04 UTC (Fri) by angelsl (subscriber, #144646) [Link]

> At each level, a memory load must be performed and TLB misses are possible,

The levels (pmd, pud, p4d, etc) above the last level all specify page frame numbers so there should not be any TLB lookups here.