LWN: Comments on "Memory passthrough for virtual machines"

Memory passthrough for virtual machines

angelsl — Fri, 23 Jun 2023 12:04:34 +0000

> At each level, a memory load must be performed and TLB misses are possible,

The levels (pmd, pud, p4d, etc) above the last level all specify page frame numbers so there should not be any TLB lookups here.

512GB pages?

ghane — Thu, 15 Jun 2023 03:45:54 +0000

Mine has:

Jun 14 14:40:02 P14sUbuntu kernel: process: using mwait in idle threads
Jun 14 14:40:02 P14sUbuntu kernel: Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
Jun 14 14:40:02 P14sUbuntu kernel: Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Mitigation: Enhanced IBRS
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
Jun 14 14:40:02 P14sUbuntu kernel: Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
Jun 14 14:40:02 P14sUbuntu kernel: Freeing SMP alternatives memory: 44K
Jun 14 14:40:02 P14sUbuntu kernel: smpboot: CPU0: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (family: 0x6, model: 0x8c, stepping: 0x1)

No L1/L2/L3 at all?

Memory passthrough for virtual machines

mathstuf — Sat, 27 May 2023 12:08:15 +0000

This seems to pass a lot of trust onto the host when there's a lot of other work going on to *not* trust the host. I presume this is intended for trusted cloud/local cluster deployments (read: not AWS or the like)?

512GB pages?

flussence — Wed, 24 May 2023 22:29:53 +0000

FWIW the same information is visible in dmesg, in a somewhat terse format:

[  +0.000002] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 512
[  +0.000002] Last level dTLB entries: 4KB 2048, 2MB 2048, 4MB 1024, 1GB 0

This one's a desktop CPU from this decade, so planning for about 10GB of in-use memory seems reasonable enough. The one on my actual NAS is incredibly anemic by comparison (the TLB can map ~64MB *total*!)

512GB pages?

willy — Wed, 24 May 2023 17:42:36 +0000

I'm not surprised you don't have I-cache 1GB TLB entries -- do you have any programs with 1GB text segments? Here's my i7-1165G7's TLB configuration:

TLB info
L1 Instruction TLB: 4KB pages, 8-way associative, 128 entries
L1 Instruction TLB: 4MB/2MB pages, 8-way associative, 16 entries
L1 Store Only TLB: 1GB/4MB/2MB/4KB pages, fully associative, 16 entries
L1 Load Only TLB: 4KB pages, 4-way associative, 64 entries
L1 Load Only TLB: 4MB/2MB pages, 4-way associative, 32 entries
L1 Load Only TLB: 1GB pages, fully associative, 8 entries
L2 Unified TLB: 4MB/2MB/4KB pages, 8-way associative, 1024 entries
L2 Unified TLB: 1GB/4KB pages, 8-way associative, 1024 entries

If we did construct a 1GB executable and figure out a way to get it into a 1GB page, and use a PUD entry to map it, that 1GB translation would go into the L2 Unified TLB. The CPU would then create 2MB TLB entries for the L1 cache to use for whichever 2MB chunks of that 1GB executable are actually in use.

If that proved to be a performance problem, I'm pretty sure Intel would figure out they were leaving performance on the table and support 1GB L1 I$ TLB entries.

Memory passthrough for virtual machines

willy — Wed, 24 May 2023 17:34:51 +0000

Linux absolutely does use variable size folios (including all the way up to PMD size) in the filesystem cache. You might want to refer to my State Of The Page talk, or my Folios talk from last year.

https://lwn.net/Articles/931794/ (2023)
https://lwn.net/Articles/893512/ (2022)

Memory passthrough for virtual machines

lbt — Tue, 23 May 2023 06:36:43 +0000

What happens with nested guests?

Memory passthrough for virtual machines

Karellen — Mon, 22 May 2023 10:29:59 +0000

As the session ran out of time, an attendee asked whether this mechanism required changing functions like malloc(). Since memory-management operations have to send commands to the memctl device, the answer was "yes", code like allocators would have to change.

Wait, why would malloc() need to change to take advantage of this? Why not implement the functionality in {,s}brk() and mmap() to make it transparent to the guest's userspace?

Memory passthrough for virtual machines

matthias — Mon, 22 May 2023 09:10:42 +0000

> Wouldn't the most flexible thing be to swap the two around?

This would remove all flexibility from the host side. If the guest frees up some memory, it would be stuck in the 1GB mapping on the host side and cannot be reused for other virtual machines or the host itself.

Having 1GB mappings on the client side - at least in theory - does not remove any flexibility. Whichever mapping the client wants to modify, it can always talk to the host and ask the host to modify the mapping instead. The client can still use the huge pages as if it were a collection of small pages. Just every manipulation would have to be made by the host instead. And not all of the 1GB mapping needs to be backed by physical memory. The host is free to unmap parts of the address space if it is told by the guest that that part of memory is not currently needed.

Of course, this needs to be supported on the client. I must admit, I have not looked into the details, i.e., whether this is targeted at the guest kernel at all or only at memory hungry processes inside the guest. Saying that malloc() needs to be changed sounds a lot like that it is targeted at processes. And it sounds like this is - at least for now - only for very specific use cases.

512GB pages?

farnz — Sun, 21 May 2023 10:47:01 +0000

Of course, it's worth remembering that a bigger page size means fewer TLB entries to cover the same area of RAM; a 2M TLB entry covers 512 4k TLB entries worth of RAM, and a 1G TLB entry (if your CPU has them - my i9-12900 does for data but not code) covers 512 2M TLB entries, or 262,144 4k entries. Thus, you need many fewer TLB entries to cover the same amount of RAM - and access patterns start to matter more for whether the TLB is big enough or not.

Memory passthrough for virtual machines

Sesse — Sat, 20 May 2023 13:52:12 +0000

Is there a reason why the _host_ has 4 kB pages and the _guest_ has 1 GB pages? Wouldn't the most flexible thing be to swap the two around? (E.g., Linux cannot yet use huge pages for the filesystem cache.)

512GB pages?

WolfWings — Sat, 20 May 2023 06:27:19 +0000

There's no x86 based CPU at least that supports larger than 1GB mappings.

And an often forgotten issue is that the CPU TLB has limited slots for higher-order mappings.

Often times only 16 (or less!) 1GB pages can be cached into the TLB in fact, you can check with the cpuid -1 command and manually examining the output, for example on the i7-6700K running my NAS:

CPU:
      0x63: data TLB: 2M/4M pages, 4-way, 32 entries
            data TLB: 1G pages, 4-way, 4 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x76: instruction TLB: 2M/4M pages, fully, 8 entries
      0xff: cache data is in CPUID leaf 4
      0xb5: instruction TLB: 4K, 8-way, 64 entries
      0xf0: 64 byte prefetching
      0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

Even large-scale server CPUs have similar limits, on some DigitalOcean AMD VMs I have for example:

   cache and TLB information (2):
      0x7d: L2 cache: 2M, 8-way, 64 byte lines
      0x30: L1 cache: 32K, 8-way, 64 byte lines
      0x2c: L1 data cache: 32K, 8-way, 64 byte lines
...
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries     = 0xff (255)
      instruction associativity = 0x1 (1)
      data # entries            = 0xff (255)
      data associativity        = 0x1 (1)
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
      instruction # entries     = 0xff (255)
      instruction associativity = 0x1 (1)
      data # entries            = 0xff (255)
      data associativity        = 0x1 (1)
...
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
      instruction # entries     = 0x200 (512)
      instruction associativity = 4-way (4)
      data # entries            = 0x200 (512)
      data associativity        = 4-way (4)
...
   L1 TLB information: 1G pages (0x80000019/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB information: 1G pages (0x80000019/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)

Zero 1GB TLB cache sizes. And if you track down the physical CPUs used it's accurate.

This is also why the host-side sticks to 2MB constructs, those often have full 1:1 parity with 4K pages already. 1GB minimizes useless page table allocations on the guest, but using actual 2MB mappings optimizes for the TLB properly.

512GB pages?

kilobyte — Fri, 19 May 2023 17:02:18 +0000

What about going the whole hog and using 512GB pages? Even the fattest today's boxen wouldn't be unlikely to run into TLB pressure there. The kernel has widespread 32-bit assumptions all around, but for a limited scope like KVM it could work ok.