LWN: Comments on "Memory passthrough for virtual machines" https://lwn.net/Articles/931933/ This is a special feed containing comments posted to the individual LWN article titled "Memory passthrough for virtual machines". en-us Tue, 16 Sep 2025 11:28:18 +0000 Tue, 16 Sep 2025 11:28:18 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Memory passthrough for virtual machines https://lwn.net/Articles/935997/ https://lwn.net/Articles/935997/ angelsl <div class="FormattedComment"> <span class="QuotedText">&gt; At each level, a memory load must be performed and TLB misses are possible, </span><br> <p> The levels (pmd, pud, p4d, etc) above the last level all specify page frame numbers so there should not be any TLB lookups here.<br> </div> Fri, 23 Jun 2023 12:04:34 +0000 512GB pages? https://lwn.net/Articles/934705/ https://lwn.net/Articles/934705/ ghane Mine has: <pre> Jun 14 14:40:02 P14sUbuntu kernel: process: using mwait in idle threads Jun 14 14:40:02 P14sUbuntu kernel: Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 Jun 14 14:40:02 P14sUbuntu kernel: Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 Jun 14 14:40:02 P14sUbuntu kernel: Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Mitigation: Enhanced IBRS Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier Jun 14 14:40:02 P14sUbuntu kernel: Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl Jun 14 14:40:02 P14sUbuntu kernel: Freeing SMP alternatives memory: 44K Jun 14 14:40:02 P14sUbuntu kernel: smpboot: CPU0: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (family: 0x6, model: 0x8c, stepping: 0x1) </pre> No L1/L2/L3 at all? Thu, 15 Jun 2023 03:45:54 +0000 Memory passthrough for virtual machines https://lwn.net/Articles/933102/ https://lwn.net/Articles/933102/ mathstuf <div class="FormattedComment"> This seems to pass a lot of trust onto the host when there's a lot of other work going on to *not* trust the host. I presume this is intended for trusted cloud/local cluster deployments (read: not AWS or the like)?<br> </div> Sat, 27 May 2023 12:08:15 +0000 512GB pages? https://lwn.net/Articles/932902/ https://lwn.net/Articles/932902/ flussence FWIW the same information is visible in dmesg, in a somewhat terse format: <pre> [ +0.000002] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 512 [ +0.000002] Last level dTLB entries: 4KB 2048, 2MB 2048, 4MB 1024, 1GB 0 </pre> This one's a desktop CPU from this decade, so planning for about 10GB of in-use memory seems reasonable enough. The one on my actual NAS is incredibly anemic by comparison (the TLB can map ~64MB *total*!) Wed, 24 May 2023 22:29:53 +0000 512GB pages? https://lwn.net/Articles/932873/ https://lwn.net/Articles/932873/ willy <div class="FormattedComment"> I'm not surprised you don't have I-cache 1GB TLB entries -- do you have any programs with 1GB text segments? Here's my i7-1165G7's TLB configuration:<br> <p> TLB info<br> L1 Instruction TLB: 4KB pages, 8-way associative, 128 entries<br> L1 Instruction TLB: 4MB/2MB pages, 8-way associative, 16 entries<br> L1 Store Only TLB: 1GB/4MB/2MB/4KB pages, fully associative, 16 entries<br> L1 Load Only TLB: 4KB pages, 4-way associative, 64 entries<br> L1 Load Only TLB: 4MB/2MB pages, 4-way associative, 32 entries<br> L1 Load Only TLB: 1GB pages, fully associative, 8 entries<br> L2 Unified TLB: 4MB/2MB/4KB pages, 8-way associative, 1024 entries<br> L2 Unified TLB: 1GB/4KB pages, 8-way associative, 1024 entries<br> <p> If we did construct a 1GB executable and figure out a way to get it into a 1GB page, and use a PUD entry to map it, that 1GB translation would go into the L2 Unified TLB. The CPU would then create 2MB TLB entries for the L1 cache to use for whichever 2MB chunks of that 1GB executable are actually in use.<br> <p> If that proved to be a performance problem, I'm pretty sure Intel would figure out they were leaving performance on the table and support 1GB L1 I$ TLB entries.<br> </div> Wed, 24 May 2023 17:42:36 +0000 Memory passthrough for virtual machines https://lwn.net/Articles/932869/ https://lwn.net/Articles/932869/ willy <div class="FormattedComment"> Linux absolutely does use variable size folios (including all the way up to PMD size) in the filesystem cache. You might want to refer to my State Of The Page talk, or my Folios talk from last year.<br> <p> <a href="https://lwn.net/Articles/931794/">https://lwn.net/Articles/931794/</a> (2023)<br> <a href="https://lwn.net/Articles/893512/">https://lwn.net/Articles/893512/</a> (2022)<br> <p> </div> Wed, 24 May 2023 17:34:51 +0000 Memory passthrough for virtual machines https://lwn.net/Articles/932668/ https://lwn.net/Articles/932668/ lbt <div class="FormattedComment"> What happens with nested guests?<br> </div> Tue, 23 May 2023 06:36:43 +0000 Memory passthrough for virtual machines https://lwn.net/Articles/932566/ https://lwn.net/Articles/932566/ Karellen <blockquote>As the session ran out of time, an attendee asked whether this mechanism required changing functions like <tt>malloc()</tt>. Since memory-management operations have to send commands to the memctl device, the answer was "yes", code like allocators would have to change.</blockquote> <p>Wait, why would <tt>malloc()</tt> need to change to take advantage of this? Why not implement the functionality in <tt>{,s}brk()</tt> and <tt>mmap()</tt> to make it transparent to the guest's userspace?</p> Mon, 22 May 2023 10:29:59 +0000 Memory passthrough for virtual machines https://lwn.net/Articles/932565/ https://lwn.net/Articles/932565/ matthias <div class="FormattedComment"> <span class="QuotedText">&gt; Wouldn't the most flexible thing be to swap the two around?</span><br> <p> This would remove all flexibility from the host side. If the guest frees up some memory, it would be stuck in the 1GB mapping on the host side and cannot be reused for other virtual machines or the host itself. <br> <p> Having 1GB mappings on the client side - at least in theory - does not remove any flexibility. Whichever mapping the client wants to modify, it can always talk to the host and ask the host to modify the mapping instead. The client can still use the huge pages as if it were a collection of small pages. Just every manipulation would have to be made by the host instead. And not all of the 1GB mapping needs to be backed by physical memory. The host is free to unmap parts of the address space if it is told by the guest that that part of memory is not currently needed.<br> <p> Of course, this needs to be supported on the client. I must admit, I have not looked into the details, i.e., whether this is targeted at the guest kernel at all or only at memory hungry processes inside the guest. Saying that malloc() needs to be changed sounds a lot like that it is targeted at processes. And it sounds like this is - at least for now - only for very specific use cases.<br> </div> Mon, 22 May 2023 09:10:42 +0000 512GB pages? https://lwn.net/Articles/932536/ https://lwn.net/Articles/932536/ farnz <p>Of course, it's worth remembering that a bigger page size means fewer TLB entries to cover the same area of RAM; a 2M TLB entry covers 512 4k TLB entries worth of RAM, and a 1G TLB entry (if your CPU has them - my i9-12900 does for data but not code) covers 512 2M TLB entries, or 262,144 4k entries. Thus, you need many fewer TLB entries to cover the same amount of RAM - and access patterns start to matter more for whether the TLB is big enough or not. Sun, 21 May 2023 10:47:01 +0000 Memory passthrough for virtual machines https://lwn.net/Articles/932518/ https://lwn.net/Articles/932518/ Sesse <div class="FormattedComment"> Is there a reason why the _host_ has 4 kB pages and the _guest_ has 1 GB pages? Wouldn't the most flexible thing be to swap the two around? (E.g., Linux cannot yet use huge pages for the filesystem cache.)<br> </div> Sat, 20 May 2023 13:52:12 +0000 512GB pages? https://lwn.net/Articles/932510/ https://lwn.net/Articles/932510/ WolfWings <p>There's no x86 based CPU at least that supports larger than 1GB mappings.</p> <p>And an often forgotten issue is that the CPU TLB has limited slots for higher-order mappings.</p> <p>Often times only 16 (or less!) 1GB pages can be cached into the TLB in fact, you can check with the <tt>cpuid -1</tt> command and manually examining the output, for example on the i7-6700K running my NAS:</p> <pre> CPU: 0x63: data TLB: 2M/4M pages, 4-way, 32 entries data TLB: 1G pages, 4-way, 4 entries 0x03: data TLB: 4K pages, 4-way, 64 entries 0x76: instruction TLB: 2M/4M pages, fully, 8 entries 0xff: cache data is in CPUID leaf 4 0xb5: instruction TLB: 4K, 8-way, 64 entries 0xf0: 64 byte prefetching 0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries </pre> <p>Even large-scale server CPUs have similar limits, on some DigitalOcean AMD VMs I have for example:</p> <pre> cache and TLB information (2): 0x7d: L2 cache: 2M, 8-way, 64 byte lines 0x30: L1 cache: 32K, 8-way, 64 byte lines 0x2c: L1 data cache: 32K, 8-way, 64 byte lines ... L1 TLB/cache information: 2M/4M pages &amp; L1 TLB (0x80000005/eax): instruction # entries = 0xff (255) instruction associativity = 0x1 (1) data # entries = 0xff (255) data associativity = 0x1 (1) L1 TLB/cache information: 4K pages &amp; L1 TLB (0x80000005/ebx): instruction # entries = 0xff (255) instruction associativity = 0x1 (1) data # entries = 0xff (255) data associativity = 0x1 (1) ... L2 TLB/cache information: 2M/4M pages &amp; L2 TLB (0x80000006/eax): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x0 (0) data associativity = L2 off (0) L2 TLB/cache information: 4K pages &amp; L2 TLB (0x80000006/ebx): instruction # entries = 0x200 (512) instruction associativity = 4-way (4) data # entries = 0x200 (512) data associativity = 4-way (4) ... L1 TLB information: 1G pages (0x80000019/eax): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x0 (0) data associativity = L2 off (0) L2 TLB information: 1G pages (0x80000019/ebx): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x0 (0) data associativity = L2 off (0) </pre> <p>Zero 1GB TLB cache sizes. And if you track down the physical CPUs used it's accurate.</p> <p>This is also why the host-side sticks to 2MB constructs, those often have full 1:1 parity with 4K pages already. 1GB minimizes useless page table allocations on the guest, but using actual 2MB mappings optimizes for the TLB properly.</p> Sat, 20 May 2023 06:27:19 +0000 512GB pages? https://lwn.net/Articles/932476/ https://lwn.net/Articles/932476/ kilobyte <div class="FormattedComment"> What about going the whole hog and using 512GB pages? Even the fattest today's boxen wouldn't be unlikely to run into TLB pressure there. The kernel has widespread 32-bit assumptions all around, but for a limited scope like KVM it could work ok.<br> </div> Fri, 19 May 2023 17:02:18 +0000