Memory passthrough for virtual machines
The translation of a virtual address to a physical address is a more complex affair than it seems. The lookup operation must work through as many as five levels of page tables. At each level, a memory load must be performed and TLB misses are possible, meaning that the lookup operation can be slow. It gets worse when this happens in the guest, though; guest "physical" addresses are virtual addresses in the host space; as a result, the lookup at each level of the guest page-table hierarchy can require walking through the full hierarchy in the host. The worst-case lookup, when both the host and the guest are running with five-level page tables, could require 35 loads, which can hurt.
Optimizing this situation, he said, starts from a recognition that work is being duplicated in the virtualized environment. He was not just referring to page-table lookups; memory is also zeroed twice when virtualization is in use. The solution Tatashin has in mind is to push as much of the work as possible to the host system in ways that are not transparent to the guest.
Specifically, he has implemented a driver for a "memctl" device that is
present on the guest side; this device provides many of the
memory-management operations that are already available through the guest's
system-call interface: mmap(), mlock(),
madvise(), and so on. The difference is that, for the most part,
these operations are passed through to the host for execution there rather
than being handled by the guest. Additionally, the memctl device does not
zero memory on the guest side; it counts on the host take care of that when
needed.
The other piece of the puzzle is that memctl would allocate pages in the guest's physical address space at the 1GB huge-page size. On the host side, though, these pages are mapped at a smaller size — as either 2MB huge pages or 4KB base pages. The use of 1GB pages on the guest shorts out most of the address-translation overhead at that level, speeding access considerably. The smaller pages on the host side avoid fragmentation issues; guest memory can be managed in smaller units. This only works, though, if all operations on that memory are done by the host, which is why the memctl device must provide equivalents for all of the relevant system calls.
David Hildenbrand suggested that the real optimization in this setup is avoiding the need to zero pages on the guest side and, perhaps, from not having to allocate all of the guest's memory immediately on the host. He thought that some of these optimizations could be done within the balloon driver as well, but probably not all of them. The virtio-balloon is "the dumping ground" for a lot of similar code, he said.
Tatashin continued, wondering whether and how it might be possible to upstream this code. Andrew Morton asked where the changes live; the answer is that almost all of the work is in the new memctl device, which is a separate driver. So there would be little impact on the core memory-management code. But Tatashin worried about maintaining the ABI after the code goes upstream and wanted to be sure that he is not adding any security problems. He was advised to copy the patches widely, and the community would figure it out somehow.
As the session ran out of time, an attendee asked whether this mechanism
required changing functions like malloc(). Since
memory-management operations have to send commands to the memctl device,
the answer was "yes", code like allocators would have to change. Perhaps
someday it would be possible to do a lot of the basic memctl operations
from within the kernel, but more specialized applications would have to do
their own memctl calls.
Index entries for this article | |
---|---|
Kernel | Memory management/Virtualization |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2023 |
Posted May 19, 2023 17:02 UTC (Fri)
by kilobyte (subscriber, #108024)
[Link] (5 responses)
Posted May 20, 2023 6:27 UTC (Sat)
by WolfWings (subscriber, #56790)
[Link] (4 responses)
There's no x86 based CPU at least that supports larger than 1GB mappings. And an often forgotten issue is that the CPU TLB has limited slots for higher-order mappings. Often times only 16 (or less!) 1GB pages can be cached into the TLB in fact, you can check with the cpuid -1 command and manually examining the output, for example on the i7-6700K running my NAS: Even large-scale server CPUs have similar limits, on some DigitalOcean AMD VMs I have for example: Zero 1GB TLB cache sizes. And if you track down the physical CPUs used it's accurate. This is also why the host-side sticks to 2MB constructs, those often have full 1:1 parity with 4K pages already. 1GB minimizes useless page table allocations on the guest, but using actual 2MB mappings optimizes for the TLB properly.
Posted May 21, 2023 10:47 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (1 responses)
Of course, it's worth remembering that a bigger page size means fewer TLB entries to cover the same area of RAM; a 2M TLB entry covers 512 4k TLB entries worth of RAM, and a 1G TLB entry (if your CPU has them - my i9-12900 does for data but not code) covers 512 2M TLB entries, or 262,144 4k entries. Thus, you need many fewer TLB entries to cover the same amount of RAM - and access patterns start to matter more for whether the TLB is big enough or not.
Posted May 24, 2023 17:42 UTC (Wed)
by willy (subscriber, #9762)
[Link]
TLB info
If we did construct a 1GB executable and figure out a way to get it into a 1GB page, and use a PUD entry to map it, that 1GB translation would go into the L2 Unified TLB. The CPU would then create 2MB TLB entries for the L1 cache to use for whichever 2MB chunks of that 1GB executable are actually in use.
If that proved to be a performance problem, I'm pretty sure Intel would figure out they were leaving performance on the table and support 1GB L1 I$ TLB entries.
Posted May 24, 2023 22:29 UTC (Wed)
by flussence (guest, #85566)
[Link] (1 responses)
Posted Jun 15, 2023 3:45 UTC (Thu)
by ghane (guest, #1805)
[Link]
Posted May 20, 2023 13:52 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (2 responses)
Posted May 22, 2023 9:10 UTC (Mon)
by matthias (subscriber, #94967)
[Link]
This would remove all flexibility from the host side. If the guest frees up some memory, it would be stuck in the 1GB mapping on the host side and cannot be reused for other virtual machines or the host itself.
Having 1GB mappings on the client side - at least in theory - does not remove any flexibility. Whichever mapping the client wants to modify, it can always talk to the host and ask the host to modify the mapping instead. The client can still use the huge pages as if it were a collection of small pages. Just every manipulation would have to be made by the host instead. And not all of the 1GB mapping needs to be backed by physical memory. The host is free to unmap parts of the address space if it is told by the guest that that part of memory is not currently needed.
Of course, this needs to be supported on the client. I must admit, I have not looked into the details, i.e., whether this is targeted at the guest kernel at all or only at memory hungry processes inside the guest. Saying that malloc() needs to be changed sounds a lot like that it is targeted at processes. And it sounds like this is - at least for now - only for very specific use cases.
Posted May 24, 2023 17:34 UTC (Wed)
by willy (subscriber, #9762)
[Link]
https://lwn.net/Articles/931794/ (2023)
Posted May 22, 2023 10:29 UTC (Mon)
by Karellen (subscriber, #67644)
[Link]
Wait, why would malloc() need to change to take advantage of this? Why not implement the functionality in {,s}brk() and mmap() to make it transparent to the guest's userspace?
Posted May 23, 2023 6:36 UTC (Tue)
by lbt (subscriber, #29672)
[Link]
Posted May 27, 2023 12:08 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Jun 23, 2023 12:04 UTC (Fri)
by angelsl (subscriber, #144646)
[Link]
The levels (pmd, pud, p4d, etc) above the last level all specify page frame numbers so there should not be any TLB lookups here.
512GB pages?
512GB pages?
CPU:
0x63: data TLB: 2M/4M pages, 4-way, 32 entries
data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID leaf 4
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xf0: 64 byte prefetching
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries
cache and TLB information (2):
0x7d: L2 cache: 2M, 8-way, 64 byte lines
0x30: L1 cache: 32K, 8-way, 64 byte lines
0x2c: L1 data cache: 32K, 8-way, 64 byte lines
...
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
instruction # entries = 0xff (255)
instruction associativity = 0x1 (1)
data # entries = 0xff (255)
data associativity = 0x1 (1)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
instruction # entries = 0xff (255)
instruction associativity = 0x1 (1)
data # entries = 0xff (255)
data associativity = 0x1 (1)
...
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
instruction # entries = 0x200 (512)
instruction associativity = 4-way (4)
data # entries = 0x200 (512)
data associativity = 4-way (4)
...
L1 TLB information: 1G pages (0x80000019/eax):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 TLB information: 1G pages (0x80000019/ebx):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
512GB pages?
512GB pages?
L1 Instruction TLB: 4KB pages, 8-way associative, 128 entries
L1 Instruction TLB: 4MB/2MB pages, 8-way associative, 16 entries
L1 Store Only TLB: 1GB/4MB/2MB/4KB pages, fully associative, 16 entries
L1 Load Only TLB: 4KB pages, 4-way associative, 64 entries
L1 Load Only TLB: 4MB/2MB pages, 4-way associative, 32 entries
L1 Load Only TLB: 1GB pages, fully associative, 8 entries
L2 Unified TLB: 4MB/2MB/4KB pages, 8-way associative, 1024 entries
L2 Unified TLB: 1GB/4KB pages, 8-way associative, 1024 entries
FWIW the same information is visible in dmesg, in a somewhat terse format:
512GB pages?
[ +0.000002] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 512
[ +0.000002] Last level dTLB entries: 4KB 2048, 2MB 2048, 4MB 1024, 1GB 0
This one's a desktop CPU from this decade, so planning for about 10GB of in-use memory seems reasonable enough. The one on my actual NAS is incredibly anemic by comparison (the TLB can map ~64MB *total*!)
Mine has:
512GB pages?
Jun 14 14:40:02 P14sUbuntu kernel: process: using mwait in idle threads
Jun 14 14:40:02 P14sUbuntu kernel: Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
Jun 14 14:40:02 P14sUbuntu kernel: Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Mitigation: Enhanced IBRS
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
Jun 14 14:40:02 P14sUbuntu kernel: Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
Jun 14 14:40:02 P14sUbuntu kernel: Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
Jun 14 14:40:02 P14sUbuntu kernel: Freeing SMP alternatives memory: 44K
Jun 14 14:40:02 P14sUbuntu kernel: smpboot: CPU0: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (family: 0x6, model: 0x8c, stepping: 0x1)
No L1/L2/L3 at all?
Memory passthrough for virtual machines
Memory passthrough for virtual machines
Memory passthrough for virtual machines
https://lwn.net/Articles/893512/ (2022)
Memory passthrough for virtual machines
As the session ran out of time, an attendee asked whether this mechanism required changing functions like malloc(). Since memory-management operations have to send commands to the memctl device, the answer was "yes", code like allocators would have to change.
Memory passthrough for virtual machines
Memory passthrough for virtual machines
Memory passthrough for virtual machines