Five-level page tables
A page table, of course, maps a virtual memory address to the physical address where the data is actually stored. It is conceptually a linear array indexed by the virtual address (or, at least, by the page-frame-number portion of that address) and yielding the page-frame number of the associated physical page. Such an array would be large, though, and it would be hugely wasteful. Most processes don't use the full available virtual address space even on 32-bit systems, and they don't use even a tiny fraction of it on 64-bit systems, so the address space tends to be sparsely populated and, as a result, much of that array would go unused.
The solution to this problem, as implemented in the hardware for decades, is to turn the linear array into a sparse tree representing the address space. The result is something that looks like this:
The row of boxes across the top represents the bits of a 64-bit virtual address. To translate that address, the hardware splits the address into several bit fields. Note that, in the scheme shown here (corresponding to how the x86-64 architecture uses addresses), the uppermost 16 bits are discarded; only the lower 48 bits of the virtual address are used. Of the bits that are used, the nine most significant (bits 39-47) are used to index into the page global directory (PGD); a single page for each address space. The value read there is the address of the page upper directory (PUD); bits 30-38 of the virtual address are used to index into the indicated PUD page to get the address of the page middle directory (PMD). With bits 21-29, the PMD can be indexed to get the lowest level page table, just called the PTE. Finally, bits 12-20 of the virtual address will, when used to index into the PTE, yield the physical address of the actual page containing the data. The lowest twelve bits of the virtual address are the offset into the page itself.
At any level of the page table, the pointer to the next level can be null, indicating that there are no valid virtual addresses in that range. This scheme thus allows large subtrees to be missing, corresponding to ranges of the address space that have no mapping. The middle levels can also have special entries indicating that they point directly to a (large) physical page rather than to a lower-level page table; that is how huge pages are implemented. A 2MB huge page would be found directly at the PMD level, with no intervening PTE page, for example.
One can quickly see that the process of translating a virtual address is going to be expensive, requiring several fetches from main memory. That is why the translation lookaside buffer (TLB) is so important for the performance of the system as a whole, and why huge pages, which require fewer lookups, also help.
It is worth noting that not all systems run with four levels of page tables. 32-Bit systems use three or even two levels, for example. The memory-management code is written as if all four levels were always present; some careful coding ensures that, in kernels configured to use fewer levels, the code managing the unused levels is transparently left out.
Back when four-level page tables were merged, your editor wrote: "Now
x86-64 users can have a virtual address space covering 128TB of memory,
which really should last them for a little while.
" The value of "a
little while" can now be quantified: it would appear to be about
12 years. Though, in truth, the real constraint appears to be the
64TB of physical memory that current x86-64 processors can address; as
Kirill Shutemov noted in the x86 five-level
page-table patches, there are already vendors shipping systems with
that much memory installed.
As is so often the case in this field, the solution is to add another level of indirection in the form of a fifth level of page tables. The new level, called the "P4D", is inserted between the PGD and the PUD. The patches adding this level were merged for 4.11-rc2, even though there is, at this point, no support for five-level paging on any hardware. While the addition of four-level page tables caused a bit of nervousness, the five-level patches merged were described as "low risk". At this point, the memory-management developers have a pretty good handle on the changes that need to be made to add another level.
The patches adding five-level support for upcoming Intel processors is
currently slated for 4.12. Systems running with five-level paging will
support 52-bit physical addresses and 57-bit virtual addresses. Or, as
Shutemov put it: "It bumps the limits to 128 PiB of virtual address
space and 4 PiB of physical address space. This 'ought to be enough for
anybody'.
" The new level also allows the creation of 512GB huge
pages.
The current patches have a couple of loose ends to take care of. One of those is that Xen will not work on systems with five-level page tables enabled; it will continue to work on four-level systems. There is also a need for a boot-time flag to allow switching between four-level and five-level paging so that distributors don't have to ship two different kernel binaries.
Another interesting problem is described at the end of the patch series. It would appear that there are programs out there that "know" that only the bottom 48 bits of a virtual address are valid. They take advantage of that knowledge by encoding other information in the uppermost bits. Those programs will clearly break if those bits suddenly become part of the address itself. To avoid such problems, the x86 patches in their current form will not allocate memory in the new address space by default. An application that needs that much memory, and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible.
Anybody wanting to play with the new mode can do so now with QEMU, which
understands five-level page tables. Otherwise it will be a matter of
waiting for the processors to come out — and the funds to buy a machine
with that much memory in it. When the hardware is available, the kernel
should be ready for it.
Index entries for this article | |
---|---|
Kernel | Memory management/Five-level page tables |
Posted Mar 16, 2017 6:16 UTC (Thu)
by ibukanov (subscriber, #3942)
[Link]
Posted Mar 16, 2017 9:51 UTC (Thu)
by kiryl (subscriber, #41516)
[Link]
No, x86 5-level paging doesn't add new huge page size.
And it's not very related to 5-level paging in general: we could have 512GB pages with 4-level paging too -- on pgd level.
Posted Mar 16, 2017 11:41 UTC (Thu)
by mips (guest, #105013)
[Link] (6 responses)
- #define PM_LEVELS 4
+ #define PM_LEVELS 5
Posted Mar 16, 2017 19:05 UTC (Thu)
by khim (subscriber, #9252)
[Link] (5 responses)
Only Java programmers could ever invent something that crazy. To handle four levels using cycle is just extremely wasteful, especially if they all are slightly different (middle level could be used for huge pages, lowest level specifies not another tables but pages itself, etc).
Posted Mar 20, 2017 20:46 UTC (Mon)
by ssmith32 (subscriber, #72404)
[Link] (4 responses)
I think the most inefficient thing I've ever coded was actually in Prolog.
And read the article - it's the damn C programmers that thought they /we were so clever in shoving stuff in the upper bits of pointers that mean the rest of us have to jump through hoops to get access to more memory. And shoving stuff into pointers *is* the kind of stupidity that seems highly correlated to the language of choice.
Posted Mar 21, 2017 18:20 UTC (Tue)
by anton (subscriber, #25547)
[Link] (3 responses)
Posted Mar 23, 2017 11:03 UTC (Thu)
by oldtomas (guest, #72579)
[Link]
Posted Mar 23, 2017 23:10 UTC (Thu)
by ssmith32 (subscriber, #72404)
[Link] (1 responses)
Btw, I said "we" do dumb things like shove stuff in pointers. Never said I was smarter ;)
Posted Mar 23, 2017 23:28 UTC (Thu)
by ssmith32 (subscriber, #72404)
[Link]
Posted Mar 16, 2017 12:08 UTC (Thu)
by PaXTeam (guest, #24616)
[Link] (2 responses)
that math and the picture don't seem to add up: 18+48=66 != 64 :)
> bits 30-38 of the virtual address are used to index into the indicated PUD page to get the address of the page middle directory (PMD).
vs.
> With bits 21-29, the PUD can be indexed to get the lowest level page table, just called the PTE.
the PUD is indexed by only one set of bits, this latter wants to be the PMD i guess. Posted Apr 16, 2017 14:58 UTC (Sun)
by glaubitz (subscriber, #96452)
[Link]
Finally. Currently, SPARC is the only architecture with 52 VA bits and many Javascript implementations are therefore broken on SPARC for this reason. If x86_64 uses more VA bits as well, it means there is finally some pressure on the Javascript engine developers to address this issue.
> An application that needs that much memory, and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible.
The problem here is that this hint is honored for mmap() only provided that the memory area pointed to by "hint" is still free. If it is already allocated, mmap() will again return an address from the top of the VA space and hence exceed the 48-bit limit [1]. On NetBSD, for example, mmap() will allocate memory in the vicinity of "hint" instead of ignoring it completely. Because of this, Firefox has an ugly workaround for this issue on Linux/arm64 which employs a manual allocator for the Javascript heap [2].
So, if they are suggesting people to use "hint" for mmap() to avoid the problems with applications that assume 48-bit addresses spaces, they will need to restore the original behavior of mmap() which is described in [1].
> [1] http://lkml.iu.edu/hypermail/linux/kernel/0305.2/0828.html
Five-level page tables
Five-level page tables
Five-level page tables
Five-level page tables
Five-level page tables
Yes, making maximum use of the bits in a pointer is something you only do in a low-level language such as C. This technique has certain benefits (efficiency) and certain drawbacks (as evidenced here). Since you are smarter than these programmers, you can demonstrate how to achieve equivalent time and memory efficiency in a Javascript implementation without "shoving stuff into pointers".
Five-level page tables
Five-level page tables
Five-level page tables
Five-level page tables
https://wiki.openjdk.java.net/display/HotSpot/CompressedOops
Five-level page tables
Five-level page tables
> [2] https://hg.mozilla.org/integration/mozilla-inbound/rev/df...