Five-level page tables

By Jonathan Corbet
March 15, 2017

Near the beginning of 2005, the merging of the four-level page tables patches for 2.6.10 was an early test of the (then) new kernel development model. It demonstrated that the community could indeed merge fundamental changes and get them out quickly to users — a far cry from the multi-year release cycles that prevailed before the 2.6.0 release. The merging of five-level page tables (outside of the merge window) for 4.11-rc2, instead, barely raised an eyebrow. It is, however, a significant change that is indicative of where the computing industry is going.

A page table, of course, maps a virtual memory address to the physical address where the data is actually stored. It is conceptually a linear array indexed by the virtual address (or, at least, by the page-frame-number portion of that address) and yielding the page-frame number of the associated physical page. Such an array would be large, though, and it would be hugely wasteful. Most processes don't use the full available virtual address space even on 32-bit systems, and they don't use even a tiny fraction of it on 64-bit systems, so the address space tends to be sparsely populated and, as a result, much of that array would go unused.

The solution to this problem, as implemented in the hardware for decades, is to turn the linear array into a sparse tree representing the address space. The result is something that looks like this:

[Four-level page tables]

The row of boxes across the top represents the bits of a 64-bit virtual address. To translate that address, the hardware splits the address into several bit fields. Note that, in the scheme shown here (corresponding to how the x86-64 architecture uses addresses), the uppermost 16 bits are discarded; only the lower 48 bits of the virtual address are used. Of the bits that are used, the nine most significant (bits 39-47) are used to index into the page global directory (PGD); a single page for each address space. The value read there is the address of the page upper directory (PUD); bits 30-38 of the virtual address are used to index into the indicated PUD page to get the address of the page middle directory (PMD). With bits 21-29, the PMD can be indexed to get the lowest level page table, just called the PTE. Finally, bits 12-20 of the virtual address will, when used to index into the PTE, yield the physical address of the actual page containing the data. The lowest twelve bits of the virtual address are the offset into the page itself.

At any level of the page table, the pointer to the next level can be null, indicating that there are no valid virtual addresses in that range. This scheme thus allows large subtrees to be missing, corresponding to ranges of the address space that have no mapping. The middle levels can also have special entries indicating that they point directly to a (large) physical page rather than to a lower-level page table; that is how huge pages are implemented. A 2MB huge page would be found directly at the PMD level, with no intervening PTE page, for example.

One can quickly see that the process of translating a virtual address is going to be expensive, requiring several fetches from main memory. That is why the translation lookaside buffer (TLB) is so important for the performance of the system as a whole, and why huge pages, which require fewer lookups, also help.

It is worth noting that not all systems run with four levels of page tables. 32-Bit systems use three or even two levels, for example. The memory-management code is written as if all four levels were always present; some careful coding ensures that, in kernels configured to use fewer levels, the code managing the unused levels is transparently left out.

Back when four-level page tables were merged, your editor wrote: "Now x86-64 users can have a virtual address space covering 128TB of memory, which really should last them for a little while." The value of "a little while" can now be quantified: it would appear to be about 12 years. Though, in truth, the real constraint appears to be the 64TB of physical memory that current x86-64 processors can address; as Kirill Shutemov noted in the x86 five-level page-table patches, there are already vendors shipping systems with that much memory installed.

As is so often the case in this field, the solution is to add another level of indirection in the form of a fifth level of page tables. The new level, called the "P4D", is inserted between the PGD and the PUD. The patches adding this level were merged for 4.11-rc2, even though there is, at this point, no support for five-level paging on any hardware. While the addition of four-level page tables caused a bit of nervousness, the five-level patches merged were described as "low risk". At this point, the memory-management developers have a pretty good handle on the changes that need to be made to add another level.

The patches adding five-level support for upcoming Intel processors is currently slated for 4.12. Systems running with five-level paging will support 52-bit physical addresses and 57-bit virtual addresses. Or, as Shutemov put it: "It bumps the limits to 128 PiB of virtual address space and 4 PiB of physical address space. This 'ought to be enough for anybody'." The new level also allows the creation of 512GB huge pages.

The current patches have a couple of loose ends to take care of. One of those is that Xen will not work on systems with five-level page tables enabled; it will continue to work on four-level systems. There is also a need for a boot-time flag to allow switching between four-level and five-level paging so that distributors don't have to ship two different kernel binaries.

Another interesting problem is described at the end of the patch series. It would appear that there are programs out there that "know" that only the bottom 48 bits of a virtual address are valid. They take advantage of that knowledge by encoding other information in the uppermost bits. Those programs will clearly break if those bits suddenly become part of the address itself. To avoid such problems, the x86 patches in their current form will not allocate memory in the new address space by default. An application that needs that much memory, and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible.

Anybody wanting to play with the new mode can do so now with QEMU, which understands five-level page tables. Otherwise it will be a matter of waiting for the processors to come out — and the funds to buy a machine with that much memory in it. When the hardware is available, the kernel should be ready for it.

Index entries for this article
Kernel	Memory management/Five-level page tables

Five-level page tables

Posted Mar 16, 2017 6:16 UTC (Thu) by ibukanov (subscriber, #3942) [Link]

Modern JavaScript interpreters assumes 48 bit pointers. They use that to encode pointers and other values as double NaN, https://wingolog.org/archives/2011/05/18/value-representa...

Five-level page tables

Posted Mar 16, 2017 9:51 UTC (Thu) by kiryl (subscriber, #41516) [Link]

> The new level also allows the creation of 512GB huge pages.

No, x86 5-level paging doesn't add new huge page size.

And it's not very related to 5-level paging in general: we could have 512GB pages with 4-level paging too -- on pgd level.

Five-level page tables

Posted Mar 16, 2017 11:41 UTC (Thu) by mips (guest, #105013) [Link] (6 responses)

As a programmer but not a kernel programmer, I am wondering: is the patch any more than just something like

- #define PM_LEVELS 4

+ #define PM_LEVELS 5

Five-level page tables

Posted Mar 16, 2017 19:05 UTC (Thu) by khim (subscriber, #9252) [Link] (5 responses)

Only Java programmers could ever invent something that crazy. To handle four levels using cycle is just extremely wasteful, especially if they all are slightly different (middle level could be used for huge pages, lowest level specifies not another tables but pages itself, etc).

Five-level page tables

Posted Mar 20, 2017 20:46 UTC (Mon) by ssmith32 (subscriber, #72404) [Link] (4 responses)

No, not only Java programmers like me, sadly. I've seen much stupider things in a variety of language s, from C to Java to Ruby to Javascript to Python to Prolog. And even coded a few stupid things myself in each language.

I think the most inefficient thing I've ever coded was actually in Prolog.

And read the article - it's the damn C programmers that thought they /we were so clever in shoving stuff in the upper bits of pointers that mean the rest of us have to jump through hoops to get access to more memory. And shoving stuff into pointers *is* the kind of stupidity that seems highly correlated to the language of choice.

Five-level page tables

Posted Mar 21, 2017 18:20 UTC (Tue) by anton (subscriber, #25547) [Link] (3 responses)

Yes, making maximum use of the bits in a pointer is something you only do in a low-level language such as C. This technique has certain benefits (efficiency) and certain drawbacks (as evidenced here). Since you are smarter than these programmers, you can demonstrate how to achieve equivalent time and memory efficiency in a Javascript implementation without "shoving stuff into pointers".

Five-level page tables

Posted Mar 23, 2017 11:03 UTC (Thu) by oldtomas (guest, #72579) [Link]

Lisps have been doing that since times immemorial ("tagged pointers"). But they learnt the lesson and most of them switched to using the lower bits: if you know the structures you point to are aligned, you get two to three bits "down there". Of course you don't have the luxury of full 16 bit, but hey.

Five-level page tables

Posted Mar 23, 2017 23:10 UTC (Thu) by ssmith32 (subscriber, #72404) [Link] (1 responses)

Easy. Since your "trade-off" is to risk violating and corrupting memory to achieve this, by making invalid assumptions about how the memory subsystem works, simply save yourself the overhead of malloc, and blindly read and write to memory with allocating it ahead of time :P.

Btw, I said "we" do dumb things like shove stuff in pointers. Never said I was smarter ;)

Five-level page tables

Posted Mar 23, 2017 23:28 UTC (Thu) by ssmith32 (subscriber, #72404) [Link]

But if you do want to know how smart people (not me) intelligently hand stuff like this, look to Java programmers, as in the people who program Java, as in program the JVM:
https://wiki.openjdk.java.net/display/HotSpot/CompressedOops

Five-level page tables

Posted Mar 16, 2017 12:08 UTC (Thu) by PaXTeam (guest, #24616) [Link] (2 responses)

> the uppermost 18 bits are discarded; only the lower 48 bits of the virtual address are used.

that math and the picture don't seem to add up: 18+48=66 != 64 :)

> bits 30-38 of the virtual address are used to index into the indicated PUD page to get the address of the page middle directory (PMD).

vs.

> With bits 21-29, the PUD can be indexed to get the lowest level page table, just called the PTE.

the PUD is indexed by only one set of bits, this latter wants to be the PMD i guess.

Fixed

Posted Mar 16, 2017 13:26 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

Those were a couple of silly slips of the fingers, yes; fixed now.

Fixed

Posted Mar 16, 2017 14:14 UTC (Thu) by PaXTeam (guest, #24616) [Link]

now if only the picture also showed 16 non-translated bits... ;)

Five-level page tables

Posted Apr 16, 2017 14:58 UTC (Sun) by glaubitz (subscriber, #96452) [Link]

> The patches adding five-level support for upcoming Intel processors is currently slated for 4.12. Systems running with five-level paging will support 52-bit physical addresses and 57-bit virtual addresses. Or, as Shutemov put it: "It bumps the limits to 128 PiB of virtual address space and 4 PiB of physical address space. This 'ought to be enough for anybody'." The new level also allows the creation of 512GB huge pages.

Finally. Currently, SPARC is the only architecture with 52 VA bits and many Javascript implementations are therefore broken on SPARC for this reason. If x86_64 uses more VA bits as well, it means there is finally some pressure on the Javascript engine developers to address this issue.

> An application that needs that much memory, and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible.

The problem here is that this hint is honored for mmap() only provided that the memory area pointed to by "hint" is still free. If it is already allocated, mmap() will again return an address from the top of the VA space and hence exceed the 48-bit limit [1]. On NetBSD, for example, mmap() will allocate memory in the vicinity of "hint" instead of ignoring it completely. Because of this, Firefox has an ugly workaround for this issue on Linux/arm64 which employs a manual allocator for the Javascript heap [2].

So, if they are suggesting people to use "hint" for mmap() to avoid the problems with applications that assume 48-bit addresses spaces, they will need to restore the original behavior of mmap() which is described in [1].

> [1] http://lkml.iu.edu/hypermail/linux/kernel/0305.2/0828.html
> [2] https://hg.mozilla.org/integration/mozilla-inbound/rev/df...