Reworking page-table traversal

By Jonathan Corbet
May 4, 2018

A system's page tables are organized into a tree that is as many as five levels deep. In many ways those levels are all similar, but the kernel treats them all as being different, with the result that page-table manipulations include a fair amount of repetitive code. During the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Kirill Shutemov proposed reworking how page tables are maintained. The idea was popular, but the implementation is likely to be tricky.

On a system with five-level page tables (which few of us have at this point, since Shutemov just added the fifth level), a traversal of the tree starts at the page global directory (PGD). From there, it proceeds to the P4D, the page upper directory (PUD), the page middle directory (PMD), and finally to the PTE level that contains information about individual 4KB pages. If the kernel wants to unmap a range of page-table entries, it may have to make changes at multiple levels. In the code, that means that a call to unmap_page_range() will start in the PGD, then call zap_p4d_range() to do the work at the P4D level. The calls trickle down through zap_pud_range() and zap_pmd_range() before ending up in zap_pte_range(). All of the levels in this traversal (except the final one) look quite similar, but each is coded separately. There is a similar cascade of functions for most common page-table operations. Some clever coding ensures that the unneeded layers are compiled out when the kernel is built for a system with shallower page tables.

Shutemov would like to replace this boilerplate with something a bit more compact. He is proposing representing a pointer into the page tables (at any level) with a structure like:

    struct pt_ptr {
        unsigned long *ptr;
	int lvl;
    };

Using this structure, page-table manipulations would be handled by a single function that would call itself recursively to work down the levels. Recursion is generally frowned upon in the kernel because it can eat up stack space, but in this case it is strictly bounded by the depth of the page tables. That one function would replace the five that exist now, but it would naturally become somewhat more complex.

He asked: would this change be worth it? Michal Hocko asked just how many years of work would be required to get this change done. Among other things, it would have to touch every architecture in the system. If it proves impossible to create some sort of a compatibility layer that would let architectures opt into the new scheme, an all-architecture flag day would be required. Given that, Hocko said that he wasn't sure it would be worth the trouble.

Laura Abbott asked what problems would be solved by the new mechanism. One is that it would deal more gracefully with pages of different sizes. Some architectures (POWER, for example) can support multiple page sizes simultaneously; this scheme would make that feature easier to use and manage. Current code has to deal with a number of special cases involving the top-level table; those would mostly go away in the new scheme. And, presumably, the resulting code would be cleaner.

It was also said in jest that this mechanism would simplify the work when processors using six-level page tables show up. The subsequent discussion suggested that this is no joking matter; it seems that such designs are already under consideration. When such hardware does appear, Shutemov said, there will be no time to radically rework page-table manipulations to support it, so there will be no alternative to adding a sixth layer of functions instead. In an effort to avoid that, he is going to try to push this work forward on the x86 architecture and see how it goes.

Index entries for this article
Kernel	Memory management/Internal API
Conference	Storage, Filesystem, and Memory-Management Summit/2018

Reworking page-table traversal

Posted May 5, 2018 1:20 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (4 responses)

As far as I know, s390 hardware already supports six level page tables...

Reworking page-table traversal

Posted May 6, 2018 17:37 UTC (Sun) by willy (subscriber, #9762) [Link] (3 responses)

No, it's 5 levels on z/Arch. 11+11+11+11+8+12 =64
x86 uses 9+9+9+9+9+12 = 57 bits

Source: first result on
https://www.google.ca/search?q=zseries+page+table+site%3A...
(Hard to get a direct link to a PDF on Android)

If x86 were willing to switch to an 8k page size, 5 level paging would get them 63 bits.

Reworking page-table traversal

Posted May 7, 2018 12:40 UTC (Mon) by cborni (subscriber, #12949) [Link]

Yes we have 5. The tricky thing is that the format of the page tables differ. So a PMD entry has a different format than a PTE. Not sure if the proposed scheme would be better for such things were we have to use different accessors depending on the level.

Reworking page-table traversal

Posted May 8, 2018 23:53 UTC (Tue) by luto (guest, #39314) [Link] (1 responses)

There’s no need for an 8k page. All that would be needed is bigger page table chunks. 8k (i.e. two-page) directories with 8-byte entries gives 10 bits of VA per level. 12+5*10 = 62 bits. Now give separate roots for the top and bottom halves of the address space and 63 of the 64 possible bits are covered.

Intel, if you ever revamp the page table format again, here are some feature requests:

- Separate R, W, and X bits.

- Separate page table roots for the top and bottom halves of the address space. Or, even better, separate user-mode and kernel-mode page tables.

- At least one extra address space accessible with prefixes or special instructions. For example, having the stack live in a separate address space accessible only with PUSH, POP, and prefixed memory operands would be awesome.

(Hmm. There are already mostly-useless CS, SS, DS, ES, FS, and GS prefixes. Make each one refer to a separate address space. Make ES, FS, and GS be usable only at CPL0, and make CPL0 default to the ES space and CPL > 0 default to CS.)

Reworking page-table traversal

Posted May 10, 2018 20:54 UTC (Thu) by kiryl (subscriber, #41516) [Link]

8-byte entries is not enough. We already run out of bits in the entries and every new feature wants to claim one (or 15 with MKTME :P).

Why is separate roots from kernel/user useful? Is it only to protect against Meltdown-alike stuff?