Reworking page-table traversal
On a system with five-level page tables (which few of us have at this
point, since Shutemov just added the fifth level), a traversal of the tree
starts at the page global directory (PGD). From there, it proceeds to the
P4D, the page upper directory (PUD), the page middle directory (PMD), and
finally to the PTE level that contains information about individual 4KB
pages. If the kernel wants to unmap a range of page-table entries, it may
have to make changes at multiple levels. In the code, that means that a
call to unmap_page_range()
will start in the PGD, then call zap_p4d_range()
to do the work at the P4D level. The calls trickle down through zap_pud_range()
and zap_pmd_range()
before ending up in zap_pte_range().
All of the levels in this traversal (except the final one) look quite
similar, but each is coded separately. There is a similar cascade of
functions for most common page-table operations. Some clever coding
ensures that the unneeded layers are compiled out when the kernel is built
for a system with shallower page tables.
Shutemov would like to replace this boilerplate with something a bit more compact. He is proposing representing a pointer into the page tables (at any level) with a structure like:
struct pt_ptr { unsigned long *ptr; int lvl; };
Using this structure, page-table manipulations would be handled by a single function that would call itself recursively to work down the levels. Recursion is generally frowned upon in the kernel because it can eat up stack space, but in this case it is strictly bounded by the depth of the page tables. That one function would replace the five that exist now, but it would naturally become somewhat more complex.
He asked: would this change be worth it? Michal Hocko asked just how many years of work would be required to get this change done. Among other things, it would have to touch every architecture in the system. If it proves impossible to create some sort of a compatibility layer that would let architectures opt into the new scheme, an all-architecture flag day would be required. Given that, Hocko said that he wasn't sure it would be worth the trouble.
Laura Abbott asked what problems would be solved by the new mechanism. One is that it would deal more gracefully with pages of different sizes. Some architectures (POWER, for example) can support multiple page sizes simultaneously; this scheme would make that feature easier to use and manage. Current code has to deal with a number of special cases involving the top-level table; those would mostly go away in the new scheme. And, presumably, the resulting code would be cleaner.
It was also said in jest that this mechanism would simplify the work when
processors using six-level page tables show up. The subsequent discussion
suggested that this is no joking matter; it seems that such designs are
already under consideration. When such hardware does appear, Shutemov
said, there will be no time to radically rework page-table manipulations to
support it, so there will be no alternative to adding a sixth layer of
functions instead. In an effort to avoid that, he is going to
try to push this work forward on the x86 architecture and see how it goes.
Index entries for this article | |
---|---|
Kernel | Memory management/Internal API |
Conference | Storage, Filesystem, and Memory-Management Summit/2018 |
Posted May 5, 2018 1:20 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (4 responses)
Posted May 6, 2018 17:37 UTC (Sun)
by willy (subscriber, #9762)
[Link] (3 responses)
Source: first result on
If x86 were willing to switch to an 8k page size, 5 level paging would get them 63 bits.
Posted May 7, 2018 12:40 UTC (Mon)
by cborni (subscriber, #12949)
[Link]
Posted May 8, 2018 23:53 UTC (Tue)
by luto (guest, #39314)
[Link] (1 responses)
Intel, if you ever revamp the page table format again, here are some feature requests:
- Separate R, W, and X bits.
- Separate page table roots for the top and bottom halves of the address space. Or, even better, separate user-mode and kernel-mode page tables.
- At least one extra address space accessible with prefixes or special instructions. For example, having the stack live in a separate address space accessible only with PUSH, POP, and prefixed memory operands would be awesome.
(Hmm. There are already mostly-useless CS, SS, DS, ES, FS, and GS prefixes. Make each one refer to a separate address space. Make ES, FS, and GS be usable only at CPL0, and make CPL0 default to the ES space and CPL > 0 default to CS.)
Posted May 10, 2018 20:54 UTC (Thu)
by kiryl (subscriber, #41516)
[Link]
Why is separate roots from kernel/user useful? Is it only to protect against Meltdown-alike stuff?
Reworking page-table traversal
Reworking page-table traversal
x86 uses 9+9+9+9+9+12 = 57 bits
https://www.google.ca/search?q=zseries+page+table+site%3A...
(Hard to get a direct link to a PDF on Android)
Reworking page-table traversal
Reworking page-table traversal
Reworking page-table traversal