A couple of Kernel Summit sessions were devoted to the problem of making
larger pages from smaller ones. The first, led by Dave Mosberger, was on
the concept of transparent superpages. "Transparent" means that the
application is not involved; the idea behind transparent superpages is
that a process can be working with large pages (and the performance
benefits they can bring) without needing to do anything to bring that
The main goal behind superpages is to improve use of the processor's
translation lookup buffer (TLB). On modern systems where small pages
(i.e. the 4K pages used by Linux on most architectures) are in use, the TLB
might not cover even enough memory to fill the cache. The "hugetlb"
feature (already in 2.6) can make things better for specific applications,
but they do not really solve the problem. Hugetlb pages are not
transparent (the application must set them up explicitly) and they are a
scarce resource. If the application does not nail down its huge pages soon
after boot, system memory is likely to fragment to the point that no such
pages are available.
The transparent superpage scheme works in a different way. When a process
requests a page of memory, the kernel allocates a larger, superpage frame,
but only maps the small page needed by the process. If the TLB starts to
fill up, the kernel can automatically "promote" the pages in that frame to
a superpage. If, instead, the system is suffering from memory pressure, the
superpage can be demoted, and the component pages swapped out. This scheme
requires some extra housekeeping information (to track promotion and
demotion states), and it requires the system to allocate most memory in
superpage-sized chunks to avoid fragmentation. For these reasons, the
maximum size of superpages is limited to 64KB or so - much smaller than
what can be achieved with hugetlb.
A fair amount of transparent superpage work has already been done.
J. Navarro at Rice University has a FreeBSD implementation. There is a
Linux implementation by Naohiko Shimizu, but it only works for anonymous
memory (that which is not backed up by a file somewhere). William Irwin and
Hubertus Franke are doing some work at IBM, and Lucy Chubb at UNSW is also
working on the problem.
William Irwin got up to talk about some of the implementation details.
These include a "speculative reservation" mechanism to allow a process to
tentatively grab a superpage; the reservation can be broken in some
situations. Page replacement becomes a hard problem once memory gets
fragmented. The page table API would need to be enhanced to be able to
work with superpage concepts. There is also the need for some sort of page
scanning algorithm - and/or a fancy tree data structure - to manage
promotion of pages.
At this point, Linus broke in to say that he is not entirely thrilled with
the superpage concept. He would rather see the entire system switch over
to larger pages, with "sub-pages" used when needed. This approach may seem
similar, but a larger page size has its own advantages - in particular, a
reduction in the size of the system memory map. Shrinking the memory map
is especially helpful for 32-bit systems, which are increasingly
constrained by the amount of low memory available. Larger pages would thus
be more helpful to the x86 architecture, which is the one Linus really
There was a discussion of how big pages could really get with such a
scheme; 16K was seen as the limit. The page size could, however, become a
configuration option, or even a decision made at boot time.
William Irwin then talked about page clustering - his patch was briefly covered here last February. Page
clustering differs from superpages in that the operating system creates larger
pages in software by logically grouping the system's smaller physical
pages. Page clustering is mainly intended to shrink the memory map and
other system data structures.
William talked briefly about the changes forced by page clustering; they
mostly have to do with code which makes assumptions about what
PAGE_SIZE means. He has a patch which works and passes "light
functional tests," but it still does not perform all that well. William
apparently knows how to fix most of the problems and will be doing so in
the near future.
(This article has been updated to fix a couple of misspelled names).
to post comments)