LWN.net Logo

Large page support in the Linux kernel

Most modern processors have the ability to work with "large pages" - single page table entries which cover large (up to multiple megabyte) ranges of contiguous physical memory. With one exception, this feature is not used in the Linux kernel, which works with a 4K or 8K page size (depending on architecture) in all situations. Smaller pages reduce internal fragmentation, are quick to swap in and out, don't require the virtual memory system to maintain large, contiguous chunks of memory, and help to ensure that exactly the virtual memory that is in use now is resident in physical memory. Small pages are the best choice for most situations. Due to the complication of supporting multiple page sizes in the Linux VM implementation, no such support has been merged so far.

There are advantages to working with large pages, however. 4MB of memory in 4KB pages requires 1024 page table entries (PTEs) - that is a lot of memory devoted to overhead, and significant processor time to set up, tear down, and maintain those PTEs. This overhead is multiplied when shared memory segments are in use, since Linux is currently unable to share page tables. But the real savings with large pages has to do with the processor's translation buffer - a small cache which remembers the result of virtual-to-physical address translations. An address lookup through the translation buffer is quick; one that has to actually go to the page table is slow. Large pages greatly extend the range of the translation buffer, and simply make applications run faster; performance improvements of 30% have been claimed at times.

The fact that Oracle uses lots of large, shared memory regions and would like to see large page support in the kernel is also helping to drive development in this area.

The most recent large page patch is this one by Rohit Seth. It allows processes to explicitly request a chunk of large page memory with a new get_large_pages system call; there is also a share_large_pages call for creating shared memory regions. The patch avoids much of the complexity of supporting large pages in the VM by, well, avoiding it. Large pages are handled completely outside of the normal memory management mechanisms. When the system boots, a percentage of memory (25%, by default) is simply set aside to satisfy large page requests. These pages are handed out when requested (as long as they last) and are not swapped.

This patch is thus (relatively) simple. It gets the job done in certain situations - imagine a large box whose job is to run a relational database system; nailing down a quarter of memory to improve database performance is a reasonable thing to do. But this patch (intentionally) does not address the larger problem. In fact, as Linus points out, this isn't really a "large page" patch at all:

The current largepage patch is really nothing but an interface to the TLB. Please view it as that - a direct TLB interface that has zero impact on the VFS or VM layers, and that is meant _purely_ as a way to expose hw capabilities to the few applications that really really want them

So what might a real large page patch provide? Wishes that have been expressed include:

  • Support for large page file I/O. Performing I/O operations in 4K chunks is increasingly a bandwidth bottleneck; filesystems could gain some performance benefits by working with larger chunks. So the size of filesystem pages - as seen in the page cache - will someday become variable.

  • No need for separate system calls. The most common suggestion has been that the mmap system call needs a new flag to request large page allocations.

  • David Miller asks: why have system calls or even mmap flags? Instead, applications should be given large pages any time they request enough memory and the system is able to do it. Then the performance benefits would be available without the need to recode applications (in a nonportable way) to use large pages.

The automatic use of large pages would be helped by another suggestion from David: if it becomes necessary to swap out a large page, simply split it back into a long list of regular pages and proceed as usual. Then most of the swap complexity would go away.

Of course, the October deadline is getting closer. So all of these ideas are almost certainly destined to wait until after the next stable series. But one of the variants of the simpler "TLB interface" patches may yet get in this time around and make the database vendors (and others) happy.

(What, you may ask, is the "one exception" where the kernel uses large pages now? The mapping of the kernel image itself - a single, large chunk of non-swappable memory - is handled with a large page PTE.)


(Log in to post comments)

Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.