Large page support in the Linux kernel
[Posted August 7, 2002 by corbet]
Most modern processors have the ability to work with "large pages" - single
page table entries which cover large (up to multiple megabyte) ranges of
contiguous physical memory. With one exception, this feature is not used
in the Linux kernel, which works with a 4K or 8K page size (depending on
architecture) in all situations. Smaller pages reduce internal
fragmentation, are quick to swap in and out, don't require the virtual
memory system to maintain large, contiguous chunks of memory, and help to
ensure that exactly the virtual memory that is in use now is resident in
physical memory. Small pages are the best choice for most situations.
Due to the complication of supporting multiple page sizes in the
Linux VM implementation, no such support has been merged so far.
There are advantages to working with large pages, however. 4MB of memory
in 4KB pages requires 1024 page table entries (PTEs) - that is a lot of
memory devoted to overhead, and significant processor time to set up, tear
down, and maintain those PTEs. This overhead is multiplied when shared
memory segments are in use, since Linux is currently unable to share page
tables. But the real savings with large pages has to do with the
processor's translation buffer - a small cache which remembers the result
of virtual-to-physical address translations. An address lookup through the
translation buffer is quick; one that has to actually go to the page table
is slow. Large pages greatly extend the range of the translation buffer,
and simply make applications run faster; performance improvements of 30%
have been claimed at times.
The fact that Oracle uses lots of large, shared memory regions and would
like to see large page support in the kernel is also helping to drive
development in this area.
The most recent large page patch is this one
by Rohit Seth. It allows processes to explicitly request a chunk of large
page memory with a new get_large_pages system call; there is also
a share_large_pages call for creating shared memory regions. The
patch avoids much of the complexity of supporting large pages in the VM by,
well, avoiding it. Large pages are handled completely outside of the
normal memory management mechanisms. When the system boots, a percentage
of memory (25%, by default) is simply set aside to satisfy large page
requests. These pages are handed out when requested (as long as they last)
and are not swapped.
This patch is thus (relatively) simple. It gets the job done in certain
situations - imagine a large box whose job is to run a relational database
system; nailing down a quarter of memory to improve database performance is
a reasonable thing to do. But this patch (intentionally) does not address
the larger problem. In fact, as Linus points
out, this isn't really a "large page" patch at all:
The current largepage patch is really nothing but an interface to
the TLB. Please view it as that - a direct TLB interface that has
zero impact on the VFS or VM layers, and that is meant _purely_ as
a way to expose hw capabilities to the few applications that really
really want them
So what might a real large page patch provide? Wishes that have
been expressed include:
- Support for large page file I/O. Performing I/O operations in 4K
chunks is increasingly a bandwidth bottleneck; filesystems could gain
some performance benefits by working with larger chunks. So the size
of filesystem pages - as seen in the page cache - will someday become
variable.
- No need for separate system calls. The most common suggestion has
been that the mmap system call needs a new flag to request
large page allocations.
- David Miller asks: why have system calls
or even mmap flags? Instead, applications should be given
large pages any time they request enough memory and the system is able
to do it. Then the performance benefits would be available without
the need to recode applications (in a nonportable way) to use large
pages.
The automatic use of large pages would be helped by another suggestion from
David: if it becomes necessary to swap out a large page, simply split it
back into a long list of regular pages and proceed as usual. Then most of
the swap complexity would go away.
Of course, the October deadline is getting closer. So all of these ideas
are almost certainly destined to wait until after the next stable series.
But one of the variants of the simpler "TLB interface" patches may yet get
in this time around and make the database vendors (and others) happy.
(What, you may ask, is the "one exception" where the kernel uses large
pages now? The mapping of the kernel image itself - a single, large chunk
of non-swappable memory - is handled with a large page PTE.)
(
Log in to post comments)