Introduction: The Kernel Hacker's Bookshelf
A lot of great operating systems research goes on, but relatively
little of it makes the leap into production operating systems, or from
one operating system to another. The ideas that do trickle down into
implementation are often delayed by years. Usually an idea gets
ignored because it looked good in the research lab but turned out not
to be practical in a production environment. But every so often, a
practical idea goes unnoticed for years simply because none of the
actual coders has the time to sit down and parse fifteen pages of dry
academic prose. You're too busy writing code, can't someone make it
easier to figure out which books and papers are worth reading?
Welcome to The Kernel Hacker's Bookshelf. The goal of this series is
to bring good research and good kernel hackers together through
reviews focusing on the practical aspects of research, written in
plain (possibly even entertaining) language. We hope you enjoy
reading these articles - and writing code inspired by them!
Transparent operating systems support for large pages
While Moore's Law tramped inexorably on during the last few decades,
increasing memory size and disk space along with transistor density,
it left some elements of computer architecture in the dust. One of
these stragglers is TLB coverage. The TLB (or Translation Look-aside
Buffer) caches translations between virtual and physical memory
addresses; usually every memory access requires a translation.
Performance is best when all the needed translations can fit in the
TLB and translations "hit" in the TLB instead of missing. The amount
of memory translated by the entries in the TLB is called the TLB
coverage. TLB coverage has been dropping as a fraction of total
memory (and, more importantly, as a fraction of the total size of
Netscape - er, Mozilla - er, Firefox), and TLB misses are often a
serious drag on system performance.
Since translations are done on a per-page basis, one solution is to
increase the size of the system pages. We could increase the base
page size - the smallest page available - but that would typically
waste a lot of memory, cause more page-outs, trigger unexpected
application bugs (ask me about the one with the JVM default stack size
and 64KB pages some time), and make the system slower overall.
Instead, many processors now offer multiple page sizes, beginning with
a base page size of 4KB or 8KB and ranging up to a large page size of
2MB or occasionally a truly monstrous page size of 256MB or larger.
Large pages increase TLB coverage and can reduce TLB misses
significantly, often improving the performance of applications with
large working sets by 10-15%. On the other hand, large pages can
reduce performance by increasing the cost of paging memory in and out
and adding the overhead of tracking several different page sizes.
Implementing automatic, transparent OS-level support for large pages
while simultaneously improving overall performance is not easy. It's
also what Linux users are clamoring for - and some of them are
switching to operating systems that already have automatic large page
support (cough, cough, Solaris).
A Solution: The Rice Paper
Transparent Operating System Support for Superpages
Navarro, et al., describes a sophisticated and elegant implementation
of transparent large pages. The authors implemented their system on
FreeBSD on the Alpha processor, using 4 page sizes: 8KB, 64KB, 512KB,
and 4MB. The paper was published in 2002, otherwise they might have
picked a less ill-omened architecture than Alpha; fortunately the
design is reasonably generic. Overall, this paper is one of the best
I've ever read.
The basic design is reservation-based; that is, enough pages to make a
large page are reserved in advance and later promoted to a large page
when justified. Memory fragmentation is reined in via careful
grouping of page types and a smarter page replacement policy. Almost
all applications tested saw at least some speed-up, and absolute worst
case performance degradation varied from 2-9%. Most amazing of all,
the implementation only required about 3500 lines of code - about half
of an ext2. How exactly did they accomplish all this? Buckle up for
some nitty-gritty details.
First, a run of contiguous pages suitable for a large page is reserved
whenever an application page fault occurs (outside an existing
reservation, of course). The size of the reservation is picked based
on the size and type of the memory object, with slight variations
depending on whether the object is fixed in size (e.g., text) or might
grow (e.g., stack). For example, an application with 700KB of text
would have a 512KB page reserved the first time a page in the text was
faulted into memory. Once a large page of any size has been fully
populated (all of its pages referenced at least once), it is promoted
into a large page. In our example, once a contiguous 64KB region
anywhere in the program text has been faulted in, it will be promoted
to a 64KB page. Promotion of a partially populated page is possible,
but the trade off is that it may increase the application's total
memory usage, unintentionally creating a memory hog.
In the rough and tumble world of scarce memory, promotion is not a
one-way street. Demotion of large pages into smaller pages is also
useful. An application may start out using all pages in a large page
but then stop referencing most of the pages. The only way to tell is
to demote the large page and check the referenced bits on the smaller
pages a little while later. A page is demoted when it is first
written, when one of its base pages is evicted, and periodically when
the system is under memory pressure.
When an application wants more memory and no free space is available,
unused parts of a reservation are preempted. "Use it or lose it" is
the name of the game here. The reservation which loses is the one
whose most recent allocation occurred least recently - LRU order,
basically - since most applications touch most of their working set
soon after starting up, and so it's unlikely the original owner of the
reservation will need the space. Unused reservations live on
different lists depending on the size of the allocation that can be
made by preempting the reservation. A population map, implemented as
a radix tree, keeps track of which pages are allocated inside each
large page-sized extent for easy look up. This radix tree is a key
data structure; it makes allocation, reservation, and promotion
decisions fast and simple.
The final key elements are the page replacement policy and the way
pages of various types are grouped together. There are several
different kinds of pages in the system. Some pages can't be moved or
freed (pinned), some pages are in use but can be moved (active), and
some pages are not currently used by anyone but may be used in the
future (cached and/or inactive). If these pages are mixed together
indiscriminately, pinned and active pages end up scattered everywhere,
without any contiguous runs of free (or free-able) pages that can be
converted into hotly pursued large pages. Fragmentation needs to be
both prevented and repaired - without hurting performance by moving
around pages too much.
Pinned pages are the most difficult problem, since once allocated they
cannot be moved and may never be freed. The system tries to allocate
these pages in clusters, so they break up as few potential large pages
as possible. Similarly, cache pages are allocated in clusters with
free pages, since cached pages can be easily freed to allow the
creation of a large page. Reservations can include cache pages, and
cached pages contained inside a reservation continue to be active
until the application actually needs to kick that page out.
The page replacement daemon was changed to run not only when free
memory runs low, but also when contiguity runs low. An "innocent
until proven guilty" algorithm works here - we assume we don't need
more contiguity until a large page reservation fails for lack of
contiguity. When woken for this reason, the daemon runs just long
enough to recover enough contiguous space to satisfy the allocations
that failed. The page aging algorithm was changed slightly from the
FreeBSD default; cached pages for a file are marked inactive on the
last close, trading off the chance of the file being reopened against
the opportunity for more contiguity.
Evaluating the System
The authors tested their system against a truly startling variety of
applications, everything from gzip to web server trace replays to fast
Fourier transforms, as well as a section exploring worst case
situations. Personally, I'm not sure I've ever seen a better
evaluation in a research paper; it's quite a treat to read.
In the best case, with low fragmentation, 33 out of 35 applications
showed some improvement (one was unchanged, and the other was about 2%
slower). Several had significant improvements. For example, rotating
an image using ImageMagick was about 20% faster; linking the FreeBSD
kernel was about 30% faster; bzip2 was 14% faster. In the fragmented
case, performance was not as good, but usually to picked up again
after a few runs as the page replacement daemon moved things around.
In the worst-case department, the performance was degraded by about 9%
for an application that only touched one byte per large page before
freeing it, and by about 2% for a test case in which large page
promotion was turned off. It makes for a pretty convincing case that
large pages are an overall win for many systems.
Implications for Linux
What does this paper tell us? It is possible to implement transparent
large page support in such a way that most applications get at least
some benefit, and some applications get a lot of benefit. The
algorithms used are relatively simply to understand and implement, and
hold up well in worst case behavior. Finally, transparent large pages
can be implemented elegantly and cleanly - only 3500 lines of code!
Best of all, this paper includes a plethora of implementation details
and smart algorithms, just begging to be reused. All of the above
earns this paper a hallowed place on the Kernel Hacker's Bookshelf.
Over the past few years, several Linux developers have been working on
various forms of transparent large page support. Some of that recent
work, spearheaded by Mel Gorman, has been reviewed earlier in LWN:
Current work on large pages in Linux is summarized on the
I look forward to more work in this fascinating and fertile area of
operating systems implementation.
[Do you have a favorite textbook or systems paper? Of course you do.
Send your suggestions to:
val dot henson at gmail dot com
Valerie Henson is a Linux kernel
developer working for Intel. Her interests include file systems,
networking, women in computing, and walking up and down large
mountains. She is always looking for good systems programmers, so
send her some email and introduce yourself.]
to post comments)