KHB: Transparent support for large pages
Introduction: The Kernel Hacker's Bookshelf
A lot of great operating systems research goes on, but relatively little of it makes the leap into production operating systems, or from one operating system to another. The ideas that do trickle down into implementation are often delayed by years. Usually an idea gets ignored because it looked good in the research lab but turned out not to be practical in a production environment. But every so often, a practical idea goes unnoticed for years simply because none of the actual coders has the time to sit down and parse fifteen pages of dry academic prose. You're too busy writing code, can't someone make it easier to figure out which books and papers are worth reading?Welcome to The Kernel Hacker's Bookshelf. The goal of this series is to bring good research and good kernel hackers together through reviews focusing on the practical aspects of research, written in plain (possibly even entertaining) language. We hope you enjoy reading these articles - and writing code inspired by them!
Transparent operating systems support for large pages
While Moore's Law tramped inexorably on during the last few decades, increasing memory size and disk space along with transistor density, it left some elements of computer architecture in the dust. One of these stragglers is TLB coverage. The TLB (or Translation Look-aside Buffer) caches translations between virtual and physical memory addresses; usually every memory access requires a translation. Performance is best when all the needed translations can fit in the TLB and translations "hit" in the TLB instead of missing. The amount of memory translated by the entries in the TLB is called the TLB coverage. TLB coverage has been dropping as a fraction of total memory (and, more importantly, as a fraction of the total size of Netscape - er, Mozilla - er, Firefox), and TLB misses are often a serious drag on system performance.
Since translations are done on a per-page basis, one solution is to increase the size of the system pages. We could increase the base page size - the smallest page available - but that would typically waste a lot of memory, cause more page-outs, trigger unexpected application bugs (ask me about the one with the JVM default stack size and 64KB pages some time), and make the system slower overall. Instead, many processors now offer multiple page sizes, beginning with a base page size of 4KB or 8KB and ranging up to a large page size of 2MB or occasionally a truly monstrous page size of 256MB or larger. Large pages increase TLB coverage and can reduce TLB misses significantly, often improving the performance of applications with large working sets by 10-15%. On the other hand, large pages can reduce performance by increasing the cost of paging memory in and out and adding the overhead of tracking several different page sizes. Implementing automatic, transparent OS-level support for large pages while simultaneously improving overall performance is not easy. It's also what Linux users are clamoring for - and some of them are switching to operating systems that already have automatic large page support (cough, cough, Solaris).
A Solution: The Rice Paper
Practical, Transparent Operating System Support for Superpages by Juan Navarro, et al., describes a sophisticated and elegant implementation of transparent large pages. The authors implemented their system on FreeBSD on the Alpha processor, using 4 page sizes: 8KB, 64KB, 512KB, and 4MB. The paper was published in 2002, otherwise they might have picked a less ill-omened architecture than Alpha; fortunately the design is reasonably generic. Overall, this paper is one of the best I've ever read.The basic design is reservation-based; that is, enough pages to make a large page are reserved in advance and later promoted to a large page when justified. Memory fragmentation is reined in via careful grouping of page types and a smarter page replacement policy. Almost all applications tested saw at least some speed-up, and absolute worst case performance degradation varied from 2-9%. Most amazing of all, the implementation only required about 3500 lines of code - about half of an ext2. How exactly did they accomplish all this? Buckle up for some nitty-gritty details.
First, a run of contiguous pages suitable for a large page is reserved whenever an application page fault occurs (outside an existing reservation, of course). The size of the reservation is picked based on the size and type of the memory object, with slight variations depending on whether the object is fixed in size (e.g., text) or might grow (e.g., stack). For example, an application with 700KB of text would have a 512KB page reserved the first time a page in the text was faulted into memory. Once a large page of any size has been fully populated (all of its pages referenced at least once), it is promoted into a large page. In our example, once a contiguous 64KB region anywhere in the program text has been faulted in, it will be promoted to a 64KB page. Promotion of a partially populated page is possible, but the trade off is that it may increase the application's total memory usage, unintentionally creating a memory hog.
In the rough and tumble world of scarce memory, promotion is not a one-way street. Demotion of large pages into smaller pages is also useful. An application may start out using all pages in a large page but then stop referencing most of the pages. The only way to tell is to demote the large page and check the referenced bits on the smaller pages a little while later. A page is demoted when it is first written, when one of its base pages is evicted, and periodically when the system is under memory pressure.
When an application wants more memory and no free space is available, unused parts of a reservation are preempted. "Use it or lose it" is the name of the game here. The reservation which loses is the one whose most recent allocation occurred least recently - LRU order, basically - since most applications touch most of their working set soon after starting up, and so it's unlikely the original owner of the reservation will need the space. Unused reservations live on different lists depending on the size of the allocation that can be made by preempting the reservation. A population map, implemented as a radix tree, keeps track of which pages are allocated inside each large page-sized extent for easy look up. This radix tree is a key data structure; it makes allocation, reservation, and promotion decisions fast and simple.
The final key elements are the page replacement policy and the way pages of various types are grouped together. There are several different kinds of pages in the system. Some pages can't be moved or freed (pinned), some pages are in use but can be moved (active), and some pages are not currently used by anyone but may be used in the future (cached and/or inactive). If these pages are mixed together indiscriminately, pinned and active pages end up scattered everywhere, without any contiguous runs of free (or free-able) pages that can be converted into hotly pursued large pages. Fragmentation needs to be both prevented and repaired - without hurting performance by moving around pages too much.
Pinned pages are the most difficult problem, since once allocated they cannot be moved and may never be freed. The system tries to allocate these pages in clusters, so they break up as few potential large pages as possible. Similarly, cache pages are allocated in clusters with free pages, since cached pages can be easily freed to allow the creation of a large page. Reservations can include cache pages, and cached pages contained inside a reservation continue to be active until the application actually needs to kick that page out.
The page replacement daemon was changed to run not only when free memory runs low, but also when contiguity runs low. An "innocent until proven guilty" algorithm works here - we assume we don't need more contiguity until a large page reservation fails for lack of contiguity. When woken for this reason, the daemon runs just long enough to recover enough contiguous space to satisfy the allocations that failed. The page aging algorithm was changed slightly from the FreeBSD default; cached pages for a file are marked inactive on the last close, trading off the chance of the file being reopened against the opportunity for more contiguity.
Evaluating the System
The authors tested their system against a truly startling variety of applications, everything from gzip to web server trace replays to fast Fourier transforms, as well as a section exploring worst case situations. Personally, I'm not sure I've ever seen a better evaluation in a research paper; it's quite a treat to read.In the best case, with low fragmentation, 33 out of 35 applications showed some improvement (one was unchanged, and the other was about 2% slower). Several had significant improvements. For example, rotating an image using ImageMagick was about 20% faster; linking the FreeBSD kernel was about 30% faster; bzip2 was 14% faster. In the fragmented case, performance was not as good, but usually to picked up again after a few runs as the page replacement daemon moved things around. In the worst-case department, the performance was degraded by about 9% for an application that only touched one byte per large page before freeing it, and by about 2% for a test case in which large page promotion was turned off. It makes for a pretty convincing case that large pages are an overall win for many systems.
Implications for Linux
What does this paper tell us? It is possible to implement transparent large page support in such a way that most applications get at least some benefit, and some applications get a lot of benefit. The algorithms used are relatively simply to understand and implement, and hold up well in worst case behavior. Finally, transparent large pages can be implemented elegantly and cleanly - only 3500 lines of code! Best of all, this paper includes a plethora of implementation details and smart algorithms, just begging to be reused. All of the above earns this paper a hallowed place on the Kernel Hacker's Bookshelf.Over the past few years, several Linux developers have been working on various forms of transparent large page support. Some of that recent work, spearheaded by Mel Gorman, has been reviewed earlier in LWN:
Current work on large pages in Linux is summarized on the linux-mm wiki.
I look forward to more work in this fascinating and fertile area of operating systems implementation.
[Do you have a favorite textbook or systems paper? Of course you do. Send your suggestions to:
val dot henson at gmail dot com
Valerie Henson is a Linux kernel developer working for Intel. Her interests include file systems, networking, women in computing, and walking up and down large mountains. She is always looking for good systems programmers, so send her some email and introduce yourself.]
| Index entries for this article | |
|---|---|
| GuestArticles | Aurora (Henson), Valerie |
