LWN.net Logo

Toward improved page replacement

When memory gets tight (a situation which usually comes about shortly after starting an application like tomboy), the kernel must find a way to free up some pages. To an extent, the kernel can free memory by cleaning up its own internal data structures - reducing the size of the inode and dentry caches, for example. But, on most systems, the bulk of memory will be occupied by user pages - that is what the system is there for in the first place, after all. So the kernel, in order to accommodate current demands for user pages, must find some existing pages to toss out.

To help in the choice of pages to remove, the kernel maintains two big linked lists for each memory zone. The "active" list contains pages which have been recently accessed, while the "inactive" list has those which have not been used in the recent past. When the kernel looks for pages to evict, it will scan through the inactive list, in the theory that the pages least likely to be needed soon are to be found there.

There is an additional complication, though: there are two fundamental types of pages to be found on these lists. "Anonymous" pages are those which are not associated with any files on disk; they are process memory pages. "Page cache" pages, instead, are an in-memory representation of (portions of) files on the disks. A proper balance between anonymous and page cache pages must be maintained, or the system will not perform well. If either type of page is allowed to predominate at the expense of the other, thrashing will result.

The kernel offers a knob called swappiness which controls how this balance is struck. If the system administrator sets a higher value of swappiness, the kernel will allow the page cache to occupy a larger portion of memory. Setting swappiness to a very low value is a way to tell the kernel to keep anonymous pages around at the expense of the page cache. In general, the system can be expected to perform better if page cache pages are reclaimed first; they can often be reclaimed without needing to be written back to disk, and their layout on the disk can make recovery faster should they be needed again. For this reason, the default value for swappiness favors the eviction of page cache pages; anonymous pages will only be targeted when memory pressure becomes relatively severe.

Swappiness clearly affects how the process of scanning pages for eviction candidates is done. If swappiness is low, anonymous pages will simply be passed over. As it turns out, this behavior can lead to performance problems; there may be a lot of anonymous pages which must be scanned over before the kernel finds any page cache pages, which are the ones it was looking for in the first place. It would be nice to avoid all of that extra work, especially since it comes at a time when the system is already under stress.

Rik van Riel has posted a patch which tries to improve this situation. The approach taken is quite simple: the active and inactive lists are each split into two new lists: one pair (active and inactive) for anonymous pages and one pair for page cache pages. With separate lists for the page cache, the kernel can go after those pages without having to iterate over a bunch of uninteresting anonymous pages on the way. The result should be better scalability on larger systems.

The idea is simple, but the patch is reasonably large. Any code which puts pages onto one of the lists must be changed to specify which list is to be used; that requires a number of small changes throughout the memory management and filesystem code. Beyond that, the current patch does not really change how the page reclamation code works, though Rik does note:

For now the swappiness parameter can be used to tweak swap aggressiveness up and down as desired, but in the long run we may want to simply measure IO cost of page cache and anonymous memory and auto-adjust.

There tends to be a lot of sympathy for changes which remove tuning knobs in favor of automatic adaptation within the kernel itself. So if this approach could be made to work, it might well be adopted. Getting system tuning right is hard; it's often better if the computer can figure it out by itself.

Meanwhile, the list-splitting patch, so far, lacks widespread testing or benchmarking. So, at this point, it is difficult to say when (or in what form) this patch will find its way into the mainline.


(Log in to post comments)

Toward improved page replacement

Posted Mar 22, 2007 11:45 UTC (Thu) by nix (subscriber, #2304) [Link]

It should be beneficial even on smaller systems: it can smash the L2 dcache less than existing code. (The L1 dcache is basically a loss whenever you enter the kernel, but the L2 is worth trying to preserve, especially when only one task is runnable...)

Toward improved page replacement

Posted Mar 29, 2007 7:19 UTC (Thu) by slamb (guest, #1070) [Link]

How do you measure that? I frequently see people talking about nuances of caching, and while I can
make wild guesses with the best of them, I just don't know how you gather more information than
"this goes x% faster". Are there some statistics somewhere I've overlooked?

Toward improved page replacement

Posted Mar 31, 2007 14:06 UTC (Sat) by nix (subscriber, #2304) [Link]

On x86, `cachegrind' works well (of course it's monitoring a virtual
machine but it's still a good estimate).

I'll admit that on other platforms I often prove that cache smashing is at
fault for the sloth of some algorithm by rewriting the algorithm; if the
new one has the same formal time complexity yet is much faster merely
because the access patterns are different, it was a cache problem :)

A really crude approach is to turn off the L2 cache entirely (which many
motherboard/CPU combinations allow you to do) and see if your suspect
thingy gets vastly slower. :)

Toward improved page replacement

Posted Apr 6, 2007 0:50 UTC (Fri) by slamb (guest, #1070) [Link]

I've used cachegrind a few times (nice tool, easy to use), but it is only good for userspace so it wouldn't help here.

A coworker mentioned to me a while ago that our hardware (a rather obscure processor) provides counters of these events, which is why I asked about that specifically. In fact, he said such features were common.

I googled a bit and found this document describing event counters on AMD Opteron processors. I'm not sure if there's existing code in the Linux kernel to take advantage of this instrumentation or not.

Toward improved page replacement

Posted Apr 21, 2007 23:27 UTC (Sat) by nix (subscriber, #2304) [Link]

I know the PowerPC provides counters for this sort of thing as well. I
imagine most CPUs do these days: cache misses are so
performance-significant...

Toward improved page replacement

Posted Apr 24, 2007 7:42 UTC (Tue) by njs (guest, #40338) [Link]

>I'm not sure if there's existing code in the Linux kernel to take advantage of this instrumentation or not.

Indeed, there is, and it is Good:

http://oprofile.sourceforge.net

Swap: an idea whose time has come, and gone.

Posted Mar 22, 2007 17:17 UTC (Thu) by zooko (subscriber, #2589) [Link]

It's been a couple of years since I turned off swap, and I've been happy ever since. The behavior of the box is consistent and smooth under normal conditions, and under extreme conditions when I run out of RAM (I have 2 GiB physical RAM here), I get a fast failure in the form of the Linux OOM killer. This quick death of a process is always a more pleasant experience that the bad old days when that state would lead to untold hours of being unable to use my machine while it swap-thrashed, followed by the eventual death of a process and a return to usability.

Of course, your workloads may be different, so you might have the kind of load where your physical RAM isn't enough to finish the work, but your physical RAM plus swap is enough to finish the work within a reasonable time frame.

But none of my workloads are like that.

Regards,

Zooko

Swap: an idea whose time has come, and gone.

Posted Mar 22, 2007 17:35 UTC (Thu) by i3839 (guest, #31386) [Link]

Just curious, but how do you get out of RAM with 2 gig of the stuff? Video editing or something?

Swap: an idea whose time has come, and gone.

Posted Mar 22, 2007 18:35 UTC (Thu) by nix (subscriber, #2304) [Link]

Loading huge images might well do it too. (I blew my machine out of memory yesterday loading a 5000-square-pixel image into ImageMagick display(1).)

Swap: an idea whose time has come, and gone.

Posted Mar 22, 2007 20:16 UTC (Thu) by k8to (subscriber, #15413) [Link]

Compiling something (zsnes I think?) with gcc 4.1.1 caused it to consume 2.8 gigabytes of heap for one .c file. Granted, gcc had some kind of memory consuming logic error, but if I wanted to compile that file, that's the ram I needed to give it.

My 2GiB of RAM machine was not enough, even with the 512MiB of swap it normally has. I enabled some additional temporary swap to get past the hump.

For the record gcc 4.1.2 resolved the issue, but it was not available at the time.

Swap: an idea whose time has come, and gone.

Posted Mar 23, 2007 8:47 UTC (Fri) by jschrod (subscriber, #1646) [Link]

With VMware I often run out of my 2 GB. That's why I have an additional 2 GB swap. (I don't use all VMwares, they are for testing.)

Swap: an idea whose time has come, and gone.

Posted Mar 22, 2007 21:33 UTC (Thu) by cventers (guest, #31465) [Link]

To each their own I suppose, but you're just forcing the kernel to evict
page cache more frequently. A lot of the anonymous data you are pinning
will hardly ever be touched, but just think of the program text that has
to be freed and then later re-read as you move from application to
application!

It may behave fine with less processes, but once you start trying to run
more things at once, a system that swaps normally will allow you to, say,
switch between tasks in X faster than your configuration.

Swap isn't there just to extend your available virtual memory. There are
performance reasons to swap out anonymous pages.

Swap: an idea whose time has come, and gone.

Posted Mar 23, 2007 0:21 UTC (Fri) by mgedmin (subscriber, #34497) [Link]

In my experience it is less painful to recover from memory hogs filling up all the RAM when you have swap than when you don't. When there's no swap the machine often thrashes while being completely nonresponsive for tens of seconds until the OOM killer kicks in.

My experience may be somewhat outdated, though. I haven't tried running a swapless machine for quite a while now.

Swap: a useful system feature which could be improved.

Posted Mar 23, 2007 14:49 UTC (Fri) by k8to (subscriber, #15413) [Link]

You can get into some amount of thrashing when you have active processes and a lot of swap. This is by design of course, you want to keep them running, and you want to be able to run more than RAM strictly allows.

Despite this, even when deep into "thrash" recently (approximately 180% of ram in active use) my system remained functional. I could have logged in as root and killed off the offending process in 20 seconds or so. From an existing paged-out terminal I could have wiped it out in around 5-8 seconds. This is a lot better than it used to be. A lot better than some other operating systems I could name. But still it does slow the system down a lot.

This underscores an interesting issue. Many users do (or should) want swap to keep more memory available to programs. But desktop users don't really ever want to thrash. I wonder if there is any possible way to seperate these cases. I would allocate 4 gigs or more of swap if I thought it wasn't an invitation to thrashing. Perhaps some kind of governor application which notices the pattern and gives the user explicit choices? IE. prioritize the compile over all other programs (including my RSS reader and desktop search agent, which think they should run periodically), or perhaps "force app1 to suspend until app2 completes its run"...

Such a beast is probably writable, and it would help _me_, a desktop user who knows about system programming dorkery. I think it would not be very useful for the majority of linux users though, who run it on servers or may not have sophistication to make explicit decisions of this sort. Any thoughts?

Swap: an idea whose time has come, and gone.

Posted Mar 23, 2007 2:44 UTC (Fri) by intgr (subscriber, #39733) [Link]

I think the inherent problem there is *not* that the kernel has some swap to use, but the fact that it penalizes all processes equally, instead of the misbehaving or memory-hogging one.

I would very gladly give my kernel a gigabyte of swap if it didn't disable the whole user space whenever a single process goes astray. Penalizing just the memory hogs would make much more sense to me.

Swap: an idea whose time has come, and gone.

Posted Mar 23, 2007 8:38 UTC (Fri) by njs (guest, #40338) [Link]

We've hard this argument before, but...

It may be that in your situation, with current kernel memory management and workloads and hardware etc., running without swap improves your experience. It's a big honking jump to go from that to saying that swap is an idea whose "time has gone". If you want to make an argument that universal, you can't really just cite your particular set of factors and entirely ignore the theoretical reasons why swap is such a good idea...

Those being, of course, that swap lets the kernel toss out any pages of program memory that are rarely accessed in favor of any pages of data that are frequently accessed. Or more concretely, from a glance at top right now, it looks like my swap contains ~100M of firefox, ~1325M of evince (!! probably large rendered image caches?), 350M of matlab, 190M of openoffice, 40M of abiword... notice that the workload so far here is just "reading papers on data analysis" :-). Oh, and 350M of amarok... looks like a memory leak. (These numbers are probably not entirely accurate, since I'm just subtracting RES from VIRT, but you get the idea.)

So swap means that I have well over a gig of extra memory to cache things -- that's enough for, say, the entire source+build tree for a project I compile -- even if I didn't have that crazy evince stuff there. And it also means that I can happily run a dozen instances of evince to look at different high-res images, and not run out of memory in the process :-).

So, another data point for readers.

(Hey, I know -- next let's talk about swap-to-ramdisk!)

Toward improved page replacement

Posted Mar 23, 2007 2:37 UTC (Fri) by intgr (subscriber, #39733) [Link]

"It would be nice to avoid all of that extra work, especially since it comes at a time when the system is already under stress."

Except that when the kernel starts swapping, it will inherently be I/O bound anyway -- the extra work the CPU has to do is probably not even even noticeable. Though the patch has secondary improvements as well, which can make it worthwhile.

Toward improved page replacement

Posted Mar 27, 2007 15:04 UTC (Tue) by liljencrantz (subscriber, #28458) [Link]

Most of the pages in the page cache are clean, e.g. once you've found those pages, evicting them is virtually free. That is the whole reason why you want to evict stuff from the page cache and not anonymous pages.

Another reason to distinguish anonymous vs page cache

Posted Mar 23, 2007 20:15 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

The article mentions one reason to favor page cache pages over anonymous ones: that a page cache page is more likely to exist somewhere else already so stealing the page frame doesn't require writing out the page presently in it.

There's another reason that I think is even more important: programs have traditionally been written to assume that I/O is slow and memory access is fast. It's an idea that is perhaps outdated in the face of modern virtual memory and file caching, not to mention mmap, but still a program that reads something from a file and thinks it will access it again soon is likely to keep it in an anonymous virtual memory buffer. Sophisticated programs even "page" between virtual memory and a file based on the program's knowledge of what will be needed soonest.

So giving priority to anonymous pages, which back virtual memory, is a way to take advantage of extra information the program has about what data is most important.

Another reason to distinguish anonymous vs page cache

Posted Mar 28, 2007 17:17 UTC (Wed) by liljencrantz (subscriber, #28458) [Link]

Sure, some programs choose to implement their own file caches internally. That is a _bad_ thing, as the operating system can share the page cache between all running software, meaning that all programs that access the same file will benefit from the cache, and that if multiple instances of the program is run, they will not create multiple copies of the same cache.

Also, you would be surprised how little it turns out that applications know about their own memory usage. The kernel is usually _much_ better at guseeing what pages will be used next based on a simple LRU rule. That is why modern processors have a really fast L1 cache instead of a tiny on-processor memory that the programmer has manual control of.

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds