LWN.net Logo

Memory compaction

By Jonathan Corbet
January 6, 2010
The longstanding memory fragmentation problem has been covered many times in these pages. In short: as the system runs, pages tend to be scattered between users, making it hard to find groups of physically-contiguous pages when they are needed. Much work has gone into avoiding the need for higher-order (multi-page) memory allocations whenever possible, with the result that most kernel functionality is not hurt by page fragmentation. But there are still situations where higher-order allocations are needed; code which needs such allocations can fail on a fragmented system.

It's worth noting that, in one way, this problem is actually getting worse. Contemporary processors are not limited to 4K pages; they can work with much larger pages ("huge pages") in portions of a process's address space. There can be real performance advantages to using huge pages, mostly as a result of reduced pressure on the processor's translation lookaside buffer. But the use of huge pages requires that the system be able to find physically-contiguous areas of memory which are not only big enough, but which are properly aligned as well. Finding that kind of space can be quite challenging on systems which have been running for any period of time.

Over the years, the kernel developers have made various attempts to mitigate this problem; techniques like ZONE_MOVABLE and lumpy reclaim have been the result. There is still more that can be done, though, especially in the area of fixing fragmentation to recover larger chunks of memory. After taking a break from this area, Mel Gorman has recently returned with a new patch set implementing memory compaction. Here we'll take a quick look at how this patch works.

Imagine a very small memory zone which looks like this:

[Memory zone]

Here, the white pages are free, while those in red are allocated to some use. As can be seen, the zone is quite fragmented, with no contiguous blocks of larger than two pages available; any attempt to allocate, for example, a four-page block from this zone will fail. Indeed, even two-page allocations will fail, since none of the free pairs of pages are properly aligned.

It's time to call in the compaction code. This code runs as two separate algorithms; the first of them starts at the bottom of the zone and builds a list of allocated pages which could be moved:

[Movable pages]

Meanwhile, at the top of the zone, the other half of the algorithm is creating a list of free pages which could be used as the target of page migration:

[Movable and free pages]

Eventually the two algorithms will meet somewhere toward the middle of the zone. At that point, it's mostly just a matter of invoking the page migration code (which is not just for NUMA systems anymore) to shift the used pages to the free space at the top of the zone, yielding a pretty picture like this:

[Happy ever after]

We now have a nice, eight-page, contiguous span of free space which can be used to satisfy higher-order allocations if need be.

Of course, the picture given here has been simplified considerably from what happens on a real system. To begin with, the memory zones will be much larger; that means there's more work to do, but the resulting free areas may be much larger as well.

But all this only works if the pages in question can actually be moved. Not all pages can be moved at will; only those which are addressed through a layer of indirection and which are not otherwise pinned down are movable. So most user-space pages - which are accessed through user virtual addresses - can be moved; all that is needed is to tweak the relevant page table entries accordingly. Most memory used by the kernel directly cannot be moved - though some of it is reclaimable, meaning that it can be freed entirely on demand. It only takes one non-movable page to ruin a contiguous segment of memory. The good news here is that the kernel already takes care to separate movable and non-movable pages, so, in reality, non-movable pages should be a smaller problem than one might think.

The running of the compaction algorithm can be triggered in either of two ways. One is to write a node number to /proc/sys/vm/compact_node, causing compaction to happen on the indicated NUMA node. The other is for the system to fail in an attempt to allocate a higher-order page; in this case, compaction will run as a preferable alternative to freeing pages through direct reclaim. In the absence of an explicit trigger, the compaction algorithm will stay idle; there is a cost to moving pages around which is best avoided if it is not needed.

Mel ran some simple tests showing that, with compaction enabled, he was able to allocate over 90% of the system's memory as huge pages while simultaneously decreasing the amount of reclaim activity needed. So it looks like a useful bit of work. It is memory management code, though, so the amount of time required to get into the mainline is never easy to predict in advance.


(Log in to post comments)

Memory compaction

Posted Jan 7, 2010 4:27 UTC (Thu) by eparis123 (guest, #59739) [Link]

By Jonathan Corbet - January 6, 2010

I wonder how our author managed his time to write this article given that the patchset has also been posted on the same day at the sixth of Jan!

That's literally up-to-the-minute highly informative reporting.

Time management

Posted Jan 7, 2010 14:08 UTC (Thu) by corbet (editor, #1) [Link]

It's called "your editor still hadn't figured out what the last kernel article was going to be as the deadline loomed closer, and getting his head around the speculative page fault discussion is proving to take longer than expected...oh, here's a nice shiny topic!" Or something like that.

Memory compaction

Posted Jan 7, 2010 4:48 UTC (Thu) by eparis123 (guest, #59739) [Link]

That might be a naive question, but I wonder if "reclaiming" a page is synonymous with swapping it to a cache partition, or just simply freeing it if it's already in a clean state in the file system. From the patchset:

When a process fails to allocate a high-order page, it may compact memory in an attempt to satisfy the allocation instead of entering direct reclaim.

And from the article:

Most memory used by the kernel directly cannot be moved - though some of it is reclaimable, meaning that it can be freed entirely on demand

The patchset author also makes the case of less reclaimed pages the selling point of the patchset, so I hope I'm not misunderstanding:

The vanilla kernel had reclaimed 105132 pages at that point. The kernel with compaction had reclaimed 59071, less than half of what the vanilla kernel reclaimed

Thank you

Memory compaction

Posted Jan 7, 2010 7:13 UTC (Thu) by johill (subscriber, #25196) [Link]

I can't really answer the question, but I don't think it matters for the "selling argument" at hand?

Both swapping out a page and freeing something that was recently used has performance impacts, so if you can get away with doing less of that while simultaneously reducing memory fragmentation, that seems like a good thing.

What, as I understand it, Jon says is that if the code finds a page it would like to move, which it cannot do, but it can reclaim that page, it will do that instead of moving it, thereby also freeing it for the contiguous memory block it is creating.

"less than half" ?

Posted Jan 7, 2010 17:25 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

59071 is not "less than half" of 105132 although it is "only slightly more than half" and so this isn't important, just a nitpick (or perhaps a misunderstanding on my part)

Memory compaction

Posted May 16, 2012 10:28 UTC (Wed) by gaoqiangscut (guest, #70812) [Link]

then under what conditions cannot user process pages be moved ?

Memory compaction

Posted Jul 19, 2012 4:51 UTC (Thu) by vinayakm (subscriber, #83282) [Link]

mlocked user space pages cant be moved

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds