Idle and stale page tracking

By Jonathan Corbet
October 4, 2011

Google's requirements for systems running in its cluster have been discussed in public a number of times at this point; the recent Linux Plumbers Conference session on control groups is an example. The company does everything it can to pack as much work onto each system as possible to ensure that its hardware is fully utilized. One aspect of this packing is the need to make the best use possible of system memory. Michel Lespinasse's recently posted idle page tracking patch set is one piece of Google's solution to this problem.

The "fake NUMA" mechanism is currently used to control memory use within a single system, but Google is trying to move to the control-group memory controller instead. The memory controller can put limits on how much memory each group of processes can use, but it is unable to automatically vary those limits in response to the actual need shown by those groups. So some control groups may have a lot of idle memory sitting around while others are starved. Google would like to get a better handle on how much memory each group actually needs so that the limits can be adjusted on the fly - responding to changes in load - and more jobs can be crammed onto each box.

Determining a process's true memory needs can be hard, but one fairly clear clue is the existence of pages in the process's working set that have not been touched in some time. If there are a lot of idle pages around, it is probably safe to say that the process is not starved for memory; this idea is based, of course, on the notion that the kernel's page replacement algorithm is working reasonably well. It follows that, if you would like to know how memory usage limits can be tweaked to optimize the use of memory, it makes sense to track the number of idle pages in each control group. The kernel does not currently provide that information - a gap that Michel's patch set tries to fill.

The memory management code has a function (page_referenced() and a number of variants) that can be used to determine whether a given page has been referenced since the last time it was checked. It is used in a number of memory management decisions, such as the quick aging of pagecache pages that are only referenced once. Michel's patch makes use of this mechanism to find idle pages, but this use has some slightly different needs: Michel needs to know more about the pages in question, and he needs to not interfere with other users of page_referenced(). To meet these needs, Michel has to make some changes to the core memory management code.

For the first problem, the page_referenced() interface is changed to take a new structure (struct page_referenced_info) where the additional information can be recorded. Avoiding interference with existing users of page_referenced(), instead, requires adding a couple of new page flags. Since page flags are in short supply on 32-bit architectures, using more of them is strongly discouraged. This patch set gets around that problem by disabling the feature altogether on 32-bit machines; anybody wanting idle page tracking will need to run in 64-bit mode.

Systems where idle page tracking is in use will have a new kernel thread running under the name kstaled. Its job is to scan through all of memory (once every two minutes by default) and count the number of pages that have not been referenced since the previous scan. Such pages are deemed to be idle; each one is traced back to its owning control group and that group's statistics are adjusted. The patch adds a new "page age" data structure - an array containing one byte for every page in the system - to track how long each page has been idle, up to 255 scan cycles. The results are boiled down to counters showing how many pages have been idle for 1, 2, 5, 15, 30, 60, 120, and 240 cycles. Idle pages are further broken down into a few categories: clean, dirty and file-backed, and dirty anonymous pages. These counters, which are only updated at the end of each scan, can be found in the memory controller's control directory for each group.

Since the statistics are only updated at the end of each scan, and since the scans are two minutes apart, the resulting numbers are likely to lag reality by some time. Imagine that a given page is scanned toward the beginning of a cycle and seen to be in use; clearly it will not be counted as idle. If it is referenced one last time just after the scan, it will still appear to be in use at the next scan, nearly two minutes later, when the "referenced" bit will be reset. It is only after another two minutes that kstaled will decide that the page is unused - nearly four minutes after its last reference. That is not necessarily a problem; a decision to shrink a group of processes because they are not using all of their memory probably should not be made in haste.

There are times when more current information is useful, though. In particular, Google's management code would like to know when a group of processes suddenly start making heavier use of their memory so that their limits can be expanded before they begin to thrash. To handle this case, the patch introduces the notion of "stale" pages: a page is stale if it is clean and if it has been idle for more than a given (administrator-defined) number of scan cycles. The presence of stale pages indicates that a control group is not under serious memory pressure. If that control group's memory needs suddenly increase, though, the kernel will start reclaiming those stale pages. So a sudden drop in the number of stale pages is a good indication that something has changed.

When kstaled determines that a given page is stale, one of the new page flags (PG_stale) will be used to mark it. Tests have been sprinkled throughout the memory management code to notice when a stale page is dirtied, referenced, locked, or reclaimed; when that happens, the owning control group's count of stale pages will be decremented on the spot. Stale pages are not detected any more quickly than idle pages, but a reduction in the number of stale pages can be noticed immediately. That provides an early-warning system that can flag control groups whose memory use is on the increase.

The patch has been through a couple of iterations; there have been comments pointing out things to fix but no fundamental opposition to the idea. That said, memory management patches are not known for their speed getting into the mainline; if and when we'll see this feature in mainline kernels remains to be seen.

Index entries for this article
Kernel	Memory management/Control groups

This could be quite interesting for me.

Posted Oct 6, 2011 14:55 UTC (Thu) by ejr (subscriber, #51652) [Link]

I work on massive graph analysis. We assume the graph is in memory (for multiple reasons, and it's appropriate for our target) and ideally in the multi-TiB size. Many of our graph analysis kernels skip around the graph in an unpredictable manner and run for hours. Assuming that pages are allocated per socket on ye typical NUMA machines, the idle page data could give us a general way to measure aggregate load balance for memory operations or help us determine which old graph edges to drop when we need to add newer edges. Sure, we could collect this data ourselves, but it's always nice when someone does it for you...

Thank you for pointing this out!

Idle and stale page tracking

Posted Oct 7, 2011 20:45 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

Why does Google partition the memory? If they're just going to move memory from partition to partition to make the utilization even, that seems to take back one of the typical benefits of partitioning memory.

Idle and stale page tracking

Posted Oct 9, 2011 21:22 UTC (Sun) by sethml (guest, #8471) [Link]

Because they have another option: move a process to another machine. Imagine you start a batch job that requires 10,000 processes. Some of your processes may get dumped on machines running web search. Now imagine one of those search processes gets a bunch of requests and its working set increases - your process may get killed to avoid making web search thrash. The controller for your job can then move that process's work somewhere else. This lets Google take advantage of underused resources on production clusters without disrupting production services.