Google's requirements for systems running in its cluster have been
discussed in public a number of times at this point; the recent Linux Plumbers Conference session on control
groups is an
example. The company does everything it can to pack as much work onto each
system as possible to ensure that its hardware is fully utilized. One
aspect of this packing is the need to make the best use possible of system
memory. Michel Lespinasse's recently posted idle page tracking patch set is one piece of
Google's solution to this problem.
The "fake NUMA" mechanism is currently used to control memory use within a
single system, but Google is trying to move to the control-group memory
controller instead. The memory controller can put limits on how much
memory each group of processes can use, but it is unable to automatically
limits in response to the actual need shown by those groups. So some
control groups may have a lot of idle memory sitting around while others
are starved. Google would like to get a better handle on how much memory
each group actually needs so that the limits can be adjusted on the fly -
responding to changes in load - and more jobs can be crammed onto each box.
Determining a process's true memory needs can be hard, but one fairly clear
clue is the existence of pages in the process's working set that have not
been touched in some time. If there are a lot of idle pages around, it is
to say that the process is not starved for memory; this idea is based, of
course, on the notion that the kernel's page replacement algorithm is
working reasonably well. It follows that, if you would like to know how
memory usage limits can be tweaked to optimize the use of memory, it makes
sense to track the number of idle pages in each control group. The kernel
does not currently provide that information - a gap that Michel's patch set
tries to fill.
The memory management code has a function (page_referenced() and a
number of variants) that can be used to determine whether a given page has
been referenced since the last time it was checked. It is used in a number
of memory management decisions, such as the quick aging of pagecache pages
that are only referenced once. Michel's patch makes use of this mechanism
to find idle pages, but this use has some slightly different needs: Michel
needs to know more about the pages in question, and he needs to not
interfere with other users of page_referenced(). To meet these
needs, Michel has to make some changes to the core memory management code.
For the first
problem, the page_referenced() interface is changed to take a new
page_referenced_info) where the additional information can be
Avoiding interference with existing users of page_referenced(),
instead, requires adding a couple of new page flags. Since page flags are
in short supply on 32-bit architectures,
using more of them is strongly discouraged. This patch set gets around
that problem by disabling the feature altogether on 32-bit machines;
anybody wanting idle page tracking will need to run in 64-bit mode.
Systems where idle page tracking is in use will have a new kernel thread
running under the name kstaled. Its job is to scan through all of
memory (once every two minutes by default) and count the number of pages
that have not been referenced since the previous scan. Such pages are
deemed to be idle; each one is traced back to its owning control group and
that group's statistics are adjusted. The patch adds a new "page age"
data structure - an array containing one byte for every page in the system
- to track how long each page has been idle, up to 255 scan cycles. The
results are boiled down to counters showing how many pages have been idle
for 1, 2, 5, 15, 30, 60, 120, and 240 cycles. Idle pages are further
broken down into a few categories: clean, dirty and file-backed, and dirty
anonymous pages. These counters, which are
only updated at the end of each scan, can be found in the memory
controller's control directory for each group.
Since the statistics are only updated at the end of each scan, and since
the scans are two minutes apart, the resulting numbers are likely to lag
reality by some time. Imagine that a given page is scanned toward the
beginning of a cycle and seen to be in use; clearly it will not be counted
as idle. If it is referenced one last time just after the scan, it will
still appear to be in use at the next scan, nearly two minutes later, when
the "referenced" bit will be reset. It is only after another two minutes
that kstaled will decide that the page is unused - nearly four minutes
after its last reference. That is not necessarily a problem; a decision to
shrink a group of processes because they are not using all of their memory
probably should not be made in haste.
There are times when more current information is useful, though. In
particular, Google's management code would like to know when a group of
processes suddenly start making heavier use of their memory so that their
limits can be expanded before they begin to thrash. To handle this case,
the patch introduces the notion of "stale" pages: a page is stale if it is
clean and if it has been idle for more than a given (administrator-defined)
number of scan cycles. The presence of stale pages indicates that a
control group is not under serious memory pressure. If that control
group's memory needs suddenly increase, though, the kernel will start
reclaiming those stale pages. So a sudden drop in the number of stale
pages is a good indication that something has changed.
When kstaled determines that a given page is stale, one of the new
page flags (PG_stale) will be used to mark it. Tests have been
sprinkled throughout the memory management code to notice when a stale page
is dirtied, referenced, locked, or reclaimed; when that happens, the owning
control group's count of stale pages will be decremented on the spot.
Stale pages are not detected any more quickly than idle
pages, but a reduction in the number of stale pages can be noticed
immediately. That provides an early-warning system that can flag control
groups whose memory use is on the increase.
The patch has been through a couple of iterations; there have been comments
pointing out things to fix but no fundamental opposition to the idea. That
said, memory management patches are not known for their speed getting into
the mainline; if and when we'll see this feature in mainline kernels
remains to be seen.
to post comments)