Efforts to reduce power consumption on Linux systems have often focused on
the CPU; that emphasis does make sense, since the CPU draws a significant
portion of the power used on most systems. After years of effort to
improve the kernel's power behavior, add instrumentation to track wakeups,
and fix misbehaving applications, Linux does quite well when it comes to
CPU power. So now attention is moving to other parts of the system in
search of power savings to be had; one of those places is main memory.
Contemporary DRAM memory requires power for its self-refresh cycles even if
it is not being used; might there be a way to reduce its consumption?
One technology which is finding its way into some systems is called
"partial array self refresh" or PASR. On a PASR-enabled system, memory is
divided into banks, each of which can be powered down independently. If
(say) half of memory is not needed, that memory (and its self-refresh
mechanism) can be turned off; the result is a reduction in power use, but
also the loss of any data stored in the affected banks. The amount of
power actually saved is a bit unclear; estimates seem to run in the range
of 5-15% of the total power used by the memory subsystem.
The key to powering down a bank of memory, naturally, is to be sure that
there is no important data stored therein first. That means that the
system must either evacuate a bank to be powered down, or it must take care
not to allocate memory there in the first place. So the memory management
subsystem will have to become aware of the power topology of main memory
and take that information into account when satisfying allocation
requests. It will also have to understand the desired power management
policy and make decisions to power banks up or down depending on the
current level of memory pressure. This is going to be fun: memory
management is already a complicated set of heuristics which attempt to
provide reasonable results for any workload; adding power management into
the mix can only complicate things further.
A recent patch set from Ankita Garg does
not attempt to solve the whole problem; instead, it creates an initial
infrastructure which can be used for future power management decisions.
Before looking at that patch, though, a bit of background will be helpful.
The memory management subsystem already splits available memory at two
different levels. On non-uniform memory access (NUMA) systems, memory
which is local to a specific processor will be faster to access than memory
on a different processor. The kernel's memory management code takes NUMA nodes
into account to implement specific allocation policies. In many cases, the
system will try to keep a process and all of its memory on the same NUMA
node in the hope of maximizing the number of local accesses; other times,
it is better to spread allocations evenly across the system. The point is
that the NUMA node must be taken into account for all allocation and
The other important concept is that of a "zone"; zones are present on all
systems. The primary use of zones is to categorize memory by
accessibility; 32-bit systems, for example, will have "low memory" and
"high memory" zones to contain memory which can and cannot (respectively)
be directly accessed by the kernel. Systems may have a zone
for memory accessible with a 32-bit address; many devices can only perform
DMA to such addresses. Zones are also used to separate memory which can
readily be relocated (user-space pages accessed through page tables, for
example) from memory which is hard to move (kernel memory for which there
may be an arbitrary number of pointers). Every NUMA node has a full set of
PASR has been on the horizon for a little while, so a few people have been
thinking about how to support it; one of the early works would appear to be
this paper by Henrik
Kjellberg, though that work didn't result in code submitted upstream.
Henrik pointed out that the kernel already has a couple of
mechanisms which could be used to support PASR.
One of those is memory hotplug, wherein memory can be physically removed
from the system. Turning off a bank of memory can be thought of as being
something close to removing that memory, so it makes sense to consider
hotplug. Hotplug is a heavyweight operation, though; it is not well suited
to power management, where decisions to power banks of memory up or down
may be made fairly often.
Another approach would be to use zones; the
system could set up a separate zone for each memory bank which could be
powered down independently. Powering down a bank would then be a matter of
moving needed data out of the associated zone and marking that zone so that
no further allocations would be made from it. The problem with this
approach is a number of important memory management operations happen at
the zone level; in particular, each zone has a set of limits on how many
free pages must exist. Adding more zones would increase memory management
overhead and create balancing problems which don't need to exist.
That is the approach that Ankita has taken, though; the patch adds another
level of description called "regions" interposed between nodes and zones,
essentially creating not just one new zone for each bank of memory, but a
complete set of zones for each. The page
allocator will always try to obtain pages from the lowest-numbered region
it can in the hope that the higher regions will remain vacant. Over time,
of course, this simple approach will not work and it will become necessary
to migrate pages out of regions before they can be powered down. The
initial patch does not address that issue, though - or any of the
associated policy issues that come up.
Your editor is not a memory management hacker, but ignorance has never kept
him from having an opinion on things. To a naive point of view, it would
almost seem like this design has been done backward - that regions should
really be contained within zones. That would avoid multiplying the number
of zones in the system and the associated balancing costs. Also,
importantly, it would allow regions to be controlled by the policy of a
single enclosing zone. In particular, regions inside a zone used for
movable allocations would be vacated with relative ease, allowing them to
be powered down when memory pressure is light. Placing multiple zones
within each region, instead, would make clearing a region harder.
The patch set has not gotten a lot of review attention; the people who know
what they are talking about in this area have mostly kept silent.
There are numerous
memory management patches circulating at the moment, so time for review is
probably scarce. Andrew Morton did ask
about the overhead of this work on machines which lack the PASR capability
and about how much power might actually be saved; answers to those
questions don't seem to be available at the moment. So one might conclude
that this patch set, while demonstrating an approach to memory power
management, will not be ready for mainline inclusion in the near future.
But, then, adding power management to such a tricky subsystem was never
going to be done in a hurry.
to post comments)