The new NUMA scheduler
[Posted January 22, 2003 by corbet]
The O(1) scheduler was integrated relatively early in the 2.5 development
cycle with great results. So it could be a bit surprising to see a new set
of scheduler changes going in at this late, feature-frozen date. The
inclusion of a new NUMA scheduler in 2.5.59, however, is a relatively safe
move which will help Linux perform well on high-end systems.
NUMA (non-uniform memory access) systems, of course, are distinguished by
an architecture which makes some memory "closer" to certain processors than
others. Each "node" in a NUMA system contains one or more processors,
along with an array of local memory. Processors can access memory belonging
to other nodes, but that access will be relatively slow. To get top (or
even reasonable) performance on NUMA systems, the kernel must keep each
process - and its memory - within a single node whenever possible.
The memory allocation side has been in place for some time; the Linux
kernel memory allocator sets up one or more zones for each node, and
allocates new pages from the current node's zones whenever possible. But
the scheduler, as found in 2.5.58, will happily move processes between
nodes in its efforts to keep all processors busy. There has been a NUMA
scheduler patch floating around for a while, but it has not been merged,
perhaps because it made too many changes to the scheduler for non-NUMA
systems.
More recently, the NUMA scheduler patch has been reworked (by Martin Bligh,
Erich Focht, Michael Hohnbaum, and others) around a simple observation:
most of the NUMA problems can be solved by simply restricting the current
scheduler's balancing code to processors within a single node. If the
rebalancer - which moves processes across CPUs in order to keep them all
busy - only balances inside a node, the worst processor imbalances will be
addressed without moving processes into a foreign-node slow zone.
A simple (three-line) patch which did nothing but add the within-node
restriction yielded most of the benefits of the full NUMA scheduler;
indeed, it performed better on some benchmarks. Real-world loads, however,
will require a scheduler which can distribute processes evenly across
nodes. Occasionally it is necessary, even, to move processes to a slower
node; a lot of CPU time on a lightly-loaded node will give better
performance than waiting in the run queue on a heavily-loaded node. So a
bit of complexity had to be added back into the new scheduler to complete
the job.
The 2.5.59 scheduler distributes processes across NUMA nodes in two
places. The first is in the exec() system call. A process which
calls exec() is very simple to move, since almost all of its
context, including memory, is being thrown away. For many loads, proper
balancing at exec() time is enough to get good performance.
Some loads, however, will tend to pile up processes within a single node.
Any process which forks many times, for example, will find itself competing
with all of its children for the same node's resources (unless, of course,
those children call exec() and are moved to a new node). To
address this problem, the new NUMA scheduler will occasionally look for a
large load imbalance between nodes, and, if one is found, move processes to
balance things out. This rebalancing happens once for every ten or hundred
intra-node rebalancings, depending on the architecture.
The scheduler has seen continued tweaking since 2.5.59 came out. The most
significant change, perhaps, is to move the explicit load balancing out of
the main scheduler code (where it could get called many times per second on
an idle processor) and to restrict it to the scheduler's "timer tick"
routine. That change allows more exact control over when the rebalancings
happen. A recent patch from Ingo Molnar performs fairly frequent
rebalancings (intra-node every 1ms, and globally every 2ms) when the
current processor is idle; if the processor is busy the rebalancings only
happen every 200 (local) and 400ms (global).
Linus raised an interesting point when he
merged the NUMA scheduler: can this scheduler handle hyperthreading as
well? Hyperthreaded processors implement two (or more) virtual CPUs on the
same physical processor; one processor can be running while the other waits
for memory access. Hyperthreading can certainly be seen as a sort of NUMA
system, since the sibling processors share a cache and thus have faster
access to memory that either one has accessed recently. So the same
algorithm should really work in this case.
Treating hyperthreaded systems as NUMA systems has a a certain conceptual
elegance, but it may not be the way the Linux kernel goes in the end. The
most recent hyperthreading patch from Ingo
Molnar takes a different approach: rather than mess with "rebalancing"
processes across the same physical processor, why not just use the same run
queue for both? Sibling processes on a hyperthreaded core are truly
equivalent; it does not matter which process runs on which virtual
processor as long as they are all busy. So NUMA and hyperthreading may
stay as distinct cases for now.
(
Log in to post comments)