The new NUMA scheduler
NUMA (non-uniform memory access) systems, of course, are distinguished by an architecture which makes some memory "closer" to certain processors than others. Each "node" in a NUMA system contains one or more processors, along with an array of local memory. Processors can access memory belonging to other nodes, but that access will be relatively slow. To get top (or even reasonable) performance on NUMA systems, the kernel must keep each process - and its memory - within a single node whenever possible.
The memory allocation side has been in place for some time; the Linux kernel memory allocator sets up one or more zones for each node, and allocates new pages from the current node's zones whenever possible. But the scheduler, as found in 2.5.58, will happily move processes between nodes in its efforts to keep all processors busy. There has been a NUMA scheduler patch floating around for a while, but it has not been merged, perhaps because it made too many changes to the scheduler for non-NUMA systems.
More recently, the NUMA scheduler patch has been reworked (by Martin Bligh, Erich Focht, Michael Hohnbaum, and others) around a simple observation: most of the NUMA problems can be solved by simply restricting the current scheduler's balancing code to processors within a single node. If the rebalancer - which moves processes across CPUs in order to keep them all busy - only balances inside a node, the worst processor imbalances will be addressed without moving processes into a foreign-node slow zone.
A simple (three-line) patch which did nothing but add the within-node restriction yielded most of the benefits of the full NUMA scheduler; indeed, it performed better on some benchmarks. Real-world loads, however, will require a scheduler which can distribute processes evenly across nodes. Occasionally it is necessary, even, to move processes to a slower node; a lot of CPU time on a lightly-loaded node will give better performance than waiting in the run queue on a heavily-loaded node. So a bit of complexity had to be added back into the new scheduler to complete the job.
The 2.5.59 scheduler distributes processes across NUMA nodes in two places. The first is in the exec() system call. A process which calls exec() is very simple to move, since almost all of its context, including memory, is being thrown away. For many loads, proper balancing at exec() time is enough to get good performance.
Some loads, however, will tend to pile up processes within a single node. Any process which forks many times, for example, will find itself competing with all of its children for the same node's resources (unless, of course, those children call exec() and are moved to a new node). To address this problem, the new NUMA scheduler will occasionally look for a large load imbalance between nodes, and, if one is found, move processes to balance things out. This rebalancing happens once for every ten or hundred intra-node rebalancings, depending on the architecture.
The scheduler has seen continued tweaking since 2.5.59 came out. The most significant change, perhaps, is to move the explicit load balancing out of the main scheduler code (where it could get called many times per second on an idle processor) and to restrict it to the scheduler's "timer tick" routine. That change allows more exact control over when the rebalancings happen. A recent patch from Ingo Molnar performs fairly frequent rebalancings (intra-node every 1ms, and globally every 2ms) when the current processor is idle; if the processor is busy the rebalancings only happen every 200 (local) and 400ms (global).
Linus raised an interesting point when he merged the NUMA scheduler: can this scheduler handle hyperthreading as well? Hyperthreaded processors implement two (or more) virtual CPUs on the same physical processor; one processor can be running while the other waits for memory access. Hyperthreading can certainly be seen as a sort of NUMA system, since the sibling processors share a cache and thus have faster access to memory that either one has accessed recently. So the same algorithm should really work in this case.
Treating hyperthreaded systems as NUMA systems has a a certain conceptual
elegance, but it may not be the way the Linux kernel goes in the end. The
most recent hyperthreading patch from Ingo
Molnar takes a different approach: rather than mess with "rebalancing"
processes across the same physical processor, why not just use the same run
queue for both? Sibling processes on a hyperthreaded core are truly
equivalent; it does not matter which process runs on which virtual
processor as long as they are all busy. So NUMA and hyperthreading may
stay as distinct cases for now.
