The new NUMA scheduler

[Posted January 22, 2003 by corbet]

The O(1) scheduler was integrated relatively early in the 2.5 development cycle with great results. So it could be a bit surprising to see a new set of scheduler changes going in at this late, feature-frozen date. The inclusion of a new NUMA scheduler in 2.5.59, however, is a relatively safe move which will help Linux perform well on high-end systems.

NUMA (non-uniform memory access) systems, of course, are distinguished by an architecture which makes some memory "closer" to certain processors than others. Each "node" in a NUMA system contains one or more processors, along with an array of local memory. Processors can access memory belonging to other nodes, but that access will be relatively slow. To get top (or even reasonable) performance on NUMA systems, the kernel must keep each process - and its memory - within a single node whenever possible.

The memory allocation side has been in place for some time; the Linux kernel memory allocator sets up one or more zones for each node, and allocates new pages from the current node's zones whenever possible. But the scheduler, as found in 2.5.58, will happily move processes between nodes in its efforts to keep all processors busy. There has been a NUMA scheduler patch floating around for a while, but it has not been merged, perhaps because it made too many changes to the scheduler for non-NUMA systems.

More recently, the NUMA scheduler patch has been reworked (by Martin Bligh, Erich Focht, Michael Hohnbaum, and others) around a simple observation: most of the NUMA problems can be solved by simply restricting the current scheduler's balancing code to processors within a single node. If the rebalancer - which moves processes across CPUs in order to keep them all busy - only balances inside a node, the worst processor imbalances will be addressed without moving processes into a foreign-node slow zone.

A simple (three-line) patch which did nothing but add the within-node restriction yielded most of the benefits of the full NUMA scheduler; indeed, it performed better on some benchmarks. Real-world loads, however, will require a scheduler which can distribute processes evenly across nodes. Occasionally it is necessary, even, to move processes to a slower node; a lot of CPU time on a lightly-loaded node will give better performance than waiting in the run queue on a heavily-loaded node. So a bit of complexity had to be added back into the new scheduler to complete the job.

The 2.5.59 scheduler distributes processes across NUMA nodes in two places. The first is in the exec() system call. A process which calls exec() is very simple to move, since almost all of its context, including memory, is being thrown away. For many loads, proper balancing at exec() time is enough to get good performance.

Some loads, however, will tend to pile up processes within a single node. Any process which forks many times, for example, will find itself competing with all of its children for the same node's resources (unless, of course, those children call exec() and are moved to a new node). To address this problem, the new NUMA scheduler will occasionally look for a large load imbalance between nodes, and, if one is found, move processes to balance things out. This rebalancing happens once for every ten or hundred intra-node rebalancings, depending on the architecture.

The scheduler has seen continued tweaking since 2.5.59 came out. The most significant change, perhaps, is to move the explicit load balancing out of the main scheduler code (where it could get called many times per second on an idle processor) and to restrict it to the scheduler's "timer tick" routine. That change allows more exact control over when the rebalancings happen. A recent patch from Ingo Molnar performs fairly frequent rebalancings (intra-node every 1ms, and globally every 2ms) when the current processor is idle; if the processor is busy the rebalancings only happen every 200 (local) and 400ms (global).

Linus raised an interesting point when he merged the NUMA scheduler: can this scheduler handle hyperthreading as well? Hyperthreaded processors implement two (or more) virtual CPUs on the same physical processor; one processor can be running while the other waits for memory access. Hyperthreading can certainly be seen as a sort of NUMA system, since the sibling processors share a cache and thus have faster access to memory that either one has accessed recently. So the same algorithm should really work in this case.

Treating hyperthreaded systems as NUMA systems has a a certain conceptual elegance, but it may not be the way the Linux kernel goes in the end. The most recent hyperthreading patch from Ingo Molnar takes a different approach: rather than mess with "rebalancing" processes across the same physical processor, why not just use the same run queue for both? Sibling processes on a hyperthreaded core are truly equivalent; it does not matter which process runs on which virtual processor as long as they are all busy. So NUMA and hyperthreading may stay as distinct cases for now.

three-line patch?

Posted Jan 23, 2003 2:54 UTC (Thu) by cpeterso (guest, #305) [Link] (2 responses)

I am a little confused about the origin of the "three-line patch". Was that patch taken from the ongoing NUMA schduler work? Or was it written by an "outsider" who demonstrated the same benefits as the "big" NUMA scheduler, but with far fewer lines of code?

three-line patch?

Posted Jan 23, 2003 3:02 UTC (Thu) by corbet (editor, #1) [Link]

Sorry, I should have been more explicit there. The patch was by Martin Bligh, who has been working with the NUMA scheduler for a while. Here's a copy with his explanation of what's going on.

three-line patch?

Posted Jan 23, 2003 21:27 UTC (Thu) by mbligh (subscriber, #7720) [Link]

I wrote the "3-line" patch to show how non-invasive a NUMA scheduler could be. It also changes the design of the NUMA scheduler to become a set of "concentric circles" as far as load-balance is concerned - I think this is a better design.

IMO, in the post-freeze environment, the only way to proceed is small, well contained patches that obviously don't break anyone else ...

The new NUMA scheduler

Posted Jan 29, 2003 17:12 UTC (Wed) by hbaum (guest, #7717) [Link]

The NUMA scheduler patches that have been around since October consisted of two pieces - balance-on-exec, and NUMA-aware-load-balance. The three line patch that Martin wrote replaced the NUMA-aware-load-balance, but still relied on the balance-on-exec patch being in place. So it is not accurate to state that the NUMA scheduler patch which provided equivalent performance to the original NUMA scheduler patch was simply a three line patch. Rather one half of the original NUMA scheduler patch was replaced by a three line patch. The balance-on-exec portion of the NUMA scheduler patch was still required to get equivalent performance.

Regarding something like this going in after feature freeze - with Martin's contribution, the NUMA scheduler changes have no impact on non-NUMA systems. The balance-on-exec portion of the patch was always a NUMA only piece of code, and Martin's contribution removed the need for an invasive change to the load balancer.