NUMA scheduling progress
Did you know...?LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.
NUMA balancing was a topic of fierce debate through much of 2012; that discussion culminated with the merging of Mel Gorman's NUMA balancing infrastructure patch set into the 3.8 kernel. Those patches provided the basic structure upon which a NUMA balancing solution could be built, but did not attempt to solve the problem in a comprehensive way. Since then, one might be forgiven for thinking that the developers involved have lost interest; not much NUMA-related code has found its way into the mainline. But, as can be seen in Mel's basic scheduler support for NUMA balancing patch set, which weighs in at 63 individual changesets, quite a bit of work has been happening in this area.
The NUMA balancing problem is one of overcoming a fundamental characteristic of NUMA hardware: accesses to memory local to any given NUMA node will be much faster than accesses to remote memory. If processes and their memory live on the same NUMA node, things will run quickly; otherwise performance will be significantly worse. NUMA performance on Linux has long been deemed to be worse than it should be, so there is clearly some room for improvement in this area.
The kernel has, for years, attempted to allocate memory on the same node that the allocating process is running on, but that turns out to be inadequate. Over time, it seems to be inevitable that processes and memory will move around the system, leading to poor performance. Recent efforts to improve the situation have thus been focused on cleaning things up as the system runs, moving memory to the nodes that are actually using it and moving processes to be closer to their memory. This particular patch set, as suggested by the "scheduler support" name, is mainly concerned with the latter task — ensuring that the scheduler runs processes on the nodes that host the bulk of their memory.
Locating a process's memory
If the scheduler is to keep processes near their memory, it must, clearly, have a sense for where that memory is placed. This information is not as straightforward to come by as one might expect, especially if one is concerned mostly with the memory that is actually being used, as opposed to memory which has been allocated but is not in active use. The kernel needs to continually observe a process's memory access patterns to be able to optimize its placement in the system.
One of the core changes in the current NUMA patch set is to enable this observation, using the virtual memory mechanism. The scheduler will periodically scan through each process's address space, revoking all access permissions to the pages that are currently resident in RAM. The next time the affected process tries to access that memory, a page fault will result. The scheduler will trap that fault and restore access to the page in question; it will also increment an access counter in a per-process array indexed by the NUMA node number. Over time, these counts (which are maintained as a sort of running average) will provide a picture of where the process's memory accesses are going. At that point, it is a simple matter of looping through the array to find the node with the most accesses; that becomes the "preferred node" for the process.
If the process is running on a node other than the preferred one, the scheduler will attempt to migrate it to the node where its memory lives. That should result in more node-local memory accesses and, thus, better performance. Of course, the picture is a bit more complicated than this for a number of reasons, starting with the fact that the migration of the process could fail for a number of reasons. So one of the patches in the series will periodically retry the attempt to migrate a process to its preferred node if things didn't work out the first time. Even if the initial migration attempt fails, the process should eventually end up on the preferred node.
This process-follow-memory migration can also run counter to another one of the scheduler's core goals: distributing the load across the system to make the best use of the available CPU time. If a process that has been migrated closer to its memory cannot actually run because the destination NUMA node is overloaded, the goal of improved performance is unlikely to be achieved. So it may be necessary to migrate other processes off the overloaded node. That complicates the picture somewhat. The patch set does not try address the full problem, but it does take on one specific subproblem: cases where swapping two processes between two CPUs will make both processes run more efficiently. Even that task is fraught with hazards: moving two processes at once is more complicated than migrating a single process. To make it work with a minimum of fuss, the patch set adds a variant of stop_machine() that effectively halts work on the two processors involved while the exchange is made.
It also became necessary to avoid tracking NUMA access information for shared executable pages. Including those pages will tend to pull processes toward the nodes where pages holding shared library code can be found. But there is relatively little value in migrating processes near their executable pages, as it turns out, because much of the needed data is in the CPU caches much of the time anyway. So patch 35 in the series avoids tracking faults from those pages.
NUMA groups
As described thus far, the patch set adds a basic mechanism for putting a process near its (exclusively owned) pages. But the concept of "a process's memory" is not necessarily simple either; processes often share pages with each other. That is especially true of threaded applications where all of the threads run in the same address space and have access to the same memory. Threaded applications written with NUMA performance in mind will partition their memory accesses, but they can still end up sharing pages, especially if a feature like transparent huge pages is in use. In other cases, entirely separate processes may still share a large memory area; neither "owns" it, but both can benefit by being placed close to it. Either way, there can be performance benefits from placing such processes on the same node whenever possible.
To improve performance in the shared memory case, the patch set adds a mechanism that tries to detect when processes are sharing memory with each other — and, importantly, when they are accessing the same pages. Doing so requires tracking which processes are accessing each page. Andrea Arcangeli's AutoNUMA patch set, one of the contenders in the 2012 discussions, added an extensive tracking infrastructure for this purpose, but the resulting memory overhead was deemed to be too high. So, as is so often the case, this patch set tries to track this data by shoehorning the data into struct page instead.
In particular, the data is put into the flags field, which is already rather more complex than a simple set of flags. The ID of the NUMA node containing each page is stored there in current kernels; this patch set changes that to the CPU ID, and adds the process ID of the last process to access the page as well. Needless to say, the full process ID does not fit into a subfield of flags, so some compromises need to be made. In this case, only the bottom eight bits of the process ID are used, with the understanding that some collisions will be unavoidable. Whenever the system handles a page fault, this "cpupid" number is stored in the relevant page structure.
With that infrastructure in place, the scheduler can respond to a NUMA scanning fault by checking whether the process currently accessing the page is the same as the last process to do so. If not, the scheduler will consider putting the two processes into a "NUMA group." But first, it must find the other process that accessed the page — not a trivial matter, given that the full process ID for that process is not available. This problem is handled by looking at whichever process is currently running on the CPU that last accessed the page; if the relevant part of the process ID matches, the scheduler assumes (after applying a few sanity tests) that it is the same process as the one that accessed the page. In this case, the processes are placed into a NUMA group to indicate that they are sharing pages; if both processes are already in groups, those two groups will be coalesced into one larger group.
Once again, shared library pages threw a wrench into the works. Since every process shares access to, for example, the C library, all processes in the system tended to get pulled together into a single NUMA group. Deeming that to be less than fully useful, Peter Zijlstra tossed in a patch avoiding group tracking for read-only pages.
Once a process is part of a NUMA group, page faults for shared pages will be tracked at the group level. That gives the scheduler two somewhat independent sets of statistics from which to make decisions: accesses to private pages, and accesses to shared pages. If both sets of numbers agree on the preferred node, the decision is easy. If, instead, there is a disagreement, the preferred node will be chosen by iterating through the arrays and picking the node where the sum of private and shared faults is the highest. That sum is distorted, though, by weighting shared faults a bit more heavily in an attempt to push the system toward decisions that put processes with shared pages on the same node.
There is a lot more than this to the patch set, including a number of
algorithmic tweaks and fixes, but the above text describes the most
significant themes. Mel has included quite a bit of benchmark data with
the patch set; there is a lot of information there, but the bottom line, as
he described it, was: "some reports indicate that the performance
is getting close to manual bindings for some workloads but your mileage
will vary.
" A number of details are yet to be addressed; in
particular, support for CPU hotplugging is not yet there. Even so, Mel
thinks that the code is getting ready for movement into the "tip" tree,
and, thus, linux-next. That suggests it could be ready for merging in the
3.13 or (perhaps more likely) 3.14 development cycle.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/NUMA systems |
| Kernel | NUMA |
| Kernel | Scheduler/NUMA |
Posted Oct 1, 2013 19:27 UTC (Tue)
by post-factum (subscriber, #53836)
[Link] (1 responses)
Posted Oct 2, 2013 6:06 UTC (Wed)
by smurf (subscriber, #17840)
[Link]
Posted Oct 1, 2013 19:28 UTC (Tue)
by mikemol (guest, #83507)
[Link] (3 responses)
Posted Oct 1, 2013 20:52 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
Posted Oct 1, 2013 23:24 UTC (Tue)
by mikemol (guest, #83507)
[Link] (1 responses)
Posted Oct 3, 2013 12:22 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link]
Posted Oct 1, 2013 23:27 UTC (Tue)
by Karellen (subscriber, #67644)
[Link] (2 responses)
For example:
If a process calls linux clone() with CLONE_VM (e.g. starts a new thread) then the new process/thread is initially put in the same NUMA group as its parent.
Why wouldn't/doesn't that work? (Or does it just not work *well enough*?)
Posted Oct 2, 2013 6:27 UTC (Wed)
by dlang (guest, #313)
[Link] (1 responses)
This means that even when everything is perfectly allocated to start with, over time it will wander.
Posted Oct 2, 2013 7:23 UTC (Wed)
by Karellen (subscriber, #67644)
[Link]
I'm in the dark about the current innards of the NUMA subsystem, so when the article said "To improve performance in the shared memory case, the patch set adds a mechanism that tries to detect when processes are sharing memory with each other — and, importantly, when they are accessing the same pages.", and then went on to talk about NUMA groups, I thought that meant that NUMA groups were the new mechanism, and the heuristics described were the only ones for assigning processes to groups.
That clears things up though. Thanks!
Posted Oct 2, 2013 10:40 UTC (Wed)
by HIGHGuY (subscriber, #62277)
[Link] (1 responses)
So this makes me wonder, how much of that is implemented and if so, how does one decide which strategy to choose for the current process? To my best recollection, some of the earlier patchsets tried to migrate memory rather than processes...
Posted Oct 2, 2013 13:38 UTC (Wed)
by dlang (guest, #313)
[Link]
If it was overly aggressive in either aspect, it would hurt performance significantly (there's overhead in moving memory or processes), but it looks like that using less aggressive stances but using both approaches, it is working well in their tests.
Posted Oct 2, 2013 21:27 UTC (Wed)
by jhoblitt (subscriber, #77733)
[Link]
Posted Oct 2, 2013 23:28 UTC (Wed)
by ssmith32 (subscriber, #72404)
[Link]
AFAIK, this can still lead to swapping:
http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-t...
Not just in MySQL but in MongoDB..
Granted, it would be dumb to hurt the normal case to make this case work, since anyone setting up a server will likely be going over and tweaking things like numactl anyways, but if I'm at home on my desktop, I don't want to bother with that. But if it could accommodate both, that would be great!
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
If a process calls clone() without CLONE_VM (e.g. fork()s) then NUMA grouping becomes a "maybe" and could change in future based on the number of copy-on-write page faults it takes - which were going to happen anyway.
If a process calls exec(), it is taken out of it's current NUMA group and starts a new one.
If a process calls mmap() with MAP_SHARED, check for other mappings of the same memory block, and, based on the size of the memory block mapped and the amount of other memory already allocated by the process, at that point possibly migrate the process to an existing NUMA group containing other processes with the same memory mapped. (And consider removing a process from a NUMA group on munmap())
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
NUMA scheduling progress
