AutoNUMA: the other approach to NUMA scheduling

By Jonathan Corbet
March 27, 2012

Last week's Kernel Page included an article on Peter Zijlstra's NUMA scheduling patch set. As it happens, Peter is not the only developer working in this area; Andrea Arcangeli has posted a NUMA scheduling patch set of his own called AutoNUMA. Andrea's goal is the same - keep processes and their memory together on the same NUMA node - but the approach taken to get there is quite different. These two patch sets embody a disagreement on how the problem should be solved that could take some time to work out.

Peter's patch set works by assigning a "home node" to each process, then trying to concentrate the process and its memory on that node. Andrea's patch lacks the concept of home nodes; he thinks it is an idea that will not work well for programs that don't fit into a single node unless developers add code to use Peter's new system calls. Instead, Andrea would like NUMA scheduling to "just work" in the same way that transparent huge pages do. So his patch set seems to assume that resources will be spread out across the system; it then focuses on cleaning things up afterward. The key to the cleanup task is a bunch of statistics and a couple of new kernel threads.

The first of these threads is called knuma_scand. Its primary job is to scan through each process's address space, marking its in-RAM anonymous pages with a special set of bits that makes the pages look, to the hardware, like they are not present. If the process tries to access such a page, a page fault will result; the kernel will respond by marking the page "present" again so that the process can go about its business. But the kernel also tracks the node that the page lives on and the node the accessing process was running on, noting any mismatches. For each process, the kernel maintains an array of counters to track which node each of its recently-accessed pages were located on. For pages, the information tracked is necessarily more coarse; the kernel simply remembers the last node to access each page.

When the time comes for the scheduler to make a decision, it passes over the per-process statistics to determine whether the target process would be better off if it were moved to another node. If the process seems to be accessing most of its pages remotely, and it is better suited to the remote node than the processes already running there, it will be migrated over. This code drew a strenuous objection from Peter, who does not like the idea of putting a big for-each-CPU loop into the middle of the scheduler's hot path. After some light resistance, Andrea agreed that this logic eventually needs to find a different home where it would run less often. For testing, though, he likes things the way they are, since it causes the scheduler to converge more quickly on its chosen solution.

Moving processes around will only help so much, though, if their memory is spread across multiple NUMA nodes. Getting the best performance out of the system clearly requires a mechanism to gather pages of memory onto the same node as well. In the AutoNUMA patch, the first non-local fault (in response to the special page marking described above) will cause that page's "last node ID" value to be set to the accessing node; the page will also be queued to be migrated to that node. A subsequent fault from a different node will cancel that migration, though; after the first fault, two faults in a row from the same node are required to cause the page to be queued for migration.

Every NUMA node gets a new kernel thread (knuma_migrated) that is charged with passing over the lists of pages queued for migration and actually moving them to the target node. Migration is not unconditional - it depends, for example, on there being sufficient memory available on the destination node. But, most of the time, these migration threads should manage to pull pages toward the nodes where they are actually used.

Beyond the above-mentioned complaint about putting heavy computation into schedule(), Peter has found a number of things to dislike about this patch set. He doesn't like the worker threads, to begin with:

The problem I have with farming work out to other entities is that its thereafter terribly hard to account it back to whoemever caused the actual work. Suppose your kworker thread consumes a lot of cpu time -- this time is then obviously not available to your application -- but how do you find out what/who is causing this and cure it?

Andrea responds that the cost of these threads is small to the point that it cannot really be measured. It is a little harder to shrug off Peter's other complaint, though: that this patch set consumes a large amount of memory. The kernel maintains one struct page for every page of memory in the system. Since a typical system can have millions of pages, this structure must be kept as small as possible. But the AutoNUMA patch adds a list_head structure (for the migration queue) and two counters to each page structure. The end result can be a lot of memory lost to the AutoNUMA machinery.

The plan is to eventually move this information out of struct page; then, among other things, the kernel can avoid allocating it at all if AutoNUMA is not actually in use. But, for the NUMA case, that memory will still be consumed regardless of its location, and some users are unlikely to be happy even if others, as Andrea asserts, will be happy to give up a big chunk of memory if they get a 20% performance improvement in return. This looks like an argument that will not be settled in the near future, and, chances are, the memory impact of AutoNUMA will need to be reduced somehow. Perhaps, your editor naively suggests, knuma_migrated and its per-page list_head structure could be replaced by the "lazy migration" scheme used in Peter's patch.

NUMA scheduling is hard and doing it right requires significant expertise in both scheduling and memory management. So it seems like a good thing that the problem is receiving attention from some of the community's top scheduler and memory management developers. It may be that one or both of their solutions will be shown to be unworkable for some workloads to the point that it simply needs to be dropped. What may be more likely, though, is that these developers will eventually stop poking holes in each other's patches and, together, figure out how to combine the best aspects of each into a working solution that all can live with. What seems certain is that getting to that point will probably take some time.

Index entries for this article
Kernel	Memory management/NUMA systems
Kernel	NUMA
Kernel	Scheduler/NUMA

AutoNUMA: the other approach to NUMA scheduling

Posted Mar 28, 2012 11:36 UTC (Wed) by Ben_P (guest, #74247) [Link] (5 responses)

Has anyone posed benchmarks on either proposed solution? Having read a small number of academic papers on NUMA scheduling; it looks like for every seemingly good solution, there exist fairly common workloads which decimate performance.

Is it possible NUMA scheduling can even do the right thing without introducing new system calls? As the programmer I will forever know more about the locality requirements of my code than the scheduler.

AutoNUMA: the other approach to NUMA scheduling

Posted Mar 28, 2012 12:56 UTC (Wed) by slashdot (guest, #22014) [Link] (1 responses)

If the workload is "static", then the kernel can in principle learn which threads read/write which pages with what throughput, and the CPU behavior of them, and simply optimize.

Whether this learning and optimization can be done cheaply is an open question though.

If the workload is not static, the kernel cannot predict the future, so it can't optimize things automatically.

Thus, it will probably be necessary to both have syscalls (esp. to express thread memory affinity) and an automatic system.

AutoNUMA: the other approach to NUMA scheduling

Posted Mar 28, 2012 13:26 UTC (Wed) by Ben_P (guest, #74247) [Link]

From my limited understanding; even for "static" workloads most NUMA schedulers do better on some static workloads and significantly worse on others. Thus the default naive behavior tends to win overall.

Benchmarks

Posted Mar 28, 2012 13:52 UTC (Wed) by corbet (editor, #1) [Link]

Andrea and Peter have both posted some benchmark results, but the testing so far is recognized by everybody involved as being insufficient.

AutoNUMA: the other approach to NUMA scheduling

Posted Mar 29, 2012 21:03 UTC (Thu) by riel (subscriber, #3142) [Link] (1 responses)

It depends entirely on what you are programming. If you are building an HPC application, you know what you will be accessing.

However, if you write a JVM, you have no idea what the application running inside the JVM will be accessing. It is entirely possible that the application will generate data (once) with one thread, and then access it hundreds of times with another thread.

For one situation, it looks obvious that Peter's solution has less overhead. For the other situation, it is not clear at all what the way to go would be. Maybe Andrea's code will automatically figure it out...

NUMA scheduling is a hard problem. Not because the solutions are difficult, but because nobody even knows exactly what all the problems look like.

approachES to NUMA scheduling?

Posted Apr 11, 2012 19:50 UTC (Wed) by gvy (guest, #11981) [Link]

> It depends entirely on what you are programming.
Exactly, and thus there might be just no real point in tyring to get *the* approach implemented when there might be at least two of them feasible and readily available, either as a kernels or (preferably but probably less realistic) as a runtime knob.

As you wrote, those who are there for performance would rather invest some more time in libraries and apps which tend to be custom or customizable; and those who won't or can't could at least pay with their RAM and cycles for some generic service.

AutoNUMA: the other approach to NUMA scheduling

Posted Mar 28, 2012 15:04 UTC (Wed) by hamjudo (guest, #363) [Link]

If it really leads to a 20% performance improvement, on any chip with more than 6 cores, we still come out ahead if we dedicate a whole core in each package to scheduling and tracking memory. Quite a win on the 8 and 12 core packages used in the high end servers.

AutoNUMA: the other approach to NUMA scheduling

Posted Mar 29, 2012 13:33 UTC (Thu) by Spudd86 (guest, #51683) [Link]

This sounds to me like it's less likely to cause a page to ping-pong from node to node, but I'm far from an expert in any of this :P (also weather the difference is significant and worth memory cost is not at all obvious even if it exists)

AutoNUMA: the other approach to NUMA scheduling

Posted Apr 8, 2012 12:09 UTC (Sun) by kevinm (guest, #69913) [Link]

I wonder if some loads would benefit from burning additional memory in order to have some read-only pages present on multiple NUMA nodes at once.

AutoNUMA: the other approach to NUMA scheduling

Posted May 17, 2012 18:05 UTC (Thu) by Perf_k (guest, #84681) [Link]

I'm curious about the _real_ location of data being touched. Surely a small but significant portion of the working dataset is contained within some part of the CPU caches. If you have multiple tasks cooperating/competing for data within the same page or better yet, cachelines (think locks), it seems to make more sense to get cooperating/competing (you choose the point of view) tasks running on the same CPU or at least within the same node. For many real world workloads the cache-to-cache latency between CPUs in different nodes is a bigger hit than bringing in the data from main memory. Of course this is workload dependant. Perhaps scheduler logic to identify cooperating tasks and gradually, slowly colocate them. A combination of memory migration and task colocation could provide the % increase in performance to justify a small bit of bloat in the main scheduling code path.