Dueling memory-management performance regressions
The behavior in question relates to the intersection of transparent huge pages and NUMA policy. Ever since this commit from Aneesh Kumar in 2015, the kernel will, for memory areas where madvise(MADV_HUGEPAGE) has been called, attempt to allocate huge pages exclusively on the current NUMA node. It turns out that the kernel will try so hard that it will go into aggressive reclaim and compaction on that node, forcing out other pages, even if free memory exists on other nodes in the system. In essence, enabling transparent huge pages for a range of memory has become an equivalent to binding that memory to a single NUMA node. The result, as observed by many, can be severe swap storms and a dramatic loss of performance.
In an attempt to fix this problem, Arcangeli applied a patch in November 2018 that loosened the tight binding to the current node. But, it turned out, some workloads want that binding behavior. Local huge pages will perform better than huge pages on a remote node; even local small pages tend to be better than remote huge pages. For some tasks, the performance penalty for using remote pages is high enough that it is worth going to great lengths — even enduring a swap storm at application startup — to avoid it. No such workload has been publicly posted, but the patch was reverted by David Rientjes in December after a huge discussion.
The problem is that far more users appear to be affected by the swap storms than by non-local huge pages; if nothing else, the former problem is far easier to notice than the latter. So a number of distributions have reverted the revert, causing their kernels to have significantly different behavior than mainline kernels. The feeling that the reverting of Arcangeli's patch was a mistake appears to have grown over time, leading to the current attempt to reapply the patch and prioritize swap-storm avoidance over huge-page locality.
While most developers appear to support this change, not all do. In particular, Rientjes is strongly in favor of retaining the current behavior:
Rientjes argued that the kernel does not currently provide an API that is adequate for all workloads to specify the behavior they need. He would rather see the addition of a prctl() call that would let an application say explicitly that its working set will not fit into a single node; after this call has been made, the kernel would allocate huge pages from remote nodes if need be. Various other calls could also be added to give applications (and administrators) more control over NUMA allocation policy for huge pages in particular.
There is little disagreement over whether the API should be improved, at
least in principle. But, as Morton pointed
out: "Implementing a new API doesn't help existing userspace
which is hurting from the problem which this patch addresses
". He,
too, seems to believe that it would be better to address the swap-storm
issue now, then work later to address the needs of applications that
absolutely cannot live with remote huge-page allocations.
Rientjes is adamant that the current semantics should be preserved, though. He has some thoughts on how the allocator could be changed to improve its behavior, mostly focused on avoiding aggressive reclaim in situations where it is unlikely to help. Michal Hocko worried that the proposed changes would make transparent huge-page allocation less effective in general, and pointed out that tweaking the allocator leaves the core problem unaddressed:
The discussion has also not been helped by the fact that Rientjes has not
posted an example of a workload that suffers with Arcangeli's patch
applied. The closest
he came ("induce node local fragmentation (easiest to do by
injecting a kernel module), do MADV_HUGEPAGE over a large range, fault, and
measure random access latency
") proved to be somewhat unsatisfying.
Without a workload to test against, other developers cannot know whether
their changes make things better or worse, or even how severe the problem
actually is. Hocko, for example, believes that the NUMA balancing built
into the kernel now should straighten out workloads that may have initially
had huge pages allocated on remote nodes, but nobody can demonstrate
whether that is true or not. This has led Mel Gorman to complain:
Gorman proposed an additional bit for the zone_reclaim_mode sysctl knob that might provide behavior closer to what Rientjes appears to need; that mode would be off by default but could be enabled for specific workloads. Rientjes has not responded to this proposal.
This discussion has, as numerous participants pointed out, gone around in circles for a while now. Other than Rientjes, no memory-management developers stepped up to defend the status quo; the consensus for applying Arcangeli's patch seems nearly complete. But there remains a real possibility that, should this patch make it to the mainline, complaints about performance regressions will cause it to be reverted — again — as a strict reading of the kernel's no-regressions policy would seem to require.
Everybody involved would prefer to avoid that course of events, thus the
attempts to understand the problem and find a solution that works for
everybody involved. But memory management is always a balancing act, even
when full information about the workloads involved is available. In the
absence of that information, developers are left groping for solutions in
the dark, and achieving that balance becomes that much harder. In such
situations, it may indeed make sense to apply a patch that fixes a known
problem and which is already carried by multiple distributions.
Index entries for this article | |
---|---|
Kernel | Development model/Regressions |
Kernel | Memory management/Huge pages |
Kernel | Performance regressions |
Posted Jun 17, 2019 11:21 UTC (Mon)
by nivedita76 (subscriber, #121790)
[Link] (2 responses)
This also shows the limitations of a strict no-regression policy: the original change obviously caused severe regressions, but because it sneaked into the kernel without anyone noticing at the time, Linus now thinks we should just be stuck with it unless we can fix David’s non-public use case.
Posted Jun 17, 2019 18:32 UTC (Mon)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Jun 19, 2019 17:21 UTC (Wed)
by rweikusat2 (subscriber, #117920)
[Link]
Dueling memory-management performance regressions
Dueling memory-management performance regressions
Dueling memory-management performance regressions