Proactive compaction for the kernel

April 21, 2020

This article was contributed by Nitin Gupta

Many applications benefit significantly from the use of huge pages. However, huge-page allocations often incur a high latency or even fail under fragmented memory conditions. Proactive compaction may provide an effective solution to these problems by doing memory compaction in the background. With my proposed proactive compaction implementation, typical huge-page allocation latencies are reduced by a factor of 70-80 while incurring minimal CPU overhead.

Memory compaction is a feature of the Linux kernel that makes larger, physically contiguous blocks of free memory available. Currently, the kernel uses an on-demand compaction scheme. Whenever a per-node kcompactd thread is woken up, it compacts just enough memory to make available a single page of the needed size. Once a page of that size is made available, the thread goes back to sleep. This pattern of compaction often causes a high latency for higher-order allocations and hurts performance for workloads that need to burst-allocate a large number of huge pages.

Experiments where compaction is manually triggered on a system with a fragmented memory state show that it could be brought to a fairly compacted memory state within one second for a 32GB system. Such data suggests that a proactive compaction scheme in the kernel could allow allocating a significant fraction of memory as huge pages while keeping allocation latencies low.

My recent patch provides an implementation of the proactive compaction approach. It exposes a single tunable: /sys/kernel/mm/compaction/proactiveness, which accepts values in the range [0, 100], with a default value of 20. This tunable determines how aggressively the kernel should compact memory in the background. The patch reuses the existing, per-NUMA-node kcompactd threads to do background compaction. Each of these threads periodically calculates a per-node fragmentation score (an indicator of the memory fragmentation) and compares it against a threshold, which is derived from the tunable.

The per-node proactive (background) compaction process is started by its corresponding kcompactd thread when the node's fragmentation score exceeds the high threshold. The compaction process remains active till the node's score falls below the low threshold, or one of the back-off conditions (defined below) is met.

Memory compaction involves bringing together "movable" pages at a zone's end to create larger, physically contiguous, free regions at a zone's beginning. If there are "unmovable" pages, like kernel allocations, spread across the physical address space, this process is less effective. Compaction has a non-trivial system-wide impact as pages belonging to different processes are moved around, which could also lead to latency spikes in unsuspecting applications. Thus, the kernel must not overdo compaction and should have a good back-off strategy.

The patch implements back-offs in the following scenarios:

When the current node's kswapd thread is active (to avoid interfering with the reclaim process).
When there is contention on the per-node lru_lock or per-zone lock (to avoid hurting non-background, latency-sensitive contexts).
When there is no progress (reduction in node's fragmentation score value) after a round of compaction.

If any of these back-off conditions is true, the proactive compaction process is deferred for a specific time interval.

Per-node fragmentation score and threshold calculation

As noted earlier, this proactive compaction scheme is controlled by a single tunable called proactiveness. All required values like per-node fragmentation score and thresholds are derived from this tunable.

A node's score is in the range [0, 100] and is defined as the sum of all the node's zone scores, where a zone's score (Sz) is defined as:

    Sz = (Nz / Nn) * extfrag(zone, HUGETLB_PAGE_ORDER)

where:

Nz is the total number of pages in the zone.
Nn is the total number of pages in the zone's parent node.
extfrag(zone, HUGETLB_PAGE_ORDER) is the external fragmentation with respect to the huge-page order in this zone.

In general, this is the way to calculate external fragmentation with respect to any order:

    extfrag(zone, order) =  ((Fz - Hz) / Fz) * 100

where:

Fz is the total number of free pages in the zone.
Hz is the number of free pages available in blocks of size >= 2^order.

This per-zone score value is in the range [0, 100], and is defined as 0 when Fz = 0. The reason for calculating the per-zone score this way is to avoid wasting time trying to compact special zones like ZONE_DMA32 and focus on zones like ZONE_NORMAL, which manage most of the memory (Nz ≈ Nn). For smaller zones (Nz << Nn), the score tends to 0, and thus can never cross the low threshold value (defined below).

The low (Tl) and high (Th) thresholds, against which these scores are compared, are defined as follows:

    Tl = 100 - proactiveness
    Th = min(10 + Tl, 100)

These thresholds are in the range [0, 100]. Once a zone's score exceeds Th, proactive compaction will be done until the score drops below Tl.

Performance evaluation

For a true test of memory compaction efficacy, the first step is to fragment the system memory such that no huge pages are directly available for allocation (i.e., an initial fragmentation score of 100 for all NUMA nodes). With the system in such a fragmented memory state, any run of a huge-page-heavy workload would highlight the effects of the kernel's compaction policy. With on-demand compaction, a majority of huge-page allocations hit the direct-compaction path, leading to high allocation latencies. With proactive compaction, the expectation is to avoid these latencies except when the compaction process is unable to catch up with the huge-page allocation rate.

For evaluating proactive compaction, a high level of external fragmentation (as described above) was triggered by a user-space program that allocated almost all memory as huge pages and then freed 75% of pages from each huge-page aligned chunk. All the tests were done on an x86_64 system with 1TB of RAM and two NUMA nodes, using kernel version 5.6.0-rc4. The first huge-page-heavy workload was a test kernel driver that allocated as many huge pages as possible, measuring latency for each allocation, until it hit an allocation failure. The driver was able to allocate almost 98% of free memory as huge pages with or without the patch. However, with the vanilla kernel (without the proactive compaction patch), 95-percentile latency was 33799µs while with the patch (with proactiveness tunable set to 20), this latency was 429µs — a reduction by a factor of 78.

To further measure the performance, a Java heap allocation test was used, which allocated 700GB of heap space, after fragmenting the memory as described above. This workload shows the effect of reduced allocation latencies in the run time of huge-page-heavy workloads. On the vanilla kernel, the run time was ~27 minutes, while with the patch (proactiveness=20), the run time came down to roughly four minutes. Tests with higher proactiveness values did not show any further speedups or slowdowns.

Some questions were raised by Vlastimil Babka, who has worked on proactive compaction along the way and helpfully reviewed some of these patches:

By your description it seems to be a one-time fragmentation event followed by a one-time stream of allocations? So kcompactd probably did the proactive work just once? That works as a smoke test, but what I guess will be more important is behavior under more complex workloads, where we should also check the vmstat compact_daemon* stats and possibly also kcompactd kthreads CPU utilizations.

The comment led to further investigation into the proactive kcompactd behavior. For the Java heap test, per-node fragmentation scores were recorded during the program's runtime, together with kcompactd threads' CPU usage. The data clearly shows that proactive compaction is active throughout the runtime of the test program and that a kcompactd thread takes 100% of one of the CPU cores while it is active.

The above plot shows changes in per-node fragmentation scores as a Java process tries to allocate 700GB of the heap area on a two-node system with 512GB memory each. The proactiveness tunable is set to 20, so the low and high thresholds are 80 and 90, respectively. Before the test program was run, the system memory was fragmented, such that no huge pages were available for direct allocation. The heap allocation started on Node-0, where the score rose as huge pages are used. When the score gets above 90, proactive compaction is triggered on the node, bringing the score back to 80. Eventually, all memory on Node-0 was exhausted (around 90 seconds into the run), at which point the allocation started from Node-1, where the same compaction pattern repeated.

As described earlier, compaction is an expensive operation, and we don't want to pay the cost unless it is able to reduce the external fragmentation. To evaluate the back-off behavior, another test was created where unmovable pages were scattered throughout the physical address space. With the system in such a memory state, compaction cannot do much apart from warming the CPU. The back-off mechanism implemented in this patch correctly identifies such a situation by checking that a node's score does not go down after a round of compaction. When that happens, further rounds of proactive compaction are deferred for some time.

Future work

The patch has already been through some review cycles, which have helped refine many of its aspects. Some kernel developers recognize the need for a more proactive compaction, and given these encouraging numbers, the patch will hopefully be accepted, probably after a few more iterations.

As a future direction, I am focusing on refining the per-node fragmentation score and threshold calculations, which drive the background proactive compaction process. Currently, the score calculation does not take into account per-node characteristics like differing TLB latencies, which would be important in heterogeneous systems. Future patches will likely add scaling factors to both score and threshold calculations to account for these per-node characteristics.

Index entries for this article
Kernel	Memory management/Large allocations
GuestArticles	Gupta, Nitin

Proactive compaction for the kernel

Posted Apr 21, 2020 17:17 UTC (Tue) by josh (subscriber, #17465) [Link] (1 responses)

To what extent could the kernel attempt to keep its unmovable pages together, and separate from movable allocations?

Segregating movable pages

Posted Apr 21, 2020 17:59 UTC (Tue) by corbet (editor, #1) [Link]

The kernel already does that, it's what ZONE_MOVABLE is for. Of course, sometimes things that were thought to be movable (or short-lived, which is almost the same thing) turn out not to be...

Proactive compaction for the kernel

Posted Apr 22, 2020 0:11 UTC (Wed) by djbw (subscriber, #78104) [Link]

How much of this effect is from proactive operations vs arranging for more CPU to be deployed on the problem? For example, if you controlled for the CPU utilization for compaction being identical between proactive vs direct would the Java heap test still take 27 minutes in the direct case? Basically how to determine if the result is an argument to do proactive compaction vs more aggressive direct compaction?

Proactive compaction for the kernel

Posted Apr 22, 2020 10:47 UTC (Wed) by jtaylor (subscriber, #91739) [Link] (2 responses)

From the description and the current patch it seems the backoff is absolute only if no progress is made.

This would mean it is always compacting with the same frequency even if only very little progress is made because some process constantly fragments some pages.

Wouldn't it be better to have the backoff time depend on how much progress was made last time (or some exponential decaying value of the last iterations progresses)?
So if there is much to compact with the current workload it runs more often, but if there is not much to do it runs less often.

One does have to be careful this does not negatively effect performance so that distributions actually enable it.
Back when redhat backported anonymous transparent huge pages to their rhel7(?) a few years back the compaction was terrible for performance and as a consequence and many blog posts which said turn of thp (instead of only the compaction) it seems many distributions still have thp turned off by default today.

Proactive compaction for the kernel

Posted Apr 22, 2020 12:00 UTC (Wed) by scientes (guest, #83068) [Link]

My hunch would be that compaction should never be done on-demand (but the background process should be influenced by demand). Perhaps if the VM addresses were assigned contiguously it would be possible to convert the allocations later to a huge page in a background process.

Proactive compaction for the kernel

Posted Apr 22, 2020 19:00 UTC (Wed) by jccleaver (guest, #127418) [Link]

> From the description and the current patch it seems the backoff is absolute only if no progress is made.
> This would mean it is always compacting with the same frequency even if only very little progress is made because some process constantly fragments some pages.
> Wouldn't it be better to have the backoff time depend on how much progress was made last time (or some exponential decaying value of the last iterations progresses)?
> So if there is much to compact with the current workload it runs more often, but if there is not much to do it runs less often.

Agreed. This definitely calls out for some sort of efficiency barrier, where if there was less than a (arbitrary number) 2% improvement, then increase back off in most cases.

There should be a visible setting (say backoff values >95) that cause all but the tiniest improvements to be considered worthwhile, though, especially if the system truly is mostly idle.

Unrelated side-note: Does compaction happen at hibernation time, or before dropping to deep sleep states? Seems like a good opportunity.

Priority of the kcompactd thread

Posted May 4, 2020 11:07 UTC (Mon) by polyp (guest, #53146) [Link]

How is the priority of the kcompactd thread handled? It seems to me that depending on what mode it is used in (on-demand or proactive) it would need to run using different thread priorities? Maybe same prio as the process waiting for it during on-demand compaction and low prio when doing background/proactive compaction?

Proactive compaction for the kernel

Posted Dec 12, 2023 9:25 UTC (Tue) by WuYang (guest, #166224) [Link]

The patch implements back-offs in the following scenarios:

When the current node's kswapd thread is active (to avoid interfering with the reclaim process).
When there is contention on the per-node lru_lock or per-zone lock (to avoid hurting non-background, latency-sensitive contexts).
When there is no progress (reduction in node's fragmentation score value) after a round of compaction.

hello, for your described backoff mechanism, can you share your achievement from patch.