|
|
Log in / Subscribe / Register

The transparent huge page shrinker

By Jonathan Corbet
September 8, 2022
Huge pages are a mechanism implemented by the CPU that allows the management of memory in larger chunks. Use of huge pages can increase performance significantly, which is why the kernel has a "transparent huge page" mechanism to try to create them when possible. But a huge page will only be helpful if most of the memory contained within it is actually in use; otherwise it is just an expensive waste of memory. This patch set from Alexander Zhu implements a mechanism to detect underutilized huge pages and recover that wasted memory for other uses.

The base page size on most systems running Linux is 4,096 bytes, a number which has remained unchanged for many years even as the amount of memory installed in those systems has grown. By grouping (typically) 512 physically contiguous base pages into a huge page, it is possible to reduce the overhead of managing those pages. More importantly, though, huge pages take far fewer of the processor's scarce translation lookaside buffer (TLB) slots, which cache the results of virtual-to-physical address translations. TLB misses can be quite expensive, so expanding the amount of memory that can be covered by the TLB (as huge pages do) can improve performance significantly.

The downside of huge pages (as with larger page sizes in general) is internal fragmentation. If only part of a huge page is actually being used, the rest is wasted memory that cannot be used for any other purpose. Since such a page contains little useful memory, the hoped-for TLB-related performance improvements will not be realized. In the worst cases, it would clearly make sense to break a poorly utilized huge page back into base pages and only keep those that are clearly in use. The kernel's memory-management subsystem can break up huge pages to, among other things, facilitate reclaim, but it is not equipped to focus its attention specifically on underutilized huge pages.

Zhu's patch set aims to fill that gap in a few steps, the first being figuring out which of the huge pages in the system are being fully utilized — and which are not. To that end, a scanning function is run every second from a kernel workqueue; each run will look at up to 256 huge pages to determine how fully each is utilized. Only anonymous huge pages are scanned; this work doesn't address file-backed huge pages. The results can be read out of /sys/kernel/debug/thp_utilization in the form of a table like this:

    Utilized[0-50]: 1331 680884
    Utilized[51-101]: 9 3983
    Utilized[102-152]: 3 1187
    Utilized[153-203]: 0 0
    Utilized[204-255]: 2 539
    Utilized[256-306]: 5 1135
    Utilized[307-357]: 1 192
    Utilized[358-408]: 0 0
    Utilized[409-459]: 1 57
    Utilized[460-512]: 400 13
    Last Scan Time: 223.98
    Last Scan Duration: 70.65

This output (taken from the cover letter) is a histogram showing the number of huge pages containing a given number of utilized base pages. The first line, for example, shows the number of huge pages for which no more than 50 base pages are in active use. There are 1,331 of those pages, containing 680,884 unused base pages. There is a clear shape to the results: nearly all pages fall into one of the two extremes. As a general rule, a huge page is either fully utilized or almost entirely unused.

An important question to answer when interpreting this data is: how does the code determine which base pages within a huge page are actually used? The CPU and memory-management unit do not provide much help in this task; if the memory is mapped as a huge page, there is no per-base-page "accessed" bit to look at. Instead, Zhu's patch scans through the memory itself to see what is stored there. Any base pages that contain only zeroes are deemed to be unused, while those containing non-zero data are counted as being used. It is clearly not a perfect heuristic; a program could initialize pages with non-zero data then never touch them again. But it may be difficult to design a better one that doesn't involve actively breaking apart huge pages into base pages.

The results of this scanning identify a clear subset of the huge pages in a system that should perhaps be broken apart. In current kernels, though, splitting a zero-filled huge page will result in the creation of a lot of zero-filled base pages — and no immediate recovery of the unused memory. Zhu's patch set changes the splitting of huge pages so that it simply drops zero-filled base pages rather than remapping them into the process's address space. Since these are anonymous pages, the kernel will quickly substitute a new, zero-filled page should the process eventually reference one of the dropped pages.

The final step is to actively break up underutilized huge pages when the kernel is looking for memory to reclaim. To that end, the scanner will add the least-utilized pages (those in the 0-50 bucket shown above) to a new linked list so that they can be found quickly. A new shrinker is registered with the memory-management subsystem that can be called when memory is tight. When invoked, that shrinker will pull some entries from the list of underutilized huge pages and split them, resulting in the return of all zero-filled base pages found there to the system.

Most of the comments on this patch set have been focused on details rather than the overall premise. David Hildenbrand expressed a concern that unmapping zero-filled pages in an area managed by a userfaultfd() handler could create confusion if that handler subsequently receives page faults it was not expecting. Zhu answered that, if this is a concern, zero-filled base pages in userfaultfd()-managed areas could be mapped to the kernel's shared zero page instead.

The kernel has supported transparent huge pages since the feature was merged into the 2.6.38 kernel in 2011, but it is still not enabled for all processes. One of the reasons for holding back is the internal-fragmentation problem, which can outweigh the benefits that transparent huge pages provide. Zhu's explicit goal is to make that problem go away, allowing the enabling of transparent huge pages by default. If this work is successful, it could represent an important step for a longstanding kernel feature that, arguably, has never quite lived up to its potential.

Index entries for this article
KernelHuge pages
KernelMemory management/Huge pages


to post comments

The transparent huge page shrinker

Posted Sep 9, 2022 7:30 UTC (Fri) by linoliumz (guest, #134676) [Link] (5 responses)

In the README file of one of my HPC projects I have a section that describes how to enable transparent huge pages on Linux in order to get the best performance. Based on my experience, enabling transparent huge pages can provide a huge speed up on workloads that use a lot of RAM (> 50 GiB) e.g. one of my application's algorithms runs twice as fast if transparent huge pages are enabled. Any patch that brings us closer to enabling transparent huge pages by default is highly appreciated by me.

The transparent huge page shrinker

Posted Sep 10, 2022 8:00 UTC (Sat) by WolfWings (subscriber, #56790) [Link] (4 responses)

Serious question, is there a reason your code doesn't check if it's running on Linux at compile time and if so just utilize madvise() to explicitly request THP?

Because all distro's I've encountered for a few years now leave it set to madvise not disabling it entirely at least. And at this point the madvise() with MADV_HUGEPAGE is by orders of magnitude the most reliable and least troublesome way to utilize huge pages on Linux since everything else requires mucking about with lots of individual knobs by comparison.

The transparent huge page shrinker

Posted Sep 10, 2022 9:35 UTC (Sat) by linoliumz (guest, #134676) [Link] (3 responses)

My application supports many different operating systems, not only Linux. I try to keep my code base portable, I am willing to add workarounds for specific operating systems to my application if it will provide a significant benefit to my users and the workaround is relatively small. However, in this case I don't think is worth the effort. In my opinion the current Linux support of transparent huge pages is a mess (to be fair, on Windows it is even worse) and I prefer to simply wait until the situation improves and hopefully transparent huge pages will eventually be enabled by default.

The transparent huge page shrinker

Posted Sep 10, 2022 18:23 UTC (Sat) by WolfWings (subscriber, #56790) [Link] (2 responses)

I guess that's my point: It is enabled by default and has been for almost a decade now, just behind madvise where the interface to use it becomes trivial.
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h> // Also used for various posix functions, cross platform
const int alignment = 1 << 21;
const int size = 16 << 21;
int main( void ) {
    void *x = aligned_alloc( alignment, size );

#ifdef MADV_HUGEPAGE
    if ( x != NULL ) {
        madvise( x, size, MADV_HUGEPAGE  );
    }
#endif

    printf( "Go run: grep -i hugepage /proc/meminfo\nPausing for 60 seconds.\n" );
    sleep( 60 );

    return 0;
}

Beware fragmentation

Posted Sep 11, 2022 0:09 UTC (Sun) by jreiser (subscriber, #11027) [Link] (1 responses)

Often I get fewer huge pages than requested, even with 32GB of RAM on a (now) lightly-loaded system that has been up for a week or so:
    system("echo \"requested 32MB of anon huge pages:\";\
            grep -i hugepage /proc/meminfo");
    //printf( "Go run: grep -i hugepage /proc/meminfo\nPausing for 60 seconds.\n" );
    //sleep( 60 );

requested 32MB of anon huge pages:
AnonHugePages:      6144 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Also remember: #include <stdio.h> if printf remains in the source.

Beware fragmentation

Posted Sep 12, 2022 8:25 UTC (Mon) by WolfWings (subscriber, #56790) [Link]

Sorry, minor oversight on my part, since you do need to fault the RAM in before it exists. The program was literally typed from memory as an example, sorry if I missed an include. :P
#ifdef MADV_HUGEPAGE
    if ( x != NULL ) {
        madvise( x, size, MADV_HUGEPAGE )
        for ( int i = 0; i < size; i += alignment ) {
            ((char *)x)[i] = 0;
        }
    }
#endif
If you fault the pages in before the madvise then you have to wait for the THP scanner to circle back around and find things, but if you madvise before you fault the pages in then each fault is generated as a THP directly so you also need a lot less page faults to actually commit all the RAM you allocated, since Linux is (too) aggressive about overcommit until pages are actually accessed.

The transparent huge page shrinker

Posted Sep 10, 2022 11:48 UTC (Sat) by Flow (subscriber, #82408) [Link] (2 responses)

A similar, I think, mechanism was presented in

Panwar, Ashish, Sorav Bansal, and K. Gopinath. “HawkEye: Efficient Fine-Grained OS Support for Huge Pages”. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS ’19. Providence, RI, USA: Association for Computing Machinery, 2019, pp. 347–360. doi:10.1145/3297858.3304064. https://dl.acm.org/doi/10.1145/3297858.3304064

BTW, shouldn't this be called "transparent huge page *splitter*"? :)

The transparent huge page shrinker

Posted Sep 10, 2022 18:25 UTC (Sat) by WolfWings (subscriber, #56790) [Link]

There already is a (very well tested) splitter though. :)

The transparent huge page shrinker

Posted Sep 10, 2022 18:26 UTC (Sat) by calumapplepie (guest, #143655) [Link]

I think "shrinker" is a term for "program the kernel uses to reduce memory consumption when low on memory".

The transparent huge page shrinker

Posted Sep 11, 2022 14:30 UTC (Sun) by tamara_schmitz (guest, #141258) [Link] (2 responses)

SUSE has set thp enable to always and defrag to madvise.
I wonder what's the history there.

The transparent huge page shrinker

Posted Sep 12, 2022 12:51 UTC (Mon) by jtaylor (guest, #91739) [Link] (1 responses)

From my memory the main issues encountered with thp were mostly poor scalability of the defragmentation on large machines.

When we first saw it was I think a RHEL 6 where thp were immediately backported to some very old kernel version and enabled with defrag by default.
On machines with many cpu cores and certain workloads this caused basically 100% system load defragmenting.

The fix back then for us was disabling the automatic defragmentation and only run it when needed. Though many opted to just flat out disabling it globally and since then thp has a pretty bad reputation despite the issues mostly all being fixed today.

The transparent huge page shrinker

Posted Sep 13, 2022 16:17 UTC (Tue) by WolfWings (subscriber, #56790) [Link]

The other long-standing issue has been software using jemalloc which doesn't actually 'free' memory when it's done with it most of the time but just uses MADV_DONTNEED and MADV_FREE so it ends up actively fighting with khugepaged which then re-faults those 'all zero' pages to merge them together again since they're still tied to the process.

So in effect software using jemalloc with THP enabled and lots of small allocations and frees would suddenly have the worst-case "lots of wasted RAM" situation almost immediately.

It's also why so much jemalloc-using software has a much higher resident size quite often is from all of those "I'm done with these... but I won't pay the memory-fault costs to actually free them (BENCHMARKS, AHOY!) so I'll use the lazy-free madvise call instead" situations.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds