The final step for huge-page swapping

By Jonathan Corbet
July 2, 2018

For many years, Linux system administrators have gone out of their way to avoid swapping. The advent of nonvolatile memory is changing the equation, though, and swapping is starting to look interesting again — if it can perform well enough. That is not the case in current kernels, but a longstanding project to allow the swapping of transparent huge pages promises to improve that situation considerably. That work is reaching its final stage and might just enter the mainline soon.

The use of huge pages can improve the performance of the system significantly, so the kernel works hard to make them available. The transparent huge pages mechanism collects application data into huge pages behind the scenes, and the memory-management subsystem as a whole works hard to ensure that appropriately sized pages are available. When it comes time to swap out a process's pages, though, all of that work is discarded, and a huge page is split back into hundreds of normal pages to be written out. When swapping was slow and generally avoided, that didn't matter much, but it is a bigger problem if one wants to swap to a fast device and maintain performance.

The work so far, which has been underway since 2016, has focused on keeping huge pages together for as long as possible in the swapout process. Before this work began, the splitting of huge pages was one of the first things that was done in the swap-out process. The first step (merged in 4.13) was to delay splitting huge pages until after the swap entries had been allocated. That work alone improved performance considerably, mostly by reducing the number of times the associated locks had to be acquired and released. Step two, merged in 4.14, further delayed the splitting until the huge page had actually been written to the swap device, again improving performance through better batching and by writing the entire huge page as a unit. Progress slowed down for a while after those pieces went upstream.

Things have picked up again with the final installment of 21 patches, posted by Ying Huang. Swapping out an entire huge page as a unit has already been mostly solved by the previous work, so it requires little effort here. What is a bit trickier, though, is keeping track of swapped huge pages. A whole set of swap entries is required to track both the huge page and its component pages, and they must all be kept in sync. Any event that might result in the splitting of a resident huge page, such as unmapping part of the page, an madvise() call, etc., must be caught so that the corresponding swap entries can be updated accordingly. Memory-use accounting for control groups must be updated to take huge-page swapping into account. The bulk of the patch set is dedicated to taking care of this kind of bookkeeping.

Once this work is done, the other side of the problem is relatively easy to solve. The page-fault handler can recognize a fault in a swapped huge page and try to swap the entire huge page back in as a unit. Such attempts can always fail if a free huge page is not available, of course, in which case the page will be split before being swapped back in. When it works, the operation will again benefit from the batching involved; the total overhead of bringing the entire huge page back into memory will be significantly reduced.

It turns out, though, that this may not be the biggest performance benefit from this work. As noted above, the kernel works hard to maximize the use of huge pages in the current workload; that includes coalescing individual pages into huge pages whenever possible. The current swap system undoes that work; if a huge page is swapped out, it will be swapped back in as individual pages. At that point, the kernel will have to restart the process of joining them into a huge page from the beginning. That is a fair amount of extra work for the kernel to do. More to the point, though, there is a limit to the rate at which pages can be coalesced in this way, and the operation may not always succeed. So, often, those small pages will remain separate and system performance will suffer accordingly.

If, instead, huge pages are swapped back in as huge pages, that work need not be done and the total number of huge pages in the running workload can be expected to increase significantly. Actually, "significantly" understates the impact of this work. In benchmark results posted with the patch, Huang notes that a system with an unmodified kernel ran the test with only 4.5% of the anonymous data being kept in huge pages by the end; with the patch set applied, that number rose to 96.6%. Inter-processor interrupts fell by 96%, and spinlock wait time dropped from 13.9% to 0.1%. The I/O throughput of swapping itself increased over 1000%. Kernel developers will often work long and hard for a 1% performance increase; improvements on this scale are nearly unheard of.

Given that, one might conclude that merging this patch set would be worthwhile. But getting memory-management changes merged is always hard, especially when the patch set is large, as this one is. As Andrew Morton remarked: "It's a tremendously good performance improvement. It's also a tremendously large patchset". Morton plans to put it into the -mm tree as soon as some conflicts with the XArray patches can be worked out. But what is really needed is some extensive review by other memory-management developers. Until that happens, the world will be stuck with slow huge-page swapping.

Index entries for this article
Kernel	Memory management/Swapping

The final step for huge-page swapping

Posted Jul 2, 2018 17:56 UTC (Mon) by Sesse (subscriber, #53779) [Link] (1 responses)

So if huge pages are soon swappable, what's the status of huge pages for the page cache? (Both for data and executable code, i guess.)

The final step for huge-page swapping

Posted Jul 3, 2018 22:35 UTC (Tue) by hansendc (subscriber, #7363) [Link]

We have huge page cache support, technically. tmpfs works, for instance. What we don't have is support for real filesystems (ext4, xfs, etc...). Kirill Shutemov had some patches for that, two or so years ago that need to get dusted off and pushed again.

The final step for huge-page swapping

Posted Jul 3, 2018 18:47 UTC (Tue) by pj (subscriber, #4506) [Link] (2 responses)

Given the speed differences of swap on nvram vs on ssd vs on spinning rust, does the kernel do some kind of automatic media characterization when swap space is added?

The final step for huge-page swapping

Posted Jul 5, 2018 13:09 UTC (Thu) by caritas (subscriber, #50896) [Link] (1 responses)

The HDD and other media are distinguished. THP swap optimization is only enabled for non-HDD. Mainly because it depends on the data structure (swap cluster) that is only available for non-HDD.

The final step for huge-page swapping

Posted Jul 8, 2018 16:57 UTC (Sun) by anton (subscriber, #25547) [Link]

I would expect significant performance advantages from huge-page swapping on HDDs. With sequential bandwidths of ~200MB/s on fast HDDs, a 2MB page can be transferred in 10ms. The random access latency on a HDD is also about 10ms. So a randomly located 2MB page can be swapped in or out in about the same time as two 4KB pages that are randomly located.

The final step for huge-page swapping

Posted Jul 4, 2018 5:36 UTC (Wed) by ncm (guest, #165) [Link] (3 responses)

Last I heard, everybody turned off transparent huge pages because it was a performance pig. When did that get fixed?

The final step for huge-page swapping

Posted Jul 4, 2018 9:42 UTC (Wed) by vbabka (subscriber, #91706) [Link] (1 responses)

I'd say gradually, but probably a noticeable change was commit e4a49efe4e7e ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") in 4.6. Since then page faults do not try to allocate THP's aggressively.

The final step for huge-page swapping

Posted Jul 5, 2018 9:28 UTC (Thu) by kirr (guest, #14329) [Link]

Vlastimil probably meant 444eb2a449.

The final step for huge-page swapping

Posted Jul 5, 2018 19:19 UTC (Thu) by Sesse (subscriber, #53779) [Link]

For some very narrow definition of “everybody” :-)

The final step for huge-page swapping

Posted Jan 13, 2020 15:29 UTC (Mon) by mmeehan (subscriber, #72524) [Link] (3 responses)

Is there a good way to track when a particular patch set gets merged to mainline? I wonder if this ever made it? It's hard to tell aside from Andrew's encouraging reply about adding it to -mm for a while.

The final step for huge-page swapping

Posted Jan 13, 2020 15:33 UTC (Mon) by mmeehan (subscriber, #72524) [Link] (1 responses)

Looks like it made it in to 5.1, minor regression fixed: https://lkml.org/lkml/2019/7/2/324

The final step for huge-page swapping

Posted Jun 2, 2023 8:49 UTC (Fri) by lluki (guest, #165426) [Link]

This patch (which is about a OOM bug in THP swap-out) seems unrelated to the patch set mentioned in the article (which is about THP swap-in).

The final step for huge-page swapping

Posted Jun 2, 2023 8:49 UTC (Fri) by lluki (guest, #165426) [Link]

As far as I can tell, this patchset has not been merged.