The final step for huge-page swapping
The use of huge pages can improve the performance of the system significantly, so the kernel works hard to make them available. The transparent huge pages mechanism collects application data into huge pages behind the scenes, and the memory-management subsystem as a whole works hard to ensure that appropriately sized pages are available. When it comes time to swap out a process's pages, though, all of that work is discarded, and a huge page is split back into hundreds of normal pages to be written out. When swapping was slow and generally avoided, that didn't matter much, but it is a bigger problem if one wants to swap to a fast device and maintain performance.
The work so far, which has been underway since 2016, has focused on keeping huge pages together for as long as possible in the swapout process. Before this work began, the splitting of huge pages was one of the first things that was done in the swap-out process. The first step (merged in 4.13) was to delay splitting huge pages until after the swap entries had been allocated. That work alone improved performance considerably, mostly by reducing the number of times the associated locks had to be acquired and released. Step two, merged in 4.14, further delayed the splitting until the huge page had actually been written to the swap device, again improving performance through better batching and by writing the entire huge page as a unit. Progress slowed down for a while after those pieces went upstream.
Things have picked up again with the final installment of 21 patches, posted by Ying Huang. Swapping out an entire huge page as a unit has already been mostly solved by the previous work, so it requires little effort here. What is a bit trickier, though, is keeping track of swapped huge pages. A whole set of swap entries is required to track both the huge page and its component pages, and they must all be kept in sync. Any event that might result in the splitting of a resident huge page, such as unmapping part of the page, an madvise() call, etc., must be caught so that the corresponding swap entries can be updated accordingly. Memory-use accounting for control groups must be updated to take huge-page swapping into account. The bulk of the patch set is dedicated to taking care of this kind of bookkeeping.
Once this work is done, the other side of the problem is relatively easy to solve. The page-fault handler can recognize a fault in a swapped huge page and try to swap the entire huge page back in as a unit. Such attempts can always fail if a free huge page is not available, of course, in which case the page will be split before being swapped back in. When it works, the operation will again benefit from the batching involved; the total overhead of bringing the entire huge page back into memory will be significantly reduced.
It turns out, though, that this may not be the biggest performance benefit from this work. As noted above, the kernel works hard to maximize the use of huge pages in the current workload; that includes coalescing individual pages into huge pages whenever possible. The current swap system undoes that work; if a huge page is swapped out, it will be swapped back in as individual pages. At that point, the kernel will have to restart the process of joining them into a huge page from the beginning. That is a fair amount of extra work for the kernel to do. More to the point, though, there is a limit to the rate at which pages can be coalesced in this way, and the operation may not always succeed. So, often, those small pages will remain separate and system performance will suffer accordingly.
If, instead, huge pages are swapped back in as huge pages, that work need not be done and the total number of huge pages in the running workload can be expected to increase significantly. Actually, "significantly" understates the impact of this work. In benchmark results posted with the patch, Huang notes that a system with an unmodified kernel ran the test with only 4.5% of the anonymous data being kept in huge pages by the end; with the patch set applied, that number rose to 96.6%. Inter-processor interrupts fell by 96%, and spinlock wait time dropped from 13.9% to 0.1%. The I/O throughput of swapping itself increased over 1000%. Kernel developers will often work long and hard for a 1% performance increase; improvements on this scale are nearly unheard of.
Given that, one might conclude that merging this patch set would be
worthwhile. But getting memory-management changes merged is always hard,
especially when the patch set is large, as this one is. As Andrew Morton
remarked: "It's a tremendously good
performance improvement. It's also a tremendously large patchset
".
Morton plans to put it into the -mm tree as soon as some conflicts with the
XArray patches can be worked out. But what is
really needed is some extensive review by other memory-management
developers. Until that happens, the world will be stuck with slow
huge-page swapping.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Swapping |
Posted Jul 2, 2018 17:56 UTC (Mon)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Jul 3, 2018 22:35 UTC (Tue)
by hansendc (subscriber, #7363)
[Link]
Posted Jul 3, 2018 18:47 UTC (Tue)
by pj (subscriber, #4506)
[Link] (2 responses)
Posted Jul 5, 2018 13:09 UTC (Thu)
by caritas (subscriber, #50896)
[Link] (1 responses)
Posted Jul 8, 2018 16:57 UTC (Sun)
by anton (subscriber, #25547)
[Link]
Posted Jul 4, 2018 5:36 UTC (Wed)
by ncm (guest, #165)
[Link] (3 responses)
Posted Jul 4, 2018 9:42 UTC (Wed)
by vbabka (subscriber, #91706)
[Link] (1 responses)
Posted Jul 5, 2018 9:28 UTC (Thu)
by kirr (guest, #14329)
[Link]
Vlastimil probably meant 444eb2a449.
Posted Jul 5, 2018 19:19 UTC (Thu)
by Sesse (subscriber, #53779)
[Link]
Posted Jan 13, 2020 15:29 UTC (Mon)
by mmeehan (subscriber, #72524)
[Link] (3 responses)
Posted Jan 13, 2020 15:33 UTC (Mon)
by mmeehan (subscriber, #72524)
[Link] (1 responses)
Posted Jun 2, 2023 8:49 UTC (Fri)
by lluki (guest, #165426)
[Link]
Posted Jun 2, 2023 8:49 UTC (Fri)
by lluki (guest, #165426)
[Link]
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
I would expect significant performance advantages from huge-page swapping on HDDs. With sequential bandwidths of ~200MB/s on fast HDDs, a 2MB page can be transferred in 10ms. The random access latency on a HDD is also about 10ms. So a randomly located 2MB page can be swapped in or out in about the same time as two 4KB pages that are randomly located.
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
The final step for huge-page swapping
