Reconsidering the direct-map fragmentation problem

By Jonathan Corbet
May 11, 2023

Mike Rapoport has put a considerable amount of effort into solving the problem of direct-map fragmentation over the years; this has resulted in proposals like __GFP_UNMAPPED and a session at the 2022 Linux Storage, Filesystem, Memory-Management, and BPF Summit. Rapoport returned at the 2023 Summit to revisit this issue, but he started with a somewhat surprising spoiler.

The kernel's direct map makes all of the system's physical memory accessible in the kernel's virtual address space (on 64-bit systems, at least). As a performance optimization, huge pages are used to construct this mapping; by keeping the kernel's use of the translation lookaside buffer (TLB) down, using huge pages for the direct map will speed memory access in general. When the permissions for specific pages in the direct map must be changed (to hide memory from the kernel or to remove write permission from executable code, for example), though, those huge pages must be split into smaller "base" pages, hurting system performance. This type of direct-map fragmentation is thus worth working hard to avoid.

Or, at least, that is what everybody has assumed for years. Rapoport started his 2023 session with the statement that he is no longer convinced that the kernel needs to make any changes to its memory allocators to avoid direct-map fragmentation. It is, he said, no longer an issue that the kernel community needs to be concerned about. "End of the talk".

The talk didn't actually end there, though. Instead, he reviewed the causes of direct-map fragmentation, including the allocation of memory for executable code (such as a loadable module or a BPF program) or for secret-memory features. Proposed future changes, such as a version of vmalloc() with memory permissions or using protection keys supervisor for page tables, may also require splitting huge pages and, as a result, create fragmentation in the direct map.

The __GFP_UNMAPPED patches tried to reduce this problem by setting aside an area for allocations that should be removed from the direct map. Once those were working, he went looking for numbers to justify the change. He ran a whole series of benchmarks to show how the reduced fragmentation made the system run faster, but got results that he described as "peculiar". The results (which he has made available for the curious) showed improvements that were, at best, far smaller than the noise in the measurements. There was little sign that any of the changes would translate into performance improvements for real workloads.

Vlastimil Babka pointed out that all of the performance tests were done on AMD CPUs and wondered whether Intel systems would behave differently; Rapoport answered that he has run the tests on Intel as well with similar results. Michal Hocko asked whether Rapoport had run the tests using only base pages for the direct map — the fully fragmented case, in other words; that test, too, has been run and showed a "single-digit" degradation in performance.

The conclusion from all of this, Rapoport continued, was that direct-map fragmentation just does not matter — for data access, at least. Using huge-page mappings does still appear to make a difference for memory containing the kernel code, so allocator changes should focus on code allocations — improving the layout of allocations for loadable modules, for example, or allowing vmalloc() to allocate huge pages for code. But, for kernel-data allocations, direct-map fragmentation simply appears to not be worth worrying about.

James Bottomley said that these results might show that, on current CPUs, the TLB doesn't work the way developers think it does. Perhaps improved speculative execution is reducing the cost of TLB misses; much of the memory-management subsystem might be built for a world that no longer exists. Rapoport answered that most of the TLB is occupied by user space in any case, so the kernel will almost certainly incur a TLB miss on each entry regardless of the state of the direct map. Trying to prevent that miss with large-page mappings may not be doing any good.

Direct-map fragmentation concerns have led to Rapoport's secret-memory features to be disabled by default. Having concluded that those concerns are not actually concerning, he is now proposing to enable the feature in all kernels, making memfd_secret() available by default. As the session ran out of time, Babka worried that fragmentation could still increase the kernel's memory usage by requiring more page tables, but Rapoport answered that the cost was not enough to worry about. Secret memory is controlled by the system resource limits, so there is only so much damage that a malicious user can do.

The proof is likely to be when the configuration change is proposed; if users can demonstrate real performance regressions, direct-map fragmentation may have to be reconsidered yet again.

Index entries for this article
Kernel	Memory management/Direct map
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

Reconsidering the direct-map fragmentation problem

Posted May 12, 2023 2:56 UTC (Fri) by mtaht (subscriber, #11087) [Link]

I am, in general, always delighted by honest talk about negative results. In a world too full of ra-ra! change! for the sake of change, this was refreshing.

https://conferences.sigcomm.org/sigcomm/2014/doc/slides/1...

Reconsidering the direct-map fragmentation problem

Posted May 12, 2023 10:30 UTC (Fri) by ardbiesheuvel (subscriber, #89747) [Link]

On arm64, we started mapping the kernel direct map down to pages by default around 5 years ago, and I don't remember seeing any complaints. This was necessary as otherwise, the kernel direct map will have writable aliases of the pages that back module text and rodata regions, which clearly defeats the purpose somewhat of mapping code read-only in the first place.

Reconsidering the direct-map fragmentation problem

Posted May 30, 2023 8:38 UTC (Tue) by oak (guest, #2786) [Link]

While huge pages is a noticeable performance improvement in general, it may also slightly decrease performance for some 100% bandwidth bound use-cases, at least on older platforms. Here's some data about that from enabling THP for Intel iGPUs to compensate performance loss from enabling IOMMU:
https://gitlab.freedesktop.org/drm/intel/-/issues/430#not...

Many years ago I've also seen up to 10% changes in CPU side memory bandwidth test results from just rebooting the machine, which I think have been due to process memory getting allocated at a different address. Huge pages being split can mean the resulting addresses being less aligned. Maybe that could could improve e.g. cache page coloring (on older platforms)?