|
|
Log in / Subscribe / Register

Zone-lock and mmap_sem scalability

By Jonathan Corbet
May 3, 2018

LSFMM
The memory-management subsystem is a central point that handles all of the system's memory, so it is naturally subject to scalability problems as systems grow larger. Two sessions during the memory-management track of the 2018 Linux Storage, Filesystem, and Memory-Management Summit looked at specific contention points: the zone locks and the mmap_sem semaphore.

Zone-lock optimizations

Dave Hansen ran a brief session about optimizations for the zone lock as a follow-on to the LRU lock session held on a previous day. The management of memory at the page level is handled by the zone mechanism, which maintains a set of per-CPU lists of pages that can be used to satisfy allocation requests; the zone lock serializes access to those lists when the need arises. In some workloads, the zone lock can create a significant amount of contention.

When one of the per-CPU lists is exhausted, the memory-management code moves a new batch of pages into it from the global list. The question that Hansen wanted to discuss was the number of pages that are pulled when this happens; that number has been set to 31 for a long time. Hardware has evolved considerably since that value was arrived at; perhaps it's time for a change?

He had the results of some tests run by Aaron Lu on a couple of relatively large x86 machines. Increasing the batch size to 53 yielded an 18% microbenchmark performance increase on a four-socket system; the increase on a two-socket system was about half that. Making the batch size larger yielded progressively smaller improvements, and by about 300 there was no improvement at all. So there does not appear to be a case for a huge increase, but perhaps a modest increase makes sense?

Andrew Morton asked whether there were other workloads that would be hurt by this change. Hansen replied that the worst-case latency might increase, but throughput would as well. Michal Hocko suggested asking the networking developers for their opinion, since they are highly sensitive to latency in memory-allocation functions. Hansen said that the latency could conceivably improve due to the reduced contention on the zone lock.

If the default value is going to be changed, a new value must be picked. There was some talk of trying to tune it automatically, but the tests did not show a whole lot of variation between the systems, so autotuning is probably not worth the effort. Rik van Riel suggested writing an LWN article describing the problem and asking users to test various batch sizes. The session concluded with the idea that the batch size should probably be approximately doubled, but that more tests need to be run before the change goes upstream.

mmap_sem scalability with munmap()

Yang Shi returned to the front of the room to discuss a specific problem with the often-criticized mmap_sem semaphore. When a process calls munmap() to unmap a range of memory, mmap_sem is held for the duration of the entire operation. That can be a long time for big mappings; he measured 18 seconds when undoing a 320GB mapping. Any other threads needing mmap_sem (to handle a page fault, for example) will hang while this is happening.

As a way of dealing with this problem, Shi developed a patch series changing the way munmap() operates. Rather than unmap the entire range at once, it splits the range into a number of pieces, and unmaps each piece separately, dropping and reacquiring mmap_sem after each. That change increased page-fault performance by 6-8%. The improvement is not seen for all workloads, but performance does not appear to degrade for any. That patch did not get much discussion in the room, though; instead, the developers wanted to consider alternative solutions.

Jérôme Glisse suggested that only the top-level page tables (the page upper directory — PUD — in particular) need to be unmapped while holding mmap_sem; after that, unmapping could drop the lock and do the rest of the work without. That only works for ranges covering a full PUD entry, though. Hugh Dickins, instead, suggested marking the virtual memory areas (VMAs) covering the range as being deleted, then dropping mmap_sem to clean up the page-table mappings contained in those VMAs.

Hocko had a different variation on the two-phase idea. An unmap operation could acquire mmap_sem for read access, then call madvise() with the MADV_DONTNEED to release the pages associated with the mapping. mmap_sem could then be upgraded to write access to finish the rest of the cleanup. There are some practical difficulties with this approach, including the fact that there is no way to upgrade mmap_sem in that way, and it would be hard to create one since a thread can hold multiple read locks simultaneously. One solution there might be to just drop the lock entirely and retake it for write access.

One possible trouble point with this approach is that an application accessing the pages in a range that is being unmapped would see a behavior change if this two-phase model were implemented. It was generally agreed, though, that this application, if it exists, is already playing with undefined behavior in a buggy way, so there shouldn't be any real trouble there. Things wound down with Hocko suggesting that this change should be done first, since it is a relatively simple approach to the problem; more complex changes can be done if the easy optimization is not enough.

Index entries for this article
KernelMemory management/mmap_sem
ConferenceStorage, Filesystem, and Memory-Management Summit/2018


to post comments


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds