Mitigating vmap lock contention

By Jonathan Corbet
May 26, 2023

The "vmap area" is a range of kernel address space used when the kernel needs to virtually map a range of memory; among other things, memory allocations obtained from vmalloc() and loadable modules are placed there. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Uladzislau Rezki, presenting remotely, explained a performance problem related to the vmap area and discussed possible solutions.

The problem, he said, is that the vmap area is protected by three global spinlocks. The global free-space area is covered by free_vmap_area_lock, the tracking of mapped areas by vmap_area_lock, and the list of lazily freed areas by purge_vmap_area_lock. These locks, he said, can turn into a significant bottleneck on systems with a large number of CPUs. The vmap_area_lock controls access to a red-black tree that can be used to find an allocated area using an address within it. These areas can be seen by looking at /proc/vmallocinfo. The free_vmap_area_lock, instead, regulates access to free space and can experience high lock contention.

The allocation path has to acquire both free_vmap_area_lock (to find a free range) and vmap_area_lock (to mark that range as busy). The freeing path, instead, needs vmap_area_lock and purge_vmap_area_lock. This pattern means that the three areas cannot be accessed concurrently. Running some tests on a "super-powerful computer", Rezki measured a basic vmalloc() call as taking about 2µs when a single thread was running. With 32 threads calling vmalloc() simultaneously, that time grew to 50µs — 25 times greater. That slowdown is the result of contention on the vmap-area locks.

The biggest problem, he said, is vmap_area_lock. This is partly due to a fair amount of fragmentation in the allocated areas, he said; the free and purge lists have fewer, larger areas and, as a result, less contention. Rezki proposed addressing this problem by adding a per-CPU cache; each CPU would pre-fetch some address space into its cache, then allocate pieces of that space to satisfy requests.

An attendee pointed out that the problem of allocating vmap-area space looks similar to allocating user-space address space and asked whether the same infrastructure could be used for both. Rezki answered that user-space allocation is a bigger problem, so the solution is heavier, and optimized implementations are still in development. The real problem with the vmap area is the serialization of requests across CPUs, which is amenable to a simpler solution.

Liam Howlett said that the vmap_area_lock is used for both allocation and freeing operations; if it could be avoided in one of the two paths, that could reduce contention. Rezki said that is true in theory, but that the bookkeeping has to be done somehow regardless. Howlett repeated that the problem is similar to the allocation of virtual-memory areas for user space. Memory-management developers should learn from each other, he said, rather than going off and doing their own things.

Rezki moved on to the management of free space in the vmap area. When a range in that area is freed, the approach would be to convert the address into the appropriate per-CPU zone, lock that zone, and remove the allocation. Then the lazy-free zone could be locked, and the newly freed area added there. A separate context would occasionally drain that lazy list; in his patch set it is being drained to the global area for now.

He concluded by asking what his next steps should be; the answer was to post patches and follow the usual process. He was asked for performance numbers, but had none available. When asked where this contention has been observed, he said it shows up on Android systems during video playback. The session ended with Michal Hocko suggesting that Rezki join his work with the efforts to improve user-space address allocation if possible.

Index entries for this article
Kernel	Memory management/Scalability
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

Mitigating vmap lock contention

Posted May 29, 2023 14:56 UTC (Mon) by wens (subscriber, #115438) [Link]

We hit this bottleneck hard. Our service involves a large number of incoming SSH connections. The bottleneck happens in two places: a) per-process kernel stack (if VMAP_STACK is enabled) allocation and b) pty (always allocated through vmalloc) allocation. In the end we rewrote a suitable SSH backend, instead of using OpenSSH.

Mitigating vmap lock contention

Posted May 30, 2023 10:13 UTC (Tue) by sima (subscriber, #160698) [Link] (1 responses)

So yeah if this is for video playback only, that's a userspace/driver issue, not a vmap issue. Of all gpu workloads video codec really should be the most predictable, allocate all buffers you need upfront and then recycle. If there's enough reallocations during playback to matter something is really busted.

Mitigating vmap lock contention

Posted May 30, 2023 12:58 UTC (Tue) by kazer (subscriber, #134462) [Link]

To me it seems like video playback is an example, not the only case at all.

Regarding video playback, can you tell what happens during streaming when bitrate changes and so forth? To me there is plenty of variation that needs to be accounted for..