|
|
Log in / Subscribe / Register

Memory-management scalability

By Jonathan Corbet
March 13, 2015

LSFMM 2015
One of the drivers of memory-management development is scalability — performing well on ever-larger systems. So it is not surprising that scalability is a perennial discussion topic at kernel development gatherings; the 2015 Linux Storage, Filesystem, and Memory Management Summit was no exception. Andi Kleen and Peter Zijlstra led the first of two sessions on virtual memory scalability during the memory-management track at that event.

Andi started by pointing out that systems were growing, not only in the number of CPU cores available, but also in the amount of attached memory. The number of cores per NUMA node is on the rise, which is bringing out some new scalability problems.

One of the well-used scalability tactics found in the kernel is per-CPU variables; when each CPU has its own data, there can be no contention between them. But, Andi asserted, as the number of CPUs grows, it no longer makes [Andi Kleen] sense to do things on a per-CPU basis. It just adds a lot of work whenever it becomes necessary to touch every CPU's version of a variable. Instead, data should be made local to groups of N cores (where N was not specified).

Christoph Lameter said that a lot of these scaling problems can be addressed by limiting subsystems to specific cores. Andi replied that this approach works great at installations where there is an experienced person configuring the system. In the absence of that person, it does not work quite so well.

Mel Gorman asked the group what other scalability problems are being experienced now. Christoph complained about I/O bandwidth; in particular, he said, he is unable to push more than about 2GB/second to a filesystem. The problems come down to locking and the handling of 4KB pages in the XFS filesystem. Writeback tends to slow things down, since a lot of CPU time is spent making it happen.

That led to a discussion of batching operations — another tried-and-true scalability technique. It was noted that the reverse-mapping code, which maintains data structures to enable the kernel to tell which processes have references to a given physical page, takes its locks on a per-page basis. Fixing that, evidently, is not hard, but it will require some reorganization of the code.

The current least-recently-used (LRU) lists track memory in units of 4KB pages. That is considered at this point to be overly fine-grained; there is no need for LRU accuracy at that level. There was talk of implementing a "bucket LRU" that would track larger groups of pages.

Inter-processor interrupts (IPIs) for translation lookaside buffer flushes have long been seen as a potential scalability problem. But, it seems [Peter Zijlstra] that, while people worry about IPIs, it is hard to find a workload where they create a bottleneck. Usually the much-maligned mmap_sem semaphore gets in the way first.

There was some vague talk of other scalability issues; memory compaction was mentioned as a problem on large systems. If compaction tries to migrate a lot of pages, that can lead to large latencies in process execution. Mel Gorman said that compaction shouldn't be doing that, though, so it is not clear where the problem is.

The session wound down without coming to any real conclusions. The scalability topic returned on the second day, though, when Davidlohr Bueso led a session focused on mmap_sem in particular. This semaphore controls access to a process's page tables, along with a number of other, not always well-defined things; it has been on the list of things to fix for some time now. Davidlohr stated a wish to walk out with some tangible action items for improving the situation.

He started by looking back at past action items, especially those that came out of the LSFMM 2014 locking session. One of the concerns then was use of mmap_sem in drivers and other code outside of the memory-management subsystem. Jan Kara has been working on getting drivers to use the gup_fast() variant of get_user_pages() in order to eliminate dependencies on mmap_sem; the biggest problem he is facing at the moment is a deadlock problem in the media subsystem.

Jan would also like to get mmap_sem out of the filesystem code. Al Viro wondered, though, about how virtual memory area (VMA) structures would be protected in its absence. Peter said he has a patch that shifts the protection of VMAs to sleepable RCU if anybody wanted to push that work forward. Meanwhile, Jan hopes to get his driver patches submitted soon.

Davidlohr said that his focus is moving stuff out from under mmap_sem entirely and, eventually, breaking up the lock into something finer-grained. The problem with that, as Peter pointed out, is [Davidlohr Bueso] that what's protected by the lock now is not entirely clear. The way to start, he said, would be to document what's protected by mmap_sem; after that, one can start thinking about better locking schemes.

One problem with mmap_sem is that it protects a process's entire address space. Concurrency could be increased by locking only portions of that space instead. The concept of "range locks" is thus of interest here. Michal Hocko suggested that developers could start by replacing mmap_sem with a range lock that still covers the entire address space; the locking could then be made more precise in an incremental manner.

Hugh Dickins, though, wondered if that was the right approach and what problems, exactly, were being solved with range locks. His impression was that the top priority was to get page-fault handling out from under mmap_sem entirely. The answer was that there are, in fact, two different issues to be addressed regarding mmap_sem: it protects too much, and the hold times are too long. Range locks are one attempt to address the first part of the problem. Peter added that, among other things, range locking would allow concurrent mmap() calls to proceed, which is important for some threaded workloads.

There was some concern about surprises that can pop up when it turns out that an unexpected corner of the code was relying on mmap_sem. In extreme cases, Hugh said, user-space code may even rely on it. He described a complaint from a user about a change in mlock() semantics. Changes in the kernel increased mlock() concurrency and, in the process, exposed a lack of locking on the user-space side. Sympathy for the affected user was relatively low in this case, but, Hugh said, it would be wise to be prepared for nasty surprises.

In the end, Davidlohr's desire for tangible action items went mostly unfulfilled. About the only firm conclusion was that the range-lock code will be cleaned up and posted in the near future.

Index entries for this article
KernelMemory management/mmap_sem
KernelMemory management/Scalability
KernelScalability
ConferenceStorage, Filesystem, and Memory-Management Summit/2015


to post comments

Memory-management scalability

Posted Mar 17, 2015 21:54 UTC (Tue) by mtanski (guest, #56423) [Link]

Unfortunately I missed the second day of this topic (mmap_sem) since I was filesystem track about epoll.

I'd love to see something done around mmap_sem, there was one workload we had with a fairly heavyweight process to churn through columnar db data for aggregation. It used mmap for the columnar files (for zero copy) with 16 threads mmap_sem started showing up a non-negligible part when we spent some time looking into getting more performance using perf.

Happy to hear that there's work ongoing on this.


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds