KS2012: memcg/mm: Memory-management performance topics
The session of the 2012 Kernel Summit memcg/mm minisummit dedicated to improving memory cgroups performance veered into a more general discussion of some topics relating to memory-management performance. These topics were as follows.
Christoph Lameter talked about latency increases in the core kernel and
the problems this is causing. Consequently, there is a trend in some High
Performance Computing (HPC) circles for applications to bypass the kernel
entirely (i.e., avoiding system calls, and so on). One initial bypass was
to allow network data to be written directly to user buffers. Now the same
bypass is being used for accessing solid-state drives. He noted that one
of his wishes these days is to have the ability to restrict kernel
execution to a subset of cores "so that we can run user-space code on
the other cores
".
Christoph had no specific examples of particularly bad performance cases; rather, it's a case of death by a thousand cuts. For example, he made the point that the performance of applications depends on the cache footprint. Performance is high if application data fits in the L1 cache, but if the kernel cache footprint increases, it pushes the application out and performance drops rapidly. Each new kernel feature increases the cache footprint, and Christoph feels that kernel developers are not being mindful enough of this problem. Christoph noted that even if cgroups are disabled at run time, they make code larger, with consequences for the cache efficiency. Thus cgroups are always configured off on the systems he runs.
Andi Kleen talked about scalability problems with large-memory machines, pointing out that 1TB machines are now available at a reasonable price. The first problem he mentioned is that the conversion of some spinlocks to mutexes that occurred a while back hurt performance quite badly for some unspecified workloads that hit the reclaim path. In some cases, the mutexes became heavily contended with large latencies. Peter Zijlstra felt that part of the problem might be that mutexes are inherently unfair, while ticket spinlocks are fair, but there was insufficient data to verify that.
Andi announced that a suite of benchmarks that target memory scalability has been developed within Intel and he hopes that the suite will be released soon. The benchmarks will be used to demonstrate a range of problems with memory scalability, so that the problems can be tackled one at a time. For the most part, the problems are in areas where the kernel does not batch operations properly. For example, during reclaim the unmapping takes place one page at a time with no attempt to batch any of the locking.
Andi noted that the lack of a zero page for transparent huge pages (THP, described here and here) increases the memory footprint and causes sufficient problems that THP is getting disabled for some workloads. He noted that Kirill Shutemov has some patches to remedy this, but they have not received a lot of review so far. (More recently, Kirill has posted an updated version of these patches.) There is currently no support for shared memory using transparent huge pages; this is an area that may need to be tackled soon for huge-memory machines. The aforementioned problems and a number of others mean that memory scalability for large machines is likely to be a recurring topic for the foreseeable future.
Next: Hierarchical reclaim for memory cgroups
