User: Password:
|
|
Subscribe / Log in / New account

Optimizing VMA caching

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
March 5, 2014
The kernel divides each process's address space into virtual memory areas (VMAs), each of which describes where the associated range of addresses has its backing store, its protections, and more. A mapping created by mmap(), for example, will be represented by a single VMA, while mapping an executable file into memory may require several VMAs; the list of VMAs for any process can be seen by looking at /proc/PID/maps. Finding the VMA associated with a specific virtual address is a common operation in the memory management subsystem; it must be done for every page fault, for example. It is thus not surprising that this mapping is highly optimized; what may be surprising is the fact that it can be optimized further.

The VMAs for each address space are stored in a red-black tree, which enables a specific VMA to be looked up in logarithmic time. These trees scale well, which is important; some processes can have hundreds of VMAs (or more) to sort through. But it still takes time to walk down to a leaf in a red-black tree; it would be nice to avoid that work at least occasionally if it were possible. Current kernels work toward that goal by caching the results of the last VMA lookup in each address space. For workloads with any sort of locality, this simple cache can be quite effective, with hit rates of 50% or more.

But Davidlohr Bueso thought it should be possible to do better. Last November, he posted a patch adding a second cache holding a pointer to the largest VMA in each address space. The logic was that the VMA with the most addresses would see the most lookups, and his results seemed to bear that out; with the largest-VMA cache in place, hit rates went to over 60% for some workloads. It was a good improvement, but the patch did not make it into the mainline. Looking at the discussion, one can quickly come up with a useful tip for aspiring kernel developers: if Linus responds by saying "This patch makes me angry," the chances of it being merged are relatively small.

Linus's complaint was that caching the largest VMA seemed "way too ad-hoc" and wouldn't be suitable for a lot of workloads. He suggested caching a small number of recently used VMAs instead. Additionally, he noted that maintaining a single cache per address space, as current kernels do, might not be a good idea. In situations where multiple threads are running in the same address space, it is likely that each thread will be working with a different set of VMAs. So making the cache per-thread, he said, might yield much better results.

A few iterations later, Davidlohr has posted a VMA-caching patch set that appears to be about ready to go upstream. Following Linus's suggestion, the single-VMA cache (mmap_cache in struct mm_struct) has been replaced by a small array called vmacache in struct task_struct, making it per-thread. On systems with a memory management unit (almost all systems), that array holds four entries. There are also new sequence numbers stored in both struct mm_struct (one per address space) and in struct task_struct (one per thread).

The purpose of the sequence numbers is to ensure that the cache does not return stale results. Any change to the address space (the addition or removal of a VMA, for example) causes the per-address-space sequence number to be incremented. Every attempt to look up an address in the per-thread cache first checks the sequence numbers; if they do not match, the cache is deemed to be invalid and will be reset. Address-space changes are relatively rare in most workloads, so the invalidation of the cache should not happen too often.

Every call to find_vma() (the function that locates the VMA for a virtual address) first does a linear search through the cache to see if the needed VMA is there. Should the VMA be found, the work is done; otherwise, a traversal of the red-black tree will be required. In this case, the result of the lookup will be stored back into the cache. That is done by overwriting the entry indexed by the lowest bits of the page-frame number associated with the original virtual address. It is, thus, a random replacement policy for all practical purposes. The caching mechanism is meant to be fast so there would probably be no benefit from trying to implement a more elaborate replacement policy.

How well does the new scheme work? It depends on the workload, of course. For system boot, where almost everything running is single-threaded, Davidlohr reports that the cache hit rate went from 51% to 73%. Kernel builds, unsurprisingly, already work quite well with the current scheme with a hit rate of 75%, but, even in this case, improvement is possible: that rate goes to 88% with Davidlohr's patch applied. The real benefit, though, can be seen with benchmarks like ebizzy, which is designed to simulate a multithreaded web server workload. Current kernels find a cached VMA in a mere 1% of lookup attempts; patched kernels, instead, show a 99.97% hit rate.

With numbers like that, it is hard to find arguments for keeping this patch out of the mainline. At this point, the stream of suggestions and comments has come to a halt. Barring surprises, a new VMA lookup caching mechanism seems likely to find its way into the 3.15 kernel.


(Log in to post comments)

Optimizing VMA caching

Posted Mar 6, 2014 2:27 UTC (Thu) by luto (subscriber, #39314) [Link]

This reminds me:

Why is a red-black tree a good data structure for VMAs? Wouldn't something like a crit-bit tree or other trie be better for this kind of use case? It gets rid of rebalancing.

Red-black trees considered almost, but not quite, harmful

Posted Mar 6, 2014 9:59 UTC (Thu) by ncm (subscriber, #165) [Link]

A red-black tree is rarely a good data structure, for any use. However, it is often the easiest thing to chuck in, and good enough. Then, it is usually just barely good enough to keep anybody from doing the work to replace it with a data structure that actually is good.

This means that if you find something spending a displeasing amount of time on lookups in a red-black tree, you have been handed a golden opportunity to replace a red-black tree with something actually, y'know, good. Or, you can stick a wee cache in front of it and move along, secure in the knowledge that while that red-black tree was not even barely good enough to keep anybody from doing anything about it -- it is now.

Red-black trees considered almost, but not quite, harmful

Posted Mar 8, 2014 5:24 UTC (Sat) by flewellyn (subscriber, #5047) [Link]

I wasn't aware red-black trees were considered harmful. Can you elaborate on why?

Red-black trees considered almost, but not quite, harmful

Posted Mar 8, 2014 8:06 UTC (Sat) by ncm (subscriber, #165) [Link]

Not harmful, just almost harmful. See my other comment below. CS professors' harping on the merits of "n log n" behavior while ignoring memory hierarchy nonlinearies has doomed us all to suboptimal SaaS service.

Red-black trees considered almost, but not quite, harmful

Posted Mar 10, 2014 4:36 UTC (Mon) by flewellyn (subscriber, #5047) [Link]

That's interesting. So, the failures of red-black trees are to do with cache coherence and locality? That makes sense.

I looked at the "crit-bit tree" idea from DJB. Looks very interesting, but it looks like it's tuned to be very very fast as long as the tree stays within a cache line. What happens, I wonder, when the tree spills out of cache?

Optimizing VMA caching

Posted Mar 6, 2014 12:09 UTC (Thu) by matthias (subscriber, #94967) [Link]

Red-black trees might have the advantage that the depth is strictly limited by log(n). In crit-bit trees, the depth limit is the bit-length of the string (i.e. 64 for 64-bit pointers)*.

Rebalancing is probably not an issue, because the data structure is mostly static (i.e. very few updates).

* I am aware that todays architectures do not really use all 64 bits and that the lowest 12 bits are zero if we address 4k pages.

Optimizing VMA caching

Posted Mar 6, 2014 12:55 UTC (Thu) by ncm (subscriber, #165) [Link]

Binary trees, howsoever balanced, if they are big enough that it matters, are hell on cache locality. Unless you need to do sorted-order traversal, you're usually much better off hashing. If you do need sorted traversal, some variation on a B-tree (with blocks tuned to a cache line) is likely to be better.

That said, if it matters, measure alternatives with actual data. The real world of caches and pipelines is full of surprises.

Optimizing VMA caching

Posted Mar 7, 2014 2:36 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Wouldn't a splay tree be ideal for this application? It effectively caches automatically.

Optimizing VMA caching

Posted Mar 7, 2014 3:14 UTC (Fri) by luto (subscriber, #39314) [Link]

These trees are read much more often than they are written. Splay trees need heavyweight synchronization for readers, which will kill multithreaded performance.

Optimizing VMA caching

Posted Mar 7, 2014 15:43 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

The hit rate improvements are definitely interesting. Are there any measurements of how they translate to performance improvements?

Optimizing VMA caching

Posted Mar 8, 2014 6:07 UTC (Sat) by jzbiciak (subscriber, #5246) [Link]

I guess I should have clicked on the "patch set" link above. It does have benchmarks of a sort, reporting the number of billions of cycles spent in find_vma() before and after.

The improvements are definitely visible. A billion cycles on an 80-core x86-64 may only be a fraction of a second, but a billion here, a billion there, pretty soon you're talking real cycles! ;-)

Optimizing VMA caching

Posted Mar 13, 2014 22:01 UTC (Thu) by jcownie (guest, #3374) [Link]

Please quote miss-rates, not hit-rates. It's the misses that matter, so a change from 2% miss rate to 1% miss rate is a 2x improvement and looks like it, whereas if you express the same thing as a change from a 98% hit rate to 99% it looks insignificant.

It's the misses that cost, so they're what you want to talk about!

RE: why rbtrees in the kernel?!

Posted Mar 17, 2014 3:53 UTC (Mon) by mmorrow (guest, #83845) [Link]

As I see, the reason red-black trees are used in many places in the
kernel is because, when the rbtree node data is included directly in a
given struct, they require no dynamic allocation.

This is not the case with (e.g.) radix trees, which are much more
cache-friendly (they're not binary, less pointer-chasing).

To replace rbtrees in the kernel (without introducing dynamic allocation
at places it's currently not present), you would need to provide some
n-ary tree implementation *THAT DOES NOT ONLY HAVE VALUES AT THE LEAVES*
(because only then can you play the same game as is done with rbtrees
and node-preallocation-as-part-of-structs-of-rbtree-participants).

Does such an n-ary tree exist?

RE: why rbtrees in the kernel?!

Posted Mar 17, 2014 4:37 UTC (Mon) by neilbrown (subscriber, #359) [Link]

The other advantage of rb-trees, over say AVL trees, is that the rebalance operation has a fixed cost (the number of rotations is O(1) for insert).

I think it would be extremely ... challenging to maintain any sort of balance for a multi-way tree which contained data in internal nodes. I don't say it is impossible, but I think it would be a very heavy cost in the worst case.

My favourite search structure is the skip-list. This can be implemented as a sorted linked-list using embedded "struct list_head", with extra dynamically allocated structure which accelerate the lookup. It has the property that it still works if an allocation occasionally fails. It leaves you with fewer lookup accelerators so search will be a bit slower, but it will still be completely functional.
i.e. you do pay a cost of dynamic allocation (which is a highly optimised path in the kernel) but it can be a GFP_ATOMIC allocation which never blocks and occasionally might fail.

I'm still waiting for skiplist to appear in the kernel :-)

https://lwn.net/Articles/551896/


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds