LWN: Comments on "The multi-generational LRU"

The multi-generational LRU

JohnDyson — Mon, 26 Apr 2021 14:40:35 +0000

The mult-generational LRU seems very close to the scheme that I developed (along with David Greenman) back in the early 1990s. My scheme was not very scalable, but allowed running X windows in about 4MB on FreeBSD (actually usefully at 8MB.) The scalability was good enough into the 100's of MB range, but beyond that size, it would need some algorithmic help. It sometimes made a substantial difference, sometimes not much at all. I forget what we called it, but it had both an active/inactive along with accumulating usage counts on a page basis. The pages would cycle though the queues and increment (or decrement) usage counts as appropriate. Pages could become fairly sticky or be made eligible for quick reuse. I truly do not know if FreeBSD still uses the scheme.

The multi-generational LRU

yuzhao@google.com — Tue, 13 Apr 2021 09:09:38 +0000

Sorry for the late reply.

Your assumption is correct. But our "wildly different workloads" are only a fraction of all real-world workloads. AFAIK, we don't use Postgresql, and we do plan to reach out to the Postgresql community and see if they could help us with benchmarking.

And we don't use io_uring (yet). Jens Axboe, the io_uring maintainer helped us identify a regression in buffered I/O and verify the fix. Now there is an improvement compared with the mainline: https://lore.kernel.org/linux-mm/20210413065633.2782273-1...

By design, the multigenerational LRU shouldn't make any workload worse -- why should it make worse decisions after it gathered more information? But that's just the theory :)

The bottomline is we'll do our best to fix any regressions the community reports to us. Again, the email address is page-reclaim@google.com.

The multi-generational LRU

smurf — Thu, 08 Apr 2021 14:01:10 +0000

These memory overheads seem rather minor, not to say insignificant, compared to the benefits.

Did you identify any workload that fares worse under your algorithm? One may assume that Google has quite a few wildly different workloads to experiment with …

The multi-generational LRU

ms-tg — Mon, 05 Apr 2021 19:33:30 +0000

Thank you for this clarification of the trade-offs and invitation for the community to ask further questions!

The multi-generational LRU and ELRU

hsu — Mon, 05 Apr 2021 07:15:34 +0000

I presented very similar algorithm with further improvements in my master's thesis in 1995, and did some analysis and performance evaluation through simulation models on it. I called the algorithm ELRU, for Extended Least Recently Used algorithm. The letter E also conveniently presents the data structure in abstract form.

Please check my thesis for more stuff. Some quick comments:

- I put the new pages to "insert list", not the top list (youngest generation). If new pages are put on youngest generation list, with smaller memories would essentially reduce the algorithm LRU case anyway, when a large sequential access would happen. Insert list would be dynamically defined, and I simulated different configurations (see ELRU Treshold). Simulation graphics on page 60 show why this is important at least for my simulated workloads.

- Further, for pages which are known to be of a specific type, they can be insert to lists representing different priorities. If the kernel makes a wrong prediction, it likely gets corrected by accesses bumping the pages to upper list. I think I did not do simulations for this though.

- I moved the accessed pages to one list higher. Not straight to youngest generation. If the access pattern is very hierachical, which it often case with filesystems, you get better results with this. See simulation graphs. I think putting active pages back to youngest list would end up corresponding ELRU(n, 0), which has substantially lower performance.

- See priorities, 3.3 in thesis, may be relevant in page cache. I did not simulate it during my work, further testing might be useful. It would be simple to implement, to allow high priority stuff to have better chance to stay in memory.

- I think for simulations I used fixed list set, but it can be dynamic, new lists are created when needed. I do not remember if we did dynamic list allocation in real implementation/

We also implemented the algorithm as part of the Shadows Database Project at Helsinki University of Technology (HUT), now part of Aalto Univerity in Helsinki. Shadow paging (similar concept to what is now called "copy on write") creates strongly hierarchical workloads due to page tables and metadata accessed very frequently. Same might be true for other copy on write filesystems as well, such as ZFS, depending on the implementation. We considered implementing a filesystem using code base created during Shadows project but it did not materialize due to other stuff taking all the time. The code can be released under public source license, though it is database & userland stuff. Shadows had some useful ideas which I do not have seen anywhere else, such as RAID model works with disks of different sizes, works well with flash without wear levelling as it naturally does large block sequential writes through garbage collection.

I think the Master's thesis was published in CS department publication, so it should count as prior art in case of IPR issues. I did not file patents at time, so it is possibly safe to use.

I also did some literature research during the project so there are some references in the Master's thesis as well, caching algorithms are pretty old field, already was 25 years ago. The oldest stuff might not be googleable, I spent a lot of time in dusty libraries back then. This can be further prior art source.

The full Master's thesis is at

http://www.suonsivu.net/masters-thesis-Heikki-Suonsivu-19...

BibTex and URN can be found at

https://aaltodoc.aalto.fi/handle/123456789/83375

The multi-generational LRU

yuzhao@google.com — Mon, 05 Apr 2021 04:47:25 +0000

Hi,

I'm one of the engineers working on this project. I want to clarify that multigenerational lru does NOT use more CPU. In fact, for all the workloads we could benchmark, it uses less, much less.

But then what's the catch? It retains more (lru) information for each page, in spare space in page->flags. It also stores extra information for each memcg, node, and process. Specifically, it has ~500 bytes per-memcg and per-node and ~50 bytes per-process memory overhead.

Please feel free to send your questions, suggestions, and concerns to page-reclaim@google.com.

Multi-generational approach is better optimised for CPU.

xoddam — Sun, 04 Apr 2021 10:00:48 +0000

The multi-generational version uses *less* CPU time, ie. "demonstrates 51% less CPU usage from kswapd" under one workload. This is possible because the existing code has suboptimal algorithms which scan the global page array and must do costly reverse lookups for each page into the per-process page tables, touching tables that are not necessarily in cache for any other reason.

The new code improves on this by doing new-generation scans on active processes directly, improving cache locality on the "fast path", avoiding the reverse lookups altogether, making better decisions on what pages get evicted and when, and seems even to reduce the amount of work that must be done when an eviction is necessary.

The multi-generational LRU

rsidd — Sun, 04 Apr 2021 07:50:08 +0000

Anything that reclaims memory helps the OOM killer, right? Or rather, helps avoid resorting to the OOM killer. One doesn't want to call the OOM killer unless it's desperate.

The multi-generational LRU

ibukanov — Sun, 04 Apr 2021 07:13:20 +0000

This essentially tries to use more CPU time to predict the future by doing more work at managing the pages. I guess with all those cores on modern CPU that can be afforded.

What is not clear is that why the patch helps with OOM killer. Is it because it helps to identify the apps that are expected not to be used?

The multi-generational LRU

matthias — Sat, 03 Apr 2021 11:59:23 +0000

In the case of memory pressure, the time scales might be much closer to each other. It is entirely possible, that a page is evicted almost immediately after usage. But I would think that the nature of the storage device would play a much bigger role than the I/O scheduler. But this is just a guess.

At the end these are all heuristics and it is impossible to test all possible combinations. Optimizing one system on its own is hard enough. The more systems you bring in to that equation, the more combinations you will get. And the combinatoric explosion will prohibit to test all possible combinations. This is why it is important that such patches are tested on real life workloads and real life configurations. This way, at least the most important combinations of configuration of subsystems are covered.

The multi-generational LRU

Sesse — Sat, 03 Apr 2021 11:39:36 +0000

I would assume it doesn't matter, since they work on entirely different timescales. I/O schedulers work on times from microseconds up to milliseconds (or perhaps seconds on rotating rust), evicting pages works on times of seconds to hours. So what one does is unlikely to impact the other one much.

The multi-generational LRU

linusw — Sat, 03 Apr 2021 11:27:03 +0000

This is exactly the kind of content I want to see in LWN. Excellent article.

My first thought is that as engineers we are good at looking at one thing in isolation and optimize it, as the economists say "ceteris paribus" (all other things being equal).

What I'm asking myself is how other parts of the memory hierarchy affect and play with the LRU mechanism.

Especially the block layer. How does the previous code and this code perform depending on whether we use none, deadline, BFQ or kyber as block scheduler?