The multi-generational LRU

By Jonathan Corbet
April 2, 2021

One of the key tasks assigned to the memory-management subsystem is to optimize the system's use of the available memory; that means pushing out pages containing unused data so that they can be put to better use elsewhere. Predicting which pages will be accessed in the near future is a tricky task, and the kernel has evolved a number of mechanisms designed to improve its chances of guessing right. But the kernel not only often gets it wrong, it also can expend a lot of CPU time to make the incorrect choice. The multi-generational LRU patch set posted by Yu Zhao is an attempt to improve that situation.

In general, the kernel cannot know which pages will be accessed in the near future, so it must rely on the next-best indicator: the set of pages that have been used recently. Chances are that pages that have been accessed in the recent past will be useful again in the future, but there are exceptions. Consider, for example, an application that is reading sequentially through a file. Each page of the file will be put into the page cache as it is read, but the application will never need it again; in this case, recent access is not a sign that the page will be used again soon.

The kernel tracks pages using a pair of least-recently-used (LRU) lists. Pages that have been recently accessed are kept on the "active" list, with just-accessed pages put at the head of the list. Pages are taken off the tail of the list if they have not been accessed recently and placed at the head of the "inactive" list. That list is a sort of purgatory; if some process accesses a page on the inactive list, it will be promoted back to the active list. Some pages, like those from the sequentially read file described above, start life on the inactive list, meaning they will be reclaimed relatively quickly if there is no further need for them.

There are more details, of course. It's worth noting that there are actually two pairs of lists, one for anonymous pages and one for file-backed pages. If memory control groups are in use, there is a whole set of LRU lists for each active group.

Zhao's patch set identifies a number of problems with the current state of affairs. The active/inactive sorting is too coarse for accurate decision making, and pages often end up on the wrong lists anyway. The use of independent lists in control groups makes it hard for the kernel to compare the relative age of pages across groups. The kernel has a longstanding bias toward evicting file-backed pages for a number of reasons, which can cause useful file-backed pages to be tossed while idle anonymous pages remain in memory. This problem has gotten worse in cloud-computing environments, where clients have relatively little local storage and, thus, relatively few file-backed pages in the first place. Meanwhile, the scanning of anonymous pages is expensive, partly because it uses a complex reverse-mapping mechanism that does not perform well when a lot of scanning must be done.

Closing the generation gap

The multi-generational LRU patches try to address these problems with two fundamental changes:

Add more LRU lists to cover a range of page ages between the current active and inactive lists; these lists are called "generations".
Change the way page scanning is done to reduce its overhead.

Newly activated pages are assigned to the youngest generation (though there are some exceptions described below). Over time, the memory-management subsystem will scan over a process's pages to determine whether each has been used since the last scan; any that have remained idle are moved to the next older generation. Pages of any generation that show activity are moved back to the youngest generation.

The result of this work is a spectrum of page ages, from those quite recently accessed to those that have not been used in some time. The number of generations can be configured into the kernel; that number seems to be as small as four for phones to several times that for cloud-based servers.

When the time comes to reclaim pages, only the oldest generation need be considered. The "oldest generation" can be different for anonymous and file-backed pages; anonymous pages can be harder to reclaim in general (they must always be written to swap) and the new code retains some of the bias toward reclaiming file-backed pages more aggressively. So file-backed pages may not escape reclaim for as many generations as anonymous pages do. The current patch only allows reclaim of file-backed pages to get one generation ahead of that for anonymous pages, though.

The multi-generational mechanism, it is claimed, is more accurate than the current two-list approach; by the time a page makes it to the oldest generation, its chances of being unneeded are rather higher than they are for pages on the inactive list. That, in turn, means that these pages can be reclaimed more aggressively, making more memory available for tasks that will actually make use of it. This mechanism allows for ready comparison of the ages of anonymous and file-backed pages, and, by tracking the creation time of each generation, of the ages of pages in different control groups; this information is lost in current kernels. That, in turn, makes it easier to identify and reclaim idle anonymous pages.

The other claimed advantage is in the change to how pages are scanned. Pages are accessed via the page-table entries (PTEs) in every process that has them mapped; the "recently accessed" bit lives in those page-table entries. Current kernels, though, scan through the pages themselves, and must use reverse-mapping to find and check the associated PTEs; that is expensive. The multi-generational LRU code, instead, scans over PTEs directly, an approach with better locality. A hook in the scheduler helps to track processes that have actually run since the last scan, so idle processes can be skipped.

The multi-generational LRU also benefits from skipping many of the heuristics that are used in current kernels to decide which pages should be reclaimed. There are still a few, though. For example, when a page is first established, its generation is picked with these rules:

Pages that are being faulted in are assigned to the youngest generation, as one would expect.
The activation of pages that are unmapped (pages resident in memory but with no PTEs pointing to them; these can include pages chosen for reclaim but not actually reclaimed before being referenced again) are added to the second-youngest generation. This is seemingly done to avoid making the youngest generation look too big, which might delay further page scanning until the next generation can be created.
Pages that are being reclaimed, but which must persist while their contents are written to backing store, are added to the second-oldest generation. That prevents another attempt to reclaim them while the writeback is underway.
Pages that are being deactivated go into the oldest generation. That is also the fate of pages that were brought in by the readahead mechanism; reading those pages is a speculative act on the kernel's part in the first place, with no guarantee that they will ever be useful.

There are a few knobs exported to user space to control this mechanism, including the ability to turn the multi-generational code off entirely; see this documentation patch for more information.

Generational change

The end result of all this work, it is claimed, is that page reclaim is much more efficient and better targeted than before. Systems like Android, when using this code, record fewer low-memory kills (when an app process is killed due to memory pressure), Chrome OS shows fewer out-of-memory kills, and server systems are better able to use available memory. It looks like an improvement all around.

Given that, one might wonder why the multi-generational algorithm is kept separate from the rest of the memory-management code and is made optional. It is, in essence, an independent approach to page aging and reclaim that exists alongside the current LRU lists. The answer, presumably, is that there are a lot of workloads out there, and some of them may not benefit from the multi-generational approach. There will need to be a lot more testing done to learn where the multi-generational LRU falls down and what might need to be done to keep that from happening.

The multi-generational LRU might eventually win over the memory-management developers, most of whom have not yet commented on this patch set. It does seem likely, though, that it will need to demonstrate better performance (or at least a lack of performance regressions) across the bulk of the workloads out there, to the point that it could be considered as a replacement for the current LRU rather than an addition to it. The idea of maintaining two separate LRU schemes is going to be a hard sell in the kernel community; it would be far better to just switch over completely to the multi-generational LRU if it is truly better.

Answering that question is certain to be a long process. Even relatively small memory-management changes can take a while to merge; it is just too easy to introduce performance penalties for some users. This change is not "relatively small", so the bar for inclusion will be high. But if the multi-generational LRU lives up to its claims, it may just be able to clear that bar — eventually.

Index entries for this article
Kernel	Memory management/Page replacement algorithms

The multi-generational LRU

Posted Apr 3, 2021 11:27 UTC (Sat) by linusw (subscriber, #40300) [Link] (2 responses)

This is exactly the kind of content I want to see in LWN. Excellent article.

My first thought is that as engineers we are good at looking at one thing in isolation and optimize it, as the economists say "ceteris paribus" (all other things being equal).

What I'm asking myself is how other parts of the memory hierarchy affect and play with the LRU mechanism.

Especially the block layer. How does the previous code and this code perform depending on whether we use none, deadline, BFQ or kyber as block scheduler?

The multi-generational LRU

Posted Apr 3, 2021 11:39 UTC (Sat) by Sesse (subscriber, #53779) [Link] (1 responses)

I would assume it doesn't matter, since they work on entirely different timescales. I/O schedulers work on times from microseconds up to milliseconds (or perhaps seconds on rotating rust), evicting pages works on times of seconds to hours. So what one does is unlikely to impact the other one much.

The multi-generational LRU

Posted Apr 3, 2021 11:59 UTC (Sat) by matthias (subscriber, #94967) [Link]

In the case of memory pressure, the time scales might be much closer to each other. It is entirely possible, that a page is evicted almost immediately after usage. But I would think that the nature of the storage device would play a much bigger role than the I/O scheduler. But this is just a guess.

At the end these are all heuristics and it is impossible to test all possible combinations. Optimizing one system on its own is hard enough. The more systems you bring in to that equation, the more combinations you will get. And the combinatoric explosion will prohibit to test all possible combinations. This is why it is important that such patches are tested on real life workloads and real life configurations. This way, at least the most important combinations of configuration of subsystems are covered.

The multi-generational LRU

Posted Apr 4, 2021 7:13 UTC (Sun) by ibukanov (subscriber, #3942) [Link] (6 responses)

This essentially tries to use more CPU time to predict the future by doing more work at managing the pages. I guess with all those cores on modern CPU that can be afforded.

What is not clear is that why the patch helps with OOM killer. Is it because it helps to identify the apps that are expected not to be used?

The multi-generational LRU

Posted Apr 4, 2021 7:50 UTC (Sun) by rsidd (subscriber, #2582) [Link]

Anything that reclaims memory helps the OOM killer, right? Or rather, helps avoid resorting to the OOM killer. One doesn't want to call the OOM killer unless it's desperate.

Multi-generational approach is better optimised for CPU.

Posted Apr 4, 2021 10:00 UTC (Sun) by xoddam (guest, #2322) [Link]

The multi-generational version uses *less* CPU time, ie. "demonstrates 51% less CPU usage from kswapd" under one workload. This is possible because the existing code has suboptimal algorithms which scan the global page array and must do costly reverse lookups for each page into the per-process page tables, touching tables that are not necessarily in cache for any other reason.

The new code improves on this by doing new-generation scans on active processes directly, improving cache locality on the "fast path", avoiding the reverse lookups altogether, making better decisions on what pages get evicted and when, and seems even to reduce the amount of work that must be done when an eviction is necessary.

The multi-generational LRU

Posted Apr 5, 2021 4:47 UTC (Mon) by yuzhao@google.com (guest, #132005) [Link] (3 responses)

Hi,

I'm one of the engineers working on this project. I want to clarify that multigenerational lru does NOT use more CPU. In fact, for all the workloads we could benchmark, it uses less, much less.

But then what's the catch? It retains more (lru) information for each page, in spare space in page->flags. It also stores extra information for each memcg, node, and process. Specifically, it has ~500 bytes per-memcg and per-node and ~50 bytes per-process memory overhead.

Please feel free to send your questions, suggestions, and concerns to page-reclaim@google.com.

The multi-generational LRU

Posted Apr 5, 2021 19:33 UTC (Mon) by ms-tg (subscriber, #89231) [Link]

Thank you for this clarification of the trade-offs and invitation for the community to ask further questions!

The multi-generational LRU

Posted Apr 8, 2021 14:01 UTC (Thu) by smurf (subscriber, #17840) [Link] (1 responses)

These memory overheads seem rather minor, not to say insignificant, compared to the benefits.

Did you identify any workload that fares worse under your algorithm? One may assume that Google has quite a few wildly different workloads to experiment with …

The multi-generational LRU

Posted Apr 13, 2021 9:09 UTC (Tue) by yuzhao@google.com (guest, #132005) [Link]

Sorry for the late reply.

Your assumption is correct. But our "wildly different workloads" are only a fraction of all real-world workloads. AFAIK, we don't use Postgresql, and we do plan to reach out to the Postgresql community and see if they could help us with benchmarking.

And we don't use io_uring (yet). Jens Axboe, the io_uring maintainer helped us identify a regression in buffered I/O and verify the fix. Now there is an improvement compared with the mainline: https://lore.kernel.org/linux-mm/20210413065633.2782273-1...

By design, the multigenerational LRU shouldn't make any workload worse -- why should it make worse decisions after it gathered more information? But that's just the theory :)

The bottomline is we'll do our best to fix any regressions the community reports to us. Again, the email address is page-reclaim@google.com.

The multi-generational LRU and ELRU

Posted Apr 5, 2021 7:15 UTC (Mon) by hsu (guest, #151475) [Link]

I presented very similar algorithm with further improvements in my master's thesis in 1995, and did some analysis and performance evaluation through simulation models on it. I called the algorithm ELRU, for Extended Least Recently Used algorithm. The letter E also conveniently presents the data structure in abstract form.

Please check my thesis for more stuff. Some quick comments:

- I put the new pages to "insert list", not the top list (youngest generation). If new pages are put on youngest generation list, with smaller memories would essentially reduce the algorithm LRU case anyway, when a large sequential access would happen. Insert list would be dynamically defined, and I simulated different configurations (see ELRU Treshold). Simulation graphics on page 60 show why this is important at least for my simulated workloads.

- Further, for pages which are known to be of a specific type, they can be insert to lists representing different priorities. If the kernel makes a wrong prediction, it likely gets corrected by accesses bumping the pages to upper list. I think I did not do simulations for this though.

- I moved the accessed pages to one list higher. Not straight to youngest generation. If the access pattern is very hierachical, which it often case with filesystems, you get better results with this. See simulation graphs. I think putting active pages back to youngest list would end up corresponding ELRU(n, 0), which has substantially lower performance.

- See priorities, 3.3 in thesis, may be relevant in page cache. I did not simulate it during my work, further testing might be useful. It would be simple to implement, to allow high priority stuff to have better chance to stay in memory.

- I think for simulations I used fixed list set, but it can be dynamic, new lists are created when needed. I do not remember if we did dynamic list allocation in real implementation/

We also implemented the algorithm as part of the Shadows Database Project at Helsinki University of Technology (HUT), now part of Aalto Univerity in Helsinki. Shadow paging (similar concept to what is now called "copy on write") creates strongly hierarchical workloads due to page tables and metadata accessed very frequently. Same might be true for other copy on write filesystems as well, such as ZFS, depending on the implementation. We considered implementing a filesystem using code base created during Shadows project but it did not materialize due to other stuff taking all the time. The code can be released under public source license, though it is database & userland stuff. Shadows had some useful ideas which I do not have seen anywhere else, such as RAID model works with disks of different sizes, works well with flash without wear levelling as it naturally does large block sequential writes through garbage collection.

I think the Master's thesis was published in CS department publication, so it should count as prior art in case of IPR issues. I did not file patents at time, so it is possibly safe to use.

I also did some literature research during the project so there are some references in the Master's thesis as well, caching algorithms are pretty old field, already was 25 years ago. The oldest stuff might not be googleable, I spent a lot of time in dusty libraries back then. This can be further prior art source.

The full Master's thesis is at

http://www.suonsivu.net/masters-thesis-Heikki-Suonsivu-19...

BibTex and URN can be found at

https://aaltodoc.aalto.fi/handle/123456789/83375

The multi-generational LRU

Posted Apr 26, 2021 14:40 UTC (Mon) by JohnDyson (guest, #151896) [Link]

The mult-generational LRU seems very close to the scheme that I developed (along with David Greenman) back in the early 1990s. My scheme was not very scalable, but allowed running X windows in about 4MB on FreeBSD (actually usefully at 8MB.) The scalability was good enough into the 100's of MB range, but beyond that size, it would need some algorithmic help. It sometimes made a substantial difference, sometimes not much at all. I forget what we called it, but it had both an active/inactive along with accumulating usage counts on a page basis. The pages would cycle though the queues and increment (or decrement) usage counts as appropriate. Pages could become fairly sticky or be made eligible for quick reuse. I truly do not know if FreeBSD still uses the scheme.