The multi-generational LRU
In general, the kernel cannot know which pages will be accessed in the near future, so it must rely on the next-best indicator: the set of pages that have been used recently. Chances are that pages that have been accessed in the recent past will be useful again in the future, but there are exceptions. Consider, for example, an application that is reading sequentially through a file. Each page of the file will be put into the page cache as it is read, but the application will never need it again; in this case, recent access is not a sign that the page will be used again soon.
The kernel tracks pages using a pair of least-recently-used (LRU) lists. Pages that have been recently accessed are kept on the "active" list, with just-accessed pages put at the head of the list. Pages are taken off the tail of the list if they have not been accessed recently and placed at the head of the "inactive" list. That list is a sort of purgatory; if some process accesses a page on the inactive list, it will be promoted back to the active list. Some pages, like those from the sequentially read file described above, start life on the inactive list, meaning they will be reclaimed relatively quickly if there is no further need for them.
There are more details, of course. It's worth noting that there are actually two pairs of lists, one for anonymous pages and one for file-backed pages. If memory control groups are in use, there is a whole set of LRU lists for each active group.
Zhao's patch set identifies a number of problems with the current state of affairs. The active/inactive sorting is too coarse for accurate decision making, and pages often end up on the wrong lists anyway. The use of independent lists in control groups makes it hard for the kernel to compare the relative age of pages across groups. The kernel has a longstanding bias toward evicting file-backed pages for a number of reasons, which can cause useful file-backed pages to be tossed while idle anonymous pages remain in memory. This problem has gotten worse in cloud-computing environments, where clients have relatively little local storage and, thus, relatively few file-backed pages in the first place. Meanwhile, the scanning of anonymous pages is expensive, partly because it uses a complex reverse-mapping mechanism that does not perform well when a lot of scanning must be done.
Closing the generation gap
The multi-generational LRU patches try to address these problems with two fundamental changes:
- Add more LRU lists to cover a range of page ages between the current active and inactive lists; these lists are called "generations".
- Change the way page scanning is done to reduce its overhead.
Newly activated pages are assigned to the youngest generation (though there are some exceptions described below). Over time, the memory-management subsystem will scan over a process's pages to determine whether each has been used since the last scan; any that have remained idle are moved to the next older generation. Pages of any generation that show activity are moved back to the youngest generation.
The result of this work is a spectrum of page ages, from those quite recently accessed to those that have not been used in some time. The number of generations can be configured into the kernel; that number seems to be as small as four for phones to several times that for cloud-based servers.
When the time comes to reclaim pages, only the oldest generation need be considered. The "oldest generation" can be different for anonymous and file-backed pages; anonymous pages can be harder to reclaim in general (they must always be written to swap) and the new code retains some of the bias toward reclaiming file-backed pages more aggressively. So file-backed pages may not escape reclaim for as many generations as anonymous pages do. The current patch only allows reclaim of file-backed pages to get one generation ahead of that for anonymous pages, though.
The multi-generational mechanism, it is claimed, is more accurate than the current two-list approach; by the time a page makes it to the oldest generation, its chances of being unneeded are rather higher than they are for pages on the inactive list. That, in turn, means that these pages can be reclaimed more aggressively, making more memory available for tasks that will actually make use of it. This mechanism allows for ready comparison of the ages of anonymous and file-backed pages, and, by tracking the creation time of each generation, of the ages of pages in different control groups; this information is lost in current kernels. That, in turn, makes it easier to identify and reclaim idle anonymous pages.
The other claimed advantage is in the change to how pages are scanned. Pages are accessed via the page-table entries (PTEs) in every process that has them mapped; the "recently accessed" bit lives in those page-table entries. Current kernels, though, scan through the pages themselves, and must use reverse-mapping to find and check the associated PTEs; that is expensive. The multi-generational LRU code, instead, scans over PTEs directly, an approach with better locality. A hook in the scheduler helps to track processes that have actually run since the last scan, so idle processes can be skipped.
The multi-generational LRU also benefits from skipping many of the heuristics that are used in current kernels to decide which pages should be reclaimed. There are still a few, though. For example, when a page is first established, its generation is picked with these rules:
- Pages that are being faulted in are assigned to the youngest generation, as one would expect.
- The activation of pages that are unmapped (pages resident in memory but with no PTEs pointing to them; these can include pages chosen for reclaim but not actually reclaimed before being referenced again) are added to the second-youngest generation. This is seemingly done to avoid making the youngest generation look too big, which might delay further page scanning until the next generation can be created.
- Pages that are being reclaimed, but which must persist while their contents are written to backing store, are added to the second-oldest generation. That prevents another attempt to reclaim them while the writeback is underway.
- Pages that are being deactivated go into the oldest generation. That is also the fate of pages that were brought in by the readahead mechanism; reading those pages is a speculative act on the kernel's part in the first place, with no guarantee that they will ever be useful.
There are a few knobs exported to user space to control this mechanism, including the ability to turn the multi-generational code off entirely; see this documentation patch for more information.
Generational change
The end result of all this work, it is claimed, is that page reclaim is much more efficient and better targeted than before. Systems like Android, when using this code, record fewer low-memory kills (when an app process is killed due to memory pressure), Chrome OS shows fewer out-of-memory kills, and server systems are better able to use available memory. It looks like an improvement all around.
Given that, one might wonder why the multi-generational algorithm is kept separate from the rest of the memory-management code and is made optional. It is, in essence, an independent approach to page aging and reclaim that exists alongside the current LRU lists. The answer, presumably, is that there are a lot of workloads out there, and some of them may not benefit from the multi-generational approach. There will need to be a lot more testing done to learn where the multi-generational LRU falls down and what might need to be done to keep that from happening.
The multi-generational LRU might eventually win over the memory-management developers, most of whom have not yet commented on this patch set. It does seem likely, though, that it will need to demonstrate better performance (or at least a lack of performance regressions) across the bulk of the workloads out there, to the point that it could be considered as a replacement for the current LRU rather than an addition to it. The idea of maintaining two separate LRU schemes is going to be a hard sell in the kernel community; it would be far better to just switch over completely to the multi-generational LRU if it is truly better.
Answering that question is certain to be a long process. Even relatively
small memory-management changes can take a while to merge; it is just too
easy to introduce performance penalties for some users. This change is not
"relatively small", so the bar for inclusion will be high. But if the
multi-generational LRU lives up to its claims, it may just be able to clear that
bar — eventually.
Index entries for this article | |
---|---|
Kernel | Memory management/Page replacement algorithms |
Posted Apr 3, 2021 11:27 UTC (Sat)
by linusw (subscriber, #40300)
[Link] (2 responses)
My first thought is that as engineers we are good at looking at one thing in isolation and optimize it, as the economists say "ceteris paribus" (all other things being equal).
What I'm asking myself is how other parts of the memory hierarchy affect and play with the LRU mechanism.
Especially the block layer. How does the previous code and this code perform depending on whether we use none, deadline, BFQ or kyber as block scheduler?
Posted Apr 3, 2021 11:39 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Apr 3, 2021 11:59 UTC (Sat)
by matthias (subscriber, #94967)
[Link]
At the end these are all heuristics and it is impossible to test all possible combinations. Optimizing one system on its own is hard enough. The more systems you bring in to that equation, the more combinations you will get. And the combinatoric explosion will prohibit to test all possible combinations. This is why it is important that such patches are tested on real life workloads and real life configurations. This way, at least the most important combinations of configuration of subsystems are covered.
Posted Apr 4, 2021 7:13 UTC (Sun)
by ibukanov (subscriber, #3942)
[Link] (6 responses)
What is not clear is that why the patch helps with OOM killer. Is it because it helps to identify the apps that are expected not to be used?
Posted Apr 4, 2021 7:50 UTC (Sun)
by rsidd (subscriber, #2582)
[Link]
Posted Apr 4, 2021 10:00 UTC (Sun)
by xoddam (guest, #2322)
[Link]
The new code improves on this by doing new-generation scans on active processes directly, improving cache locality on the "fast path", avoiding the reverse lookups altogether, making better decisions on what pages get evicted and when, and seems even to reduce the amount of work that must be done when an eviction is necessary.
Posted Apr 5, 2021 4:47 UTC (Mon)
by yuzhao@google.com (guest, #132005)
[Link] (3 responses)
I'm one of the engineers working on this project. I want to clarify that multigenerational lru does NOT use more CPU. In fact, for all the workloads we could benchmark, it uses less, much less.
But then what's the catch? It retains more (lru) information for each page, in spare space in page->flags. It also stores extra information for each memcg, node, and process. Specifically, it has ~500 bytes per-memcg and per-node and ~50 bytes per-process memory overhead.
Please feel free to send your questions, suggestions, and concerns to page-reclaim@google.com.
Posted Apr 5, 2021 19:33 UTC (Mon)
by ms-tg (subscriber, #89231)
[Link]
Posted Apr 8, 2021 14:01 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (1 responses)
Did you identify any workload that fares worse under your algorithm? One may assume that Google has quite a few wildly different workloads to experiment with …
Posted Apr 13, 2021 9:09 UTC (Tue)
by yuzhao@google.com (guest, #132005)
[Link]
Your assumption is correct. But our "wildly different workloads" are only a fraction of all real-world workloads. AFAIK, we don't use Postgresql, and we do plan to reach out to the Postgresql community and see if they could help us with benchmarking.
And we don't use io_uring (yet). Jens Axboe, the io_uring maintainer helped us identify a regression in buffered I/O and verify the fix. Now there is an improvement compared with the mainline: https://lore.kernel.org/linux-mm/20210413065633.2782273-1...
By design, the multigenerational LRU shouldn't make any workload worse -- why should it make worse decisions after it gathered more information? But that's just the theory :)
The bottomline is we'll do our best to fix any regressions the community reports to us. Again, the email address is page-reclaim@google.com.
Posted Apr 5, 2021 7:15 UTC (Mon)
by hsu (guest, #151475)
[Link]
Please check my thesis for more stuff. Some quick comments:
- I put the new pages to "insert list", not the top list (youngest generation). If new pages are put on youngest generation list, with smaller memories would essentially reduce the algorithm LRU case anyway, when a large sequential access would happen. Insert list would be dynamically defined, and I simulated different configurations (see ELRU Treshold). Simulation graphics on page 60 show why this is important at least for my simulated workloads.
- Further, for pages which are known to be of a specific type, they can be insert to lists representing different priorities. If the kernel makes a wrong prediction, it likely gets corrected by accesses bumping the pages to upper list. I think I did not do simulations for this though.
- I moved the accessed pages to one list higher. Not straight to youngest generation. If the access pattern is very hierachical, which it often case with filesystems, you get better results with this. See simulation graphs. I think putting active pages back to youngest list would end up corresponding ELRU(n, 0), which has substantially lower performance.
- See priorities, 3.3 in thesis, may be relevant in page cache. I did not simulate it during my work, further testing might be useful. It would be simple to implement, to allow high priority stuff to have better chance to stay in memory.
- I think for simulations I used fixed list set, but it can be dynamic, new lists are created when needed. I do not remember if we did dynamic list allocation in real implementation/
We also implemented the algorithm as part of the Shadows Database Project at Helsinki University of Technology (HUT), now part of Aalto Univerity in Helsinki. Shadow paging (similar concept to what is now called "copy on write") creates strongly hierarchical workloads due to page tables and metadata accessed very frequently. Same might be true for other copy on write filesystems as well, such as ZFS, depending on the implementation. We considered implementing a filesystem using code base created during Shadows project but it did not materialize due to other stuff taking all the time. The code can be released under public source license, though it is database & userland stuff. Shadows had some useful ideas which I do not have seen anywhere else, such as RAID model works with disks of different sizes, works well with flash without wear levelling as it naturally does large block sequential writes through garbage collection.
I think the Master's thesis was published in CS department publication, so it should count as prior art in case of IPR issues. I did not file patents at time, so it is possibly safe to use.
I also did some literature research during the project so there are some references in the Master's thesis as well, caching algorithms are pretty old field, already was 25 years ago. The oldest stuff might not be googleable, I spent a lot of time in dusty libraries back then. This can be further prior art source.
The full Master's thesis is at
http://www.suonsivu.net/masters-thesis-Heikki-Suonsivu-19...
BibTex and URN can be found at
https://aaltodoc.aalto.fi/handle/123456789/83375
Posted Apr 26, 2021 14:40 UTC (Mon)
by JohnDyson (guest, #151896)
[Link]
The multi-generational LRU
The multi-generational LRU
The multi-generational LRU
The multi-generational LRU
The multi-generational LRU
Multi-generational approach is better optimised for CPU.
The multi-generational LRU
The multi-generational LRU
The multi-generational LRU
The multi-generational LRU
The multi-generational LRU and ELRU
The multi-generational LRU