Multi-generational LRU: the next generation

By Jonathan Corbet
May 24, 2021

The multi-generational LRU patch set is a significant reworking of the kernel's memory-management subsystem that promises better performance for a number of workloads; it was covered here in April. Since then, two new versions of that work have been released by developer Yu Zhao, with version 3 being posted on May 20. Some significant changes have been made since the original post, so another look is in order.

As a quick refresher: current kernels maintain two least-recently-used (LRU) lists to track pages of memory, called the "active" and "inactive" lists. The former contains pages thought to be in active use, while the latter holds pages that are thought to be unused and available to be reclaimed for other uses; a fair amount of effort goes into deciding when to move pages between the two lists. The multi-generational LRU generalizes that concept into multiple generations, allowing pages to be in a state between "likely to be active" and "likely to be unused". Pages move from older to newer generations when they are accessed; when memory is needed pages are reclaimed from the oldest generation. Generations age over time, with new generations being created as the oldest ones are fully reclaimed.

That summary oversimplifies a lot of things; see the above-linked article for a more detailed description.

Multi-tier, multi-generational LRU

Perhaps the largest change since the first posting of this work is the concept of "tiers", which are used to subdivide the generations of pages which, in turn, facilitates better decisions about which pages to reclaim, especially on systems where a lot of buffered I/O is taking place. Specifically, tiers are a way of sorting the pages in a generation by the frequency of accesses — but only accesses made by way of file descriptors. When a page first enters a generation, it normally goes into tier 0. If some process accesses that page via a file descriptor, the page's usage count goes up and it will move to tier 1. Further accesses will push the page into higher tiers; the actual tier number is the base-2 log of the usage count.

Before looking at how these tiers are used, it is worth asking why they are managed this way — why are only file-descriptor-based accesses counted? One possible reason is never mentioned in the patch set or discussion, but seems plausible: accesses via file-descriptor will happen as the result of a system call and are relatively easy and cheap to count. Direct accesses to memory by the CPU are more costly to track and cannot reasonably be monitored with the same resolution.

The other reason, though, is that this mechanism enables some changes to how the aging of pages brought in via I/O is done. In current kernels, a page that is brought into memory as the result of, say, a read() call will initially be added to the inactive list. This makes sense because that page will often never be used again. Should there be another access to the page, though, it will be made active and the kernel will try to avoid reclaiming it. This mechanism works better than its predecessors, but it is still possible for processes doing a lot of I/O to flush useful pages out of the active list, hurting the performance of the system.

Doing better involves making use of the existing shadow-page tracking in the kernel. When pages are reclaimed for another use, the kernel remembers, for a while, what those pages contained and when the old contents were pushed out. If one of those pages is accessed again in the near future, requiring it to be brought back in from secondary storage, the kernel will notice this "refault", which is a signal that actively used pages are being reclaimed. As a general rule, refaults indicate thrashing, which is not a good thing. The kernel can respond to excessive refaulting by, for example, making the active list larger.

The multi-generational LRU work tweaks the shadow entries to record which tier a page was in when it was reclaimed. If the page is refaulted, it can be restored to its prior tier, but the refault can also be counted in association with that tier. That allows the computation of the refault rate for each tier — what percentage of pages being reclaimed from that tier are being subsequently refaulted back into memory? It seems evident that refaults on pages in higher tiers — those which are being accessed more frequently — would be worth avoiding in general.

This refault information is used by comparing the refault rates of the higher tiers against that of tier 0, which contains, remember, pages that are accessed directly by the CPU and pages that have not been accessed at all. If the higher tiers have a refault rate that is higher than the tier 0 rate, then pages in those tiers are moved to a younger generation and thus protected (for a while) from reclaim. That has the effect of focusing reclaim on the types of pages that are seeing fewer refaults.

The other piece of the puzzle is that the memory-management code no longer automatically promotes pages on the second file-descriptor-based access, as is done in current kernels. Instead, pages resulting from I/O stay in the oldest generation unless they have been moved, as the result of usage, into a tier that is refaulting at a higher rate than directly accessed pages. That, as Zhao explained in this lengthy message, has the effect of preventing these pages from forcing out directly accessed pages that are more heavily used. That should give better performance on systems doing a lot of buffered I/O; this remark from Jens Axboe suggests that it does indeed help.

Another change from the first version is the addition of a user-space knob to force the eviction of one or more generations. The purpose of this feature appears to be to allow job controllers to make some space available for incoming work; this documentation patch contains a little more information.

Multiple patch generations

The multi-generational LRU work remains promising, and it has garnered a fair amount of interest. Its path into the mainline kernel still looks long and difficult, though. Johannes Weiner raised a concern that was mentioned in the first article as well: the multi-generational LRU, as implemented now, sits alongside the existing memory-management code as a separate option, essentially giving the kernel two page-reclaim mechanisms. That will always be a hard sell for reasons described by Weiner:

It would be impossible to maintain both, focus development and testing resources, and provide a reasonably stable experience with both systems tugging at a complicated shared code base.

So the new code would have to replace the existing system, which is a tall order. It would have to be shown to perform better (or, at least, not worse) for just about any workload, at a level of confidence that would motivate the replacement of code that has "billions of hours of production testing and tuning". The only way to do this is to merge the changes as a series of small, evolutionary steps. So the multi-generational LRU patch set would have to be broken up into a series of changes, none of which are so big that the memory-management developers don't feel that they can be safely merged.

Over the years, the kernel has absorbed massive changes this way, but it is not a fast or easy process. Weiner suggested a couple of areas that could be focused on as a way of beginning the task of getting parts of this work upstream and making the rest easier to consider. If this advice is followed, some progress toward merging the multi-generational LRU could be made in the relatively near future. But this functionality as a whole is likely to have to go through numerous generations of revisions before it all makes it into a mainline kernel.

Index entries for this article
Kernel	Memory management/Page replacement algorithms

Multi-generational LRU: the next generation

Posted May 24, 2021 18:00 UTC (Mon) by dancol (guest, #142293) [Link] (7 responses)

> Specifically, tiers are a way of sorting the pages in a generation by the frequency of accesses — but only accesses made by way of file descriptors.

[Twitch] Please, God, no! An access to a page should be an access to a page regardless of whether that access came from a file descriptor operation, from a page fault, from a kernel file server, or from any other source. The kernel should not be making policy decisions based on how an application chooses to spell its resource access!

Multi-generational LRU: the next generation

Posted May 24, 2021 18:58 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (3 responses)

The wording of the article suggests that this is not a policy decision so much as a "we measure what we can measure" decision. Nobody wants to fire an interrupt handler on every single (userspace) memory access. If it came in via a page fault, then that would imply that it was previously not in memory and did not have a frequency count in the first place. If it got refaulted, then it should be counted, and the article says that it is counted.

Frankly, I'm not sure I understand the change you are proposing.

Multi-generational LRU: the next generation

Posted May 25, 2021 7:32 UTC (Tue) by epa (subscriber, #39769) [Link] (1 responses)

I guess the kernel could randomly sprinkle access restrictions over a selection of userspace pages, on architectures that support it. When a process tries to access that page, it gets a memory protection fault. The kernel updates a usage count for the page, removes the restriction, and lets the process continue. As long as only one in a million page requests gets faulted in this way, it might not noticeably affect performance. The question is whether it's possible, and whether it would give high quality information on what pages are being used.

Multi-generational LRU: the next generation

Posted May 25, 2021 16:52 UTC (Tue) by mwsealey (subscriber, #71282) [Link]

The kernel already does this on reclaim as a last resort, so...

Multi-generational LRU: the next generation

Posted May 26, 2021 6:56 UTC (Wed) by garloff (subscriber, #319) [Link]

Doesn't x86 have an accessed bit in the PTE that the CPU sets on access?
Lazily scanning pages for this bit, counting and clearing it again would seem like a way to approximate non-fd page access.
Maybe the kernel already does this?

Multi-generational LRU: the next generation

Posted May 26, 2021 3:40 UTC (Wed) by yuzhao@google.com (guest, #132005) [Link]

In practice, there is no way to track how many times a page has been accessed via page tables mapping it.

From: https://lwn.net/ml/linux-kernel/CAOUHufbz_f4EjtDsMkmEBbQp...

Remark 1: a refault, *a page fault* or a buffered read is exactly one
memory reference. A page table reference as how we count it, i.e., the
accessed bit is set, could be one or a thousand memory references. So
the accessed bit for a mapped page and PageReferenced() for an
unmapped page may carry different weights.

Multi-generational LRU: the next generation

Posted May 26, 2021 18:48 UTC (Wed) by iabervon (subscriber, #722) [Link] (1 responses)

It's a little hard to compare numbers of fd-based accesses with numbers of direct accesses; you generally read a bunch of data with one syscall and then do multiple reads out of anonymous memory, but you don't generally bother to copy parts of a mmapped file into anonymous memory.

How you perform your accesses kind of has to matter in order to make sense, regardless: is a 4096-byte read from a page the same number of accesses as 4096 1-byte reads from the page, or as 1 1-byte read from the page?

Multi-generational LRU: the next generation

Posted May 26, 2021 21:43 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

I think the problem here is the "number of accesses" metric. All that matters for predicting how likely the page is to be needed in the future is whether the page *was* accessed (over some predetermined interval), not the number of read() calls or the number of bytes read.

> … is a 4096-byte read from a page the same number of accesses as 4096 1-byte reads from the page, or as 1 1-byte read from the page?

In my opinion: both. A single 4096-byte read, 4096 separate 1-byte reads, and a single 1-byte read (all within a single sample interval) should all be weighted the same for determining whether to keep the page in RAM. Of course the final decision should be based on multiple sample intervals, not just one sample. A better metric might be how long the page has gone without any access vs. how many times the data has been faulted back into RAM after being discarded.

Multi-generational LRU: the next generation

Posted May 25, 2021 1:56 UTC (Tue) by dxin (guest, #136611) [Link] (7 responses)

I like the technical aspect of this work, but doesn't it feel like users under tight memory constraint is steering the development? Yes, I know they are the majority now, e.g. Android phones and cloud providers, but what's the frequency of page out on our laptops? One page per week?

This is like the opposite of I/O scheduler situation, where users with very fast NVMEs drives it.

Multi-generational LRU: the next generation

Posted May 25, 2021 3:50 UTC (Tue) by comicfans (subscriber, #117233) [Link] (5 responses)

If you're browsing lots of web page, you may easily hit swap storms problem(even on powerful laptop). this may be problem of (possible) memory leaked browser or bad oom killer, but kernel indeed needs improvement under such situation.

while using windows, I've never seen system hang due to high memory usage, but such problem can easily makes linux "hang": gui/terminal not respond to any key in reasonable time/ ssh timeout ,you can't kill bad app, anything you're doing just makes it slower. you may wait it for hours (or forever) to restore, or just hard reset. I've hit such problems multi times.

taken from patch mail:

...8G RAM + zswap + swap... lots of apps opened ... LSP/chats/browsers...
gets quickly to a point of SWAP-storms...
system lags heavily and is barely usable.
... migrated from 5.11.15 kernel to 5.12 + the LRU
patchset... Till now I had *not a
single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.

Multi-generational LRU: the next generation

Posted May 25, 2021 12:05 UTC (Tue) by k3ninho (subscriber, #50375) [Link] (4 responses)

>while using windows, I've never seen system hang due to high memory usage, but such problem can easily makes linux "hang": gui/terminal not respond to any key in reasonable time/ ssh timeout ,you can't kill bad app, anything you're doing just makes it slower. you may wait it for hours (or forever) to restore, or just hard reset. I've hit such problems multi times.
Windows XP and spinning rust used to grind to a halt on heavy memory use, and Win10 puts a device out of action for hours upgrading and patching itself if you still have a rotating hard disk.

That's more about the latency of storage access. What is the current advice about RAM and swap? I ask because, since having NAND-based SSD's with block overwriting concerns, I've disabled swap and bought large amounts of RAM so that apps can't get stroppy about memory pressure. Only recently have I configured zswap but it's not noticeably changed my experience.

I think that this is something that should involve running user-experience items at elevated nice levels and using the alt-sysrq keys to safely OOM-kill and then unmount filesystems if you can't recover the device.

Are we still advocating for swap in 2021?

K3n.

Multi-generational LRU: the next generation

Posted May 25, 2021 14:41 UTC (Tue) by comicfans (subscriber, #117233) [Link]

>Windows XP and spinning rust used to grind to a halt on heavy memory use

maybe I'm lucky, I haven't hit XP halt even once with celeron1.7 + 845G + 256RAM + HDD . IIRC, I've run many apps which definitely exceeds 256MB, it got slower, but never stop responding.

>and Win10 puts a device out of action for hours upgrading and patching itself if you still have a rotating hard disk.

we all know windows update isn't good, but we're talking about high memory pressure, not a bad system update.

>I've disabled swap and bought large amounts of RAM so that apps can't get stroppy about memory pressure. Only recently have I configured zswap but it's not noticeably changed my experience.

it doesn't apply if laptop have none-replacable ram. and memory leak (maybe by accident) can eat as much ram as you have. of course better hardware can resolve many software problems, but should it stop kernel improve experience on old hardware ? while slower-bigger storage is still cheaper than faster-smaller storage, swap is always needed (that's why bcache/lvm-cache exist)

Multi-generational LRU: the next generation

Posted May 25, 2021 17:30 UTC (Tue) by MattBBaker (guest, #28651) [Link]

I would argue the opposite. With disk space as fast and cheap as it is now, why not have a 1:1 mapping between swap and RAM with /proc/sys/vm/swappiness set to 100? With a kernel that believes in over committing memory, being able to blow away the entire working set in an instant is a positive. My QoL has vastly improved on my linux workstation after I give it gobs of swap, since I no longer worry about it thrashing a tiny swap file as the OOM killer desperately looks for an alternative to firefox to knife.

Multi-generational LRU: the next generation

Posted May 28, 2021 18:17 UTC (Fri) by anton (subscriber, #25547) [Link] (1 responses)

My recommendation is to have no swap if the backing device is a HDD. Why? If the system needs so much RAM that it starts swapping, it becomes unusable anyway.

With an SSD, swap may be more viable. And if you swap rarely, don't worry about the limited number of SSD writes possible. If you swap often, buy more RAM.

Multi-generational LRU: the next generation

Posted Jun 10, 2021 11:46 UTC (Thu) by Hi-Angel (guest, #110915) [Link]

I have 2 comments for this thread, both regarding HDD and SSD.

First, for HDD: I'm the guy who posted the testimonial that the v3 of the patches refers to, and I am using HDD. So, just in case anyone's wondering about behavior on HDD specifically, here it is.

Second, for SSD: I see here an attitude that once you create a SWAP on SSD, all problems are gone. So here's my experience: this is not true.

My gf has a Macbook 2013 with SSD, 4G RAM, ZSWAP. She always had swap-partition on the SSD. Before I went out to try the patches on her laptop, she have also had frequent SWAP-storms, her overall experience was pretty bad.

After I configured her system to use the multi-LRU patches (v2), her experience improved a lot. Now the only moment when lags start appearing is when her SWAP usage goes to around 7-8G (Idk why exactly that size).

So, for anyone out there thinking creating a SWAP on SSD will magically solve any need in the memory reclaim rework — that ain't true.

Multi-generational LRU: the next generation

Posted May 26, 2021 4:01 UTC (Wed) by yuzhao@google.com (guest, #132005) [Link]

There are lots of people who couldn't afford high memory laptops. I implore you not to take your quality of life for granted for the rest of the world.

Multi-generational LRU: the next generation

Posted May 26, 2021 12:36 UTC (Wed) by intelfx (subscriber, #130118) [Link] (1 responses)

I may be ignorant (and quite likely am), but isn’t thia thing basically ZFS’ ARC with extra steps?

Multi-generational LRU: the next generation

Posted May 27, 2021 7:29 UTC (Thu) by yuzhao@google.com (guest, #132005) [Link]

The building blocks are similar: the access recency, the access frequency and shadows/ghosts.

The fundamental arguments are different: the multigenerational lru argues that pages that have been used only once (if we are sure) are always the best choice, no matter how recent they were used, because 1) even some of them are not, they will be protected upon the second access; 2) the cost to ascertain whether the rest are better or not is higher (to do so we probably have to scan more than half a million pages on a 4GB laptop under moderate pressure, and there is no guarantee we'd make better choices after we have done it).

Essentially, ZFS ARC is a *cache* replacement implementation, which discovers page accesses reactively by hooks. For *page* replacement, we have to scan page tables proactive, which is far more difficult and expensive in terms of discovering page accesses.