|
|
Log in / Subscribe / Register

Reconsidering the multi-generational LRU

By Jonathan Corbet
March 5, 2026
The multi-generational LRU (MGLRU) is an alternative memory-management algorithm that was merged for the 6.1 kernel in late 2022. It brought a promise of much-improved performance and simplified code. Since then, though, progress on MGLRU has stalled, and it still is not enabled on many systems. As the 2026 Linux Storage, Filesystem, Memory-Management and BPF Summit (LSFMM+BPF) approaches, several memory-management developers have indicated a desire to talk about the future of MGLRU. While some developers are looking for ways to improve the subsystem, another has called for it to be removed entirely.

An MGLRU refresher

One of the core memory-management tasks a kernel must handle is to determine which pages belong in RAM and which should be pushed out to slower storage (or "reclaimed"). As a general rule, it is best to retain the pages that will be used the most in the near future, while reclaiming pages that will not be used again. Given the challenges involved in predicting the future, the kernel must rely heavily on information about how pages were used recently as a guide for what will happen going forward. The least-recently-used (LRU) lists are a key component of that solution.

The classic (and still default) solution in the kernel relies on two LRU lists (more correctly, numerous pairs of such lists) called the "active" and "inactive" lists. Pages that are thought to be in current use should be on the active list, while those that are seemingly unused go onto the inactive list. When the time comes to reclaim pages for other use, the inactive list will be consulted for a list of potential victims. Much of the complexity (and many of the heuristics) in this solution are focused on properly sizing the two lists and deciding when to move pages from one to the other.

The MGLRU extends that approach to multiple lists, deemed "generations". At one end, the youngest generation contains pages that are known (or at least thought) to have been used within the recent past. Each older generation tracks pages that have been idle for longer than those in the preceding generations. Various sorts of accesses will move a page from an older generation to a younger one; the oldest generation is pillaged by the kernel when the need for more free memory arises. The MGLRU is claimed to more accurately identify the truly cold pages and to use less CPU time while doing that work. See this 2021 article for more information about the design of MGLRU.

The trouble with MGLRU

Recent discussions have made it clear that MGLRU is not seen as living up to all of its promises. It all started in mid-February, when Zicheng Wang posted a request for an LSFMM+BPF discussion about MGLRU and, specifically, how it works with Android. Even though MGLRU has been in the kernel for some years, Wang said, many vendors of Android systems do not enable it. There are a number of problems that play into that decision.

One complaint (which was later echoed by others) is that MGLRU does not properly balance reclaim between anonymous and file-backed pages. The traditional LRU maintains a separate pair of lists for each of those page types (thus the comment above about "numerous pairs" of LRU lists — there are other complications as well). Reclaim from the two sets of lists is normally biased somewhat toward file-backed pages, since they do not normally need to be written back to persistent storage; the longstanding "swappiness" sysctl knob can be used to adjust how aggressively the kernel attacks each list.

With MGLRU, Wang said, anonymous pages tend to stay within the youngest two generations, causing them to never be reclaimed (and file-backed pages to be reclaimed overly aggressively). Adjusting the swappiness knob does not fix the problem. Wang's employer (an Android OEM called "Honor") addresses this problem by explicitly using memory control groups to force reclaim of anonymous pages from non-foreground apps, but there is no general solution in the mainline kernel. Kairui Song, who proposed an MGLRU session as well, also mentioned problems with the reclaim of anonymous pages.

Wang had a number of other problems to discuss. MGLRU can reclaim too aggressively from any given control group, freeing memory beyond the required amount. It's too expensive on low-end devices, especially in situations where there are not a lot of reclaimable pages. There is also a disconnect between Android's notion of hot and cold apps (designed to prioritize the app the user is interacting with at any given time) and MGLRU, which (like the rest of the kernel) lacks that distinction. Some of these problems have been addressed with vendor-specific hacks; there is, for example, a vendor hook that exempts the current foreground task from reclaim. Wang would like to discuss which of these vendor changes, if any, should find their way into the mainline kernel.

Barry Song added a separate complaint: when the system performs readahead (speculatively reading data that it thinks user space may soon request), it places all of the resulting pages into the youngest generation, even though there is no guarantee that those pages will ever be used at all. That may cause pages actually in use to be reclaimed while leaving the readahead pages in RAM. The traditional LRU, instead, puts those pages onto the inactive list, where they will be reclaimed relatively quickly if they are not referenced again. This problem, at least, should be amenable to a relatively simple solution.

Kairui Song's list of problems had a different focus, starting with the fact that MGLRU uses three page flags. These flags are in perennial short supply; patches that try to allocate even one of them tend to run into stiff resistance. The desire to increase the number of generations managed by MGLRU also implies using even more page flags. He had a proposal for shifting those flags elsewhere, making systems with up to 63 generations possible while freeing the three page flags currently used by MGLRU.

Another problem is performance regressions for some workloads (while others do better). Kairui Song thinks that these problems result from the control loop that manages reclaim in MGLRU, and that they could be addressed by better tracking the usage history of file-backed pages. Doing that, though, would require three page flags, presumably those that had just been freed by shifting the generation number elsewhere.

The metrics provided by MGLRU differ from those out of the traditional LRU in a number of ways, he said, making it harder for other parts of the system to understand the memory-management state of any given page. He has a proposal for changing how the state of pages is tracked to improve that situation. Kalesh Singh also described problems with metrics, saying that they differ significantly between the two LRU implementations, and that makes life difficult for components like the Android user-space out-of-memory daemon.

In passing, Kairui Song also mentioned the idea of adding a BPF hook that would allow the customization of generation-placement decisions.

There were other problems mentioned as well but, perhaps surprisingly, most of the participants skipped over one other relevant issue: the fact that there are two competing LRU implementations in the kernel in the first place. Kairui Song did note that the problems he described were among the "many reasons MGLRU is still not the only LRU implementation in the kernel". David Rientjes added that the discussion should cover "what needs to be addressed so that MGLRU can be on a path to becoming the default implementation and we can eliminate two separate implementations". That will be a challenging thing to do; there will certainly always be workloads that do better with one implementation than the other, so removing one will cause some workloads to regress. Getting those regressions down to a tolerable level will require some work yet.

Just remove it?

A persistent fear among kernel developers (and developers in many projects, in truth) is that a developer will add a pile of complex code, then not be around to maintain it. Matthew Wilcox asserted that this is exactly what has happened with MGLRU:

To my mind, the biggest problem with MGLRU is that Google dumped it on us and ran away. Commit 44958000bada claimed that it was now maintained and added three people as maintainers. In the six months since that commit, none of those three people have any commits in mm/! This is a shameful state of affairs.

I say rip it out.

The original developer of MGLRU, Yu Zhao, has, as is noted in the above-mentioned commit, "moved on to other projects these days". As can be seen in the (subscriber-only) KSDB page for Zhao, he has occasionally made improvements to MGLRU, but the last such was a handful of commits in the 6.14 kernel. While other developers are said to be working on this code, none of those who were added to the MAINTAINERS file have made any changes to MGLRU since.

So it is true, to an extent, that MGLRU was contributed to the kernel and abandoned shortly thereafter. Axel Rasmussen, one of the named maintainers of MGLRU, seemed to agree with this assessment, but said that the situation would soon change:

I acknowledge this is a big problem. We have let the community down here, and we plan to correct this starting in April, e.g. by working together with Kairui and others to address outstanding issues.

The lack of ongoing developer attention certainly has not helped MGLRU to overcome the problems that many potential users have encountered with it. Even so, there was little support expressed for the idea of removing it. Barry Song asked to keep it around so that those problems could be addressed:

It just needs more work. MGLRU has many strong design aspects, including using more generations to differentiate cold from hot, the look-around mechanism to reduce scanning overhead by leveraging cache locality, and data structure designs that minimize lock holding.

Gregory Price listed a number of perceived problems with MGLRU. He did, however, stop short of calling for to be taken out of the kernel entirely.

So the MGLRU discussion at LSFMM+BPF in May seems unlikely to spend much time on the idea of removing it entirely. But there will be a lot of interest in understanding the work that needs to be done to bring MGLRU up to the needed level of performance and, perhaps someday, be the only LRU implementation in the kernel. If some developers are willing to commit to doing that work, MGLRU may finally make the progress that has been missing for the last few years. It seems likely to be an interesting session.

Index entries for this article
KernelMemory management/Page replacement algorithms


to post comments

ARC, anyone?

Posted Mar 5, 2026 16:06 UTC (Thu) by intelfx (subscriber, #130118) [Link] (3 responses)

So, it would appear we still do not have anything remotely competitive with ZFS' ARC. Sigh.

ARC, anyone?

Posted Mar 6, 2026 13:42 UTC (Fri) by PeeWee (subscriber, #175777) [Link] (2 responses)

IIUC, we sort of already do have, or had(?), an approximation of that with "classic" LRU. The active/inactive lists track frequency/recency respectively, kind of. And there is also refault distance tracking, though it seems to be limited to file pages, which should make no difference for a comparison to ARC.

Thus Linux already has ARC-but-not-ARC. File pages start life on the inactive list, after being initially faulted in. That's almost identical to what happens in ZFS ARC [PDF] (p. 13f); total misses - the page wasn't even on one of the ghost lists - are put at the head of the LRU list, Linux's inactive list being the equivalent. Only after being accessed again will they be promoted to the active list, hence they get promoted due to frequent access, otherwise they'd already be evicted again, possible still in some shadow entry. But a promotion to the active list implies a minor/soft page fault - inactive pages get unmapped but not evicted -, so it's not exactly the same as in ARC, where pages are simply promoted to the LFU list, no fault having happened, so slight advantage for ARC, maybe, depending on the LFU list overhead, see below. Arguably, that makes Linux's active list an LFU list in disguise, sans the access counter. The only way to eviction is by being less frequently accessed than other active pages; and to get on the active list, file pages need to be accessed a second time. And that's where the equivalency breaks, because in ARC it would be evicted but placed at the head of the ghost LFU list, whereas in Linux it gets unmapped and placed at the head of the inactive list for being the most recent visitor in purgatory.

Reiterating, in Linux mm terms, file pages age out of the active list because they are not accessed frequently (enough) onto the inactive list and then onto the shadow "list" (entries in the radix tree, according to above article), after being evicted for not being accessed recently. If they get refauted in and there is still a shadow entry, one page gets demoted from the active list right next to the refaulted one, thus shortening the active list and lengthening the inactive list by one, respectively, if the refault distance (how many page faults were handled in its absence) is less than the length of the active list. There is no mention of what happens when refault distance is equal or greater than the length of the active list, so I am assuming no additional action takes place in that case. Also, since all of the above is in context of file-backed pages, I just want to add that anon pages only differ in that they are immediately placed on the active list after faulting them in, which might have implications regarding their exact refault behavior and which list to adjust, but I don't even know if it even applies there; couldn't find anything on anon refault distance tracking.

There may also be other reasons for not having an ARC clone in Linux. As that presentation above states, LFU lists are of O(log(n)) complexity. Maybe that implies, it only makes sense for file pages, the overhead being small enough compared to the cost of a major fault, hitting spinning rust I/O. But it might be too much for tracking in-memory pages for frequency, if all that buys us is avoiding a minor fault - still considering the active list as LFU. Just assuming that active pages are frequently accessed might be the cheaper option, since aging out does not mean eviction, only demotion to a lower "tier", with tolerable (soft) refault cost.

It's also hard to tell if it is ("remotely") competitive with ARC, at least for me. ZFS (ARC) has great tools and stats, but I wouldn't know where to find their equivalents in Linux, e.g. hit and miss rates on said active/inactive/shadow lists. Plus, ZFS ARC can optionally store compressed blocks, which need to be decompressed on access, just to name one more variable; I am almost sure that the compressed version stays in ARC and the decompressed one lives in the page cache, at least in ZFS on Linux (ZoL). That, if true, may have interesting effects, because really hot read-only pages, i.e. always clean (not dirtied, which would invalidate the ARC cache entry), never to be evicted from the page cache, their compressed siblings may get evicted from ARC, because, for all it knows, they are cold; I'd call that a temperature paradox, for lack of a better term, if there isn't already a name for it. But maybe I am totally misunderstanding how ZoL integrates with vanilla Linux.

One more question just popped into my head; if ARC is so superior, then why was it only implemented for ZFS and not also in the swap path of Solaris? Did Sun Microsystems simply not think of that, or the Illumos developers, for that matter? Maybe that does speak to the relative cost compared to simpler solutions.

ARC, anyone?

Posted Mar 6, 2026 17:39 UTC (Fri) by hnaz (subscriber, #67104) [Link] (1 responses)

> And there is also refault distance tracking, though it seems to be limited to file pages

There actually is shadow tracking for swap-backed pages. On classic LRU, swapbacked memory starts out on inactive list, just like file. It provides the same workingset protection against bulk accesses with low locality (e.g. a scan through a large, cold, mostly swapped out anon segment).

You can observe the shadow list hits through /proc/vmstat::workingset_*.

ARC, anyone?

Posted Mar 6, 2026 20:29 UTC (Fri) by PeeWee (subscriber, #175777) [Link]

> And there is also refault distance tracking, though it seems to be limited to file pages There actually is shadow tracking for swap-backed pages. On classic LRU, swapbacked memory starts out on inactive list, just like file. It provides the same workingset protection against bulk accesses with low locality (e.g. a scan through a large, cold, mostly swapped out anon segment).
Thanks for pointing that out. Thinking about it some more, given this info, it makes perfect sense, because they are just file-backed pages as well, right? I've also had a glance at workset.c, which is exemplary well documented by comments, in the meantime, and it does contain MGLRU code, as well. Not that I read the actual code, or would be able to understand it, but I think it's just the same as in the classic active/inactive lists, the only difference being that there is no "inactive" list, only gen 0, the oldest generation LRU list, on which file-backed pages are placed.
You can observe the shadow list hits through /proc/vmstat::workingset_*.
TIL where to find the stats. Thanks! And it looks like it's just the same with MGLRU:
workingset_nodes 4530
workingset_refault_anon 4
workingset_refault_file 13627
workingset_activate_anon 4
workingset_activate_file 9981
workingset_restore_anon 4
workingset_restore_file 4655
workingset_nodereclaim 5396
There hasn't been much memory churn yet since last boot. But I do have two 16K mTHPs, as it seems (pool_total_size=32768), already in zswap, while still at 8.8 GiB committed, with ~7 GiB page cache. That's with vm.swappiness=180 and 14.5 GiB RAM total.

It sure does work to help memory overcommitted servers

Posted Mar 5, 2026 18:05 UTC (Thu) by aviallon (subscriber, #157205) [Link]

I experienced much better (less global slowness) behavior on my servers by enabling MGLRU.
Memory is hugely overcommitted (5x total memory is allocated), so that may explain why the global experience is so much better.
Services response times are much more predictable, and SSHing into the server actually works.

I think MGLRU isn't ready for a lot of uses unless it keeps the paired lists

Posted Mar 5, 2026 19:23 UTC (Thu) by jthill (subscriber, #56558) [Link] (3 responses)

I think disabling the swappiness knob is *clearly* a mistake. Straight LRU doesn't understand differing reload costs and doesn't understand other usage patterns. If a set of file-backed pages are expensive to reload and reused at workload-scale short(ish) but cpu-scale long intervals LRU fails due to the horizon effect. swappiness is engineered to correct for both, for bulk-reuse-at-intervals and sometimes huge disparities in reload costs, it's there because detecting that automatically is a can of worms and tuning the tradeoff is an administrative matter anyway.

I think MGLRU isn't ready for a lot of uses unless it keeps the paired lists

Posted Mar 6, 2026 14:10 UTC (Fri) by PeeWee (subscriber, #175777) [Link] (2 responses)

I don't think the swappiness knob was actually disabled. It's just rendered useless, allegedly, because anon pages tend to hijack the younger generations and are thus never considered in the reclaim path. But the question is, if the complainant may have discovered that their workload actually does lean more heavily towards anon pages. That might be a genuine improvement, but they didn't expect it because experience with previous kernel behavior may have clouded or biased their judgement.

I've just checked and MGLRU is enabled on my Ubuntu 24.04 laptop with Linux 6.17. Since I am using zswap, I have also set an aggressive swappiness=180. I have ~14.5 GiB of total RAM (16 - iGPU RAM) and I sometimes push the limits with tmpfs usage. But way before I hit 12 GiB commited, (z)swapping starts, with plenty of cache available for supposedly cheap eviction. If the mentioned complaint pointed to a real problem, wouldn't I also see that all page cache is evicted first? And there is also, as I've just written above, the refault distance tracking of file pages, which can correct for overly eager cache eviction by increasing the length of the inactive list. Or does that also work differently, or not at all, with MGLRU?

I think MGLRU isn't ready for a lot of uses unless it keeps the paired lists

Posted Mar 6, 2026 16:33 UTC (Fri) by jthill (subscriber, #56558) [Link] (1 responses)

I meant disabled in the rendered-much-less-effective sense, sorry for leaving the more explicit sense as such a reasonable read like that.

I don't think tmpfs pages are considered file-backed, so your swappiness setting wouldn't be trying to protect them anyway.

I got in to swappiness tuning when I only had an hdd and was pushing limits, I found I disliked deferring slowdowns to workload-switching time, if I needed to do like ten minutes of image editng and it was little slower but say madeupnumber30 seconds extra, then go back to what I was doing before, the needed files being all still cached and ready is very gratifying, like, when *I'm* done with the interruption, my computer's got my back and the real work is all still up to speed instead of making me sit and wait for 10 seconds while it rereads all the stuff it evicted to save me those 30 I didn't notice and might have spent anyway.

tl;dr: whether the computer's ready for my next trick a little before or well before I'm ready to perform it makes no difference to me, but if it's not ready when I am, that matters.

So yeah, it's going to be workload-specific, and my impression from the article was people with specific workloads didn't like mglru's observed behavior. So if the idea is to settle on just one reclaim algorithm I think pick the one that's easily tunable to avoid frustrating waiting humans.

I think MGLRU isn't ready for a lot of uses unless it keeps the paired lists

Posted Mar 6, 2026 18:30 UTC (Fri) by PeeWee (subscriber, #175777) [Link]

I don't think tmpfs pages are considered file-backed, so your swappiness setting wouldn't be trying to protect them anyway.
No, that's not why I mentioned my liberal use of tmpfs, at all; it's mounted noswap, forgot to say. It was meant to convey the memory pressure it induces. Otherwise I'd never see any swapping, ever, because I keep my system on a diet. I should have also mentioned why swappiness=180. That's because I am using LUKS full disk encryption, which makes file page refault significantly more expensive, even from the fast-ish NVMe SSD. On top of that I have BTRFS with transparent compression combined with some FUSE-based mergerfs mounts on top, which adds even more costs, in case files are read from there, i.e. all my media files. Page cache is almost exclusively clean most of the time, so evicting from there is as cheap as it gets. But that's only half the story, when one considers the possibility of refaults. Zswap, while it does incur a penalty on reclaim, is practically free, in terms of refault cost, compared LUKS->BTRFS->mergerfs. As long as I don't hit the zswap pool limit, that is, of course, but that (almost) never happens.

I got in to swappiness tuning when I only had an hdd and was pushing limits, I found I disliked deferring slowdowns to workload-switching time, if I needed to do like ten minutes of image editng and it was little slower but say madeupnumber30 seconds extra, then go back to what I was doing before, the needed files being all still cached and ready is very gratifying, like, when *I'm* done with the interruption, my computer's got my back and the real work is all still up to speed instead of making me sit and wait for 10 seconds while it rereads all the stuff it evicted to save me those 30 I didn't notice and might have spent anyway.

tl;dr: whether the computer's ready for my next trick a little before or well before I'm ready to perform it makes no difference to me, but if it's not ready when I am, that matters.

So yeah, it's going to be workload-specific, and my impression from the article was people with specific workloads didn't like mglru's observed behavior. So if the idea is to settle on just one reclaim algorithm I think pick the one that's easily tunable to avoid frustrating waiting humans.
According to this lengthy article, which was referenced here, swappiness fits even better into a more precise picture:
> > For phones and laptops, executable pages are frequently evicted
> > despite the fact that there are many less recently used anon pages.
> > Major faults on executable pages cause "janks" (slow UI renderings)
> > and negatively impact user experience.
>
> This is not because of the inactive/active scheme but rather because
> of the anon/file split, which has evolved over the years to just not
> swap onto iop-anemic rotational drives.
>
> We ran into the same issue at FB too, where even with painfully
> obvious anon candidates and a fast paging backend the kernel would
> happily thrash on the page cache instead.
>
> There has been significant work in this area recently to address this
> (see commit 5df741963d52506a985b14c4bcd9a25beb9d1981). We've added
> extensive testing and production time onto these patches since and
> have not found the kernel to be thrashing executables or be reluctant
> to go after anonymous pages anymore.
>
> I wonder if your observation takes these recent changes into account?

Again, I agree with all you said above. And I can confirm your series
has generally fixed the problem for the following test case.

When our most common 4GB Chromebook model is zram-ing under memory
pressure, the size of the file lru is
  ~80MB without that series
  ~120MB with that series
  ~140MB with this series

User experience is acceptable as long as the size is above 100MB. For
optimal user experience, the size is 200MB. But we do not expect the
optimal user experience under memory pressure.
I hope there is enough context. The gist here is, they wanted to mitigate exectuable page thrashing, i.e. refaulting program code, on Android and Chrome OS which was rooted in the kernel's PTSD-induced reluctance to swap to spinning rust. Then Facebook fixed that and MGLRU further improves on that. That seems to contradict the claim that swappiness is rendered useless. Further up in that message Zhao explains that MGLRU actually sets file and anon pages on an equal footing in terms of age-based eviction. Reclaim will look at the oldest generation of both types just as ordinary pages. And only then will swappiness factor into the selection from the candidates found. And I think that MGLRU seemingly evicting file pages overly aggressively just shows that most file pages are cold. Oh, and file pages keep getting put on the oldest generation. They also need to "earn" their place on the next-younger generation LRU list by climbing above tier 0 inside gen 0. But refaults are also still tracked by shadow entries, so they will refault to the tier they were evicted from. And only when the refault rate of their tier has a higher refault rate than tier 0, will they move to the younger generation. I hope that's roughly correct.

My heart skipped a bit

Posted Mar 23, 2026 5:58 UTC (Mon) by Hi-Angel (guest, #110915) [Link]

> another has called for it to be removed entirely.

My heart skipped a bit when I read it 😅 I was the one whose testimony was included in v3 or v4 of MGLRU patches (which I sent in turn because I was afraid the patchset will linger for the next ½-decade on the MLs), and despite deficiencies mentioned in the article MGLRU immensely improves experience for low-memory systems (and besides, nowadays, the definition of "low-memory systems" is more vague than one would think).

> and [MGLRU] still is not enabled on many systems.

This probably meant to says "not enabled on many Android systems"…?

Because as far as desktop Linux goes, MGLRU AFAIR is enabled in most major distros. Arch, Fedora, Ubuntu — just off the top of my head.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds