Multi-generational LRU: the next generation
As a quick refresher: current kernels maintain two least-recently-used (LRU) lists to track pages of memory, called the "active" and "inactive" lists. The former contains pages thought to be in active use, while the latter holds pages that are thought to be unused and available to be reclaimed for other uses; a fair amount of effort goes into deciding when to move pages between the two lists. The multi-generational LRU generalizes that concept into multiple generations, allowing pages to be in a state between "likely to be active" and "likely to be unused". Pages move from older to newer generations when they are accessed; when memory is needed pages are reclaimed from the oldest generation. Generations age over time, with new generations being created as the oldest ones are fully reclaimed.
That summary oversimplifies a lot of things; see the above-linked article for a more detailed description.
Multi-tier, multi-generational LRU
Perhaps the largest change since the first posting of this work is the concept of "tiers", which are used to subdivide the generations of pages which, in turn, facilitates better decisions about which pages to reclaim, especially on systems where a lot of buffered I/O is taking place. Specifically, tiers are a way of sorting the pages in a generation by the frequency of accesses — but only accesses made by way of file descriptors. When a page first enters a generation, it normally goes into tier 0. If some process accesses that page via a file descriptor, the page's usage count goes up and it will move to tier 1. Further accesses will push the page into higher tiers; the actual tier number is the base-2 log of the usage count.
Before looking at how these tiers are used, it is worth asking why they are managed this way — why are only file-descriptor-based accesses counted? One possible reason is never mentioned in the patch set or discussion, but seems plausible: accesses via file-descriptor will happen as the result of a system call and are relatively easy and cheap to count. Direct accesses to memory by the CPU are more costly to track and cannot reasonably be monitored with the same resolution.
The other reason, though, is that this mechanism enables some changes to how the aging of pages brought in via I/O is done. In current kernels, a page that is brought into memory as the result of, say, a read() call will initially be added to the inactive list. This makes sense because that page will often never be used again. Should there be another access to the page, though, it will be made active and the kernel will try to avoid reclaiming it. This mechanism works better than its predecessors, but it is still possible for processes doing a lot of I/O to flush useful pages out of the active list, hurting the performance of the system.
Doing better involves making use of the existing shadow-page tracking in the kernel. When pages are reclaimed for another use, the kernel remembers, for a while, what those pages contained and when the old contents were pushed out. If one of those pages is accessed again in the near future, requiring it to be brought back in from secondary storage, the kernel will notice this "refault", which is a signal that actively used pages are being reclaimed. As a general rule, refaults indicate thrashing, which is not a good thing. The kernel can respond to excessive refaulting by, for example, making the active list larger.
The multi-generational LRU work tweaks the shadow entries to record which tier a page was in when it was reclaimed. If the page is refaulted, it can be restored to its prior tier, but the refault can also be counted in association with that tier. That allows the computation of the refault rate for each tier — what percentage of pages being reclaimed from that tier are being subsequently refaulted back into memory? It seems evident that refaults on pages in higher tiers — those which are being accessed more frequently — would be worth avoiding in general.
This refault information is used by comparing the refault rates of the higher tiers against that of tier 0, which contains, remember, pages that are accessed directly by the CPU and pages that have not been accessed at all. If the higher tiers have a refault rate that is higher than the tier 0 rate, then pages in those tiers are moved to a younger generation and thus protected (for a while) from reclaim. That has the effect of focusing reclaim on the types of pages that are seeing fewer refaults.
The other piece of the puzzle is that the memory-management code no longer automatically promotes pages on the second file-descriptor-based access, as is done in current kernels. Instead, pages resulting from I/O stay in the oldest generation unless they have been moved, as the result of usage, into a tier that is refaulting at a higher rate than directly accessed pages. That, as Zhao explained in this lengthy message, has the effect of preventing these pages from forcing out directly accessed pages that are more heavily used. That should give better performance on systems doing a lot of buffered I/O; this remark from Jens Axboe suggests that it does indeed help.
Another change from the first version is the addition of a user-space knob to force the eviction of one or more generations. The purpose of this feature appears to be to allow job controllers to make some space available for incoming work; this documentation patch contains a little more information.
Multiple patch generations
The multi-generational LRU work remains promising, and it has garnered a fair amount of interest. Its path into the mainline kernel still looks long and difficult, though. Johannes Weiner raised a concern that was mentioned in the first article as well: the multi-generational LRU, as implemented now, sits alongside the existing memory-management code as a separate option, essentially giving the kernel two page-reclaim mechanisms. That will always be a hard sell for reasons described by Weiner:
It would be impossible to maintain both, focus development and testing resources, and provide a reasonably stable experience with both systems tugging at a complicated shared code base.
So the new code would have to replace the existing system, which is a tall
order. It would have to be shown to perform better (or, at least, not
worse) for just about any workload, at a level of confidence that would
motivate the replacement of code that has "billions of hours of
production testing and tuning
". The only way to do this is to merge
the changes as a series of small, evolutionary steps. So the
multi-generational LRU patch set would have to be broken up into a series
of changes, none of which are so big that the memory-management developers
don't feel that they can be safely merged.
Over the years, the kernel has absorbed massive changes this way, but it is
not a fast or easy process. Weiner suggested a couple of areas that could
be focused on as a way of beginning the task of getting parts of this work
upstream and making the rest easier to consider. If this advice is
followed, some progress toward merging the multi-generational LRU could be
made in the relatively near future. But this functionality as a whole is
likely to have to go through numerous generations of revisions before it
all makes it into a mainline kernel.
Index entries for this article | |
---|---|
Kernel | Memory management/Page replacement algorithms |
Posted May 24, 2021 18:00 UTC (Mon)
by dancol (guest, #142293)
[Link] (7 responses)
[Twitch] Please, God, no! An access to a page should be an access to a page regardless of whether that access came from a file descriptor operation, from a page fault, from a kernel file server, or from any other source. The kernel should not be making policy decisions based on how an application chooses to spell its resource access!
Posted May 24, 2021 18:58 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Frankly, I'm not sure I understand the change you are proposing.
Posted May 25, 2021 7:32 UTC (Tue)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted May 25, 2021 16:52 UTC (Tue)
by mwsealey (subscriber, #71282)
[Link]
Posted May 26, 2021 6:56 UTC (Wed)
by garloff (subscriber, #319)
[Link]
Posted May 26, 2021 3:40 UTC (Wed)
by yuzhao@google.com (guest, #132005)
[Link]
From: https://lwn.net/ml/linux-kernel/CAOUHufbz_f4EjtDsMkmEBbQp...
Remark 1: a refault, *a page fault* or a buffered read is exactly one
Posted May 26, 2021 18:48 UTC (Wed)
by iabervon (subscriber, #722)
[Link] (1 responses)
How you perform your accesses kind of has to matter in order to make sense, regardless: is a 4096-byte read from a page the same number of accesses as 4096 1-byte reads from the page, or as 1 1-byte read from the page?
Posted May 26, 2021 21:43 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link]
> … is a 4096-byte read from a page the same number of accesses as 4096 1-byte reads from the page, or as 1 1-byte read from the page?
In my opinion: both. A single 4096-byte read, 4096 separate 1-byte reads, and a single 1-byte read (all within a single sample interval) should all be weighted the same for determining whether to keep the page in RAM. Of course the final decision should be based on multiple sample intervals, not just one sample. A better metric might be how long the page has gone without any access vs. how many times the data has been faulted back into RAM after being discarded.
Posted May 25, 2021 1:56 UTC (Tue)
by dxin (guest, #136611)
[Link] (7 responses)
This is like the opposite of I/O scheduler situation, where users with very fast NVMEs drives it.
Posted May 25, 2021 3:50 UTC (Tue)
by comicfans (subscriber, #117233)
[Link] (5 responses)
while using windows, I've never seen system hang due to high memory usage, but such problem can easily makes linux "hang": gui/terminal not respond to any key in reasonable time/ ssh timeout ,you can't kill bad app, anything you're doing just makes it slower. you may wait it for hours (or forever) to restore, or just hard reset. I've hit such problems multi times.
taken from patch mail:
...8G RAM + zswap + swap... lots of apps opened ... LSP/chats/browsers...
Posted May 25, 2021 12:05 UTC (Tue)
by k3ninho (subscriber, #50375)
[Link] (4 responses)
That's more about the latency of storage access. What is the current advice about RAM and swap? I ask because, since having NAND-based SSD's with block overwriting concerns, I've disabled swap and bought large amounts of RAM so that apps can't get stroppy about memory pressure. Only recently have I configured zswap but it's not noticeably changed my experience.
I think that this is something that should involve running user-experience items at elevated nice levels and using the alt-sysrq keys to safely OOM-kill and then unmount filesystems if you can't recover the device.
Are we still advocating for swap in 2021?
K3n.
Posted May 25, 2021 14:41 UTC (Tue)
by comicfans (subscriber, #117233)
[Link]
maybe I'm lucky, I haven't hit XP halt even once with celeron1.7 + 845G + 256RAM + HDD . IIRC, I've run many apps which definitely exceeds 256MB, it got slower, but never stop responding.
>and Win10 puts a device out of action for hours upgrading and patching itself if you still have a rotating hard disk.
we all know windows update isn't good, but we're talking about high memory pressure, not a bad system update.
>I've disabled swap and bought large amounts of RAM so that apps can't get stroppy about memory pressure. Only recently have I configured zswap but it's not noticeably changed my experience.
it doesn't apply if laptop have none-replacable ram. and memory leak (maybe by accident) can eat as much ram as you have. of course better hardware can resolve many software problems, but should it stop kernel improve experience on old hardware ? while slower-bigger storage is still cheaper than faster-smaller storage, swap is always needed (that's why bcache/lvm-cache exist)
Posted May 25, 2021 17:30 UTC (Tue)
by MattBBaker (guest, #28651)
[Link]
Posted May 28, 2021 18:17 UTC (Fri)
by anton (subscriber, #25547)
[Link] (1 responses)
With an SSD, swap may be more viable. And if you swap rarely, don't worry about the limited number of SSD writes possible. If you swap often, buy more RAM.
Posted Jun 10, 2021 11:46 UTC (Thu)
by Hi-Angel (guest, #110915)
[Link]
First, for HDD: I'm the guy who posted the testimonial that the v3 of the patches refers to, and I am using HDD. So, just in case anyone's wondering about behavior on HDD specifically, here it is.
Second, for SSD: I see here an attitude that once you create a SWAP on SSD, all problems are gone. So here's my experience: this is not true.
My gf has a Macbook 2013 with SSD, 4G RAM, ZSWAP. She always had swap-partition on the SSD. Before I went out to try the patches on her laptop, she have also had frequent SWAP-storms, her overall experience was pretty bad.
After I configured her system to use the multi-LRU patches (v2), her experience improved a lot. Now the only moment when lags start appearing is when her SWAP usage goes to around 7-8G (Idk why exactly that size).
So, for anyone out there thinking creating a SWAP on SSD will magically solve any need in the memory reclaim rework — that ain't true.
Posted May 26, 2021 4:01 UTC (Wed)
by yuzhao@google.com (guest, #132005)
[Link]
Posted May 26, 2021 12:36 UTC (Wed)
by intelfx (subscriber, #130118)
[Link] (1 responses)
Posted May 27, 2021 7:29 UTC (Thu)
by yuzhao@google.com (guest, #132005)
[Link]
The fundamental arguments are different: the multigenerational lru argues that pages that have been used only once (if we are sure) are always the best choice, no matter how recent they were used, because 1) even some of them are not, they will be protected upon the second access; 2) the cost to ascertain whether the rest are better or not is higher (to do so we probably have to scan more than half a million pages on a 4GB laptop under moderate pressure, and there is no guarantee we'd make better choices after we have done it).
Essentially, ZFS ARC is a *cache* replacement implementation, which discovers page accesses reactively by hooks. For *page* replacement, we have to scan page tables proactive, which is far more difficult and expensive in terms of discovering page accesses.
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Lazily scanning pages for this bit, counting and clearing it again would seem like a way to approximate non-fd page access.
Maybe the kernel already does this?
Multi-generational LRU: the next generation
memory reference. A page table reference as how we count it, i.e., the
accessed bit is set, could be one or a thousand memory references. So
the accessed bit for a mapped page and PageReferenced() for an
unmapped page may carry different weights.
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
gets quickly to a point of SWAP-storms...
system lags heavily and is barely usable.
... migrated from 5.11.15 kernel to 5.12 + the LRU
patchset... Till now I had *not a
single SWAP-storm*, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.
Multi-generational LRU: the next generation
Windows XP and spinning rust used to grind to a halt on heavy memory use, and Win10 puts a device out of action for hours upgrading and patching itself if you still have a rotating hard disk.
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
My recommendation is to have no swap if the backing device is a HDD. Why? If the system needs so much RAM that it starts swapping, it becomes unusable anyway.
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation
Multi-generational LRU: the next generation