Toward a swap abstraction layer

Posted May 23, 2023 9:08 UTC (Tue) by atnot (subscriber, #124910)
Parent article: Toward a swap abstraction layer

> The kernel's swapping code tends to not get much love. Users try to avoid it[...]

I feel like a lot of this is due to the unfortunate general mis(understanding|information) that swap is only of use in emergencies when a device "runs out" of memory. For anyone unfamiliar, the role of swap is actually much more important during normal operation, because it puts anonymus and file-backed memory on equal footing. Without swap, if some program say allocates some memory for startup that it never actually ends up using again, that will have to stay there forever, even if it would be more beneficial for performance to use that memory for the page cache or some other purpose instead.

I think part of this may be due to latency, as especially with old hard drives the stutter as pages get swapped back in was immediately noticable, while the the disk accesses saved by that stutter weren't. But with modern SSDs and especially ram compression, those are no longer a thing and it's basically just upsides.

Large server operators have understood this for a while, but fortunately more distros like Fedora are shipping swap-on-zram and other solutions these days too. So this perception does seem to be slowly changing and I'm very happy to see more love in this area!

Toward a swap abstraction layer

Posted May 23, 2023 11:07 UTC (Tue) by kleptog (subscriber, #1183) [Link] (2 responses)

> I feel like a lot of this is due to the unfortunate general mis(understanding|information) that swap is only of use in emergencies when a device "runs out" of memory.

The problem is that the worst case was so terrible. When you have an HDD that can do 80MB/s then a 1GB swap file is 12 seconds of 100% I/O utilisation *in the best case*, which generally isn't what happens. For some reason, swapping is pretty terrible at I/O utilisation. So some process goes haywire, the kernel spends time swapping your working set to disk, then finally invokes the OOM killer to kill the offending process, and then more time swapping your working set back in again. In the meantime your system is totally unresponsive.

Add in the recommendation to make your swapfile at least as large are your main memory and you could be waiting a very long time. On the other hand, having no swap avoids the problem entirely, but you pay the penalty a different way. But once bitten, twice shy.

IMO when the process went haywire then either the OOM killer should have been invoked prior to swapping out your working set, or the swapping should be limited to the new process that was causing the memory pressure.

Of course, SSDs changed the equation considerably. In my experience, 1GB swap is of a size that haywire processes (usually firefox, occasionally other things) only lead to about ~30s unresponsiveness. It would be great if an exploding firefox tab led to only firefox being unresponsive rather than the whole system.

Toward a swap abstraction layer

Posted May 23, 2023 13:07 UTC (Tue) by atnot (subscriber, #124910) [Link]

> In my experience, 1GB swap is of a size that haywire processes (usually firefox, occasionally other things) only lead to about ~30s unresponsiveness. It would be great if an exploding firefox tab led to only firefox being unresponsive rather than the whole system.

This is why you generally want a userspace OOM killer, like oomd or the one provided by the systemd project. The kernel OOM is only invoked when an allocation request actually fails, at which point it is both too late and too deep within the kernel to really make good decisions. For example, you mention Firefox, however there is a good chance that firefox isn't the cause at all. A modern browser just happens to have a lot of processes that map a lot of memory, which makes it a juicy first target for the OOM killer regardless of whether it is actually the issue or not.

A userspace OOM killer can take advantage of cgroups and statistics like PSI to act much earlier and smarter instead. It can, for example, look at memory pressure to determine that some cgroup is causing a lot of system time to be spent on memory reclaim (shrinking caches, swapping, etc.) and interfering with system performance long before it has exhausted all memory. In my experience when I fat-finger an array size that usually means a brief hitch of a second at most.

Toward a swap abstraction layer

Posted May 23, 2023 13:11 UTC (Tue) by gmgod (guest, #143864) [Link]

> Of course, SSDs changed the equation considerably.

In case of true OOM, you get the exact same "system grinding to a halt" problem you had on HDDs though. And usually not just 30s of it (in my experience)...

Toward a swap abstraction layer

Posted May 23, 2023 11:41 UTC (Tue) by xecycle (subscriber, #140261) [Link] (19 responses)

> it puts anonymus and file-backed memory on equal footing

Yes, we have all read that "in-defense of swap", and we agree that swap has its value in memory management. But IMHO declaring anon pages and file-backed ones as "equal" is way too optimistic:
- file-backed pages that are not dirty do not need a swap-out step;
- applications usually anticipate disk file access may take some time, and they implement many strategies to avoid file access hanging UI, but no app can anticipate an arbitrary anon page may take more time to access; swapping out such pages may have a higher chance to give the user a stutter.

But, well, don't get me wrong, I'm not against enabling swap. My opinion is that, saying "we should enable swap" and advertising "enable swap and your system will be more responsive" cannot be accurate. A lot of configuration are needed to get an always-responsive system; swap is one part of the config, but all parts must come together to make it work. MGLRU, DAMON, cgroups and its managers (e.g. system76 scheduler) are going to help us on the way. However, there is always another choice: if your RAM is enough most of the time, just don't enable swap and shutdown everyday; if you are shot someday just reboot. I'm curios whether those strategies can play equally well as this.

Again, don't get me wrong, I don't like rebooting everyday, and I do have swap enabled on some of my systems; although when it goes too far I still have to resort to rebooting.

Toward a swap abstraction layer

Posted May 23, 2023 12:38 UTC (Tue) by farnz (subscriber, #17727) [Link] (5 responses)

I don't get your second point:

- applications usually anticipate disk file access may take some time, and they implement many strategies to avoid file access hanging UI, but no app can anticipate an arbitrary anon page may take more time to access; swapping out such pages may have a higher chance to give the user a stutter.

If the kernel has no access to swap, then it's, by definition, going to evict file-backed pages to make room for the application's data use. On my system, the vast majority of file-backed pages are executable code; what applications implement strategies to avoid hanging the UI if their code is paged out and has to be reloaded from mass storage?

Toward a swap abstraction layer

Posted May 23, 2023 14:30 UTC (Tue) by xecycle (subscriber, #140261) [Link] (4 responses)

I think page caches are counted as a kind of file-backed memory, did I get it wrong? After all I'm told by some swap supporters that they choose to swap out anon pages to make room for page cache.

Toward a swap abstraction layer

Posted May 24, 2023 0:01 UTC (Wed) by Paf (subscriber, #91811) [Link] (3 responses)

Is there any *other* kind of file backed data?

Toward a swap abstraction layer

Posted May 24, 2023 0:55 UTC (Wed) by xecycle (subscriber, #140261) [Link] (2 responses)

Ugh, you are right, my wording is bad. I mean, at least on my machine, most of the page cache is not mapped in any userspace mapping; and that's what swap proponents keep telling me, that these cache are also good for performance and I should swap out other unused parts, those mapped by userspace, to make room for that part of the page cache. They never told me a good strategy to decide which ones are "unused", however.

Toward a swap abstraction layer

Posted May 24, 2023 10:38 UTC (Wed) by farnz (subscriber, #17727) [Link]

Assuming you have a sensibly sized swap (not so large that you can enter swap thrash, and not too small), the kernel will get it right. It tracks access frequency for pages, and it can track whether it's doing repeated I/O to reread pages, and will (with help from the vm.swappiness sysctl decide whether to swap out an unused page of anonymous memory or drop some page cache.

And bear in mind that pages mapped by executables are part of the page cache. The page cache is all data that the kernel can recover from the filesystem trivially, and thus can drop at any time - most of the code segments of the running executables are just page cache pages.

Toward a swap abstraction layer

Posted May 30, 2023 19:15 UTC (Tue) by Paf (subscriber, #91811) [Link]

Ok, thanks for the clarification!

Toward a swap abstraction layer

Posted May 24, 2023 1:26 UTC (Wed) by xecycle (subscriber, #140261) [Link] (12 responses)

Sorry it seems I may have made some upset, so allow me to rephrase:

I think that in-defense of swap article says "having a swap is good for memory management in general", and I agree.

However I'm upset that part of our community seems to take that as "enable swap, and your desktop will be more responsive at all times". This kind of flame war happens very often when someone says "I disable swap and it's faster". I don't know if this is implied in that article (I think not), but I have to say I disagree with it. There are too many factors in the responsiveness problem, and swap on/off can be either better or worse. I have heard a user (in a telegram group chat) that goes to implement a daemon that memlocks all X/gtk libraries on boot, so he gets a hard guarantee that critical UI library code never gets evicted. Having heard this case I think the whole picture may be like, swap gives eviction more choice, but users need protection in addition to a swap, to actually get good performance; but these protection mechanisms are way more complex, and solutions are still evolving. As such I think there's no point shouting at a user who disables swap; he may have done scientific testing on his own and decided no-swap is actually better for him.

Toward a swap abstraction layer

Posted May 24, 2023 10:44 UTC (Wed) by farnz (subscriber, #17727) [Link] (11 responses)

As a rule of thumb, turning off swap completely is a route to trouble, and if turning swap off improves your system, you've probably got the wrong value for vm.swappiness, such that the kernel is too keen to swap pages instead of dropping page cache, or vice-versa.

In the case of memlocking critical UI code, that's something that will only help if your system is set up to avoid swapping - most of the code is in the page cache, and mapped into userspace, and if you have too little swap (or a kernel tuned to prefer evicting page cache to swapping anonymous pages), the kernel is forced to drop those page cache pages in order to avoid OOM.

This is a complicated topic, and there's a lot of misunderstanding out there about what works and what does not work - e.g. people not realising that the "page cache" includes all the code pages from your executables, and thus by removing swap as an option, you force the kernel to evict code you're actively using in order to avoid OOM.

Toward a swap abstraction layer

Posted May 24, 2023 12:30 UTC (Wed) by xecycle (subscriber, #140261) [Link] (1 responses)

> by removing swap as an option, you force the kernel to evict code

This can be right or wrong, depending on other factors. E.g. on single-purpose systems I can go extreme: memlock all code sections; I can do this together with cgroup-limiting all userspace to use some less memory than total physical memory. Removing swap basically means to lock all anon pages; I can still manually lock file-backed pages.

This, IMO, can be a valid protection scheme, and it can work without swap.

Btw "in order to avoid OOM" is what the kernel is programmed to do; which makes sense, as a general-purpose kernel it must try to allow programs to proceed, in the absence of configured limits. But this may be not the top-priority of an end user. In the desktop use case (where the word "responsiveness" makes sense), I would prefer something that can quickly identify a misbehaving program and kill it, ideally without other programs' code dropped or anon swapped out. Let's imagine this protection scheme:
- I pick some apps, declare them as important, run them in a cgroup with memory.min protections, all code sections locked;
- for all other apps run those in another cgroup that is limited to use total RAM - some margin for the kernel - dedicated amount given to above.
This way, with or without swap, with or without an early OOM daemon, those important apps should get equal protection after all. "Unimportant" apps can benefit from swap, yes, but I can simply declare them "unimportant".

Of course I don't think there are many people doing this on their desktops, but I think similar practice is already used elsewhere, e.g. pod QoS in k8s.

Again, I'm arguing that users actually want a good protection scheme, and a good scheme for some use case may work well without swap at all. After all, what to do with diskless systems?

Toward a swap abstraction layer

Posted May 24, 2023 13:56 UTC (Wed) by farnz (subscriber, #17727) [Link]

Sure, in special-purpose systems you can memlock all important code so that it can't be evicted from the page cache, and you can limit userspace so that it can't overcommit. This works fine with or without swap.

But the underlying problem we have is that fork either requires a huge amount of wasted swap/memory, or overcommit. If you allow overcommit, then you have to have some way of handling what happens when you get it wrong, and run out of memory.

Your scheme isn't something that's commonly used, precisely because of the overcommit issue - you've just declared that important apps can't overcommit, and thus can't reliably use fork. What is common is setting up OOM controls so that if there's a shortage of memory, unimportant apps get killed first, and unimportant apps get paged out in preference to important apps.

But on a desktop, pretty much everything you're running is an important application - you don't bother running something if you don't want it to work. So you need a system that works well for a case where nothing is unimportant - and that's the case where you have a small amount of swap, and an early OOM daemon that kills programs if they start entering swap thrash.

And it's not useful to keep all code sections in memory - my desktop is currently perfectly responsive, and yet significant chunks of running applications are not paged in - start up code, error handling code for errors I'm never going to hit, POP3 code in my e-mail client, X11 support in my GUI libraries for Wayland-native programs, Wayland support in GUI libraries for programs that still use X11 and more. None of that needs to be loaded into memory, since it's never going to run; and by locking it into memory, you stop my applications from using that memory for something more useful to me.

The goal the kernel has is to page out only the parts of file-backed data that won't be accessed again soon, and to page out anonymous data that won't be accessed again soon. It needs a bit of guidance (the swappiness parameter) to help it out in deciding between the two, but it won't page something out unless it's rarely accessed - and by definition, that's the stuff that you don't need for responsiveness, since if you did, it'd be frequently accessed.

Toward a swap abstraction layer

Posted May 24, 2023 16:30 UTC (Wed) by SLi (subscriber, #53131) [Link] (8 responses)

I would love to know what that magic value for swappiness is. It is my experience too that with basically any amount of swap and swappiness>0, there's a major risk of a system becoming completely unresponsive for half an hour, even with the swap space being on a fast SSD.

If this is easy to get right, is there any (say, desktop-oriented) distribution that gets it right, or even modifies the kernel defaults, out of the box?

My workstation has 512 GiB of RAM. If I add a comparatively very meager amount of 10 GiB of swap, a runaway process makes it completely unusable for a long time.

I think in this game, it's cheaper to just buy 10% more RAM and waste it. Still, it's sad. Why is this so hard?

Toward a swap abstraction layer

Posted May 24, 2023 16:44 UTC (Wed) by farnz (subscriber, #17727) [Link] (7 responses)

10 GiB of swap is a huge amount, unless you're spreading it across multiple Optane SSDs; to avoid the unresponsiveness issue, you want the OS to be able to random write the swap area in page-sized chunks in under half a second. On my laptop, with a fast SSD (1.4M IOPS), that limits swap to 6 GiB at most, and I actually use 128 MiB (with 64 GiB RAM), which makes for a good balance - when things go wrong, it doesn't thrash, but I see anonymous pages swapped out when my free memory gets low (noting that it's often unused, since it's not unknown for me to not fill RAM with page cache and anonymous pages in a single session, and I shut down the laptop overnight).

The rules about swap being sized proportionally to RAM come from systems without overcommit. In that situation, you do need a lot of swap, but most of it is never used; Linux has overcommit, and thus you only need a very small amount of swap to ensure that unused anonymous pages can be swapped in preference to paging out in-use code.

Toward a swap abstraction layer

Posted May 24, 2023 16:49 UTC (Wed) by SLi (subscriber, #53131) [Link] (1 responses)

But then that also sounds like a very much diminishing benefit. In the best case, your computer performs as if it had 0.2% more RAM. In the worst case, you'll run into OOM sluggishness. Is it worth it?

Toward a swap abstraction layer

Posted May 24, 2023 16:52 UTC (Wed) by farnz (subscriber, #17727) [Link]

Yes, it really is. It's the difference between the machine going sluggish when I run close to OOM, because it's paging code in and out all the time, and it's having to page the code I'm actually using, and the machine using swap to page out anonymous data, giving systemd-oomd time to react and kill off the thing that's eating all my RAM.

Without it, I find my system enters thrash a lot more easily than it does with a tiny amount of swap.

Toward a swap abstraction layer

Posted May 28, 2023 23:43 UTC (Sun) by pturmel (guest, #95781) [Link] (1 responses)

This is a very useful tip. Thanks, @farnz. I will now give my laptop 128MB of swap and see how it goes (also 64GB of RAM, nVME storage).

Toward a swap abstraction layer

Posted May 30, 2023 12:42 UTC (Tue) by farnz (subscriber, #17727) [Link]

I should note that I suspect that a significant fraction of my gains from this are because I have systemd-oomd running, and it takes action when swap is significantly used. My working theory is that if I'm hitting 90% full swap, I'm at a point where I've exceeded my system's capabilities, and systemd-oomd should kick in and kill things before I start paging out the executable code I'm actively using.

Toward a swap abstraction layer

Posted May 29, 2023 9:10 UTC (Mon) by kleptog (subscriber, #1183) [Link] (2 responses)

> The rules about swap being sized proportionally to RAM come from systems without overcommit.

And the fact that if you want to support hibernation, your memory has to fit in swap. IIRC that was the argument some distributions had to default sizing the swap so large. I think most distributions don't do that anymore, and hibernation has basically disappeared as an option in most setups.

I think it would be nice to be able to reserve some disk space for "hibernation-but-not-swap", but that's not possible AFAIK.

Toward a swap abstraction layer

Posted May 29, 2023 12:20 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

> I think it would be nice to be able to reserve some disk space for "hibernation-but-not-swap", but that's not possible AFAIK.

You can do that by only enabling an additional swap partition just before hibernation and disabling it right after resume.

Toward a swap abstraction layer

Posted May 30, 2023 13:38 UTC (Tue) by gioele (subscriber, #61675) [Link]

> > I think it would be nice to be able to reserve some disk space for "hibernation-but-not-swap", but that's not possible AFAIK.
>
> You can do that by only enabling an additional swap partition just before hibernation and disabling it right after resume.

That's more or less the behavior of the `resume=` kernel argument, isn't it?

From https://man7.org/linux/man-pages/man7/kernel-command-line...

> resume=, resumeflags=
>
> Enables resume from hibernation using the specified device and mount options. All fstab(5)-like paths are supported. For details, see systemd-hibernate-resume-generator(8).