Toward a swap abstraction layer

Posted May 24, 2023 10:44 UTC (Wed) by farnz (subscriber, #17727)
In reply to: Toward a swap abstraction layer by xecycle
Parent article: Toward a swap abstraction layer

As a rule of thumb, turning off swap completely is a route to trouble, and if turning swap off improves your system, you've probably got the wrong value for vm.swappiness, such that the kernel is too keen to swap pages instead of dropping page cache, or vice-versa.

In the case of memlocking critical UI code, that's something that will only help if your system is set up to avoid swapping - most of the code is in the page cache, and mapped into userspace, and if you have too little swap (or a kernel tuned to prefer evicting page cache to swapping anonymous pages), the kernel is forced to drop those page cache pages in order to avoid OOM.

This is a complicated topic, and there's a lot of misunderstanding out there about what works and what does not work - e.g. people not realising that the "page cache" includes all the code pages from your executables, and thus by removing swap as an option, you force the kernel to evict code you're actively using in order to avoid OOM.

Toward a swap abstraction layer

Posted May 24, 2023 12:30 UTC (Wed) by xecycle (subscriber, #140261) [Link] (1 responses)

> by removing swap as an option, you force the kernel to evict code

This can be right or wrong, depending on other factors. E.g. on single-purpose systems I can go extreme: memlock all code sections; I can do this together with cgroup-limiting all userspace to use some less memory than total physical memory. Removing swap basically means to lock all anon pages; I can still manually lock file-backed pages.

This, IMO, can be a valid protection scheme, and it can work without swap.

Btw "in order to avoid OOM" is what the kernel is programmed to do; which makes sense, as a general-purpose kernel it must try to allow programs to proceed, in the absence of configured limits. But this may be not the top-priority of an end user. In the desktop use case (where the word "responsiveness" makes sense), I would prefer something that can quickly identify a misbehaving program and kill it, ideally without other programs' code dropped or anon swapped out. Let's imagine this protection scheme:
- I pick some apps, declare them as important, run them in a cgroup with memory.min protections, all code sections locked;
- for all other apps run those in another cgroup that is limited to use total RAM - some margin for the kernel - dedicated amount given to above.
This way, with or without swap, with or without an early OOM daemon, those important apps should get equal protection after all. "Unimportant" apps can benefit from swap, yes, but I can simply declare them "unimportant".

Of course I don't think there are many people doing this on their desktops, but I think similar practice is already used elsewhere, e.g. pod QoS in k8s.

Again, I'm arguing that users actually want a good protection scheme, and a good scheme for some use case may work well without swap at all. After all, what to do with diskless systems?

Toward a swap abstraction layer

Posted May 24, 2023 13:56 UTC (Wed) by farnz (subscriber, #17727) [Link]

Sure, in special-purpose systems you can memlock all important code so that it can't be evicted from the page cache, and you can limit userspace so that it can't overcommit. This works fine with or without swap.

But the underlying problem we have is that fork either requires a huge amount of wasted swap/memory, or overcommit. If you allow overcommit, then you have to have some way of handling what happens when you get it wrong, and run out of memory.

Your scheme isn't something that's commonly used, precisely because of the overcommit issue - you've just declared that important apps can't overcommit, and thus can't reliably use fork. What is common is setting up OOM controls so that if there's a shortage of memory, unimportant apps get killed first, and unimportant apps get paged out in preference to important apps.

But on a desktop, pretty much everything you're running is an important application - you don't bother running something if you don't want it to work. So you need a system that works well for a case where nothing is unimportant - and that's the case where you have a small amount of swap, and an early OOM daemon that kills programs if they start entering swap thrash.

And it's not useful to keep all code sections in memory - my desktop is currently perfectly responsive, and yet significant chunks of running applications are not paged in - start up code, error handling code for errors I'm never going to hit, POP3 code in my e-mail client, X11 support in my GUI libraries for Wayland-native programs, Wayland support in GUI libraries for programs that still use X11 and more. None of that needs to be loaded into memory, since it's never going to run; and by locking it into memory, you stop my applications from using that memory for something more useful to me.

The goal the kernel has is to page out only the parts of file-backed data that won't be accessed again soon, and to page out anonymous data that won't be accessed again soon. It needs a bit of guidance (the swappiness parameter) to help it out in deciding between the two, but it won't page something out unless it's rarely accessed - and by definition, that's the stuff that you don't need for responsiveness, since if you did, it'd be frequently accessed.

Toward a swap abstraction layer

Posted May 24, 2023 16:30 UTC (Wed) by SLi (subscriber, #53131) [Link] (8 responses)

I would love to know what that magic value for swappiness is. It is my experience too that with basically any amount of swap and swappiness>0, there's a major risk of a system becoming completely unresponsive for half an hour, even with the swap space being on a fast SSD.

If this is easy to get right, is there any (say, desktop-oriented) distribution that gets it right, or even modifies the kernel defaults, out of the box?

My workstation has 512 GiB of RAM. If I add a comparatively very meager amount of 10 GiB of swap, a runaway process makes it completely unusable for a long time.

I think in this game, it's cheaper to just buy 10% more RAM and waste it. Still, it's sad. Why is this so hard?

Toward a swap abstraction layer

Posted May 24, 2023 16:44 UTC (Wed) by farnz (subscriber, #17727) [Link] (7 responses)

10 GiB of swap is a huge amount, unless you're spreading it across multiple Optane SSDs; to avoid the unresponsiveness issue, you want the OS to be able to random write the swap area in page-sized chunks in under half a second. On my laptop, with a fast SSD (1.4M IOPS), that limits swap to 6 GiB at most, and I actually use 128 MiB (with 64 GiB RAM), which makes for a good balance - when things go wrong, it doesn't thrash, but I see anonymous pages swapped out when my free memory gets low (noting that it's often unused, since it's not unknown for me to not fill RAM with page cache and anonymous pages in a single session, and I shut down the laptop overnight).

The rules about swap being sized proportionally to RAM come from systems without overcommit. In that situation, you do need a lot of swap, but most of it is never used; Linux has overcommit, and thus you only need a very small amount of swap to ensure that unused anonymous pages can be swapped in preference to paging out in-use code.

Toward a swap abstraction layer

Posted May 24, 2023 16:49 UTC (Wed) by SLi (subscriber, #53131) [Link] (1 responses)

But then that also sounds like a very much diminishing benefit. In the best case, your computer performs as if it had 0.2% more RAM. In the worst case, you'll run into OOM sluggishness. Is it worth it?

Toward a swap abstraction layer

Posted May 24, 2023 16:52 UTC (Wed) by farnz (subscriber, #17727) [Link]

Yes, it really is. It's the difference between the machine going sluggish when I run close to OOM, because it's paging code in and out all the time, and it's having to page the code I'm actually using, and the machine using swap to page out anonymous data, giving systemd-oomd time to react and kill off the thing that's eating all my RAM.

Without it, I find my system enters thrash a lot more easily than it does with a tiny amount of swap.

Toward a swap abstraction layer

Posted May 28, 2023 23:43 UTC (Sun) by pturmel (guest, #95781) [Link] (1 responses)

This is a very useful tip. Thanks, @farnz. I will now give my laptop 128MB of swap and see how it goes (also 64GB of RAM, nVME storage).

Toward a swap abstraction layer

Posted May 30, 2023 12:42 UTC (Tue) by farnz (subscriber, #17727) [Link]

I should note that I suspect that a significant fraction of my gains from this are because I have systemd-oomd running, and it takes action when swap is significantly used. My working theory is that if I'm hitting 90% full swap, I'm at a point where I've exceeded my system's capabilities, and systemd-oomd should kick in and kill things before I start paging out the executable code I'm actively using.

Toward a swap abstraction layer

Posted May 29, 2023 9:10 UTC (Mon) by kleptog (subscriber, #1183) [Link] (2 responses)

> The rules about swap being sized proportionally to RAM come from systems without overcommit.

And the fact that if you want to support hibernation, your memory has to fit in swap. IIRC that was the argument some distributions had to default sizing the swap so large. I think most distributions don't do that anymore, and hibernation has basically disappeared as an option in most setups.

I think it would be nice to be able to reserve some disk space for "hibernation-but-not-swap", but that's not possible AFAIK.

Toward a swap abstraction layer

Posted May 29, 2023 12:20 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

> I think it would be nice to be able to reserve some disk space for "hibernation-but-not-swap", but that's not possible AFAIK.

You can do that by only enabling an additional swap partition just before hibernation and disabling it right after resume.

Toward a swap abstraction layer

Posted May 30, 2023 13:38 UTC (Tue) by gioele (subscriber, #61675) [Link]

> > I think it would be nice to be able to reserve some disk space for "hibernation-but-not-swap", but that's not possible AFAIK.
>
> You can do that by only enabling an additional swap partition just before hibernation and disabling it right after resume.

That's more or less the behavior of the `resume=` kernel argument, isn't it?

From https://man7.org/linux/man-pages/man7/kernel-command-line...

> resume=, resumeflags=
>
> Enables resume from hibernation using the specified device and mount options. All fstab(5)-like paths are supported. For details, see systemd-hibernate-resume-generator(8).