|
|
Log in / Subscribe / Register

Three ways to rework the swap subsystem

By Jonathan Corbet
April 7, 2025

LSFMM+BPF
The kernel's swap subsystem is complex and highly optimized — though not always optimized for today's workloads. In three adjacent sessions during the memory-management track of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, Kairui Song, Nhat Pham, and Usama Arif all talked about some of the problems that they are trying to solve in the Linux swap subsystem. In the first two cases, the solutions take the form of an additional layer of indirection in the kernel's swap map; the third, which enables swap-in of large folios, may or may not be worthwhile in the end.

Simplifying the swap subsystem

There are some good things about the kernel's swapping code, Song began. Since swapping is done to make memory available to the kernel, it must use as little memory as possible itself; the swap code manages to get by with a single byte of overhead per page of swap space. Most swapping operations are fast and lightweight; swap entries point directly to physical locations. The per-CPU swap-cluster design manages to avoid contention in almost all cases.

On the other hand, the swap system is complex. To show just how complex, Song put up this slide:

[Swap system
diagram]

There are a lot of components with complex interactions, he said. New features are hard to add to this subsystem; the long (and ongoing) effort to add swapping for multi-size transparent huge pages (mTHPs) is one case in point. The whole thing is built around the one-byte-per-page design for the swap map, meaning that there is no space to store anything beyond a reference count and a pair of flags. Various optimizations have been bolted on over the years, increasing the complexity of the system.

[Kairui Song] As an example, he mentioned the SWP_SYNCHRONOUS_IO optimization, which was added to 6.14 by Baolin Wang added in 4.15 by Minchan Kim. When the kernel is swapping from a fast, memory-based device like zram, any extra latency hurts; in such cases, the kernel can simply bypass most of the swapping machinery and copy the data directly, preserving larger folios as well. Song said that this is a nice optimization, but there are now four different ways to bring in a folio from swap. He has tried to unify them all, but that work failed due to performance regressions.

The distinction between the swap map and the swap cache adds complexity as well. The swap map is the one-byte-per-slot array that tracks the usage of swap slots in a swap device. It is a global resource, requiring locking for access, so it can be a contention point on systems that are doing a lot of swapping. The swap cache, instead, is a per-CPU data structure that holds a set of swap slots allocated in bulk from the swap map; it allows many swap-related actions to be done locklessly, and enables batching of swap-map changes when a lock must be taken. When a swap-map entry is in some CPU's cache, the SWAP_HAS_CACHE bit is set in the swap map to indicate that some CPU owns the entry. But, Song said, that bit has acquired other meanings over time, again making it harder to make changes to the swap machinery.

Any redesign of this system, he said, is destined to use more memory; the one-byte design of the swap map just does not allow for much flexibility. There is, however, memory that could be used for this purpose. The swap cache currently uses eight bytes per slot, and control groups can add another two (duplicating some data in the process), so the actual memory consumption for swapping can be eleven bytes per slot. If that memory were to be repurposed, the swap map could be transformed into a "swap table" with eight bytes per entry, which would be enough for everything that he has in mind. Swap entries would still be managed in clusters, he said, and the total memory use of the swap subsystem should drop as some of the existing complexity is removed.

Song's proposed layout for this swap table (taken from his proposal for the session) looks like this:

    | -----------    0    ------------| - Empty slot
    | SWAP_COUNT |---- Shadow ----|XX1| - Swapped out
    | SWAP_COUNT |------ PFN -----|X10| - Cached
    |----------- Pointer ---------|100| - Reserved / Unused yet

There would be a table for each swap cluster, spreading out the accesses and mostly eliminating locking contention; that, in turn, should allow the elimination of the separate swap cache. The eight-bit SWAP_COUNT is the same reference count that is kept in the current swap map, but it no longer needs to dedicate a couple of bits to flags like SWAP_HAS_CACHE. This design, he says, resolves many of the problems with the current swapping subsystem, and performs better as well, yielding a 10-20% improvement in the all-important kernel-build benchmark. There is only one swap-in path, and it never bypasses the table. Memory usage is lower, he said, and this design allows for the removal of a lot of complexity from the swap subsystem.

A participant asked how the new design could be faster in the absence of the bypassing optimization currently used for zram. The answer was that the unification of the swap map and the swap cache means that there is no need to check or maintain both, making the swap subsystem as a whole faster.

Future steps, Song said, include the addition of a virtual swap device that has no storage associated with it. Instead, it contains only entries pointing to slots in other swap devices. This new layer of indirection is intended to facilitate the intact swapping of larger folios, which currently becomes difficult when the swap devices become fragmented. The virtual device could be much larger than the physical swap space, making it resistant to fragmentation.

Time was running out, so Song concluded with an idea to consider further in the future: swap migration. The list of swap clusters already works as a sort of least-recently-used (LRU) list, he said, so it could be used as a way of detecting folios that have been swapped out for a long time so that they could be moved to cheaper storage. Perhaps compaction could be performed at the same time. There was no time for the discussion of this idea, though.

Virtual swap devices

The concept of a virtual swap device returned in the next session, though, as the topic that Pham wanted to talk about. His original motivation, he said, was to separate the zswap compressed swap cache from the rest of the swap subsystem. There is heavy use of zswap at his employer (Meta), which is good, but the current design of the swap layer requires that there be a slot assigned in an on-disk swap device for every page that is stored in zswap. That disk space will never be used and is thus entirely wasted; he has seen hosts running with an entirely unused, 100GB swap file. In an environment where hosts can have terabytes of installed RAM, it just is not possible to attach (and waste) sufficiently large drives for swap files.

[Nhat Pham] Once a swap area has been added to the system, its size is fixed; the only way to increase swap space is to add another swap area. That is a slow operation, though, that a heavily loaded production system cannot afford, so Meta has to provision a suitably sized swap file ahead of time for each host type. There have been ongoing problems with machines running out of memory just because the unused swap device is "full". Pham appeared to be of the opinion that this was not an optimal way to run things.

The problem, he said, is the tight coupling between swapped pages and the backing store behind them. The page-table entry for a swapped page points to the physical location for its data. It is, he said, a design oriented toward the sort of two-tier swapping system that was common some time ago. Beyond capacity problems, this design leads to other challenges; if, for example, zswap rejects a page, its page-table entry has already been changed and recovery is difficult.

Solving this problem, he said, requires decoupling the various swap backends. A page stored in zswap should not take space in the other backends — unless that has been dictated by policy, as can happen with write-through caching. The system needs to be able to support multi-tier swapping; that would also help with the addition of new features, such as discontiguous swap-space allocation for large folios, or swapping in folios at different sizes than they were at swap-out time.

Thus, he is proposing the implementation of a virtualized swap subsystem, providing swap space that is independent of whichever backend any given page is stored to. Each swapped-out page is assigned a virtual slot; that is what is stored in the page-table entry. Virtual slots can then be resolved to a specific backing store as needed, where that backing store could be zswap, a disk drive, a cache like zram, or something else. Such a system would eliminate the wasted space problem and allow pages to be moved between backends without having to change all of the references to them. That would make it easy for zswap to write pages back to another device, for example; it would also make removing a swap device much easier than it is now.

He has a working prototype now, he said, that adds two new swap maps. There is a forward map that turns a virtual swap slot into a swap descriptor describing the actual placement of a page; it uses the XArray data structure, so lookups are lockless. The reverse map turns a physical swap slot into a virtual slot; that is useful to support cluster readahead or the movement of pages between backends. The metadata for a swapped page is placed in a dynamically allocated swap descriptor that is stored in the forward map.

The prototype is getting close to the point where he can post it, he said. It is a big change, though, and he is worried about how he will be able to land it. Johannes Weiner suggested that it could perhaps operate in parallel with the existing swap subsystem until the performance is shown to be at least as good.

At the end, a participant asked whether this system would be able to swap in a single page from a larger folio that has been swapped out; Pham said that he has considered that use in the design. Matthew Wilcox asked whether the virtual swap space would be used sparsely or densely; Pham, like Song, said that a large, sparse virtual space would be better for fragmentation avoidance.

Pham has posted the slides from this session.

Large-folio swap-in

The final episode of the swapping trilogy began with Arif reminding the crowd of the advantages of using large folios. They allow for better translation lookaside buffer (TLB) usage, reduce the number of page faults the system must handle, shorten LRU lists, and allow page-table entries to be manipulated in batches. Large folios often do not survive their encounter with the swap subsystem, though; they end up being split into base pages. Arif was there to talk about how the swap subsystem might be improved to better handle larger folios.

[Usama Arif] He mentioned work done by Ryan Roberts around a year ago to enable swapping out mTHPs without splitting them. That helped to avoid the cost of splitting these folios and avoid the fragmentation of memory. Work has been done to store large folios to zswap, and to be able to bring in large folios from zram. Compression of large folios in zram (which yields better compression) is being worked on, but has not been merged yet. One problem with compressing large folios, though, is that swapping in a single base page from that folio becomes difficult — the entire folio must be decompressed to make that base page accessible.

Arif's large-folio swap-in work builds on these previous efforts. Specifically, at swap-in time, it checks to see whether the swap count is one (meaning there is a single reference to the page) and whether the page is in zswap. If so, the swap cache is skipped entirely, and zswap will be checked to see if it holds a larger folio containing the page in question. If so, the folio will be swapped in one page at a time and assembled into a proper folio at the end.

This patch speeds 1MB folio swap-in by 36%, but also slows kernel builds. It resulted in an overall increase in zswap activity, with a lot of thrashing and folios being swapped in and out repeatedly. Thus, he concluded, perhaps large folios are not good for all workloads; would the group be happy with a change that yielded such different results for different workloads? Some workloads benefit; Android, for example, swaps out background tasks entirely and does not see this performance regression. Perhaps a control knob could be provided to tell the system whether to swap in large folios from zswap, but most users never change these knobs and would not see the benefit. Perhaps this behavior could be switched off automatically if the refault rate is seen as being too high.

Another alternative would be to combine large-folio swapping with large-folio compression; that might offset the regression with kernel builds, he said. But the inability to swap in base pages out of large folios could get in the way here.

As the session ran down, he wondered if there was a need for large-folio swap-in at all. Perhaps the system should continue to swap in base pages and let the khugepaged thread reassemble larger folios afterward. Wilcox said that there is a need to gather more statistics to understand what is really going on here. At this point, the topic was swapped out for something entirely different.

Index entries for this article
KernelMemory management/Swapping
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2025


to post comments

New swap architecture

Posted Apr 7, 2025 15:14 UTC (Mon) by willy (subscriber, #9762) [Link] (2 responses)

I don't suppose you have a photo of Kairui's new architecture slide? IIRC it went from 13 boxes to 5. I almost fell out of my chair when I saw that. Let's get that merged soon!

New swap architecture

Posted Apr 7, 2025 15:18 UTC (Mon) by brauner (subscriber, #109349) [Link]

Is that slide here in the article meant to be the visual counterpart to memory-model.rst if the children cannot yet read?

"After" photo

Posted Apr 7, 2025 15:27 UTC (Mon) by corbet (editor, #1) [Link]

I did take a photo, yes, but obviously didn't work it into the article; perhaps I should have. In any case, it's now on the photo page for the curious.

Swap across multiple storage hardware

Posted Apr 7, 2025 18:09 UTC (Mon) by aeden (subscriber, #116597) [Link] (3 responses)

Recently I posed the following question: suppose a system has three distinct NVME SSDs. Should I create a swap partition on each device and `swapon` all of them (i.e. /dev/nvme), or should I create a mdadm RAID0 across those three swap devices and then `swapon` _that_ (i.e. /dev/md0). It seems the current best strategy's is the latter, since it looks like the swap devices are handled in a dumb way.

Swap across multiple storage hardware

Posted Apr 10, 2025 14:51 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

To the best of my knowledge, (by default) swap is allocated a priority based on its order in fstab. So you should list devices in order of speed.

Or you can explicitly assign a priority to each, and the kernel will stripe (aka create a raid-0) across all devices of equal priority. So no, actually creating your own raid-0 is not a good idea.

The other thing is there's two types of raid-0 - there's the striping you're thinking of, and there's sequential (which iirc is what swap does by default) where raid will fill one device before moving on to the next.

The only time raid'ing swap makes sense is if you do a raid-1, so the system is hardened against swap-device-failure.

Cheers,
Wol

Swap across multiple storage hardware

Posted Apr 22, 2025 7:00 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

> To the best of my knowledge, (by default) swap is allocated a priority based on its order in fstab. So you should list devices in order of speed.

Is this just "mount order"? How else are manually-mounted swap devices or systemd `.mount` swap units prioritized? It feels strange (to me) that the kernel would read `/etc/fstab` and make such policy decisions based on it.

Swap across multiple storage hardware

Posted Apr 22, 2025 18:52 UTC (Tue) by edgewood (subscriber, #1123) [Link]

"swapon" on the command line accepts a priority option, and passes that priority to the kernel. "pri=" in fstab does the same thing, and systemd presumably also does the same. So the kernel tracks the priority, no matter where it came from originally.

The swapon(2) man page describes what the kernel does with priorities if there is more than one swap device.

CONFIG_SWAP_SMALL with refcount max 8?

Posted Apr 7, 2025 18:48 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (1 responses)

>The whole thing is built around the one-byte-per-page design for the swap map, meaning that there is no space to store anything beyond a reference count and a pair of flags.

6 bits for the reference count? Did they collect stats to show they need all 6? Can we get by with 3 bits for the table, 2 flags and 3 bits for the refcount?

K3n.

CONFIG_SWAP_SMALL with refcount max 8?

Posted Apr 7, 2025 19:08 UTC (Mon) by willy (subscriber, #9762) [Link]

If there are more than 63 references to a swap page, another page full of bytes is allocated to store the high bits. After 4095, a third page is allocated, and so on.

Yes, this is stupid. No, it rarely happens.

Does it actually work?

Posted Apr 7, 2025 21:39 UTC (Mon) by Sesse (subscriber, #53779) [Link] (40 responses)

I've seen these “let's make swap faster” posts for LWN for many years now, but… it doesn't actually match the reality I'm seeing? On any kind of system, be it laptop or desktop or server, blazing-fast NVMe or HDD, once anything really starts swapping, the machine is unusable for me; I have to spend minutes frantically trying to get a killall through. (Thankfully the OOM killer is much better at finding the right target than it used to be!) It's perfectly fine for “slow” swapping (basically of initialized but not really used memory in a daemon or something), but for anything involving significant churn, I feel very disconnected from the reality in all these presentations.

Is it something that's going to be radically better in, say, five years?

Does it actually work?

Posted Apr 8, 2025 0:03 UTC (Tue) by pabs (subscriber, #43278) [Link] (1 responses)

That situation happens for me too, even with no swap, only zram. Presumably it would happen even without any kind of swap too, because the issue is probably due to constantly paging in binaries from disk as they run?

Probably there isn't any easy way to prevent this, but it could be mitigated if systemd could lock into RAM itself and everything needed for a rescue login console, the files needed for a VT switch, login, a root shell session and command-line and interactive tools needed to find and kill processes, so that when it happens I can instantly kill offending processes using all my RAM. Debian has a memlockd package for this, but it doesn't seem to help any more unfortunately. Perhaps the systemd folks can come up with a better design and or maintain it better.

Another thing that could help is being able to automatically shut down more systemd user/system services when memory pressure begins to build, IIRC this is implemented, but probably not widespread within service configs.

Does it actually work?

Posted Apr 8, 2025 10:02 UTC (Tue) by paulj (subscriber, #341) [Link]

> lock into RAM itself and everything needed for a rescue login console, the files needed for a VT switch, login, a root shell session

This is, I suspect, more or less what Solaris had decades ago to protect emergency access in the case of a swap storm. I don't know the details unfortunately - but it had some kind of mechanism.

Does it actually work?

Posted Apr 8, 2025 7:37 UTC (Tue) by intelfx (subscriber, #130118) [Link] (37 responses)

> It's perfectly fine for “slow” swapping (basically of initialized but not really used memory in a daemon or something), but for anything involving significant churn, I feel very disconnected from the reality in all these presentations.

A machine in a state of memory thrashing will never feel or be "usable". That's simply a non-goal.

Does it actually work?

Posted Apr 8, 2025 12:24 UTC (Tue) by kleptog (subscriber, #1183) [Link] (7 responses)

The current situation is that once something starts thrashing, everything becomes unusable until (hopefully) the OOM killer kicks in.

Is there a way to improve that situation other than simply disabling swap altogether? I mean, what's the point of swap if it doesn't solve the actual problem of allowing you to use your memory more effectively by swapping unused stuff to disk.

I dunno, maybe kicking the OOM killer if it notices stuff being swapped being less than a few seconds old or something?

Does it actually work?

Posted Apr 8, 2025 13:59 UTC (Tue) by jhe (subscriber, #164815) [Link] (6 responses)

Disabling swap will force the thrashing onto the page cache and will not prevent the problem from occuring.

You could write a program that hooks into the pressure stall information and kills a pre-defined process (Firefox, probably) when memory pressure goes above 30 or 50. You are right that the kernel OOM killer should do that on its own.

I think the underlying cause is that some applications (Firefox, Electron) are too memory hungry to run on 4 or 8 GB.

Does it actually work?

Posted Apr 10, 2025 12:39 UTC (Thu) by khim (subscriber, #9252) [Link] (5 responses)

> I think the underlying cause is that some applications (Firefox, Electron) are too memory hungry to run on 4 or 8 GB.

Yet, somehow, it works fine with macOS or Windows. That's what makes things really weird: Linux claims that it has very efficient and quick swap subsystem yet, in practice, it works like a crap.

I can open 100 Windows of Chrome and VSCode (or XCode) and pile of other apps that use 50GB or 100GB of swap on Windows and MacOS system with 8GiB RAM – and it would work. Yes, it wouldn't be flying, not with this kind of memory pressure… but system would be usable.

Yet on Linux with much smaller memory pressure system is totally unresponsive.

I always assumed that it was because no one cared or used swap on Linux, and that means it's useless… but that article claims that certain refactorings are not done because there would be regressions… regressions compared to what? To the non-usable swap of today? Does it even matter? Who uses swap on Linux, why and, most importantly, how?

Does it actually work?

Posted Apr 10, 2025 13:13 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

aiui, swapping and paging are two different beasts. Do MacOs and Windows use paging? Certainly I thought Windows had a page file.

That could quite possibly be it ...

Cheers,
Wol

Does it actually work?

Posted Apr 10, 2025 13:32 UTC (Thu) by khim (subscriber, #9252) [Link]

> Do MacOs and Windows use paging?

Windows and MacOS have essentially the same API as Linux. MacOS is POSIX and Windows can run Linux binaries (WSL1, WSL2 runs full Linux kernel).

So no, that's not it.

I'm pretty sure they both include tons of tweaks and hacks to ensure that even if system is heavily trashing it still stays responsive, but the end result: if system is heavily overloaded it, definitely, becomes “sluggish”, but noting like Linux does where switch from graphic to text console may take hours (literally) and then you couldn't log in on text console because of timeouts (measured in minutes).

And if keeping system usable when it's swapping is explicit non-goal then I wonder why does anyone care to benchmark things that are not supposed to be used, anyway.

As I have said: I always assumed that Linux just simply keeps swap in some vestigial form for a nostalgia reasons (and no one cares to do anything to it) – and this is done to keep it working great in a “normal” situation (when swap is not used).

This even may be a sane stance if you recall that most Linux system don't really use swap (but Android and ChromeOS use swap code to implement zram… would explain these additions to that code that article discusses).

But when I read about some regressions and other such things… hey, that means that someone, somewhere still uses swap on Linux, for something.

The whole system have looked ever more mysterious the more I read an article: because it certainly read as if it comes from some parallel universe where something else but “zero swap but some zram” is used…

But… how and why? Why do they care about speed… and what kind of speed they do care about? Because for me swap on Linux always had one and one speed only: unusable. Was that it's 10x of “unusable” threshold or 100x of “unusable” threshold… I don't know: one thinks that should happen in seconds start taking hours system is no longer usable and measuring “speed of swap” doesn't make much sense, after that point.

Of course for me “speed of swap” is “slowdown compared to the situation when there are enough memory and swap is not used” and, maybe, there are some other ways to measure speed of swap, but… again: who, how and why does that?

That meta-mystery remained uncovered in the article…

Does it actually work?

Posted Apr 11, 2025 4:58 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Yet, somehow, it works fine with macOS or Windows. That's what makes things really weird: Linux claims that it has very efficient and quick swap subsystem yet, in practice, it works like a crap.

Windows doesn't do overcommit (unless you REALLY try with MEM_RESERVE flag for VirtualAlloc). If a program allocates RAM, then there is a page in memory or in the swap file to back it. This naturally limits the amount of memory that has to be materialized out of thin air if there's a "bank run" on uncommitted RAM.

Does it actually work?

Posted Apr 11, 2025 8:21 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

That's entirely different kettle of fish. You can disable overcommit on Linux, add 1TiB swap on the 4GiB desktop… and it would still become absolutely unresponsive if you would run two rustc processes that try to use 20GiB each.

And yes, miracles are impossible: if you would try to run some process that needs 1TiB of RAM while physically there are only 1GiB then even on macOS or Windows you would have to wait, most likely, till the heat death of universe.

But using 2x, 3x, 10x more RAM than your machine does have, physically? With dozen of apps? That's Ok: system is becoming more and more sluggish but it stays usable. On Linux using even 2x more is, normally, a prelude to hitting “reset”.

Does it actually work?

Posted Apr 14, 2025 8:37 UTC (Mon) by taladar (subscriber, #68407) [Link]

It is really less of a factor how much you overcommit and more how much of the overcommitted memory is in active use. Once that goes beyond your physical memory you have a problem no matter what you do with swap.

Does it actually work?

Posted Apr 9, 2025 1:12 UTC (Wed) by pm215 (subscriber, #98099) [Link] (25 responses)

Is it possible to avoid the machine getting into that thrashing state in the first place, though? On a desktop system what I typically find is that all is fine almost all of the time; but occasionally some process goes rogue (most recently it was clangd running away with all the memory). When that happens everything goes unusably noninteractive until the OOM killer kicks in and kills (hopefully) the offending process. What I would prefer is that instead of giving the offender so much of the available RAM that the whole system is unusable is that the kernel dynamically restricts the resources the rogue can have *before* that point, so the system stays interactively usable if perhaps a bit sluggish (and it's fine if the rogue makes no forward progress at all) until the OOM killer can do its thing...

Does it actually work?

Posted Apr 9, 2025 2:12 UTC (Wed) by intelfx (subscriber, #130118) [Link] (24 responses)

> Is it possible to avoid the machine getting into that thrashing state in the first place, though?

There are several solutions to this problem in Linux that fit various definitions/stages of "possible". We have cgroup memory limits (which can be set manually), PSI metrics and userspace OOM killers (that can be more aggressive and/or proactive than the in-kernel one, having a chance to terminate the offending processes before the system becomes unusable), and MGLRU working set protection (albeit it's not certain if it actually causes more good than harm, cf. https://github.com/zen-kernel/zen-kernel/commit/856c3874). Perhaps there are other things also.

However, none of it is related to *making swap faster*. The point I was making is that once you get to the state of thrashing, no amount of "making swap faster" is going to make a system usable while it's in that state.

Does it actually work?

Posted Apr 10, 2025 12:50 UTC (Thu) by khim (subscriber, #9252) [Link] (23 responses)

> The point I was making is that once you get to the state of thrashing, no amount of "making swap faster" is going to make a system usable while it's in that state.

But why? What makes that impossible precisely and exactly on Linux while perfectly possible on other OSes?

I have one friend who is unfortunate enough to be developing Rust app on Macbook Air. And when rustc starts using 20GiB of RAM (two processes 10GiB each) on his 8GiB MBA… compilation takes ages. What takes 5 minutes for me takes 50 minutes for him. So… 10x slowdown, sure.

Yet… Mac remains perfectly responsive (if sluggish). Chrome works, VSCode works, you can run Word and edit something in it.

Try the same thing on Linux… and everything freezes entirely.

Do you seriously want to imply that rustc has that special code somewhere in it that makes it freeze Linux and specifically and exclusively Linux only? While all other OSes chug along just fine?

Does it actually work?

Posted Apr 10, 2025 13:39 UTC (Thu) by intelfx (subscriber, #130118) [Link] (20 responses)

> But why? What makes that impossible precisely and exactly on Linux while perfectly possible on other OSes?

Because you are (either accidentally or deliberately) conflating swap performance and memory management strategy.

> Do you seriously want to imply that rustc has that special code somewhere in it that makes it freeze Linux and specifically and exclusively Linux only?

Why do *you* want to imply that? I never said anything that could be interpreted as what you just said. Please stop putting words in my mouth.

Does it actually work?

Posted Apr 10, 2025 13:51 UTC (Thu) by khim (subscriber, #9252) [Link] (12 responses)

> Because you are (either accidentally or deliberately) conflating swap performance and memory management strategy.

I guess. For me “swap performance” is a simple ratio: without swap XXX takes N seconds, with a swap XXX takes K * N seconds. And K is swap performance. There the question about how to more precisely measure XXX, obviously, but K looked like a non-ambiguous measure…

Sure, when you want to improve that ratio you would need some kind of more nuanced measures and maybe even introduce many other, lower-level, more tractable and less integral, numbers… but if we talk about “swap performance”… what else can you ever mean?

Virtual memory is “poor man's imitation of real memory” and swap is supposed to make it behave like real one… what other kind of measure may we want to talk about in the absence of any clarifications?

I even every explicitly wrote that I want to know what other natural way to measure “swap performance” is – preferably with explanation about who may want to know these numbers and why.

Does it actually work?

Posted Apr 10, 2025 13:54 UTC (Thu) by pizza (subscriber, #46) [Link] (11 responses)

> I guess. For me “swap performance” is a simple ratio: without swap XXX takes N seconds, with a swap XXX takes K * N seconds.

You're missing the point where "without swap" something in the system (possibly even what you're using) will be force-killed due to insufficient memory.

If you have enough memory for your workload, nothing will ever get paged out.

Does it actually work?

Posted Apr 10, 2025 14:11 UTC (Thu) by khim (subscriber, #9252) [Link]

> If you have enough memory for your workload, nothing will ever get paged out.

Bingo! That means that swap is, essentially, a cheap replacement for an expensive extra RAM. And it has a theoretical, never achievable goal: be as fast as said extra RAM (it's not really possible to make it faster, I suspect).

> You're missing the point where "without swap" something in the system (possibly even what you're using) will be force-killed due to insufficient memory.

Not if you would buy extra RAM. Swap is, ultimately, a replacement for an extra RAM thus it's natural to compare it's speed to speed of a computer system with said extra RAM, isn't it?

After that you can look on list of workloads that you use, attach some $$ measure to the time that you are losing and then decide whether purchase of an extra RAM is worth it.

What other “swap speed” may you want to talk about and why? What would you do with these numbers and why?

Does it actually work?

Posted Apr 11, 2025 7:46 UTC (Fri) by taladar (subscriber, #68407) [Link] (9 responses)

I think you are missing the point that something being killed is fundamentally a lot better than the system becoming so slow that it will never finish its task before someone force reboots it to get a usable system back.

Does it actually work?

Posted Apr 11, 2025 8:05 UTC (Fri) by khim (subscriber, #9252) [Link] (8 responses)

In a cloud, for server applications it's often better: fault tolerance requirements means that you have some system to cope with death of any process (if only because you have to be ready to cope with sudden death of the hardware) – and restarting work on another computer is faster than using swap.

On a desktop becoming sluggish and 10x slow is much better then killing any processes, you only want to kill them when these are truly rogue processes that wouldn't ever produce any useful results.

Satisfying both requirements is hard but that doesn't give us any useful insights into what kind of “swap speed” one may need or want if the end result is that slowdown from the use of swap is so big that the only advice about tuning of swap that one may ever get is “ensure that swap is never used”.

Does it actually work?

Posted Apr 11, 2025 14:20 UTC (Fri) by kleptog (subscriber, #1183) [Link] (7 responses)

> On a desktop becoming sluggish and 10x slow is much better then killing any processes, you only want to kill them when these are truly rogue processes that wouldn't ever produce any useful results.

Only 10x would be ok. In practice it's something that's normally almost instant taking upwards of 20 minutes and your mouse pointer doesn't even move anymore. At that point killing something like firefox (which will restore from state anyway) is the right solution. Firefox is actually a good choice, because it has many subprocesses you can probably kill without the user even noticing.

I'd settle for: if moving the mouse pointer has no visual effect within 10 seconds, trigger OOM killer. I just have no idea how to go about implementing that. Actually, I'd like it to happen as soon as mouse movement becomes choppy, because at that point all is lost already. At that point I'm only desperately trying to get the mouse to the close icon in the top-right hoping clicking it gets through before the system hangs entirely.

Does it actually work?

Posted Apr 14, 2025 8:34 UTC (Mon) by taladar (subscriber, #68407) [Link] (6 responses)

Another good indicator might be if loading files during login takes more than a second or two. I am not talking about a whole desktop worth of applications starting up but just a terminal login. At that point I almost certainly want the slow application killed as my only purpose after completing the login.

Does it actually work?

Posted Apr 14, 2025 10:39 UTC (Mon) by khim (subscriber, #9252) [Link] (5 responses)

Login? What login? It wasn't uncommon for me to not be able to login on a text console because after I type my username and hit enter I never get to the Password prompt because getty kill that session for “inactivity” before that may happen.

I have no idea how much work /bin/login does, but couldn't imagine why that would take 5 minutes or more on any system where swap works properly.

This stopped, of course, after I gave up and switched to the use of Linux in container. With browser outside of it.

Does it actually work?

Posted Apr 14, 2025 22:14 UTC (Mon) by mbunkus (subscriber, #87248) [Link] (3 responses)

I can confirm khim's description. It's spot on & not hyperbole in the least. I've found myself in the exact same situation several times over the last several years. When this happens, no login method works at all (local console, ssh, display manager if it happens to be running on that server) due to how incredibly long each. and. every. key. stroke. takes. Debugging in such a situation is completely impossible, let alone fixing anything. Even if I manage to log in getting a process list takes multiple tens of minutes. The only sane thing to do is to hard reboot & eat the file system damage & loss of data.

And that sucks so much.

BTW, all of those cases were severs, not desktop systems.

Does it actually work?

Posted Apr 18, 2025 23:32 UTC (Fri) by Jandar (subscriber, #85683) [Link] (2 responses)

> The only sane thing to do is to hard reboot & eat the file system damage & loss of data.

I use ALT+SysReq+{s,u,s,b} in that order with a few seconds between. That's my way out of the semi-regular freezes apparently caused by swapping. Every time I have to do this I cross my fingers and pray to the filesystem-gods to spare my home directory ;-)

Does it actually work?

Posted Apr 24, 2025 16:00 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

SysRq is disabled by default now on some common distros. The point where you remember this is the point where you need it, but can no longer login to change this, of course.

Does it actually work?

Posted Apr 24, 2025 16:45 UTC (Thu) by farnz (subscriber, #17727) [Link]

Note, too, that SysRq's functions can be individually configured. On Fedora, the default setting is to allow sync, but nothing else; this is safe, since it's a function that the kernel will eventually do in the background, but not very powerful.

Does it actually work?

Posted Apr 24, 2025 16:08 UTC (Thu) by paulj (subscriber, #341) [Link]

Speaking of containers and browsers.... I used to run a system with a slow disk - plenty of RAM, but disk was slow enough that it was easy to trigger Linux's suicidal, "off a cliff edge", swap behaviour if I ignored the initial signs of slow-down from the bloated browser. I took to running the browser using 'systemd-run --user --slice -p MemoryMax=xG <browser>'. So it runs in a cgroup that will limit its memory use, and systemd-status --user can show you exactly how much RAM the entire group of browser processes is using, and it's also to kill the entire group - without needing to first run ps to find the process.

Does it actually work?

Posted Apr 11, 2025 10:16 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

> > But why? What makes that impossible precisely and exactly on Linux while perfectly possible on other OSes?

> Because you are (either accidentally or deliberately) conflating swap performance and memory management strategy.

And I may well be out of date, but people seem to be conflating swapping and paging. Am I right, in that a *swap* system *swaps processes* in and out of memory. While a *paging* system *swaps pages* in and out of memory. Which is why systems that page don't become as slow as systems that swap.

So a swapping system, faced with a large executable, will pull in the entire executable in order to run it. While a paging system will map the executable, pull in the first page to execute it, and only fault in further pages as required.

Same with working memory - a paging system will file-back the memory, and presumably flush dirty pages, and flag hot pages. While a swapping system will dump the entire memory, and read it all back, as required.

So a swapping system will be (relatively) slow because it's forever flushing and retrieving entire processes. A paging system on the other hand, will be dropping clean cold pages if required, so any program that is actively running will normally keep its hot paths in RAM without being dropped.

Cheers,
Wol

Does it actually work?

Posted Apr 11, 2025 11:48 UTC (Fri) by daroc (editor, #160859) [Link] (1 responses)

I'm not intimately familiar with Linux's swap system, but I think the distinction you're making is not one the swap system makes. On my laptop right now I have ~500MB in swap, which appears to be small sections of memory from a variety of different processes.

Does it actually work?

Posted Apr 11, 2025 12:51 UTC (Fri) by khim (subscriber, #9252) [Link]

If I remember correctly μClinux can use “a swapping system” (just like ancient Unix versions like Xenix 8086 had to do because they worked on a system without any MMU, even strange and primitive ones, like 80286).

Mainstream Linux never had support for “a swapping system”, because it was never supported any hardware where that even makes any sense.

MacOS (Classic) and Windows developed very elaborate and clever system with handles and movable pages and also never did that.

That's why that distinction that Wol talks about is more-or-less forgotten in a world of modern software.

Does it actually work?

Posted Apr 11, 2025 12:44 UTC (Fri) by khim (subscriber, #9252) [Link] (3 responses)

> And I may well be out of date, but people seem to be conflating swapping and paging.

Well… there's a reason for that. Indeed, you are a bit out of date. About 30-40 years our of date, give or take.

> Am I right, in that a *swap* system *swaps processes* in and out of memory. While a *paging* system *swaps pages* in and out of memory. Which is why systems that page don't become as slow as systems that swap.

That's true, but the problem here is that the last system that was swapping out processes that someone may have had a chance to use was probably a DOS Shell that was included in MS DOS 4 (that's year 1988), moved to supelement disk in MS DOS 6 (that's year 1993) and excluded from Windows 95 (and later version). DR-DOS also had similar thing (it was born out of Concurrent_CP/M-86 that was used swapping, after all), but I don't remember the dates.

Linux, Windows and macOS never used this strategy and thus it's very hard for anyone to even recall that ever existed.

> So a swapping system will be (relatively) slow because it's forever flushing and retrieving entire processes. A paging system on the other hand, will be dropping clean cold pages if required, so any program that is actively running will normally keep its hot paths in RAM without being dropped.

You are absolutely correct, but in a world where “a swapping system” is “something that briefly existed very long time ago and for a relatively short time”… most people wouldn't even understand what you are talking about when you would start discussing that difference. Paging needs support in hardware to be efficient, but after Intel made said support mandatory with introduction of 80386 in 1985 no one was even thinking about using “a swapping systems” for anything anywhere.

From what I understand swapping was popular before introduction of IBM/370 in 1970, but I don't know the history of computers well enough to say how exactly “a swapping systems” and “a paging system” competed 50 years ago. “A swapping systems” were already a stuff that you read in a history book and not use for work, when Linus started work on Linux, thus, of course, in the context of Linux they don't matter at all.

Does it actually work?

Posted Apr 11, 2025 14:51 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> From what I understand swapping was popular before introduction of IBM/370 in 1970, but I don't know the history of computers well enough to say how exactly “a swapping systems” and “a paging system” competed 50 years ago. “A swapping systems” were already a stuff that you read in a history book and not use for work, when Linus started work on Linux, thus, of course, in the context of Linux they don't matter at all.

I think you might be surprised then!

The original linux "swapping system" is pretty much an exact copy of the original Unix swapping system, which probably does date from your aforementioned 1970!

Do you remember back when Linus made a rather controversial change to linux swap, 2.4.10 or 2.6.10 I think - I remember it was roughly a 10 - that gave people running Vanilla Linux a massive shock? Do you remember the old saw that "swap must be at least twice ram" which most people - even back then - thought was an old wive's tale?

Because Linus ripped out all the optimisation code, and that requirement came back! If you had a swap file, and the system tried to touch just ONE BYTE of swap, if it didn't have at least twice ram available, it locked up. HARD. (Most people didn't notice, because the distros put the code back ...)

But that was sparked by Andrea Arcangeli and ?Rick van Riel? squabbling over a new approach to memory management. I don't know the end of that story, other than the "swap system" was replaced with something much better and I don't know what the new system does.

I thought all this was documented by LWN, and I've tried to find it on several occasions without success. It could have been Zak's Kernel News, which was quite big back then ...

Cheers,
Wol

Does it actually work?

Posted Apr 11, 2025 15:27 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

> The original linux "swapping system" is pretty much an exact copy of the original Unix swapping system, which probably does date from your aforementioned 1970!

Maybe in some very early versions, I'm not familiar with what was there before Linux 1.2.something

> Do you remember back when Linus made a rather controversial change to linux swap, 2.4.10 or 2.6.10 I think - I remember it was roughly a 10 - that gave people running Vanilla Linux a massive shock?

Definitely not 2.4.10 and not 2.6.10. Maybe 1.1.10? Very unlikely even back then, actually.

I know that with Linux 2.2 it was already possible to build MySQL by adding temporary swap on my puny machine with 16MiB RAM. I.e. Linux behaved example like macOS behaves today and how tutorials sell “virtual memory” idea normally: if your system doesn't have enough RAM… just add more swap and get more virtual RAM!

That's fundamentally impossible with “a swapping system” approach: if you are swapping out the whole process and there are not enough RAM to even put a single copy in memory, then that's it, you couldn't continue…

I think what you remember is different quirk that early versions of Linux had: in early versions of Linux each page had a static place in swap (but could be loaded into memory in different places) and that meant that total amount of virtual memory was equal to the size of swap and not to the size of swap+RAM. That was somewhat irritating, sure, but I don't remember any version of Linux (except μClinux) that swaps out whole processes and not individual pages.

And yes, this is exactly what puzzles me, too: I know that it was possible to use Linux while using swap and a “poor man's RAM extension” quarter century ago. And it's essentially not possible to use it like that today. But what and when have changed? Was that change to Linux or to something in userspace? I actually suspect it's change to userspace: while old twm/fvwm/mwm setup designed to support dozen of hardware X terminals over slow network was very frugal in it's copying of data and thus worked even on Linux with pretty inefficient swap implementation “new way” with Compiz and its descendants now copies megabytes (or maybe gigabytes?) of data around every time it needs to redraw a single pixel… and without pile of hacks that macOS or Windows employ that means that use modern Linux UI without enough actual, real, physical RAM is, pretty much, impossible.

> I thought all this was documented by LWN, and I've tried to find it on several occasions without success. It could have been Zak's Kernel News, which was quite big back then ...

It would be great to find out. Because I don't remember of any version of Linux that had “a swapping system”, but I sure remember that time when Linux used swap inefficiently… yet it paging, just not a very efficient one… you were able to run app that needed more memory than system has physically available on all versions of Linux that I ever used.

Does it actually work?

Posted Apr 11, 2025 15:49 UTC (Fri) by corbet (editor, #1) [Link]

Wol is thinking about the big MM switch in 2.4.10 - see this page.

Linux has never had full-process swapping, though. What is called "swap" is paging for anonymous memory.

Does it actually work?

Posted Apr 14, 2025 10:31 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

IME, Windows does better at not paging out part of a process's working set in order to page something else that the process needs in, and as a consequence is better at getting useful work from a process when it starts to thrash. It then combines this with longer timeslices, so that once you're doing useful work, you get a longer time to finish it before you're swapped out.

Does it actually work?

Posted Apr 14, 2025 10:52 UTC (Mon) by khim (subscriber, #9252) [Link]

It definitely plays tricks with pushing processed to swap and back.

One way to experience “a last straw breaks the camel's back” performance issue, that plagues Linux, on Windows – is to overfill both memory and swap, too (may only do that with “permanent”, fixed size swap… that's not a default Windows setup)

When Windows may no longer “push out” some “victim” processes to swap then it starts behaving like Linux and, eventually, freezes, too.

But I have found out that adding more swap to Linux doesn't help: it may take 20 minutes to switch to a text console and then you can not login… even if there are plenty of swap left unused.

Does it actually work?

Posted Apr 10, 2025 12:28 UTC (Thu) by khim (subscriber, #9252) [Link] (2 responses)

Why is it non-goal, BTW? I know that Yandex was using FreeBSD for years because it could save money that way: with two sets of daemons on the same machine and 99%/1% split everything was perfectly behaving – the 99% side was perfectly responsive and could handle the traffic without being swapped out while 1% would get small percent of that same traffic while responding slowly because it would be trashing like hell (but responses would go to log and never to a human thus it was Ok).

Linux could never pull tricks like these… but I have no idea why. Because they were declared “non-goals” by someone?

P.S. Eventually the had to adopt a different way of doing experiments and switched to Linux… but mostly because it was becoming harder and harder to keep FreeBSD going when most hardware favors Linux.

Does it actually work?

Posted Apr 10, 2025 13:32 UTC (Thu) by intelfx (subscriber, #130118) [Link] (1 responses)

It's a non-goal for the swap subsystem (in the sense that no amount of making swap faster will make it as fast as RAM).

It may or may not be a goal for the overall memory management strategy; limiting thrashing to some parts of the system should be somewhat doable with suitable application of cgroups or perhaps other, more auto-magic technologies (like aforementioned MGLRU working set protection). But ultimately, all that does is constrain unusability to those parts of the system that *are* thrashing (and likely degrade the overall throughput). Making *those* parts of the system usable is still a non-goal because it is fundamentally unachievable.

Does it actually work?

Posted Apr 10, 2025 13:39 UTC (Thu) by khim (subscriber, #9252) [Link]

> Making *those* parts of the system usable is still a non-goal because it is fundamentally unachievable.

Why not? MacOS and Windows achieve that… somehow.

Sure, when it takes 50 minutes to compile something that can be compiled in 5 minutes without trashing… it's not good, absolutely – but it's still usable.

When the same thing takes days (not sure how many days, though, I stopped the experiment after two)… it's not usable.

You may say that it's not worth it (Apple did all that work, apparently, to claim that MBA with just 8GiB of RAM is as good as any other laptop with 16GiB of RAM), but that's something radically different from “fundamentally unachievable”.

Might it be possible

Posted Apr 10, 2025 15:51 UTC (Thu) by bferrell (subscriber, #624) [Link]

The desired workloads are conceptually broken?

It might be the wrong problem is being solved


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds