The next steps for swap

By Jonathan Corbet
March 22, 2017

LSFMM 2017

Swapping has long been an unloved corner of the kernel's memory-management subsystem. As a general rule, the thinking went, if a system starts swapping the performance battle has already been lost, so there is little reason to try to optimize the performance of swapping itself. The growth of fast solid-state storage devices is changing that calculation, though, making swapping interesting again. At the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Tim Chen led a session in the memory-management track that looked at the ways that swapping performance can be improved.

Chen has been working on swapping performance for a while; the first set of swap scalability patches has already been merged. His next priority is improving swap readahead performance. This mechanism, which tries to read pages from swap ahead of an anticipated need for them, currently reads pages back in the order in which they were swapped out. This, he noted, is not necessarily the best order and, with mixed access patterns, performance can be poor.

The recently submitted VMA-based swap readahead patches try to improve readahead performance by watching the swap-in behavior of each virtual memory area (VMA). If it appears that memory is being accessed in a serial fashion, the readahead window is increased in the hope of bringing in more pages before they are needed. For random patterns, instead, readahead has little value, so the window is reduced.

Rik van Riel noted that the current readahead algorithm was designed for rotational media and asked how well the VMA-based mechanism works on such devices. Chen, with visible embarrassment, said that this hasn't been tried. Van Riel added that, with rotational devices, a group of adjacent blocks can be read as quickly as a single block can, so it makes sense to speculatively read extra data. The same is not true for solid-state storage. So, he suggested, it might make sense for the readahead code to see which type of device is hosting the swap space and change its behavior accordingly.

Matthew Wilcox, instead, said that the real problem might be at swap-out time. Pages are swapped based on their position on the least-recently-used (LRU) lists, which may not reflect the order in which they will be needed again. He said that, perhaps, writes to swap could be buffered; swapped pages would go into a "victim cache" and sorted before being written to storage. The value of this approach wasn't clear to everybody in the room, though, given that access patterns can change over time.

The next subject was the swapping of transparent huge pages. Currently, the first step is to split those pages into their component single pages, then to write those to swap individually — not the most efficient way to go about things. Chen and company would like to improve this behavior in a few steps, the first of which is to delay the splitting of the page until space has been allocated in the swap area. That should result in the allocation of a single cluster of pages for the entire huge page, at which point the whole thing can be written in a single operation. Patches implementing this change have been submitted; they result in a 14% swap-out performance increase.

The next step is to delay the splitting of huge pages further, until the swap-out operation is finished. Those patches are in development; benchmarking shows that they result in a 37% improvement in swap-out performance.

Finally, it would be nice to be able to swap huge pages directly back in. This idea needs more thought, he said. It is not always a performance win; if the application only needs a couple of small pages of data, there is no point in bringing in the whole huge page. One possible heuristic could be to only swap in huge pages for memory regions marked with MADV_HUGEPAGE or which have a large readahead window.

There was a bit of discussion on how to justify the inclusion of these patches once they are ready. The best motivator is good benchmark results. It was suggested that Linus Torvalds is less likely to block the patches if they do not slow down kernel builds. Michal Hocko said that the patches were interesting, but that they were optimizing a rare event; the current code assumes that we don't ever want to swap. But Johannes Weiner said that the swap-out changes, at least, make a lot of sense; batching operations by keeping huge pages together will speed things up.

The next topic was the use of the DAX direct-access mechanism with swapped data. If swapping is done to a persistent memory array, the data can still be accessed directly without the need to read it back into RAM. There is "an almost-working prototype" that does this, Chen said. The hard part is deciding when it makes sense to bring pages back into RAM; memory that will be frequently accessed, especially if the accesses are writes, is better read back in.

Wilcox said that the decision really depends on the performance difference between dynamic RAM and persistent memory on the system in question; in some cases, the right answer might be "never". Sometimes, for example, the "persistent-memory array" is actually dynamic RAM hosted in a hypervisor. There was some talk of using the system's performance-monitoring unit (PMU) to track page accesses, but that idea didn't get far. Developers prefer that the kernel not take over the PMU, the runtime cost is high, and the results are not always all that useful.

After some discussion, the conclusion reached was that the kernel should just bring a random set of pages back into RAM occasionally. With luck, the frequently used pages will stay there, while the rest will age back out to swap.

Finally, there was a brief discussion of further optimizing the swap-device locking, which still sees significant contention even after the recent scalability improvements. So there is some interest in using lock elision toward this end.

Index entries for this article
Kernel	Memory management/Swapping
Conference	Storage, Filesystem, and Memory-Management Summit/2017

The next steps for swap

Posted Mar 30, 2017 15:56 UTC (Thu) by Wol (subscriber, #4433) [Link] (3 responses)

> As a general rule, the thinking went, if a system starts swapping the performance battle has already been lost, so there is little reason to try to optimize the performance of swapping itself.

Okay, I might be a bit unusual but I'm sure there are plenty of systems like mine ...

I can't add any memory (all 4 slots have 4GB chips - the largest the system will take), and I have masses of swap configured because I occasionally need it. I make extensive use of tmpfs, and when that floods ("emerge world") I usually spill into swap.

Cheers,
Wol

The next steps for swap

Posted Mar 30, 2017 17:45 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

Instead of using a filesystem in RAM that gets swapped to disk, wouldn't it be more natural to emerge into a filesystem on disk that gets cached in RAM? In both cases it should ideally store as much as possible in RAM and flush the less-used data to disk, but in the latter case you avoid the ugliness of swapping (the fixed capacity, the performance issues mentioned here, etc).

If the problem is that tmpfs has better performance in practice, why can't the filesystem cache be improved to match that performance?

The next steps for swap

Posted Mar 31, 2017 10:44 UTC (Fri) by lpremoli (guest, #94065) [Link] (1 responses)

IMHO The question is that tmpfs is available since initial boot phase and as such it is a full FS laying on a RAM device (which is available since very initial boot phase). The opposite, i.e. a FS laying on disk and being cached in RAM would be very difficult to implement and would require a disk which is not yet available during early boot.

The next steps for swap

Posted Mar 31, 2017 16:43 UTC (Fri) by Wol (subscriber, #4433) [Link]

Also I don't want stuff left lying around on disk. So caching a memory filesystem in swap makes far more sense then caching a disk filesystem in memory.

Basically, I'm using tmpfs because the data is exactly that - temporary - and a reboot will just dump it. As a desktop system, reboots are common :-)

(And if it's a disk-based filesystem, the chances of data being flushed and then deleted are not negligible, and clearly wasteful :-)

Cheers,
Wol