Large folios, swap, and FS-Cache

By Jake Edge
July 24, 2024

David Howells wanted to discuss swap handling in light of multi-page folios in a combined storage, filesystem, and memory-management session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. Swapping has always been done with a one-to-one mapping of memory pages to swap slots, he said, but swapping multi-page folios breaks that assumption. He wondered if it would make sense to use filesystem techniques to track swapped-out folios.

Traditional swap is divided into page-sized segments, Howells began, and, up until recently, memory was divided the same way. When the kernel wanted to swap something out, it found a free segment and put the contents of the page there; the reverse of that was done when it needed to swap the page back in. But, with folios, there might be two pages, 2MB of pages, or even 2GB of pages joined together that need to be swapped out as a unit.

The locations where those pages get stored on disk need to be recorded; there may not be a contiguous region in the swap space to hold the full folio due to fragmentation. To him, it looks like reinventing a filesystem; "we don't want to do that, we've already got a whole bunch of filesystems". He wondered if there could be a filesystem mechanism added where a block of data, such as a 2MB folio, can be stored with a specific key. That key can be used to retrieve the block or to cancel it, if it is no longer of interest.

He maintains FS-Cache, which is a local disk cache for network filesystems, that has similar needs; it stores blocks from remote files in a local on-disk cache. So there are two separate caches that have many of the same properties; does it make sense to combine them? They could share the same pool of reserved disk space and FS-Cache could fill up the entire space if it needs to; swap can always get the space that it needs because FS-Cache is a true cache, so any of its entries can simply be removed.

Howells noted that Chris Li's session on a swap abstraction, which would have a different take on changes to the swap layer, was coming up in the next summit slot. Howells asked for attendees' thoughts on the idea of merging FS-Cache and swap. Jeff Layton was concerned that the current FS-Cache uses cachefilesd, which may allocate memory; using that mechanism for swapping could lead to a deadlock. Howells said that if the two were combined, cachefilesd would no longer be used. That functionality would move into the kernel so that it could track both swap and FS-Cache entries; the new code would not allocate memory to do so.

James Bottomley said that filesystems could already provide what Howells was looking for with a generic extent map and some form of pre-allocation. He suggested that the block layer did not need to be involved at all; "if that's the solution, can we go on to the next session?" As might be guessed, there was more to work out than that; Howells said that an interface was still needed, but "I don't know if anyone has any particular insight" on it. He noted that Kent Overstreet had thoughts of providing it from bcachefs.

What had been described is a kind of object storage, Jan Kara said. He agreed that it was basically a filesystem question; it is nearly trivial to expose an API like what is needed, but there are a lot of constraints to be worked out. The FS-Cache piece is easy, but fitting swap into that is much harder; he remembers how difficult it was to get swap-over-NFS working, so making this combined cache work reliably in close-to-out-of-memory conditions "needs some serious thought". For that, patches are more important than discussion, because of all of the little details that need to be handled.

Li asked about using zram as a backing device and thought that a separate kind of swap-handling would be needed in that case. Kara said that the point of zram is to compress the data, but the filesystem can do that compression instead. But Li said that filesystem compression is done on a per-file basis, so the whole file would have to be decompressed to access some of its blocks. Overstreet said that "the right way to do compression is per extent".

The information about swapped blocks is stored in memory, Dave Chinner said, so it simply goes away if the system crashes. There is no need for persistence of that information, but he wondered if that was also true for FS-Cache. Howells agreed that the data did not need to persist if the system crashed. That means the existing swap-file support mostly takes care of what is needed, Chinner said; all that needs to be added is support for a variable-length block size to deal with various folio sizes. The metadata can be stored in memory, since it does not need to persist, and can use the B-tree code from, say, XFS or bcachefs to manage the extents; it is efficient and easy to do, he said.

But Howells said that the FS-Cache did need to persist, though not across crashes; when the system is rebooted, the contents of FS-Cache should persist. That invalidates the idea of using swap files as the basis of the feature, Chinner said, "you need an actual filesystem at that point". Bottomley wondered what was stored in FS-Cache that can be destroyed in a crash, but must be preserved on a reboot. Howells said that the FS-Cache data is valuable, so he would prefer not to get rid of it, but it is a real cache so the files could be recreated if needed—at a fairly high cost.

So the question then becomes "how valuable is FS-Cache persistence?", Bottomley said. Howells said that it is "very valuable" and that there are many customers that use it. Layton added that some of those customers are doing big rendering jobs with huge files; "the last thing they want to do when it comes up is throw away their cache because then it will slam their server" to retrieve the files.

The key that would be used for the FS-Cache files is based on several pieces of data that identifies the remote file being stored, but Howells had not mentioned anything about the key for swap data, Matthew Wilcox said. In today's swap code in the memory-management layer, much of the pain is that there are "these sort-of address spaces, not exactly address spaces" that are used to allocate space in; that is done with a data structure that is not good for allocation, using locks that don't really scale, he said. "It's all a bit ... awful."

Wilcox wondered if Howells had a proposal for what the key for the swap items would be or if he was looking to the memory-management community to come up with one. Howells said that he had assumed it would be based on the host inode number and a page offset within it, but realized that would not cover anonymous memory. Wilcox thought that maybe the VMA structure could be used, but Kara noted that those can be split and merged, which would invalidate the key.

Li said that the swap offset, which is an index into the array of swap slots, is what is used at swap-out time and what gets stored in the swap entry, so it should be used as the key. But he was concerned about the amount of overhead that would be consumed if, for example, an inode was allocated for each swap slot. Wilcox said that it was important to move away from swap offsets and slots, since that is what causes the "real scalability problems in the swap subsystem".

There will be a need for some kind of mapping layer to map the page-based swap offset (because the least-recently-used (LRU) list is maintained on a page basis) to some internal file, inode, or something else. Wilcox said that he is thinking that the struct anon_vma used for anonymous memory areas could be used; it "looks and feels to me an awful lot like an inode", though it is not an exact fit. When you add reflinks, which are copy-on-write links to files, into the mix, an anon_vma is even more like an inode, he said.

Adding something that is "akin to the page cache in each anon_vma structure", if done properly, could also solve "all of the problems that we have around caching reflinked files". Many sessions at LSFMM over the last ten years have had complaints about the lack of page-sharing in the page cache between files that are reflinked, Wilcox said; "if we make both of these problems the same problem, maybe that will be the final impetus we need to get this problem fixed".

Li and Wilcox went back and forth some about how that would look, what would need to be stored in the swap entry, and where the swap entry would be stored. No real conclusions were reached and Li said that he would like to see a more concrete proposal. To that end, Howells wondered if it would make sense to convert tmpfs to use FS-Cache instead of the swap system; Wilcox thought that would be a big improvement for tmpfs. At that point, the session ran out of time; we will have to wait and see what, if any, patches emerge.

A YouTube video of the session is available.

Index entries for this article
Kernel	FS-Cache
Kernel	Memory management/Swapping
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2024

XFS' real-time support

Posted Jul 24, 2024 22:49 UTC (Wed) by Tobu (subscriber, #24111) [Link]

For storing folios as extents, XFS real-time devices seem well suited: the metadata structures are preallocated, and space is tracked for all power of two sizes. (Reference is the real-time chapters of XFS Algorithms & Data Structures)