Leading items
Welcome to the LWN.net Weekly Edition for May 25, 2023
This edition is dedicated to ongoing coverage from the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit:
- Sunsetting buffer heads: nobody likes the venerable buffer-head data structure, but getting rid of it is not an easy task.
- A development-process discussion: trying to come to grips, again, with the problems of maintainer overload and frustrated contributors.
- FUSE and io_uring: an effort to use io_uring in FUSE to improve performance.
- Fanotify and hierarchical storage management: fanotify can be used for HSM, but some additions would make it better; meanwhile, there are some deadlocks and races that need to be addressed.
- Monitoring mount operations: discussion on the needs for monitoring mount and unmount activity in the system.
- Page aging with hardware counters: can the kernel's understanding of memory access patterns be usefully improved with hardware assistance?
- The intersection of lazy RCU and memory reclaim: a discussion of the memory-management implications of the lazy RCU mechanism.
- Memory passthrough for virtual machines: an effort to improve the efficiency of memory management in virtualized workloads.
- Phyr: a potential scatterlist replacement: working toward a better representation of DMA operations in both the CPU and device spaces.
- Fighting the zombie-memcg invasion: memory control groups have long presented memory-management problems of their own. This session explored the current difficulties and how they might be fixed.
- Toward a swap abstraction layer: a proposal to bring some much-needed structure to the swap subsystem.
- A slab allocator (removal) update: what is being done to reduce the number of slab allocators in the kernel.
- Reliable user-space stack traces with SFrame: a derivative of the kernel's ORC format may provide a more efficient way to reliably produce user-space stack traces.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Sunsetting buffer heads
The buffer head is a kernel data structure that dates back to the first Linux release; for much of the time since then, kernel developers have been hoping to get rid of it. Hannes Reinecke started a plenary session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit by saying that everybody agrees that buffer heads are a bad idea, but there is less agreement on how to take them out of the kernel. The core functionality they provide — facilitating sector-size I/O operations to a block device underlying a filesystem — must be provided somehow.The key question, he said, was whether the existing buffer-head API should be reimplemented using folios or, instead, the best approach would be to just replace buffer heads with folios directly. One problem with the latter approach is that folios only support page-size I/O; pages are usually larger than sectors, so page-size I/O operations will necessarily transfer multiple sectors. It is not clear to him that this matters much, though. Filesystems work hard to pack data on disk, and chances are good that the adjacent sectors will also be accessed soon. With current hardware, he said, I/O size is no longer as important for performance, so filesystem developers should not hesitate to use page-sized I/O operations.
An advantage that would come from using folios is that they can use the
page cache directly. The buffer cache duplicates a fair amount of
page-cache functionality; that code would be good to drop. Changing
filesystems to read page-sized chunks is relatively easy, he said,
but the write side is less so. Filesystems often assume that a write will
only affect one sector, but that is not the case with folios. There is a
patch series from Ritesh Harjani adding support for sub-page dirty
tracking that should help with the conversion process.
Luis Chamberlain said that one possible sticking point is the block-device cache, which uses buffer heads for tasks like partition scanning. It's not clear that the benefits of converting this code are worth the risk. Reinecke answered that a full conversion away from buffer heads may never be necessary. Chamberlain said that a number of filesystems use buffer heads for metadata I/O and asked whether they wanted to move away from that; the answer appeared to be "yes". Jan Kara said that moving away from buffer heads was important, but that there is still a need for some sort of intermediate layer; Reinecke agreed that a drop-in replacement for buffer heads is important.
Kara pointed out that filesystems use the buffer-head layer to maintain the association between sectors and the inode of the file containing them. It is, he said, "one of the darker corners" of the buffer-head code. Reinecke questioned whether that functionality was needed, since the dirtying of data happens at the folio level in any case.
Ted Ts'o said that the buffer-head code hides a lot of functionality, and that each filesystem uses a different subset of that functionality. Sub-page tracking only replaces one piece of that puzzle. The ext4 filesystem still supports 1KB block sizes, so sub-page tracking can't be ignored; it needs that piece. Things get more complicated when small block sizes have to be supported on CPUs with 64KB base page sizes.
Some filesystems also use the buffer cache to associate dirty buffers with inodes, Ts'o continued. Filesystem developers are slowly converting over to the iomap support layer, which makes that problem go away, but finishing the conversion is "a big lift". Some other buffer-cache functionality is only used by the journaled block device (JBD2) layer, and might be best just moved into that code.
Yet another problem is that ext4 has to be able to support utilities that open the underlying block device while a filesystem is mounted; keeping everything coherent in that environment is a challenge, but the buffer cache makes it happen for free. He concluded by saying that there is value in replacing the ancient buffer-head code, even if it still provides functionality needed by filesystems, but it will be an incremental task. It will not be possible to just switch to folios for metadata; some sort of intermediate layer will still be needed.
Josef Bacik agreed that "we all hate buffer heads", but said that every filesystem manages metadata in its own special way. Btrfs uses extent buffers that sit on top of struct page and will be converted to folios soon. XFS has its own structure for metadata. And so on. Just dropping buffer heads will not be the big win that everybody seems to think it will be, he said. A common replacement layer for filesystems is not the solution, since every filesystem is different. It would take some coercion to get him to switch Btrfs to a common layer. Reinecke answered that he is trying to outline a path by which filesystems can be converted away from buffer heads and is not trying to force anybody.
Ts'o said that the buffer-head work won't matter for filesystems that do not use buffer heads now, but there are still quite a few that do use them, and the buffer-head layer provides an important support system. Bacik said that any conversion work should just be done within the buffer-head layer so that filesystems wouldn't notice the difference. Matthew Wilcox asked whether there was a desire for large folio support within the buffer head layer; that would be difficult to add transparently. Ts'o advised keeping things simple and stick to single-page folios for a replacement buffer-head layer. Filesystems that want multi-page support can switch to iomap.
Bacik brought the session to a close by stating the apparent conclusion from the session: the buffer-head layer will be converted to use folios internally while minimizing changes visible to the filesystems using it. Only single-page folios will be used within this new buffer-head layer. Any other desires, he said, can be addressed later after this problem has been solved.
An LSFMM development-process discussion
At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Hannes Reinecke led a plenary session ostensibly dedicated to the "limits of development". The actual discussion focused on the frustrations of the kernel development process as experienced by both developers and maintainers. It is probably fair to say that no problems were solved here, but perhaps the nature of some of the challenges is a bit more clear.Reinecke started with the topic of patch sets that are proposed by their author, but do not end up being either applied or rejected. Or perhaps a patch set will be rejected after the developer has put the work into posting numerous revisions. Meanwhile, other patches that are seemingly just as intrusive are accepted without problems. This unpredictability makes it hard for companies — even those that are relatively deeply involved with the kernel community — to know what will happen when they start on a development project.
Nobody wants to spend a lot of time on a project that will end up being rejected. He hasn't been able to come up with guidelines for these developers other than to say that, if they approach the maintainer at the wrong time, their work will not get in. This seems somewhat inadequate, so he would like to find a way to apply the brakes early for work that has fundamental problems and will not find acceptance.
Josef Bacik said that the Btrfs community has put some effort into addressing this problem. As the patch counts have gone up, the community has gotten better at figuring out how and when patches go in; this took a lot of effort from the maintainers and technical leaders. An effort has been made to provide early indications of whether a project is worth doing and, if it is, to put in the work to shepherd the project along.
A lot of kernel maintainers, he continued, see their subsystem as their own fiefdom and don't care much about work from other developers. Meanwhile, drive-by authors don't know who they should be paying attention to; this is a fundamental problem in our community. Maintainer behavior can be erratic and inconsistent. We should be able to work better together, he said. Getting there requires a more clear definition of what the maintainer's role is. It is not just accepting patches. A maintainer should provide clear indications to newcomers, provide feedback on ideas, and ensure that patches get reviews. The Btrfs developers get together regularly to talk about their work, but there is no equivalent forum for the wider community.
Steve Rostedt mentioned the kernel's communication-style documentation, which is currently being (slowly) rethought by the Linux Foundation Technical Advisory Board. Our documentation on how maintainers talk to developers can stand improvement, but there is also a place for a document on how contributors should talk to maintainers. Reinecke answered that all of this implies that people actually talk to each other.
Bacik said that communication style matters. Anybody who has been in his presence for long knows that he tends toward the use of profanity when he speaks. But he has made a point of not doing that on the mailing lists; he never knows how it will be perceived, especially by people who do not know him. It could cause contributors to decide to stay away from the community. We should all make a similar effort to be more welcoming, he said. He also expressed unhappiness about people who continue to fight over things that have been settled; he mentioned BPF and systemd as examples. We are all moving in the same direction, he said, and should pretend that we like each other.
Reinecke asked if the problem is just one of communication. A pattern he has often seen is where the relevant parties enter a discussion late, then dig in their heels. Bacik said that it's the maintainer's job to steer a project; that includes looking at random patch sets and giving feedback. A flat refusal to accept a patch is not helpful; the author had a use case in mind and didn't write the code just for fun. They are trying to accomplish something, and the maintainer should make the effort to understand why.
Ted Ts'o said that running the project has to be a team effort; if the onus is put just on the maintainers, they will burn out. Maintainers simply cannot review all of the patches that hit the lists; they are dependent on others to help with that work. One cannot demand a service-level agreement from a volunteer maintainer. Few maintainers, he said, do that work as a full-time job.
Kent Overstreet said that developers want quick feedback. Reviewers often think that every review has to be highly detailed; that is demanding and slows down the process. Christian Brauner answered that some patches really do require deep review, and that just doesn't scale. Sometimes, he said, we just have to accept code that is not perfect.
Brauner said that he would like to see more patches with Reviewed-by tags. In his experience, potential reviewers worry that their tag will make them look bad if a bug is found in the patch later on. But bugs happen, he said, and they don't mean that a reviewer hasn't done their job. Dan Williams said that we are all managing our reputations in the community, and that can cause people to hesitate; maintainers need to engage with contributors, but the contributors also have to give the maintainers some space. Maintainers, he said, should talk to each other more and not be afraid to ask for help. There is also value in having regular phone calls. Most of a project's communication channels are public and archived; having a more private channel can put people at ease.
James Bottomley asked whether the community is sufficiently encouraging of reviews. In the past there have been problems with people giving bad advice, leading the community to clamp down somewhat. But perhaps the correction has gone a little too far? Brauner said that a maintainer will only get reviews if they encourage reviewers; if you see somebody who is doing good work, write to them and ask them to continue. We are good at saying something is wrong, he said, but not so good at saying when something is right.
As the session ran out of time, Bottomley asked how a potential reviewer might get the "R" tag in the MAINTAINERS file that will cause them to be copied on patches. In most subsystems, it seems that the reviewer has to explicitly ask for it.
At that point, the session came to a close. It is not clear that a whole lot was achieved in the discussion, but it did at least give maintainers a chance to talk about their frustrations for a bit.
FUSE and io_uring
Bernd Schubert led a session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit on the intersection of FUSE and io_uring. He works for DDN Storage, which is using FUSE for two network-storage products; he has found FUSE to be a bottleneck for those filesystems. That could perhaps be improved by using io_uring, which is something he has been working on and wanted to discuss.
![Bernd Schubert [Bernd Schubert]](https://static.lwn.net/images/2023/lsfmb-schubert-sm.png)
He noted that Boaz Harrosh had developed the zero-copy user-mode filesystem (ZUFS) in 2018, but it did not go upstream in part due to concerns about its overlap with FUSE. Meanwhile, Miklos Szeredi started working on FUSE2, but that has languished as well. Schubert briefly looked at both, but the FUSE2 Git tree was hard to review since it is a single big patch rather than being broken into reviewable pieces.
Last year, he was working on atomic open operations and noticed some problems; Szeredi asked for some benchmarks, which turned out to be confusing. Multiple threads were reading from /dev/fuse, which caused the confusing results; switching to a single thread made the results consistent. He also realized that performing a polling loop before adding an operation to the wait queue greatly improved the filesystem performance.
He was also looking at an NVMe driver that his company uses, wondering why it was able to avoid the bottlenecks that he was seeing; it used io_uring, but in the "wrong" direction, from user space to the kernel, while FUSE needed to go the other way. Around that time, the IORING_OP_URING_CMD support for io_uring was added, which is being used in the ublk user-space block driver in the right direction, from the kernel to user space. That provides a model for doing something similar in FUSE, which is what Schubert has been working on.
He explained the inner workings of the work he has done to make FUSE use io_uring. There is one thread per core, each with its own ring buffer; there is a shared memory buffer with the FUSE queue ID used as the offset for user space to mmap() its region. Libfuse initiates operations with an IORING_OP_URING_CMD, which is the core idea taken from ublk. For debugging purposes, there is a mode with a single thread and ring buffer.
Amir Goldstein asked whether user space really needed to be aware of the underlying implementation. Schubert replied that his goal was to make it transparent, so that existing filesystem implementations could gain the performance benefits without having to change their code.
He is unhappy with using the queue ID to identify the user-space buffer (via the mmap() offset), but was unable to find the corresponding buffer in the kernel without it. There was some discussion of ways to get the kernel and user-space in sync on the buffer location directly. Jan Kara suggested looking in the VMA associated with the user-space virtual address to find where in kernel memory the buffer was located; he said that he could help Schubert find the right calls to make for that.
Schubert showed some performance benchmarks, but noted that he needs to find a way to keep the scheduler from migrating the application processes to other CPUs. For example, direct I/O reads showed moderate improvements for io_uring over regular FUSE, but much larger improvements when migration was disabled. Kara cautioned that CPU migration can be a problem at the start of a test like this, but may not actually be problematic over the long term; meanwhile, there are other workloads that may benefit from the migration.
But Schubert said that FUSE is particularly affected by this because most of the I/O work is being handed off to another process; once it completes, it is best if the application process is still running on that same CPU. Kara said that it is not a simple problem and that the scheduler may lack the information it needs to make the right decision. The scheduler developers are aware that there may be problems in scheduling for io_uring and are working on some solutions; he recommended that Schubert work with those developers.
Fanotify and hierarchical storage management
In the filesystem track of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Amir Goldstein led a session on using fanotify for hierarchical storage management (HSM). Linux had some support for HSM in the XFS filesystem's implementation of the data management API (DMAPI), but that code was removed back in 2010. Goldstein has done some work on using fanotify for HSM features, but he has run into some problems with deadlocks that he wanted to discuss with attendees.
He began by pointing to a wiki page he created to describe HSM and his goals for using fanotify to support it. His employer is CTERA Networks, which builds "cloud gateway solutions", where files appear to be available on the local system even though they may be cached on a local network-attached storage (NAS) device or stored somewhere else in the cloud. The NAS might not have space to accommodate all of the data, but it functions as a (more) local cache.
![Amir Goldstein [Amir Goldstein]](https://static.lwn.net/images/2023/lsfmb-goldstein-sm.png)
Windows has an API for HSM, so files have a status that reflects their location; users can decide if they want to access a file if, for example, it will require a lengthy copy from the cloud. This HSM support is based on "reparse points" in NTFS; when those are encountered, another filesystem driver is called to provide the file data. There is nothing like that in Linux, so those who provide that functionality have to implement their own scheme; CTERA uses FUSE.
The FUSE solution comes with various kinds of problems and he hopes that some of the alternatives being discussed at LSFMM+BPF will help alleviate them. DMAPI is an old API, which is insufficient for today's HSM needs, though the code from XFS still exists if there is anything useful in it; remnants of it are still present in Linux, as the "punch hole" interface was added for DMAPI. When the DMAPI hooks were removed, there was a comment suggesting that "at least the namespace events can be done much saner in the VFS", which is what Goldstein is trying to do now.
He showed that a simple HSM can be implemented using the existing upstream fanotify API. It could use sparse files to represent the data that is not local. It does so by first getting an exclusive lock on the file object using fcntl(fd, F_SETLEASE, F_WRLCK), migrating the content elsewhere, and then punching a hole in the file using FALLOC_FL_PUNCH_HOLE to fallocate(). The HSM service can subscribe to various types of fanotify events in order to be notified when the content, permissions, or directory entry of the file is changed; the cloud version can then be updated as needed. "It is very naive, but it works."
However, it is not practical for today's use. For example, users have to download their entire movie, say, before starting to watch it. He has a patch set to add some features to fanotify that would make it more usable as an HSM, which he posted (as a pointer to his Git tree) in an email back in September 2022. The resulting thread eventually led to the session at the summit.
The changes are small, he said, simply adding a few more fanotify event types (or additional information to existing events), which would facilitate the HSM use case. They are described further in a section of the wiki page and would allow features like populating directories on demand, streaming downloads of large files, and crash-safe change tracking. He has been working on change tracking for a number of years now in various guises; he has an internal solution, but would like to get something into the mainline.
He described a demo that he did not have time to actually perform, which can be seen in slide 6 of his slides; it was based on the HTTPDirFS FUSE filesystem, which allows read-only mounts of a directory accessed via HTTP. Goldstein modified it to use fanotify on a kernel with his patches. It would allow him to mount the kernel.org /pub directory locally, then access a file deep in the directory hierarchy. The filesystem lazily populates the needed directories into the local directory where it is mounted. The mount point is no longer a FUSE mount in that mode, but is a bind mount instead, with fanotify events being monitored. He displayed an example command that would display the first few lines of a tar table of contents of a large file. Only the first 1MB of the file would be transferred before the command completed, rather than waiting for the entire contents.
He had two more slides after the "demo" slide, which were increasingly complex, he said. They were an attempt to explain some problems that he has found, "in order to try to sell the solution". At one time, there was a problem with the original fanotify API where an operation caused a FAN_ACCESS_PERM event, which might require the fanotify service to access the file; that results in a second (blocking) FAN_ACCESS_PERM event which leads to a user-space deadlock. That was solved by adding a special file descriptor that can be used by the service to perform actions without triggering another fanotify event.
But now there is another deadlock that can happen with the existing API; it is perhaps rare, but it can happen and he is surprised that it has not been reported. It involves a clone file range operation, which takes the superblock freeze lock, but it may cause the HSM (or other fanotify-based) service to also need to freeze-lock the superblock. If the files are on the same filesystem (thus share the same superblock), a deadlock will result.
This deadlock is perhaps more common in his HSM service than in other types of fanotify-based scanners (e.g. virus scanners). He has solved it by using a new event flag (FAN_PRE_VFS) that gets added to FAN_ACCESS_PERM events if the freeze lock has not been taken. He then went through and added that flag in the places where it was true, which involved calling the notify hook in some new places. That gives the service an opportunity to fill the file before the clone file range operation freezes the superblock. That was his solution, which was not hard to do, Goldstein said.
He moved on to the second even-more-complicated slide, which covered a similar kind of deadlock, but it could also result in a race condition that would cause his HSM to miss filesystem changes at times. The scenario was well beyond my ability to follow it, but a video of the session should be available before long. His solution to the problem, which was suggested by Jan Kara, was to use sleepable RCU, which would avoid the race at the cost of an occasional false-positive change notification.
Once attendees seemed to get up to speed on the problem (and proposed solution), the session ran out of time, though discussion spilled over into the next slot. Josef Bacik said that he did not hate the solution that had been chosen, though he did not love it either. Kara explained why sleepable RCU was chosen, and Goldstein thought that the general idea could be applied to other filesystem-related ordering problems (such as when an inode's i_version field gets incremented).
Monitoring mount operations
Amir Goldstein kicked off a session on monitoring mounts at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. In particular, there are problems when trying to efficiently monitor "a very large number of mounts in a mount namespace"; some user-space programs need an accurate view of the mount tree without having to constantly parse /proc/mounts or the like. There are a number of questions to be answered, including what the API should look like and what entity should be watched in order to get notifications of new mount operations.
It is trivial, he said, to add a notification for unmount events, but the corresponding event for a new mount is trickier, since it is not clear where, exactly, the watch for that event should be placed. It could be placed on the user or mount namespace of interest; another idea would be to choose a directory and monitor all of the mounts that happen on it and any of its subdirectories recursively. David Howells said that he has implemented something for getting mount notifications; the watch is placed on the mount namespace. Miklos Szeredi said that each namespace has its own mount tree and each mount has a 32-bit ID that gets assigned to it, but those cannot reliably be used to uniquely identify a particular mount because they can be reused during a given boot of the system. Howells said that he added a 64-bit counter that could be used for that purpose, though it will "eventually get reused" as well.
Howells was asked about patches, which he said he had posted a while back. Szeredi pointed out that those patches were not for fanotify support, but were instead for the watch queue; it is the same general concept, however, he said. Christian Brauner thought that the notification piece should be separated from the fsinfo() effort.
The problem, Howells said, is that the notification queue can overflow, which means that events, such as mount and unmount operations, would get lost. Howells mentioned that currently tools have to parse (and poll) /proc/mounts in order to find out the status of mounts and unmounts, which is not particularly efficient. Brauner noted that he had invited Lennart Poettering to the talk, since systemd would be one of the eventual users of any new feature of this sort, so he asked Poettering about systemd's needs in this area.
Right now, systemd parses /proc/self/mountinfo, "which, of course, is terrible", Poettering said. He is not particularly concerned if events get dropped, as long as there is a way to figure out what has happened; some kind of unambiguous indication that events have been dropped coupled with an API for systemd to get the current status when it needs to do so would be ideal. He would like a facility that provides the immediate child mounts for a given mount along with mount-related events for those children. Howells said that Ian Kent had created a patch set implementing mount watching for systemd using fsinfo() and the watch queue notifications.
Brauner asked if the feature needed to be added to fanotify for systemd's use, but Poettering said that he did not care. His main concern is in getting notified when events are lost, so that systemd can take some action to update its state; it would be great if the lost-event notification narrowed down where in the mount tree the lost event(s) came from. For systemd's use case, it would be better to get events for a particular subtree, rather than the whole system, because it normally is only concerned with a subset of the full mount tree.
Jeff Layton asked about the systemd use case for this information. Poettering responded that there are many systemd services that need to wait for mount activity of some form (e.g. at boot time, MySQL needs to wait for the filesystem where its files reside). Much of systemd's dependency processing for services depends on an accurate picture of the state of the system, including mounts.
Goldstein said that he was unsure how to report the occurrence of a tucked mount, which is a mechanism aimed at cleanly replacing an overlay mount. Brauner said that he was no longer "allowed to call it that"; there is another interpretation of that term, which he was unaware of until "friendly people on social media" pointed it out to him. They suggested using "beneath" to describe the type of mount. There is also, of course, the danger of mistyping the previous term, he said.
There was some discussion of a way to retrieve the immediate child mounts, as Poettering requested, but that will require a unique mount ID, Brauner said. After some roundabout discussion about mount-related APIs and the concerns that would need to be kept in mind, worries about a mount-ID overflow were raised. Layton pointed out that a 64-bit counter that gets incremented every nanosecond will take more than 500 years to overflow, so "we're never going to overflow at 64 bits".
There may be problems with exposing those 64-bit values to user-space programs that expect only a 32-bit mount-ID, however. In fact, Poettering checked the systemd code and it "knows" that the mount-IDs are 32-bits in size. Howells said that the existing mount-ID is "recycled, too small, people assume it is too small", so something new that is defined to be 64-bits is needed. Poettering suggested using UUIDs "and the problem goes away", he said, to chuckles around the room. As time expired, things kind of trailed off; it is clear that there is more work needed before anything is likely to go upstream.
Page aging with hardware counters
The memory-management subsystem has the unenviable task of trying to predict which pages of memory will be needed in the near future. Since predictions tend to be difficult, the code relies heavily on the heuristic that memory used in the recent past is likely to be used again in the near future. However, even knowing which memory has been recently used can be a challenge. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Aneesh Kumar and Wei Xu, both presenting remotely, discussed some ways to use the increasingly capable hardware counters that are provided by current and upcoming CPUs.
Kumar started by talking about how these counters, which can count page
accesses, might be used to improve the multi-generational LRU; he has recently posted
a
patch series implementing this functionality. It uses page-access
counters to help with the sorting of pages into generations. Counters
help, but do not entirely solve the problem, he said; most architectures
can produce access counts at this point, but those are absolute numbers.
Page sorting requires a sense of relative activity — which pages are
seeing more activity than others now? The code works by looking at the
counts for some pages in the oldest and youngest generations, using them to
estimate what the minimum and maximum numbers would be. Those values are
then used to classify other pages. He said that it might be feasible to
use k-means
clustering to classify pages instead.
The advantage of using the counters is that the kernel can skip the work of walking through a process's page tables to find the pages that have been accessed recently. It also eliminates the need to store generation information in the page flags, which are a tightly limited resource. The results from benchmarking this code were somewhat inconclusive, though; some workloads regressed slightly, while others improved a little. Optimizing the multi-generational LRU, he said, is a tricky task. Profiling the kernel showed that the use of hardware counters reduced the time spent checking per-page "accessed" bits, but added time spent querying the counters. In summary, he said, it is still not entirely clear whether this feature is worth adding or not.
An entirely different use case is measuring the utilization of transparent huge pages. These pages, which are often made up of 512 4KB "base" pages, have a single bit tracking references to the whole thing. If a single base page within a huge page is active, it make the whole huge page look busy, even though the other 511 base pages might be unused. Hardware counters exist for each base page, though, so they should be able to identify hot and cold pages hidden within a huge page.
The approach he is working on changes the khugepaged thread, which "collapses" base pages into huge pages behind the scenes, with the result that it only creates a huge page if all of the base pages that it contains have approximately equal access frequencies. The reclaim process can look at access patterns for the base pages within a huge page and break that huge page apart if it is sparsely used. Benchmarking this work was tricky, since it is hard to find workloads that show this type of access pattern. With a special-purpose microbenchark, though, Kumar was able to demonstrate that sparsely-used huge pages can be split, freeing little-used memory for other uses.
Another potential use for hardware counters is in page promotion, which relies on being able to detect heavily used pages that should be promoted to faster storage. Using hardware counters, the kernel can compare the relative hotness of pages across NUMA nodes, which is not easily done now. Kumar has been unable to test this idea, though, since he lacks the hardware that it would apply to.
Assuming that there is a place for hardware counters, developers would need to find the best way to integrate them. One option, he said, was to add support to DAMON, but that approach is hard to evaluate. One benchmark he ran showed a 12% performance improvement, but it also showed a lot of variance.
Xu took over at this point to return to the question of page promotion,
which he described as a key challenge for memory tiering. There are
currently a number of ways of identifying hot pages (which are candidates
for promotion) in the kernel, he said, including page faults, accessed-bit
scans, hardware counters, and more. It would be good to somehow unify the
kernel's approach to hardware-assisted page promotion.
The best approach there, he said, would be to create an abstraction layer
around page promotion that could use a number of back-end modules to
acquire information on page usage.
He has implemented a user-space promotion daemon that used a combination of access bits and precise event-based sampling (PEBS) data, with events streamed to user space by way of a BPF interface. In combination with a custom system call to request the migration of physical pages, he said, it worked "well enough".
He wondered how this kind of functionality might be brought into the kernel. One possibility would be to extend the autonuma mechanism to use hardware counters but, he said, that is not a great fit. Autonuma is based on virtual memory areas, while the counters are tied to page-frame numbers. A better idea, he thought, was to extend the multi-generational LRU to make use of this information. It could be augmented with a per-node page-promotion thread that works like kswapd, but in the opposition direction.
Kumar asked the gathered developers how they thought the incorporation of hardware counters should proceed. Dan Williams said that user space might also want to use the system's performance counters; if the kernel grabs them, those applications could break. Kumar answered that this would not be a problem on architectures, like PowerPC, that have dedicated counters for this purpose. Williams suggested implementing this functionality for such hardware first. Xu added that he had used PEBS for his work because it was the only thing available, but that dedicated counters are a better solution and he hoped other vendors would start to provide them.
DAMON developer SeongJae Park expressed his thanks for the DAMON integration, which is something that he had been wanting to do himself; he encouraged the sending of patches. Kumar said that the patches would be sent, but remarked that DAMON is difficult to use for generic workloads; Park answered that he is working on automatic tuning to address that problem.
The session closed with a suggestion from Williams that proving the value of hardware counters in DAMON would be a good first step. After that, if it seems worthwhile, support for these counters could be moved into the core memory-management code.
The intersection of lazy RCU and memory reclaim
Joel Fernandes introduced himself to the memory-management track at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit as a co-maintainer of the read-copy-update (RCU) subsystem and an implementer of the "lazy RCU" functionality. Lazy RCU can improve performance, especially on systems that are not heavily utilized, but it also has some implications for memory management that he wanted to discuss with the group.The core idea behind lazy RCU is that, when the system is idle, it may not need to invoke RCU callbacks right away. These callbacks trickle in constantly, even on a lightly loaded system, and waking a CPU to call them can disturb an otherwise idle system, confusing the power-management code. This behavior can be seen in workloads like video playback on Chrome OS systems and Android logging.
RCU, he said, maintains a per-CPU "bypass list" to reduce contention.
Normally, callbacks can be queued on one CPU but run on another, which can
lead to lock contention and reduced performance. If the main callback list
gets too long, the RCU code will start shunting callbacks over to the bypass
list instead, avoiding the need to acquire a lock. Eventually the bypass
list is flushed back onto the main list, either as the result of a timer
firing or the bypass list getting too long. Lazy RCU is based on the idea
that callbacks marked as non-urgent can go straight to the bypass list and
be processed at some future time. This technique, he said, can reduce a
system's power usage by 10-20%.
One of the main uses of RCU callbacks is to release memory once it's safe to do so. Accumulating callbacks indefinitely thus has the potential to run the system short of memory over time. To avoid this problem, RCU implements a simple shrinker that flushes bypass-list callbacks into the main list, where they will be processed. There are some problems with this approach, though, starting with the fact that RCU has no way to know how much memory any given callback will release to the system. Shrinkers are for caches, but the callback list is not really a cache; it is, instead, a deferred garbage-collection mechanism. So a call to the shrinker might free more memory than is needed, but it doesn't do that immediately; instead, the shrinker has to trigger an RCU grace period, which can take some time.
Fernandes was looking for input on how the handling of this list could be improved from a memory-management point of view. Michal Hocko said that the shrinker is probably not the right approach; the kernel's proactive-reclaim mechanisms can cause shrinkers to be called even when the system is not short of memory. That could cause callbacks to be flushed unnecessarily, defeating the purpose of lazy RCU. A better idea would be to hook into the allocator directly, perhaps in a function like should_reclaim_retry(). When that call does happen, he said, RCU should just flush everything. Fernandes said this approach might help.
Another attendee suggested that, since callbacks are being flushed in the hope that they will free memory, the right thing to do would be to specially mark callbacks that will do so? Matthew Wilcox said that "99% of RCU callbacks" free memory, but Fernandes disagreed, saying that they handle numerous other types of tasks as well. Still, he allowed that "most" callbacks do, indeed, free memory. Perhaps, he said, a better approach would be to create an API for callbacks that don't return memory to the system.
Fernandes asked whether it might make sense to get information from the memory-management subsystem on whether any given callback actually freed memory. That might help in cases where a specific amount of memory is targeted for freeing. He also wondered if the RCU shrinker should return zero (indicating no memory was actually freed), since it will have only started a grace period and will not have actually freed any memory yet. The answer was that RCU should just drop the shrinker and implement something better.
There was a side discussion about kfree_rcu(), which exists for the sole purpose of freeing memory after an RCU grace period. This implementation, rather than maintaining a linked list of callbacks, just fills a page with pointers to the objects to be freed; the whole set can then be returned with a call to kfree_bulk(). This approach has a number of advantages, including increased cache locality and the ability to use the more efficient kfree_bulk() method. There is a significant disadvantage, though, in that kfree_rcu() may have to allocate memory while freeing. Having to allocate memory in this situation is something kernel developers go out of their way to avoid, since that allocation might be impossible at exactly the time when it is most needed.
Fernandes would like to build a deferred-freeing mechanism directly into the slab allocator but, he confessed, he was "living in a fantasy world" when he was researching the idea. It is harder to do than he thought it would be. The interaction of grace periods with the slab allocator is tricky and, when the need to free memory arises, the grace period might have already passed, meaning that the RCU stage can be skipped entirely.
His thought was to mark objects specifically in the slab allocators as not being ready to be freed quite yet. The allocators could maintain such objects in their free list, but not hand them out to new users until that marking goes away. That would eliminate the need to allocate memory in kfree_rcu(), and could eliminate the need for a separate shrinker as well. Unfortunately, the SLUB allocator maintains its free lists by storing pointers in the objects themselves — which is not advisable if the objects are still in use. Slab maintainer Vlastimil Babka said that he would think about the problem.
Fernandes closed the session by saying that the benefits of this scheme may not justify the addition of more complexity to the slab allocator. For now, at least, hooking into the reclaim path as Hocko suggested is the direction this work seems likely to go.
Memory passthrough for virtual machines
Memory management is tricky enough on it own, but virtualization adds another twist: now there are two kernels (host and guest) managing the same memory. This duplicated effort can be wasteful if not implemented carefully, so it is not surprising that a lot of effort, from both hardware and software developers, has gone into this problem. As Pasha Tatashin pointed out during a memory-management-track session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, though, there are still ways in which these systems run less efficiently than they could. He has put some effort into improving that situation.The translation of a virtual address to a physical address is a more complex affair than it seems. The lookup operation must work through as many as five levels of page tables. At each level, a memory load must be performed and TLB misses are possible, meaning that the lookup operation can be slow. It gets worse when this happens in the guest, though; guest "physical" addresses are virtual addresses in the host space; as a result, the lookup at each level of the guest page-table hierarchy can require walking through the full hierarchy in the host. The worst-case lookup, when both the host and the guest are running with five-level page tables, could require 35 loads, which can hurt.
Optimizing this situation, he said, starts from a recognition that work is being duplicated in the virtualized environment. He was not just referring to page-table lookups; memory is also zeroed twice when virtualization is in use. The solution Tatashin has in mind is to push as much of the work as possible to the host system in ways that are not transparent to the guest.
Specifically, he has implemented a driver for a "memctl" device that is
present on the guest side; this device provides many of the
memory-management operations that are already available through the guest's
system-call interface: mmap(), mlock(),
madvise(), and so on. The difference is that, for the most part,
these operations are passed through to the host for execution there rather
than being handled by the guest. Additionally, the memctl device does not
zero memory on the guest side; it counts on the host take care of that when
needed.
The other piece of the puzzle is that memctl would allocate pages in the guest's physical address space at the 1GB huge-page size. On the host side, though, these pages are mapped at a smaller size — as either 2MB huge pages or 4KB base pages. The use of 1GB pages on the guest shorts out most of the address-translation overhead at that level, speeding access considerably. The smaller pages on the host side avoid fragmentation issues; guest memory can be managed in smaller units. This only works, though, if all operations on that memory are done by the host, which is why the memctl device must provide equivalents for all of the relevant system calls.
David Hildenbrand suggested that the real optimization in this setup is avoiding the need to zero pages on the guest side and, perhaps, from not having to allocate all of the guest's memory immediately on the host. He thought that some of these optimizations could be done within the balloon driver as well, but probably not all of them. The virtio-balloon is "the dumping ground" for a lot of similar code, he said.
Tatashin continued, wondering whether and how it might be possible to upstream this code. Andrew Morton asked where the changes live; the answer is that almost all of the work is in the new memctl device, which is a separate driver. So there would be little impact on the core memory-management code. But Tatashin worried about maintaining the ABI after the code goes upstream and wanted to be sure that he is not adding any security problems. He was advised to copy the patches widely, and the community would figure it out somehow.
As the session ran out of time, an attendee asked whether this mechanism required changing functions like malloc(). Since memory-management operations have to send commands to the memctl device, the answer was "yes", code like allocators would have to change. Perhaps someday it would be possible to do a lot of the basic memctl operations from within the kernel, but more specialized applications would have to do their own memctl calls.
Phyr: a potential scatterlist replacement
The "scatterlist" is a core-kernel data structure used to describe DMA I/O operations from the point of view of both the CPU and the peripheral device. Over the years, the shortcomings of scatterlists have become more apparent, but there has not been a viable replacement on the horizon. During a memory-management session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Jason Gunthorpe described a possible alternative, known alternatively as "phyr", "physr", or "rlist", that might improve on scatterlists for at least some use cases.The buffer for an I/O operation is usually described by an address and a length. In the virtual address space where the operation is requested, that buffer is usually seen as being contiguous. Things may look quite different from a peripheral device's point of view; that seemingly contiguous buffer may be scattered randomly in the physical address space. The virtual address used to locate that buffer has no meaning to the device — and the CPU's physical addresses might not either, especially if there is an I/O memory-management unit (IOMMU) sitting in the middle. Instead, the device works with DMA addresses that may exist in their own space.
These differing points of view mean that I/O operations must be described in two different ways; that is the role of struct scatterlist. It contains an (address, offset, length) tuple as seen by the CPU, where the address is actually a pointer to the page structure for the page holding the buffer; a scatterlist is an array of these structures. The DMA-mapping layer can use that information to augment that array with addresses and lengths visible to the target device. If an IOMMU is used to coalesce a scattered set of pages and make them look contiguous to the device, the set of tuples seen by the device may be shorter than that provided by the CPU.
Gunthorpe started his session by listing a number of problems experienced
by developers when using the kernel's DMA-mapping layer, many of which are
tied to the scatterlist representation. It would be useful if functions
like pin_user_pages() could return folios, but the current API
works on individual pages, so I/O operations can end up splitting up huge
pages. Since scatterlists work with page structures, they cannot
represent memory that lacks such structures, as is often the case with
memory installed on the devices themselves; this is a problem for P2PDMA operations, among others. The block
layer would be faster, he said, if I/O requests stored in BIO structures
did not need to be converted to scatterlists on their way through. RDMA
users want to be able to pin large amounts of memory (he said 100GB) and
perform I/O on it "forever"; storing such allocations in a scatterlist is a
useless waste of memory.
Matthew Wilcox added another reason not to like them: simple cleanliness. Gunthorpe agreed that everybody hated scatterlists; they are found everywhere in the kernel, and "abused and misused everywhere". The structure is hopelessly tied to struct page. There is no hope, he said, of doing something better with it.
Gunthorpe's approach is to improve the DMA API to provide better operations; an initial implementation can be found in his GitHub repository. It involves creating a "range CPU" iterator that would operate over intervals of (physical) CPU memory; it could be used to create a DMA buffer that would serve as a handle for peer-to-peer memory and which could be stuffed into an IOMMU. There would be an equivalent "range DMA" iterator to iterate over DMA addresses, and various options to map between the two. A new pinning API would take a range CPU iterator as an argument. There would be a number of storage options for the iterator, including scatterlists, BIOs, page structures, and more. Users could then iterate over these ranges without worrying about how they are represented.
He started into the project thinking that "this doesn't sound too bad", but got a quick education, he said. There are 23 separate sets of DMA operations in the kernel, he said, many of which are for "weird old IOMMUs" like GARTs. He really doesn't want to touch that code. So, instead, he is working on a performant version of a new set of DMA operations for current architectures without trying to support the older ones.
Then, he said, there is the perennial issue of the get_user_pages() family of functions, which are used in many performance-critical places in the kernel. Getting these functions to return data beyond the page structures they handle now will be costly; he wondered if there would be any appetite for slowing down get_user_pages() for this improvement. Dan Williams asked what kind of added output was being discussed here; Gunthorpe said that the functions would return a set of folios.
Wilcox said that there are two types of users of these functions, some of which are performance critical and some of which are not. The former users can continue to use get_user_pages(), while the others could call something like get_user_range() instead and get the extra information. Gunthorpe said that would involve duplicating much of the get_user_pages() code, when there are already a couple of implementations in the kernel. John Hubbard suggested creating the new version of the interface, then opportunistically factoring pieces out as it makes sense.
The session began to wind down with a seeming consensus that this work is on the right track. Williams said that, if it turns out to be useful, it would eventually be necessary to rewrite all of the existing scatterlist users, but that idea received some pushback. Gunthorpe said it would be great if everybody used the new API, but getting there would be painful work that is not likely to happen. Wilcox agreed that existing scatterlist users should mostly be left alone; they can be converted at leisure later. Gunthorpe, though, repeated that a complete conversion was not likely to ever happen.
Fighting the zombie-memcg invasion
Memory control groups (or "memcgs") allow an administrator to manage the memory resources given to the processes running on a system. Often, though, memcgs seem to have memory-use problems of their own, and that has made them into a recurring Linux Storage, Filesystem, and Memory-Management Summit topic since at least 2019. The topic returned at the 2023 event with a focus on the handling of shared, anonymous memory. The quirks associated with this memory type, it seems, can subject systems to an unpleasant sort of zombie invasion; a session in the memory-management track led by T.J. Mercier, Yosry Ahmed, and Chris Li discussed possible solutions.
Mercier started the session by describing how the zombie problem comes
about. It all starts when a process running within a memcg allocates some
anonymous memory; that memory is charged to the group and all is well. The
process then allocates some shared memory, which is also duly charged. If
that memory is subsequently shared with processes in a different memcg, though,
things start to become a little strange; only the original owner will be
charged for all that memory. The other groups can use it for free. If all
of the processes in that first group go away, the memcg itself would
normally be deleted, but that can't happen; it is still responsible for
that shared memory, even though all of the users of that memory are outside
of the group.
That memcg has become a zombie, destined to haunt the system for,
potentially, a long time. In some settings, thousands of them can
accumulate, creating a true zombie horde that consumes a significant amount
of kernel brains memory. It also slows down operations,
including memory reclaim, that iterate over the memcgs in the system. This
is, in other words, a problem worth fixing.
Mercier started by going through some "non-fixes" that have been shown not to work. Forcing manual reclaim with the memory.reclaim memcg knob does not work if the memory is not actually reclaimable, which is the case with shared memory. Instead, it can push pages out to swap, making the problem even worse; there will be no way to get rid of the zombie memcg without first swapping any swapped-out pages back in.
Another non-fix is to try reparenting the charged memory to the zombie memcg's parent group. But that group does not actually own the memory, so this action just has the effect of hiding the zombie memory there. Reparenting can cause memory to end up in the root control group and, in general, complicates memory management.
The fundamental problem, he concluded, was the fact that any given page structure can only have one memcg owner. That leaves no way to account for shared memory and leads to the zombie problem. A potential fix might be to move the charge for that memory to one of the other control groups using it; that would lead to better accounting overall, but finding that other control group is harder than it seems. A longer-term fix, Mercier said, would be to develop a first-class way to associate shared memory with multiple groups. Matthew Wilcox suggested charging the first group to access the shared memory once the owning group turns into a zombie, which led naturally to the next part of the talk.
Ahmed then took over to talk about the option of re-charging a zombie
memcg's shared-memory pages to one of the other users. The task, he said,
involves iterating through the pages charged to the zombie group and
looking at the type of each. Kernel pages will already have been
reparented, he said, and do not need further attention. There are pages on
the memcg's least-recently-used (LRU) list that may or may not be mapped,
and page-cache pages that are not mapped. With these pages, there are a
few options for dealing with them.
For example, these pages can simply be evicted from memory entirely; this will work for page-cache pages, which will eventually be faulted back in and charged to the new user. This action is intrusive, though, and will slow access to heavily used pages; it also does not work with pinned pages.
Alternatively, they can be re-charged to another group that has the page mapped; this should identify the right owner. It has the potential to charge a memcg for memory it used "hours ago", though, and could push the group into an out-of-memory situation. If multiple groups have the page mapped, the kernel would have to choose one somehow.
Finally, a two-step "deferred re-charge" approach could be taken, where (as Wilcox had suggested) the page is charged to the group that accesses it next. The pages could be removed from the zombie group's balance sheet, perhaps charged to the parent until the right group is found. This approach would be complicated to implement and would add extra work to some hot paths.
Ahmed sketched out an algorithm that might work, executed from a worker thread launched when the memcg enters the zombie state. This worker would iterate through the group's LRU list and take the appropriate action with each page. Unmapped, file-backed pages would simply be evicted, while unmapped anonymous pages would go through the deferred re-charge process. Pages that are mapped, instead, would be charged to a mapping group, either directly or using the deferred method. Pages that are swapped are a separate problem; they, too, will keep references to the zombie group. The answer there would be to walk through the swap cache and reparent the pages as needed.
That, he said, might be an effective short-term solution to the problem. Michal Hocko responded that the kernel used to re-charge pages in this situation, but had to back away from that approach. It is heavily based on the idea that memcgs do not change over time, he said. The sharing of memory across memcgs is not a great idea in the first place; is there any way that it could be just avoided? Another attendee said that the kernel used to have re-charging, but that reparenting is different and could perhaps be more easily implemented. Telling users to not share memory between memcgs is not a good answer, he said; users are trying to save memory with more sharing, not less. Rather than avoid the problem, it would be better to just try solutions that can be implemented easily.
Ahmed added that there are a lot of ways to create sharing, some of which
can be surprising. Writing a byte to a tmpfs file, for example, will nail
down a page that will be stuck until the file is removed or truncated.
Hocko said that maybe the kernel should just refuse to remove a memcg if pages remain charged to it, but Ahmed said that would just create more trouble for users. New memcgs will just be created until the machine fills up. It is difficult to see what the shared resources that are holding a memcg in place are, so users do not have an easy way to fix the problem.
John Hubbard said that what was needed was a separate parent for shared resources that would track all of the groups using those resources. That was Li's cue to talk briefly (there was little time left at this point) about a possible long-term solution in the form of an approach for tracking those sharing relationships. The idea, which was "not fully hashed out", involved creating a separate shared-memory controller that would own memory that is shared between memcgs. Through a complicated mechanism, it would track that shared use; the result would be no movement of charges over time, and no zombie control groups.
There was no time to discuss this idea. Chances are that some time will be
found next year when this unkillable topic shows up at the 2024 conference.
Toward a swap abstraction layer
The kernel's swapping code tends to not get much love. Users try to avoid it, and developers often find better things to do with their time than trying to improve it. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, though, Yosry Ahmed dedicated a memory-management-track session to the problem of the swap layer and what might be done to make it better.Ahmed started with the subject of zswap, which "swaps" pages by compressing them and storing them more efficiently in RAM. This action happens just before actual swapping would occur, and the rest of the memory-management system doesn't know anything about it. One of the results of that is that a system using zswap must provide a swap file, even if no pages are ever stored there. Every page kept in zswap must also have a slot reserved in the swap file, which wastes space in that file and takes extra CPU time to manage. The reclaim code is unaware of zswap and has no idea where the pages it evicts will go.
So zswap is not perfectly integrated with the rest of the system, but it is still heavily used, including on Google's production systems. He suspects that Chrome OS may be using it, and that a number of other companies running large fleets also have a place for it. So it would be good if zswap were to work more efficiently. Google has hacked together a short-term fix in the form of a simple indirection layer that indexes both the swap file and the zswap area. The rest of the memory-management subsystem just sees a single swap index and doesn't know whether it corresponds to an actual swap entry or a zswap entry. He is "not fond of it", but it works.
A better solution for the medium-term, he said, could be a proper indirection layer. There would be an operations vector (a structure full of function pointers) for each swap implementation, and an XArray to index the whole thing. As well as addressing the duplication problem, this implementation would make it possible to optimize the swapoff() operation.
An even better, long-term idea, is something he described as "swap descriptors" that would track swapping information independently from the backend used to store any given swapped page. This would be a much cleaner abstraction, he said, though it would add a bit of memory overhead for the additional tracking. There would also have to be another index added to support cluster readahead (which reads pages in larger chunks for performance) properly. This would be a bigger job overall.
Matthew Wilcox said that cluster readahead is an optimization that was necessary for spinning media, but that it has no value for solid-state storage. Rather than worrying about cluster readahead, the memory-management subsystem should just be swapping entire folios and keeping the data together that way. Michal Hocko agreed, saying that users with slow disks have learned to avoid swap at all costs; cluster readahead is an optimization for a non-existent set of users.
Ahmed said that he was worried that he would be told that keeping cluster readahead would be a requirement, but nobody in the room seemed to feel that way. His next question was whether a memory overhead of 0.6-0.8% for swap descriptors would be too much; Wilcox responded that it was affordable, but said that care had to be taken to avoid allocating memory in the reclaim path.
Chris Li briefly presented an alternative, which he called a "virtual filesystem interface for swap files", that would abstract out all of the operations for each swap type. Each swap file would have its own swap count and slot management. Swapping, when implemented this way, could be organized in layers without the need for an additional indirection layer; that would avoid the need to increase its memory usage. The Android system, he said, aims to keep about 20% of its pages in swap; its developers might not be pleased with the extra overhead.
David Hildenbrand, at the end of the session, said that people have been complaining about the swap layer for years. He would like to see that code rewritten to work more like the virtual filesystem layer does, so he has a strong preference for Li's alternative if it takes the kernel in that direction. Dan Williams advised Li that "if you come to LSFMM and show people a pony, they will want the pony". Hildenbrand agreed and, with a laugh, said that Li's proposal was accepted.
Swapping in Linux may thus finally be on a path toward improvement. It's worth keeping in mind, though, that the code implementing this idea does not exist yet. So the kernel will probably not be swapping out its swap interface in the near future.
A slab allocator (removal) update
The kernel developers try hard to avoid duplicating functionality in the kernel, which is enough of a challenge to maintain as it is. So it has often seemed out of character for the kernel to support three different slab allocators (called SLAB, SLOB, and SLUB), all of which handle the management of small memory allocations in similar ways. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, slab maintainer Vlastimil Babka updated the group on progress toward the goal of reducing the number of slab allocators in the kernel and gave an overview of what to expect in that area.Babka started by saying that his original proposal for the session mentioned the SLOB allocator in the title. This allocator, which was optimized for memory-limited systems, has been on the chopping block for a while now. That removal, he announced to applause, happened during the 6.4 merge window. There is a set of configuration options that can be selected to make the SLUB allocator more suitable for small-memory systems. It is now possible to call kfree() on all slab-allocated objects — something that SLOB never supported.
The next step, he said, might be to remove SLAB. That would solve one of his biggest problems: he never figured out how to pronounce SLAB and SLUB so that others could hear the difference. SLAB contains 4,000 lines of code, he said, not all of which is regularly or well tested. He has found parts of the SLAB allocator that have been broken for years. Keeping SLAB around means maintaining a common-code layer used also by SLUB, which complicates maintenance. It also requires reimplementing features; both allocators have implementations of memory control groups, for example, while realtime preemption is only supported by SLUB.
This is not the first time somebody has suggested removing SLAB; he found
at least three other times that the idea has come up. Each time the idea is
raised, somebody complains about performance regressions when using SLUB.
He wanted to know if the same objections would be raised this time.
David Rientjes is one of the developers who has objected to the removal of SLAB in the past. Speaking from a Google perspective, he said that things have come a long way since then. Per-cpu partial slabs help a lot. He has been looking at the benchmark results, and has concluded that, at this point, they can go either way depending on the workload. He did complain that SLUB can have a higher memory overhead; partial slabs make it better, and further progress can be made on that front. At this point, he concluded, he would not object to removing SLAB.
Michal Hocko said that SUSE has been using SLUB for some time; it works better in some cases, and worse in others. The biggest reason to make the change, he said, was that SLUB makes debugging problems easier; he suggested just removing SLAB and fixing any remaining problems afterward. Matthew Wilcox said that, in the past, SLUB performed worse with certain database benchmarks, but that problem has since gone away.
Another attendee asked about SLUB's extra memory overhead; is it something structural, or is it something that can be chipped away at. Babka answered that he was surprised to hear objections about memory overhead. SLUB, it seems, uses about 30% more memory than SLAB to keep track of the memory it manages; he asked whether that translated to significant amount of memory when viewed as an absolute number.
Much of SLUB's additional overhead, he said, could be seen as a structural problem; SLUB gets its performance by using a lot of per-CPU caches. When Christoph Lameter introduced SLUB in 2007, one of his justifications for the addition of another allocator was that SLAB used too much memory for caches. But, Babka said, things have shifted over time. Addressing this memory use would require coming up with another way to get similar performance.
Pasha Tatashin asked whether per-CPU caching still makes sense in systems with hundreds of cores. Babka answered that some per-CPU caching is needed for scalability, but that there might be ways to make it more effective.
Concerns about memory usage notwithstanding, the conclusion from the session was that nobody objects to the removal of the SLAB allocator at this point; Babka plans to post a proposal to the mailing lists and see what kind of reaction it gets. Anybody who objects, he said, should be prepared to show a use case or benchmark that regresses with SLUB so that any remaining problems can be addressed. But this removal should not be held back for the sake of a microbenchmark; if there are concrete problems, the community can discuss how to fix them.
Once that task is complete, he said, it's time to think about what is next. API improvements will become easier once there is only one allocator to change. One idea he had was opt-in, per-CPU caching of array objects which, he said, could improve performance while simultaneously reducing overhead. The ability to allocate in non-maskable interrupt (NMI) context using a per-CPU cache was another idea; there would still be no guarantees that an allocation would succeed, though. That would allow the removal of a BPF-specific allocator.
Perhaps, he said, the allocator could offer guaranteed allocations with some sort of upper bound, much like mempools do now. That could be useful for tasks like the allocation of maple-tree nodes. More generally, he concluded, he would like to find ways to end the reinvention of memory-management functionality outside of the memory-management layer. There are a lot of things being done now that would be better handled in the core memory-management code.
Wilcox had one problem he would like to see a solution for that he called "dcache poisoning". On a system with a lot of memory and little memory pressure, the directory entry (dentry) cache can grow without bound. This can be an especially big problem with workloads creating a lot of negative dentries. The kernel will only run the shrinker when there is memory pressure; by the time that happens, cleaning out the dentry cache can take a long time. Andrew Morton described this as a "dentry cache policy decision", but Babka said that the allocator might be a useful part of a solution to this problem.
Babka closed the session by thanking the attendees and asking them to wish him luck as he proceeds with the SLAB removal.
Reliable user-space stack traces with SFrame
A complete stack trace is needed for a number of debugging and optimization tasks, but getting such traces reliably can be surprisingly challenging. At the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Steve Rostedt and Indu Bhagat described a mechanism called SFrame that enables the creation of reliable user-space stack traces in the kernel without the memory and run-time overhead of some other solutions.
Rostedt began by saying that obtaining a full stack trace of a user-space
process is useful for a number of purposes. It is needed for accurate
profiling, so both perf and ftrace make use of stack traces. BPF programs,
too, can benefit from a picture of the state of the call stack.
The traditional way to reliably obtain stack frames is to build the program in question with frame pointers. The frame pointer is simply a CPU register that is dedicated to containing the base address of the current stack frame. That frame will include a saved copy of the previous frame pointer, indicating where the previous frame began. The kernel (or any other program) can thus follow the chain of frame pointers to locate each frame on the stack. If frame pointers are not present, instead, the kernel's perf subsystem must, at each event, copy a lot of the stack for later postprocessing using the DWARF unwinder. That is a costly thing to do.
But frame pointers are not free either. Managing the frame pointer requires some setup code to run at the entry to every function. Using a register for the frame pointer makes a scarce CPU register unavailable for other uses, slowing program execution. As described in this article, building user space with frame pointers can lead to measurable performance regressions, which can cause their use to be controversial.
The kernel, Rostedt continued, has a stack unwinder called ORC that is much simpler than DWARF. It was added in the 4.14 release to support live patching — another application that needs reliable stack traces. The kernel's objtool utility creates the ORC data at build time and adds two tables to a section in the kernel executable: orc_unwind to hold stack-frame information, and orc_unwind_ip to map instruction pointer values to the appropriate unwind entry.
SFrame is based on ORC; it provides the same mechanism, but for user space rather than the kernel. When an executable is built with SFrame data, the kernel can create full stack traces without the need for frame pointers. There is always a cost, of course; in this case, developers are sacrificing a bit of disk space (to hold the ORC tables) for speed. This data is read, if needed, in the kernel's ptrace() path, so it doesn't affect execution when it is not needed. Some additional effort was required to handle some user-space complications; for example, since binaries are relocatable, there must be a mechanism to apply the correct offsets to the SFrame data.
Rostedt provided an overview of how SFrame support would work in the kernel. The generation of a perf event starts with a non-maskable interrupt (NMI), which ends up in the perf code. If a stack trace is called for, then the kernel will make an attempt to read the call stack; if that encounters a page fault, then there will be no stack trace for this event. He would like to change that code to look for the SFrame data instead. The NMI handler would set a flag indicating that there is work to be done before returning to user space; the ptrace() path would see that flag and reconstruct the stack trace in user context. Among other things, that would make it possible to recover the stack even if page faults occur while reading it.
This approach would require some changes to the user-space perf tool as well. The initial perf event, generated at NMI time, will not include the call stack (which will not be obtained until later), so it will, instead, have a bit set saying "a stack trace is coming". There may be several intervening events generated before that stack trace finally shows up in the ring. Joel Fernandes asked whether the kernel could just reserve space in the ring buffer at NMI time, then fill it in later. Rostedt answered that the ring may end up with multiple events all with the same stack trace; reserving that space for each would end up wasting space.
Rostedt concluded his part by saying that the stack is unlikely to be
swapped out, so generating the trace will not normally create I/O to fault
pages back in. That said, generating the trace will need to bring in some
other data, since the SFrame tables are stored in the executable on disk.
The SFrame data should only be mapped when it is actually used, so the
first use within a process will cause a brief stall while that mapping
takes place.
Bhagat (who has done much of the work to implement this functionality) said that there could perhaps be a problem with code in parts of the kernel that are written in assembly. The non-standard stack usage in that code may well confuse the unwinder. It remains to be seen whether unwinding through those parts of the kernel is important, she said.
Another potential issue is that the SFrame data is stored unaligned on the disk; that can lead to unaligned memory accesses in the kernel. Avoiding that requires a certain amount of copying of data, "weird casts", and such. The alternative, forcing the data to be aligned, would bloat the format though. There seemed to be agreement that storing the data unaligned is the best solution, and that there was no need to change it.
Other outstanding problems include the need to handle dlopen(), which maps executable text from another file into a range of the calling process's memory. This issue could perhaps be addressed by adding a system call to tell the kernel where the SFrame data for a given executable mapping can be found. Just-in-time compiled code is also a problem; when there is no backing file for a mapping, there is no SFrame data either.
As the session concluded, the sentiment in the room seemed to be that
SFrame would be a nice tool to have and that this work should continue.
Page editor: Jonathan Corbet
Next page:
Brief items>>