LWN.net Weekly Edition for May 12, 2022
Welcome to the LWN.net Weekly Edition for May 12, 2022
This edition contains wall-to-wall coverage from the 2022 Linux Storage, Filesystem, Memory-Management, and BPF Summit, including:
- Dealing with negative dentries: how the kernel can manage the cache of file-name lookup failures better.
- Recent RCU changes: a quick overview of read-copy-update (RCU) followed by descriptions of some of the bigger changes that have gone into it over the last few years.
- How to cope with hardware-poisoned page-cache pages: What should the kernel do when memory errors corrupt a page in the page cache?
- Page pinning and filesystems: this year's discussion on how to solve the problems with get_user_pages().
- Ways to reclaim unused page-table pages: a lot of the memory used to hold page tables is not needed in that role and could be put to better uses.
- The ongoing search for mmap_lock scalability: the 2022 version of the perennial page-fault scalability topic, perhaps with some solutions on the horizon this time.
- Improving memory-management documentation: how to capture some of the "tribal knowledge" behind memory management and make it available to more developers.
- The state of memory-management development: : Andrew Morton describes some changes to how memory-management patches are handled.
- Seeking an API for protection keys supervisor: protection keys can help to harden the kernel, but the best way of managing them is still unclear.
- Better tools for out-of-memory debugging: understanding out-of-memory problems is not easy; what can be done to improve the situation?
- Changing filesystem resize patterns: filesystems that are created small, but continually resized to be much larger, are causing some headaches.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Dealing with negative dentries
The problem of negative dentries accumulating in the dentry cache in an unbounded manner, as we looked at back in April, came up at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). Negative dentries reflect failed file-name lookups, which are then cached, saving an expensive operation if the file name in question is looked up again. There is no mechanism to proactively prune back those cache entries, however, so the cache keeps growing until memory pressure finally causes the system to forcibly evict some of them, which can make the system unresponsive for a long time or even cause a soft lockup.
The problem
Stephen Brennan led the session; he had posted a patch set, with a new approach to the problem, during the discussion in March. The problem that he is seeing is on big servers with lots of memory, where part of the workload is looking up unique IDs in some directory a few times per second—which goes on for months or years. Each lookup creates a negative dentry in the cache, resulting in a cache full of these entries that have never been used after they were created.
Since there is no memory pressure, because the system has lots more memory than is needed by the workload, there is no clean up. That can lead to soft lockups when iterating through the children of a dentry because of the amount of time it takes to do so. It also leads to slab fragmentation; if the system has 500 million negative dentries and then a directory containing some of them is deleted, there will be an enormous number of partially filled slab pages.
His goal is to have some kind of generic system for managing all of the various least-recently-used (LRU) lists in the page-cache and filesystem code. Currently, negative dentries are not moved to the head of the LRU when they are referenced; instead, they are simply marked and left in place until the shrinker runs. Shrinkers are the mechanism that the memory-management subsystem uses to request cache entries be freed. They only run when there is memory pressure, but at that point a dentry might have been marked as referenced a year ago, so that dentry is not useful anymore—if it ever was. When that happens, the shrinkers have to do a lot of work just to move entries to the end of the list before they can even be reclaimed.
In the mailing list discussion, Dave Chinner wanted to see a generic solution for various caches that all use the list_lru mechanism. Brennan (and others) have been thinking about that. For example, every time something gets added to these caches, the list could be rotated to move those with fewer references to the back of the list. At that point, perhaps some aging rules could be applied to negative dentries. Another thing that could be done is to track the number of entries in these caches, and their types, so that decisions could be made based on the numbers of dentries, negative dentries, or some calculation using those numbers.
James Bottomley said that the problem is that the number of negative dentries is completely unbounded, so he wondered if the focus of the work should be on dealing with that. The number of positive dentries is bounded by the number of files in a directory, but there is a nearly infinite number of file names that are not present. Brennan agreed, noting that on the systems he has looked at, 99% of the cached dentries are of the negative variety, so that is the source of the underlying problem. But he thinks the fix can be made using the existing lru_list and shrinker frameworks.
While Oracle's main concern is with negative dentries, Matthew Wilcox said, there are other caches that have similar problems. For example, under certain workloads, the inode cache can similarly end up with many entries that will never be used again. That cache is bounded by the number of files in the filesystem, but it is still a "used once" problem as with negative dentries in the problem cases.
Not really LRU
This is a "classic LRU problem", Kent Overstreet said. Brennan agreed with that, noting that these LRUs are not really being treated as "least recently used". There are, instead, used-once entries scattered throughout the list. Under memory pressure, those entries are the ones that should be cleaned up; if they were organized better, so that all of those that were not recently used were together, the cost of scanning the whole list for them could be reduced. But there is still the problem of workloads that create "stupid amounts" of negative dentries that will never be used again. There is no good reason to cache those at all.
Josef Bacik said that he had solved a similar problem around five years ago by not marking entries as referenced until they are used for the second time. Prior to that, the list was being scanned to clear the referenced bits for entries that had been referenced once, but that scan took a long time. He chose to wait for the second use, instead of changing the LRU to be a real LRU, because he found that constantly shuffling things to the back of the list was "not excellent" for the workload he was looking at.
Brennan said that it is not excellent for a lot of workloads because it leads to a lot of contention on the spinlocks. He loves the idea of waiting until the entry is used again, but it leads to another problem: "how much is too much?" Memory pressure provides a good signal to indicate that entries need to be pruned, but the soft lockups he mentioned can occur starting around 100 million negative dentries in a single directory. He would rather not use some kind of "magic threshold", but there needs to be some mechanism to start shrinking the one-use items, at least.
Other possibilities
Bottomley suggested that the bounded positive dentry cache size for a directory be the limit on how large the negative-dentry cache could grow. Brennan thought that was an interesting idea. Ted Ts'o pointed out that negative dentries have no references to other data structures in the system, so they are easy to get rid of, or simply to move to another page. He wondered if simply getting rid of the negative dentries blocking the freeing of a page might be a reasonable tradeoff, even if those entries might actually be used again. There may be good reasons to give negative dentries special treatment, he said.
Bacik said that the negative-dentry problem can be solved in a fairly straightforward way, but that if Brennan wanted to solve the more general problem, there was a lot of work that would be needed. There was some discussion of using a mechanism like the page cache "refault" tracking, that was added by Johannes Weiner quite a ways back; it would allow the system to know that something it had evicted returned to the cache, so it should probably stick around longer.
Overstreet wondered if there was a way to detect workloads that are scanning and creating lots of negative dentries, then stopping the creation of those entries. That could be tracked per process ID and limits could be placed on how many negative dentries could be created. Michal Hocko said it sounded like a similar problem to that of throttling the creation of dirty pages; when the rate of their creation gets too high, a process is throttled to slow down their creation.
Brennan said that the problem is not necessarily that the entries are created at a high rate of hundreds or thousands per second; it could be a slow trickle of them over a long period of time. It can add up to hundreds of millions over, say, a year's time. There are some workloads that simply do lookups, which create a negative dentry when the file is not found, but others may be creating lots of temporary files and then deleting them, which also leaves negative dentries behind.
John Hubbard said that part of the problem is that Linux wants to use all of the available memory if possible, because it generally leads to better performance, but there is a cost to getting that memory back when it is needed elsewhere. Making any kind of improvement on the reclaim side would really help, Brennan said.
As time for the session wound down, Bacik asked Brennan what it is he would like the filesystems and memory-management developers to do to help. Brennan said that there were a lot of good ideas raised in the session and he hopes that those who care about the specific problem for negative dentries and the more general problem of finding better ways to reclaim memory from these caches would provide more eyeballs on the patches he would be posting.
Weiner asked if a way to sidestep the problem would be to put the negative dentries on a separate shrinker list. There are some shrinkers that are called more aggressively and there is no need to protect specific entries; a more wholesale freeing would probably work just fine. Brennan agreed that might be a way forward. The possibility of making changes that were specific to negative dentries was his favorite part of the discussion, he said.
Recent RCU changes
In a combined filesystem and memory-management session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Paul McKenney gave an update on the changes to the read-copy-update (RCU) subsystem that had been made over the last several years. He started with a quick overview of what RCU is and why it exists at all. He did not go into any real depth, though, since many of the topics could take a 90-minute session of their own, he said, but he did provide some descriptions of the work that has gone into RCU recently.
RCU is still under active development, McKenney said, which came as a big surprise to an academic person he was talking to at a conference a few years back. He did not have the heart to tell him that locking and atomic operations were also under active development, he said to laughter. "But here we are."
RCU review
The overall problem is that global agreement in an operating system like Linux is expensive. While it would have shocked his 50-years-ago self, it turns out that the speed of light is too slow and atoms are too big; those things are causing massive problems in concurrent software these days. The way that is handled by RCU is to synchronize both in time and in space, he said. The core RCU API is split into two sets of temporal and spatial calls. RCU is a way to allow readers of a data structure to continue working with it while an updater makes a change; RCU is most often used with linked lists of various sorts.
He showed a slide with four quadrants that described how RCU works on the temporal side. The basic idea is that readers call rcu_read_lock() before they read the data and rcu_read_unlock() after they are done. That allows an updater to remove the locked data, so long as it calls synchronize_rcu() (or receives the callback it set up with call_rcu()) before it actually frees it. The old data may still be in use by a reader, but RCU guarantees that all readers have unlocked it before it will return from synchronize_rcu(). The four quadrants show the possible orderings of a reader's lock/unlock and an updater's removal of the data, synchronize_rcu(), and the freeing of the old data. The one that could lead to serious misbehavior, where the removal is done after the lock and the freeing of the old memory is done before the unlock, is prevented by RCU. If that actually happens, it is a bug in RCU, he said.
He also looked at the global agreement cost, contrasting the use of a reader-writer lock with that of RCU. When an updater wants to change the data using a reader-writer lock, there is a period of time before that lock has propagated to each of the readers, then the update can be done, but there is another lag waiting for that update to reach all of the readers. During all of that time, the readers are stalled waiting to be able to read again. Using RCU, that same time period is no longer wasted. Readers can continue, possibly with an outdated version of the data, while updaters just need to wait until the end of the grace period to ensure that all subsequent readers will see the new data.
He showed some graphs to explain why the complexity of RCU is tolerated. RCU performs better than a reader-writer lock and it scales a lot better as the number of threads goes up. In addition, the shorter the critical section and the more CPUs there are, the better RCU looks, he said. RCU can also prevent some deadlocks.
More information about RCU can be found in the LWN kernel index entry for RCU. A good starting point is "The RCU API, 2019 edition".
Changes
That ended his "full-speed review of RCU". He put up a list of eight changes to RCU that he wanted to talk about; he would guess at which of those were the most important to the audience and dig into them a bit deeper. He began with flavor consolidation.
One day back in 2018, he got an email from Linus Torvalds, with a CC to security@kernel.org, that described an exploit in a use case of RCU. The problem was that the readers were using the *_sched() flavor for locking, but the updater was using synchronize_rcu() (instead of synchronize_sched()). In Linux 4.19 and earlier, the flavors needed to be matched up. The resulting bug was somehow used to cause a use-after-free, leading to the exploit.
Torvalds asked if there was a way to deal with this problem; the way to do so is to make synchronize_rcu() work for all of the flavors. McKenney said that it took "about a year of my life to make it happen", but it solves the problem except for one little thing: if you need to backport something to 4.19 or earlier. Amir Goldstein asked: "does that mean we have a booby trap now?" McKenney agreed that it was, but that there is a "'get out of jail free' card" with synchronize_rcu_mult(); it can be called in those earlier kernels and gets passed the various flavors of call_rcu() that are being used. It will chain calls to each of those and wait for all of them before returning, which emulates the more recent version of synchronize_rcu() at the cost of some additional latency.
In 5.4, Joel Fernandez added lockdep support to list_for_each_entry_rcu() (and the hlist_* variant). Those calls can take an optional lockdep expression, which removed the need for more variants of those calls in the API.
Uladzislau Rezki added a single-argument version of kvfree_rcu() (kfree_rcu() is another name for it) in 5.9. Previously, kvfree_rcu() required two arguments: the object to be freed and the name of a field within the object containing an rcu_head structure. Now, adding the rcu_head to the object is optional; the name of the field is not passed in that case. If the object structures are small, the rcu_head may add more overhead than is desired. The two-argument version is still supported and, as always, never sleeps, but the new version can sleep if the system is out of memory. It is a tradeoff: you can use smaller structures, but it can sleep, he said.
There are new variants of RCU that are for specialized use cases. He did not go into much detail, but wanted people to be aware of RCU Tasks Rude and RCU Tasks Trace since they may appear in tracebacks. They are mostly used for tracing of trampolines and he suggested that those who think they should use them contact him or one of the other users before doing so. RCU Tasks has been around since 3.18, but Rude and Trace were added in 5.8 with the help of Neeraj Upadhyay.
Polling for the end of the grace period, instead of calling synchronize_rcu() and waiting for it goes back to 3.14. Polling is done by getting a cookie, then eventually passing the cookie to cond_synchronize_rcu(). This method works, but cannot be used in contexts where sleeping is not allowed. In addition, getting the cookie does not imply that the grace period has actually started, which can be problematic in some use cases. In 5.12, some functions were added to the API, start_poll_synchronize_rcu() and poll_state_synchronize_rcu(), along with *_srcu() variants for sleepable RCU, in order to support those use cases. There are some caveats to be aware of in using them, however.
A feature that is mostly of interest to the realtime and HPC communities is the run-time callback offloading (and deoffloading) support that was added by Frédéric Weisbecker in 5.12. Normally, the RCU callbacks are executed on the CPU where they were queued, but that can interfere with the workload running on the CPU. So there is a way to offload those callbacks to kernel threads (kthreads) and then assign those kthreads elsewhere.
Traditionally, those assignments were done at boot time by choosing which CPUs would be used for the callbacks and that could not be changed without rebooting. Weisbecker added the infrastructure that allows changing those assignments at run time; a CPU gets marked as "possibly offloaded" at boot time, then it can be switched to offloaded and back to deoffloaded at any time. Currently there is an internal kernel function to do so, but McKenney thinks it will be wired up with a user-space interface at some point.
Another feature is a "memory diet" for the sleepable RCU (SRCU) code. Previously, it would allocate an array based on NR_CPUS, which is the maximum number of CPUs that the kernel can handle. That number is sometimes set to 4096 by distributions even though the vast majority of the systems where they run will have far fewer CPUs. So, instead of allocating the array at build time, it now gets allocated at run time based on the number of CPUs actually present. That is due in 5.19.
Another feature slated for 5.19 is realtime expedited grace periods contributed by Rezki. McKenney gave a brief history of the length of RCU CPU-stall timeouts. In the 1990s, Dynix/PTX used 1.5s; in the 2000s, Linux used 60s, which was somewhat disappointing to him. In the 2010s that dropped to 21s for Linux; now a patch has proposed 20ms. On Android systems, the expedited grace period CPU-stall timeout would be 20ms, while it would stay 21s on other systems.
In order for that to work, some additional patches from Kalesh Singh are being added. Normally expedited grace periods are driven by workqueues and run with the SCHED_OTHER scheduling class, like normal user-space processes. The patches will add a new kind of expedited kthread in the SCHED_FIFO scheduling class, which is "strong medicine", he said. It is limited to systems with fewer than 32 CPUs, no realtime, and with priority boosting enabled. The test results were impressive, he said, with latencies reduced by three orders of magnitude, down to roughly 2ms. It is a kind of realtime system with the expedited grace period on the fast path; "If you told me that last year, I would have laughed in your face."
Future
He said that he had turned 100 this year, or perhaps 40, but in base 10, of course, that's 64. He expects to be around for a while, noting that his father and grandfathers worked until they were 90 or so, but "mother nature's retirement program" awaits us all, so it is good to be prepared. He put up a list of some things that might be worked on in the future, but pointed out that the things that he can't see coming complicate that picture. There need to be people with a good understanding of RCU to handle those when they arise.
He looked back at the commits to RCU over a two-year period that ended in April 2017, so five years ago. That showed 46 contributors, most of whom contributed a single patch, while McKenney contributed the vast majority of patches (288 patches or 74%). Looking at the previous two years from April 2022 shows 79 contributors, with McKenney's percentage of the patches committed dropping to 63% (503 patches). One reason that the overall patch numbers have increased is that, since he started at Facebook, he has concentrated on adding more distributed testing for RCU.
In general, the trend is going in the right direction. There are developers who have been doing significant work deep inside RCU recently, which is great. There is still a lot of work to do, however, he said. One thing that he has noted over the years is that once a developer shows they can work on RCU, some company pays them a lot of money to work on something else. That is good, because people with some RCU knowledge are spread around the community. More recently he has noticed developers sticking with RCU itself, which is even better.
The knowledge and understanding of RCU needs to be better propagated throughout the community, he said. He has recommended two presentations that he did as good starting points for that (here and here) but more is needed. There is also a more general problem of how to choose the right synchronization tool for a given problem—RCU is not always the right choice—which is another area that needs better understanding and propagation within the kernel community.
How to cope with hardware-poisoned page-cache pages
"Hardware poisoning" is a mechanism for detecting and handling memory errors in a running system. When a particular range of memory ceases to remember correctly, it is "poisoned" and further accesses to it will generate errors. The kernel has had support for hardware poisoning for over a decade, but that doesn't mean it can't be improved. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit, Yang Shi discussed the challenges of dealing with hardware poisoning when it affects memory used for the page cache.The page cache, of course, holds copies of pages from files in secondary storage. A page-cache page that is generating errors will no longer accurately reflect the data that is (or should be) in the file, and thus should not be used. If that page has not been modified since having been read from the backing store, the solution is easy: discard it and read the data again into memory that actually works. If the page is dirty (having been written to by the CPU), though, the situation is harder to deal with. Currently, Shi said, the page is dropped from the page cache and any data that was in it is lost. Processes will not be notified unless they have the affected page mapped into their address space.
This behavior, Shi said, leads to silent data loss. Subsequent accesses to the page will yield incorrect data, with no indication to the user that there is a problem. That leads to problems that can be difficult to debug.
To solve the problem, he continued, the kernel should keep the poisoned
page in the page cache rather than dropping it. The filesystem that owns
the page will need to be informed of the problem and must not try to write
the page back to secondary store. Some operations, such as truncation or
hole-punching, can be allowed to work normally since the end result will be
correct. But if the page is accessed in other ways, an error must be
returned.
There are a few ways in which this behavior could be implemented. One would be to check the hardware-poison flag on every path that accesses a page-cache page; that would require a lot of code changes. An alternative would be to return NULL when looking up the page in the cache. The advantage here is that callers already have to be able to handle NULL return values, so there should be few surprises — except that the error returned to user space will be ENOMEM, which may be surprising or misleading. Finally, page-cache lookups could, instead, return EIO, which better indicates the nature of the real problem. That would be much more invasive, though, since callers will not be prepared for that return status.
Matthew Wilcox jumped in to say that only the first alternative was actually viable. Poisoning is tracked per page, but the higher-level interfaces are being converted to folios, which can contain multiple pages. The uncorrupted parts of a folio should still be accessible, so page-cache lookups need to still work. Dan Williams said that, in the DAX code (which implements direct access to files in persistent memory), the approach taken is to inform the filesystem of the error and still remove the page from the page cache. That makes it possible to return errors to the user, he said; this might also be a good approach for errors in regular memory as well.
Ted Ts'o expressed his agreement with Williams, saying that if the information about a corrupt page exists only in memory, a crash will erase any knowledge of the problem; that, too, leads to silent data corruption. The proposed solution does a lot of work, he said, to return EIO only until the next reboot happens. Asking the filesystem to maintain this information is more work, but may be the better approach in the end. One way to make it easier, he said, would be to not worry about tracking corrupted pages individually; instead, the file could just be marked as having been corrupted somewhere.
Shi argued that memory failures are not particularly rare in large data-center environments, and that any of his approaches would be better than doing nothing. Also, he said, users may well care about which page in a file has been damaged, so just marking the file as a whole may not be sufficient.
Kent Overstreet said that, beyond returning an indication of the problem to the user, the point of this work is to avoid writing garbage back to the disk. Then, if the system crashes, "the slate is wiped clean" and the corrupted memory no longer exists. A crash, he said, might be seen as the best case. Wilcox replied that this "best case" still involves data loss.
Josef Bacik said that storing corruption information made the most sense to him; the implementation could mostly go into the generic filesystem code. When notified of problems, the filesystem code should mark the affected pages, refuse to return data from them, and take care to avoid writing them to backing store. But he suggested that a per-file flag might suffice; developers — in both user and kernel space — are not good at dealing with error cases, so this mechanism should be kept simple, especially at the beginning. Developers can "try to be fancy" later if it seems warranted.
David Hildenbrand objected that a per-file flag could get in the way of virtual machines running from images stored in a single file. A single error would prevent the whole thing from being used, essentially killing the virtual machine. Tracking individual pages is probably better for that use case. But Bacik reiterated that the community was destined to make mistakes, so the simple case should be done first.
As time ran out, Wilcox pointed out that filesystems could handle the case of writing to a corrupted page — if the entire page is being overwritten. In that case, the damaged data is gone and the file is, once again, in the state that the user intended. Goldwyn Rodrigues pointed out, though, that the situation is not so simple for copy-on-write filesystems, which may still have the damaged pages sitting around. Bacik said this case is exactly why he opposes fancy solutions.
Page pinning and filesystems
It would have been surprising indeed if the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) did not include a session working toward solutions to the longstanding problems with get_user_pages(), an internal function that locks user-space pages in memory for access by the kernel. The issue has, after all, come up numerous times over the years. This year's event duly contained a session in the joint filesystem and memory-management track, led by John Hubbard, with a focus on page pinning and how it interacts with filesystems.File-backed pages, naturally, have a filesystem behind them that manages the movement of data to and from persistent storage. The root of the problem with page pinning is that the kernel uses it to operate on the contents of pages outside of the filesystem's purview; that can lead to unpleasant surprises when those contents change at times when the filesystem is not expecting it. If filesystems were aware of pinned pages then they could at least attempt to take evasive action, but pinning is generally invisible to filesystems.
The approach that has been taken is to try to make pinning explicit and
visible; to that end, the new pin_user_pages()
API was added. The effect is about the same as with
get_user_pages(), but this interface attempts to mark pages to
show that they have been pinned.
There is still one little problem, though: there are no
spare bits in struct page to track the number of times a given
page has been pinned, so the developers had to hack things by using
a bias value (1024) in the page reference count. As a result, if a
page has at least that many references, it will appear to be pinned even if
it is not. For that reason, the function to query whether a page is
pinned is called page_maybe_dma_pinned().
Hubbard complained about the "maybe" in the name; it seems inadequate, but
it is the best that the development community has been able to achieve so
far.
Matthew Wilcox said that the folio work might be able to provide a few extra bits for a pin counter, but it is probably not enough. There was some talk of putting this count into a side structure, moving it out of struct page entirely. David Howells noted that the sorts of accesses that pin pages (DMA and direct I/O, primarily) are not that common, so the side-structure idea might be the best approach. David Hildenbrand wasn't sure of the scope of the problem, since the bias is 1024, it takes a lot of references for a page to appear to be pinned. Wilcox pointed out that frequently mapped pages, such as those containing the C library, will have high reference counts.
Hubbard returned to the status of the pinning work, saying that developers need to think about why they need access to the pages in question. If they are doing something that will touch a page's contents, then pin_user_pages() should be used. In other cases, often where the intent is to make changes to the underlying page structures, get_user_pages() is the right function to call. The process of converting filesystems to deal with pinning is ongoing; it is not a small job. There are also a lot of cases where kernel code uses set_page_dirty() to mark a page as having been modified, then unpins the page. He made a helper for that case, but it feels wrong to him; each one is a place where pages are being marked "dirty" outside of the filesystem that is responsible for them, but which is still not actively involved in (or aware of) the operation. At least the evil is concentrated now, he said.
At this point, Hubbard noted that he had 12 minutes left in his session to come up with a finished design that fully solves the pinning problem. Even after everything has been converted to the new pinning API, he said, the problems that drove all this work in the first place will remain unfixed. His suggestion was to add an API allowing kernel code to take a lease on a range within a file, and to require leases to be taken before pinning file pages. There is a significant advantage to this approach: it is a correct solution that would connect the filesystem and memory-management code. There has to be some way of communicating information about changes to file-backed code to filesystems, and this is the only proposal he has seen so far.
Ira Weiny objected that leases are hard to get right; there are lots of roadblocks that come up. He agreed, though, that some way of making filesystems aware of pins is needed. But filesystems, he said, don't like letting go of pages they are responsible for. Ext4 maintainer Ted Ts'o answered, though, that the prospect doesn't worry him. A lease (or equivalent mechanism) would be telling the filesystem that the pages in the affected range may be marked dirty at some point in the future; that implies that if the indicated pages don't have blocks allocated in the filesystem, that must be rectified immediately. A copy-on-write filesystem might have to copy the whole range, even if the pages are never dirtied in the end. If the pages are dirtied all of this is fine, and perhaps even better, since allocating for the whole range at once might lead to better layout.
Howells said that he did an implementation of file leases for network filesystems, but he ended up abandoning it. There were just too many problems with truncate(), and with direct I/O; it is hard to get it right. Hubbard suggested just ignoring truncate(), to general laughter.
Chris Mason said that, for Btrfs, marking the page dirty is not the only problem. Btrfs has to lock pages before doing I/O so that their contents can't change; among other things, a checksum of the page's contents must be taken, and that checksum must continue to match those contents. Josef Bacik said, in his classic way, that the situation was bad in general, and that this problem is the biggest barrier to the sharing of pages in the page cache. The memory-management subsystem going behind a filesystem's back is a huge problem, he said, that has to be fixed. He is not thrilled with the lease idea, though; it would conflict with the way that Btrfs manages ranges of dirty pages.
Kent Overstreet answered, though, that user space has always been able to modify pages at inopportune times. Direct I/O, for example, bypasses the page cache, and a buffer used for direct I/O can be mapped into a file as well. This can actually lead to deadlocks in some situations, and could be an attack vector. Bacik said that Btrfs has a special path for just this case.
Hubbard, having not really gotten the 12-minute design he was after, closed the session by noting that fixing these problems may cause performance regressions. There may be objections but, he said, the higher performance was an "illusion" and the system was not correct. There was some brief discussion of ways to mitigate some of any future performance loss, but developers may still find themselves having to explain to users why their I/O should not have been as fast as it was before the system was made to work correctly.
Ways to reclaim unused page-table pages
One of the memory-management subsystem's most important jobs is reclaiming unused (or little-used) memory so that it can be put to better use. When it comes to one of the core memory-management data structures — page tables — though, this subsystem often falls down on the job. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), David Hildenbrand led a session on the problems posed by the lack of page-table reclaim and explored options for improving the situation.Page tables, of course, contain the mapping between virtual and physical addresses; every virtual address used by a process must be translated by the hardware using a page-table entry. These tables are arranged in a hierarchy, three to five levels deep, and can take up a fair amount of memory in their own right. On an x86 system, a 2TB mapping will require 4GB of bottom-level page tables, 8MB of the next-level (PMD) table, and smaller amounts for the higher-level tables. These tables are not swappable or movable, and sometimes get in the way of operations like hot-unplugging memory.
Page tables are reclaimed by the memory-management subsystem in some cases
now. For example, removing a range of address space with munmap()
will clean up the associated page tables as well. But there are situations
that create empty page tables that are not reclaimed. Memory allocators
tend to map large ranges, for example, and to only use parts of the range
at any given time. When a given sub-range is no longer needed, the
allocator will use madvise()
to release the pages, but the associated
page tables will not be freed. Large mapped files can also lead to empty
page tables. These are all legitimate use cases, and they can be
problematic enough; it is also easy to write a
malicious program that fills memory with empty page tables and brings the
system to a halt, which is rather less legitimate.
In all of these cases, many of the resulting page tables are empty and of no real use; they could be reclaimed without harm. One step in that direction, Hildenbrand said, is this patch set from Qi Zheng. It adds a reference count (pte_ref) to each page table page; when a given page's reference count drops to zero, that page can be reclaimed.
Another source of empty page-table pages is mappings to the shared zero page. This is a page maintained by the kernel, containing only zeroes, which is used to initialize anonymous-memory mappings. The mapping is copy-on-write, so user space will get a new page should it ever write to a page mapped in this way. It is easy for user space to create large numbers of page-table entries pointing to the zero page, once again filling memory with unreclaimable allocations. This can be done, for example, by simply reading one byte out of every 2MB of address space mapped as anonymous memory. These page-table pages could also be easily reclaimed, though, if the kernel were able to detect them; there is no difference (as far as user space can tell) between a page-table page filled with zero-page mappings and a missing page-table page. At worst, the page-table page would have to be recreated should an address within it be referenced again from user space.
There are other tricks user space can use to create large numbers of page-table pages. Some of them, like so many good exploits, involve the userfaultfd() system call.
So what is to be done? Reclaiming empty page-table pages is "low-hanging fruit", and the patches to do so already exist. The kernel could also delete mappings to the zero page while scanning memory; that would eventually empty out page-table pages containing only zero-page mappings, and allow those to be reclaimed as well. Ordinary reclaim unmaps file-backed pages, which can create empty page-table pages, but that is not an explicit objective now. It might be possible to get reclaim to focus on pages that are close to each other in the hopes of emptying out page-table pages, which could then be reclaimed as well.
There are some other questions to be answered as well, though. Some page-table pages may be actively used, even if they just map the zero page; reclaiming them could hurt performance if they just have to be quickly recreated again. There are also questions about whether it makes sense to reclaim higher-level page tables. Reclaiming empty PMD pages might be worthwhile, Hildenbrand said, but there is unlikely to be value in trying to go higher.
When should this reclaim happen? With regard to empty page-table pages, the answer is easy — as soon as they become empty. For zero-page mappings, the problem is a little harder. They can be found by scanning, but that scanning has its own cost. So perhaps this scanning should only happen when the kernel determines that there is suspicious zero-page activity happening. But defining "suspicious" could be tricky. Virtual machines tend to create a lot of zero-page mappings naturally, for example, so that is not suspicious on its own; it may be necessary to rely on more subjective heuristics. One such might be to detect when a process has a high ratio of page-table pages to its overall resident-set size.
Even when a removable page-table page has been identified, he said, removing it may not be trivial. There is a strong desire to avoid acquiring the mmap_lock, which could create contention issues; that limits when these removals can be done. Removing higher-level page-table pages is harder, since ordinary scanning can hold a reference to them for a long time. Waiting for the reference to go away could stall the reclaim process indefinitely.
With regard to what can be done to defeat a malicious user-space process, Hildenbrand said that the allocation of page-table pages must be throttled somehow. The best way might be to enforce reclaim of page-table pages against processes that are behaving in a suspicious way. One way to find such processes could be to just set a threshold on the amount of memory used for page-table pages, but that could perhaps be circumvented simply by spawning a lot of processes.
At this point, the session wound down. Yang Shi suggested at the end that there might be some help to be found in the multi-generational LRU work, if and when that is merged. It has to scan page tables anyway, so adding the detection of empty page-table pages might fit in relatively easily.
The ongoing search for mmap_lock scalability
There are certain themes that recur regularly at the Linux Storage, Filesystem, Memory-Management, and BPF Summit; among the most reliable is the scalability problems posed by the mmap_lock (formerly mmap_sem) lock. This topic has come up in (at least) 2013, 2018 (twice), and 2019. The 2022 event was no exception, with three consecutive sessions led by Liam Howlett, Michel Lespinasse, and Suren Baghdasaryan dedicated to the topic. There are improvements on the horizon, but the problem is far from solved.
Lespinasse started with an overview of the problem.
The mmap_lock is used to serialize changes to a process's address
space; in that role, it can cause contention in a number of ways. A
multi-threaded process that is generating a lot of page faults, for
example, will end up bouncing the lock's cache line around, even though
page faults only require a reader lock and can thus be handled in parallel.
If the threads are performing other types of memory operations, such as
mapping or protection changes,
they may contend for mmap_lock and block each other more directly,
even if they are operating on different
parts of the address space and should be able to work concurrently. There are
also problems when one process accesses another's address space; running
ps is a simple example. Monitoring tools often run at a
relatively low priority; if one acquires the mmap_lock then is
scheduled out, it can end up blocking the application it is trying to
watch.
Lespinasse had been working on range locking for some time, with the overall goal of splitting the mmap_lock into two parts. There would be one lock covering the entire address space that would be held only for short periods, while range locks would be used for longer operations. He couldn't get it working well, though, mostly because it made page faults more expensive. So he has since turned his attention to the longstanding speculative page-fault handling patches, which should at least help with the cache-line bouncing problem, since they eliminate the need to acquire mmap_lock much of the time.
Liam Howlett then briefly presented the maple tree work; this is a new data structure that was described in detail in this 2021 article. It is a parallel approach to the mmap_lock scalability problem (and more). The maple tree patches successfully replace the current red-black tree for virtual memory areas with about the same performance; in the future, maple trees should more naturally support more scalable access to the process's address space. Meanwhile, the new data structure should be able to take much of the complexity out of the memory-management subsystem.
Matthew Wilcox joined to note that, in the first phase, the maple tree does not yet bring much in the way of scalability benefits. The mmap_lock is still being used in the same places, and the planned ability to use read-copy-update (RCU) with maple trees is not yet there. But it is still a win even at this stage simply due to the reduction in complexity.
The problem with page-fault handling
Wilcox continued by putting up an overview of how the page-fault handling code works to demonstrate where the problems arise:
do_user_addr_fault();
mmap_read_lock();
find_vma();
handle_mm_fault();
__handle_mm_fault();
p4d_alloc();
pud_alloc();
pmd_alloc();
handle_pte_fault();
do_fault();
do_read_fault();
do_fault_around();
filemap_map_pages();
/* Stuff under RCU read lock */
mmap_read_unlock();
The point is all of the work done after the mmap_read_lock() call, which acquires the mmap_lock. The subsequent call to find_vma() will find the virtual memory area (VMA) containing the faulting address, and the code requires that VMA to remain stable thereafter — thus the acquisition of the lock. Some of the calls thereafter (the p?d_alloc() calls) may perform GFP_KERNEL allocations, meaning that they may sleep; that rules out using RCU rather than taking the lock. Once those functions have done their work, though, RCU can be safely used.
Paul McKenney asked whether perhaps sleepable
RCU (SRCU) could be used here, especially if a variant could be made that
omitted the use of read-side barriers. Wilcox said that SRCU has been
tried in the past and shown some performance problems, but things might be
better now. McKenney said those problems are still present, but he might
be able to "contort the grace periods" to make things better.
Wilcox said that, instead, the p?d_alloc() functions could gain a GFP-flags argument to tell them not to perform GFP_KERNEL allocations; Michal Hocko replied that it might be necessary to dig into a lot of architecture-specific code to make that work. It would have been helpful to have that argument from the beginning, he said, but it will be harder to add now. Johannes Wiener asked whether the flags argument was really needed; perhaps the fault handler could just drop into a slow path if the upper-level page directories are not present.
The conversation then shifted to the need to hold the lock to prevent changes to the VMA in the first place. The problem, in particular, is that the VMA for a faulting address must continue to contain that address while the fault is being handled; it cannot be allowed to shrink in a way that would leave the address outside. One way to avoid this problem would be to stop changing the VMA in place; instead, it could be replaced using RCU. Lespinasse asked what would happen if there were multiple changes contending for access to the same VMA; it was suggested that a flag indicating that changes are pending could be added.
The approach taken in the speculative page-fault patches is a little different; rather than using RCU, the code takes a seqlock. If concurrent changes happen while the fault is being handled, the code would simply start over before committing the previous attempt. This can, Wilcox said, be thought of as spreading out changes in time, rather than spreading them out in space, as is done by by the RCU approach.
Howlett, meanwhile, concluded that his maple tree work doesn't conflict with either approach to page-fault handling. A remaining problem, though, is the need to preallocate memory — perhaps a fair amount of it — before going into a non-sleepable mode. Much of that memory may ultimately not end up being used, but it must be available; all of this (often useless) allocation and freeing is expensive. It would be nice, Wilcox said, to have a slab call to reserve some number of objects, with the ability to release the unused ones later on. "I'll just get Vlastimil [Babka] to do it", he said.
Meanwhile, Lespinasse said, this topic has been under discussion for many years; the speculative page fault patches go back to at least 2009. It is about time to start getting some of this code into the mainline.
Wilcox started to sketch out another variant on the approach. The fault code would, as a first attempt, avoid taking the mmap_lock and use RCU for its concurrent updates. If this fails, as it would if memory had to be allocated, for example, then the code would just start over and retry the old-fashioned way. It would not be a huge change, he said, at least for faults on file-backed pages. There was some discussion over whether it made sense to attempt to allocate memory (using GFP_NOWAIT) before giving up on the fast path; Wilcox thinks there would be a benefit to making the attempt.
Hocko asked whether this approach eliminated the need to do range locking, especially once the maple tree is in place too. Howlett answered that range locking adds a lot of complexity, and it may well not be worthwhile.
Wilcox then suggested that a reader/writer semaphore could be put into the VMA itself; that would have the effect of using the VMA as a sort of range lock. There would still be contention at the VMA level, but it would be an improvement. Developers could then iterate toward a better solution for as long as the community is willing to put up with all the changes. Mel Gorman thought that the contention at the VMA level would be far less severe than with the address-space-wide mmap_lock, but Wilcox worried about applications that make terabyte-sized VMAs. Gorman said that applications creating that sort of VMA are almost certainly doing their own memory management anyway.
The session concluded with the thought that some variant of the above approach might make sense. The proof is always in the code, though.
Speculative page faults for Android
The discussion on speculative page-fault handling was not done, though; Baghdasaryan joined Lespinasse to shift the focus slightly. Baghdasaryan pointed out that Android has a specific interest in speculative page-fault handling because it improves the launch time for applications — something that Android users care a lot about. It is important enough that vendors have been shipping it and have asked for it to be included in the Android common kernel. Multi-threaded applications also benefit from it, he said.
Why are speculative page faults so helpful in these situations? When Android runs an app, it does so by spawning a new thread which typically starts by mapping a number of VMAs. As the app runs, those VMAs start generating page faults, which create mmap_lock contention. Eliminating that contention makes things run much faster.
The latest speculative page-fault patches were posted in
January, he continued. One of the things holding them back is that few
benchmarks show the benefit of this work, and hackbench shows a 4%
regression. That leads to pushback; few people outside of Android see the
value of this work. Indeed, many do not know about it at all, so they
don't try it out, so there are no reports from use cases that it helps.
But everybody can see the added complexity and cost.
Beyond that, not everybody cares deeply about startup time. Many server applications run for a long time; the time it takes for them to get going is insignificant in comparison. Meanwhile, people who are affected by mmap_lock contention have worked around the worst problems long ago. The end result of all this is that the speculative page-fault patches are still being carried in the Android tree, with no clear prospect of going upstream. The Android developers would like to get this work merged, though; the days when Android happily carried a big pile of out-of-tree patches are past. So, he asked, what is the best path forward?
Lespinasse said that he has seen a lot of people having mmap_lock issues; they almost always find some sort of workaround and move on to the next problem, but the issues are still there. The problem is obviously not impossible to deal with, but it is a source of frustration for users. The kernel should not impose this frustration; there are solutions out there, and he wishes that something could go forward.
Hocko said that getting to a solution will still probably take some time. Different approaches, including speculative page-fault handling and the maple tree, will need to be evaluated; he is not sure that both are required. Lespinasse didn't see a conflict between the two, though; the maple tree is a more efficient red-black tree, he said, but does not give all of the benefits by itself. McKenney suggested that it might make sense to delay the lockless part of the maple-tree patches, giving the speculative page-fault work a chance to show the benefits that it provides.
The session concluded with no firm outcomes. Chances are, though, that there will be a renewed round of patches showing up soon, and an increased desire to get something upstream. As Baghdasaryan noted, some of this work has been ported forward and backward over many years and been widely tested on a lot of devices. This is not immature work; it should be possible to get at least some of it merged.
Improving memory-management documentation
Like much of the kernel, the memory-management subsystem is under-documented, and much of the documentation that does exist is less than fully current. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Mike Rapoport ran a session on memory-management documentation and what can be done to improve it. The result was a reinvigorated interest in documentation, but only time will tell what actual improvements will come from that interest.Rapoport started by noting that, a couple of years ago, he took a hard look at the current state of memory-management documentation. What he found was summarized as "Mel's book and some text files". The book in question, Mel Gorman's Understanding the Linux Virtual Memory Manager, is reminiscent of many old Unix books: the basic concepts still apply to a great extent, but the details are all out of date and, thus, wrong. There are many important memory-management features, such as transparent huge pages, that are not mentioned at all.
With regard to the text files in the kernel's documentation directory,
Rapoport made the effort to convert them over to restructured text and
integrate them into the kernel's documentation system, adding a bit of
much-needed organization in the process. He added some coverage of
internal APIs, but there is a lot that is still in need of improvement.
So, he asked, what can be done to improve the documentation and encourage
the writing of more documentation?
One idea, he continued, was for reviewers to make a point of reviewing the associated documentation when looking at memory-management patches. He has been making an effort in that direction, but has not seen other reviewers following suit. Matthew Wilcox jumped in to note that the maple tree patches are well documented. Rapoport agreed, but said that doesn't change the fact that there is a lot of "tribal wisdom" in the memory-management community that does not exist in written form.
Another developer noted that the documentation can be found in two distinct places: in the code, and under the kernel's documentation directory (Documentation/). The latter documentation, he said, is not as good as it could be. There are some sections written in a clear narrative file, but it is mixed in with "noise and horrible stuff". The rendered documentation, which incorporates kerneldoc comments from the code into the separate documents, can jumble everything together and can be hard to work with. Andrew Morton said that Documentation/ is good for user-facing material, but otherwise the right place for documentation is in the code itself.
As the maintainer for the documentation directory, I felt the need to jump in at this point; I had to disagree with Morton's assertion that separate documentation is only good for end users. There is a lot of information that is relevant to developers, but which doesn't fit readily into kerneldoc comments, and it is hard to tell a coherent story in the code that way. The idea that comments in the code will be better maintained than separate documentation is a poor match to reality at best.
With regard to organization, it is possible to put introductory and contextual information into kerneldoc comments and produce a coherent manual from them, but extra effort must be made toward that end, and the end result only appears in the documents after being rendered by the build system — not in the code. The DRM documentation is a good example of what can be done when developers put effort into it.
That said, organization has been an issue all along; when I became the documentation maintainer, the kernel's documentation directory was a seemingly random collection of independent files. Over the years, developers working on the documentation have been trying to organize that material with a focus on who the readers are; thus the Core API manual, the User-space API guide, the Maintainer handbook, and several others. The net effect has been to create a set of smaller piles of unorganized and often outdated material, but it's a start. But people rarely find time to try to improve those manuals or to turn each into a coherent document rather than a collection of related files.
Wilcox mentioned Neil Brown's readahead
documentation as an example of another type of problem. The new
documentation is "90% right"; Wilcox should have reviewed it but was not
copied on it was unable to find the time. Brown, he said, did not
use the documentation that was
already present in the code when doing his work, and that is frustrating.
A recurring theme was that there are not enough people with the time and expertise to work on documentation; developers were encouraged to lobby their employers to support that work. Michal Hocko said that you can't bring in a "random tech writer" to work on memory-management documentation, though; a lot of knowledge is needed to write useful documentation, so experienced people need to write it. Brown's approach was excellent, since he is an expert user of the interface and can see it through those eyes. Meanwhile, Hocko said, he generally avoids looking through the code when in search of documentation and digs through the LWN archives instead.
I agreed with Hocko but had to add that writing documentation is a good way to gain the needed expertise. I learned much of what I know when working on Linux Device Drivers; it's fair to say that I was not well qualified when I began that project.
Davidlohr Bueso claimed that the best document in the kernel, the one that others should emulate, is the infamous memory-barriers.txt. It is written by developers with a high level of expertise, is clear, and actively maintained; even a newcomer can get something out of it. Johannes Weiner said that one of the strengths of memory-barriers.txt is that the document has had an excellent structure from the beginning; that made it relatively easy for others to come along and add to it. The memory-management subsystem needs somebody to come along and create a similar sort of documentation structure.
Dan Williams asked what the near-term focus for memory-management documentation should be; did Rapoport have specific APIs in mind? Rapoport answered that his goal was to make the documentation better in general so that others could understand how Linux memory management works. Williams said that was "a good mission statement", but he was looking for actionable tasks. Rapoport suggested speculative page faults or the multi-generational LRU (both of which are still out of tree) as examples.
Kent Overstreet said that developers are not bringing up documentation during code review, and that the subsystem does not have a person who has a coherent view of what the documentation should look like. Liam Howlett said that, as a new memory-management developer, he has encountered many functions that he did not understand. When he changes code, he tries to improve its documentation. He mentioned find_vma() specifically as a function whose behavior doesn't really match its name or documentation.
David Hildenbrand asked what documents were wanted for memory management in general. Rapoport answered that he would like to see more material in the admin guide first, preferably a high-level overview of how it all works. Improving the kerneldoc comments is rather lower on his list, but it is also easier to do. There was some discussion around whether there was a greater need for internal or user-oriented documentation; it was suggested that perhaps developers over-document some internal APIs, causing users to use them when they really should not. find_vma() was mentioned again as an example of this sort of problem.
At the conclusion of the session, Wilcox suggested that a good first step would be to create a new memory-management document using Gorman's book as a guide, and volunteered to take a stab at it. That book had a structure that clearly worked; starting with that would solve the organizational problem and make it easy for developers to improve things. A ReStructured Text file could be created along those lines, and the existing documentation could be slotted into it as appropriate. There was a general agreement that this was a good thing to do — no doubt helped by the existence of a developer who was willing to take the initial steps. Wilcox has since posted an initial version of the new documentation structure for review.
The state of memory-management development
The 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) was the first chance for Linux memory-management developers to gather in three years. In a session at the end of the first day led by maintainer Andrew Morton, those developers discussed the memory-management development process. While the overall governance will remain the same, there are nonetheless some significant changes in store for this subsystem.Morton started by saying that he was finally moving part of his process to Git — a change he has resisted for many years. There will be three trees kept on kernel.org for patches currently in development. The "hot fixes" tree is just what it sounds like; it will have branches for various releases with important fixes to upstream code. The "stable" tree holds material destined for the mainline, probably in the next merge window, while the "unstable" queue holds less mature material.
All three of these trees will be routinely rolled together to make the
-mm tree and fed into
linux-next. Morton made it clear that he will still use Quilt as much as
he can for his own workflow; he finds it to be much more flexible when work is still in flux.
He did say, though, that he will be willing to do Git pulls from developers
"if I have to", but he still doesn't think that model works. There has
never been a significant patch set, he said, that has gotten through the
memory-management tree unscathed, so Git's immutable model is a poor fit.
His management policy will be, he said,
to "stabilize forever" until patches seem ready for the stable tree, at
which point they will be merged into a branch while waiting to go upstream.
The unstable tree exists only as a Quilt queue for now, though it will eventually be available in Git as well. When that happens, probably in the 5.19 cycle, he does not want developers to use the result as a base for development. Instead, all patches should be against an upstream release, preferably the previous -rc3. Each series will be stored in a separate branch, even if there is only one patch involved. The result will obviously be many branches with long names; they will start with mm-unstable- and contain the patch subject. When developers send follow-on patches, he said, they should include the target branch name.
The stable tree is meant to be immutable, meaning it will never be rebased. That is a nice idea, he said, but it only works if there is material to put into it. He has been looking for this material, but nothing is ready at this point; everything is waiting for reviews and/or updates. If this process is going to work developers need to nail things down more quickly. He plans to become more active in nagging developers, often in private, to help push things forward.
Stability and review
Michal Hocko said that it seemed like a lot of patches are entering the -mm tree too soon; that causes developers' attention to move on to the next shiny thing. Morton answered that he has begun skipping the first version of a new patch series, partly for that reason. But, he added, anybody who thinks that a merge into -mm means that their work is done is showing a lack of experience. He tries to guide such folks when he can.
David Hildenbrand suggested that Morton should push for more review of patches before accepting them into -mm; a lot of patches that break things are getting through. In general, the community seems to have far more ability to generate code than to review it, he said. Morton answered that he doesn't want to block patches from -mm due to a lack of review; he would rather get the work out there and tested. He will not let work proceed to the mainline, though, without proper review.
Dan Williams suggested that perhaps more transparency would help; if developers could see the state of the queue and which patches are waiting on review, they might feel motivated to help with that. Morton answered that this information is available now. But, he said, he does make the decision to move patches into the stable tree by himself, and he would like to change that. This decision, he said, should be made more transparently and sooner. For now, his plan is to publicly post his plans and see if there are objections; he does not intend to wait for explicit acks before moving work to stable, though.
He would like to see more focus on the transition to stable, which he described as "a big deal". Patches do not need to be perfect to be promoted; after all, there are generally still a few months (the remainder of the development cycle and the next full cycle after a patch is merged) to get things up to production quality. So his criteria are whether Linux wants the patch overall, and whether it is in good enough condition to proceed.
Morton raised some doubts as to whether he should be publishing the stable tree at all. He does not want developers to base their patches on it, but that will surely be a temptation if it's out there.
There were some questions about documenting the process (and especially the criteria) for moving patches into the stable tree, but Mel Gorman advised against that. If a set of rules is posted, developers will inevitably try to game them. He also said that running the stable tree in a non-rebasing mode wasn't particularly important. Memory-management patches tend not to conflict often, so rebasing is unlikely to create problems for developers, especially early in the development cycle.
Changing subject, Morton said that he actually plans to create a fourth Git tree that will contain kernel-wide patches. It will always be generated from a Quilt queue, and he doesn't want to have any memory-management material there. Morton still handles a lot of patches to unrelated subsystems, so the reasoning behind this tree is easy to understand.
Submaintainers
Hocko brought up the question of submaintainers in the memory-management subsystem. There are a few areas that are handled by another developer's Git tree and are often pushed directly into the mainline, which he doesn't like. It ends up balkanizing the memory-management subsystem, he said, and makes it hard to get a coherent picture of where things stand. Having submaintainers may make sense, he said, but their trees should be pulled into -mm. Morton said that might work; he would probably pull those trees into the unstable branch.
Vlastimil Babka asked whether his tree, which contains changes to the slab allocators, should be going through -mm. Morton replied that the current process is working and Babka should keep sending pull requests directly to Linus Torvalds. He would not add any value to the process, he said. He pointed out, though, that slab patches tend to be independent of everything else; trying to decouple core memory-management changes, instead, would be a nightmare.
Hocko noted, though, that conflicts between areas are rare, even in the core memory-management subsystem. Perhaps it would be worthwhile to bring in more submaintainers to take on specific areas. Morton expressed a willingness to try and see if it works better than the current process, but he said he would like to get the current changes stabilized first. Hildenbrand added that new developers often don't know where to send memory-management patches, and the get_maintainer.pl script doesn't normally help. Defining submaintainers might help in this regard; he volunteered to take on that role for memory hotplug.
At this point, the day was approaching its end and the participants were getting tired. There was a bit of residual conversation on some details, but the session came quickly to a close.
Seeking an API for protection keys supervisor
Memory protection keys are a CPU feature that allows additional access restrictions to be imposed on regions of memory and changed in a fast and efficient way. Support for protection keys in user space has been in the kernel for some time, but kernel-side protection (often called "protection keys supervisor" or PKS) remains unsupported — on x86, at least. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Ira Weiny provided an update on the state of PKS and led a discussion on what the proper in-kernel API for PKS should be.
Weiny began by saying that version
10 of his PKS patch set had been posted in April. It adds additional
protections for kernel-space mappings on x86 systems. On that
architecture, memory protection keys control read and write access, but cannot affect execute
access. The permissions set by PKS apply only on the local CPU; that means
that they can changed quickly, with no need for expensive translation
lookaside buffer (TLB) flushes. PKS protections can apply to persistent
memory as well as normal RAM; the initial goal of Weiny's patch series is
to use PKS to protect persistent memory
against stray writes from the kernel. There is also a
patch series from Rick Edgecombe that uses PKS to protect page tables
from corruption.
Protecting memory from unwanted changes by the kernel is a good thing, but this protection cannot get in the way of legitimate changes. So PKS protections must be lifted when such changes are being made. Trying to find every site in the kernel where PKS-protected memory is being accessed would be futile, and the result would be unmaintainable, so Weiny has, instead, "abused" the kmap() interface for this purpose. But kmap() is not the best tool for the job, for a couple of reasons.
The kmap() API was initially introduced to enable the kernel to manage (relatively) large amounts of memory on 32-bit processors. On such machines, there are not enough available address bits to directly map more than (usually) about 1GB of memory into the kernel's address space. The memory that can be mapped this way was called "low memory", while all of the memory that could not be directly mapped was "high memory". When the kernel needs access to a page in high memory, it must first make a temporary mapping in its page tables; this is done with kmap(). In practice, this means that the kernel must call kmap() (or one of its variants) before accessing any page in memory, and call kunmap() when that access is complete. In cases where the target page is in low memory (that would be all pages on 64-bit systems), those calls do nothing, but they must still be present.
Thus, kmap() would seem to be an ideal interface for adjusting PKS restrictions. When the kernel needs to access a page, it will call kmap(), which can suspend any PKS protections for the page in question on the local CPU; the following kunmap() call can then restore those protections. The only problem is that high memory is going away sooner or later, and the plan is for kmap() to be removed at the same time. We live in a 64-bit world now, Weiny said. There are still some Arm CPUs that need high memory now, he added, but the writing is on the wall for high memory in the longer term.
It would thus be good to find an alternative to using kmap(). One apparent option is the page_address() macro, but that will not work due to the lack of an unmap operation. The problem needs to be solved, Weiny said; PKS is not the last protection scheme of this type that will come along. The kernel project needs to establish the rule that code cannot just access the direct map without making prior arrangements; he suggests simply redefining kmap() to fill this role. The new meaning of a kmap() call would be "give me a kernel-accessible address for this page".
An alternative would be to improve vmap(), which creates a new kernel-space mapping for a page, for this purpose, though that would require changing a lot of kmap() calls. Matthew Wilcox said that it should be possible to make vmap() more efficient; there just has never been a driving need to optimize it so far. Weiny said that could make long-term mappings, for which kmap() is not intended, work better. Or, he said, the kernel could just eliminate the direct map entirely and always require memory to be mapped explicitly, but that approach probably would not perform well.
Wilcox raised the issue of the "Capability Hardware Enhanced RISC Instructions" (CHERI) architecture, which applies capabilities to all of memory. It has a 128-bit address type that provides a lot of space for access keys and more. Only FreeBSD supports this architecture currently, but it is, he said, something that the Linux community should be thinking about; CHERI-like mechanisms seem likely to show up in other processors over time. Supporting PKS can be seen as a small step in preparing for that world.
Josef Bacik said that the Btrfs code is currently a heavy user of kmap(), but he does not really care about the API to access kernel memory. "Just tell me what to use", he added. Chris Mason said that Btrfs developers have a debugging patch that makes pages read-only so that they can look at the resulting crashes and see who is modifying pages when they shouldn't be. PKS would be a useful way to implement this functionality.
Wilcox suggested that there should be an interface that can change the protections on multiple pages. Some sort of kmap_local_range() function would be useful. Bacik agreed, saying that Btrfs often has to map 16KB metadata blocks. It would be easy, he said, to change over to a new API that did the job better. At that point time ran out and the session came to a close.
Better tools for out-of-memory debugging
Out-of-memory (OOM) situations are dreaded by users, system administrators, and kernel developers alike. Usually, all that is known is that a lot of memory is being used somewhere and the system has run out, but the kernel provides little help to anybody trying to figure out where the memory has gone. In a memory-management session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Kent Overstreet asked what could be done to improve OOM reports and reduce the pain for all involved.
The kernel writes a report to the system log when an OOM problem occurs, he
began, but those reports often do not include information on memory managed
by the slab allocator. Other times, there can be hundreds of those
reports, which is not necessarily much more helpful. With a new reporting
system he has been working on (described briefly in this article), the report only includes the
ten slabs with the most allocated memory, which often is what he wants to
see. Even more important for
debugging OOM problems is information on the kernel's shrinkers, which are
responsible for reclaiming memory from caches when it is needed elsewhere.
There is currently no information available on what the shrinkers are doing,
but Overstreet's new report can include what was requested of each shrinker
and how much the shrinker was actually able to free.
This information is useful, he said, but it's only a start. The OOM-report code hasn't changed much since 2006, and it is showing its age; there is a lot of room for improvement. Johannes Weiner, who has done much of the work on OOM reporting, added that even the 2006 work was "mostly a refactoring". It was generally agreed that developers need better tools to address these out-of-memory situations.
A part of the problem, Overstreet said, is that outputting information from the kernel in a human-readable way is not easy. There are various "pretty printers" available, mostly in the form of special printk() format specifiers. But these are all hidden away in the printk() code and are hard to find; pretty printers are better placed with the code that manages the data they are printing. With the right infrastructure, he said, the kernel can have "thousands of pretty printers" and its output will get much better.
With regard to improving the OOM reports specifically, he suggested adding rate limiting as a first step. Once, say, a dozen reports have been printed, there is not likely to be much value in creating more. Reorganizing the reports to separate information about kernel-space and user-space memory would help. Information on fragmentation in the page allocator is needed and, as mentioned above, more information about shrinkers.
Weiner said that the report used to just print the top memory-consuming tasks rather than all of them, but that got removed for some reason. Michal Hocko responded that dumping out the top tasks is not easy; it requires locking the task list, which is expensive. Beyond that, partial information on the state of the system can be misleading and make it hard to get a complete picture; the top consumers may not be the problem if the sheer number of tasks overwhelms the system. What should be in the report depends on the situation. If, for example, a GFP_NOWAIT allocation request is failing, then the shrinkers (which will not be invoked in that situation) are probably not relevant. That is also true for high-order allocations; in that case, compaction is failing and developers need to know why. What's in the report now, he said, is a compromise — the information that is usually useful.
Overstreet said that the contents of the report will always be a compromise, but it is possible to create better reports than what the kernel has now. Hocko said that he has to process a lot of OOM reports, and his feeling is that there are simply too many numbers in them. What he often wants to do is to check the proportion of memory used by the slab allocator relative to that on the least-recently-used (LRU) lists. If the LRU memory dominates, then the problem is almost certainly in user space; calling out that situation explicitly would help.
Weiner suggested starting with the most useful summary information; the rest can come afterward. Verbosity in these reports is not necessarily a problem, especially if they are rate-limited. Hocko countered that, while rate limiting is nice in theory, it doesn't actually work. The problem is that printk() can be slow, especially when serial consoles are being used; just dumping all of the information in a report can bog down the machine, and it takes too long to trigger the limit.
Ira Weiny asked how it is that memory can be allocated without the kernel knowing where it went. The problem is that the tracking infrastructure just isn't there. Overstreet said that he has a mechanism to track memory usage by call site that is efficient enough to use on production systems; it employs the dynamic debugging mechanism. The pr_debug() macro is changed to create a static structure at the call site that is placed in its own ELF section. This mechanism can then be used to wrap kmalloc() calls and remember where they came from.
Hocko asked why the existing tracing mechanism couldn't be used for this purpose; Overstreet answered that he wants something that is always on, so he can look at the allocation numbers at any time. Paul McKenney suggested using a BPF program to store call-site information in a map. Weiner answered, though, that he had tried that once and it was trickier than it seems. There are cases, such as freeing memory in an interrupt handler, that are hard to handle.
The session concluded without much in the way of firm conclusions. Overstreet closed by saying that he keeps "finding stupid stuff" in the kernel, and that developers are not looking at memory allocations the way they should be. With luck, better tools will improve that situation in the future.
Changing filesystem resize patterns
In a filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Ted Ts'o brought up the subject of filesystems that get resized frequently and whether the default parameters for filesystem creation should change as a result. It stems from a conversation that he had with XFS developer Darrick Wong, who is experiencing some of the same challenges as ext4 in this area. He outlined the problem and how it comes about, then led the discussion on ways to perhaps address it.
Problems
Linux filesystems were generally designed to support being resized, but the expectation is that they start as a fairly large filesystem and then big chunks are added periodically. Filesystem data structures are sized and created based on how big the filesystem being created will be. He gave the example of RAID array using md-raid that gets a new disk, then the filesystem is resized to take advantage of it. That new disk may be a substantial fraction of the size of the existing filesystem, but the filesystem is already rather large.
Another use case is with a few network-attached storage (NAS) projects that wanted to take a 100MB filesystem and install it using dd, then expand it to, say, 10TB. There were just a few projects like that, so the Linux filesystem developers were able to "pound on them until they did the right thing" and created larger filesystems to match the intended size, Ts'o said. But that same basic scenario is playing out in the cloud today.
Cloud providers typically have some minimum virtual block-device size, say 10GB, where filesystems get created, then they are resized from there. For example, that 10GB filesystem might be expanded to tens of terabytes. The difference from the NAS projects is that there are many more of the cloud providers and many are fairly naive, he said. They use the mkfs default parameters, which are geared toward a USB thumb-drive and may not work well on a giant filesystem.
In addition, many cloud providers charge their customers based on the size of the virtual block device being used, which means that customers have good reason to wait until the filesystem is nearly full before expanding it. A common pattern is that once a filesystem gets, say, 99% full, another 1GB or 5GB are added; that pattern repeats over and over for the filesystem. "That tends to result in worst-case filesystem fragmentation." Most filesystems are not designed to work well when running nearly full, he said.
Possible solutions
While the problems have been identified, solutions have not been; he was hoping that attendees had some interesting ideas. One thing that would be useful, Ts'o said, is to have a standard format for large filesystem images that could be inflated onto block devices into the full size of the filesystem. The ext4 developers have been experimenting with using the qcow format; there is a utility in e2fsprogs called e2image that can create these images. They only contain the blocks that are actually used by the filesystem, so they are substantially smaller than the filesystem they will create. The XFS developers have also been looking at xfsdump, since it has some similar capabilities, but for XFS filesystems.
When he and Wong were talking, they agreed that some single standard format would be useful. One possibility is qcow, but it is not well-specified and the QEMU developers, who created the qcow format, discourage its use as an interchange format, he said. Perhaps there are others that could be considered, but getting agreement between the various filesystems is important. That would help move away from the idea that dd is the state-of-the-art tool for transferring filesystems.
Another possibility would be to change mkfs so that it "sometimes or always" created filesystems suitable for being expanded into huge sizes. That could be done in a simpleminded way by changing the defaults, so that it always creates a huge journal even on USB thumb-drives, for example. Amir Goldstein said that doing so would not work well, since some large percentage of these small devices would be consumed by a journal that is not useful for them. Ts'o said that it might not be the optimal solution, but it was something that could be done.
Perhaps the block device could give some kind of hint to mkfs that would indicate it is one where the filesystem might grow in the future, Ts'o said. Those hints could say that a device is not resizable at all, such as the thumb-drive case, or that it is being installed in some virtual block device in the cloud, where expansion is fairly likely. That way, the USB thumb-drive defaults could be maintained, but the parameters could be switched to more suitable ones for the cloud case. There could be a set of heuristics based on the name of the device to try to figure out whether it is likely to grow. If the block device provides the hints, that pushes the problem down to the device drivers, but that code is in a position to know more about the underlying storage. None of these is a perfect solution, however.
He also noted that providing hints does not solve the problem of customers who are trying to minimize their storage costs, by adding small amounts of storage whenever the filesystem fills. The performance of such filesystems is seriously degraded, he said, to the point where the customer is probably paying more for the extra computation needed than they are saving on storage. There is no solution there that he knows of, other than customer education, and the number of customers is vastly more than the number of filesystem developers, so that does not work well either.
Goldstein asked if those creating filesystems that may expand would be using the mkfs option to put the resulting filesystem in a file; that could perhaps provide a hint to use a different set of parameters. Ts'o said that the NAS developers asked about this problem along the way, so the filesystem developers were able to give them a set of parameters that would work well for them. It does not work for the cloud case, however, since virtual block devices look like normal SCSI devices. Using heuristics based on the name of the device, or perhaps on some "magical SCSI attribute" that no one is setting right now, could be a way to recognize the cloud case.
Josef Bacik said that blkid did not have any useful information but that the model name of the disk could perhaps be parsed to help determine if it is a cloud device. He would guess that cloud providers use a consistent set of those names within a given cloud. Ts'o agreed with that, and thought that the results should be encoded in blkid so that it does not have to be recreated each time. Bacik thought that made sense, so that filesystem developers could move to a single "source of truth" about the type of device.
Chris Mason wondered if there were any statistics on how often people create small filesystems for the cloud and keep them small forever. He is concerned that making the decision solely on whether the device is a cloud type would punish users of small cloud filesystems. Ts'o said he did not have any statistics on that; he normally hears from customers who continually add 1GB chunks to their filesystems and then complain about the performance of them.
Part of the problem is that the filesystems do not get information about the maximum size of the device; each cloud provider has its own minimum, maximum, and increment sizes, Ts'o said. Arguments could be added to mkfs, to allow the user to specify the maximum size, for example, but most users will not know about the options. Users generally use the distribution installer or the default cloud image, so there is no opportunity for them to provide that information.
Goldstein asked: is it possible to resize the underlying parts of the filesystem, such as the journal, so that the problem is lessened? Ts'o said that it depends on the filesystem; ext4 can resize its journal because it allows the journal not to be contiguous, but he thinks that XFS needs a contiguous journal. Unfortunately, none of the XFS developers were able to attend LSFMM in person, but Eric Sandeen used the Zoom chat to point out that there is more than just the journal that needs to be adjusted. Mason also noted that there are a lot of data structures in a filesystem that would need to change as it gets larger.
Bacik said that everyone in the room is well aware that "creating more options for users just creates more ways for things to go horribly wrong". The best path is to make the tools intelligent by default; they can gather information from blkid and elsewhere to make the best choice that they can. They will not always make the right decision, and perhaps options can be added for power users, The tradeoffs are likely to be filesystem-specific, Ts'o said, so giving the filesystems hints on what the likely use case is will allow them to do the right thing.
There was some discussion of things that could be done to make filesystems more resilient to size increases. Ts'o listed a few things that ext4 could perhaps do, like moving to 4KB block sizes by default and auto-resizing the journal; Sandeen had mentioned that XFS should perhaps make filesystems with a large number of allocation groups more efficient as a partial workaround. Those things are filesystem-specific and Ts'o said that he was hoping to find ways to address the problems in a more general way. Time ran out on the session without any real solution of that nature, however.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: "rustdecimal"; Fedora 36; RHEL 9; GCC 12.1; GNOME strategy; Python Language Summit; Quotes; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.
