LWN.net Weekly Edition for May 9, 2019
Welcome to the LWN.net Weekly Edition for May 9, 2019
This edition contains the following feature content:
- Inheritance versus composition: a PyCon talk comparing two common design patterns.
- A massive amount of coverage from the 2019 Linux Storage,
Filesystem, and Memory-Management Summit, including:
- Issues around discard
- Taking ZUFS upstream
- Improving fget() performance
- The end of the DAX experiment
- Write-protect for userfaultfd()
- The search for available page flags
- Minimizing the use of tail pages
- Presenting heterogeneous memory to user space
- NUMA nodes for persistent-memory management
- Transparent huge pages, NUMA locality, and performance regressions
- Proactively reclaiming idle memory
- Cleaning up after dying control groups
- Remote memory control-group charging
- get_user_pages(), pinned pages, and DAX
- How to get rid of mmap_sem
- The memory-management subsystem development process
- Alignment guarantees for kmalloc()
- Improving access to physically contiguous memory
- Memory management for 400Gb/s interfaces
There is more to come; see the LWN LSFMM 2019 page for the current state of our coverage the summit.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Inheritance versus composition
The idea of "inheritance" is something that most students learn about early on when they are studying object-oriented programming (OOP). But one of the seminal books about OOP recommends favoring "composition" over inheritance. Ariel Ortiz came to PyCon in Cleveland, Ohio to describe the composition pattern and to explain the tradeoffs between using it and inheritance.
Ortiz is a full-time faculty member at Tecnológico de Monterrey, which is a private university in Mexico. He noted that the title of his talk, "The Perils of Inheritance", sounded like "clickbait"; he jokingly suggested that perhaps he should have named it: "4 dangers of inheritance; you won't believe number 3!". That elicited a good laugh, but he said that clickbait was not his intent.
He has been teaching computer science for more than 30 years, using many languages, including Python. He likes Python and uses it for several courses, including data structures, web development, and compiler construction. He started with Python 2.0 in 2001 or so.
Definitions
In order to proceed, Ortiz said, he needed to define the two concepts at hand. "Inheritance" in the OOP sense starts with one class, known as the base class or superclass, and another class that is related to it. That second class is known as the derived class or subclass. The derived class has various attributes, such as variables and methods, that have come directly from the base class. Those attributes can be overridden in the derived class; new attributes can be added as well.
"Composition", on the other hand, has the composite or wrapper class and a component that is being wrapped, sometimes known as the "wrapee". Inheritance embodies the idea of an "is a" relationship, while composition creates a "has a" relationship.
In Python terms, you might have a Vehicle class that is used as the base for a derived Bicycle class, which would be created as follows:
class Bicycle(Vehicle): passSo, a Bicycle "is a" Vehicle. An example of composition might be a class Engine that is used by another class:
class Car: def __init__(self): self.engine = Engine()So, a Car "has a" (an) Engine.
![Ariel Ortiz [Ariel Ortiz]](https://static.lwn.net/images/2019/pycon-ortiz-sm.jpg)
The famous Design Patterns: Elements of Reusable Object-Oriented Software book has suggested favoring composition over inheritance. At the time it was published, over 20 years ago, most OO programmers were favoring inheritance in languages like C++ and Java. But, even all these years later, inheritance is the first and main tool that programmers reach for as part of their code reuse strategy.
Inheritance and composition are two mechanisms for code reuse. Inheritance is "white-box" (or even better, "crystal-box", Ortiz said) reuse. The derived class can see (too) many of the details of the base class. He prefers "black-box" reuse, which is what composition provides. The implementation of the component object is kept as a secret from the composite class. That secrecy is not because the object developer does not want people to know how it does its job, but because knowing those details is irrelevant. The component object can simply be used through its interface without needing to know anything further about its guts.
There are some advantages to inheritance, however. It is the easiest way to reuse code, he said; in Python you just list the base classes that you want to inherit from in the derived class definition. It is also easy to change the inherited implementation and to add to it.
The disadvantages of inheritance exist as well, of course. For one, the relationship between the base and derived classes is statically fixed; technically, that is not true for Python because of its dynamic nature, but is for C++ and Java. Inheritance supports weak encapsulation, so that derived classes can use parts of the base class in ways that the designer did not intend; also, name collisions can occur.
Beyond that, derived classes get everything from the base class, even things that they don't want. As the creator of Erlang, Joe Armstrong, put it: "You wanted a banana but what you got was a gorilla holding the banana and the entire jungle." And, finally, any changes in the base class impact all of the classes deriving from it. So, for example, adding a new method to the base class can cause name collisions all over the hierarchy.
Example
In order to show some of the differences between inheritance and composition, he presented two implementations of a simple last-in-first-out (LIFO) linked list that can be seen in a GitHub repository (and in his slides or the YouTube video of the talk). The code implements LinkedList as one class along with an InsertCounter class that is meant to count all of the insertions done to the list (it ignores removals from the list so the value is not the same as the length).
The example is a bit contrived, perhaps, but it does show that changing the implementation of the insert_all() method (which takes an iterable and adds each element to the list) in the base class actually affects the derived class. It leads to the count being incorrect in the InsertCounter object.
The composition version passes the LinkedList into the initialization of the InsertCounter:
counter = InsertCounter(LinkedList())After that, the counter object can be used in code that looks exactly the same as it was for the inheritance version:
counter.insert(42) counter.insert_all([1, 2, 3]) counter.remove() print(f'len = {len(counter)}, counter = {counter.counter}')The final print statement would yield:
len = 3, counter = 4
But, as Ortiz pointed out, there is lots that InsertCounter gets for free when it uses inheritance that needs to be explicitly added to the class when it uses composition. Many methods (e.g. remove(), clear(), __len__(), etc.) that came with the LinkedList via inheritance needed to be implemented for the class when it uses composition. On the other hand, the composed InsertCounter was not affected by the changes to LinkedList. But, on the gripping hand, the implementation of insert_all() was, presumably deliberately, set up to be susceptible to this problem; a more realistic example would be hard to fit into the 30-minute slot for the talk.
Ortiz pointed to the forwardable module as a way to avoid having to implement all of that extra code for the composite object (InsertCounter). The module makes it easy to delegate functionality from the composite object to the component. His last change to his example code used forwardable in the InsertCounter.
There are several advantages to composition. The implementation of the component can be chosen at runtime because it can be passed in as part of the initialization of the composite object. Interface changes have a limited ripple effect. If the component interface changes, only the code in the composite object needs to change; users of the composite object are unaffected. With composition, one can create a class that has relationships with multiple components; this is a design pattern known as the façade pattern.
There are, of course, disadvantages to composition. It requires more code than inheritance, as we saw, Ortiz said. It is often more difficult to read the code using composition than it is to read code using inheritance.
His talk was not meant to demonize inheritance, there are good reasons to use it sometimes. In particular, when the base and derived classes are in the same package and are under the control of the same developers, inheritance may well be the right choice. "Inheritance is not wrong", he said, there are many circumstances where it makes sense, especially when there truly is an "is a" relationship.
He ended with a Spanish question that was evidently made popular in a US advertisement: "¿Por qué no los dos?", which means "Why not both?". Inheritance and composition are both useful tools; developers should understand when to use each.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Cleveland for PyCon.]
Issues around discard
In a combined filesystem and storage session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Dennis Zhou wanted to talk about discard, which is the process of sending commands (e.g. TRIM) to block devices to indicate blocks that are no longer in use. Discard is a "serious black box", he said; it is a third way to interact with a drive, but Linux developers have no real insight into what its actual effects will be. That can lead to performance and other problems.
Zhou works for Facebook, where discard is enabled, but the results are not great; there is a fair amount of latency observed as the flash translation layer (FTL) shuffles blocks around for wear leveling and/or garbage collection. Facebook runs a periodic fstrim to discard unused blocks; that is something that the Cassandra database recommends for it users. The company also has an internal delete scheduler that slowly deletes files, but the FTL can take an exorbitant amount of time when gigabytes of files are deleted; read and write performance can be affected. It is "kind of insane" that applications need to recognize that they can't issue a bunch of discards at one time; he wondered if there is a better way forward.
![Dennis Zhou [Dennis Zhou]](https://static.lwn.net/images/2019/lsf-zhou-sm.jpg)
An approach that Facebook is trying with Btrfs is to simply preferentially reuse logical-block addresses (LBAs), rather than discarding them. That requires cooperation between the filesystem and block layer, since the block layer knows when it can afford to do discard operations. What the FTL is doing is unknown, however, which could have implications on the lifetime of the device. He is looking for something that can be run in production.
Erik Riedel asked what is special about a 1GB discard versus a 1GB read or write. You are asking the device to do a large operation, so it may take some time. But James Bottomley noted that discard had been sold as a "fast" way to clean up blocks that are no longer needed. Riedel suggested that vendors need to be held accountable for their devices; if TRIM has unacceptable characteristics, those should be fixed in the devices.
Ric Wheeler said that there should be a tool that can show the problems with specific devices; naming and shaming vendors is one way to get them to change their behavior. Chris Mason noted that the kernel team at Facebook did not choose the hardware; it would likely choose differently. This is an attempt to do the best the team can with the hardware that it gets.
Ted Ts'o said that he worries whenever the filesystem tries to second-guess the hardware. Different devices will have different properties; there is at least one SSD where the TRIM operation is handled in memory, so it is quite fast. Trying to encode heuristics for different devices at the filesystem layer would just be hyper-optimizing for today's devices; other devices will roll out and invalidate that work.
The current predicament came about because the kernel developers gave the device makers an "out" with the discard mount flag, Martin Petersen said. If performance was bad for a device, the maker could recommend mounting without enabling discard; if the kernel developers had simply consistently enabled discard, vendors would have fixed their devices by now. Beyond that, some vendors have devices that advertise the availability of the TRIM command, but do nothing with it; using it simply burns a queue slot for no good reason.
Riedel suggested that "name and shame" is the right way to handle this problem. If a device claims to have a feature that is not working, or not working well, that should be reported to provide the right incentives to vendors to fix things.
Bottomley wondered if filesystems could just revert to their old behavior and not do discards at all. Wheeler said that could not work since discard is the only way to tell the block device that a block is not in use anymore. The bigger picture is that these drives exist and Linux needs to support them, Mason said.
Part of the problem is that filesystems are "exceptionally inconsistent" in how they do discard, Mason continued. XFS does discard asynchronously, while ext4 and Btrfs do it synchronously. That means Linux does not have a consistent story to give to the vendors; name and shame requires that developers characterize what, exactly, filesystems need.
The qualification that is done on the devices is often cursory, Wheeler said. The device is run with some load for 15 minutes or something like that before it is given a "thumbs up"; in addition, new, empty devices are typically tested. Petersen said that customers provide much better data on these devices; even though his employer, Oracle, has an extensive qualification cycle, field-deployed drives provide way more information.
Zhou said that tens or hundreds of gigabytes can be queued up for discard, but there is no need to issue it all at once. Mason noted that filesystems have various types of rate limiting, but not for discard. Ts'o said it would be possible to do that, but it should be done in a single place; each filesystem should not have to implement its own. There was some discussion of whether these queued discards could be "called back" by the filesystem, but attendees thought the complexity of that was high for not much gain. However, if the queue is not going to empty frequently, removing entries might be required.
In the end, Fred Knight said, the drive has to do the same amount of work if a block is discarded or just overwritten. Writing twice to the same LBA will not go to the same place in the flash. The FTL will erase the current location of the LBA's data, then write a new location for the block. All discard does is allow the erasure to happen earlier, thus saving time when the data is written.
The problem is that kernel developers do not know what a given FTL will do, Ts'o said. For some vendors, writing the same LBA will be problematic, especially for the "cheap crappy" devices. In some cases, reusing an LBA is preferable to a discard, but the kernel would not know that is the case; it would simply be making assumptions.
To a certain degree, Zhou said, "discard doesn't matter until it does"; until the device gets past a certain use level, writing new blocks without discarding the old ones doesn't impact performance. But then there is a wall at, say, 80% full, where the drive goes past the "point of forgiveness" and starts doing garbage collection on every write. There is a balance that needs to be struck for discard so that there are "enough" erasure blocks available to keep it out of that mode.
That is hard for SSD vendors, however, because they cannot reproduce the problems that are seen, so they cannot fix them, an attendee said. The kernel developers need to work more closely with the device firmware developers and provide workloads, traces, and the like. Wheeler said that reporting workloads and traces is the "oldest problem in the book". We have to do better than we are doing now, he said, which is to provide nothing to the vendors, he said.
Bottomley pushed back on the idea that preferentially reusing LBAs was the right path. If there are blocks available to be written, reusing the LBA is worse, as it will fragment extent-based filesystems. If there is an erase block available, no erase need be done on a write to a new LBA, and if there isn't, it is the same as reusing the LBA; so rewriting LBAs actually compounds the problem.
The exact behavior is specific to the filesystem and workload, however. The sizes of erase blocks are not known to the kernel, or even by kernel developers, because the drive vendors have decided to keep them secret, Bottomley said. So every decision the kernel makes is based on an assumption of one kind or another.
Riedel still believes this is a qualification problem at its core. The right solution is for the drives to do a good job with the expected workloads. But Petersen reiterated that by giving the vendors an out with the discard flag, nothing will ever change. Fixes will "never happen".
The core of the problem is that reads and writes need to happen immediately, Zhou said, while discards can wait a bit without causing much of a problem. But there are different viewpoints on the problem itself, Ts'o said; desktop distributions will be different from servers or mobile devices. Mason noted that if you asked ten people in the room how discard should work, you would get something like 14 answers.
The session wound down without much in the way of resolution. There was talk of some kind of tool that could be used to reproduce the problems and gather traces. There was also talk of rate-limiting discards, but no one wants to do a "massive roto-tilling" of the block layer given all of the unknowns, Ts'o said; he suggested any change be done in an optional module between the filesystems and the block layer, which could be revisited in a few years.
Taking ZUFS upstream
At the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Boaz Harrosh introduced the ZUFS filesystem. At this year's event, he was back to talk about what it would take to merge ZUFS into the mainline. ZUFS, which Harrosh pronounced as both "zoo-eff-ess" and "zoofs", has been running in production for his employer's (NetApp's) customers for some time now, so he wondered if it was something that could go upstream.
ZUFS is the "zero-copy user-mode filesystem". When developing it, NetApp set out to do the impossible: create a high-performance filesystem that runs in user space. It needs to run in user space because there are components, libraries, filesystems, and so on, that may be used but are licensed in ways that are not compatible with the GPLv2 used by the kernel.
![Boaz Harrosh [Boaz Harrosh]](https://static.lwn.net/images/2019/lsf-harrosh-sm.jpg)
NetApp has shipped ZUFS as a product and customers are happy with it, Harrosh said. It uses persistent memory for performance on the front end; data from there is moved to slower storage over time. That way, customers get the speed of persistent memory, but can store as much data as they want. For optimal performance, the company recommends having 8% of the total storage as persistent memory. A lot of QA testing has been done on the filesystem and customers trust it with their data.
The kernel part of ZUFS is released under the GPL, but the user-space side is broken into two parts. One is a "systemd server" that is open source, released under the FreeBSD (or 2-clause BSD) license, so that it can be used on operating systems such as Windows or FreeBSD. The other piece is a plugin mechanism that allows vendors to register their code with the user-space server. These plugins will implement the filesystems; the plugins can be released under any license, including a proprietary license.
So, he asked, do we want the kernel open-source project part of ZUFS in the mainline kernel? He and his colleagues think it is "very very stable". Harrosh has been developing filesystems for many years, he said; in the past, whenever the filesystem crashed, you would have to reboot the virtual machine it was running in. But for ZUFS, that is all different; if the user-space server crashes, you can just restart it and remount the filesystem.
In ZUFS, the kernel piece is just a broker that provides a fast communication path between the application and the server. A round trip on that path takes 4µs for a simple read or write. For a filesystem made with the Filesystem in Userspace (FUSE) interface, that same round trip takes ten times as long, Harrosh said.
If the project is simply going to be a NetApp pet project, the company will continue to maintain it, he said. But if it is interesting to others, it could go upstream. Jan Kara suggested applying the communication techniques used by ZUFS to FUSE as a way for others to get access, but Harrosh does not think that is possible. ZUFS is a completely different idea that is unlike FUSE or anything else.
Ted Ts'o said that most FUSE filesystems he knows about would not benefit from the ZUFS communication scheme because they are not performant enough. He thought that if ZUFS lived in its own directory, and did not make big changes outside of it, that it could perhaps be merged. That would allow others to experiment with it and for FUSE to perhaps incorporate parts of it. Ric Wheeler pointed out that there are some FUSE filesystems, Gluster and CephFS, for example, that do care about performance.
Harrosh said that the main novelty of ZUFS is that all of the communication is synchronous and is all done on a single CPU. Everything in ZUFS is done on a single CPU; the application grabs a CPU channel and the server registers threads on that CPU, so the server runs on it. It is all completely lockless and there are no copies made of the data, which is directly read from or written to the application buffers—or to/from DMA and RDMA devices.
Trond Myklebust wondered what made it impossible to incorporate the ZUFS ideas in FUSE and Amir Goldstein asked how a filesystem using libfuse could use libzufs (or its equivalent). Harrosh said that ZUFS is "very incompatible" with libfuse, but is actually compatible with the filesystem code. Harrosh said that making a ZUFS filesystem was much easier than making a FUSE filesystem. He looked into FUSE and hit the performance wall right away. So he did this work and would like everyone to be able to use it.
ZUFS is, of course, much newer than FUSE; he wonders, for example, if there are additional steps he needs to take to ensure that ZUFS is not leaving behind security holes. There are uses for ZUFS beyond simply filesystems, he said, any kind of server can use it; it could be used for SQL server communication over the lockless CPU channels, for example. It is "so fast and so different" that Harrosh thinks getting it in the kernel will cause an appetite to develop.
Steven Whitehouse thought that separating the fast communication mechanism from the filesystem pieces might make for an easier path, at least for the first part, into the kernel. But Harrosh said that he is not sure how to separate the VFS plugin aspect from that of the communication channel. ZUFS "intimately sits" under the VFS and acts as a POSIX filesystem, he doesn't know how to split those things apart.
The general consensus was that use cases will be needed before the communication channel stuff can go in the mainline; Linus Torvalds and others will ask about them. If the code can be used by FUSE that will only help smooth the path, so working with the FUSE developers to see how the two can cooperate would make sense, Harrosh said in summary. In addition, more users beyond just the NetApp filesystem is probably needed.
He asked for examples of the bigger FUSE users; attendees responded that Gluster and Oracle database filesystem were two. Harrosh said that NetApp has customers using Oracle; using ZUFS allows them to get a 10x performance increase without seeing any difference. He would like to have a workshop or similar at an upcoming conference to show how to write ZUFS servers to encourage others to learn and experiment with the technology. At this point, it may require some marketing to get it upstream, he thought.
Improving fget() performance
The performance of the fget() function in the kernel was the topic of a discussion led by Dave Watson at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). fget() is used to take a reference to a file (i.e. bump a reference count), based on its file descriptor, and to return the struct file pointer for it; references are dropped with fput(). Some recent profiling at Watson's employer, Facebook, found the function to be taking a sizable portion of the CPU time for some applications, so he wanted to talk about some of the things he has tried to make that situation better.
Watson found that fget() and fput() were taking up to nearly 3% of the CPU for some services that Facebook runs. The worst was 2.9% for an internal service, but the Memcached distributed key-value store was at 2% and mcrouter, which is used to scale Memcached, was at 1.7%. Various other services ranged from 0.3% to 0.9%.
His first thought was that taking up so much CPU simply to manage the reference count is excessive. But he noticed that the services that seemed most affected were networking services, so he guessed that something in the network stack might be causing this. He focused on Memcached and found that 72.5% of the CPU used by fget() and fput() was coming from the recvfrom() system call. The other two system calls that used significant amounts of CPU time were epoll_pwait() at 11.5% and sendmsg() at 11%.
![Dave Watson [Dave Watson]](https://static.lwn.net/images/2019/lsf-watson-sm.jpg)
He noted that Memcached is a call-response service; it receives a request for a key's value and sends it back. So sendmsg() is being called as often as recvfrom() but is contributing much less to the problem. His suspicion is that the receive path is taking a bunch of cache misses, but that the send comes relatively soon after it, so the cache will have fewer misses.
He then annotated fget() and found that cache misses on the file-descriptor table and on dereferencing the struct file pointer were taking up much of the CPU time, as does the atomic increment for the reference count. So he tried two different ways to reduce that overhead.
The first was to delay the reclamation of the struct file pointer by not incrementing the reference count in fget() (and not decrementing it in fput()) for some system calls that use a single file descriptor (e.g. recvfrom(), sendmsg(), various epoll*() calls, etc.). The calls are not likely to block, but if they do, the behavior reverts to the existing path. He worked up a proof of concept for this idea, but the results were underwhelming, so he does not recommend going down that path.
His second attempt tried to get at the cache misses that were causing much of the CPU use by creating a cache of struct file pointers on a per-task basis. Multiple file pointers in the file-descriptor table are sharing the same cache line, when any of those get changed, even by an unrelated thread, there is cache-line bouncing. For Facebook, the file descriptors stay with the same thread once the "accepter" thread hands them off to a processing thread, so the cache eliminates the processor cache misses and the performance loss that went along with them. It is not a fancy cache as, once again, he just wanted to see if it helped. There is a complication, however, as it is not clear how to flush the cache, he said; if you want close() to work, though, you need to flush the cache. He ended up adding a field in struct file that pointed at the cache entry so that close() could do the right thing.
Overall, his proof-of-concept seems to work well. Most of the overhead from cache misses for the file-descriptor table are gone; that gives a roughly 2x performance increase. He has not thought about handling the cache misses for the struct file pointer dereferencing, but that could be next.
Jan Kara wondered if the file-descriptor table bouncing could be handled by allocating file descriptors in a way that causes separate threads to use different parts of the table. Some applications may depend on sequential file-descriptor allocation, however, which may not mesh well. It would potentially stop the cache-line bouncing of the table, though. Matthew Wilcox suggested that the scheme could be prototyped using dup2().
Wilcox also suggested that moving to a process-based, rather than thread-based, model for these services would be another way to avoid some of the problems that Facebook is experiencing. Watson said that would essentially be impossible. The idea of a per-thread file-descriptor range is worth experimenting with, however.
The end of the DAX experiment
Since its inception, the DAX mechanism (which provides for direct access to files stored on persistent memory) has been seen as somewhat experimental and incomplete. At the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Dan Williams ran a session where he said that perhaps the time has come to end that experiment. Some of the unimplemented DAX features may never actually need to be implemented, and it might just be possible to declare DAX finished. But first there are a few more details to take care of.
He started with the general question of what DAX actually means; he defined
it as the mechanism that is used to bypass the page cache and map
persistent memory directly into user space. There are obvious performance
advantages to doing this, but it causes problems as well: the virtual
address given to a page of data is also the address of the permanent
storage for that data. Many parts of the system were not designed with
that in mind, so various features have to be turned off when DAX is in use,
and others must be bypassed. It gives the whole subsystem the feel of a
permanent experiment, and that makes people not want to use it.
Within the kernel, there are a few ways to know that DAX is involved with any given memory mapping. A call to vma_is_dax() is a way of asking whether a given page is located in persistent storage. Then, there is vma_is_fsdax(), which is similar except it says that some other agent could break that association at any time. It is used to prevent certain long-term mappings to DAX memory. The PTE_DEVMAP page-table flag notes that a given page's lifetime is tied to that of an underlying device; users have to take extra references to that device to keep it from going away. Outside of the kernel, though, there is only one sure indication that DAX is in use: the MAP_SYNC flag to mmap(). If a mapping operation with that flag fails, persistent memory is not present.
What do we need to finish the experiment? There are still some semantics to figure out, Williams said. There are a lot of applications that are not interested in persistent memory at the level of performing individual cache-flush operations, but they do want to know whether any given mapping involves the page cache. Jan Kara added that such applications are trying to optimize the amount of memory they use; they tend to be programs like a database manager that knows it has the system to itself. If a given file is mapped through ordinary memory, the application must allow for the page cache and reduce its own memory use. If DAX is available, the application can use more memory for its own purposes. Dave Hansen suggested that what is really needed is a way to ask directly about the performance characteristics of a memory mapping.
Williams continued by asking whether there might be a need for a MAP_DIRECT option that asks the kernel to minimize the use of the page cache. Filesystems offer both buffered and direct APIs, with the latter avoiding the page cache; perhaps memory management should do the same. The problem is that these semantics might break with upcoming filesystems like NOVA, which do not use the page cache but also do not offer direct mappings. Applications are often not interested in the "direct" part, but they do care about page-cache usage.
Michal Hocko jumped in with a different point of view: the real problem, he said, is that the page cache simply sucks. It is an implementation detail in current kernels, but might well prove not to be a long-term concern. Rather than adding new APIs to enable applications to get around the page cache, we should just make caching work properly. That would help to eliminate a lot of nasty application code. This suggestion, while interesting, does not solve the short-term problem and so was set aside.
Getting back to DAX, Williams noted that there are a lot of features in the kernel that are not implemented for the DAX case; many places in the code read something like:
if (dax) goto fail;
These include long-term mappings with get_user_pages(), which can pin down DAX pages indefinitely. There are some problems with reflink() functionality, a problem that was set aside for the filesystem track.
There are also problems with private mappings and DAX; pages created via copy-on-write will end up in ordinary RAM rather than persistent memory, which may not be what users want. Some may prefer that these copies remain in cheaper memory. Hansen suggested that this problem could be solved with NUMA policies. Williams said that the right solution might be to call back into the underlying filesystem to inform it of a copy-on-write fault and make it allocate a new page to handle it; the filesystem would then have to track those pages specially until they are released. Kara said that he didn't really see a use case for this feature, though. Mel Gorman described it as "difficult semantics and non-obvious behavior for a non-existent use case".
Time for the session ran out about then. Williams closed by saying that, perhaps, it is not necessary to implement all of those APIs for the DAX case until users scream about their absence. That would allow the DAX experiment to be declared done, more or less, for now.
Write-protect for userfaultfd()
The userfaultfd() system call allows one process to handle page faults for another — in user space. Its original use case was to support transparent container migration, but other uses have developed over the years. At the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Andrea Arcangeli described a scheme to add write-protection support to userfaultfd(). After a year of lost time fighting speculative-execution problems, Arcangeli is about ready to move this feature into the mainline.The core idea behind userfaultfd() is that, when the process of interest incurs a page fault due to a non-present page, the monitoring process receives a notification and is given the opportunity to provide a page to satisfy that fault. Once pages are present, though, the monitoring process is no longer involved. The new proposal changes that by allowing the monitoring process to mark pages as being write-protected, even though the owning process has write permission to them. That allows writes to be intercepted, and some extra processing performed.
There are a number of use cases for this feature, Arcangeli said. They include:
- Live snapshotting of running processes. QEMU would like this feature so that it can be sure of copying pages in a stable state.
- The Redis database system performs checkpointing by forking, then writing out its memory in the child process, taking advantage of the kernel's copy-on-write behavior to keep the memory stable while it is being written. Among other things, this technique requires disabling transparent huge pages in order to work, but that has a performance impact. It also potentially doubles memory use, since the parent could write to all of its data pages while the child is running. The write-protect feature could eliminate the need to fork to get a stable set of pages to write.
- There are schemes for implementing distributed shared memory that could use it to detect (and distribute) changes. This can, in theory, be done now by using mprotect() and handling segmentation-violation signals, but it's slow and creates vast numbers of virtual memory areas in the kernel.
- The soft dirty feature, used by the checkpoint-restore in user space (CRIU) mechanism, could be replaced by this write-protect mechanism. It would be far more efficient, since it would eliminate the need to scan through the page tables.
- Similarly, it could be used to replace the dirty bits used by language runtime systems, such as the Java virtual machine, to catch changes.
Many other cases that use mprotect() now would likely run faster with userfaultfd(), which never needs to acquire the highly contended mmap_sem lock for writes.
Processes wanting to use this feature must call userfaultfd() to
obtain a file descriptor, then perform an UFFDIO_REGISTER
ioctl() call with the UFFDIO_REGISTER_MODE_WP option.
Then UFFDIO_WRITEPROTECT operations, which supply a start address
and a size, can be used to change the write-protect bit on one or more
pages. Events will show up with the UFFD_PAGEFAULT_FLAG_WP flag
set. For now, write protection only works with anonymous pages.
There are some loose ends that still need to be worked out. For example, the write-protect feature requires a new bit in the page-table entries; it has taken the last available bit for this purpose, which may not be entirely welcome. Arcangeli said that it may be possible to find an alternative to using that bit later on. Dave Hansen said that there might just be a couple of other bits available if one looks hard enough, but Mel Gorman warned that they might not be available on all architectures. Some of the apparently unused bits have been claimed by CPU vendors for uses that have not yet been made public, it seems.
Another issue is that page-fault handlers can fail with a VM_FAULT_RETRY return code, indicating that the memory-management subsystem should restart the process of handling the fault from the beginning. But that can only happen twice for any given fault before the whole thing fails, sending a (probably fatal) signal to the faulting process. This is a problem for the write-protect feature, which can generate more retries than that. Ideally, an unlimited number of retries would be allowed, Arcangeli said.
Currently privilege is required to use userfaultfd(), but the desire is to make it available for unprivileged use as well; a sysctl knob would be used to control whether privilege is needed or not, allowing "paranoid" system administrators to restrict its use. Jérôme Glisse suggested that perhaps the seccomp mechanism would be a better way to control access to this feature.
At the end of the session, Arcangeli asked whether the patches need more review before being merged. "Always" was the inevitable answer from memory-management subsystem maintainer Andrew Morton.
The search for available page flags
Among the many other things crammed into the page structure that is used to represent a page of memory in the kernel is a set of flags to track the state of the page. These flags have been in short supply for some time; LWN looked at the problem nearly ten years ago. Jérôme Glisse ran a session during the memory-management track of the 2019 Linux Storage, Filesystem, and Memory-Management Summit to explore ways of making some flags available for new uses. While there may be some easily available bits in the field that holds the page flags, obtaining a significant number of them may be tricky.Glisse is looking for a way to add a new, generic page-protection bit in order to implement a feature like kernel same-page merging (KSM) for file-backed pages. There is a flag for KSM now, but it's already overloaded with the PageMovable flag, which is relevant for file-backed pages. So, he asked, where might he be able to locate another flag that could be used for this purpose?
His attention was drawn to the general area of memory reclaim, which uses
many of the available flags. The
PageActive,
PageIsolated,
PageLRU,
PageMovable,
PageReclaim,
PageReferenced,
PageUnevictable, and
PageWorkingset flags are all tied to reclaim in one way or
another. Often, when that many flags are associated with an activity, one
can discern patterns of flags that are either always set together or which
never appear together; that can allow the replacement of flags with a more
efficient representation. In this case, though, he found that almost all
combinations of those flags are possible and valid, so there is no real
opportunity
to merge any of them. To further complicate matters, there is no single
lock that protects access to all of those flags, so combining them would
require difficult (or impossible) locking changes.
Having failed to reclaim any of the reclaim-oriented flags, Glisse turned his attention to the node ID and zone number, both of which are stored in the same word as the page flags. The node ID is frequently used in hot paths, so trying to move it out of the page structure is likely to create performance issues. The zone number is not quite as hot and might be a candidate to be relocated. Indeed, in some configurations this information already gets pushed out of the page structure, so there might be some promise here.
Some low-hanging fruit might also be found in the width of the node ID field; it is sized to hold the maximum number of NUMA nodes that the kernel is configured to support — typically 1024. But there are not all that many 1024-node systems out there; that sizing was described as a holdover from the days of SGI's huge systems. So it may be possible to gain a bit or two there, though probably not more than that. It is also possible to calculate the node ID from the zone number, eliminating that field altogether, but that would add overhead to some of the hottest page-allocator paths, which would be unwelcome.
There was some talk of getting rid of either PageIsolated or PageMovable, both of which are optimizations for information that can be had elsewhere. PageIsolated is there to keep two threads running compaction from interfering with each other, a situation that, according to Hugh Dickins, cannot happen anyway.
Looking at some of the other bits, Matthew Wilcox observed that the PageError, PageSlab, and PageHWPoison flags are incompatible with each other and could perhaps be unified in some way. Indeed, it seems that there is no need for a flag for PageHWPoison (which marks a page that has been taken out of service due to hardware errors) at all. To the extent that the kernel needs to reference that state, it does not need to happen quickly, but most of the time such a page should be out of the kernel's view entirely.
Dave Hansen suggested creating a concept of "fast" and "slow" page flags, where the slow ones are stored outside of the page structure. A single fast bit could be set when any of the slow ones are, hopefully eliminating the need to actually check the slow bits most of the time. Mel Gorman, instead, suggested that some kernel features could be made dependent on whether sufficient page flags are available; if the kernel is configured to support a large number of NUMA nodes, perhaps it would be unable to support KSM. But, as Hansen reminded the group, there are not many bits to be had via that path.
Another option is to push more data (flags or something else) out to the page_ext structure, which already exists to hold information that won't fit into struct page. Hansen worried, though, that this would make it too easy to bloat the amount of per-page data stored, which already occupies a significant fraction of the system's memory. Without a hard constraint, that data could easily get out of hand.
The final suggestion aired in this session was to create an XArray to hold pages in relatively rare states (such as PageIsolated). They could be stored using their page-frame number and searched for when the need arises. Whether any of these suggestions will be implemented remains to be seen, though.
Minimizing the use of tail pages
Compound pages are created by the kernel as a way of combining a number of small pages into a single, larger unit. Such pages are implemented as a single "head page" at the beginning, followed by a number of "tail pages". Matthew Wilcox has concluded that it would be beneficial to minimize the use of tail pages in the kernel; he ran a session during the memory-management track at the 2019 Linux Storage, Filesystem, and Memory-Management Summit to explore how that could be done. The discussion ranged widely, veering into the representation of DMA I/O operations, but few hard conclusions were reached.The most common use for compound pages is to represent huge pages, as created by the transparent huge pages or hugetlbfs features. The slab allocators can also create compound pages to satisfy larger allocations. Since compound pages can be thought of as simply being larger pages, they can usually be managed using just the head page. There are times, though, when the tail pages come into play. Most often, something (a page fault, perhaps) will create a reference within a larger page; the tail page is then used to locate the head page, which is where most of the relevant information is stored.
To make things as transparent as possible, many places in the kernel use
the compound_head()
function to ensure that they are looking at a head page, even when the page
in question may not be a compound page. That is overhead that, perhaps, does
not need to be incurred. There is a READ_ONCE() memory barrier in
it, Wilcox pointed out; even if the overhead is small, it will be greater
than zero and nice to get rid of.
There are a number of kernel functions that can yield a reference to a tail page, including virt_to_page(), phys_to_page(), and find_get_entry(). If those functions were to, instead, give a pointer to the head page, the calls to compound_head() could be eliminated. In the case of virt_to_page(), one can simply call virt_to_head_page() instead when one knows that the result could otherwise be a tail page — problem solved. There is no phys_to_head_page(), but perhaps that should be added.
A trickier problem is the page cache; if a transparent huge page is stored there, all of the tail pages are stored along with the head page. Wilcox has a patch to store only the head page, making things far more efficient in general, but there are a lot of related functions that can return tail pages; these include find_get_page(), find_get_pages_range(), and others. There is thus a need for replacements that only return head pages, followed by the usual effort to track down and fix all of the callers.
Even trickier is the problem of dealing with page tables; this is what Wilcox described as the "controversial bit". He would like to introduce a replacement for get_user_pages() called get_user_head_pages(); it would only return the head pages for the region of memory mapped into the kernel. Callers would not be able to make assumptions about the size of the pages, and would have to be more careful in how they work with those pages. This function would, as a result, be harder to use than the function it replaces (which is already not entirely easy), it is more complex, and there are a lot of callers to fix. He is not sure when or even if such a transition would ever be completed. Additionally, as Kirill Shutemov pointed out, callers may really want the tail page sometimes, so it may not be possible to make this change at all.
A second option is thus worth considering. Wilcox asserted that most callers of get_user_pages() have, as their real goal, the creation of a scatter/gather list for DMA I/O operations — an assertion that Christoph Hellwig immediately disputed. Presenting a list of files containing get_user_pages() calls, Wilcox asked an attendee to pick one at random; digging into that file, he showed that the results of the call were, indeed, used to create a scatter/gather list. Hellwig then went into a characteristic exposition on why the current struct scatterlist data structure is the wrong solution to the problem, since it mixes DMA information with the host representation of the memory and creates confusion when memory areas are coalesced on the DMA side.
A replacement for the current scatter/gather list representation may be on the horizon, but that is somewhat orthogonal to the idea that Wilcox was getting at here. Rather than having kernel code call get_user_pages() and using the result to create a scatter/gather list, it might be better to implement a get_user_sg() function that would do the full job. That function could then hide any dealings with tail pages while simultaneously simplifying callers.
If struct scatterlist is to be eliminated, get_user_sg() could still be implemented in some form. The leading contender for a replacement structure appears to be struct biovec. This structure has its origins in the block layer, but is widely used in the rest of the kernel at this point; the networking code was described as the biggest remaining holdout. A biovec can describe the host side of an I/O operation; the device side would need a different structure that, perhaps, is more specific to the hardware involved.
In any case, almost everything discussed in this session is theoretical until patches appear on the mailing lists. Until then, the kernel will continue to deal with tail pages (and scatter/gather lists) the way it has for years.
Presenting heterogeneous memory to user space
Computer memory architecture is growing more complex over time, with different types of memory attached to a CPU via a number of paths. The kernel development community is duly working to make this memory available to user space in an equally diverse set of ways. Two sessions at the 2019 Linux Storage, Filesystem, and Memory-Management Summit presented possible mechanisms and APIs to allow programs to work with the types of memory they need.
The heterogeneous memory mapper
One upcoming development is the "heterogeneous memory mapper" (hmmap), which was presented by Adam Manzanares in a combined storage and memory-management session. Hmmap is implemented as a character device that can map different kinds of memory into a process's address space. It is intended to provide both direct and cached access, possibly at the same time (though via different address ranges). Cache management is flexible, with pluggable strategies; the page cache is good for most workloads, Manzanares said, but some may prefer alternatives. The actual cache management to use can be specified by the user. Internally, hmmap is implemented in two layers, one handling caching strategy and one for low-level access to the underlying media.
Why might one want this? It provides flexibility for different device
types, allowing users to mix and match technologies like DAX, RDMA, and basic
block devices. It provides the pluggable caching strategies, though he
allowed that he's not fully convinced that this feature is needed. It is
an alternative to supporting DAX via NUMA nodes that does not require
hardware support. It also provides a clear path to persistent storage,
something that is more implicit with the NUMA approach.
Manzanares was quickly accused of reimplementing the page cache, a criticism that he partially accepted. Perhaps, he suggested, some of the features provided by hmmap could be put into the page cache itself instead. James Bottomley noted that one advantage of hmmap is that it can be shrunk more quickly than the normal page cache can be; perhaps a policy could be developed to exploit that capability.
Mel Gorman said that he was having a hard time seeing the use case for hmmap. Existing kernel functionality, like dm-cache or bcache, can be used already for many storage-acceleration applications. One potentially interesting use case could be a device with directly accessible memory that could be put to use as additional RAM on the system; data could be moved to and from ordinary RAM on demand. This device would differ from others in that the data would not be stored in any persistent media.
With regard to dm-cache in particular, he said, it seems that it can handle part of this use case; it works like a page cache, moving pages between faster and slower devices. It is inefficient for some kinds of workloads, though, where it ends up touching large amounts of memory. Persistent memory is better for moving smaller amounts of memory efficiently; it could benefit from a software layer that can take advantage of that.
Other potential uses are harder to lay out. Gorman suggested that Manzanares should enumerate the use cases for this feature and explain why currently available solutions are not good enough. He suggested that hmmap is an implementation of dm-cache with a different API. As the session ended, Manzanares said that he would look more deeply into dm-cache and outline any changes that would be needed to make it more widely applicable.
hbind()
The other proposal was described by Jérôme Glisse in a memory-management track session on the final day. Glisse has been working with heterogeneous memory for some time; his focus at the moment is on memory that is not necessarily directly accessible by the CPU and which may not be cache coherent. Device drivers tend to want complete control over such memory; pinning of pages by the CPU cannot generally be supported.
There has been a lot of talk about supporting memory types using the NUMA abstraction, Glisse said, but that approach has some limitations. Applications might start using device memory without understanding its limitations, leading to data corruption or other unfortunate consequences. His solution is to make the use of this memory something that applications opt into explicitly, specifying the type of memory they are looking for.
Applications would do this with the new hbind() system call, which
was posted
for review several months ago. This call exists to unify access to
different types of device memory; that access is generally available now,
but it requires a different ioctl() call for each device type.
Rather than forcing applications to support a range of calls, Glisse would
like to provide a single call that works for everything. hbind()
would be like mbind(),
but it would work with device memory as well as ordinary, CPU-attached
memory.
There were a lot of questions about just what the semantics of this new system call would be; Glisse described it as a request to migrate content to device memory while keeping it accessible. Michal Hocko said that he would rather have a better description of the semantics than the implementation details Glisse was giving. He asked what would happen if multiple users request more memory than a device can provide; how can one guarantee access for the most important process? Glisse replied that the answer would be device-specific. Hocko complained that the interface is insufficiently defined in general; if he has to maintain that API forever, he said, he wants specifics.
Dave Hansen said that it looked like Glisse is creating a parallel interface to the system's NUMA functionality; is NUMA really not good enough to solve this problem? Aneesh Kumar said, though, that it wasn't possible to allow the kernel to manage this memory, since things that the kernel wants to do (page pinning, for example) cannot be supported. In the end, though, Hansen replied, users want to allocate this memory with malloc(), which ends up involving the kernel anyway.
Glisse talked briefly about how applications would discover which device memory is available on a system. There would be a new directory called /sys/devices/system/hmemory, with one entry per resource. Each entry would give the size of the memory region, a link to the device, and describe any special properties that the memory has.
In the last minutes of the session, Gorman observed that hbind() looks like a combination of the existing mmap() and set_mempolicy() system calls; perhaps applications should just use those instead. Hansen added that there will be NUMA nodes for devices providing memory to the system, on x86 systems at least. The new interface essentially makes those NUMA nodes go away, which is suboptimal. Glisse responded that he wants to provide access to memory that is noncoherent or not directly accessible to the CPU; NUMA can't handle either of those.
Gorman then suggested that move_pages() could be used to place data in the more exotic types of memory; Glisse said that there is no NUMA node to use for that purpose, but the developers pointed out that one could always be added. Or, perhaps, move_pages() could grow a new flag to indicate that a device ID is being specified rather than a NUMA node. The final conclusion seemed to be that the move_pages() option should be investigated further.
NUMA nodes for persistent-memory management
While persistent memory is normally valued for its persistence, there is also a subcurrent of interest in using it in settings where persistence is not important. In particular, the fact that this memory is relatively inexpensive makes it appealing to use instead of ordinary RAM in budget-conscious settings. At the 2019 Linux Storage, Filesystem, and Memory-Management Summit, two sessions in the memory-management track looked at how the kernel's NUMA mechanism could be pressed into service to manage non-persistent uses of persistent memory.
Persistent memory as RAM
Yang Shi led the first session, which was focused on a scheme to use persistent memory configured into CPUless NUMA nodes as RAM for low-cost virtual machines. The idea seems to have some merit, but there are a number of details to be worked out yet.
The motivation behind this work is simple enough: offer cheaper virtual
machines for customers who are willing to put up with slower memory. It is
an idea that has been making the rounds for
some months already. Ideally, it would be possible to give each virtual
machine a certain amount of RAM for its hottest data, while filling in the
rest with persistent memory, maintaining a set ratio of the amounts of each
type of memory. Virtual machines would preferentially use their available
RAM up to their quota, after which data would spill over into persistent
memory; the kernel would migrate data back and forth in an attempt to keep
the most frequently used memory in RAM.
Needless to say, there are some challenges involved in implementing a scheme like this. Workloads can have random and frequently changing access patterns that are hard to keep up with. Maintaining the right ratio between the two types of memory will require continuous scanning of the virtual machine's address space. It's even hard to define a service-level agreement so that customers know what sort of performance to expect.
Michal Hocko asked what sort of API was envisioned to control this memory-type ratio. There is no control planned for customers; administrators would have a /proc file where the ratio could be set. Customers would be allowed some influence over where data is placed by using the madvise() system call to indicate which pages should be in fast memory. Dave Hansen asked about that aspect of things; given that the memory types will be split into different NUMA nodes, the existing NUMA mechanisms could be used to control placement. Mel Gorman agreed, saying that, if the kernel's NUMA awareness could be augmented to allow the specification of a percentage-based allocation strategy, the needed flexibility would be there.
Hocko then wondered about explicit placement of pages, something that user space can request now. If the kernel then tries to shuffle things around, the situation could get messy. He would like to avoid reproducing the sort of problems we see now with CPU sets conflicting with explicit NUMA placement, which he described as a "giant mess".
Gorman suggested that perhaps the memory controller could be augmented with per-node counters, though it would be a non-trivial problem to get right. It's not clear how memory reclaim could be handled. But it is important to realize, he said, that there are two different problems to be solved here: implementing the memory ratio and migrating pages between memory types. The first step is to get the accounting right so that the ratio can be implemented; after that, it will make sense to worry about locality. Proper accounting will require tracking memory usage on a per-node, per-process (or, more likely, per-control-group) basis, which the kernel doesn't do now. When this is implemented, he said, it's important to not make it specific to persistent memory; this mechanism could be used to express policies for different classes of memory in general.
The discussion wandered around the accounting issue for some time without making any real progress, but it was agreed that getting the accounting working was an important first step. Control groups already have some of the needed support; finishing the job should be feasible. The data-migration task might prove harder, though; Hocko said that it would probably have to be implemented in user space. Gorman added that, once the accounting is available, the kernel provides everything that is needed to implement a brute-force migration solution in user space, though he described it with terms like "brutal" and "horrible".
That still leaves the problem of kernel memory, though, which is not accounted in the same way and which cannot generally be migrated at all, much less from user space. Where, Hansen asked, would the dentry cache live on such a system? Putting it into slow memory would be painful. Hansen argued for supporting placement in a general way, for now. He also noted that the memory situation is becoming more complicated; slow memory may have fast caches built into it, for example. Christoph Lameter mentioned the possibility of high-end memory of the type currently found on GPUs coming to CPUs in the near future. The kernel will need as much flexibility as possible to be able to handle the more complex memory architectures on the horizon.
The discussion returned to the ratio-enforcement problem, with Gorman repeating that control groups offer some of the needed statistics now. Hocko agreed, saying that this support is not complete, but it's a place to start. The charging infrastructure can be made more complete, and kernel-memory accounting could be added. If the result turns out not to be usable for this task, it least it would be a starting point from which developers could figure out what the real solution should look like. Gorman cautioned against premature optimism, noting the previous ideas along these lines have not succeeded.
This kind of policy would bring some new challenges to the accounting code, Hocko said. The current memory controller works by first allocating memory in response to a request, then attempting to charge the memory to the appropriate control group. If the charge fails (the group is above its limit), the newly allocated memory is freed rather than handed over and the allocation request fails. In a ratio-based scheme, a charging failure would have to lead to an allocation being retried with a different policy instead.
Gorman said that might not necessarily be the case; instead, if a group has moved away from its configured ratio, memory could be reclaimed from that group to bring things back into balance. There would be problems with such a solution, but it would be a place to start. Dan Williams, instead, suggested a scheme where allocations come from a random node, with the choice being biased toward the desired ratio. The session ran out of time at this point, ending with no conclusions but with a number of ideas for developers to try.
Persistent memory in NUMA nodes
At this point, leadership of the discussion shifted over to Hocko and the topic moved to the use of the NUMA infrastructure for the management of persistent memory in general. There are schemes floating around to do things like configure persistent memory as a "distant" NUMA node and implement proactive reclaim algorithms that would automatically move cold pages over to that "node", which would be hidden from the memory-management subsystem otherwise. Many of these proposals, Hocko said, make the mistake of treating persistent memory as something special — an approach that he has been pushing back on. They risk creating a solution that works for today's technology but not tomorrow's.
Instead, he would really rather just think in terms of NUMA workloads; the
system is already built to handle memory with different performance
characteristics, after all. Perhaps all that is really needed is to create a
new memory-reclaim mode that migrates memory rather than reclaiming its
pages; the NUMA balancing code could then handle the task of ensuring that
pages are in the right places.
Hansen wondered how the ordering of nodes would be handled. Currently, NUMA balancing works in terms of distance — how long it takes to access memory on any given node. But that might not be optimal for the sorts of things people want to do with persistent memory. Perhaps the kernel needs to create a fallback list of places a page could be put, using the node distance as the metric initially. Gorman, though, said that was too complicated; memory placement should fall back to exactly one node at the outset. That would avoid traps like page-migration cycles, where pages would shift around the system but never find a permanent home.
Hocko asked if the migration-based reclaim mode seemed like a good idea; it could be implemented without big changes in a simple mode to see how well it works. All that would be needed, he said, would be a hook into one place in the reclaim code. Gorman thought it made sense, but he repeated that there should only be one node to which pages could be migrated. If all nodes in the system were on a fallback list, a page would have to move through every possible option — each RAM-based node and each persistent-memory node — before actually being reclaimed. It would be necessary to maintain the history of where each page has been, and would be likely to disrupt other workloads on the system.
Hansen thought that perhaps the problem could be addressed by creating directional paths between nodes, so that pages would only migrate in one direction. Hocko said that there was no real need for that, since the kernel would not move pages from a CPUless persistent-memory node to a RAM node, but Gorman argued that this was an unfortunate application of special-casing. One should not, he said, assume that persistent-memory nodes will never have CPUs in them.
Once migration is solved, there is still the problem of moving pages in the other direction: how will frequently used pages be promoted to fast memory? Gorman said that the NUMA-balancing code does that now, so things should just work, but Rik van Riel argued that NUMA balancing would not work in this case; when pages are stored on a CPUless node, all accesses will be remote, so NUMA balancing will naturally pull every page off of that node. Gorman conceded the point, but said that this behavior might be reasonable; pages that are being used should move to faster memory, but Van Riel said that there is no way to tell between occasional uses and frequent uses.
Hocko was less worried about this scenario. A page that has found its way to a persistent-memory node has been demoted there by the memory-management code, meaning that it fell off the far end of the least-recently-used list. That means it hasn't been used for a while and is unlikely to be accessed again unless the access pattern changes. But Van Riel was still worried that the system could thrash through the demoted memory, and that some sort of rate limiting might be needed. Gorman said that there may be a need to alter the two-reference rule that is currently used to move pages between NUMA nodes, but that is not a radical change; the idea of using NUMA balancing is still sound, he said.
The session wound down with a general agreement that the migrate-on-reclaim mode was an idea worth pursuing, and that relying on NUMA balancing to promote pages back to RAM should be tried. Once the simplest solution has been implemented it can be observed with real workloads; developers can go back and improve things if this approach behaves poorly.
Transparent huge pages, NUMA locality, and performance regressions
Sometimes, the kernel's no-regression rule may not have the desired result. Andrea Arcangeli led a session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit to make the point that the recent reversion of a fix after a performance regression was reported has led to worse performance overall — with, as is his wont, a lot of technical information to back up that point. With a wider understanding of what is at stake here, he hopes, the reversion can itself be reverted.
It started with a performance problem reported in the 4.2 kernel, when Alex
Williamson encountered pathological behavior around transparent huge pages
(THP). When using THP in large virtual memory areas under
VFIO, a
large slowdown would result. After some work, the problem was
figured out and a fix was applied to the RHEL kernel in August 2018. A
little later, the same
problem was reported in an SLES 4.12 kernel; the same fix solved it there.
At that point, the fix was committed upstream
as being of general use. But that fix ended up being reverted in December
after automated testing revealed a slowdown in one benchmark; while the fix
had greatly improved fairness, it did alter performance in this case.
Large amounts of email traffic ensued; Arcangeli hopes that it will soon be
possible to apply the fix again.
The underlying problem is that, if the system is configured to always enable memory compaction and a huge page allocation is requested, the page allocator will refuse to allocate pages on remote nodes. It behaves as if the program had been explicitly bound to the current node, which was never the intended result. The reasoning that led to this behavior is that it is better to allocate local 4KB pages than remote huge pages — an idea that Arcangeli is not fully in agreement with. But the kernel goes beyond that in this situation, refusing to allocate any pages on remote nodes and potentially forcing the local node deeply into swap. The (since reverted) fix avoids this situation by allowing the allocation of remote pages (both huge and small) when local pages are not available. In a quick demo, Arcangeli showed how that can yield better performance and less local-node swapping.
Even so, he said, the allocation heuristic is not what it should be: it tries to allocate huge pages across the system first before falling back to 4KB pages locally. The correct order would be to first attempt to allocate a huge page locally, then to try a 4KB page locally. Failing that, the next thing to try should be a huge page at the closest remote node, followed by a 4K page at that node, then a huge page at a more distant node, and so on. In other words, the allocator should follow something like the current zone-list heuristic, but attempting allocations at multiple sizes at each step.
Doing that properly may require changing the alloc_pages() interface to make it smarter; he suggested something like alloc_pages_multi_order() that would accept a bitmask of the acceptable allocation sizes. It would try the largest first, but then fall back to the smaller size(s) if need be. Naturally, it would return the size of the actual allocation to the caller.
There was a bit of discussion on how such an interface might really work. Michal Hocko said, though, that the real problem is that the compaction setting has other, surprising implications: it can result in a failure to allocate huge pages in situations where they would help. Perhaps the solution is just to be a bit more clever in how this setting is handled, in which case it should just be done.
Meanwhile, Arcangeli's intent is to get the original fix reapplied; he has posted a patch set doing so with a bunch of backup information. That seems like the right thing to do; two enterprise distributions are currently carrying out-of-tree patches to address this issue. Once that little problem has been resolved, work can begin in earnest on coming up with the best way to allocate huge pages in NUMA environments.
Proactively reclaiming idle memory
Shakeel Butt started his 2019 Linux Storage, Filesystem, and Memory-Management Summit session by noting that memory makes up a big part of the total cost of equipping a data center. As a result, data-center operators try to make the best use of memory they can, generally overcommitting it significantly. In this session, Butt described a scheme in use at Google to try to improve memory utilization; while the need for the described functionality was generally agreed upon, the developers in the room were not entirely happy with the solution presented.Overcommitting memory increases total utilization, but it comes at a cost: systems experience memory pressure and end up having to reclaim pages. Direct reclaim (where a process that is allocating memory has to do some of the work of freeing up memory used by others) is particularly problematic since it can introduce surprising delays; it reduces the isolation between users. The solution to this problem, he said, is to seek out and reclaim idle pages before memory gets tight.
To this end, systems in the data center have been supplemented with slower
(cheaper) memory, which can take any of a number of forms, including
persistent memory, in-memory compression, or a real swap device. The
system manages this memory in a way that is transparent to users. Then, idle
memory (pages that have not been accessed for some time) are located and
pushed out to this slower memory. Butt said that, across the Google data
center, about 32% of memory can be deemed to be idle at any given time. If
that memory is reclaimed after two minutes of idle time, he said, about 14%
of it will be refaulted back in; the rest is better used by somebody else.
At this point, one might wonder why Google doesn't just use the kernel's existing reclaim mechanism, as implemented by the kswapd kernel thread. It "kind of works", he said, but is based on watermarks (keeping a certain percentage of memory free) rather than on idleness. kswapd also tries to balance memory usage across NUMA nodes, which is not useful in this setting. Finally, Butt said, kswapd is built on a large set of complicated heuristics, and he doesn't want to try to change them.
So, instead, Google has put resources into a mechanism for tracking idle pages. There is a system in place now that is based on a user-space process that reads a bitmap stored in sysfs, but it has a high CPU and memory overhead, so a new approach is being tried. It is built on a new kernel thread called kstaled; it tracks idle pages using page flags, so it no longer adds memory overhead to the system, but it still requires a relatively large amount of CPU time. The new kreclaimd thread then scans through memory, reclaiming pages that have been idle for too long.
The CPU cost is not trivial; it increases linearly with the amount of memory that must be tracked and with the scan frequency. On a system with 512GB of installed memory, one full CPU must be dedicated to this task. Most of this time is spent walking through the reverse-map entries to find page mappings. This has been improved by getting rid of the reverse-map walk and creating a linked list of mid-level (PMD) page tables; that reduced CPU usage by a factor of 3.5. Removing the scanning from kreclaimd in favor of a queue of pages passed in from kstaled gave another significant reduction.
Butt said that he would like to upstream this work; it is not something that can be handled in user space. Rik van Riel noted that, even with the performance improvements that have been made, this system has scalability problems. Johannes Weiner asked why Google was reimplementing the tracking that is already done by the memory-management subsystem's least-recently-used (LRU) lists. Like the LRUs, this new mechanism is trying to predict the future use of pages; it might be nice to have, he said, but it is "crazy expensive". Butt replied that Google was willing to pay that cost, which was less than having the system go into direct reclaim.
Weiner continued, saying that Facebook has faced the same issue. There, every workload must be containerized, and users are required to declare how much memory they will need. But nobody actually knows how much memory their task will require, so they all ask for too much, leading to the need to overcommit and reclaim issues. The solution being tried there is to use pressure-stall information to learn when memory is starting to get tight, then chopping the oldest pages off the LRU list. If the refault rate goes up, pages are reclaimed less aggressively. This solution, he said, yields reasonable results at a much smaller CPU cost.
Discussion continued for a bit, but the general consensus was that, while this sort of proactive reclaim would be useful for a number of users, the cost of this particular solution was too high to consider merging it upstream.
Cleaning up after dying control groups
Control groups are a useful mechanism for managing resource usage in the system, but what happens when the control groups themselves become a resource problem? In a plenary session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Roman Gushchin described problems he has been facing with deleted control groups that take their time before actually going away. Some of these problems have been fixed, but the issue has not been truly resolved.Control groups are managed using a virtual filesystem; a particular group can be deleted just by removing the directory that represents it. But the truth of the matter, Gushchin said, is that while removing a control group's directory hides the group from user space, that group continues to exist in the kernel until all references to it go away. While it persists, it continues to consume resources.
The problem is especially acute for memory control groups, where every page
that is charged to the group holds a reference to it. So a given control
group cannot be truly deleted until every page that was charged to it is
reclaimed, which can take a long time; if some of those pages are still
actively used, they may avoid reclaim indefinitely. In addition, various
bugs have also had the effect of keeping deleted groups around. It all
adds up to deleted control groups hanging around and haunting the system;
he found 1,500 of them after a week of operation.
The consequences of this problem are not huge, but still "not nice", he said. Each control group consumes about 200KB of memory while it exists, which begins to add up when thousands of them are waiting to die. All of those groups serve to increase the complexity (and the cost) of traversing the control-group hierarchy in the kernel. That memory use can also throw off memory-management accounting.
Some of the reasons for the persistence of removed control groups are easier to deal with than others. There was, for example, a rounding error in the handling of user pages that caused the final page not to be reclaimed. This bug showed up in both versions of the control-group subsystem; it has since been fixed. Another issue had to do with the accounting of kernel stacks; it was introduced in the switch to virtually mapped stacks in 2016. These stacks were charged to the process (and its group) that first allocated them; when a stack was reused for a new process, the charging was not updated. This problem, too, has been fixed.
A problem that is not yet fixed has to do with kernel memory obtained from the slab allocators. Many cached objects, such as dentry structures, are obtained from the slab allocator and charged to the appropriate control group; they, too, must be cleaned up before that group can be truly deleted. But when there is not a lot of memory pressure, the shrinkers do not run aggressively and those objects can persist for a long time. Gushchin tried a patch to apply some extra pressure, but it caused performance regressions in the XFS filesystem and was subsequently reverted. So now he is working on a different approach: reparenting slab caches on control-group removal. There is a patch set in review, so hopefully this problem will be resolved in the near future.
With those fixes, the problems that he has observed have been addressed, but there are other potential problems out there. Pages obtained with vmalloc() and per-CPU pages are one possible trouble area. In general, though, he said that it is easy to create hidden references to control groups that can impede their removal; this is an area where regressions are likely to happen.
At the end of the session, Michal Hocko said that the part of the problem is simply the size of structure used to represent a memory control group. Perhaps things could be made a little better by splitting that structure in two and only keeping the core parts when the group is removed. But Johannes Weiner replied that memory pressure is the only thing that is trimming back these deleted groups now; if they are made smaller, they will just pile up more deeply. So, while some manifestations of this problem have been dealt with, the issue of dying control groups will, like the groups themselves, be with us for some time yet.
Remote memory control-group charging
Memory control groups exist to track and limit the amount of memory used by sets of processes. Normally, one would not expect that memory used by one group would be charged to another but, as Shakeel Butt described in a memory-management track session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, that does happen in a number of different situations. It's often a problem, but occasionally it's also a useful feature.One of the most common examples of this sort of "remote charging" happens when a page of memory shared by more than one group is swapped out. When that page is swapped back in, it will be charged to the group that originally allocated it, even if that group has long since stopped using it. That can lead to surprising situations like a page fault in one group triggering reclaim or even an out-of-memory (OOM) response in another group. Remote charging can happen when userfaultfd() is in use; the fault-handling process can be charged for memory used by the controlled process. It is seen when processes are traced with ptrace() and can even happen explicitly with a call to get_user_pages_remote(). At times, memory allocated in kernel space can also be remotely charged.
Michal Hocko asked about userfaultfd() in particular, wondering how often the control process runs in a different group than the processes it is handling faults for. Butt responded that he didn't know, but that the remote charging is documented. Hocko suggested that running in different groups was probably not a good idea in general, and said that he would be reluctant to complicate the code for a theoretical use case that may not happen in the real world.
One of the in-kernel cases mentioned by Butt was the allocation of buffer heads, which can happen when writeback is triggered on a page. That can happen in the global reclaim process, as well as elsewhere. Those buffer heads will be charged to the control group that is being charged for the page being written back, even if that group had nothing to do with the activity that made writeback necessary.
One interesting quirk of remote charging is that it bypasses the memory.high control knob, which defines when an allocating process should be throttled. That throttling will not happen on allocations that are remotely charged. Rik van Riel asked how often this problem comes about; Butt didn't have a precise answer but said that he had seen it. Hocko said that if 1% of allocations are remotely charged in this way it really doesn't matter; the processes within the control group will eventually be charged directly for a page and the throttling will happen. In more unbalanced cases it could be a problem, though.
In some cases there might actually be a use case for remote charging, though. Some users want to separate virtualized workloads, with the virtual-machine monitor running in a different group than the virtual machines themselves. This is evidently done mostly as a way of knowing what the monitor overhead is. The way this is done currently is to create a tmpfs filesystem and mount it in its own control group; all pages allocated will then be charged to that group. There's just one little problem: if the virtual machine dies (from the OOM killer, for example) the tmpfs filesystem will remain, consuming memory. To deal with that, the kernel has been hacked to send a special notification on OOM kills so that somebody can go and clean up the tmpfs files.
Butt described a new idea that he has been working on: the tmpfs files could be attached to a special dummy process that is subject to the out-of-memory killer's attention. When the process is killed, the files are truncated; it's essentially an OOM-killable tmpfs file. Johannes Weiner said that such a mechanism was far too specialized to go upstream. Van Riel, though, suggested that it might be possible to upstream a patch implementing a new mount flag for tmpfs. If the control group attached to a given mount is killed, the filesystem would automatically drop its contents.
Hugh Dickins said that problems with OOM kills and tmpfs are common, so it would be good to have some sort of a solution there. Weiner said that perhaps the best place to implement it would be in the oomd user-space OOM daemon. That kind of policy is hard to put into the kernel's OOM killer, which exists primarily to protect the machine as a whole. The implementation of specific resource-control policies, perhaps, is better placed in a user-space process. At that point, the session wound down.
get_user_pages(), pinned pages, and DAX
The problems associated with the kernel's internal get_user_pages() function have been a topic of discussion at the Linux Storage, Filesystem, and Memory-Management Summit for a few years. At the 2019 event, Jan Kara began a plenary session by saying that it would be "like last year's session". It turned out rather differently, though, perhaps due to the plenary setting; this discussion (along with the related session that followed) turned out to be one of the most heated at the entire conference.
Pinned pages
Kara said that he didn't really want to waste a lot of time explaining the problem yet again, so he went over the background in a hurry; readers who are not familiar with the problem can learn more in this article. In the end, it comes down to filesystems that are unprepared for pages of data to be modified behind their back, leading to problems like data corruption and kernel crashes.
He has developed a list of frequently asked questions over the years of
working on this problem. Could get_user_pages() simply be
disabled for file-backed pages? No, because real applications use it.
They map shared buffers for direct I/O, for example; these applications
were written before the addition of the splice()
system call, but they are still being used. Could the PageLocked
flag be used to lock the pages entirely? That would lead to pages being
locked for long periods of time, and could deadlock the system in a number
of scenarios. How about using MMU notifiers to get callers to drop their
references to pages on demand? "Good luck converting over 300 call sites"
to the new scheme, Kara said; that would also increase the overhead for
short-lived mappings that are used for tasks like direct I/O.
So something else must be done. The plan, Kara said, is to create a new type of page reference for get_user_pages() called a "pin". It would be obtained by calling get_pin(), which would increase the target page's reference count by a large bias value. Then, any page with a reference count greater than the bias value will be known to have pin references to it. There are some details to deal with, including the possible overflow of the reference count, though it shouldn't be a problem for real use cases. A large number of references could occasionally cause false positives, but that would not create problems either.
Kirill Shutemov suggested just making the PagePinned flag reliable, but Kara responded that doing so would require adding another reference count for each page. Space is not readily available for such a count. He had actually looked at schemes involving taking pinned pages out of the least-recently-used (LRU) lists, at which point the list pointers could be used, but it was not an easy task and conflicted with what the heterogeneous memory management code is doing.
The next step is to convert get_user_pages() users to release their pages with put_user_page(), which has already been merged into the mainline. There are a lot of call sites, so it will take a while to get this job done.
Christoph Hellwig jumped in to say that there are other problems with get_user_pages(), and that this might be a good time to do a general spring cleaning. Many of the users, he said, are "bogus"; many of the callers never actually look at the pages they have mapped. For the cases where a scatter/gather list is needed for DMA I/O, a helper could be provided. But Jérôme Glisse said that it would not, in the end, be possible to remove that many callers.
Once the system can tell which pages are pinned, it's just a matter of deciding what to do with that information. Kara talked mostly about the writeback code; in some cases, it will simply skip pages that are pinned. But there are cases where those pages must be written out — "somebody has called fsync(), and they expect something to be saved". In this case, pinned pages will be written, but they will not be marked clean at the end of the operation; they will still be write-protected in the page tables while writeback is underway, though. In situations where stable pages are required (when a checksum of the data must be written with the data, for example), bounce buffers will be used. There are over 40 places to fix up, not all of which are trivial.
DAX and long-term pins
At this point, Ira Weiny took over the leadership of the discussion to talk about the problems related to long-lasting page pins and persistent memory in particular. The DAX interface allows user space to operate directly on the memory in which files are stored (when the filesystem is on a persistent-memory device), which brings some interesting challenges. Writeback is not normally a problem with DAX, he said, but other operations, such as truncation and hole-punching are. When pages are removed from a file, they are normally taken out of the page cache as well, but with DAX those pages were never in the page cache to begin with. Instead, users are dealing directly with the storage media.
As a result, a truncate operation may have removed pages from a file while
some of the pages in that file are pinned in memory. In this case, the
filesystem has to leave the pages in place on the device. When pages are
pinned for a long time (as RDMA will do, for example), those pages can be
left there indefinitely.
Weiny said that what is needed to solve this problem is some sort of a lease mechanism so that pinned pages can be unpinned on demand. He has a prototype implementation working now; it implements leases with get_user_pages(), with individual call sites being able to indicate whether they support this mechanism. If a user has a file mapped and pages are removed from it, that user will get a SIGBUS signal to indicate that this has happened. Weiny asked whether the group thought this approach was reasonable.
One attendee noted that NFS seems to handle this case now; it can lose a file delegation on a truncate event. Perhaps the kernel should use the NFS (or the SMB direct) mechanism? There are challenges to implementing that, Weiny said, and in any case it's not the approach that people have been looking at. DAX is fundamentally different, he said, in that it uses a filesystem to control memory semantics.
Boaz Harrosh said that the approach was wrong, and that the truncate or hole-punch operation should simply fail when pinned pages are present. Others agreed that it wasn't right to just randomly kill processes because somebody truncated a file somewhere. Whether that is truly "random" was a matter of debate, but that was a secondary issue.
The rest of the session was an increasingly loud argument between those who favored sending SIGBUS and those who thought that the file operation should fail. At one point Hellwig suggested that people who were yelling really didn't need to use the microphone as well. Your editor will confess to having simply stopped taking notes partway through; it was one of those disagreements where opinions are strong and nobody is prepared to change their mind. About the only point of agreement was that the current behavior in this situation, wherein a call to truncate() will simply hang, is not good.
Toward the end, Ted Ts'o said that it would probably prove necessary to implement both options sooner or later. The session ended, rather later than scheduled, with no agreement as to what the right solution is. This topic seems likely to light up the mailing lists in the future.
How to get rid of mmap_sem
The mmap_sem lock used in the memory-management subsystem has been a known scalability problem for years, but it has proved difficult to remove. No less than three sessions were held during the memory-management track of the 2019 Linux Storage, Filesystem, and Memory-Management Summit to talk about mmap_sem and how it might be eliminated. Many ideas were discussed, but the shape of the solution remains vague at best.
Maple trees
The first session, run by Laurent Dufour and Matthew Wilcox, discussed a possible solution: replacing the red-black tree currently used to track virtual memory areas (VMAs) with a new data structure called a "maple tree".
VMAs represent the regions of memory that make up a process's address
space. They are kept in an augmented red-black
tree that exists to answer two kinds of queries: quickly finding the
VMA associated with a given address, or finding a gap in the address space
that is large enough to hold a new VMA. There is also a separate linked
list that contains the VMAs in an address space, making it possible to walk
through the entire space. Either way, protection of those data structures
is done with mmap_sem. Wilcox noted that doubly linked lists have
come under a lot of criticism in recent years; they are "the devil's
structure" with poor performance characteristics. The current structure
for VMAs, he said, is a "quintuply linked list" with even worse
performance.
Naturally (since he is the creator of the XArray structure), Wilcox wanted to replace everything with an XArray. It is, he said, "absolutely superb" for the page cache, with all the properties one might want for looking up pages with a 2n-byte alignment. But the XArray lacks good range support, which is what is needed here; there is no way to search for gaps in the address space. It is, thus, not a good structure for this task.
So, instead, he suggests using maple trees, a data structure he has been working on with Liam Howlett. It is a form of B-tree that has been optimized for storing ranges that lack significant gaps, which is just what the VMA list is. VMAs are mostly adjacent, he said, with a relatively small number of gaps between them. Hugh Dickins predicted that somebody was sure to post a patch in the near future that spread out VMAs as a hardening measure; Wilcox responded that the structure could be tweaked accordingly if that ever happens.
One of the key features of the new structure is lockless access to VMAs, a claim that brought an immediate protest from Mel Gorman, who worried that it could never be safe. One of the core aspects of the VMA, he said, is that it is actually two data structures in disguise. One of them is the VMA itself; the other is the address_space structure describing what that range of memory actually maps to. The two structures have their own life cycles and locking schemes, and lockless access will be hazardous at best. It might be necessary to add some sort of reference counter to the address space to keep it from going away while the VMA is being worked on.
That, Wilcox said, is why this discussion was happening; it would allow the
developers to get a sense for what the end state would look like.
Meanwhile, while the maple tree conversion is happening, code will continue
to use mmap_sem to access the data structure. There are two
phases to this work: replacing the data structure with something more
efficient, and separating out the locking.
Andrea Arcangeli said that it might be better to do things in the opposite order, and find ways to fix the locking first. For example, he said there should be no need to use mmap_sem with VM_GROWSDOWN VMAs (those representing stack areas, normally). It can also be avoided for calls to get_user_pages_fast(). The complexity of the code may be increased by removing the locking first, but he said that it was pointless to try to change the two things together. It is better to change the locking first; otherwise you can't really even benchmark the results of the other changes.
Michal Hocko was unsure about changing the locking first, but he liked the idea of getting rid of the doubly linked list and doing lockless searches. This work could also help, he said, with range locking, which is a commonly suggested way of reducing contention on mmap_sem. Rather than using mmap_sem to lock the entire range, taking a range lock on just the part of the address space being worked on would allow more to happen in parallel. He asked: how much work is needed to make this happen?
Wilcox replied that the patch, so far, is about 2,000 lines of code, but it's addressing more than just the VMA problem. There are three users of radix trees for ranges; all are horrible, he said, and should be replaced. There are other places using red-black trees that could be improved as well. In the end, this work will give him the opportunity to delete a lot of code. The maple tree structure will be there anyway, regardless of whether it is used for VMAs. Hocko said that was a good argument for using it with VMAs too.
At the end of the session, the next steps were laid out. Some sort of per-VMA locking will be set up, probably based on a range lock. Hocko suggested that the range locking should happen first, since adding range locks is a huge step. There are many places where mmap_sem is abused to protect other data structures; it will take a long time just to figure out where they all are. Gorman suggested that all mmap_sem users should be changed to acquire a lock on the entire address-space range; after that, each use can be evaluated to see if it can be narrowed. That will allow some sort of incremental progress. This task is, he said, "the big kernel lock all over again".
Range locks
Wilcox led a session on the following day dedicated to range locks in particular. Range locking is, he said, "a very seductive idea". There is a highly contended lock getting in the way, so it is natural to want to split it up. After all, the kernel doesn't use one big lock for access to inodes; instead, each inode has its own lock. Something similar has obvious appeal for the memory-management subsystem.
Davidlohr Bueso has done a range-lock implementation for the kernel that stores ranges in a red-black tree. But, Wilcox said, he is not a big fan of red-black trees; he has a replacement to propose (presumably the maple tree) that should perform better, but it is not working yet. That said, he is also not a fan of range locks, which he finds inelegant. They use one complicated data structure to protect another; in this case, one red-black tree (the range locks) is being used to protect another red-black tree (containing the VMAs). It would be better, he said, to build the locking into the tree itself.
There is some data to back this up. Dave Chinner tried to use range locks to improve XFS, Wilcox said, and ended up with a significant slowdown. Mel Gorman protested, though, that Chinner's experiment used an XFS-specific range lock that managed extents; it is not an equivalent situation, he said.
Wilcox then admitted that he did not have a whole lot else to say. Gorman said that it would be great to have a lock in each VMA, but that is a lot harder to do for holes in the address space. No object exists there yet, so there is nothing to put a lock into. But evidently the DAX subsystem has a solution to that problem now, inserting locked entries into holes in the radix tree. Wilcox is planning something similar for VMAs, but there is no code yet so it is too early to talk about now.
At this point, this relatively short session came to a close.
mmap_sem again
At the very end of the conference, Dufour ran one more session with the desultory title of "mmap_sem again". It was ostensibly about the speculative page faults work, though it did not focus much on that work specifically.
One purpose of mmap_sem is to protect VMAs against concurrent changes. That could perhaps be replaced in a twofold manner: a lock to protect the list of VMAs (perhaps using read-copy-update), and a lock embedded in each individual VMA. The latter lock would have to be a sleeping lock, since it must be taken while handling page faults. That leads immediately to the first challenge: how does one go from the non-sleeping lock protecting the list to the sleeping lock for the VMA? Acquiring the first lock will normally invoke a context where sleeping is no longer an option.
There was some talk of using reference counts for this purpose; Wilcox described it as "open-coding a semaphore". Gorman said there would have to be a wait queue for anybody who needs to wait for the reference count to go to zero. Starvation could become an issue in some settings. He also advised against using RCU, which is only useful when getting a copy of an object is sufficient; that is never true for VMAs. There would need to be a convincing explanation of how all this is actually better than mmap_sem, he said.
Jérôme Glisse suggested adding counts to the VMA so that all faults could be handled speculatively. But Gorman argued against building on the speculative page-fault code, which has yet to produce any performance gain in any of his tests. Glisse said that he was only thinking about taking parts of it to check for concurrent changes to a VMA while other work is going on. He would like to avoid range locking, he said; instead, the page-table locks can function as a sort of natural range lock. Hocko disagreed with the idea that range locks should be avoided. An address space is a collection of ranges that the memory-management subsystem operates on, so a range lock is a natural solution to the problem.
Wilcox turned the discussion back to mmap_sem, noting that it is highly contended and wondering how it could be split up. Part of the problem, he said, is that mmap_sem "covers many sins" beyond the protection of VMAs. Once that is dealt with, though, everybody agrees that work needs to be separated by range; that is not the point of contention. Instead, the dispute centers around whether Bueso's range locks, in particular, should be used.
For a next step, Glisse brought back the idea of replacing mmap_sem with a wrapper that would lock the entire range; Wilcox responded that this approach was "crazy". His own next step is moving unrelated data out from under mmap_sem to simplify the problem. Meanwhile, he is continuing to work on the maple tree concept as an independent effort. He also said that working on speculative page faults is valuable, even if it yields no immediate performance benefits; it helps the developers discover the perils that come with splitting up mmap_sem in general. Dufour agreed that he had already learned far more than he had ever wanted while working on that code; he is not sure about the future of speculative page faults but hopes that it can help to get to a more scalable memory-management subsystem.
Dave Hansen complained that mmap_sem is a heavyweight lock, even when just acquired for reading. It bounces its reference count around the system, creating a lot of cache misses. Adding a reference count to the VMA structure would, he said, just move the problem. Dufour said that the current practice of merging VMAs whenever possible might be making things worse by increasing contention on the locks; perhaps merging should not be done so aggressively.
The session was coming to an end, and the participants were clearly tired after three days of this kind of discussion. Hocko said that the group should at least come up with an action item. The goal is to replace mmap_sem; if developers don't like range locks, then what would they suggest using instead? It needs to be something that, like range locks, can turn the mmap_sem replacement into an incremental problem so that the transition is manageable. The focus should be on the new API, he said; once that is sane, developers can work on the implementation.
The problem, Wilcox countered, is that the range-lock API doesn't work for this problem. Code often does not know the size of the range it needs when it takes the lock, so it ends up locking the entire region, then downgrading to something smaller. But locking everything for every fault sounds a lot like mmap_sem, and downgrading will be an extra expense on top of that. Hocko asked what the alternative was, if there is not an API to start with, but no definitive answer was at hand. As the session concluded, Gorman said that one action item would be to get Bueso to repost the work he as done so far; perhaps there are some lessons to be learned from it.
The memory-management subsystem development process
One fixture of the memory-management track at the Linux Storage, Filesystem, and Memory-Management Summit is a discussion with subsystem maintainer Andrew Morton on how the development process is going. The 2019 version indicated that the memory-management developers are mostly happy with how the process is working, but there are still things that they would like to see changed. While some of the issues are old and intractable, others may be amenable to short-term improvement.Morton started by reminding the group that the 2018 development-process session had established the goal that all memory-management patches would be reviewed before they go upstream. That goal, he said, has been "easily achieved". That said, while many of the reviews are good, some of them are less so. But, he asked, did the group find the reviews as a whole useful? The answer was a clear "yes". Aneesh Kumar said that insisting on reviews puts pressure on developers and helps to make patch review actually happen; Morton answered with a hope that, as the process becomes better understood, managers will allow their developers more time for patch review.
At this point, Mel Gorman went into a long soliloquy about patch review
which, he said, is not always achieving its objectives in the
memory-management subsystem. For example, developers will post complex
patch sets without any justification and have that pointed out in review; then
the next version turns out to be even more complex. Memory management is
not a black-and-white field, there are consequences to everything and often
little supporting data is provided to describe the consequences of specific
patches.
At the same time, Gorman continued, patches that are seen as important fixes sometimes are rejected, usually because of a single objector blocking progress. In other subsystems, the maintainer just makes a call on patches like that but, in memory management, they can get stuck indefinitely. He mentioned the removal of the legacy block layer, which took a few rounds. There were objections, which were addressed; eventually the maintainer just merged it as the right thing to do.
Things seem to happen differently in memory-management. For example, Gorman said, the reversion of Andrea Arcangeli's patch fixing performance problems with transparent huge page allocation should have been overridden. (That episode was discussed earlier in the day). The fix was reverted as the result of an objection from "a single developer with an unreproducible workload"; sometimes, he said, you just have to push work through. If you wait for the controversy to end, you'll never make progress, he concluded.
Michal Hocko raised a perennial problem in memory management (and beyond): there are still more people writing patches than reading them. Johannes Weiner asked if there were quality problems as a result; Hocko replied in the affirmative, saying that some parts of the subsystem are more problematic than others. Gorman said that memory-management is far from alone in suffering from this problem; the group should concede that there are quality issues, he said, but nobody should think that the problem is unique to this subsystem. Morton asked the developers to bring his attention to problems when they come up; Gorman answered those times when he appears on the list, quickly fixes something, then disappears are a good sign that something has gone wrong.
In the traditional kernel maintainer model, Morton said, the maintainer tends to get involved late in the development of a patch set, when it is nearly completed. The maintainer's job at that point is to make sure that it get upstream; things tend not to change after a patch is applied to the subsystem maintainer's repository. Morton doesn't work that way, though; he gets involved earlier and tries to help developers get their work into shape. His ‑mm tree is a part of the development process, not its culmination. Is this good or bad?
Vlastimil Babka said that developers often don't understand that ‑mm works that way; they think that once something appears there, the work is done. Morton answered that he has gotten "less promiscuous" lately, passing over the first version of many patch sets. Hocko said that, as a reviewer, he can lose the big picture when followup fixes are added to ‑mm (a topic that came up in 2018 as well). That is, he said, a huge problem that has led to incorrect merges in the past. That is a direct result of the early merging strategy; he would rather see developers just send a new series when things need to be fixed.
Gorman said that the early merging approach has both merits and limits. It makes it hard for developers to work against linux-next, which is made more volatile by the inclusion of unready code. Gorman's own patches tend to come with a lot of supporting data, but all of those results are measured with mainline kernels; the numbers can change significantly in ‑mm, forcing him to reverify things. It would be nice, he said, if ‑mm had a proper set of topic branches for developers to work against; that sentiment was echoed by others in the room as well. Hocko said that different parts of the memory-management subsystem need different base trees for development; ‑mm is a single point of failure now. Other subsystems are trying to add redundancy to the maintenance process; he would like to see something similar happen for memory management.
Morton said that, when developers work against the mainline tree, things usually just work, and he is comfortable with that. Gorman replied, though, that it is hard to know which patches in ‑mm will go upstream anytime soon. That could be improved with a set of standardized branches in ‑mm, so that developers can have a better idea of what the next merge window will look like.
As the session ran out of time and dinner beckoned, Babka noted that most maintainers put detailed summaries into their pull requests; that doesn't happen with ‑mm. When asked, I could only agree; a pull request that reads "most of ‑mm" is not particularly helpful. Morton closed the session by saying that he would try to do better.
Alignment guarantees for kmalloc()
kmalloc() is one of the kernel's fundamental memory-allocation primitives for relatively small objects. Most of the time, developers don't worry about the alignment of memory returned from kmalloc(), and things generally just work. But, Vlastimil Babka said during a plenary session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, every now and then kmalloc() will do something surprising. He proposed tightening the guarantees around object alignment in the hope of generating fewer surprises in the future.
In particular, Babka wanted to discuss when kmalloc() should
return objects with their "natural alignment". What that term means was
not actually defined until near the end of the session; presumably nobody
will object to a slight bit of reordering at this point. Natural alignment
for an object means that its beginning address is a multiple of each of up
to three
different quantities: ARCH_KMALLOC_MINALIGN (eight bytes, by
default), the cache-line size (for larger objects), and the size of the
object itself when that size is a power of two. The actual required
alignment will generally be the least common multiple of the three.
Most of the time, Babka said, kmalloc() already returns naturally aligned objects for a power-of-two allocation size; that result falls out of how the slab pages are laid out. But there are exceptions: when SLUB debugging is enabled or when the SLOB allocator is used. Few people worry much about SLOB, but the SLUB debugging exception can lead to problems for unsuspecting developers.
That is because code can come to depend on receiving objects with natural alignment without the developer realizing; then, one day, they turn on SLUB debugging and things break. Babka has found a number of Stack Overflow questions about the alignment guarantees for kmalloc(), often with incorrect answers. XFS, in particular, is subject to breaking due to the block-size alignment requirement it has for some objects. There are some workarounds in place, but even then, XFS is depending on implementation details rather than explicit guarantees. These workarounds also increase memory use unnecessarily.
James Bottomley asked why XFS has this alignment requirement; Christoph Hellwig responded that XFS is "just the messenger" with regard to this problem. Another source of trouble, he said, is any of a number of devices with strange alignment requirements for I/O buffers. Matthew Wilcox said, somewhat sarcastically, that it would be great if the block layer could do bounce buffering for such devices; Hellwig responded seriously that developers are trying to reduce bounce buffering, not increase it. Besides, there are other places using memory from kmalloc() for I/O; any of them could show the problem.
One possible solution, for code where there are known alignment requirements, would be to use kmem_cache_alloc() to create a cache with an explicit alignment size. But that requires creating caches ahead of time for each possible allocation size. Dave Hansen said that, given that the default is to provide good behavior, perhaps caches created with explicit alignment could be coalesced back into the default caches, thus reducing the overhead Babka said that would be messy to implement; the group discussed the feasibility of this idea for a while without coming to any real conclusions.
Another possibility, Babka said, would be to create a new kmalloc_aligned() function that would take an explicit alignment parameter. That would be useful in situations where the required alignment is larger than the natural size. Developers would have to know about it (and about their alignment requirements) to use it, though, suggesting that it might not be used in places where it is needed.
So, Babka said, kmalloc() should simply be defined to return naturally aligned objects when the allocation size is a power of two. No changes would be required to implement that guarantee with SLAB or with SLUB without debugging enabled. In the SLUB-with-debugging case, there would be a cost in the form of more wasted memory. Implementing the alignment guarantee with SLOB would end up fragmenting the heap more than it already is.
Wilcox said that there wasn't really a need for SLUB red-zone debugging (which traps accesses outside of the allocated object) now that KASAN works; perhaps it could just be deleted? It seems, though, that other developers still find this feature useful. KASAN has a high overhead and must be built into the kernel; red-zones can be enabled at run time. Ted Ts'o said that, while he is willing to turn on red zones, he finds KASAN too painful to enable most of the time. Hellwig said he turns on red zones whenever somebody comes up with "a crazy new data structure".
While providing an alignment guarantee would make life easier for kmalloc() users, Babka said, there are some costs as well. Kernels with SLUB debugging enabled would be less efficient and SLOB would be less efficient overall. The guarantee could also restrict possible improvements to the slab allocators in the future. Christoph Lameter said that it could make storing metadata with objects harder.
Bottomley suggested a variant: natural alignment for objects up to 512 bytes, but a maximum of 512-byte alignment for larger objects up to the size of a page. Anything at least as large as one page would have page alignment. Babka said that this kind of special rule would only serve to complicate the implementation. Ts'o said that it is not worth trying too hard to avoid every potential cost of an alignment guarantee; spending a little more memory for more predictable behavior is worth it.
Hugh Dickins worried that SLOB users, who tend to be running on memory-constrained devices, might suffer from this change. Their memory overhead will get higher, and the memory-management developers will not hear about it for a long time. Babka replied, though, that the situation is not that bad; there will be more holes in the heap, but they will end up being filled by smaller objects. Wilcox noted, though, that SLOB uses the four bytes ahead of each object to store its size; doing that while aligning user data will be "obnoxious". Babka closed the session by saying that he would look more closely at the actual overhead with SLOB, saying that it would be unfortunate if the little-used SLOB allocator were to block this project.
Improving access to physically contiguous memory
For years, kernel developers have been told to avoid allocating large chunks of physically contiguous memory; as the system runs and memory becomes fragmented, satisfying such allocations becomes increasingly difficult. But, as Zi Yan pointed out in a memory-management track session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit, there are times when contiguous memory is useful. In this session, the memory-management developers discussed ways to make such allocations more likely to succeed.There are a lot of uses for physically contiguous memory, Yan said. The use of huge (2MB) or gigantic (1GB) pages improves performance by reducing the demands on the CPU's translation lookaside buffer (TLB). But it also turns out that many high-bandwidth peripheral devices also prefer physically contiguous memory; those devices, too, have TLBs that can run out of space. A device must respond to a TLB miss by walking through its internal page tables, which tends to be rather slower than when the CPU does it.
There are three ways of allocating large physically contiguous chunks now.
One is the libhugetlbfs virtual filesystem,
but it requires that memory be set aside for that purpose at boot time.
Users may well get the size wrong, and there is no interface to allocate
memory from libhugetlbfs inside the kernel. The second method, transparent
huge pages, suffers from fragmentation and must occasionally do compaction
of system memory. It also depends on the kernel's buddy allocator, meaning
that it is limited to buffers of size MAX_ORDER, which is normally
set to eleven, meaning a maximum size of 2048 4096-byte pages. It thus
cannot even come close to providing gigantic pages. Finally, there is alloc_contig_range(),
which is used with the CMA memory
allocator. This functionality is not available to user space, though.
Dave Hansen pointed out that libhugetlbfs can now be resized at run time, eliminating one of the concerns there. Andrea Arcangeli thought it might be useful to be able to allocate transparent huge pages from libhugetlbfs, and that it might not be that hard to implement. It seems that the real problem isn't necessarily the ability to allocate large chunks, but concerns about the time required to do so, since compaction may be required to make such a chunk available. There was some general discussion on quick allocation of higher-order chunks of memory without any conclusions.
Mel Gorman said that allocation latency for large chunks was an old problem. The kernel used to run into multi-second stalls when faced with such requests, but that was "a long, long time ago". Until somebody has observed and nailed down problems with current kernels, he sees those problems as being mostly hypothetical. Transparent huge pages have been through a few implementations since the 3.0 kernel; there is a lot of old information out there on the net that has not caught up with the current state of affairs. So anybody who is having trouble with large allocations should observe their systems, enable tracepoints, and report the results — all with a current, upstream kernel. "Don't handwave this one", he said, and he cautioned that he would not lose sleep over reports about problems on enterprise kernels.
Yan was not finished, though: he had a proposal for a new mechanism to defragment virtual memory areas (VMAs) after memory has been allocated. This would be done by finding pairs of pages that could be exchanged in a way that would improve the situation. Unlike the kernel's khugepaged thread, it would defragment in place rather than allocating huge pages up front and moving data into them. The page-exchange idea caused a few raised eyebrows in the room; it seemed overly complex for the problem at hand. The advantage of this approach, Yan said, is that it doesn't require any pages to be allocated; data is exchanged between pages by copying word-by-word through the CPU's registers.
Gorman asked why anybody should bother avoiding allocation of temporary pages; Yan said it might be worthwhile for larger pages, but the group was unconvinced. When Gorman asked about performance measurements, Yan replied that this exchange was faster than simply migrating two pages, but it was not clear why. Hansen asked what the overall problem was that was being solved, Yan said that it is a way to obtain 1GB gigantic pages without needing to change the kernel's MAX_ORDER parameter. The group was not convinced that this would be beneficial, though; current CPUs have tiny TLBs for gigantic pages, and nobody has been able to measure any real performance gain from using them.
There was a bit of back-and-forth between Yan (who works for NVIDIA) and another NVIDIA developer in the room. It seems that there might, maybe, somehow, be interest within the company in using 1GB pages for better performance in some not-yet-developed product far in the future. Maybe. But naturally nobody could actually talk about any such products, and kernel developers have little interest in trying to support them.
The session came to an end with Gorman saying that there was no reason to add new infrastructure for this purpose; khugepaged is there already. If the kernel's page-migration logic is too slow, the right thing to do is to make it faster rather than circumventing it. For example, he said, there is no effort made to batch migration work currently; it is a "stupid" implementation. There is a lot of low-hanging fruit that should be fixed before thinking about adding a whole new set of machinery.
Memory management for 400Gb/s interfaces
Christoph Lameter has spent years improving Linux for high-performance computing tasks. During the memory-management track of the 2019 Linux Storage, Filesystem, and Memory-Management Summit, he talked about the problem of keeping up with a 400Gb/s network interface. At that speed, there simply is no time for the system to get its work done. Some ways of improving the situation are in sight, but it's a hard problem overall and, despite some progress, the situation is getting worse.The problem is that, at those data rates, the kernel's page cache is overwhelmed and simply cannot keep up. That is not entirely the kernel's fault; there is an increasing mismatch between interface speeds and memory speeds. As a result, sites have stopped upgrading their Infiniband fabrics; there is no point in making the fabric go any faster. A PCIe 3 bus can manage 1GB/s in each lane; x86 systems have 44 lanes, all of which must be used together to keep up with a 400Gb/s interface. So extra capacity on the fabric side is not useful.
PCIe 4 offers a bit of relief in the form of a doubled transfer rate but,
Lameter said, that effort is currently stalled. Meanwhile latencies are
high. The whole Intel computing paradigm is in trouble, he said; it is no
longer suitable for high-performance computing. The OpenCAPI architecture
is somewhat faster than PCIe, but it is only available on POWER9 systems.
The fastest interlink available currently is NVIDIA's NVLink, which
can attain 300GB/s; that too is only available on POWER9.
In the area of memory bandwidth, processor vendors are adding memory channels; Intel has six of them now, AMD has eight. But that adds more pins and complicates routing. These systems can move 20GB/s in each channel, which puts an upper bound on what any individual thread can do; a single thread cannot keep up with even a 100Gb/s network interface. So multiple cores are needed to get the job done. There is some potential in GDDR and HBM memory; those, combined with NVLink, show that it is possible to do better than current systems do.
Jesper Brouer has done a lot of work improving the performance of the kernel's network stack; he was able to get up to a rate of 10Gb/s. But when the data rate is raised to 100Gb/s, there are only 120ns available to process each packet; the system cannot take even a single cache miss and keep up. So that kind of network processing must be done in hardware. The development of the express data path (XDP) mechanism is another sign that you just cannot use the network stack at those rates. Moving some functions, such as checksums and timestamps, to the interfaces can help somewhat.
Then, there are problems with direct I/O in the kernel; it works with arrays of pointers to 4KB pages, meaning there is little opportunity for batching. 1GB transfers are thus relatively slow. The 5.1 kernel has improved the situation by allowing for larger chunks of data to be managed; that results in lower cache use, fewer memory allocations, and less out-of-band data to communicate to devices — and, thus, higher performance. But this is a new feature that will not make its way into the major distributions for some time.
The kernel's page cache, Lameter said, simply does not scale. The fact that it can't work with large pages makes things worse; users have to use direct I/O or bypass the kernel entirely, which should not be necessary. That said, there has been some progress. The XArray data structure enables handling multiple page sizes in the page cache. The slab movable objects work can help to address fragmentation. Work is being done to avoid acquiring the mmap_sem lock while handling page faults, and support for huge pages is being added to filesystems. One option that has not been pursued, he said, is to create a kernel that uses 2MB as its base page size or increasing the base size to an intermediate value by grouping 4KB pages.
There is some value in persistent memory, which is attached to the memory channels and is thus fast. The DAX mechanism can be used to avoid the page cache altogether. This storage is currently limited in size, though, and cannot be used with RDMA due to the well-discussed problems with get_user_pages().
In the future, he said, kernel developers need to be thinking about terabit streams. There is 3D hologram streaming on the horizon, he said. We increasingly need to move massive amounts of data, but everybody is busy trying to avoid the kernel's limitations to get this work done. Part of the solution, eventually, will be new hardware architectures for high-performance computing.
It would be nice, he concluded, if the memory-management subsystem had a road map showing how it plans to meet these challenges. In the brief moment before the session ended, Matthew Wilcox said that not having a road map is not necessarily a bad thing. The development community is indeed working on these problems; each developer has taken on one piece of it. Coordinating all of this work is what LSFMM is all about; he now knows what others need from the subsystem.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: Firefox disables extensions; Linux 5.1; Guix 1.0; RHEL 8; dav1d 0.3; Firefox 66.0.4; GCC 9.1; Quotes; ...
- Announcements: Newsletters; events; security updates; kernel patches; ...