Leading items
Welcome to the LWN.net Weekly Edition for November 30, 2023
This edition contains the following feature content:
- An overview of kernel samepage merging (KSM): KSM has seen a number of improvements recently; Stefan Roesch described that work and gave an overview of how to tune KSM for a specific workload.
- Using drgn on production kernels: brainstorming on how to make drgn work safely on production systems.
- Preventing atomic-context violations in Rust code with klint: a tool for ensuring that Rust kernel code handles atomic context properly.
- The real realtime preemption end game: it looks like the realtime preemption code will soon make it into the mainline — for real this time.
- The 2023 Kernel Maintainers Summit:
a report from the 2023 gathering of top-level kernel maintainers, with
discussions on:
- Trust in and maintenance of filesystems. The kernel supports a wide range of filesystems, some of which have seen little maintenance (or use) for years. How can the kernel communicate the maintenance status of individual filesystems to users, and how can the community remove the worst of them?
- Committing to Rust for kernel code: the use of Rust in the kernel is considered an experiment that can come to an end if Rust does not work out. As more Rust code is merged, though, it is starting to look like the experimental phase is coming to an end. Is the kernel community ready to commit to Rust for the long term?
- Reducing kernel-maintainer burnout: kernel maintainers are complaining about stress and overwork; what can be done to reduce the strain?
- A discussion on kernel-maintainer pain points: Linus Torvalds talks with the maintainers about how the development community could be improved.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
An overview of kernel samepage merging (KSM)
In the Kernel Summit track at the 2023 Linux Plumbers Conference (LPC), Stefan Roesch led a session on kernel samepage merging (KSM). He gave an overview of the feature and described some recent changes to KSM. He showed how an application can enable KSM to deduplicate its memory and how the feature can be evaluated to determine whether it is a good fit for new workloads. In addition, he provided some real-world data of the benefits from his workplace at Meta.
KSM basics
The high-level summary of KSM is "very simple": it is simply a scheme to deduplicate anonymous pages by sharing a single copy. It was added to the kernel in 2009, so it is not a new feature, but there has been increased interest in it over the last two years. The original use case was for deduplicating the memory of virtual machines (VMs), but there are other use cases as well.
![Stefan Roesch [Stefan Roesch]](https://static.lwn.net/images/2023/lpc-roesch-sm.png)
In order to do its job, KSM has a kernel thread, ksmd, that scans anonymous pages in virtual memory areas (VMAs) that have KSM enabled, which Roesch calls the "candidate pages". It operates in three major phases, using a hash of the contents of the page to quickly compare them against the hashes of other pages to determine if the page is duplicated (or to see if its contents have changed). An rmap_item is created for each candidate to track its hash; if a candidate's hash changes frequently, it is not a good choice for deduplication.
In the second phase, any candidates that have not changed get added to an "unstable" tree; if the candidate is already found to be on the unstable tree, though, it gets moved to the "stable" tree. At that point, other pages with the same contents are switched to use a single page on the stable tree. A copy-on-write (CoW) mechanism is used to ensure that writes to any of the copies are handled correctly.
There are two ways to add an anonymous page to the candidate set. The "old way" is to use the madvise() system call, while the new one uses the prctl() system call; the latter was developed by Roesch. Not all memory regions are suitable for KSM, so there are exclusions for regions using DAX, hugetlb, and shared VMAs, he said.
The madvise() mechanism uses a flag, MADV_MERGEABLE, to indicate memory regions for KSM to operate on; if it is a compatible region, its pages are added to the candidates. The problem with that approach is that you had to guess which memory regions will benefit because there was no feedback on how well (or poorly) the deduplication is doing for the region.
The new prctl()-based method was added in the 6.4 kernel; the PR_SET_MEMORY_MERGE flag can be used to enable KSM for all compatible VMAs in a process. That setting is also inherited when the process forks, so KSM will be enabled for compatible VMAs in any children as well. The PR_GET_MEMORY_MERGE flag can be used to query whether KSM is enabled for the process.
System-wide configuration of KSM is done through the /sys/kernel/mm/ksm sysfs interface; there are multiple files in that directory, both for monitoring and configuring the feature. The run file is used to enable or disable the feature on the system, pages_to_scan determines how many pages are scanned each time ksmd wakes up, and sleep_millisecs sets how frequently the scans are done. Those latter two govern how aggressively KSM operates.
For monitoring, there are a few files in the sysfs directory, as well as in the /proc/PID directory. In particular, the /proc/PID/ksm_stat file has some information about KSM for the process, while some extra KSM information was added to the smaps and smaps_rollups files for the 6.6 kernel. That information can be used to see which VMAs are benefiting from KSM.
The monitoring files in /sys/kernel/mm/ksm include system-wide measurements of KSM, such as pages_shared for the number of pages shared via KSM, pages_sharing for the number of references to KSM shared pages (thus how many pages are being deduplicated), pages_unshared, which is the number of non-changing pages that are unique, thus unshared, and pages_volatile that counts the pages that changed too rapidly. The pages_scanned file was added for 6.6 to count the total pages scanned, which can be combined with full_scans, the count of scans completed, to determine how much work is being done in the scan phase.
One challenge is that, prior to the 6.4 kernel, it was not possible to figure out how long the scans were taking. He added some tracepoints to KSM that allow measuring the scan time; ksm_start_scan and ksm_stop_scan are the two most important tracepoints, but there are a handful of others that are useful for more-specialized investigation.
At Meta
He then turned to how Meta is using KSM. The Instagram web application was suffering from both memory and CPU pressure on older server systems. The workload is characterized by a single controller process and 32 or more worker processes; the number of workers scales based on the size of the system. The workers load their interpreter into memory when they start up and they also share a lot of other data structures that get loaded on demand.
The Meta engineers thought that KSM would work well for that workload because there is a lot of memory that can potentially be shared. At the time, the only way to enable KSM was via the madvise() call. The workers are run in control groups (cgroups) that are started by systemd, so the idea of process-level KSM enabling came up, along with the idea of inheriting that state across fork() .
That is where the prctl() flag, which was added for 6.4, came from. At the same time, systemd was modified to add the MemoryKSM parameter to enable KSM for a systemd service. The advantage of this approach is that the application code does not need to change at all to take advantage of KSM.
When he first started testing KSM on the workload, the "results were very disappointing to say the least", Roesch said; there was no real sharing of memory happening. He realized that the default pages_to_scan value was set to 100, which is "way too low"; later he noticed that the documentation says that the default is only useful for demo purposes. There were no tracepoints available at the time, either, which made it more difficult to track the problem down.
It turns out that 4000-5000 is a good compromise value for pages_to_scan on the Instagram workload. Other workloads that he has tested require 2000-3000 for that parameter. It is important that people know that the value needs to be changed; looking at the memory savings and the amount of time it takes to do a full scan are good hints for determining the best value. If it is taking 20 minutes to do a full scan, that is an indication the pages_to_scan is too low; Meta tries to keep the scan time at around two to three minutes, he said.
He showed some numbers for a typical workload (which can be seen in his slides or the YouTube video of the talk). There were around 73,000 pages_shared with 2.1 million references to them (i.e. pages_sharing). That means a savings of around 6GB of memory on a 64GB machine, "which is, for us, a huge saving". If you consider the fleet of systems at Meta, that savings multiplies greatly, Roesch said.
Optimizations
Once Meta started looking more closely at the scanning, it was clear that KSM was scanning a huge number of pages, especially during the initial ramp-up as the workers are being started. Even after it reaches something of a steady state, there are lots of pages being repeatedly scanned, but they are unique so they never get shared. That led to the idea of skipping pages as an optimization to reduce CPU usage.
The "smart scan" optimization feature, which has been merged for the 6.7 kernel, stores a skip count with each rmap_item that governs whether the page is skipped in the processing. The skip count increases (to a maximum of eight scan cycles that will be skipped) each time the page is found to be unique again once its skip count is reached. Smart scan is enabled by default and it reduces the number of pages scanned per cycle by 10-20%.
An optimization that is being discussed would help tune the number of pages to scan. Right now, that value needs to be set based on the ramp-up time where more than twice the number of pages need to scanned per cycle; once the steady state is reached, the pages_to_scan value could be reduced. Other workloads have shown similar behavior, so the "auto-tune" optimization could manage how aggressive the page scans are. The idea would be to identify a target for how long it should take to scan all of the candidate pages, which is what auto-tune would try to optimize. There would also be minimum and maximum CPU usage percentages that would limit the scans as well.
The results from auto-tune so far are promising. At startup, the pages_to_scan gets set to 5000-6000, but that gets reduced to 2500 or even less once the system reaches the steady state. That results in a CPU usage savings of 20-30% for ksmd. Configuration using a target scan time and CPU usage limits is more meaningful to administrators, as well, he said.
Evaluating new workloads
The easiest way to enable KSM for an application is by using the prctl() flag for the process. That can be done by changing the application itself, using the systemd parameter, or by running the program with an LD_PRELOAD library with a function that gets called at program load time. The last option works, but the first two are preferred, he said.
The next step is to run the program on a representative workload. The /sys/kernel/mm/ksm/general_profit file can be consulted to see how much memory is being saved; that measure subtracts out the memory used by KSM itself. The /proc files can be consulted for further per-process information as well.
To get meaningful data, though, it makes sense to rerun the test with different pages_to_scan values. How aggressive the page scan should be depends on the workload, so it is important to run the tests long enough to get the full picture. He reiterated that the default value for pages_to_scan is not at all adequate, so it will need to be adjusted.
Often, it is the case that an application has certain VMAs that benefit from KSM and others that do not. The /proc/PID/smaps file now has entries for KSM that will help show which VMAs are seeing the most benefit. Once that is known, the prctl() call can be removed and separate madvise() calls can be made for just those VMAs. One general piece of advice that he had is that smaller page sizes work better with KSM because there is more likelihood of sharing.
Today, evaluating a new workload for KSM requires running experiments with KSM enabled, but there may be situations where KSM cannot be enabled or these kinds of experiments cannot be run. He has some ideas on ways to evaluate workloads and was looking for feedback on them. One is an in-kernel approach and the other uses the drgn kernel debugger.
He has just hacked something together for drgn at this point, which he has not yet released, but the idea is to go through all the VMAs and collect the hashes for the pages, storing them in Python dictionaries. That information can be processed to see how much sharing can be done. It is fairly simple, but is also rather slow; if only a few processes are examined, it is "probably OK", but if the whole system is to be analyzed, "we need to do something else".
An in-kernel alternative would provide a means to calculate the hashes for the pages so that the sharing could be evaluated. A more advanced scheme would actually maintain the unstable and stable trees but do no merging; that would provide more accurate information about how much sharing can be done, but would be more expensive. These are some ideas he is considering because Meta has other workloads that might benefit from KSM, but running experiments to figure out which would benefit is rather time-consuming.
There are some security issues to consider with regard to KSM, though "if you control your workload then this is less of a worry". There are known side-channel attacks against KSM, however—he linked to two papers in his slides—so that should be factored into the decision about using KSM. In addition, KSM does not make sense for all workloads; in particular, latency-sensitive workloads are not good candidates for KSM.
He wrapped up by recounting the KSM changes that entered in kernel in 6.1, 6.4, 6.6, and in the upcoming 6.7, with a nod to the auto-tune feature that will likely come before long. He also credited several of his colleagues for work on the feature and the systemd developers for helping him on that piece of the puzzle.
Omar Sandoval asked whether auto-tune was being done in the kernel or if it was driven by user space. Roesch said that it was all done in the kernel based on the three parameters (target scan time, CPU min/max). There are default values for those that should be fine for most workloads, but may need tweaking based on the number of pages and the CPU availability.
Another question was about the CPU and memory overhead for enabling KSM. Roesch said there is a formula in the documentation to calculate the memory overhead, but that it is not much; there are the rmap_item entries, which includes the unstable tree that is overlaid on it, plus the stable tree. The CPU overhead depends on how aggressively the scans are done; on a typical Instagram Skylake system during startup "we see up to 60% CPU usage for the ksmd kernel background thread", which drops to around 30% in the steady state.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for assistance with my travel to Richmond for LPC.]
Using drgn on production kernels
The drgn Python-based kernel debugger was developed by Omar Sandoval for use in his job on the kernel team at Meta. He now spends most of his time working on drgn, both in developing new features for the tool and in using it to debug production problems at Meta, which gives him a view of both ends of that feedback loop. At the 2023 Linux Plumbers Conference (LPC), he led a session on drgn in the kernel debugging microconference, where he wanted to brainstorm on how to add some new features to the debugger and, in particular, how to allow them to work on production kernels.
Quick intro
Sandoval had a presentation on drgn (which is pronounced "dragon") in 2019 that covered some of the basics of the tool, which has presumably evolved since then. He has given other in-depth talks on drgn, he said, but he would just be doing a quick introduction to the tool at the LPC session. After that, he wanted to focus on the two features, writing to memory and setting breakpoints, and to justify why his team wants to be able to do those things on kernels running in production ("as crazy as it sounds"). He hoped that the brainstorming could come up with both a mechanism for supporting the features and an API that is "friendly enough, but also not so dangerous in the sense that you won't accidentally do something that you didn't mean to do".
![Omar Sandoval [Omar Sandoval]](https://static.lwn.net/images/2023/lpc-sandoval-sm.png)
Drgn is a "programmable debugger"; rather than having built-in commands, it provides building blocks, representations of kernel objects, types, stack traces, and more, that can be used to create the tool needed for the job at hand. There are, for example, many kernel-specific helper functions that provide access to various internal data structures, such as to find task structures or to walk various slab caches. Those can be used in an interactive session and then turned into scripts that can be saved (or shared with others) for the next time a similar problem arises, he said.
He did a brief demo of drgn on a virtual machine on his laptop; the YouTube video of the presentation from the conference livestream is available for the curious. In the Python read-eval-print loop (REPL), he had a handful of import statements pre-typed and then proceeded to demonstrate some of the capabilities, such as looking up the idle task using its variable name (init_task) with one of the kernel helpers.
He also showed a loop using the for_each_task() kernel helper that found the task structure for a cat he had running in a shell; he could then print the stack trace for that task, which had all of the symbol information, filenames, and line numbers. He used the stack-frame number to index into an array to further investigate a particular stack frame, including things like its local variables and their values. There are also a large number of contributed scripts that consist of "people's debugging sessions" that can be examined in addition to all of the helpers that come with drgn.
All of what he had shown is read-only, however; you can read any memory in the live kernel or in a kernel dump. But users have been asking for read-write features for the live kernel for some time; that would allow overwriting memory and setting breakpoints in the running kernel. That makes sense for development workflows, Sandoval said; drgn could attach to the kernel running in QEMU using its GDB stub. That functionality is something that developers are used to when debugging, so he would like to support it in drgn.
He has a proposed memory-writing API that, at its most basic level, just takes a target address and buffer of bytes to write there, which makes the user responsible for figuring out the right address and how to place the values into the buffer correctly based on the kernel type. On top of that would be a more user-friendly interface that would mirror the read side to a certain extent; objects can be looked up, then their fields can be used as Python attributes, with drgn ensuring that the write is done correctly. It could potentially also take a Python dictionary with structure fields as keys to write a structure with those values. The API is still up for debate as he has not implemented anything yet.
"Breakpoints are a little more complicated, but not too much." There are a few different ways a user might want to set a breakpoint: by address, function name, function name and offset, or filename and line number. Then handling any breakpoints might be done with a synchronous event loop, where the events indicate which thread hit the breakpoint, allow access to things like the stack trace and local variables for the stack frames, and provide a way to resume the thread after the processing is done.
Once again, Sandoval said that he was interested in hearing about simpler alternatives or use cases that still needed to be covered. Chris Mason said that he wanted to be able to see when a frequently called function is being called from some other specific function; Sandoval said that could be done with his API just by looking at the stack frame in the breakpoint and resuming unless it is being called by the function of interest. Another attendee suggested watchpoints for memory, which Sandoval thought could be added to the API in a way similar to the set_breakpoint() call he was proposing.
Because drgn is programmable, many of the different use cases can be handled with programs of different sorts, he said. If some of the use cases need a performance boost, perhaps BPF could be used to do things like pre-filter breakpoints. Another attendee suggested using drgn for doing error injection in testing, which Sandoval thought could fit right in, though there may be a need for a way to overwrite registers as part of the API.
Production
Those features are obviously useful in development, but his team at Meta has run into a few scenarios where it would be helpful to be able use them on production systems. For example, there have been cases where being able to overwrite some part of memory in the kernel would be enough to work around an emergency that has gotten people out of bed. It could be used to fix reference counts that are not getting decremented correctly, reset overflows or underflows of accounting information, change invalid states, and more.
A more concrete example is "an embarrassing bug in Btrfs" (fixed by this commit) where enabling asynchronous discard was not handled quite correctly. It manifested in reports of disks starting to run slowly, which was eventually tracked down to discards (i.e. telling the device that certain blocks are no longer used so that the it can do garbage collection on them) not being issued at all. After some "heroic debugging", the problem was tracked down and promptly fixed in the tree, but it would have been convenient to be able to run a drgn script on affected machines to change the single bit that would actually enable discard for the Btrfs mounts.
There are "a lot of caveats about doing this in production", though. You have to be careful about what you are overwriting—and when—"race conditions are definitely a thing". It is not meant as a replacement for a live patch or an updated kernel, but is instead a test of the fix and a stopgap measure. He hoped that explained the why, so he wanted to turn to "all the crazy ways we might be able to do this".
For development, the solution is easy, Sandoval said, simply use the GDB stub that is provided by QEMU. He listed some possibilities for the production use case, starting with bringing back /dev/kmem, which is "almost a joke" of a suggestion. He mentioned an LWN article that celebrated its removal and he noted that the /dev/kmem interface to read and write the kernel's memory was a "beautiful thing for rootkits". Drgn is not a rootkit, but debuggers do share some elements with rootkits, so /dev/kmem would be the "most straightforward way to support this, but I don't think anyone is going to accept that patch".
An alternative might be a custom kernel module that is effectively /dev/kmem, which is not that much better, he said. But, in order to enable the feature, there will need to be some way to write values to an arbitrary address, so the key will be in getting the access controls right. BPF could perhaps be used, but that "kind of goes against everything that BPF believes in, which is that your program should be safe".
Another possible approach would be to interface to kgdb, though Meta does not enable it in production kernels "but it's not the worst thing we could do". Kgdb already supports both memory writes and breakpoints, but, as far as he can tell, it was never intended to be used on a live, running system. For example, hitting a breakpoint stops every CPU, so drgn cannot still be running; perhaps it could be modified to only stop certain CPUs and leave one running for drgn.
An attendee asked why users would want other CPUs running when kgdb hits a breakpoint. The whole idea is that other CPUs cannot interfere with the state of the kernel at the time of the breakpoint. Sandoval said that works fine when there is another system that is driving the debugging, but that he wants to be able to log into a broken machine and run drgn there. Another audience member said that if there is a breakpoint set, drgn could cause it to be hit, leading to a deadlock.
Sandoval acknowledged that problem as a difficult one. His idea is to have a watchdog that would raise an non-maskable interrupt (NMI) to cancel the breakpoint in the deadlock situation. Kprobes were identified as another way to do breakpoints, which Sandoval thought might be workable. There would still need to be a kernel module that alerted drgn that the kprobe/breakpoint had been hit, as well as a watchdog for deadlock prevention, he said.
The kernel lockdown mode was brought up as a potential problem area by a participant; it is meant to restrict any mechanisms that might alter the running kernel, and may well be enabled on many production kernels. So had Sandoval thought about how drgn might work with—or around—lockdown? It probably makes sense to just disable drgn support on locked-down kernels, Sandoval said.
When considering access control, the features that he wants to add to drgn are things that already could be done from a custom kernel module, thus CAP_SYS_MODULE and CAP_SYS_ADMIN could perhaps control access to whatever underlying mechanism is decided upon. There is the caveat that some organizations require signed kernel modules beyond just having the capabilities. That might mean that the drgn mechanism needs to validate the user based on keys on the kernel keyring in some fashion.
Stephen Brennan pointed out that Python itself loads lots of code from various locations on the system that needs to be somehow protected so that running drgn does not become a compromise vector. Sandoval said that he "kind of copped out and made it a per-user authentication thing", so that the user has to be careful about those kinds of things, but that type of access control has not worked out so well over the years, he said, pointing to setuid binaries in particular.
Instead of having full breakpoints in drgn, Mason said, there could be a limited set of things that can be done when the code is reached. That could then be turned into BPF or a kprobe, which would then need to be inserted into the kernel; it would not change the security picture at all, but would simplify the problem of stopping all the CPUs and prevent the deadlocks. Sandoval said that one of the things in that defined set would need to be writing memory, however, so some solution for that part of the problem would still be required.
As time ran out, he wrapped up by saying that he still had "more questions than answers", but encouraged attendees to find him later to discuss "more bad ideas"—or so that he could show them "cool drgn stuff", he said with a chuckle.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for assistance with my travel costs to Richmond for LPC.]
Preventing atomic-context violations in Rust code with klint
One of the core constraints when programming in the kernel is the need to avoid sleeping when running in atomic context. For the most part, the responsibility for adherence to this rule is placed on the developer's shoulders; Rust developers, though, want the compiler to ensure that code is safe whenever possible. At the 2023 Linux Plumbers Conference, Gary Guo presented (via a remote link) the klint tool, which can find and flag many atomic-context violations before they turn into user-affecting bugs.Rust is built on the idea that safe Rust code, as verified by the compiler, cannot cause undefined behavior. This behavior comes in a lot of forms, including dereferencing dangling or null pointers, buffer overruns, data races, or violations of the aliasing rules; code that is "safe" will not do those things. The Rust-for-Linux project is trying to create an environment where much kernel functionality can be implemented with safe code. On the other hand, some surprising behavior, including memory leaks, deadlocks, panics, and aborts, is considered "safe". This behavior is defined, thus "safe" (though still, obviously, bad).
Atomic context in the kernel raises some interesting safety questions. If code, for example, executes a sequence like:
spin_lock(&lock); /* ... */ mutex_lock(&mutex); /* can schedule */ /* ... */ spin_unlock(&lock);
the result could be a deadlock if another thread attempts to take the same spinlock on the same CPU. That is "safe" (but "bad") code. But what about code like the following?
rcu_read_lock(); schedule(); rcu_read_unlock();
In this case, the safety of this code, even in the Rust sense, is not so clear. RCU assumes that there will be no context switches in code that is running within an RCU critical section; calling into the scheduler breaks that assumption. In this case, the atomic-context violation can indeed be a safety issue, creating use-after-free bugs, data races and worse. This is "fine" for C code, where the distinction between "safety" and "correctness" is not so well defined. Rust developers, though, try to live by different rules; consequently, they cannot design a safe API that allows sleeping in atomic context.
Avoiding that situation is not easy, though. One possible solution would
be to make all blocking operations unsafe. That, Guo acknowledged, is
likely to be widely seen as a bad idea. Another approach is token types,
which are commonly used in Rust to represent capabilities; functions that
might sleep can require a token asserting the right to do so. That
leads to complex and unwieldy APIs, though. It is possible to do runtime
checking, using the preemption count maintained in some kernel
configurations now. That adds runtime overhead, though, and the preempt
count is not available in all kernel configurations.
The last option would be to simply ignore the problem and trust developers to get things right, perhaps using the kernel's lockdep locking checker to find some problems on development systems. That approach, though, is unsound and not the Rust way of doing things.
The root of the problem, Guo said, is the need to optimize three objectives (soundness, an ergonomic API, and minimal run-time overhead) in a "choose any two" situation. Token types optimize soundness and overhead at the expense of an ergonomic API, for example, while run-time checking improves the API but sacrifices the goal of avoiding run-time overhead. Solutions that optimize all three quantities are hard to come by; the kernel's needs simply do not fit nicely into the Rust safety model.
The answer, Guo said, is to adapt the Rust compiler to this use case; that has been done in the form of a tool called "klint", which will verify at compile time the absence of atomic-context violations to the maximum extent possible. For the cases that cannot be verified, an escape hatch, in the form of a run-time check or use of unsafe, will be provided to developers so that their code can be built.
This tool was built with a number of goals in mind. It should be easy to explain and understand, of course, and provide useful diagnostics. There needs to be an escape hatch so that it does not get in the way of getting real work done. Its defaults, he said, should be sane, and there should be little need for additional annotations in the kernel. Finally, the tool needs to be fast so that it can be run every time the code is built.
Klint gives every function two properties, the first of which is an "adjustment" describing the change it makes (if any) to the preemption count (which, when non-zero, indicates that the current thread cannot be preempted). The second is the expected value of the preemption count when the call is made; this value can be a range. The klint tool tracks the possible state of the preemption count at each location, looking for situations where a function's expected preemption count is violated.
Thus, for example, rcu_read_lock() increments the preemption count by one, and can be called with any value. In Rust code, that would be annotated as:
#[klint::preempt_count(adjust = 1, expect = 0.., unchecked)] pub fn rcu_read_lock() -> RcuReadGuard { /* ... */ }
As klint passes over the code, it tracks the possible values of the preemption count and flags an error if an expected condition is not met. For example, schedule() would be annotated as expecting the preemption count to be zero; if klint sees a call to schedule() after a call to rcu_read_lock() it will complain — unless, of course, there is a call to rcu_read_unlock() that happens first.
The compiler's type inference makes explicit annotation unnecessary much of the time. There are exceptions, naturally, including at the foreign-function interface boundary, with recursive functions, and with indirect function calls. Other limitations exist as well; there is, for example, no way to annotate functions like spin_trylock(), where the effect on the preemption count is not known in advance. Perhaps, in the future, that shortcoming could be addressed by adding some sort of match expression to the annotations, he said.
Data-dependent acquisition, where, for example, a function only takes a lock if a boolean parameter instructs it to, is also not handled by klint at this point. Finally, there are cases where the compiler injects code into a function that confuses klint, leading to incorrect reports. This problem is currently blocking the wider use of klint, and is thus urgent to solve. Meanwhile, he said, klint imposes a negligible compile-time overhead.
Guo concluded by saying that klint is available on GitHub for folks who want to play with it. More information can also be found in the slides from the talk.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our
travel to this event.]
The real realtime preemption end game
The addition of realtime support to Linux is a long story; it first shows up in LWN in 2004. For much of that time, it has seemed like only a little more work was needed to get across the finish line; thus we ran headlines like the realtime preemption endgame — in 2009. At the 2023 Linux Plumbers Conference, Thomas Gleixner informed the group that, now, the end truly is near. There is really only one big problem left to be solved before all of that work can land in the mainline.
The point of realtime preemption is to ensure that the highest-priority
process will always be able to run with a minimum (and predictable) delay.
To that end, it makes the kernel preemptible in as many situations as
possible, with the exceptions being tightly limited in scope. The basic
mechanics of how that works have been established for a long time, but
there have been a lot of details to resolve along the way. The realtime
preemption work has resulted in the rewriting of much of the core kernel
over the years, with benefits that extend far beyond the realtime use case.
Gleixner started by noting that, while the realtime preemption project has been underway for nearly 20 years, it is actually closer to 25 years for him — he started working on realtime support for Linux in 1999. Once it's done, he said, there will be "a big party". Is that point at hand? The answer, he said, is "yes — kind of". There is one last holdout to be dealt with: printk().
Whenever code in the kernel needs to send something to the system consoles and logs, it calls printk() or one of the numerous functions built on top of it. One might not think that printing a message would be a challenging task, but it is. A call to printk() can come from any context, including in non-maskable-interrupt handlers or other printk() calls. The information being printed may be crucial, especially in the case of a system crash, so printk() calls have to work regardless of the context. As a result, there are a lot of concurrency and locking issues, and lots of driver-related complications.
printk(), Gleixner said, is fully synchronous in current kernels; a call will not return until the message has been sent to all of the configured destinations. That is "stupid"; much of what is printed is simply noise, especially during the boot process, and there is no point to waiting for it all to go out. Beyond being pointless, that waiting introduces latency, which runs counter to the goals of the realtime work, so the realtime developers have long since moved printk() output into separate threads, making it asynchronous. That code is a bunch of hacks rather than a real solution, though. A better job must be done to make this work useful for the rest of the kernel.
The printk() problem has been worked on seriously since 2018, resulting in about 300 patches that have either gone upstream or are waiting in linux-next; this work has been covered here at times. There are, he said, three final patch sets currently in the works to finish the job. A few tricky details are still being worked on. One of those is the handover mechanism; if the kernel has an emergency message to put out (it's crashing, for example), it may need to grab control of a console that is currently printing a lower-priority message. Doing that safely from any context is not an easy thing to do.
Another ongoing task is marking console drivers that are not safe to use in some contexts; if, for example, outputting a message during a non-maskable interrupt requires doing video-mode setting, it's just not going to work.
Gleixner finished the prepared part of his talk by saying that, even though it's getting close, nobody should ask him when the work will be done. printk() is unpredictable, and he is no longer willing to even try. Even so, he expressed hopes that the rest of the realtime preemption code would be in mainline before the 20th anniversary comes late in 2024.
An audience member asked whether there had been any interesting changes in
the printk() code over the last year; Gleixner answered that there
have been no fundamental conceptual changes. John Ogness, who has done
much of the printk() work, said that the handover code has been
reduced somewhat, but that some work remains; there are 76 console drivers
in the kernel that need to be fixed, and it may take a while until they are
all done. The handover code has been changed to allow drivers to be
updated one at a time rather than requiring that this work all be done at
once. (See this article for more
discussion on the recent printk() work).
Masami Hiramatsu asked which kernel messages need to be printed synchronously; Gleixner answered that almost everything should be made asynchronous. Beyond reducing latency associated with printk() calls, asynchronous output allows the creation of a separate kernel thread for each console, letting the faster consoles go at full speed rather than waiting for the slowest one. He also said that the code has been changed to ensure that important messages are fully copied into the message buffer before the first line is output, just in case a faulty console driver brings the whole system down in flames. Further safety is obtained by writing to the known-safe consoles first. If, for example, there is a persistent-memory store available, messages are put there before being sent to physical devices, once again preserving the output even if a faulty driver kills the system.
As the session closed, Clark Williams asked whether, once the printk() patches go upstream, Gleixner would try to push the rest of the realtime code (which wasn't discussed in this session) in the same merge window. The answer was a qualified "yes"; he might try if all of the code is staged in linux-next and seems ready to go.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our
travel to this event.]
The 2023 Kernel Maintainers Summit
The Kernel Maintainers Summit is an annual, invitation-only gathering of a subset of the kernel's top maintainers. The 2023 meeting took place on November 16 in Richmond, Virginia, after the Linux Plumbers Conference. A full day of discussion covered filesystem maturity, Rust, maintainer burnout, and more.LWN's reporting from this gathering is now complete; the discussions held were:
- Trust in and maintenance of filesystems. The kernel supports a wide range of filesystems, some of which have seen little maintenance (or use) for years. How can the kernel communicate the maintenance status of individual filesystems to users, and how can the community remove the worst of them?
- Committing to Rust for kernel code: the use of Rust in the kernel is considered an experiment that can come to an end if Rust does not work out. As more Rust code is merged, though, it is starting to look like the experimental phase is coming to an end. Is the kernel community ready to commit to Rust for the long term?
- Reducing kernel-maintainer burnout: kernel maintainers are complaining about stress and overwork; what can be done to reduce the strain?
- A discussion on kernel-maintainer pain points: Linus Torvalds talks with the maintainers about how the development community could be improved.
Group photo
Acknowledgment
Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our travel to this event.
Trust in and maintenance of filesystems
The Linux kernel supports a wide variety of filesystems, many of which are no longer in heavy use — or, perhaps, any use at all. The kernel code implementing the less-popular filesystems tends to be relatively unpopular as well, receiving little in the way of maintenance. Keeping old filesystems alive does place a burden on kernel developers, though, so it is not surprising that there is pressure to remove the least popular ones. At the 2023 Kernel Maintainers Summit, the developers talked about these filesystems and what can be done about them.Christoph Hellwig started the discussion by saying that it is hard for developers to know how mature — how trustworthy and maintained — any given filesystem is; that can often only be determined by talking to its users. This information gap can be a bad thing, he said. User space (in the form of desktop environments in particular) has a strong urge to automatically mount filesystems, even those that are unmaintained, insecure, and untrustworthy. This automounting exposes the system to security threats and is always a bad idea, but it's a fact of life; maybe there needs to be a way for the kernel to indicate to user space that some filesystems are not suitable for mounting in this way.
Another problem, he said, is fuzz testing. He appreciates all of the work
that is going into fuzz-testing of filesystems, but it is not helpful if it
is being directed at filesystems that are not going to be fixed. That is a
waste of resources; the fuzzer should, instead, be directed at filesystems
that will be fixed in response to problem reports.
The broader question, he continued, is how the kernel can do better at getting rid of old and unmaintained code. The process for doing so is always "very ad hoc"; typically, some maintainer gets angry and sends a removal patch, which is sometimes accepted. What comes next is typically a round of "whining in the Linux press". Other projects have "reasonable deprecation schedules", where features are annotated as being set for removal in a couple of releases unless somebody who cares puts in the time to maintain them properly. The kernel, perhaps, could benefit from something similar.
He closed by mentioning the prospect of the European Cyber Resilience Act, which could put vendors of products containing the kernel at risk.
Steve Rostedt said that, since most users run the kernels provided by distributors, the right thing to do might be to educate those distributors about which filesystems are trustworthy. Hellwig responded that, if the community needs to educate in that way, it is doing something wrong; there should be a better way to communicate this information. Ted Ts'o said that there are hundreds of distributions out there; it would be necessary for the kernel community to decide which ones it cares about. It could be said, he suggested, that distributors that do not contribute back to the kernel do not matter.
Linus Torvalds responded that he does not want anybody who would make that argument as a developer on his kernel — a position that has not changed in decades, he said. Any approach that says "users don't matter" is wrong, he said. Hellwig answered that users need to know when a filesystem is broken. Dave Chinner suggested that the right approach is to talk to the people who are setting the automounting policies — and the GNOME developers in particular.
Communications
Kees Cook said that the maturity information for filesystems could be stored as a field in the MAINTAINERS file; noises in the room made it clear that this idea was not universally loved. Hellwig said that file is not the place to send users; perhaps, instead, the automounter could be taught to only mount filesystems at a given maturity level or above. Chinner noted that the XFS developers have shipped a udev rule saying "don't automount XFS filesystems" for some time; perhaps that policy should be centralized. Torvalds said that everybody agrees that the current "automount everything" policy is wrong, but it should not be blocked in the kernel; this is a problem for the desktop environments.
Ts'o said that the problem comes down to communications; end users do not read kernel configurations or documentation folders. There is also no agreement on the acceptable level of risk; kernel developers worry about unmaintained filesystems, but GNOME developers think that everything should be automounted. That was a rule made by a product manager, he said, "and you don't argue with product managers". That, in turn, makes his life miserable as he is inundated with syzbot reports for crashes resulting from mounting malicious filesystems. The kernel, he concluded, should suggest a default policy that makes sense, but user space will make the decision.
If we had some sort of filesystem maturity model, Hellwig said, distributors would be able to use it to set an automounting policy. Josef Bacik said that it was just necessary to figure out a way to communicate this information to user space. Once that is in place, it's just a matter of waiting a couple of years for the tooling to be updated. He added that the more interesting problem was that of deprecation; there is no policy for doing that now. He would like a way to tag kernel features as being deprecated and slated for future removal. Greg Kroah-Hartman pointed out that this idea has been tried in the past (it was abandoned in 2012); Bacik said that it was time to try again.
Kroah-Hartman said that the kernel drops drivers all the time, and that perhaps the process should be formalized. Torvalds answered that he will always require a reason to deprecate code; the lack of reasons has annoyed him in the filesystem discussion, he said. He raised the "sysv" filesystem (used on Unix version 7 on PDP-11 machines and on some early proprietary x86 Unix systems) as an example; it is simple and places no burden on developers, so there is no reason to deprecate it. Bacik responded that there is no way to make changes to sysv and be sure of not breaking it; when Torvalds asked if anybody had encountered a problem with sysv, the answer from the room was "we don't know". He said that proved the lack of a cost, so nobody cares if sysv remains in the tree.
There was a side discussion on the differences between filesystems and drivers that started when Torvalds pointed out that there are many drivers in the kernel that receive little or no testing. Hellwig asserted that there are many drivers that have not worked for years. Kroah-Hartman said that there is no difference between drivers and filesystems, but Arnd Bergmann said that, for drivers, it's relatively clear when the associated hardware no longer exists. That is not the case for filesystems. Thomas Gleixner added that, if a driver stops working, it simply stops; a filesystem, instead, can silently corrupt data. Chinner agreed, pointing out that one cannot fix filesystem corruption with a reboot.
Rostedt claimed that it is possible to test all of the filesystems in the tree, since there is no special hardware needed, but the filesystem developers disagreed. Chinner said that, without a mkfs tool, filesystem testing cannot be done; additionally, there is a need for a filesystem checker and integration with the fstests suite. Ts'o singled out the ntfs filesystem; the tools are proprietary, but the kernel developers took the kernel code without insisting that the tools be open.
According to Chinner, the quality of the kernel's filesystem implementations has improved greatly over the last decade or so. There are something like 2,000 tests in the fstests suite; it even has 200 Btrfs-specific tests, where recently there were none. The reliability of the filesystems covered by fstests has gone way up; developers know that the filesystems are good and can tell users so. For the other filesystems, nobody really knows.
Torvalds pointed out that one of those other filesystems, reiserfs, is in fact deprecated and on its way out. It is possible to remove code that causes problems, but only if there is a good reason to do so. Reiserfs may still have a few users, given that SUSE defaulted to it for years, but he is happy to remove it — as long as there are no complaints, which might cause him to reconsider. Brauner asked for a proper path toward that removal, so that maintainers can effect removals without getting yelled at.
There was some discussion of how to communicate filesystem maturity information to user space. Ideas included an ELF section in module binaries, a kernel-configuration option, an interface for automounters, tainting the kernel when a low-quality filesystem is mounted, or requiring a special "I know it's deprecated" mount option. Dave Airlie suggested working with the udisks developers; Chinner said that he had tried that and "hit a brick wall" before just adding a udev rule instead. A few developers expressed frustration with a perceived inability to get a response from user-space developers on topics like this.
"Print a warning"
Torvalds said that there is no way to communicate this information to existing user space, since that code is not prepared to receive that message. Instead, he said, just outputting a warning with printk() can be effective; users see the errors and complain to their distributors. Suitable warnings could be added to the more questionable filesystems.
That, Kroah-Hartman said, requires coming up with a list of good and bad filesystems. Hellwig said that there would need to be at least three levels: "no trust", "generally maintained but don't mount untrusted images", and "well maintained". Torvalds said that this information could be given to the kernel when a filesystem is registered, and a warning printed if an untrusted filesystem is mounted. "Enough arguing", he said, it was time to just write a patch and try it.
The arguing was not done quite yet, though; Gleixner complained about architectures that do not keep up with low-level core-kernel changes. That leaves him having to figure out how to fix "25 PowerPC subarchitectures", an impossible task for a developer without the hardware and who is not an expert in that architecture. Might there be a way to tell architecture maintainers that they need to move forward to current APIs or the support will be removed?
Hellwig concurred with the problem, saying that there is an implicit assumption in the community that this sort of API cleanup is a low-priority task; as a result it is easily blocked. There needs to be a way to force developers to move to newer APIs. Bergmann mentioned the desire to remove high memory, but 32-bit Arm still needs it, so it cannot be removed from anywhere, imposing significant costs on the kernel as a whole. Chinner complained that increasing amounts of work are being placed on maintainers to keep old code working; maintainers are at the limit of what they can do now, and this path is unsustainable, he said.
Torvalds repeated that it is possible to deprecate old code when there is a good reason to do so. When Gleixner again said that he cannot get architecture maintainers to move to new APIs, Torvalds added that kernel developers often add new APIs alongside old ones to avoid having to fix everything at the outset. Perhaps, he said, developers should try to avoid that approach and, instead, just fix everything right away. Gleixner said that, for some changes, every subarchitecture must be fixed manually, which is a huge job.
Christian Brauner said that he had similar problems with the adoption of the new mount API; it took years to convert a majority of the important filesystems, and he had to do a lot of the work himself. A lot of his patches were rejected, creating a frustrating situation. Hellwig added that, during this conversion, two more filesystems using the old API were merged. Torvalds suggested that, in many of these cases, adding a warning might be all that is needed to put pressure on maintainers to move forward.
The above discussion was supposed to fit into a 30-minute discussion slot; readers who have gotten this far will be unsurprised to learn that it ran significantly over. At this point, the developers were in need of a break, so this topic was put aside so that the rest of the agenda could be addressed.
Committing to Rust for kernel code
Rust has been a prominent topic at the Kernel Maintainers Summit for the last couple of years, and the 2023 meeting continued that tradition. As Rust-for-Linux developer Miguel Ojeda noted at the beginning of the session dedicated to the topic, the level of interest in using Rust for kernel development has increased significantly over the last year. But Rust was explicitly added to Linux as an experiment; is the kernel community now ready to say that the experiment has succeeded?The Rust-for-Linux project has added a full-time engineer in the last year, Ojeda said, and a student developer as well. Various companies have joined in to support this work. There is also work underway to get the Coccinelle tool working with Rust code. A priority at the moment is bringing in more reviewers for the code that is being posted.
On the toolchain front, work on gccrs, the GCC-based Rust compiler, has slowed significantly. The GCC code generator for rustc is showing better progress; it can compile kernel code now and has been merged into the compiler. This GCC-based backend will enable the expansion of Rust support to architectures that are not supported by the LLVM-based rustc. Meanwhile, the Rust project itself is increasing its involvement in this work; this is good, since the kernel has some unique requirements and will need guarantees that language changes won't break kernel code in the future.
Within the kernel, work is proceeding in a number of subsystems. The Rust
implementation of Android's binder is working well and its performance is
on a par with the C implementation. The amount of unsafe code that was
needed to get there was pleasingly small. Filesystem bindings are the
subject of work by Wedson Almeida Filho, who is targeting read-only support
for now. The object there is to make it possible to implement a filesystem
in 100% safe Rust.
In general, he is finding an increasing number of maintainers who are open to the idea of using Rust. That leads to an issue the Rust developers have run up against, though. It would be good to have some reference drivers in the kernel as an example of how drivers can be written and to make it possible to compare Rust and C drivers. The best way to do that often seems to be to merge a Rust driver that duplicates the functionality of an existing C driver — but that sort of duplicate functionality is not welcomed by maintainers. Perhaps, he said, it would be good to allow a few duplicate drivers that are not meant for actual use, but only as examples for other developers to use.
There are some other challenges; upstreaming the block-layer abstractions has run into some resistance. Virtual filesystem layer maintainer Christian Brauner said that he is not opposed to merging those abstractions, but he would rather not do that and see filesystems built on it right away. He would prefer to see an implementation of something relatively simple, along the lines of the binder driver, to show that things work as expected.
A driver soon?
Dave Airlie, the maintainer of the DRM (graphics) subsystem, said that, if he has his way, there will be a Rust DRM driver merged within the next couple of releases. Christoph Hellwig shot back that Airlie was willing to "make everybody's life hell" so that he could play with his favorite toy. Merging Rust, Hellwig said, would force others to deal with a second language, new toolchains, and "wrappers with weird semantics". Dan Williams said that the current situation "is what success looks like", and that the kernel community was already committed to Rust.
Airlie continued that a lot of the Rust work is currently blocked in a sort of chicken-and-egg problem. Abstractions cannot be merged until there is a user for them, but the code needing those abstractions is blocked waiting for code to land in multiple subsystems. As a result, developers working on Rust are dragging around large stacks of patches that they need to get their code to work. Breaking that roadblock will require letting in some abstractions without immediate users. Ojeda agreed that this problem has been slowing progress, but said he has tried not to put pressure on maintainers to merge code quickly. In the case of networking, ironically, the Rust developers had to ask the networking maintainers to slow down merging Rust code.
The conversation took several directions from there. Greg Kroah-Hartman said that merging the binder driver would be a good next step; it is self-contained, has a single user that is committed to its maintenance, and doesn't touch the rest of the kernel. Kees Cook disputed the description of Rust as a "toy", saying that there is a lot of pressure to not use C for new code; Hellwig responded that the developers would have to rewrite everything in Rust, otherwise the resulting dual-language code base would be worse than what exists now.
Dave Chinner worried that maintainers lack the expertise to properly review the abstractions that are being merged. Airlie replied that maintainers merge a lot of C APIs now without really understanding how they work. A lot of mistakes have been made in the process, but "we're still here". When things turn out to be broken, they can be fixed, and that will happen more quickly if the code goes upstream.
Ted Ts'o expressed concern about the burden that adding Rust code will place on maintainers. The Rust developers are setting higher standards than have been set in the past, he said. Getting good abstractions merged is one thing, but who is responsible for reviewing drivers, and how will tree-wide changes be handled? The Rust effort, he said, is getting to a point where it is impacting a growing part of the community.
Williams pointed out that the previous session had discussed how hard it is to get kernel subsystems to move to new APIs; now, he said, there is talk of moving to a whole new language. Hellwig said that the real problem is that the Rust bindings tend to work differently than the C APIs they provide abstractions for; the new APIs may well be better, but they are still completely new APIs. What should be done, he said, is to first fix the C APIs so that they are directly usable by Rust code. He proposed that, for each subsystem that is considering bringing in Rust code, a year or two should first be spent on cleaning up its APIs along those lines. Ojeda said that this kind of API improvement has already happened in some subsystems.
Linus Torvalds said that he was seeing a divide between the filesystem and driver maintainers. Developers on the filesystem side tend to be more conservative, while the driver world "is the wild west". Driver authors tend not to understand concurrency, he said, and a lot of the code there is broken and unfixable. So it is unsurprising that there is interest in bringing in a language that better supports the writing of correct and safe code.
Brauner said that Rust can help with a lot of problems, since the compiler can keep a lot of bugs from making it into the kernel. But he worried about whether there would be maintainer and development support for it a few years from now. Airlie again mentioned developers with out-of-tree code needed by Rust code; Cook answered that the people shepherding that code are maintainers, and that bringing it in would bring the maintainers with it. Airlie added that those maintainers are the sort of younger developers that the kernel community would like to attract.
Chinner said that he would like to see a reimplementation of the ext2 filesystem in Rust. It is a complete filesystem that makes wide use of the kernel's APIs, but it is still small enough to read and understand. If the Rust APIs can support an ext2 implementation, they will be enough to implement others as well. Meanwhile, the ext2 implementation would be good reference for maintainers, who could compare it to the C version.
Confidence
Ts'o asked when the community would feel enough confidence that it could have modules where the only implementation is in Rust. Binder could be a good start, he said, perhaps followed by a driver that sees wider use. Airlie said that he is considering a virtual graphics driver that reimplements an existing C driver. There is also the driver for Apple M1 GPUs. He is feeling a fair amount of pressure to get it upstream and is wondering if there is any reason why he should keep it out. After that, he would love to see a rewrite of the Nouveau driver for NVIDIA GPUs.
Arnd Bergmann said those drivers could be OK, but that it will be quite a bit longer before something like a keyboard driver could be merged; the toolchain just isn't ready, he said, for a driver what would be widely used. That led to a question about the frequent version upgrades being seen in the kernel, which moved to Rust 1.73.0 for 6.7. That upgrade process will eventually stop and a minimum Rust version will be set once the important features that the kernel depends on have stabilized. He said that he has been working to get the kernel code into the Rust continuous-integration tests to help ensure that it continues working as the compiler and language evolve.
Bergmann said that he didn't plan to look seriously at the language until it could be compiled with GCC. Torvalds answered that, while he used to find problems in the LLVM Clang compiler, now he's more likely to find problems with GCC instead; he now builds with Clang. Ojeda said that he is working on finding developer resources for gccrs; the project is currently sitting on over 800 out-of-tree patches and still has a lot of work to do on top of that. GCC support will be a while, he said.
Ts'o complained that the language still isn't entirely stable. This could be a particular problem for the confidential-computing community; they are concerned about security and, as a consequence, about backports to long-term-support kernels. But if those kernels are on different Rust versions, those backports will be problematic. Ojeda said that, while it is a "crazy idea", backporting the version upgrades could be considered. He doesn't think that the change rate will be high enough to be a problem, though.
At the conclusion, Torvalds pointed out that there have been problems over the years with GCC changes breaking the kernel; the same will surely happen with Rust, but it will be the same thing in the end. The session, well over time, was brought to a halt at this point. Whether the kernel community has truly concluded that it is committed to Rust remains to be seen; there will almost certainly be pull requests adding significant Rust code in the near future.
Reducing kernel-maintainer burnout
Overstressed maintainers are a constant topic of conversation throughout the open-source community. Kernel maintainers have been complaining more loudly than usual recently about overwork and stress. The problems that maintainers are facing are clear; what to do about them is rather less so. A session at the 2023 Maintainers Summit took up the topic yet again with the hope of finding some solutions; there may be answers, perhaps even within the kernel community, but a general solution still seems distant.Ted Ts'o started off the session by saying that kernel maintainers end up having to do all of the tasks that nobody else working on a given subsystem wants to take on. These can include patch review, release engineering, testing, and responding to security reports. The expectations placed on maintainers have gone up over time, and kernel maintainers are feeling the pressure as a result.
Darrick Wong, Ts'o continued, broke
down the aspects of the maintainer job nicely when he stepped
down as the XFS maintainer. Ts'o is uncertain, though, about how well
that has worked to get others to step up and tackle some of those jobs.
Martin Petersen asserted that the real problem is that people are sending too many patches, but Dave Airlie strongly disagreed. As the DRM subsystem maintainer, he is "processing more changes than anybody" without having to touch a single patch. The way to handle this problem is to build up a structure with people who are able to take on the various tasks needed. The filesystem layer, he said, is more important than graphics; why doesn't it have more people than the DRM layer does?
Josef Bacik answered that building up that structure is hard; he has been trying with various people for the last three years. One developer simply couldn't do the work, another was unable to bring things to a conclusion. The filesystem problem space is complicated, he said, and finding people who can work in that area is hard.
Steve Rostedt said that part of the problem may be documentation; when he runs into bugs, he can't find documents describing how things work. Christoph Hellwig suggested writing down problems as they are encountered. Christian Brauner said that there is extensive documentation about the filesystem layers, but it tends to be hard to understand.
Airlie said that the trick is for maintainers to leave voids for others to fill. Thomas Gleixner, though, brought up the example of the generic interrupt subsystem. There is currently one person maintaining it, even though "if it breaks, the whole world breaks". There are a lot of people sending patches, but nobody showing any desire to maintain it. Airlie said that, if there are 100 people sending patches, there may be five who can be convinced to help maintain the subsystem; Gleixner answered that he sees a lot of "random drive-by names" that clearly have no intention of sticking around.
The need for reviewer help in particular came up; Linus Torvalds jumped in to say that reviewing is boring, so it is unsurprising that people don't want to do it. He keeps seeing huge patch sets being sent out once each week with little seeming to happen with them. A few tweaks, perhaps, before the next version is sent, but no real resolution. What is really needed, he said, is to find ways to get away from the email patch model, which is not really working anymore. He feels that way now, even though he is "an old-school email person".
Bacik said that the Btrfs developers are using GitHub to track things; it is good at showing the work that is outstanding, so he can see what has been languishing. That has improved the subsystem's throughput significantly. There are, he said, tools out there that "make the tasks we hate easier", and that every other project uses to get their work done. Gleixner, though, expressed skepticism that adopting another tool would solve the problem. He has a patch-tracking system that works well enough, he said; the real solution is to teach managers that, with proper engineering, work gets done sooner (and life for maintainers is easier).
Ts'o said that he never knows how long a patch submitter will be around, so it's never clear whether time spent to educate them will be worthwhile. He also said that, while asking submitters to fix existing technical debt in order to get their work merged is asking too much, maintainers can take a stand against adding more debt. Hellwig said that a developer trying to contribute code is often the best opportunity to get some cleanup done. Bacik said that the Btrfs community has been guiding developers that way for a long time and has learned how to do it well; he admitted that he should be writing their approach down. Maintainers should, he said, take a bigger role in teaching others.
Gleixner said that a lot of useful information for contributors has indeed been written down. Dave Chinner, though, worried that pointing contributors to that documentation can come across as an impersonal brush-off. That is why he often takes the time to write a long response to contributors needing guidance. Rostedt said that today's developers have different needs. He asked how many in the room had started kernel work because their employers had told them to; no hands were raised. That is not the case for many of today's new developers, he said.
"Being a maintainer is a part of our identity", Airlie said; it is likely how we got our current job and is not something that we readily let go of. Brauner added that people tend to hold onto power for dear life. Wolfram Sang said that he likes reviewing, but can get no support for doing that work; Dan Williams said that developers tend not to understand just how much social capital they can get from doing good reviews. Bacik said that the group was taking an overly simple view of the reviewing task; many developers hesitate to do reviews because they don't want to be seen as having missed something if a bug turns up. It was widely agreed that nobody should feel that way, since no one can be expected to catch everything. How to communicate that to the community as a whole is unclear, though.
Torvalds said that a Reviewed-by tag mainly means that the reviewer will be copied on any bug reports; developers should add those tags liberally, he said. Gleixner added that maintainers make fools of themselves every other day. Hellwig said that he has been trying to review code outside of his comfort area; it takes a couple of times before he feels that he understands well enough to offer a Reviewed-by tag. Rostedt, though, raised the issue of bare Reviewed-by tags offered without any discussion, which can be a sign of somebody trying to game the system and get into the statistics. Bacik said that, if the maintainer does not know the reviewer, their Reviewed-by tag means nothing.
Torvalds said that some subsystems are setting their requirements for contributors too high, making it hard for new developers to come in. Chinner added that the kernel's culture can be off-putting and not inclusive, making people fight to get their changes in. Bacik agreed, saying that there is no arbiter in the community; he said that Torvalds wants developers to figure things out for themselves, so disagreements over changes often end up as big battles. He would like to move to a system that is more encouraging of efforts to find solutions.
Torvalds said that, while the community gets a lot of new contributors, it doesn't tend to get many new maintainers. The contributor and maintainer roles should be separated, he said. Chinner said that becoming a maintainer is often seen as a promotion for developers who do good work; Torvalds answered that people often see maintainers as some sort of "super developer", but they are really just managers. He took a moment to thank Konstantin Ryabitsev for the b4 tool, which has made life much easier for maintainers; the attendees responded with applause.
As this part of the session came to a close, Williams said that part of the pay for reviewing work is autonomy within a subsystem, but that the community doesn't actually provide that autonomy. Instead, maintainers hold onto all of the decision power. Airlie answered that the DRM subsystem has done well with distributing that power among a number of developers.
A support group
A related session was run by Rostedt, who started by saying that he has
heard a lot of maintainers and developers complaining about burnout. There
are many things that could be done about this problem, but often all that a
tired developer really needs is somebody to talk to. He is proposing the
creation of a list of developers who are willing to lend an ear when the
need arises. These developers would have no power, they would just be
there to provide support and advice when a problem arises.
Torvalds answered that if he wanted to talk to somebody, he wouldn't go to a kernel developer. Bacik, though, said that he is willing to do some basic support work. He can talk well with developers who are at the same level in the community, but his ability to get others to listen to him is not great. He suggested that Torvalds should take less of a laissez-faire approach to the development community and help solve problems more often.
Chinner asked what problem Rostedt was trying to solve; Rostedt answered that many developers feel isolated and that they could benefit from a support group, but they don't know who to talk to. Bacik said that, with the developers he works with, he knows that problems can be worked out. But perhaps developers who lack that assurance could use some support.
Williams asked whether the inability for developers to see each other for a couple of years contributed to problems; many people seemed to think that it did.
At the end of this session, Torvalds said that about half of the emails he receives are private, rather than copied to the mailing lists. Developers are always welcome to send him a note when they are having problems; he has often had long discussions with developers about conflicts. Ts'o said that individual subsystems often have a decision maker who can bring conflicts to an end, resolving disputes by decree if they have to. The community lacks that resource for cross-subsystem issues, though.
The next step will be for Rostedt to propose an addition to the kernel's process documentation describing this support group.
A discussion on kernel-maintainer pain points
A regular feature of the Kernel Maintainers Summit is a session where Linus Torvalds discusses the problems that he has been encountering. In recent years, though, there have been relatively few of those problems, so this year he turned things around a bit by asking the community what problems it was seeing instead. He then addressed them at the Summit in a session covering aspects of the development community, including feedback to maintainers, diversity (or the lack thereof), and more.The first question he mentioned was a suggestion that, because he does test builds after acting on pull requests, he is showing a distrust of his maintainers. These builds slowed the process down during the 6.7 merge window, when Torvalds was traveling and doing the builds on a laptop. He answered that he normally does a build after each pull just as a part of his normal workflow. It is not a matter of not trusting maintainers — though he does also like to verify that everything is OK. He also does builds to confirm any conflict resolutions he may have had to do.
Another question had to do with Torvalds's tendency to give mostly negative
feedback. Maintainers quickly learn that, if Torvalds responds to a
pull-request email, it is usually bad news. He acknowledged that he tends
to operate that way, and said that he is not proud of it. In general, he
tries to avoid answering email if he doesn't have to. So if a pull happens
without problems, he is happy and wants "to say 'I love you'", but he
doesn't act on that. As a result, most of his emails are about problems.
He did not say that he would try to change that pattern.
The session was interrupted by a break at this point; on return, Torvalds said that he is quite happy in general. He was in Hawaii for the first week of the merge window, which might ordinarily make things harder. But he got a lot of pull requests early and, despite the 6.7 merge window bringing in the most commits of any merge window in the project's history, it was "pain-free". There was not a single case of a change breaking his machine, which is rare.
Another question to Torvalds mentioned that the maintainer tree is quite flat, meaning that he pulls directly from a large number of trees rather than from intermediate maintainers who coalesce pulls from multiple subsystems. He agreed that the tree is quite flat. Sometimes that is by his request; there have been cases where having code go through intermediate maintainers has made things more complicated. But the flatness also, he said, suggests that some maintainers don't have the support that they should and are solely responsible for getting work from contributors into the mainline. Some top-level maintainers do far too much work, he said; they should find ways to delegate some of that work to others.
There was, he said, a complaint that Thorsten Leemhuis, who has taken on the task of tracking regressions, is pushing maintainers to get those regressions fixed. Torvalds found that surprising; he loves having somebody staying on top of problems that way. He can see that it can be annoying to some maintainers, he said, but if Leemhuis weren't doing that job, he would be doing it himself instead.
Another question had to do with the "bus factor" of having Torvalds in charge of the whole community; what would happen if he were suddenly unable or unwilling to do that work? Torvalds said that things are working so well in the kernel community that his disappearance would be "a momentary distraction"; there would be "some infighting" as the new order was worked out, then work would continue as always. He said that maintainers should talk to him, though, if they think he should be doing his work differently.
He noted that the kernel community as a whole has hit a plateau in the last five years; the patch volume and number of developers are not growing as they once did. That may suggest, he said, that the community has hit the limits of how far it can scale. Adding more layers of maintainers would help, he said, but it also would not solve all of the issues that are impeding further growth.
Diversity
The last question to be addressed had to do with the gender imbalance in the community and at the Maintainers Summit (which was 100% male) specifically. Torvalds agreed that the situation was not good and "not going in the right direction", then quickly moved on. Dan Williams returned the discussion to this topic a bit later, noting that he had recently had a discussion with the head of the OpenJS Foundation. When she joined, she was the only female member of the board; now it is much more balanced. She told Williams that the change was effected through direct outreach to potential members; hoping that the problem would get better on its own was not good enough.
Williams noted that the 2023 election for members of the Linux Foundation Technical Advisory Board was uncontested; only the incumbents ran to keep their seats. That suggests, he said, that the community is missing chances to reach out to people. Torvalds said that Greg Kroah-Hartman used to do that sort of outreach; Kroah-Hartman, in turn, said that Shuah Khan at the Linux foundation is doing a good job of bringing in interns. The problem is that, after learning how to do kernel development, they disappear into companies and are never seen again.
Williams said that he went to a recent Black Is Tech conference, which featured a slate of all black developers. This would have been a good event for outreach, but nobody was there to recruit for Linux. Torvalds said that the maintainers in the room were not the best people to be doing outreach. Kroah-Hartman again mentioned programs like Outreachy and the Google Summer of Code, which do well at reaching out to potential developers, but which mostly end up providing employees that disappear into companies. Dave Airlie said that he has been able to get a couple of Outreachy interns working on graphics into community-oriented jobs, and they are still contributing.
Steve Rostedt said that one problem has to do with the demands on women who are successful in the kernel community; they are quickly "asked to join everything" and it burns them out. Ted Ts'o suggested thinking about the different points in the pipeline; people invited to join panels tend to be mid-level developers, but people are dropping out at all levels, suggesting that there is a wider problem. Developers at different levels have different needs, he said. Josef Bacik agreed that the community relies too heavily on the few women that it has; he cited one developer who got burned out and now prefers to just focus on one area and avoids the community.
Thomas Gleixner pointed to one end of the pipeline by saying that there are few women in computer-science programs; Airlie said that is true of undergraduate programs, but there are more women doing postgraduate work. Christoph Hellwig says that he sees more women at academic events than at community events.
Bacik said that he does a lot of recruiting for his employer; he goes to events like the Grace Hopper Celebration as a way of finding good candidates. The Linux Foundation, he said, could send people to events like this to let developers know that the Linux community exists.
Konstantin Ryabitsev said that there are good reasons why developers disappear into companies. It is often the only path available; the Linux Foundation is unable to hire them (it employs few developers in general). Not everybody is able to sacrifice their evenings and weekends to do community work, he said. Hellwig suggested looking harder outside of the US and Europe; there are far more women in engineering elsewhere. Discussion on this topic ended with a suggestion from Ts'o to survey Outreachy interns a couple of years after they complete the program and see if they are still working in tech. If not, it would be good to know why; for now, he said, we are only guessing.
With that, the session (and the Maintainers Summit as a whole) came to an end; the attendees filed off for the obligatory group photo before taking some much-needed rest.
Page editor: Jonathan Corbet
Next page:
Brief items>>