Reports from OSPM 2023, part 3
Proxy execution
Author: John Stultz (video)The last day of OSPM began with a talk and discussion on proxy execution, a generalized form of priority inheritance. Proxy execution is an idea that has been worked on for a number of years by several core kernel contributors without getting much traction.
Android doesn't use realtime scheduling for applications, but it still needs to prioritize foreground applications over background applications. This is usually done via control-group mechanisms to restrict background tasks, allowing foreground applications to get more CPU time. Unfortunately, as is commonly seen with realtime priorities, Android devices frequently experience priority inversion. Low-priority background tasks may occasionally grab a mutex that an important foreground task requires, but the background task is restricted from running to release the lock.
The classic solution to priority inversion is priority inheritance via rt_mutexes, but this doesn't help, because the completely fair scheduler (CFS) doesn't choose which task to run based on strict priority order, as is done with realtime scheduling; the same problem appears in SCHED_DEADLINE as well, where deadlines instead of priorities would need to be inherited. So a more general form of priority inheritance is needed; that is what proxy execution provides.
The simple idea for proxy execution is that one can treat the scheduler's task-selection algorithm as a black box, and let it select the most important task at any time to run. But instead of just choosing from the currently runnable tasks, we want it to also consider mutex-blocked tasks when selecting a task to run. Then, should the task it selects be blocked on a mutex, we instead select the mutex owner to run as the selected task's "proxy", but run it with the entire scheduler context of the selected task.
This elegant and simple idea, unfortunately, has a number of edge cases that needed to be resolved for it to work. As the complexity grows, this idea of simply running a lock-holding task on behalf of a blocked task starts to break down. It begins to make sense that implementing this idea has stymied previous developers, and one might wonder if it's worth it.
However, the reason the Android team is continuing to push this effort is because the early results seen with it are favorable, as it avoids the long-tail outlier latencies caused by priority inversion, giving more deterministic behavior for foreground applications.
To spur discussion, I covered a number of risks and issues, some recently addressed and others still to be resolved, as well as some half-formed ideas for avoiding some of the complex migrations.
Peter Zijlstra mentioned that work on split reservations in SCHED_DEADLINE, as well as on CFS deadline servers parallels the splitting of the kernel's task data into separate scheduler and execution contexts for proxy execution, so there is a potential for sharing some logic and cleanup there. Similarly, around the issues of pick_next_task() having side effects, the task returned is what is run, so side effects don't matter. Zijlstra noted similar changes are needed for core scheduling to separate pick_next_task() and set_next_task().
Joel Fernandes and Steven Rostedt also contributed ideas to the discussion around how we might avoid migrations as well as how we might amortize task-tree migrations so it isn't a big cost all at once.
Additionally, for the concern about optimistic spinning being disabled and impacting performance, Zijlstra suggested that we might be able to avoid the complexity of blocked-task migration in the case where the lock holder is actively running on a CPU, restoring optimistic spinning in that case. Zijlstra also pushed to get actual numbers on how costly the additional migration and selection retries are so we have a real sense of the concern rather than just a theoretical one. This seemed to align with some of the SCHED_DEADLINE bandwidth-inheritance efforts as well. Dario Faggioli pointed out that the spinning was actually necessary for some of the EDF research algorithms, causing the CPU time of the waiting task to be consumed so that later deadlines aren't affected. The implications of this were unclear to me, so follow up discussions were planned for a later time.
(See also: this article on proxy execution.)
Sched_ext: Pluggable scheduling with BPF
Author: David Vernet (video)Sched_ext is a new extensible scheduling class that allows system-wide scheduling policies to be implemented in BPF. The talk began by introducing sched_ext at a high level, and enumerating the reasons that it was built. These reasons include rapid and safe experimentation, bespoke scheduling policies for individual applications, enabling quick rollouts of Spectre mitigations (such as for L1TF, where the CPU scheduler can help), and enabling some complexity to be moved to user space. For Meta, sched_ext has allowed experiments with different features, some of which will soon be sent upstream for inclusion in CFS. Additionally, it has allowed Meta to build bespoke schedulers that have resulted in multi-percent improvements in throughput and p99 latency for its main web workloads.
At this point, Zijlstra pointed out that most Spectre mitigations had nothing to do with the CPU scheduler, and questioned the value of the "quick rollout" mechanism. It was clarified that the intention was that it could be beneficial if and when any vulnerabilities are discovered in the future that can be mitigated with scheduling. Another attendee asked why Meta was interested in upstreaming to CFS if it could just run its own schedulers. The response was that Meta is an upstream-first company, with few internal patches used in its deployed kernel. Additionally, many of the kernel engineers at Meta (including Chris Mason, who was in attendance) are long-time maintainers or contributors in core subsystems, and believe strongly in the general benefits of open source.
The talk continued with an overview of the interface of sched_ext, describing the struct sched_ext_ops callback mechanism and the dispatch queue (DSQ) abstraction, both of which are described in the LWN article linked above.
Zijlstra said that sched_ext was just a worse version of the debugfs scheduler knobs, in that vendors such as Oracle would do nonstandard things such as shipping their own scheduler rather than using CFS. Mason responded that, if a scheduler had a feature which provided compelling performance, and if that feature was missing from CFS, then we would simply have to figure out how to enable it in CFS so that it could be used without sacrificing performance. Steven Rostedt pointed out that sched_ext is a different proposal than the debugfs knobs because sched_ext sits below CFS, and therefore wouldn't impose any maintainership burden on Zijlstra.
As the presentation continued, there were several other concerns raised by the audience. Thomas Gleixner claimed that using BPF could expose the scheduler to user-space-API stability constraints due to the BPF scheduler program exposing interfaces to its user-space counterpart. It was pointed out in response that sched_ext uses the struct_ops interface, which is purely internal to the kernel and provides no ABI guarantees. This point echoes the sentiment shared by Linus Torvalds at the 2022 Kernel Maintainers Summit.
Another concern was that sched_ext would discourage upstream contributions to CFS. Sched_ext intends to try and mitigate this possibility by encouraging users to upstream their schedulers in similar fashion to kernel modules. Schedulers that are out-of-tree may break any time if the scheduler's struct_ops interface changes, whereas in-tree schedulers will be updated to avoid build and performance regressions. The bar for upstreaming sched_ext schedulers can also be lower than for CFS changes. Upstreaming to CFS requires carefully integrating a feature into a complex code base and ensuring that it doesn't regress any of the existing features. With sched_ext, developers can upstream their BPF schedulers, and then add their features to CFS at a later time once they are proven to be worth the complexity and maintainership cost. Finally, it was pointed out that all sched_ext programs must be GPLv2, and will be rejected by the verifier otherwise.
The talk concluded with everyone going out to enjoy an espresso.
A push/pull model for timers
Author: Anna-Maria Behnsen (video)When a timer expires, it is better to run its handler function on an active CPU rather than waking an idle CPU for this purpose. The current approach to prevent handling of timer-wheel timers on idle CPUs uses heuristics at enqueue time. The heuristic calculates the best CPU, which hopefully will not be idle when the timer expires. It is the so-called "push model" — push the timer to another CPU at enqueue time. This model has two problems: heuristics are not reliable, and most timer-wheel timers are canceled or rearmed before they expire. This means wasting cycles during the enqueue operation to generate a possibly incorrect solution that has a high probability of not being used anyway.
The future state would be a pull model. To support this model, a hierarchy is created at boot time as CPUs are brought online. When a CPU goes idle, all timers that are not pinned to that CPU (and are not deferrable) will be enqueued into the hierarchy. The hierarchy is used to build groups where a single CPU acts as migrator and ensures that enqueued timers are expired in time and are "pulled" onto a busy CPU.
Testing is an important step to validate this approach. Several people volunteered to integrate the sixth version (now seventh) of the implementation into their test environment.
A discussion about scheduler behavior was introduced by the audience as well as by the speaker; when a timer expires on a remote CPU, the CPU where the timer was started originally gets woken up. This is caused by the completely fair scheduler, as it assumes that the context is still cache-hot and, for performance reasons, the CPU is woken up again instead of reloading the context on the actually busy CPU. The result during the discussion was that this behavior could be changed and patch proposals are welcome.
There is still an open point in the new approach about handling the deferrable timers, most of which could be mapped onto non-pinned timers, but those that also contain the "pinned" flag beside the "deferrable" flag, could not simply be mapped. Some people in the audience offered support to get rid of those five last users, which are mainly users of deferrable work.
Dynamic Energy Model
Author: Lukasz Luba (video)This talk was about four new features that are going to be part of the energy model (EM) framework. These features form a set and have been called the "dynamic energy model". The first feature allows users to modify the model values at run time to better reflect the current hardware; an example would be to reflect increased static power (leakage) due to higher temperature of the SoC in gaming workloads. The patch set is available on the mailing list. The feature was presented at the 2022 Linux Plumbers Conference, but this time some plots have been added showing how the CPU power can increase over time.
The second feature aims to improve the simulation performed in the energy-aware scheduler (EAS) while doing the task placement during wakeup. The hardware dependency issue described in the talk creates a situation where the energy used for performing the computation for two tiny tasks can be far higher than the simulation in the kernel. It is due to shared voltage and frequency domain between the little cores in a big.LITTLE system and the dynamic shared unit (DSU), which contains the L3 cache. They both can perform dynamic voltage and frequency scaling (DVFS) to save power, but L3 cache is an important factor in three-gear SoCs — big.LITTLE systems with a third performance level provided by an even bigger "big" CPU.
The performance (latency and throughput) of L3 cache for big CPUs is crucial. To maintain that performance, the system may increase the frequency (and voltage) for the domain containing the little CPUs and the DSU, even when the little CPUs don't require that. This behavior of the hardware is not modeled in the current energy model, but the new feature is going to change that. The results from a simple experiment showed power saving for two tiny tasks of ~50% in that specific scenario, which was in total ~35mW. According to little's energy model, ~40mW means an average power of one CPU core running at 1GHz, fully loaded, so it shouldn't be ignored.
The third feature was about modeling CPU wakeup cost from deep idle state. Currently, the EAS doesn't take into account if the CPU is in a deeper idle state and if it is worth waking it up for a small task. This wakeup energy cost might not be that small for the big CPU and it might be better to find another CPU that could handle the small task.
The last feature was about introducing a new field in the energy model that reflects the performance of the CPU for each frequency. The performance of the CPU might not be linear with the frequency. It can also vary for different workloads (e.g. integer vs. floating-point heavy). That doesn't fit well with the current software model. Luba also showed slides with power variation for a CPU core running at fixed frequency. The power varied for different benchmarks.
Those two pieces of information create a variety of power vs. performance curves for the CPU. This has been called the "power and performance profile", and there can be many of those. There was also a reference to Morten Rasmussen's talk at the 2022 Linux Plumbers Conference, when he showed four energy models with those different characteristics. With the run-time modifications to the energy model, it would be possible to plug in a new power and performance profile for a long-running application (like video conferencing) to improve decisions in EAS, better utilize the hardware, and save the battery.
Eco-friendly Linux kernel development: minimizing energy consumption during CI/CD
Author: Andrea Righi (video)In this talk, Righi presented KernelCraft (since renamed to virtme-ng), a tool that allows users to quickly and efficiently build and run tests with custom kernels.
Back in the old days, Righi was testing kernels on the same PC he was using for development, which resulted in multiple filesystem corruptions and lost work. With virtualization, he was able to test kernels more efficiently, but redeploying and reconfiguring a corrupted VM was still not ideal. Righi then discovered virtme (a tool written by Andy Lutomirski), that allows one to run a kernel virtualizing the entire filesystem of the host (exported read-only using the 9p filesystem). In this way, it is possible to avoid the hassle of re-deploying the testing VM if something goes bad. Unfortunately, virtme is unmaintained at the moment, so Righi decided to fork the project and create KernelCraft.
This tool generates a minimal kernel configuration that includes only the essential support required to run the kernel inside a QEMU instance. The tool then builds the kernel and uses virtme to create a live copy-on-write snapshot of the host system (using overlayfs with tmpfs to handle writes inside the guest).
Zijlstra suggested taking a look at the kvm_guest.config make target in the kernel, while Steven Rostedt suggested taking a look at ktest.pl in the kernel source tree for generating a minimal .config.
According to the tests, Righi is able to generate a kernel from scratch, run a simple uname -r inside it, and collect the result on the host in only 82 seconds, compared to over an hour using a typical compile-deploy-test workflow with an Ubuntu kernel; the amount of energy saved in the process is around a factor of 10x.
In conclusion, this tool has proved to be useful in a CI/CD
scenario or for conducting kernel bisects, as it allows significant time and
energy savings. Moreover, with this talk, Righi aimed to share the tools and
environment that he is using in his kernel development and debugging activity.
Sharing our respective workflows can be extremely useful, as even seemingly
minor details can help or inspire others to become more involved in kernel
development.
| Index entries for this article | |
|---|---|
| Conference | OS-Directed Power-Management Summit/2023 |
