Reports from OSPM 2023, part 1

By Jonathan Corbet
June 13, 2023

OSPM

The fifth conference on Power Management and Scheduling in the Linux Kernel (abbreviated "OSPM") was held on April 17 to 19 in Ancona, Italy. LWN was not there, unfortunately, but the attendees of the event have gotten together to write up summaries of the discussions that took place and LWN has the privilege of being able to publish them. Reports from the first day of the event appear below.

Reports from day 2 are also available.

Improving system pressure on CPU-capacity feedback to the scheduler

Author: Vincent Guittot (video)

The factors that can impact the compute capacity of CPUs can be split into two different types. The first type impacts the performance of the CPU by acting on the frequency whereas the second one reduces the number of cycles available for the tasks. The session was about the first type of pressure, for which, we can distinguish four kinds of impacts: hardware mitigation, which can act at KHz frequency, firmware/kernel mitigation, which can change at a period of 10 to 100ms, capping of the power budget provided to the CPUs, and action by user space, which happens at an even lower frequency with a multi-second period.

The current implementation of thermal pressure in the kernel uses one input that is then used for PELT filtering or to estimate the actual maximum performance. Although we have only two users of this input (the Qualcomm limits management hardware and the CPU-frequency cooling driver), other parts also cap or boost the maximum frequency. In the end we have one interface for three different behaviors: instantaneous, average, and permanent pressure ("permanent" meaning that it applies for one second or more, which can be seen as permanent from the scheduler's point of view).

The high-frequency pressure should keep the current interface. The medium pace needs a new interface able to handle several inputs submitted by firmware and/or the kernel. The user-space pressure, such as changing the power mode of a laptop when plugged into the wall power supply, should trigger the update of the original CPU capacity in order to align it with the new maximum performance.

Keeping a gap indefinitely between the actual maximum capacity of a CPU and value stored by the scheduler can impact the behavior of load balancing and task placement. It has been asked if such behavior could be related to the KVM patches that have been submitted to scale the frequency of the host CPUs with guest performance requests. This seems to be a different problem as they want to apply the guest's request to vCPU threads.

While studying the current thermal-pressure implementation, I found strange behaviors in the energy model (EM) that need consolidation. The implementation uses the current maximum frequency and CPU-capacity pair to compute the energy cost, and they can change at run time (although not that often). The energy model only needs to save a coherent pair of frequency and capacity values that were used when building the model, but they don't have to be the current maximum ones.

The maximum compute capacity can then be changed without impacting or rebuilding the energy model. It has been noticed during some tests on an Arm system that has boost frequency that there can be a mismatch between the energy model and the CPU-frequency governor. It has been raised that energy-aware scheduling (EAS) should probably be disabled when boost frequencies are selected, because the kernel reverts to a performance-oriented mode when CPU utilization reaches 80%, but the tests have shown that the case was possible and others suggested that the CPU could be not yet overutilized when tasks wake up.

Then the discussion moved on to the reason that the energy model is not used with Intel processors. It has already been suggested to use the energy model for Intel processors and it seems that there is no real blocking reason other than there are no fixed operating power points (OPPs), which has probably blocked the developers from moving further on trying to use the energy model. There are patches for the support of an artificial energy model which should cover the concern.

The discussion moved back to the system pressure proposal, which can create a situation where no CPU in the system appears to have 100% capacity. Although it should not be a major problem, the detailed impact of this new situation must be carefully studied. This new interface proposal will replace the current thermal pressure mechanism in the scheduler, which includes load balancing and EAS. It has been asked if we could save the compute capacity of each OPP as this can be non-linear with frequency, which violates the current assumption. Then the discussion ended with some suggestions about disabling EAS and the energy model at boot until everything initializes, like for RCU, but the system doesn't really know when the CPU-frequency driver will be loaded and the system will be ready.

Utilization Boosting

Author: Dietmar Eggemann (video)

Per-entity load tracking (PELT) is the Linux kernel's task-size tracker, where "size" can be determined by load (based on runnable — waiting and running — time and weight), utilization (based on running time), or runnable (based on runnable time). Utilization drives CPU frequency and Energy-Aware Scheduling (EAS) run-queue selection. The responsiveness of PELT is still considered too low to react fast enough to utilization changes during task ramp-up. With util_est (utilization estimation), which caches max utilization before sleeping for wakeup, and Uclamp (utilization clamping) which allows per-entity utilization tuning from user space, there are already kernel mainline features available to improve the situation.

During the talk, additional ideas to improve the utilization ramp-up, which had appeared on the linux-kernel mailing list over the last eight months, were discussed. It started with the proposal to be able to change the PELT half-life at run time. Some Android systems use this to be able to run gaming workloads with a smaller PELT half-life value of 16ms or 8ms instead of the default 32ms. They claim better frames-per-second (FPS) rates with only moderate energy increase due to faster PELT signal rising and decaying.

The design was rejected because it isn't clear which specific problem it solves. Two mechanisms, util_est_boost and util_est_faster, were reviewed that improve the responsiveness of a single completely fair scheduler (CFS) task without boosting the whole system by changing the PELT half life. The use of runnable alongside utilization came out of the util_est_faster discussion on the mailing list. A situation in which runnable is larger than utilization indicates run-queue contention. Using the greater of the runnable and utilization values helps boosting CPU frequency in these scenarios. The UI benchmark JankbenchX on a Pixel 6 device has shown that this can seriously minimize jank frames (frames which don't meet the required rendering time).

With API level 31, Android started to provide the Android Dynamic Performance Framework (ADPF) CPU-performance hints feature which allows per-task boosting using Uclamp.

As a stand-alone kernel mainline feature, improving CPU utilization by runnable for faster CPU frequency ramp-up can work alongside Android's CPU performance-hinting user-space boosting mechanism. It is unlikely that a PELT half-life runtime modifier gets into kernel mainline, even though Android would like to have it as a tuning parameter for CPU performance and power management.

Results from using SCHED_DEADLINE for energy-aware optimization of RT DAGs on heterogeneous hardware

Author: Tommaso Cucinotta (video)

In this talk, some results from the AMPERE EU project were presented, including prototype modifications our research group made to SCHED_DEADLINE, to accommodate the requirements of application use cases belonging to the automotive and railway domains. We considered in the project some different reference platforms, including a Xilinx UltraScale+ board with FPGA acceleration, and an NVIDIA Xavier AGX board with GPU acceleration. Throughout the three-year project, running from 2020 to date, we implemented a number of components, most of which were released under an open-source license.

We implemented PARTSim, a power-aware and thermal-aware realtime-systems simulator for big.LITTLE platforms, in which we modeled the variability in execution time and power consumption of a range of boards (ODROID-XU4, Raspberry Pi 4, Xilinx US+ ZCU102), so that we can simulate the temporal behavior of applications on these boards under a variety of configurations of schedulers. We also prototyped, in this simulator, big.LITTLE CBS, a variant of a SCHED_DEADLINE-like, realtime scheduler that, on every task wakeup, is able to schedule the task on a core (among the one it woke up upon, a core from the same core island, or a core from another island) that is most convenient from a power-consumption perspective.

We prototyped APEDF, a variant of SCHED_DEADLINE providing adaptive, partitioned, earliest-deadline-first (EDF) scheduling of realtime tasks, requiring no user-space API changes. APEDF can automatically partition SCHED_DEADLINE tasks among the CPUs using simple heuristics, consistently handling the EDF schedulability condition for all cores, thus guaranteeing no that deadlines are missed when proper admission conditions are met. More importantly, APEDF can provide guarantees in the presence of DVFS, providing a sound and consistent foundation to the power-management logic in schedutil in the Linux kernel, which unfortunately is not very effective in presence of tasks using the default Global-EDF behavior of SCHED_DEADLINE. See the separate talk about "SCHED_DEADLINE meets DVFS" for more details and experimental results.

We developed some theoretical results about the schedulability of SCHED_DEADLINE tasks on big.LITTLE platforms, developing enhanced admission tests that guarantee the ability to host the admitted workload under the assumption of an APEDF scheduler, and even in presence of dynamic migrations of tasks as advocated by APEDF with a worst-fit placement, or big.LITTLE constant-bandwidth server (BL-CBS).

We implemented ReTiF, a framework for declarative deployment of realtime tasks on multiple cores. This is an open-source daemon and associated client library that can receive requests from clients to "go realtime", providing diverse and heterogeneous information about their timing. For example, some tasks might request specific priorities, others might declare their periods only, others might also declare their run times. ReTiF partitions the tasks among the available CPUs using different scheduling strategies as needed, (e.g. POSIX fixed-priority, rate-monotonic, or SCHED_DEADLINE).

We also tackled the cumbersome problem of minimum-power configuration for embedded platforms with multi-core, DVFS-enabled, multi-capacity-core islands (e.g., Arm big.LITTLE), GPU acceleration, and/or FPGA acceleration, to deploy a set of complex realtime applications. The applications have been modeled as realtime directed acyclic graphs (DAGs) of computations, where some of the functions could be realized either in software running on a CPU, or as a kernel running on the GPU, or as a hardware IP deployed in FPGA slots, if available.

A few experimental results have been shown, where we optimized the deployment of randomly generated realtime DAGs with end-to-end deadlines and optionally FPGA-accelerated functions, minimizing the expected power consumption on a big.LITTLE ODROID-XU4 and a Xilinx UltraScale+ board. We used a mixed-integer quadratic constraint programming (MIQCP) optimization approach, showing how the software tasks have been scheduled with SCHED_DEADLINE using periods automatically computed by the optimizer. We also showed that, by adding the secondary objective of maximizing the minimum relative slack, we could make the schedule more robust with the same power configuration, managing to remove deadline misses occurring in a few runs.

We also reported that, during the experimentation, we needed to swap the order of deadline.c and rt.c among the scheduling classes in the Linux kernel, thus giving POSIX realtime tasks priority over deadline tasks. This was needed because, in order to use FPGA-accelerated functions, we needed an RPC-like interaction with a daemon, which in turn could trigger on-the-fly reprogramming of one of the available FPGA slots, making use of a reconfiguration kernel thread. Both these threads have low processing requirements, yet when they are activated, they need to complete as soon as possible, or the whole DAG gets delayed.

Modeling these interactions in the DAG topology would have been cumbersome, and scheduling these threads with SCHED_DEADLINE didn't seem viable due to the small implied run times and short deadlines (thus high worst-case utilization). Instead, it was more convenient to model these similarly to how we deal with interrupt drivers in this context i.e. considering a run-time overprovisioning factor that needs to be there anyway, to account for external interference with the deployed realtime application DAGs. Interestingly, the possibility to swap rt.c and deadline.c in the kernel, or even to possibly make it a tunable sysfs option, was discussed for other reasons in other talks throughout OSPM.

The overcommit bandwidth scheduler

Author: Steven Rostedt (video)

Rostedt started out by explaining what the system layout is for ChromeOS. Chromebooks are used for video games, video conferences and many other use cases. There are millions of Chromebooks out in the world. The focus of ChromeOS is the Chrome browser, which is where most applications run. There are thousands of threads that handle all of the services of the Chrome browser. The threads that are the focus for this talk are the renderers and compositors, as well as the threads that handle the user interface (UI). There are other threads of concern that handle things like garbage collection.

ChromeOS uses control groups (cgroups) to contain these threads and handle their priorities. There is a render cgroup that contains two child cgroups that are the foreground and background cgroups. There's also a UI cgroup. The render foreground and the UI cgroups are the high-priority ones. Although the high-priority tasks are in these cgroups, there may be some other threads in these cgroups that are related to the high-priority threads though not of high priority themselves, but they still share the cgroup.

There's an issue with cgroups. The priorities of one cgroup are not acknowledged by another cgroup. Cgroups are like tasks, where even a low-priority cgroup will get a share of the CPU. If a low-priority cgroup is running and a high priority task in another cgroup wakes up, it has to wait for the other cgroup to finish. The Linux kernel's SCHED_OTHER scheduler (the completely fair scheduler, or CFS) is, according to Rostedt, "too fair". Rostedt created a test to run a spinner that updated a counter to see how much it ran. With one thread at nice value -20 (high priority) and one thread at normal nice value 0, the high priority thread ran for 87% of the time. When he added 10 threads with the nice value of 0, the high priority thread ran for only 40%. With 1,000 threads at normal nice value, it ran less than 1% of the time.

Another issue that Rostedt brought up was the migration process, which picks the next task to run rather than the highest-priority task. This means that the load balancer may keep the highest-priority tasks on the same CPU and the low-priority task on another.

There's a new scheduler that people are talking about, called "Earliest Eligible Virtual Deadline First" (EEVDF), but is from a paper written in the 1990s. This adds a "lag" value that keeps track of when the task was able to run compared to what it was supposed to run. If the lag is positive it means it has not run for as much it should and is eligible to run. If the "lag" is negative, then it is not eligible to run, as that means the task has executed on the CPU more than it was supposed to (for example, disabling preemption kept it from leaving the CPU). Rostedt said that this helps a little with latency, but still has the dilution of the higher priority task.

Rostedt then brought up the realtime scheduler, which does help the situation, but it becomes an issue when there's more than one high-priority task to run on the CPU. The FIFO scheduling class will run a task until either a higher-priority task preempts it or it goes to sleep. Round-robin, instead, will schedule between various tasks of the same priority, but acts like the FIFO class against tasks of different priorities. Both of these are not good for untrusted tasks, as they can take over the CPU. Videos watched on the Internet are run by the renderers, but can we trust running them with a realtime policy? Any changes that are done must still be good for hard realtime and POSIX compliant (no regressions).

Next Rostedt brought up the MuQSS scheduler (Multiple Queue Skiplist Scheduler) by Con Kolivas. Con's work influenced the creation of the CFS scheduler. He came back with other schedulers, but his latest one is MuQSS. It uses a trylock operation for access between the different CPU run queues, and if it doesn't get the lock, it just doesn't do anything. It also introduces a skiplist that is similar to a red/black tree that is not as balanced, but which is faster to add and remove items. The MuQSS scheduler is similar to the EEVDF scheduler. Some tests comparing the schedulers were run were on the 4.14 kernel, as that was the one that was easiest to port to ChromeOS. But tests were done against CFS in the 4.14 kernel for comparison.

Rostedt brought up a new proof-of-concept scheduling class called SCHED_HIGH, for, which he made up the name but not the concept. The idea is that this scheduling class would sit between realtime and SCHED_OTHER. He based the scheduling class on CFS, but said that it was difficult to use it, and he should probably have used EEVDF. He stated that if a new scheduler class is created, it can have a new API and be developed without having to deal with the limitations of regressions since everything is new. He listed requirements, where one is that tasks can get different priorities within the class, but still get time slices. The other requirement is that it does not starve SCHED_OTHER tasks.

Next up was Youssef Esmat, who presented the results of the tests. He explained that there were two Chromebooks that were used. One with two cores and one with four cores. The first test used Google Meets with 16 participants and would measure key-press timings. It also measures the time between the event and when the graphics show on the screen. The second test was with 49 participants and would measure the time when the mouse went over a window and displayed an icon.

The critical tasks have a nice value of -8. The realtime tests set the tasks to priority one. With EEVDF, it used the latency_nice of -20 (fastest time). The critical tasks ran in the SCHED_HIGH class. MuQSS used a nice of -10 instead of -8 as the MuQSS documentation states that gives a 16ms cycle.

Youssef then showed the results of the tests. First the baseline of the current method. During this work, Youssef found that one of the critical tasks was not set to a high priority. Doing that made a significant improvement. Dropping the key-press latency down on the two-core system by more than half, and on the four-core system by around 20%. Then, moving the critical tasks out of the cgroups also made the values much better. Dropping mouse latency on two-core system from 121ms to just 71ms. The other tests, including the four-core test, also improved across the board.

Moving the critical tasks to the realtime class improved even more. Switching from CFS to EEVDF actually increased latency on every test on the two-core machine, and the four-core system went up slightly. Dropped frames increased as well with EEVDF. Someone in the audience asked if latency-nice was updated for CFS, which Youssef affirmed, but wasn't able to finish those tests in time for this talk. The SCHED_HIGH prototype performed on par with the realtime class. The tests showed that MuQSS, SCHED_HIGH, and realtime performed the best, then after that, it was just moving everything out of the cgroups (MuQSS, SCHED_HIGH and RT also did not use cgroups for the critical tasks). MuQSS actually did the best with the dropped frames (least amount of dropped frames).

It was asked if the tests that were used are publicly available. Rostedt and Youssef said that they are in the Chromium repository.

Hierarchical scheduling

Author: Steven Rostedt (video)

The realtime and deadline schedulers are specific for "hard realtime" applications. This talk, instead, was about "soft realtime" schedulers. The FIFO realtime scheduler, which is priority driven, running the highest-priority task until it is preempted by a higher-priority task or it sleeps, is not easy to map to applications, as applications may have dynamic priorities depending on when their deadline is.

The round-robin (RR) scheduler will give time slices for tasks of the same priority, but the amount of slice is not something that can be easily changed. There's no definition in POSIX that defines that time slice.

The deadline scheduler is for periodic realtime tasks. Rate-monotonic has static priorities and is easier to implement but cannot guarantee 100% utilization of the CPU. The EDF scheduler can get 100% CPU utilization, but also requires calculations to know which task can be scheduled next. But that is only for a single CPU. Global EDF allows for more than one CPU but with restrictions.

Basically, "realtime is hard!". Rostedt argued that the Linux kernel is hard realtime where it has the utilities to handle hard realtime tasks. Some argue that Linux is soft realtime, but if a deadline is missed then the system fails, which is hard realtime and not soft realtime. The issue that some have is that Linux is not mathematically provable. But Rostedt said that is "quality of service" (QoS) and not realtime. Rostedt came to the conclusion that Linux has a "hard realtime design", where the OS is designed to be realtime but it is not guaranteed to be so (though it's a bug if it is not).

Rostedt then introduced an "over-commit scheduler" which could be implemented as EEVDF. The system allows for tasks to ask for more than 100% of the CPU. But when that happens, the tasks will all fail their deadlines, but they will fail it "fairly". That is, all will miss their deadline by the same percentage.

Rostedt argued for a new scheduling class to give this characteristic of over-committing the CPU. The idea is to keep it separate from the other hard realtime schedulers in order to not "dilute" them, or as Daniel Bristot de Oliveira shouted out, "duct-tape development". This scheduling class should act as hard realtime when it does not use more resources than the CPU can supply, but not fail if it does exceed them. Instead, the QoS would just suffer.

Rostedt stated that there is a need for a scheduling class that gives a general idea about realtime so that applications do not need to know the details of how realtime works. An application should just state that a thread wants a percentage of the CPU, but could possibly have multiple threads that ask for a percentage that may add up to more than 100%. This should be acceptable with the caveat that they will lose their QoS.

He added that there should be an easy way for applications to tell the kernel which tasks are important and which are not. Currently the kernel uses heuristics to determine this. Dhaval Giani asked what the user-space tasks would ask of the kernel. Rostedt mentioned that there's a scheduler that switches from deadline to SCHED_OTHER when a task's declared run time is exhausted. Juri Lelli stated that SCHED_DEADLINE originally did this, and added that everything that Rostedt mentioned so far has been done in the past (but not accepted) and is still doable. Bristot mentioned that problems happen when tasks run in small bursts. Where a task runs a little and then sleeps and runs again, it can take advantage of run time that it builds up to run more than it should later and harm other tasks.

Peter Zijlstra chimed in stating that the SCHED_HIGH tasks need to know about hard realtime tasks that interrupt them, otherwise their accounting will go wrong. It's best to use the hard realtime scheduler as a server for the other class. He then continued to say that static priorities are a nightmare and should be removed. Others stated that they are still being used and Zijlstra said that was due to POSIX, which made a mistake and we have to live with it.

Lelli said that the SCHED_DEADLINE could be used, but Rostedt said that the tasks get throttled when their run time expires. Lelli said it is possible to have it continue to run.

Zijlstra brought up that any bandwidth-oriented scheduler below the realtime class would be broken because they could run indefinitely. Rostedt wondered if rate monotonic would work, as it does not have the calculations that deadline schedulers have, it just picks the task with the smallest period and lets it run. Rostedt then asked if the issue is that the realtime classes would cause issues with lower classes, like CFS, where CFS would not get to run and it would need to handle what to do when it does get to run again, but now needs to make up for the large hold that the hard realtime schedulers caused it. Zijlstra said that was part of it, but it also has to do with all the cgroup code that also needs to account for this.

Zijlstra ended by stating that we can get what we want by adding a patch to make the tasks downgrade to SCHED_OTHER when they run out of their time slice. There's also the issue with allowing unprivileged tasks to use SCHED_DEADLINE.

The discussion ended with the idea of possibly trying out SCHED_DEADLINE with sched reclaim enabled to allow tasks to run when their run time is exhausted.

Rewriting RT throttling

Author: Joel Fernandes (video)

Realtime throttling issues in ChromeOS have been a concern and barrier to the usage of realtime on ChromeOS. One of the issues is that realtime tasks can be bursty — if a realtime task takes a significant amount of time, it will undergo realtime throttling. Even if it did not, it will starve CFS tasks until it is throttled. The default settings let realtime run for 0.95 seconds before throttling.

At the conference, there was a notable pushback against the idea of improving the realtime scheduler in ChromeOS to fix throttling issues. While no specific technical reasons were cited, attendees generally suggested that the effect of realtime scheduler throttling should be addressed through alternative means. One recommendation was to revisit an older set of patches by Zijlstra and Lelli that involves running CFS tasks from a SCHED_DEADLINE reservation. The SCHED_DEADLINE server infrastructure would replace the existing realtime throttling mechanism.

Fernandes pointed out that this approach still faces the issue of what happens once throttling kicks in. He suggested that this problem could be resolved either by using a smaller period for the CFS deadline task or by implementing demotion of realtime to CFS, with the former being easier to accomplish. Lelli expressed confidence that, with some tweaks and tuning, the approach could be made to work effectively and offered to collaborate on a call to further discuss the matter. Thomas Gleixner proposed describing the flow of scheduling events during an input event through an API or a similar solution to facilitate better scheduling.

(See this article and this one for recent coverage of realtime throttling).

Saving power by reducing timer interrupts

Author: Joel Fernandes (video)

When profiling the ChromeOS video playback use case, it was found that high-resolution (HR) timers can cause high power usage due to reduced batching. The conference attendees showed a general interest in these findings, discussing the history and interactions of HR timers with CPU idle states, as well as possible solutions to address the problem. Experts like Gleixner and Daniel Lezcano provided valuable insights into the issue and shared their experience in understanding the intricacies of HR timers.

Gleixner suggested a potential reason why dynamically toggling HR timers based on use cases might not be working: the need to stop the tick when transitioning from high resolution to low resolution. He recommended collecting more traces to better understand the issue. Although Gleixner was open to the idea of turning off HR timers dynamically, he also emphasized the importance of addressing the problem in user space. He shared a past example of the Alarm user-space applet, which experienced a bug when HR timers were introduced, as it would wake up too frequently.

Index entries for this article
Conference	OS-Directed Power-Management Summit/2023