Notes from the 2nd Operating-System-Directed Power-Management Summit

By Jonathan Corbet
May 25, 2018

The second Operating-System-Directed Power-Management (OSPM18) Summit took place at the ReTiS Lab of the Scuola Superiore Sant'Anna in Pisa between April 16 and April 18, 2018. Like last year, the summit was organized as a collection of collaborative sessions focused on trying to improve how operating-system-directed power management and the kernel's task scheduler work together to achieve the goal of reducing energy consumption while still meeting performance and latency requirements.

This extensive summary of the event was written by Juri Lelli, Rafael Wysocki, Dietmar Eggemann, Vincent Guittot, Ulf Hansson, Daniel Lezcano, Dhaval Giani, Georgi Djakov, and Joel Fernandes. Editing by Jonathan Corbet.

The count of summit attendees almost doubled from last year (almost maxing out venue limits). The summit again brought together people from academia, open-source maintainership, and industry. The relaxed atmosphere of the venue and the manageable group of attendees allowed for intense debate and face-to-face discussions about current problems. Several proposals or proof-of-concept solutions were presented during the sessions.

The following chapters, clustered by topic areas, summarize the discussions of the individual sessions. Videos of the talks, along with the slides used, can all be found on the OSPM Summit web site.

Software Frameworks

HERCULES: a realtime architecture for low-power embedded systems. Tomasz Kloda presented an overview of the HERCULES project, which UNIMORE is coordinating (other partners are CTU Prague, ETH Zurich, Evidence Srl, AIRBUS, PITOM and Magneti Marelli). The project is studying how to use commercial, off-the-shelf (COTS), multi-core platforms for realtime embedded systems (such as medical devices or autonomous driving). The biggest issue that the project is facing is that COTS systems are designed with little or no attention to worst-case behavior; this makes guaranteeing timeliness difficult (if not unachievable). Tomasz’s group is, in particular, looking at how to avoid memory contention, which can cause unexpected latencies when it happens. To verify the proposed solution, a Jetson TX1 board was chosen as the reference COTS platform and the ERIKA Enterprise RTOS was ported to it. To implement splitting of hardware resources, the team decided to utilize the Jailhouse hypervisor (which has been ported as well). The proposed solution is an implementation of Pellizzoni’s Predictable Execution Model [PDF], it contemplates using both cache coloring and preventive invalidation techniques to achieve the desired determinism.

SCHED_DEADLINE: The Android Audio Pipeline Case Study. Alessio Balsini presented the work he has been doing at ARM as an intern. The question he tried to answer is whether the deadline scheduler (SCHED_DEADLINE) might help to improve the performance of the Android audio pipeline. He started his talk with an introduction to the pipeline, describing how sound gets from an app to the speaker in Android. His work focused on Fast Mixer, which is used to service applications with strict latency requirements, including:

Power efficiency: always running at maximum frequency is not good for your battery, so the lowest (while sufficient!) clock frequency has to be selected.
Low-latency: smaller buffer sizes give lower latencies, but might introduce glitches .
Reactive to workload changes: (E.g., virtual instruments, where a variable number of notes might need to be synthesized at any point in time).

He reviewed current and alternative solutions, then stated that using SCHED_DEADLINE with an adaptive bandwidth mechanism implemented by the user-space runtime system is what showed best results. While the periodicity and deadlines of the tasks composing the pipeline are easy to get right, correctly dimensioning the requested CPU time might be tricky. For this reason an adaptive bandwidth mechanism has been implemented that varies runtime using information coming from running statistics and application hints (e.g., number of notes the virtual instrument will have to synthesize in the next cycle).

Workload consolidation by dynamic adaptation. Rohit Jain and Dhaval Giani started the discussion by introducing the issues that the Oracle Database multi-tenancy option faces in trying to partition hardware resources: consolidation of "virtual databases" on common hardware suffers from performance interference. To solve this problem, the presenters stated that the use of both (work-conserving) control-group regulation and exclusive cpusets (to ensure cache affinity) would be needed. It would be best to move toward a more dynamic solution incorporating the good properties of the two mechanisms: share CPUs opportunistically and "re-home" tasks to assigned CPUs when the load increases.

The performance results of a solution based on modifying sched_setaffinity() to accept a "soft affinity" flag were then presented; they showed a promising trend, but were not ideal. The discussion focused on alternative ways to fix the issues; the group suggested that Rohit and Dhaval use (and possibly improve) automatic NUMA balancing, which should do the job already (in conjunction with mbind()). Rohit and Dhaval will go back and do their experiments again to see if this suggestion would indeed help their case.

Why Android phones don't run mainline Linux (yet) Todd Kjos kindly agreed to cover Chris Redpath’s discussion slot about why Android phones can’t run mainline Linux kernels. Using a set of slides from Google Bootcamp 2018, Todd started the session highlighting the fact that, since last year, the Android Common Kernel is a properly maintained (upstream merges, CI testing, etc.) downstream of the long-term support (LTS) kernels. There are, however, three classes of out-of-tree patches that have to be maintained in Android Common:

Android-specific features, such as the interactive cpufreq governor, which is not actually used anymore as Android has switched to use schedutil.
Features that were proposed for upstream, but were rejected (they are going to be reevaluated later this year).
Features that are ready for Android, but still under discussion for mainline inclusion (the energy-aware scheduler — EAS — for example) and are expected to be merged relatively soon.

A diffstat-like slide, with respect to long-term support (LTS) kernels, was then presented that showed how the biggest downstream feature is EAS, but the hope is that next year this figure will change as more and more EAS pieces are merged. Networking comes next, followed by filesystem and audio-related changes, and several additional bits and pieces (architecture support, drivers, debugging, and emulators). But these out-of-tree patches aren't the main inhibitor to running mainline linux on a phone.

Branching strategy was then presented. At the top of the hierarchy is the LTS kernel, which merges down into the Android Common Kernel (this year's generation of phones runs 4.9). Then silicon vendors create their kernels from the Android Common Kernel when starting a development cycle for a new device, then put all SoC-specific code (plus debugging and instrumentation) on top of it. This process takes about one year, then the result is used by partners and OEMs who contribute additional changes before releasing the resulting kernel, running on phones, another six months later. Considering that the normal support length for an LTS kernel is two years, we are left with only six months of support after the initial phones based on an SoC hit the market. There has been an effort evangelizing the idea that SoC/OEM patches have to be easily rebasable on top of a newer LTS; however there is still a lot of work to be done to accomplish this goal. The general situation has improved a lot in the last couple of years with efforts like Project Treble.

Energy-aware realtime scheduling: Parallel or sequential? From analysis to implementation Houssam-Eddine Zahaf presented the work he has been doing (while working at Université de Lille) regarding energy-awareness tradeoffs on realtime systems. He introduced the topic by giving the audience two examples of common pitfalls (and possible solutions) when running realtime tasks on SMP platforms or platforms that implement voltage and frequency-scaling techniques. The former can leverage task migrations to achieve higher system utilization (while still meeting deadlines); the latter can save energy (on certain hardware) by executing workloads at a lower clock frequency.

He then moved on by stating that the very first thing to do when trying to intelligently save energy is to derive an energy model for the platform of interest. This model might easly get quite complex however, as it depends in general on the specific application, memory usage, type of operations performed, and task composition. He concluded by showing details about a possible formalization of the problem that can accept task parameters, the desired parallelization level, and energy model as inputs and provide allocation of threads on the different cores and frequency selection hints as output.

FAIR and RT Scheduler

What is still missing in load tracking? Vincent Guittot presented the evolution of the load tracking mechanism in the Linux scheduler and what should be the next steps. The session was split into three parts. The first part showed the improvements made in scheduler load tracking since last OSPM summit and listed the features that have already been merged. The audience agreed that new load tracking was far more accurate, stable, and helpful in scheduler load balancing.

Vincent then described what still remains to be fixed, like the case of realtime tasks preempting ordinary tasks. There is also a desire to remove the current rt_avg mechanism and to replace it with the new load-tracking information. Based on this use case, the discussion extended to the definition of CPU utilization and what is needed to get a complete view. We already track ordinary task utilization, and we had seen with the previous use case that tracking realtime utilization is beneficial. The audience agreed that we should extend that to account for interrupt pressure and SCHED_DEADLINE usage to get a complete view of the utilization.

Then, we discussed the load-tracking mechanism; the current implementation is simple and efficient but has some drawbacks, including capping the value to the current capacity of the CPU, which makes the utilization not fully invariant as shown in some examples. After describing which kind of load-tracking behavior we would like, Vincent raised the question of what we really want to track. It’s not really the running time (even after the latter has been scaled); instead we are more interested by the amount of work that has been executed (how many instructions). Some people from Intel said that they have an implementation of scaling invariance, but it has problems with the current scaling invariance implementation: when using APERF/MPERF, the arch_scale_freq is often lower than 1024 (max value), which caps the current utilization and decreases the targeted frequency. As a result, the utilization decreases enters a decreasing feedback effect. Tracking the work and removing the capping effect should help to fix this kind of problem. ARM developers mentioned that they would be interested too, because they have some new performance counters that could be used for tracking the utilization of CPUs.

The last part of the session discussed the performance impact of load tracking. A short and hackish test has shown an impact of 5% in the sched pipe benchmark result. The audience agreed that 5% is not negligible, but this overhead gives a benefit thanks to a better load balancing. The result has been shown to raise the discussion whether it’s a real problem or not and whether we should look at optimizing PELT.

Status update on Utilization Clamping. Patrick Bellasi gave a status update about his proposal for implementing utilization clamping. The proposal is not new and has been extensively discussed at last year's OSPM and Linux Plumbers Conference; however the implementation has changed considerably with respect to Patrick’s first shot at it.

The discussion started by recalling that there exist systems that are managed by informed runtimes, Android and ChromeOS in particular, that have a lot of information about which applications are running and how to properly tune the system to best suit their needs by, for example, trading off performance against energy consumption. Transient optimizations are also possible, depending on the state an application is in: whether it is running in the foreground or background on Android systems, for example.

Considering that task utilization is already driving clock frequency selection in mainline kernels via the schedutil cpufreq governor, and that EAS further extends its usage to drive load-balancing decisions, being able to act upon such signal by clamping it from userspace/middleware is thought to be a powerful means to perform fine-grained dynamic optimizations.

The latest util_clamp implementation provides both a per-task and a control-group interface. The per-task API extends the sched_attr structure with a couple of parameters (util_min and util_max) and can be used via the existing sched_setattr() syscall. Scheduler maintainer Peter Zijlstra was positive about the viability of such an extension.

The control-group interface is still an extension of the CPU controller, like in the previous proposals, but it’s now considered a secondary interface, as requested by Tejun Heo. Since Tejun was not fully convinced about the names proposed by the new attributes, it has been suggested that Patrick should mention all the possible usages and link this proposal to some possible future uses, such as better support for task placement. A brief discussion also concerned how to properly aggregate clamped utilization values coming from different scheduling classes. The realtime scheduling class might benefit from such a solution, especially on battery-constrained devices. The general consensus was that aggregation should at least be consistent with how it is going to be used by the load balancer.

arm64 topology representation Morten Rasmussen started his session by saying that the discussion he wanted to have was driven by challenges and issues he had to deal with on ARM64 platforms. Before getting into the details, however, he gave a gentle introduction to the Linux topology ("scheduling domains") setup code as of today and described the hierarchical nature of this topology representation. Each topology level has flags associated to it, and this information influences scheduler decisions. This all works well for ARM64 mobile systems.

Problems arise with ARM64 servers, though. Such systems have lots of clusters in a single package and, without changes to the topology setup, task wakeups will be confined to single clusters instead of being potentially spread to the package. The good news is that packages have a shared L3 cache, so changing how flags are attributed to the domains might easily fix the problems. Even for systems that don’t come with a physically shared L3 cache, it might still be worthwhile to balance across the whole package, as the cache-coherence interconnect makes data sharing fast. So, setting or not setting the desired flags seems to be up to where one wants it to "hurt" (as commented by Peter Zijlstra). In any case, Peter was adamant in stating that "if there is an L3, then the multi-core scheduling-domain level must span the L3".

A discussion followed about how load balancing is using the flags at the moment and what kind of topology setup is used to build the hierarchy. At this point Morten noted that there is one addition for the ARM64 world based on PPTT ACPI tables (currently driven towards mainline adoption by Jeremy Linton): a tree representation of caches and associated flags. The end goal in this case would be to end up with the same topology if one describes the same platform either using PPTT or a device tree.

Whichever form the final solution takes, it was noticed that being backward compatible might be important. Having some sort of flag for deciding whether to go "the old way" or to adopt the new solution(s) might be worthwhile to avoid any sort of backward-compatibility problems.

EAS in Mainline: where we are. Quentin Perret and Dietmar Eggemann presented the latest energy-aware scheduling patch set that has been posted on the linux-kernel list. A previous version, sent out a couple of weeks earlier, was already covered by LWN. There was agreement that starting with the simplest energy model, representing only the active power of CPUs at all available P-states, is the right thing to do.

The design decision to only enable energy-aware scheduling on systems with asymmetric CPU capacity seems to be correct. Another important design factor is that the energy model can assume that all CPUs of a frequency domain have the same micro-architecture. Rafael Wysocki and Peter Zijlstra required that the energy model must be an independent framework which can be used on all architectures and by multiple users (such as the scheduler, cpufreq or a thermal cooling device). Therefore the current coupling of the energy model with the PM_OPP library has to be abandoned, since there are platforms which don’t expose their P-states.

The fact that the energy-aware scheduler iterates over every CPU in the system is acceptable as long as the implementation will warn or bail in case the number of CPUs and frequency domains exceeds a certain threshold (probably eight or 16).

The idea that energy-aware scheduling only makes sense if the system is not over-utilized and the appropriate implementation found agreement, although it is not clear if the current per-scheduling-domain approach is really beneficial or if the easier system-wide implementation should be preferred.

Power-aware and capacity-aware migrations for realtime tasks. Luca Abeni made use of the time allocated for his slot to review and discuss the main issues he found while trying to make SCHED_DEADLINE aware of CPU compute capacities, something similar to what the mainline completely fair scheduling (CFS) load balancer already does. After giving the audience a brief introduction to SMP scheduling on realtime systems, he stated that SMP global invariants might be wrong when CPUs don’t have the same computational power (capacity). This issue can come up on big.LITTLE systems or traditional SMP systems with CPUs running at different clock frequencies.

He aims at modifying the SCHED_DEADLINE code that controls task migrations to make it aware of both CPU-capacity and operating-frequency differences. Existing theory is however of little help, as the global earliest-deadline first scheduling algorithm doesn't take into account CPU utilization, and theoretical algorithms may lead to an unacceptable number of task migrations. He thus decided to follow a more practical approach (leaving theory as a second step). The idea would be to modify the cpudl_find() function (responsible for finding a potential destination for migrating a task) to reject CPUs without enough spare capacity. He has already implemented a further simplification of the solution that, for the time being, looks at completely free CPUs with suitable capacity (when running at maximum frequency) to accommodate a task's needs.

At this point an interesting discussion started, focusing on which kinds of parameters are to be considered when evaluating a task's needs. One option would be static utilization (given at admission control time) which never changes but might be pessimistic; the alternative is dynamic utilization that takes into account a task's execution history before the migration decision happens. A decision has not been made, and discussion has been postponed to when actual patches hit the list, but a point worth of note from the discussion is that rq_clock() can be considered always in sync for all CPUs and thus could be potentially used to implement the latter (dynamic) solution.

Towards a constant time search for wake up fast path. Dhaval Giani and Rohit Jain discussed a task wakeup fast-path optimization to make it more scalable. The goal is to make the time needed to find a CPU to run a new task on independent of the system size (or O(1) ideally). The approach explored so far is based on counting the threads running on each CPU core to pick the least loaded one. Alternatively, a limit on the search space could be introduced that would make it a constant-time search. The recommendation from the audience was to post the alternative patch for review and run some tests to compare the two solutions.

System Frameworks

Towards a tighter integration of the system-wide and runtime PM of devices Rafael Wysocki started his talk by giving the audience an introduction to device power management in Linux. In order of complexity, he reviewed working versus sleeping system states, device runtime power management, and the system suspend, hibernation and restore control flows. He then moved on to look at how the power-management code is organized; it consists of several layers of software including devices (with drivers), bus types and power-management domains. This complexity is needed to make devices work on different bus types and across power-management-domain topologies, implementing power management on different platforms while avoiding code duplication.

While most cases are already handled correctly, some questions still remain open:

Can devices in runtime suspend be left suspended during system suspend (or hibernation)?
Can devices be left in suspend during system resume then be managed by runtime power management in the working state?
Can runtime power-management callbacks be used for system suspend and resume and hibernation?

After presenting the existing solutions, previous attempts, and new ideas for these problems, he asked the audience for further ideas. Peter Zijlstra proposed to implement a "unified" state machine to reduce complexity, which might be considered as a solution even if it looks like a difficult one to get right. A need to support new bus standards (e.g., mesh topologies) was also mentioned and discussed.

Integration of system-wide and runtime device power management: what are the requirements for a common solution? This session, led by Ulf Hansson, was a continuation of Rafael’s session; the first part focused on the issues related to device wakeup handling. Generally, device wakeup is about powering up the device to handle external events when the device is in a low-power state, the entire system is in a sleep state, or both (in the majority of cases, devices are in low-power states if the system is in a sleep state). That can be done in a few different ways: through in-band device interrupts, via a special signal line or GPIO pin (out of band), or through a standard wakeup signaling means like PME# in PCI (also regarded as in-band). There are devices where two or more methods are available and each of them may require a different configuration. Moreover, the set of configurations in which a device can signal wakeup may be limited (for example, in-band wakeup signaling may not be available in all of the possible low-power states of the device) and that may depend on the physical design of the board (for example, it may depend on whether or not the device's wakeup GPIO line is present).

The runtime power-management framework assumes that device wakeup will be set up for all devices in runtime suspend. The device driver and middle-layer code (e.g. bus type or power-management domain) involved are then expected to figure out the most suitable configuration to put the device into on runtime suspend. For system sleep states, there is another bit of information to take into account because user space may disallow devices from waking up the entire system, via sysfs. Still, all of that doesn't tell the driver and the middle layer which wakeup signaling method is to be used and which methods are actually viable.

On systems with more complex platform firmware, like ACPI or PSCI, this mostly is handled by that firmware, or at least the firmware can tell the kernel what to do. However, on systems relying on a device tree only, there is no common way to describe the possible device wakeup settings. The response to that from the session audience was that this seems to be a device-tree issue and it should be addressed by creating appropriate bindings for device wakeup. Still, though, the kernel is currently missing a common framework to use the firmware-provided information on device wakeup in a generic way, independent of what the platform firmware is. Whether or not having such a framework would be useful and to what extent is an open question at this time.

The second part of the Ulf's session focused on the generic power domains (genpd) framework and the situations in which it would be useful to put devices into multiple domains at the same time. There is a design limitation in the driver core by which a device can only be part of one power-management domain, which is related to the way device callbacks are invoked by the power-management core. The design of genpd assumes that the more complex cases should be covered by domain hierarchies. That does not seem to be sufficient to cover some use cases, however, and some ideas on addressing them are floating around.

Notifying devices of performance state changes of their master PM domain. Vincent Guittot led this session on behalf of Viresh Kumar who was not able to attend the summit. The goal was to discuss the prospect of adding a notification mechanism for when the performance state of genpd changes in order to optimize a device’s resource configuration — the frequency, for example. Vincent presented the use case that has raised the need for such a notification mechanism: a DynamiQ system with shared voltage between the DynamiQ shared unit and some cores. He also mentioned that DynamiQ is not the only configuration that can take advantage of this mechanism, because shared voltage is something common on embedded SoCs. He wanted to know where would be the best place to implement it, either directly in genpd or in a core framework to extend this mechanism to other features. The feature was well received by audience. The best place for such feature was discussed and some people mentioned other platforms where the GPU and core share their voltage domain. Nevertheless, it has been decided that starting with a implementation in genpd would be the best starting point. Extension to other framework can be decided later once a real usage arises.

CPU Idle Loop Ordering Problem. Rafael Wysocki and Thomas Ilsche discussed problems related to the ordering of the CPU idle loop in Linux and the solutions that were merged into Linux 4.17-rc1. Rafael started the session with a high-level introduction to CPU idle states and the CPU idle-time management code in the kernel. That code is executed by logical CPUs when there are no tasks to run; it is responsible for putting the CPUs into an idle state in which they draw much less power. If the CPUs support idle states, they will invoke the CPU idle governor to select an appropriate idle state and the CPU idle driver to make the CPU actually enter that state. The CPU will stay in the idle state until it is woken up by an interrupt. This sounds straightforward enough, but the idle-state selection in the cpuidle governor is based on predicting the time the CPU will be idle (idle duration) and that is not deterministic, which leads to problems.

Next, Thomas presented some results of measurements from his laboratory, clearly showing that CPUs might be put into insufficiently deep idle states by the CPU idle-time management code and might stay in those states for too long long, which was related to the non-deterministic nature of the idle duration prediction by the governor. It might be triggered by specific task activity patterns causing the idle governor to mispredict the idle duration, however, so it happened in practice and it could be reproduced on demand.

Rafael took over at this point and went into some details on how the CPU idle loop works. He said that, before the changes made in Linux 4.17-rc1, the idle loop would try to stop the scheduler tick before invoking the cpuidle governor. That made the governor’s job simpler, because the next timer event time was known after that and the governor needed to know that time to make the idle duration prediction, but it also was problematic if the predicted idle duration turned out to be shorter than the scheduler tick period. In that case either the tick did not need to be stopped (accurate idle duration prediction), so the overhead related to stopping it was unnecessary, or the idle state selected for the CPU would be too shallow (misprediction of idle duration) and the CPU would stay in it for too long.

The ordering of the idle loop was changed in 4.17-rc1 so that the decision on whether or not to stop the scheduler tick is made after invoking the idle governor to select the idle state for the CPU. This way, if the idle duration predicted by the governor is shorter than the scheduler tick period, the tick need not be stopped and the overhead related to stopping it is avoided. Moreover, if the predicted idle duration turns out to be too short, in which case the CPU might have gone into a deeper idle state, the CPU will be woken up by the scheduler tick timer, so the time it can spend in an insufficiently deep idle state is now limited by the length of the tick period.

Of course, there are a few complications with this approach. For example, the idle governor generally needs to know the next timer-event time to predict the idle duration, so it needs to be told when the next timer event will happen in two different cases, depending on whether or not the scheduler tick is stopped. It needs to take that information into account and, in addition to the idle-state selection, it has to give its caller (the idle loop) a hint on whether or not to stop the scheduler tick. Also, the governor needs to take the cases in which the CPU was woken up by the scheduler tick into account in a special way when collecting idle duration statistics (used for the idle duration prediction going forward), because in at least in some of these cases the CPU should have been sleeping much longer that it actually did (due to the tick wakeup). Fortunately, all of those complications could be taken care of.

The session concluded with a presentation of some test data to illustrate the impact of the idle loop changes. First, Rafael showed a number of graphs from the Intel Open Source Technology Center Server Power Lab demonstrating significant improvements in idle power consumption on several different server systems with Intel CPUs. Graphs produced by Doug Smythies during his testing of the kernel changes were also shown, illustrating significant reduction of power draw with no performance impact in some non-idle workloads. Next, Thomas showed his results demonstrating that he was not able to reproduce the original problem described previously with the modified kernel.

Finally, Rafael said that in his opinion the work on the CPU idle loop was a good example of effective collaboration between the code developer (himself), reviewers (Peter Zijlstra and Frederic Weisbecker) and testers (Thomas Ilsche and Doug Smythies) allowing a sensitive part of the kernel to be significantly improved in a relatively short time (several weeks).

Do more with less. Daniel Lezcano led a session on the increasingly fast evolution of the SoC market. With vendors developing increasingly powerful hardware, thermal constraints, with high temperature changes and fast transitions, become challenging. The presentation introduced a new passive cooling device in addition to the existing one by injecting idle cycles of a fixed duration but with a variable period. By synchronizing threads to force idle duration, based on the play_idle() API, the kernel is able to force the CPUs belonging to the same cluster to power down as well as the cluster component. Even if that adds latency to the system, it has the benefit of dropping the static leakage.

The presentation demonstrated the computation of the runtime duration relative to the idle duration. Then showed the experiment confirmed the theory. The third part of the presentation showed an improvement of the idle injection by combining the actual device which change the operating power points (OPPs) with idle injection cycle until it reaches the capacity equivalence of the lower OPP.

The feedback from the audience was:

The existing cpu cooling device was initially planned to be a combination cooling device, involving both the CPU frequency governor and cpuidle. There is some skepticism about the combination cooling device, which can be interesting if there is a small number of OPPs with a big gap, but nowadays, that is no longer the case.
Concerning idle injection, there is no reluctance to merge it. However, the code must be organized differently. Intel has a power clamping driver doing roughly the same thing, so Rafael Wysocki (maintainer of the PM of the linux kernel) would like to add an idle-injection framework in the drivers/powercap directory and propose an API to be used by the intel_powerclamp driver and the cpu idle cooling device.
Eduardo Valentin, (maintainer of the SoC thermal framework) and Javi Merino (co-maintainer of the cpu cooling device) think that, in any case, idle injection makes sense because, if the existing CPU-cooling device is unable to mitigate the temperature, there is no alternative to stop the temperature increase. The more the options we have, the better it is.

Next steps in CPU cluster idling. Ulf Hansson presented the latest updates around the CPU cluster idling series. The cpuidle subsystem handles each CPU individually when deciding which state to select for an idle CPU. It does not take into account the behavior of a cluster (a group of CPUs sharing resources such as caches) or consider power management of other shared resources (interrupt controllers, floating-point units, Coresight, etc). In principle, these are not problems on an SoC, where the policy for the cluster(s) is decided solely by firmware outside the knowledge of Linux. However, for many ARM systems (legacy and new), particularly those targeted for embedded, battery-driven devices, these problems do exist.

In the session, Ulf provided a brief overview of the significantly reworked version of the series, covering how the CPU topology is parsed from the device tree and modeled as power-management domains via the genpd infrastructure, as well as how the PSCI firmware driver comes into play during idle. Some concerns was raised around the new genpd CPU governor code that is being introduced as a part of the series, as currently it means that, in parallel, the cpuidle governor decides about the CPU and the genpd CPU governor decides about the cluster. More importantly, these decisions are made without sharing any information between them during the idle-state selection process, which could lead to problems in the long run. For example, in cases when a CPU supports more than one idle state besides WFI.

Discussions moved along to more generic thoughts about the future of cpuidle governors. In particular, interests were raised about continuing to explore the option of adding a new governor based on interrupt-prediction heuristics, thus providing something that people could play with and report experiences for, as a first step.

Advanced Architectures

Security and performance trade offs in power interfaces. Charles García-Tobin used the time assigned to his slot to talk about the latest design choices regarding power interfaces in the ARM world, with the intent of gathering feedback from kernel developers and possibly steering the course of future decisions. He introduced the topic by stating that ARM systems have an embedded legacy which has led to designs where several assumptions have been made: the kernel has full and exclusive control of power of the platform, interfaces can be as low-level as needed, and only the operating system is "running". However, it has been discovered with time that low-level interfaces can be dangerous (e.g., CLKscrew) and that it is very hard to create a kernel that controls every single kind of system and can solve every power problem (especially if considering that the kernel might be too slow to respond in cases where power delivery or thermal capping need quick modulation). Therefore, ARM has come to the realization that more abstract interfaces are needed (and ACPI is one method of abstraction, but it’s not suitable everywhere).

ACPI provides a feature called "collaborative processor performance control" (CPPC). This is an abstract interface where firmware and the kernel can collaborate on performance control; it is abstract enough that can be used on almost any system with a power controller. This makes it relatively straightforward to adopt in ARM-based enterprise systems. However, it needs feedback counters to give the kernel an idea about the per-CPU delivered performance versus the requested one. For the embedded world, ARM is also providing the System Control and Management Interface (SCMI), an extensible, standard platform controller interface to cover power and system management functions. SCMI will be driven directly from the kernel through SCMI drivers. ACPI-based kernels can also drive SCMI indirectly though ACPI-interpreted code. ACPI CPPC and SCMI implementations can coexist in the same power controller. As mentioned above, CPPC requires feedback signals to be used properly by the kernel; ARM is solving this problem by providing activity monitors (adopted from ARM v8.4) which are constantly running, read-only event counters that can be used, among other possible applications, to monitor delivered performance.

Charles concluded his session by asking the audience if ARM has come up with the right kind of feedback signals and what the kernel might do with this information. It was suggested that they might actually be still useful even if the kernel is not going to use them, for debugging, for example. They might also help to implement proactive policies when the CPU is in danger of hitting thermal capping situations (it was however noticed that this might not be so easy to achieve because the system might still be too slow to react). Charles also wanted to understand if people had experience with similar mechanisms. The audience mentioned RAPL as an always useful tool to have and that being able to request delivered performance for remote CPUs might be handy (when migrating tasks around for example).

Scaling Interconnect bus. Georgi Djakov and Vincent Guittot led a session on addressing use cases with quality-of-service (QoS) constraints on the performance of the interconnect between different components of the system (or SoC). In the presence of such constraints, the interconnect's clock rate has to be sufficiently high to allow those constraints to be met and there needs to be a way to make that happen when necessary. This will help the system to choose between high performance or low power consumption and keep the optimal power profile according to the workload. That is currently done in vendor kernels shipping with Android devices, for example, and basically every vendor uses a different custom approach. A generalization of it is needed in order to add support for that to the mainline kernel.

The idea is to add a new framework for representing interconnects along the lines of the clock and regulator frameworks, following the consumer-provider model. Developers would implement platform-specific provider drivers that understand the topology and teach device drivers to be their consumers. Various approaches have been discussed and existing kernel infrastructure was mentioned — Devfreq, power-management QoS, the common clock framework, genpd, etc. None of these seem suitable to configure complex, multi-tiered bus topologies and aggregate constraints provided by drivers. People agreed that having another framework in the kernel is fine if the use-case is distinct enough and encouraged other vendors to join.

An important point was raised about the cases when platform firmware takes care of tuning the interconnects. It was argued that the firmware has to be involved in the interconnect handling for the same reasons that were brought up during the "Security and performance trade offs in power interfaces" session. In such cases, in order to not clash with the firmware, the interconnect driver implementation could interact with the firmware and just provide hints about the current bandwidth needs, instead of controlling the hardware directly.

Another scenario that was discussed was about multiple CPUs or DSPs coexisting in an SoC that use shared interconnects. It is not clear how to exactly represent this topology in the kernel, as each CPU or DSP may have different constraints depending on whether it's active or sleeping. The current proposal was just to duplicate the topology for both states, collect constraints from drivers for both states and switch the configuration when a state is changed. This is still an open question, as more feedback was needed.

Tools

eBPF super powers on arm64. The BPF compiler collection (BCC) is a suite of kernel tracing tools that allow systems engineers to efficiently and safely get a deep understanding into the inner workings of a Linux system. Because they can't crash the kernel, they are safer than kernel modules and can be used in production.

In his talk, Joel Fernandes went through solutions, such as BPFd, for getting BCC working on embedded systems. He then went through several demonstrations, showing other tools for detecting performance issues. In the last part of the talk, a new tool being written to detect lock contention by monitoring the futex() system call was discussed. One of the issues is that it is difficult to distinguish between futexes used for locking versus other purposes. Because of this one may get false-positives. Solutions to this problem were discussed. Peter Zijlstra suggested using FUTEX_LOCK_PI from user space, which is a locking-specific futex() command. Other than this, there is no easy way to solve the problem, it seems.

The audience also discussed ideas for how eBPF can be useful for power testing. One of the ideas is to use the energy model available on some platforms with cpufreq residency information to calculate approximate power numbers. This can enable writing of tools like powertop. Another idea that was brought up pertained the use of eBPF to monitor scheduler behaviors and enforce that certain behaviors are working in the scheduler. Lastly, the crowd talked about using ePBF for workload characterization and discussed an existing tool that does this, called sched-time and developed by Josef Bacik. However, Joel mentioned that sched-time might need more work as it wasn’t working properly in his tests. It was agreed that it would be nice to fix it and use it to characterize workloads.

Scheduler unit tests. Dhaval Giani brought up the topic of using rtapp traces for regression testing, which was originally brought up at the Linux Plumbers Conference last year. Patrick Bellasi pointed out that he uses rtapp for functional testing as opposed to performance. Peter Zijlstra also mentioned that he was never able to get Facebook’s rtapp traces running on a system which was very similar to the Facebook setup. He remained sceptical of running on systems which were different from where the traces were generated. Many in the audience were also unsure if rtapp could be used to model memory, interrupts and other conditions affecting the workload. It was mentioned that rtapp could already do some of that and could be extended further.

The discussion moved toward trying to make rtapp test cases in use available more widely, and trying to write more functional test cases. Rafael suggested having test cases testing only one functionality, which is what Patrick confirmed their test cases did. Charles Garcia-Tobin mentioned that he would like to describe the scheduler as a set of invariants and use rtapp to test that those invariants are not broken. Most folks agreed on the idea of having a set of core invariants with each user describing their own requirements as test cases. The discussion ended with lunch approaching and Dhaval agreeing to look into this idea of invariants.

Index entries for this article
Kernel	Power management
Conference	OS-Directed Power-Management Summit/2018