LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.3-rc4, released on February 18, a couple of days later than might have been expected. "This time the reason for the delay is that we spent several days chasing down a nasty floating point state corruption that happens on 32-bit x86 - but only if you have a modern CPU (why are you using 32-bit kernels?) that supports the AES-NI instructions. And then you have to enable support for them *and* use a wireless driver that uses it. The most likely reason for that is using the mac80211 infrastructure with WPA with AES encryption (ie usually WPA2)." There's lots of other fixes as well, of course; the short-form changelog can be found in the announcement.

Stable updates: the 3.0.22 and 3.2.7 stable updates were released on February 20; they contain the usual list of important fixes.

Comments (none posted)

Quote of the week

When you write:

if (ret) {
        one_line_statement();
}

somewhere, a puppy dies. And the DRM guys just took out an entire kennel.

-- David Miller

Comments (12 posted)

A sys_poll() ABI tweak

By Jonathan Corbet
February 22, 2012
The poll() system call has three parameters, one of which is a timeout value specifying an upper bound (in milliseconds) for how long the process will wait. The manual page indicates that the type of this value is int. For reasons lost in history, though, the kernel's internal implementation of poll() has always expected the timeout value to be a long integer. And that has created a source of occasional bugs.

Most of the time, things just work. The int and long types tend to be the same on most architectures, and, in cases where they are different, glibc sign-extends the timeout value appropriately. Things go wrong, though, when a 32-bit process is running on an x86-64 system. In that case, the 32-bit sys_poll() function just passes the timeout value directly to the native kernel version, without sign extension. So if the timeout value is negative (an indication that poll() should wait forever if need be), the kernel will eventually see a large, positive timeout instead.

There are various ways this problem could be fixed. What Linus has chosen to do, though, is to just change the type of the timeout parameter to int inside the kernel. Since the timeout is now a 32-bit quantity on all systems, that particular source of confusion is removed. There is a small risk to this approach, though: it is possible that some program somewhere was actually making use of 64-bit timeouts. Doing so would require replacing or bypassing glibc (because its sign extension makes 64-bit timeouts unusable), so it's unlikely that anybody has bothered, but one never knows. If this change were to break a real application, it would have to be reverted in favor of a more complicated solution.

Linus's patch was merged for 3.3-rc5, so anybody who objects has a few weeks to make their concerns known.

Comments (5 posted)

dma-buf and binary-only modules

By Jonathan Corbet
February 22, 2012
The DMA buffer sharing mechanism has been merged for the 3.3 kernel; it is a way for DMA buffers to be shared between otherwise independent device drivers under user-space control. The dma-buf patches, as merged for 3.3, include a number of functions used by drivers to access buffers; those functions are all exported in the GPL-only mode. That drew a complaint from Robert Morell of NVIDIA, who, unsurprisingly, didn't like the fact that this interface would be unavailable to his company's proprietary driver.

It will be unsurprising to most readers that the response to Robert's complaint was not 100% sympathetic. After a while, the discussion died down without any real resolution. Recently, though, Rob Clark has reported on a discussion held at the Embedded Linux Conference:

Following the discussion, I agree that dma-buf infrastructure is intended as an interface between driver subsystems. And because (for now) all the other arm SoC gl(es) stacks unfortunately involve a closed src userspace, and since I consider userspace and kernel as tightly coupled in the realm of graphics stacks, I don't think we can really claim the moral high-ground here. So I can't object to use of EXPORT_SYMBOL() instead of EXPORT_SYMBOL_GPL().

Since then, there has been no discussion at all; there has also been no move to change the symbol exports in the mainline kernel. But the shift in tone suggests that positions may be softening, and that the buffer-sharing API may eventually be made available to proprietary modules.

Comments (none posted)

Oracle offering DTrace for Linux

Oracle has announced the availability of a beta release of the DTrace tracing framework, ported to its "Unbreakable Enterprise Kernel." There is not a lot of information currently about how the port works or how to use it; the DTrace on Linux forum contains only a "welcome" message. There is a usage example in this weblog post by Wim Coekaerts.

Comments (39 posted)

Kernel development news

Short sleeps suffering from slack

By Jonathan Corbet
February 17, 2012
As a general rule, kernel developers will go out of their way to avoid breaking user-space code, even when that code is seen as being wrong and already broken. But there are exceptions; a recent discussion regarding timer behavior may prove to be an example of how such exceptions can come about.

The C-library sleep() function is defined to put the calling process to sleep for at least the number of seconds specified. One might think that calling sleep() with an argument of zero seconds would make relatively little sense; why put a process to sleep for no time? It turns out, though, that some developers put such calls in as a way to relinquish the CPU for a short period of time. The idea is to be nice and allow other processes to run briefly before continuing execution. Applications that perform polling or are otherwise prone to consuming too much CPU are often "fixed" with a zero-second sleep.

Once upon a time in Linux, sleep(0) would always put the calling process to sleep for at least one clock tick. When high-resolution timers were added to the kernel, the behavior changed: if a process asked to sleep on an already-expired timer (which is the case for a zero-second sleep), the call simply returned directly back to the calling process. Then came the addition of timer slack, which can extend sleep periods to force multiple processes to wake at the same time. This behavior will cause timers to run a little longer than requested, but the result is fewer processor wakeups and, thus, a savings of power. In the case of a zero-second sleep, the addition of timer slack turns an expired timer into one that is not expired, so the calling process will, once again, be put to sleep.

The default timer slack, at 50µs, is unlikely to cause visible changes to the behavior of most applications. But it seems that, on some systems, the timer slack value is set quite high - on the order of seconds - to get the best power behavior possible. That can extend the length of a zero-second sleep accordingly, leading to misbehaving applications.

Matthew Garrett, working under the notion that breaking applications is bad, submitted a patch making a special-case for zero-second sleeps. The idea is simple: if the requested sleep time is zero, timer slack will not be added and the process will not be delayed indefinitely. The problem with this approach is that the process will still not get the desired result: rather than yielding the processor, it will have simply performed a useless system call and gone right back to whatever it was doing before. Without timer slack, a request to sleep on an expired timer will return directly to user space without going through the scheduler at all.

An alternative would be to transform sleep(0) into a call to sched_yield(). But that idea is not hugely popular with the scheduler developers, who think that calls to sched_yield() are almost always a bad idea. It is better, they say, to fix the applications to stop polling or doing whatever else it is that they do that causes developers to think that explicitly yielding the CPU is the right thing to do.

According to Matthew, the number of affected applications is not tiny:

Checking through an exploded Fedora kernel tree suggests around 125 packages out of 11000 or so, so around 1% of userspace seems to use sleep(0) under certain circumstances. We can probably fix everything in the distribution, but that suggests that there's also going to be a significant amount of code in the outside world that's also broken.

Normal practice in kernel development would be to try to avoid breaking those applications if possible. Even in cases where applications are relying on undefined and undocumented behavior - certainly the case here - it is better if a kernel upgrade doesn't turn working code into broken code. Some participants have suggested that the same approach should be taken in this case.

The situation with sleep(0) is a little different from others, though. Application developers cannot claim a long history of working behavior in this case, since the kernel's response to a zero-second sleep has already changed a few times over the course of the last decade. And, according to Thomas Gleixner, it is hard to know when the special case applies or what should be done:

Dammit, we cannot come up with a reasonable definition for special casing that stuff simply because you cannot draw a clear boundary what to special case and what not. And there is no sensible definition for what to do - return right away or go through schedule() or what ever.

Thomas worries that there may be calls for special cases for similar calls - single-nanosecond calls to nanosleep(), for example - and that the result will be an accumulation of cruft in the core timer code. So, rather than try to define these cases and maintain the result indefinitely, he thinks it is better just to let the affected code break in cases where the timer slack has been set to a large value. And that is where the discussion faded away, suggesting that nothing will be done in the kernel to reduce the effect of timer slack on zero-second sleeps.

Comments (25 posted)

The Linaro Connect scheduler minisummit

February 22, 2012

This article was contributed by Paul McKenney

I had the privilege of acting as moderator/secretary for the recent Scheduler Mini-Summit at Linaro Connect, which was attended by Peter Zijlstra (Red Hat), Paul Turner (Google), Suresh Siddha (Intel), Venki Pallipadi (Google), Robin Randhawa (ARM), Rob Lee (Freescale assigned to Linaro), Vincent Guittot (ST-Ericsson assigned to Linaro), Kevin Hilman (TI), Mike Turquette (TI assigned to Linaro), Peter De Schrijver (Nvidia), Paul Brett (Intel), Steve Muckle (Qualcomm), Sven-Thorsten Dietrich (Huawei), and was ably organized by Amit Kucheria (Linaro). Rough notes from the session can be found here.

The main goals of the mini-summit were as follows:

  1. Take first step towards planning any Linux-kernel scheduler changes that might be needed for ARM's upcoming big.LITTLE [PDF] systems to work well (see also Nicolas Pitre's LWN article).

  2. Create a power-aware infrastructure for scheduling and related Linux kernel subsystems. For example, integrate dyntick-idle, cpufreq, cpuidle, sched_mc, timers, thermal framework, pm_qos, and the scheduler.

  3. Provide a usable mechanism that reliably allows all work (present and future) to be moved off of a CPU so that said CPU can be powered off and back on under user-application control. CPU hotplug is used for this today, but has some serious side effects, so it would be good to either fix CPU hotplug or come up with a better mechanism—or, in the best Linux-kernel tradition, both. Such a mechanism might also be useful to the real-time people, who also need to clear all non-real-time activity from a given CPU.

How well did we meet these goals? Read on and decide yourself! To that end, the remainder of this article is organized as follows:

  1. Overview of ARM big.LITTLE Systems
  2. Major Issues Considered
  3. Future Work and Prospects
  4. Conclusions

Following this is the inevitable Answers to Quick Quizzes.

Overview of ARM big.LITTLE Systems

ARM's big.LITTLE systems combine the Cortex-A7 and Cortex-A15 processors. Both processors are implementations of the ARMv7 architecture and they execute the same code. ARM stated the little Cortex-A7 design was focused on energy efficiency at the expense of performance. The bigger Cortex-A15 design was, instead, focused on performance at some cost to energy efficiency. In practice this means the little core will be somewhat quicker and a lot more power efficient than today's Cortex-A8: a multi-core configuration of these little cores could run today's smartphones. The big core will significantly outperform Cortex-A9 within a similar power budget.

Quick Quiz 1: But what if there is a different number of Cortex-A7s than of Cortex-A15s? Answer
One way to use a big.LITTLE system is to have equal numbers of Cortex-A7 and Cortex-A15 CPUs paired up, so that only one CPU of a given pair is running at a time. This pairing is “a continuation of dynamic voltage/frequency scaling by other means”. To see this, imagine the Cortex-A15 initially running at maximum clock frequency, with the voltage and frequency decreasing until the performance is barely greater than that of the Cortex-A7 CPU. At this point, the firmware switches the software context from the Cortex-A15 to the Cortex-A7, with the Cortex-A7 initially running at its maximum clock frequency, but at lower power than the Cortex-A15.
Quick Quiz 2: Why scale down? Isn't it always better to run full out in order to race to idle? Answer
The voltage and frequency of the Cortex-A7 can then be further decreased, in turn further decreasing the power consumption.

For some implementations, thermal limitations would require that the Cortex-A15 CPUs be used only for short bursts at maximum frequency, as was discussed at length at the summit. However, I have since learned that many other implementations are expected to be fully capable of running the Cortex-A15 CPUs at maximum frequency indefinitely.

The switch between the Cortex-A7 and Cortex-A15 CPUs is implemented in firmware, but Grant Likely, Nicolas Pitre, and Dave Martin are moving this functionality into the Linux kernel.

In many big.LITTLE designs, it is also possible to run both the Cortex-A7 and Cortex-A15 CPUs concurrently in an shared-memory configuration. However, this means that the Linux kernel sees the big.LITTLE architecture, which in turn raises the issues discussed in the next section.

Major Issues Considered

Those of you who know the personalities in attendance will not be surprised to hear that the discussions were both spirited and wide-ranging. However, most of the discussion centered around the following four major issues:

  1. Benchmarks and Synthetic Workloads
  2. Parallel Hardware/Software Development
  3. What Do You Do With a LITTLE CPU?
  4. CPU Hotplug: Kill It or Cure It?

Each of these issues is covered in one of the following sections:

Benchmarks and Synthetic Workloads

The biggest and most pressing issue facing SMP-style big.LITTLE systems is the lack of packaged Linux-kernel-developer-friendly benchmarks and synthetic workloads. C programs and sh, perl, and python scripts can be friendly to Linux-kernel developers, while benchmarks requiring (for example) an Android SDK or a specific device will likely be actively ignored.

It is critically important for benchmarks to provide a useful “figure of merit”, which should encompass both user experience and estimated power consumption. For example, a synthetic workload that models a user browsing the web on a smartphone might have a smaller-is-better estimate of average power consumption, but also have the constraint that the system respond to emulated browser actions within (say) 500ms. If the response time is within the 500ms constraint, then the figure of merit is the estimated average power consumption, but if that constraint is exceeded, the figure of merit is a very large number. The exact computation of the figure of merit will vary from benchmark to benchmark.

Currently, some rough and ready workloads are in use. For example, Vincent Guittot used cyclic test in his work. While this did get the job done for Vincent, something more adapted to embedded/mobile workloads instead of real-time computing would be quite welcome. Zach Pfeffer of Linaro will be doing some workload creation in his group, however, given the wide variety of mobile and embedded workloads, additional contributions would also be welcome.

Finally, the scheduler maintains a great number of statistics and tracepoints. A “schedtop”-style tool that provides a mobile/embedded view of this information would be very valuable.

Parallel Hardware/Software Development

Even if you don't know exactly when a given piece of hardware will be available, it is a good bet that it will become available too late to get the needed software running on it. It is therefore critically important to have some way to develop the needed software before the hardware is available. Thankfully, there are a number of ways to test big.LITTLE scheduler features before big.LITTLE hardware becomes available.

One crude but portable method is to create a SCHED_FIFO thread on each LITTLE-designated CPU, and to have this thread spin, burning CPU, for (say) one millisecond out of every two milliseconds. This approach perturbs the scheduler's preemption points, particularly the wake-up preemptions. Nevertheless, this approach is likely to be quite useful.

A less portable but more accurate approach is to constrain the clock frequency of the CPUs so that the big-designated CPUs have a lower bound on their frequency and the LITTLE-designated CPUs have an upper bound on their frequency. The way to do this is via the sysfs files in the /sys/devices/system/cpu/cpu*/cpufreq directories, the most pertinent of which are described below.

Quick Quiz 3: I typed the following commands:
  cd /sys/devices/system/cpu/cpu1/cpufreq
  sudo echo 800000 > scaling_max_freq
Despite the sudo, I got “Permission denied”. Why doesn't sudo give me sufficient permissions? Answer
Echoing a number into the scaling_max_freq file will require that the corresponding CPU's frequency be limited to the specified number in KHz. Echoing a number into the scaling_min_freq file will require that the corresponding CPU's frequency be at least the specified number in KHz. Reading the scaling_available_frequencies file will list out the frequencies (again in KHz) that the corresponding CPU is capable of running at. For example, the laptop I am typing on gives the following list:
    2534000 2533000 1600000 800000
Reading the affected_cpus file lists the CPUs whose core clock frequencies must move in lockstep with the corresponding CPU. On my laptop, each CPU's frequency may be varied independently, but it is not unusual for a given “clock domain” to contain multiple CPUs, which then must all run at the same frequency, for example, on systems with hardware threads. Reading the scaling_cur_freq file gives you the kernel's opinion on what frequency the corresponding CPU is running at. Reading the cpuinfo_cur_freq file, instead, gives you the hardware's opinion on what frequency that the corresponding CPU is running at, which might or might not match the kernel's opinion, so you should most definitely experiment to make sure that all of this is doing what you want on your particular hardware and kernel.

For more information, see Documentation/cpu-freq in the Linux kernel source directory.

There was also some discussion of ways that the linsched user-mode scheduler simulator might help with prototyping.

Finally, it is possible to use T-states on Intel platforms to emulate a big.LITTLE system. According to Paul Brett:

Intel Architecture processors provide a clock modulation control exposed as the MSR_IA32_THERM_CONTROL MSR. This MSR can be used to reduce the effective clock frequency for each core independently in 12.5% increments from 100% down to 12.5%. Under normal conditions, the least significant 5 bits of the MSR are cleared to indicate 100% performance. To enable clock modulation, set bit 4 of this MSR to 1 and write a value from 1-7 in bits 1-3 (where 7 is 87.5% equivalent performance and 1 is 12.5% equivalent performance). More information on clock modulation can be found in volume 3 of the Intel IA64/IA32 Software Developers Manual, under Thermal Monitoring and Protection. Please note that the effect of clock modulation approximates running the CPU at a lower frequency - in benchmarks we have noted up to a 5% variance in performance between clock modulation and running the same core at the equivalent frequency.

Quick Quiz 4: Why would anyone use an Intel system to test out an ARM capability? Answer

Although none of these approaches can be considered a perfect substitute for running on the actual big.LITTLE hardware, they are all likely to be very useful during the time until such hardware is actually available.

What Do You Do With a LITTLE CPU?

If you have both big and LITTLE CPUs, how do you decide what tasks will be banished to the slower LITTLE CPUs? Similarly, if your workload is currently running all on LITTLE CPUs, how do you decide when to take the step of starting up one of the the power-hungry big CPUs?

Right now for SMP-configured big.LITTLE systems, “you” is the application developer, who can use facilities such as CPU hotplug, affinity, cpusets, sched_mc, and so on to manually direct the available work to the desired subsets of the CPUs. These facilities constrain the scheduler in order to ensure that nothing runs on CPUs that are to be powered down.

Decisions on what CPUs to use should include a number of considerations. First, if a LITTLE CPU is able to provide sufficient performance, it provides better energy efficiency, at least in cases where race to idle is inappropriate. Second, because mobile platforms have no fans and are sometimes sealed, some devices might not be able to run all the big CPUs at maximum clock rate for very long without overheating. Of course, such devices might also need to limit the heat produced by analog electronics and GPUs as well (see Carroll's and Heiser's 2010 USENIX paper [PDF] and presentation [PDF] for a power-consumption analysis of a ca. 2008 smartphone). Third, some workloads can adapt themselves to lower performance. For example, some media applications can reduce performance requirements by dropping frames and reducing resolution. Fourth, there is more to performance than CPU clock speed: For example, it is possible that a workload with high cache-miss rates can run just as fast on a LITTLE CPU as it can on a big CPU. Finally, many workloads will have preferred ways of using the CPUs, for example, some mobile workloads might use the LITTLE CPUs most of the time, but bring the big CPUs online for short bursts of intense processing.

Keeping track of all this can be challenging, which is one big reason for thinking in terms of automated assistance from the scheduler. Some of the proposed work towards this end is listed in the Future Work and Prospects section. But first, let's take a closer look at CPU hotplug and its potential replacements.

CPU Hotplug: Kill It or Cure It?

Although CPU hotplug has a checkered reputation in many circles, it is what almost all current multicore devices actually use to evacuate work from a given CPU. This is a bit surprising given that CPU hotplug was intended for infrequently removing failing CPUs, not for quickly bringing perfectly good CPUs into and out of service. It is therefore well worth asking what CPU hotplug is providing that users cannot get from other mechanisms:

  • Migrating timers off of a given CPU. This can likely be fixed, but a synchronous fix that prevents any further timers from being set may be more challenging.

  • Shutting off a CPU with a single simple action. This can likely be fixed, but requires coordinating interrupts, the scheduler, timers, kthreads, and so on.

  • Preventing all possible wakeup events from causing that CPU to power back on until the user explicitly permits it to power back on. (Some platforms may have wakeup events that cannot be shut off.)

  • Synchronous action, so that userspace can treat it atomically.

  • Coordinating user applications based on hotplug events. (However, there is no known embedded or mobile use of this feature, so if you need it, please let us know. Otherwise it will likely go away.)

These CPU-hotplug features are valuable outside of the mobile/embedded space, for example, some real-time applications will take a CPU offline and then immediately bring it back online to make it fully available for the application—in particular, to clear timers off of the CPU. Furthermore, people really do make use of CPU hotplug to offline failing CPUs.

But this brings up another question. Given that CPU hotplug does all these useful things, what is not to like? First, CPU-hotplug operations can take several seconds, as shown here. An ideal power-management mechanism would have latencies in the 5ms range. It might be possible to make CPU hotplug run much faster. Second, CPU-hotplug offline operations use the stop_machine() facility, which interrupts each and every CPU for an extended time period. This sort of behavior is not acceptable when certain types of real-time or high-performance-computing (HPC) applications are running. It might be possible to wean CPU hotplug from stop_machine().

Third, a given CPU's workqueues can contain large numbers of pending items of work, and migrating all of these can be quite time consuming, as can re-initializing all the workqueue kernel threads when a given CPU comes online. Other CPU-hotplug notifiers have similar problems, which can hopefully be addressed by coming up with a good low-overhead way to “park” and “unpark” kernel threads that are associated with an offline CPU.

Quick Quiz 5: Why not just use SIGSTOP to park per-CPU kthreads? Answer

Such a parking mechanism faces the following challenges:

  • Many per-CPU kernel threads are (quite naturally) coded with the assumption that they will always run on the corresponding CPU.

  • If a kthread that has an affinity to a given CPU is awakened while that CPU is offline, the scheduler prints an error message and removes the affinity, so that the kthread will now be able to run on any CPU.

  • Wakeups can be delayed so that they do not arrive at the kthread until after the corresponding CPU has gone offline.

  • All kernel threads parked for a given offline CPU must sleep interruptibly, because otherwise the kernel will emit soft-lockup messages.

  • When a given CPU goes offline, any work pending for that CPU must either be completed immediately (thus delaying the offline operation), migrated to some other CPU (thus increasing complexity), or deferred until the CPU comes back online (which might be never).

Quick Quiz 6: What other mechanisms could be used to park per-CPU kthreads? Answer

There is some reason to believe that any mechanism that evacuates all work from a CPU faces these same challenges.

Finally, CPU-hotplug operations can destroy cpuset configuration, so that cpusets need to be repaired when CPUs are brought back online. This topic is currently the subject of spirited discussions.

Perhaps these CPU-hotplug shortcomings can be repaired. But suppose that they cannot. What should be done instead in order to evacuate all work from a given CPU?

Tasks can be moved off of a given CPU by use of explicit per-task affinity, cgroups, or cpusets, although interactions with other uses of these mechanisms need more thought. In addition, interactions among all of these mechanisms can have unexpected results because of a strong desire that the scheduler generally consume less CPU than the workload being scheduled.

However, interrupts can still happen on a given CPU even after all tasks have been evacuated. Interrupts must be redirected separately using the /proc/irq directory. This directory in turn contains one directory for each IRQ, and each IRQ directory contains a smp_affinity file to which you can write a hexadecimal mask to restrict delivery of the corresponding interrupts. You can then examine the /proc/interrupts file to verify that interrupts really are no longer being delivered to the CPUs in question. See Documentation/IRQ-affinity.txt in the kernel sources for more information. One caution: that document notes that some irq controllers do not support affinity, and for such controllers it is not possible to direct irq delivery away from a given CPU.

Finally, evacuating tasks and interrupts from a given CPU can still leave timers running on that CPU. As noted earlier, there is currently no mechanism other than CPU hotplug to migrate timers off of a given CPU, but it should be possible to create such a mechanism. An asynchronous mechanism (which would allow each timer one final ride on the outgoing CPU) is straightforward. A synchronous mechanism would be more complex, but should be doable.

So, what should be done? In the near term, the only sane approach is to attack on both fronts: (1) attempt to cure CPU hotplug of its worst ills (especially given that it will likely continue to be needed for removing failing CPUs), and (2) attempt to improve the alternative mechanisms so that they can do more of the work that can currently only be done by CPU hotplug—hopefully avoiding at least some of the complexity currently inherent to CPU hotplug.

Future Work and Prospects

In the short term, the following actions need to be taken:

  • Create an email list for the attendees and other interested parties. This is now available here, courtesy of Amit Kucheria and Loic Minier.

  • Document best practices for using existing Linux kernel facilities (including CPU hotplug, cgroups, cpusets, affinity, and so on) to manage big.LITTLE systems in an SMP configuration. This documentation should include measurements of latencies (keeping in mind the 5ms goal for evacuating work from a CPU and for restarting it) and power consumption. Vincent Guittot's presentation and git tree are a good start in this direction.

  • Create software to emulate big.LITTLE systems on current hardware, for example, using one or more of the approaches describe in the Parallel Hardware/Software Development section.

  • Produce Linux-kernel-developer-friendly synthetic workloads and benchmarks for mobile applications and use cases, as discussed in the Benchmarks and Synthetic Workloads section. There will be some Linaro work in this direction, but additional workloads and benchmarks are welcome from any and all.
In the medium term, the following additional actions are needed:
  • Experiment with improving cpusets and cgroups as discussed in the CPU Hotplug: Kill It or Cure It? section.

  • Experiment with curing CPU hotplug, also discussed in the CPU Hotplug: Kill It or Cure It? section.

  • Accumulate a list of the attributes that system-on-a-chip (SoC) vendors believe to be important to scheduling and managing big.LITTLE systems. An initial list was accumulated during the scheduler summit:

    • Power-domain and clock-domain constraints. For example, many ARM SoCs require that all CPUs in a cluster run at the same clock rate.

    • Thermal tradeoffs. For example, some SoCs might impose a tradeoff between the number of CPUs running at a given time and the frequency at which they are running if they are to avoid thermal throttling.

    • Thermal feedback, e.g., temperature sensors.

    • Process type, where the amount of leakage current can affect the optimal strategies for power-efficient operation.

    • Relative benefit of reducing frequency of several CPUs as opposed to consolidating workload on a small number of CPUs.

    • Instruction-per-clock (IPC) measurements and correlation between clock rate and useful forward progress.

  • Remove sched_mc.
In the longer term, the following additional actions would be quite helpful:
  • Port Frederic Weisbecker's OS-jitter-reduction patchset to ARM. Geoff Levand of Huawei is leading up an effort along these lines.

  • Contact gaming companies (e.g., Epic) to see if their 3D gaming engines (which run on both iPhone and Android) can make good use of big.LITTLE, even in the presence of thermal throttling.

  • Investigate alternative scheduler disciplines. For example, the prototype SCHED_EDF patchset would allow tasks to specify deadlines, which would allow the scheduler to better decide between race-to-idle and run-at-low-frequency. Other related scheduler disciplines such as EVDF might be useful—there may be other real-time technologies that can be commandeered to energy-efficiency purposes.

    If SCHED_EDF looks useful to mobile/embedded, someone needs to forward-port the patch and fix a number of issues in it. This is not a trivial project. (Paul Turner is looking into bringing some Google resources to bear, and Juri Lelli has been doing some recent deadline-scheduler work.)

    One common mobile/embedded requirement is to consolidate the workload down to the minimum number of CPUs that can support acceptable user experience, then spread the load across that minimal set of CPUs.

  • Investigate modal scheduling. Paul Turner gave the following list of modes as an example:

    • Low load of interactive, low-utilization tasks might favor race to idle.

    • Moderate load of periodic media-feeding tasks might lower frequency to the smallest value that allows the task to keep up with its hardware.

    • High load of CPU-bound tasks in the absence of thermal limitations might increase frequency.

    Some hysteresis will be required. It is usually OK to delay the decisions a bit, especially given that ARM provides relatively fast transitions between power states. Paul Turner posted a first step in this direction, with a patch series that improves the scheduler's ability to better estimate the effect of migration of each CPU's load. A more up-to-date series is maintained here.

Conclusions

The scheduler mini-summit at Linaro Connect was quite productive, with work already in progress to implement some of the recommendations. For example, some code and patches are in flight to reduce RCU's dependence on stop_machine(), which is a first step towards weaning CPU hotplug from stop_machine(). For another example, Srivatsa Bhat is doing some good work on curing CPU hotplug of some of its ills.

So how did we do against the goals? Let's check:

  1. Take first step towards planning any Linux-kernel scheduler changes that might be needed for ARM's upcoming big.LITTLE systems work well.

    The most important actions toward this goal are the emulation of big.LITTLE systems, the mobile/embedded synthetic benchmarks/workloads, and the list of SoC attributes. This information will help work out which of the longer-term actions are most important.

  2. Create a power-aware infrastructure for scheduling and related Linux kernel subsystems. For example, integrate dyntick-idle, cpufreq, cpuidle, sched_mc, timers, thermal framework, pm_qos, and the scheduler.

    The mobile/embedded synthetic benchmarks/workloads is the most important first step in this direction, as is the list of SoC attributes. The removal of sched_mc is a first implementation step, on the theory that one must tear down before one can build.

  3. Provide a usable mechanism that reliably allows all work (present and future) to be moved off of a CPU so that that CPU can be powered off and back on under user-application control. CPU hotplug is used for this today, but has some serious side effects, so it would be good to either fix CPU hotplug or come up with a better mechanism—or, in the best Linux-kernel tradition, both. Such a mechanism might also be useful to the real-time people, who also need to clear all non-real-time activity from a given CPU.

    This goal received the most discussion, and the medium-term actions for curing CPU hotplug on the one hand or improving the alternatives to CPU hotplug on the other.

Work on these three goals has only just begun, but with continued effort, we can make the Linux kernel work better for big.LITTLE in particular and for mobile/embedded workloads on asymmetric systems in general.

Acknowledgments

I am grateful to the scheduler mini-summit attendees for many useful and enlightening discussions, and especially to Amit Kucheria for organizing the mini-summit. We all owe thanks to Zach Pfeffer, Peter Zijlstra, Amit Kucheria, Robin Randhawa, Jason Parker, Rusty Russell, Vincent Guittot, and Dave Rusling for helping make this article human readable. I owe thanks to Dave Rusling and Jim Wasko for their support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: But what if there is a different number of Cortex-A7s than of Cortex-A15s?

Answer: In that case, it is necessary to remove the excess CPUs from service, for example, using CPU hotplug, before carrying out the switch.

Back to Quick Quiz 1.

Quick Quiz 2: Why scale down? Isn't it always better to run full out in order to race to idle?

Answer: Although it often is best to race to idle, there are some important exceptions to this rule in certain mobile/embedded workloads. For one example, imagine a codec that required the CPU to occasionally do some work to provide the codec with data. Because CPU power consumption often rises as the square of the core clock frequency, you typically get the best battery life by running the CPU at the lowest frequency that gets the work done in time. As always, use the right tool for the job!

Back to Quick Quiz 2.

Quick Quiz 3: I typed the following command:

    sudo echo 800000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq

Despite the sudo, I got “Permission denied”. Why doesn't sudo give me sufficient permissions?

Answer: Although that command does give echo sufficient permissions, the actual redirection is carried out by the parent shell process, which evidently does not have sufficient permissions to open the file for writing. One way to work around this is sudo bash followed by the echo, or to do something like:

    sudo sh -c 'echo 800000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq'

Another approach is to use tee, for example:

    echo 800000 | sudo tee /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq

Yet another approach uses dd as follows:

    echo 800000 | sudo dd of=/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq

Back to Quick Quiz 3.

Quick Quiz 4: Why would anyone use an Intel system to test out an ARM capability?

Answer: The scheduler is core code, and for the most part does not care about which instruction-set architecture is running. The important thing is not the ISA, but rather the performance characteristics.

Back to Quick Quiz 4.

Quick Quiz 5: Why not just use SIGSTOP to park per-CPU kthreads?

Answer: It might well work, at least given appropriate adjustments. Please try it and let us know how it goes.

Back to Quick Quiz 5.

Quick Quiz 6: What other mechanisms could be used to park per-CPU kthreads?

Answer: Here are some possibilities:

  • Kill the kthreads at CPU-offline time and recreate them at CPU-online time. This is used today, and is quite slow.

  • Kill the kthreads at CPU-offline time and recreate them at CPU-online time, as above, but create a separate CLONE_ flag to prevent the parent from waiting until the child runs. This waiting behavior exists to work around an old bash bug, and is not needed for in-kernel kthreads.

  • Kill the kthreads at CPU-offline time, but don't recreate them until they are actually needed, perhaps using a separate high-priority kthread to allow the creation to be initiated from environments where blocking is prohibited. This might work well in some situations, but does increase the state space significantly.

  • Have the kthreads block while their CPU is offline. This approach faces some complications:

    • Wake-ups can be delayed, so that a delayed wakeup might arrive at a kthread after the corresponding CPU has gone offline.

    • If a task that has an affinity to a given CPU awakens while that CPU is offline, the scheduler prints a warning message and breaks affinity. This breaking of affinity can cause failures in kthreads written to assume that they only run on their CPU.

    • Sleeping uninterruptibly while a CPU is offline can result in spurious soft-lockup warnings.

  • Use an explicit rendezvous to park each kthread, setting its affinity mask to cover all CPUs and informing it that it needs to remain quiescent. This operation would be reversed when the CPU comes back online. This works, but is often surprisingly difficult to get right, particularly on busy systems where wakeups can be delayed.

  • Your idea here.

It is quite likely that different approaches will be required in different situations.

Back to Quick Quiz 6.

Comments (20 posted)

Subtle interactions in the embedded world - what bugs can teach us

February 22, 2012

This article was contributed by Neil Brown

I must admit that I love bugs. I also hate bugs which leads to a very conflicted relationship with the blighters, but for now I just want to focus on the positive - I love bugs. The bugs I am talking about are, of course, the incorrect behavior of software systems due to incorrect code and the reason that I love them is that they are so educational. There are a number of reasons for this, the significant one for the present being that they provide motivation and focus to read and study new code.

One of the freedoms that free and open source software provides is the freedom to study the source code. However having that freedom doesn't mean it's easy to use it. When faced with a large body of code such as the Linux kernel it can be hard to know where to start, or when to move on. There is no story-line, no curriculum, no plot to guide your study. This is where a bug comes in: it can provide a story line. This may be a particular sequence of events or a particular combination of features, or just a starting point to spiral out from. But by providing a clear focus and a clear finishing point, it makes study easier and more rewarding.

More than a phone

I have recently been spending some time trying to make the Linux kernel run well on the GTA04 "Phoenux" replacement motherboard for the Openmoko "Freerunner" and "Neo 1973" mobile phones. This is an attempt to revitalize the Openmoko effort to provide an open (or at least "as open as possible") platform for a mobile computing device. It fits in the same case, uses the same display and speakers, but provides more memory, faster processor, and faster networking plus a few other extras. During this (ongoing) effort I have discovered a number of bugs and missing features, most of which only came with smaller learning experiences. One, though, stands out for leading me on a somewhat longer path of discovery than most, and so I would like to share that path and those discoveries so that others might benefit. I should note first that the path is still open. I have not actually fixed the bug, so others could follow the path themselves. Anyone keen to do that should stop after the next paragraph, get a GTA04, and start hunting themselves. Others can read on.

The symptom of the bug is easy to describe. Like most computer systems the GTA04 has an RTC (real time clock) which has an alarm function. When the time in the clock matches the time in the alarm, it generates an interrupt. The problem is that, though this interrupt works perfectly while the system is awake, it does not reliably wake the device when it is asleep (i.e. in suspend-to-RAM state). Other interrupts configured much the same way (for example data received on the serial console) wake the system with 100% reliability. However the RTC alarm almost always fails. Understanding this bug leads on a path through the interrupt management code, the suspend management code, and even the USB code. But it must start with a brief description of the hardware.

In the GTA04 there are two particular integrated circuits (ICs) that are relevant. One is the OMAP3 SoC (system-on-a-chip) from TI. The other is the PMIC (power management IC) which is referred to as "twl4030" in the various drivers - various because, like the OMAP3, it is a multi-function chip combining diverse functions such as battery charging, keyboard control, and the USB electrical interface. There are a number of connections between these chips but only two are relevant for now. These are a single interrupt line and an I2C bus.

The interrupt line signals the interrupt controller (INTC) in the OMAP3 that something has happened. The I2C (inter-integrated-circuit) bus is a simple 2-wire bus that allows the CPU to read and write registers in the PMIC. In particular it can be used to find out which of the several functions generated the interrupt, and why. It can also be used to clear the source of the interrupt.

Having one interrupt line that indicates any of a number of possible events like this is not uncommon. For example the UARTs (Universal Asynchronous Receiver Transmitter - a serial port) in the OMAP3 each have a single interrupt, but it can be generated by a character arriving, a character transmit having completed, or an error condition. As with most such devices there is a register that can be inspected which has one bit for each possible interrupt source.

When all the interrupt sources are related to a single device and handled by a single driver this is all quite easy to manage. However when the different interrupt sources are related to logically separate devices, having to manage all these behind a single interrupt could get clumsy.

So many interruptions

To address this situation, Linux has an abstraction called an "irq_chip". An irq_chip represents a set of interrupt sources each of which is assigned an interrupt number (as listed in /proc/interrupts). It provides functions to enable or disable each interrupt, to allow the interrupt to wake the device from suspend, to set the trigger type (edge or level), and various other functions. It also arranges that the interrupt handler for each interrupt will be called when appropriate.

[irq_chip layout] irq_chips with sources

Using this, the driver for the INTC in the OMAP3 defines an irq_chip for all the interrupts that it knows about directly, and the driver for the PMIC can define a separate irq_chip which describes the various interrupt sources in the PMIC. Each irq_chip will provide very different functions to configure and control the interrupt sources, but will provide a uniform interface to all device drivers. When the single PMIC interrupt arrives at the INTC, it will call an interrupt handler which will examine the PMIC registers and then call the appropriate handler that was registered with the second irq_chip. This way the drivers for the various parts of the PMIC don't need to know too much about each other - the irq_chip mediates between them.

In our PMIC there is a "Primary Interrupt Status Register" which has one bit for each module that can generate interrupts. Each of these modules has their own "Secondary Interrupt Status Register" to report what actually caused the interrupt, much like the UART described earlier. There is normally no need for a tertiary irq_chip to represent these registers as the one driver manages the whole module and all of its interrupts.

There are two exceptions, only one of which interests us. There is a "power interrupts" module in the PMIC which combines interrupts from a diverse range of sources: press of the power button, insertion of USB cable, over-temperature warning, and the RTC alarm. These are sufficiently varied that a separate irq_chip makes sense here. The twl4030-irq driver manages all this quite elegantly, providing a secondary irq_chip for the device as a whole, and supporting tertiary irq_chips, including for the Power Interrupts module.

So when the RTC alarm triggers, the primary interrupt is handled by the INTC. This calls an interrupt handler which inspects the PMIC, determines that a power interrupt is active and calls the relevant handler. It in turn examines the PMIC again and finds that the RTC alarm has fired, and so calls the handler that was registered for that particular interrupt. This all works perfectly. But it doesn't seem to wake the device from suspend. At least, not always. It seems that something happens on the way to suspend that interferes with this.

A funny thing happened on the way to suspend

One of the many things that happens on the way to suspend is that each individual interrupt gets disabled - unless it is flagged as IRQF_NO_SUSPEND in which case it is left alone. However this doesn't mean exactly what it sounds like it means. Being "disabled" just means that the handler routine will not be run, it doesn't mean that the interrupt will not be generated. We have a different word for that, which is "masking". When an interrupt is masked the originating source of the interrupt is told to never post that interrupt.

Linux uses a lazy scheme for disabling interrupts. When the disable request is made, the fact is recorded in internal data structures, but that is all. If the interrupt is subsequently delivered, only then might the interrupt be masked. This can be a useful optimization as masking an interrupt can take a lot longer than just setting a flag in memory.

So, on the way to suspend, interrupts are disabled but not masked. If the interrupt does actually arrive before we reach full suspend, the fact is recorded. If it was an interrupt that should wake from suspend, this is detected in check_wakeup_irqs() and suspend aborts. If the interrupt doesn't arrive before full suspend, then it is still unmasked and will successfully wake up the device, which will resume and then handle the interrupt. This might all seem a bit complex, but once it is fully understood it is actually quite neat and it works well ... except for my RTC alarm.

Who turned out the lights?

As promised we will need to glimpse at the USB code as well and now is the time to bring that in. Like most modern phones, the GTA04 can be attached to a computer or a charger via a USB port. When the USB interface detects 5 volts on the power line, it will route this to the battery charger (BCI) which will use it to charge the battery and power the device.

This is relevant because - when tracing is added to the twl4030-irq driver - we see that during suspend we get an interrupt from the PMIC which turns out to be due to the battery charger losing power and saying "Hey, just thought you should know that I'm running on battery again now". What is happening is that when the USB port is told to go to sleep it turns off its various power supply regulators. One of these (USB3V1) also powers the voltage comparator which determines if 5V is present. As soon as that presence is no longer signaled, the current stops being routed to the BCI, and the BCI justifiably complains.

This results in the interrupt line from the PMIC being asserted which causes the interrupt management code to run. This notices that the interrupt was disabled, and so masks it. When we get to check_wakeup_irqs(), the PMIC IRQ is pending, but as it has not been set for wakeup it is just ignored. The net result here is that all interrupts from the PMIC are ignored during suspend, so the RTC alarm doesn't work, the power button doesn't wake the device up, and plugging or unplugging the USB cable has no effect either.

So now that we understand the problem ...

So, what to do? We could just unplug the USB cable when testing the RTC alarm. Then the BCI would not notice the current disappearing and so wouldn't generate an interrupt. This certainly works - once. It doesn't seem to work a second time, but we'll leave that issue for the moment.

We could tell the USB controller not to turn off USB3V1 when the BCI is active and power is available on the USB bus. This is certainly a good idea and would mean that the battery can charge during suspend. It also brings up the interesting issue of how these separate drivers should communicate, but it doesn't really solve the general problem, just this specific one.

It could be that an interrupt occurs that we really do want to ignore during suspend, that we have even explicitly disabled (but due to lazy-masking has not been masked). So we want to be able to mask those interrupts without masking the main interrupt. A few moments reflection on what we have learned so far suggests that setting IRQF_NO_SUSPEND on the intermediate interrupts should help. That way they will not be disabled during suspend so the interrupt management will trace through the chain of irq_chips far enough to find the originating interrupt source and will mask just that. The other interrupt source will still be available.

This may sound convincing, but reality has a way of getting in the way: testing shows that just setting IRQF_NO_SUSPEND is not enough. Diving back into the code and the data sheets reveals an interesting and important fact. The PMIC allows each individual interrupt source to be masked, but it does not allow a whole module as a set of interrupt sources to be masked. So when an interrupt arrives from the BCI (one of the modules) the IRQ manager notices that interrupt is disabled and so the secondary irq_chip is asked to mask it. But as the irq_chip cannot mask it (the hardware doesn't support that), it simply ignores the request. You can imagine what happens next - an unending stream of attempts to mask the BCI interrupt, none of which are effective and the interrupt keeps re-firing.

The root problem seems to be that while the TWL4030 PMIC appears to have an interrupt structure that lends itself to being represented as a small tree of irq_chips, this isn't actually the case. The details of the behavior of suspend makes it essential that a working "mask" function be provided, and that cannot be provided for modules in the twl4030, only for individual interrupts. So it seems that a complete fix requires some change to the structure of the irq_chips in the twl4030 driver. Possibly we could make the structure simpler (no tree), or possibly we could make the implementation more clever in some way. The important point is that each interrupt that is visible to the Linux IRQ management code must support being masked in the hardware in some way.

Another funny thing on the way to resume

As mentioned, after one successful wake-from-alarm, the RTC alarm in the GTA04 doesn't work a second time. In fact it stops working completely, even when not suspended. It seems that something has gone wrong on the way to resume.

During resume all the interrupts that were disabled by suspend are re-enabled. The three interrupts which relate to the RTC - the base PMIC interrupt, the interrupt for the "power interrupts" module, and the actual RTC interrupt - are enabled in that order with PMIC first. As this interrupt is still asserted (as it is what triggered resume) the PMIC interrupt handler runs as soon as it is enabled. It reads the interrupt status register to see which module was responsible, and triggers the relevant interrupt for that module.

This secondary interrupt is still disabled so nothing happens. In particular the interrupt source is not cleared, nor is it masked, so we again get a repeated sequence of attempts to handle an interrupt which cannot be handled. This one isn't unending as the loop that enables all the interrupts is still running and it eventually gets to the one for the target module (the power interrupts module). This is triggered again and this time it actually does something. It reads the final status register (thus clearing the interrupt) and calls the interrupt handler for the RTC alarm. Unfortunately that is still disabled so nothing else happens.

Eventually the RTC interrupt is enabled but by now it is too late. In some cases an interrupt that fires while disabled gets the IRQS_PENDING flag set so that when it is re-enabled, the interrupt is resent. But that is not the case when the twl4030-irq interrupt controller tries to handle a tertiary IRQ. Whether this is a bug or a misunderstood feature is as yet unclear. Another journey of discovery would be needed to fully understand that issue. Meanwhile, because the interrupt is delivered but the interrupt service routine isn't run, the handshaking between the driver and the device gets confused and no other interrupt is ever seen from the RTC.

We can work around this problem by a simple expedient. When re-enabling the interrupts during resume, do it in the reverse order. This ensures that the RTC interrupt is enabled before it is called, and everything works. This even feels like the "right" thing to do - it is balanced for "enable" to happen in the reverse order of "disable". But even if it is "right", it is not really a fix. There would still be bugs lurking and ready to spring out again.

Journey's end

The point of this little tour is not to point out bugs in the code or to shame the relevant developers. In fact the code is in general quite elegant, well structured, largely very effective, and the developers are only to be congratulated. Rather, the point is to highlight the complexity of the task being tackled by developers working the embedded space:

  • There are complex interactions between distinct hardware. Here the USB interface and the battery charger have important interdependencies that need to be reflected in the drivers. Some interdependencies already are but there are subtleties that are easy to miss.

  • The requirements of an "irq_chip" are not really documented anywhere, and given the current rate of development there is a good chance that such documentation would be out of date. Extracting the requirements from the code is far from trivial. In this case a seemingly correct and obvious implementation was only found to be deeply wrong by a fairly obscure test case.

  • The IRQF_NO_SUSPEND flag is clearly important, but not easy to understand. It is fairly obvious what it does, but much less obvious why you would want to do that. I imagine that as a developer my approach would be to not set it until I found a bug which could be fixed by setting it. Then I would set it and hope it didn't break anything. This is not really a robust way to do development.

With complex hardware interactions and complex software dependencies it is not surprising that we seem to see a new class of problems in the embedded space. As we have observed, there are often ways to work around the bug rather than fix it, and that is usually much simpler and quicker. They may be tempting, but for this developer at least they are nowhere near as satisfying. Because if you follow the bug all the way to the root cause, and if you understand all that is happening and find the right fix, then you have not only helped other future developers, but have learned a whole lot in the process. And that is why I love bugs.

Comments (11 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds