Brief items
The current development kernel is 3.3-rc4,
released on February 18, a couple of days
later than might have been expected. "
This time the reason for the
delay is that we spent several days chasing down a nasty floating point
state corruption that happens on 32-bit x86 - but only if you have a modern
CPU (why are you using 32-bit kernels?) that supports the AES-NI
instructions. And then you have to enable support for them *and* use a
wireless driver that uses it. The most likely reason for that is using the
mac80211 infrastructure with WPA with AES encryption (ie usually
WPA2)." There's lots of other fixes as well, of course; the
short-form changelog can be found in the announcement.
Stable updates: the
3.0.22 and
3.2.7 stable updates were released on
February 20; they contain the usual list of important fixes.
Comments (none posted)
When you write:
if (ret) {
one_line_statement();
}
somewhere, a puppy dies. And the DRM guys just took out an entire kennel.
--
David Miller
Comments (12 posted)
By Jonathan Corbet
February 22, 2012
The
poll() system call has three parameters, one of which is a
timeout value specifying an upper bound (in milliseconds) for how long the
process will wait. The manual page indicates that the type of this value
is
int. For reasons lost in history, though, the kernel's
internal implementation of
poll() has always expected the timeout
value to be a
long integer. And that has created a source of
occasional bugs.
Most of the time, things just work. The int and long
types tend to be the same on most architectures, and, in cases where they
are different, glibc sign-extends the timeout value appropriately. Things
go wrong, though, when a 32-bit process is running on an x86-64 system. In
that case, the 32-bit sys_poll() function just passes the timeout
value directly to the native kernel version, without sign extension. So if
the timeout value is negative (an indication that poll() should
wait forever if need be), the kernel will eventually see a large, positive
timeout instead.
There are various ways this problem could be fixed. What Linus has chosen
to do, though, is to just change the type of the timeout parameter to
int inside the kernel. Since the timeout is now a 32-bit quantity
on all systems, that particular source of confusion is removed. There is a
small risk to this approach, though: it is possible that some program
somewhere was actually making use of 64-bit timeouts.
Doing so would require replacing or bypassing glibc (because its sign
extension makes 64-bit timeouts unusable), so it's unlikely that anybody
has bothered, but one never knows. If this change were to break a real
application, it would have to be reverted in favor of a more complicated
solution.
Linus's patch was merged for
3.3-rc5, so anybody who objects has a few weeks to make their concerns known.
Comments (5 posted)
By Jonathan Corbet
February 22, 2012
The
DMA buffer sharing mechanism has been
merged for the 3.3 kernel; it is a way for DMA buffers to be shared between
otherwise independent device drivers under user-space control. The dma-buf
patches, as merged for 3.3, include a number of functions used by drivers
to access buffers; those functions are all exported in the GPL-only mode.
That drew a
complaint from Robert Morell of
NVIDIA, who, unsurprisingly, didn't like the fact that this interface would
be unavailable to his company's proprietary driver.
It will be unsurprising to most readers that the response to Robert's
complaint was not 100% sympathetic. After a while, the discussion died
down without any real resolution. Recently, though, Rob Clark has reported on a discussion held at the Embedded
Linux Conference:
Following the discussion, I agree that dma-buf infrastructure is
intended as an interface between driver subsystems. And because
(for now) all the other arm SoC gl(es) stacks unfortunately involve
a closed src userspace, and since I consider userspace and kernel
as tightly coupled in the realm of graphics stacks, I don't think
we can really claim the moral high-ground here. So I can't object
to use of EXPORT_SYMBOL() instead of EXPORT_SYMBOL_GPL().
Since then, there has been no discussion at all; there has also been no
move to change the symbol exports in the mainline kernel. But the shift in
tone suggests that positions may be softening, and that the buffer-sharing
API may eventually be made available to proprietary modules.
Comments (none posted)
Oracle has
announced
the availability of a beta release of the DTrace tracing framework, ported
to its "Unbreakable Enterprise Kernel." There is not a lot of information
currently about how the port works or how to use it; the
DTrace on
Linux forum contains only a "welcome" message. There is a usage
example in
this weblog
post by Wim Coekaerts.
Comments (39 posted)
Kernel development news
By Jonathan Corbet
February 17, 2012
As a general rule, kernel developers will go out of their way to avoid
breaking user-space code, even when that code is seen as being wrong and
already broken. But
there are exceptions; a recent discussion regarding timer behavior may
prove to be an example of how such exceptions can come about.
The C-library sleep() function is defined to put the calling
process to sleep for at least the number of seconds specified. One might
think that calling sleep() with an argument of zero seconds would
make relatively little sense; why put a process to sleep for no time? It
turns out, though, that some developers put such calls in as a way to
relinquish the CPU for a short period of time. The idea is to be nice and
allow other processes to run briefly before continuing execution.
Applications that perform polling or are otherwise prone to consuming too
much CPU are often "fixed" with a zero-second sleep.
Once upon a time in Linux, sleep(0) would always put the calling
process to sleep for at least one clock tick. When high-resolution timers
were added to the kernel, the behavior changed: if a process asked to sleep
on an already-expired timer (which is the case for a zero-second sleep),
the call simply returned directly back to the calling process. Then came
the addition of timer slack, which can extend sleep periods to force
multiple processes to wake at the same time. This behavior will cause
timers to run a little longer than requested, but the result is fewer
processor wakeups and, thus, a savings of power. In the case of a
zero-second sleep, the addition of timer slack turns an expired timer into
one that is not expired, so the calling process will, once again, be put to
sleep.
The default timer slack, at 50µs, is unlikely to cause visible
changes to the behavior of most applications. But it seems that, on some
systems, the timer slack value is set quite high -
on the order of seconds - to get the best power behavior possible. That
can extend the length of a zero-second sleep accordingly, leading to
misbehaving applications.
Matthew Garrett, working under the notion that breaking applications is
bad, submitted a patch making a
special-case for zero-second sleeps. The idea is simple: if the requested
sleep time is zero, timer slack will not be added and the process will not
be delayed indefinitely. The problem with this approach is that the
process will still not get the desired result: rather than yielding the
processor, it will have simply performed a useless system call and gone
right back to whatever it was doing before. Without timer slack, a request
to sleep on an expired timer will return directly to user space without
going through the scheduler at all.
An alternative would be to transform sleep(0) into a call to
sched_yield(). But that idea is not hugely popular with the
scheduler developers, who think that calls to sched_yield() are
almost always a bad idea. It is better, they say, to fix the applications
to stop polling or doing whatever else it is that they do that causes
developers to think that explicitly yielding the CPU is the right thing to
do.
According to Matthew, the number of
affected applications is not tiny:
Checking through an exploded Fedora kernel tree suggests around 125
packages out of 11000 or so, so around 1% of userspace seems to use
sleep(0) under certain circumstances. We can probably fix
everything in the distribution, but that suggests that there's also
going to be a significant amount of code in the outside world
that's also broken.
Normal practice in kernel development would be to try to avoid breaking
those applications if possible. Even in cases where applications are
relying on undefined and undocumented behavior - certainly the case here -
it is better if a kernel upgrade doesn't turn working code into broken
code. Some participants have suggested that the same approach should be
taken in this case.
The situation with sleep(0) is a little different from others,
though. Application developers cannot claim a long history of working
behavior in this case, since the kernel's response to a zero-second sleep
has already changed a few times over the course of the last decade. And,
according to Thomas Gleixner, it is hard to
know when the special case applies or what should be done:
Dammit, we cannot come up with a reasonable definition for special
casing that stuff simply because you cannot draw a clear boundary
what to special case and what not. And there is no sensible
definition for what to do - return right away or go through
schedule() or what ever.
Thomas worries that there may be calls for special cases for similar calls
- single-nanosecond calls to nanosleep(), for example - and that
the result will be an accumulation of cruft in the core timer code. So,
rather than try to define these cases and maintain the result indefinitely,
he thinks it is better just to let the affected code break in cases where
the timer slack has been set to a large value. And that is where the
discussion faded away, suggesting that nothing will be done in the kernel
to reduce the effect of timer slack on zero-second sleeps.
Comments (25 posted)
February 22, 2012
This article was contributed by Paul McKenney
I had the privilege of acting as moderator/secretary for the recent
Scheduler Mini-Summit at Linaro Connect, which was attended by
Peter Zijlstra (Red Hat),
Paul Turner (Google),
Suresh Siddha (Intel),
Venki Pallipadi (Google),
Robin Randhawa (ARM),
Rob Lee (Freescale assigned to Linaro),
Vincent Guittot (ST-Ericsson assigned to Linaro),
Kevin Hilman (TI),
Mike Turquette (TI assigned to Linaro),
Peter De Schrijver (Nvidia),
Paul Brett (Intel),
Steve Muckle (Qualcomm),
Sven-Thorsten Dietrich (Huawei),
and was ably organized by Amit Kucheria (Linaro).
Rough notes from the session can be found
here.
The main goals of the mini-summit were as follows:
- Take first step towards planning any Linux-kernel scheduler
changes that might be needed for ARM's
upcoming
big.LITTLE
[PDF]
systems to work well (see also
Nicolas Pitre's
LWN article).
- Create a power-aware infrastructure for scheduling and related
Linux kernel subsystems.
For example, integrate dyntick-idle,
cpufreq, cpuidle, sched_mc, timers,
thermal framework, pm_qos,
and the scheduler.
- Provide a usable mechanism that reliably allows all work (present
and future) to be moved off of a CPU so that said CPU can
be powered off and back on under user-application control.
CPU hotplug is used for this today, but has some serious side
effects, so it would be good to either fix CPU hotplug or come
up with a better mechanism—or, in the best Linux-kernel
tradition, both.
Such a mechanism might also be useful to the real-time people,
who also need to clear all non-real-time activity from
a given CPU.
How well did we meet these goals?
Read on and decide yourself!
To that end, the remainder of this article is organized as follows:
-
Overview of ARM big.LITTLE Systems
-
Major Issues Considered
-
Future Work and Prospects
-
Conclusions
Following this is the inevitable
Answers to Quick Quizzes.
ARM's big.LITTLE systems combine the
Cortex-A7
and
Cortex-A15
processors.
Both processors are implementations of the ARMv7
architecture and they execute the same code.
ARM stated the
little Cortex-A7 design was focused on energy efficiency at the expense of
performance.
The bigger Cortex-A15 design was, instead, focused
on performance at some cost to energy efficiency.
In practice
this means the little core will be somewhat quicker and a lot
more power efficient than today's Cortex-A8: a multi-core
configuration of these little cores could run today's smartphones.
The big core will
significantly outperform Cortex-A9 within a similar power budget.
Quick Quiz 1:
But what if there is a different number of Cortex-A7s than of Cortex-A15s?
Answer
One way to use a big.LITTLE system is to have equal numbers of
Cortex-A7 and Cortex-A15 CPUs paired up, so that only one CPU
of a given pair is running at a time.
This pairing is “a
continuation of dynamic voltage/frequency scaling by other
means”.
To see this, imagine the Cortex-A15 initially running
at maximum clock frequency, with the voltage and frequency
decreasing until the performance is barely greater than that of
the Cortex-A7 CPU.
At this point, the firmware switches the
software context from the Cortex-A15 to the Cortex-A7, with the
Cortex-A7 initially running at its maximum clock frequency, but
at lower power than the Cortex-A15.
Quick Quiz 2:
Why scale down?
Isn't it always better to run full out in order to race to idle?
Answer
The voltage and frequency of
the Cortex-A7 can then be further decreased, in turn further
decreasing the power consumption.
For some implementations,
thermal limitations would require that the Cortex-A15 CPUs be
used only for short bursts at maximum frequency, as was discussed at
length at the summit.
However, I have since learned that many
other implementations are expected to be fully capable of running
the Cortex-A15 CPUs at maximum frequency indefinitely.
The switch between the Cortex-A7 and Cortex-A15 CPUs is implemented
in firmware, but Grant Likely, Nicolas Pitre, and Dave Martin are
moving this functionality into the Linux kernel.
In many big.LITTLE designs, it is also possible to run both the
Cortex-A7 and Cortex-A15 CPUs concurrently in an shared-memory configuration.
However, this means that the Linux kernel sees the big.LITTLE
architecture, which in turn raises the issues discussed in the
next section.
Those of you who know the personalities in attendance will not be
surprised to hear that the discussions were both spirited and wide-ranging.
However, most of the discussion centered around the following four
major issues:
-
Benchmarks and Synthetic Workloads
-
Parallel Hardware/Software Development
-
What Do You Do With a LITTLE CPU?
-
CPU Hotplug: Kill It or Cure It?
Each of these issues is covered in one of the following sections:
The biggest and most pressing issue facing SMP-style big.LITTLE
systems is the lack of packaged Linux-kernel-developer-friendly
benchmarks and synthetic workloads.
C programs and sh, perl, and python
scripts can be friendly to Linux-kernel developers, while benchmarks
requiring (for example) an Android SDK or a specific device
will likely be actively ignored.
It is critically important for benchmarks to provide a useful
“figure of merit”, which should encompass both
user experience and estimated power consumption.
For example, a synthetic workload that models a user browsing the
web on a smartphone might have a smaller-is-better estimate of
average power consumption, but also have the constraint that
the system respond to emulated browser actions within (say)
500ms.
If the response time is within the 500ms constraint,
then the figure of merit is the estimated average power consumption,
but if that constraint is exceeded, the figure of merit is a
very large number.
The exact computation of the figure of merit will vary from
benchmark to benchmark.
Currently, some rough and ready workloads are in use.
For example, Vincent Guittot used cyclic test in
his work.
While this did get the job done for Vincent, something more adapted
to embedded/mobile workloads instead of real-time computing would
be quite welcome.
Zach Pfeffer of Linaro will be doing some workload creation in his group,
however, given the wide variety of mobile and embedded workloads, additional
contributions would also be welcome.
Finally, the scheduler maintains a great number of statistics and
tracepoints.
A “schedtop”-style tool that provides a mobile/embedded
view of this information would be very valuable.
Even if you don't know exactly when a given piece of hardware will
be available, it is a good bet that it will become available too late
to get the needed software running on it.
It is therefore critically important to have some way to develop the
needed software before the hardware is available.
Thankfully, there are a number of ways to test big.LITTLE scheduler
features before big.LITTLE hardware becomes available.
One crude but portable method is to create a SCHED_FIFO
thread on each LITTLE-designated CPU, and to have this thread spin,
burning CPU, for (say) one millisecond out of every two milliseconds.
This approach perturbs the scheduler's preemption points, particularly the
wake-up preemptions.
Nevertheless, this approach is likely to be quite useful.
A less portable but more accurate approach is to constrain the
clock frequency of the CPUs so that the big-designated CPUs have a lower
bound on their frequency and the LITTLE-designated CPUs have an upper
bound on their frequency.
The way to do this is via the sysfs files in the
/sys/devices/system/cpu/cpu*/cpufreq directories,
the most pertinent of which are described below.
Quick Quiz 3:
I typed the following commands:
cd /sys/devices/system/cpu/cpu1/cpufreq
sudo echo 800000 > scaling_max_freq
Despite the
sudo, I got “Permission denied”.
Why doesn't
sudo give me sufficient permissions?
Answer
Echoing a number into the
scaling_max_freq
file will require that the corresponding CPU's frequency
be limited to the specified number in KHz.
Echoing a number into the
scaling_min_freq
file will require that the corresponding CPU's frequency
be at least the specified number in KHz.
Reading the
scaling_available_frequencies
file
will list out the frequencies (again in KHz) that the
corresponding CPU is capable of running at.
For example, the laptop I am typing on gives the following
list:
2534000 2533000 1600000 800000
Reading the
affected_cpus
file lists the CPUs whose
core clock frequencies must move in lockstep with the
corresponding CPU.
On my laptop, each CPU's frequency may be varied independently,
but it is not unusual for a given “clock domain”
to contain multiple CPUs, which then must all run at the
same frequency, for example, on systems with hardware threads.
Reading the
scaling_cur_freq
file gives you the kernel's
opinion on what frequency the corresponding CPU is running at.
Reading the
cpuinfo_cur_freq
file, instead, gives you the hardware's
opinion on what frequency that the corresponding CPU is
running at,
which might or might not match the kernel's opinion, so
you should most definitely experiment to make sure that
all of this is doing what you want on your particular hardware
and kernel.
For more information, see Documentation/cpu-freq
in the Linux kernel source directory.
There was also some discussion of ways that the linsched
user-mode scheduler simulator might help with prototyping.
Finally, it is possible to use T-states on Intel platforms to emulate
a big.LITTLE system.
According to Paul Brett:
Intel Architecture processors provide a clock modulation
control exposed as the MSR_IA32_THERM_CONTROL MSR.
This MSR can be used to reduce the effective clock
frequency for each core independently in 12.5% increments
from 100% down to 12.5%. Under normal conditions,
the least significant 5 bits of the MSR are cleared to
indicate 100% performance. To enable clock modulation,
set bit 4 of this MSR to 1 and write a value from 1-7 in
bits 1-3 (where 7 is 87.5% equivalent performance and 1
is 12.5% equivalent performance). More information on
clock modulation can be found in volume 3 of the Intel
IA64/IA32 Software Developers Manual, under Thermal
Monitoring and Protection. Please note that the effect
of clock modulation approximates running the CPU at a
lower frequency - in benchmarks we have noted up to a 5%
variance in performance between clock modulation and
running the same core at the equivalent frequency.
Although none of these approaches can be considered a perfect substitute
for running on the actual big.LITTLE hardware, they are all likely to be
very useful during the time until such hardware is actually available.
If you have both big and LITTLE CPUs, how do you decide what tasks
will be banished to the slower LITTLE CPUs?
Similarly, if your workload is currently running all on LITTLE CPUs,
how do you decide when to take the step of starting up one of the
the power-hungry big CPUs?
Right now for SMP-configured big.LITTLE systems, “you”
is the application developer, who can use
facilities such as CPU hotplug, affinity, cpusets, sched_mc, and so on
to manually direct the available work to the desired subsets of the CPUs.
These facilities constrain the scheduler in order to ensure that nothing
runs on CPUs that are to be powered down.
Decisions on what CPUs to use should include a number of considerations.
First, if a LITTLE CPU is able to provide sufficient performance,
it provides better energy efficiency, at least in cases where
race to idle is inappropriate.
Second, because mobile platforms have no fans and are sometimes sealed,
some devices might not be able to run all the big CPUs at maximum
clock rate for very long without overheating.
Of course, such devices might also need to limit the heat produced
by analog electronics and GPUs as well (see Carroll's and Heiser's
2010
USENIX
paper [PDF]
and
presentation
[PDF]
for a power-consumption analysis of a ca. 2008 smartphone).
Third, some workloads can adapt themselves to lower performance.
For example, some media applications can reduce performance
requirements by dropping frames and reducing resolution.
Fourth, there is more to performance than CPU clock speed: For example,
it is possible that a workload with high cache-miss rates can
run just as fast on a LITTLE CPU as it can on a big CPU.
Finally, many workloads will have preferred ways of using the CPUs,
for example, some mobile workloads might use the LITTLE CPUs
most of the time, but bring the big CPUs online for short bursts
of intense processing.
Keeping track of all this can be challenging, which is one big reason
for thinking in terms of automated assistance from the scheduler.
Some of the proposed work towards this end is listed in the
Future Work and Prospects
section.
But first, let's take a closer look at CPU hotplug and its potential
replacements.
Although CPU hotplug has a checkered reputation in many circles, it
is what almost all current multicore devices actually use to evacuate
work from a given CPU.
This is a bit surprising given that CPU hotplug was intended for
infrequently removing failing CPUs, not for quickly bringing perfectly
good CPUs into and out of service.
It is therefore well worth asking what CPU hotplug is providing that users
cannot get from other mechanisms:
- Migrating timers off of a given CPU. This can likely be fixed,
but a synchronous fix that prevents any further timers from being
set may be more challenging.
- Shutting off a CPU with a single simple action.
This can likely be fixed, but requires coordinating interrupts,
the scheduler, timers, kthreads, and so on.
- Preventing all possible wakeup events from causing that
CPU to power back on until the user explicitly permits
it to power back on.
(Some platforms may have wakeup events that cannot be shut off.)
- Synchronous action, so that userspace can treat it atomically.
- Coordinating user applications based on
hotplug events.
(However, there is no known embedded or mobile use of this
feature, so if you need it, please let us know.
Otherwise it will likely go away.)
These CPU-hotplug features are valuable outside of the mobile/embedded
space, for example, some real-time applications will take a CPU offline
and then immediately bring it back online to make it fully available
for the application—in particular, to clear timers off of the CPU.
Furthermore, people really do make use of CPU hotplug to offline failing
CPUs.
But this brings up another question.
Given that CPU hotplug does all these useful things, what is not to like?
First, CPU-hotplug operations can take several seconds, as shown
here.
An ideal power-management mechanism would have latencies in the
5ms range.
It might be possible to make CPU hotplug run much faster.
Second, CPU-hotplug offline operations use the stop_machine() facility,
which interrupts each and every CPU for an extended time period.
This sort of behavior is not acceptable when certain types
of real-time or high-performance-computing (HPC) applications are running.
It might be possible to wean CPU hotplug from stop_machine().
Third, a given CPU's workqueues can contain large numbers of pending
items of work, and migrating all of these can be quite time
consuming, as can re-initializing all the workqueue kernel threads
when a given CPU comes online.
Other CPU-hotplug notifiers have similar problems, which can
hopefully be addressed by coming up with a good low-overhead
way to “park” and “unpark” kernel threads that are
associated with an offline CPU.
Such a parking mechanism faces the following challenges:
- Many per-CPU kernel threads are (quite naturally) coded with
the assumption
that they will always run on the corresponding CPU.
- If a kthread that has an affinity to a given CPU
is awakened while that CPU is offline, the scheduler
prints an error message and removes the affinity,
so that the kthread will now be able to run on any CPU.
- Wakeups can be delayed so that they do not arrive
at the kthread until after the corresponding CPU
has gone offline.
- All kernel threads parked for a given offline CPU must sleep
interruptibly, because otherwise the kernel will
emit soft-lockup messages.
- When a given CPU goes offline, any work pending for that
CPU must either be completed immediately (thus delaying
the offline operation), migrated to some other CPU
(thus increasing complexity), or deferred until the
CPU comes back online (which might be never).
There is some reason to believe that
any mechanism that
evacuates all work from a CPU faces these same challenges.
Finally, CPU-hotplug operations can destroy cpuset configuration,
so that cpusets need to be repaired when CPUs are brought
back online.
This topic is currently the subject of spirited discussions.
Perhaps these CPU-hotplug shortcomings can be repaired.
But suppose that they cannot.
What should be done instead in order to evacuate all work from a given CPU?
Tasks can be moved off of a given CPU by use of
explicit per-task affinity, cgroups, or cpusets, although
interactions with other uses of these mechanisms need more
thought.
In addition, interactions among all of these mechanisms can have
unexpected results because of a strong desire that the scheduler
generally consume less CPU than the workload being scheduled.
However, interrupts can still happen on a given CPU even after all
tasks have been evacuated.
Interrupts must be redirected separately using the
/proc/irq directory.
This directory in turn contains one directory
for each IRQ, and each IRQ directory contains a
smp_affinity file to which you can write a
hexadecimal mask to restrict delivery of the corresponding
interrupts.
You can then examine the /proc/interrupts file
to verify that interrupts really are no longer being
delivered to the CPUs in question.
See Documentation/IRQ-affinity.txt in the
kernel sources for more information.
One caution: that document notes that some irq controllers
do not support affinity, and for such controllers it is
not possible to direct irq delivery away from a given CPU.
Finally, evacuating tasks and interrupts from a given CPU can still
leave timers running on that CPU.
As noted earlier, there is currently no mechanism other than
CPU hotplug to migrate timers off of a given CPU, but it should
be possible to create such a mechanism.
An asynchronous mechanism (which would allow each timer one final
ride on the outgoing CPU) is straightforward.
A synchronous mechanism would be more complex, but should be
doable.
So, what should be done?
In the near term, the only sane approach is to attack on both fronts:
(1) attempt to cure CPU hotplug of its worst ills (especially
given that it will likely continue to be needed for removing failing
CPUs), and (2) attempt to improve the alternative mechanisms so
that they can do more of the work that can currently only be done
by CPU hotplug—hopefully avoiding at least some of the complexity
currently inherent to CPU hotplug.
In the short term, the following actions need to be taken:
- Create an email list for the attendees and other interested parties.
This is now available
here,
courtesy of Amit Kucheria and Loic Minier.
- Document best practices for using existing Linux kernel
facilities (including CPU hotplug, cgroups, cpusets,
affinity, and so on) to manage big.LITTLE systems in
an SMP configuration.
This documentation should include measurements of latencies
(keeping in mind the 5ms goal for evacuating work
from a CPU and for restarting it) and power consumption.
Vincent Guittot's
presentation
and git tree
are a good start in this direction.
- Create software to emulate big.LITTLE systems on current hardware,
for example, using one or more of the approaches describe in the
Parallel Hardware/Software Development
section.
- Produce Linux-kernel-developer-friendly synthetic workloads
and benchmarks for mobile applications and use cases,
as discussed in the
Benchmarks and Synthetic Workloads
section.
There will be some Linaro work in this direction, but additional
workloads and benchmarks are welcome from any and all.
In the medium term, the following additional actions are needed:
- Experiment with improving cpusets and cgroups as discussed
in the
CPU Hotplug: Kill It or Cure It?
section.
- Experiment with curing CPU hotplug, also discussed in the
CPU Hotplug: Kill It or Cure It?
section.
- Accumulate a list of the attributes that system-on-a-chip
(SoC) vendors believe to be important to scheduling and
managing big.LITTLE systems.
An initial list was accumulated during the scheduler summit:
- Power-domain and clock-domain constraints. For example,
many ARM SoCs require that all CPUs in a cluster
run at the same clock rate.
- Thermal tradeoffs. For example, some SoCs might impose
a tradeoff between the number of CPUs running at a
given time and the frequency at which they are running
if they are to avoid thermal throttling.
- Thermal feedback, e.g., temperature sensors.
- Process type, where the amount of leakage current can affect
the optimal strategies for power-efficient operation.
- Relative benefit of reducing frequency of several CPUs
as opposed to consolidating workload on a small number
of CPUs.
- Instruction-per-clock (IPC) measurements and correlation
between clock rate and useful forward progress.
- Remove
sched_mc.
In the longer term, the following additional actions would be quite
helpful:
- Port Frederic Weisbecker's OS-jitter-reduction patchset to ARM.
Geoff Levand of Huawei is leading up an effort along these lines.
- Contact gaming companies (e.g., Epic) to see if their 3D gaming
engines (which run on
both iPhone and Android) can make good use of big.LITTLE,
even in the presence of thermal throttling.
- Investigate alternative scheduler disciplines. For example,
the prototype SCHED_EDF patchset would allow tasks to specify
deadlines, which would allow the scheduler to better decide
between race-to-idle and run-at-low-frequency. Other related
scheduler disciplines such as EVDF might be useful—there
may be other real-time technologies that can be commandeered
to energy-efficiency purposes.
If SCHED_EDF looks useful to mobile/embedded, someone needs
to forward-port the patch and fix a number of issues in it.
This is not a trivial project. (Paul Turner is looking into
bringing some Google resources to bear, and Juri Lelli has
been doing some recent
deadline-scheduler work.)
One common mobile/embedded requirement is to consolidate the
workload down to the minimum number of CPUs that can support
acceptable user experience, then spread the load across that
minimal set of CPUs.
- Investigate modal scheduling. Paul Turner gave the following
list of modes as an example:
- Low load of interactive, low-utilization tasks might
favor race to idle.
- Moderate load of periodic media-feeding tasks might
lower frequency to the smallest value that allows
the task to keep up with its hardware.
- High load of CPU-bound tasks in the absence of thermal
limitations might increase frequency.
Some hysteresis will be required. It is usually OK to delay
the decisions a bit, especially given that ARM provides relatively
fast transitions between power states.
Paul Turner posted a first step in this direction, with a
patch series
that improves the scheduler's ability to better estimate
the effect of migration of each CPU's load.
A more up-to-date series is maintained
here.
The scheduler mini-summit at Linaro Connect was quite productive,
with work already in progress to implement some of the recommendations.
For example, some code and patches are in flight to reduce RCU's dependence on
stop_machine(), which is a first step towards weaning CPU hotplug from
stop_machine().
For another example, Srivatsa Bhat is doing some good work on curing
CPU hotplug of some of its ills.
So how did we do against the goals?
Let's check:
- Take first step towards planning any Linux-kernel scheduler
changes that might be needed for ARM's
upcoming big.LITTLE systems work well.
The most important actions toward this goal are the
emulation of big.LITTLE systems, the mobile/embedded
synthetic benchmarks/workloads, and the list of SoC attributes.
This information will help work out which of the longer-term
actions are most important.
- Create a power-aware infrastructure for scheduling and related
Linux kernel subsystems.
For example, integrate dyntick-idle,
cpufreq, cpuidle, sched_mc, timers, thermal framework, pm_qos,
and the scheduler.
The mobile/embedded synthetic benchmarks/workloads is the
most important first step in this direction, as is the
list of SoC attributes.
The removal of sched_mc is a first implementation
step, on the theory that one must tear down before one can
build.
- Provide a usable mechanism that reliably allows all work (present
and future) to be moved off of a CPU so that that CPU can
be powered off and back on under user-application control.
CPU hotplug is used for this today, but has some serious side
effects, so it would be good to either fix CPU hotplug or come
up with a better mechanism—or, in the best Linux-kernel
tradition, both.
Such a mechanism might also be useful to the real-time people,
who also need to clear all non-real-time activity from
a given CPU.
This goal received the most discussion, and the medium-term
actions for curing CPU hotplug on the one hand or improving
the alternatives to CPU hotplug on the other.
Work on these three goals has only just begun, but
with continued effort, we can make the Linux kernel work better for big.LITTLE
in particular and for mobile/embedded workloads on asymmetric
systems in general.
I am grateful to the scheduler mini-summit attendees for many
useful and enlightening discussions, and especially to Amit Kucheria
for organizing the mini-summit.
We all owe thanks to Zach Pfeffer, Peter Zijlstra, Amit Kucheria,
Robin Randhawa, Jason Parker,
Rusty Russell, Vincent Guittot,
and Dave Rusling for helping make this article human readable.
I owe thanks to Dave Rusling and Jim Wasko for their support of this effort.
Quick Quiz 1:
But what if there is a different number of Cortex-A7s than of Cortex-A15s?
Answer:
In that case, it is necessary to remove the excess CPUs from service,
for example, using CPU hotplug, before carrying out the switch.
Back to Quick Quiz 1.
Quick Quiz 2:
Why scale down?
Isn't it always better to run full out in order to race to idle?
Answer:
Although it often is best to race to idle, there are some important
exceptions to this rule in certain mobile/embedded workloads.
For one example, imagine a codec that required the CPU to
occasionally do some work to provide the codec with data.
Because CPU power consumption often rises as the square of the
core clock frequency, you typically get the best battery life
by running the CPU at the lowest frequency that gets the work
done in time.
As always, use the right tool for the job!
Back to Quick Quiz 2.
Quick Quiz 3:
I typed the following command:
sudo echo 800000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
Despite the sudo, I got “Permission denied”.
Why doesn't sudo give me sufficient permissions?
Answer:
Although that command does give echo sufficient
permissions, the actual redirection is carried out by the
parent shell process, which evidently does not have sufficient
permissions to open the file for writing.
One way to work around this is sudo bash followed
by the echo, or to do something like:
sudo sh -c 'echo 800000 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq'
Another approach is to use tee, for example:
echo 800000 | sudo tee /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
Yet another approach uses dd as follows:
echo 800000 | sudo dd of=/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
Back to Quick Quiz 3.
Quick Quiz 4:
Why would anyone use an Intel system to test out an ARM capability?
Answer:
The scheduler is core code, and for the most part does not care
about which instruction-set architecture is running.
The important thing is not the ISA, but rather the performance
characteristics.
Back to Quick Quiz 4.
Quick Quiz 5:
Why not just use SIGSTOP to park per-CPU kthreads?
Answer:
It might well work, at least given appropriate adjustments.
Please try it and let us know how it goes.
Back to Quick Quiz 5.
Quick Quiz 6:
What other mechanisms could be used to park per-CPU kthreads?
Answer: Here are some possibilities:
- Kill the kthreads at CPU-offline time and recreate them at
CPU-online time.
This is used today, and is quite slow.
- Kill the kthreads at CPU-offline time and recreate them at
CPU-online time, as above, but create a separate
CLONE_
flag to prevent the parent from waiting until the child runs.
This waiting behavior exists to work around an old bash
bug, and is not needed for in-kernel kthreads.
- Kill the kthreads at CPU-offline time, but don't recreate them
until they are actually needed, perhaps using a separate high-priority
kthread to allow the creation to be initiated from environments
where blocking is prohibited.
This might work well in some situations, but does increase the
state space significantly.
- Have the kthreads block while their CPU is offline.
This approach faces some complications:
- Wake-ups can be delayed, so that a delayed wakeup might
arrive at a kthread after the corresponding CPU has
gone offline.
- If a task that has an affinity to a given CPU awakens while that
CPU is offline, the scheduler prints a warning message
and breaks affinity.
This breaking of affinity can cause failures in kthreads
written to assume that they only run on their CPU.
- Sleeping uninterruptibly while a CPU is offline can
result in spurious soft-lockup warnings.
- Use an explicit rendezvous to park each kthread, setting its
affinity mask to cover all CPUs and informing it that it needs
to remain quiescent.
This operation would be reversed when the CPU comes back online.
This works, but is often surprisingly difficult to get right,
particularly on busy systems where wakeups can be delayed.
- Your idea here.
It is quite likely that different approaches will be required in
different situations.
Back to Quick Quiz 6.
Comments (20 posted)
February 22, 2012
This article was contributed by Neil Brown
I must admit that I love bugs. I also hate bugs which leads to a very
conflicted relationship with the blighters, but for now I just want to
focus on the positive - I love bugs.
The bugs I am talking about are, of course, the incorrect behavior of
software systems due to incorrect code and the reason that I love them
is that they are so educational. There are a number of reasons for
this, the significant one for the present being that they provide
motivation and focus to read and study new code.
One of the freedoms that free and open source software provides is the
freedom to study the source code. However having that freedom doesn't
mean it's easy to use it. When faced with a large body of code such as the
Linux kernel it can be hard to know where to start, or when to move
on. There is no story-line, no curriculum, no plot to guide your
study. This is where a bug comes in: it can provide a story line.
This may be a particular sequence of events or a particular
combination of features, or just a starting point to spiral out from.
But by providing a clear focus and a clear finishing point, it makes
study easier and more rewarding.
More than a phone
I have recently been spending some time trying to make the Linux
kernel run well on the
GTA04
"Phoenux" replacement motherboard for the
Openmoko "Freerunner" and "Neo 1973" mobile phones. This is an
attempt to revitalize the Openmoko effort to provide an open (or at
least "as open as possible") platform for a mobile computing device.
It fits in the same case, uses the same display and speakers, but
provides more memory, faster processor, and faster networking plus a few
other extras. During this (ongoing) effort I have discovered a number
of bugs and missing features, most of which only came with smaller
learning experiences. One, though, stands out for leading me on a
somewhat longer path of discovery than most, and so I would like to
share that path and those discoveries so that others might benefit. I
should note first that the path is still open. I have not actually
fixed the bug, so others could follow the path themselves. Anyone keen
to do that should stop after the next paragraph, get a GTA04,
and start hunting themselves. Others can read on.
The symptom of the bug is easy to describe. Like most computer
systems the GTA04 has an RTC (real time clock) which has an alarm
function. When the time in the clock matches the time in the alarm,
it generates an interrupt. The problem is that, though this interrupt
works perfectly while the system is awake, it does not reliably wake the
device when it is asleep (i.e. in suspend-to-RAM state). Other
interrupts configured much the same way (for example data received on
the serial console) wake the system with 100% reliability. However
the RTC alarm almost always fails. Understanding this bug leads on a
path through the interrupt management code, the suspend management
code, and even the USB code. But it must start with a brief
description of the hardware.
In the GTA04 there are two particular integrated circuits (ICs) that
are relevant. One is the OMAP3 SoC (system-on-a-chip) from TI. The
other is the PMIC (power management IC) which is referred to as
"twl4030" in the various drivers - various because, like the OMAP3,
it is a multi-function chip combining diverse functions such as battery
charging, keyboard control, and the USB electrical interface. There are a
number of connections between these chips but only two are relevant
for now. These are a single interrupt line and an I2C bus.
The interrupt line signals the interrupt controller (INTC) in the
OMAP3 that something has happened. The I2C (inter-integrated-circuit)
bus is a simple 2-wire bus that allows the CPU to read and write
registers in the PMIC. In particular it can be used to find out which
of the several functions generated the interrupt, and why. It can
also be used to clear the source of the interrupt.
Having one interrupt line that indicates any of a number of possible
events like this is not uncommon. For example the UARTs (Universal
Asynchronous Receiver Transmitter - a serial port) in the OMAP3 each
have a single interrupt, but it can be generated by a character
arriving, a character transmit having completed, or an error
condition. As with most such devices there is a register that can be
inspected which has one bit for each possible interrupt source.
When all the interrupt sources are related to a single device and
handled by a single driver this is all quite easy to manage. However
when the different interrupt sources are related to logically separate
devices, having to manage all these behind a single interrupt could
get clumsy.
So many interruptions
To address this situation, Linux has an abstraction called an
"irq_chip". An irq_chip represents a set of
interrupt sources each of which is assigned an interrupt number (as
listed in /proc/interrupts). It provides functions to enable or
disable each interrupt, to allow the interrupt to wake the device from
suspend, to set the trigger type (edge or level), and various other
functions. It also arranges that the interrupt handler for each
interrupt will be called when appropriate.
irq_chips with sources
Using this, the driver for the INTC in the OMAP3 defines an
irq_chip for all the interrupts that it knows about directly,
and the driver for the PMIC can define a separate irq_chip
which describes the various interrupt sources in the PMIC. Each
irq_chip will provide very different functions to configure
and control the interrupt sources, but will provide a uniform
interface to all device drivers. When the single PMIC interrupt
arrives at the INTC, it will call an interrupt handler which will
examine the PMIC registers and then call the appropriate handler
that was registered with the second irq_chip. This way the
drivers for the various parts of the PMIC don't need to know too much
about each other - the irq_chip mediates between them.
In our PMIC there is a "Primary Interrupt Status Register" which has
one bit for each module that can generate interrupts. Each of these
modules has their own "Secondary Interrupt Status Register" to report
what actually caused the interrupt, much like the UART described
earlier. There is normally no need for a tertiary irq_chip
to represent these registers as the one driver manages the whole
module and all of its interrupts.
There are two exceptions, only one of which interests us. There is a
"power interrupts" module in the PMIC which combines interrupts from a
diverse range of sources: press of the power button, insertion of USB
cable, over-temperature warning, and the RTC alarm. These are
sufficiently varied that a separate irq_chip makes sense here.
The twl4030-irq driver manages all this quite elegantly, providing a
secondary irq_chip for the device as a whole, and supporting
tertiary irq_chips, including for the Power Interrupts module.
So when the RTC alarm triggers, the primary interrupt is handled by
the INTC. This calls an interrupt handler which inspects the PMIC,
determines that a power interrupt is active and calls the relevant
handler. It in turn examines the PMIC again and finds that the RTC
alarm has fired, and so calls the handler that was registered for that
particular interrupt. This all works perfectly. But it doesn't seem
to wake the device from suspend. At least, not always. It seems that
something happens on the way to suspend that interferes with this.
A funny thing happened on the way to suspend
One of the many things that happens on the way to suspend is that
each individual interrupt gets disabled - unless it is flagged as
IRQF_NO_SUSPEND in which case it is left alone. However this doesn't
mean exactly what it sounds like it means. Being "disabled" just
means that the handler routine will not be run, it doesn't mean that
the interrupt will not be generated. We have a different word for
that, which is "masking". When an interrupt is masked the originating
source of the interrupt is told to never post that interrupt.
Linux uses a lazy scheme for disabling interrupts. When the disable
request is made, the fact is recorded in internal data structures, but
that is all. If the interrupt is subsequently delivered, only then
might the interrupt be masked. This can be a useful optimization as
masking an interrupt can take a lot longer than just setting a flag in
memory.
So, on the way to suspend, interrupts are disabled but not masked. If
the interrupt does actually arrive before we reach full suspend, the
fact is recorded. If it was an interrupt that should wake from
suspend, this is detected in
check_wakeup_irqs()
and suspend aborts.
If the interrupt doesn't arrive before full suspend, then it is still
unmasked and will successfully wake up the device, which will resume
and then handle the interrupt. This might all seem a bit complex, but
once it is fully understood it is actually quite neat and it works
well ... except for my RTC alarm.
Who turned out the lights?
As promised we will need to glimpse at the USB code as well and now is
the time to bring that in. Like most modern phones, the GTA04 can be
attached to a computer or a charger via a USB port. When the USB
interface detects 5 volts on the power line, it will route this to the
battery charger (BCI) which will use it to charge the battery and
power the device.
This is relevant because - when tracing is added to the twl4030-irq
driver - we see that during suspend we get an interrupt from the PMIC
which turns out to be due to the battery charger losing power and saying
"Hey, just thought you should know that I'm running on battery again
now".
What is happening is that when the USB port is told to go to sleep it
turns off its various power supply regulators. One of these (USB3V1) also
powers the voltage comparator which determines if 5V is present.
As soon as that presence is no longer signaled, the current stops
being routed to the BCI, and the BCI justifiably complains.
This results in the interrupt line from the PMIC being asserted
which causes the interrupt management code to run. This notices that
the interrupt was disabled, and so masks it. When we get to
check_wakeup_irqs(), the PMIC IRQ is pending, but as it has not been
set for wakeup it is just ignored.
The net result here is that all interrupts from the PMIC are ignored
during suspend, so the RTC alarm doesn't work, the power button
doesn't wake the device up, and plugging or unplugging the USB cable
has no effect either.
So now that we understand the problem ...
So, what to do? We could just unplug the USB cable when testing the
RTC alarm. Then the BCI would not notice the current disappearing and
so wouldn't generate an interrupt. This certainly works - once. It
doesn't seem to work a second time, but we'll leave that issue for the
moment.
We could tell the USB controller not to turn off USB3V1 when the BCI
is active and power is available on the USB bus. This is certainly a good
idea and would mean that the battery can charge during suspend. It
also brings up the interesting issue of how these separate drivers
should communicate, but it doesn't really solve the general problem,
just this specific one.
It could be that an interrupt occurs that we really do want to ignore
during suspend, that we have even explicitly disabled (but due to
lazy-masking has not been masked). So we want to be able to mask
those interrupts without masking the main interrupt. A few moments
reflection on what we have learned so far suggests that setting
IRQF_NO_SUSPEND on the intermediate interrupts should help. That way
they will not be disabled during suspend so the interrupt management
will trace through the chain of irq_chips far enough to find
the originating interrupt source and will mask just that. The other
interrupt source will still be available.
This may sound convincing, but reality has a way of getting in the
way: testing shows that just setting IRQF_NO_SUSPEND is not enough. Diving
back into the code and the data sheets reveals an interesting and important
fact. The PMIC allows each individual interrupt source to be masked, but it
does not allow a whole module as a set of interrupt sources to be masked. So
when an interrupt arrives from the BCI (one of the modules) the IRQ manager
notices that interrupt is disabled and so the secondary irq_chip is
asked to mask it. But as the irq_chip cannot mask it (the hardware
doesn't support that), it simply ignores the request. You can imagine what
happens next - an unending stream of attempts to mask the BCI interrupt, none
of which are effective and the interrupt keeps re-firing.
The root problem seems to be that while the TWL4030 PMIC appears to
have an interrupt structure that lends itself to being represented as
a small tree of irq_chips, this isn't actually the case. The
details of the behavior of suspend makes it essential that a working
"mask" function be provided, and that cannot be provided for modules
in the twl4030, only for individual interrupts.
So it seems that a complete fix requires some change to the structure
of the irq_chips in the twl4030 driver. Possibly we could
make the structure simpler (no tree), or possibly we could make the
implementation more clever in some way. The important point is that
each interrupt that is visible to the Linux IRQ management code must
support being masked in the hardware in some way.
Another funny thing on the way to resume
As mentioned, after one successful wake-from-alarm, the RTC alarm in
the GTA04 doesn't work a second time. In fact it stops working
completely, even when not suspended. It seems that something has gone
wrong on the way to resume.
During resume all the interrupts that were disabled by suspend are re-enabled.
The three interrupts which relate to the RTC - the base PMIC interrupt, the
interrupt for the "power interrupts" module, and the actual RTC interrupt - are
enabled in that order with PMIC first. As this interrupt is still asserted (as
it is what triggered resume) the PMIC interrupt handler runs as soon as it
is enabled. It reads the interrupt status register to see which module
was responsible, and triggers the relevant interrupt for that module.
This secondary interrupt is still disabled so nothing happens. In
particular the interrupt source is not cleared, nor is it masked,
so we again get a repeated sequence of attempts to handle an interrupt
which cannot be handled. This one isn't unending as the loop that
enables all the interrupts is still running and it eventually gets to
the one for the target module (the power interrupts module). This is
triggered again and this time it actually does something. It reads
the final status register (thus clearing the interrupt) and calls the
interrupt handler for the RTC alarm. Unfortunately that is still
disabled so nothing else happens.
Eventually the RTC interrupt is enabled but by now it is too late.
In some cases an interrupt that fires while disabled gets the
IRQS_PENDING flag set so that when it is re-enabled, the interrupt is
resent. But that is not the case when the twl4030-irq interrupt
controller tries to handle a tertiary IRQ. Whether this is a bug or a
misunderstood feature is as yet unclear. Another journey of discovery
would be needed to fully understand that issue.
Meanwhile, because the interrupt is delivered but the interrupt service routine
isn't run, the handshaking between the driver and the device gets
confused and no other interrupt is ever seen from the RTC.
We can work around this problem by a simple expedient. When
re-enabling the interrupts during resume, do it in the reverse order.
This ensures that the RTC interrupt is enabled before it is called,
and everything works. This even feels like the "right" thing to do -
it is balanced for "enable" to happen in the reverse order of
"disable". But even if it is "right", it is not really a fix. There
would still be bugs lurking and ready to spring out again.
Journey's end
The point of this little tour is not to point out bugs in the code or
to shame the relevant developers. In fact the code is in general
quite elegant, well structured, largely very effective, and the
developers are only to be congratulated. Rather, the point is to
highlight the complexity of the task being tackled by developers
working the embedded space:
-
There are complex interactions between distinct hardware. Here the
USB interface and the battery charger have important interdependencies
that need to be reflected in the drivers. Some interdependencies already
are but there are subtleties that are easy to miss.
-
The requirements of an "irq_chip" are not really documented
anywhere, and given the current rate of development there is a good
chance that such documentation would be out of date. Extracting the
requirements from the code is far from trivial. In this case a seemingly
correct and obvious implementation was only found to be deeply wrong
by a fairly obscure test case.
-
The IRQF_NO_SUSPEND flag is clearly important, but not easy to
understand. It is fairly obvious what it does, but much less obvious
why you would want to do that. I imagine that as a developer my
approach would be to not set it until I found a bug which could be
fixed by setting it. Then I would set it and hope it didn't break
anything. This is not really a robust way to do development.
With complex hardware interactions and complex software dependencies
it is not surprising that we seem to see a new class of problems in
the embedded space. As we have observed, there are often ways to work
around the bug rather than fix it, and that is usually much simpler
and quicker. They may be tempting, but for this developer at least
they are nowhere near as satisfying.
Because if you follow the bug all the way to the root cause, and if
you understand all that is happening and find the right fix, then you
have not only helped other future developers, but have learned a whole
lot in the process. And that is why I love bugs.
Comments (11 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>