Brief items
The current development kernel is 3.5-rc2, released by Linus on June 8. "I
think -rc2 is in fairly good shape, and I've been trying to fairly
aggressively revert things that caused problems (and sometimes things that
were only *suspected* to be the cause of problems)." Since that
release he has been wandering around the world winning high-profile prizes,
so the flow of changes into the mainline has been relatively slow.
Stable updates:
3.0.34 and
3.4.2 were released on June 9;
3.2.20 came out on June 11.
Comments (none posted)
Not to mention that I would love to make all users that have x86_64
machines running i386 kernels, for something other than testing
i386, thrown into a large boiling pot of clue stew. It really
pisses me off when people run these i386 machines with 16 gigs of
memory and complain about performance. highmem should be disabled
for any x86_64 box with more than 1G of RAM running an i386 kernel
with a nasty message on boot up:
"Idiot! Why the f*ck are you running i386 on your nice shiny new
x86_64 machine, with Umpteen Gigs of RAM. Boot a x86_64 kernel for
Christ sake and get your full potential. Your i386 userspace will
still work just fine on it. You're like one of those 75 year old
retirees that can finally buy a Porsche just to drive it 10 miles
per hour below the speed limit with a line of cars behind them!"
—
Steven Rostedt
This is just more proof that it's absolutely pointless to make any
changes at all to the old IDE layer. Nobody really cares, and the
risk %99.999 of the time is purely to introduce regressions rather
than make forward progress.
—
David Miller
Comments (52 posted)
Btrfs developer Chris Mason has announced that he is leaving Oracle and
joining the growing crowd of kernel hackers at Fusion-IO. "
From a Btrfs point of view, very little will change. I'll still
maintain Btrfs and will continue all of my Btrfs development in the
open. Oracle will still use Btrfs in their Oracle Linux products, and
I'll work with all of the distros using Btrfs in production."
Full Story (comments: 49)
Some quotes just can't wait for the next week's edition to come about. In
a discussion on ACPI, Alan Cox claimed "
I spent the holiday avoiding
the English queen's party and turning from a republican into a raving
republican."
Linus's
response can only be read in its entirety.
Comments (80 posted)
Kernel development news
By Jonathan Corbet
June 12, 2012
While the kernel is a crucial part of almost any job one might want to do with
a computer, it is rarely the kernel itself that gets the interesting work
done. More often, it appears as overhead that takes system resources away
from the applications the user actually wants to run. So it makes sense to
optimize kernel operations to the greatest extent possible, especially when
those operations are carried out frequently on performance-critical paths.
The "word at a time" interface, optimized and made generic for the 3.5
release, is a good example of how far those optimization efforts can go.
The kernel does a lot of string processing, especially (but not
exclusively) when working with file path names. It is often necessary to
know the length of a name or path component. When confronted with such a
task, a C programmer would typically code a loop iterating through the
string one character at a time. But, given enough strings, the
per-character loop overhead starts to get expensive. It turns out that,
with enough bit-level trickery, much of that overhead can be dealt with by
working through string data one 32-bit or 64-bit word at a time. The "word
at a time" API makes that sort of processing possible—but with a certain
complexity cost.
The API
Code wanting to use this interface should include
<asm/word-at-a-time.h>. A few functions are defined
therein, the first being has_zero():
unsigned long has_zero(unsigned long value, unsigned long *bits,
const struct word_at_a_time *constants);
From one point of view, has_zero() is a simple boolean function
that returns true if value contains a zero byte. But what are the
other two parameters? Let's start with the constants value, which
must simply be set to a value defined in the header file:
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
As will be described in more detail below, this structure contains
some useful constant values. The structure is small and the contents are
architecture-dependent, so it was evidently deemed unnecessary to create a
single, globally-accessible copy.
The bits parameter, instead, is a place where has_zero()
can stash temporary data that will be useful to the remaining functions in
the API. Those functions are:
unsigned long prep_zero_mask(unsigned long value, unsigned long bits,
const struct word_at_a_time *constants);
unsigned long create_zero_mask(unsigned long bits);
unsigned long find_zero(unsigned long mask);
Once has_zero() has identified a word containing a zero byte, all
three of these functions must be used to determine which byte
contains the zero value. The usual calling sequence looks something like
this:
if (has_zero(value, &bits, &constants)) {
bits = prep_zero_mask(value, bits, &constants);
bits = create_zero_mask(bits);
zero_byte = find_zero(bits);
/* ... */
In other words, prep_zero_mask() and create_zero_mask()
both take the bits value first created by has_zero() and
rework it to the point that find_zero() can produce the offset of
the first zero byte in the word.
This may seem like a lot of work to do, but there is a reason for it. The
functionality split allows different architectures to provide optimized
functions for each part of the job. But there is another interesting bit
of subtlety here: it is possible to perform a logical OR of two different
bits values from two calls to prep_zero_mask(). The
function hash_name() in fs/namei.c uses this feature to
search for either a zero byte or one containing a slash—the string it is
looking at ends either with a null byte or the beginning of the next
component in the path name. The kernel spends a lot of time processing
path names, so this operation is worth optimizing in this way.
There is one other little detail to be kept in mind: the string might not
start at the beginning of a word. Managing unaligned strings adds a bit
more complexity to the task; the curious can look at
lib/strnlen_user.c for one example of how these strings are
handled. All told, using this interface adds enough complexity that it is
probably almost never worthwhile. In that rare case where a per-character
loop is too expensive, though, word-at-a-time access can help.
How it works
The x86 version of this API can be found in
arch/x86/include/asm/word-at-a-time.h; one might be forgiven for
thinking that parts of it came from the obfuscated C contest. It starts by
defining the above-mentioned constants:
struct word_at_a_time {
const unsigned long one_bits, high_bits;
};
#define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) }
REPEAT_BYTE() is a macro (defined in
<linux/kernel.h>) that fills a word with copies of the given
byte value. So, on a 32-bit machine, one_bits will be initialized
to 0x01010101, and high_bits will be 0x80808080;
64-bit machines will get the same pattern, but twice as long.
After that,
has_zero() is defined as:
static inline unsigned long has_zero(unsigned long a, unsigned long *bits,
const struct word_at_a_time *c)
{
unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
*bits = mask;
return mask;
}
In English, the code subtracts one from every byte, masks out all of the
bits that were set in the original value, then masks everything but
the highest bit in every byte. If one thinks of each byte as an
independent value, the high bit can be thought of as the sign bit.
Subtracting one from a value will only cause the sign bit to change from zero to one
if the bytes's value was zero before. So this series of operations will cause the
highest bit to be set in any byte whose value was zero before. (In truth,
the bytes are not independent, and borrowing will cause different results
after the first zero byte, but only the first one is of interest so that is
unimportant).
In the x86 implementation, prep_zero_mask() does nothing and will
be optimized out by the compiler. That is not true of
create_zero_mask(), though:
static inline unsigned long create_zero_mask(unsigned long bits)
{
bits = (bits - 1) & ~bits;
return bits >> 7;
}
The subtraction will cause all bits up to the first set bit to be set to
one; all of the other bits are then masked out and the result is
right-shifted. Thereafter, all bytes prior to the first zero byte (in the
original value) will be set to 0xff. All that's left, now, is to
figure out how many of those fully-populated bytes there are. The code
that does this is not entirely straightforward; it is the result of a
request Linus posted on Google+ in March. For 32-bit machines,
find_zero() comes down to this code:
long a = (0x0ff0001+mask) >> 23;
/* Fix the 1 for 00 case */
return a & mask;
On 64-bit systems, instead, it looks like:
return mask*0x0001020304050608ul >> 56;
Either way, the effect is to produce a number that is the byte offset of
the first zero.
This API is relatively new, having been first added (for x86 only) in the
3.4 development cycle. In 3.5, it was substantially reworked and made more
widely applicable. There are specific implementations for x86 and powerpc (the
latter uses a "count leading zeroes" instruction to speed things up); there
is also a "generic" version that really only works properly for big-endian
architectures. That is enough, though, for a number of architectures to
make use of the capability. The resulting microsecond-sized time savings
may not seem like much, but, multiplied across all of the string operations
the kernel does, it adds up to a significant improvement.
Comments (17 posted)
By Jake Edge
June 13, 2012
An elusive, but highly sought after goal for Linux on ARM is to have a
"single" kernel image that can boot on all ARM devices. The full goal is
likely not attainable, but reducing the number of different ARM kernels is,
and progress is being made. Linaro kernel tech lead Deepak Saxena reported
on the motivations behind the consolidation effort as well as progress
toward the goal at LinuxCon Japan 2012.
The holy grail
Saxena started with a problem statement. He noted that those who have
worked with ARM for a while may not really see the problem, but it
is a problem that every ARM processor needs its own kernel. For
example, a laptop vendor that makes a few small hardware changes will almost
certainly need a separate kernel for each revision.
The "holy grail" is a single kernel binary that will boot on any ARM
device, he said.
Requiring separate kernels "creates a lot of extra work" for developers.
If they are building a driver for multiple platforms—or even three
revisions of the same platform—they will probably need to build and
test a kernel for each. Sometimes there is a register that can be read to
determine the hardware revision, thus reducing the number of different
kernels, but that often is not the case.
For distributions, supporting ARM is difficult right now. Debian, for
example, has different kernels for the BeagleBoard, PandaBoard, and
others. In the consumer electronics and mobile space, the problem is
similar, and those companies want to reduce the amount of testing that
needs to be done. Right now, we think of cell phones more than anything
else, but the Saxena pointed to the video projector and camera as devices
that are also likely ARM-powered. Beyond that, there will soon be ARM servers.
The model in the server/enterprise world is very different than mobile.
Typically, a mobile device comes with a full software stack, and updates come the
same way, but that is not the case for servers. You may or may not get the
distribution that you want on servers and, in fact, may get Windows on
those systems. That means that distributions need to have installation
media that can work on a wide variety of server platforms.
"The distros have spoken", Saxena said, and they need one kernel image that
can be built, booted, and tested on all of the different platforms. There
are thousands of different x86-based laptops today, and you don't need a
different kernel for each. The Linux ARM community wants to get to that
same model. Beyond that, cloud and hyperscale computing also need to be
able to deploy on lots of different platforms without requiring separate
kernels.
"How did we get here?", he asked; you can boot a single Ubuntu or Fedora
install disk on all x86 systems, but that is not true for ARM. Part of the
problem is the diversity in the ARM world. There is also a lot of
fragmentation in how code has been written to support Linux on ARM. There
are multiple implementations for the same IP block, for example. Functionality has
been duplicated at least partly because of a closed-door policy by the
various ARM vendors.
In addition, Linux ARM maintainer Russell King got overloaded, which led
ARM platform and system-on-chip (SoC) developers to start pushing their
code to Linus Torvalds directly. Saxena said that he may have been the
first to do that, for ixp4xx, but now he sees the problems that
caused and apologized for doing so. It led to an inability for anyone to
track the "big picture" in Linux ARM support, he said.
Fixes needed
It is now a multi-faceted problem that requires several different avenues
of attack to clean up. Saxena identified four areas that are being
pursued. Cleaning up and consolidating the header files within the various
ARM machine and platform trees is one, while consolidating ARM drivers is
another. In addition, device tree will provide a way to specify the
differences between ARM SoCs at runtime. Finally, doing active maintenance
of the ARM tree, keeping in mind the big picture, will also help. No one
of those fixes the problem, but all of them together get us closer to the
holy grail, he said.
To start with, there are various header file collisions in the
ARM tree. There are a bunch of arm/arch/mach-* directories, one
for each of the machine types. Each of those has an include/mach
directory that maps to the top-level include/mach directory at
build time. In order to build for multiple machine types, those header files
need to be consolidated in one place so that the remapping doesn't need to
happen.
In the 3.0 kernel tree, which was around the time the consolidation effort
started, there were 64 different machine types in the tree. Some of those
machine types are similar and could be consolidated. For the others, there
are lots of overlapping symbols that need to be dealt with. The goal is to
get rid of as many of those as possible either by making them generic for
all ARMs or by moving platform-specific symbols to non-generic header files.
There were also 577 occurrences
of #include <mach/*> in the drivers
directory. Unfortunately, ARM has a lot of driver-specific definitions in
the architecture headers, which means that drivers depend on
arch/arm header files. Basically, it is prevalent for ARM drivers
to require definitions from both the driver directories and the
architecture headers, which makes it difficult to build multi-platform
kernels.
Linaro and the ARM community started working on these problems last year.
They met in August and thought they could have a solution in relatively
short order, but that proved not to be the case. Some changes require
coordination between multiple maintainers and others are awaiting agreement
between maintainers on how to proceed. The problem "may seem trivial at
first", Saxena said, but it actually fairly complicated. The hope is to
have a single zImage for multiple systems by the end of 2012.
Beyond the header file issue, there is a need to cleanup and consolidate
the drivers in the tree. The problem is not directly related to creating a
single kernel, but fixing it will help to get there, he said. There are lots
of implementations of drivers for the same hardware in the tree, sometimes
with a different API. That can cause problems with overlapping symbols and
code bloat.
The clock management code is the "epitome of code duplication and
fragmentation" in the ARM kernels, Saxena said. The clk.h file,
which declares a struct clk, was introduced in 2.6.16 back in
2006. Since then, 27 different struct clk implementations
have been added to the tree, each with its own API and semantics. Over the
last two years or more, work has been done to create a common definition
and API, though the job is not done yet, as there is ongoing discussion on
certain parts.
Pinmux is another example of duplication. It is a subsystem to manage the
pins on SoCs and there were multiple implementations of that
functionality. The problem is not as bad as struct clk, he
said, but there was still a need to consolidate. After six months of work,
a common pinmux implementation is now
upstream, though there are still
discussions about certain parts of the implementation and API.
Another piece of the solution is device tree. Before device tree, very
small, simple changes to the hardware would require a kernel rebuild
because the information about IRQ lines, memory maps, and so forth were
statically defined. Device tree makes it so that much of this information
can be probed at boot time.
Using device tree means creating a source file using a "simple markup
language" that defines the hardware. It can specify where the registers
are or what the interrupts are for a particular device. That means that
the same kernel can be used on a new SoC once a device tree file has been
created for that SoC. Since the kernel will not have to change, it makes
hardware revisions and testing multiple devices that much easier.
Status and plans
Currently, a single kernel can be built for the four Linaro member
platforms (imx, omap2, ux500 and vexpress), though only the Versatile
Express board boots at this point.
The goal is to have all four booting by the end of the year. Linaro is
focused on its member platforms, but would like to get other platforms
supported as well. That is even more reason for SoC vendors to get their
code upstream, Saxena said, as it will be more difficult to participate in
the multi-platform kernel effort with code that is out of the mainline.
The ARM SoC tree has been used as the base, with Arnd Bergmann maintaining
a branch for the multi-platform work.
There are, of course, things still left to do. USB drivers need to be
consolidated as there are some problems building multiple host controllers into
one kernel at this point. Finishing the device tree conversion is another
piece of the puzzle; the infrastructure is there, but there are lots of
drivers and board files to convert. At the moment there is something of a
hack around the "driver #include madness", which needs to be
cleaned up for real.
While the holy grail has not been reached, things will be better than they
are today, Saxena said. Due to micro-architecture and page table
differences, four kernels are envisioned: ARM v6/7 with large physical
address extensions (lpae), ARM v6/7 non-lpae, ARM v4/5, and, eventually,
ARM v8. It still means multiple kernel binaries, but that could be treated in
the same way that the distributions handle various CPU extensions in the
x86 world. The idea of "one zImage to rule them all" turns out to not be
practical, but we will end up with far fewer ARM kernel binaries in the
near future.
[ The author would like to thank the Linux Foundation for assisting with
travel to Yokohama for LinuxCon Japan 2012. ]
Comments (2 posted)
June 12, 2012
This article was contributed by Paul McKenney
ARM's big.LITTLE architecture is an example of asymmetric multiprocessing
where all CPUs are instruction-set compatible, but where different CPUs
have very different performance and energy-efficiency characteristics.
In the case of big.LITTLE, the big CPUs are
Cortex-A15
CPUs with deep pipelines and numerous functional
units, providing maximal performance.
In contrast, the LITTLE CPUs are
Cortex-A7
with short pipelines and few functional units,
which optimizes for energy efficiency.
Linaro is working on two methods of
supporting big.LITTLE systems.
The first method pairs up each big CPU with a
LITTLE CPU, so that user applications are aware of only one CPU corresponding
to each such pair.
A given application thread can then be transparently migrated between
the big and LITTLE CPUs making up the pair, essentially providing
extreme dynamic voltage/frequency scaling.
Instead of merely varying the voltage and frequency, at some point the
application thread is migrated between the big and LITTLE CPUs making
up the corresponding pair,
as was described in an
earlier LWN article by
Nicolas Pitre.
This approach is termed “big.LITTLE Switcher.”
The other way to support big.LITTLE systems is to have all CPUs,
both big and LITTLE, visible in a multiprocessor configuration.
This approach offers greater flexibility, but also poses special
challenges for the Linux kernel.
For example, the scheduler assumes that CPUs are interchangeable,
which is anything but the case on big.LITTLE systems.
These big.LITTLE systems were therefore the subject of a
scheduler minisummit at last February's
Linaro Connect in the Bay Area which was
reported on at LWN.
This article summarizes progress since then, both on the relevant
mailing lists and as reported during the Linaro Connect sessions in
Hong Kong, and is divided into the following topics:
-
Emulate asymmetric computation on commodity systems
-
Document system configuration for big.LITTLE work
-
Define mobile workload and create simulator
-
Address kthread performance issues
-
Wean CPU hotplug from stopping all CPUs
-
Add minimal support to scheduler for asymmetric cores
-
Conclusions
Following this are the inevitable
Answers to Quick Quizzes.
Emulating asymmetric computation on commodity systems
The need to develop software concurrently with the corresponding hardware
has led to heavy reliance on various forms of emulation, and ARM's
asymmetric big.LITTLE systems are no exception.
Unfortunately, full emulation is quite slow, especially given that core
kernel code and user-level code really only needs a reasonable approximation
to big.LITTLE's performance characteristics.
What we need instead is some way to cause commodity systems to roughly
mimic big.LITTLE, for example, by artificially slowing down the CPUs
designated as LITTLE CPUs.
There are a number of ways of slowing down CPUs, the first three of which
were discussed at the February
scheduler minisummit:
- Assign a high-priority
SCHED_FIFO cycle-stealing
thread for each designated LITTLE CPU, which consumes a
predetermined fraction of that CPU's bandwidth.
- Constrain clock frequencies of the LITTLE CPUs.
- Make use of Intel's T-state capability.
- Use perf
to place a load on the designated-LITTLE CPUs.
Rob Lee presented the results of experiments comparing the
cycle-stealing and clock-frequency approaches.
Morten Rasmussen proposed the perf-based approach,
in which he configured perf to interrupt the designated LITTLE CPUs
every few thousand cycles.
Each of these approaches has advantages and disadvantages, as
laid out in the following table:
| Type |
Transparent to Scheduler? |
Portability? |
Calibration? |
Scope? |
| Cycle stealing |
SCHED_FIFO threads visible |
Portable |
Requires calibration for duty cycles less than about 10 milliseconds |
Process-level only |
| Clock Frequency |
Transparent |
Requires per-CPU clock domains |
Auto-calibrating, but limited number of settings |
All execution |
| T-State |
Transparent |
Intel only |
Auto-calibrating |
All execution |
| perf |
Transparent |
Requires fine-grained perf |
Requires calibration |
All code subject to perf exceptions |
Quick Quiz 1:
How can you tell whether multiple CPUs on your system are in the same
clock domain?
Answer
As can be seen from the table, none of these mechanisms is perfect,
for example, many embedded systems-on-a-chip (SoCs) have multiple CPUs
(often all of them) in a given clock domain, which limits the applicability
of the clock-frequency approach.
Rob has placed scripts for the cycle-stealing and clock-frequency
approaches in a git tree
located at git://git.linaro.org/people/rob_lee/asymm_cpu_emu.git;
he plans to add Morten's perf-based approach as well.
Quick Quiz 2:
Given that these emulation approaches are never going to provide a
perfect emulation of big.LITTLE systems, why bother?
Wouldn't it be simpler to just wait until real hardware is available?
Answer
At this point, it seems likely that more than one of these will be
required in order to cover the full range of development hardware and
workloads.
Documenting system configuration for big.LITTLE work
The Linux kernel as it is today can handle big.LITTLE systems,
give or take hardware bring-up issues.
However, the current kernel does need some help to get
good results from a big.LITTLE system.
Chris Redpath's presentation is a first step towards determining the
best division of labor between current kernels and the application.
Chris described an experiment running an Android workload on emulated
big.LITTLE hardware.
He made use of Android's bg_non_interactive cpuset,
which holds low-priority threads (ANDROID_PRIORITY_BACKGROUND)
and is limited to 10% of CPU usage.
Chris further constrained this cpuset to run only on CPU 0, which on his
system is a LITTLE CPU.
Chris then created two new cpusets, default and
fg_boost.
The default cpuset is constrained to the LITTLE CPUs,
and contains all non-background tasks that are not SCHED_BATCH.
The SCHED_BATCH tasks remain in the bg_non_interactive
cpuset called out above.
The new fg_boost cpuset contains tasks that are
SCHED_NORMAL and that have a priority higher than
ANDROID_PRIORITY_NORMAL.
Chris used a combination of sysbench and cyclictest as a rough-and-ready
mobile-like workload, where sysbench mimics high-CPU-consumption mobile
tasks (e.g., rendering a complex web page) and cyclictest mimics interactive
workloads (e.g., user input and communications).
Chris's configuration uses four sysbench compute threads and eight
cyclictest threads.
The sysbench threads all run for 15 seconds and then block for 10
seconds, and repeat this on a 25-second cycle.
The eight cyclictest threads run throughout the full benchmark run.
Without Chris's additional cgroups, the scheduler scattered the threads
across the system, slowing execution of the sysbench threads and wasting
power.
With the additional cgroups, the sysbench threads were confined to the
big CPUs, and the cyclictest threads to a single LITTLE CPU.
In short, use of cpusets to constrain whether given threads run on big
or LITTLE CPUs works quite well, at least as long as we know a priori
which threads should run where.
Chris also tested sched_mc, which can be set up to keep
short-lived tasks off of the big CPUs.
Although sched_mc was able to keep an additional CPU free
of work, it was unable to help the remaining CPUs to reach deeper sleep
states, and was nowhere near as effective as use of cpusets.
These big.LITTLE results support the decision taken at the February Linaro
Connect to remove sched_mc from the kernel.
Finally, Chris tested userspace tools that dynamically migrate tasks
among the
CPUs in response to Android's priority boosting of foreground user-interface
(UI) threads.
In contrast, Chris's first approach statically assigned
the threads.
This approach required minor changes to the Android software stack.
Although this approach was effective in getting high-consumption UI
threads running on big CPUs, the migration latency rivaled the
bursts of UI CPU-bound activity, which in many cases defeats the purpose.
The migration latency would need to be decreased by about a factor of two
for this approach to be worthwhile, though your mileage may vary on
real hardware.
These results underscore the importance of planned scheduler changes
for dynamic general-purpose workloads.
Several other potential approaches were discussed:
- Use a modified
cpufreq governor to switch between
the big and LITTLE CPUs.
One potential issue with this approach is that the governor
currently runs at full frequency most of the time, in accordance
with its race-to-idle strategy.
Some tweaking would therefore be required for workloads for which
race-to-idle is inappropriate.
- Treat specific applications specially, for example, make big.LITTLE
scheduling decisions based on the
->comm name.
This is likely to work well in some cases, but is unable to
optimally handle applications that have different threads
that want to run on different CPUs.
- Rather than migrating threads from one cpuset to another, add and
remove CPUs to given cpusets.
This should reduce the sysfs overhead in some cases.
In short, it is possible to get decent performance out of big.LITTLE
systems on current Linux kernels for at least some workloads.
As more capabilities are added to the scheduler, we can expect
more workloads will run well on big.LITTLE systems, and also that
it will become easier to tune workloads that can already be made
to run well on big.LITTLE.
Define mobile workload and create simulator
An easy-to-use mobile-workload simulator is needed to allow people
without access to the relevant hardware, emulators, and SDKs to evaluate
the effects of changes on mobile workloads.
The good news is that the work presented at Linaro Connect used
a number of synthetic workloads, but the not-so-good news is that
the setups are not yet generally useful.
Shortcomings include: (1) Much work is required to interpret the
output of the workloads, (2) The figures of merit are not constant,
but instead different figures of merit are chosen in different circumstances,
and of course (3) They are not (yet) nicely packaged.
Hopefully we will see additional progress going forward.
Address kthread performance issues
At the February Linaro Connect, Vincent Guittot presented
work
showing that the bulk of CPU-hotplug latency was due to creation,
teardown, and migration of the per-CPU kthreads that carry out housekeeping
tasks for various kernel subsystems.
Further discussion indicated that any mechanism that successfully
clears all current and future work from a given CPU will likely
incur similar latencies.
In an ideal world, these per-CPU kthreads would automatically quiesce
when the corresponding CPU went offline, and then automatically spring back
into action when the CPU came back online, but without the overheads of
kthread creation, teardown, or migration.
Unfortunately, the scheduler does not take kindly to runnable tasks
whose affinity masks only allow them to run on CPUs that are currently
offline.
However, Thomas Gleixner noticed that there is one exception to this rule,
namely a set of per-CPU tasks that have exactly this property:
the idle tasks.
This key insight led Thomas to propose that the per-CPU kthreads
should have this same property, which should completely eliminate the
overhead of per-kthread creation, teardown, and migration.
The idle tasks are currently created by each individual architecture,
so Thomas posted a
patchset that
moves idle-task creation to common code.
This patchset has been well received thus far, and went into mainline
during the 3.5 merge window.
Thomas has since followed up with a new
patch
that allows kthreads to be parked and unparked.
The new kthread_create_on_cpu(),
kthread_should_park(),
kthread_park(), and
kthread_unpark() APIs can be applied to the per-CPU
kthreads that are now created and destroyed on each CPU-hotplug cycle.
Quick Quiz 3:
This sounds way simpler than CPU hotplug, so why not just
get rid of CPU hotplug and replace it with this mechanism?
Answer
At the Q2 Linaro Connect in Hong Kong, Nicolas Pitre suggested
a different way to leverage the properties of idle tasks, namely
to provide a set of high-priority per-CPU idle tasks that would normally
be blocked.
When it was time to quiesce a given CPU, its high-priority idle task would
become runnable, preempting all of the rest of that CPU's tasks.
The high-priority idle task could then power down the CPU.
This approach can be thought of as a variant of idle-cycle injection.
There are some limitations to this approach, but it might work quite
well in situations where the CPU is to be quiesced for short periods
of time.
In the meantime, Vincent Guittot has been experimenting with various
userspace approaches to reduce the per-kthread overhead.
One promising approach is to use cgroups to ensure that kthreadd (the
parent of all other kernel threads)
is guaranteed the CPU bandwidth needed during the hotplug operation.
Preliminary results indicate large improvements in latency.
Although this approach cannot achieve a five-millisecond CPU-hotplug
operation (a target adopted at the February meeting), it is nevertheless
likely to be useful for projects
that are required to use current kernels with the existing slow
CPU-hotplug code.
Wean CPU hotplug from stopping all CPUs
Another source of CPU-hotplug slowness—and of real-time
latency degradation—is its use of __stop_machine().
The problem is that __stop_machine() function causes
each CPU to run a special per-CPU __stop_machine()
kthread, which brings all processing on the system to a grinding
halt for the duration.
This system-wide grinding halt is bad for performance and fatal to real-time
response.
Given that other systems have managed to remove CPUs from service
without requiring such a grinding halt, it should be possible to
wean Linux's CPU-hotplug process from its __stop_machine()
habit.
Quick Quiz 4:
Why bother to fix CPU hotplug?
Wouldn't it be better to just rip it out and replace it with something better?
Answer
This is a large change, and will take considerable time and effort.
However, it is likely to provide good improvements in both performance
and robustness of CPU hotplug.
Add minimal support to scheduler for asymmetric cores
There has been great progress in a number of areas.
First, Paul Turner posted a new version of his
entity load-tracking patchset.
This patchset should allow the scheduler to make better
(and more power-aware) task-placement and load-balancing decisions.
Second, Morten Rasmussen ran some experiments (including experimental
patches) on top of Paul Turner's patchset.
See below for more information.
Third, Peter Zijlstra posted a
pair of
patches
removing sched_mc
and also posted an
RFC email
proposing increased scheduler awareness of hardware topology.
This should allow the scheduler to better handle asymmetric
architectures such as ARM's big.LITTLE architecture.
Finally, Juri Lelli posted an
updated version
of a prototype
SCHED_DEADLINE patchset,
which Juri has taken over from Dario Faggioli.
Please see below for a discussion of how
SCHED_DEADLINE might be used for energy efficiency on
big.LITTLE systems.
Morten Rasmussen presented prototype enhancements to the Linux scheduler
that better accommodate big.LITTLE systems.
As noted above, these enhancements are based on Paul Turner's entity
load-tracking patchset.
The key point behind Morten's enhancements is that work should be
distributed unevenly on big.LITTLE systems in order to maximize
power efficiency with minimal performance degradation.
Unlike SMP systems, on big.LITTLE systems it matters deeply where
a given task is run, not just when a task is run.
Morten therefore created a pair of cgroups, one for the big CPUs
and another for the LITTLE CPUs, but with no load balancing between
them.
The select_task_rq_fair() function checks task load history
in order to make the correct big.LITTLE selection when a given task
starts running.
High-priority tasks that have tended to be CPU bound are placed on big CPUs,
and all other tasks are placed on LITTLE CPUs.
Of course, a high-priority task that has historically been I/O bound
might enter a CPU-bound phase.
Therefore, Morten's patchset also periodically migrates tasks via
load balancing.
There are of course issues with global load balancing, and addressing
these issues is important future work.
Nevertheless, when running the Bbench browser benchmark on an
emulated big.LITTLE Android system, Morten's patches provided
response time very nearly that of an SMP system with all big CPUs,
both when running on SMP hardware emulating big.LITTLE and
when running on Paul Turner's LinSched.
This is clearly an impressive achievement.
In contrast, big.LITTLE response time on a vanilla kernel is 10-20% slower
with much worse variability in response times.
Obviously, considerably more performance evaluation is required,
but preliminary results are quite promising.
In addition, more work will be required to handle thermal issues
and also to do load balancing of tasks back onto LITTLE CPUs when
all the big CPUs are saturated.
Juri Lelli's SCHED_DEADLINE is quite important for some
types of real-time computing, but some thought is required to see possible
application to non-real-time big.LITTLE systems.
The key thing about SCHED_DEADLINE is that it provides the
scheduler with information about the likely future needs of the various
applications running on the system.
This information should allow the scheduler to make better big.LITTLE
task placement decisions, and also potentially better frequency-selection
decisions.
Of course, this begs the question of where the deadlines and CPU
requirements come from.
In fact, one reason that deadline schedulers remain the exception
rather than the rule even for real-time applications is that it is
difficult to generate the needed information for a given real-time
application, especially given
the non-deterministic
nature of modern hardware
and variations in rendering time for different media—or even
for different segments of a given item being played back.
One way to solve this problem for some mobile workloads might be to
use feedback-directed profiling techniques, where the application's
CPU requirements are measured rather than derived.
Regardless of how things turn out for the use of deadline
scheduling for mobile workloads, there is clearly some important
and innovative work ahead for power-aware scheduling of mobile
workloads.
Finally, Vincent Guittot led a discussion on additional inputs
to the scheduler, which centered mainly around accurate load tracking
(e.g., Paul Turner's patchset), which CPUs share a clock (and thus must
all run at the same frequency), which CPUs share idle state, which
CPUs share each level of cache, and thermal constraints.
There was also a call to keep all this simple, which is clearly extremely
important—and also will likely be extremely difficult.
But if keeping things simple was easy, where would be the challenge?
Conclusions
Although there is still a great deal of work remaining before
mobile workloads fully exploit the power and power efficiency of
big.LITTLE, the past few months have seen some impressive progress
in that direction.
With appropriate user-mode help, some important workloads can make
good use of big.LITTLE systems even given the existing scheduler,
and some modest modifications (in conjunction with Paul Turner's
entity load-tracking patchset) greatly improves the scheduler's
ability to support big.LITTLE with minimal user-mode help.
A promising design for streamlining CPU hotplug has been identified,
and work in that direction has started.
This work gives some hope that the future will be bright for asymmetric
multiprocessing in general and ARM's big.LITTLE architecture in
particular.
Acknowledgments
We all owe a debt of gratitude to Srivatsa Bhat, Juri Lelli, Daniel Lezcano,
Nicolas Pitre, Christian Daudt, Roger Teague, and George Grey
for helping to make this article human readable.
I owe thanks to Dave Rusling and Jim Wasko for their support of this effort.
Answers to quick quizzes
Quick Quiz 1:
How can you tell whether multiple CPUs on your system are in the same
clock domain?
Answer:
Look at the contents of this sysfs file:
/sys/devices/system/cpu/cpu0/cpufreq/affected_cpus
This file lists all of the CPUs that are in the same clock domain
as CPU 0.
Similar files are present for the rest of the CPUs.
Back to Quick Quiz 1.
Quick Quiz 2:
Given that these emulation approaches are never going to provide a
perfect emulation of big.LITTLE systems, why bother?
Wouldn't it be simpler to just wait until real hardware is available?
Answer:
Indeed, none of these emulation methods will perfectly emulate big.LITTLE
systems.
Then again, the various big.LITTLE implementations will not be exactly
the same as each other anyway, so significant emulation error is
quite acceptable.
More importantly, these emulation approaches allow big.LITTLE software
work to proceed long before real big.LITTLE hardware is available.
Back to Quick Quiz 2.
Quick Quiz 3:
This sounds way simpler than CPU hotplug, so why not just
get rid of CPU hotplug and replace it with this mechanism?
Answer:
First, CPU hotplug is still needed to isolate failing hardware,
which some expect to become more prevalent as transistors continue
to shrink.
Second, there are a number of limitations of Nicolas's approach
compared to CPU hotplug:
- Interrupts must be directed away from the CPU.
- Timers must be migrated off of the CPU.
- If a CPU is left in this state for more than two minutes,
the softlockup subsystem will start reporting errors
if there are runnable tasks affinitied to this CPU.
One workaround is to migrate tasks away from this
CPU, but that approach starts looking a lot like CPU
hotplug.
- If one of the preempted threads is hard-affinitied to the
CPU and is holding a mutex, then that mutex cannot be
acquired on any other CPU.
If the CPU remains in this state very long, a system hang
could result.
- The per-CPU kthreads would need to be informed of this
transition if it persists for very long.
In other words, as noted above, Nicolas's approach is appropriate
only for cases where the CPU is to be quiesced for a short time.
Back to Quick Quiz 3.
Quick Quiz 4:
Why bother to fix CPU hotplug?
Wouldn't it be better to just rip it out and replace it with something better?
Answer:
Although there is always a strong urge to rip and replace, the fact is
that any mechanism that successfully removes all current and future work
from a CPU will really need to remove all current and future work from
that CPU.
And it is the removal of all such work that makes CPU hotplug difficult,
not the actual powering down of the electronics.
Of course, this suggests the possibility of removing all work without
actually powering the CPU down, a possibility that is likely to be
supported by Thomas's ongoing CPU-hotplug rework.
Back to Quick Quiz 4.
Comments (13 posted)
Patches and updates
Kernel trees
- Con Kolivas: 3.4-ck2 .
(June 11, 2012)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>