|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.5-rc2, released by Linus on June 8. "I think -rc2 is in fairly good shape, and I've been trying to fairly aggressively revert things that caused problems (and sometimes things that were only *suspected* to be the cause of problems)." Since that release he has been wandering around the world winning high-profile prizes, so the flow of changes into the mainline has been relatively slow.

Stable updates: 3.0.34 and 3.4.2 were released on June 9; 3.2.20 came out on June 11.

Comments (none posted)

Quotes of the week

Not to mention that I would love to make all users that have x86_64 machines running i386 kernels, for something other than testing i386, thrown into a large boiling pot of clue stew. It really pisses me off when people run these i386 machines with 16 gigs of memory and complain about performance. highmem should be disabled for any x86_64 box with more than 1G of RAM running an i386 kernel with a nasty message on boot up:

"Idiot! Why the f*ck are you running i386 on your nice shiny new x86_64 machine, with Umpteen Gigs of RAM. Boot a x86_64 kernel for Christ sake and get your full potential. Your i386 userspace will still work just fine on it. You're like one of those 75 year old retirees that can finally buy a Porsche just to drive it 10 miles per hour below the speed limit with a line of cars behind them!"

Steven Rostedt

This is just more proof that it's absolutely pointless to make any changes at all to the old IDE layer. Nobody really cares, and the risk %99.999 of the time is purely to introduce regressions rather than make forward progress.
David Miller

Comments (52 posted)

Chris Mason leaving Oracle

Btrfs developer Chris Mason has announced that he is leaving Oracle and joining the growing crowd of kernel hackers at Fusion-IO. "From a Btrfs point of view, very little will change. I'll still maintain Btrfs and will continue all of my Btrfs development in the open. Oracle will still use Btrfs in their Oracle Linux products, and I'll work with all of the distros using Btrfs in production."

Full Story (comments: 49)

Alan Cox celebrates the queen

Some quotes just can't wait for the next week's edition to come about. In a discussion on ACPI, Alan Cox claimed "I spent the holiday avoiding the English queen's party and turning from a republican into a raving republican". Linus's response can only be read in its entirety.

Comments (80 posted)

Kernel development news

The word-at-a-time interface

By Jonathan Corbet
June 12, 2012
While the kernel is a crucial part of almost any job one might want to do with a computer, it is rarely the kernel itself that gets the interesting work done. More often, it appears as overhead that takes system resources away from the applications the user actually wants to run. So it makes sense to optimize kernel operations to the greatest extent possible, especially when those operations are carried out frequently on performance-critical paths. The "word at a time" interface, optimized and made generic for the 3.5 release, is a good example of how far those optimization efforts can go.

The kernel does a lot of string processing, especially (but not exclusively) when working with file path names. It is often necessary to know the length of a name or path component. When confronted with such a task, a C programmer would typically code a loop iterating through the string one character at a time. But, given enough strings, the per-character loop overhead starts to get expensive. It turns out that, with enough bit-level trickery, much of that overhead can be dealt with by working through string data one 32-bit or 64-bit word at a time. The "word at a time" API makes that sort of processing possible—but with a certain complexity cost.

The API

Code wanting to use this interface should include <asm/word-at-a-time.h>. A few functions are defined therein, the first being has_zero():

    unsigned long has_zero(unsigned long value, unsigned long *bits, 
			   const struct word_at_a_time *constants);

From one point of view, has_zero() is a simple boolean function that returns true if value contains a zero byte. But what are the other two parameters? Let's start with the constants value, which must simply be set to a value defined in the header file:

    const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;

As will be described in more detail below, this structure contains some useful constant values. The structure is small and the contents are architecture-dependent, so it was evidently deemed unnecessary to create a single, globally-accessible copy.

The bits parameter, instead, is a place where has_zero() can stash temporary data that will be useful to the remaining functions in the API. Those functions are:

    unsigned long prep_zero_mask(unsigned long value, unsigned long bits,
    				 const struct word_at_a_time *constants);
    unsigned long create_zero_mask(unsigned long bits);
    unsigned long find_zero(unsigned long mask);

Once has_zero() has identified a word containing a zero byte, all three of these functions must be used to determine which byte contains the zero value. The usual calling sequence looks something like this:

    if (has_zero(value, &bits, &constants)) {
        bits = prep_zero_mask(value, bits, &constants);
    	bits = create_zero_mask(bits);
    	zero_byte = find_zero(bits);
	/* ... */

In other words, prep_zero_mask() and create_zero_mask() both take the bits value first created by has_zero() and rework it to the point that find_zero() can produce the offset of the first zero byte in the word.

This may seem like a lot of work to do, but there is a reason for it. The functionality split allows different architectures to provide optimized functions for each part of the job. But there is another interesting bit of subtlety here: it is possible to perform a logical OR of two different bits values from two calls to prep_zero_mask(). The function hash_name() in fs/namei.c uses this feature to search for either a zero byte or one containing a slash—the string it is looking at ends either with a null byte or the beginning of the next component in the path name. The kernel spends a lot of time processing path names, so this operation is worth optimizing in this way.

There is one other little detail to be kept in mind: the string might not start at the beginning of a word. Managing unaligned strings adds a bit more complexity to the task; the curious can look at lib/strnlen_user.c for one example of how these strings are handled. All told, using this interface adds enough complexity that it is probably almost never worthwhile. In that rare case where a per-character loop is too expensive, though, word-at-a-time access can help.

How it works

The x86 version of this API can be found in arch/x86/include/asm/word-at-a-time.h; one might be forgiven for thinking that parts of it came from the obfuscated C contest. It starts by defining the above-mentioned constants:

    struct word_at_a_time {
	const unsigned long one_bits, high_bits;
    };

    #define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) }

REPEAT_BYTE() is a macro (defined in <linux/kernel.h>) that fills a word with copies of the given byte value. So, on a 32-bit machine, one_bits will be initialized to 0x01010101, and high_bits will be 0x80808080; 64-bit machines will get the same pattern, but twice as long.

After that, has_zero() is defined as:

    static inline unsigned long has_zero(unsigned long a, unsigned long *bits, 
    					 const struct word_at_a_time *c)
    {
	unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
	*bits = mask;
	return mask;
    }

In English, the code subtracts one from every byte, masks out all of the bits that were set in the original value, then masks everything but the highest bit in every byte. If one thinks of each byte as an independent value, the high bit can be thought of as the sign bit. Subtracting one from a value will only cause the sign bit to change from zero to one if the bytes's value was zero before. So this series of operations will cause the highest bit to be set in any byte whose value was zero before. (In truth, the bytes are not independent, and borrowing will cause different results after the first zero byte, but only the first one is of interest so that is unimportant).

In the x86 implementation, prep_zero_mask() does nothing and will be optimized out by the compiler. That is not true of create_zero_mask(), though:

    static inline unsigned long create_zero_mask(unsigned long bits)
    {
	bits = (bits - 1) & ~bits;
	return bits >> 7;
    }

The subtraction will cause all bits up to the first set bit to be set to one; all of the other bits are then masked out and the result is right-shifted. Thereafter, all bytes prior to the first zero byte (in the original value) will be set to 0xff. All that's left, now, is to figure out how many of those fully-populated bytes there are. The code that does this is not entirely straightforward; it is the result of a request Linus posted on Google+ in March. For 32-bit machines, find_zero() comes down to this code:

    long a = (0x0ff0001+mask) >> 23;
    /* Fix the 1 for 00 case */
    return a & mask;

On 64-bit systems, instead, it looks like:

    return mask*0x0001020304050608ul >> 56;

Either way, the effect is to produce a number that is the byte offset of the first zero.

This API is relatively new, having been first added (for x86 only) in the 3.4 development cycle. In 3.5, it was substantially reworked and made more widely applicable. There are specific implementations for x86 and powerpc (the latter uses a "count leading zeroes" instruction to speed things up); there is also a "generic" version that really only works properly for big-endian architectures. That is enough, though, for a number of architectures to make use of the capability. The resulting microsecond-sized time savings may not seem like much, but, multiplied across all of the string operations the kernel does, it adds up to a significant improvement.

Comments (17 posted)

LinuxCon Japan: One zImage to rule them all

By Jake Edge
June 13, 2012

An elusive, but highly sought after goal for Linux on ARM is to have a "single" kernel image that can boot on all ARM devices. The full goal is likely not attainable, but reducing the number of different ARM kernels is, and progress is being made. Linaro kernel tech lead Deepak Saxena reported on the motivations behind the consolidation effort as well as progress toward the goal at LinuxCon Japan 2012.

The holy grail

Saxena started with a problem statement. He noted that those who have worked with ARM for a while may not really see the problem, but it is a problem that every ARM processor needs its own kernel. For example, a laptop vendor that makes a few small hardware changes will almost certainly need a separate kernel for each revision. The "holy grail" is a single kernel binary that will boot on any ARM device, he said.

[Deepak Saxena]

Requiring separate kernels "creates a lot of extra work" for developers. If they are building a driver for multiple platforms—or even three revisions of the same platform—they will probably need to build and test a kernel for each. Sometimes there is a register that can be read to determine the hardware revision, thus reducing the number of different kernels, but that often is not the case.

For distributions, supporting ARM is difficult right now. Debian, for example, has different kernels for the BeagleBoard, PandaBoard, and others. In the consumer electronics and mobile space, the problem is similar, and those companies want to reduce the amount of testing that needs to be done. Right now, we think of cell phones more than anything else, but the Saxena pointed to the video projector and camera as devices that are also likely ARM-powered. Beyond that, there will soon be ARM servers.

The model in the server/enterprise world is very different than mobile. Typically, a mobile device comes with a full software stack, and updates come the same way, but that is not the case for servers. You may or may not get the distribution that you want on servers and, in fact, may get Windows on those systems. That means that distributions need to have installation media that can work on a wide variety of server platforms.

"The distros have spoken", Saxena said, and they need one kernel image that can be built, booted, and tested on all of the different platforms. There are thousands of different x86-based laptops today, and you don't need a different kernel for each. The Linux ARM community wants to get to that same model. Beyond that, cloud and hyperscale computing also need to be able to deploy on lots of different platforms without requiring separate kernels.

"How did we get here?", he asked; you can boot a single Ubuntu or Fedora install disk on all x86 systems, but that is not true for ARM. Part of the problem is the diversity in the ARM world. There is also a lot of fragmentation in how code has been written to support Linux on ARM. There are multiple implementations for the same IP block, for example. Functionality has been duplicated at least partly because of a closed-door policy by the various ARM vendors.

In addition, Linux ARM maintainer Russell King got overloaded, which led ARM platform and system-on-chip (SoC) developers to start pushing their code to Linus Torvalds directly. Saxena said that he may have been the first to do that, for ixp4xx, but now he sees the problems that caused and apologized for doing so. It led to an inability for anyone to track the "big picture" in Linux ARM support, he said.

Fixes needed

It is now a multi-faceted problem that requires several different avenues of attack to clean up. Saxena identified four areas that are being pursued. Cleaning up and consolidating the header files within the various ARM machine and platform trees is one, while consolidating ARM drivers is another. In addition, device tree will provide a way to specify the differences between ARM SoCs at runtime. Finally, doing active maintenance of the ARM tree, keeping in mind the big picture, will also help. No one of those fixes the problem, but all of them together get us closer to the holy grail, he said.

To start with, there are various header file collisions in the ARM tree. There are a bunch of arm/arch/mach-* directories, one for each of the machine types. Each of those has an include/mach directory that maps to the top-level include/mach directory at build time. In order to build for multiple machine types, those header files need to be consolidated in one place so that the remapping doesn't need to happen.

In the 3.0 kernel tree, which was around the time the consolidation effort started, there were 64 different machine types in the tree. Some of those machine types are similar and could be consolidated. For the others, there are lots of overlapping symbols that need to be dealt with. The goal is to get rid of as many of those as possible either by making them generic for all ARMs or by moving platform-specific symbols to non-generic header files.

There were also 577 occurrences of #include <mach/*> in the drivers directory. Unfortunately, ARM has a lot of driver-specific definitions in the architecture headers, which means that drivers depend on arch/arm header files. Basically, it is prevalent for ARM drivers to require definitions from both the driver directories and the architecture headers, which makes it difficult to build multi-platform kernels.

Linaro and the ARM community started working on these problems last year. They met in August and thought they could have a solution in relatively short order, but that proved not to be the case. Some changes require coordination between multiple maintainers and others are awaiting agreement between maintainers on how to proceed. The problem "may seem trivial at first", Saxena said, but it actually fairly complicated. The hope is to have a single zImage for multiple systems by the end of 2012.

Beyond the header file issue, there is a need to cleanup and consolidate the drivers in the tree. The problem is not directly related to creating a single kernel, but fixing it will help to get there, he said. There are lots of implementations of drivers for the same hardware in the tree, sometimes with a different API. That can cause problems with overlapping symbols and code bloat.

The clock management code is the "epitome of code duplication and fragmentation" in the ARM kernels, Saxena said. The clk.h file, which declares a struct clk, was introduced in 2.6.16 back in 2006. Since then, 27 different struct clk implementations have been added to the tree, each with its own API and semantics. Over the last two years or more, work has been done to create a common definition and API, though the job is not done yet, as there is ongoing discussion on certain parts.

Pinmux is another example of duplication. It is a subsystem to manage the pins on SoCs and there were multiple implementations of that functionality. The problem is not as bad as struct clk, he said, but there was still a need to consolidate. After six months of work, a common pinmux implementation is now upstream, though there are still discussions about certain parts of the implementation and API.

Another piece of the solution is device tree. Before device tree, very small, simple changes to the hardware would require a kernel rebuild because the information about IRQ lines, memory maps, and so forth were statically defined. Device tree makes it so that much of this information can be probed at boot time.

Using device tree means creating a source file using a "simple markup language" that defines the hardware. It can specify where the registers are or what the interrupts are for a particular device. That means that the same kernel can be used on a new SoC once a device tree file has been created for that SoC. Since the kernel will not have to change, it makes hardware revisions and testing multiple devices that much easier.

Status and plans

Currently, a single kernel can be built for the four Linaro member platforms (imx, omap2, ux500 and vexpress), though only the Versatile Express board boots at this point. The goal is to have all four booting by the end of the year. Linaro is focused on its member platforms, but would like to get other platforms supported as well. That is even more reason for SoC vendors to get their code upstream, Saxena said, as it will be more difficult to participate in the multi-platform kernel effort with code that is out of the mainline. The ARM SoC tree has been used as the base, with Arnd Bergmann maintaining a branch for the multi-platform work.

There are, of course, things still left to do. USB drivers need to be consolidated as there are some problems building multiple host controllers into one kernel at this point. Finishing the device tree conversion is another piece of the puzzle; the infrastructure is there, but there are lots of drivers and board files to convert. At the moment there is something of a hack around the "driver #include madness", which needs to be cleaned up for real.

While the holy grail has not been reached, things will be better than they are today, Saxena said. Due to micro-architecture and page table differences, four kernels are envisioned: ARM v6/7 with large physical address extensions (lpae), ARM v6/7 non-lpae, ARM v4/5, and, eventually, ARM v8. It still means multiple kernel binaries, but that could be treated in the same way that the distributions handle various CPU extensions in the x86 world. The idea of "one zImage to rule them all" turns out to not be practical, but we will end up with far fewer ARM kernel binaries in the near future.

[ The author would like to thank the Linux Foundation for assisting with travel to Yokohama for LinuxCon Japan 2012. ]

Comments (2 posted)

A big.LITTLE scheduler update

June 12, 2012

This article was contributed by Paul McKenney

ARM's big.LITTLE architecture is an example of asymmetric multiprocessing where all CPUs are instruction-set compatible, but where different CPUs have very different performance and energy-efficiency characteristics. In the case of big.LITTLE, the big CPUs are Cortex-A15 CPUs with deep pipelines and numerous functional units, providing maximal performance. In contrast, the LITTLE CPUs are Cortex-A7 with short pipelines and few functional units, which optimizes for energy efficiency. Linaro is working on two methods of supporting big.LITTLE systems.

The first method pairs up each big CPU with a LITTLE CPU, so that user applications are aware of only one CPU corresponding to each such pair. A given application thread can then be transparently migrated between the big and LITTLE CPUs making up the pair, essentially providing extreme dynamic voltage/frequency scaling. Instead of merely varying the voltage and frequency, at some point the application thread is migrated between the big and LITTLE CPUs making up the corresponding pair, as was described in an earlier LWN article by Nicolas Pitre. This approach is termed “big.LITTLE Switcher.”

The other way to support big.LITTLE systems is to have all CPUs, both big and LITTLE, visible in a multiprocessor configuration. This approach offers greater flexibility, but also poses special challenges for the Linux kernel. For example, the scheduler assumes that CPUs are interchangeable, which is anything but the case on big.LITTLE systems. These big.LITTLE systems were therefore the subject of a scheduler minisummit at last February's Linaro Connect in the Bay Area which was reported on at LWN.

This article summarizes progress since then, both on the relevant mailing lists and as reported during the Linaro Connect sessions in Hong Kong, and is divided into the following topics:

  1. Emulate asymmetric computation on commodity systems
  2. Document system configuration for big.LITTLE work
  3. Define mobile workload and create simulator
  4. Address kthread performance issues
  5. Wean CPU hotplug from stopping all CPUs
  6. Add minimal support to scheduler for asymmetric cores
  7. Conclusions

Following this are the inevitable Answers to Quick Quizzes.

Emulating asymmetric computation on commodity systems

The need to develop software concurrently with the corresponding hardware has led to heavy reliance on various forms of emulation, and ARM's asymmetric big.LITTLE systems are no exception. Unfortunately, full emulation is quite slow, especially given that core kernel code and user-level code really only needs a reasonable approximation to big.LITTLE's performance characteristics. What we need instead is some way to cause commodity systems to roughly mimic big.LITTLE, for example, by artificially slowing down the CPUs designated as LITTLE CPUs.

There are a number of ways of slowing down CPUs, the first three of which were discussed at the February scheduler minisummit:

  1. Assign a high-priority SCHED_FIFO cycle-stealing thread for each designated LITTLE CPU, which consumes a predetermined fraction of that CPU's bandwidth.

  2. Constrain clock frequencies of the LITTLE CPUs.

  3. Make use of Intel's T-state capability.

  4. Use perf to place a load on the designated-LITTLE CPUs.

Rob Lee presented the results of experiments comparing the cycle-stealing and clock-frequency approaches. Morten Rasmussen proposed the perf-based approach, in which he configured perf to interrupt the designated LITTLE CPUs every few thousand cycles. Each of these approaches has advantages and disadvantages, as laid out in the following table:

Type Transparent to Scheduler? Portability? Calibration? Scope?
Cycle stealing SCHED_FIFO threads visible Portable Requires calibration for duty cycles less than about 10 milliseconds Process-level only
Clock Frequency Transparent Requires per-CPU clock domains Auto-calibrating, but limited number of settings All execution
T-State Transparent Intel only Auto-calibrating All execution
perf Transparent Requires fine-grained perf Requires calibration All code subject to perf exceptions

Quick Quiz 1: How can you tell whether multiple CPUs on your system are in the same clock domain? Answer

As can be seen from the table, none of these mechanisms is perfect, for example, many embedded systems-on-a-chip (SoCs) have multiple CPUs (often all of them) in a given clock domain, which limits the applicability of the clock-frequency approach. Rob has placed scripts for the cycle-stealing and clock-frequency approaches in a git tree located at git://git.linaro.org/people/rob_lee/asymm_cpu_emu.git; he plans to add Morten's perf-based approach as well.

Quick Quiz 2: Given that these emulation approaches are never going to provide a perfect emulation of big.LITTLE systems, why bother? Wouldn't it be simpler to just wait until real hardware is available? Answer

At this point, it seems likely that more than one of these will be required in order to cover the full range of development hardware and workloads.

Documenting system configuration for big.LITTLE work

The Linux kernel as it is today can handle big.LITTLE systems, give or take hardware bring-up issues. However, the current kernel does need some help to get good results from a big.LITTLE system. Chris Redpath's presentation is a first step towards determining the best division of labor between current kernels and the application.

Chris described an experiment running an Android workload on emulated big.LITTLE hardware. He made use of Android's bg_non_interactive cpuset, which holds low-priority threads (ANDROID_PRIORITY_BACKGROUND) and is limited to 10% of CPU usage. Chris further constrained this cpuset to run only on CPU 0, which on his system is a LITTLE CPU. Chris then created two new cpusets, default and fg_boost. The default cpuset is constrained to the LITTLE CPUs, and contains all non-background tasks that are not SCHED_BATCH. The SCHED_BATCH tasks remain in the bg_non_interactive cpuset called out above. The new fg_boost cpuset contains tasks that are SCHED_NORMAL and that have a priority higher than ANDROID_PRIORITY_NORMAL.

Chris used a combination of sysbench and cyclictest as a rough-and-ready mobile-like workload, where sysbench mimics high-CPU-consumption mobile tasks (e.g., rendering a complex web page) and cyclictest mimics interactive workloads (e.g., user input and communications). Chris's configuration uses four sysbench compute threads and eight cyclictest threads. The sysbench threads all run for 15 seconds and then block for 10 seconds, and repeat this on a 25-second cycle. The eight cyclictest threads run throughout the full benchmark run. Without Chris's additional cgroups, the scheduler scattered the threads across the system, slowing execution of the sysbench threads and wasting power. With the additional cgroups, the sysbench threads were confined to the big CPUs, and the cyclictest threads to a single LITTLE CPU.

In short, use of cpusets to constrain whether given threads run on big or LITTLE CPUs works quite well, at least as long as we know a priori which threads should run where.

Chris also tested sched_mc, which can be set up to keep short-lived tasks off of the big CPUs. Although sched_mc was able to keep an additional CPU free of work, it was unable to help the remaining CPUs to reach deeper sleep states, and was nowhere near as effective as use of cpusets. These big.LITTLE results support the decision taken at the February Linaro Connect to remove sched_mc from the kernel.

Finally, Chris tested userspace tools that dynamically migrate tasks among the CPUs in response to Android's priority boosting of foreground user-interface (UI) threads. In contrast, Chris's first approach statically assigned the threads. This approach required minor changes to the Android software stack. Although this approach was effective in getting high-consumption UI threads running on big CPUs, the migration latency rivaled the bursts of UI CPU-bound activity, which in many cases defeats the purpose. The migration latency would need to be decreased by about a factor of two for this approach to be worthwhile, though your mileage may vary on real hardware. These results underscore the importance of planned scheduler changes for dynamic general-purpose workloads.

Several other potential approaches were discussed:

  1. Use a modified cpufreq governor to switch between the big and LITTLE CPUs. One potential issue with this approach is that the governor currently runs at full frequency most of the time, in accordance with its race-to-idle strategy. Some tweaking would therefore be required for workloads for which race-to-idle is inappropriate.

  2. Treat specific applications specially, for example, make big.LITTLE scheduling decisions based on the ->comm name. This is likely to work well in some cases, but is unable to optimally handle applications that have different threads that want to run on different CPUs.

  3. Rather than migrating threads from one cpuset to another, add and remove CPUs to given cpusets. This should reduce the sysfs overhead in some cases.

In short, it is possible to get decent performance out of big.LITTLE systems on current Linux kernels for at least some workloads. As more capabilities are added to the scheduler, we can expect more workloads will run well on big.LITTLE systems, and also that it will become easier to tune workloads that can already be made to run well on big.LITTLE.

Define mobile workload and create simulator

An easy-to-use mobile-workload simulator is needed to allow people without access to the relevant hardware, emulators, and SDKs to evaluate the effects of changes on mobile workloads. The good news is that the work presented at Linaro Connect used a number of synthetic workloads, but the not-so-good news is that the setups are not yet generally useful. Shortcomings include: (1) Much work is required to interpret the output of the workloads, (2) The figures of merit are not constant, but instead different figures of merit are chosen in different circumstances, and of course (3) They are not (yet) nicely packaged. Hopefully we will see additional progress going forward.

Address kthread performance issues

At the February Linaro Connect, Vincent Guittot presented work showing that the bulk of CPU-hotplug latency was due to creation, teardown, and migration of the per-CPU kthreads that carry out housekeeping tasks for various kernel subsystems. Further discussion indicated that any mechanism that successfully clears all current and future work from a given CPU will likely incur similar latencies.

In an ideal world, these per-CPU kthreads would automatically quiesce when the corresponding CPU went offline, and then automatically spring back into action when the CPU came back online, but without the overheads of kthread creation, teardown, or migration. Unfortunately, the scheduler does not take kindly to runnable tasks whose affinity masks only allow them to run on CPUs that are currently offline. However, Thomas Gleixner noticed that there is one exception to this rule, namely a set of per-CPU tasks that have exactly this property: the idle tasks. This key insight led Thomas to propose that the per-CPU kthreads should have this same property, which should completely eliminate the overhead of per-kthread creation, teardown, and migration.

The idle tasks are currently created by each individual architecture, so Thomas posted a patchset that moves idle-task creation to common code. This patchset has been well received thus far, and went into mainline during the 3.5 merge window. Thomas has since followed up with a new patch that allows kthreads to be parked and unparked. The new kthread_create_on_cpu(), kthread_should_park(), kthread_park(), and kthread_unpark() APIs can be applied to the per-CPU kthreads that are now created and destroyed on each CPU-hotplug cycle.

Quick Quiz 3: This sounds way simpler than CPU hotplug, so why not just get rid of CPU hotplug and replace it with this mechanism? Answer

At the Q2 Linaro Connect in Hong Kong, Nicolas Pitre suggested a different way to leverage the properties of idle tasks, namely to provide a set of high-priority per-CPU idle tasks that would normally be blocked. When it was time to quiesce a given CPU, its high-priority idle task would become runnable, preempting all of the rest of that CPU's tasks. The high-priority idle task could then power down the CPU. This approach can be thought of as a variant of idle-cycle injection. There are some limitations to this approach, but it might work quite well in situations where the CPU is to be quiesced for short periods of time.

In the meantime, Vincent Guittot has been experimenting with various userspace approaches to reduce the per-kthread overhead. One promising approach is to use cgroups to ensure that kthreadd (the parent of all other kernel threads) is guaranteed the CPU bandwidth needed during the hotplug operation. Preliminary results indicate large improvements in latency. Although this approach cannot achieve a five-millisecond CPU-hotplug operation (a target adopted at the February meeting), it is nevertheless likely to be useful for projects that are required to use current kernels with the existing slow CPU-hotplug code.

Wean CPU hotplug from stopping all CPUs

Another source of CPU-hotplug slowness—and of real-time latency degradation—is its use of __stop_machine(). The problem is that __stop_machine() function causes each CPU to run a special per-CPU __stop_machine() kthread, which brings all processing on the system to a grinding halt for the duration. This system-wide grinding halt is bad for performance and fatal to real-time response. Given that other systems have managed to remove CPUs from service without requiring such a grinding halt, it should be possible to wean Linux's CPU-hotplug process from its __stop_machine() habit.

Quick Quiz 4: Why bother to fix CPU hotplug? Wouldn't it be better to just rip it out and replace it with something better? Answer

This is a large change, and will take considerable time and effort. However, it is likely to provide good improvements in both performance and robustness of CPU hotplug.

Add minimal support to scheduler for asymmetric cores

There has been great progress in a number of areas. First, Paul Turner posted a new version of his entity load-tracking patchset. This patchset should allow the scheduler to make better (and more power-aware) task-placement and load-balancing decisions. Second, Morten Rasmussen ran some experiments (including experimental patches) on top of Paul Turner's patchset. See below for more information. Third, Peter Zijlstra posted a pair of patches removing sched_mc and also posted an RFC email proposing increased scheduler awareness of hardware topology. This should allow the scheduler to better handle asymmetric architectures such as ARM's big.LITTLE architecture. Finally, Juri Lelli posted an updated version of a prototype SCHED_DEADLINE patchset, which Juri has taken over from Dario Faggioli. Please see below for a discussion of how SCHED_DEADLINE might be used for energy efficiency on big.LITTLE systems.

Morten Rasmussen presented prototype enhancements to the Linux scheduler that better accommodate big.LITTLE systems. As noted above, these enhancements are based on Paul Turner's entity load-tracking patchset. The key point behind Morten's enhancements is that work should be distributed unevenly on big.LITTLE systems in order to maximize power efficiency with minimal performance degradation. Unlike SMP systems, on big.LITTLE systems it matters deeply where a given task is run, not just when a task is run. Morten therefore created a pair of cgroups, one for the big CPUs and another for the LITTLE CPUs, but with no load balancing between them. The select_task_rq_fair() function checks task load history in order to make the correct big.LITTLE selection when a given task starts running. High-priority tasks that have tended to be CPU bound are placed on big CPUs, and all other tasks are placed on LITTLE CPUs.

Of course, a high-priority task that has historically been I/O bound might enter a CPU-bound phase. Therefore, Morten's patchset also periodically migrates tasks via load balancing. There are of course issues with global load balancing, and addressing these issues is important future work. Nevertheless, when running the Bbench browser benchmark on an emulated big.LITTLE Android system, Morten's patches provided response time very nearly that of an SMP system with all big CPUs, both when running on SMP hardware emulating big.LITTLE and when running on Paul Turner's LinSched. This is clearly an impressive achievement. In contrast, big.LITTLE response time on a vanilla kernel is 10-20% slower with much worse variability in response times.

Obviously, considerably more performance evaluation is required, but preliminary results are quite promising. In addition, more work will be required to handle thermal issues and also to do load balancing of tasks back onto LITTLE CPUs when all the big CPUs are saturated.

Juri Lelli's SCHED_DEADLINE is quite important for some types of real-time computing, but some thought is required to see possible application to non-real-time big.LITTLE systems. The key thing about SCHED_DEADLINE is that it provides the scheduler with information about the likely future needs of the various applications running on the system. This information should allow the scheduler to make better big.LITTLE task placement decisions, and also potentially better frequency-selection decisions.

Of course, this begs the question of where the deadlines and CPU requirements come from. In fact, one reason that deadline schedulers remain the exception rather than the rule even for real-time applications is that it is difficult to generate the needed information for a given real-time application, especially given the non-deterministic nature of modern hardware and variations in rendering time for different media—or even for different segments of a given item being played back. One way to solve this problem for some mobile workloads might be to use feedback-directed profiling techniques, where the application's CPU requirements are measured rather than derived.

Regardless of how things turn out for the use of deadline scheduling for mobile workloads, there is clearly some important and innovative work ahead for power-aware scheduling of mobile workloads.

Finally, Vincent Guittot led a discussion on additional inputs to the scheduler, which centered mainly around accurate load tracking (e.g., Paul Turner's patchset), which CPUs share a clock (and thus must all run at the same frequency), which CPUs share idle state, which CPUs share each level of cache, and thermal constraints. There was also a call to keep all this simple, which is clearly extremely important—and also will likely be extremely difficult. But if keeping things simple was easy, where would be the challenge?

Conclusions

Although there is still a great deal of work remaining before mobile workloads fully exploit the power and power efficiency of big.LITTLE, the past few months have seen some impressive progress in that direction. With appropriate user-mode help, some important workloads can make good use of big.LITTLE systems even given the existing scheduler, and some modest modifications (in conjunction with Paul Turner's entity load-tracking patchset) greatly improves the scheduler's ability to support big.LITTLE with minimal user-mode help. A promising design for streamlining CPU hotplug has been identified, and work in that direction has started.

This work gives some hope that the future will be bright for asymmetric multiprocessing in general and ARM's big.LITTLE architecture in particular.

Acknowledgments

We all owe a debt of gratitude to Srivatsa Bhat, Juri Lelli, Daniel Lezcano, Nicolas Pitre, Christian Daudt, Roger Teague, and George Grey for helping to make this article human readable. I owe thanks to Dave Rusling and Jim Wasko for their support of this effort.

Answers to quick quizzes

Quick Quiz 1: How can you tell whether multiple CPUs on your system are in the same clock domain?

Answer: Look at the contents of this sysfs file:

    /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus

This file lists all of the CPUs that are in the same clock domain as CPU 0. Similar files are present for the rest of the CPUs.

Back to Quick Quiz 1.

Quick Quiz 2: Given that these emulation approaches are never going to provide a perfect emulation of big.LITTLE systems, why bother? Wouldn't it be simpler to just wait until real hardware is available?

Answer: Indeed, none of these emulation methods will perfectly emulate big.LITTLE systems. Then again, the various big.LITTLE implementations will not be exactly the same as each other anyway, so significant emulation error is quite acceptable. More importantly, these emulation approaches allow big.LITTLE software work to proceed long before real big.LITTLE hardware is available.

Back to Quick Quiz 2.

Quick Quiz 3: This sounds way simpler than CPU hotplug, so why not just get rid of CPU hotplug and replace it with this mechanism?

Answer: First, CPU hotplug is still needed to isolate failing hardware, which some expect to become more prevalent as transistors continue to shrink. Second, there are a number of limitations of Nicolas's approach compared to CPU hotplug:

  1. Interrupts must be directed away from the CPU.

  2. Timers must be migrated off of the CPU.

  3. If a CPU is left in this state for more than two minutes, the softlockup subsystem will start reporting errors if there are runnable tasks affinitied to this CPU. One workaround is to migrate tasks away from this CPU, but that approach starts looking a lot like CPU hotplug.

  4. If one of the preempted threads is hard-affinitied to the CPU and is holding a mutex, then that mutex cannot be acquired on any other CPU. If the CPU remains in this state very long, a system hang could result.

  5. The per-CPU kthreads would need to be informed of this transition if it persists for very long.

In other words, as noted above, Nicolas's approach is appropriate only for cases where the CPU is to be quiesced for a short time.

Back to Quick Quiz 3.

Quick Quiz 4: Why bother to fix CPU hotplug? Wouldn't it be better to just rip it out and replace it with something better?

Answer: Although there is always a strong urge to rip and replace, the fact is that any mechanism that successfully removes all current and future work from a CPU will really need to remove all current and future work from that CPU. And it is the removal of all such work that makes CPU hotplug difficult, not the actual powering down of the electronics. Of course, this suggests the possibility of removing all work without actually powering the CPU down, a possibility that is likely to be supported by Thomas's ongoing CPU-hotplug rework.

Back to Quick Quiz 4.

Comments (13 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.5-rc2 ?
Greg KH Linux 3.4.2 ?
Steven Rostedt 3.4.2-rt10 ?
Steven Rostedt 3.4.1-rt9 ?
Con Kolivas 3.4-ck2 ?
Ben Hutchings Linux 3.2.20 ?
Steven Rostedt 3.2.20-rt32 ?
Steven Rostedt 3.2.19-rt31 ?
Greg KH Linux 3.0.34 ?
Steven Rostedt 3.0.33-rt53 ?
Steven Rostedt 3.0.33-rt54 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Andrew Stiegmann (stieg) VMCI for Linux ?

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds