LWN.net Logo

LC-Asia: A big LITTLE MP update

By Jonathan Corbet
March 6, 2013
The ARM "big.LITTLE" architecture pairs two types of CPU — fast, power-hungry processors and slow, efficient processors — into a single package. The result is a system that can efficiently run a wide variety of workloads, but there is one little problem: the Linux kernel currently lacks a scheduler that is able to properly spread a workload across multiple types of processors. Two approaches to a solution to that problem are being pursued; a session at the 2013 Linaro Connect Asia event reviewed the current status of the more ambitious of the two.

LWN recently looked at the big.LITTLE switcher, which pairs fast and slow processors and uses the CPU frequency subsystem to switch between them. The switcher approach has the advantage of being relatively straightforward to get working, but it also has a disadvantage: only half of the CPUs in the system can be doing useful work at any given time. It also is not yet posted for review or merging into the mainline, though this posting is said to be planned for the near future, after products using this code begin to ship.

The alternative approach has gone by the name "big LITTLE MP". Rather than play CPU frequency governor games, big LITTLE MP aims to solve the problem directly by teaching the scheduler about the differences between processor types and how to distribute tasks between them. The big.LITTLE switcher patch touches almost no files outside of the ARM architecture subtree; the big LITTLE MP patch set, instead, is focused almost entirely on the core scheduler code. At Linaro Connect Asia, developers Vincent Guittot and Morten Rasmussen described the current state of the patch set and the plans for getting it merged in the (hopefully) not-too-distant future.

The big LITTLE MP patch set has recently seen a major refactoring effort. The first version was strongly focused on the heterogeneous multiprocessing (HMP) problem but, among other things, it is hard to get developers for the rest of the kernel interested in HMP. So the new patch set aims to improve [Morten and
Vincent] scheduling results on all systems, even traditional SMP systems where all CPUs are the same. There is a patch set that is in internal review and available on the Linaro git server. Some parts have been publicly posted recently; soon the rest should be more widely circulated as well.

The new patches are working well; for almost all workloads, their performance is similar to that achieved with the old patch set. The patches were developed with a view toward simplicity: they affect a critical kernel path, so they must be both simple and fast. Some of the patches, fixes for the existing scheduler, have already been posted to the mailing lists. The rest try to augment the kernel's scheduler with three simple rules:

  • Small tasks (those that only use small amounts of CPU time for brief periods) are not worth the trouble to schedule in any sophisticated way. Instead, they should just be packed onto a single, slow core whenever they wake up, and kept there if at all possible.

  • Load balancing should be concerned with the disposition of long-running tasks only; it should simply pass over the small tasks.

  • Long-running tasks are best placed on the faster cores.

Implementing these policies requires a set of a half-dozen patches. One of them is the "small-task packing" patch that was covered here in October, 2012. Another works to expand the use of per-entity load tracking (which is currently only enabled when control groups and the CPU controller are being used) so that the per-task load values are always available to the scheduler. A further patch ensures that the "LB_MIN" scheduler feature is turned on; LB_MIN (which defaults to "off" in mainline kernels) causes the load balancer to pass over small tasks when working to redistribute the computing load on the system, essentially implementing the second policy objective above.

After that, the patch set augments the scheduler with the concept of the "capacity" of each CPU; the unloaded capacity is essentially the clock speed of the processor. The load balancer is tweaked to migrate processes to the CPU with the largest available capacity. This task is complicated by the fact that a CPU's capacity may not be a constant value; realtime scheduling, in particular, can "steal" capacity away from a CPU to give to realtime-priority tasks. Scheduler domains also need to be tuned for the big.LITTLE environment with an eye toward reducing the periodic load balancing work that needs to be done.

The final piece is not yet complete; it is called "scheduling invariance." Currently, the "load" put on the system by a process is a function of the amount of time that process spends running on the CPU. But if some CPUs are faster than others, the same process could end up with radically different load values depending on which CPU it is actually running on. That is suboptimal; the actual amount of work the process needs to do is the same in either case, and varying load values can cause the scheduler to make poor decisions. For now, the problem is likely to be solved by scaling the scheduler's load calculations by a constant value associated with each processor. Processes running on a CPU that is ten times faster than another will accumulate load ten times more quickly.

Even then, the load calculations are not perfect for the HMP scheduling problem because they are scaled by the process's priority. A high-priority task that runs briefly can look like it is generating as much load as a low-priority task that runs for long periods, but the scheduler may want to place those processes in different ways. The best solution to this problem is not yet clear.

A question from the audience had to do with testing: how were the developers testing their scheduling decisions? In particular, was the Linsched testing framework being used? The answer is that no, Linsched is not being used. It has not seen much development work since it was posted for the 3.3 kernel, so it does not work with current kernels. Perhaps more importantly, its task representation is relatively simple; it is hard to present it with something resembling a real-world Android workload. It is easier, in the end, to simply monitor a real kernel with an actual Android workload and see how well it performs.

The plan seems to be to post a new set of big LITTLE MP patches in the near future with an eye toward getting them upstream. The developers are a little concerned about that; getting reviewer attention for these patches has proved to be difficult thus far. Perhaps persistence and a more general focus will help them to get over that obstruction, clearing the way for proper scheduling on heterogeneous multiprocessor systems in the not-too-distant future.

[Your editor would like to thank Linaro for travel assistance to attend this event.]


(Log in to post comments)

LC-Asia: A big LITTLE MP update

Posted Mar 7, 2013 14:14 UTC (Thu) by pj (subscriber, #4506) [Link]

HMP reminds me of the load balancing problem in SSI clustering. IIRC, the scheduling/load distribution solution that eventually worked best was based on a kind of market/economy model.

LC-Asia: A big LITTLE MP update

Posted Mar 8, 2013 13:38 UTC (Fri) by k3ninho (subscriber, #50375) [Link]

>[T]he problem is likely to be solved by scaling the scheduler's load calculations by a constant value associated with each processor. Processes running on a CPU that is ten times faster than another will accumulate load ten times more quickly.

Faster isn't the goal here. Efficiency is the goal, so a scaling based on the relative efficiency of the 'big' or 'LITTLE' core should fit better with improving battery life.

For example, a core which runs at 600MHz and consumes 0.3 watts with a bogoMIPS of 1 might have a measurement of 200 mega-ops-per-joule and you'd compare it to a core that runs at 1600MHz and consumes 12 watts with a bogoMIPS of 4 (so has 533 1/3 mega-ops per joule). In a race-to-idle scenario, it's obvious where to schedule the work, but in a long-runnning work scenario there may be certain workloads which you wouldn't leave on the LITTLE core because they break into blocks that can win the race-to-idle on the big core. Calculating where that line is will depend on the cost of moving the work between cores, but for a first approximation: how much time do you get on the big core before you've used a second's worth of energy in the LITTLE core? (Assuming it's no cost to start and stop cores, I think that's 1/40 second, so work that fits in intervals of < 1/40 second is unexpectedly better off on the big core.

Note: if these numbers aren't quite right, please do tell. :-)

K3n.

LC-Asia: A big LITTLE MP update

Posted Mar 10, 2013 18:25 UTC (Sun) by iabervon (subscriber, #722) [Link]

For efficiency, you want to direct the tasks that take the most cycles to the most efficient processors. Since you control which processor the task will run on in the future, your model should estimate how much energy it will take for each scheduling choice, so you're guessing the future cycle count for the task and multiplying by the energy cost for each processor; if you look at the past energy usage of the task, you're assuming that your past scheduling decisions are a good predictor of something in the future, but the point of the model is to make better decisions than your past ones were.

Doesn't the frequency governor make ordinary SMP systems heterogeneous?

Posted Mar 11, 2013 19:18 UTC (Mon) by droundy (subscriber, #4559) [Link]

It seems like much of their work could be applied very generally to ordinary multicore and multiprocessor systems, once different power states and frequencies are involved, since it is quite possible that you'll want some cores to be running in different states than others (or be powered down). But perhaps you have to keep the frequency the same for all cores in a processor?

Doesn't the frequency governor make ordinary SMP systems heterogeneous?

Posted Mar 18, 2013 11:57 UTC (Mon) by amit.kucheria (subscriber, #59246) [Link]

You are right about this. Small-task packing, for example, makes sense even on SMP hardware, if throughput is not your own consideration.

Doesn't the frequency governor make ordinary SMP systems heterogeneous?

Posted Mar 18, 2013 17:57 UTC (Mon) by amit.kucheria (subscriber, #59246) [Link]

s/own/only

LC-Asia: A big LITTLE MP update

Posted Mar 12, 2013 2:26 UTC (Tue) by jcm (subscriber, #18262) [Link]

I'll work on getting the RH folks access to TC2 (Test Chip 2) hardware.

LC-Asia: A big LITTLE MP update

Posted Mar 18, 2013 17:58 UTC (Mon) by amit.kucheria (subscriber, #59246) [Link]

Getting more review from the maintainers and stakeholders is much appreciated, Jon.

LC-Asia: A big LITTLE MP update

Posted Mar 14, 2013 3:58 UTC (Thu) by heijo (guest, #88363) [Link]

Is it already known which approach is the code shipping on the first Samsung Galaxy S4 units going to use?

LC-Asia: A big LITTLE MP update

Posted Mar 20, 2013 9:25 UTC (Wed) by Duncan (guest, #6647) [Link]

From what I've read on LWN (no mobile here and I don't follow them closely enough to know specifics about individual models), the switching code will be shipping on the first units, as it's more mature due to both being simpler and the fact that there was less tweaking to do to the existing (cpufreq in that case) code to get something shippable.

The HMP code is far more complex, and as the article states, the last pieces of the puzzle have yet to be fully hashed out, let alone written (the base cpufreq/switching code is I believe all written, they're at the last stae now), let alone corner-cased, fully debugged, with tunables tweaked.

So we're talking weeks to ship switching, months, another mobile hardware generation at least, to ship HMP.

LC-Asia: A big LITTLE MP update

Posted Mar 23, 2013 21:52 UTC (Sat) by JanC_ (guest, #34940) [Link]

I wonder if it would make sense to add additional scheduling classes (or options) for this sort of heterogenous processing environment?

For example, some long-running application that want to run in the background while the system is idle might still want to run on a fast core (e.g. a background encoding task, which I want to finish ASAP while not disturbing or slowing down anything else), while others processes probably care less about finishing ASAP and could run on a slow core...

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds