Linux support for ARM big.LITTLE

February 15, 2012

This article was contributed by Nicolas Pitre

ARM Ltd recently announced the big.LITTLE architecture consisting of a twist on the SMP systems that we've all gotten accustomed to. Instead of having a bunch of identical CPU cores put together in a system, the big.LITTLE architecture is effectively pushing the concept further by pulling two different SMP systems together: one being a set of "big" and fast processors, the other one consisting of "little" and power-efficient processors.

In practice this means having a cluster of Cortex-A15 cores, a cluster of Cortex-A7 cores, and ensuring cache coherency between them. The advantage of such an arrangement is that it allows for significant power saving when processes that don't require the full performance of the Cortex-A15 are executed on the Cortex-A7 instead. This way, non-interactive background operation, or streaming multimedia decoding, can be run on the A7 cluster for power efficiency, while sudden screen refreshes and similar bursty operations can be run on the A15 cluster to improve responsiveness and interactivity.

Then, how to support this in Linux? This is not as trivial as it may seem initially. Let's suppose we have a system comprising a cluster of four A15 cores and a cluster of four A7 cores. The naive approach would suggest making the eight cores visible to the kernel and letting the scheduler do its job just like with any other SMP system. But here's the catch: SMP means Symmetric Multi-Processing, and in the big.LITTLE case the cores aren't symmetric between clusters.

The Linux scheduler expects all available CPUs to have the same performance characteristics. For example, there are provisions in the scheduler to deal with things like hyperthreading, but this is still an attribute which is normally available on all CPUs in a given system. Here we're purposely putting together a couple of CPUs with significant performance/power characteristic discrepancies in the same system, and we expect the kernel to make the optimal usage of them at all times, considering that we want to get the best user experience together with the lowest possible battery consumption.

So, what should be done? Many questions come to mind:

Is it OK to reserve the A15 cluster just for interactive tasks and the A7 cluster for background tasks?
What if the interactive tasks are sufficiently light to be processed by the small cores at all times?
What about those background tasks that the user interface is actually waiting after?
How to determine if a task using 100% CPU on a small core should be migrated to a fast core instead, or left on the small core because it is not critical enough to justify the increased power usage?
Should the scheduler auto-tune its behavior, or should user-space policies influence it?
If the latter, what would the interface look like to be useful and sufficiently future-proof?

Linaro started an initiative during the most recent Linaro Connect to investigate this problem. It will require a high degree of collaboration with the upstream scheduler maintainers and a good amount of discussion. And given past history, we know that scheduler changes cannot happen overnight... unless your name is Ingo that is. Therefore, it is safe to assume that this will take a significant amount of time.

Silicon vendors and portable device makers are not going to wait though. Chips implementing the big.LITTLE architecture will appear on the market in one form or another, way before a full heterogeneous multi-processor aware scheduler is available. An interim solution is therefore needed soon. So let's put aside the scheduler for the time being.

ARM Ltd has produced a prototype software solution consisting of a small hypervisor using the virtualization extensions of the Cortex-A15 and Cortex-A7 to make both clusters appear to the underlying operating system as if there was only one Cortex-A15 cluster. Because the cores within a given cluster are still symmetric, all the assumptions built into the current scheduler still hold. With a single call, the hypervisor can atomically suspend execution of the whole system, migrate the CPU states from one cluster to the other, and resume system execution on the other cluster without the underlying operating system being aware of the change; just as if nothing has happened.

Taking the example above, Linux would see only four Cortex-A15 CPUs at all times. When a switch is initiated, the registers for each of the 4 CPUs in cluster A are transferred to corresponding CPUs in cluster B, interrupts are rerouted to the CPUs in cluster B, then CPUs in cluster B are resumed exactly where cluster A was interrupted, and, finally, the CPUs in cluster A are powered off. And vice versa for switching back to the original cluster. Therefore, if there are eight CPU cores in the system, only four of them are visible to the operating system at all times. The only visible difference is the observable execution speed, and of course the corresponding change in power consumption when a cluster switch occurs. Some latency is implied by the actual switch of course, but that should be very small and imperceptible by the user.

This solution has advantages such as providing a mechanism which should work for any operating system targeting a Cortex-A15 without modifications to that operating system. It is therefore OS-independent and easy to integrate. However, it brings a certain level of complexity such as the need to virtualize all the differences between the A15 and the A7. While those CPU cores are functionally equivalent, they may differ in implementation details such as cache topology. That would force every cache maintenance operation to be trapped by the hypervisor and translated into equivalent operations on the actual CPU core when the running core is not the one that the operating system thinks is running.

Another disadvantage is the overhead of saving and restoring the full CPU state because, by virtue of being OS-independent, the hypervisor code may not know what part of the CPU is actually being actively used by the OS. The hypervisor could trap everything to be able to know what is being touched allowing partial context transfers, but that would be yet more complexity for a dubious gain. After all, the kernel already knows what is being used in the CPU, and it can deal with differing cache topologies natively, etc. So why not implement this switcher support directly in the kernel given that we can modify Linux and do better?

In fact that's exactly what we are doing i.e. take the ARM Ltd BSD licensed switcher code and use it as a reference to actually put the switcher functionality directly in the kernel. This way, we can get away with much less support from the hypervisor code and improve switching performances by not having to trap any cache maintenance instructions, by limiting the CPU context transfer only to the minimum set of active registers, and by sharing the same address space with the kernel.

We can implement this switcher by modeling its functionality as a CPU speed change, and therefore expose it via a cpufreq driver. This way, contrary to the reference code from ARM Ltd which is limited to a whole cluster switch, we can easily pair each of the A15 cores with one of the A7 cores, and have each of those CPU pairs appear as a single pseudo CPU with the ability to change its performance level via cpufreq. And because the cpufreq governors are already available and understood by existing distributions, including Android, we therefore have a straightforward solution with a fast time-to-market for the big.LITTLE architecture that shouldn't cause any controversy.

Obviously the "switcher" as we call it is not replacing the ultimate goal of exposing all the cores to the kernel and letting the scheduler make the right decisions. But it is nevertheless a nice self-contained interim solution that will allow pretty good usage of the big.LITTLE architecture while removing the pressure to come up with scheduler changes quickly.

Index entries for this article
Kernel	Architectures/Arm
Kernel	big.LITTLE
Kernel	Scheduler/big.LITTLE
GuestArticles	Pitre, Nicolas

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 14:54 UTC (Wed) by SEJeff (guest, #51588) [Link] (36 responses)

Doesn't Nvidia's Tegra3 "Kal El" processor already do this?

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 15:02 UTC (Wed) by tajyrink (subscriber, #2750) [Link] (15 responses)

I think it's NVIDIA's own approach to the same thing.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 15:08 UTC (Wed) by SEJeff (guest, #51588) [Link] (14 responses)

Right. Nvidia had the first shipping implementation vs ARM. I wonder if they got the idea from ARM, or ARM got the idea from the Tegra 3.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 15:21 UTC (Wed) by gioele (subscriber, #61675) [Link]

> I wonder if they got the idea from ARM, or ARM got the idea from the Tegra 3.

The idea _per se_ is not new at all, see the asymmetric multi-CPUs [1] in the '70s or the Cell architecture in 2000 [2].

[1] https://en.wikipedia.org/wiki/Asymmetric_multiprocessing
[2] https://en.wikipedia.org/wiki/Cell_%28microprocessor%29

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 15:33 UTC (Wed) by hamjudo (guest, #363) [Link]

Anybody who programmed a CDC Cyber 6600 thought, wouldn't it be great if we could make a system where the little processors ran the same instruction set as the big processors?

Wikipedia says the Cyber came out in 1964. There are many earlier examples.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 16:50 UTC (Wed) by drag (guest, #31333) [Link] (10 responses)

It's very common for ARM systems to have disparate processors. My GP2X handheld is dual core system. It has a regular ARM processor and then a second processor for accelerating some types of multimedia functions.

Then, of course, there is modern x86-64 systems were the GPU is a now a coprocessor that can be used for anything, rather then something dedicated just for graphics.

I don't know how common it is to have disparate general purpose processors, however. With my GP2X the application had to be optimized to take advantage of the second processor.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 17:04 UTC (Wed) by jzbiciak (guest, #5246) [Link] (8 responses)

Right, but IIRC, the OS runs as a single processor OS, and the second CPU is treated as more like a peripheral. You write your video / graphics processing code for the second ARM and then it becomes an application specific accelerator, not too much different than dedicated hardware, but tuned for a particular app.

Heck, even desktop PCs have been asymmetric multiprocessor since their introduction (8038 in the keyboard controller, 8088 running the apps), but the OS really only thinks about the main CPU.

The A7/A15 split is rather different: This wants to run SMP Linux across both types of cores, dynamically moving tasks between the A7 and A15 side of the world seamlessly. All of the processor cores are considered generic resources, just with different performance/power envelopes. That's rather different from the GP2X model.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 18:16 UTC (Wed) by jmorris42 (guest, #2203) [Link] (7 responses)

> Right, but IIRC, the OS runs as a single processor OS, and the
> second CPU is treated as more like a peripheral.

True but that is because of reasons above the silicon. In a typical bottom of the line Android phone you already have multiple processing cores. Mine has two ARM cores, two DSPs plus the GPU. One ARM and one DSP are dedicated to the radio to keep that side a silo but the processors can talk to each other and share one block of RAM. So there isn't a techinical reason an OS couldn't be written for that old hardware that unified all five processing elements and dispatched executable code between the ARM cores as needed. It just wouldn't be a phone anymore.

The 'innovation' here is deciding to encourage the rediscovery of asymetrical multiprocessing and to relearn and update the ways to deal with the problems it brings. There was a reason everyone moved to SMP, it made the software a lot easier; now power consumption drives everything and software is already very complex so the balance shifts.

Then they will stick in an even slower CPU for the radio in chips destined for phone use and wall it off in the bootloader just like now. It is the only way to keep the lawyers (and the FCC) happy.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 20:51 UTC (Wed) by phip (guest, #1715) [Link] (6 responses)

You need more than just shared DRAM to enable an MP OS. The CPUs need to be cache-coherent.

In a low-end multicore SOC with noncoherent CPUs, the DRAM is usually statically partitioned between the CPUs which run independant OS images. Any interprocessor communication is done through shared buffers that are uncached or else carefully fushed at the right times (similar to DMA)

-Phil

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 21:55 UTC (Wed) by zlynx (guest, #2285) [Link] (5 responses)

Nah. There's no real reason the hardware has to be cache coherent. It is just a lot easier if it is.

For example, the OS could force a cache flush on both CPUs when migrating a task.

Threaded applications would have a tough time, but even that has been handled in the past. For example, the compiler and/or threading library could either track memory writes between locks so as to make sure those cache addresses are flushed, or it could use the big hammer of a full cache flush before unlock.

Cache coherency is really just a crutch. Lots of embedded programmers and Cell programmers (no cache protocol on the SPE's) know how to work without it. :-)

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 22:31 UTC (Wed) by phip (guest, #1715) [Link] (2 responses)

Strictly speaking that's true, but it is cumbersome enough in practice
that nobody does it for any mainstream OS.

It's not just migrating processes & threads - any global operating
system data structures need to be synchronized with cache flushes,
memory ordering barriers, mutexes, etc. before and after each access.

If you want to use multiple noncoherent cores to run a general-purpose,
the best approach is to treat it as a cluster (with each CPU running
its own OS image).

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 22:35 UTC (Wed) by phip (guest, #1715) [Link] (1 responses)

I don't know of anyone running a general-purpose OS on the Cell
Synergestic Processors (or on a GPU for that matter).

Having a different instruction set on the different processor
cores moves the complexity to another level above noncoherence.

The usual programming model is to run a general-purpose OS on the
PowerPC processor(s), and treat the SPEs as a type of coprocessor
or peripheral device.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 23:02 UTC (Wed) by zlynx (guest, #2285) [Link]

For the Cell SPEs or GPU shaders they probably aren't capable enough to bother with having the OS run them directly. I only brought them up because there's no automatic cache coherence with them. They read the data into cache (the local 256 KB). They write the data out. Both read and write are done explicitly.

No one runs a MP system of different instruction sets, true. It isn't impossible though. The OS would need to be built twice, once for each instruction set. It could share the data structures.

I wonder if Intel ever played around with this for an Itanium server? It seems that I remember Itanium once shared the same socket layout as a Xeon so this would have been possible on a hardware level.

Now, if you got very tricky and decided to require that all software be in an intermediate language, the OS could use LLVM, Java, .NET or whatever to compile the binary to whichever CPU was going to execute the process. That would really make core switching expensive! And you'd need some way to mark where task switches were allowed to happen, and maybe run the task up to the next switch point so you could change over to the same equivalent point in the other architecture.

A bit more realistic would be cores with the same base instruction set, but specialized functions on different cores. That could work really well. When the program hit an "Illegal Instruction" fault, the OS could inspect the instruction and find one of the system cores that supports it, then migrate the task there or emulate the instruction. Or launch a user-space helper to download a new core design off the internet and program the FPGA!. That would let programmers use specialized vector, GPU, audio or regular expression instructions without worrying about what cores to use.

Linux support for ARM big.LITTLE

Posted Feb 19, 2012 18:06 UTC (Sun) by alison (subscriber, #63752) [Link] (1 responses)

zlynx comments:

"it could use the big hammer of a full cache flush before unlock.
Cache coherency is really just a crutch. Lots of embedded programmers and Cell programmers (no cache protocol on the SPE's) know how to work without it. :-)"

Full cache flush == embedded "Big Kernel Lock" equivalent?

-- Alison Chaiken, alchaiken@gmail.com

Linux support for ARM big.LITTLE

Posted Feb 19, 2012 21:51 UTC (Sun) by phip (guest, #1715) [Link]

Big Kernel Lock...

Hmm, that brings up another point I should have thought of earlier:
Non-coherent multi-CPU SOCs are also likely to not implement
atomic memory access primatives (i.e. Compare/Exchange, Test and Set,
Load-Linked/Store-Conditional, etc.)

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 17:27 UTC (Wed) by nix (subscriber, #2304) [Link]

It's very common for ARM systems to have disparate processors. My GP2X handheld is dual core system. It has a regular ARM processor and then a second processor for accelerating some types of multimedia functions.

This goes right back to the ARM's prehistory. The BBC microcomputer's 'Tube' coprocessor interface springs to mind, allowing you to plug in coprocessors with arbitrary buses, interfacing them to the host machine via a set of FIFOs. People tended to call the coprocessor 'the Tube' as well, which was a bit confusing given how variable the CPUs were that you could plug in there.

Linux support for ARM big.LITTLE

Posted Feb 23, 2012 19:41 UTC (Thu) by wmf (guest, #33791) [Link]

I think the canonical reference here is "Single-ISA Heterogeneous Multi-Core Architectures" from MICRO 2003: http://www.microarch.org/micro36/html/pdf/kumar-SingleISA...

ARM big.LITTLE vs. nvidia Tegra 3 Kal-El

Posted Feb 15, 2012 15:50 UTC (Wed) by Felix.Braun (guest, #3032) [Link] (1 responses)

However, as I understand it, in Tegra 3 doesn't expose all processors to the OS. Instead it makes the decisions in hardware, when to use the smaller processor (it's only one small processor in Kal-El's case IIRC) and when to switch on the big processors.

A software approach seems more flexible to me, which is a good thing considering the complexity of the issue. This seems impossible to get right in the first try.

ARM big.LITTLE vs. nvidia Tegra 3 Kal-El

Posted Feb 15, 2012 15:55 UTC (Wed) by Felix.Braun (guest, #3032) [Link]

Here's the link to the White-Paper describing the Kal-El architecture: http://www.nvidia.com/content/PDF/tegra_white_papers/Vari...

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 14:46 UTC (Thu) by yaap (subscriber, #71398) [Link] (17 responses)

Some overall idea, but different implementations.

In Nvidia approach, you have 1 slow Cortex A9 optimized for low power, and 4 fast Cortex A9 optimized for performance. The slow one will run at 500 MHz while the fast ones could be 1 GHz or more. They use a mixed 40nm LP/G process, so I would assume that the slow A9 is LP (low power) while the others are in G (generic/high performance), but that's just a guess. Anyway, there are other ways to optimize for speed vs. low power.
The switch is between the single "slow" A9 and the other(s). If the slow is not enough, the load is switched to a fast A9. Then when on the fast side (with the slow disabled), one or several cores may be enabled depending on the workload.

With LITTLE.big, for now you have as many A7 as A15 cores. The current approach switch between all A7 or all A15. But "all" is misleading, in that not all cores may be needed of course.
ARM approach may also support later on a finer approach where you have pairs of A7/A15, and each pair can switch (or be disabled) independently. Nicer of course. But still as many A7 as A15 cores.

The NVidia approach applied to LITTLE.big would be a single A7 and 4 A15. But it doesn't seem to be in the plans yet. And I guess that as ARM gets more money the more cores are used, they don't have much incentive to get there ;) Still, as far as I know, nothing would prevent a licensee to do that on its own.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 19:51 UTC (Thu) by jzbiciak (guest, #5246) [Link] (16 responses)

As I recall, the individual CPUs in an A15 cluster can be powered on/off independently of the others. I would be surprised if A7 didn't also allow that.

So, you could set up "migration pairs", pairing each A15 CPU with an A7 CPU, and only keep one on at a time, migrating back and forth as needed. If all CPUs in a cluster are off, then you can shut off the cluster as a whole.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 19:57 UTC (Thu) by dlang (guest, #313) [Link] (15 responses)

what's the reason to pair up CPUs like this rather than just using the processing power of all of the CPUs? (assuming they are all powered on)

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 21:21 UTC (Thu) by jzbiciak (guest, #5246) [Link] (14 responses)

The last three paragraphs of the article really summarize why.

The current scheduler does not really understand heterogeneous CPU mixes. But, you can still do better than ARM's hypervisor-based switcher by pushing the logic into the kernel, and doing this pair-wise switcheroo. Then, the existing Linux infrastructure just models it as a frequency change within its existing governor framework.

Longer term, of course, you do want to teach Linux how to load balance properly between disparate cores, so at the limit, you could have all 2*N CPUs active on a machine with an N-CPU A15 cluster next to an N-CPU A7 cluster. But in the short run, pulling ARM's BSD-licensed switching logic to do the pair-wise switch at least gets something running that works better than leaving it to the hypervisor. The short run use model really does appear to be to have N CPUs powered on when you have an N CPU A15 cluster next to an N CPU A7 cluster.

Another thought occurs to me: Even if Linux were able to schedule for all 8 CPUs, the possibility exists that the environment around it doesn't really support that. It might not have the heat dissipation or current carrying capacity to have all 2*N CPUs active. So, that's another reason the pair-wise switch is interesting to consider.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 21:23 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Keeping my abstraction straight, I should have said "Even if Linux were able to schedule for all 2*N CPUs." And carrying the thought further, I mean "schedule efficiently and appropriately for the CPU mix."

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 21:35 UTC (Thu) by dlang (guest, #313) [Link] (11 responses)

Ok, that may be a short-term hack, but such a hack should probably not go into the mainline scheduler.

My point is that we already have these 'problems' on mainstream systems.

On Intel and AMD x86 CPUs we already have cases where turning off some cores will let you run other cores at a higher clock speed (thermal/current limitations), and where you can run some cores at lower speeds than others

These may not be as big a variation as the ARM case has, but it's a matter of degree, not a matter of being a completely different kind of problem.

I agree that the current scheduler doesn't have explicit code to deal with this today, but I think that for the most part the existing code will 'just work' without modification. The rebalancing code pulls work off of a core if the core is too heavily loaded. a slower core will be more heavily loaded for the same work than a faster core would be, so work will naturally be pulled off of a heavily loaded slow core.

The two points where the current scheduler doesn't do the right thing are in the rebalancing code when considering moving processes between cores of different speeds (but only if you have CPU hog processes that will max out a core). As I note above, fixing this doesn't seem like that big a change, definitely less intrusive than the NUMA aware parts.

let userspace figure out all the nuances of what combinations of clock speeds on the different cores will work in the current environment (if it's limited by thermal factors, then different cooling, ambient temperatures, etc will change these limits, you really don't want that code in the kernel)

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 16:40 UTC (Fri) by BenHutchings (subscriber, #37955) [Link] (10 responses)

On Intel and AMD x86 CPUs we already have cases where turning off some cores will let you run other cores at a higher clock speed (thermal/current limitations), and where you can run some cores at lower speeds than others.

The Linux scheduler already generically supports the grouping of CPU threads that share resources, and tends to spread runnable tasks out across groups (although it can also be configured to concentrate them in order to save power). I think that this should result in enabling the 'turbo' mode where possible.

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 19:05 UTC (Fri) by dlang (guest, #313) [Link] (9 responses)

My point is that deciding to try to enable 'turbo' mode, or to slow down some cores to speed up other cores, etc is an area that is so complex and machine specific that trying to do it in the kernel is wrong.

Part of the decision process will also need to be to consider what programs are running, and how likely are these programs to need significantly more CPU than they are currently using (because switching modes takes a significant amount of time), this involves a lot of policy, and a lot of useage profiling, exactly the types of things that do not belong in the kernel.

With the exception of the case where there is a single thread using all of a core, I think the existing kernel scheduler will 'just work' on a system with different speed cores.

Where I expect the current scheduler to have problems is in the case where a single thread will max out a CPU, I don't think that the scheduler will be able to realize that it would max one CPU, but not max another one

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 19:24 UTC (Fri) by BenHutchings (subscriber, #37955) [Link] (8 responses)

While I understand that policy decisions should be set by user-space, the kernel generally and the scheduler in particular cannot make synchronous calls to ask userland what to do. Also, upcalls are relatively expensive. So the kernel has to be given the information to get on and implement the policy without asking too many questions.

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 19:38 UTC (Fri) by dlang (guest, #313) [Link] (7 responses)

I'm in no way suggesting that the kernel make calls to userspace (synchronous or otherwise)

I view this as a three tier system

At the first tier you have the scheduler on each core making decisions about what process that is assigned to that core should run next

At the second tier you have a periodic rebalancing algorithm that considers moving jobs from one core to another. preferably run by a core that has idle time (that core 'pulls' work from other cores)

These two tiers will handle cores of different speeds without a problem as-is, as long as no thread maxes out the slowest core.

I am saying that the third tier would be the userspace power management daemon, which operates completely asynchronously to the kernel, it watches the overall system and makes decisions on when to change the CPU configuration. When it decides to do so, it would send a message to the kernel to make the change (change the speed of any core, including power it off or on)

until the userspace issues the order to make the change, the kernel scheduler just works with what it has, no interaction with userspace needed.

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 20:26 UTC (Fri) by BenHutchings (subscriber, #37955) [Link] (6 responses)

Sorry, what you're suggesting is still likely to be too slow. Look how successful the 'userspace' cpufreq governor isn't.

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 22:34 UTC (Fri) by dlang (guest, #313) [Link] (5 responses)

I don't see what's so performance critical about this, you shouldn't be making significant power config changes to your system hundreds of times a second,

doing the analysis every second (or even less frequently) should be pretty good in most cases

the kernel can do some fairly trivial choices, but they are limited to something along the lines of

here is a list of power modes, if you think you are being idle too much, move down the list, if you think you are not being idle enough move up the list

but anything more complicated than this will quickly get out of control

for example,

for the sake of argument, say that 'turbo mode' is defined as:

turn off half your cores and run the other half 50% faster, using the same amount of power. (loosing 25% of it's processing power, probably more due to memory pipeline stalls)

how would the kernel ever decide when it's appropriate to sacrifice so much of it's processing power for no power savings?

I could say that I would want to do so if a single thread is using 100% of a cpu in a non-turbo mode.

but what if making that switch would result in all the 'turbo mode' cores being maxed out? it still may be faster to run overloaded for a short time to finish the cpuhog task faster.

I don't see any way that this sort of logic can possibly belong in the kernel. And it's also stuff that's not very timing sensitive (if delaying a second to make the decision results in the process finishing first, it was probably not that important a decision to make, for example)

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 23:25 UTC (Fri) by khim (subscriber, #9252) [Link] (4 responses)

I don't see what's so performance critical about this, you shouldn't be making significant power config changes to your system hundreds of times a second,

Why do you think so?

And it's also stuff that's not very timing sensitive (if delaying a second to make the decision results in the process finishing first, it was probably not that important a decision to make, for example)

What you are talking about? It looks like this whole discussion comes from different universe. Perhaps it's the well-discussed phenomenon where an important requirement that was not at all obvious to one party is so obvious to other one that they didn't think to state it.

We are discussing all that in the context of big.LITTLE processing, right? Which is used by things like tablets and mobile phones, right?

Well, the big question here is: do I need to unfreeze and use hot and powerful Cortex-A15 core to perform some kind of UI task or will slim and cool Cortex-A7 be enough to finish it? And the cut-off is dictated by physiology: the task should be finished in less then 70-100ms if it's reaction to user input or in 16ms if it's part of the animation. This means that decision to wake up Cortex-A15 or not must be taken in 1-2ms, tops. Better to do it in about 300-500µs. Any solution which alters power config once per second is so, so, SO beyond the event horison it's not even funny.

I could say that I would want to do so if a single thread is using 100% of a cpu in a non-turbo mode.

Wrong criteria. If Cortex-A7 core can calculate the next frame in 10ms then there are no need to wake up Cortex-A15 core even if for these 10ms Cortex-A7 is 100% busy.

The problems here are numerous and indeed quite time-critical. The only model which makes sense is in-kernel demon which actually does the work quickly and efficiently - but it uses information collected by userspace daemon.

Linux support for ARM big.LITTLE

Posted Feb 18, 2012 0:47 UTC (Sat) by dlang (guest, #313) [Link] (3 responses)

I'm trying to address the general problem, not just a tablet specific problem.

waking from some sleep modes may take 10ms, so if you have deadlines like that you better not put the processor to sleep in the first place.

I also think that a delay at the start of an app is forgivable, so if the system needs the faster cores to render things, it should find out quickly, start up those cores, and continue.

I agree that if you can specify an ordered list of configurations and hand that to the kernel you may be able to have the kernel use that.

on the other hand, the example that you give:

> Wrong criteria. If Cortex-A7 core can calculate the next frame in 10ms then there are no need to wake up Cortex-A15 core even if for these 10ms Cortex-A7 is 100% busy.

sort of proves my point. how can the kernel know that the application completed it's work if it's busy 100% of the time? (especially if you have an algorithm that will adapt to not having quite enough processor and will auto-scale it's quality)

this sort of thing requires knowledge that the kernel does not have.

Also, the example of the 'turbo mode' where you can run some cores faster, but at the expense of not having the thermal headroom to run as many cores. In every case I am aware of, 'turbo mode' actually reduces the clock cycles available overall (and makes the cpu:memory speed ratio worse, costing more performance), but if you have a single threaded process that will finish faster in turbo mode, it may be the right thing to do to switch to that mode.

it doesn't matter if you are a 12 core Intel x86 monster, or a much smaller ARM chip.

Linux support for ARM big.LITTLE

Posted Feb 18, 2012 11:09 UTC (Sat) by khim (subscriber, #9252) [Link] (2 responses)

it doesn't matter if you are a 12 core Intel x86 monster, or a much smaller ARM chip.

Well, sure. But differences between interactive tasks and batch processing modes are acute. With batch processing you are optimizing time for the [relatively] long process. With interactive tasks you optimize work in your tiny 16ms timeslice. It makes no sense to produce result in 5ms (and pay for it) but if you spent 20ms then you are screwed.

Today the difference is not so acute because the most power-hungry part of the smartphone or tablet is LCD/OLED display. But if/when technologies like Mirasol will be adopted these decisions will suddenly start producing huge differences in the battery life.

Linux support for ARM big.LITTLE

Posted Feb 18, 2012 11:58 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

agreed, I'd love to have a passive LCD display netbook that was able to transparently sleep between keystrokes, but realistically we are a long way from that in terms of the hardware capabilities, and even further away from that in terms of being able to predict the future workloads correctly.

i don't think we are ever going to get away from having to make the choice between keeping things powered up to be able to be responsive, and powering things down aggressively to save power.

Linux support for ARM big.LITTLE

Posted Feb 20, 2012 12:17 UTC (Mon) by khim (subscriber, #9252) [Link]

Hardware is in labs already (and should reach the market in a few years), it's time to think about software side.

If we are talking about small tweaks then such hardware it not yet on the radar, but if we plan to create the whole new subsystem (task which will itself need two or three years to mature) then it must be considered.

Linux support for ARM big.LITTLE

Posted Feb 23, 2012 1:35 UTC (Thu) by scientes (guest, #83068) [Link]

The big.LITTLE whitepaper here: http://www.arm.com/files/downloads/big_LITTLE_Final_Final...
implies that the "big.LITTLE MP Use Model" or running all the cores at once
is supported. Of course this doesn't mean that implamenters that are only testing for the "Task Migration Use Model" will implament it adaquately for a use model they are not using.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 17:03 UTC (Wed) by epa (subscriber, #39769) [Link] (2 responses)

For a moment I thought ARM had introduced some funky new instructions to switch rapidly between bigendian and littleendian data, perhaps converting between uppercase and lowercase at the same time...

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 23:59 UTC (Wed) by cmccabe (guest, #60281) [Link]

Or a system where one of the CPUs was big-endian, and the other was little-endian. Just for fun

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 13:03 UTC (Thu) by ssvb (guest, #60637) [Link]

FWIW, the funky "SETEND BE"/"SETEND LE" instructions for runtime switching of default endiannes in load/store operations have been supported since ARMv6 :)

This is needed even for bigger systems.

Posted Feb 15, 2012 19:02 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

There was a issue within the last few weeks related to the scheduler and the debate of what should be in the scheduler vs what should be in a userspace daemon (IIRC it was power / sleep related, but I don't remember exactly what)

It seems to me that there are two answers that jump out

1. policy belongs in userspace, so the question of when to power up the fast cores, when to power them down, etc belongs in userspace.

2. the kernel scheduler needs the ability to handle CPUs with different performance, be it the big.LITTLE approach, or just a many-core x86 box with some cores running at a reduced clock speed.

For this latter problem, it seems to me that the system shouldn't care abut what the speed of an available CPU is, but should instead be balancing on how close to being maxed out it is. If none of the cores are maxed out, then (except for power management, which we've deferred to userspace on), it doesn't matter how fast any of the cores are.

The one exception to the "scheduler doesn't need to know the core speeds" is if a core _is_ maxed out, the scheduler needs to know the relative speeds of the different cores to decide if it should move the process to a "better" core.

However, the speed of the core isn't the only possible reason to move it to a different core, in a NUMA system, you may want to move a job to a different core to get it nearer to the memory that it accesses.

the good news (at least as it seems to me) is that this is not something that needs to be in the scheduler hot path, this is something that can be in the periodic rebalancing routine, probably as an abstraction of NUMA aware pulling to tinker with the definition of the "optimal" cpu for a job.

It's definantly not correct to try and schedule interactive tasks on one type of CPU and non-interactive tasks on a different type.

In terms of what the API to the userspace daemon needs to be. I can't define details, but to kick off the conversation, I think it needs to be able to allow the following:

1. the userspace daemon needs to be able to tell the kernel to alter the configuration of a particular core (power up/down, change it's speed, engage "turbo" mode. This would include turning off some cores so that others can run at higher clock speed), etc.

2. For some systems this should probably be something close to an atomic change, so the API probably should allow passing a data structure to the kernel, not just individual setting changes.

3. the userspace daemon needs to be able to see what the existing settings are

4. the userspace daemon needs to be able to gather infromation about the per-core performance. I think this information is already available today, although there may be reasons to improve the efficiency of gathering it (which would help other performance analysis tools ad well).

the devil is in the details as always, but this doesn't look like a situation where the broad-brush design options are that difficult.

This is needed even for bigger systems.

Posted Nov 26, 2013 23:14 UTC (Tue) by plugwash (subscriber, #29694) [Link]

"It's definantly not correct to try and schedule interactive tasks on one type of CPU and non-interactive tasks on a different type."
In many systems most threads wake up from time to time, do some chunk of work and go back to sleep. The problem is the kernel has no idea either how long the work will take or how important it is that the work is completed quickly. Without that information it is not possible to determine if the job should be run on a cheap (in terms of power per work done) but slow core or an expensive but fast core.

Assuming that user interaction tasks are time important (that is the user will be pissed off if they don't complete quickly) while background tasks are not time important (that is the user doesn't care how long they take to comlete) is not perfect but it's probablly a better approximation than not having any information at all.

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 22:51 UTC (Wed) by jengelh (guest, #33263) [Link] (1 responses)

>The Linux scheduler expects all available CPUs to have the same performance characteristics.

Does it? Even today, cores can be set to different speeds and therefore performance levels. (It seems the i7-920 has an own will, but certainly HT threads factory in here as well in this oddball printout.)

analyzing CPU 0:
  The governor "powersave" may decide which speed to use
  current CPU frequency is 2.67 GHz (asserted by call to hardware).
analyzing CPU 1:
  The governor "performance" may decide which speed to use
  current CPU frequency is 1.60 GHz (asserted by call to hardware).
analyzing CPU 2:
  The governor "performance" may decide which speed to use
  current CPU frequency is 1.60 GHz (asserted by call to hardware).
analyzing CPU 3:
  The governor "performance" may decide which speed to use
  current CPU frequency is 1.60 GHz (asserted by call to hardware).
analyzing CPU 4:
  The governor "performance" may decide which speed to use
  current CPU frequency is 2.67 GHz (asserted by call to hardware).
analyzing CPU 5:
  The governor "performance" may decide which speed to use
  current CPU frequency is 1.60 GHz (asserted by call to hardware).
analyzing CPU 6:
  The governor "performance" may decide which speed to use
  current CPU frequency is 2.67 GHz (asserted by call to hardware).
analyzing CPU 7:
  The governor "performance" may decide which speed to use
  current CPU frequency is 2.67 GHz (asserted by call to hardware).

Linux support for ARM big.LITTLE

Posted Feb 15, 2012 23:24 UTC (Wed) by dlang (guest, #313) [Link]

I think both statements are correct

1. the linux scheduler expects all available CPUs to have the same performance characteristics

and

2. even on commodity systems this isn't the case already.

ARM big.LITTLE is just a more extreme case of the existing problem.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 0:56 UTC (Thu) by rusty (guest, #26) [Link] (1 responses)

It's important to realize that the the A15 cores are likely heat-limited in a mobile device. To paraphrase a conversation with Paul McKenney, the big cpus are like a sportscar you can rent for 3 seconds at a time :)

Cheers,
Rusty.

Short-term rental of sports cars

Posted Feb 16, 2012 7:01 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Perhaps a good analogy would be with the Bugatti Veyron, which can generate more than 1,000 horsepower and reach speeds in excess of 400 kilometers per hour, but which when operated at full throttle will empty its fuel tank in about twelve minutes.

Excessively high temperatures on the one hand, limited fuel tank on the other, but either way you go really fast for a very short time. ;–)

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 4:01 UTC (Thu) by ringerc (subscriber, #3071) [Link] (2 responses)

This also complicates the "race to idle" issue even more than usual. Now, not only do we have the choice of variably clocking one CPU or set of CPUs, but a choice of an entirely different set to use!

Determining where the [time x watts-per-instruction ] optimum is is even harder when you have two *different* curves for (instructions,power) x clock speed, one for the A7s and one for the A15s. Which will use more power for a given task - an A7 at full power, an A7 at half power, an A15 at full power or an A15 at half power? Is the task going to be short lived or long-running? Is it time critical or background/non-interactive? What is the user's preference about responsiveness vs power use?

I can't help but think that apps are going to have to start giving the OS a lot more knowledge of what they're doing using hints to the scheduler. "This thread is doing periodic non-time-critical background work," "This thread is currently running a short job that must complete quickly," etc.

How to make that accessible to high level app developers in a way they won't screw up is another thing entirely, too.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 7:06 UTC (Thu) by fredrik (subscriber, #232) [Link] (1 responses)

In Android you already have a api that allows the application to indicate how important a background task is. You can tell the system to schedule background activities only when it is otherwise awake. And you can tell it to schedule activities inexactly, coordinated with being awake for other tasks.

See the documentation for the AlarmManager and compare the flags RTC_WAKEUP vs RTC and methods setRepeating vs setInexactRepeating.

http://developer.android.com/reference/android/app/AlarmM...

Now I don't know if these options are communicated all the way down to the kernel scheduler today. Though you could easily imagine a scheduler execute the inexact non-wakeup tasks on a low-power cpu, while executing tasks that are registered with the exact method and wakeup flag could be scheduled on a high-performance cpu.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 7:58 UTC (Thu) by dlang (guest, #313) [Link]

the linux kernel already has the ability to relax timing for wakeups so that they can be combined with other events.

as for scheduling different types of tasks on different cores, unless the slow core is not fast enough to keep up with the running application, there's no reason to use the fast core at all.

I can guarantee you that you cannot trust the application to tell you what it's needs really are, the only sane way is to watch the application and if it's trying to do more than the slow core can keep up with (with the other tasks that are also running), then and only then should you migrate it to a faster core.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 13:39 UTC (Thu) by const-g (guest, #5006) [Link] (6 responses)

KISS principle should apply here.

Let us consider:
* ALL CPUs are visible
* we have two CGROUPs (that set affinity among other things) -- one for cluster of "BIG" CPUs, one for cluster of "Little" CPUs
* user-space policy (dynamic or static) sets which task/thread belongs to which group.
* drivers decide statically where their interrupts run (on BIG or LITTLE CPUs) -- via affinity masks. This can be adjusted, (possibly event set to start with) from a user-space policy.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 19:09 UTC (Thu) by dlang (guest, #313) [Link] (5 responses)

I think you are still adding unneded complexity.

if you have all the CPUs running, why do you care which CPU is running which process? The only reason to care would be if a process can't get it's work done on the slow CPU, and the answer to that is to have the scheduler consider this when re-balancing jobs between CPUs

Similar statements can be made about interrupt handling. If both types of CPU can handle the interrupt, why do you care which one does it?

Userspace can then control what CPUs are running, and if needed, can set affinity masks for interrupts and processes that it determines "need" to avoid the slow processors, but I really think that a slightly smarter rebalancing algorithm is the right answer in just about all cases.

the rebalancing algorithm should already be looking at more than just the number of processes, it should be looking at how much time those processes are using, and it should be considering NUMA considerations as well. If you have two cores, one slow and one fast, both running the same load, the slow one will have a much higher utilization than the fast one, and so the fast one should end up pulling work from the slow one with the existing algorithm.

There is one factor (which shows up in two aspects) that the current scheduler doesn't consider, which is the relative speed of the cores

when pulling work to the slow cpu, it assumes that if it can run on the current cpu it can run on the new cpu, this needs a scaling check

if the process is using 100% of a current cpu, it's not considered for being pulled to a new cpu on the assumption that it won't gain anything, this needs a scaling check to see if the new CPU is faster

this is all 'slow path' checks in the scheduler rebalancing code, so it shouldn't be too bad.

And as noted elsewhere, this is needed for current x86 multi-core systems because they can run different cores at different speeds.

Linux support for ARM big.LITTLE

Posted Feb 16, 2012 21:45 UTC (Thu) by dlang (guest, #313) [Link]

I guess I'm saying that it doesn't seem as if the long-term scheduler changes are very significant, and as such I don't see a lot of value in producing such a short-term hack to deal with the issue.

especially as the long term fix is needed to deal with shipping Intel/AMD x86 systems

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 5:06 UTC (Fri) by hamish (guest, #6282) [Link] (3 responses)

This (very appealing) simplification doesn't take into account the race to completion - so you might end up using more total power staying on the slow cpu than would have been used up by bouncing the process to the fast cpu.

I suppose a large part of this depends on the power usage profile of the two speed cpu's - if the power used by the slow cpu running at full speed is the same or less than the power used by the fast cpu (when the fast cpu is at a cpu freq that produces equivalent numbers of MIPS) then this pattern could be used.

There still would need to be some way to work out that a process could benefit from a lower latency - and allow it to use the higher speeds available on the fast cpu.

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 5:17 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

right, but this sort of decision is too system specific to put into the kernel. Instead you want to have a userspace daemon decide what processors to use.

If the load if heavy enough, you want to use every processing cycle you have available. If the load is a little lighter, you may turn off some of the slow cores to save power (if your "race to idle" logic applies to the use case and you really will go idle), let the load drop some more and you will want to turn on the slow cores and turn off a fast one, etc

the possible combinations are staggering, and include user policy trade-offs that can get really ugly

race-to-idle is not an absolute, it's an observation that at the current time, the power efficiency of CPUs is such that for a given core, you are more efficient to race-to-idle than to reduce your clock speed. But when you then compare separate cores with different design frequencies, the situation may be very different.

remember that you can run a LOT of low-power cores for the same power as a single high-power core, enough that if your application can be parallelized well you are far better off with 100 100MHz cores than one 1GHz core, both in the amount of power used and the amount of processing you can do.

race-to-idle also assumes that as soon as you finish the job you can shut down. This is very true in batch job situations, but most of the time you don't really go into your low-power mode immediatly, instead you run idle for a little bit, then shift to a lower power mode, wait some more, etc (and or much of this you keep the screen powered, etc), all these things need to be taken into consideration for a particular gadget when you are deciding what your best power strategy is going to be.

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 6:29 UTC (Fri) by hamish (guest, #6282) [Link] (1 responses)

It would be good to be able to have some generic logic available in the kernel - as a default, with knobs to take control from userspace if there is a policy daemon available.

Perhaps something like a blending of the scheduler with a cpu-freq governor:
A busy cpu is interpreted as a signal to increase the frequency, but if the frequency cannot be increased (due to big.LITTLE or a package power usage limitation, or something else I cannot envisage) but there are other cpus that have faster speeds still available then one (or more) processes can be selected to migrate cpus (but not necessarily all the processes from the slow cpu). (And I have not thought about how to decide to migrate back to the slower cpu either ...)

Obviously, this would not suit all workloads, but would provide a good starting point and seems like it could be of use in other asymmetric scenarios.

Sure, race-to-idle is possibly only applicable to current hardware (and maybe only batch jobs), but I also think that similar principles are likely to apply to user-perceived latencies during interactive tasks - especially if you can powerdown what is likely to be a power hungry core and keep the residual activity running on the slow core.

Linux support for ARM big.LITTLE

Posted Feb 17, 2012 19:09 UTC (Fri) by dlang (guest, #313) [Link]

in the existing scheduler, a busy core is a signal for other cores to pull load from it. Unless you have individual threads that max out the slow CPU, this should just work. This includes moving load to the slower cores.

the userspace power management daemon should look at the load and make decisions on what cores to power up/down or speed up/slow down.

Part of powering down a core is telling the system to pull all of the load from that core.

Linux support for ARM big.LITTLE

Posted Feb 23, 2012 9:44 UTC (Thu) by slashdot (guest, #22014) [Link] (3 responses)

How about using the cpufreq algorithms, but applying them to the union of all CPUs in the system?

Linux support for ARM big.LITTLE

Posted Feb 23, 2012 9:47 UTC (Thu) by slashdot (guest, #22014) [Link] (2 responses)

(where obviously the "power states" would be A7-only, A15-only and A7+A15).

Linux support for ARM big.LITTLE

Posted Feb 23, 2012 9:55 UTC (Thu) by slashdot (guest, #22014) [Link] (1 responses)

And if you can really power up/down each core independently with power savings, then have (n + 1) * (n + 1) power states consisting of all #A7 ad #A15 combinations, sorted either by increasing total energy consumption or increasing total bogomips.

Linux support for ARM big.LITTLE

Posted Feb 23, 2012 19:49 UTC (Thu) by dlang (guest, #313) [Link]

there are corner cases that need to be figured out even with this model

1. if you have a CPU hog, how well/poorly does it work if you have one core that is 2x (or 10x) the power of a different core? can your cpu hog get stuck on a slow core?

2. how do you deal with software that will adapt to not having enough processor

3. what about software that's constrained by memory bandwidth, not cpu cycles, it may be just as fast on a slow core as a fast one, but use 100% cpu in both cases (since CPU utilisation doesn't reflect time spent waiting for memory access)

Linux support for ARM big.LITTLE

Posted Jun 13, 2012 7:38 UTC (Wed) by revmischa (guest, #74786) [Link]

Don't the CPUs have (slightly) different instruction sets and optimizations? Ideally, wouldn't software need to be compiled and optimized for both architectures?