Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Posted Feb 15, 2012 14:54 UTC (Wed) by SEJeff (guest, #51588)Parent article: Linux support for ARM big.LITTLE
Posted Feb 15, 2012 15:02 UTC (Wed)
by tajyrink (subscriber, #2750)
[Link] (15 responses)
Posted Feb 15, 2012 15:08 UTC (Wed)
by SEJeff (guest, #51588)
[Link] (14 responses)
Posted Feb 15, 2012 15:21 UTC (Wed)
by gioele (subscriber, #61675)
[Link]
The idea _per se_ is not new at all, see the asymmetric multi-CPUs [1] in the '70s or the Cell architecture in 2000 [2].
[1] https://en.wikipedia.org/wiki/Asymmetric_multiprocessing
Posted Feb 15, 2012 15:33 UTC (Wed)
by hamjudo (guest, #363)
[Link]
Wikipedia says the Cyber came out in 1964. There are many earlier examples.
Posted Feb 15, 2012 16:50 UTC (Wed)
by drag (guest, #31333)
[Link] (10 responses)
Then, of course, there is modern x86-64 systems were the GPU is a now a coprocessor that can be used for anything, rather then something dedicated just for graphics.
I don't know how common it is to have disparate general purpose processors, however. With my GP2X the application had to be optimized to take advantage of the second processor.
Posted Feb 15, 2012 17:04 UTC (Wed)
by jzbiciak (guest, #5246)
[Link] (8 responses)
Heck, even desktop PCs have been asymmetric multiprocessor since their introduction (8038 in the keyboard controller, 8088 running the apps), but the OS really only thinks about the main CPU.
The A7/A15 split is rather different: This wants to run SMP Linux across both types of cores, dynamically moving tasks between the A7 and A15 side of the world seamlessly. All of the processor cores are considered generic resources, just with different performance/power envelopes. That's rather different from the GP2X model.
Posted Feb 15, 2012 18:16 UTC (Wed)
by jmorris42 (guest, #2203)
[Link] (7 responses)
True but that is because of reasons above the silicon. In a typical bottom of the line Android phone you already have multiple processing cores. Mine has two ARM cores, two DSPs plus the GPU. One ARM and one DSP are dedicated to the radio to keep that side a silo but the processors can talk to each other and share one block of RAM. So there isn't a techinical reason an OS couldn't be written for that old hardware that unified all five processing elements and dispatched executable code between the ARM cores as needed. It just wouldn't be a phone anymore.
The 'innovation' here is deciding to encourage the rediscovery of asymetrical multiprocessing and to relearn and update the ways to deal with the problems it brings. There was a reason everyone moved to SMP, it made the software a lot easier; now power consumption drives everything and software is already very complex so the balance shifts.
Then they will stick in an even slower CPU for the radio in chips destined for phone use and wall it off in the bootloader just like now. It is the only way to keep the lawyers (and the FCC) happy.
Posted Feb 15, 2012 20:51 UTC (Wed)
by phip (guest, #1715)
[Link] (6 responses)
In a low-end multicore SOC with noncoherent CPUs, the DRAM is usually statically partitioned between the CPUs which run independant OS images. Any interprocessor communication is done through shared buffers that are uncached or else carefully fushed at the right times (similar to DMA)
-Phil
Posted Feb 15, 2012 21:55 UTC (Wed)
by zlynx (guest, #2285)
[Link] (5 responses)
For example, the OS could force a cache flush on both CPUs when migrating a task.
Threaded applications would have a tough time, but even that has been handled in the past. For example, the compiler and/or threading library could either track memory writes between locks so as to make sure those cache addresses are flushed, or it could use the big hammer of a full cache flush before unlock.
Cache coherency is really just a crutch. Lots of embedded programmers and Cell programmers (no cache protocol on the SPE's) know how to work without it. :-)
Posted Feb 15, 2012 22:31 UTC (Wed)
by phip (guest, #1715)
[Link] (2 responses)
It's not just migrating processes & threads - any global operating
If you want to use multiple noncoherent cores to run a general-purpose,
Posted Feb 15, 2012 22:35 UTC (Wed)
by phip (guest, #1715)
[Link] (1 responses)
Having a different instruction set on the different processor
The usual programming model is to run a general-purpose OS on the
Posted Feb 15, 2012 23:02 UTC (Wed)
by zlynx (guest, #2285)
[Link]
No one runs a MP system of different instruction sets, true. It isn't impossible though. The OS would need to be built twice, once for each instruction set. It could share the data structures.
I wonder if Intel ever played around with this for an Itanium server? It seems that I remember Itanium once shared the same socket layout as a Xeon so this would have been possible on a hardware level.
Now, if you got very tricky and decided to require that all software be in an intermediate language, the OS could use LLVM, Java, .NET or whatever to compile the binary to whichever CPU was going to execute the process. That would really make core switching expensive! And you'd need some way to mark where task switches were allowed to happen, and maybe run the task up to the next switch point so you could change over to the same equivalent point in the other architecture.
A bit more realistic would be cores with the same base instruction set, but specialized functions on different cores. That could work really well. When the program hit an "Illegal Instruction" fault, the OS could inspect the instruction and find one of the system cores that supports it, then migrate the task there or emulate the instruction. Or launch a user-space helper to download a new core design off the internet and program the FPGA!. That would let programmers use specialized vector, GPU, audio or regular expression instructions without worrying about what cores to use.
Posted Feb 19, 2012 18:06 UTC (Sun)
by alison (subscriber, #63752)
[Link] (1 responses)
"it could use the big hammer of a full cache flush before unlock.
Full cache flush == embedded "Big Kernel Lock" equivalent?
-- Alison Chaiken, alchaiken@gmail.com
Posted Feb 19, 2012 21:51 UTC (Sun)
by phip (guest, #1715)
[Link]
Hmm, that brings up another point I should have thought of earlier:
Posted Feb 15, 2012 17:27 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Feb 23, 2012 19:41 UTC (Thu)
by wmf (guest, #33791)
[Link]
Posted Feb 15, 2012 15:50 UTC (Wed)
by Felix.Braun (guest, #3032)
[Link] (1 responses)
A software approach seems more flexible to me, which is a good thing considering the complexity of the issue. This seems impossible to get right in the first try.
Posted Feb 15, 2012 15:55 UTC (Wed)
by Felix.Braun (guest, #3032)
[Link]
Posted Feb 16, 2012 14:46 UTC (Thu)
by yaap (subscriber, #71398)
[Link] (17 responses)
In Nvidia approach, you have 1 slow Cortex A9 optimized for low power, and 4 fast Cortex A9 optimized for performance. The slow one will run at 500 MHz while the fast ones could be 1 GHz or more. They use a mixed 40nm LP/G process, so I would assume that the slow A9 is LP (low power) while the others are in G (generic/high performance), but that's just a guess. Anyway, there are other ways to optimize for speed vs. low power.
With LITTLE.big, for now you have as many A7 as A15 cores. The current approach switch between all A7 or all A15. But "all" is misleading, in that not all cores may be needed of course.
The NVidia approach applied to LITTLE.big would be a single A7 and 4 A15. But it doesn't seem to be in the plans yet. And I guess that as ARM gets more money the more cores are used, they don't have much incentive to get there ;) Still, as far as I know, nothing would prevent a licensee to do that on its own.
Posted Feb 16, 2012 19:51 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (16 responses)
So, you could set up "migration pairs", pairing each A15 CPU with an A7 CPU, and only keep one on at a time, migrating back and forth as needed. If all CPUs in a cluster are off, then you can shut off the cluster as a whole.
Posted Feb 16, 2012 19:57 UTC (Thu)
by dlang (guest, #313)
[Link] (15 responses)
Posted Feb 16, 2012 21:21 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (14 responses)
The current scheduler does not really understand heterogeneous CPU mixes. But, you can still do better than ARM's hypervisor-based switcher by pushing the logic into the kernel, and doing this pair-wise switcheroo. Then, the existing Linux infrastructure just models it as a frequency change within its existing governor framework.
Longer term, of course, you do want to teach Linux how to load balance properly between disparate cores, so at the limit, you could have all 2*N CPUs active on a machine with an N-CPU A15 cluster next to an N-CPU A7 cluster. But in the short run, pulling ARM's BSD-licensed switching logic to do the pair-wise switch at least gets something running that works better than leaving it to the hypervisor. The short run use model really does appear to be to have N CPUs powered on when you have an N CPU A15 cluster next to an N CPU A7 cluster.
Another thought occurs to me: Even if Linux were able to schedule for all 8 CPUs, the possibility exists that the environment around it doesn't really support that. It might not have the heat dissipation or current carrying capacity to have all 2*N CPUs active. So, that's another reason the pair-wise switch is interesting to consider.
Posted Feb 16, 2012 21:23 UTC (Thu)
by jzbiciak (guest, #5246)
[Link]
Posted Feb 16, 2012 21:35 UTC (Thu)
by dlang (guest, #313)
[Link] (11 responses)
My point is that we already have these 'problems' on mainstream systems.
On Intel and AMD x86 CPUs we already have cases where turning off some cores will let you run other cores at a higher clock speed (thermal/current limitations), and where you can run some cores at lower speeds than others
These may not be as big a variation as the ARM case has, but it's a matter of degree, not a matter of being a completely different kind of problem.
I agree that the current scheduler doesn't have explicit code to deal with this today, but I think that for the most part the existing code will 'just work' without modification. The rebalancing code pulls work off of a core if the core is too heavily loaded. a slower core will be more heavily loaded for the same work than a faster core would be, so work will naturally be pulled off of a heavily loaded slow core.
The two points where the current scheduler doesn't do the right thing are in the rebalancing code when considering moving processes between cores of different speeds (but only if you have CPU hog processes that will max out a core). As I note above, fixing this doesn't seem like that big a change, definitely less intrusive than the NUMA aware parts.
let userspace figure out all the nuances of what combinations of clock speeds on the different cores will work in the current environment (if it's limited by thermal factors, then different cooling, ambient temperatures, etc will change these limits, you really don't want that code in the kernel)
Posted Feb 17, 2012 16:40 UTC (Fri)
by BenHutchings (subscriber, #37955)
[Link] (10 responses)
The Linux scheduler already generically supports the grouping of CPU threads that share resources, and tends to spread runnable tasks out across groups (although it can also be configured to concentrate them in order to save power). I think that this should result in enabling the 'turbo' mode where possible.
Posted Feb 17, 2012 19:05 UTC (Fri)
by dlang (guest, #313)
[Link] (9 responses)
Part of the decision process will also need to be to consider what programs are running, and how likely are these programs to need significantly more CPU than they are currently using (because switching modes takes a significant amount of time), this involves a lot of policy, and a lot of useage profiling, exactly the types of things that do not belong in the kernel.
With the exception of the case where there is a single thread using all of a core, I think the existing kernel scheduler will 'just work' on a system with different speed cores.
Where I expect the current scheduler to have problems is in the case where a single thread will max out a CPU, I don't think that the scheduler will be able to realize that it would max one CPU, but not max another one
Posted Feb 17, 2012 19:24 UTC (Fri)
by BenHutchings (subscriber, #37955)
[Link] (8 responses)
Posted Feb 17, 2012 19:38 UTC (Fri)
by dlang (guest, #313)
[Link] (7 responses)
I view this as a three tier system
At the first tier you have the scheduler on each core making decisions about what process that is assigned to that core should run next
At the second tier you have a periodic rebalancing algorithm that considers moving jobs from one core to another. preferably run by a core that has idle time (that core 'pulls' work from other cores)
These two tiers will handle cores of different speeds without a problem as-is, as long as no thread maxes out the slowest core.
I am saying that the third tier would be the userspace power management daemon, which operates completely asynchronously to the kernel, it watches the overall system and makes decisions on when to change the CPU configuration. When it decides to do so, it would send a message to the kernel to make the change (change the speed of any core, including power it off or on)
until the userspace issues the order to make the change, the kernel scheduler just works with what it has, no interaction with userspace needed.
Posted Feb 17, 2012 20:26 UTC (Fri)
by BenHutchings (subscriber, #37955)
[Link] (6 responses)
Posted Feb 17, 2012 22:34 UTC (Fri)
by dlang (guest, #313)
[Link] (5 responses)
doing the analysis every second (or even less frequently) should be pretty good in most cases
the kernel can do some fairly trivial choices, but they are limited to something along the lines of
here is a list of power modes, if you think you are being idle too much, move down the list, if you think you are not being idle enough move up the list
but anything more complicated than this will quickly get out of control
for example,
for the sake of argument, say that 'turbo mode' is defined as:
turn off half your cores and run the other half 50% faster, using the same amount of power. (loosing 25% of it's processing power, probably more due to memory pipeline stalls)
how would the kernel ever decide when it's appropriate to sacrifice so much of it's processing power for no power savings?
I could say that I would want to do so if a single thread is using 100% of a cpu in a non-turbo mode.
but what if making that switch would result in all the 'turbo mode' cores being maxed out? it still may be faster to run overloaded for a short time to finish the cpuhog task faster.
I don't see any way that this sort of logic can possibly belong in the kernel. And it's also stuff that's not very timing sensitive (if delaying a second to make the decision results in the process finishing first, it was probably not that important a decision to make, for example)
Posted Feb 17, 2012 23:25 UTC (Fri)
by khim (subscriber, #9252)
[Link] (4 responses)
Why do you think so? What you are talking about? It looks like this whole discussion comes from different universe. Perhaps it's the well-discussed phenomenon where an important requirement that was not at all obvious to one party is so obvious to other one that they didn't think to state it. We are discussing all that in the context of big.LITTLE processing, right? Which is used by things like tablets and mobile phones, right? Well, the big question here is: do I need to unfreeze and use hot and powerful Cortex-A15 core to perform some kind of UI task or will slim and cool Cortex-A7 be enough to finish it? And the cut-off is dictated by physiology: the task should be finished in less then 70-100ms if it's reaction to user input or in 16ms if it's part of the animation. This means that decision to wake up Cortex-A15 or not must be taken in 1-2ms, tops. Better to do it in about 300-500µs. Any solution which alters power config once per second is so, so, SO beyond the event horison it's not even funny. Wrong criteria. If Cortex-A7 core can calculate the next frame in 10ms then there are no need to wake up Cortex-A15 core even if for these 10ms Cortex-A7 is 100% busy. The problems here are numerous and indeed quite time-critical. The only model which makes sense is in-kernel demon which actually does the work quickly and efficiently - but it uses information collected by userspace daemon.
Posted Feb 18, 2012 0:47 UTC (Sat)
by dlang (guest, #313)
[Link] (3 responses)
waking from some sleep modes may take 10ms, so if you have deadlines like that you better not put the processor to sleep in the first place.
I also think that a delay at the start of an app is forgivable, so if the system needs the faster cores to render things, it should find out quickly, start up those cores, and continue.
I agree that if you can specify an ordered list of configurations and hand that to the kernel you may be able to have the kernel use that.
on the other hand, the example that you give:
> Wrong criteria. If Cortex-A7 core can calculate the next frame in 10ms then there are no need to wake up Cortex-A15 core even if for these 10ms Cortex-A7 is 100% busy.
sort of proves my point. how can the kernel know that the application completed it's work if it's busy 100% of the time? (especially if you have an algorithm that will adapt to not having quite enough processor and will auto-scale it's quality)
this sort of thing requires knowledge that the kernel does not have.
Also, the example of the 'turbo mode' where you can run some cores faster, but at the expense of not having the thermal headroom to run as many cores. In every case I am aware of, 'turbo mode' actually reduces the clock cycles available overall (and makes the cpu:memory speed ratio worse, costing more performance), but if you have a single threaded process that will finish faster in turbo mode, it may be the right thing to do to switch to that mode.
it doesn't matter if you are a 12 core Intel x86 monster, or a much smaller ARM chip.
Posted Feb 18, 2012 11:09 UTC (Sat)
by khim (subscriber, #9252)
[Link] (2 responses)
Well, sure. But differences between interactive tasks and batch processing modes are acute. With batch processing you are optimizing time for the [relatively] long process. With interactive tasks you optimize work in your tiny 16ms timeslice. It makes no sense to produce result in 5ms (and pay for it) but if you spent 20ms then you are screwed. Today the difference is not so acute because the most power-hungry part of the smartphone or tablet is LCD/OLED display. But if/when technologies like Mirasol will be adopted these decisions will suddenly start producing huge differences in the battery life.
Posted Feb 18, 2012 11:58 UTC (Sat)
by dlang (guest, #313)
[Link] (1 responses)
i don't think we are ever going to get away from having to make the choice between keeping things powered up to be able to be responsive, and powering things down aggressively to save power.
Posted Feb 20, 2012 12:17 UTC (Mon)
by khim (subscriber, #9252)
[Link]
Hardware is in labs already (and should reach the market in a few years), it's time to think about software side. If we are talking about small tweaks then such hardware it not yet on the radar, but if we plan to create the whole new subsystem (task which will itself need two or three years to mature) then it must be considered.
Posted Feb 23, 2012 1:35 UTC (Thu)
by scientes (guest, #83068)
[Link]
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
[2] https://en.wikipedia.org/wiki/Cell_%28microprocessor%29
Anybody who programmed a CDC Cyber 6600 thought, wouldn't it be great if we could make a system where the little processors ran the same instruction set as the big processors?Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
> second CPU is treated as more like a peripheral.
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
that nobody does it for any mainstream OS.
system data structures need to be synchronized with cache flushes,
memory ordering barriers, mutexes, etc. before and after each access.
the best approach is to treat it as a cluster (with each CPU running
its own OS image).
Linux support for ARM big.LITTLE
Synergestic Processors (or on a GPU for that matter).
cores moves the complexity to another level above noncoherence.
PowerPC processor(s), and treat the SPEs as a type of coprocessor
or peripheral device.
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Cache coherency is really just a crutch. Lots of embedded programmers and Cell programmers (no cache protocol on the SPE's) know how to work without it. :-)"
Linux support for ARM big.LITTLE
Non-coherent multi-CPU SOCs are also likely to not implement
atomic memory access primatives (i.e. Compare/Exchange, Test and Set,
Load-Linked/Store-Conditional, etc.)
Linux support for ARM big.LITTLE
It's very common for ARM systems to have disparate processors. My GP2X handheld is dual core system. It has a regular ARM processor and then a second processor for accelerating some types of multimedia functions.
This goes right back to the ARM's prehistory. The BBC microcomputer's 'Tube' coprocessor interface springs to mind, allowing you to plug in coprocessors with arbitrary buses, interfacing them to the host machine via a set of FIFOs. People tended to call the coprocessor 'the Tube' as well, which was a bit confusing given how variable the CPUs were that you could plug in there.
Linux support for ARM big.LITTLE
ARM big.LITTLE vs. nvidia Tegra 3 Kal-El
ARM big.LITTLE vs. nvidia Tegra 3 Kal-El
Linux support for ARM big.LITTLE
The switch is between the single "slow" A9 and the other(s). If the slow is not enough, the load is switched to a fast A9. Then when on the fast side (with the slow disabled), one or several cores may be enabled depending on the workload.
ARM approach may also support later on a finer approach where you have pairs of A7/A15, and each pair can switch (or be disabled) independently. Nicer of course. But still as many A7 as A15 cores.
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
On Intel and AMD x86 CPUs we already have cases where turning off some cores will let you run other cores at a higher clock speed (thermal/current limitations), and where you can run some cores at lower speeds than others.
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
I don't see what's so performance critical about this, you shouldn't be making significant power config changes to your system hundreds of times a second,
And it's also stuff that's not very timing sensitive (if delaying a second to make the decision results in the process finishing first, it was probably not that important a decision to make, for example)
I could say that I would want to do so if a single thread is using 100% of a cpu in a non-turbo mode.
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
it doesn't matter if you are a 12 core Intel x86 monster, or a much smaller ARM chip.
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
Linux support for ARM big.LITTLE
implies that the "big.LITTLE MP Use Model" or running all the cores at once
is supported. Of course this doesn't mean that implamenters that are only testing for the "Task Migration Use Model" will implament it adaquately for a use model they are not using.