Thanks for the great article. I was at the summit too, my first experience of this sort of thing as I am new to this world, and it was a great experience. I'd like to pick up on a few points though:
>>”First, if a LITTLE CPU is able to provide sufficient performance, it provides better energy efficiency, at least in cases where race to idle is inappropriate.”
I found this slightly misleading. Static leakage is also lower in the small cluster than in the big one, so even in race to idle cases, if LITTLE CPUs can supply adequate performance you’d in theory be better off using them. The balance here would be the states of other devices that are active during the use case, and the power states that can be entered. So if the super speed of the bigs is such that when you race to idle you turn lots of stuff off for a longer period of time than you would have if using the LITTLES (where your idle time would be less) then indeed you’d be better off in the big cores. But for situations where you are going to idle for significant periods of times (so LITTLES can still access aggressive retention modes) then due the lower static leakage you would always be better off on the LITTLES.
>>”Coordinating user applications based on hotplug events. (However, there is no known embedded or mobile use of this feature, so if you need it, please let us know. Otherwise it will likely go away.)”
Actually I raised this point at the summit. The reason I did is because I know of at least one system (Qt and in particular QtConcurrent) which creates thread pools as a multiple (in this case 1-1) of the amount of logical CPUs available. I believe Qt uses sysconf(_SC_NPROCESSORS_ONLN) when creating the thread pool, and that this would return the number of processors (or hyperthread sets) currently on line. I imagine there are other libraries which provide threadpool abstractions again based on the number of cores, but admit I have real knowledge here.
So what I was actually worrying about was: what happens? And what should happen? I can imagine people in future coming up with requirements around wanting to know when the number of CPUs changes to scale up/down their thread pools. I think in theory it’s not really justifiable to have too much of a communication when scaling down. If performance monitoring code decides we are under using the CPU resources and to scale down, then clearly the threadpool is not using the cores too much so who cares. However if you were bringing more CPUs online and the threadpool is a big contributor to load, and has a 1-1 matching with number of CPUs, then growing the pool could make sense. Anyways it’s food for thought.
>>”Wakeups can be delayed so that they do not arrive at the kthread until after the corresponding CPU has gone offline.”
Did I misunderstand or should that have read “CPU has come online”? I have probably misunderstood, but I thought that the idea is to postpone the wakeups until the CPUs for the targeted kthreads area available once again. I profess no knowledge here...
There are some other notions of RACE to idle vs DVFS that are very focussed on schedule. But in reality device usage is going to be a key parameter in making that decision. I think more so than information that can be deduced from schedule. You allude to this in Quiz1. The classic example is mp3. You have a use case where CPU utilisation is low, periodic and bursty, and assuming good size buffers, gives you good idle time between bursts, however because you are using that audio hw/codec hw you cannot actually go into a very deep low power mode. So you are better off using DVFS rather than race to idle. A system that can represent available modes, given current device usage and latency constraints would be very useful here.
Posted Feb 24, 2012 18:47 UTC (Fri) by heechul (subscriber, #79852)
[Link]
Just like Qt, android dalvik also use sysconf(_SC_NPROCESSORS_ONLN) to report the number of available cpus to applications which then create the thread pools based on that information. Obviously, applications can suffer significant performance hit by dynamic hotplugging.
The right solution should be using sysconf(_SC_NPROCESSORS_CONF) which, in theory, should return the total available cpus instead of online ones.
The problem is that libc implementations, at least the android libc i used, do not distinguish the two and simply they are the same. i.e., report #of online cpus although _SC_NPROCESSORS_CONF is requested. They should be fixed as desired.
Then a question would be whether kernel has standard interface to userspace to report #online cpus and #available cpus so that libc can properlly implement both _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_CONF.
The Linaro Connect scheduler minisummit
Posted Feb 29, 2012 19:45 UTC (Wed) by BenHutchings (subscriber, #37955)
[Link]
Last time I looked, glibc was using /proc/cpuinfo. I think it should be using /sys/devices/system/cpu/online and
/sys/devices/system/cpu/possible, with a fallback to /proc/cpuinfo.
The Linaro Connect scheduler minisummit
Posted Feb 24, 2012 19:11 UTC (Fri) by PaulMcKenney (subscriber, #9624)
[Link]
Hello, Charles!
Good point on the choice between big and LITTLE being far more nuanced than I could hope to fully capture in a one-sentence rule of thumb. I do agree that the characteristics of a given device will often need to be taken into account.
Please accept my apologies for losing your comment about thread pools. I agree that having applications base their thread-pool sizes on the number of CPUs physically configured on the device will usually be a good place to start.
I really did mean that wakeups can be delayed until a CPU has gone offline! ;-)
Here is what can happen (or at least did happen to me as of about a year ago): (1) A kthread bound to CPU 0 is awakened. (2) Before the kthread can run, CPU 0 goes offline. (3) The kthread actually tries to start running, and as a result has its binding to the now-offline CPU broken. It is possible to handle this by careful use of preemption disabling and checks to see what CPU the kthread is actually running on.
The Linaro Connect scheduler minisummit
Posted Mar 1, 2012 15:46 UTC (Thu) by tvld (subscriber, #59052)
[Link]
> I imagine there are other libraries which provide threadpool abstractions again based on the number of cores, but admit I have real knowledge here.
Yes, I think that's fairly common. (And not perfect.)
> So what I was actually worrying about was: what happens? And what should happen? I can imagine people in future coming up with requirements around wanting to know when the number of CPUs changes to scale up/down their thread pools.
Thread pools could indeed benefit from adjustments at runtime. The underlying problem is that userspace is making resource-usage decisions. In particular, parallelized programs do that to a larger extend than single-threaded ones because they have to decide how many threads they use. For userspace, it's not clear whether none/few/many threads would yield optimal performance because the lower-level trade-offs (e.g., all that's discussed in the article) aren't known; likewise, the kernel can't optimize optimally either because it doesn't know about the program's utility function and because the userspace code that is run for single-thread vs. multi-thread is often different (e.g., due to differences in synchronization) and has different performance characteristics.
> I think in theory it’s not really justifiable to have too much of a communication when scaling down. If performance monitoring code decides we are under using the CPU resources and to scale down, then clearly the threadpool is not using the cores too much so who cares. However if you were bringing more CPUs online and the threadpool is a big contributor to load, and has a 1-1 matching with number of CPUs, then growing the pool could make sense.
I agree that we shouldn't worry about having unused threads lying around. However, if the kernel knows that there will only be one big core not occupied by other processes (and thus available to our program), for example, then the program can benefit from this because it knows it doesn't have to try hard to run stuff in parallel. The parallelization overheads of course vary depending on the parallelization approach that userspace follows and the actual application/workload (e.g., task-based parallelism together with work-stealing are more robust than other approaches in cases of load imbalance).
Overall, I think we should work on getting userspace and kernel to try to optimize this jointly, because IMO this is an end-to-end optimization problem (program utility function down to HW performance characteristics) and because I doubt that we'll find a catch-all tuning policy that's feasible to add to the kernel.