Thanks for the great article. I was at the summit too, my first experience of this sort of thing as I am new to this world, and it was a great experience. I'd like to pick up on a few points though:
>>”First, if a LITTLE CPU is able to provide sufficient performance, it provides better energy efficiency, at least in cases where race to idle is inappropriate.”
I found this slightly misleading. Static leakage is also lower in the small cluster than in the big one, so even in race to idle cases, if LITTLE CPUs can supply adequate performance you’d in theory be better off using them. The balance here would be the states of other devices that are active during the use case, and the power states that can be entered. So if the super speed of the bigs is such that when you race to idle you turn lots of stuff off for a longer period of time than you would have if using the LITTLES (where your idle time would be less) then indeed you’d be better off in the big cores. But for situations where you are going to idle for significant periods of times (so LITTLES can still access aggressive retention modes) then due the lower static leakage you would always be better off on the LITTLES.
>>”Coordinating user applications based on hotplug events. (However, there is no known embedded or mobile use of this feature, so if you need it, please let us know. Otherwise it will likely go away.)”
Actually I raised this point at the summit. The reason I did is because I know of at least one system (Qt and in particular QtConcurrent) which creates thread pools as a multiple (in this case 1-1) of the amount of logical CPUs available. I believe Qt uses sysconf(_SC_NPROCESSORS_ONLN) when creating the thread pool, and that this would return the number of processors (or hyperthread sets) currently on line. I imagine there are other libraries which provide threadpool abstractions again based on the number of cores, but admit I have real knowledge here.
So what I was actually worrying about was: what happens? And what should happen? I can imagine people in future coming up with requirements around wanting to know when the number of CPUs changes to scale up/down their thread pools. I think in theory it’s not really justifiable to have too much of a communication when scaling down. If performance monitoring code decides we are under using the CPU resources and to scale down, then clearly the threadpool is not using the cores too much so who cares. However if you were bringing more CPUs online and the threadpool is a big contributor to load, and has a 1-1 matching with number of CPUs, then growing the pool could make sense. Anyways it’s food for thought.
>>”Wakeups can be delayed so that they do not arrive at the kthread until after the corresponding CPU has gone offline.”
Did I misunderstand or should that have read “CPU has come online”? I have probably misunderstood, but I thought that the idea is to postpone the wakeups until the CPUs for the targeted kthreads area available once again. I profess no knowledge here...
There are some other notions of RACE to idle vs DVFS that are very focussed on schedule. But in reality device usage is going to be a key parameter in making that decision. I think more so than information that can be deduced from schedule. You allude to this in Quiz1. The classic example is mp3. You have a use case where CPU utilisation is low, periodic and bursty, and assuming good size buffers, gives you good idle time between bursts, however because you are using that audio hw/codec hw you cannot actually go into a very deep low power mode. So you are better off using DVFS rather than race to idle. A system that can represent available modes, given current device usage and latency constraints would be very useful here.