There was a issue within the last few weeks related to the scheduler and the debate of what should be in the scheduler vs what should be in a userspace daemon (IIRC it was power / sleep related, but I don't remember exactly what)
It seems to me that there are two answers that jump out
1. policy belongs in userspace, so the question of when to power up the fast cores, when to power them down, etc belongs in userspace.
2. the kernel scheduler needs the ability to handle CPUs with different performance, be it the big.LITTLE approach, or just a many-core x86 box with some cores running at a reduced clock speed.
For this latter problem, it seems to me that the system shouldn't care abut what the speed of an available CPU is, but should instead be balancing on how close to being maxed out it is. If none of the cores are maxed out, then (except for power management, which we've deferred to userspace on), it doesn't matter how fast any of the cores are.
The one exception to the "scheduler doesn't need to know the core speeds" is if a core _is_ maxed out, the scheduler needs to know the relative speeds of the different cores to decide if it should move the process to a "better" core.
However, the speed of the core isn't the only possible reason to move it to a different core, in a NUMA system, you may want to move a job to a different core to get it nearer to the memory that it accesses.
the good news (at least as it seems to me) is that this is not something that needs to be in the scheduler hot path, this is something that can be in the periodic rebalancing routine, probably as an abstraction of NUMA aware pulling to tinker with the definition of the "optimal" cpu for a job.
It's definantly not correct to try and schedule interactive tasks on one type of CPU and non-interactive tasks on a different type.
In terms of what the API to the userspace daemon needs to be. I can't define details, but to kick off the conversation, I think it needs to be able to allow the following:
1. the userspace daemon needs to be able to tell the kernel to alter the configuration of a particular core (power up/down, change it's speed, engage "turbo" mode. This would include turning off some cores so that others can run at higher clock speed), etc.
2. For some systems this should probably be something close to an atomic change, so the API probably should allow passing a data structure to the kernel, not just individual setting changes.
3. the userspace daemon needs to be able to see what the existing settings are
4. the userspace daemon needs to be able to gather infromation about the per-core performance. I think this information is already available today, although there may be reasons to improve the efficiency of gathering it (which would help other performance analysis tools ad well).
the devil is in the details as always, but this doesn't look like a situation where the broad-brush design options are that difficult.