Reworking CFS load balancing

July 11, 2019

The Linux scheduler is made of the main types of scheduling which are the Completely Fair Scheduler (CFS), the realtime (RT), and the more recent deadline scheduler. The CFS class is the default and most commonly used one, which aims at sharing the running time of CPUs between tasks according to their priority. It was introduced in 2007 and has seen several major changes since. One of these major changes was the introduction of per-entity load tracking (PELT), which gives more details about the utilization of CPUs by tasks.

The load-balancing algorithm of the scheduler has the key responsibility of placing tasks on CPUs to optimize the overall throughput of the system. It periodically monitors the system and decides when tasks have to migrate to ensure a fair distribution of compute capacity and an optimal use of resources. But that hasn't really changed to take full advantage of these new metrics and it is still only using the load as the unit to migrate tasks, even when the root cause of an imbalance is not linked to load but to the available compute capacity of CPUs, for example.

In order to quantify how imbalanced the system is, load balancing uses some virtual and somewhat meaningless values like the average load per task. This is often used as a fallback when the load is not the right metric to express the imbalance but "should" be enough to fix the problem. In other cases, it can even bias the statistics to make a group look overloaded when it is not to force task migration between groups. But these hooks are not always enough and some use cases still show some sub-optimal task placement.

As an example, we can have a look at how load balancing handles asymmetries between groups of CPUs and how it sometimes fails to ensure that there is one task per CPU. This asymmetry can come from CPU microarchitecture, the time stolen by other scheduling classes, or because of thermal mitigation that can temporarily decrease the maximum compute capacity of a CPU. Let's take the case of a big.LITTLE platform made of four LITTLE cores and four big cores. If there are only eight tasks running on the system, the scheduler should place one task on each CPU, but it puts five tasks on the cluster made of big cores instead.

If you only look at the load, the decision makes sense because there is much more compute capacity in this group and the average load is balanced between tasks, but it also leaves one LITTLE core idle while two tasks have to share a core. This is one typical case where comparing load doesn't make sense and using the number of idle cores is more appropriate. A few other use cases have been described during the talks to highlight that some information is missing to make better decisions.

Those examples demonstrated that it's probably time for the load-balancing algorithm to be cleaned up and reworked to use new metrics, remove the bypasses that have been inserted over time, and to remove the old heuristics, which have lost their meaning.

With all metrics available in the scheduler, a group of CPUs can be better classified and its imbalance better identified. That's why the first step is to classify the groups of CPUs into categories after collecting statistics:

Groups that have spare capacity to be used by some other tasks.
Groups that are fully busy and all cores are fully used.
Groups that are overloaded and have to share the running time between tasks.

In addition to these generic states, some special cases also have to be identified:

The need to move any task, but at least one to unblock a pinned related imbalance
Move a minimum level of utilization
Move a specific task

Once classified, load balancing can easily select the group it should pull some imbalance to and also what needs to be moved:

Is it load to ensure fair time sharing between tasks when CPUs are overloaded ?
Is it a number of tasks to fill idle CPUs in other groups ?
Is it a dedicated task that doesn't fit into its local CPUs ?

That's what a rework of the load balancing will have to address.

Index entries for this article
Conference	OS-Directed Power-Management Summit/2019