The burstable CFS bandwidth controller

By Jonathan Corbet
February 8, 2021

The kernel's CFS bandwidth controller is an effective way of controlling just how much CPU time is available to each control group. It can keep processes from consuming too much CPU time and ensure that adequate time is available for all processes that need it. That said, it's not entirely surprising that the bandwidth controller is not perfect for every workload out there. This patch set from Huaixin Chang aims to make it work better for bursty, latency-sensitive workloads.

The bandwidth controller only applies to "completely fair scheduling" (CFS) tasks (otherwise known as "normal processes"); the CPU usage of realtime tasks is handled by other means. This controller provides two parameters to manage the limits applied to any given control group:

cpu.cfs_quota_us is the amount of CPU time (in microseconds) available to the group during each accounting period.
cpu.cfs_period_us is the length of the accounting period, also in microseconds.

Thus, for example, setting cpu.cfs_quota_us to 50000 and cpu.cfs_period_us to 100000 will enable the group to consume 50ms of CPU time in every 100ms period. Halving those values (setting cpu.cfs_quota_us to 25000 and cpu.cfs_period_us 50000) allows 25ms of CPU time every 50ms. In both cases, the group has been empowered to consume 50% of one CPU, but in the latter case that time will come more frequently, in smaller chunks.

The distinction between those two cases is important here. Imagine a control group containing a single process that needs to run for 30ms. In the first case, 30ms is less than the allowed 50ms, so the process will be able to complete its task without being throttled. In the second case, the process will be cut off after running for 25ms; it will then have to wait for the next 50ms period to start before it can finish its job. If the workload is sensitive to latency, the bandwidth-controller parameters need to be set with care.

This mechanism works reasonably well for workloads that consistently require a specific amount of CPU time. It can be a bit more awkward, though, for bursty workloads. A given process may use far less than its quota during most periods, but occasionally a burst of work may come along that requires more CPU time than the quota allows. In cases where latency doesn't matter, making that process wait for the next period to finish its work may not be a problem; if latency does matter, though, this delay can be a real concern.

There are ways to try to work around this issue. One, of course, is to just give the process in question a quota that is large enough to handle the workload bursts, but doing that will enable the process to consume more CPU time overall. System administrators may not like that result, especially if there is money involved and only so much time is actually being paid for. The alternative would be to increase both the quota and the period, but that, too, can increase latency if the process ends up waiting for the next period anyway.

Chang's patch set enables a different approach: allow control groups to carry over some of their unused quota from one period to the next. A new parameter, cpu.cfs_burst_us, sets the maximum amount of time that can be accumulated that way. As an example, let's return to the group with a quota of 25ms and a period of 50ms. If cpu.cfs_burst_us is set to 40000 (40ms), then processes in that group can run for up to 40ms in a given period, but only if they have carried over the 15ms beyond their normal quota from previous periods. This allows the group to respond to a burst of work while still keeping it within the quota in the longer term.

Another way of looking at this situation is that, when cpu.cfs_burst_us is in use, the quota is interpreted differently than before. Rather than being an absolute limit, the quota is an amount of CPU time that is deposited into the group's CPU-time account every period, with the burst value capping the value of that account. Bursty groups can save up a limited amount of CPU time in that account for when they need it.

By default, cpu.cfs_burst_us is zero, which disables the burst mechanism and preserves the previous behavior. There is a sysctl knob that can be used to disable burst usage across the entire system. Another knob (sysctl_sched_cfs_bw_burst_onset_percent) causes the controller to give each group a given percentage of their burst quota at the beginning of each period, regardless of whether that time has been accumulated in previous periods.

The patch set comes with some benchmark results showing order-of-magnitude reductions in worst-case latencies when the burstable controller is in use. This idea has been seen on the lists a few times at this point, both in its current form and as separate implementations by Cong Wang and Konstantin Khlebnikov. It looks as if the biggest roadblocks have been overcome at this point, so this change could find its way into the mainline as soon as the 5.13 merge window.

Index entries for this article
Kernel	Control groups
Kernel	Scheduler/Completely fair scheduler

The burstable CFS bandwidth controller

Posted Feb 8, 2021 18:26 UTC (Mon) by jmclnx (guest, #72456) [Link]

Interesting, almost like a Subscription Service some Businesses have, carry over unused time. KaaS :)

Have you looked at how network QoS does it?

Posted Feb 9, 2021 5:25 UTC (Tue) by ras (subscriber, #33059) [Link]

Linux network QoS solves a similar problem. QoS tackles how to share a network link equitably among many users.

The CBQ and HTB are queuing disciplines attempting to solve the problem of having a single link between protocols needing guaranteed low latency but low traffic (eg, VoIP and interactive like ssh), tasks that need to a responsive link with uneven loads (eg, http), and that just need heaps of "low grade" bandwidth (eg, email). They use ad hoc techniques like the ones described here, and mostly work on a good day - but sometimes don't. And they are computationally expensive.

HFSC came later and solves the same problem. It has a rigorous mathematical analysis behind it, delivers perfect results, and is computationally inexpensive. The key turns out to be how you pose the problem. Doing that in a way that allows you to come up with a robust solution is non-obvious, or at least I found it non-obvious. Interesting, like the proposed solution here HFSC must also take into account what bandwidth was used and went unused in the past to determine what can be used the future.

Unfortunately CPU scheduling and QoS are only similar, not identical. QoS has the luxury of the application breaking the work it presents to the QoS scheduled into bite sized pieces - packets. The QoS problem reduces to deciding which of these packets to send next, and when. In CPU scheduling you have a number of tasks lining up to use the CPU that will run for an unknown amount of time. The problem reduces to "how long can I let this task run, before I interrupt it". Nonetheless, I suspect they have one thing in common: in order to do their jobs well, the mathematical model behind them must be perfect.

The burstable CFS bandwidth controller

Posted Feb 9, 2021 20:14 UTC (Tue) by flussence (guest, #85566) [Link]

So there's still the elephant in the room - how does it compare to MuQSS after these patches? In terms of latency/interactive performance that's been consistently embarrassing CFS for a decade and a half now, and the only reaction seems to have been “ignore it and hope it goes away”.

The burstable CFS bandwidth controller

Posted Feb 20, 2021 5:16 UTC (Sat) by scientes (guest, #83068) [Link]

With SeL4's multicriticality, processes can also pass their scheduling context around, and without broken priority inheritance.