The io.weight I/O-bandwidth controller
There are a number of challenges facing an I/O-bandwidth controller. Some processes may need a guarantee that they will get at least a minimum amount of the available bandwidth to a given device. More commonly in recent times, though, the focus has shifted to latency: a process should be able to count on completing an I/O request within a bounded period of time. The controller should be able to provide those guarantees while still driving the underlying device at something close to its maximum rate. And, of course, hardware varies widely, so the controller must be able to adapt its operation to each specific device.
The earliest I/O-bandwidth controller allows the administrator to set maximum bandwidth limits for each control group. That controller, though, will throttle I/O even if the device is otherwise idle, causing the loss of I/O bandwidth. The more recent io.latency controller is focused on I/O latency, but as Tejun Heo, the author of the new controller, notes in the patch series, this controller really only protects the lowest-latency group, penalizing all others if need be to meet that group's requirements. He set out to create a mechanism that would allow more control over how I/O bandwidth is allocated to groups.
io.weight
The new controller works by assigning a "weight" value to each control
group. Consider, for example, the simple hierarchy shown to the right.
If group A is given a weight of 100 for a specific block device and
group B has a weight of 300, then B will be allowed to use 75% of the
available bandwidth. Absolute weights do not matter, each group's actual
portion of the available bandwidth will be determined by its weight
relative to the sum of all weights at that level in the hierarchy.
That leaves open the question of just how the controller determines how much of the device's capacity each group is using. Simply counting I/O operations or total bandwidth turns out to be inadequate, since some requests can be quite a bit more expensive than others. So the new controller uses a "cost model" that tries to better estimate how much of a device's time will be required to satisfy any given request. This model is relatively simple. First, it determines whether a request is sequential or random; in the former case, the operation will complete much more quickly (especially on rotating drives) than in the latter. The operation is given a fixed cost based on this determination, plus an incremental cost for each page to be transferred. The resulting total cost is an estimate of how long it will take for the request to be executed.
By default, the controller will observe the actual behavior of each device to work out what the cost parameters should be. The administrator can override this behavior by writing some commands to the io.weight.cost_model file in the root-level control group. For each drive, the maximum throughput, along with the maximum number of sequential and random operations that can be performed per second, can be specified. Different costs can be used for read and write operations if appropriate.
The default cost model apparently works pretty well. But, should somebody encounter a situation where that model falls apart, there is, inevitably, a hook to run a BPF program that can calculate the cost in whatever way makes sense.
vtime
The controller works by establishing a virtual clock (called vtime) for each device; that clock normally advances at the usual rate of one second per second. Each control group also has a vtime clock that determines when it can submit another I/O operation. Once the cost of an operation has been determined, it is added to the group's vtime; the operation can only be sent to the device once that device's vtime is ahead of the group's vtime. The weights assigned to each group are implemented by scaling the cost of each operation proportionally to that group's share of the total bandwidth. If group A, above, has 25% of the available bandwidth, the cost of its operations will be multiplied by four. In a sense, control groups live in a relativistic universe where lower-weight groups have slower-moving clocks.
To avoid situations where a device sits idle when there are operations pending, the controller will take note when any given group is not using the full bandwidth available to it and temporarily lower that group's weight to match its actual usage, in effect "lending" the unused bandwidth to other groups that are performing I/O. There is a mechanism to allow a group to quickly grab back its lent bandwidth should it start to need it.
There is one little remaining problem: the vtime mechanism is designed to issue requests at the speed that the device can handle them. But the cost model is unlikely to be perfect, and the performance of any given device can vary over time. If the cost model is off, the controller could dispatch too many requests (increasing latencies) or not enough requests (leaving some bandwidth unused). That, naturally, is a situation worth avoiding.
Should the controller notice that request-completion times are increasing, it takes that as a signal that too many requests are being sent. That situation is addressed by slowing down the vtime associated with the overloaded device, so that requests will be dispatched at a slower rate. Similarly, if the device is not completely busy, its vtime will be advanced more quickly so that more requests will go out.
The controller will try to tune this scaling automatically, but that may not be adequate for some situations. Write operations, in particular, can be queued within the device itself and completed in an order chosen by the device, meaning that the controller loses some control over the latency of any given request. In cases where that is a problem, it may be desirable to slow down request dispatch more aggressively to reduce the latency of request completion, even at the possible cost of leaving some bandwidth unused. There is another control knob in the root group, called, io.weight.qos, that can be used to specify what the desired latency ranges are and how much the device's vtime can be adjusted to achieve those ranges.
See the comments at the top of this patch for more details on the various control knobs and how they work.
Heo notes that the controller does a reasonable job of enforcing each
group's weight using the default parameters — for read requests, at least.
When
there are a lot of writes involved, some playing with the parameters may be
needed to get the best results. Tools and documentation to help
administrators working to tune this controller are promised. Meanwhile,
there has not been a huge amount of feedback on this controller since it
was posted on June 13. Expecting it for 5.3 seems optimistic, but it
may well be ready for a merge window shortly thereafter.
| Index entries for this article | |
|---|---|
| Kernel | Control groups/I/O bandwidth controllers |
Posted Jun 29, 2019 17:36 UTC (Sat)
by taggart (subscriber, #13801)
[Link] (2 responses)
But the device firmware has been written with particular goals in mind, which might not be the goals of the administrator. Probably it's been optimized for Windows access patterns, to get good numbers on particular benchmarks, etc. So by moving the control point to the system io controller, the administrator might gain back control for the things they want to optimize for, but maybe at a cost of increased flash wear, increased write latency, etc.
If the system io controller knew which type of things it could dispatch to the device controller and let it deal with that might help. Or if there were hints it could pass. Maybe TRIM/discard is sort of an example of this.
One trick I have done in the past with consumer grade SSDs is to deliberately partition only 80% of the drive, thus 20% will never get allocated and can be used by the device firmware as additional spare area to help with write latency, wear leveling, etc. (this was back in the pre-TRIM/discard days, maybe unnecessary with those now).
Posted Jun 30, 2019 7:23 UTC (Sun)
by epa (subscriber, #39769)
[Link] (1 responses)
If you just mean that you leave some of the capacity unused, wouldn’t a large file full of zero bytes do that as effectively?
Posted Jun 30, 2019 16:25 UTC (Sun)
by Jonno (subscriber, #49613)
[Link]
The idea is to keep a large consecutive range of LBAs to which you have _never_* written _anything_. Writing zeroes to the range you want to reserve is counterproductive, as it forces the firmware to store those zeroes, tying up a large amount of flash it could otherwise have used as an additional spare space (unless the drive firmware inspects every write to check if the whole write is just zeroes, and treat it as a TRIM if it is, but I doubt any consumer grade SSD has gone to the trouble to optimize for that case).
* On modern SSDs that support TRIM, substitute "since last consecutive TRIM covering the whole range"...
Posted Jun 30, 2019 2:58 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (2 responses)
Bounded latency for every stream/process while driving the underlying device close to its maximum rate: that's practically word for word the objective people fixing bufferbloat set themselves. Now I realize there are some differences. The main one is probably that packet loss is not just allowed in networking: it's the main signal. Yet I suspect there's a fair amount of overlap in the approaches. Are these two crowds connected to each other? Networking was not mentioned once in this article.
BTW maybe there would be more networking people reading this article if the keyword "latency" had been in the title. Or even better: in the name of the scheduler itself. Again on the "marketing" topic, why call it a "controller"? Straight from some legacy name in the kernel code maybe?
Posted Jun 30, 2019 14:40 UTC (Sun)
by Paf (subscriber, #91811)
[Link]
Posted Jun 30, 2019 22:35 UTC (Sun)
by mtaht (subscriber, #11087)
[Link]
There was even an attempt once at applying a fq_codel-like technique to queue up commands for a graphics card - which worked really well, except that one of the possible commands included resetting the pipeline.
Anyway, given deep buffers on a SSD device itself, something along the lines of BQL and utilizing a completion interrupt to keep those from getting too deep might be good. For spinning rust, instead of SSDs, you'd have to weight the seek somehow, and in that case you actually do want any seeks "along the way" to get inserted into the on-device queue... aggh, I'm rooting for y'all to sort it out.
Posted Nov 25, 2019 7:26 UTC (Mon)
by riteshsarraf (subscriber, #11138)
[Link]
Posted Jan 21, 2020 9:33 UTC (Tue)
by deven_zhu (guest, #132086)
[Link]
so, this is a new io controller for replacing the blk-throttle ? or co-exist with it
write operations in device controllers
write operations in device controllers
write operations in device controllers
The io.weight I/O-bandwidth controller
The io.weight I/O-bandwidth controller
The io.weight I/O-bandwidth controller
Thank you for this, as always, well written article. I hope this new I/O controller, finally, mitigates the Linux Desktop Latency problem that has plagued it for years. Eagerly waiting for my distribution to package and provide Linux 5.4
The io.weight I/O-bandwidth controller
The io.weight I/O-bandwidth controller
