The BFQ I/O scheduler
Basic BFQ
BFQ, which has been developed and used out of tree for some years, is, in many ways, modeled after the "completely fair queuing" (CFQ) I/O scheduler currently found in the mainline kernel. CFQ separates each process's I/O requests into a separate queue, then rotates through the queues trying to divide the available bandwidth as fairly as it can. CFQ does a reasonably good job and is normally the I/O scheduler of choice for rotating drives, but it is not without its problems. The code has gotten more complex over the years as attempts have been made to improve its performance, but, despite the added heuristics, it can still create I/O latencies that are longer than desired.
The BFQ I/O scheduler also maintains per-process queues of I/O requests, but it does away with the round-robin approach used by CFQ. Instead, it assigns an "I/O budget" to each process. This budget is expressed as the number of sectors that the process is allowed to transfer when it is next scheduled for access to the drive. The calculation of the budget is complicated (more on this below), but, in the end, it is based on each process's "I/O weight" and observations of the process's past behavior. The I/O weight functions like a priority value; it is set by the administrator (or by default) and is normally constant. Processes with the same weight should all get the same allocation of I/O bandwidth. Different processes may get different budgets, but BFQ tries to preserve fairness overall, so a process getting a smaller budget now will get another turn at the drive sooner than a process that was given a large budget.
When it comes time to figure out whose requests should be serviced, BFQ examines the assigned budgets and, to simplify a bit, it chooses the process whose I/O budget would, on an otherwise idle disk, be exhausted first. So processes with small I/O budgets tend not to wait as long as those with large budgets. Once a process is selected, it has exclusive access to the storage device until it has transferred its budgeted number of sectors, with a couple of exceptions. Those are:
- Normally, a process's access to the device ends if it has no more
requests to be serviced. If, however, the last request was
synchronous (a read request, essentially), BFQ will let the drive sit
idle for a short period to give the process a chance to generate
another I/O request. Since the process was probably waiting for the
read to complete before generating more I/O traffic, that request
comes quite often, and it tends to
be contiguous with the last request (or close to it) and, thus, fast
to service. It may be counter-intuitive, but idling a drive briefly
after synchronous requests tends to increase throughput overall.
- There is a maximum period of time allowed for a process to complete its requests. If its I/O load is slow to complete (mostly likely because it consists of random I/O patterns requiring a lot of seeking by the drive), it will lose access to the drive before it has transferred its full budget. In this case, the process will be charged for the full budget anyway to reflect its overall effect on the drive's I/O throughput.
There is still the question of how each process's budget is assigned. In its simplest form, the algorithm is this: each process's budget is set to the number of sectors it transferred the last time it was scheduled, subject to a systemwide maximum. So processes that tend to do small transfers then stop for a while will get small budgets, while I/O-intensive processes will get larger budgets. The processes with the smaller budgets, which often tend to be more sensitive to latency, will be scheduled more often, leading to a more responsive system. The processes doing a lot of I/O may wait a bit longer, but they will get an extended time slice with the storage device, allowing the transfer of a large amount of data and, hopefully, good throughput.
Bring on the heuristics
Some experience with BFQ has evidently shown that the above-described algorithm can yield good results, but that there is room for improvement in a number of areas. The current posting of the code has, in response, added a set of heuristics intended to push the behavior of the system in the desired direction. These include:
- Newly started processes get a medium-sized budget and an increased
weight; this allows them to do a fair amount of I/O with minimal delay
from the outset. The idea here is to improve application startup time
by giving new processes some extra I/O bandwidth to fault their code
into memory. The increased weight decays linearly as the process
runs.
- BFQ's budget calculations, including the maximum allowed budget, are
dependent on an estimate of the underlying device's peak I/O rate.
The peak I/O rate can vary considerably, though, depending on where
the data
is located on the disk and how much caching is going on inside the
drive. A number of tweaks to the peak-rate calculator try to account
for these effects. For example, the observed I/O rate for
processes that run out of time without exhausting their budgets is now
factored in, even though a timeout of this nature usually indicates
that the I/O pattern is random and the drive is not operating at its
peak rate. The reasoning is that a timeout can also indicate that the
maximum budget value is too large. There is also a low-pass filter
used to exclude especially high rate calculations, since those are
more likely to be measuring drive caching than actual I/O rates.
- The budget calculations themselves have been tweaked. If a process
runs out of requests before exhausting its budget, the old response
was to lower the budget to the number of requests issued. In the
current code, instead, the scheduler will look to see if any of the
process's I/O requests are still outstanding; if so, the rate will be
doubled on the theory that more requests will be forthcoming
when those outstanding requests complete. In the case of a timeout,
the budget is, once again, doubled; this tweak is meant to help
processes working from slow parts of a drive and to cause processes
with truly random I/O patterns to be serviced less frequently.
Finally, if a process
still has requests outstanding after using its entire budget, it's
likely to be an I/O-intensive process; in this case the budget is
quadrupled.
- Write operations are more costly than reads because disk drives
tend to cache the data and signal completion immediately; the actual
write to media is done at some later time. That can cause starvation
of read requests later on. BFQ tries to account for this cost by
counting writes more heavily against the budget; indeed, one write is
charged like ten reads.
- On drives that can queue multiple commands internally, idling (as
described above) can cause the internal queue to empty out, reducing
throughput. So BFQ will disable idling on solid-state devices with
command queuing. Idling will also be disabled on rotational devices,
but only when servicing processes with random I/O patterns.
- When multiple processes are operating on the same portion of the disk,
it can be better to keep their queues together rather than servicing
them separately. Evidently QEMU, which divides I/O among a number of
worker threads, is a good example of this type of workload. BFQ
includes an algorithm called "early queue merge" that attempts to
detect such processes and join their queues together.
- BFQ attempts to detect "soft realtime" applications — media players, for example — and boost their weight to help ensure that they experience low latencies. This detection works by looking for a pattern of issuing a set of I/O requests, then going idle (from a disk I/O point of view) for a period of time. Processes that exhibit that pattern will have their weight increased.
The list of heuristics is longer than this, but one should get the idea: tuning the I/O patterns of a system to optimize for a wide range of workloads is a complex task. From the results posted by BFQ developer Paolo Valente, it seems that a fair amount of success has been achieved. The task of getting this code into the mainline may be just a little bit harder, though.
The path toward merging
If BFQ does have a slow path into the mainline, it will not be because the kernel developers dislike it; indeed, almost all of the comments have been quite positive. The results speak for themselves, but there was also a lot of happiness about how the scheduler has been studied and all of the heuristics have been extensively described and tested. The CFQ I/O scheduler also contains a lot of heuristics, but few people understand what they are or how they work. BFQ appears to be a cleaner and much better documented alternative.
What the kernel developers do not want to see, though, is the merging of another complex I/O scheduler that tries to fill the same niche as CFQ. Instead, they would like to see a set of patches that evolves CFQ into BFQ, leaving the kernel with a single, improved I/O scheduler. As Tejun Heo put it:
Changing CFQ in an evolutionary way would also help when the inevitable performance regressions turn up. Finding the source of regressions in BFQ could be challenging; bisecting a series of changes to CFQ would, instead, point directly to the offending change.
The BFQ scheduler has been around for a while, and has seen a fair amount of use. Distributions like Sabayon and OpenMandriva ship it, as does CyanogenMod. It seems to be a well-proven technology. All that's needed is some time put into packaging it properly for inclusion into the mainline. Once that has been done, more extensive performance testing can be done. After any issues found there are resolved, this scheduler could replace CFQ (or, more properly, become the future CFQ) in the kernel relatively quickly.
(See this
paper [PDF] for a lot more information on how BFQ works).
| Index entries for this article | |
|---|---|
| Kernel | Block layer/I/O scheduling |
| Kernel | Elevator |
| Kernel | I/O scheduler |
