Two new block I/O schedulers for 4.12

By Jonathan Corbet
April 24, 2017

The multiqueue block layer subsystem, introduced in 2013, was a necessary step for the kernel to scale to the fastest storage devices on large systems. The implementation in current kernels is incomplete, though, in that it lacks an I/O scheduler designed to work with multiqueue devices. That gap is currently set to be closed in the 4.12 development cycle when the kernel will probably get not just one, but two new multiqueue I/O schedulers.

The lack of an I/O scheduler might seem like a fatal flaw in the multiqueue code, but the truth is that the need for a scheduler was not clearly evident at the outset. High-end drives are generally solid-state devices lacking rotational delay problems; they are thus not as sensitive to the ordering of operations. But it turns out that there is value in I/O scheduling even for the fastest devices; a scheduler can coalesce adjacent requests, reducing the overall operation count, and it can prioritize some operations over others. So the desire to add I/O scheduling for multiqueue devices has been there for a while, but the code has been lacking.

Things got closer in the 4.11 merge window, when the block layer gained support for I/O scheduling for multiqueue devices. The deadline I/O scheduler was ported to this mechanism as a proof of concept, but it was never seen as the real solution going forward.

When I/O scheduling support was added, the first intended user was the Budget Fair Queuing (BFQ) scheduler. BFQ has been in the works for years; it assigns an I/O budget to each process that, when combined with a bunch of heuristics, is said to produce much better I/O response, especially on slower devices. Users with rotational storage benefit from BFQ, but there are also benefits when using slower solid-state devices; as a result, there is a fair amount of interest in using BFQ on devices like mobile handsets, for example.

The idea that BFQ is an improvement over the CFQ scheduler found in mainline kernels is fairly uncontroversial, but getting BFQ merged was still a lengthy process. One of the final stumbling blocks was that it was a traditional I/O scheduler rather than a multiqueue scheduler. The block subsystem developers have a long-term goal of moving all drivers to the multiqueue model, and merging a non-multiqueue I/O scheduler did not seem like a step in that direction.

Over the last several months, BFQ developer Paolo Valente has worked to port the code to the multiqueue interface. The known problems with the port have been resolved, and block subsystem maintainer Jens Axboe has agreed to merge it for 4.12. If all goes to plan, the long wait for the BFQ I/O scheduler will finally be at an end.

The interesting thing is that BFQ will not be the only multiqueue I/O scheduler entering the mainline in 4.12; there will be another one, developed over a much shorter time period, that should also be merged then. One might well wonder why a second scheduler is needed, especially since kernel developers place a premium on general solutions that can address a wide variety of use cases. But it seems that there is, indeed, a reasonable case to be made for merging a second multiqueue I/O scheduler.

BFQ is a complex scheduler that is designed to provide good interactive response, especially on those slower devices. It has a relatively high per-operation overhead, which is justified when the I/O operations themselves are slow and expensive. This complexity may not make sense, though, in situations where I/O operations are cheap and throughput is a primary concern. When running a server workload using solid-state devices, it may be better to run a much simpler scheduler that allows for request merging and perhaps some simple policies, but which mostly stays out of the way.

The Kyber I/O scheduler, posted by Omar Sandoval, would appear to be such a beast. Kyber is intended for fast multiqueue devices and lacks much of the complexity found in BFQ; it is less than 1,000 lines of code. Its policies, while simple, would appear to be an interesting echo of the bufferbloat work done in the networking stack.

I/O requests passing through Kyber are split into two primary queues, one for synchronous requests and one for asynchronous requests — or, in other words, one for reads and one for writes. A process issuing a read request is typically unable to proceed until that request completes and the data is available, so such requests are seen as synchronous. A write operation, instead, can complete at some later time; the initiating process doesn't usually care when that write actually happens. So it is common to prioritize reads over writes, but not to the point that writes are starved.

The key to the Kyber algorithm is that the number of operations (both reads and writes) sent to the dispatch queues (the queues that feed operations directly to the device) is strictly limited, keeping those queues relatively short. If the dispatch queues are short, then the amount of time that passes while any given request waits in the queues (the per-request latency) will be relatively small. That ensures a quick completion time for higher-priority requests.

The scheduler tunes the actual number of requests allowed into the dispatch queues by measuring the completion time of each request and adjusting the limits to achieve the desired latencies. There are two tuning knobs available to the system administrator for setting the latency targets; they default to 2ms for read requests and 10ms for writes. As always, there will be tradeoffs in setting these values; setting them too low will ensure low latencies, but at the cost of reducing the opportunities for the merging of requests, hurting throughput.

Kyber, too, has been accepted for the 4.12 merge window. So, if all goes according to plan, the 4.12 kernel will have two new options for I/O scheduling on multiqueue devices. Users concerned with interactive response and possibly using slower devices will likely opt for BFQ, while throughput-sensitive server loads are more likely to run with Kyber. Either way, an important gap in the kernel's multiqueue block I/O subsystem has been filled, clearing the path for an eventual transition of all of the kernel's block drivers to the multiqueue API.

Index entries for this article
Kernel	Block layer/I/O scheduling
Kernel	I/O scheduler

Two new block I/O schedulers for 4.12

Posted Apr 24, 2017 19:49 UTC (Mon) by jhoblitt (subscriber, #77733) [Link] (1 responses)

Are there benchmarks for kyber? A quick search didn't turn up anything relevant.

Two new block I/O schedulers for 4.12

Posted Apr 25, 2017 0:38 UTC (Tue) by liam (guest, #84133) [Link]

http://marc.info/?l=linux-block&m=148978871820916&...

Latency falls from 8ms to 1ms (the target set by kyber).

Interaction with btrfs

Posted Apr 25, 2017 9:41 UTC (Tue) by zdzichu (guest, #17118) [Link] (2 responses)

I wonder how those schedulers are going to interact with btrfs? Those schedulers try to ensure fairness between *processes*. But in my observations, when using btrfs most work seem to be dispatched from kworker, btrfs-worker, btrfs-cleaner, btrfs-transaction and other btrfs-* kernel threads, not from original processes.
This btrfs behaviour makes observing I/O traffic troublesome already. I fear it will wreak havoc with fairness.

Interaction with btrfs

Posted Apr 25, 2017 13:15 UTC (Tue) by DG (subscriber, #16978) [Link]

Unscientific observation : It works well.

I used to see spikes in latency from BTRFS on an internal backup server as it performed rsync --inplace based backups and deleted old snapshots ... which were bad enough to make me try BFQ.

Random details: vanilla 4.9.x kernel + BFQ. Two WD 3TiB Red disks in a BTRFS raid 1 filesystem. 4 core i5 processor with 32gb of RAM. FS contains 150 readonly snapshots. Currently using about 1.79TiB.

Interaction with btrfs

Posted Apr 26, 2017 13:00 UTC (Wed) by masoncl (subscriber, #47138) [Link]

It should work pretty well, but we'll definitely look into any latency problems. For the most part we're able to tag latency sensitive IOs and get good results from the schedulers. On the btrfs side, we try to make sure the btrfs IO threads aren't creating priority inversions of their own.

difference between deadline & kyber

Posted Apr 29, 2017 3:58 UTC (Sat) by dud225 (subscriber, #114210) [Link]

This new scheduler seems to have similarities with deadline. As far as I understand kyber is able to dynamically adapt the size of the queues to serve requests at low latencies whereas deadline will be doing as much as it can.
Is that correct ? Are there other differences ?

Two new block I/O schedulers for 4.12

Posted May 8, 2017 9:01 UTC (Mon) by micka (subscriber, #38720) [Link] (4 responses)

Do I understand correctly that you have only one such scheduler at any time? So if you have both slow and fast devices on a system, which one do you use?

Two new block I/O schedulers for 4.12

Posted May 8, 2017 10:32 UTC (Mon) by zdzichu (guest, #17118) [Link] (3 responses)

Linux supports pluggable, per-device I/O schedulers; you can change them with `cat`:

$ find /sys -name scheduler -exec grep . {} +
/sys/devices/pci0000:00/0000:00:0b.0/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler: noop [deadline] cfq
/sys/devices/pci0000:00/0000:00:09.0/virtio3/host8/target8:0:0/8:0:0:2/block/sdc/queue/scheduler: noop deadline [cfq]
/sys/devices/pci0000:00/0000:00:0c.0/virtio5/block/vda/queue/scheduler: mq-deadline kyber [bfq] none

Two new block I/O schedulers for 4.12

Posted May 10, 2017 21:58 UTC (Wed) by micka (subscriber, #38720) [Link] (1 responses)

Thank you, this seems sensible. I could not find any hint about that in the article.

Two new block I/O schedulers for 4.12

Posted May 11, 2017 6:37 UTC (Thu) by dtlin (subscriber, #36537) [Link]

For reference, there is documentation at /usr/src/linux/Documentation/block/switching-sched.txt.

Two new block I/O schedulers for 4.12

Posted Dec 5, 2022 14:30 UTC (Mon) by leodag (guest, #162537) [Link]

You can also find these devices listed under /sys/class/block with their more familiar names:

$ ls -l /sys/class/block
total 0
lrwxrwxrwx 1 root root 0 dez 5 11:22 dm-0 -> ../../devices/virtual/block/dm-0/
lrwxrwxrwx 1 root root 0 dez 5 11:22 nvme0n1 -> ../../devices/pci0000:00/0000:00:01.2/0000:02:00.0/nvme/nvme0/nvme0n1/
lrwxrwxrwx 1 root root 0 dez 5 11:22 nvme0n1p1 -> ../../devices/pci0000:00/0000:00:01.2/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/
lrwxrwxrwx 1 root root 0 dez 5 11:22 nvme0n1p2 -> ../../devices/pci0000:00/0000:00:01.2/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p2/