Two new block I/O schedulers for 4.12
The lack of an I/O scheduler might seem like a fatal flaw in the multiqueue code, but the truth is that the need for a scheduler was not clearly evident at the outset. High-end drives are generally solid-state devices lacking rotational delay problems; they are thus not as sensitive to the ordering of operations. But it turns out that there is value in I/O scheduling even for the fastest devices; a scheduler can coalesce adjacent requests, reducing the overall operation count, and it can prioritize some operations over others. So the desire to add I/O scheduling for multiqueue devices has been there for a while, but the code has been lacking.
Things got closer in the 4.11 merge window, when the block layer gained support for I/O scheduling for multiqueue devices. The deadline I/O scheduler was ported to this mechanism as a proof of concept, but it was never seen as the real solution going forward.
When I/O scheduling support was added, the first intended user was the Budget Fair Queuing (BFQ) scheduler. BFQ has been in the works for years; it assigns an I/O budget to each process that, when combined with a bunch of heuristics, is said to produce much better I/O response, especially on slower devices. Users with rotational storage benefit from BFQ, but there are also benefits when using slower solid-state devices; as a result, there is a fair amount of interest in using BFQ on devices like mobile handsets, for example.
The idea that BFQ is an improvement over the CFQ scheduler found in mainline kernels is fairly uncontroversial, but getting BFQ merged was still a lengthy process. One of the final stumbling blocks was that it was a traditional I/O scheduler rather than a multiqueue scheduler. The block subsystem developers have a long-term goal of moving all drivers to the multiqueue model, and merging a non-multiqueue I/O scheduler did not seem like a step in that direction.
Over the last several months, BFQ developer Paolo Valente has worked to port the code to the multiqueue interface. The known problems with the port have been resolved, and block subsystem maintainer Jens Axboe has agreed to merge it for 4.12. If all goes to plan, the long wait for the BFQ I/O scheduler will finally be at an end.
The interesting thing is that BFQ will not be the only multiqueue I/O scheduler entering the mainline in 4.12; there will be another one, developed over a much shorter time period, that should also be merged then. One might well wonder why a second scheduler is needed, especially since kernel developers place a premium on general solutions that can address a wide variety of use cases. But it seems that there is, indeed, a reasonable case to be made for merging a second multiqueue I/O scheduler.
BFQ is a complex scheduler that is designed to provide good interactive response, especially on those slower devices. It has a relatively high per-operation overhead, which is justified when the I/O operations themselves are slow and expensive. This complexity may not make sense, though, in situations where I/O operations are cheap and throughput is a primary concern. When running a server workload using solid-state devices, it may be better to run a much simpler scheduler that allows for request merging and perhaps some simple policies, but which mostly stays out of the way.
The Kyber I/O scheduler, posted by Omar Sandoval, would appear to be such a beast. Kyber is intended for fast multiqueue devices and lacks much of the complexity found in BFQ; it is less than 1,000 lines of code. Its policies, while simple, would appear to be an interesting echo of the bufferbloat work done in the networking stack.
I/O requests passing through Kyber are split into two primary queues, one for synchronous requests and one for asynchronous requests — or, in other words, one for reads and one for writes. A process issuing a read request is typically unable to proceed until that request completes and the data is available, so such requests are seen as synchronous. A write operation, instead, can complete at some later time; the initiating process doesn't usually care when that write actually happens. So it is common to prioritize reads over writes, but not to the point that writes are starved.
The key to the Kyber algorithm is that the number of operations (both reads and writes) sent to the dispatch queues (the queues that feed operations directly to the device) is strictly limited, keeping those queues relatively short. If the dispatch queues are short, then the amount of time that passes while any given request waits in the queues (the per-request latency) will be relatively small. That ensures a quick completion time for higher-priority requests.
The scheduler tunes the actual number of requests allowed into the dispatch queues by measuring the completion time of each request and adjusting the limits to achieve the desired latencies. There are two tuning knobs available to the system administrator for setting the latency targets; they default to 2ms for read requests and 10ms for writes. As always, there will be tradeoffs in setting these values; setting them too low will ensure low latencies, but at the cost of reducing the opportunities for the merging of requests, hurting throughput.
Kyber, too, has been accepted for the 4.12 merge window. So, if all goes
according to plan, the 4.12 kernel will have two new options for I/O
scheduling on multiqueue devices. Users concerned with interactive
response and possibly using slower devices will likely opt for BFQ, while
throughput-sensitive server loads are more likely to run with Kyber.
Either way, an important gap in the kernel's multiqueue block I/O subsystem
has been filled, clearing the path for an eventual transition of all of the
kernel's block drivers to the multiqueue API.
Index entries for this article | |
---|---|
Kernel | Block layer/I/O scheduling |
Kernel | I/O scheduler |
Posted Apr 24, 2017 19:49 UTC (Mon)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted Apr 25, 2017 0:38 UTC (Tue)
by liam (guest, #84133)
[Link]
Latency falls from 8ms to 1ms (the target set by kyber).
Posted Apr 25, 2017 9:41 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (2 responses)
Posted Apr 25, 2017 13:15 UTC (Tue)
by DG (subscriber, #16978)
[Link]
I used to see spikes in latency from BTRFS on an internal backup server as it performed rsync --inplace based backups and deleted old snapshots ... which were bad enough to make me try BFQ.
Random details: vanilla 4.9.x kernel + BFQ. Two WD 3TiB Red disks in a BTRFS raid 1 filesystem. 4 core i5 processor with 32gb of RAM. FS contains 150 readonly snapshots. Currently using about 1.79TiB.
Posted Apr 26, 2017 13:00 UTC (Wed)
by masoncl (subscriber, #47138)
[Link]
Posted Apr 29, 2017 3:58 UTC (Sat)
by dud225 (subscriber, #114210)
[Link]
Posted May 8, 2017 9:01 UTC (Mon)
by micka (subscriber, #38720)
[Link] (4 responses)
Posted May 8, 2017 10:32 UTC (Mon)
by zdzichu (subscriber, #17118)
[Link] (3 responses)
$ find /sys -name scheduler -exec grep . {} +
Posted May 10, 2017 21:58 UTC (Wed)
by micka (subscriber, #38720)
[Link] (1 responses)
Posted May 11, 2017 6:37 UTC (Thu)
by dtlin (subscriber, #36537)
[Link]
Posted Dec 5, 2022 14:30 UTC (Mon)
by leodag (guest, #162537)
[Link]
$ ls -l /sys/class/block
Two new block I/O schedulers for 4.12
Two new block I/O schedulers for 4.12
Interaction with btrfs
This btrfs behaviour makes observing I/O traffic troublesome already. I fear it will wreak havoc with fairness.
Interaction with btrfs
Interaction with btrfs
difference between deadline & kyber
Is that correct ? Are there other differences ?
Two new block I/O schedulers for 4.12
Two new block I/O schedulers for 4.12
/sys/devices/pci0000:00/0000:00:0b.0/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler: noop [deadline] cfq
/sys/devices/pci0000:00/0000:00:09.0/virtio3/host8/target8:0:0/8:0:0:2/block/sdc/queue/scheduler: noop deadline [cfq]
/sys/devices/pci0000:00/0000:00:0c.0/virtio5/block/vda/queue/scheduler: mq-deadline kyber [bfq] none
Two new block I/O schedulers for 4.12
For reference, there is documentation at /usr/src/linux/Documentation/block/switching-sched.txt.
Two new block I/O schedulers for 4.12
Two new block I/O schedulers for 4.12
total 0
lrwxrwxrwx 1 root root 0 dez 5 11:22 dm-0 -> ../../devices/virtual/block/dm-0/
lrwxrwxrwx 1 root root 0 dez 5 11:22 nvme0n1 -> ../../devices/pci0000:00/0000:00:01.2/0000:02:00.0/nvme/nvme0/nvme0n1/
lrwxrwxrwx 1 root root 0 dez 5 11:22 nvme0n1p1 -> ../../devices/pci0000:00/0000:00:01.2/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p1/
lrwxrwxrwx 1 root root 0 dez 5 11:22 nvme0n1p2 -> ../../devices/pci0000:00/0000:00:01.2/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p2/