I/O scheduling for single-queue devices

By Jonathan Corbet
October 12, 2018

Block I/O performance can be one of the determining factors for the performance of a system as a whole, especially on systems with slower drives. The need to optimize I/O patterns has led to the development of a long series of I/O schedulers over the years; one of the most recent of those is BFQ, which was merged during the 4.12 development cycle. BFQ incorporates an impressive set of heuristics designed to improve interactive performance, but it has, thus far, seen relatively little uptake in deployed systems. An attempt to make BFQ the default I/O scheduler for some types of storage devices has raised some interesting questions, though, on how such decisions should be made.

A bit of review for those who haven't been following the block layer closely may be in order. There are two generations of the internal API used between the block layer and the underlying device drivers, which we can call "legacy" and "multiqueue". Unsurprisingly, the legacy API is older, while the multiqueue API was first merged in 3.13. The conversion of block drivers to the multiqueue API has been ongoing since then, with the SCSI subsystem only switching over, after a false start, in the upcoming 4.19 release. Most of the remaining holdout legacy drivers will be converted to multiqueue in the near future, at which point the legacy API can be expected to go away.

Several I/O schedulers exist for the legacy interface but, in practice, only two are in common use: cfq for slower drives and none for fast, solid-state devices. The multiqueue interface was aimed at fast devices from the outset; it was not able to support an I/O scheduler at all initially. That capability was added later, along with the mq-deadline scheduler, which was essentially a forward port of one of the simpler legacy schedulers (deadline). BFQ, which came later, is also a multiqueue-API scheduler.

In early October Linus Walleij posted a patch making BFQ the default I/O scheduler for single-queue devices driven by way of the multiqueue API. The idea of a single-queue multiqueue device may seem a bit contradictory at a first encounter, but one should remember that "multiqueue" refers to the API which, unlike the legacy API, is capable of handling block devices that implement more than one queue in hardware (but does not require multiple queues). As more drivers move to this API, more single-queue devices will be driven using it. In this particular case, Walleij is concerned with SD cards and similar devices, the kind often found in mobile systems. The expectation is that devices with a single hardware queue can be expected to be relatively slow, and that BFQ will extract better performance from those devices.

The initial response from block subsystem maintainer Jens Axboe was not entirely positive: "I think this should just be done with udev rules, and I'd prefer if the distros would lead the way on this". This approach is not inconsistent with how the kernel tries to do things in general, leaving policy decisions to user space. But, of course, current kernels, by selecting mq-deadline for such devices, are already implementing a specific policy.

There were a few objections made to Axboe's position. Paolo Valente, the creator of BFQ, asserted that almost nobody understands I/O schedulers or how to choose one, so almost everybody will stick with whatever default the system gives them. And mq-deadline, as a default, is far worse than BFQ for such devices, he said. Walleij added that there are quite a few systems out there that do not use udev at all, so a rule-based approach will not work for them. On embedded systems where initramfs is not in use, it's currently not possible to mount the root filesystem using BFQ at all. As an additional practical difficulty, the number of hardware queues provided by a device is currently not made available to udev, so it could not effect this particular policy choice in any case (though that would be relatively straightforward to fix).

Oleksandr Natalenko was not impressed by the embedded-systems argument; he said that the people building such systems know which I/O scheduler they should use and can build their systems accordingly. Mark Brown took issue with that view of things, though:

That's not an entirely realistic assessment of a lot of practical embedded development - while people *can* go in and tweak things to their heart's content and some will have the time to do that there's a lot of small teams pulling together entire systems who rely fairly heavily on defaults, focusing most of their effort on the bits of code they directly wrote.

Walleij echoed that view, and added that there have been many times in kernel history where the decision was made to try to do the right thing automatically when possible, without requiring intervention from user space.

Bart Van Assche, instead, questioned the superiority of the BFQ scheduler. He initially claimed that it would slow down kernel builds (a sure way to prevent your code from being merged), but Valente challenged that assessment. Van Assche's other concern, though, had to do with fast solid-state SATA drives. Once SCSI switches over to the multiqueue API, those drives will show up with a single queue, and will thus be affected by this change. He questioned whether BFQ could be as fast as mq-deadline in that situation, but did not present any test results.

One other potential problem, as pointed out by Damien Le Moal, is shingled magnetic recording (SMR) disks, which often require that write operations arrive in a specific order. BFQ does not provide the same ordering guarantees that mq-deadline does, so attempts to use it with SMR drives are unlikely to lead to a high level of user satisfaction. Valente has a plan for how to support those drives in BFQ, but he acknowledged that they will not work correctly now.

The discussion wound down without reaching any sort of clear conclusion. It would appear that, before being merged, a patch of this nature would need to gain some additional checks to ensure, at a minimum, that BFQ is not selected for hardware that it cannot schedule properly. No such revision has been posted as of this writing. The proponents of BFQ seem unlikely to give up in the near future, though, so this topic seems like one that can be expected to arise again.

Index entries for this article
Kernel	Block layer/I/O scheduling

I/O scheduling for single-queue devices

Posted Oct 12, 2018 17:50 UTC (Fri) by post-factum (subscriber, #53836) [Link]

> Oleksandr Natalenko was not impressed by the embedded-systems argument

Just for the record, that Oleksandr Natalenko guy (aka me) is using BFQ since the very early days of its existing (2009, I believe? BFQ was definitely merged into pf-kernel for v2.6.31, at least) and has nothing against its proven superiority. It is just the automated choice for everyone that is questionable.

I/O scheduling for single-queue devices

Posted Oct 13, 2018 17:46 UTC (Sat) by josh (subscriber, #17465) [Link] (4 responses)

For a device that does have multiple queues, does the kernel use "none" or "mq-deadline" by default? I have a recent NVMe drive, and the kernel seems to default to "none"; I'm wondering if that's the right default.

I/O scheduling for single-queue devices

Posted Oct 15, 2018 9:26 UTC (Mon) by jan.kara (subscriber, #59161) [Link] (3 responses)

What does your /sys/block/<dev>/queue/scheduler show?

I/O scheduling for single-queue devices

Posted Oct 15, 2018 12:30 UTC (Mon) by timokokk (subscriber, #52029) [Link] (1 responses)

For me it shows as:

$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

Running still 4.18.1

I/O scheduling for single-queue devices

Posted Oct 16, 2018 12:27 UTC (Tue) by timokokk (subscriber, #52029) [Link]

Seems that there is only one list for IO schedulers, both normal and multi-queue schedulers are put on the same list. You can only choose default for the normal schedulers, not for the MQ ones. So apparently the default for mq-scheduler is thus nop as there is no way to set it up at compile time.

I/O scheduling for single-queue devices

Posted Oct 15, 2018 15:11 UTC (Mon) by josh (subscriber, #17465) [Link]

$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline

I/O scheduling for single-queue devices

Posted Oct 13, 2018 19:33 UTC (Sat) by marcH (subscriber, #57642) [Link]

> That's not an entirely realistic assessment of a lot of practical embedded development - while people *can* go in and tweak things to their heart's content and some will have the time to do that there's a lot of small teams pulling together entire systems who rely fairly heavily on defaults, focusing most of their effort on the bits of code they directly wrote.

OK, but do we know at least how the small, "early adopters" minority configure their systems and why?

Changing a default setting without *any* prior, "from the field" experience would be wrong. Sorry if I missed something.

I/O scheduling for single-queue devices

Posted Oct 13, 2018 19:47 UTC (Sat) by linusw (subscriber, #40300) [Link]

I am sorry that I (Linus Walleij) have not posted a v2 yet, but it's on my laptop and it enforces mq-deadline on SMR devices while still letting single queue devices use BFQ by default. I will post it next week or so, just waiting for the debate to conclude. I am anyways going to maintain that patch for myself, so who knows, I might go and lobby it into a few distributions where it makes sense, if that is what it takes. Blessed are the meek, for they shall inherit the earth.