LWN: Comments on "The multiqueue block layer"

The multiqueue block layer

Bobby999 — Wed, 22 Jan 2020 14:58:08 +0000

Hi Neil,

I have a question regarding multi-queue (MQ) in SCSI layer. I have read articles and blogs on multi-queue in Linux block layer. Including your brilliant article as well. According to my understanding, since Linux kernel 3.13 (2014), the linux block layer has multi-queue a.k.a mq-blk. And then after mq-blk in block layer, the SCSI IO submission path had to be updated. As a result, SCSI multi-queue a.k.a scsi-mq work has been functional since Linux kernel 3.17.

My question is: How actually multi-queuing is achieved in SCSI? AFAIK, traditionally the SCSI mid-level layer used to create queuecommand (). Now when there is multi-queuing implemented in SCSI, does multi-queuing actually means creating multi queuecommand ()? I am struggling to understand the multi-queue mechanism in context of SCSI. I mean where one can see multi-queue in SCSI code?
Please help me understand. Thanks :-)

The multiqueue block layer

Bobby999 — Wed, 22 Jan 2020 14:46:30 +0000

Hi Jonathan,
Hi all,

Great article !

I have a question regarding multi-queue (MQ) in SCSI layer. I have read articles and blogs on multi-queue in Linux block layer. Since Linux kernel 3.13 (2014), the linux block layer has multi-queue a.k.a mq-blk. And then after mq-blk in block layer, the SCSI IO submission path had to be updated. As a result, SCSI multi-queue a.k.a scsi-mq work has been functional since Linux kernel 3.17.

Please help me understand. Thanks :-)

The multiqueue block layer

neilbrown — Wed, 21 Feb 2018 01:22:43 +0000

> Hi, can i ask some questions?

You are always free to ask :-)

I suggest that you read https://lwn.net/Articles/736534/ and https://lwn.net/Articles/738449/ as they go in to quite a bit more detail.
If something is not clear after reading those, do ask again, either here or in the comments of the relevant other article.

The multiqueue block layer

dusol — Wed, 21 Feb 2018 00:10:47 +0000

Hi, can i ask some questions?
I'm stuck in understanding kernel multi-queue block layer I/O scheduling algorithm.
If my task want to submit bios, it uses function 'generic_make_request(bio)'.
I understand that 'generic_make_request(bio)' submit bios to its own software staging queue(one software staging queue per core).
This function get block device driver's queue(bdev->bd_disk->queue) through bdev_get_queue(bio->bi_bdev) and then,
add bios through a recursive call to generic_make_request().
This article says 'the request queue is split into a number of separate queues'.
Are 'request queue' and bdev->bd_disk->queue the same thing?
I uses kernel linux-4.8.17 version.

Hypothetical progams might desire cross-CPU coalescing of IO

kleptog — Fri, 14 Jun 2013 06:39:47 +0000

Actually, I'd consider write combining something to be done by the disk itself. These days with storage arrays having gigabytes of battery backed storage there's no real reason to have the kernel worry about things like write combining. Maybe if it affected the communication overhead, but that can be dealt with in other ways (like parallel submission).

Hypothetical progams might desire cross-CPU coalescing of IO

dlang — Fri, 07 Jun 2013 18:18:22 +0000

you are processing slices of a range, but are you really writing the adjacent slices at almost exactly the same time from multiple processes? that's what it would take to give the OS the chance to combine the output from the different processes into one write to disk.

Also, in HPC aren't you dealing with many different systems accessing your data over the network rather than multiple processes on one machine?

What we are talking about here is the chance that things running on different CPUs in a single system are generating disk writes that the OS could combine into a single write before sending it to the drives

for reads this isn't as big a deal, readahead should go a long way towards making the issue moot.

Hypothetical progams might desire cross-CPU coalescing of IO

ejr — Fri, 07 Jun 2013 10:06:21 +0000

In my world (HPC), having multiple coordinated processes accessing slices of a range *is* the common case. We have to fight for anything else to have even reasonable performance support. See lustre.

But this case often is better handled outside the OS. There could be different interface where clients post their read buffers and the OS fills them in some hardware-optimal order, but that's been considered for >20 years and has no solution yet.

Hypothetical progams might desire cross-CPU coalescing of IO

dlang — Fri, 07 Jun 2013 02:27:24 +0000

when you have things like databases where many applications are accessing one file, how common is it for the different applications to be making adjacent writes at the same time?

It may happen, but it's not going to be the common case.

Hypothetical progams might desire cross-CPU coalescing of IO

alankila — Thu, 06 Jun 2013 14:38:18 +0000

The algorithms used for data processing on modern disks are quite fast, for instance lzo compression has bandwidth approaching 1 GB/s per core.

Hypothetical progams might desire cross-CPU coalescing of IO

axboe — Thu, 06 Jun 2013 14:22:33 +0000

If you write programs like that and expect IO performance to be stellar, you have a problem. It's already the case that Linux does not merge IO from independent threads, unless it just happens to either detect this or if explicitly asked to by sharing an IO context.

So in general it's not a huge problem. For "legacy" devices that benefit a lot from merging, we can help them out a little bit. They will typically be single queue anyway, and merging at dispatch time on that queue would be trivial to do.

Hypothetical progams might desire cross-CPU coalescing of IO

ejr — Thu, 06 Jun 2013 12:06:35 +0000

See uses of HDF5 and NetCDF file formats. There are many software systems that store in a single, large file, emulating a file system more aligned with the higher-level semantics. Also, think of databases. Counting a case as rare v. common requires an application area.

But... Parallel HDF5, etc. interfaces handle some coalescing in the "middleware" layer. They've found that relying on the OS leads to, um, sufficient performance variation across different systems and configurations. Yet another standard parallel I/O layer in user-space could help more than trying to jam the solution into the OS.

But relying on user-space won't help the case multiqueue is attacking.

Hypothetical progams might desire cross-CPU coalescing of IO

dlang — Thu, 06 Jun 2013 07:28:10 +0000

a comment I posted elsewhere is also relevant to this discussion. I'll post a link rather than reposting the longer comment

in https://lwn.net/Articles/553086/ I talk about how I/O coalescing should only be done when the I/O is busy (among other things)

Hypothetical progams might desire cross-CPU coalescing of IO

dlang — Thu, 06 Jun 2013 07:08:41 +0000

but how many people are running such parallel processing of single files?

And of those who are doing so, how much do they care if their files get processed one at a time using every CPU for that one file, or many files at a time with each CPU processing a different file (and therefor not needing the combined I/O logic)

Yes, there are going to be some, but is it really worth crippling the more common cases to help this rare case?

Hypothetical progams might desire cross-CPU coalescing of IO

faramir — Thu, 06 Jun 2013 05:38:38 +0000

Our worthy editor suggests that not coalescing IO requests across CPUs is probably not a big problem. If we restrict ourselves to the original model of UNIX computation (single process/private memory space), I would agree.

If we consider multiple processes with synchronization (perhaps via shared memory), or multi-threaded programs; I'm not so sure. Imagine some kind of processing of file data which can be done a block at a time (say a block based cipher or non-adaptive compression). A multi-threaded/multi-process version of such a program may in fact be running code on multiple CPUs but reading from/writing to the same files (and therefore generating coalescible IO requests.) Reads from the input file could come from any of the threads/processes engaged in the task.

In the case of compression; due to variable length output chunks; the writer side will have to be coalesced into a single stream in the program itself in order to put the output in the correct order. Although that might be done by having a management thread simply inform each compression thread when to write; so the actual write calls might still come from different CPUs.

A block based cipher program could probably use lseek() on multiple fds opened to the same output file to maintain correct ordering from each thread.

In either case, it would appear that coalescing across CPUs would be useful. At least if the actual processing required was negligible relative to IO time. It may be that CPUs aren't fast enough to do this for anything beyond ROT13 encryption or simple RLE compression; so it might not matter for now. But it would seem to be at least be a theoretical issue.