LWN.net Logo

LSFMM: Reducing SCSI latency

By Jonathan Corbet
April 25, 2013
LSFMM Summit 2013
As was noted often during the 2013 LSFMM Summit, the speed of storage devices is increasing rapidly, with the result that the Linux storage stack is having a hard time driving those devices at their full speed. For much of that hardware, one of the more significant parts of the stack is the SCSI layer. A session led by Bart van Assche examined ways in which the SCSI code could be made to perform better with fast hardware.

The discussion quickly honed in on the issue of the SCSI queue depth parameter, which limits the number of outstanding I/O operations. Bart complained that the queue depth should really be a per-LUN (per-device) parameter, rather than per-host; that would allow more outstanding requests and, hopefully, better performance. It could also reduce lock contention since queue depth counter updates could be split across multiple counters. James Bottomley objected that the queue depth limit is already a per-LUN parameter, but that host adapters tend to have a per-host limit as well. In the end, SCSI commands must go through the host adapter regardless of the target LUN, so there will be locking at the host adapter level regardless.

James went on to say that the queue depth was really the wrong problem to be worried about. Speeding up the SCSI layer requires removing lock contention, and that is best done by going to a multiple-queue architecture. There was talk of setting up one queue per LUN, but James stated that per-LUN queues are the wrong model. The right way to do multiqueue I/O is to have per-CPU queues, because that's the level at which locks can be eliminated — besides, with LUN numbers being 64-bit quantities, one could need a lot of queues. So per-CPU queuing is the plan once Jens Axboe's multiqueue block layer implementation is ready. That code will make it possible to split the SCSI stack up on a per-CPU basis and minimize the interactions between the CPUs.

Even with a single queue, Jens added, there is a lot that can be done to minimize contention. Much of that work seems to have to do with clever tagging of SCSI commands so that they can be dispatched quickly to the appropriate CPU. True multiqueue hardware will have per-queue tagging, which will make things even easier.

Bart's final question was: should the SCSI layer move to being a make_request_fn() driver? Block drivers that specify their own "make request" function accept I/O requests almost directly from the rest of the kernel, shorting out much of the block layer's functionality. Taking that approach can look like a way to reduce overhead but, as Jens said, it is a model that the block developers are trying to get away from. Using make_request_fn() means taking on a lot of the tasks that are otherwise handled in the block layer, leading to duplicated solutions to the same problem. Even if the SCSI layer were to be made more block-like (by using BIO structures throughout the midlayer, for example), there would still be a lot of infrastructure that would need to be reimplemented.

There was some talk of NUMA systems where I/O devices, too, are local to a specific CPU. In such cases, it obviously makes sense to move the I/O processing work to the right processor. A more NUMA-aware scheduler will help in this regard, but there were concerns that the scheduler still won't know about the system's I/O topology. The system's tendency to move processes toward the CPU where wakeup events occur should help to get the I/O threads in the right place. There might still be value in setting explicit thread CPU affinities on complex systems, though.

The last part of the session returned to the tagging of SCSI requests in a multiqueue environment. Since tags are a part of the request-completion notification from the device, it would be nice if the value of the tag, itself, could direct processing immediately to the correct queue. Tags are currently 16 bits wide, so including the queue was said to be "vaguely possible"; the T10 committee (which writes the SCSI standard) is considering increasing the width of tags to make the inclusion of queue pointers easier.

But wider tags may not really be needed. Real-world devices, it seems, do not generally operate with a queue depth greater than 255, so eight bits of the tag value are sufficient to track the requests to any specific device. That leaves eight bits that can be used to encode a queue number. James expressed some relief that upcoming devices did not appear to need queues larger than that; having to deal with massive queues, he said, would be bad for latency. Not having to plan for that case, he said, will make life a little easier.

[Thanks to Elena Zannoni, whose extensive notes made this writeup possible.]


(Log in to post comments)

LSFMM: Reducing SCSI latency

Posted Apr 26, 2013 13:52 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link]

Two points:

1) Couldn't using per-CPU queues result in command reordering? So if userspace is submitting I/Os sequentially but the scheduler moves the process to another CPU, then the SCSI layer could send the commands to the device in a different order? That could kill performance, especially with some SAS disks that I have seen. (My specific application uses /dev/sg* to access disks in the raw, and I have done a lot of tuning to get maximum performance.)

2) My profiling of the pm8001 SAS HBA driver has shown that most of the per-command overhead comes from using the generic libsas/libata libraries in the kernel. They are great from a software design perspective, but crap for performance. A few ugly hacks (for testing, not production) to bypass all those software layers resulted in a significant improvement in IOPS. If I had more time to devote to it, I would look into adding a low-overhead fastpath for read/write commands and let the high-overhead generic layers handle other misc commands, even though it would mean having to duplicate a little bit of generic functionality in the low-level driver.

LSFMM: Reducing SCSI latency

Posted Apr 26, 2013 19:46 UTC (Fri) by RobertCElliott (guest, #90524) [Link]

1. Multiple queues in SCSI Express do indeed result in ordering issues, just like multiple connections between SAS wide ports, or multiple connections per session in iSCSI.

Most SCSI commands are sent with SIMPLE task attributes allowing the device to reorder them any way it wants, so reordering from multiple queues doesn't really change the result. To preserve the best command performance, we will probably adopt the SAS approach of "fencing" for task management functions that do care (e.g., avoid an ABORT TASK racing ahead of the command it is trying to abort), rather than the iSCSI approach of adding ordering numbers across the queues.

Commands with (rarely used) ORDERED task attributes need to be funneled down the same queue.

SSD latency is short enough that SSDs cannot significantly accelerate a command with HEAD OF QUEUE task attribute over the others; trying to honor it would slow everything down. An HDD with a SCSI Express interface might support it, but not run the commands in the same order as if only one queue were being used.

2. One aspect of the feature sets is distinguishing between high performance IO commands vs. normal commands. Optimizing software stacks for the IO commands is a good idea.

LSFMM: Reducing SCSI latency

Posted May 1, 2013 14:12 UTC (Wed) by dougg (subscriber, #1894) [Link]

If you are not already aware, injecting SCSI commands via /dev/sg* nodes is LIFO through the block layer. That is also the default with /dev/bsg/* nodes but there you have the BSG_FLAG_Q_AT_TAIL flag to defeat that default.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds