Improvements in the block layer
Jens Axboe is the maintainer of the block layer of the kernel. In this capacity, he spoke at Kernel Recipes 2017 on what's new in the storage world for Linux, with a particular focus on the new block-multiqueue subsystem: the degree to which it's been adopted, a number of optimizations that have recently been made, and a bit of speculation about how it will further improve in the future.
Back in 2011, Intel published a Linux driver for NVM Express (or NVMe, where NVM is the Non-Volatile Memory Host Controller Interface), which was its new bus for accessing solid-state storage devices (SSDs). This driver was incorporated into the mainline kernel in 2012, first appearing in 3.3. It allowed new, fast SSD devices to be run at speed, but that gave no improvement if the block subsystem continued to treat them as pedestrian hard drives. So a new, scalable block layer known as blk-mq (for block-multiqueue) was developed to take better advantage of these fast devices; it was merged for 3.13 in 2014. It was introduced with the understanding that all of the old drivers would be ported to blk-mq over time; this continues, even though most of the mainstream block storage devices have by now been successfully ported. Axboe's first focus was a status update on this process.
![Jens Axboe [Jens Axboe]](https://static.lwn.net/images/2017/kr-axboe-sm.jpg) 
Some old, outstanding flash drivers have recently been converted, as has NBD, the network block device driver. The scsi-mq mechanism, which allows SCSI drivers to use blk-mq, has been in the kernel as an option for some time, but recently Axboe made it the default. This change had to be backed out because of some performance and scalability issues that arose for rotating storage. Axboe feels those issues have now been pretty much dealt with, and hopes that scsi-mq will be back to being the default soon.
The old cciss drivers have now moved to SCSI, which will please anyone who's had to work with HP Smart Array controllers; it also pleases Axboe, since that means that cciss will be implicitly converted to blk-mq when scsi-mq becomes default. That leaves about 15 drivers that have to be converted; work will continue until the job is finished, a state Axboe speculated (to some amusement) will be reached with the conversion of floppy.c — for which he is willing to award a small prize.
The main new features added recently relate to I/O scheduling, because its absence was one of the main sources of resistance to converting drivers to the new multiqueue framework. Various design decisions that were made earlier for blk-mq have made this less easy than it might have been but, despite this, blk-mq-sched was added in 4.11, with (initially) two scheduling disciplines: none and mq-deadline. The former makes no change to the default behavior, which Axboe describes as "roughly FIFO", and the latter is a re-implementation of the old deadline scheduler. In 4.12 two more were added: Budget Fair Queuing (BFQ), which is a scheduler based on CFQ that's been around for years but never been integrated into the kernel, and Kyber, a fully multiqueue aware scheduler that supports such things as reader-versus-writer fairness.
Another new feature is writeback throttling. This is an attempt to better deal with the kernel's periodic desire to write out dirty pages to disk (or whatever flavor of storage backs them), which currently causes heavy load on the I/O backend that Axboe compared to Homer Simpson consuming a continuous but non-uniform stream of donuts. With writeback throttling, blk-mq attempts to get maximum performance without excessive I/O latency using a strategy borrowed from the CoDel network scheduler. CoDel tracks the observed minimum latency of network packets and, if that exceeds a threshold value, it starts dropping packets. Dropping writes is frowned upon in the I/O subsystem, but a similar strategy is followed in that the kernel monitors the minimum latency of both reads and writes and, if that exceeds a threshold value, it starts to turn down the amount of background writeback that's being done. This behavior was added in 4.10; Axboe said that pretty good results have been seen.
The work came out of complaints from Facebook developers who saw high latencies when images were being updated. To quantify this, Axboe created his writeback challenge: two RPMs that can be installed on a system, one of which creates many small files, the other of which creates a small number of large files, the idea being that you monitor the performance of your services while installing these RPMs. In the context of a small but predictable service (a test application called io.go) Axboe saw (on both SSD and rotating storage) maximum request times that decreased by around a factor of ten with writeback throttling; worst-case on rotating storage with throttling was an I/O request that took 478ms to complete, while without throttling the worst case took 6.5 seconds.
Yet another improvement is I/O polling. Traditionally, when an application wishes to perform I/O, execution passes into kernel space and down to the device driver, then while it's waiting for that request to complete it either gets on with something or goes to sleep. Completion is made known to it by an interrupt; the receipt of this completion notification wakes the application to get on with its job. During this sleep phase, on a non-overloaded system, the CPU on which the sleeping application was running is now likely to go to sleep itself, and the overhead of going to sleep then waking up becomes a significant part of the completion time. So, as a strategy to avoid this overhead, the kernel code may engage in continuous polling: it passes the I/O request to the hardware then immediately starts repeatedly asking it whether it's done yet — a behavior that will be familiar to any parent of small children on car journeys. This minimizes the sleeping and waking-up overhead, but wastes CPU in the process of polling; there are also power implications to such behavior. A middle ground is desirable.
Axboe and others working on this subsystem came up with a solution called hybrid polling, which relies on the fact that fast devices tend also to be deterministic. With hybrid polling, the kernel tracks the completion times of I/O requests as a function of their size. Then when the kernel sends any given I/O request down to the hardware it sets a timer for half the mean completion time of comparably-sized requests. That will wake the application (running within the kernel) while it's likely that the request has not yet completed and switch to continuous polling behavior so that the completion of the request is detected as promptly as possible. Thus, hopefully about half of the cost of continuous polling is avoided, while hopefully most of the cost of waking up is paid before the actual I/O request has been completed, so that latency is not increased.
In Axboe's tests, this strategy produced latencies that were indistinguishable from continuous polling. There are new system calls, pwritev2() and preadv2(), for those who wish to enable this behavior now (certain flags must be set). There are also associated sysfs controls: io_poll, which enables or disables the behavior, and io_poll_delay, which defaults to -1, meaning no polling. If the latter is set to zero, hybrid polling is used as described. If it's set to a positive value, a specific delay latency (in microseconds) is set. Enthusiastic knob-twiddlers should be aware that Axboe's tests show that it's hard to beat the hybrid strategy, and easy to do badly: not just worse than hybrid, but worse even than traditional interrupt-based performance.
Improved direct I/O (i.e. O_DIRECT) handling, which treats large and small requests differently, and corresponding improvements in fs/iomap.c, have shaved a further 6% from I/O times. This improvement is in 4.10. The I/O accounting subsystem was observed to be using 1-2% of CPU, which, for a subsystem that just tracks performance, is a high overhead. Changes merged in 4.14, which were easy as a result of the design of blk-mq, have noticeably reduced that cost.
Support has also been added for write lifetime hints, which is a feature introduced in hardware in NVMe 1.3. This allows the flash device controller to be given knowledge of the expected lifetime of data that's been queued for writing. Flash devices group writes into structures known as erase blocks, which can be multi-gigabyte sized in modern devices. If a write is later invalidated by an overwrite sent down from the application, the erase block it was in has to be copied to a new, modified one internally, and this is expensive. If a device controller knows the expected lifetime of data, then it can improve its own performance by constructing erase blocks of writes that are all expected to have comparable lifetimes and thus might all be invalidated together. Current kernel support allows lifetime hints of short, medium, long, and extreme to be provided, though these quantities don't have absolute values as it is acknowledged that they will differ from application to application. Nevertheless, with these changes, reductions of 25-30% in physical writes to the storage device have been achieved in the context of database workloads, which Axboe rightly describes as "huge", and is accompanied by corresponding improvements in latency.
Axboe concluded his talk by showing his list of desirable improvements from 2015, and noting (to applause) that every single one had been achieved. His list for 2017 is therefore much shorter: I/O determinism and efficiency improvements. The former is a way to guarantee I/O latency for a given application and thus avoid the "noisy neighbor" problem, where two applications use the same back-end storage and one's I/O unduly reduces the other's performance. The other is a safe bet because it's a wide umbrella; history suggests he'll find something to put beneath it next year.
Unless you're running a computer that never remembers anything it does, you, personally, have an interest in the I/O subsystem. This sort of news, then, is good news for all of us.
[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]
| Index entries for this article | |
|---|---|
| GuestArticles | Yates, Tom | 
| Conference | Kernel Recipes/2017 | 
      Posted Oct 6, 2017 15:20 UTC (Fri)
                               by willy (subscriber, #9762)
                              [Link] (2 responses)
       
I don't know if Jens is spreading false information or if this was a small mistake in transcription, but I heard this inaccurate history given in a talk at LinuxCon NA as well, so I'd like to squelch it.  The version of the nvme driver I released absolutely ran at full speed.  We got over a million IOPS out of it running against a simulated device.  It did this by completely avoiding the queueing (request) layer and taking BIOs at the top of the block layer.  Jens wanted to support queueing, so he developed blk-mq.  Given the two-year delay in getting blk-mq merged, I'm certain that developing the nvme driver as a bio-based driver was the right decision.  I'm still not sure that the request based version of the nvme driver is actually an improvement.  It's certainly more complicated! 
 
     
    
      Posted Oct 8, 2017 5:51 UTC (Sun)
                               by madhatter (subscriber, #4665)
                              [Link] 
       
     
      Posted Oct 9, 2017 16:35 UTC (Mon)
                               by axboe (subscriber, #904)
                              [Link] 
       
> I'm still not sure that the request based version of the nvme driver is actually an improvement. It's certainly more complicated! 
IMHO, that statement is obviously false. If you look at the initial conversion, it removed far more lines than it added. It's utilizing shared code in blk-mq that other drivers get to use as well. These days the driver is obviously more complex than the initial version, but that's due to all of the new features that have been added. Years later at this point, I'd say that was definitely the right decision and an improvement. The simpler we can make drivers, the better off we are. Core code is much more heavily scrutinized than driver code, and shared code means we only have to fix bugs once. 
     
      Posted Oct 8, 2017 13:59 UTC (Sun)
                               by pr1268 (guest, #24648)
                              [Link] (1 responses)
       This, and the Homer Simpson eating donuts reference come together with: :-) 
     
    
      Posted Oct 9, 2017 16:38 UTC (Mon)
                               by axboe (subscriber, #904)
                              [Link] 
       
That comment is perfect for the discussion of IO polling, since that's basically how that works. Missed opportunity on my part. :-) 
     
    Improvements in the block layer
      
      It was not something Jens said at all, and therefore it was also not a mistake in transcription: it was me adding context from background research.  Talks at Linux Recipes were very short on context, because the audience was assumed to be up to speed on all kernel things; that makes for very efficient talks, but without some background they can be a bit difficult for a more general audience to approach.  So any such error is entirely mine, and I thank you for the clarification.
      
          Improvements in the block layer
      Improvements in the block layer
      
Improvements in the block layer - More Simpsons
      [...] starts repeatedly asking it whether it's done yet — a behavior that will be familiar to any parent of small children on car journeys.
for (;;)
{
    Bart & Lisa:    "Are we there yet?"
    Homer:          "NO!"
}
Improvements in the block layer - More Simpsons
      
 
           