LWN.net Logo

LSFMM: Error handling and firmware updates

By Jonathan Corbet
April 25, 2013
LSFMM Summit 2013
The LSFMM 2013 storage-only track for Thursday, April 18, included sessions on the challenges of SCSI error handling and SATA firmware updates. The following writeup appears courtesy of Elena Zannoni and Martin Petersen.


Error handing

Several issues in the error handling code were discussed in a session led by Hannes Reinecke, James Smart, Roland Dreier, and Mike Christie.

First of all, aborting commands may take a long time on Fibre Channel, and there is no guarantee that the target device will actually do anything with the abort command. Among other things, that makes it hard to find a suitable timeout for the abort; there is no way to know how long it should realistically take. There may be a lot of commands outstanding to a failing device, each of which needs to be aborted separately; the stack then waits for all the aborts to fail before deciding what to do next, potentially waiting for a long time. James Bottomley recommended implementing a strategy handler akin to the one in libsas. Some pieces can potentially be made generic and shared between all transport classes.

The other topic was that if one device does not respond to error handling the SCSI subsystem will end up escalating to an HBA (host bus adapter) reset. This HBA reset is done blindly, causing I/O to other devices on the HBA to be disrupted. Often the HBA reset cannot be avoided, but the kernel should be smarter about HBA resets and check whether I/O is successfully completing on other targets on that HBA. In that case dropping the target instead of resetting the HBA may be the answer.

A new error handling method was briefly discussed, namely using the enclosure management subsystem to try power cycling a non-responsive drive.

Firmware updates

Martin Petersen's storage-track session on firmware updates was meant to address a simple problem: the need to do offline updates on SATA devices leads to down time that is not universally welcomed by customers. It would be nice to add some sort of online upgrade capability for this class of hardware.

Firmware upgrades are not a problem for SCSI devices because the "WRITE BUFFER" command allows the firmware to be downloaded in one piece. For SATA drives things are more complicated because there are several command variants, some of which require the firmware image to be split up into suitably sized chunks. The chunks need to be sent in order with no other I/O in between for the firmware upgrade to work. This is problematic on system drives that are never idle.

Various approaches were discussed; the consensus was to implement a "WRITE BUFFER" translation in libata that would use the SCSI layer's quiesce feature to ensure that pieces will be written without interference from other I/O requests.


(Log in to post comments)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds