LWN.net Logo

LSFMM: Error returns

By Jake Edge
May 1, 2013
LSFMM Summit 2013

Filesystem handling of errors returned from the SCSI subsystem was the topic of the final combined session for the Storage and Filesystem tracks at the 2013 LSFMM Summit. It is a follow-on to an earlier session that got a little side-tracked. The discussion was led by Hannes Reinecke and James Bottomley.

Reinecke displayed the code from __scsi_error_from_host_byte() to show attendees the existing errors that can be returned from the SCSI layer. He asked Dave Chinner, who had earlier expressed interest in seeing errors other than EIO be returned, how he would like to see those change. Chinner responded that it was more than just what the error code is; filesystem developers needed to understand what the actual failure is. Most important, from his perspective, is whether it could be a temporary failure or not.

Chinner gave the example of a Fibre Channel link failure, which can take up to 240 seconds to come back. After 30 seconds, though, the filesystem has received an EIO and has shut down. If the Fibre Channel comes back in a few minutes, the filesystem has done the wrong thing from the user's perspective. In that case, the filesystem could have tried again in a few minutes and only shut down after getting 3-4 failures in a row. Overall, filesystems are looking for "classes of errors" that will help the higher levels determine whether to retry or give up, he said.

But, as Martin Petersen pointed out, the storage does not know if the error is temporary. Chinner noted that there are certain types, like read failures, that are often temporary. Alasdair Kergon was concerned about which level should make that kind of decision. If the kernel developers are not careful, every layer and filesystem will repeat the same kind of logic to make those decisions.

Five or six distinct error codes, one for each class of error, is what Chinner would like to see. He suggested sticking to POSIX error codes (rather than a new set established between storage and filesystem developers) in case they end up leaking to user space. He reiterated that the filesystem developers don't particularly care what the error code is, but they want to know if the error is transient. He suggested something like "transient read error", "permanent read error", and the same for writes, but Petersen noted, again, that no one really knows if the errors are transient.

There are things a filesystem can do if it knows that there was some kind of media error on a write, Ric Wheeler said. The data could be written to another disk, or written in place elsewhere on the disk. But that can only happen if the filesystem gets some indication other than EIO.

If the errors returned are to change, Chinner asked, what can be done to document those changes? Documentation in the kernel tree is fine, but there needs to be more widespread notification. In the end, it was suggested that posting the information to the fs-devel mailing list (and also to LWN) is the right way to go about it. While some changes were agreed upon in the meeting, it's not clear how and when (or even if) those exact error return values will be used—though some changes are likely.

[ Thanks to Elena Zannoni, whose detailed notes were a nice supplement to my own. ]


(Log in to post comments)

LSFMM: Error returns

Posted May 4, 2013 21:08 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Error codes are not very forward-thinking. What would be more flexible is to return that actual facts desired. For example, an indication of "how transient is this error?" with values saying "transient", "not transient", and "don't know". Separately, an indication of "is the problem specific to this storage address?", etc.

LSFMM: Error returns

Posted May 5, 2013 12:42 UTC (Sun) by istenrot (subscriber, #69564) [Link]

How about returning SCSI ASCQ equivalent codes in parallel with POSIX error codes? File systems might be interested about SCSI layer warning messages like "WARNING - NON-VOLATILE CACHE NOW VOLATILE" or "WARNING - POWER LOSS EXPECTED".

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds