Incrementally improving the SCSI subsystem
[Posted July 3, 2002 by corbet]
James Bottomley gave a talk at OLS on the plans for improving the SCSI
subsystem. It went into more detail than the Kernel Summit presentation,
and included the outcomes from the Summit discussion. Places where work
will be done include:
- Elimination of the SCSI exception table
- Generic tagged command queueing
- Implementation of write barriers
- Reworking the error handler
- Multipath device support
- Getting rid of the midlayer
The SCSI exception table is an in-kernel list of about 90 (in 2.4.18) SCSI
devices which are known to be poorly behaved; this list only continues to
grow as manufacturers make more and more stupid devices. Many of these
devices misbehave if you try to access a logical unit number other than
zero; others demonstrate more creative sorts of problems. In any case,
this sort of constantly growing blacklist is not the kind of data structure
you want to have taking up more and more kernel space.
The answer here, of course, is to move this table (and its associated
processing) into user space. Rather than handle SCSI device scanning in
the kernel, the SCSI subsystem will just use the /sbin/hotplug
mechanism and let a user space program handle the details. James likes
this solution because it cleans up the SCSI code, and the hotplug code
support "is Greg KH's problem." Greg's enthusiasm was rather more
restrained.
Tagged command queueing (TCQ) changes were discussed at the Kernel Summit
as well. Each SCSI adaptor driver has its own TCQ implementation, which is
not the right way to do it. So TCQ support will be done in the generic
block layer code instead (James once again notes, with satisfaction, that
in the block layer it's somebody else's problem).
One big remaining
problem is "tag starvation," where a disk ignores a request for a long time
while dealing with (newer) requests that it can satisfy more quickly.
Options for fixing this problem including using ordered tags (which force
the completion of all previous tagged operations) or just shutting down the
request queue until the neglected request gets handled. Either approach
could work; the request queue throttling technique is thought to be less
hard on the overall performance of the system.
Write barriers are needed for journaling filesystem support; they can be
implemented with ordered tags. The real problem here, as it turns out, is
error handling. If a write barrier operation fails, subsequent operations
could be executed out of order. Another issue is the "queue full" problem:
the drive rejects the barrier operation because its command queue is full,
but then accepts a command issued after the barrier. This is a sort of
race condition which is difficult, if not impossible, to produce on real
systems, but it is a problem which can occur.
The current SCSI error handler is a "pluggable" mechanism which allows the
provision of operations for a set of predefined situations. The
"pluggable" interface is never been used - everybody uses the default error
handlers, which are seen as being heavy-handed and insufficiently smart.
The new error handler should also handle things like command cancellation -
a feature required by asynchronous I/O.
The new error handler should, instead, be message-oriented, allowing
greater flexibility in what sorts of situations can be dealt with. It
should also be stackable and available to higher levels. Volume managers
and RAID, for example, want a detailed picture of exactly what sort of
errors are happening so that they can respond intelligently; "bad block"
requires a different response than "drive on fire," but there is currently
no way for higher levels to tell the difference.
In the end, much of the error handling code needs to move into, of course,
the block layer. IDE drives also have errors, and higher-level code should
not have to know the difference. So, happily (for James), much of it
becomes somebody else's problem.
Support for multipath devices, too, should be implemented in the block
layer - and thus be somebody else's problem. One big issue with multipath
devices is the preservation of write barriers. A command which is meant to
execute after a write barrier could be sent via a different path and
overtake the barrier operation.
The death of the midlayer is expected to be "a slow process via
starvation." The internal SCSI request structure may be replaced by the
generic block level version, and much of the current SCSI functionality
will migrate up to the higher levels. The end result will be a vastly
thinner SCSI midlayer which has had most of its functionality moved up to
the higher layers. This work, of course, will allow more common code to be
shared across disk subsystems. It also means that, for example, the
ide-scsi driver can be eliminated. Under the new system, it will be a
straightforward task to connect the high-level SCSI code with the low-level
IDE transport.
This is all a big job, of course; it is not expected to be done by
the 2.5 feature freeze.
(
Log in to post comments)