Incrementally improving the SCSI subsystem
- Elimination of the SCSI exception table
- Generic tagged command queueing
- Implementation of write barriers
- Reworking the error handler
- Multipath device support
- Getting rid of the midlayer
The SCSI exception table is an in-kernel list of about 90 (in 2.4.18) SCSI devices which are known to be poorly behaved; this list only continues to grow as manufacturers make more and more stupid devices. Many of these devices misbehave if you try to access a logical unit number other than zero; others demonstrate more creative sorts of problems. In any case, this sort of constantly growing blacklist is not the kind of data structure you want to have taking up more and more kernel space.
The answer here, of course, is to move this table (and its associated processing) into user space. Rather than handle SCSI device scanning in the kernel, the SCSI subsystem will just use the /sbin/hotplug mechanism and let a user space program handle the details. James likes this solution because it cleans up the SCSI code, and the hotplug code support "is Greg KH's problem." Greg's enthusiasm was rather more restrained.
Tagged command queueing (TCQ) changes were discussed at the Kernel Summit as well. Each SCSI adaptor driver has its own TCQ implementation, which is not the right way to do it. So TCQ support will be done in the generic block layer code instead (James once again notes, with satisfaction, that in the block layer it's somebody else's problem).
One big remaining problem is "tag starvation," where a disk ignores a request for a long time while dealing with (newer) requests that it can satisfy more quickly. Options for fixing this problem including using ordered tags (which force the completion of all previous tagged operations) or just shutting down the request queue until the neglected request gets handled. Either approach could work; the request queue throttling technique is thought to be less hard on the overall performance of the system.
Write barriers are needed for journaling filesystem support; they can be implemented with ordered tags. The real problem here, as it turns out, is error handling. If a write barrier operation fails, subsequent operations could be executed out of order. Another issue is the "queue full" problem: the drive rejects the barrier operation because its command queue is full, but then accepts a command issued after the barrier. This is a sort of race condition which is difficult, if not impossible, to produce on real systems, but it is a problem which can occur.
The current SCSI error handler is a "pluggable" mechanism which allows the provision of operations for a set of predefined situations. The "pluggable" interface is never been used - everybody uses the default error handlers, which are seen as being heavy-handed and insufficiently smart. The new error handler should also handle things like command cancellation - a feature required by asynchronous I/O.
The new error handler should, instead, be message-oriented, allowing greater flexibility in what sorts of situations can be dealt with. It should also be stackable and available to higher levels. Volume managers and RAID, for example, want a detailed picture of exactly what sort of errors are happening so that they can respond intelligently; "bad block" requires a different response than "drive on fire," but there is currently no way for higher levels to tell the difference.
In the end, much of the error handling code needs to move into, of course, the block layer. IDE drives also have errors, and higher-level code should not have to know the difference. So, happily (for James), much of it becomes somebody else's problem.
Support for multipath devices, too, should be implemented in the block layer - and thus be somebody else's problem. One big issue with multipath devices is the preservation of write barriers. A command which is meant to execute after a write barrier could be sent via a different path and overtake the barrier operation.
The death of the midlayer is expected to be "a slow process via starvation." The internal SCSI request structure may be replaced by the generic block level version, and much of the current SCSI functionality will migrate up to the higher levels. The end result will be a vastly thinner SCSI midlayer which has had most of its functionality moved up to the higher layers. This work, of course, will allow more common code to be shared across disk subsystems. It also means that, for example, the ide-scsi driver can be eliminated. Under the new system, it will be a straightforward task to connect the high-level SCSI code with the low-level IDE transport.
This is all a big job, of course; it is not expected to be done by
the 2.5 feature freeze.
