The current development kernel remains 2.5.24
. Linus has not
released any kernels - or surfaced on the linux-kernel mailing list - since
before OLS and the Kernel Summit. Some patches are beginning to show up in
his BitKeeper tree, however; they include some SCSI updates, an NTFS
update, and, interestingly, a change of the internal x86 clock frequency to
The current stable kernel release is still 2.4.18. No new 2.4.19
release candidates have been announced in the last week.
The latest 2.5 kernel status summary from
Guillaume Boissiere came out on July 3.
Comments (none posted)
Kernel development news
A longstanding kernel feature request is a SCHED_IDLE
class. Tasks running as SCHED_IDLE would only run when the processor would
otherwise be idle. The "niceness" scheme in the current scheduler does not
provide this behavior: even the lowest-priority processes will run
sometimes. Users who want to search out encryption keys,
model proteins, or search for
extraterrestrial life on their systems generally want that work to not take
any time from other tasks running on the system. Thus the request for
In principle, SCHED_IDLE is not that hard to implement. The
problem, of course, is the classic priority inversion trap. If a
SCHED_IDLE process acquires an important shared resource, such as
an internal filesystem semaphore, there is no way to know how long the
process may have to wait before it can run long enough to release that
resource. A SCHED_IDLE process can be preempted at any time by a
higher-priority process; it could then keep needed resources unavailable
indefinitely. Priority inversion problems can come up by themselves; this
situation could also be brought about intentionally as a denial of service
So far, no solution to this problem has been implemented, so no
SCHED_IDLE patch has ever been merged into the kernel. It is
easier to simply ensure that every process makes a little progress
occasionally so that priority inversion problems resolve themselves.
Now Ingo Molnar has posted a patch which, he
claims, implements SCHED_IDLE (which he calls
SCHED_BATCH) in a safe way. Those who are curious are encouraged
to read his posting, which describes the work in far more detail than you
will find here.
The fundamental observation behind Ingo's approach is that processes only
hold important kernel resources, such as semaphores, when they are running
in kernel mode. If a SCHED_BATCH process is preempted when
running in user mode, it is safe to set that process aside indefinitely.
If, instead, it is running in kernel mode, it must be allowed to finish it
work within a reasonable period of time.
So Ingo's patch splits the schedule() call into two variants.
schedule_userspace() is called when the preempted process is
running in user mode; it implements the full SCHED_BATCH
semantics. schedule(), instead, is invoked when the process is in
kernel mode; it will handle a SCHED_BATCH process like any other,
normal process. Thus SCHED_BATCH processes essentially have their
priorities raised while running in kernel mode.
Raising the priority of processes that hold critical resources is a classic
response to priority inversion problems. Ingo's patch takes a slightly
simpler approach by treating the entire kernel as such a resource. This
patch will raise the priority of SCHED_BATCH processes a bit more
than is strictly necessary; the approach should be robust, however, and the
difference in scheduling behavior would be difficult to measure.
Comments (3 posted)
A number of people have complained about the removal of the IDE taskfile
operations from the 2.5 version of the driver. For anybody wondering why
people might want this obscure capability, consider this posting
from Scott Tillman. Scott is
working on the "port Linux to the XBox" effort. It turns out that the XBox
IDE drive will not allow access to its sectors until a special,
vendor-specific "password" command has been run. Taskfile access is needed
to be able to issue that password.
Of course, providing taskfile access so that this command can be issued
could, with a broad reading, be seen as a violation of the DMCA's
anticircumvention measures. It is a bit of a stretch, and depends on
whether the special command is just seen as vendor-specific initialization,
or whether it is really a "technological measure" for copyright
protection. Unfortunately, a broad reading of the DMCA seems to be in
vogue in the U.S. these days.
The XBox team, meanwhile, has a bunch of code it has written for dealing
with the XBox partition scheme and filesystem. They will port it to 2.5 if
it appears that it might actually get merged. That may well happen;
the fun of running Linux on Microsoft-subsidized hardware could be
Comments (2 posted)
James Bottomley gave a talk at OLS on the plans for improving the SCSI
subsystem. It went into more detail than the Kernel Summit presentation,
and included the outcomes from the Summit discussion. Places where work
will be done include:
- Elimination of the SCSI exception table
- Generic tagged command queueing
- Implementation of write barriers
- Reworking the error handler
- Multipath device support
- Getting rid of the midlayer
The SCSI exception table is an in-kernel list of about 90 (in 2.4.18) SCSI
devices which are known to be poorly behaved; this list only continues to
grow as manufacturers make more and more stupid devices. Many of these
devices misbehave if you try to access a logical unit number other than
zero; others demonstrate more creative sorts of problems. In any case,
this sort of constantly growing blacklist is not the kind of data structure
you want to have taking up more and more kernel space.
The answer here, of course, is to move this table (and its associated
processing) into user space. Rather than handle SCSI device scanning in
the kernel, the SCSI subsystem will just use the /sbin/hotplug
mechanism and let a user space program handle the details. James likes
this solution because it cleans up the SCSI code, and the hotplug code
support "is Greg KH's problem." Greg's enthusiasm was rather more
Tagged command queueing (TCQ) changes were discussed at the Kernel Summit
as well. Each SCSI adaptor driver has its own TCQ implementation, which is
not the right way to do it. So TCQ support will be done in the generic
block layer code instead (James once again notes, with satisfaction, that
in the block layer it's somebody else's problem).
One big remaining
problem is "tag starvation," where a disk ignores a request for a long time
while dealing with (newer) requests that it can satisfy more quickly.
Options for fixing this problem including using ordered tags (which force
the completion of all previous tagged operations) or just shutting down the
request queue until the neglected request gets handled. Either approach
could work; the request queue throttling technique is thought to be less
hard on the overall performance of the system.
Write barriers are needed for journaling filesystem support; they can be
implemented with ordered tags. The real problem here, as it turns out, is
error handling. If a write barrier operation fails, subsequent operations
could be executed out of order. Another issue is the "queue full" problem:
the drive rejects the barrier operation because its command queue is full,
but then accepts a command issued after the barrier. This is a sort of
race condition which is difficult, if not impossible, to produce on real
systems, but it is a problem which can occur.
The current SCSI error handler is a "pluggable" mechanism which allows the
provision of operations for a set of predefined situations. The
"pluggable" interface is never been used - everybody uses the default error
handlers, which are seen as being heavy-handed and insufficiently smart.
The new error handler should also handle things like command cancellation -
a feature required by asynchronous I/O.
The new error handler should, instead, be message-oriented, allowing
greater flexibility in what sorts of situations can be dealt with. It
should also be stackable and available to higher levels. Volume managers
and RAID, for example, want a detailed picture of exactly what sort of
errors are happening so that they can respond intelligently; "bad block"
requires a different response than "drive on fire," but there is currently
no way for higher levels to tell the difference.
In the end, much of the error handling code needs to move into, of course,
the block layer. IDE drives also have errors, and higher-level code should
not have to know the difference. So, happily (for James), much of it
becomes somebody else's problem.
Support for multipath devices, too, should be implemented in the block
layer - and thus be somebody else's problem. One big issue with multipath
devices is the preservation of write barriers. A command which is meant to
execute after a write barrier could be sent via a different path and
overtake the barrier operation.
The death of the midlayer is expected to be "a slow process via
starvation." The internal SCSI request structure may be replaced by the
generic block level version, and much of the current SCSI functionality
will migrate up to the higher levels. The end result will be a vastly
thinner SCSI midlayer which has had most of its functionality moved up to
the higher layers. This work, of course, will allow more common code to be
shared across disk subsystems. It also means that, for example, the
ide-scsi driver can be eliminated. Under the new system, it will be a
straightforward task to connect the high-level SCSI code with the low-level
This is all a big job, of course; it is not expected to be done by
the 2.5 feature freeze.
Comments (none posted)
Patches and updates
Core kernel code
- Robert Love: 2.5: fair scheduler hints. "<span>Scheduler hints are a way for a program to give
a "hint" to the scheduler about its present behavior in the hopes of the
scheduler consequently making better scheduling decisions.</span>"
(July 3, 2002)
Filesystems and block I/O
- Alasdair Kergon: device-mapper for 2.4. "<span>Device-mapper is a light-weight driver designed to support
volume managers generically</span>."
(June 27, 2002)
Page editor: Jonathan Corbet
Next page: Distributions>>