|
|
Log in / Subscribe / Register

Kernel development

Brief items

Current release status

The current development kernel remains 2.5.24. Linus has not released any kernels - or surfaced on the linux-kernel mailing list - since before OLS and the Kernel Summit. Some patches are beginning to show up in his BitKeeper tree, however; they include some SCSI updates, an NTFS update, and, interestingly, a change of the internal x86 clock frequency to 1000 Hz.

The current stable kernel release is still 2.4.18. No new 2.4.19 release candidates have been announced in the last week.

The latest 2.5 kernel status summary from Guillaume Boissiere came out on July 3.

Comments (none posted)

Kernel development news

A safe SCHED_IDLE implementation

A longstanding kernel feature request is a SCHED_IDLE scheduler class. Tasks running as SCHED_IDLE would only run when the processor would otherwise be idle. The "niceness" scheme in the current scheduler does not provide this behavior: even the lowest-priority processes will run sometimes. Users who want to search out encryption keys, model proteins, or search for extraterrestrial life on their systems generally want that work to not take any time from other tasks running on the system. Thus the request for SCHED_IDLE.

In principle, SCHED_IDLE is not that hard to implement. The problem, of course, is the classic priority inversion trap. If a SCHED_IDLE process acquires an important shared resource, such as an internal filesystem semaphore, there is no way to know how long the process may have to wait before it can run long enough to release that resource. A SCHED_IDLE process can be preempted at any time by a higher-priority process; it could then keep needed resources unavailable indefinitely. Priority inversion problems can come up by themselves; this situation could also be brought about intentionally as a denial of service attack.

So far, no solution to this problem has been implemented, so no SCHED_IDLE patch has ever been merged into the kernel. It is easier to simply ensure that every process makes a little progress occasionally so that priority inversion problems resolve themselves.

Now Ingo Molnar has posted a patch which, he claims, implements SCHED_IDLE (which he calls SCHED_BATCH) in a safe way. Those who are curious are encouraged to read his posting, which describes the work in far more detail than you will find here.

The fundamental observation behind Ingo's approach is that processes only hold important kernel resources, such as semaphores, when they are running in kernel mode. If a SCHED_BATCH process is preempted when running in user mode, it is safe to set that process aside indefinitely. If, instead, it is running in kernel mode, it must be allowed to finish it work within a reasonable period of time.

So Ingo's patch splits the schedule() call into two variants. schedule_userspace() is called when the preempted process is running in user mode; it implements the full SCHED_BATCH semantics. schedule(), instead, is invoked when the process is in kernel mode; it will handle a SCHED_BATCH process like any other, normal process. Thus SCHED_BATCH processes essentially have their priorities raised while running in kernel mode.

Raising the priority of processes that hold critical resources is a classic response to priority inversion problems. Ingo's patch takes a slightly simpler approach by treating the entire kernel as such a resource. This patch will raise the priority of SCHED_BATCH processes a bit more than is strictly necessary; the approach should be robust, however, and the difference in scheduling behavior would be difficult to measure.

Comments (3 posted)

A use for IDE taskfile access

A number of people have complained about the removal of the IDE taskfile operations from the 2.5 version of the driver. For anybody wondering why people might want this obscure capability, consider this posting from Scott Tillman. Scott is working on the "port Linux to the XBox" effort. It turns out that the XBox IDE drive will not allow access to its sectors until a special, vendor-specific "password" command has been run. Taskfile access is needed to be able to issue that password.

Of course, providing taskfile access so that this command can be issued could, with a broad reading, be seen as a violation of the DMCA's anticircumvention measures. It is a bit of a stretch, and depends on whether the special command is just seen as vendor-specific initialization, or whether it is really a "technological measure" for copyright protection. Unfortunately, a broad reading of the DMCA seems to be in vogue in the U.S. these days.

The XBox team, meanwhile, has a bunch of code it has written for dealing with the XBox partition scheme and filesystem. They will port it to 2.5 if it appears that it might actually get merged. That may well happen; the fun of running Linux on Microsoft-subsidized hardware could be irresistible.

Comments (2 posted)

Incrementally improving the SCSI subsystem

James Bottomley gave a talk at OLS on the plans for improving the SCSI subsystem. It went into more detail than the Kernel Summit presentation, and included the outcomes from the Summit discussion. Places where work will be done include:
  • Elimination of the SCSI exception table
  • Generic tagged command queueing
  • Implementation of write barriers
  • Reworking the error handler
  • Multipath device support
  • Getting rid of the midlayer

The SCSI exception table is an in-kernel list of about 90 (in 2.4.18) SCSI devices which are known to be poorly behaved; this list only continues to grow as manufacturers make more and more stupid devices. Many of these devices misbehave if you try to access a logical unit number other than zero; others demonstrate more creative sorts of problems. In any case, this sort of constantly growing blacklist is not the kind of data structure you want to have taking up more and more kernel space.

The answer here, of course, is to move this table (and its associated processing) into user space. Rather than handle SCSI device scanning in the kernel, the SCSI subsystem will just use the /sbin/hotplug mechanism and let a user space program handle the details. James likes this solution because it cleans up the SCSI code, and the hotplug code support "is Greg KH's problem." Greg's enthusiasm was rather more restrained.

Tagged command queueing (TCQ) changes were discussed at the Kernel Summit as well. Each SCSI adaptor driver has its own TCQ implementation, which is not the right way to do it. So TCQ support will be done in the generic block layer code instead (James once again notes, with satisfaction, that in the block layer it's somebody else's problem).

One big remaining problem is "tag starvation," where a disk ignores a request for a long time while dealing with (newer) requests that it can satisfy more quickly. Options for fixing this problem including using ordered tags (which force the completion of all previous tagged operations) or just shutting down the request queue until the neglected request gets handled. Either approach could work; the request queue throttling technique is thought to be less hard on the overall performance of the system.

Write barriers are needed for journaling filesystem support; they can be implemented with ordered tags. The real problem here, as it turns out, is error handling. If a write barrier operation fails, subsequent operations could be executed out of order. Another issue is the "queue full" problem: the drive rejects the barrier operation because its command queue is full, but then accepts a command issued after the barrier. This is a sort of race condition which is difficult, if not impossible, to produce on real systems, but it is a problem which can occur.

The current SCSI error handler is a "pluggable" mechanism which allows the provision of operations for a set of predefined situations. The "pluggable" interface is never been used - everybody uses the default error handlers, which are seen as being heavy-handed and insufficiently smart. The new error handler should also handle things like command cancellation - a feature required by asynchronous I/O.

The new error handler should, instead, be message-oriented, allowing greater flexibility in what sorts of situations can be dealt with. It should also be stackable and available to higher levels. Volume managers and RAID, for example, want a detailed picture of exactly what sort of errors are happening so that they can respond intelligently; "bad block" requires a different response than "drive on fire," but there is currently no way for higher levels to tell the difference.

In the end, much of the error handling code needs to move into, of course, the block layer. IDE drives also have errors, and higher-level code should not have to know the difference. So, happily (for James), much of it becomes somebody else's problem.

Support for multipath devices, too, should be implemented in the block layer - and thus be somebody else's problem. One big issue with multipath devices is the preservation of write barriers. A command which is meant to execute after a write barrier could be sent via a different path and overtake the barrier operation.

The death of the midlayer is expected to be "a slow process via starvation." The internal SCSI request structure may be replaced by the generic block level version, and much of the current SCSI functionality will migrate up to the higher levels. The end result will be a vastly thinner SCSI midlayer which has had most of its functionality moved up to the higher layers. This work, of course, will allow more common code to be shared across disk subsystems. It also means that, for example, the ide-scsi driver can be eliminated. Under the new system, it will be a straightforward task to connect the high-level SCSI code with the low-level IDE transport.

This is all a big job, of course; it is not expected to be done by the 2.5 feature freeze.

Comments (none posted)

Patches and updates

Kernel trees

Andrea Arcangeli 2.4.19rc1aa1 ?
J.A. Magallon Linux 2.4.19-pre10-jam3 ?

Architecture-specific

Christer Weinigel SCx200 patches part 1/3 -- Watchdog driver Adds support for the National Semiconductor SCx200 processor. ?

Build system

Kai Germaschewski kbuild fixes and more ?

Core kernel code

Ingo Molnar batch/idle priority scheduling, SCHED_BATCH The long-sought safe <tt>SCHED_IDLE</tt> implementation. ?
Robert Love 2.5: fair scheduler hints "<q>Scheduler hints are a way for a program to give a "hint" to the scheduler about its present behavior in the hopes of the scheduler consequently making better scheduling decisions.</q>" ?

Device drivers

Bartlomiej Zolnierkiewicz 2.5.24 IDE 95 ?
Bartlomiej Zolnierkiewicz 2.5.24 IDE 96 ?
Bartlomiej Zolnierkiewicz 2.5.24 IDE 97 ?
Jaroslav Kysela ALSA 0.9.0rc2 release notes ?

Documentation

Patrick Mochel Device Model Docs ?
Denis Vlasenko lk maintainers ?

Filesystems and block I/O

Alasdair Kergon device-mapper for 2.4 "<q>Device-mapper is a light-weight driver designed to support volume managers generically</q>". ?
Paul Menage Filter /proc/mounts based on process root dir Makes <tt>/proc/mounts</tt> consider namespaces. ?

Janitorial

Memory management

Andrea Arcangeli vm fixes for 2.4.19rc1 ?

Networking

Tobias Ringstrom ipsec_tunnel-0.2.2 released ?

Miscellaneous

Willy TARREAU CMOV emulation for 2.4.19-rc1 Provides x86 instruction emulation on older processors. ?
Rik van Riel #kernelnewbies moves ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds