User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.9-rc2; Linus has released no prepatches since September 13.

Linus's BitKeeper repository contains more __iomem annotations (see last week's Kernel Page) and new sparse annotations intended to flush out byte endianness errors, an NTFS update, ethtool support in the loopback driver, m32r architecture support, the "string" I/O memory access functions, support for more than eight partitions on BSD-labeled disks, some User-mode Linux cleanups, a tunable "max sectors" limit for block I/O requests (a latency reduction feature), a new prctl() option allowing programs to change their name, some shared memory scalability improvements, and a change in TCP ICMP source quench behavior (such messages are simply ignored now).

The current tree from Andrew Morton is 2.6.9-rc2-mm1. Recent changes to -mm include the inclusion of a number of Ingo Molnar's latency reduction patches, a rework of tty locking, a number of User-mode Linux updates, and various fixes.

The current 2.4 prepatch is still 2.4.28-pre3; Marcelo has released no prepatches since September 11.

Comments (5 posted)

Kernel development news

Modular, switchable I/O schedulers

The I/O scheduler ("elevator") has a challenging job: it must arrange for disk I/O operations to be executed in the optimal order. "Optimal" means maximizing the I/O bandwidth to the disk while, simultaneously, ensuring that all requests are satisfied in a timely manner, no process suffers excessive latency, and, for desktop systems, that the interactive "feel" of the system is responsive. Some schedulers take on additional tasks, such as dividing the available bandwidth equally between processes (or users) contending for each disk.

Given that set of demands, it is not surprising that there are multiple I/O schedulers in the Linux kernel. The deadline scheduler works by enforcing a maximum latency for all requests. The anticipatory scheduler briefly stalls I/O after a read request completes with the idea that another, nearby read is likely to come in quickly. The completely fair queueing scheduler (recently updated by Jens Axboe) applies a bandwidth allocation policy. And there is a simple "noop" scheduler for devices, such as RAM disks, which do not benefit from fancy scheduling schemes (though such devices usually short out the request queue entirely).

The kernel has a nice, modular scheme for defining and using I/O schedulers. What it lacks, however, is any flexible way of letting a system administrator choose a scheduler. I/O schedulers are built into the kernel code, and exactly one of them can be selected - for all disks in the system - at boot time with the elevator= parameter. There is no way to use different schedulers for different drives, or to change schedulers once the system boots. The chosen scheduler is used, and any others configured into the system simply sit there and consume memory.

Jens Axboe has recently posted a patch which improves on this situation. With this patch in place, I/O schedulers can be built as loadable modules (though, as Jens cautions, at least one scheduler must be linked directly into the kernel or the system will have a hard time booting). A new scheduler attribute in each drive's sysfs tree lists the available schedulers, noting which one is active at any given time. Changing schedulers is simply a matter of writing the name of the new scheduler into that attribute.

The patch is long, but the amount of work required to support switchable I/O schedulers wasn't all that great. The internal structures describing elevators have been split apart to reflect the more dynamic nature of things; struct elevator_ops contains the scheduler methods, while struct elevator_type holds the metadata which describes an I/O scheduler to the kernel. The new elevator_queue structure glues an instance of an I/O scheduler to a specific request queue. Updating the mainline schedulers to work with the new structures required a fair number of relatively straightforward code changes. Each scheduler now also has module initialization and cleanup functions which have been separated from the code needed to set up or destroy an elevator for a specific queue.

One interesting question is: what should be done with the currently queued block requests when an I/O scheduler change is requested? One could imagine requeueing all of those requests with the new scheduler in order to let it have its say immediately. The simpler approach, which was chosen for this patch, is to block the creation of new requests and wait for the queue to empty out. Once all outstanding I/O has been finished up, the old scheduler can be shut down and moved out of the way.

There have been no (public) objections to the patch; chances are it will find its way into the mainline sometime after 2.6.9 comes out.

Comments (14 posted)

Goodbye, old code

In the Good Old Days, loadable modules had to manage their own reference counts with the MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT macros. This mechanism was always subject to race conditions; since the count was manipulated inside the module itself, there was no way to avoid situations where the kernel was executing inside the module, but the use count was zero. And that was for correctly written modules; distributing responsibility for the reference count in this way also provided lots of opportunities for module writers to get things wrong.

So, for 2.6, reference count management was moved up into the code which calls into modules, and the MOD_*_USE_COUNT macros were deprecated. In recent times the kernel janitors have been busy, to the effect that, at this point, there are no more users of those macros in the mainline kernel. So Christoph Hellwig has posted a patch removing them altogether. That patch has not been merged as of this writing, but the writing is clearly on the wall. Any external modules which are still using these macros should probably be fixed up in a hurry.

Christoph has also sent out a patch marking the lightly-used inter_module functions as deprecated. These functions, which perform a sort of run-time linking between modules, have never been seen as elegant or safe to use.

Rusty Russell, meanwhile, has added a warning to the kernel informing users that the ipchains and ipfwadm interfaces to netfilter will be going away soon. They have been obsolete since 2.4, but the kernel developers have kept them around because they are a user-space interface which is still very much in use. Once a site administrator gets a set of firewall rules that works, he or she is rarely amused by the idea of rewriting everything for a new interface.

Supporting these interfaces requires the maintenance of an intermediate compatibility layer in the netfilter code, however, and that makes maintenance and development of the code hard. In the interests of carrying the code forward, the netfilter developers want to get rid of the older cruft. For now, they are just adding a warning; no time frame has been given for (1) firmer warnings, or (2) actual removal of the code.

There are a couple of obstacles to actually taking this code out:

  • The users of the old interfaces. For those trying to convert to iptables, William Stearns has posted a script which converts ipchains rules to iptables.

  • 32-bit emulation. The binary interface used by iptables is exceedingly difficult to implement for 32-bit user-space programs in a 64-bit kernel - with the result that it has not been done. For this reason, x86-64 maintainer Andi Kleen has requested that ipchains not be removed at this time. Fixing that problem will not be a straightforward task, however.

In the longer term, it seems clear that the older interfaces have to go. The alternative is a steady accumulation of compatibility cruft which, eventually, causes the kernel to collapse under its own weight.

Comments (none posted)

I/O space write barriers

Some platforms, it seems, have an interesting property: writes to I/O memory space from multiple processors may be reordered before reaching the device. Even if the device registers are protected by a lock (pretty much necessary to keep multiple processors from writing simultaneously and confusing the device), writes issued by one CPU can arrive before those from another, even if the second CPU had held the lock and issued its writes first. The Itanium architecture in particular behaves this way, though others may as well.

The answer, according to Jesse Barnes is the addition of a new type of memory barrier to force the ordering of writes to the device. Jesse's patch adds a new function, mmiowb(), which implements this barrier. He has also updated the qla1280 driver to make use of it.

Authors of PCI drivers are accustomed to coding a different sort of barrier: reading from a device register to ensure that all writes have actually been posted to the device. mmiowb() is a different, lighter-weight mechanism. After a call to mmiowb(), writes might still have not reached the device. Writes are not forced out; they just have their ordering with respect to subsequent writes guaranteed. In many situations, that sort of guarantee is all that is needed.

Comments (1 posted)

Configuration of pluggable network adaptors

Li Shaohua ran into a problem when repeatedly plugging and unplugging an e1000 network adaptor. After 32 times, the adaptor would no longer work. It seems that the driver (like many others in the 2.6 kernel) was designed to discover at most 32 devices at boot time, and it has space for configuration parameters for just that many devices. Each new hotplug event looked like a new device, so the driver quickly ran out of parameter storage. In fact, the e1000 driver can handle many more devices than that; it just lacks space in its boot-time arrays to hold default configuration information.

Mr. Li's diagnosis was that the problem lies with the e1000 driver's inability to reuse board numbers internally. So he wrote up a patch to keep track of existing boards, and to reuse their numbers when they are removed. After some discussion, this patch was reworked into a general mechanism using the "idr" facility (described in the next article) - since the e1000 is not the only driver which behaves this way, it makes sense to fix the problem once for everybody.

Not everybody agrees that this is the right approach, however. Boot-time configuration parameters can be useful for many (if not most) systems where the network interfaces are screwed down and are unlikely to be replaced while the system is up. But do they really make sense for hotpluggable devices? There is a whole system in place for the configuration of hotpluggable devices; perhaps that should be used rather than adding complexity to the network drivers. Given that the conversation came to a hard stop after this view was posted, it seems likely to carry the day.

Comments (none posted)

idr - integer ID management

There has been a fair number of patches in recent times which convert one part or other of the kernel over to the "idr" facility. Idr is a set of library functions for the management of small integer ID numbers. In essence, an idr object can be thought of as a sparse array mapping integer IDs onto arbitrary pointers, with a "get me an available entry" function as well. This code was first added in February, 2003 as part of the POSIX clocks patch, and has seen various tweaks since.

Working with idr requires including <linux/idr.h>. Creating a new idr object is simply a matter of allocating a struct idr and passing it to:

    void idr_init(struct idr *idp);

The interface for allocating new IDs is somewhat unintuitive and interesting. The authors decided to separate out the parts of the ID allocation process which may require getting memory from the system; the idea was that the memory allocation could be done with no locks held, while the actual generation of an ID number could be done in a locked state. Thus, before allocating a new ID, one must call:

    int idr_pre_get(struct idr *idp, unsigned int gfp_mask);

This function will get set up to allocate a new ID number, allocating memory (with the given gfp_mask) if necessary. Contrary to the usual conventions, the return value will be zero if something goes wrong, nonzero otherwise.

Once that is done, a new ID can be allocated with either of:

    int idr_get_new(struct idr *idp, void *ptr, int *id);
    int idr_get_new_above(struct idr *idp, void *ptr, int start_id, int *id);

The first form gets the next available ID number, stores it in id, and associates it with the given ptr internally. If you wish to specify a minimum value for the new ID, use idr_get_new_above() instead. If all goes well, the return value will be zero; if no more IDs can be allocated, -ENOSPC will be returned.

Imagine a situation where two processors are both looking to allocate a new ID. Both call idr_pre_get(), guaranteeing that enough memory exists to allocate at least one more ID. Then one processor swoops in and grabs that ID, leaving no memory for the other. In that case, idr_get_new() will not attempt to allocate more memory; it will, instead, return -EAGAIN. At that point, the code should emit a heavy sigh, release its locks, and go back to the idr_pre_get() stage. Thus, ID allocation code can look something like this:

	if (idr_pre_get(&my_idr, GFP_KERNEL) == 0) {
		/* No memory, give up entirely */
	result = idr_get_new(&my_idr, &target, &id);
	if (result == -EAGAIN) {
		goto again;

It should be noted that calls to idr_get_new() (and most other idr functions) must be serialized by some sort of lock, or unpleasant things could happen. idr_pre_get() can sleep, however, and should not be called under lock.

Looking up an existing ID is much simpler:

    void *idr_find(struct idr *idp, int id);

The return value will be the pointer associated with the given id, or NULL otherwise.

To deallocate an ID, use:

    void idr_remove(struct idr *idp, int id);

With these functions, kernel code can generate ID numbers to use as minor device numbers, inode numbers, or in any other place where small integer IDs are useful.

There is one more interesting twist to the idr code: it does (almost) nothing to help users detect reused ID numbers. When an object is destroyed, it may not be possible to tell whether anybody still has its ID number around or not. When some part of the kernel comes along with an ID number, it would be nice to know that refers to a currently-existing object, rather than being left over from some previous time.

The idr code makes it possible for callers to perform this check by ignoring the high-order bits in the ID number. Here, "high-order" is defined as "all the bits which are not needed to represent the largest allocated ID." By putting some sort of unique information in the upper part of the ID (and by limiting the maximum ID number which can be used), idr users can turn the small ID numbers into unique identifiers. The POSIX timer and SCTP code use idr in this way; most of the other in-kernel users treat idr as a sort of unique number generation service and do not perform this sort of check.

Comments (none posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds