Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.5.60, which was released by Linus on February 10. This release was mostly an exercise in catching up with the pile of patches that accumulated while Linus was traveling; it includes ia32 lost timer tick detection and compensation, self-unplugging block I/O request queues, an ACPI update, various architecture updates, a SCSI command queue rework, Linux Security Module networking hooks, a big user-mode Linux update, a number of kbuild changes, 64-bit jiffies support, and a great many other fixes and updates. The long-format changelog has the details.

Linus's (pre-2.5.61) BitKeeper tree includes a big x86-64 update, some fixups for signal problems in 2.5.60, some kbuild work, and another set of AGP patches.

Dave Jones has released 2.5.60-dj2, which adds some driver fixes and a number of 2.4 fixes to the 2.5.60 kernel.

The current stable kernel is 2.4.20; Marcelo has not released any 2.4.21 prepatches since January 29.

The current patch from Alan Cox is 2.4.21-pre4-ac4. It contains another set of IDE fixes and a few other repairs.

Comments (2 posted)

The continuing development of I/O scheduling

The 2.5 development series has seen a great deal of work aimed at improving the performance of the block I/O subsystem. Recently there has been a resurgence of interest in I/O scheduling - deciding which disk I/O requests to process in which order. Optimal scheduling can keep the disks running at full speed and users happy, but the optimal solution can be hard to find. That doesn't stop the kernel hackers from trying, however. The anticipatory I/O scheduler work was covered here a couple of weeks ago; now a new approach is being tried which may improve I/O performance even more.

The technique being looked at is "stochastic fair queueing," and it is intended to bring greater fairness to I/O scheduling decisions. In a fair situation, all users of a particular drive would be able to execute about the same number of I/O requests over a given time. This approach to fairness gets rid of starvation problems, and ensures that all processes can get some work done. The hope would be, for example, that a streaming media application would be able to move its data without outages, even in the presence of other, disk-intensive applications.

The stochastic fair queueing approach was first developed in the networking world by Paul E. McKenney; his paper on the subject can be found on this page. In the networking context, stochastic fair queueing tries to divide the available bandwidth equally among all users. Ideally, a separate queue would be used for each ongoing connection, but high-performance routers lack the resources to do things that way. So a smaller number of queues is used, with each connection being assigned to a queue via a hash function. Packets are then taken from each queue in turn, dividing the bandwidth between them. If two high-bandwidth connections happen to land on the same queue, they will be penalized relative to the other queues; to address this problem, the hash function is periodically changed to redistribute connections among the queues. The algorithm works reasonably well and is easy to make fast; the Linux networking code has had a stochastic queueing module available for some time.

In the disk I/O context, the aim is to divide the available disk bandwidth fairly between processes. The initial implementation by Jens Axboe creates 64 subqueues for each block I/O request queue, and distributes requests among the subqueues based on the process ID of the requestor. (Actually, it uses the process ID of the currently running process, which could, in some situations, not be the originator of the request). When the time comes to dispatch requests, one is taken from each subqueue, and the whole set is ordered before being sent to the drive for execution.

Taking things even further, Jens has also posted a complete fair queueing scheduler, which does away with the hash function used in the stochastic approach. Each process has its own queue, and requests are taken equally from all queues. It is hard to get fairer than that. Of course, as Jens points out, once you have this infrastructure in place, it is relatively easy to make things less fair again by adding, say, I/O priorities to processes.

Where this all appears to be heading (though probably not in the 2.5 series) is toward a configurable I/O scheduler with several possible algorithms which can be mixed and matched according to a site's local policy. In other words, it looks a lot like the traffic control code which has existed in the networking subsystem for a few years. As with networking, most sites will probably not need to tweak their disk scheduling regimes. Users with special needs, however, will be glad for the ability to fine-tune things to their specifications.

Comments (8 posted)

Porting drivers to 2.5

Last week's Kernel Page included the first articles in a series on porting device drivers (and other kernel code) to the 2.5 kernel. These articles are an offshoot of the work to update the Linux Device Drivers sample code (and then, of course, the book itself). Three more articles have been added to the series; one of them, which fills in more information on porting to the new module loader, appears below. The other two (on miscellaneous changes and the seq_file interface) can be read separately.

These articles will be collected at lwn.net/Articles/driver-porting as the series continues to develop. With luck, they will become a useful resource for the kernel development community. Stay tuned...

Comments (none posted)

Driver porting: more module changes

This article is part of the LWN Porting Drivers to 2.6 series.

The first article in this series noted a couple of changes that result from the new, kernel-based module loader. In particular, explicit module_init() and module_exit() declarations are now necessary. Quite a few other things have changed as well, however; this article will summarize the most important of those changes.

Module parameters

The old MODULE_PARM macro, which used to specify parameters which can be passed to the module at load time, is no more. The new parameter declaration scheme add type safety and new functionality, but at the cost of breaking compatibility with older modules.

Modules with parameters should now include <linux/moduleparam.h> explicitly. Parameters are then declared with module_param:

    module_param(name, type, perm);

Where name is the name of the parameter (and of the variable holding its value), type is its type, and perm is the permissions to be applied to that parameter's sysfs entry. The type parameter can be one of byte, short, ushort, int, uint, long, ulong, charp, bool or invbool. That type will be verified during compilation, so it is no longer possible to create confusion by declaring module parameters with mismatched types. The plan is for module parameters to appear automatically in sysfs, but that feature had not been implemented as of 2.6.0-test9; for now, the safest alternative is to set perm to zero, which means "no sysfs entry."

If the name of the parameter as seen outside the module differs from the name of the variable used to hold the parameter's value, a variant on module param may be used:

    module_param_named(name, value, type, perm);

Where name is the externally-visible name and value is the internal variable.

String parameters will normally be declared with the charp type; the associated variable is a char pointer which will be set to the parameter's value. If you need to have a string value copied directly into a char array, declare it as:

    module_param_string(name, string, len, perm);

Usually, len is best specified as sizeof(string).

Finally, array parameters (supplied at module load time as a comma-separated list) may be declared with:

    module_param_array(name, type, num, perm);

The one parameter not found in module_param() (num) is an output parameter; if a value for name is supplied when the module is loaded, num will be set to the number of values given. This macro uses the declared length of the array to ensure that it is not overrun if too many values are provided.

As an example of how the new module parameter code works, here is a paramaterized version of the "hello world" module shown previously:

    #include <linux/init.h>
    #include <linux/module.h>
    #include <linux/moduleparam.h>
    
    MODULE_LICENSE("Dual BSD/GPL");
    
    /*
     * A couple of parameters that can be passed in: how many times we say
     * hello, and to whom.
     */
    static char *whom = "world";
    module_param(whom, charp, 0);
    static int howmany = 1;
    module_param(howmany, int, 0);
    
    
    static int hello_init(void)
    {
        int i;
        for (i = 0; i < howmany; i++)
	    printk(KERN_ALERT "(%d) Hello, %s\n", i, whom);
        return 0;
    }
    
    static void hello_exit(void)
    {
    	printk(KERN_ALERT "Goodbye, cruel %s\n", whom);
    }
    
    module_init(hello_init);
    module_exit(hello_exit);

Inserting this module with a command like:

    insmod ./hellop.ko howmany=2 whom=universe

causes the message "hello, universe" to show up twice in the system logfile.

Module aliases

A module alias is an alternative name by which a loadable module can be known. These aliases are typically defined in /etc/modules.conf, but many of them are really a feature of the module itself. In 2.6, module aliases can be embedded with a module's source. Simply add a line like:

    MODULE_ALIAS("alias-name");

The module use count

In 2.4 and prior kernels, modules maintained their "use count" with macros like MOD_INC_USE_COUNT. The use count, of course, is intended to prevent modules from being unloaded while they are being used. This method was always somewhat error prone, especially when the use count was manipulated inside the module itself. In the 2.6 kernel, reference counting is handled differently.

The only safe way to manipulate the count of references to a module is outside of the module's code. Otherwise, there will always be times when the kernel is executing within the module, but the reference count is zero. So this work has been moved outside of the modules, and life is generally easier for module authors.

Any code which wishes to call into a module (or use some other module resource) must first attempt to increment that module's reference count:

    int try_module_get(&module);

It is also necessary to look at the return value from try_module_get(); a zero return means that the try failed, and the module should not be used. Failure can happen, for example, when the module is in the process of being unloaded.

A reference to a module can be released with module_put().

Again, modules will not normally have to manage their own reference counts. The only exception may be if a module provides a reference to an internal data structure or function that is not accounted for otherwise. In that (rare) case, a module could conceivably call try_module_get() on itself.

As of this writing, modules are considered "live" during initialization, meaning that a try_module_get() will succeed at that time. There is still talk of changing things, however, so that modules are not accessible until they have completed their initialization process. That change will help prevent a whole set of race conditions that come about when a module fails initialization, but it also creates difficulties for modules which have to be available early on. For example, block drivers should be available to read partition tables off of disks when those disks are registered, which usually happens when the module is initializing itself. If the policy changes and modules go back off-limits during initialization, a call to a function like make_module_live() may be required for those modules which must be available sooner. (Update 2.6.0-test9: this change has not happened and seems highly unlikely at this point).

Finally, it is not entirely uncommon for driver authors to put in a special ioctl() function which sets the module use count to zero. Sometimes, during module development, errors can leave the module reference count in a state where it will never reach zero, and there was no other way to get the kernel to unload the module. The new module code supports forced unloading of modules which appear to have outstanding references - if the CONFIG_MODULE_FORCE_UNLOAD option has been set. Needless to say, this option should only be used on development systems, and, even then, with great caution.

Exporting symbols

For the most part, the exporting of symbols to the rest of the kernel has not changed in 2.6 - except, of course, for the fact that any user of those symbols should be using try_module_get() first. In older kernels, however, a module which did not arrange things otherwise would implicitly export all of its symbols. In 2.6, things no longer work that way; only symbols which have explicitly been exported are visible to the rest of the kernel.

Chances are that change will cause few problems. When you get a chance, however, you can remove EXPORT_NO_SYMBOLS lines from your module source. Exporting no symbols is now the default, so EXPORT_NO_SYMBOLS is a no-op.

The 2.4 inter_module_ functions have been deprecated as unsafe. The symbol_get() function exists for the cases when normal symbol linking does not work well enough. Its use requires setting up weak references at compile time, and is beyond the scope of this document; there are no users of symbol_get() in the 2.6.0-test9 kernel source.

Kernel version checking

2.4 and prior kernels would include, in each module, a string containing the version of the kernel that the module was compiled against. Normally, modules would not be loaded if the compile version failed to match the running kernel.

In 2.5, things still work mostly that way. The kernel version is loaded into a separate, "link-once" ELF section, however, rather than being a visible variable within the module itself. As a result, multi-file modules no longer need to define __NO_VERSION__ before including <linux/module.h>.

The new "version magic" scheme also records other information, including the compiler version, SMP status, and preempt status; it is thus able to catch more incompatible situations than the old scheme did.

Module symbol versioning ("modversions") has been completely reworked for the 2.6 kernel. Module authors who use the makefiles shipped with the kernel (and that is about the only way to work now) will find that dealing with modversions has gotten easier than before. The #define hack which tacked checksums onto kernel symbols has gone away in favor of a scheme which stores checksum information in a separate ELF section.

Comments (5 posted)

Getting at the BitKeeper repositories without BitKeeper

Andrea Arcangeli, with a statement that he prefers coding to participating in flame wars, recently released a script which can pull code from a BitKeeper repository without the need to actually run BitKeeper. The script makes use of the web interface to the repository running on bkbits.net. It looks like a great way for developers who do not want to run proprietary software to get access to Linus's current tree. There is only one problem, however: the BitMover folks are very concerned about the amount of bandwidth that could be burned by extensive use of this script, and have promised to shut down the web interface if the bandwidth bill gets too high.

The issue of access to the BitKeeper repositories via free software will not go away, however; there is a determined subset of the kernel hacker community that simply does not want to use proprietary code. Fortunately, there appears to be an answer on the horizon: BitMover has promised to make Linus's repository available as an automatically updated CVS repository. That repository, presumably, will be hosted at kernel.org. At that point, a lot of minds should be eased about access to the repository - and about long-term preservation of the kernel's revision history in an open format (not that the BitKeeper format, which is based on SCCS, is particularly closed).

Incidentally, it has been just over one year since Linus let the world know he was trying out BitKeeper in the 2.5.4-pre1 announcement.

Comments (3 posted)

Martin J. Bligh 2.5.59-mjb5 (scalability / NUMA patchset) ?

Martin J. Bligh 2.5.59-mjb6 (scalability / NUMA patchset) ?

Stephen Hemminger 2.5.60-dcl2 ?

Con Kolivas 2.4.20-ck3 ?

Jeff Dike UML skas-related fixes ?

Jeff Dike UML updated to 2.5.59 ?

Jeff Dike UML configurable kernel stack size ?

Jeff Dike UML fixes ?

Patricia Gaughen Discontigmem support for the x440 ?

Dominik Brodowski cpufreq: cpufreq governor interface ?

Dominik Brodowski cpufreq: move frequency table helpers to extra file ?

Dominik Brodowski cpufreq: move /proc/cpufreq interface to extra file ?

Daniel Jacobowitz Ptrace updates [0/5] ?

Daniel Jacobowitz Ptrace updates: has_stopped_jobs [1/5] ?

Daniel Jacobowitz Ptrace updates: PTRACE_GETSIGINFO [2/5] ?

Daniel Jacobowitz Ptrace updates: CLONE_PTRACE should use force_sig_specific [3/5] ?

Daniel Jacobowitz Ptrace updates: event tracing for vfork finish and process exit [4/5] ?

Daniel Jacobowitz Ptrace updates: Prevent zombies when debugging LinuxThreads apps [5/5] ?

Ingo Molnar HT scheduler, sched-2.5.59-F3 ?

Stephen Rothwell compat_sys_futex ?

george anzinger The high-res-timers package has been updated to 2.5.60 ?

Andrea Arcangeli openbkweb-0.0 ?

Suparna Bhattacharya Using kexec for crash dumps in LKCD ?

Robert Williamson LTP-20030206 ?

Fleischer, Julie N Open POSIX Test Suite 0.2.0 Released ?

Greg KH PCI Hotplug changes for 2.5.59 ?

Vojtech Pavlik Update of the input subsystem - 37 csets ?

Heiko Ronsdorf CPU5 watchdog driver for 2.5 ?

David Brownell Update: USB "gadget" API [1 of 3] ?

David Brownell Update: USB "gadget" API [2 of 3] ?

David Brownell Update: USB "gadget" API [3 of 3] ?

Andries.Brouwer@cwi.nl syscall documentation ?

Andries.Brouwer@cwi.nl syscall documentation (2) ?

Andries.Brouwer@cwi.nl syscall documentation (3) ?

Andries.Brouwer@cwi.nl syscall documentation (4) ?

Andries.Brouwer@cwi.nl syscall documentation (5 and last) ?

Denis Vlasenko lk maintainers ?

Keith Owens Announce: XFS split patches for 2.4.20 - respin ?

Jens Axboe SFQ disk scheduler ?

Jens Axboe CFQ disk scheduler (was Re: [PATCH] SFQ disk scheduler) ?

Olaf Dietsche 2.5.60: Filesystem capabilities 0.14 ?

Olaf Dietsche 2.5.60: access permission filesystem 0.14 ?

Rusty Russell Modernize sched.h iteration macros ?

Andrew Morton 2.5.59-mm9 ?

Andrew Morton 2.5.59-mm10 ?

Andrew Morton 2.5.60-mm1 ?

Matthew Dobson [rfc][api] Shared Memory Binding ?

Con Kolivas 2.5.59-mm10 {+antic I/O sched} with contest ?

Con Kolivas 2.5.60 with contest ?

Con Kolivas 2.5.60-cfq with contest ?

Con Kolivas 2.5.60-mm1 with contest ?

Greg KH klibc for 2.5.59 bk ?

Grover, Andrew ACPI Licensing change ?

Neil Brown ANNOUNCE: mdadm-1.0.9 - prelease for 1.1.0 ?

Kernel development

Brief items

Kernel release status

Kernel development news

The continuing development of I/O scheduling

Porting drivers to 2.5

Driver porting: more module changes

Module parameters

Module aliases

The module use count

Exporting symbols

Kernel version checking

Getting at the BitKeeper repositories without BitKeeper

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Benchmarks and bugs

Miscellaneous