Brief items
The current development kernel is 2.5.60, which was
released by Linus on February 10. This
release was mostly an exercise in catching up with the pile of patches that
accumulated while Linus was traveling; it includes ia32 lost
timer tick detection and compensation, self-unplugging block I/O request
queues, an ACPI update, various architecture updates, a SCSI command queue
rework, Linux Security Module networking hooks, a big user-mode Linux
update, a number of kbuild changes, 64-bit jiffies support, and a great
many other fixes and updates. The
long-format
changelog has the details.
Linus's (pre-2.5.61) BitKeeper tree includes a big x86-64 update, some
fixups for signal problems in 2.5.60, some kbuild work, and another set of
AGP patches.
Dave Jones has released 2.5.60-dj2, which
adds some driver fixes and a number of 2.4 fixes to the 2.5.60 kernel.
The current stable kernel is 2.4.20; Marcelo has not released any
2.4.21 prepatches since January 29.
The current patch from Alan Cox is 2.4.21-pre4-ac4. It contains another set of
IDE fixes and a few other repairs.
Comments (2 posted)
Kernel development news
The 2.5 development series has seen a great deal of work aimed at improving
the performance of the block I/O subsystem. Recently there has been a
resurgence of interest in I/O scheduling - deciding which disk I/O requests to
process in which order. Optimal scheduling can keep the disks running at
full speed and users happy, but the optimal solution can be hard to find.
That doesn't stop the kernel hackers from trying, however. The
anticipatory I/O scheduler work was covered here
a couple of weeks ago; now a new approach is
being tried which may improve I/O performance even more.
The technique being looked at is "stochastic fair queueing," and it is
intended to bring greater fairness to I/O scheduling decisions. In a fair
situation, all users of a particular drive would be able to execute about
the same number of I/O requests over a given time. This approach to
fairness gets rid of starvation problems, and ensures that all processes
can get some work done. The hope would be, for example, that a streaming
media application would be able to move its data without outages, even in
the presence of other, disk-intensive applications.
The stochastic fair queueing approach was first developed in the networking
world by Paul E. McKenney; his paper on the subject can be found on this page. In the
networking context, stochastic fair queueing tries to divide the available
bandwidth equally among all users. Ideally, a separate queue would be used
for each ongoing connection, but high-performance routers lack the
resources to do things that way. So a smaller number of queues is used,
with each connection being assigned to a queue via a hash function.
Packets are then taken from each queue in turn, dividing the bandwidth
between them. If two high-bandwidth connections happen to land on the same
queue, they will be penalized relative to the other queues; to address this
problem, the hash function is periodically changed to redistribute
connections among the queues. The algorithm works reasonably well and is
easy to make fast; the Linux networking code has had a stochastic queueing
module available for some time.
In the disk I/O context, the aim is to divide the available disk bandwidth
fairly between processes. The initial
implementation by Jens Axboe creates 64 subqueues for each block I/O
request queue, and distributes requests among the subqueues based on the
process ID of the requestor. (Actually, it uses the process ID of the
currently running process, which could, in some situations, not be the
originator of the request). When the time comes to dispatch requests, one
is taken from each subqueue, and the whole set is ordered before being sent
to the drive for execution.
Taking things even further, Jens has also posted a complete fair queueing scheduler, which does
away with the hash function used in the stochastic approach. Each process
has its own queue, and requests are taken equally from all queues. It is
hard to get fairer than that. Of course, as Jens points out, once you have
this infrastructure in place, it is relatively easy to make things less
fair again by adding, say, I/O priorities to processes.
Where this all appears to be heading (though probably not in the 2.5
series) is toward a configurable I/O scheduler with several possible
algorithms which can be mixed and matched according to a site's local
policy. In other words, it looks a lot like the traffic control code which
has existed in the networking subsystem for a few years. As with
networking, most sites will probably not need to tweak their disk
scheduling regimes. Users with special needs, however, will be glad for
the ability to fine-tune things to their specifications.
Comments (8 posted)
Last week's Kernel Page included the first
articles in a series on porting
device drivers (and other kernel code) to the 2.5 kernel. These articles
are an offshoot of the work to update the
Linux Device
Drivers sample code (and then, of course, the book itself). Three
more articles have been added to the
series; one of them, which fills in more information on porting to the new
module loader, appears below. The other two (on
miscellaneous changes and
the seq_file interface) can be read
separately.
These articles will be collected at lwn.net/Articles/driver-porting as the
series continues to develop. With
luck, they will become a useful resource for the kernel development
community. Stay tuned...
Comments (none posted)
The first article in this series noted a
couple of changes that result from the new, kernel-based module loader. In
particular, explicit
module_init() and
module_exit()
declarations are now necessary. Quite a few other things have changed as
well, however; this article will summarize the most important of those
changes.
Module parameters
The old
MODULE_PARM macro, which used to specify parameters which can be
passed to the module at load time, is no more. The new parameter
declaration scheme add type safety and new functionality, but at the cost
of breaking compatibility with older modules.
Modules with parameters should now include <linux/moduleparam.h>
explicitly. Parameters are then declared with module_param:
module_param(name, type, perm);
Where
name is the name of the parameter (and of the variable
holding its value),
type is its type, and
perm is the
permissions to be applied to that parameter's sysfs entry. The
type parameter can be one of
byte,
short,
ushort,
int,
uint,
long,
ulong,
charp,
bool or
invbool. That
type will be verified during compilation, so it is no longer possible to
create confusion by declaring module parameters with mismatched types. The
plan is for module parameters to appear automatically in sysfs, but that
feature had not been implemented as of 2.6.0-test9; for now, the safest
alternative is to set
perm to zero, which means "no sysfs entry."
If the name of the parameter as seen outside the module differs from the
name of the variable used to hold the parameter's value, a variant on
module param may be used:
module_param_named(name, value, type, perm);
Where
name is the externally-visible name and
value is
the internal variable.
String parameters will normally be declared with the charp type;
the associated variable is a char pointer which will be set to the
parameter's value. If you need to have a string value copied directly into
a char array, declare it as:
module_param_string(name, string, len, perm);
Usually,
len is best specified as
sizeof(string).
Finally, array parameters (supplied at module load time as a
comma-separated list) may be declared with:
module_param_array(name, type, num, perm);
The one parameter not found in module_param() (num) is
an output parameter; if a value for name is supplied when the
module is loaded, num will be set to the number of values given.
This macro uses the declared length of the array to ensure that it is not
overrun if too many values are provided.
As an example of how the new module parameter code works, here is a
paramaterized version of the "hello world" module shown previously:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
MODULE_LICENSE("Dual BSD/GPL");
/*
* A couple of parameters that can be passed in: how many times we say
* hello, and to whom.
*/
static char *whom = "world";
module_param(whom, charp, 0);
static int howmany = 1;
module_param(howmany, int, 0);
static int hello_init(void)
{
int i;
for (i = 0; i < howmany; i++)
printk(KERN_ALERT "(%d) Hello, %s\n", i, whom);
return 0;
}
static void hello_exit(void)
{
printk(KERN_ALERT "Goodbye, cruel %s\n", whom);
}
module_init(hello_init);
module_exit(hello_exit);
Inserting this module with a command like:
insmod ./hellop.ko howmany=2 whom=universe
causes the message "hello, universe" to show up twice in the system
logfile.
Module aliases
A module alias is an alternative name by which a loadable module can be
known. These aliases are typically defined in
/etc/modules.conf,
but many of them are really a feature of the module itself. In 2.6, module
aliases can be embedded with a module's source. Simply add a line like:
MODULE_ALIAS("alias-name");
The module use count
In 2.4 and prior kernels, modules maintained their "use count" with macros
like
MOD_INC_USE_COUNT. The use count, of course, is intended to
prevent modules from being unloaded while they are being used. This method
was always somewhat error prone, especially when the use count was
manipulated inside the module itself. In the 2.6 kernel, reference
counting is handled differently.
The only safe way to manipulate the count of references to a module is
outside of the module's code. Otherwise, there will always be times when
the kernel is executing within the module, but the reference count is
zero. So this work has been moved outside of the modules, and life is
generally easier for module authors.
Any code which wishes to call into a module (or use some other module
resource) must first attempt to increment
that module's reference count:
int try_module_get(&module);
It is also necessary to look at the return value from
try_module_get(); a zero return means that the try failed, and the
module should not be used. Failure can happen, for example, when the
module is in the process of being unloaded.
A reference to a module can be released with module_put().
Again, modules will not normally have to manage their own reference
counts. The only exception may be if a module provides a reference to an
internal data structure or function that is not accounted for otherwise.
In that (rare) case, a module could conceivably call
try_module_get() on itself.
As of this writing, modules are considered "live" during initialization,
meaning that a try_module_get() will succeed at that time. There
is still talk
of changing things, however, so that modules are not accessible until they
have completed their initialization process. That change will help prevent
a whole set of race conditions that come about when a module fails
initialization, but it also creates difficulties for modules which have to
be available early on. For example, block drivers should be available to read
partition tables off of disks when those disks are registered, which
usually happens when the module is initializing itself. If the policy
changes and modules go back off-limits during initialization, a call to a
function like make_module_live() may be required for those modules
which must be available sooner. (Update 2.6.0-test9: this change
has not happened and seems highly unlikely at this point).
Finally, it is not entirely uncommon for driver authors to put in a special
ioctl() function which sets the module use count to zero.
Sometimes, during module development, errors can leave the module reference
count in a state where it will never reach zero, and there was no other way
to get the kernel to unload the module. The new module code supports
forced unloading of modules which appear to have outstanding references -
if the CONFIG_MODULE_FORCE_UNLOAD option has been set.
Needless to say, this option should only be used on development systems,
and, even then, with great caution.
Exporting symbols
For the most part, the exporting of symbols to the rest of the kernel has
not changed in 2.6 - except, of course, for the fact that any user of those
symbols should be using
try_module_get() first. In older kernels,
however, a module which did not arrange things otherwise would implicitly export all
of its symbols. In 2.6, things no longer work that way; only symbols which
have explicitly been exported are visible to the rest of the kernel.
Chances are that change will cause few problems. When you get a chance,
however, you can remove EXPORT_NO_SYMBOLS lines from your module
source. Exporting no symbols is now the default, so
EXPORT_NO_SYMBOLS is a no-op.
The 2.4 inter_module_ functions have been deprecated as unsafe.
The symbol_get() function exists for the cases when normal symbol
linking does not work well enough. Its use requires setting up weak
references at compile time, and is beyond the scope of this document; there
are no users of symbol_get() in the 2.6.0-test9 kernel
source.
Kernel version checking
2.4 and prior kernels would include, in each module, a string containing
the version of the kernel that the module was compiled against. Normally,
modules would not be loaded if the compile version failed to match the
running kernel.
In 2.5, things still work mostly that way. The kernel version is loaded
into a separate, "link-once" ELF section, however, rather than being a
visible variable within the module itself. As a result, multi-file modules
no longer need to define __NO_VERSION__ before including
<linux/module.h>.
The new "version magic" scheme also records other information, including
the compiler version, SMP status, and preempt status; it is thus able to
catch more incompatible situations than the old scheme did.
Module symbol versioning ("modversions") has been completely reworked for
the 2.6 kernel. Module authors who use the makefiles shipped with the kernel
(and that is about the only way to work now) will find that dealing with
modversions has gotten easier than before. The #define hack which
tacked checksums onto kernel symbols has gone away in favor of a scheme
which stores checksum information in a separate ELF section.
Comments (5 posted)
Andrea Arcangeli, with a statement that he prefers coding to participating
in flame wars, recently
released a script
which can pull code from a BitKeeper repository without the need to
actually run BitKeeper. The script makes use of
the web interface to the
repository running on bkbits.net. It looks like a great way for
developers who do not want to run proprietary software to get access to
Linus's current tree. There is only one problem, however: the BitMover
folks are very concerned about the amount of bandwidth that could be burned
by extensive use of this script, and have promised to shut down the web
interface if the bandwidth bill gets too high.
The issue of access to the BitKeeper repositories via free software will
not go away, however; there is a determined subset of the kernel hacker
community that simply does not want to use proprietary code. Fortunately,
there appears to be an answer on the horizon: BitMover has promised to make Linus's repository available
as an automatically updated CVS repository. That repository, presumably,
will be hosted at kernel.org. At that point, a lot of minds should be
eased about access to the repository - and about long-term preservation of
the
kernel's revision history in an open format (not that the BitKeeper format,
which is based on SCCS, is particularly closed).
Incidentally, it has been just over one year since Linus let the world
know he was trying out BitKeeper in the 2.5.4-pre1 announcement.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>