The current development kernel is 2.6.0-test5
; Linus has released no
kernels since September 8. It has been a relatively slow period for
kernel development in general.
Patches in Linus's BitKeeper repository include a Coda filesystem update,
some initramfs tweaks, improvements in random driver locking, the
removal of some ext3 debugging hooks, direct I/O support for reiserfs, some
CPU frequency work, an Intel SpeedStep-SMI driver, a substantial amount of
janitorial work, and various fixes.
The current stable kernel is 2.4.22. Marcelo released 2.4.23-pre4 on September 12; it includes
some VM improvements (including the removal of the much-maligned
out-of-memory killer), an ia-64 update, some NFS work, a wireless update,
and various other fixes.
Comments (1 posted)
Kernel development news
The out-of-memory (OOM) killer is a longstanding source of controversy in
Linux development circles. The killer comes into play if the kernel
encounters a memory shortage so severe that the ongoing functioning of the
system is endangered. Rather than panic or lock up, the kernel brings in
the OOM killer, which goes looking for processes to kill. The killer has a
complicated set of heuristics built into it in an attempt to have it target
the processes that are least likely to be missed. Anybody who has seen the
OOM killer in action, however, knows that it can still make unfortunate
choices. Choosing the process which (1) is among the least valuable
on the system, and (2) is a significant part of the memory problem is
a difficult task.
As a result of discomfort with this grim reaper lurking within the kernel,
and of recently merged VM improvements, the OOM killer has been removed
from the 2.4.23 prepatch series.
For 2.6, Rusty Lynch has just posted a different
answer that should, perhaps, have been obvious from the beginning.
Rather than trying to come up with a set of OOM killer heuristics that
works for everybody, Rusty's patch sets up a notifier-based mechanism that
allows for pluggable OOM killer modules. With this patch, anybody who
wants to set up a different response to memory shortages need only write a
module implementing that technique.
The patch includes the standard OOM killer, along with an example
alternative which simply panics the system. But there is already talk of
creating OOM killer modules implementing different policies. One, which
has been posted already, targets processes if they are seen to be forking
children which fall victim to the OOM killer; it works on the assumption that the
parent is the real source of the problem. A "blame Mozilla" module has
been suggested. And Alan Cox has suggested involving the security module
code so that a site's security policies can be part of the OOM reaction
It's unclear how far this process will go. But pluggable OOM killers is a
clear way of ending the long discussion over what the right policy should
be. Linux is, after all, about choice, even when the choices are
Comments (8 posted)
The OpenBIOS project has announced
release of a Forth kernel, known as "BeginAgain." Most users, who are
strangely uninterested in typing Forth code at something close to bare
hardware, will probably not rush out to install this release. But it is a
step forward for the OpenBIOS project and for everybody wanting to run
their systems with free software all the way down to the bare metal. The
BeginAgain platform is mostly useful for testing at this point, but when a
few more pieces are added (a device interface and the client layer which
will allow the system to boot operating systems) OpenBIOS should start to
get interesting for a wider group of users.
Comments (none posted)
Andrew Grover has announced
that he is no
longer the ACPI maintainer; his duties have been passed on to Len Brown.
ACPI is still not popular among all developers and users, but the simple
fact is that good ACPI support is now required to get many systems to
function properly. Andrew and his team have put massive amounts of work
into the Linux ACPI implementation over the last few years, with the result
that Linux does, indeed, have good ACPI support. Thanks, Andrew; we're
looking forward to your next project, whatever it may be.
Comments (none posted)
Much 2.5 kernel development work went toward increasing the size of the
device number type. That work has necessarily forced some
changes in how device drivers work with the rest of the kernel. This
article describes the changes as seen from the point of view of char
drivers. It is current as of the 2.6.0-test9 kernel. Note that the
interfaces describe here are still volatile
and could change
significantly before 2.6.0-final is released.
Major and minor numbers
With the expanded dev_t
, it is no longer be possible to assume
that major and minor numbers fit within eight bits. To the greatest extent
possible, the relevant interfaces have been changed in ways that will not
break existing drivers. In particular, a driver which uses the
function to register a char device
will never see minor device numbers greater then 255. Attempts to open a
device node with a larger minor number will simply fail with a "no such
One change that is visible to all drivers, however, is the elimination of the
kdev_t type. Device numbers are now a simple dev_t
throughout the kernel. The place where this change is most apparent for
most will be the change in the type of the inode i_rdev field.
Drivers which need to get major or minor numbers from inodes should use the
two new helper functions:
unsigned iminor(struct inode *inode);
unsigned imajor(struct inode *inode);
Use of these functions will help keep a driver working in the future, even
if the representation within inodes changes again.
The new way
continues to work as it always did, and drivers
which use that function need not be changed. Unchanged drivers, however,
will not be able to use the expanded device number range, or take advantage
of the other features provided
by the new code. Sooner or later, it is worthwhile to get to know the new
The new way to register a char device range is with:
int register_chrdev_region(dev_t from, unsigned count, char *name);
Here, from is the device number of the first device in the range,
count is the number of device numbers to register, and
name is the base name of the device (it appears in
/proc/devices). The return value is zero if all goes well, and a
negative error number otherwise.
Note that from is a device number, not a major number. This
interface allows the registration of an arbitrary range of device numbers,
starting from anywhere. So the from argument specifies both the
beginning major and minor number. If the count argument exceeds
the number of minor numbers available, the allocation will continue on into
the next major number; this is a design feature.
register_chrdev_region() works if you know which major device
number you wish to use. If, instead, your driver expects to work with
dynamic major number allocation, it should use:
int alloc_chrdev_region(dev_t *dev, unsigned baseminor,
unsigned count, char *name);
In this case, dev is an output-only parameter which will be set to
the first device number of the allocated range. The input parameters are
baseminor, the first minor number to use (usually zero);
count, the number of device numbers to allocate; and
name, the base name of the device. Once again, the return value
is zero or a negative error code.
Connecting up devices
Some readers may have noticed that the above functions, unlike
, do not have a file_operations
argument. Registering a device number range sets those numbers aside for
your use, but it does not actually make any device operations available to
user space. There is now a separate object (struct cdev
represents char devices, and which must be set up by your driver to
actually make a device available.
To work with struct cdev, you code should include
<linux/cdev.h>. Then, the usual way of getting one of these
structures is with:
struct cdev *cdev_alloc(void);
If all goes well, the return value will be a pointer to a newly allocated,
initialized cdev structure. Check that value, though; there is a
memory allocation involved, and things can always fail.
It is also possible to declare a static cdev structure, or to
embed one within another structure. In this
case, you should pass it to:
void cdev_init(struct cdev *cdev, struct file_operations *fops);
before doing anything else with it.
Your driver will need to set a couple of fields in the cdev
structure before adding it to the system. The owner field should
be set to the owning module, usually THIS_MODULE. The device's
file_operations structure should be pointed to by the ops
field. And, to get a directory in sysfs, you should also set the
name field in the embedded kobject, with something like:
struct cdev *my_cdev = cdev_alloc();
kobject_set_name(&cdev->kobj, "my_cdev%d", devnum);
Note that kobject_set_name() takes a printf()-like format
string and associated arguments.
Once you have the structure set up, it's time to add it to the system:
int cdev_add(struct cdev *cdev, dev_t dev, unsigned count);
cdev is, of course, a pointer to the cdev structure;
dev is the first device number handled by this structure, and
count is the number of devices it implements. This, one
cdev structure can stand in for several physical devices, though
you will usually not want to do things that way.
There are two important things to bear in mind when calling
cdev_add(). The first is that this call can fail. If the return
value is nonzero, the device has not been added and is not visible to user
space. If, instead, the call succeeds, the device becomes immediately
live. You should not call cdev_add() until your driver is
completely ready to handle calls to the device's methods.
Adding a device also creates a directory entry under /sys/cdev,
using the name stored in the kobj.name field. As of this writing,
that directory is empty, but one assumes that all sorts of good things (the
associated device numbers, if nothing else) will eventually show up there.
If you need to get rid of a cdev
structure, the usual way of doing
things is to call:
void cdev_del(struct cdev *cdev);
This function should only be called, however, on a cdev structure
which has been successfully added to the system with cdev_add().
If you need to destroy a structure which has not been added in this way
(perhaps cdev_add() failed), you must, instead, manually decrement
the reference count in the structure's kobject with a call like:
Calling cdev_del() on a device which is still active (if, say, a
user-space process still has an open file reference to it) will cause the
device to become inaccessible, but it will not actually delete the
structure at that time. The reference count in the structure will keep it
around until all the references have gone away. That means that your
driver's methods could be called after you have deleted your cdev
object - a possibility you should be aware of.
The reference count of a cdev structure can be manipulated with:
struct kobject *cdev_get(struct cdev *cdev);
void cdev_put(struct cdev *cdev);
Note that these functions change two reference counts: that of the
cdev structure, and that of the module which owns it. It will be
rare for drivers to call these functions, however.
Finding your device in file operations
Most of the methods provided by the driver in the file_operations
structure take a struct inode
(or a struct file
be used to find the associated inode) as an argument. Traditionally, Linux
drivers have looked at the device number stored in the inode's
field to determine which device is being operated upon.
That technique still works, but, in many cases, there is a better way. In
2.6, struct inode
contains a field called i_cdev
contains a pointer to the associated cdev
structure. If you have
embedded one of those structures within your own, device-specific
structure, you can use the container_of()
macro (described in the kobject article
) to obtain a pointer to
Why things were done this way
The new interface may seem rather more complex to many. Before, a single
call to register_chrdev()
was all that was necessary; now a driver
has to deal with the additional hassle of managing cdev
structures. This approach provides a great deal of flexibility, however,
in how the device number space can be managed. Each device gets exactly
the number range it needs, and its operations will never be invoked for
device numbers outside that range. In the past, it has been noted that
many drivers had incorrect range checks on minor numbers; with the new
scheme, all those range checks can go away altogether.
The new method also makes it easy for each device to have its own
file_operations structure without the need for big switch
statements in the open() method. Separate cdev
structures can also have separate entries in /sys/cdev.
In general, char devices have
become proper objects within the kernel, with all the advantages that come
with that status. A little bit of extra object management is a small price
Comments (7 posted)
Patches and updates
Core kernel code
- Con Kolivas: O20.2int.
(September 16, 2003)
- Con Kolivas: O20.3int.
(September 16, 2003)
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>