Brief items
The current development kernel is 2.5.69,
released by Linus on May 4. This large
patch includes a FireWire update, some IDE cleanups, more devfs cleanups, a
rework of the driver core class code, some new libfs helpers which make it
easier to create in-kernel virtual filesystems, a big tty layer cleanup, a
change to the interrupt handler prototype (see
the April 24 LWN Kernel Page), runtime barrier
instruction patching (which allows optimal performance on different
processors without the need to ship multiple kernels), more preparation for
an expanded
dev_t type, some swapoff improvements, a new set of
memory allocation flags, and numerous other fixes and updates. Details can
be found in
the long-format changelog.
Linus's BitKeeper tree contains some I2C improvements, some netfilter
tweaks, and a small number of other fixes.
2.5.69-ac1 is available from Alan Cox; it
adds some IDE fixes, a number of janitorial fixes, and various other fixes
and updates.
The current stable kernel remains 2.4.20; there have been no 2.4.21
prepatches since 2.4.21-rc1 on
April 21.
Alan Cox's 2.4.21-rc1-ac4 adds a vesafb fix,
some NFS improvements, audio and serial ATA support for the Intel ICH5
controller, a new AMI Megaraid driver, and various other fixes.
Comments (none posted)
Kernel development news
The kernel list still sees occasional complaints about the interactive
response of recent development kernels. Many of these complaints, it turns
out, relate to OpenOffice. The specific problem in this case has been
found: a combination of a change in
sched_yield() semantics and,
one might say, suboptimal programming in OpenOffice.
The purpose of sched_yield() is to temporarily give up the
processor to other processes. The process calling sched_yield()
remains in the runnable state, and can normally expect to run again in
short order. The 2.5 development series has made a subtle change in the
way sched_yield() works, however. This call used to simply move
the process to the end of the run queue; now it moves the process to the
"expired" queue, effectively cancelling the rest of the process's time
slice. So a process calling sched_yield() now must wait until all
other runnable processes in the system have used up their time slices
before it will get the processor again.
The new semantics arguably make more sense; a process calling
sched_yield() should truly give up the processor. Some
threaded applications, however, implement busy-wait loops with
sched_yield(). OpenOffice is one such application; LinuxThreads
also, apparently, uses this technique. This kind of application performs
poorly with the new yield semantics; being moved to the "expired" queue
makes the loop far less responsive.
There has been talk of ways of changing sched_yield() so that
OpenOffice and other applications are not so badly penalized. One
approach, for example, preserves the application's time slice, but drops
its priority slightly. The consensus, however, seems to be that
applications that loop on sched_yield() are simply broken and
should be fixed. In the case of OpenOffice, this fix has already
apparently been made.
Comments (5 posted)
In the 2.2 (and prior) kernels, loadable modules were charged with the task
of maintaining a count of references to the module. When the reference
counting was done correctly, the kernel knew when it was safe to unload a
module. Unfortunately, the maintenance of the reference counts often was
not done right, and, in any case, in-module reference counting was subject
to certain small but unavoidable race conditions. Simply put, there was no
way to completely avoid situations where the kernel was executing module
code while the module's reference count was zero.
Starting in 2.3, the reference counting task moved slowly outside of the
modules themselves. For example, the file_operations structure
exported by char drivers got an owner field which points to the
module. Before the kernel will, say, call a module's open()
method, it increments the reference count. This mechanism puts the
reference count in one place (rather than in hundreds of module
open() methods) and eliminates race conditions. In 2.5, this
mechanism was extended further, and attempts to increment reference counts
were allowed to fail (for example, when the module still exists, but is
being unloaded).
This mechanism works reasonably well for most device drivers; the interface
between the kernel and the module is narrow, and references to the module
are limited to a few types of objects (open files and memory mappings).
Life gets harder, however, when you get into other parts of the kernel. A
recent discussion on the netdev list, started by the discovery of a situation where the
networking subsystem can call into a module which has been unloaded, shows
how hard it can be.
The networking code keeps track of a vast array of objects, each of which
can reference others, and each of which must be reference counted. A
networking module can only be unloaded when all of those objects are no
longer referenced and have been cleaned up. The immediate problem has to
do with network devices (exported by network device drivers); numerous
parts of the kernel can reference such a device. So the device itself
contains a reference count. In some situations, however, the kernel can
remove a
driver module even though a particular device's reference count had not
dropped to zero. One solution to the problem, as proposed by Rusty Russell, is to increment the
module's reference count every time one of its network device's count goes
up. The problem with this approach, according
to David Miller, is that devices are just the beginning.
So you propose to add this kind of thing for every ARP entry, every
route cache entry, every IPSEC policy, every socket, every struct
sock, every networking dynamic object ever created? When we add
SKB recycling, will we need to do a module get/put on every SKB
alloc/free/clone/copy? I think this way lies insanity :)
The insanity comes from the fact that attempts to increment module use
counts can fail. Trying to add an unbelievable number of failure paths to
the networking code to deal with this case does indeed seem like a one-way
ticket to the funny farm. All this extra reference counting also adds
significant overhead to the networking code's hand-crafted fast paths; that
is a penalty that the networking hackers are not prepared to accept.
The solution that some of the networking developers are asking for is to go
back to having modules maintain their own reference counts - sort of. At
the least, modules need some way of saying whether they can or cannot be
unloaded at a particular time. Usually that decision is just a matter of
looking at the internal objects they have to maintain anyway. So the
addition of a simple can_unload() function to many modules would
solve the immediate problem.
There is still another problem, though: actually getting a complex module
to a state where it can be unloaded can be a tricky task. Removing a
network protocol, for example, requires shutting down the protocol and
waiting for all objects to be freed. Little details (like sockets which
must, according to the protocol specification, sit in a 60-second
TIME_WAIT state before going away) complicate the picture and can
make the unload process take a long time. Users tend to worry when an
rmmod command appears to just hang. Handling all the details of
that case (especially if you want to allow users to interrupt the
rmmod operation) gets to be tricky indeed. Possible solutions are
being discussed, but no implementations are currently on the horizon.
Of course, one could always go back to Rusty's suggestion from the 2002
Kernel Summit: simply do not allow modules to be removed from the kernel.
Comments (14 posted)
Driver porting
One of the more significant changes in the 2.5 development series is the
creation of the integrated device model. The device model was originally
intended to make power management tasks easier through the maintenance of a
representation of the host system's hardware structure. A certain amount
of mission creep has occurred, however, and the device model is now closely
tied into a number of device management tasks - and other kernel functions
as well.
The device model presents a bit of a steep learning curve when first
encountered.
But the underlying concepts are not that hard to understand, and
driver programmers will benefit from a grasp of what's going on.
The fundamental task of the driver model is to maintain a set of internal
data structures which reflect the architecture and state of the underlying
system. Among other things, the driver model tracks:
- Which devices exist in the system, what power state they are in, what
bus they are attached to, and which driver is responsible for them.
- The bus structure of the system; which buses are connected to which
others (i.e. a USB controller can be plugged into a PCI bus), and
which devices each bus can potentially support (along with associated
drivers), and which devices actually exist.
- The device drivers known to the system, which devices they can
support, and which bus type they know about.
- What kinds of devices ("classes") exist, and which real devices
of each class are connected. The driver model can thus answer
questions like "where is the mouse (or mice) on this system?" without
the need to worry about how the mouse might be physically connected.
- And many other things.
Underneath it all, the driver model works by tracking system configuration
changes (hardware and software) and maintaining a complex "web woven by a
spider on drugs" data structure to represent it all.
Some device model terms
The device model brings with it a whole new vocabulary to describe its data
structures. A quick overview of some driver model terms appears below;
much of this stuff will be looked at in detail later on.
- device
- A physical or virtual object which attaches to a (possibly virtual) bus.
- driver
- A software entity which may probe for and be bound to devices, and
which can perform certain management functions.
- bus
- A device which serves as an attachment point for other devices.
- class
- A particular type of device which can be expected to perform in
certain ways. Classes might include disks, partitions, serial ports,
etc.
- subsystem
- A top-level view of the system's structure. Subsystems used in the
kernel include devices (a hierarchical view of all devices
on the system), bus (a bus-oriented view), class
(devices by class), net (the
networking subsystem), and others. The best way to think of a
subsystem, perhaps, is as a particular view into the device model data
structure rather than a physical component of the system. The same
objects (devices, usually) show up in most
subsystems, but they are organized differently.
Other terms will be defined as we come to them.
sysfs
Sysfs is a virtual filesystem which provides a userspace-visible
representation of the device model. The device model and sysfs are
sometimes confused with each other, but they are distinct entities. The
device model functions just fine without sysfs (but the reverse is not
true).
The sysfs filesystem is usually mounted on /sys; for readers
without a 2.6 system at hand, an example
/sys hierarchy from a simple system is available. The top-level
directories there correspond to the known subsystems in the model. The
full device model data structure can be seen by looking at the entries and
links within each subsystem. Thus, for example, the first IDE disk on a
particular system, being a device, would appear as:
/sys/devices/pci0/00:11.1/ide0/0.0
But that device appears (in symbolic link form) under other subsystems as:
/sys/block/hda/device
/sys/bus/ide/devices/0.0
And, additionally, the IDE controller can be found as:
/sys/bus/pci/devices/0.11.1
/sys/bus/pci/drivers/VIA IDE/00:11.1
Within the disk's own sysfs directory (under /devices), the link
block points back at /sys/block/hda. As was said before, it
is a complicated data structure.
Driver writers generally need not worry about sysfs; it is magically
created and implemented by the driver model and bus driver code. The one
exception comes about when it comes to exporting attributes via
sysfs. These attributes represent some aspect of how the device and/or its
driver operate; they may or may not be writeable from user space. Sysfs is
now the preferred way (over /proc or ioctl()) to export
these variables to user space. The next article
in the series looks at how to manage attributes.
Kobjects
Even though most driver writers will never have to manipulate a kobject
directly, it is hard to dig very deeply into the driver model without
encountering them. A
kobject
is a simple representation of data relevant
to any object found in the system; in a true object-oriented language, this
would be the class that most others inherit from. Kobjects contain the
attributes that, it is expected, most objects in the system will need: a
name, reference count, parent, and type. Almost any object related to the
device model will have a kobject buried deeply inside it somewhere.
A kset is a container for a set of kobjects of identical type.
Ksets belong to a subsystem (but a subsystem can hold more than one kset).
Among other things, ksets control how the system responds to hotplug events
- the addition (or removal) of an entry to (or from) the set.
Together, kobjects and ksets make up much of the glue that holds the driver
model structure together. A separate
article in this series covers kobjects and ksets in detail.
Comments (none posted)
When driver authors want to work with the kernel device model, they are
probably wanting to (1) ensure that their devices are represented in
the system hierarchy, or (2) set up some custom attributes in sysfs.
To meet those needs, this article will look at device and attribute
registration. These tasks represent only part of what the device model is
about, but they are a good starting place.
Some time ago, this series looked at a simple
block driver. That driver will now be augmented with simple driver
model and sysfs support. The relevant bits of code will be shown below;
the full source is available here.
The device structure
Once upon a time,
struct device referred to a network interface.
That structure has long since been renamed
net_device, and, in
2.5,
struct device became the "base class" for representing all
devices in the system. The full structure can be seen in
<linux/device.h>; for most drivers, however, there are only
a few fields that are worth worrying about:
char name[DEVICE_NAME_SIZE];
char bus_id[BUS_ID_SIZE];
void *driver_data;
struct device_driver *driver;
The name field is a descriptive name (not something found in
/dev; it can be something like "Barmatic VX773 Frobnicator").
bus_id describes where the device can be found on the bus; for PCI
devices it is a string like "00:09.0". The driver can put
anything it wants into the driver_data field. And driver
describes the driver for this device; we'll get there shortly.
As a general rule, your driver will want to remember more about a device
than can be represented in struct device; this structure just
exists to hold the data common to all devices in the system. So drivers do
not normally deal with bare device structures; instead, this
structure is embedded within something larger. Thus, if you look at the
definition of, say, struct pci_dev or struct usb_device,
you'll find a struct device lurking within.
A general rule has been adopted that struct device is not
the first field in any other structure which contains it. The idea is that
programmers should think carefully about which structure they are dealing
with at any given time, and not just cast pointers back and forth. Going
from the larger structure to struct device is just a matter of
referencing the appropriate field. To go the other way, the
container_of() macro should be used. Thus, for example, a USB
driver could turn a struct device pointer (called, say,
devp) into a struct usb_device pointer with:
container_of(devp, struct usb_device, dev)
Here, "dev" is the name of the struct usb_device field
containing the device structure. Normally, a bus subsystem will
define a macro for doing this conversion; in the USB case, it is
to_usb_device().
The example simple block device (SBD) does not attach to a physical bus, of
course. For this type of device, the kernel exports a "system" bus which
may be used for virtual attachments. Usually, the system bus contains
devices like the processor, interrupt controller, and timer. But we can
attach virtual disks there too.
System bus devices are represented by struct sys_device, defined
as:
struct sys_device {
char *name;
u32 id;
struct device dev;
};
(A couple of fields have been omitted).
Here, the name should be a /dev-style name - it will be
used (along with the id value) to create the device's entry in
sysfs under devices/sys.
The simple block device will be represented as a sys_device.
Before we get to that, however, there is another structure that deserves a
look.
The device driver structure
Device drivers, too, are represented in the device model, and in sysfs.
The relevant structure (again, with a few fields omitted) looks like:
struct device_driver {
char *name;
struct bus_type *bus;
int (*probe) (struct device * dev);
int (*remove) (struct device * dev);
void (*shutdown) (struct device * dev);
int (*suspend) (struct device * dev, u32 state, u32 level);
int (*resume) (struct device * dev, u32 level);
};
The name this time around is the name of the driver, of course.
The bus field is normally filled in by the bus-layer logic;
drivers need not worry about initializing it. The various methods provided
in the structure are for the handling of device discovery and power
management tasks. Usually, again, these methods are provided at the bus
level, with bus-specific calls down into the driver itself.
Time to look at some code. The SBD driver sets up its structure as:
static struct device_driver sbd_driver = {
.name = "sbd",
};
A driver for a real device may, at the least, want to add methods for
suspend and resume events. There is nothing in particular that SBD needs
to do in response to such events, however, so no such methods have been
provided.
In the SBD module init code, the driver structure is registered with:
driver_register(&sbd_driver);
It is also necessary to call driver_unregister() in the shutdown
code, of course.
This call is sufficent to create the directory bus/system/sbd in
the sysfs hierarchy. As SBD devices are registered, they will appear as
symbolic links in that directory. There will be no other data there, at
least not yet.
Driver attributes
Suppose we wanted to put something else in the driver sysfs directory?
That can be done through the creation of
driver attributes. The SBD driver
adds a file called
version which contains the version of the
driver code; user-space scripts could query that file to get a sense for
what capabilities might or might not be available.
Each driver attribute requires a name, file permissions, and functions to
format and set the value of the attribute. Generally the functions are
defined first. The SBD function to display the version is:
static ssize_t show_version(struct device_driver *drv, char *buf)
{
strcpy(buf, Version);
return strlen(Version) + 1;
}
The buffer passed into the show function is one full page, so there's
plenty of room. In general, however, values for sysfs attributes should be
short. The convention is that an attribute contains a single value - not
pages of information as can be found in some /proc files.
The store function has the prototype:
ssize_t (*store)(struct device driver *drv,
const char *buf,
size_t count);
The return value is the number of bytes consumed by the operation (usually
count); it can also be one of the usual error codes. Since there
is little point in changing the driver's version string from user space,
SBD provides no store function.
Creating attributes requires filling in a driver_attribute
structure. This is usually done with the DRIVER_ATTR macro:
DRIVER_ATTR(name, mode, show, store);
In the case of the SBD driver, the relevant declaration is:
static DRIVER_ATTR(version, S_IRUGO, show_version, NULL);
This line creates a structure called driver_attr_version; it will
ultimately create a file called version in the driver's sysfs
directory. That file will have read-only permissions, and will call
show_version() when read.
Actually creating the file, however, requires one more step. This line
appears in the module initialization code, immediately after the call to
driver_register():
driver_create_file(&sbd_driver, &driver_attr_version);
There is a driver_remove_file(), but normally it is unnecessary to
call it - the files will be removed automatically when the driver is
unregistered.
Device registration
Now that we are done looking at driver registration, we can get around to
creating our device. SBD is a "system bus" device; the bus-specific device
structure is created as:
static struct sys_device sbd_sys_device = {
.name = "sbd",
.dev = { /* struct device stuff */
.name = "Simple block device",
.driver = &sbd_driver
},
};
The id field defaults to zero, so this device will eventually be
sbd0. Note the assignment of the dev.driver field, which
connects the device with the driver that handles it.
At initialization time, the device is registered with:
sys_device_register(&sbd_sys_device);
sys_device_register() is a wrapper around
device_register() which handles "system bus" details. Once this
call has been made, the sysfs directory device/sys/sbd0 is
created. Two attributes exist there: name contains "Simple block
device", and power contains the device's current power state.
Most importantly, however, the device exists within the device model data
structure, where it can respond to hotplug and power management events.
Devices, too, can have custom attributes. For SBD, an attribute
device contains the device number assigned to the virtual disk;
this value could be used, for example, to create a /dev entry
automatically in user space. The implementation is very similar to the
driver attribute we set up before:
static ssize_t show_devnum(struct device *dev, char *buf)
{
return sprintf(buf, "%02x00", major_num);
}
DEVICE_ATTR(device, S_IRUGO, show_devnum, 0);
...
device_create_file(&sbd_sys_device.dev, &dev_attr_device);
One final step, specific to block devices, is taken in SBD. Before the
virtual disk's gendisk structure is registered with
add_disk(), a pointer to the device structure is stored:
Device.gd->driverfs_dev = &sbd_sys_device.dev;
This assignment causes a couple of extra symbolic links to be created in
sysfs; devices/sys/sbd0/block points to block/sbd0, and
block/sbd0/device points back to devices/sys/sbd0. In
this way, the relationship between the various entries is made explicit.
Going further
This article barely touches on the device model interface. Many details
have necessarily been omitted; many of them will be topics for future
articles. The next article in the series, which will appear soon (promise)
will look at the class interface. Power management also deserves a look,
but that interface remains in flux as of this writing. Expect an article
when the dust settles a bit.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>