Kernel development
Brief items
Kernel release status
The current development kernel is 2.5.69, released by Linus on May 4. This large patch includes a FireWire update, some IDE cleanups, more devfs cleanups, a rework of the driver core class code, some new libfs helpers which make it easier to create in-kernel virtual filesystems, a big tty layer cleanup, a change to the interrupt handler prototype (see the April 24 LWN Kernel Page), runtime barrier instruction patching (which allows optimal performance on different processors without the need to ship multiple kernels), more preparation for an expanded dev_t type, some swapoff improvements, a new set of memory allocation flags, and numerous other fixes and updates. Details can be found in the long-format changelog.Linus's BitKeeper tree contains some I2C improvements, some netfilter tweaks, and a small number of other fixes.
2.5.69-ac1 is available from Alan Cox; it adds some IDE fixes, a number of janitorial fixes, and various other fixes and updates.
The current stable kernel remains 2.4.20; there have been no 2.4.21 prepatches since 2.4.21-rc1 on April 21.
Alan Cox's 2.4.21-rc1-ac4 adds a vesafb fix, some NFS improvements, audio and serial ATA support for the Intel ICH5 controller, a new AMI Megaraid driver, and various other fixes.
Kernel development news
The right way to yield
The kernel list still sees occasional complaints about the interactive response of recent development kernels. Many of these complaints, it turns out, relate to OpenOffice. The specific problem in this case has been found: a combination of a change in sched_yield() semantics and, one might say, suboptimal programming in OpenOffice.The purpose of sched_yield() is to temporarily give up the processor to other processes. The process calling sched_yield() remains in the runnable state, and can normally expect to run again in short order. The 2.5 development series has made a subtle change in the way sched_yield() works, however. This call used to simply move the process to the end of the run queue; now it moves the process to the "expired" queue, effectively cancelling the rest of the process's time slice. So a process calling sched_yield() now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again.
The new semantics arguably make more sense; a process calling sched_yield() should truly give up the processor. Some threaded applications, however, implement busy-wait loops with sched_yield(). OpenOffice is one such application; LinuxThreads also, apparently, uses this technique. This kind of application performs poorly with the new yield semantics; being moved to the "expired" queue makes the loop far less responsive.
There has been talk of ways of changing sched_yield() so that OpenOffice and other applications are not so badly penalized. One approach, for example, preserves the application's time slice, but drops its priority slightly. The consensus, however, seems to be that applications that loop on sched_yield() are simply broken and should be fixed. In the case of OpenOffice, this fix has already apparently been made.
Module reference counts - back to the future?
In the 2.2 (and prior) kernels, loadable modules were charged with the task of maintaining a count of references to the module. When the reference counting was done correctly, the kernel knew when it was safe to unload a module. Unfortunately, the maintenance of the reference counts often was not done right, and, in any case, in-module reference counting was subject to certain small but unavoidable race conditions. Simply put, there was no way to completely avoid situations where the kernel was executing module code while the module's reference count was zero.Starting in 2.3, the reference counting task moved slowly outside of the modules themselves. For example, the file_operations structure exported by char drivers got an owner field which points to the module. Before the kernel will, say, call a module's open() method, it increments the reference count. This mechanism puts the reference count in one place (rather than in hundreds of module open() methods) and eliminates race conditions. In 2.5, this mechanism was extended further, and attempts to increment reference counts were allowed to fail (for example, when the module still exists, but is being unloaded).
This mechanism works reasonably well for most device drivers; the interface between the kernel and the module is narrow, and references to the module are limited to a few types of objects (open files and memory mappings). Life gets harder, however, when you get into other parts of the kernel. A recent discussion on the netdev list, started by the discovery of a situation where the networking subsystem can call into a module which has been unloaded, shows how hard it can be.
The networking code keeps track of a vast array of objects, each of which can reference others, and each of which must be reference counted. A networking module can only be unloaded when all of those objects are no longer referenced and have been cleaned up. The immediate problem has to do with network devices (exported by network device drivers); numerous parts of the kernel can reference such a device. So the device itself contains a reference count. In some situations, however, the kernel can remove a driver module even though a particular device's reference count had not dropped to zero. One solution to the problem, as proposed by Rusty Russell, is to increment the module's reference count every time one of its network device's count goes up. The problem with this approach, according to David Miller, is that devices are just the beginning.
The insanity comes from the fact that attempts to increment module use counts can fail. Trying to add an unbelievable number of failure paths to the networking code to deal with this case does indeed seem like a one-way ticket to the funny farm. All this extra reference counting also adds significant overhead to the networking code's hand-crafted fast paths; that is a penalty that the networking hackers are not prepared to accept.
The solution that some of the networking developers are asking for is to go back to having modules maintain their own reference counts - sort of. At the least, modules need some way of saying whether they can or cannot be unloaded at a particular time. Usually that decision is just a matter of looking at the internal objects they have to maintain anyway. So the addition of a simple can_unload() function to many modules would solve the immediate problem.
There is still another problem, though: actually getting a complex module to a state where it can be unloaded can be a tricky task. Removing a network protocol, for example, requires shutting down the protocol and waiting for all objects to be freed. Little details (like sockets which must, according to the protocol specification, sit in a 60-second TIME_WAIT state before going away) complicate the picture and can make the unload process take a long time. Users tend to worry when an rmmod command appears to just hang. Handling all the details of that case (especially if you want to allow users to interrupt the rmmod operation) gets to be tricky indeed. Possible solutions are being discussed, but no implementations are currently on the horizon.
Of course, one could always go back to Rusty's suggestion from the 2002 Kernel Summit: simply do not allow modules to be removed from the kernel.
Driver porting
Driver porting: Device model overview
| This article is part of the LWN Porting Drivers to 2.6 series. |
The device model presents a bit of a steep learning curve when first encountered. But the underlying concepts are not that hard to understand, and driver programmers will benefit from a grasp of what's going on.
The fundamental task of the driver model is to maintain a set of internal data structures which reflect the architecture and state of the underlying system. Among other things, the driver model tracks:
- Which devices exist in the system, what power state they are in, what
bus they are attached to, and which driver is responsible for them.
- The bus structure of the system; which buses are connected to which
others (i.e. a USB controller can be plugged into a PCI bus), and
which devices each bus can potentially support (along with associated
drivers), and which devices actually exist.
- The device drivers known to the system, which devices they can
support, and which bus type they know about.
- What kinds of devices ("classes") exist, and which real devices
of each class are connected. The driver model can thus answer
questions like "where is the mouse (or mice) on this system?" without
the need to worry about how the mouse might be physically connected.
- And many other things.
Underneath it all, the driver model works by tracking system configuration changes (hardware and software) and maintaining a complex "web woven by a spider on drugs" data structure to represent it all.
Some device model terms
The device model brings with it a whole new vocabulary to describe its data structures. A quick overview of some driver model terms appears below; much of this stuff will be looked at in detail later on.
Other terms will be defined as we come to them.
- device
- A physical or virtual object which attaches to a (possibly virtual) bus.
- driver
- A software entity which may probe for and be bound to devices, and which can perform certain management functions.
- bus
- A device which serves as an attachment point for other devices.
- class
- A particular type of device which can be expected to perform in certain ways. Classes might include disks, partitions, serial ports, etc.
- subsystem
- A top-level view of the system's structure. Subsystems used in the kernel include devices (a hierarchical view of all devices on the system), bus (a bus-oriented view), class (devices by class), net (the networking subsystem), and others. The best way to think of a subsystem, perhaps, is as a particular view into the device model data structure rather than a physical component of the system. The same objects (devices, usually) show up in most subsystems, but they are organized differently.
sysfs
Sysfs is a virtual filesystem which provides a userspace-visible representation of the device model. The device model and sysfs are sometimes confused with each other, but they are distinct entities. The device model functions just fine without sysfs (but the reverse is not true).The sysfs filesystem is usually mounted on /sys; for readers without a 2.6 system at hand, an example /sys hierarchy from a simple system is available. The top-level directories there correspond to the known subsystems in the model. The full device model data structure can be seen by looking at the entries and links within each subsystem. Thus, for example, the first IDE disk on a particular system, being a device, would appear as:
/sys/devices/pci0/00:11.1/ide0/0.0
But that device appears (in symbolic link form) under other subsystems as:
/sys/block/hda/device
/sys/bus/ide/devices/0.0
And, additionally, the IDE controller can be found as:
/sys/bus/pci/devices/0.11.1
/sys/bus/pci/drivers/VIA IDE/00:11.1
Within the disk's own sysfs directory (under /devices), the link block points back at /sys/block/hda. As was said before, it is a complicated data structure.
Driver writers generally need not worry about sysfs; it is magically created and implemented by the driver model and bus driver code. The one exception comes about when it comes to exporting attributes via sysfs. These attributes represent some aspect of how the device and/or its driver operate; they may or may not be writeable from user space. Sysfs is now the preferred way (over /proc or ioctl()) to export these variables to user space. The next article in the series looks at how to manage attributes.
Kobjects
Even though most driver writers will never have to manipulate a kobject directly, it is hard to dig very deeply into the driver model without encountering them. A kobject is a simple representation of data relevant to any object found in the system; in a true object-oriented language, this would be the class that most others inherit from. Kobjects contain the attributes that, it is expected, most objects in the system will need: a name, reference count, parent, and type. Almost any object related to the device model will have a kobject buried deeply inside it somewhere.A kset is a container for a set of kobjects of identical type. Ksets belong to a subsystem (but a subsystem can hold more than one kset). Among other things, ksets control how the system responds to hotplug events - the addition (or removal) of an entry to (or from) the set.
Together, kobjects and ksets make up much of the glue that holds the driver model structure together. A separate article in this series covers kobjects and ksets in detail.
Driver porting: Devices and attributes
| This article is part of the LWN Porting Drivers to 2.5 series. |
Some time ago, this series looked at a simple block driver. That driver will now be augmented with simple driver model and sysfs support. The relevant bits of code will be shown below; the full source is available here.
The device structure
Once upon a time, struct device referred to a network interface. That structure has long since been renamed net_device, and, in 2.5, struct device became the "base class" for representing all devices in the system. The full structure can be seen in <linux/device.h>; for most drivers, however, there are only a few fields that are worth worrying about:
char name[DEVICE_NAME_SIZE];
char bus_id[BUS_ID_SIZE];
void *driver_data;
struct device_driver *driver;
The name field is a descriptive name (not something found in /dev; it can be something like "Barmatic VX773 Frobnicator"). bus_id describes where the device can be found on the bus; for PCI devices it is a string like "00:09.0". The driver can put anything it wants into the driver_data field. And driver describes the driver for this device; we'll get there shortly.
As a general rule, your driver will want to remember more about a device than can be represented in struct device; this structure just exists to hold the data common to all devices in the system. So drivers do not normally deal with bare device structures; instead, this structure is embedded within something larger. Thus, if you look at the definition of, say, struct pci_dev or struct usb_device, you'll find a struct device lurking within.
A general rule has been adopted that struct device is not the first field in any other structure which contains it. The idea is that programmers should think carefully about which structure they are dealing with at any given time, and not just cast pointers back and forth. Going from the larger structure to struct device is just a matter of referencing the appropriate field. To go the other way, the container_of() macro should be used. Thus, for example, a USB driver could turn a struct device pointer (called, say, devp) into a struct usb_device pointer with:
container_of(devp, struct usb_device, dev)
Here, "dev" is the name of the struct usb_device field containing the device structure. Normally, a bus subsystem will define a macro for doing this conversion; in the USB case, it is to_usb_device().
The example simple block device (SBD) does not attach to a physical bus, of course. For this type of device, the kernel exports a "system" bus which may be used for virtual attachments. Usually, the system bus contains devices like the processor, interrupt controller, and timer. But we can attach virtual disks there too.
System bus devices are represented by struct sys_device, defined as:
struct sys_device {
char *name;
u32 id;
struct device dev;
};
(A couple of fields have been omitted). Here, the name should be a /dev-style name - it will be used (along with the id value) to create the device's entry in sysfs under devices/sys.
The simple block device will be represented as a sys_device. Before we get to that, however, there is another structure that deserves a look.
The device driver structure
Device drivers, too, are represented in the device model, and in sysfs. The relevant structure (again, with a few fields omitted) looks like:
struct device_driver {
char *name;
struct bus_type *bus;
int (*probe) (struct device * dev);
int (*remove) (struct device * dev);
void (*shutdown) (struct device * dev);
int (*suspend) (struct device * dev, u32 state, u32 level);
int (*resume) (struct device * dev, u32 level);
};
The name this time around is the name of the driver, of course. The bus field is normally filled in by the bus-layer logic; drivers need not worry about initializing it. The various methods provided in the structure are for the handling of device discovery and power management tasks. Usually, again, these methods are provided at the bus level, with bus-specific calls down into the driver itself.
Time to look at some code. The SBD driver sets up its structure as:
static struct device_driver sbd_driver = {
.name = "sbd",
};
A driver for a real device may, at the least, want to add methods for suspend and resume events. There is nothing in particular that SBD needs to do in response to such events, however, so no such methods have been provided.
In the SBD module init code, the driver structure is registered with:
driver_register(&sbd_driver);
It is also necessary to call driver_unregister() in the shutdown code, of course.
This call is sufficent to create the directory bus/system/sbd in the sysfs hierarchy. As SBD devices are registered, they will appear as symbolic links in that directory. There will be no other data there, at least not yet.
Driver attributes
Suppose we wanted to put something else in the driver sysfs directory? That can be done through the creation of driver attributes. The SBD driver adds a file called version which contains the version of the driver code; user-space scripts could query that file to get a sense for what capabilities might or might not be available.Each driver attribute requires a name, file permissions, and functions to format and set the value of the attribute. Generally the functions are defined first. The SBD function to display the version is:
static ssize_t show_version(struct device_driver *drv, char *buf)
{
strcpy(buf, Version);
return strlen(Version) + 1;
}
The buffer passed into the show function is one full page, so there's plenty of room. In general, however, values for sysfs attributes should be short. The convention is that an attribute contains a single value - not pages of information as can be found in some /proc files.
The store function has the prototype:
ssize_t (*store)(struct device driver *drv,
const char *buf,
size_t count);
The return value is the number of bytes consumed by the operation (usually count); it can also be one of the usual error codes. Since there is little point in changing the driver's version string from user space, SBD provides no store function.
Creating attributes requires filling in a driver_attribute structure. This is usually done with the DRIVER_ATTR macro:
DRIVER_ATTR(name, mode, show, store);
In the case of the SBD driver, the relevant declaration is:
static DRIVER_ATTR(version, S_IRUGO, show_version, NULL);
This line creates a structure called driver_attr_version; it will ultimately create a file called version in the driver's sysfs directory. That file will have read-only permissions, and will call show_version() when read.
Actually creating the file, however, requires one more step. This line appears in the module initialization code, immediately after the call to driver_register():
driver_create_file(&sbd_driver, &driver_attr_version);
There is a driver_remove_file(), but normally it is unnecessary to call it - the files will be removed automatically when the driver is unregistered.
Device registration
Now that we are done looking at driver registration, we can get around to creating our device. SBD is a "system bus" device; the bus-specific device structure is created as:
static struct sys_device sbd_sys_device = {
.name = "sbd",
.dev = { /* struct device stuff */
.name = "Simple block device",
.driver = &sbd_driver
},
};
The id field defaults to zero, so this device will eventually be sbd0. Note the assignment of the dev.driver field, which connects the device with the driver that handles it.
At initialization time, the device is registered with:
sys_device_register(&sbd_sys_device);
sys_device_register() is a wrapper around device_register() which handles "system bus" details. Once this call has been made, the sysfs directory device/sys/sbd0 is created. Two attributes exist there: name contains "Simple block device", and power contains the device's current power state. Most importantly, however, the device exists within the device model data structure, where it can respond to hotplug and power management events.
Devices, too, can have custom attributes. For SBD, an attribute device contains the device number assigned to the virtual disk; this value could be used, for example, to create a /dev entry automatically in user space. The implementation is very similar to the driver attribute we set up before:
static ssize_t show_devnum(struct device *dev, char *buf)
{
return sprintf(buf, "%02x00", major_num);
}
DEVICE_ATTR(device, S_IRUGO, show_devnum, 0);
...
device_create_file(&sbd_sys_device.dev, &dev_attr_device);
One final step, specific to block devices, is taken in SBD. Before the virtual disk's gendisk structure is registered with add_disk(), a pointer to the device structure is stored:
Device.gd->driverfs_dev = &sbd_sys_device.dev;
This assignment causes a couple of extra symbolic links to be created in sysfs; devices/sys/sbd0/block points to block/sbd0, and block/sbd0/device points back to devices/sys/sbd0. In this way, the relationship between the various entries is made explicit.
Going further
This article barely touches on the device model interface. Many details have necessarily been omitted; many of them will be topics for future articles. The next article in the series, which will appear soon (promise) will look at the class interface. Power management also deserves a look, but that interface remains in flux as of this writing. Expect an article when the dust settles a bit.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
