The current development kernel is 2.5.32
by Linus on August 27. It
includes, of course, the IDE code replacement (see last week's LWN Front
pages). In this (large) patch you'll also find the asynchronous I/O core
(covered in the August 1 LWN Kernel
), a bunch more device model work, IA-64 and PPC64 updates, the
beginning of the NFSv4 merge, a bunch of input layer changes, Ingo Molnar's
thread performance work, and an incredible number of other fixes and
updates. The long-format changelog
is also available.
Linus's current BitKeeper tree, which will become 2.5.33, contains a number
of memory management performance fixes from Andrew Morton, some partition
and IDE work by Alexander Viro, a set of network driver improvements, and a
big pile of typo and designated initializer fixes.
The current 2.5 status summary from Guillaume
Boissiere is dated August 28.
The current stable kernel is 2.4.19; Marcelo has released no
2.4.20 prepatches over the last week.
The current prepatch from Alan Cox is 2.4.20-pre4-ac2. The -ac series is now the
staging area for ongoing IDE work which, by most accounts, is going well.
Comments (2 posted)
Kernel development news
A constant feature of development kernel summaries is "device model work."
Perhaps it's time to take a look at what the device model actually is, and
where it's going.
The device model effort has its roots in the 2001 Kernel
Summit. It had become clear, at that point, that support of advanced
power management would require a more structured approach to the management
of devices in the Linux kernel. There has traditionally been no
centralized registry of devices in the kernel - no way to just ask the
system what devices were connected to it. Power management needs not only
the answer to that question, but also some idea of how all the devices are
plugged together. It doesn't do to shut down a SCSI controller before
stopping all of the peripherals connected to that controller, for example.
So the device model work, done mainly by Patrick Mochel, started by
adapting the existing PCI device scheme to represent a full system. At the
center of the scheme is struct device, which, of course,
represents a single device in the system. This structure contains quite a
few fields, including no less than six different list heads; some of these
fields will be examined shortly.
One type of device, of course, is a bus. There is a device
structure for each bus, along with a bus_type structure for each
type of bus. Almost every device on a system is reached via (at least) one
bus, and the device model topology reflects that. Each bus device
maintains, via the children list in its device structure,
a list of all devices plugged into that bus. By looking at the
bus_list field of any device in the system, the kernel can find
all other devices attached to the same bus.
Each device structure also maintains a parent pointer (to
another struct device, of course), and an entry into another
list (called simply node) of all its siblings under the same
parent. This hierarchy may look a lot like the bus lists already
mentioned, but that is not the case. A device may be on a USB bus, but its
parent may be the USB hub to which it is connected. Similarly, a SCSI tape
drive may be reached through a PCI bus, but its parent is the SCSI host
Thus, it is the parent and node lists that model the true
hierarchy of the devices in the system. One could suspend a computer by
starting at the top-level devices and doing a depth-first traversal of the
device hierarchy via each device's children list. In fact, the
device model makes this sort of traversal easy by maintaining a separate
"global device list" which contains every device on the system, in the
As an example, your editor's system is represented in the driver model with
a hierarchy like the following:
PCI host bridge
Lexar SmartMedia reader
SCSI bus 0
Target 0 (disk drive)
Target 1 (DAT tape)
Target 4 (CDRW)
8253 Interval timer
Each entry in the hierarchy above is one device structure in the
model; each device's children list holds each indented entry below
that device. The global device list, instead, contains the full hierarchy
shown above, in order from top to bottom. ("sys" is a virtual bus
for devices not otherwise connected to a system bus).
The model, as described so far, shows the hierarchy of the system, but does
not allow the kernel to actually do much with those devices. The
next step involves a new generic structure:
struct device_driver, which is registered for each driver in
the system. This structure tells the system what type of bus the driver
expects to work with, and provides a set of useful functions. One of those
functions is probe; when a new device is discovered on the system
the base code calls the probe function of every likely-looking
driver for the relevant bus until a driver agrees to manage the device.
The system then sets the driver pointer in the device
structure, and knows how to find the right driver for the device from then
This driver pointer is not used for normal, user-space accesses to
the device - that is still handled through the device arrays (indexed by
the device's major number). What that pointer can be used for,
however, is power management and hotplug events. If the kernel has been
told to suspend the system, for example, it now need only pass through the
global device list, calling the suspend function found in the
device driver structure for each device. Similarly, if the user unplugs a
device, the kernel can call that device's remove function to let
the driver know.
The above is sufficient to handle the basic functions needed by power
management and to support hotpluggable devices. It also unifies much of
the device probing and accounting logic in the kernel, allowing the removal
of a great deal of duplicated code. The device model work
has not stopped there, however. One recent (2.5.32) addition is the notion
of device classes and interfaces. The "class" of a device is the basic
function that it performs - it could be an "input" or "storage" device, for
example. Not much is done with the class information currently, but the
structure is there for class-level drivers to affect how the device is
"Interfaces" are paths to the device from user space - normally entries in
/dev. Devices which implement a given interface can be expected
to respond in certain, well-defined ways. As with classes, about all that
is done with interfaces, for now, is to remember them. But that could
This discussion, so far, has left out an important subsystem which, while
technically not part of the device model, is intimately tied in with it.
"driverfs" is a virtual filesystem which provides a userspace
representation of the driver model data structure. This filesystem,
normally mounted at /devices, contains (currently) three top-level
- root contains the entire device tree in the usual
hierarchical form. By digging around in /devices/root, users
(or code) can get a handle on how the system is put together.
Driverfs also makes it easy for devices to export tunable parameters
(much like those found in /proc/sys) which can be found - and
tweaked - in the device tree.
- class contains an entry for each device class
registered in the system. Further down, an entry for every device
which implements that class can be found (it's a symbolic link to the
entry in the /devices/root tree). There are also entries for
each interface registered with a class, and, again, a symbolic link
for every device implementing the interface.
- bus lists each bus type (not each physical bus) on
the system and the devices managed by each.
(See this example /devices listing
which corresponds to the system hierarchy shown above, to see how it all
Some readers may be noting a certain similarity between driverfs and
devfs. They do resemble each other in that they are both kernel-generated
virtual filesystems which contain entries for the devices in the system.
They differ, however, in that driverfs is intended to be a physical
representation of the system, while devfs is intended to provide user-space
access to the devices themselves. A devfs user can mount
/dev/discs/disc0; somebody perusing driverfs can, with sufficient
typing pain, find the directory
/devices/root/pci0/00:0e.0/scsi0/0:0:0:0/0:0:0:0:p1, but there's
nothing there to mount. Instead, a bunch of information - including the
device's major and minor numbers - is available.
So devfs and driverfs serve different purposes, but driverfs (with
/sbin/hotplug) could conceivably
supplant devfs in future kernels. While driverfs is not intended to be the
way users access devices, all the information needed to create
/dev nodes is (or can be) there. In the future, the /sbin/hotplug
script may be used to configure all devices as they are discovered in the
system; there is no reason why that script can not use the driverfs
information (including class and interface information) to create
/dev nodes implementing whatever policy the system administrator
likes. The result would be a flexible device naming and administration
scheme which removes policy from the kernel code.
That all remains in the future, however; the device model and driverfs are
still works in progress. Most driver code does not yet interface with the
device model; thus far, there has been little need to change the drivers
themselves, since the PCI code has done the necessary device registration.
Full implementation of classes and interfaces, however, is likely to
require digging into the driver code, and that could take a little while.
It could yet happen for 2.6, however.
Comments (13 posted)
Hyperthreading is a hardware technique where a single CPU behaves as if it
were multiple (usually two) virtual processors. When one virtual processor
stalls (on a cache miss, for example), the other runs. Hyperthreading can
yield significant performance improvements (numbers of around 30% have been
floated) for a very small silicon investment. And the software side is
free: a hyperthreaded processor is almost indistinguishable from a pair of
real, physical processors, and the current Linux (or whatever) SMP code
However, a scheduler which handles SMP, but which is unaware of
hyperthreading, will not obtain optimal performance. If you have two
processes running on two virtual processors on the same physical CPU, they
will be contending with each other in a way that processes on separate CPUs
will not. A naive scheduler, such as the one currently found in the Linux
kernel, does not understand the difference between the two situations, and
will thus make wrong decisions.
Ingo Molnar has posted some scenarios where
the current scheduler gets things wrong, along with, of course, a patch
that makes everything right. Consider a system with two physical CPUs,
each of which provides two virtual processors. If there are two running
tasks, the current scheduler would happily let them both run on a single
physical processor, even though far better performance would result from
migrating one process to the other physical CPU. The scheduler also
doesn't understand that migrating a process from one virtual processor to
its sibling is cheaper (due to cache loading) than migrating it across
The solution is to change the way the run queues work. The 2.5 scheduler
maintains one run queue per processor, and attempts to avoid moving tasks
between queues. The change is to have one run queue per physical
processor which is able to feed tasks into all of the virtual processors.
Throw in a smarter sense of what makes an idle CPU (all virtual processors
must be idle), and the resulting code "magically fulfills" the needs of
scheduling on a hyperthreading system. The actual patch involves a bunch
of tricky details, of course, but the end result is that a relatively
simple idea yields a 10% or greater performance improvement.
Comments (none posted)
Larry McVoy recently posted a note
Linux kernel list regarding changes in BitKeeper licensing. The big change
is that the new license gives BitMover the right, if you are using the free
(beer) version of BitKeeper, to require you to make your repository
available under a free license. The point is that the free version of
BitKeeper is meant to help free software development; it's not meant for
Larry also states that BitMover may about to make a sale which can be tied
to the kernel developers' use of BitKeeper; should that happen, he'll set
aside $25K in BitKeeper developer time. Linus can use that time to cause
the implementation of features he wants, regardless of whether that's
something BitMover otherwise would have done.
Full Story (comments: none)
We'll now take a brief moment for editorial self indulgence... The Linux
Journal's 2002 Editors' Choice awards have been announced. The selection
for "best technical book" was one Linux Device Drivers, 2nd Edition
by Alessandro Rubini and Jonathan Corbet.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
- Luca Barbieri: i386 dynamic fixup/self modifying code. "<span>This patch implements a system that modifies the kernel code at runtime
depending on CPU features and SMPness.
In fact, I'm not really sure whether it's a good idea to do something
(August 28, 2002)
Page editor: Jonathan Corbet
Next page: Distributions>>