|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.32, released by Linus on August 27. It includes, of course, the IDE code replacement (see last week's LWN Front and Kernel pages). In this (large) patch you'll also find the asynchronous I/O core (covered in the August 1 LWN Kernel page), a bunch more device model work, IA-64 and PPC64 updates, the beginning of the NFSv4 merge, a bunch of input layer changes, Ingo Molnar's thread performance work, and an incredible number of other fixes and updates. The long-format changelog is also available.

Linus's current BitKeeper tree, which will become 2.5.33, contains a number of memory management performance fixes from Andrew Morton, some partition and IDE work by Alexander Viro, a set of network driver improvements, and a big pile of typo and designated initializer fixes.

The current 2.5 status summary from Guillaume Boissiere is dated August 28.

The current stable kernel is 2.4.19; Marcelo has released no 2.4.20 prepatches over the last week.

The current prepatch from Alan Cox is 2.4.20-pre4-ac2. The -ac series is now the staging area for ongoing IDE work which, by most accounts, is going well.

Comments (2 posted)

Kernel development news

The 2.5 device model

A constant feature of development kernel summaries is "device model work." Perhaps it's time to take a look at what the device model actually is, and where it's going.

The device model effort has its roots in the 2001 Kernel Summit. It had become clear, at that point, that support of advanced power management would require a more structured approach to the management of devices in the Linux kernel. There has traditionally been no centralized registry of devices in the kernel - no way to just ask the system what devices were connected to it. Power management needs not only the answer to that question, but also some idea of how all the devices are plugged together. It doesn't do to shut down a SCSI controller before stopping all of the peripherals connected to that controller, for example.

So the device model work, done mainly by Patrick Mochel, started by adapting the existing PCI device scheme to represent a full system. At the center of the scheme is struct device, which, of course, represents a single device in the system. This structure contains quite a few fields, including no less than six different list heads; some of these fields will be examined shortly.

One type of device, of course, is a bus. There is a device structure for each bus, along with a bus_type structure for each type of bus. Almost every device on a system is reached via (at least) one bus, and the device model topology reflects that. Each bus device maintains, via the children list in its device structure, a list of all devices plugged into that bus. By looking at the bus_list field of any device in the system, the kernel can find all other devices attached to the same bus.

Each device structure also maintains a parent pointer (to another struct device, of course), and an entry into another list (called simply node) of all its siblings under the same parent. This hierarchy may look a lot like the bus lists already mentioned, but that is not the case. A device may be on a USB bus, but its parent may be the USB hub to which it is connected. Similarly, a SCSI tape drive may be reached through a PCI bus, but its parent is the SCSI host adaptor.

Thus, it is the parent and node lists that model the true hierarchy of the devices in the system. One could suspend a computer by starting at the top-level devices and doing a depth-first traversal of the device hierarchy via each device's children list. In fact, the device model makes this sort of traversal easy by maintaining a separate "global device list" which contains every device on the system, in the depth-first order.

As an example, your editor's system is represented in the driver model with a hierarchy like the following:

root
  pci0
    PCI host bridge
    ISA bridge
    IDE interface
    USB controller
      USB bus
        Lexar SmartMedia reader
    ACPI bridge
    SCSI adaptor
      SCSI bus 0
        Target 0 (disk drive)
	  Partition 1
	  Partition 2
	Target 1 (DAT tape)
	  st0
	  nst0
	  ...
	Target 4 (CDRW)
    Audio controller
    MIDI port
    Ethernet controller
    Graphics card
  sys
    Interrupt controller
    8253 Interval timer
    floppy controller

Each entry in the hierarchy above is one device structure in the model; each device's children list holds each indented entry below that device. The global device list, instead, contains the full hierarchy shown above, in order from top to bottom. ("sys" is a virtual bus for devices not otherwise connected to a system bus).

The model, as described so far, shows the hierarchy of the system, but does not allow the kernel to actually do much with those devices. The next step involves a new generic structure: struct device_driver, which is registered for each driver in the system. This structure tells the system what type of bus the driver expects to work with, and provides a set of useful functions. One of those functions is probe; when a new device is discovered on the system the base code calls the probe function of every likely-looking driver for the relevant bus until a driver agrees to manage the device. The system then sets the driver pointer in the device structure, and knows how to find the right driver for the device from then on.

This driver pointer is not used for normal, user-space accesses to the device - that is still handled through the device arrays (indexed by the device's major number). What that pointer can be used for, however, is power management and hotplug events. If the kernel has been told to suspend the system, for example, it now need only pass through the global device list, calling the suspend function found in the device driver structure for each device. Similarly, if the user unplugs a device, the kernel can call that device's remove function to let the driver know.

The above is sufficient to handle the basic functions needed by power management and to support hotpluggable devices. It also unifies much of the device probing and accounting logic in the kernel, allowing the removal of a great deal of duplicated code. The device model work has not stopped there, however. One recent (2.5.32) addition is the notion of device classes and interfaces. The "class" of a device is the basic function that it performs - it could be an "input" or "storage" device, for example. Not much is done with the class information currently, but the structure is there for class-level drivers to affect how the device is managed.

"Interfaces" are paths to the device from user space - normally entries in /dev. Devices which implement a given interface can be expected to respond in certain, well-defined ways. As with classes, about all that is done with interfaces, for now, is to remember them. But that could change.

This discussion, so far, has left out an important subsystem which, while technically not part of the device model, is intimately tied in with it. "driverfs" is a virtual filesystem which provides a userspace representation of the driver model data structure. This filesystem, normally mounted at /devices, contains (currently) three top-level directories:

  • root contains the entire device tree in the usual hierarchical form. By digging around in /devices/root, users (or code) can get a handle on how the system is put together. Driverfs also makes it easy for devices to export tunable parameters (much like those found in /proc/sys) which can be found - and tweaked - in the device tree.

  • class contains an entry for each device class registered in the system. Further down, an entry for every device which implements that class can be found (it's a symbolic link to the entry in the /devices/root tree). There are also entries for each interface registered with a class, and, again, a symbolic link for every device implementing the interface.

  • bus lists each bus type (not each physical bus) on the system and the devices managed by each.
(See this example /devices listing, which corresponds to the system hierarchy shown above, to see how it all goes together).

Some readers may be noting a certain similarity between driverfs and devfs. They do resemble each other in that they are both kernel-generated virtual filesystems which contain entries for the devices in the system. They differ, however, in that driverfs is intended to be a physical representation of the system, while devfs is intended to provide user-space access to the devices themselves. A devfs user can mount /dev/discs/disc0; somebody perusing driverfs can, with sufficient typing pain, find the directory /devices/root/pci0/00:0e.0/scsi0/0:0:0:0/0:0:0:0:p1, but there's nothing there to mount. Instead, a bunch of information - including the device's major and minor numbers - is available.

So devfs and driverfs serve different purposes, but driverfs (with /sbin/hotplug) could conceivably supplant devfs in future kernels. While driverfs is not intended to be the way users access devices, all the information needed to create /dev nodes is (or can be) there. In the future, the /sbin/hotplug script may be used to configure all devices as they are discovered in the system; there is no reason why that script can not use the driverfs information (including class and interface information) to create /dev nodes implementing whatever policy the system administrator likes. The result would be a flexible device naming and administration scheme which removes policy from the kernel code.

That all remains in the future, however; the device model and driverfs are still works in progress. Most driver code does not yet interface with the device model; thus far, there has been little need to change the drivers themselves, since the PCI code has done the necessary device registration. Full implementation of classes and interfaces, however, is likely to require digging into the driver code, and that could take a little while. It could yet happen for 2.6, however.

Comments (13 posted)

The scheduler and hyperthreading

Hyperthreading is a hardware technique where a single CPU behaves as if it were multiple (usually two) virtual processors. When one virtual processor stalls (on a cache miss, for example), the other runs. Hyperthreading can yield significant performance improvements (numbers of around 30% have been floated) for a very small silicon investment. And the software side is free: a hyperthreaded processor is almost indistinguishable from a pair of real, physical processors, and the current Linux (or whatever) SMP code works.

However, a scheduler which handles SMP, but which is unaware of hyperthreading, will not obtain optimal performance. If you have two processes running on two virtual processors on the same physical CPU, they will be contending with each other in a way that processes on separate CPUs will not. A naive scheduler, such as the one currently found in the Linux kernel, does not understand the difference between the two situations, and will thus make wrong decisions.

Ingo Molnar has posted some scenarios where the current scheduler gets things wrong, along with, of course, a patch that makes everything right. Consider a system with two physical CPUs, each of which provides two virtual processors. If there are two running tasks, the current scheduler would happily let them both run on a single physical processor, even though far better performance would result from migrating one process to the other physical CPU. The scheduler also doesn't understand that migrating a process from one virtual processor to its sibling is cheaper (due to cache loading) than migrating it across physical processors.

The solution is to change the way the run queues work. The 2.5 scheduler maintains one run queue per processor, and attempts to avoid moving tasks between queues. The change is to have one run queue per physical processor which is able to feed tasks into all of the virtual processors. Throw in a smarter sense of what makes an idle CPU (all virtual processors must be idle), and the resulting code "magically fulfills" the needs of scheduling on a hyperthreading system. The actual patch involves a bunch of tricky details, of course, but the end result is that a relatively simple idea yields a 10% or greater performance improvement.

Comments (none posted)

A change in the BitKeeper license

Larry McVoy recently posted a note to the Linux kernel list regarding changes in BitKeeper licensing. The big change is that the new license gives BitMover the right, if you are using the free (beer) version of BitKeeper, to require you to make your repository available under a free license. The point is that the free version of BitKeeper is meant to help free software development; it's not meant for proprietary work.

Larry also states that BitMover may about to make a sale which can be tied to the kernel developers' use of BitKeeper; should that happen, he'll set aside $25K in BitKeeper developer time. Linus can use that time to cause the implementation of features he wants, regardless of whether that's something BitMover otherwise would have done.

Full Story (comments: none)

Linux Journal 2002 Editors' Choice

We'll now take a brief moment for editorial self indulgence... The Linux Journal's 2002 Editors' Choice awards have been announced. The selection for "best technical book" was one Linux Device Drivers, 2nd Edition by Alessandro Rubini and Jonathan Corbet.

Comments (2 posted)

Patches and updates

Architecture-specific

Robert Love per-arch load balancing ?
Luca Barbieri i386 dynamic fixup/self modifying code "<q>This patch implements a system that modifies the kernel code at runtime depending on CPU features and SMPness. In fact, I'm not really sure whether it's a good idea to do something like this.</q>" ?

Build system

Roman Zippel linux kernel conf 0.3 ?

Core kernel code

Development tools

Rusty Russell (re-xmit) Kprobes ?

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Oliver Xymoron (1/7) entropy, take 2 - log2 ?
Oliver Xymoron (2/7) entropy batching ?
Oliver Xymoron (3/7) xfer cleanup ?
Oliver Xymoron (4/7) trust_pct ?
Oliver Xymoron (6/7) core accounting ?
Oliver Xymoron (7/7) update entropy users ?

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds