LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

Kernel development

Release status

Kernel release status

The current 2.6 prepatch is 2.6.13-rc3, released by Linus on July 12. Changes this time around include a new DES (crypto) implementation with better performance, multi-block operation support in the crypto layer, "almost-skas" mode support for user-mode Linux, a big memory technology device (MTD) update, user-space I/O initiation for InfiniBand, and the long-awaited inotify patch. "There's a bit more changes here than I would like, but I'm putting my foot down now. Not only are a lot of people going to be gone next week for LKS and OLS, but we've gotten enough stuff for 2.6.13, and we need to calm down." See the long-format changelog for the details.

Linus's git repository contains a small number of fixes added after the -rc3 release.

The current -mm tree is 2.6.13-rc2-mm2. Recent changes to -mm include a set of swapper fixes, a big InfiniBand update, and lots of fixes. The class-based kernel resource management patches have since been added for (presumably) 2.6.13-rc3-mm1.

Comments (none posted)

Kernel development news

Some 2.6.13 API changes

The flood of patches going into the mainline 2.6.13 brings with it the usual assortment of changes to the internal kernel API. Here's a subset of those changes.

The configurable HZ patch has been merged. If there is, somehow, code which has survived this far with assumptions about the value of HZ, it should probably be fixed sometime soon.

There is a new timer function:

    int try_to_del_timer_sync(struct timer_list *timer);

This function will make a best effort to delete the timer. Should the timer function actually be running at the time, however, this version will not wait for it to complete; it will return -1 immediately. It can thus be used in interrupt handlers and other contexts where waiting for a timer function to finish is not an option.

The block_device_operations structure has a new member:

    long (*unlocked_ioctl) (struct file *filp, unsigned cmd, 
                            unsigned long arg);

If an unlocked_ioctl() method exists, it will be called (in preference to ioctl()), and the big kernel lock will not be held. Drivers which perform their own locking (which should be all of them, really) can use the new method to avoid the overhead of the BKL.

The netif_rx() function, used by network drivers (when not in NAPI mode) to feed packets into the kernel, has traditionally returned one of several values indicating how congested the system was. The idea was that drivers could use this information to reduce load on the kernel as congestion increases. No drivers do this, however; instead, NAPI is used for high-traffic situations. So netif_rx() now will return one of two values: NETIF_RX_SUCCESS if all is well, or NETIF_RX_DROP if the packet was dropped.

It's also worth noting that the sk_buff structure has changed again, leading to the usual troubles with binary-only drivers.

Authors of PCI drivers who want to squeeze out every bit of DMA performance from their hardware can use a new function to determine the optimal DMA burst size:

    void pci_dma_burst_advice(struct pci_dev *pdev, 
                              enum pci_dma_burst_strategy *strat,
			      unsigned long *param);

On return, strat will tell which strategy works best on the current platform. PCI_DMA_BURST_INFINITY says that bursts should simply be made as large as possible; in this case, param contains no information. PCI_DMA_BURST_BOUNDARY tells the driver to not burst across memory boundaries which are a multiple of the value returned in param. And PCI_DMA_BURST_MULTIPLE sets a maximum size (returned in param) on each individual burst.

Thomas Graf has contributed a generic text searching mechanism for the kernel. It can handle searching through non-contiguous data, and is designed to work with pluggable searching algorithms. A couple of search modules have been provided: a straight Knuth/Morris/Pratt string matcher and a finite state machine version which provides a limited regular expression mechanism. The initial application for this library is for flexible packet classification in the networking traffic control code, but other uses are possible.

Performing a search requires first setting up a configuration:

    struct ts_config *textsearch_prepare(const char *algorithm, 
                                         const void *pattern,
                                         unsigned int patlen, 
					 int gfp_mask, int flags);

Here, algorithm is the searching algorithm to use; "kmp" will get Knuth/Morris/Pratt. pattern is the actual pattern to search for; patlen is its length. The usual memory allocation flags are provided in gfp_mask, and flags is for search-specific flags. Currently, the only flag is TS_AUTOLOAD, which allows the kernel to load a module implementing the desired search algorithm, if necessary. The return value is a pointer to a configuration structure to be used with the other functions, or an error value (as determined by IS_ERR()).

A ts_config structure, once initialized, can be reused as many times as desired. It contains no per-search state, so it can be used in parallel searches as well. When the structure is no longer needed, it should be returned with a call to textsearch_destroy().

If the data to be searched is a single, contiguous block, then searching is a matter of calling:

    unsigned int textsearch_find_continuous(struct ts_config *config,
                                            struct ts_state *state,
					    const void *data, 
					    unsigned int datalen);
    unsigned int textsearch_next(struct ts_config *config,
                                 struct ts_state *state);

For these calls, config is a configuration returned from textsearch_prepare(), and state is a local state variable. A call to textsearch_find_continuous() must come first; it will initialize state for a search through the given data array. Both functions will return the offset of the beginning of the match, or UINT_MAX if no (further) match is found.

If the data to be searched is not contiguous in memory, things get a little more complicated. The caller must provide a method which can obtain a pointer to a block of data:

    unsigned int (*get_next_block)(unsigned int consumed,
			 	   const u8 **dst,
				   struct ts_config *config,
				   struct ts_state *state);

This function will be called by the textsearch code when it needs more data to look through. It should locate the first byte beyond consumed and store its address in *dst. The config pointer will not normally be used; state->cb is a 40-byte "control buffer" which can be used to store data between calls to get_next_block(). The return value is the length of the block, or zero if there is no more data.

Another method:

    void (*finish)(struct ts_config *config, struct ts_state *state);

will be called after each search completes. Note that there can be several get_next_block() calls for each call to finish().

Both of these methods are stored in the ts_config structure; they should be set there after the call to textsearch_prepare(). The first search is performed with:

    unsigned int textsearch_find(struct ts_config *config,
                                 struct ts_state *state);

Subsequent searches can be performed with textsearch_next().

Comments (none posted)

PCI error recovery

The PCI bus is the interconnect of choice for the bulk of the architectures supported by Linux. Most peripherals on such systems - including disk, network, and USB controllers - communicate with the CPU via this bus. Linux device drivers (regardless of the bus used) must be written with the idea that the device being controlled can fail. Most drivers, however, assume that the bus used to communicate with the device will work flawlessly. This assumption exists because (1) it tends to be true, and (2) the Linux kernel has never provided an infrastructure which enables drivers to detect (and respond to) PCI errors. Work is under way to provide that infrastructure, however; there are currently two entirely different interfaces being proposed for this role.

The first approach, posted by Linas Vepstas, works by way of callbacks. It enhances the pci_driver structure by adding a new set of methods:

struct pci_error_handlers
{
    enum pci_channel_state error_state;
    int (*error_detected)(struct pci_dev *dev, 
                          enum pci_channel_state error);
    int (*mmio_enabled)(struct pci_dev *dev);
    int (*link_reset)(struct pci_dev *dev);
    int (*slot_reset)(struct pci_dev *dev);
    void (*resume)(struct pci_dev *dev);
};

A PCI driver is not required to supply any of these callbacks. Any driver which will perform PCI error recovery must provide at least error_detected(), however. That method will be called sometime after the PCI subsystem detects an error on the bus; the error parameter will be set to one of these values:

enum pci_channel_state {
    pci_channel_io_normal = 0, /* I/O channel is in normal state */
    pci_channel_io_frozen = 1, /* I/O to channel is blocked */
    pci_channel_io_perm_failure, /* pci card is dead */
};

The error_detected() method should shut down any ongoing I/O operations, but should not attempt to communicate with the adapter itself. This method can take locks and sleep; it is called from process context. The return value tells the error recovery subsystem how to proceed; it can be PCIERR_RESULT_CAN_RECOVER (the driver thinks it will be able to recover just by talking to the adapter), PCIERR_RESULT_NEED_RESET (a hard reset of the adapter will be required), or PCIERR_RESULT_DISCONNECT (the situation is hopeless, and the adapter should be considered permanently dead).

If all drivers on an affected PCI segment think they can recover from the problem, the next step is to turn memory-mapped I/O back on and let the drivers try. To this end, each driver's mmio_enabled() callback will be invoked. This callback should do whatever port banging is required to get the adapter back into a reasonable state, then return one of PCIERR_RESULT_RECOVERED (it worked), PCIERR_RESULT_NEED_RESET (it failed, try resetting), or PCIERR_RESULT_DISCONNECT (it failed, abandon all hope). Regardless of the outcome, the driver should not restart I/O from this callback.

The link_reset() method is similar to mmio_enabled(), but it is only applicable for PCI-Express adapters which might be fixable via a link reset operation. The return codes are the same as for mmio_enabled().

If a reset is called for, the PCI subsystem will perform the reset, then call slot_reset() to let the driver know. The driver should attempt to bring the adapter back to a working state, re-download firmware, etc., then return a status code indicating whether things worked or not. If reinitialization fails, it is possible that slot_reset() could be called more than once as the PCI subsystem employs an increasingly large hammer.

Finally, if all seems to be well, the driver's resume() callback will be called; this is the point where I/O operations can be restarted.

A very different approach is taken by the IOCHK interface posted by Hidetoshi Seto. This patch expects drivers to perform more of their own error checking, but gives more control over the timing of recovery operations.

The IOCHK patch works by defining a new opaque type called iocookie. A driver which is about to engage in a conversation with one of its devices would initialize one of these cookies with:

    void iochk_clear(iocookie *cookie, struct pci_dev *dev);

The driver then performs its device operations, reading and writing memory-mapped I/O registers as necessary. At any point, the driver can check to see whether an error has occurred with:

    int iochk_read(iocookie *cookie);

A non-zero return indicates trouble; should that happen, the driver can respond by resetting the device, disconnecting it, or going into hysterics. There is no core support for operations like resetting adapters.

The obvious question which has been raised is why two interfaces are needed. It seems that some situations are better handled by an asynchronous notification mechanism (such as implemented by Linas's patch), while others are better suited to a synchronous approach. So it may well be that, at some point in the future, the kernel will go from no PCI error handling interfaces to two of them. Before that happens, however, one assumes that some work will be done to unify the underlying support code and to make the two interfaces appear more like parts of a single API.

Comments (none posted)

Manual driver binding and unbinding

July 12, 2005

This article was contributed by Greg Kroah-Hartman.

One new feature in the 2.6.13-rc3 kernel release, is the ability to bind and unbind drivers from devices manually from user space. Previously, the only way to disconnect a driver from a device was usually to unload the whole driver from memory, using rmmod.

In the sysfs tree, every driver now has bind and unbind files associated with it:

    $ tree /sys/bus/usb/drivers/ub/
    /sys/bus/usb/drivers/ub/
    |-- 1-1:1.0 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb1/1-1/1-1:1.0
    |-- bind
    |-- module -> ../../../../module/ub
    `-- unbind

In order to unbind a device from a driver, simply write the bus id of the device to the unbind file:

    echo -n "1-1:1.0" > /sys/bus/usb/drivers/ub/unbind

and the device will no longer be bound to the driver:

    $ tree /sys/bus/usb/drivers/ub/
    /sys/bus/usb/drivers/ub/
    |-- bind
    |-- module -> ../../../../module/ub
    `-- unbind

To bind a device to a driver, the device must first not be controlled by any other driver. To ensure this, look for the "driver" symlink in the device directory:

    $ tree /sys/bus/usb/devices/1-1:1.0
    /sys/bus/usb/devices/1-1:1.0
    |-- bAlternateSetting
    |-- bInterfaceClass
    |-- bInterfaceNumber
    |-- bInterfaceProtocol
    |-- bInterfaceSubClass
    |-- bNumEndpoints
    |-- bus -> ../../../../../../bus/usb
    |-- modalias
    `-- power
        `-- state

Then, simply write the bus id of the device you wish to bind, into the bind file for that driver:

    echo -n "1-1:1.0" > /sys/bus/usb/drivers/usb-storage/bind

And check that the binding was successful:

    $ tree /sys/bus/usb/devices/1-1:1.0
    /sys/bus/usb/devices/1-1:1.0
    |-- bAlternateSetting
    |-- bInterfaceClass
    |-- bInterfaceNumber
    |-- bInterfaceProtocol
    |-- bInterfaceSubClass
    |-- bNumEndpoints
    |-- bus -> ../../../../../../bus/usb
    |-- driver -> ../../../../../../bus/usb/drivers/usb-storage
    |-- host2
    |   `-- power
    |       `-- state
    |-- modalias
    `-- power
        `-- state

As the example above shows, this capability is very useful for switching devices between drivers which handle the same type of device (both the ub and usb-storage drivers handle USB mass storage devices, like flash drives.)

A number of "enterprise" Linux distributions offer multiple drivers of different version levels in their kernel packages. This manual binding feature will allow configuration tools to pick and choose which devices should be bound to which drivers, allowing users to upgrade only specific devices if they wish to.

In order for a device to bind successfully with a driver, that driver must already support that device. This is why you can not just arbitrarily bind any device to any driver. To help with the issue of adding new devices support to drivers after they are built, the PCI system offers a dynamic_id file in sysfs so that user space can write in new device ids that the driver should bind too. In the future, this ability to add new driver IDs to a running kernel will be moved into the driver core to make it available for all buses.

Comments (3 posted)

CFQ v3

Jens Axboe's completely fair queueing (CFQ) I/O scheduler has been regarded by many as the best available in the 2.6 kernel for a while. Said scheduler has just been through another major upgrade which should implement a higher degree of fairness while providing "excellent" throughput for the system as a whole.

One of the big additions this time around is time sharing: processes now get time slices during which they are able to dispatch I/O requests. The scheduler will allow a drive to go idle - briefly - during a process's time slice to give that process an opportunity to generate more I/O requests. In this way, it behaves similarly to the anticipatory scheduler; it allows the process to get the most out of its slice while, hopefully, taking advantage of the locality of that process's requests. If, however, a process's requests end up causing too much seeking, that process will temporarily lose its right to hold the disk idle.

Tied in with the time sharing implementation is the notion of I/O priorities. Each process has its own I/O priority, which, by default, is derived from its CPU priority. Processes with higher priorities will preempt lower-priority processes, while sharing the drive in a round-robin fashion with equal-priority processes. There is also a realtime priority level which does not do round-robin sharing, and an "idle" level which is only allowed to dispatch requests when the drive has been idle for a sufficiently long period.

There is a temporary priority boosting mechanism designed to avoid priority inversion problems when a low-priority process holds important resources.

Two new system calls have been added for working with I/O priorities:

    int ioprio_set(int which, int who, int priority);
    int ioprio_get(int which, int who);

Here, which controls whether the call applies to a single process, process group, or user, and who is the appropriate ID (usually the process ID). A call to ioprio_set() will apply the new priority (subject to the usual permissions checks) while ioprio_get() returns the current value.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

  • Marco Costalba: qgit-0.7. (July 12, 2005)

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds