Brief items
The current 2.6 prepatch is 2.6.13-rc3,
released by Linus on
July 12. Changes this time around include a new DES (crypto)
implementation with better performance, multi-block operation support in
the crypto layer, "almost-skas" mode support for user-mode Linux, a big
memory technology device (MTD) update, user-space I/O initiation for
InfiniBand, and the long-awaited
inotify patch. "
There's a
bit more changes here than I would like, but I'm putting my foot down
now. Not only are a lot of people going to be gone next week for LKS and
OLS, but we've gotten enough stuff for 2.6.13, and we need to calm
down." See
the long-format changelog
for the details.
Linus's git repository contains a small number of fixes added after the
-rc3 release.
The current -mm tree is 2.6.13-rc2-mm2. Recent changes
to -mm include a set of swapper fixes, a big InfiniBand update, and lots of
fixes. The class-based kernel
resource management patches have since been added for (presumably)
2.6.13-rc3-mm1.
Comments (none posted)
Kernel development news
The flood of patches going into the mainline 2.6.13 brings with it the
usual assortment of changes to the internal kernel API. Here's a subset of
those changes.
The configurable HZ patch has been merged. If there is, somehow,
code which has survived this far with assumptions about the value of
HZ, it should probably be fixed sometime soon.
There is a new timer function:
int try_to_del_timer_sync(struct timer_list *timer);
This function will make a best effort to delete the timer. Should the
timer function actually be running at the time, however, this version will
not wait for it to complete; it will return -1 immediately. It
can thus be used in interrupt handlers and other contexts where waiting for
a timer function to finish is not an option.
The block_device_operations structure has a new member:
long (*unlocked_ioctl) (struct file *filp, unsigned cmd,
unsigned long arg);
If an unlocked_ioctl() method exists, it will be called (in
preference to ioctl()), and the big kernel lock will not be held.
Drivers which perform their own locking (which should be all of them,
really) can use the new method to avoid the overhead of the BKL.
The netif_rx() function, used by network drivers (when not in NAPI
mode) to feed packets into the kernel, has traditionally returned one of
several values indicating how congested the system was. The idea was that
drivers could use this information to reduce load on the kernel as
congestion increases. No drivers do this, however; instead, NAPI is used
for high-traffic situations. So netif_rx() now will return one of
two values: NETIF_RX_SUCCESS if all is well, or
NETIF_RX_DROP if the packet was dropped.
It's also worth noting that the sk_buff structure has changed
again, leading to the usual troubles with binary-only drivers.
Authors of PCI drivers who want to squeeze out every bit of DMA performance
from their hardware can use a new function to determine the optimal DMA
burst size:
void pci_dma_burst_advice(struct pci_dev *pdev,
enum pci_dma_burst_strategy *strat,
unsigned long *param);
On return, strat will tell which strategy works best on the
current platform. PCI_DMA_BURST_INFINITY says that bursts should
simply be made as large as possible; in this case, param contains
no information. PCI_DMA_BURST_BOUNDARY tells the driver to not
burst across memory boundaries which are a multiple of the value returned
in param. And PCI_DMA_BURST_MULTIPLE sets a maximum size
(returned in param) on each individual burst.
Thomas Graf has contributed a generic text searching mechanism for the
kernel. It can handle searching through non-contiguous data, and is
designed to work with pluggable searching algorithms. A couple of search
modules have been provided: a straight Knuth/Morris/Pratt string matcher
and a finite state machine version which provides a limited regular
expression mechanism. The initial application for this library is for
flexible packet classification in the networking traffic control code, but
other uses are possible.
Performing a search requires first setting up a configuration:
struct ts_config *textsearch_prepare(const char *algorithm,
const void *pattern,
unsigned int patlen,
int gfp_mask, int flags);
Here, algorithm is the searching algorithm to use;
"kmp" will get Knuth/Morris/Pratt. pattern is the actual
pattern to search for; patlen is its length. The usual memory
allocation flags are provided in gfp_mask, and flags is
for search-specific flags. Currently, the only flag is
TS_AUTOLOAD, which allows the kernel to load a module implementing
the desired search algorithm, if necessary. The return value is a
pointer to a configuration structure to be used with the other functions,
or an error value (as determined by IS_ERR()).
A ts_config structure, once initialized, can be reused as many
times as desired. It
contains no per-search state, so it can be used in parallel searches as
well. When the structure is no longer needed, it should be returned with a
call to textsearch_destroy().
If the data to be searched is a single, contiguous block, then searching is
a matter of calling:
unsigned int textsearch_find_continuous(struct ts_config *config,
struct ts_state *state,
const void *data,
unsigned int datalen);
unsigned int textsearch_next(struct ts_config *config,
struct ts_state *state);
For these calls, config is a configuration returned from
textsearch_prepare(), and state is a local state
variable. A call to textsearch_find_continuous() must come first;
it will initialize state for a search through the given
data array. Both functions will return the offset of the
beginning of the match, or UINT_MAX if no (further) match is
found.
If the data to be searched is not contiguous in memory, things get a little
more complicated. The caller must provide a method which can obtain a
pointer to a block of data:
unsigned int (*get_next_block)(unsigned int consumed,
const u8 **dst,
struct ts_config *config,
struct ts_state *state);
This function will be called by the textsearch code when it needs more data
to look through. It should locate the first byte beyond consumed
and store its address in *dst. The config pointer will
not normally be used; state->cb is a 40-byte "control buffer"
which can be used to store data between calls to
get_next_block(). The return value is the length of the block, or
zero if there is no more data.
Another method:
void (*finish)(struct ts_config *config, struct ts_state *state);
will be called after each search completes. Note that there can be several
get_next_block() calls for each call to finish().
Both of these methods are stored in the ts_config structure; they
should be set there after the call to textsearch_prepare(). The
first search is performed with:
unsigned int textsearch_find(struct ts_config *config,
struct ts_state *state);
Subsequent searches can be performed with textsearch_next().
Comments (none posted)
The PCI bus is the interconnect of choice for the bulk of the architectures
supported by Linux. Most peripherals on such systems - including disk,
network, and USB controllers - communicate with the CPU via this bus.
Linux device drivers (regardless of the bus used) must be written with the
idea that the device being controlled can fail. Most drivers, however,
assume that the bus used to communicate with the device will work
flawlessly. This assumption exists because (1) it tends to be true,
and (2) the Linux kernel has never provided an infrastructure which
enables drivers to detect (and respond to) PCI errors. Work is under way
to provide that infrastructure, however; there are currently two entirely
different interfaces being proposed for this role.
The first approach, posted by Linas
Vepstas, works by way of callbacks. It enhances the pci_driver
structure by adding a new set of methods:
struct pci_error_handlers
{
enum pci_channel_state error_state;
int (*error_detected)(struct pci_dev *dev,
enum pci_channel_state error);
int (*mmio_enabled)(struct pci_dev *dev);
int (*link_reset)(struct pci_dev *dev);
int (*slot_reset)(struct pci_dev *dev);
void (*resume)(struct pci_dev *dev);
};
A PCI driver is not required to supply any of these callbacks. Any driver
which will perform PCI error recovery must provide at least
error_detected(), however. That method will be called sometime after the
PCI subsystem detects an error on the bus; the error parameter
will be set to one of these values:
enum pci_channel_state {
pci_channel_io_normal = 0, /* I/O channel is in normal state */
pci_channel_io_frozen = 1, /* I/O to channel is blocked */
pci_channel_io_perm_failure, /* pci card is dead */
};
The error_detected() method should shut down any ongoing I/O
operations, but should not attempt to communicate with the adapter itself.
This method can take locks and sleep; it is called from process
context. The return value tells the error recovery subsystem how to
proceed; it can be PCIERR_RESULT_CAN_RECOVER (the driver thinks it
will be able to recover just by talking to the adapter),
PCIERR_RESULT_NEED_RESET (a hard reset of the adapter will be
required), or PCIERR_RESULT_DISCONNECT (the situation is hopeless,
and the adapter should be considered permanently dead).
If all drivers on an affected PCI segment think they can recover from the
problem, the next step is to turn memory-mapped I/O back on and let the
drivers try. To this end, each driver's mmio_enabled() callback
will be invoked. This callback should do whatever port banging is required
to get the adapter back into a reasonable state, then return one of
PCIERR_RESULT_RECOVERED (it worked),
PCIERR_RESULT_NEED_RESET (it failed, try resetting), or
PCIERR_RESULT_DISCONNECT (it failed, abandon all hope).
Regardless of the outcome, the driver should not restart I/O from this
callback.
The link_reset() method is similar to mmio_enabled(), but
it is only applicable for PCI-Express adapters which might be fixable via a
link reset operation. The return codes are the same as for
mmio_enabled().
If a reset is called for, the PCI subsystem will perform the reset, then
call slot_reset() to let the driver know. The driver should
attempt to bring the adapter back to a working state, re-download firmware,
etc., then return a status code indicating whether things worked or not.
If reinitialization fails, it is possible that slot_reset() could
be called more than once as the PCI subsystem employs an increasingly large
hammer.
Finally, if all seems to be well, the driver's resume() callback
will be called; this is the point where I/O operations can be restarted.
A very different approach is taken by the IOCHK interface posted by
Hidetoshi Seto. This patch expects drivers to perform more of their own
error checking, but gives more control over the timing of recovery
operations.
The IOCHK patch works by defining a new opaque type called
iocookie. A driver which is about to engage in a conversation
with one of its devices would initialize one of these cookies with:
void iochk_clear(iocookie *cookie, struct pci_dev *dev);
The driver then performs its device operations, reading and writing
memory-mapped I/O registers as necessary. At any point, the driver can
check to see whether an error has occurred with:
int iochk_read(iocookie *cookie);
A non-zero return indicates trouble; should that happen, the driver can
respond by resetting the device, disconnecting it, or going into
hysterics. There is no core support for operations like resetting
adapters.
The obvious question which has been raised is why two interfaces are
needed. It seems that some situations are better handled by an
asynchronous notification mechanism (such as implemented by Linas's patch),
while others are better suited to a synchronous approach. So it may well
be that, at some point in the future, the kernel will go from no PCI error
handling interfaces to two of them. Before that happens, however, one
assumes that some work will be done to unify the underlying support code
and to make the two interfaces appear more like parts of a single API.
Comments (none posted)
One new feature in the 2.6.13-rc3 kernel release, is the ability to bind
and unbind drivers from devices manually from user space. Previously,
the only way to disconnect a driver from a device was usually to unload
the whole driver from memory, using
rmmod.
In the sysfs tree, every driver now has bind and unbind files
associated with it:
$ tree /sys/bus/usb/drivers/ub/
/sys/bus/usb/drivers/ub/
|-- 1-1:1.0 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb1/1-1/1-1:1.0
|-- bind
|-- module -> ../../../../module/ub
`-- unbind
In order to unbind a device from a driver, simply write the bus id of
the device to the unbind file:
echo -n "1-1:1.0" > /sys/bus/usb/drivers/ub/unbind
and the device will no longer be bound to the driver:
$ tree /sys/bus/usb/drivers/ub/
/sys/bus/usb/drivers/ub/
|-- bind
|-- module -> ../../../../module/ub
`-- unbind
To bind a device to a driver, the device must first not be controlled by
any other driver. To ensure this, look for the "driver" symlink in the
device directory:
$ tree /sys/bus/usb/devices/1-1:1.0
/sys/bus/usb/devices/1-1:1.0
|-- bAlternateSetting
|-- bInterfaceClass
|-- bInterfaceNumber
|-- bInterfaceProtocol
|-- bInterfaceSubClass
|-- bNumEndpoints
|-- bus -> ../../../../../../bus/usb
|-- modalias
`-- power
`-- state
Then, simply write the bus id of the device you wish to bind, into the
bind file for that driver:
echo -n "1-1:1.0" > /sys/bus/usb/drivers/usb-storage/bind
And check that the binding was successful:
$ tree /sys/bus/usb/devices/1-1:1.0
/sys/bus/usb/devices/1-1:1.0
|-- bAlternateSetting
|-- bInterfaceClass
|-- bInterfaceNumber
|-- bInterfaceProtocol
|-- bInterfaceSubClass
|-- bNumEndpoints
|-- bus -> ../../../../../../bus/usb
|-- driver -> ../../../../../../bus/usb/drivers/usb-storage
|-- host2
| `-- power
| `-- state
|-- modalias
`-- power
`-- state
As the example above shows, this capability is very useful for switching devices
between drivers which handle the same type of device (both the
ub and usb-storage drivers handle USB mass storage
devices, like flash drives.)
A number of "enterprise" Linux distributions offer multiple drivers of
different version levels in their kernel packages. This manual binding
feature will allow configuration tools to pick and choose which devices
should be bound to which drivers, allowing users to upgrade only
specific devices if they wish to.
In order for a device to bind successfully with a driver, that driver
must already support that device. This is why you can not just
arbitrarily bind any device to any driver. To help with the issue of
adding new devices support to drivers after they are built, the PCI
system offers a dynamic_id file in sysfs so that user space
can write in new device ids that the driver should bind too. In the
future, this ability to add new driver IDs to a running kernel will be
moved into the driver core to make it available for all buses.
Comments (3 posted)
Jens Axboe's completely fair queueing (CFQ) I/O scheduler has been regarded
by many as the best available in the 2.6 kernel for a while. Said
scheduler has just been through another major upgrade which should
implement a higher degree of fairness while providing "excellent"
throughput for the system as a whole.
One of the big additions this time around is time sharing: processes now
get time slices during which they are able to dispatch I/O requests. The
scheduler will allow a drive to go idle - briefly - during a process's time
slice to give that process an opportunity to generate more I/O requests. In this
way, it behaves similarly to the anticipatory scheduler; it allows the
process to get the most out of its slice while, hopefully, taking advantage
of the locality of that process's requests. If, however, a process's
requests end up causing too much seeking, that process will temporarily
lose its right to hold the disk idle.
Tied in with the time sharing implementation is the notion of I/O
priorities. Each process has its own I/O priority, which, by default, is
derived from its CPU priority. Processes with higher priorities will
preempt lower-priority processes, while sharing the drive in a round-robin
fashion with equal-priority processes. There is also a realtime priority
level which does not do round-robin sharing, and an "idle" level which is
only allowed to dispatch requests when the drive has been idle for a
sufficiently long period.
There is a temporary priority boosting mechanism designed to avoid priority
inversion problems when a low-priority process holds important resources.
Two new system calls have been added for working with I/O priorities:
int ioprio_set(int which, int who, int priority);
int ioprio_get(int which, int who);
Here, which controls whether the call applies to a single process,
process group, or user, and who is the appropriate ID (usually the
process ID). A call to ioprio_set() will apply the new
priority (subject to the usual permissions checks) while
ioprio_get() returns the current value.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
- Marco Costalba: qgit-0.7.
(July 12, 2005)
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>