User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.7; the first 2.6.8 prepatch has not yet been released as of this writing. There is, however, a large pile of patches in Linus's BitKeeper tree, including support for new Apple PowerBooks, more sparse annotations, some netfilter improvements, some kbuild work, a new wait_event_interruptible_exclusive() macro, support for the O_NOATIME flag in the open() call, sysfs knobs for tuning the CFQ I/O scheduler, mirroring and snapshot targets for the device mapper, the removal of the PC9800 subarchitecture, reiserfs data=journal support, preemptible kernel support for the PPC64 architecture, and many fixes and updates.

The current prepatch from Andrew Morton is 2.6.7-mm1; recent additions to -mm include a new knob for controlling how aggressively the system reclaims VFS caches when memory gets tight, a memory allocation tweak to improve DMA segment merging (see below), and various fixes.

The current 2.4 prepatch is 2.4.27-rc1, which was released by Marcelo on June 19. Only a small number of fixes have gone in since the last prepatch. Now is the time for those interested in a stable 2.4.27 release to do some testing.

Comments (4 posted)

Kernel development news

A handful of DMA topics

The generic DMA layer provides a way for device drivers to allocate and work with direct memory access regions without regard for how the underlying hardware does things. This interface works well, for the most part, but, as with the rest of the kernel, occasional issues come up. Here's a few that were discussed over the last week.

Many devices can perform full 64-bit DMA operations. This capability is nice on large-memory systems, but working with larger addresses can also bring a performance penalty. As a way of helping drivers pick the optimal size for DMA address descriptors, James Bottomley has proposed the creation of a new function called dma_get_required_mask().

The current API already has dma_set_mask(), which tells the kernel about the range of DMA addresses the device can access. The new function would be called after an invocation of dma_set_mask(); it would return a new bitmask describing what the platform sees as the optimal set of DMA addresses, taking the device's original DMA mask into account. If the specific hardware situation does not require the use of larger addresses, the platform can suggest using the faster, 32-bit mode even when the device can handle larger addresses. The driver can then use that advice to set a new mask describing what it will actually use.

The "scatterlist" mechanism is another part of the DMA subsystem; it allows drivers to set up scatter/gather I/O, where the buffer to be transferred is split into multiple, distinct chunks of memory. Scatter/gather is useful in a number of situations, including network packets (which are assembled from multiple chunks), the readv() and writev() system calls, and for I/O directly to or from user-space buffers, which can be spread out in physical memory. The mapping functions for scatter/gather I/O will coalesce pieces of the buffer which turn out to be physically adjacent in memory. In practice, that has turned out not to happen very often; one recent report showed that, out of approximately 32,000 segments, all of 40 had been merged in this manner.

It turns out, however, that the Linux memory allocator is not helping the situation. When the allocator breaks up a large block of pages to satisfy smaller requests (a frequent occurrence), it returns the highest page in the block. A series of allocations will, thus, obtain pages in descending order. If those pages are assembled into an I/O buffer, each page will need to be a separate segment in a scatter/gather operation, since the reverse-allocated pages cannot be merged.

William Lee Irwin put together a patch which causes the allocator to hand out pages from the bottom of a block instead of the top. With this patch applied, the merge rate in this particular test went up to over 55%. Larger segments lead to faster I/O setup and execution, which is a good thing. Sometimes a tiny patch can make a big difference, once you know where the problem is.

Meanwhile, Ian Molton turned up a different sort of problem. Some types of interfaces have their own onboard memory. This memory is, often, accessible to the CPU, and it can be used by devices attached to the interface for DMA operations. But that memory is not part of the regular system RAM, and it typically does not show up in the system memory map. As a result, the generic DMA functions will not make use of this memory when allocating DMA buffers.

It would be nice to be able to make use of this memory, however. It is there, and it can be used to offload some DMA buffers from main memory. On some systems, it may be the only memory which is usable for DMA operations to certain devices. The DMA API has even been set up with this sort of memory in mind; it can handle cases where, for example, the memory in question has a different address from the device's point of view than it does for the processor. It would seem that the addition of an architecture-specific module to the DMA API could enable such memory to be allocated on platforms which have it, when the DMA target is a device which can make use of it.

The biggest problem would appear to be that this sort of remote memory is not part of the system's memory map, and, thus, there is no struct page structure which describes it. The lack of a page structure makes certain macros fail. It also completely breaks any driver which tries to map the buffer into user space via the nopage() VMA operation. And, it turns out, drivers really do that; the ALSA subsystem, for example, maps buffers to user space in this manner.

Once a problem is identified, it can usually be fixed. The right approach in this case would appear to be a combination of two things. The first is to simply fix any bad assumptions in drivers with regard to how they can treat DMA buffers. If the driver expects that a page structure exists for a DMA buffer, it is broken and simply needs to be fixed. The second part is to provide an architecture-independent way for device drivers to map DMA buffers into user space.

To that end, Russell King has proposed yet another DMA API function:

    int dma_map_coherent(struct device *dev, 
                         struct vm_area_struct *vma,
                         void *cpu_addr,
                         dma_addr_t handle,
                         size_t size);

This function would take the given mapped DMA buffer (as described by cpu_addr and handle) and map it into the requested VMA. Device drivers could use this function to make a buffer available to user space, and would be able to discard their existing nopage() methods. The new interface would thus simplify things, though it does still leave a reference counting problem on the driver side of things: freeing the DMA buffer before user space has unmapped it would be a big mistake.

Comments (4 posted)

Separating kernel source and object files

The build process in recent 2.6 kernels allows for the separation of source and object trees. If a kernel build is started with the O= option, the resulting object files (and other built files) will go into the directory specified, rather than being mixed in with the source. Some developers find this way of doing things easier to manage, especially if the same source tree is being used to build kernels for multiple architectures or with multiple sets of configuration options.

One distributor (SUSE) has begun shipping kernels which have been built in this manner. The difference has gone unnoticed by almost all users, but one vendor of proprietary modules recently posted a strong message accusing SUSE of forking the kernel. The specific issue is that this vendor's modules would no longer build with SUSE's kernels, and that problem turned out to be a result of the separated source and object trees.

When a kernel's modules are installed under /lib, a symbolic link called build is made pointing to the source tree. This link is used by the external module build process to find kernel headers, configuration files, and needed object files. When SUSE adopted the separate object directory, it redirected the build link to point to that directory, rather than to the original source. That is, after all, where many of the necessary files will be found. Unfortunately for this particular vendor, their modules needed some other files which are only found in the source tree. When the build link was directed elsewhere, those modules would no longer compile.

The fix was relatively straightforward, but this situation forced a new discussion on how the build system should work when separate object directories are in use. The result is a new patch from Sam Ravnborg which nails down how these links should work. With this patch (not merged as of this writing), the build link would always point to the object directory. Doing things this way allows most external modules to continue to build without changes. A new link (source) will be added to point to the source directory when needed. And a small, special-purpose makefile is placed in the object directory; its job is to bridge the gap between the two trees and make most external module builds work with no changes required.

Comments (5 posted)

Reworking the wireless extensions

Two weeks ago this page covered the launch of a new wireless networking effort. The scope of this effort now seems to be expanding to a redesign of the "wireless extensions" portion of the network stack. This code handles wireless network interfaces, and, in particular, provides a set of functions to user space for the control of those interfaces. Scott Feldman has posted an initial set of objectives for a wireless extensions rework.

Much of what is being proposed is uncontroversial. There has been some disagreement, however, over proposed changes to the "iw_handler" interface. This interface is the mechanism by which wireless adapter drivers respond to ioctl() calls from user space. Each driver registers a set of functions, one for each of the command codes supported by the wireless extensions. The mechanism used is different from what is seen in other parts of the kernel, however; a wireless interface driver fills in a simple array of function pointers and passes that to the core. The array is indexed by the ioctl() command code, and the proper function is called.

The problem with this interface is that it defeats the compiler's normal type checking. All wireless extension handler functions must have the same prototype, and there is no real way to tell if the right one is being called. As a way of improving the code base, Jeff Garzik would like to replace the iw_handler array with a structure full of specific, named function pointers - the same mechanism which is used in the rest of the kernel. Initially, all of these functions would keep the current iw_handler prototype, but, over time, each function would be migrated over to taking exactly the arguments it needs.

Nobody disputes that the new interface would be cleaner. Jean Tourrilhes, the designer of the wireless extensions, has an objection, however: changing this interface would break backward compatibility. Jean does not like this idea:

The wireless extension has remained backward compatible over almost 8 years, while tremendously improving and adding new features. And I believe that moving forward, the price of keeping backward compatibility is small, as you can see from my patch.

It's possible. It's not difficult. Breaking backward compatibility is not a design goal.

Jean proposes, instead, to create a wrapper layer around the existing interface, thus avoiding breaking any out-of-tree drivers. Jeff, however, would rather get rid of the old interface entirely, since he sees it as dangerous.

We want to design driver interfaces that make it tough for the driver writer to screw up. Excluding yourself, myself, and others on this list, I think we all know that driver writers can't code their way out of a paper bag. A properly designed interface lets the compiler flag incorrect code at the first possible opportunity.

The other relevant point is that Jeff, like most kernel developers, does not see backward compatibility of internal interfaces as an important goal. Interfaces need to be able to change, and the developers can't be held back by the prospect of breaking out-of-tree drivers. As a result, the wireless extensions changes are quite likely to happen - though, perhaps, not until 2.7.

Comments (none posted)

Debugging kernel modules

Linus is famously against the use of interactive debuggers on the kernel, but many developers use them anyway. Debugging a running kernel is a little harder than working with a typical application, but it can be done in a couple of ways. It is relatively easy to query kernel data structures in the current running kernel by running gdb with /proc/kcore as the "core" file. More extensive debugging, allowing the use of breakpoints and such, can be done by using gdb on a remote machine and controlling the target via a serial line or a network interface. The -mm tree contains the necessary patches for using gdb in this mode for a few architectures.

One limitation with using gdb this way is that it can't be used to work with loadable modules. The debugger can query the memory used by loadable modules, set breakpoints there, etc. The problem is that it does not know what addresses get assigned to functions and variables when a module is loaded. Those addresses, obviously, are not in the core kernel executable, and there is no real way to find them at run time. The developer can thus work by typing in hex addresses directly, but that gets tiresome fairly quickly.

Your editor was recently finishing out the debugging chapter for Linux Device Drivers, Third Edition (which is getting closer to ready - honest) when he ran up against the loadable module problem. The kernel knows where all of the symbols go when it loads a module; it really seemed like it should be possible to communicate that information to a debugger. A bit of digging revealed that, in fact, the relevant information gets dropped once the module gets loaded. So it was time for a fix.

Like any other ELF executable, a loadable module is divided up into several sections. The section called .text contains (most of) the module code itself; .data and .bss contain most of the variables. The module loader looks at all of the sections and lays them out sequentially in (vmalloc) memory; after relocating symbols it forgets about where the sections went. If the positions of the sections could be recovered, however, they could be passed to gdb in the same add-symbol-file command which tells the debugger about the module code. The section offsets are all that gdb needs to figure out where the module's variables live.

Your editor, rather than tell LDD3 readers that symbolic debugging of kernel modules was impossible, chose to do a little hacking. The result was this patch, which hangs a new kobject onto each loadable module and populates it with a set of attributes containing the section offsets. Those attributes will show up under /sys/module. Thus, for example, after module foo is loaded, /sys/module/foo/sections/.data will contain the beginning of the .data section. The foo developer can then fire up gdb and, after connecting to the target kernel, use the section offset information to issue a command like:

    add-symbol-file /path/to/module 0xd081d000 \  # .text
 		-s .data 0xd08232c0 \
		-s .bss  0xd0823e20

Thereafter, debugging the module is just like debugging the rest of the kernel. There is a little script (included with the patch) which generates the add-symbol-file command, reducing the operation to a simple cut-and-paste.

The patch has been merged into Linus's BitKeeper tree, and will be part of 2.6.8.

Comments (7 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds