Brief items
The current 2.6 kernel is 2.6.7; the first 2.6.8 prepatch has not
yet been released as of this writing. There is, however, a large pile of
patches in Linus's BitKeeper tree, including support for new Apple
PowerBooks, more sparse annotations, some netfilter improvements, some
kbuild work, a new
wait_event_interruptible_exclusive() macro,
support for the
O_NOATIME flag in the
open() call, sysfs
knobs for tuning the CFQ I/O scheduler, mirroring and snapshot targets for
the device mapper, the removal of the PC9800 subarchitecture, reiserfs
data=journal support, preemptible kernel support for the PPC64
architecture, and many fixes and updates.
The current prepatch from Andrew Morton is 2.6.7-mm1; recent additions to -mm include a
new knob for controlling how aggressively the system reclaims VFS caches
when memory gets tight, a memory allocation tweak to improve DMA segment
merging (see below), and various fixes.
The current 2.4 prepatch is 2.4.27-rc1, which was released by Marcelo on June 19. Only a
small number of fixes have gone in since the last prepatch. Now is the time
for those interested in a stable 2.4.27 release to do some testing.
Comments (4 posted)
Kernel development news
The
generic DMA layer provides a way for
device drivers to allocate and work with direct memory access regions
without regard for how the underlying hardware does things. This interface
works well, for the most part, but, as with the rest of the kernel,
occasional issues come up. Here's a few that were discussed over the last
week.
Many devices can perform full 64-bit DMA operations. This capability is
nice on large-memory systems, but working with larger addresses can also
bring a performance penalty. As a way of helping drivers pick the optimal
size for DMA address descriptors, James Bottomley has proposed the creation of a new function called
dma_get_required_mask().
The current API already has dma_set_mask(), which tells the kernel
about the range of DMA addresses the device can access. The new function
would be called after an invocation of dma_set_mask(); it would
return a new bitmask describing what the platform sees as the optimal set
of DMA addresses, taking the device's original DMA mask into account. If
the specific hardware situation does not require the
use of larger addresses, the platform can suggest using the faster, 32-bit
mode even when the device can handle larger addresses. The driver can then
use that advice to set a new mask describing what it will actually use.
The "scatterlist" mechanism is another part of the DMA subsystem; it allows
drivers to set up scatter/gather I/O, where the buffer to be transferred is
split into multiple, distinct chunks of memory. Scatter/gather is useful
in a number of situations, including network packets (which are assembled
from multiple chunks), the readv() and writev() system
calls, and for I/O directly to or from user-space buffers, which can be
spread out in physical memory. The mapping functions for scatter/gather
I/O will coalesce pieces of the buffer which turn out to be physically
adjacent in memory. In practice, that has turned out not to happen very
often; one recent report showed that, out of
approximately 32,000 segments, all of 40 had been merged in this manner.
It turns out, however, that the Linux memory allocator is not helping the
situation. When the allocator breaks up a large block of pages to satisfy
smaller requests (a frequent occurrence), it returns the highest page in
the block. A series of allocations will, thus, obtain pages in descending
order. If those pages are assembled into an I/O buffer, each page will
need to be a separate segment in a scatter/gather operation, since the
reverse-allocated pages cannot be merged.
William Lee Irwin put together a patch which
causes the allocator to hand out pages from the bottom of a block instead
of the top. With
this patch applied, the merge rate in this particular test went up to over
55%. Larger segments lead to faster I/O setup and execution, which is a
good thing. Sometimes a tiny patch can make a big difference, once you
know where the problem is.
Meanwhile, Ian Molton turned up a different
sort of problem. Some types of interfaces have their own onboard memory.
This memory is, often, accessible to the CPU, and it can be used by devices
attached to the interface for DMA operations. But that memory is not part
of the regular system RAM, and it typically does not show up in the system
memory map. As a result, the generic DMA functions will not make use of
this memory when allocating DMA buffers.
It would be nice to be able to make use of this memory, however. It is
there, and it can be used to offload some DMA buffers from main memory. On
some systems, it may be the only memory which is usable for DMA operations
to certain devices. The DMA API has even been set up with this sort of
memory in mind; it can handle cases where, for example, the memory in
question has a different address from the device's point of view than it
does for the processor. It would seem that the addition of an
architecture-specific module to the DMA API could enable such memory to be
allocated on platforms which have it, when the DMA target is a device
which can make use of it.
The biggest problem would appear to be that this sort of remote memory is
not part of the system's memory map, and, thus, there is no struct
page structure which describes it. The lack of a page
structure makes certain macros fail. It also completely breaks any driver
which tries to map the buffer into user space via the nopage() VMA
operation. And, it turns out, drivers really do that; the ALSA subsystem,
for example, maps buffers to user space in this manner.
Once a problem is identified, it can usually be fixed. The right approach in this
case would appear to be a combination of two things. The first is to
simply fix any bad assumptions in drivers with regard to how they can treat
DMA buffers. If the driver expects that a page structure exists
for a DMA buffer, it is broken and simply needs to be fixed. The second
part is to provide an architecture-independent
way for device drivers to map DMA buffers into user space.
To that end, Russell King has proposed yet
another DMA API function:
int dma_map_coherent(struct device *dev,
struct vm_area_struct *vma,
void *cpu_addr,
dma_addr_t handle,
size_t size);
This function would take the given mapped DMA buffer (as described by
cpu_addr and handle) and map it into the requested VMA.
Device drivers could use this function to make a buffer available to user
space, and would be able to discard their existing nopage()
methods. The new interface would thus simplify things, though it does
still leave a reference counting problem on the driver side of things:
freeing the DMA buffer before user space has unmapped it would be a big
mistake.
Comments (4 posted)
The build process in recent 2.6 kernels allows for the separation of source
and object trees. If a kernel build is started with the
O=
option, the resulting object files (and other built files) will go into the
directory specified, rather than being mixed in with the source. Some
developers find this way of doing things easier to manage, especially if
the same source tree is being used to build kernels for multiple
architectures or with multiple sets of configuration options.
One distributor (SUSE) has begun shipping kernels which have been built in
this manner. The difference has gone unnoticed by almost all users, but
one vendor of proprietary modules recently posted a
strong message accusing SUSE of forking the kernel. The specific
issue is that this vendor's modules would no longer build with SUSE's
kernels, and that problem turned out to be a result of the separated source
and object trees.
When a kernel's modules are installed under /lib, a symbolic link
called build is made pointing to the source tree. This link is
used by the external module build process to find kernel headers,
configuration files, and needed object files. When SUSE adopted the
separate object directory, it redirected the build link to point
to that directory, rather than to the original source. That is, after all,
where many of the necessary files will be found. Unfortunately for this particular
vendor, their modules needed some other files which are only found in the
source tree. When the build link was directed elsewhere, those
modules would no longer compile.
The fix was relatively straightforward, but this situation forced a new
discussion on how the build system should work when separate object
directories are in use. The result is a new
patch from Sam Ravnborg which nails down how these links should work.
With this patch (not merged as of this writing), the build link
would always point to the object directory. Doing things this way allows
most external modules to continue to build without changes. A new link
(source) will be added to point to the source directory when
needed. And a small, special-purpose makefile is placed in the object
directory; its job is to bridge the gap between the two trees and make
most external module builds work with no changes required.
Comments (5 posted)
Two weeks ago this page covered the launch
of a new wireless networking effort. The scope of this effort now seems to
be expanding to a redesign of the "wireless extensions" portion of the
network stack. This code handles wireless network interfaces, and, in
particular, provides a set of functions to user space for the control of
those interfaces. Scott Feldman has posted
an
initial set of objectives for a wireless extensions rework.
Much of what is being proposed is uncontroversial. There has been some
disagreement, however, over proposed changes to the "iw_handler"
interface. This interface is the mechanism by which wireless adapter
drivers respond to ioctl() calls from user space. Each driver
registers a set of functions, one for each of the command codes supported
by the wireless extensions. The mechanism used is different from what is
seen in other parts of the kernel, however; a wireless interface driver
fills in a simple array of function pointers and passes that to the core.
The array is indexed by the ioctl() command code, and the proper
function is called.
The problem with this interface is that it defeats the compiler's normal
type checking. All wireless extension handler functions must have the same
prototype, and there is no real way to tell if the right one is being
called. As a way of improving the code base, Jeff Garzik would like to
replace the iw_handler array with a structure full of specific,
named function pointers - the same mechanism which is used in the rest of the kernel.
Initially, all of these functions would keep the current
iw_handler prototype, but, over time, each function would be
migrated over to taking exactly the arguments it needs.
Nobody disputes that the new interface would be cleaner. Jean Tourrilhes,
the designer of the wireless extensions, has an objection, however:
changing this interface would break backward compatibility. Jean does not like this idea:
The wireless extension has remained backward compatible over almost
8 years, while tremendously improving and adding new features. And
I believe that moving forward, the price of keeping backward
compatibility is small, as you can see from my patch.
It's possible. It's not difficult. Breaking backward
compatibility is not a design goal.
Jean proposes, instead, to create a wrapper layer around the existing
interface, thus avoiding breaking any out-of-tree drivers. Jeff, however,
would rather get rid of the old interface
entirely, since he sees it as dangerous.
We want to design driver interfaces that make it tough for the
driver writer to screw up. Excluding yourself, myself, and others
on this list, I think we all know that driver writers can't code
their way out of a paper bag. A properly designed interface lets
the compiler flag incorrect code at the first possible opportunity.
The other relevant point is that Jeff, like most kernel developers, does
not see backward compatibility of internal interfaces as an important
goal. Interfaces need to be able to change, and the developers can't be
held back by the prospect of breaking out-of-tree drivers. As a result,
the wireless extensions changes are quite likely to happen - though,
perhaps, not until 2.7.
Comments (none posted)
Linus is famously against the use of interactive debuggers on the kernel,
but many developers use them anyway. Debugging a running kernel is a
little harder than working with a typical application, but it can be done
in a couple of ways. It is relatively easy to query kernel data
structures in the current running kernel by running
gdb with
/proc/kcore as the "core" file. More extensive debugging,
allowing the use of breakpoints and such, can be done by using
gdb
on a remote machine and controlling the target via a serial line or a
network interface. The -mm tree contains the necessary patches for using
gdb in this mode for a few architectures.
One limitation with using gdb this way is that it can't be used to work
with loadable modules. The debugger can query the memory used by loadable
modules, set breakpoints there, etc. The problem is that it does not know
what addresses get assigned to functions and variables when a module is
loaded. Those addresses, obviously, are not in the core kernel executable,
and there is no real way to find them at run time. The developer can thus
work by typing in hex addresses directly, but that gets tiresome fairly
quickly.
Your editor was recently finishing out the debugging chapter for Linux
Device Drivers, Third Edition (which is getting closer to ready -
honest) when he ran up against the loadable module problem. The kernel
knows where all of the symbols go when it loads a module; it really seemed
like it should be possible to communicate that information to a debugger.
A bit of digging revealed that, in fact, the relevant information gets
dropped once the module gets loaded. So it was time for a fix.
Like any other ELF executable, a loadable module is divided up into several
sections. The section called .text contains (most of) the module
code itself; .data and .bss contain most of the
variables. The module loader looks at all of the sections and lays them
out sequentially in (vmalloc) memory; after relocating symbols it forgets
about where the sections went.
If the positions of the sections could be recovered, however, they could be
passed to gdb in the same add-symbol-file command which
tells the debugger about the module code. The section offsets are all that
gdb needs to figure out where the module's variables live.
Your editor, rather than tell LDD3 readers that symbolic debugging of
kernel modules was impossible, chose to do a little hacking. The result
was this patch, which hangs a new kobject
onto each loadable module and populates it with a set of attributes
containing the section offsets. Those attributes will show up under
/sys/module. Thus, for example, after module foo is
loaded, /sys/module/foo/sections/.data will contain the beginning of
the .data section. The foo developer can then fire up
gdb and, after connecting to the target kernel, use the section
offset information to issue a command
like:
add-symbol-file /path/to/module 0xd081d000 \ # .text
-s .data 0xd08232c0 \
-s .bss 0xd0823e20
Thereafter, debugging the module is just like debugging the rest of the
kernel. There is a little script (included with the patch) which generates
the add-symbol-file command, reducing the operation to a simple
cut-and-paste.
The patch has been merged into Linus's BitKeeper tree, and will be part of
2.6.8.
Comments (6 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Page editor: Jonathan Corbet
Next page: Distributions>>