User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.15-rc5, released by Linus on December 3. It consists mostly of fixes, but also includes some changes for drivers which map memory into user space (see below). The long-format changelog has the details.

2.6.15-rc4 was released on November 30; details in the long-format changelog.

The current -mm tree is 2.6.15-rc5-mm1. Recent changes to -mm include some memory management tweaks, a special test which taints the kernel when ndiswrapper or driverloader is loaded, a new set of ktimer patches, and various architecture updates.

Comments (none posted)

Kernel development news

Linux in a binary world... a doomsday scenario

Arjan van de Ven has contributed to the debate on proprietary kernel modules by putting together a scenario based on one crucial event: "On December 6th, 2005 the kernel developers en mass decide that binary modules are legally fine and also essential for the progress of linux, and are as such a desirable thing." Click below to see how the story plays out.

Full Story (comments: 63)

Xen 3.0 released

Version 3.0 of the Xen hypervisor - a virtualization system - has been released. Xen 3.0 includes support for Intel's hardware virtualization mechanism, SMP guest systems (with hot-pluggable virtual CPUs), large memory support, trusted platform module support, ports to the ia-64 and (soon) PowerPC architectures, and more.

Comments (10 posted)

The first stable OpenVZ release

The OpenVZ project has announced its existence and its first stable release. OpenVZ is yet another virtualization approach for Linux, based on SWsoft's "Virtuozzo" product. The OpenVZ approach differs from others, however, in that it creates its virtualized environments within a single kernel; the result, it is claimed, is better performance. Unfortunately, the released patch is for the ancient 2.6.8 kernel.

Comments (34 posted)

The evolution of driver page remapping

Two weeks ago, this page looked at the new VM_UNPAGED flag, introduced in 2.6.15-rc2 to mark virtual memory areas (VMAs) which are not made up of "normal" pages. These areas are usually created by device drivers which map special memory areas (which may or may not be device I/O memory) into user space. Your editor now humbly suggests that readers ignore that article; things have changed significantly since then.

As it turns out, Linus didn't like the VM_UNPAGED idea, so he rewrote the code for 2.6.15-rc4. The VM_UNPAGED VMA flag is gone, replaced by VM_PFNMAP. The new flag has a very similar meaning: it marks the VMA as containing special page table entries which should not be touched by the VM subsystem. In particular, it states that there is no page structure associated with any page in that VMA, so the VM subsystem should not go looking for one. Even in cases where that structure does exist (such as remappings of real memory), the VM code will pretend that it does not.

The advantage of the reworked code is that it takes out a number of special cases; the VM_PFNMAP VMAs can be treated just like normal VMAs in more places. Things quickly got a bit more complicated, however. The initial VM_PFNMAP code assumed that a linear range of addresses was being mapped into user space. In fact, some drivers piece together memory in more complicated ways.

So a subsequent patch added explicit support for "incomplete" VMAs, marked with yet another flag: VM_INCOMPLETE. When the kernel detects that a driver is creating something other than a straightforward, linear mapping, it sets that flag and emits a warning. It also requires, in this case, that the pages being remapped carry the PG_reserved flag - even though this flag is being phased out. Remapping RAM in this way always required that flag in the past, so this requirement is not a change as far as drivers are concerned.

The patch adding VM_INCOMPLETE notes that "In the long run we almost certainly want to export a totally different interface for that, though." In this case, "in the long run" meant about one day, when yet another patch was merged adding a new function:

    int vm_insert_page(struct vm_area_struct *vma, 
                       unsigned long address,
                       struct page *page);

This function inserts the given page into vma, mapped at the given address. It does not put out warnings, and does not require that PG_reserved be set. What it does require is that the page be an order-zero allocation obtained for this purpose; it is not possible to remap arbitrary RAM pages with vm_insert_page(). Since a page structure is required, the new function is also unsuitable for remapping I/O memory. But it is useful for drivers which wish to map a set of pages into a user-space address range.

Just which driver might want to do something like that became clear when another patch was merged for 2.6.15-rc5. It removed the GPL-only export for vm_insert_page() and included this commit message:

Make vm_insert_page() available to NVidia module. It used to use remap_pfn_range(), which wasn't GPL-only either, and the new interface is actually simpler and does more checking, so we shouldn't unnecessarily discourage people from switching over.

Some developers objected to this change, seeing it as an explicit endorsement of the proprietary NVidia drivers. Others, however, saw it as a simple attempt to avoid breaking drivers without a good reason. The kernel developers may well be working toward taking a stronger stand against proprietary modules, but this particular interface will not be the place where that battle is fought.

Comments (2 posted)

bcm43xx and the 802.11 stack

The Broadcom 43xx family is yet another wireless network chipset without free driver support. There is, however, a proprietary Linux driver available; for example, the LinkSys WRT54G router has a Broadcom module. A reverse engineering team has been busily looking at that driver with the idea of writing a document describing how this chipset works; the resulting free bcm43xx specification is in a reasonably complete state.

Independently, the bcm43xx driver team has been writing a driver from this specification. The authors have never worked with the original, proprietary driver, so they should be unable to infringe any copyrights which cover that driver. This project has been moving along quietly for a while, but the quiet period is over: the free bcm43xx driver is now working. It is not for the faint of heart at this point, but it is able to transmit and receive packets. Adventurous souls with suitable hardware are encouraged to start testing the new driver.

While almost everybody is happy to see a free driver for this hardware, there have been some complaints about it. In particular, some developers are unhappy about the "softmac" layer used by the bcm43xx driver. This layer handles many media access tasks - scanning, management frames, etc. - for the driver. This functionality is not currently a part of the Linux 802.11 stack because the chipset for which that stack was initially developed - Intel's ipw chips - performs those tasks in hardware. Most other chipsets rely on the host for this functionality, so some sort of "software MAC" must be provided.

The problem is not that there is no softmac implementation for Linux; instead, there are too many of them. The softmac layer used by the bcm43xx driver, which is meant to integrate with the current kernel 802.11 stack, is one. The MadWifi project includes its own 802.11 stack, including a software MAC implementation. There is also a complete 802.11 stack from Devicescape available. Both the MadWifi and Devicescape stacks are said - by their supporters - to be more capable than the in-kernel stack, with or without the softmac layer. So why, they ask, should yet another software MAC be written using the in-tree 802.11 stack when better alternatives exist?

Your editor will not attempt to draw any conclusions about which implementation is the best. The simple fact, however, is that the in-tree 802.11 code is what developers have to work with now. Efforts to work with and improve that code will be better received by the networking maintainers than pointing at out-of-tree parallel implementations. So the softmac code used by the bcm53xx driver would appear to have an advantage going forward: it builds on the existing, in-tree code, and makes new capabilities available for all drivers.

Meanwhile, those who are interested in playing with the bcm43xx driver may want to avail themselves of the daily snapshots posted by the project.

Comments (1 posted)

Memory copies in hardware

Upcoming versions of Intel processors will include a feature called an "asynchronous DMA engine." Essentially, it is a hardware peripheral which can be used to quickly copy data from one memory location to another. The "I/OAT" ("I/O acceleration technology") is expected to improve performance by offloading copy operations, enabling quick in-memory scatter/gather operations, and keeping copy operations from pushing useful data out of the processor's cache.

Hardware with an I/OAT is not yet available, but a patch for I/OAT support has recently been posted. It lacks the hardware-level interface, but does demonstrate the API that the folks at Intel have come up with for this sort of device.

Code which wishes to make use of the I/OAT must first register itself as a "DMA client." The registration interface looks like:

    #include <linux/dmaengine.h>
    typedef void (*dma_event_callback)(struct dma_client *client, 
                                       struct dma_chan *chan, 
				       enum dma_event_t event); 

    struct dma_client *dma_async_client_register(dma_event_callback event_callback);
    void dma_async_client_unregister(struct dma_client *client);

The client must provide a callback function which will be invoked when DMA channels come and go. If all goes well, registration results in a dma_client structure which can be used with subsequent operations.

Before anything can be done, the client must request one or more "channels." Every channel on the I/OAT can be used for one copy operation at a time; all channels can be operating simultaneously. The function to request channels is:

    dma_async_client_chan_request(struct dma_client *client, 
                                  unsigned int number);

The client's callback function will be called once for each allocated channel. The number of channels actually allocated may be less than what has been requested. There is no real guidance on the optimal number of channels to ask for; the example patch for the networking subsystem requests one channel for each processor on the system. The number of channels can be changed later on if need be.

There are three functions for actually starting a copy operation:

    dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
                                             void *dest, void *src,
                                             size_t len);
    dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
                                            struct page *page,
                                            unsigned int offset,
                                            void *kdata, size_t len);
    dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
                                           struct page *dest_pg,
                                           unsigned int dest_off,
                                           struct page *src_pg,
                                           unsigned int src_off,
                                           size_t len);

All three functions do the same thing: they request an asynchronous copy operation from one memory location to another. The only difference is whether kernel addresses or page structures are used to specify the locations. For some reason, it appears to be necessary to issue a call to:

    void dma_async_memcpy_issue_pending(struct dma_chan *chan);

before the operation will actually happen.

Since copy operations are asynchronous, they may not have completed when the request functions return, so the caller should not mess with the affected buffers in the mean time. There are two functions for querying and waiting for completion:

    dma_async_memcpy_complete(struct dma_chan *chan, dma_cookie_t cookie,
                              dma_cookie_t *last, dma_cookie_t *used);
    dma_async_wait_for_completion(struct dma_chan *chan, 
                                  dma_cookie_t cookie);

dma_async_memory_complete() will return one of DMA_SUCCESS, DMA_IN_PROGRESS, or DMA_ERROR, depending on the status of the copy operation indicated by cookie (the last and used arguments can be passed as NULL; their purpose is not entirely clear to your slow editor). A call to dma_async_wait_for_completion() will wait until the given operation finishes. In the current implementation, that wait is accomplished via a busy loop calling schedule(). There is no function for canceling an outstanding operation.

The initial reaction to the patch was cautiously positive. There is some concern that invoking an external device to perform copies may be sufficiently expensive that it will only be worthwhile for very large operations. There were also some requests to extend the interface to include a transformation to be performed on the data as it is copied. The current hardware does not look like it will support anything beyond a direct copy (though, since the hardware is not yet available, it is hard to be sure), but it would be nice to be able to make use of any such capabilities as they arrive. Transformations could be simple (simply zeroing a buffer, say), or complex (cryptographic operations). But they will only be available if the interface supports them.

The hardware is due in "early 2006," so more information will become available then. Until that time, there probably will not be any serious discussion of merging the I/OAT interface.

Comments (6 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds