User: Password:
Subscribe / Log in / New account

Kernel development

Kernel release status

The current 2.6 prepatch is 2.6.9-rc2, announced by Linus on September 13. There is a lot of new stuff in this release, including some infrastructure for catching illegal use of I/O memory addresses (see below), the NETIF_F_LLTX interface feature flag (discussed in last week's Kernel Page), the removal of the ancient, unused "busmouse" driver, infrastructure for cluster-wide file locking, a number of DRM subsystem cleanups, the out-of-line spinlock patch, AMD dual-core support, more filesystem conversions to the new symbolic link resolution code (which will eventually allow an increase in the maximum link depth), a new waitid() system call implementing the POSIX call by the same name, a "fake NUMA" mode for x86-64 testing, a small-footprint tmpfs implementation, the base KProbes patch, a set of IDE updates, support for scheduler profiling (seeing where context switches come from), automatic TCP window scaling calculation, a kobject change (it uses kref now), a USB gadget interface update with "On The Go" support, a big ALSA update, the removal of the Philips webcam driver, numerous network driver updates, some random number generator fixes, a fix for the audio CD writing memory leak, some VFS interface improvements, executable support in hugetlb mappings, the Whirlpool digest algorithm, some virtual memory tweaks, a number of asynchronous I/O fixes and improvements, a User-mode Linux update, the "flex mmap" user-space memory layout (covered here last June), a number of scheduler tweaks, the removal of the very last suser() call, and lots of fixes. See the long-format changelog for the details.

Linus's BitKeeper repository contains the "string" I/O memory access functions, support for more than eight partitions on BSD-labeled disks, some User-mode Linux cleanups, a tunable "max sectors" limit for block I/O requests (a latency reduction feature), a new prctl() option allowing programs to change their name, some shared memory scalability improvements, and a change in TCP ICMP source quench behavior (such messages are simply ignored now).

The current prepatch from Andrew Morton is 2.6.9-rc1-mm5. Recent additions to -mm include some software suspend improvements, the return of a functioning lockmeter patch, some ext3 reservation improvements, some scheduler tweaks, a completely reworked "completely fair queueing" I/O scheduler, and implementations of atomic_inc_return() for various architectures.

The current 2.4 prepatch is 2.4.28-pre3, which was released by Marcelo on September 11. This patch is mainly "a bunch of scattered fixes"; there is also the Whirlpool digest algorithm, and an XFS update.

Comments (1 posted)

Kernel development news

Quotes of the week

What makes you think kernel developers have a deep understanding of the value of connectivity in the OS? They don't. The average kernel developer is not particularly bright.

-- Hans Reiser.

But hey, the fact that I have better taste than anybody else in the universe is just something I have to live with. It's not easy being me.

-- Linus Torvalds.

Comments (5 posted)

Announcing the Kernel Page index

We managed to pull together a bit of time to hack on the LWN site code over the last week. The result is the LWN Kernel Page index, which can be used to find LWN's kernel-oriented articles for a given topic. This mechanism will probably be extended to other parts of LWN's content in the future.

As of this writing, all articles published in 2004 have been indexed; earlier articles will be added as time permits. We'll also fix the case-sensitive sorting when we get a chance. Even without that, however, we hope that the new index will be helpful.

Comments (4 posted)

A new I/O memory access mechanism

Most reasonably current cards for the PCI bus (and others) provide one or more I/O memory regions to the bus. By accessing those regions, the processor can communicate with the peripheral and make things happen. A look at /proc/iomem will show the I/O memory regions which have been registered on a given system.

To work with an I/O memory region, a driver is supposed to map that region with a call to ioremap(). The return value from ioremap() is a magic cookie which can be passed to a set of accessor functions (with names like readb() or writel()) to actually move data to or from the I/O memory. On some architectures (notably x86), I/O memory is truly mapped into the kernel's memory space, so those accessor functions turn into a straightforward pointer dereference. Other architectures require more complicated operations.

There have been some longstanding problems with this scheme. Drivers written for the x86 architecture have often been known to simply dereference I/O memory addresses directly, rather than using the accessor functions. That approach works on the x86, but breaks on other architectures. Other drivers, knowing that I/O memory addresses are not real pointers, store them in integer variables; that works until they encounter a system with a physical address space which doesn't fit into 32 bits. And, in any case, readb() and friends perform no type checking, and thus fail to catch errors which could be found at compile time.

The 2.6.9 kernel will contain a series of changes designed to improve how the kernel works with I/O memory. The first of these is a new __iomem annotation used to mark pointers to I/O memory. These annotations work much like the __user markers, except that they reference a different address space. As with __user, the __iomem marker serves a documentation role in the kernel code; it is ignored by the compiler. When checking the code with sparse, however, developers will see a whole new set of warnings caused by code which mixes normal pointers with __iomem pointers, or which dereferences those pointers.

The next step is the addition of a new set of accessor functions which explicitly require a pointer argument. These functions are:

    unsigned int ioread8(void __iomem *addr);
    unsigned int ioread16(void __iomem *addr);
    unsigned int ioread32(void __iomem *addr);
    void iowrite8(u8 value, void __iomem *addr);
    void iowrite16(u16 value, void __iomem *addr);
    void iowrite32(u32 value, void __iomem *addr);

By default, these functions are simply wrappers around readb() and friends. The explicit pointer type for the argument will generate warnings, however, if a driver passes in an integer type.

There are "string" versions of these operations:

    extern void ioread8_rep(void __iomem *port, void *buf, 
                            unsigned long count);

All of the other variants are defined as well, of course.

There is actually one other twist to these functions. Some drivers have to be able to use either I/O memory or I/O ports, depending on the architecture and the device. Some such drivers have gone to considerable lengths to try to avoid duplicating code in those two cases. With the new accessors, a driver which finds it needs to work with x86-style ports can call:

    void __iomem *ioport_map(unsigned long port, unsigned int count);

The return value will be a cookie which allows the mapped ports to be treated as if they were I/O memory; functions like ioread8() will automatically do the right thing. For PCI devices, there is a new function:

    void __iomem *pci_iomap(struct pci_dev *dev, int base, 
                            unsigned long maxlen);

For this function, the base can be either a port number or an I/O memory address, and the right thing will be done.

As of 2.6.9-rc2, there are no in-tree users of the new interface. That can be expected to change soon as patches get merged and the kernel janitors get to work. For more information on the new I/O memory interface and the motivation behind it, see this explanation from Linus.

Comments (6 posted)

The Philips webcam driver returns

The removal of the Philips webcam driver from the kernel set off a long and sometimes inflammatory discussion. Its return has, instead, been greeted with almost total silence. Once people take a look, however, they might see something worth yelling about.

The new maintainer is Luc Saillard. He has posted a patch which restores the PWC driver to the kernel, but without the problematic hook for the proprietary compression module. As an added bonus, the driver can deal with compressed streams from some cameras (those using chipsets 2 or 3), in some modes. Work still needs to be done for chipset 1 and the Bayer mode.

The final result is yet to be seen, but it would appear that the whole PWC episode is heading toward a best-case conclusion: a 100% free driver. It would be hard to see that outcome as anything but a good thing.

Comments (5 posted)

The Big Kernel Semaphore?

Much of the latency reduction work spearheaded by Ingo Molnar is reaching a state of completion; a lengthy set of patches has been posted which breaks up long lock hold times and adds "voluntary preemption" points at strategic places. With these patches in place, most of the worst latency problems in the 2.6 kernel have been addressed, even when kernel preemption is not enabled. That is good news for multimedia users and others who feel that their needs have been passed over in the 2.5/2.6 development period.

One issue remains, however: there are some old parts of the kernel which still rely on the Big Kernel Lock (BKL) for mutual exclusion. Code which uses the BKL is not performance critical itself (all such uses have been fixed for a while). But the BKL is a lock, and code which holds the BKL will not be preempted. That can mean long latencies if a code path holds the BKL for a long time - and there are a few such paths.

Interest in eradicating use of the BKL has waned in the last year or two, for a few reasons. Any code whose performance was seriously impacted by the BKL has been fixed. And, perhaps more to the point, much of the remaining code is ancient, crufty, and brittle. Finally, as Alan Cox (who holds the dubious fame of having created the BKL) points out, the BKL is not a traditional lock:

The BKL turns on old style unix non-pre-emptive sematics between all code that is within lock_kernel sections, that is it. That also makes it hard to clean up because lock_kernel is delimiting code properties (its essentially almost a function attribute) and spin_lock/down/up and friends are real locks and lock data.

Fixing the remaining code is not an exercise for the timid. In most cases, the prudent course has been to simply leave things alone. The latency problem may just force this issue, however; by increasing latency, BKL-protected code is harming the higher-performance parts of the kernel.

The BKL has one very interesting property which distinguishes it from an ordinary spinlock: code holding the BKL can call schedule() at any time. When that happens, the kernel releases the lock until the scheduling thread is returned to the processor. If code holding the lock can schedule, it ought to be preemptible as well - at least under some circumstances.

Ingo Molnar has decided to mitigate the BKL problem by turning it into the Big Kernel Semaphore. As seen in his patch, the BKS is a special sort of semaphore; it is recursive (as is the BKL), and it is released when the thread holding it voluntarily schedules. The key difference from the BKL, however, is that a process holding the BKS can be preempted - but the semaphore is not released in that case. So code which uses lock_kernel() is still protected against other such code, just like it is now. But that code can be preempted (as long as it does not take any spinlocks). That change should be sufficient to address the latency problems caused by long BKL hold times.

Whether this patch will be accepted remains to be seen. Linus doesn't like it, but Ingo has reasonable responses to his objections. Including Ingo's patch would mitigate the current problems caused by the BKL, which may have an undesirable effect: once again, there will be little motivation to truly fix users of the BKL. Some developers may prefer to simply bite the bullet and eliminate those final BKL users for real.

Comments (1 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds