User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The stable kernel release is in review as of this writing; it should be released sometime around November 25. It contains 23 patches with important fixes, most of which are in the networking subsystem.

The current 2.6 prepatch is 2.6.15-rc2, released by Linus on November 19. It is mostly made up of a large pile of fixes, but there is also a big x86-64 update (including the DMA32 memory zone) which got missed for -rc1. The long-format changelog has the details.

Linus's git repository contains 100 or so fixes merged since -rc2. Among them is the new VM_UNPAGED VMA feature, described below.

The current -mm tree is 2.6.15-rc1-mm2. Recent changes to -mm include various memory management and memory hotplug patches, a relayfs update, a number of kernel shrinking patches from the -tiny tree, a reiser4 update, some software suspend improvements, a kdump update, and lots of fixes.

Comments (none posted)

Kernel development news

Dynamic USB device IDs

The market for USB devices is certainly dynamic; new gadgets are released at a high rate. Unfortunately, Linux kernels and their associated drivers are not always updated quite as quickly. The result can be that the kernel fails to recognize and drive a new gadget, even though existing drivers may be entirely capable of doing the job. The driver simply does not know that the device is one it can handle, so the kernel does not bind the two together.

Greg Kroah-Hartman has posted a simple patch which should help fix this situation. With the patch in place, each USB driver gets a new sysfs attribute (new_id). If a system administrator writes two values (the vendor and product ID numbers reported by the device) to that attribute, those numbers form a new device ID associated with the driver. Immediately after the write, the driver will recognize the device, and everybody will be happy. No changes to the drivers themselves are necessary. Of course, one could create confusion by associating a device with an inappropriate driver, but a bit of attention should suffice to avoid that problem.

This patch came out a bit late for 2.6.15, so it is more likely to show up in 2.6.16 or thereafter.

Comments (3 posted)

Making notifiers safe

The kernel contains a mechanism, called "notifiers" or "notifier chains," which allows kernel code to ask to be told when something interesting happens. A number of notifier chains are currently in use in the kernel; chains exist for memory hotplug events, CPU frequency policy changes, USB hotplug events, module loading and unloading, system reboots, network device changes, and more. Notifiers are a simple and easy way to get the word out, so they are increasingly being used throughout the kernel.

The interface to notifiers is simple. There is one structure type:

    struct notifier_block
        int (*notifier_call)(struct notifier_block *self, 
                             unsigned long event, void *data);
        struct notifier_block *next;
        int priority;

A notifier chain is thus a simple, singly-linked list with no separate head. A kernel subsystem which wishes to be notified of specific events fills out a notifier_block structure and passes it to:

    int notifier_chain_register(struct notifier_block **chain, 
                                  struct notifier_block *notifier);

The chain is kept sorted in increasing priority order. Sending out an event is a matter of calling:

    int notifier_call_chain(struct notifier_block **chain, 
                            unsigned long event, void *data);

Notifiers registered in the chain will be called, in increasing priority order, with the given event and data values. Any notifier can return a value with the NOTIFY_STOP_MASK bit set, with the result that no further notifiers will be called. The return value from the last notifier is return from notify_call_chain(). In some cases, the combination of NOTIFY_STOP_MASK and the return value is used to allow notifiers to veto proposed actions.

The current notifier implementation is quite simple, not much more than one page of code. Alan Stern recently noticed a little problem, however: notifier_call_chain() goes through the list without any sort of locking. Changes to the notifier list are protected by a global notifier lock, but that lock is ignored when notifiers are called. Thus, if notifier_call_chain() is called while some other part is adding or removing notifiers, a mess could result.

One might be tempted to fix the problem by simply acquiring the lock in notifier_call_chain(), but life it not so simple. The current lock for notifiers is a spinlock, but, as it turns out, some notifier functions can sleep. So holding the lock while calling notifiers is not possible. Switching the lock to a semaphore is also out for similar reasons: some notifier chains can be called from atomic contexts. So a more complicated fix is called for.

That fix has been posted by Chandra Seetharaman. It appears that notifier chains have to be split into two types: those which can sleep, and those which are entirely atomic. A new notifier_type enum has been created with two values: ATOMIC_NOTIFIER and BLOCKING_NOTIFIER. There is also now an explicit type (struct notifier_head) for the head of a notifier chain. Chains are now declared with something like:

    NOTIFIER_HEAD(name, type);

Some new rules have been adopted for notifiers as well; one of those is that notifiers are only added or removed in non-atomic context. With that rule in place, each notifier_head structure can contain a semaphore (an rwsem, actually) which protects access to the chain. The new registration function is:

    int notifier_chain_register(struct notifier_head *chain,
                                struct notifier_block *notifier);

Addition of a notifier is relatively easy to do in a safe manner. The "next" pointer in the new entry is set first, followed by the "next" pointer in the appropriate place in the list. By throwing in some memory barriers, the patch ensures that the chain is always in a consistent state.

The new form of notifier_call_chain() is:

    int notifier_call_chain(struct notifier_head *chain,
                            unsigned long event, void *data);

If the chain is of the BLOCKING_NOTIFIER variety, notifier_call_chain() can simply acquire the chain semaphore and call the notifiers safely. Acquiring the semaphore is not possible for ATOMIC_NOTIFIER chains, however, so, in that case, the code simply calls rcu_read_lock() to ensure that it will not be preempted while calling the notifiers.

The new prototype for the unregistration function is:

    int notifier_chain_unregister(struct notifier_head *chain,
                                  struct notifier_block *notifier);

For blocking chains, removal of notifiers is straightforward; the code can simply acquire the semaphore and do its work knowing that nobody else will be traversing the chain. For atomic notifiers, however, notifier_call_chain() does not acquire the semaphore, so the possibility of races is real. Removing the notifier from the chain is still straightforward: a single pointer assignment takes the notifier out in an atomic manner. But code in another processor may have stumbled across that notifier before it was removed from the chain; in that case, it may still have a reference to it. So the destruction of the removed notifier must wait until the kernel can be sure that no references remain.

This is just the sort of situation that the read-copy-update (RCU) mechanism was created for. In many applications, the way to destroy this structure would be to set up an rcu_head structure, pass it to call_rcu(), and wait for a callback to finish the job. In this case, however, callers to notifier_chain_unregister() are not expecting callbacks later on, and, in any case, notifier removal is not a performance-critical operation. So the unregister code simply calls synchronize_rcu() to block until all current RCU read locks have been released. Once synchronize_rcu() has returned, the unregistration code can safely return as well, knowing that no references to the removed notifier exist.

The new design adds one other new constraint: notifiers cannot remove themselves from the chain. Both the use of the semaphore and the use of RCU would lead to deadlocks in that situation, resulting in developer notifications by way of bugzilla and annoyed email.

Comments (1 posted)


The page structure, used to describe the memory in the system, includes a set of flags; one of those flags is PG_reserved. For a long time, this bit has marked pages which are not part of the regular memory management regime; pages so marked include the kernel text (which really should not be swapped out) and the I/O memory in the legacy ISA hole at 640K. Occasionally, device drivers have explicitly set the reserved bit on ordinary memory so that it could be mapped into user space with remap_pfn_range(). This technique has been discouraged for years, but still persists in spots.

The 2.6.15 kernel removes, for all practical purposes, the reserved bit. Space for page flags is tight, and it was figured that, in 2.6, this bit was no longer needed. The page reclaim code no longer cycles through the system memory map, so it does not need this bit to know which pages to avoid. For the other uses, the VM_RESERVED bit in the vm_area structure could be used instead. So, in 2.6.15-rc2, the PG_reserved bit is (almost) ignored, and the kernel respects VM_RESERVED by not freeing pages found in areas with that bit set.

Unfortunately, it seems a number of drivers set VM_RESERVED for all VMAs which are mapped into user space. Some of these areas are actually normal memory pages, which the driver maps into the process's address space one-by-one when its nopage() function is called. Hugh Dickins noticed that, in this case, those pages will never be returned to the system, since the VM_RESERVED flag prevents them from being freed. The right fix for the problem is probably to get rid of VM_RESERVED altogether; its use is mostly a legacy from the 2.4 days. But going into a bunch of drivers and tweaking their memory management code when this kernel is already at a -rc2 release looks like a certain way to introduce obscure bugs. So Hugh decided to go in and make fundamental changes to the low-level memory management code instead.

The result is a new VMA flag, VM_UNPAGED. This flag says, explicitly, that the pages in this VMA are not to be managed, and in particular, should not be freed. It essentially takes over the meaning previously held by VM_RESERVED, but in an arguably better-defined manner. Calls to remap_pfn_range() will cause the VM_UNPAGED flag to be set. But areas of RAM managed by a driver nopage() function will not have VM_UNPAGED set, so their memory will be managed normally.

Various other subtleties, such as what happens when a process with VM_UNPAGED VMAs forks, had to be dealt with. But the end result of all this work should be that things function again, with no driver changes. At some point, the use of VM_RESERVED in drivers may be taken out, but that's a post-2.6.15 thing.

Meanwhile, one other interesting result of the PG_reserved removal is that remap_page_range() can now be used to remap any set of addresses, not just those marked reserved.

Comments (3 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Device drivers

  • Bartlomiej Zolnierkiewicz: ide update. (November 19, 2005)


Filesystems and block I/O


Memory management


  • Stephen Hemminger: TCP CUBIC. (November 18, 2005)


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds