Kernel development
Brief items
Kernel release status
The current stable 2.6 kernel is 2.6.16.19, released on May 30. It contains a single fix for an information leak in the netfilter code.The current 2.6 prepatch is 2.6.17-rc5, released by Linus on May 24. With luck, this will be the final prepatch before the final 2.6.17 release. It consists of a fair number of fixes; see the long-format changelog for the details.
Several dozen patches (all fixes) have found their way into the mainline after the -rc5 release.
The current -mm tree is 2.6.17-rc5-mm1. Recent changes to -mm include the generic IRQ layer, an updated version of reiser4, the lock validator (see below), the adaptive readahead patch set, a new infrastructure for maintaining kernel statistics, and a new kernel API for inotify.
Kernel development news
A summary of 2.6.17 API changes
The final 2.6.17 kernel release is getting close. Further internal API changes in this cycle are (one hopes) highly unlikely, so the following list should be definitive for this time around.
- Support for the SPARC "Niagara" architecture.
- EXPORT_SYMBOL_GPL_FUTURE()
has been merged.
- The safe notifier patch has been
merged, creating a new API for all notifier users.
- The SLAB_NO_REAP slab cache option, which ostensibly caused
the slab not to be cleaned up when the system is under memory
pressure, has been removed. The kmem_cache_t typedef is also
being phased out in favor of struct kmem_cache.
- The "softmac" 802.11 subsystem has been merged. This code may
eventually be phased out, however, in favor of the Devicescape code.
- There is a new real-time clock subsystem, providing generalized RTC
support and a well-defined driver interface.
- A new utility function has been added:
int execute_in_process_context(void (*fn)(void *data), void *data, struct execute_work *work);This function will arrange for fn() to be called in process context (where it can sleep). Depending on when execute_in_process_context() is called, fn() could be invoked immediately or delayed by way of a work queue.
- The SMP alternatives
patch has been merged.
- A rework of the relayfs API - but the sysfs interface has been left
out for now.
- There is a new tracing mechanism for developers debugging block
subsystem code.
- There is a new internal flag (FMODE_EXEC) used to indicate
that a file has been opened for execution.
- The obsolete MODULE_PARM() macro is gone forevermore.
- A new function, flush_anon_page(), can be used in conjunction
with get_user_pages() to safely perform DMA to anonymous
pages in user space.
- Zero-filled memory can now be allocated from slab caches with
kmem_cache_zalloc(). There is also a new slab debugging
option to produce a /proc/slab_allocators file with detailed
allocation information.
- There are four new ways of creating mempools:
mempool_t *mempool_create_page_pool(int min_nr, int order); mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size); mempool_t *mempool_create_kzalloc_pool(int min_nr, size_t size); mempool_t *mempool_create_slab_pool(int min_nr, struct kmem_cache *cache);The first creates a pool which allocates whole pages (the number of which is determined by order), while the second and third create a pool backed by kmalloc() and kzalloc(), respectively. The fourth is a shorthand form of creating slab-backed pools.
- The prototype for hrtimer_forward() has changed:
unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);The new now argument is expected to be the current time. This change allows some calls to be optimized. The data field has also been removed from the hrtimer structure.
- A whole set of generic bit operations (find first set, count set bits,
etc.) has been added, helping to unify this code across architectures
and subsystems.
- The inode f_ops pointer - which refers to the
file_operations structure for the open file - has been marked
const. Quite a bit of code, which used to change that
structure, has been changed to compensate. Similar changes have been
made in many filesystems. "
The goal is both to increase correctness (harder to accidentally write to shared datastructures) and reducing the false sharing of cachelines with things that get dirty in .data (while .rodata is nicely read only and thus cache clean).
" - local_t is now a signed type.
- Attributes in sysfs can be
pollable.
- A class_device can now have attribute groups created at
registration time; to take advantage of this capability, store the
desired groups in the new groups field.
- The splice(), vmsplice(), and tee() system calls have been merged. Supporting those calls requires implementing two new file_operations methods. See this article for the final form of the splice_read() and splice_write() functions.
As always, look at the LWN 2.6 kernel API changes page for a list of changes over time.
Notifiers, 2.6.17 style
While plowing through the flood of patches early in the 2.6.17 cycle, your editor missed a significant API change: the new notifier interface. Notifiers are an internal kernel mechanism allowing code to register to be told about events of interest. There are notifiers for memory hotplug events, CPU frequency policy changes, USB hotplug events, module loading and unloading, system reboots, network device changes, and more.Back in November, 2005, this page looked at a proposed notifier API change motivated by the lack of locking on the notifier chains themselves. That proposal received a lukewarm reception. Many low-level data structures in the kernel explicitly avoid performing any locking, on the assumption that the higher layers will have to be concerned with their own locking in any case. So, it was asked, why should notifiers be any different? The answer seems to be that, unlike many other data structures, notifiers tend to be used across relatively wide parts of the kernel, making it hard to use any locking regime except one designed for the notifiers themselves. In any case, a version of the notifier patch was merged for 2.6.17-rc1.
The current form of the API defines three different types of notifiers:
- Blocking notifiers are always called from process context. The
notifier code - along with the notification routines it calls - is
allowed to sleep.
- Atomic notifiers can be called from atomic context, no sleeping
allowed.
- Raw notifiers have no internal locking and no associated rules; they are simply the older form of the notifier API, preserved as a historical relic.
For 2.6.17, all notifier chains have been converted to the blocking or atomic types; there are no users of the raw interface in the mainline kernel. The notifier patch includes no threatening noises about removing the raw interface, but, sooner or later, somebody is likely to come along and want to clean it up. So avoiding raw notifiers is probably a good idea; this article will concentrate on the other two types.
Blocking notifiers are essentially a raw notifier with an rwsem added for mutual exclusion. Any operation on a blocking notifier may, well, block on that rwsem. These notifiers can be created in the usual two ways:
#include <linux/notifier.h>
BLOCKING_NOTIFIER_HEAD(my_notifier);
struct blocking_notifier_head my_notifier;
BLOCKING_INIT_NOTIFIER_HEAD(my_notifier);
Code which wishes to hook into a blocking notifier should first fill in a notifier_block structure:
struct notifier_block {
int (*notifier_call)(struct notifier_block *block,
unsigned long event,
void *data);
int priority;
/* ... */
};
The notifier_call field should point to the function to be called when something interesting happens; the event and data parameters will be provided by the code generating the event. Notifiers are called in order of increasing priority; the return value from the final notifier called will be passed back to the code signalling the event. Normally, the final notifier is the one with the highest priority value, but any notifier can halt further processing by returning a value with the bit indicated by NOTIFIER_STOP_MASK set. Other than that one bit (currently 0x8000), the return values are arbitrary (as far as the notification code is concerned), but the convenience values NOTIFY_OK ("so far so good"), NOTIFY_STOP ("all is well, but don't call any more notifiers") and NOTIFY_BAD ("stop calling notifiers and veto the proposed action") are available.
Once the code has a notifier_block ready, it should register it with:
int blocking_notifier_chain_register(struct blocking_notifier_head *chain,
struct notifier_block *nb);
The return value is apparently intended to allow an error status to be returned if the registration fails, but the 2.6.17 version of the code cannot fail.
A blocking notifier can be unregistered with:
int blocking_notifier_chain_unregister(struct blocking_notifier_head *chain,
struct notifier_block *nb);
This call will return -ENOENT if the given notifier was not actually registered.
Code which wishes to use a blocking notifier chain to signal an event can do so with:
int blocking_notifier_call_chain(struct blocking_notifier_head *chain,
unsigned long event,
void *data);
This function will call all notifiers in chain (unless one of them stops the process partway through), returning the value from the last notifier called.
Atomic notifiers replace the rwsem with a spinlock; the API is very similar:
ATOMIC_NOTIFIER_HEAD(my_notifier);
struct atomic_notifier_head my_notifier;
ATOMIC_INIT_NOTIFIER_HEAD(my_notifier);
int atomic_notifier_chain_register(struct atomic_notifier_head *chain,
struct notifier_block *nb);
int atomic_notifier_chain_unregister(struct atomic_notifier_head *chain,
struct notifier_block *nb);
int atomic_notifier_call_chain(struct atomic_notifier_head *chain,
unsigned long event,
void *data);
Note that atomic notifiers use the same notifier_block structure as the blocking variety does. Nothing will ever sleep in the atomic notifier code, however, and notifier functions called from an atomic chain are not allowed to sleep either.
As noted above, all notifier chains in the kernel have been changed to one of the above types; any out-of-tree code which uses a kernel chain will have to be updated accordingly. See the explanatory text for the notifier patch for a summary of what type was assigned to each existing chain in the mainline kernel.
The kernel lock validator
Locking is a necessary evil in operating systems; without a solid locking regime, different parts of the system will collide when trying to access the same resources, leading to data corruption and general chaos. But locking has hazards of its own; carelessly implemented locking can cause system deadlocks. As a simple example, consider two locks L1 and L2. Any code which requires both locks must take care to acquire the locks in the right order. If one function acquires L1 before L2, but another function acquires them in the opposite order, eventually the system will find itself in a situation where each function has acquired one lock and is blocked waiting for the other - a deadlock.A race condition like the one described above may be a one-in-a-million possibility, but, with computers, it does not take too long to exercise a code path a million times. Sooner or later, a system containing this sort of bug will lock up, leaving its users wondering what is going on. To avoid this sort of situation, kernel developers try to define rules for the order in which locks should be acquired. But, in a system with many thousands of locks, defining a comprehensive set of rules is challenging at best, and enforcing them is even harder. So locking bugs creep into the kernel, lurk until some truly inconvenient time, and eventually surprise some unsuspecting user.
Over time, the kernel developers have made increasing use of automated code analysis tools as those tools become available. The latest such is the first version of the lock validator patch, posted by Ingo Molnar. This patch (a 61-part set, actually) adds a complex infrastructure to the kernel which can then be used to prove that none of the locking patterns observed in a running system could ever deadlock the kernel.
To that end, the lock validator must track real locking patterns in the kernel. There is no point, however, in tracking every individual lock - there are thousands of them, but many of them are treated in exactly the same way by the kernel. For example, every inode structure contains a spinlock, as does every file structure. Once the kernel has seen how locking is handled for one inode structure, it knows how it will be handled for every inode structure. So, somehow, the lock validator needs to be able to recognize that all spinlocks contained within (for example) the inode structure are essentially the same.
To this end, every lock in the system (including rwlocks and mutexes, now) is assigned a specific key. For locks which are declared statically (for example, files_lock, which protects the list of open files), the address of the lock is used as the key. Locks which are allocated dynamically (as most locks embedded within structures are) cannot be tracked that way, however; there may be vast numbers of addresses involved, and, in any case, all locks associated with a specific structure field should be mapped to a single key. This is done by recognizing that these locks are initialized at run time, so, for example, spin_lock_init() is redefined as:
# define spin_lock_init(lock) \
do { \
static struct lockdep_type_key __key; \
\
__spin_lock_init((lock), #lock, &__key); \
} while (0)
Thus, for each lock initialization, this code creates a static variable (__key) and uses its address as the key identifying the type of the lock. Since any particular type of lock tends to be initialized in a single place, this trick associates the same key with every lock of the same type.
Next, the validator code intercepts every locking operation and performs a number of tests:
- The code looks at all other locks which are already held when a new
lock is taken. For all of those locks, the validator looks for a past
occurrence where any of them were taken after the new lock. If
any such are found, it indicates a violation of locking order rules,
and an eventual deadlock.
- A stack of currently-held locks is maintained, so any lock being
released should be at the top of the stack; anything else means that
something strange is going on.
- Any spinlock which is acquired by a hardware interrupt handler can
never be held when interrupts are enabled. Consider what happens when
this rule is broken. A kernel function, running in process context,
acquires a specific lock. An interrupt arrives, and the associated
interrupt handler runs on the same CPU; that handler then attempts to
acquire the same lock. Since the lock is unavailable, the handler
will spin, waiting for the lock to become free. But the handler has
preempted the only code which will ever free that lock, so it will
spin forever, deadlocking that processor.
To catch problems of this type, the validator records two bits of information for every lock it knows about: (1) whether the lock has ever been acquired in hardware interrupt context, and (2) whether the lock is ever held by code which runs with hardware interrupts enabled. If both bits are set, the lock is being used erroneously and an error is signaled.
- Similar tests are made for software interrupts, which present the same problems.
The interrupt tests are relatively straightforward, requiring just four bits of information for each lock (though the situation is a little more complicated for rwlocks). But the ordering tests require a bit more work. For every known lock key, the validator maintains two lists. One of them contains all locks which have ever been held when the lock of interest (call it L) is acquired; it thus contains the keys of all locks which might be acquired before L. The other list (the "after" list) holds all locks acquired while the L is held. These two lists thus encapsulate the proper ordering of how those other locks should be acquired relative to L.
Whenever L is acquired, the validator checks whether any lock on the "after" list associated with L is already held. It should not find any, since all locks on the "after" list should only be acquired after acquiring L. Should it find a lock which should not be held, an error is signaled. The validator code also takes the "after" list of L, connects it with the "before" lists of the currently-held locks, and convinces itself that there are no ordering or interrupt violations anywhere within that chain. If all the tests pass, the validator updates the various "before" and "after" lists and the kernel continues on its way.
Needless to say, all this checking imposes a certain amount of overhead; it is not something which one will want to enable on production kernels. It is not quite as bad as one might expect, however. As the kernel does its thing, the lock validator maintains its stack of currently-held locks. It also generates a 64-bit hash value from that series of locks. Whenever a particular combination of locks is validated, the associated hash value is stored in a table. The next time that lock sequence is encountered, the code can find the associated hash value in the table and know that the checks have already been performed. This hashing speeds the process considerably.
Of course, there are plenty of exceptions to the locking rules as understood by the validator. As a result, a significant portion of the validator patch set is aimed at getting rid of false error reports. For example, the validator normally complains if more than one lock with the same key is held at the same time - doing so is asking for deadlocks. There are situations, however, where this pattern is legitimate. For example, the block subsystem will often lock a block device, then lock a partition within that device. Since the partition also looks like a block device, the validator signals an error. To keep that from happening, the validator implements the notion of lock "subtypes." In this case, locks on partition devices can be marked with a different subtype, allowing their usage to be validated properly. This marking is done by using new versions of the locking functions (spin_lock_nested(), for example) which take a subtype parameter.
The lock validator was added to 2.6.17-rc5-mm1, so interested people can play with it. Waiting for another -mm release might not be a bad idea, however; there has since been a fairly long series of validator fixes posted.
The key point behind all of this is that deadlock situations can be found without having to actually make the kernel lock up. By watching the sequences in which locks are acquired, the validator can extrapolate a much larger set of possible sequences. So, even though a particular deadlock might only happen as the result of unfortunate timing caused by a specific combination of strange hardware, a rare set of configuration options, 220V power, a slightly flaky video controller, Mars transiting through Leo, an old version of gcc, an application which severely stresses the system (yum, say), and an especially bad Darl McBride hair day, the validator has a good chance of catching it. So this code should result in a whole class of bugs being eliminated from the kernel code base; that can only be a good thing.
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
