Brief items
The current stable 2.6 kernel is 2.6.16.19,
released on May 30. It
contains a single fix for an information leak in the netfilter code.
The current 2.6 prepatch is 2.6.17-rc5, released by Linus on
May 24. With luck, this will be the final prepatch before the final
2.6.17 release. It consists of a fair number of fixes; see the long-format changelog for
the details.
Several dozen patches (all fixes) have found their way into the mainline
after the -rc5 release.
The current -mm tree is 2.6.17-rc5-mm1. Recent changes
to -mm include the generic IRQ
layer, an updated version of reiser4, the lock validator (see below),
the adaptive readahead patch
set, a new infrastructure for maintaining kernel statistics, and a new
kernel API for inotify.
Comments (none posted)
Kernel development news
The final 2.6.17 kernel release is getting close. Further internal API
changes in this cycle are (one hopes) highly unlikely, so the following
list should be definitive for this time around.
- Support for the SPARC "Niagara" architecture.
- EXPORT_SYMBOL_GPL_FUTURE()
has been merged.
- The safe notifier patch has been
merged, creating a new API for all notifier users.
- The SLAB_NO_REAP slab cache option, which ostensibly caused
the slab not to be cleaned up when the system is under memory
pressure, has been removed. The kmem_cache_t typedef is also
being phased out in favor of struct kmem_cache.
- The "softmac" 802.11 subsystem has been merged. This code may
eventually be phased out, however, in favor of the Devicescape code.
- There is a new real-time clock subsystem, providing generalized RTC
support and a well-defined driver interface.
- A new utility function has been added:
int execute_in_process_context(void (*fn)(void *data),
void *data,
struct execute_work *work);
This function will arrange for fn() to be called in process
context (where it can sleep). Depending on when
execute_in_process_context() is called, fn() could
be invoked immediately or delayed by way of a work queue.
- The SMP alternatives
patch has been merged.
- A rework of the relayfs API - but the sysfs interface has been left
out for now.
- There is a new tracing mechanism for developers debugging block
subsystem code.
- There is a new internal flag (FMODE_EXEC) used to indicate
that a file has been opened for execution.
- The obsolete MODULE_PARM() macro is gone forevermore.
- A new function, flush_anon_page(), can be used in conjunction
with get_user_pages() to safely perform DMA to anonymous
pages in user space.
- Zero-filled memory can now be allocated from slab caches with
kmem_cache_zalloc(). There is also a new slab debugging
option to produce a /proc/slab_allocators file with detailed
allocation information.
- There are four new ways of creating mempools:
mempool_t *mempool_create_page_pool(int min_nr, int order);
mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size);
mempool_t *mempool_create_kzalloc_pool(int min_nr, size_t size);
mempool_t *mempool_create_slab_pool(int min_nr,
struct kmem_cache *cache);
The first creates a pool which allocates whole pages (the number of
which is determined by order), while the second and third create a
pool backed by kmalloc() and kzalloc(),
respectively. The fourth is a shorthand form of creating slab-backed
pools.
- The prototype for hrtimer_forward() has changed:
unsigned long hrtimer_forward(struct hrtimer *timer,
ktime_t now, ktime_t interval);
The new now argument is expected to be the current time.
This change allows some calls to be optimized. The data
field has also been removed from the hrtimer structure.
- A whole set of generic bit operations (find first set, count set bits,
etc.) has been added, helping to unify this code across architectures
and subsystems.
- The inode f_ops pointer - which refers to the
file_operations structure for the open file - has been marked
const. Quite a bit of code, which used to change that
structure, has been changed to compensate. Similar changes have been
made in many filesystems. "The goal is both to increase
correctness (harder to accidentally write to shared datastructures)
and reducing the false sharing of cachelines with things that get
dirty in .data (while .rodata is nicely read only and thus cache
clean)."
- local_t is now a signed type.
- Attributes in sysfs can be
pollable.
- A class_device can now have attribute groups created at
registration time; to take advantage of this capability, store the
desired groups in the new groups field.
- The splice(), vmsplice(), and tee() system
calls have been merged. Supporting those calls requires implementing
two new file_operations methods. See this article for the final
form of the splice_read() and splice_write()
functions.
As always, look at the LWN 2.6 kernel API changes page
for a list of changes over time.
Comments (none posted)
While plowing through the flood of patches early in the 2.6.17 cycle, your
editor missed a significant API change: the new notifier interface.
Notifiers are an internal kernel mechanism allowing code to register to be
told about events of interest. There are notifiers for memory hotplug
events, CPU frequency policy changes, USB hotplug events, module loading
and unloading, system reboots, network device changes, and more.
Back in November, 2005, this page looked at a proposed notifier API
change motivated by the lack of locking on the notifier chains
themselves. That proposal received a lukewarm reception. Many low-level
data structures in the kernel explicitly avoid performing any locking, on
the assumption that the higher layers will have to be concerned with their
own locking in any case. So, it was asked, why should notifiers be any
different? The answer seems to be that, unlike many other data structures,
notifiers tend to be used across relatively wide parts of the kernel,
making it hard to use any locking regime except one designed for the
notifiers themselves. In any case, a version of the notifier patch was
merged for 2.6.17-rc1.
The current form of the API defines three different types of notifiers:
- Blocking notifiers are always called from process context. The
notifier code - along with the notification routines it calls - is
allowed to sleep.
- Atomic notifiers can be called from atomic context, no sleeping
allowed.
- Raw notifiers have no internal locking and no associated rules; they
are simply the older form of the notifier API, preserved as a
historical relic.
For 2.6.17, all notifier chains have been converted to the blocking or
atomic types; there are no users of the raw interface in the mainline
kernel. The notifier patch includes no threatening noises about removing
the raw interface, but, sooner or later, somebody is likely to come along
and want to clean it up. So avoiding raw notifiers is probably a good
idea; this article will concentrate on the other two types.
Blocking notifiers are essentially a raw notifier with an rwsem added for
mutual exclusion. Any operation on a blocking notifier may, well, block on
that rwsem. These notifiers can be created in the usual two ways:
#include <linux/notifier.h>
BLOCKING_NOTIFIER_HEAD(my_notifier);
struct blocking_notifier_head my_notifier;
BLOCKING_INIT_NOTIFIER_HEAD(my_notifier);
Code which wishes to hook into a blocking notifier should first fill in a
notifier_block structure:
struct notifier_block {
int (*notifier_call)(struct notifier_block *block,
unsigned long event,
void *data);
int priority;
/* ... */
};
The notifier_call field should point to the function to be called
when something interesting happens; the event and data
parameters will be provided by the code generating the event. Notifiers
are called in order of increasing priority; the return value from
the final notifier called will be passed back to the code signalling the
event. Normally, the final notifier is the one with the highest
priority value, but any notifier can halt further processing by
returning a value with the bit indicated by NOTIFIER_STOP_MASK
set. Other than that one bit (currently 0x8000), the return
values are arbitrary (as far as the notification code is concerned), but
the convenience values NOTIFY_OK ("so far so good"),
NOTIFY_STOP ("all is well, but don't call any more notifiers") and
NOTIFY_BAD ("stop calling notifiers and veto the proposed action")
are available.
Once the code has a notifier_block ready, it should register it
with:
int blocking_notifier_chain_register(struct blocking_notifier_head *chain,
struct notifier_block *nb);
The return value is apparently intended to allow an error status to be
returned if the registration fails, but the 2.6.17 version of the code
cannot fail.
A blocking notifier can be unregistered with:
int blocking_notifier_chain_unregister(struct blocking_notifier_head *chain,
struct notifier_block *nb);
This call will return -ENOENT if the given notifier was not
actually registered.
Code which wishes to use a blocking notifier chain to signal an event can
do so with:
int blocking_notifier_call_chain(struct blocking_notifier_head *chain,
unsigned long event,
void *data);
This function will call all notifiers in chain (unless one of them
stops the process partway through), returning the value from the last
notifier called.
Atomic notifiers replace the rwsem with a spinlock; the API
is very similar:
ATOMIC_NOTIFIER_HEAD(my_notifier);
struct atomic_notifier_head my_notifier;
ATOMIC_INIT_NOTIFIER_HEAD(my_notifier);
int atomic_notifier_chain_register(struct atomic_notifier_head *chain,
struct notifier_block *nb);
int atomic_notifier_chain_unregister(struct atomic_notifier_head *chain,
struct notifier_block *nb);
int atomic_notifier_call_chain(struct atomic_notifier_head *chain,
unsigned long event,
void *data);
Note that atomic notifiers use the same notifier_block structure
as the blocking variety does. Nothing will ever sleep in the atomic
notifier code, however, and notifier functions called from an atomic chain
are not allowed to sleep either.
As noted above, all notifier chains in the kernel have been changed to
one of the above types; any out-of-tree code which uses a kernel chain will
have to be updated accordingly. See the explanatory text for the
notifier patch for a summary of what type was assigned to each existing
chain in the mainline kernel.
Comments (none posted)
Locking is a necessary evil in operating systems; without a solid locking
regime, different parts of the system will collide when trying to access
the same resources, leading to data corruption and general chaos. But
locking has hazards of its own; carelessly implemented locking can cause
system deadlocks. As a simple example, consider two locks
L1 and
L2. Any code which requires
both locks must take care to acquire the locks in the right order. If one
function acquires
L1 before
L2, but
another function acquires them in the opposite order, eventually the system will
find itself in a situation where each function has acquired one lock and is
blocked waiting for the other - a deadlock.
A race condition like the one described above may be a one-in-a-million
possibility, but, with computers, it does not take too long to exercise a
code path a million times. Sooner or later, a system containing this sort
of bug will lock up, leaving its users wondering what is going on. To
avoid this sort of situation, kernel developers try to define rules for the
order in which locks should be acquired. But, in a system with many
thousands of locks, defining a comprehensive set of rules is challenging at
best, and enforcing them is even harder. So locking bugs creep into the
kernel, lurk until some truly inconvenient time, and eventually surprise
some unsuspecting user.
Over time, the kernel developers have made increasing use of automated code
analysis tools as those tools become available. The latest such is the first version of the lock
validator patch, posted by Ingo Molnar. This patch (a 61-part set,
actually) adds a complex infrastructure to the kernel which can then be
used to prove that none of the locking patterns observed in a running
system could ever deadlock the kernel.
To that end, the lock validator must track real locking patterns in the
kernel. There is no point, however, in tracking every individual lock -
there are thousands of them, but many of them are treated in exactly the
same way by the kernel. For example, every inode structure
contains a spinlock, as does every file structure. Once the
kernel has seen how locking is handled for one inode structure, it
knows how it will be handled for every inode structure. So,
somehow, the lock validator needs to be able to recognize that all
spinlocks contained within (for example) the inode structure are
essentially the same.
To this end, every lock in the system (including rwlocks and mutexes, now)
is assigned a specific key. For locks which are declared statically (for
example, files_lock, which protects the list of open files), the
address of the lock is used as the key. Locks which are allocated
dynamically (as most locks embedded within structures are) cannot be
tracked that way, however; there may be vast numbers of addresses involved,
and, in any case, all locks associated with a specific structure field
should be mapped to a single key. This is done by recognizing that these
locks are initialized at run time, so, for example,
spin_lock_init() is redefined as:
# define spin_lock_init(lock) \
do { \
static struct lockdep_type_key __key; \
\
__spin_lock_init((lock), #lock, &__key); \
} while (0)
Thus, for each lock initialization, this code creates a static variable
(__key) and uses its address as the key identifying the type of
the lock. Since any particular type of lock tends to be initialized in a
single place, this trick associates the same key with every lock of the
same type.
Next, the validator code intercepts every locking operation and performs a
number of tests:
- The code looks at all other locks which are already held when a new
lock is taken. For all of those locks, the validator looks for a past
occurrence where any of them were taken after the new lock. If
any such are found, it indicates a violation of locking order rules,
and an eventual deadlock.
- A stack of currently-held locks is maintained, so any lock being
released should be at the top of the stack; anything else means that
something strange is going on.
- Any spinlock which is acquired by a hardware interrupt handler can
never be held when interrupts are enabled. Consider what happens when
this rule is broken. A kernel function, running in process context,
acquires a specific lock. An interrupt arrives, and the associated
interrupt handler runs on the same CPU; that handler then attempts to
acquire the same lock. Since the lock is unavailable, the handler
will spin, waiting for the lock to become free. But the handler has
preempted the only code which will ever free that lock, so it will
spin forever, deadlocking that processor.
To catch problems of this type, the validator records two bits of
information for every lock it knows about: (1) whether the lock
has ever been acquired in hardware interrupt context, and
(2) whether the lock is ever held by code which runs with
hardware interrupts enabled. If both bits are set, the lock is being used
erroneously and an error is signaled.
- Similar tests are made for software interrupts, which present the same
problems.
The interrupt tests are relatively straightforward, requiring just four
bits of information for each lock (though the situation is a little more
complicated for rwlocks). But the ordering tests require a bit more work.
For every known lock key, the validator maintains two lists. One of them
contains all locks which have ever been held when the lock of interest
(call it L) is
acquired; it thus contains the keys of all locks which might be acquired
before L. The other list (the "after" list)
holds all locks acquired while the L is held. These two lists thus
encapsulate the proper ordering of how those other locks should be acquired
relative to L.
Whenever L is
acquired, the validator checks whether any lock on the "after" list
associated with L is already held. It should not find any, since
all locks on the "after" list should only be acquired after acquiring
L. Should it find a lock which should not be held, an error is
signaled. The validator code also takes the "after" list of L, connects it
with the "before" lists of the currently-held locks, and convinces itself
that there are no ordering or interrupt violations anywhere within that chain.
If all the tests pass, the validator updates the various "before" and
"after" lists and the kernel continues on its way.
Needless to say, all this checking imposes a certain amount of overhead; it
is not something which one will want to enable on production kernels. It
is not quite as bad as one might expect, however. As the kernel does its
thing, the lock validator maintains its stack of currently-held locks. It
also generates a 64-bit hash value from that series of locks. Whenever a
particular combination of locks is validated, the associated hash value is
stored in a table. The next time that lock sequence is encountered, the
code can find the associated hash value in the table and know that the
checks have already been performed. This hashing speeds the process
considerably.
Of course, there are plenty of exceptions to the locking rules as
understood by the validator. As a result, a significant portion of the
validator patch set is aimed at getting rid of false error reports. For
example, the validator normally complains if more than one lock with the
same key is held at the same time - doing so is asking for deadlocks.
There are situations, however, where this pattern is legitimate. For
example, the block subsystem will often lock a block device, then lock a
partition within that device. Since the partition also looks like a block
device, the validator signals an error. To keep that from happening, the
validator implements the notion of lock "subtypes." In this case, locks on
partition devices can be marked with a different subtype, allowing their
usage to be validated properly. This marking is done by using new versions
of the locking functions (spin_lock_nested(), for example) which
take a subtype parameter.
The lock validator was added to 2.6.17-rc5-mm1, so interested
people can play with it. Waiting for another -mm release might not be a
bad idea, however; there has since been a fairly long series of validator
fixes posted.
The key point behind all of this is that deadlock situations can be found
without having to actually make the kernel lock up. By watching the
sequences in which locks are acquired, the validator can extrapolate a much
larger set of possible sequences. So, even though a particular deadlock
might only happen as the result of unfortunate timing caused by a specific
combination of strange hardware, a rare set of configuration options, 220V
power, a slightly flaky video controller, Mars transiting through Leo, an
old version of gcc, an application which severely stresses the
system (yum, say), and an especially bad Darl McBride hair day,
the validator has a good chance of catching it. So this code should result
in a whole class of bugs being eliminated from the kernel code base; that
can only be a good thing.
Comments (36 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>