The current 2.6 prepatch is 2.6.16-rc6
on March 11.
Linus notes: "Ok, we're getting closer, although the 2.6.16 release
certainly seems to drag out more than it should have.
" As one would
expect, this patch is dominated by fixes; see the long-format changelog
for the details.
The mainline git repository contains a few dozen patches merged since -rc6;
many of them are fixes for bugs found in the Coverity scan. There is also
a patch which disables the sysfs interface for the error detection and
correction (EDAC) subsystem; that interface "needs more thought" and so
will be hidden until the issues get worked out.
The current -mm tree is 2.6.16-rc6-mm1. Recent changes
to -mm include a new set of NFS superblock sharing patches (which are
creating NFS problems for some testers) and a bunch of fixes.
Comments (none posted)
Kernel development news
Oh, and women don't fall for the "I hack kernel stuff" line. I was
-- Mariusz Mazur gives up on linux-libc-headers
Comments (3 posted)
2.6.16 should be sufficiently stable, at this point, that it is safe to
make a list of API changes. As usual, this list will be folded into the LWN 2.6 API changes page
- The mutex code has been
merged. The use of semaphores for mutual exclusion is now deprecated,
and the current semaphore API may go away altogether.
- The high-resolution kernel
timer code has been merged. The new API allows for greater
precision in timer values, though the underlying implementation is
still limited by the timer interrupt resolution.
- A new list function, list_for_each_entry_safe_reverse(), does
just what one would expect.
- A 64-bit atomic type, atomic_long_t, has been added.
Supported functions are:
- long atomic_long_read(atomic_long_t *l);
- void atomic_long_set(atomic_long_t *l, long i);
- void atomic_long_inc(atomic_long_t *l);
- void atomic_long_dec(atomic_long_t *l);
- void atomic_long_add(long i, atomic_long_t *l);
- void atomic_long_sub(long i, atomic_long_t *l);
- The "SLOB" memory allocator has been merged. SLOB is a drop-in
replacement for the slab allocator, intended for very low-memory
- The dentry structure has been changed: the d_child
and d_rcu fields are now overlaid in a union. This change
shrinks this heavily-used structure and improves its cache behavior.
- The usb_driver structure has a new field
(no_dynamic_id) which lets a driver disable the addition of
dynamic device IDs. The owner field has also been removed
from this structure.
- The device probe() and remove() methods have been moved
from struct device_driver to struct bus_type. The
bus-level methods will override any remaining driver methods.
- Some significant changes to the SCSI subsystem aimed at eliminating
the use of the old scsi_request structure. The SCSI software
IRQ is no longer used; postprocessing happens via the generic block
software IRQ instead.
- Much of the core device model code has been reeducated to use the term
"uevent" instead of "hotplug." Some changes which are visible outside
of the core code include:
- kobject_hotplug() becomes kobject_uevent()
- struct kset_hotplug_ops becomes struct
kset_uevent_ops, and its hotplug() member is now
- add_hotplug_env_var() becomes add_uevent_var()
- The block I/O barrier code has been rewritten. This
patch changes the barrier API and also adds a new parameter to
- The block_device_operations structure has a new method
getgeo(); its job is to fill in an hd_geometry
structure with information about the drive. With this operation in
place, many block drivers will not need an ioctl() function
- Linas Vepstas's PCI error
recovery patch has been merged.
- Compilers prior to gcc 3.2 can no longer be used to build
- The venerable "make bzImage" command no longer works; just type "make"
- When the kernel is configured to be optimized for size, gcc (if it's
version 4.x) is given the freedom to decide whether inline
functions should really be inlined. The __always_inline
attribute now truly forces inlining in all cases. This is an outcome
from the discussion on
inline functions held at the beginning of the year.
Comments (3 posted)
Nobody could ever claim that there is a shortage of Linux virtualization
technologies to choose from. There are numerous approaches, from
lightweight "container" techniques which simply create walls between parts
of the system, to full virtualization approaches which implement a complete
virtual hardware platform capable of running a number of (unmodified)
operating systems. Between the two are "paravirtualization" approaches
which require a certain amount of awareness in the guest kernel. To many,
paravirtualization seems like the best approach, in that it promises to
combine a relatively high level of performance with strong isolation of
guest systems. Xen is currently the highest-profile paravirtualization
system out there, but there are others.
Each paravirtualization approach places its own demands on the guest
system. Before a particular system can run under a given hypervisor, it
must be modified to work with that hypervisor's interface. This
requirement can add to the work of creating a virtual system in the first
place, and it increases the maintenance burden going forward, especially if
both the hypervisor and the guest kernel are under heavy development.
In an attempt to make life easier for virtualization hackers, Zachary
Amsden (of VMware) has put forward a complex proposal for a common
virtual machine interface (VMI) layer with some interesting properties.
The VMI layer defines a set of calls for performing machine-specific
functions - the sorts of things that generally require hypervisor
intervention. These calls are very low-level -
operations like changing page protections, enabling interrupts, writing
model-specific registers, changing specific control registers, dealing with
timer events, etc. As a result, the VMI interface currently only works
with i386-architecture systems, though an x86-64 port is in the works.
When a virtualized kernel boots, one of the first things it does is search
for a "VMI ROM" provided by the hypervisor. That ROM provides the
information needed for the low-level VMI calls to interact with the
hypervisor. Using information found in the ROM, the just-booted kernel
modifies its own code to use the hypervisor's functions
without table lookups or indirect function calls.
As a result, hypervisor operations are fast.
There are a couple of interesting implications of this approach. One is
that a VMI-equipped kernel can run under any VMI hypervisor without
modification - or even recompilation. It simply grabs the ROM provided by
whatever hypervisor is present and gets on with life. Just as interesting
is the fact that such a kernel can run on the bare hardware with no
hypervisor at all, as the host kernel. The VMI developers state that the
performance impact of running with the VMI calls is essentially zero. That
leads to this claim:
VMI Linux has negligible overheads on native machines, so much so,
that we are confident that VMI Linux can, in the long run, be the
default Linux for i386.
The actual code is packaged as a 24-part patch. It involves significant
amounts of low-level tweaking and assembly language trickery. That may
have something to do with why there have been few comments on the code
itself. The discussion which has been seen seems somewhat favorable, if
reserved. Among other things, there will need to be an open source
hypervisor which uses this interface before it will be seriously considered
for merging. In the mean time, anybody interested in the details can learn
more from the documentation file.
Comments (6 posted)
The kernel includes a number of library routines for the implementation of
useful data structures. Among those are two types of trees: radix trees
and red-black trees. This article will have a look at the radix tree API,
with red-black trees to follow in the future.
Wikipedia has a radix
tree article, but Linux radix trees are not well described by that
article. A Linux radix tree is a mechanism by which a (pointer) value can
be associated with a (long) integer key. It is reasonably efficient in terms of
storage, and is quite quick on lookups. Additionally, radix trees in the
Linux kernel have some features driven by kernel-specific needs, including
the ability to associate tags with specific entries.
The cheesy diagram on the right shows a leaf node from a Linux radix tree.
The node contains a number of slots, each of which can contain a pointer to
something of interest to the creator of the tree. Empty slots contain a
NULL pointer. These trees are quite broad - in the 2.6.16-rc
kernels, there are 64 slots in each radix tree node. Slots are indexed by
a portion of the (long) integer key. If the highest key value is less than
64, the entire tree can be represented with a single node.
Normally, however, a rather larger set of keys is in use - otherwise, a
simple array could have been used. So a larger tree might look something
This tree is three levels deep. When the kernel goes to look up a specific
key, the most significant six bits will be used to find the appropriate
slot in the root node. The next six bits then index the slot in the middle
node, and the least significant six bits will indicate the slot containing a
pointer to the actual value. Nodes which have no children are not present
in the tree, so a radix tree can provide efficient storage for sparse
Radix trees have a few users in the mainline kernel tree. The PowerPC
architecture uses a tree to map between real and virtual IRQ numbers. The
NFS code attaches a tree to relevant inode structures to keep
track of outstanding requests. The most widespread use of radix trees,
however, is in the memory management code. The address_space
structure used to keep track of backing store contains a radix tree which
tracks in-core pages tied to that mapping. Among other things, this tree
allows the memory management code to quickly find pages which are dirty or
As is typical with kernel data structures, there are two modes for
declaring and initializing radix trees:
RADIX_TREE(name, gfp_mask); /* Declare and initialize */
struct radix_tree_root my_tree;
The first form declares and initializes a radix tree with the given
name; the second form performs the initialization at run time. In
either case, a gfp_mask must be provided to tell the code how
memory allocations are to be performed. If radix tree operations
(insertions, in particular) are to be performed in atomic context, the
given mask should be GFP_ATOMIC.
The functions for adding and removing entries are straightforward:
int radix_tree_insert(struct radix_tree_root *tree, unsigned long key,
void *radix_tree_delete(struct radix_tree_root *tree, unsigned long key);
A call to radix_tree_insert() will cause the given item
to be inserted (associated with key) in the given tree. This
operation may require memory allocations; should an allocation fail, the
insertion will fail and the return value will be -ENOMEM. The
code will refuse to overwrite an existing entry; if key already
exists in the tree, radix_tree_insert() will return
-EEXIST. On success, the return value is zero.
radix_tree_delete() removes the item associated with key
from tree, returning a pointer to that item if it was present.
There are situations where failure to insert an item into a radix tree can
be a significant problem. To help avoid such situations, a pair of specialized
functions are provided:
int radix_tree_preload(gfp_t gfp_mask);
This function will attempt to allocate sufficient memory (using the given
gfp_mask) to guarantee that the next radix tree insertion cannot
fail. The allocated structures are stored in a per-CPU variable, meaning
that the calling function must perform the insertion before it can schedule
or be moved to a different processor. To that end,
radix_tree_preload() will, when successful, return with preemption
disabled; the caller must eventually ensure that preemption is enabled
again by calling radix_tree_preload_end(). On failure,
-ENOMEM is returned and preemption is not disabled.
Radix tree lookups can be done in a few ways:
void *radix_tree_lookup(struct radix_tree_root *tree, unsigned long key);
void **radix_tree_lookup_slot(struct radix_tree_root *tree, unsigned long key);
unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
unsigned long first_index,
unsigned int max_items);
The simplest form, radix_tree_lookup(), looks for key in
the tree and returns the associated item (or NULL on
failure). radix_tree_lookup_slot() will, instead, return a
pointer to the slot holding the pointer to the item. The caller can, then,
change the pointer to associate a new item with the key. If the
item does not exist, however, radix_tree_lookup_slot() will not
create a slot for it, so this function cannot be used in place of
Finally, a call to radix_tree_gang_lookup() will return up to
max_items items in results, with ascending key values
starting at first_index. The number of items returned may be less
than requested, but a short return (other than zero) does not imply that
there are no more values in the tree.
One should note that none of the radix tree
functions perform any sort of locking internally. It is up to the caller
to ensure that multiple threads do not corrupt the tree or get into other
sorts of unpleasant race conditions. Nick Piggin currently has a patch
circulating which would use read-copy-update to free tree nodes; this patch
would allow lookup operations to be performed without locking as long as
(1) the resulting pointer is only used in atomic context, and
(2) the calling code avoids creating race conditions of its own. It
is not clear when that patch might be merged, however.
The radix tree code supports a feature called "tags," wherein specific bits
can be set on items in the tree. Tags are used, for example, to mark
memory pages which are dirty or under writeback. The API for working with
void *radix_tree_tag_set(struct radix_tree_root *tree,
unsigned long key, int tag);
void *radix_tree_tag_clear(struct radix_tree_root *tree,
unsigned long key, int tag);
int radix_tree_tag_get(struct radix_tree_root *tree,
unsigned long key, int tag);
radix_tree_tag_set() will set the given tag on the item
indexed by key; it is an error to attempt to set a tag on a
nonexistent key. The return value will be a pointer to the tagged item.
While tag looks like an arbitrary integer, the
code as currently written allows for a maximum of two tags. Use of any tag
value other than zero or one will silently corrupt memory in some
undesirable place; consider yourself warned.
Tags can be removed with radix_tree_tag_clear(); once again, the
return value is a pointer to the (un)tagged item. The function
radix_tree_tag_get() will check whether the item indexed by
key has the given tag set; the return value is zero if
key is not present, -1 if key is present but tag
is not set, and +1 otherwise. This function is currently commented out in
the source, however, since no in-tree code uses it.
There are two other functions for querying tags:
int radix_tree_tagged(struct radix_tree_root *tree, int tag);
unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root *tree,
unsigned long first_index,
unsigned int max_items,
radix_tree_tagged() returns a non-zero value if any item in the
tree bears the given tag. A list of items with a given tag can be
obtained with radix_tree_gang_lookup_tag().
In concluding, we can note one other interesting aspect of the radix tree
API: there is no function for destroying a radix tree. It is, evidently,
assumed that radix trees will last forever. In practice, deleting all
items from a radix tree will free all memory associated with it other than
the root node, which can then be disposed of normally.
Comments (5 posted)
tutorial on creating /proc files
from loadable kernel modules.
"Here's a [module] that supports both reading and writing. This simple
application provides a fortune cookie dispenser. After the module is
loaded, the user can load text fortunes into it using the echo command and
then read them back out individually using the cat command.
don't try to get it merged into the mainline.
Comments (3 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>