The current 2.6 prepatch remains 2.6.15-rc5
; Linus, it seems, has
been too busy stirring up desktop flamewars to get -rc6 out the door.
A slow stream of patches continues to accumulate in the mainline git
repository. These consist mostly of fixes, but there is also the removal
of the "incomplete mapping" support discussed here last week (it was deemed
unnecessary), a new rcu_barrier() primitive to wait until all
queued RCU callbacks have run, and a build system change making the
"optimize for size" option available for all configurations.
The current -mm tree is 2.6.15-rc5-mm2. Recent changes
to -mm include a couple of new inotify flags controlling which files are to
be watched, a Sony laptop ACPI driver, basic PCI domain support, a
schedule_on_each_cpu() function to run code on every processor, a
new high-resolution timers implementation, and a "batch" scheduling policy.
Comments (none posted)
Kernel development news
The Linux kernel contains a full counting semaphore implementation. Given
a semaphore, a call to down()
will sleep until the semaphore
contains a positive value, decrement that value, and return. Calling
increments the semaphore's value and wakes up a process
waiting for the semaphore, if one exists. If the initial value of the
semaphore is ten, then ten different threads can call down()
Most users of semaphores do not use the counting feature, however.
Instead, they initialize the semaphore to a value of one, allowing a single
thread to hold the semaphore at any given time. This mode of use turns a
semaphore into a "mutex," a mutual exclusion primitive which can be used to
implement critical sections. Using a semaphore in this way is entirely
There is one little issue, however: a simple binary mutex can often be
implemented more cheaply than a full counting semaphore. If a semaphore is
used in the mutex mode, the extra cost of the counting capability is simply
wasted. Linux semaphores also suffer from highly architecture-dependent
implementations, to the point that any changes to the semaphore API are
very difficult to make. So cleaning up semaphores has been one of those
items on the "do to" list for some time.
David Howells went ahead and did
it. His patch adds a new, binary mutex type to the kernel. Since
almost all of the semaphores currently in use are, in reality, mutexes,
David changed the prototypes of most of the semaphore functions
(down() and variants, up(), init_MUTEX(),
DECLARE_MUTEX()) to take a mutex rather than a semaphore. To make
things work again, most semaphore declarations have been changed to
struct mutex, but, beyond the declaration change, code using
mutexes need not be modified.
For code which truly needs a semaphore, a new set of functions has been
void down_sem(struct semaphore *sem);
void up_sem(struct semaphore *sem);
int down_sem_trylock(struct semaphore *sem);
Kernel code which was actually using the counting capability of semaphores
has been changed to use the new functions.
This patch makes fundamental changes to the kernel's mutual exclusion
mechanisms, creates a flag day which breaks all out-of-tree code, and is
generally quite large. But there is surprisingly little resistance to the
patch in general. Some developers are concerned that some counting
semaphores may have been converted to mutexes erroneously - it is hard to
audit that much code and be absolutely sure of how every semaphore is
used. It has also been noted that the posted mutex implementation may
actually be slower than the semaphores it replaces, but that is something
which, it is assumed, can be fixed. In general, however,
almost nobody objects to making this sort of change.
There are some disagreements over just how the change should be done,
however. Some developers do not want to see the old down() and
up() functions switched to a different type which has no counter
to bump "down" or "up." The alternative would be to create a completely
new API for mutexes; Alan Cox has suggested
names like sleep_lock() and sleep_unlock(). A completely
new API would make it clear what is really going on; it would also make it possible to
change over users gradually as they are audited.
Some developers would rather see a big flag day than a
year-long series of patches slowly converting semaphore users over to
mutexes. For them, the mutex changeover is a chance to get the API right,
and they would rather see everything changed over at once. Gradual
changeovers, it is argued, never seem to come to a conclusion; examples
include the continued existence of the big kernel lock and the
long-deprecated sleep_on() functions. Rather than live with a
deprecated API for years, it may be better to just take the pain all at
once and be done with it.
It has also been pointed out that there is another mutex patch in
circulation: the real-time preemption tree has had mutexes for the last
year. So far, there has been no real debate on whether the -rt
implementation is better; Ingo Molnar does not seem to be pushing it, even
though this might be a good opportunity to merge a significant chunk of the
-rt tree into the mainline.
In the end, it looks like some sort of mutex patch is likely to be merged
into a future mainline kernel - though it almost certainly will not be
ready when the 2.6.16 window opens. The form of that patch could change
significantly, however; stay tuned.
Comments (9 posted)
For years, otherwise useful kernel patches have been rejected because they
use language features which are not supported by version 2.95 of the gcc
compiler. The developers have been reluctant to remove support for this
ancient version of gcc (released in 1999) because some not-so-old
distributions used it, and because a couple of architectures required it.
More importantly, however: gcc 2.95 simply runs faster than later
versions. For a kernel hacker waiting for a build to complete, compilation
speed can be far more important than additional language features or more
highly optimized code generation.
In the middle of the mutex conversation, however, it was pointed out that
some of the alternatives under consideration would not work with 2.95. In
response, Andrew Morton, the biggest defender of 2.95 compatibility, threw in the towel. It seems that quite a few
things in the kernel already fail to work with 2.95, and the situation is
not getting better. So, says Andrew:
It's time to give up on it and just drink more coffee or play more
tetris or something, I'm afraid.
He followed up with a patch officially
removing gcc 2.95 compatibility from the kernel. A suggestion to drop gcc 3.0 quickly
followed; the 3.0 release was never widely used, and it lacks some features
that the kernel developers would like to use. Moving directly to 3.1 as
the oldest supported gcc would make life easier without a whole lot
of additional pain.
Nothing has been merged into the mainline yet - and may not be until 2.6.16
opens. But the writing is clearly on the wall: anybody still trying to use
these older compilers with current kernels will have to upgrade soon.
Comments (11 posted)
The i386 processor family poses a challenge for kernel builders. These
processors have maintained instruction set compatibility for many years;
code built for early Pentium processors will likely still run on current
hardware. The problem is that code built for these older processors will
fail to take advantage of features added later on. The "least common
denominator" approach can thus lead to sub-optimal use of current CPUs.
The kernel has a number of ways of dealing with this challenge. In some
cases it can make decisions at run time, using processor features only if
they are found to be present. Other features are only available by way of
build-time configuration options; selecting these will result in a kernel
which will not run on older systems. Yet another mechanism is the
"alternatives" feature, which allows the kernel to optimize itself at boot
time. Consider this example of alternatives use (from
#define mb() alternative("lock; addl $0,0(%%esp)", \
This macro places a memory barrier in the code, ensuring that all memory
reads and writes initiated before the barrier complete before execution
continues. The default implementation is essentially a bus-locked no-op;
it will work anywhere. On newer systems, however, the more efficient
mfence instruction is available, and it would be nice to use it.
The alternative() macro compiles in the default code, but also
makes a note of its location (and alternative implementation) in a special
ELF section. Early in the boot process, the kernel calls
apply_alternatives(), which makes a pass through that special
section. Every alternative instruction which is supported by the running
processor is patched directly into the loaded kernel image; it will be
filled with no-op instructions if need be. Once
apply_alternatives() has finished its work, the kernel behaves as
if it had been compiled for the processor it is actually running on. This
distributors to ship generic kernels which can optimize themselves at boot
The 2.6 mainline uses alternatives sparingly: for barriers, prefetch hints,
and saving the floating point unit state. Gerd Knorr, however, believes
that the use of alternatives could be expanded to further reduce the range
of kernels which distributors need to ship - and to improve runtime
flexibility as well. In particular, he thinks that kernels can be
optimized for single- or multiprocessor systems on the fly.
Gerd's SMP alternatives patch
is an implementation of this concept. It creates an new macro
(alternative_smp()) which can be used to specify optimal
implementations of an operation on both uniprocessor and SMP systems; the
proper version will then be selected at runtime. The main use of SMP
alternatives in his patch is with spinlock operations; spinlocks can be
patched in or edited out, as dictated by the configuration of the system at
There are a couple of interesting features in Gerd's patch. One is in the
handling of the i386 architecture's lock prefix. This prefix,
when applied to specific instructions, causes the instruction to run in a
bus-locked, atomic manner. It is used for operations which must be seen
coherently across a multiprocessor system; these include semaphore
operations and the atomic_t implementation. Use of the
lock prefix on uniprocessor systems imposes a runtime cost with no
benefit; it would be nice to edit those out. The SMP alternatives patch
takes a shortcut here; it simply remembers each location where a
lock prefix appears. If the kernel boots on a uniprocessor
system, all of those prefixes can be quickly overwritten with no-ops.
A more interesting - and more controversial - feature of this patch is
that, when the kernel is converted between the SMP and uniprocessor mode,
the overwritten instructions are remembered. At some point the the future,
then, the alternatives code can reverse the change, switching the kernel
back to the full SMP implementation. The code is then run whenever a CPU
hotplug event happens, optimizing the kernel for the system's new
configuration. A system can be initially booted with a single processor,
and the alternatives code will edit out all of the SMP-related
instructions. If another processor is added later on, the kernel will be
automatically converted back into a fully SMP-capable mode. If processors
are removed, the SMP code can be taken out too. All within a running
system, with no need to reboot.
This feature may seem useful to a rather small minority of users - and it
is. But that minority may be bigger than one thinks. Virtualization
systems (and Xen in particular) are implementing the ability to configure
the number of (virtual) CPUs in each running instance on the fly, in
response to the load on each. So it may really be that a busy, virtualized
server will have CPUs hot-plugged into it, and that those processors will
go away when the load drops. Enabling the kernel to reconfigure itself on
the fly when this happens will allow each Xen instance to run a kernel
which is optimized for its current situation.
The CPU hotplug may be a hard sell - self-modifying code in a running
kernel tends to make people nervous. The rest of the SMP alternatives
patch seems likely to find a place in the mainline, eventually.
Comments (29 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>