A slow stream of patches continues to accumulate in the mainline git repository. These consist mostly of fixes, but there is also the removal of the "incomplete mapping" support discussed here last week (it was deemed unnecessary), a new rcu_barrier() primitive to wait until all queued RCU callbacks have run, and a build system change making the "optimize for size" option available for all configurations.
The current -mm tree is 2.6.15-rc5-mm2. Recent changes to -mm include a couple of new inotify flags controlling which files are to be watched, a Sony laptop ACPI driver, basic PCI domain support, a schedule_on_each_cpu() function to run code on every processor, a new high-resolution timers implementation, and a "batch" scheduling policy.
Kernel development news
Most users of semaphores do not use the counting feature, however. Instead, they initialize the semaphore to a value of one, allowing a single thread to hold the semaphore at any given time. This mode of use turns a semaphore into a "mutex," a mutual exclusion primitive which can be used to implement critical sections. Using a semaphore in this way is entirely valid.
There is one little issue, however: a simple binary mutex can often be implemented more cheaply than a full counting semaphore. If a semaphore is used in the mutex mode, the extra cost of the counting capability is simply wasted. Linux semaphores also suffer from highly architecture-dependent implementations, to the point that any changes to the semaphore API are very difficult to make. So cleaning up semaphores has been one of those items on the "do to" list for some time.
David Howells went ahead and did it. His patch adds a new, binary mutex type to the kernel. Since almost all of the semaphores currently in use are, in reality, mutexes, David changed the prototypes of most of the semaphore functions (down() and variants, up(), init_MUTEX(), DECLARE_MUTEX()) to take a mutex rather than a semaphore. To make things work again, most semaphore declarations have been changed to struct mutex, but, beyond the declaration change, code using mutexes need not be modified.
For code which truly needs a semaphore, a new set of functions has been provided:
void down_sem(struct semaphore *sem); void up_sem(struct semaphore *sem); int down_sem_trylock(struct semaphore *sem); ...
Kernel code which was actually using the counting capability of semaphores has been changed to use the new functions.
This patch makes fundamental changes to the kernel's mutual exclusion mechanisms, creates a flag day which breaks all out-of-tree code, and is generally quite large. But there is surprisingly little resistance to the patch in general. Some developers are concerned that some counting semaphores may have been converted to mutexes erroneously - it is hard to audit that much code and be absolutely sure of how every semaphore is used. It has also been noted that the posted mutex implementation may actually be slower than the semaphores it replaces, but that is something which, it is assumed, can be fixed. In general, however, almost nobody objects to making this sort of change.
There are some disagreements over just how the change should be done, however. Some developers do not want to see the old down() and up() functions switched to a different type which has no counter to bump "down" or "up." The alternative would be to create a completely new API for mutexes; Alan Cox has suggested names like sleep_lock() and sleep_unlock(). A completely new API would make it clear what is really going on; it would also make it possible to change over users gradually as they are audited.
Some developers would rather see a big flag day than a year-long series of patches slowly converting semaphore users over to mutexes. For them, the mutex changeover is a chance to get the API right, and they would rather see everything changed over at once. Gradual changeovers, it is argued, never seem to come to a conclusion; examples include the continued existence of the big kernel lock and the long-deprecated sleep_on() functions. Rather than live with a deprecated API for years, it may be better to just take the pain all at once and be done with it.
It has also been pointed out that there is another mutex patch in circulation: the real-time preemption tree has had mutexes for the last year. So far, there has been no real debate on whether the -rt implementation is better; Ingo Molnar does not seem to be pushing it, even though this might be a good opportunity to merge a significant chunk of the -rt tree into the mainline.
In the end, it looks like some sort of mutex patch is likely to be merged into a future mainline kernel - though it almost certainly will not be ready when the 2.6.16 window opens. The form of that patch could change significantly, however; stay tuned.
In the middle of the mutex conversation, however, it was pointed out that some of the alternatives under consideration would not work with 2.95. In response, Andrew Morton, the biggest defender of 2.95 compatibility, threw in the towel. It seems that quite a few things in the kernel already fail to work with 2.95, and the situation is not getting better. So, says Andrew:
He followed up with a patch officially removing gcc 2.95 compatibility from the kernel. A suggestion to drop gcc 3.0 quickly followed; the 3.0 release was never widely used, and it lacks some features that the kernel developers would like to use. Moving directly to 3.1 as the oldest supported gcc would make life easier without a whole lot of additional pain.
Nothing has been merged into the mainline yet - and may not be until 2.6.16 opens. But the writing is clearly on the wall: anybody still trying to use these older compilers with current kernels will have to upgrade soon.
The kernel has a number of ways of dealing with this challenge. In some cases it can make decisions at run time, using processor features only if they are found to be present. Other features are only available by way of build-time configuration options; selecting these will result in a kernel which will not run on older systems. Yet another mechanism is the "alternatives" feature, which allows the kernel to optimize itself at boot time. Consider this example of alternatives use (from include/asm-i386/system.h):
#define mb() alternative("lock; addl $0,0(%%esp)", \ "mfence", \ X86_FEATURE_XMM2)
This macro places a memory barrier in the code, ensuring that all memory reads and writes initiated before the barrier complete before execution continues. The default implementation is essentially a bus-locked no-op; it will work anywhere. On newer systems, however, the more efficient mfence instruction is available, and it would be nice to use it.
The alternative() macro compiles in the default code, but also makes a note of its location (and alternative implementation) in a special ELF section. Early in the boot process, the kernel calls apply_alternatives(), which makes a pass through that special section. Every alternative instruction which is supported by the running processor is patched directly into the loaded kernel image; it will be filled with no-op instructions if need be. Once apply_alternatives() has finished its work, the kernel behaves as if it had been compiled for the processor it is actually running on. This mechanism allows distributors to ship generic kernels which can optimize themselves at boot time.
The 2.6 mainline uses alternatives sparingly: for barriers, prefetch hints, and saving the floating point unit state. Gerd Knorr, however, believes that the use of alternatives could be expanded to further reduce the range of kernels which distributors need to ship - and to improve runtime flexibility as well. In particular, he thinks that kernels can be optimized for single- or multiprocessor systems on the fly.
Gerd's SMP alternatives patch is an implementation of this concept. It creates an new macro (alternative_smp()) which can be used to specify optimal implementations of an operation on both uniprocessor and SMP systems; the proper version will then be selected at runtime. The main use of SMP alternatives in his patch is with spinlock operations; spinlocks can be patched in or edited out, as dictated by the configuration of the system at boot time.
There are a couple of interesting features in Gerd's patch. One is in the handling of the i386 architecture's lock prefix. This prefix, when applied to specific instructions, causes the instruction to run in a bus-locked, atomic manner. It is used for operations which must be seen coherently across a multiprocessor system; these include semaphore operations and the atomic_t implementation. Use of the lock prefix on uniprocessor systems imposes a runtime cost with no benefit; it would be nice to edit those out. The SMP alternatives patch takes a shortcut here; it simply remembers each location where a lock prefix appears. If the kernel boots on a uniprocessor system, all of those prefixes can be quickly overwritten with no-ops.
A more interesting - and more controversial - feature of this patch is that, when the kernel is converted between the SMP and uniprocessor mode, the overwritten instructions are remembered. At some point the the future, then, the alternatives code can reverse the change, switching the kernel back to the full SMP implementation. The code is then run whenever a CPU hotplug event happens, optimizing the kernel for the system's new configuration. A system can be initially booted with a single processor, and the alternatives code will edit out all of the SMP-related instructions. If another processor is added later on, the kernel will be automatically converted back into a fully SMP-capable mode. If processors are removed, the SMP code can be taken out too. All within a running system, with no need to reboot.
This feature may seem useful to a rather small minority of users - and it is. But that minority may be bigger than one thinks. Virtualization systems (and Xen in particular) are implementing the ability to configure the number of (virtual) CPUs in each running instance on the fly, in response to the load on each. So it may really be that a busy, virtualized server will have CPUs hot-plugged into it, and that those processors will go away when the load drops. Enabling the kernel to reconfigure itself on the fly when this happens will allow each Xen instance to run a kernel which is optimized for its current situation.
The CPU hotplug may be a hard sell - self-modifying code in a running kernel tends to make people nervous. The rest of the SMP alternatives patch seems likely to find a place in the mainline, eventually.
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds