There is still no 2.6 prepatch
as the merge window remains open.
Patches continue to pour into the mainline git repository; see the article
below for details.
The current -mm tree is 2.6.21-mm2. Recent changes to
-mm include the removal of the adaptive readahead patches (in anticipation
of a newer, simpler version), a kernel-based mechanism for unprivileged
user mounts, and the removal of the staircase deadline scheduler. Mostly
-mm is shedding weight as patches move into the mainline.
For older kernels: 18.104.22.168 was released on
May 4, followed by 22.214.171.124 on May 9. Each
update contains a handful of important fixes.
Comments (none posted)
Kernel development news
Really, we are likely be better off by risking the merge of _bad_
code (which in the swap-prefetch case is the exact opposite of the
truth), than to let code stagnate. People are clearly unhappy about
certain desktop aspects of swapping, and the only way out of that
is to let more people hack that code. Merging code involves more
people. It will cause 'noise' and could cause regressions, but at
least in this case the only impact is 'performance' and the feature
is trivial to disable.
-- Ingo Molnar
pushes for swap prefetch
Open source is about release early, release often. Not "hide code
in a dark corner until Christoph thinks it is perfect." We have
high standards for upstream merged code, but that standard is not
perfection. Perfect is the enemy of good.
-- Jeff Garzik
If your mission to another star *depends* on every single piece of
complex equipment staying up with zero reboots for 200+ years, you
have some serious technology problems.
-- Linus Torvalds
Comments (7 posted)
As of this writing, the 2.6.22 merge window remains open, with quite a bit
of code still expected to be merged. User-visible changes which have gone
- The mac80211 (formerly Devicescape) wireless networking stack has
finally found its way into the mainline. As of this writing there are
no drivers which actually use that stack, but drivers are said to be
in the works.
- The sysfs representation of i2c devices has changed in ways which
could break older tools. In particular, versions of lm_sensors prior
to 2.10.3 will have problems.
- A number of old USB touchscreen drivers (itmtouch,
mtouchusb, and touchkitusb) have been removed in
favor of the new usbtouchscreen driver.
- The x86_64 architecture has gained relocatable kernel support, a
necessary feature for those wanting to use the kexec-based crash dump
- Patching of low-level paravirtualization hooks can be inhibited at
boot time with the new noreplace-paravirt boot flag.
- The REORDER configuration option, which would rearrange
functions in the kernel binary for optimal performance, has been
removed from the x86_64 architecture.
- The CIFS filesystem supports IPv6 addresses. There is a new mount
option to allow user and group IDs to be overridden. A number of
performance improvements for CIFS were also merged.
- The kernel virtual machine (KVM) API has seen significant changes. If
earlier plans still hold, this should be the last set of incompatible
- There is now a framework for supporting the "RF kill" switches (which
disable the transmitter) found on many mobile devices.
- Support for filesystem "subtypes" has been added. The target here is
FUSE-based filesystems, which currently all look the same to the
kernel and are hard to specify in fstab. Now a FUSE
ssh-based filesystem can have the type "fuse.sshfs".
- Entries in /proc now exist to provide position and flags
information for all open file descriptors.
- There is a new system call:
long utimensat(int dirfd, char *filename, struct timespec *times,
This call allows an application to set the access and modification
times for the given filename with nanosecond precision.
- The device mapper has a new "delay" target which can delay I/O
operations; this may seem like a feature of dubious value but it's
intended for testing only.
- Motorola sysv68 disk partition tables are now supported.
- There is a new private futex mechanism which improves scalability by
avoiding the shared global namespace.
- The PowerPC architecture supports the concept of "slices" - special
areas of memory which can have different page sizes. The feature is
similar to hugetlbfs, but with more page size flexibility.
- New hardware supported includes Picotux 200 ARM boards, ADS7846
touchscreen devices, D-Link DSM-G600 boards, MIPS RM9122 integrated
serial ports, PMC-Sierra MSP71xx serial devices, MS7712SE01 boards,
L-BOX RE2 router boards, SH7780 and SH7722 Solution Engine boards, Sun
XVR-500 and XVR-2500 framebuffers, SUN4U PCI-E controllers, Apple
system management controllers, Ricoh RS5C313 clock chips, Maxim DS1WM
one-wire ASIC cores, Alchemy au1500 programmable serial controllers,
Intel LE80578-based framebuffers, PowerPC 750 "Holly" platforms,
PowerPC 440GP "Ebony" reference boards, Maxim MAX6650 and MAX6651 fan
controllers, Analog Devices AD741x monitoring chips, Intel Core
temperature sensors, PA Semi PA6T-1682M random number generators,
VIA VT8623 framebuffers,
and various drivers for the new "Blackfin" architecture.
Changes visible to kernel developers include:
- The i2c layer has seen significant new changes meant to make i2c
drivers look more like drivers for other buses. There are, for
example, new probe() and remove() methods for
notifying devices when i2c peripherals come and go. Since i2c is not
a self-describing bus, the support code still needs help to know where
i2c devices might be; for many classes of device, this information can
be had from the system BIOS.
- The crypto API has a new set of functions for use with asynchronous
block ciphers. There is also a new cryptd kernel thread
which can run any synchronous cipher in an asynchronous mode.
- The subsystem structure has been removed from the Linux
device model; there never really was any need for it. Most code which
was expecting a struct subsystem argument has been changed to
use the relevant kset instead.
- There is a new version of the in-kernel rpcbind (portmapper) client
which supports versions 2-4 of the rpcbind protocol. The portmapper
API has changed as a result.
- Numerous changes to the paravirt_ops methods have been made.
Additionally, paravirt_ops is no longer a GPL-only export.
- There is a new memory function:
void *krealloc(const void *p, size_t new_size, gfp_t flags);
As one would expect, it changes the size of the allocated memory, moving it
if need be.
- The SLUB allocator has
been merged as an experimental (for now) alternative to the slab code.
- A new macro has been added to make the creation of slab caches easier:
struct kmem_cache KMEM_CACHE(struct-type, flags);
The result is the creation of a cache holding objects of the given
struct_type, named after that type, and with the additional
slab flags (if any).
- The SLAB_DEBUG_INITIAL flag has been removed, along with the
associated SLAB_CTOR_VERIFY flag passed to constructors. The
result is a set of changes which ripples through quite a few source
files. The unused SLAB_CTOR_ATOMIC flag is also gone.
- The "quicklist" mechanism has been merged. Quicklists are a simple
lookaside cache for page table pages which optimize the allocation and
initialization of those pages.
- The SuperH architecture has working kgdb support again.
- The ia64 architecture has a new tool which will inject machine check
errors into a running system. Not recommended for production
- The deferrable timers
patch has been merged. There is also a new macro for initializing
workqueue entries (INIT_DELAYED_WORK_DEFERRABLE()) which
causes the job to be queued in a deferrable manner.
- The old SA_* interrupt flags have not been removed as
originally scheduled, but their use will now generate warnings at
- There is a new list_first_entry() macro which, surprisingly,
gets the first entry from a list.
- The atomic64_t and local_t types are now fully
supported on a wider set of architectures.
- The "hibernation" (suspend to disk) code has been separated from the
"suspend" (to RAM) code as part of a larger effort to distinguish
between those two very different operations.
- Workqueues have been reworked again. There is a new
void cancel_work_sync(struct work_struct *work);
This function tries to cancel a single workqueue entry, be it on the
shared (keventd) or a private workqueue.
Meanwhile run_scheduled_work() has been removed.
The merging process is not yet done, so expect another big set of patches
to go into 2.6.22 before the window closes.
Comments (8 posted)
The last time this page looked at the kevent interface
, it seemed
to have reached the end of its run. The eventfd
stolen the thunder, providing a way for applications to wait on many types
of events using the standard polling interfaces. The kevent developer has
shelved the work on the assumption that it would not get in. That
assumption appeared to be justified, given that Andrew Morton, in his 2.6.22 merge plans document
that the eventfd patches would be included.
As was mentioned last week, one obstacle came up in the form of pollfs, an implementation of a
very similar idea. There were a couple of relatively harsh reviews of the
pollfs code, and its profile appears to have lowered considerably. It is
possible that a new, improved version of pollfs could show up in the near
future, but it would have to be a lot better to grab a significant amount
of attention. The pollfs code has probably shown up too late to the game.
There's another late arrival who will have to be listened to, however:
glibc maintainer Ulrich Drepper. Having sat out the discussion of eventfd,
he is now back and opposing its inclusion
into the mainline:
It's Linus decision whether he wants to add yet more code, yet more
possible problems, yet more maintenance overhead/nightmare for an
interim solution which isn't necessary, which cannot solve all the
problems, and which is not as scalable as other proposed methods.
I can only say that I would be trickly [sic] against it. It makes
just no sense.
Ulrich has a number of complaints about the eventfd approach:
- The eventfd code, by relying on poll() and variants, does not
provide a way for applications to obtain events without entering the
kernel. For high-bandwidth applications - big network servers, for
example - eliminating system calls is one of the keys to adequate
performance. The kevent code, with its user-space event ring,
provides that sort of mechanism while eventfd does not.
- The use of poll() also makes it hard for the kernel to pass
information back to the application - the communication channel only
includes a few bits. The kevent interface allows for a fair amount of
information to be packaged with each event. Eventfd gets around this
problem by allowing applications to read more event information from
the relevant file descriptors - but that requires another system call.
- Ulrich argues that the poll()
interface poses unsolvable issues with regard to threads and
cancellation processing. This argument is not universally accepted, however.
- The current eventfd code does not let applications wait on futexes,
and Davide Libenzi, the eventfd developer, is uninclined to add that support. The
pollfs patches do support futex waits, though Ulrich had some issues
with the implementation. In general, Ulrich would like to see a
single system call where applications can wait for anything, so
leaving out primitives like futexes will leave him unsatisfied.
The end result of this is that Ulrich opposes the merging of eventfd; he
would rather see the effort go into making kevent (or a replacement with
similar functionality) ready for the mainline. A kevent-like interface, he
says, will eventually become necessary in
I think we ultimately have to have something like kevent and then
all this *fd() work is unnecessary and just adds code to the kernel
which has to be kept around and which might hinder further work in
How this issue will be resolved is entirely unclear. There's not been a
flood of developers lining up to support Ulrich's position - but they are
not opposing him either. Nobody has dusted off the kevent patches for
another round of discussion - yet. But one thing that does seem likely is
that this whole discussion may delay the merging of eventfd past the 2.6.22
merge window. User-space interfaces are important and, once they are added
to the kernel, they are almost impossible to remove. Waiting another
development cycle seems like a small price to pay if it helps the
developers to get this decision right.
Update: the eventfd code was merged into the mainline on May 11.
Comments (12 posted)
Your editor's copy of The C Programming Language, Second Edition
(copyright 1988, still known as "the new C book") has the following to say
about the volatile
The purpose of volatile is to force an implementation to suppress
optimization that could otherwise occur. For example, for a
machine with memory-mapped input/output, a pointer to a device
register might be declared as a pointer to volatile, in
order to prevent the compiler from removing apparently redundant
references through the pointer.
C programmers have often taken volatile to mean that the variable
could be changed outside of the current thread of execution; as a result,
they are sometimes tempted to use it in kernel code when shared data
structures are being used. Andrew Morton recently called out use of volatile in a
submitted patch, saying:
The volatiles are a worry - volatile is said to be
basically-always-wrong in-kernel, although we've never managed to
document why, and i386 cheerfully uses it in readb() and friends.
In response, Randy Dunlap pulled together some
email from Linus on the topic and suggested to your editor that he
could maybe help "document why." Here is the result.
The point that Linus often makes with regard to volatile is that
its purpose is to suppress optimization, which is almost never what one
really wants to do. In the kernel, one must protect accesses to data
against race conditions, which is very much a different task.
Like volatile, the kernel primitives which make concurrent access
to data safe (spinlocks, mutexes, memory barriers, etc.) are designed to
prevent unwanted optimization. If they are
being used properly, there will be no need to use volatile as
well. If volatile is still necessary, there is almost
certainly a bug in
the code somewhere. In properly-written kernel code, volatile can
only serve to slow things down.
Consider a typical block of kernel code:
If all the code follows the locking rules, the value of
shared_data cannot change unexpectedly while the_lock is
held. Any other code which might want to play with that data will be
waiting on the lock. The spinlock primitives act as memory barriers - they
are explicitly written to do so -
meaning that data accesses will not be optimized across them. So the
compiler might think it knows what will be in shared_data, but the
spin_lock() call will force it to forget anything it knows. There
will be no optimization problems with accesses to that data.
If shared_data were declared volatile, the locking would
still be necessary. But the compiler would also be prevented from
optimizing access to shared within the critical section,
when we know that nobody else can be working with it. While the lock is
held, shared_data is not volatile. This is why Linus says:
Also, more importantly, "volatile" is on the wrong _part_ of the
whole system. In C, it's "data" that is volatile, but that is
insane. Data isn't volatile - _accesses_ are volatile. So it may
make sense to say "make this particular _access_ be careful", but
not "make all accesses to this data use some random strategy".
When dealing with shared data, proper locking makes volatile
unnecessary - and potentially harmful.
The volatile storage class was originally meant for memory-mapped
I/O registers. Within the kernel, register accesses, too, should be
protected by locks, but one also does not want the compiler "optimizing"
register accesses within a critical section. But, within the kernel, I/O
memory accesses are always done through accessor functions; accessing I/O
memory directly through pointers is frowned upon and does not work on all
architectures. Those accessors are written to prevent unwanted
optimization, so, once again, volatile is unnecessary.
Another situation where one might be tempted to use volatile is
when the processor is busy-waiting on the value of a variable. The right
way to perform a busy wait is:
while (my_variable != what_i_want)
The cpu_relax() call can lower CPU power consumption or yield to a
hyperthreaded twin processor; it also happens to serve as a memory
barrier, so, once again, volatile is unnecessary. Of course,
busy-waiting is generally an anti-social act to begin
There are still a few rare situations where volatile makes sense
in the kernel:
- The above-mentioned accessor functions might use volatile on
architectures where direct I/O memory access does work. Essentially,
each accessor call becomes a little critical section on its own and
ensures that the access happens as expected by the programmer.
- Inline assembly code which changes memory, but which has no other
visible side effects, risks being deleted by GCC. Adding the
volatile keyword to asm statements will prevent this
- The jiffies variable is special in that it can have a
different value every time it is referenced, but it can be read
without any special locking. So jiffies can be
volatile, but the addition of other variables of this type is
frowned upon. Jiffies is considered to be a "stupid legacy" issue
in this regard.
For most code, none of the above justifications for volatile
apply. As a result, the use of volatile is likely to be seen as a
bug and will bring additional scrutiny to the code. Developers who are
tempted to use volatile should take a step back and think about
what they are truly trying to accomplish.
(Thanks to Randy Dunlap for getting things started and researching the
issue, and to Satyam Sharma, and Johannes Stezenbach for comments on the
first draft of this article).
Comments (18 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>