Kernel development
Brief items
Kernel release status
There is still no 2.6 prepatch as the merge window remains open. Patches continue to pour into the mainline git repository; see the article below for details.The current -mm tree is 2.6.21-mm2. Recent changes to -mm include the removal of the adaptive readahead patches (in anticipation of a newer, simpler version), a kernel-based mechanism for unprivileged user mounts, and the removal of the staircase deadline scheduler. Mostly -mm is shedding weight as patches move into the mainline.
For older kernels: 2.6.16.50 was released on May 4, followed by 2.6.16.51 on May 9. Each update contains a handful of important fixes.
Kernel development news
Quotes of the week
More stuff for 2.6.22
As of this writing, the 2.6.22 merge window remains open, with quite a bit of code still expected to be merged. User-visible changes which have gone in include:
- The mac80211 (formerly Devicescape) wireless networking stack has
finally found its way into the mainline. As of this writing there are
no drivers which actually use that stack, but drivers are said to be
in the works.
- The sysfs representation of i2c devices has changed in ways which
could break older tools. In particular, versions of lm_sensors prior
to 2.10.3 will have problems.
- A number of old USB touchscreen drivers (itmtouch,
mtouchusb, and touchkitusb) have been removed in
favor of the new usbtouchscreen driver.
- The x86_64 architecture has gained relocatable kernel support, a
necessary feature for those wanting to use the kexec-based crash dump
mechanism.
- Patching of low-level paravirtualization hooks can be inhibited at
boot time with the new noreplace-paravirt boot flag.
- The REORDER configuration option, which would rearrange
functions in the kernel binary for optimal performance, has been
removed from the x86_64 architecture.
- The CIFS filesystem supports IPv6 addresses. There is a new mount
option to allow user and group IDs to be overridden. A number of
performance improvements for CIFS were also merged.
- The kernel virtual machine (KVM) API has seen significant changes. If
earlier plans still hold, this should be the last set of incompatible
KVM changes.
- There is now a framework for supporting the "RF kill" switches (which
disable the transmitter) found on many mobile devices.
- Support for filesystem "subtypes" has been added. The target here is
FUSE-based filesystems, which currently all look the same to the
kernel and are hard to specify in fstab. Now a FUSE
ssh-based filesystem can have the type "fuse.sshfs".
- Entries in /proc now exist to provide position and flags
information for all open file descriptors.
- There is a new system call:
long utimensat(int dirfd, char *filename, struct timespec *times, int flags);This call allows an application to set the access and modification times for the given filename with nanosecond precision.
- The device mapper has a new "delay" target which can delay I/O
operations; this may seem like a feature of dubious value but it's
intended for testing only.
- Motorola sysv68 disk partition tables are now supported.
- There is a new private futex mechanism which improves scalability by
avoiding the shared global namespace.
- The PowerPC architecture supports the concept of "slices" - special
areas of memory which can have different page sizes. The feature is
similar to hugetlbfs, but with more page size flexibility.
- New hardware supported includes Picotux 200 ARM boards, ADS7846 touchscreen devices, D-Link DSM-G600 boards, MIPS RM9122 integrated serial ports, PMC-Sierra MSP71xx serial devices, MS7712SE01 boards, L-BOX RE2 router boards, SH7780 and SH7722 Solution Engine boards, Sun XVR-500 and XVR-2500 framebuffers, SUN4U PCI-E controllers, Apple system management controllers, Ricoh RS5C313 clock chips, Maxim DS1WM one-wire ASIC cores, Alchemy au1500 programmable serial controllers, Intel LE80578-based framebuffers, PowerPC 750 "Holly" platforms, PowerPC 440GP "Ebony" reference boards, Maxim MAX6650 and MAX6651 fan controllers, Analog Devices AD741x monitoring chips, Intel Core temperature sensors, PA Semi PA6T-1682M random number generators, VIA VT8623 framebuffers, and various drivers for the new "Blackfin" architecture.
Changes visible to kernel developers include:
- The i2c layer has seen significant new changes meant to make i2c
drivers look more like drivers for other buses. There are, for
example, new probe() and remove() methods for
notifying devices when i2c peripherals come and go. Since i2c is not
a self-describing bus, the support code still needs help to know where
i2c devices might be; for many classes of device, this information can
be had from the system BIOS.
- The crypto API has a new set of functions for use with asynchronous
block ciphers. There is also a new cryptd kernel thread
which can run any synchronous cipher in an asynchronous mode.
- The subsystem structure has been removed from the Linux
device model; there never really was any need for it. Most code which
was expecting a struct subsystem argument has been changed to
use the relevant kset instead.
- There is a new version of the in-kernel rpcbind (portmapper) client
which supports versions 2-4 of the rpcbind protocol. The portmapper
API has changed as a result.
- Numerous changes to the paravirt_ops methods have been made.
Additionally, paravirt_ops is no longer a GPL-only export.
- There is a new memory function:
void *krealloc(const void *p, size_t new_size, gfp_t flags);As one would expect, it changes the size of the allocated memory, moving it if need be.
- The SLUB allocator has
been merged as an experimental (for now) alternative to the slab code.
- A new macro has been added to make the creation of slab caches easier:
struct kmem_cache KMEM_CACHE(struct-type, flags);The result is the creation of a cache holding objects of the given struct_type, named after that type, and with the additional slab flags (if any). - The SLAB_DEBUG_INITIAL flag has been removed, along with the
associated SLAB_CTOR_VERIFY flag passed to constructors. The
result is a set of changes which ripples through quite a few source
files. The unused SLAB_CTOR_ATOMIC flag is also gone.
- The "quicklist" mechanism has been merged. Quicklists are a simple
lookaside cache for page table pages which optimize the allocation and
initialization of those pages.
- The SuperH architecture has working kgdb support again.
- The ia64 architecture has a new tool which will inject machine check
errors into a running system. Not recommended for production
machines.
- The deferrable timers
patch has been merged. There is also a new macro for initializing
workqueue entries (INIT_DELAYED_WORK_DEFERRABLE()) which
causes the job to be queued in a deferrable manner.
- The old SA_* interrupt flags have not been removed as
originally scheduled, but their use will now generate warnings at
compile time.
- There is a new list_first_entry() macro which, surprisingly,
gets the first entry from a list.
- The atomic64_t and local_t types are now fully
supported on a wider set of architectures.
- The "hibernation" (suspend to disk) code has been separated from the
"suspend" (to RAM) code as part of a larger effort to distinguish
between those two very different operations.
- Workqueues have been reworked again. There is a new
function:
void cancel_work_sync(struct work_struct *work);This function tries to cancel a single workqueue entry, be it on the shared (keventd) or a private workqueue. Meanwhile run_scheduled_work() has been removed.
The merging process is not yet done, so expect another big set of patches to go into 2.6.22 before the window closes.
The return of kevent?
The last time this page looked at the kevent interface, it seemed to have reached the end of its run. The eventfd patches had stolen the thunder, providing a way for applications to wait on many types of events using the standard polling interfaces. The kevent developer has shelved the work on the assumption that it would not get in. That assumption appeared to be justified, given that Andrew Morton, in his 2.6.22 merge plans document said that the eventfd patches would be included.As was mentioned last week, one obstacle came up in the form of pollfs, an implementation of a very similar idea. There were a couple of relatively harsh reviews of the pollfs code, and its profile appears to have lowered considerably. It is possible that a new, improved version of pollfs could show up in the near future, but it would have to be a lot better to grab a significant amount of attention. The pollfs code has probably shown up too late to the game.
There's another late arrival who will have to be listened to, however: glibc maintainer Ulrich Drepper. Having sat out the discussion of eventfd, he is now back and opposing its inclusion into the mainline:
I can only say that I would be trickly [sic] against it. It makes just no sense.
Ulrich has a number of complaints about the eventfd approach:
- The eventfd code, by relying on poll() and variants, does not
provide a way for applications to obtain events without entering the
kernel. For high-bandwidth applications - big network servers, for
example - eliminating system calls is one of the keys to adequate
performance. The kevent code, with its user-space event ring,
provides that sort of mechanism while eventfd does not.
- The use of poll() also makes it hard for the kernel to pass
information back to the application - the communication channel only
includes a few bits. The kevent interface allows for a fair amount of
information to be packaged with each event. Eventfd gets around this
problem by allowing applications to read more event information from
the relevant file descriptors - but that requires another system call.
- Ulrich argues that the poll()
interface poses unsolvable issues with regard to threads and
cancellation processing. This argument is not universally accepted, however.
- The current eventfd code does not let applications wait on futexes, and Davide Libenzi, the eventfd developer, is uninclined to add that support. The pollfs patches do support futex waits, though Ulrich had some issues with the implementation. In general, Ulrich would like to see a single system call where applications can wait for anything, so leaving out primitives like futexes will leave him unsatisfied.
The end result of this is that Ulrich opposes the merging of eventfd; he would rather see the effort go into making kevent (or a replacement with similar functionality) ready for the mainline. A kevent-like interface, he says, will eventually become necessary in any case:
How this issue will be resolved is entirely unclear. There's not been a flood of developers lining up to support Ulrich's position - but they are not opposing him either. Nobody has dusted off the kevent patches for another round of discussion - yet. But one thing that does seem likely is that this whole discussion may delay the merging of eventfd past the 2.6.22 merge window. User-space interfaces are important and, once they are added to the kernel, they are almost impossible to remove. Waiting another development cycle seems like a small price to pay if it helps the developers to get this decision right.
Update: the eventfd code was merged into the mainline on May 11.
The trouble with volatile
Your editor's copy of The C Programming Language, Second Edition (copyright 1988, still known as "the new C book") has the following to say about the volatile keyword:
C programmers have often taken volatile to mean that the variable could be changed outside of the current thread of execution; as a result, they are sometimes tempted to use it in kernel code when shared data structures are being used. Andrew Morton recently called out use of volatile in a submitted patch, saying:
In response, Randy Dunlap pulled together some email from Linus on the topic and suggested to your editor that he could maybe help "document why." Here is the result.
The point that Linus often makes with regard to volatile is that its purpose is to suppress optimization, which is almost never what one really wants to do. In the kernel, one must protect accesses to data against race conditions, which is very much a different task.
Like volatile, the kernel primitives which make concurrent access to data safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent unwanted optimization. If they are being used properly, there will be no need to use volatile as well. If volatile is still necessary, there is almost certainly a bug in the code somewhere. In properly-written kernel code, volatile can only serve to slow things down.
Consider a typical block of kernel code:
spin_lock(&the_lock);
do_something_on(&shared_data);
do_something_else_with(&shared_data);
spin_unlock(&the_lock);
If all the code follows the locking rules, the value of shared_data cannot change unexpectedly while the_lock is held. Any other code which might want to play with that data will be waiting on the lock. The spinlock primitives act as memory barriers - they are explicitly written to do so - meaning that data accesses will not be optimized across them. So the compiler might think it knows what will be in shared_data, but the spin_lock() call will force it to forget anything it knows. There will be no optimization problems with accesses to that data.
If shared_data were declared volatile, the locking would still be necessary. But the compiler would also be prevented from optimizing access to shared within the critical section, when we know that nobody else can be working with it. While the lock is held, shared_data is not volatile. This is why Linus says:
When dealing with shared data, proper locking makes volatile unnecessary - and potentially harmful.
The volatile storage class was originally meant for memory-mapped I/O registers. Within the kernel, register accesses, too, should be protected by locks, but one also does not want the compiler "optimizing" register accesses within a critical section. But, within the kernel, I/O memory accesses are always done through accessor functions; accessing I/O memory directly through pointers is frowned upon and does not work on all architectures. Those accessors are written to prevent unwanted optimization, so, once again, volatile is unnecessary.
Another situation where one might be tempted to use volatile is when the processor is busy-waiting on the value of a variable. The right way to perform a busy wait is:
while (my_variable != what_i_want)
cpu_relax();
The cpu_relax() call can lower CPU power consumption or yield to a hyperthreaded twin processor; it also happens to serve as a memory barrier, so, once again, volatile is unnecessary. Of course, busy-waiting is generally an anti-social act to begin with.
There are still a few rare situations where volatile makes sense in the kernel:
- The above-mentioned accessor functions might use volatile on
architectures where direct I/O memory access does work. Essentially,
each accessor call becomes a little critical section on its own and
ensures that the access happens as expected by the programmer.
- Inline assembly code which changes memory, but which has no other
visible side effects, risks being deleted by GCC. Adding the
volatile keyword to asm statements will prevent this
removal.
- The jiffies variable is special in that it can have a different value every time it is referenced, but it can be read without any special locking. So jiffies can be volatile, but the addition of other variables of this type is frowned upon. Jiffies is considered to be a "stupid legacy" issue in this regard.
For most code, none of the above justifications for volatile apply. As a result, the use of volatile is likely to be seen as a bug and will bring additional scrutiny to the code. Developers who are tempted to use volatile should take a step back and think about what they are truly trying to accomplish.
(Thanks to Randy Dunlap for getting things started and researching the issue, and to Satyam Sharma, and Johannes Stezenbach for comments on the first draft of this article).
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
