|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

There is still no 2.6 prepatch as the merge window remains open. Patches continue to pour into the mainline git repository; see the article below for details.

The current -mm tree is 2.6.21-mm2. Recent changes to -mm include the removal of the adaptive readahead patches (in anticipation of a newer, simpler version), a kernel-based mechanism for unprivileged user mounts, and the removal of the staircase deadline scheduler. Mostly -mm is shedding weight as patches move into the mainline.

For older kernels: 2.6.16.50 was released on May 4, followed by 2.6.16.51 on May 9. Each update contains a handful of important fixes.

Comments (none posted)

Kernel development news

Quotes of the week

Really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable.
-- Ingo Molnar pushes for swap prefetch

Open source is about release early, release often. Not "hide code in a dark corner until Christoph thinks it is perfect." We have high standards for upstream merged code, but that standard is not perfection. Perfect is the enemy of good.
-- Jeff Garzik for Libertas

If your mission to another star *depends* on every single piece of complex equipment staying up with zero reboots for 200+ years, you have some serious technology problems.
-- Linus Torvalds

Comments (7 posted)

More stuff for 2.6.22

As of this writing, the 2.6.22 merge window remains open, with quite a bit of code still expected to be merged. User-visible changes which have gone in include:

  • The mac80211 (formerly Devicescape) wireless networking stack has finally found its way into the mainline. As of this writing there are no drivers which actually use that stack, but drivers are said to be in the works.

  • The sysfs representation of i2c devices has changed in ways which could break older tools. In particular, versions of lm_sensors prior to 2.10.3 will have problems.

  • A number of old USB touchscreen drivers (itmtouch, mtouchusb, and touchkitusb) have been removed in favor of the new usbtouchscreen driver.

  • The x86_64 architecture has gained relocatable kernel support, a necessary feature for those wanting to use the kexec-based crash dump mechanism.

  • Patching of low-level paravirtualization hooks can be inhibited at boot time with the new noreplace-paravirt boot flag.

  • The REORDER configuration option, which would rearrange functions in the kernel binary for optimal performance, has been removed from the x86_64 architecture.

  • The CIFS filesystem supports IPv6 addresses. There is a new mount option to allow user and group IDs to be overridden. A number of performance improvements for CIFS were also merged.

  • The kernel virtual machine (KVM) API has seen significant changes. If earlier plans still hold, this should be the last set of incompatible KVM changes.

  • There is now a framework for supporting the "RF kill" switches (which disable the transmitter) found on many mobile devices.

  • Support for filesystem "subtypes" has been added. The target here is FUSE-based filesystems, which currently all look the same to the kernel and are hard to specify in fstab. Now a FUSE ssh-based filesystem can have the type "fuse.sshfs".

  • Entries in /proc now exist to provide position and flags information for all open file descriptors.

  • There is a new system call:

        long utimensat(int dirfd, char *filename, struct timespec *times,
                       int flags);
    

    This call allows an application to set the access and modification times for the given filename with nanosecond precision.

  • The device mapper has a new "delay" target which can delay I/O operations; this may seem like a feature of dubious value but it's intended for testing only.

  • Motorola sysv68 disk partition tables are now supported.

  • There is a new private futex mechanism which improves scalability by avoiding the shared global namespace.

  • The PowerPC architecture supports the concept of "slices" - special areas of memory which can have different page sizes. The feature is similar to hugetlbfs, but with more page size flexibility.

  • New hardware supported includes Picotux 200 ARM boards, ADS7846 touchscreen devices, D-Link DSM-G600 boards, MIPS RM9122 integrated serial ports, PMC-Sierra MSP71xx serial devices, MS7712SE01 boards, L-BOX RE2 router boards, SH7780 and SH7722 Solution Engine boards, Sun XVR-500 and XVR-2500 framebuffers, SUN4U PCI-E controllers, Apple system management controllers, Ricoh RS5C313 clock chips, Maxim DS1WM one-wire ASIC cores, Alchemy au1500 programmable serial controllers, Intel LE80578-based framebuffers, PowerPC 750 "Holly" platforms, PowerPC 440GP "Ebony" reference boards, Maxim MAX6650 and MAX6651 fan controllers, Analog Devices AD741x monitoring chips, Intel Core temperature sensors, PA Semi PA6T-1682M random number generators, VIA VT8623 framebuffers, and various drivers for the new "Blackfin" architecture.

Changes visible to kernel developers include:

  • The i2c layer has seen significant new changes meant to make i2c drivers look more like drivers for other buses. There are, for example, new probe() and remove() methods for notifying devices when i2c peripherals come and go. Since i2c is not a self-describing bus, the support code still needs help to know where i2c devices might be; for many classes of device, this information can be had from the system BIOS.

  • The crypto API has a new set of functions for use with asynchronous block ciphers. There is also a new cryptd kernel thread which can run any synchronous cipher in an asynchronous mode.

  • The subsystem structure has been removed from the Linux device model; there never really was any need for it. Most code which was expecting a struct subsystem argument has been changed to use the relevant kset instead.

  • There is a new version of the in-kernel rpcbind (portmapper) client which supports versions 2-4 of the rpcbind protocol. The portmapper API has changed as a result.

  • Numerous changes to the paravirt_ops methods have been made. Additionally, paravirt_ops is no longer a GPL-only export.

  • There is a new memory function:

        void *krealloc(const void *p, size_t new_size, gfp_t flags);
    

    As one would expect, it changes the size of the allocated memory, moving it if need be.

  • The SLUB allocator has been merged as an experimental (for now) alternative to the slab code.

  • A new macro has been added to make the creation of slab caches easier:

        struct kmem_cache KMEM_CACHE(struct-type, flags);
    
    The result is the creation of a cache holding objects of the given struct_type, named after that type, and with the additional slab flags (if any).

  • The SLAB_DEBUG_INITIAL flag has been removed, along with the associated SLAB_CTOR_VERIFY flag passed to constructors. The result is a set of changes which ripples through quite a few source files. The unused SLAB_CTOR_ATOMIC flag is also gone.

  • The "quicklist" mechanism has been merged. Quicklists are a simple lookaside cache for page table pages which optimize the allocation and initialization of those pages.

  • The SuperH architecture has working kgdb support again.

  • The ia64 architecture has a new tool which will inject machine check errors into a running system. Not recommended for production machines.

  • The deferrable timers patch has been merged. There is also a new macro for initializing workqueue entries (INIT_DELAYED_WORK_DEFERRABLE()) which causes the job to be queued in a deferrable manner.

  • The old SA_* interrupt flags have not been removed as originally scheduled, but their use will now generate warnings at compile time.

  • There is a new list_first_entry() macro which, surprisingly, gets the first entry from a list.

  • The atomic64_t and local_t types are now fully supported on a wider set of architectures.

  • The "hibernation" (suspend to disk) code has been separated from the "suspend" (to RAM) code as part of a larger effort to distinguish between those two very different operations.

  • Workqueues have been reworked again. There is a new function:

        void cancel_work_sync(struct work_struct *work);
    

    This function tries to cancel a single workqueue entry, be it on the shared (keventd) or a private workqueue. Meanwhile run_scheduled_work() has been removed.

The merging process is not yet done, so expect another big set of patches to go into 2.6.22 before the window closes.

Comments (8 posted)

The return of kevent?

The last time this page looked at the kevent interface, it seemed to have reached the end of its run. The eventfd patches had stolen the thunder, providing a way for applications to wait on many types of events using the standard polling interfaces. The kevent developer has shelved the work on the assumption that it would not get in. That assumption appeared to be justified, given that Andrew Morton, in his 2.6.22 merge plans document said that the eventfd patches would be included.

As was mentioned last week, one obstacle came up in the form of pollfs, an implementation of a very similar idea. There were a couple of relatively harsh reviews of the pollfs code, and its profile appears to have lowered considerably. It is possible that a new, improved version of pollfs could show up in the near future, but it would have to be a lot better to grab a significant amount of attention. The pollfs code has probably shown up too late to the game.

There's another late arrival who will have to be listened to, however: glibc maintainer Ulrich Drepper. Having sat out the discussion of eventfd, he is now back and opposing its inclusion into the mainline:

It's Linus decision whether he wants to add yet more code, yet more possible problems, yet more maintenance overhead/nightmare for an interim solution which isn't necessary, which cannot solve all the problems, and which is not as scalable as other proposed methods.

I can only say that I would be trickly [sic] against it. It makes just no sense.

Ulrich has a number of complaints about the eventfd approach:

  • The eventfd code, by relying on poll() and variants, does not provide a way for applications to obtain events without entering the kernel. For high-bandwidth applications - big network servers, for example - eliminating system calls is one of the keys to adequate performance. The kevent code, with its user-space event ring, provides that sort of mechanism while eventfd does not.

  • The use of poll() also makes it hard for the kernel to pass information back to the application - the communication channel only includes a few bits. The kevent interface allows for a fair amount of information to be packaged with each event. Eventfd gets around this problem by allowing applications to read more event information from the relevant file descriptors - but that requires another system call.

  • Ulrich argues that the poll() interface poses unsolvable issues with regard to threads and cancellation processing. This argument is not universally accepted, however.

  • The current eventfd code does not let applications wait on futexes, and Davide Libenzi, the eventfd developer, is uninclined to add that support. The pollfs patches do support futex waits, though Ulrich had some issues with the implementation. In general, Ulrich would like to see a single system call where applications can wait for anything, so leaving out primitives like futexes will leave him unsatisfied.

The end result of this is that Ulrich opposes the merging of eventfd; he would rather see the effort go into making kevent (or a replacement with similar functionality) ready for the mainline. A kevent-like interface, he says, will eventually become necessary in any case:

I think we ultimately have to have something like kevent and then all this *fd() work is unnecessary and just adds code to the kernel which has to be kept around and which might hinder further work in this area.

How this issue will be resolved is entirely unclear. There's not been a flood of developers lining up to support Ulrich's position - but they are not opposing him either. Nobody has dusted off the kevent patches for another round of discussion - yet. But one thing that does seem likely is that this whole discussion may delay the merging of eventfd past the 2.6.22 merge window. User-space interfaces are important and, once they are added to the kernel, they are almost impossible to remove. Waiting another development cycle seems like a small price to pay if it helps the developers to get this decision right.

Update: the eventfd code was merged into the mainline on May 11.

Comments (12 posted)

The trouble with volatile

Your editor's copy of The C Programming Language, Second Edition (copyright 1988, still known as "the new C book") has the following to say about the volatile keyword:

The purpose of volatile is to force an implementation to suppress optimization that could otherwise occur. For example, for a machine with memory-mapped input/output, a pointer to a device register might be declared as a pointer to volatile, in order to prevent the compiler from removing apparently redundant references through the pointer.

C programmers have often taken volatile to mean that the variable could be changed outside of the current thread of execution; as a result, they are sometimes tempted to use it in kernel code when shared data structures are being used. Andrew Morton recently called out use of volatile in a submitted patch, saying:

The volatiles are a worry - volatile is said to be basically-always-wrong in-kernel, although we've never managed to document why, and i386 cheerfully uses it in readb() and friends.

In response, Randy Dunlap pulled together some email from Linus on the topic and suggested to your editor that he could maybe help "document why." Here is the result.

The point that Linus often makes with regard to volatile is that its purpose is to suppress optimization, which is almost never what one really wants to do. In the kernel, one must protect accesses to data against race conditions, which is very much a different task.

Like volatile, the kernel primitives which make concurrent access to data safe (spinlocks, mutexes, memory barriers, etc.) are designed to prevent unwanted optimization. If they are being used properly, there will be no need to use volatile as well. If volatile is still necessary, there is almost certainly a bug in the code somewhere. In properly-written kernel code, volatile can only serve to slow things down.

Consider a typical block of kernel code:

    spin_lock(&the_lock);
    do_something_on(&shared_data);
    do_something_else_with(&shared_data);
    spin_unlock(&the_lock);

If all the code follows the locking rules, the value of shared_data cannot change unexpectedly while the_lock is held. Any other code which might want to play with that data will be waiting on the lock. The spinlock primitives act as memory barriers - they are explicitly written to do so - meaning that data accesses will not be optimized across them. So the compiler might think it knows what will be in shared_data, but the spin_lock() call will force it to forget anything it knows. There will be no optimization problems with accesses to that data.

If shared_data were declared volatile, the locking would still be necessary. But the compiler would also be prevented from optimizing access to shared within the critical section, when we know that nobody else can be working with it. While the lock is held, shared_data is not volatile. This is why Linus says:

Also, more importantly, "volatile" is on the wrong _part_ of the whole system. In C, it's "data" that is volatile, but that is insane. Data isn't volatile - _accesses_ are volatile. So it may make sense to say "make this particular _access_ be careful", but not "make all accesses to this data use some random strategy".

When dealing with shared data, proper locking makes volatile unnecessary - and potentially harmful.

The volatile storage class was originally meant for memory-mapped I/O registers. Within the kernel, register accesses, too, should be protected by locks, but one also does not want the compiler "optimizing" register accesses within a critical section. But, within the kernel, I/O memory accesses are always done through accessor functions; accessing I/O memory directly through pointers is frowned upon and does not work on all architectures. Those accessors are written to prevent unwanted optimization, so, once again, volatile is unnecessary.

Another situation where one might be tempted to use volatile is when the processor is busy-waiting on the value of a variable. The right way to perform a busy wait is:

    while (my_variable != what_i_want)
        cpu_relax();

The cpu_relax() call can lower CPU power consumption or yield to a hyperthreaded twin processor; it also happens to serve as a memory barrier, so, once again, volatile is unnecessary. Of course, busy-waiting is generally an anti-social act to begin with.

There are still a few rare situations where volatile makes sense in the kernel:

  • The above-mentioned accessor functions might use volatile on architectures where direct I/O memory access does work. Essentially, each accessor call becomes a little critical section on its own and ensures that the access happens as expected by the programmer.

  • Inline assembly code which changes memory, but which has no other visible side effects, risks being deleted by GCC. Adding the volatile keyword to asm statements will prevent this removal.

  • The jiffies variable is special in that it can have a different value every time it is referenced, but it can be read without any special locking. So jiffies can be volatile, but the addition of other variables of this type is frowned upon. Jiffies is considered to be a "stupid legacy" issue in this regard.

For most code, none of the above justifications for volatile apply. As a result, the use of volatile is likely to be seen as a bug and will bring additional scrutiny to the code. Developers who are tempted to use volatile should take a step back and think about what they are truly trying to accomplish.

(Thanks to Randy Dunlap for getting things started and researching the issue, and to Satyam Sharma, and Johannes Stezenbach for comments on the first draft of this article).

Comments (18 posted)

Patches and updates

Kernel trees

Andrew Morton 2.6.21-mm1 ?
Andrew Morton 2.6.21-mm2 ?
Con Kolivas 2.6.21-ck1 ?
Adrian Bunk Linux 2.6.16.51 ?
Adrian Bunk Linux 2.6.16.51-rc1 ?
Adrian Bunk Linux 2.6.16.50 ?

Architecture-specific

Chandramouli Narayanan x86_64: EFI64 support ?
Thomas Gleixner x86-64 highres/dyntick support ?
Nick Piggin lock bitops ?

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Joern Engel LogFS take two ?

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds