|
|
Log in / Subscribe / Register

Kernel development

Kernel release status

The current 2.6 kernel is 2.6.9, released, at last on October 18. Very few fixes were merged since 2.6.9-final, which, in turn, contained only a small number of changes since 2.6.9-rc4. The -final naming scheme drew a few complaints, to which Linus responded "I'm a retard." One assumes he will not do that again.

For those just tuning in, 2.6.9 includes a lot of NTFS updates, block I/O barrier support, a patch allowing unprivileged process to lock small amounts of memory in RAM, a new USB storage driver, cluster-wide file locking infrastructure, completely out-of-line spinlocks, AMD dual-core support, support for the POSIX waitid() system call, KProbes, USB "on the go" support, the "flex mmap" user-space memory layout, m32r architecture support, a bunch of latency-reduction work, and lots of fixes. See the (lengthy) changelog for a full list of changes since 2.6.8.

There have been no 2.6.10 prepatches released yet, but the floodgates have certainly opened; several hundred changesets have found their way into Linus's BitKeeper repository. These include a set of SCSI updates, a big rework of the IRQ subsystem (pulling lots of duplicated code into a single, generic core - no functional changes), some software suspend fixes, a number of scheduler tweaks, CDRW packet writing support, switchable and loadable I/O schedulers, a new version of the completely fair queueing (CFQ) I/O scheduler, the removal of the (unused) wake_up_all_sync() function, a simple generic circular buffer implementation, a big USB update, version 17 of the wireless extensions API, the kernel events notification mechanism, a patch changing the core device model function exports to GPL-only, a PCI subsystem update, the BSD "secure levels" security module, and lots of fixes.

Andrew Morton has not released any -mm patches over the last week.

The current 2.4 prepatch is still 2.4.28-pre4; Marcelo has not released any prepatches since October 8.

Comments (8 posted)

Kernel development news

Quotes of the week

On a side note, the GPL buyout previously offered has been modified. We will be contacting individual contributors and negotiating with each copyright holder for the code we wish to convert on a case by case basis....

SCO has contacted us and identifed [sic] with precise detail and factual documentation the code and intellectual property in Linux they claim was taken from Unix. We have reviewed their claims and they appear to create enough uncertianty [sic] to warrant removal of the infringing portions.

-- Jeff Merkey, of course.

Yes, I can reveal them. All of XFS, All of JFS, and All of the SMP Support in Linux. I have no idea what the hell RCU is and when I find it, I'll remove it from the code.

-- Yes, him again.

Sorry, couldn't resist; we'll stop now.

Comments (11 posted)

Coming in 2.6.10

A large number of patches have already been merged and will show up in the first 2.6.10 prepatch. Some of those have been covered on this page before, but others have not. As a way of catching up with current events, we'll take a quick look at a few of these patches.

CFQ v2

The completely fair queueing (CFQ) I/O scheduler endeavors to get good performance from block devices while dividing the available bandwidth equally between the processes contending for each device. 2.6.10 will contain a major rework of the CFQ scheduler, called "CFQ v2." Some of the changes in this version are:

  • Process I/O context information is maintained for the lifetime of each process, rather than just for the periods when the process has outstanding I/O. This change fixes some starvation scenarios which came up with CFQ v1.

  • Grouping of processes can be done by user ID, group ID, thread group, or process group; the policy in force can be changed at runtime.

  • Request ordering is more strictly enforced as a way of limiting the maximum latency experienced by any given request.

  • Small backward seeks are occasionally allowed if they look like they will improve responsiveness.

The code is also more heavily commented; author Jens Axboe says that was done to increase its AAF - "akpm acceptance factor." AKPM is Andrew Morton, who has been known to complain about insufficiently commented kernel submissions.

Simple circular buffers

Circular buffers are a common data structure in the kernel, but there has never been a generic implementation available for use. Stelian Pop decided to change that; he was almost certainly surprised, however, by the large number of iterations it took to respond to all the comments he got. In the end, this effort showed the value of having a single, generic implementation in the kernel. Even a data structure as simple as a circular buffer can be tricky to implement correctly; it makes no sense for every developer to go through that process each time a new one is needed. With a single, well-reviewed implementation, the chances of it being truly correct are much better.

A circular buffer is represented by struct kfifo, defined in <linux/kfifo.h>. A staticly-allocated buffer can be initialized with kfifo_init(), or allocation and initialization can be performed together with kfifo_alloc():

struct kfifo *kfifo_init(unsigned char *buffer, unsigned int size,
                         int gfp_mask, spinlock_t *lock);
struct kfifo *kfifo_alloc(unsigned int size, int gfp_mask,
                          spinlock_t *lock);

Either way, size is the desired size of the buffer (in bytes, must be a power of two), gfp_mask is a set of GFP_ flags controlling how memory allocations will be performed, and lock is a spinlock which will be used to serialize access to the data structure.

The functions for moving data into and out of the buffer are:

unsigned int kfifo_put(struct kfifo *fifo, unsigned char *buffer, 
                       unsigned int len);
unsigned int kfifo_get(struct kfifo *fifo, unsigned char *buffer, 
                       unsigned int len);

These functions move at most len bytes between the structure and buffer; the actual number of bytes transferred is returned. The number of bytes currently stored in a circular buffer can be obtained by passing it to kfifo_len(), and a buffer may be flushed by passing it to kfifo_reset(). A dynamically-allocated buffer may be returned to the system with kfifo_free(); there does not seem to be a way to free memory from staticly-allocated buffers.

Kernel events

The kernel events notification mechanism has been covered here a couple of times. This code provides a way for user-space processes to learn about important events by way of a netlink socket. The final form of the event generation interface (for now) is:

    int kobject_uevent(struct kobject *kobj, enum kobject_action action,
                       struct attribute *attr);

The kobject describes where the interesting event happened. For the one explicit use currently in the kernel (filesystem mount and unmount events), the kobject corresponds to the disk partition involved. action is a small set of possible events; it is currently one of KOBJ_ADD, KOBJ_REMOVE, KOBJ_CHANGE, KOBJ_MOUNT, and KOBJ_UMOUNT. The "add" and "remove" actions are generated along with hotplug events; "change" describes attribute value changes, and "mount" and "unmount" are for filesystem events. The final parameter (attr) is an optional attribute of the given kobject which provides further information.

The patches merged also modify how hotplug events are handled; such events now are reported in two ways: via the new events mechanism and through an invocation of /sbin/hotplug.

Comments (2 posted)

Realtime preemption, part 2

In last week's episode, we saw the release of a number of patches intended to bring (something closer to) realtime response to the standard Linux kernel. The level of activity in this area remains high; here is what has been happening over the last week.

Bill Huey of LynuxWorks surfaced to announce that he, too, has been working on realtime preemption; his patches are available at mmlinux.sourceforge.net. Mr. Huey seemed a bit annoyed at the posting from MontaVista which started the current discussion; his version, it seems, has been working for some months. But, by his own admission, he had been sitting on the patches for some time as a result of the "commercial development attitude" at his employer. "Release early" is the kernel developers' mantra for a reason.

The mmlinux patch resembles the others, in that it turns all spinlocks into semaphores and makes most critical sections preemptible. It includes a threaded interrupt handler patch from TimeSys, and uses standard Linux semaphores, without priority inheritance. See the mmlinux release announcement for more information.

The folks at MontaVista must be feeling a bit like their own vehicle has taken off and left them behind. Even so, Daniel Walker announced a new MontaVista realtime patch, based on Ingo Molnar's work. It includes an architecture-independent mutex implementation (but still different from regular Linux kernel semaphores), and some latency tracing code.

The real work, however, continues to be done by Ingo Molnar; he has been releasing patches at such a rate that some developers working on slower systems may have trouble simply compiling them before the next one comes out. Ingo's focus has been the elimination of the (numerous) remaining spinlocks, especially those outside of the core kernel. The current situation, as he put it, is "an opt-in model to correctness which is bad from a maintenance and upstream acceptance point of view". With his current patches (the latest is RT-2.6.9-rc4-mm1-U8 as of this writing, but that is likely to have changed by the time anybody reads this), over 90% of the raw spinlock calls have been removed, and most non-core subsystems are entirely free of spinlocks. At least, that is the case when realtime preemption is configured into the kernel; without that option, the situation is mostly unchanged.

To get to that point, Ingo had to make changes to a number of Linux mutual exclusion primitives which got in the way. One of those is per-CPU variables, which are based around the idea that, as long as each processor only works with its own copy of a variable, no locking is required to make that work safe. That assumption only holds, however, if threads are not preempted while manipulating per-CPU variables. So using a per-CPU variable requires disabling preemption, which runs counter to the whole "make everything preemptible" idea. To address this problem, Ingo introduced a new "locked" per-CPU variable type:

    DEFINE_PER_CPU_LOCKED(type, name);

    get_cpu_var_locked(var, cpu);
    put_cpu_var_locked(var, cpu);

Threads which use the "locked" type of per-CPU variable can be preempted while working with that variable - they can even be shifted to a different processor while sleeping. The result could be a thread updating the "wrong" processor's version of the variable. The lock will prevent race conditions, however, so, as Ingo puts it, "'statistically' the variable is still per-CPU and update correctness is fully preserved."

Then, there is the issue of read-copy-update, which also depends on threads not being preempted while they hold a reference to RCU-protected data. Ingo's approach here was, essentially, to dump RCU in the realtime case and just go back to regular locking. This change is hard to do in any sort of automatic way, however, because the RCU read locking primitive (rcu_read_lock(), which, normally, just disables preemption) does not identify which data is being protected. So converting RCU code requires picking out a spinlock or semaphore which can be used to prevent races with writers, and to change the rcu_read_lock() calls to one of the many new variants:

    rcu_read_lock_sem(struct semaphore *sem);
    rcu_read_lock_down_read(struct rwsem *sem);
    rcu_read_lock_spin(spinlock_t *lock);
    ...

This API, Ingo notes, is still in flux. There does not seem to have been any benchmarking done yet to determine what effect these changes have on the scalability issues RCU was created to address.

Atomic kmaps were another problem. An atomic kmap is a mechanism used to quickly map a high memory page into the kernel's address space. It is, for all practical purposes, an implementation of per-CPU page table entries, and it has the same preemption issues. The solution here was the addition of a new function (kmap_atomic_rt()) which turns into a regular, non-atomic kmap when realtime preemption is enabled. In this case (as with many of the others) the low-latency imperative brings a small overall performance cost.

As a sort of side project, many users of semaphores in the kernel were changed over to the completion mechanism. Some new completion functions have been added to help with that process:

    int wait_for_completion_interruptible(struct completion *c);
    unsigned long wait_for_completion_timeout(struct completion *c,
                                              unsigned long timeout);
    unsigned long wait_for_completion_interruptible_timeout(struct completion *c,
                                              unsigned long timeout);

Quite a few other changes have gone in, but the idea should be clear by now: a vast number of changes are being made to the kernel's fundamental assumptions about locking and the execution environment. Few readers will be surprised to learn that the brave souls testing these patches have been encountering significant numbers of bugs. Those bugs are being squashed in a hurry, though, to the point that Ingo can say:

...this is i believe the first correct conversion of the Linux kernel to a fully preemptible (fully mutex-based) preemption model, while still keeping all locking properties of Linux.

I also think that this feature can and should be integrated into the upstream kernel sometime in the future. It will need improvements and fixes and lots of testing, but i believe the basic concept is sound and inclusion is manageable and desirable.

The interesting thing is that nobody has come forward to challenge that statement. As the realtime preemption patches become more stable, and the pressure for their inclusion starts to build, that situation may well change. It is hard to imagine a patch this intrusive going in without some sort of fight - especially when many developers are far from convinced about the goal of supporting realtime applications in Linux to begin with.

Comments (none posted)

MODULE_PARM deprecated

It's hard to turn down an opportunity to give Rusty Russell some grief, so let's take a moment to review a comment he posted on LWN in 2003:

Regarding module_param(): MODULE_PARM() will certainly stay throughout the 2.6 series, so no need to change existing code just yet.

Those who held off on changing their out-of-tree modules may want to do so now. Rusty has sent out a patch marking MODULE_PARM() obsolete in preparation for its removal from the kernel. A set of companion patches deals with many of the remaining MODULE_PARM() uses in the mainline tree.

MODULE_PARM() declares parameters for loadable modules; these parameters can be changed when the module is loaded to affect its operation. One of the many changes that came with the new module loader in the 2.5 series was a new mechanism (module_param()) for declaring module parameters. The new scheme has a number of advantages over the old one: it is type safe, it allows module parameters to be represented (and changed) in sysfs, and it provides a flexible mechanism for new types of parameters. But, since the older way continued to work, many modules were never updated.

Under the old development model, things probably would have gone as Rusty suggested: MODULE_PARM() would have remained through the 2.6 series in order to avoid breaking things. The new development model lacks the same sort of obvious demarcation point where compatibility can be broken, so those changes end up going into the regular patch stream. This is especially true of internal API changes, where there never has been a guarantee of any sort of continuity, even in an old-style stable series. So some of these changes are coming more quickly than some developers might have expected.

With regard to MODULE_PARM, The current patches in circulation suggest that the time to update to module_param() is running out. Consider yourself warned.

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux v2.6.9... ?
Con Kolivas 2.6.9-ck1 ?
viro@parcelfarce.linux.theplanet.co.uk 2.6.9-rc4-bird1 patchset ?
viro@parcelfarce.linux.theplanet.co.uk 2.6.9-rc4-bird2 patchset ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

John McCutchan inotify 0.14 ?

Memory management

Miscellaneous

Greg KH udev 038 release ?
Greg KH udev 039 release ?
Patrick Mansfield scsi_id 0.7 available ?
Stephen Hemminger iproute2 2.6.9-041019 ?
christophe.varoqui@free.fr multipath-tools-0.3.3 ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds