Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is 2.6.11-rc2, released by Linus on January 21. Changes this time around include some networking updates (including a "fix" for a NETIF_F_LLTX race condition which was subsequently withdrawn), an ALSA update (to version 1.0.8), some enhancements to the "circular pipe buffers" code introduced in -rc1, the ioctl() method rework, in-inode extended attributes for ext3, some additions to the completion API, some spinlock changes, and fixes for the latest round of security problems. The long-format changelog has the details.

The flow of patches into Linus's BitKeeper repository has slowed as things begin to stabilize for the 2.6.11 release. Changes merged since -rc2 include some architecture updates, the removal of bcopy(), a fix for writable module parameters in sysfs (it never actually worked before), and various fixes.

The current -mm tree is 2.6.11-rc2-mm1. Recent changes to -mm include some random driver reworking, the POSIX high-resolution timers patch set, ACL support for the NFS client, the isochronous CPU scheduler (see below), and some crypto API work.

The current 2.4 kernel is 2.4.29; no 2.4.30 prepatches have been released.

Comments (none posted)

Low latency for audio applications

Two weeks ago, this page looked at the realtime security module, an addition requested by Linux users who need to be able to ensure that certain applications are able to respond quickly to external events. Musicians working with Linux, in particular, want a system which can keep up with audio streams - a task which requires sub-millisecond response in many cases. Unpatched Linux kernels have generally not been able to provide latencies that low in any sort of reliable way.

The idea of merging the realtime module appears to have been dropped for now; the opposition was too strong. There are a couple of other approaches which are being worked on, however, to meet the audio developers' needs. In particular, Con Kolivas and Ingo Molnar have been creating patches, and audio hacker Jack O'Quin has been tirelessly testing them out. Two approaches which look like they could solve the problem have emerged from this work.

The approach taken by Con Kolivas is the isochronous scheduler patch. This patch, in its current form, creates two new scheduling classes: SCHED_ISO_RR and SCHED_ISO_FIFO. These classes function much like the realtime scheduling classes in that they provide a higher scheduling priority than any SCHED_NORMAL process enjoys. They differ from the true realtime classes, however, in a couple of ways. No privilege is required to enter one of the isochronous classes, so audio applications need not run as root. The scheduler will also automatically select an isochronous class if an unprivileged application attempts to enter a true realtime class, with the result that many audio applications can use the new classes without modification.

The isochronous classes give high-priority access to the CPU, but only to a point. If isochronous processes use more than an administrator-defined percentage of the processor (70% by default), they get dropped back to the SCHED_NORMAL class for a while. This feature prevents high-priority, unprivileged tasks from taking over the system entirely. This is an important feature - the lack of any such protection was the reason for many of the objections to the realtime security module.

Ingo Molnar's approach, instead, is the creation of a new resource limit (initially called RLIMIT_RT_CPU_RATIO, later changed to RLIMIT_RT_CPU). This limit controls what percentage of the processor's time may be taken by all unprivileged realtime processes. If the limit is in effect, the patch also allows any process to enter the realtime scheduling classes. So the end result is similar to that obtained with Con's patch: unprivileged tasks can get realtime access to the processor, but they are prevented from taking over entirely. The difference is that Ingo's patch is somewhat smaller and simpler, and does not require the introduction of new scheduling classes.

The rlimit-based patch is also interesting in that it allows each process to have a different maximum CPU utilization limit. Imagine a system running a set of audio applications where some have their limit set at 60%, and others at 80%. If 70% of the available processor time is actually being used by realtime tasks, processes with the 60% limit will lose their realtime access, but the 80% processes will not. This scheme, thus, allows a smart supervisor (such as the jack server) to arrange for a (relatively) graceful degradation as contention for the CPU increases.

Jack O'Quin's benchmarking suggests that either patch, in their most recent forms, has the potential to solve the problem (though the realtime preemption work may also be required for a complete solution). He appears to favor Ingo's version, however, and its relative simplicity could well argue for taking that path. It does not seem that any decisions have been made, however; it may be that nothing is merged until the 2.6.12 process starts. It does appear, however, that life is about to get a little easier for Linux audio users, which is a good thing. It can be worthwhile to be noisy about your needs, especially if you are willing to put time into helping in the development of the solution.

Comments (7 posted)

A new core time subsystem

Keeping track of the current time is one of the kernel's many jobs. In the Linux kernel, this task is handled in a very architecture-dependent way. Each architecture has its own sources of high-resolution time, and each performs its own calculations. This system works, but it results in quite a bit of code being duplicated across architectures, and it can be brittle. Patches which change time-related code often do not manage to correctly update all architectures.

John Stultz has been working for some months on a cleaner alternative. The result is a new time subsystem which, he hopes, will improve the situation.

Much of the patch can be seen as a refactoring of the time code. Common calculations are now performed in the timeofday core, rather than in architecture-specific code. The code for implementing the network time protocol (NTP), an interesting exercise in complexity itself, has been separated from the rest of the time code and hidden in its own file. Most of the core time code has been reworked to deal with time in nanoseconds, a format which gives adequate time resolution but which, in a 64-bit variable, is still good for centuries. The timeofday code no longer depends on the jiffies variable, meaning that it can work independently of the timer interrupt, which may be disabled in some situations. The overall result is kernel timing code which is much easier to read and understand.

In the end, however, the timing code must go to the hardware to actually get high-resolution time values. John made a couple of observations here. One is that, while time sources are architecture-dependent, many architectures share the same types of timing hardware. The other was that the code which deals with a time source is really just another device driver. So he isolated the time source information into its own structure:

	struct timesource_t {
		char* name;
		int priority;
		enum {
			TIMESOURCE_FUNCTION,
			TIMESOURCE_CYCLES,
			TIMESOURCE_MMIO_32,
			TIMESOURCE_MMIO_64
		} type;
		cycle_t (*read_fnct)(void);
		void __iomem* mmio_ptr;
		cycle_t mask;
		u32 mult;
		u32 shift;
		void (*update_callback)(void);
	};

Here, name is just a name for the source, and priority is used to choose between multiple available sources. The type field tells how this source can be read. If type is TIMESOURCE_FUNCTION, the read_fnct() will be called to read the source. The two _MMIO_ variants are for hardware which can be read directly from I/O memory; in that case, the time code can just obtain a value from the location indicated by mmio_ptr with no need to call any outside functions. TIMESOURCE_CYCLES indicates that the processor's time stamp counter (TSC) is being used, so get_cycles() is called to get the actual value. In any of the above cases, the value returned by the time source is assumed to be some sort of counter. The mask, mult, and shift values are applied to turn a delta between two such values into a number of nanoseconds for the rest of the timekeeping code.

With this structure in place, architecture-specific code need only fill in a timesource_t structure (possibly implementing a read function in the process) and pass it to register_timesource(). All the rest is then handled in the common code. John has provided a set of time source drivers for a few architectures which demonstrates how they can be written.

The discussion of the patches suggests that, while developers like the general intent, there are some remaining concerns - especially among the architecture maintainers. In some architectures, the gettimeofday() system call can be handled entirely in user space, but the current patches do not yet support that. The current NTP implementation is also seen as being too expensive. Finding a way to cut the cost of NTP while maintaining accuracy could be a bit of a challenge, but John is working at it. Expect to see some more iterations on this one.

Comments (none posted)

Some 2.6.11 API changes

A few small internal API changes have been merged for 2.6.11. For the record, here's what they are.

The completion mechanism allows a thread in the kernel to block until a specific event happens. Three new functions, some of which appear to be aiming for the "longest name in the kernel" prize, have been added:

int wait_for_completion_interruptible(struct completion *c);
unsigned long wait_for_completion_timeout(struct completion *c,
	                                  unsigned long timeout);
unsigned long wait_for_completion_interruptible_timeout(struct completion *c,
                                                        unsigned long timeout);

Each of these functions should be relatively straightforward to understand: they add interruptible and timeout variants to the basic wait_for_completion() function. They were added it make it easier to convert more semaphore users over to the completion API, which is more appropriate for cases where a one-shot operation is being waited for. This change is another small bit of fallout from the realtime preemption work.

The kernel has long had an implementation of bcopy():

    void bcopy(const char *src, char *dest, int size);

Arjan van de Ven and Adrian Bunk recently noticed a couple of things: (1) nothing in the kernel was actually using bcopy(), and (2) the implementation was broken. bcopy() is supposed to be able to handle overlapping source and destination areas, but, for a number of architectures, the kernel implementation would not do the right thing with such areas. So a patch was merged which removes bcopy(). No other in-kernel changes were needed, but out-of-tree modules which use bcopy() will need to be changed.

Chip Salzenberg (and others) noticed that a couple of networking functions - skb_copy_datagram() and sock_alloc_send_pskb() - are no longer exported to modules in the 2.6.11 prepatches. This change breaks the out-of-tree VMWare modules. Fixes for VMWare have already been merged.

On the PCI front, a patch from Pavel Machek which changes the prototype of the suspend() method in struct pci_driver was merged. The new prototype is:

    int (*suspend)(struct pci_dev *dev, pm_message_t state);

By changing the type of the state parameter, the patch allows the removal of some translation code and lets PCI drivers know what is really going on at the higher power management levels. Pavel is looking for help in fixing PCI drivers to use the new interface.

A few spinlock primitives have seen changes. For starters, the macro rwlock_is_locked() has been removed. It was never clear whether the macro referred to read or write locking, so Linus dealt with the confusion by just taking it out altogether. Then a new set of primitives was added:

    int read_can_lock(rwlock_t *rw);
    int write_can_lock(rwlock_t *rw);

These test whether an attempt to obtain a read or write lock at that time would have succeeded. In addition, there is a version for regular spinlocks:

    int spin_can_lock(spinlock_t *lock);

This function returns a nonzero value if an attempt to obtain lock would have succeeded, but does not actually modify the lock.

Finally, the name of the internal lock field in the spinlock structure was renamed to slock. This change was made to force the compiler to complain when rwlock primitives are used on a regular spinlock (and vice versa). This sort of type safety could also have been achieved by using inline functions, rather than macros, but some performance problems with gcc prevented that approach from being used.

Comments (1 posted)

Linus Torvalds Linux 2.6.11-rc2 ?

Andrew Morton 2.6.11-rc2-mm1 ?

Domen Puncer 2.6.11-rc2-kj ?

Andrew Morton 2.6.11-rc1-mm2 ?

john stultz new timeofday arch specific hooks (v. A2) ?

john stultz new timeofday arch specific timesources (v. A2) ?

Con Kolivas sched: Isochronous class v2 for unprivileged soft rt scheduling ?

Con Kolivas sched - Implement priority and fifo support for SCHED_ISO ?

Peter Williams plugsched-2.0 patches ... ?

Limin Gu Job - inescapable job containers ?

Ingo Molnar spinlock fix #1 ?

Ingo Molnar spinlock fix #2: generalize [spin|rw]lock yielding ?

Ingo Molnar spinlock fix #3: type-checking spinlock primitives, x86 ?

Ingo Molnar Real-Time Preemption, -RT-2.6.11-rc2-V0.7.36-00 ?

Ingo Molnar sched: /proc/sys/kernel/rt_cpu_limit tunable ?

Ingo Molnar sched: RLIMIT_RT_CPU_RATIO feature ?

john stultz new timeofday core subsystem (v. A2) ?

Karim Yaghmour relayfs redux for 2.6.10: lean and mean ?

Zach Brown tracepipe -- event streams, debugfs, and pipe_buffers ?

Bartlomiej Zolnierkiewicz ide-dev-2.6 update ?

Pavel Machek driver model: pass pm_message_t down to pci drivers ?

Pavel Machek driver model: more pm_message_t conversion ?

James Bottomley SCSI updates for 2.6.11-rc2 ?

Stephen Hemminger skge driver (0.5) ?

Robert Love inotify for 2.6.11-rc1-mm2 ?

John McCutchan inotify 0.18 ?

Andreas Gruenbacher NFSACL protocol extension for NFSv3 ?

Andreas Gruenbacher Infrastructure and server side of nfsacl ?

Andreas Gruenbacher Client side of nfsacl ?

raven@themaw.net autofs 4.1.4 beta1 release ?

Arjan van de Ven removing bcopy... because it's half broken ?

Mel Gorman Avoiding fragmentation through different allocator ?

Christoph Lameter alloc_zeroed_user_highpage to fix the clear_user_highpage issue ?

Christoph Lameter Extend clear_page by an order parameter ?

Christoph Lameter A scrub daemon (prezeroing) ?

Andrea Arcangeli OOM fixes 1/5 ?

Andrea Arcangeli OOM fixes 2/5 ?

Andrea Arcangeli OOM fixes 3/5 ?

Andrea Arcangeli OOM fixes 4/5 ?

Andrea Arcangeli OOM fixes 5/5 ?

Andrea Arcangeli writeback-highmem ?

Christoph Lameter [PATCH] Soft introduction of atomic pte operations to avoid the clearing of ptes ?

Andrea Arcangeli seccomp for 2.6.11-rc1-bk8 ?

Fruhwirth Clemens Adding cipher mode context information to crypto_tfm ?

Fruhwirth Clemens Adding a generic scatterlist eater: generic scatterwalk ?

Fruhwirth Clemens Add tweakable cipher interface ?

Fruhwirth Clemens Add LRW ?

Serge Hallyn fs-only bsdjail ?

christophe varoqui multipath-tools-0.4.2 ?

Stephen Hemminger iproute2 (050124) release ?

Matt Mackall lib/qsort ?

Nigel Cunningham Software Suspend for 2.4 Final Release ?

Kernel development

Brief items

Kernel release status

Kernel development news

Low latency for audio applications

A new core time subsystem

Some 2.6.11 API changes

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Security-related

Miscellaneous