Kernel development [LWN.net]

Kernel release status

The current stable kernel remains 2.6.18. The 2.6.19 merge window has opened and, as of this writing, just over 2000 patches have landed in the mainline git repository - see the separate article, below, for a summary.

The current -mm tree is 2.6.18-mm1. Recent changes to -mm include an xpad dance pad driver, a new version of the SLIM security module, and a number of fixes. The -mm tree is shrinking quickly as patches move into the mainline.

Adrian Bunk has released the first 2.6.16.30 prepatch. Along with the usual fixes, this prepatch adds a few new drivers, which has caused some observers to wonder about criteria for patches in the long-term 2.6.16 tree. It appears that 2.6.16, going into the future, will be a bit more open to new code than the regular -stable tree.

Comments (none posted)

Quotes of the week

I don't guarantee that I always change my mind, but I _can_ guarantee that if most of the people I trust tell me I'm a dick-head, I'll at least give it a passing thought.

[ Chorus: "You're a dick-head, Linus" ]

-- Linus Torvalds

Ultimately, we need to recognize that Linux is a 15-year-old kernel and that there will be another technical development to supersede it eventually. I can't say what that will be, but I think the best chance of mobilizing individual contribution to it would be to use GPL 3.

-- Bruce Perens

Comments (1 posted)

The 2.6.19 process begins

The 2.6.19 merge window has opened, and the flood of patches into the mainline has begun. As of this writing, it has only begun - the 2000 or so patches which have been merged after 2.6.18 are likely to be outnumbered by those that remain. Here's what has found its way in so far, starting with the user-visible changes:

The OCFS2 filesystem is no longer marked "experimental."
A number of InfiniBand updates, including better RDMA support and drivers for some new adapters.
Support for IPv6 policy routing rules - and a mechanism for the creation of multiple IPv6 routing tables to support those rules.
The parallel ATA driver patch set.
The labeled networking patch set implementing the Commercial IP Security Option.
Support for the Atmel AVR32 architecture.
Super-H support for Titan, SH7710VoIPGW and I-O DATA Landisk boards.
Big updates to the PowerPC and S/390 architectures. Among other things, S/390 has gained KProbes support.
New drivers for external flash on ATSTK1000 boards, TI OMAP1/2 i2c busses, ESI Miditerminal 4140 devices, Areca RAID controllers, SuperTrak EX8350/8300/16350/16300 SCSI controllers, QLogic QLA3xxx network interfaces, IBM eHEA Ethernet adapters, and the Ethernet controller found on Cirrus Logic ep93xx boards. The controversial aic94xx driver, originally written by Luben Tuikov and since revised by a number of others, has also been merged.

Changes visible to kernel developers include:

The CHECKSUM_HW value has long been used in the networking subsystem to support hardware checksumming. That value has been replaced with CHECKSUM_PARTIAL (intended for outgoing packets where the job must be completed by the hardware) and CHECKSUM_COMPLETE (for incoming packets which have been completely checksummed by the hardware).
A number of memory management changes, including tracking of dirty pages in shared memory mappings, making the DMA32 and HIGHMEM zones optional, and an architecture-independent mechanism for tracking memory ranges (and the holes between them).
The pud_page() and pgd_page() macros now return a struct page pointer, rather than a kernel virtual address. Code needing the latter should use pud_page_vaddr() or pgd_page_vaddr() instead.
A number of driver core changes including parallel device probing and some improvements to the suspend/resume process.
There is now a notifier chain for out-of-memory situations; the idea here is to set up functions which might be able to free some memory when things get very tight.
The semantics of the kmap() API have been changed a bit: on architectures with complicated memory coherence issues, kmap() and kunmap() are expected to manage coherency for the mapped pages, thus eliminating the need to explicitly flush pages from cache.
PCI Express Advanced Error Reporting is now supported in the PCI layer.
A number of changes have been made to the inode structure in an effort to make it smaller.
The no_pfn() address space operation has been added.

For anybody who is curious about what else is likely to be merged, Andrew Morton's 2.6.19 -mm merge plans document is worth a look. Highlights include another set of memory patches (with ongoing discussion over whether making ZONE_DMA optional makes sense), a rework of the network time protocol code, the vectored AIO patch set (maybe), a long list of NFS improvements, eCryptfs (though there is some opposition here), various device mapper and RAID improvements, and a number of changes to the generic IRQ layer.

Additionally, Andrew plans to merge a couple of container-oriented patches: virtualization for the utsname and IPC namespaces. Says Andrew:

This doesn't really make sense on its own, so there's an act of faith here - it assumes that Linux will eventually have full-on virtualisation of the various namespaces with sufficient coverage to actually be useful to userspace.

Normally I'd just buffer all the functionality into -mm until it's ready to go and is actually useful to userspace. But for this work that would mean just too many patches held for too long. So I'll start moving little pieces like this into mainline.

One thing which is not likely to go in is reiser4, which is still held up on various needed fixes. So this filesystem looks like it will wait for yet another development cycle.

Comments (4 posted)

Driver core API changes for 2.6.19

The Linux driver core subsystem continues to evolve at a high rate. The set of patches for 2.6.19 continues this process with a number of improvements - and a number of API changes. This time around, however, the changes appear to be additive, and thus should not break any existing drivers.

Linux boot time is an ongoing sore point - there are few users who wish that their systems would take longer to come up. There are many things which happen during the boot process, and many possible ways of speeding things up. Most of the opportunities for improving boot time lie in user space, but, on the kernel side, probing for devices can take a lot of time. Each device must be located, initialized, and hooked into the system; this process can involve waiting for peripheral processors to boot, firmware loads, and, perhaps, even physical processes like spinning up disks. As a result, much of the kernel time spent bringing up devices is idle time, waiting for the device to do its part.

One obvious idea for improving this process is to probe devices in parallel. That way, when the kernel is waiting for one device to respond, it can be setting up another; the kernel would also be able to make full use of multiprocessor systems. The 2.6.19 device core will, at last, have the ability to operate in this mode. The changes start by adding a flag (multithread_probe) to the device_driver structure. At probe time, if a driver has set that flag, the actual work of setting up the device will be pushed into a separate kernel thread which can run in parallel with all the others. At the end of the initialization process, the kernel waits for all outstanding probe threads to finish before mounting the root filesystem and starting up user space.

On uniprocessor systems, this change leads to a relatively small reduction of bootstrap time. Drivers typically do not yield the processor during the probe process, so there is relatively little opportunity for parallelism, even during times when the kernel has to wait for a bit. On multiprocessor systems, however, the effect can be rather more pronounced - each CPU can be probing devices in parallel with all the others. So this change will be most useful on large systems with lots of attached devices.

At least, it will be useful once it's enabled; this feature is currently marked "experimental" and carries a number of warnings. Even when it is turned on, it only applies to PCI devices. Not all drivers are written with parallel probing in mind, so they may not have the right sort of locking in place. There can be problems with power drain - turning on too many devices simultaneously can cause a high demand for power over a short period of time; if this demand exceeds what the power supply can deliver, the resulting conflagration could slow the boot process considerably. The order of device enumeration is likely to become less deterministic. And so on. Still, this feature, over time, should lead to faster system boots, especially on systems (such as embedded applications) where the mix of hardware is well understood and static.

On a separate front, the API for handling suspend and resume has been filled out somewhat. The class mechanism now has its own hooks, found in struct class:

    int (*suspend)(struct device *dev, pm_message_t state);
    int (*resume)(struct device *dev);

The new suspend() method is called relatively early in the suspend process, and is expected to handle any class-specific tasks. These might include quieting the device and stopping higher-level processing. The resume() method is called toward the end of the resume process and should finish the job of getting devices in the class ready to operate again.

Most of the suspend/resume processing is still handled through the bus subsystem, however. That portion of the API has been filled out with three new struct bus_type methods:

    int (*suspend_prepare)(struct device *dev, pm_message_t state);
    int (*suspend_late)(struct device *dev, pm_message_t state);
    int (*resume_early)(struct device *dev);

All of these methods just add more places for the bus code to hook into the process and do whatever work needs to be done. So suspend_prepare() is called early on, while the system is still in an operational state. The suspend() method is unchanged from prior kernels: it is called after tasks have been frozen, and is allowed to sleep if need be. The new suspend_late() method, instead, is called very late, with interrupts disabled and only a single processor running. At resume time, resume_early() is called, once again, with interrupts and SMP disabled, and the old resume() method is called later.

The PCI subsystem makes this new functionality available via three new methods in the pci_driver structure:

    int  (*suspend_prepare) (struct pci_dev *dev, pm_message_t state);
    int  (*suspend_late) (struct pci_dev *dev, pm_message_t state);
    int  (*resume_early) (struct pci_dev *dev);

There are no drivers actually using these new methods in the mainline, as of this writing.

Finally, the class subsystem continues to migrate toward the eventual removal of the class_device structure. To that end, struct class has picked up another pair of methods:

    int (*dev_uevent)(struct device *dev, char **envp, int num_envp,
		      char *buffer, int buffer_size);
    void (*dev_release)(struct device *dev);

These methods provide similar functionality as the uevent() and release methods in struct class_device.

Also as part of this migration, a couple of new helper functions have been added:

    int device_create_bin_file(struct device *dev, 
                               struct bin_attribute *attr);
    void device_remove_bin_file(struct device *dev, 
                                struct bin_attribute *attr);

This methods will let drivers create binary attributes in sysfs without having to deal with the sysfs code directly.

Comments (1 posted)

Read-copy-update for realtime

The developers working on realtime response for Linux have stated their intent to merge many of their remaining changes into 2.6.19. One of those changes is a reworking of the read-copy-update mechanism for lower latencies; this work appears likely to go in regardless of the fate of the rest of the realtime code. So it's worth a look.

RCU, remember, is a mechanism which allows certain types of data structure to be updated without requiring locking between readers and writers. It works by splitting the update process into two steps: (1) replacing a pointer to old data with a pointer to the updated version, and (2) deferring the removal of the old data structure until it is known that no kernel code holds any references to that structure. The part about knowing that no references are held is handled by (1) requiring all code which references RCU-protected data structures to be atomic, and (2) waiting until all processors have scheduled once. Since a processor which schedules is not running atomic code, it cannot hold any references to RCU-protected data structures from before the call to schedule().

This mechanism works well for most systems, but it presents a problem in realtime environments. The requirement that references to RCU-protected data structures be handled by atomic code means that any such code cannot be preempted. That, in turn, increases latencies, which is just what the realtime code is trying to avoid. So another solution had to be found. A couple of ideas have been pursued, one of which is now advanced to the point that it will likely find its way into 2.6.19. Here we'll take a superficial look at how realtime RCU works; anybody interested in the details is advised to have a look at the realtime RCU paper [PDF] from the 2006 Ottawa Linux Symposium.

Fixing the RCU latency problem means ending the requirement that RCU-protected code be non-preemptible. And that, in turn, means that RCU can no longer count on a processor rescheduling meaning that no references to RCU-protected structures exist on that processor. So the accounting must be done in a more explicit manner. The realtime RCU code handles this accounting with two sequence numbers, two per-CPU counters and three linked lists.

The sequence numbers track the specific batches of RCU callbacks to process; for added confusion value, both are named "completed," though they live in two different global structures. The value rcu_ctrlblk.completed is the current batch number, which is accumulating new callbacks to process; rcu_data.completed, instead, is the number of the last batch of callbacks to have been processed.

Within any given RCU batch, one of the per-CPU counters tracks the number of kernel threads which are currently executing within RCU critical sections. During this batch, any RCU callbacks queued (with call_rcu()) will be appended to the first of the linked lists: rcu_data.nextlist. Whenever code calls rcu_read_lock(), the appropriate counter is incremented; a pointer to that counter is also stored so that, should the thread change processors before calling rcu_read_unlock(), the right counter will be decremented.

Another reason for storing a pointer to the counter has to do with the batch "flip" logic. When the RCU code decides that it is time to start a new batch, it increments rcu_ctrlblk.completed; that, in turn, will cause rcu_read_lock() to switch to the second per-CPU counter, which will start out at zero. Any new entries into RCU critical sections will increment the new counter. Meanwhile, any code which was in such a section when the flip happened retains a pointer to the old counter. So, when that code calls rcu_read_unlock(), the older counter will be decremented. When all of the counters from the old batch reach zero, the kernel knows that all references to RCU-protected data from the old batch are gone, and the corresponding RCU callbacks can be called.

Also at flip time, the set of RCU callbacks in rcu_data.nextlist is moved over to rcu_data.waitlist, since those callbacks are now waiting for any possible remaining references to go away. When all of the counters for that batch drop to zero, these callbacks are moved to the third list (rcu_data.donelist) so that they can be invoked whenever the kernel decides to get around to it. That work currently happens in a tasklet, but there is another patch queued for 2.6.19 which moves that work over to a separate software interrupt handler.

With this code in place, code within an RCU critical section can be preempted and it will still be possible to know when all references to protected data structures are gone. RCU critical sections still cannot sleep, of course, or they could delay the batch flip indefinitely. But they can be pushed out of the way temporarily if a higher-priority process needs to run.

The overall overhead of the new mechanism is higher, however, since it must maintain all of those counters. For this reason, it is unlikely to ever be the default RCU on most systems. Instead, the plan is to ship two RCU implementations, "classic" and "preempt," and allow the person configuring the kernel to choose between them.

Comments (1 posted)

Andrew Morton 2.6.18-mm1 ?

Con Kolivas 2.6.18-ck1 ?

Adrian Bunk Linux 2.6.16.30-pre1 ?

Rusty Russell Using %gs for per-cpu areas on x86 ?

Andi Kleen x86/x86-64 merge for 2.6.19 ?

jeremy@goop.org Per-processor private data areas for i386 ?

Jeremy Fitzhardinge revised pda patches ?

Jeff Garzik x86[-64] PCI domain support ?

Davide Libenzi epoll_pwait for 2.6.18 ... ?

Rafael J. Wysocki swsusp: Add support for swap files ?

Rafael J. Wysocki swsusp: Add support for swap files (rev. 2) ?

Dipankar Sarma RCU: various patches ?

malahal@us.ibm.com event based work queues ?

Mathieu Desnoyers Linux Markers 0.4 (+dynamic probe loader) for 2.6.17 ?

Mathieu Desnoyers Linux Kernel Markers 0.8 for 2.6.17 (with near jump for i386) ?

Mathieu Desnoyers Linux Kernel Markers 0.15 for 2.6.17 ?

Petr Baudis Cogito-0.17.4 ?

Adam Buchbinder xpad: dance pad support ?

James Bottomley scsi updates for post 2.6.18 ?

Martin Peschke SCSI I/O statistics ?

Dave Airlie intelfb tree for 2.6.18-rc1 ?

Dave Airlie drm tree for 2.6.19-rc1 ?

John W. Linville Please pull 'upstream' branch of wireless-2.6.git ?

mchehab@infradead.org V4L/DVB updates ?

Greg KH Driver Core patches for 2.6.18 ?

Greg KH PCI patches for 2.6.18 ?

Greg KH I2C patches for 2.6.18 ?

Amit S. Kale NetXen: 1G/10G Ethernet Driver ?

Christopher Montgomery [PATCH 0/15] ehci-hcd: full-featured EHCI budgeter/scheduler ?

Steven Whitehouse GFS2 & DLM merge request ?

Jeff Garzik add and use include/linux/magic.h ?

Christoph Lameter Initial alpha-0 for new page allocator API ?

Christoph Lameter alloc_pages_range V1: Allocate memory from a specified range of addresses ?

YOSHIFUJI Hideaki IPv6 Update for net-2.6.19, Take 2 ?

Ashwini Kulkarni TCP socket splice ?

Patrick McHardy : Netfilter update for 2.6.19 ?

Herbert Xu Crypto Update for 2.6.19 ?

Olaf Dietsche 2.6.18: access permission filesystem 0.19 ?

Olaf Dietsche 2.6.18: Filesystem capabilities 0.16 ?

paul.moore@hp.com NetLabel fixes and reworked Netlink interface ?

Andrew Morton 2.6.19 -mm merge plans ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

The 2.6.19 process begins

Driver core API changes for 2.6.19

Read-copy-update for realtime

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Miscellaneous