Kernel development [LWN.net]

Kernel release status

The current 2.6 development kernel is 2.6.28-rc8, released on December 10. 2.6.28-rc8 contains another fairly long list of fixes, including some for fairly important regressions.

Linus also notes (in the 2.6.28-rc8 announcement) that he's trying to figure out whether to release 2.6.28 before or after the holidays. He asks for suggestions, of which, one assumes, he will get plenty.

The current stable 2.6 kernel is 2.6.27.8, released on December 5. This is a large update with fixes for a wide variety of problems.

Comments (3 posted)

Quotes of the week - all about documentation

For some reason, the act of typing in some kerneldoc makes people's brains turn off. Perhaps it's because "oh, I am supposed to type some documentation here" instead of "gee, I think this code is unclear, let's clarify that".

-- Andrew Morton

I personally don't like kerneldoc at all, because the truth is that people will work on fixing bugs and other useful things before keeping kerneldoc up to date. And that's the basic fact which cannot be denied.

I wish it could work, but it doesn't across the board. So unless we have dedicated monkeys scouring over every single patch that goes into the tree and doing the necessary kerneldoc updates, kerneldoc will be chronically wrong somewhere.

That leads to confusion and lost developer time. Because if the kerneldoc bits are wrong, it's worthless.

-- David Miller

I expect better: You never see me hard with time word making sentence coherent stuff. Ever.

-- Rusty Russell

As usual: You shall never rely on the source code comments, they will only mislead you.

-- Manfred Spraul

Comments (16 posted)

Speakers needed for the linux.conf.au Kernel Miniconf

The Kernel Miniconf at linux.conf.au next January is looking for a speaker or two to fill out the schedule. "Presentations do not have to be limited to a slide deck. If you have an idea for a 50-minute session that follows a non-traditional format, it will be considered."

Full Story (comments: none)

A new realtime tree

By Jonathan Corbet
December 9, 2008

It has been just over four years, now, since the realtime discussion got serious and the realtime preemption patch set got its start. During that time, your editor has heard many predictions for when the bulk of the realtime work would be merged; generally, the guess has been "within about a year." While a lot of realtime work has been merged, some of the core components of the realtime tree remain outside of the mainline. Beyond that, the realtime developers have been relatively quiet over the last year - at least on the realtime front. Having taken on some little side tasks - unifying the x86 architecture and maintaining it going forward, for example - some of those developers have been just a little bit distracted recently.

The realtime patch set has not gone away, though. If nothing else, the fact that a number of distributors are shipping this code is enough to ensure continued interest in its development. So your editor noted with interest the recent announcement of a new -rt tree with an updated set of realtime patches. This tree will be of interest for anybody wanting to look at the realtime work in the context of the 2.6.28 kernel or beyond.

One of the core technologies in the realtime tree is a change to how spinlocks work. Spinlocks in the mainline will busy-wait until the required lock becomes available; they thus occupy the processor to no useful end when acquiring a contended lock. Holding a spinlock will also prevent a thread from being preempted. This behavior is generally best for system throughput; it also makes it easier to write correct code. But anything which prevents a CPU from immediately servicing the highest-priority process runs counter to the chief design goal of a realtime operating system: providing deterministic response times in all situations. So, for the realtime patches, classic spinlocks had to go.

The solution was to turn most spinlocks into a form of mutex with priority inheritance. A process which attempts to acquire a contended "spinlock" will no longer spin; instead, it goes to sleep and waits for the lock to become free, making the processor available to another thread. Code which holds one of these non-spinlocks is no longer immune to preemption; a higher-priority thread can always push it out of the way. By changing spinlocks in this way, the realtime hackers were able to eliminate one of the largest sources of latency in the mainline kernel. Much of that work found its way into the mainline some time ago in the form of the mutex API, but spinlocks themselves have not been changed in the mainline.

To minimize the pain of maintaining the realtime patches, the developers simply redefined the spinlock_t type to be the new mutex type instead. Except that, as it turns out, some spinlocks in low-level parts of the kernel really do need to be spinlocks still. So those were switched to a new raw_spinlock_t type - but without changing the various spin_lock() calls. Instead, some truly frightening macro trickery was introduced to cause the spinlock API to do the right thing when passed either of two entirely different mutual exclusion primitives. This bit of macro magic was always going to be an impediment to mainline inclusion, so the realtime developers never really expected to merge the lock code in that form.

The new realtime tree now shows how the realtime developers think this work might get into the mainline. It involves a more explicit separation of the two types of "spinlocks" - and a lot of code churn. In the realtime tree, most locks of type spinlock_t are changed to a new lock_t type. There is a new set of operations for this type:

    #include <linux/lock.h>

    lock_t lock;

    acquire_lock(&lock);
    release_lock(&lock);

For a normal, non-realtime kernel build, lock_t will be the same as spinlock_t, and things will work as they always have. On realtime kernels, instead, lock_t will be a mutex type. The other variants of the spinlock API will be represented in the new API (there is an acquire_lock_irqsave(), for example), but none of them will actually disable interrupts in a realtime kernel. Meanwhile, spinlock_t will remain a true spinlock type.

This change gets rid of the tricky macros, but at the cost of changing the declarations of and operations on almost all spinlocks in the kernel. That is a lot of code changes: a quick grep turns up over 20,000 spin_lock*() calls in the upcoming 2.6.28 kernel. That will make for some pain if and when this change is merged. But in the mean time, it can only make for a lot of pain for the people who have to maintain this patch out of tree. To make their lives a little easier, the realtime developers have created a couple of scripts to do the bulk of the work. First, all spinlocks in a pristine kernel are converted to lock_t, then the few locks which truly must be spinlocks are switched back. This work is kept in a separate branch which is regenerated when needed; in this way, the realtime developers avoid the need to do nasty merges to keep up with current kernels.

Your editor has heard talk of another locking change which does not, yet, appear in this tree. One problem with the realtime patch set is that it requires distributors to create yet another kernel build - something they hate doing - if they want to support realtime operation. In an effort to make life easier for distributors, the realtime developers are working on a scheme whereby a kernel would determine at run time whether it should be running in a realtime mode. If so, spinlocks will be changed to sleeping locks by patching the kernel binary as it boots. Kernels built this way will be able to run efficiently in either mode.

The branches of the realtime tree provide a quick guide to the other parts of the realtime work which remain outside of the mainline. The threaded interrupt handler code is one example; that change could be proposed (again) for merging in the near future. The priority workqueue mechanism sits in another branch, as do patches aimed at Java support, filesystem changes, memory management changes, and more. Then, there's a branch for stuff which will never be merged; for example, there is this patch which gives Java programs direct access to physical memory - not something which strikes most kernel developers as a good idea. All told, there is a great deal of work sitting in the realtime patch set; this work is finally being organized into a proper git tree.

The "upstream first" policy says that vendors should merge their code upstream before shipping it to customers. The 2.6.x development model is built on the idea that no change is too fundamental to be accepted into a regular, 3-month development cycle. The realtime patches would appear to be an exception to both rules. It has taken over four years to get to a point where some of the fundamental realtime technologies are close to ready for the mainline, but distributors have been shipping it for at least three of those years. It has, in other words, been one of the biggest forks of the Linux kernel, ever. The plan has always been to join this fork back with the mainline, though; perhaps, finally, that goal is getting closer. With luck, it will happen within about a year.

Comments (6 posted)

Tracking down a "runaway loop"

By Jake Edge
December 10, 2008

The Linux boot process, at least as provided by distributions, depends on help from user space, with drivers being loaded as required from the initial filesystem (initramfs/initrd). Loading drivers requires using tools built into initramfs and if those tools break, the kernel won't boot. But when a working kernel configuration and initramfs are used with a new kernel, the result is expected to be a kernel that successfully boots. When that doesn't happen, bugs are filed regarding kernel regressions but, as a recent example shows, the actual problem may be elsewhere.

The original report was made in late October, but no progress was made until Evgeniy Polyakov saw it again in early December. The symptom was a kernel that hangs after printing:

    request_module: runaway loop modprobe char-major-5-1

four times on the console. Since nothing in the user space (initramfs) or kernel configuration had changed, it seemed to clearly point to something in the kernel itself.

It turns out that the "runaway loop" message is meant to indicate that the request_module() function has been invoked recursively. So in an effort to load the driver for the character device with major/minor numbers 5/1—which corresponds to /dev/console—request_module() was invoked again. The code in kernel/kmod.c:

        if (atomic_read(&kmod_concurrent) > max_modprobes) {
                /* We may be blaming an innocent here, but unlikely */
                if (kmod_loop_msg++ < 5)
                        printk(KERN_ERR
                               "request_module: runaway loop modprobe %s\n",
                               module_name);
                atomic_dec(&kmod_concurrent);
                return -ENOMEM;
        }

only allows that message to be printed four times, but the invoker should recognize the ENOMEM and handle it appropriately.

The root cause was that something in the kernel was trying to access /dev/console before that device was registered in the kernel. This led the kernel to try and load a module to handle /dev/console, which will fail. Because of the failure, something in the user space modprobe then tries to access /dev/console, presumably to output an error message, which repeats the kernel module loading process. And so on. After that recurses enough to exceed the max_modprobes limit, request_module() will produce the runaway loop message and return ENOMEM which should put a stop to the whole process.

In an acrimonious thread—and kernel bug report—Alan Cox, Kay Sievers, and Polyakov tried to determine where the problem came from and what to do about it. It didn't help matters that they were using different distribution's initramfs so that they saw different behavior. Polyakov/Sievers were using Debian user space while Cox was using Fedora. Something in the Debian version was continuing to try to open /dev/console even after getting ENOMEM. This leads to an infinite loop, thus a kernel hang.

Sievers eventually tracked it to the kernel cryptographic API:

It is caused by: "modprobe cryptomgr" called from swapper[1]

This modprobe process does try to log an error, accesses /dev/console, which is not initialized in the kernel at that time, and the kernel module loader tries the load a module to support dev_t 5:1, which again runs modprobe, and ...

Setting CONFIG_CRYPTO_MANAGER=y makes it disappear.

It turns out that the crypto layer attempts to load the cryptomgr module as part of its algorithm testing infrastructure. If cryptomgr fails to load, the algorithm registration code can continue without it. It is optional, but modprobe wants to put out a message when it fails to load it, which leads to the runaway loop. As Herbert Xu points out, though, the problem is not crypto-specific at all:

In any case the loop itself does not involve any crypto components so I don't think making changes in the crypto layer is going to make this go away forever as anyone calling request_module early enough will get into this loop.

It is this potential pitfall that Sievers and Polyakov would like to see removed. In general, user-space programs are not required to be concerned with the availability of /dev/console—except when they are run from early kernel initialization. But Cox points out that user-space helpers must concern themselves with avoiding loops because there are multiple possible ways to cause that to happen. As an example, he notes that if UNIX-domain sockets (AF_UNIX) are in a module and syslog() is called before the module is loaded, a similar loop will occur.

In an effort to "step back" from the arguments that were going back and forth, Ted Ts'o offers his analysis of the problem along with a suggested course of action:

There is a dispute about whether it is looping forever, or whether it should be getting caught by kernel/kmod.c's modprobe recursion detector. Alan has checked the recursion detector and reports that it works just fine; Evgeniy and Kay are claiming that it in fact loops forever, and the recursion detector is not working.

[...] So I would think the best thing to do is to figure out what Debian's initrd is doing that is evading the recursion detection. Fixing that is going to make things much more robust.

Clearly the recursion detector is working to some extent, or the runaway loop messages would not be seen, but on Debian, at least, that detection doesn't stop the problem. Ts'o's theory is that something outside of directly invoked helper is actually the culprit: "I'm guessing why it isn't working given Debian's initrd setup is that whatever is ultimately opening /dev/console isn't being called until after the helper script has exited." That seems worth tracking down as Ts'o points out in a later message:

It would be good to make sure we understand what the root causes for while the modprobe recursion detector is apparently not triggering, since it could be that Debian's initrd might cause some other uncaught recursion loop if we don't drive this problem determination to root cause.

The exact cause of the problem and why Debian and Fedora behave differently is still not known. Digging into Debian's initrd to figure that out, as Ts'o suggests, is clearly the right starting point. That answer will likely lead to sensible fixes, either in user space or the kernel—possibly both. Bickering about where and how to fix the problem before it is fully understood seems counter-productive at best.

Comments (7 posted)

Dueling performance monitors

By Jonathan Corbet
December 9, 2008

Low-level optimization of performance-critical code can be a challenging task. At this point, one assumes, the potential for algorithmic improvements in the targeted code has been realized; what is left is trying to locate and address problems like cache misses, mis-predicted branches, and so on. Such problems can be impossible to find by just looking at the code; one needs support from the hardware. The good news is that contemporary hardware provides that support; most processors can collect a wide range of performance data for analysis. The bad news is that, despite the fact that processors have been able to collect that data for many years, there has never been support for this kind of performance monitoring in the mainline kernel. That situation may be about to change, but, first, the development community will have to make a choice between a venerable out-of-tree implementation and an unexpected competitor.

The "perfmon" patch set has been under development for some years, but, for a number of reasons, it has never found its way into the mainline kernel. The most recent version of the patch was posted for review by Stéphane Eranian in late November. The perfmon patches show the signs of all those years of development work and usage experience; they offer a wide set of features and extensive user-space support. The full perfmon patch adds twelve system calls to the kernel; the posted version, though, trims that count back to five in the hope that a narrower interface will have a better chance of getting into the mainline. The additional system calls, one assumes, will be proposed for inclusion sometime after the perfmon core is merged. The reduced interface is described in the patch set; briefly, an application hooks into the performance monitoring subsystem with a call to:

    int pfm_create(int flags, pfarg_sinfo_t *regs);

This system call returns a file descriptor to identify the performance monitoring session. The regs parameter is used to return a list of performance monitoring registers available on the current system; flags is currently unused.

Specific performance counter registers can be manipulated with:

    int pfm_write(int fd, int flags, int type, void *d, size_t sz);
    int pfm_read(int fd, int flags, int type, void *d, size_t sz);

These system calls can be used to write values into registers (thus programming the performance monitoring hardware) and to read counter and configuration information from those registers.

Actually doing some performance monitoring requires a couple more calls:

    int pfm_attach(int fd, int flags, int target);
    int pfm_set_state(int fd, int flags, int state);

A call to pfm_attach() specifies which process is to be monitored; pfm_set_state() then turns monitoring on and off.

There are a couple of distinctive aspects to the perfmon interface. One is that it knows almost nothing about the specific performance monitoring registers; that information, instead, is expected to live in user space. As a result, the bare perfmon system call interface is probably not something that most monitoring applications would use; instead, those system calls are hidden behind a user-space library which knows how to program different types of processors for the desired results. Beyond that, perfmon uses the ptrace() mechanism to stop the monitored process while performance counters are being queried; as a result, the monitoring process must have the right to trace the target process.

On December 4, Thomas Gleixner and Ingo Molnar posted a surprise announcement of a new performance counter subsystem. The announcement states:

We are aware of the perfmon3 patchset that has been submitted to lkml recently. Our patchset tries to achieve a similar end result, with a fundamentally different (and we believe, superior :-) design.

This is not the first time that these developers have shown up with an out-of-the-blue reimplementation of somebody else's subsystem; other examples include the CFS scheduler, high-resolution timers, dynamic tick, and realtime preemption. Most of the time, the new code quickly supplants the older version - an occurrence which is not always pleasing to the original developers - but the situation does not seem quite as straightforward this time.

The proposed interface is much simpler, adding a single system call:

    int perf_counter_open(u32 hw_event_type, u32 hw_event_period,
                          u32 record_type, pid_t pid, int cpu);

This call will return a file descriptor corresponding to a single hardware counter. A call to read() will then return the current value of the counter. The hw_event_period can be used to block reads until the counter overflows the given value, allowing, for example, events to be queried in batches of 1000. The pid parameter can be used to target a specific process, and cpu can restrict monitoring to a specific processor.

There are a few advantages claimed for the new implementation. The simplicity of the system call interface is one of those; it is possible to write a very simple application to perform monitoring tasks, with no additional libraries required. The second version of the patch includes a simple "kerneltop" utility which can display a constantly-updated profile of anything the performance counting hardware can monitor. Another advantage is the avoidance of ptrace(); this reduces the amount of privilege needed by the monitoring process and avoids perturbing the monitored process by stopping and restarting it. The management of counters is said to be more flexible, with facilities for sharing counters between processes and reserving them for administrative access. The low-level hardware interface is said to be simpler as well.

Those claimed advantages notwithstanding, a number of complaints have been raised with regard to the new performance monitoring code. Two of those seem to be at the top of the list: the single counter per file descriptor API, and programming the hardware performance monitoring unit inside the kernel. On the API side, the biggest concern is that putting each counter behind its own file descriptor makes it very hard to correlate two or more counters. Reading two counters requires two independent read() system calls; as is always the case, just about anything could happen between those two calls. So it's hard to tell how two different counter values relate to each other. But that sort of correlation is exactly what developers doing performance optimization want to do. Paul Mackerras says:

Your API has as its central abstraction the "counter". I am saying that that is the wrong abstraction. The abstraction really needs to be a set of counters that are all active over precisely the same interval, so that their values can be meaningfully compared and related to each other.

In response, Ingo argues that the loss of precision caused by independent read() calls is small - much smaller than the muddying of the results caused by stopping the target process so that all of the counters can be read at the same time. That argument does not appear to have convinced the detractors, though.

The other complaint is that moving the counter programming task into the kernel requires that the kernel know about the complexities of every possible performance monitoring unit it may encounter. This hardware sits at the core of the most performance-critical CPU subsystems, so its design parameters value non-interference above features or a straightforward programming interface. So programming it can be a complex business, involving sizeable tables describing how various operations interact with each other. The perfmon code keeps those tables in a user-space library, but the alternative implementation won't allow that. Quoting Paul again:

Now, the tables in perfmon's user-land libpfm that describe the mapping from abstract events to event-selector values and the constraints on what events can be counted together come to nearly 29,000 lines of code just for the IBM 64-bit powerpc processors.

Your API condemns us to adding all that bloat to the kernel, plus the code to use those tables.

Paul (and others) argue that this information - which can add up to hundreds of kilobytes - is better kept in user space.

There also seems to be a bit of concern over the fact that Stéphane had clearly never heard about this work before it was posted for review. It must, indeed, be a shock to work on a subsystem for years, then find a proposed replacement sitting in one's mailbox. As David Miller put it:

And also, another part of the backlash is that the poor perfmon3 person was completely blindsided by this new stuff. Which to be honest was pretty unfair. He might have had great ideas about the requirements (even if you don't give a crap about his approach to achieving those requirements) and thus could have helped avoid the past few days of churn.

So, at this point, what will happen with performance monitoring is unclear at best. Perhaps, though, this discussion will have the effect of raising the profile of performance monitoring, which has been without proper kernel support for many years. The merging of either solution - or, perhaps, a combination of both - seems like it has to be an improvement over having no support at all.

Comments (25 posted)

Greg KH Linux 2.6.27.8 ?

Steven Rostedt 2.6.24.7-rt24 ?

Joerg Roedel KVM device passthrough support with AMD IOMMU ?

K.Prasad Hardware Breakpoint interfaces - v2 ?

Yinghai Lu sparse irq enabling v5 ?

Gregory Haskins sched: track next-highest priority ?

Oren Laadan Kernel based checkpoint/restart ?

Jeff Arnold Ksplice: Rebootless kernel updates ?

Nick Andrew Recursive printk ?

Steven Rostedt The -rt git tree ?

Casey Dahlin waitfd: file descriptor to wait on child processes ?

Thomas Gleixner [Announcement] Performance Counters for Linux ?

Ingo Molnar Performance Counters for Linux, v2 ?

Mathieu Desnoyers Performance counter in-kernel virtualization API ?

Andy Whitcroft update checkpatch to version 0.26 ?

Lai Jiangshan ftrace: support binary record ?

Userspace I/O (UIO): Add support for userspace DMA ?

Hans J. Koch UIO: Pass information about ioports to userspace (V2) ?

Mauro Carvalho Chehab edac: driver for i5400 MCH (Seaburg) ?

Matthew Garrett misc: Add oqo-wmi driver for model 2 OQO backlight and rfkill control ?

Matthew Garrett misc: Add dell-wmi driver for hotkey control ?

Alan Cox watchdog: Introduce a midlayer/helper library ?

Alan Cox watchdog: New library code ?

Trilok Soni Add TI TSC2005 Touchscreen driver Dec 07

Daniel Silverstone Micrel KS8695 intergrated ethernet driver ?

chaithrika@ti.com phy: Add LSI ET1011C PHY driver ?

Steve Glendinning smsc9420: SMSC LAN9420 10/100 PCI ethernet adapter ?

David Daney libata: Cavium OCTEON SOC Compact Flash driver (v4) ?

Guennadi Liakhovetski i.MX31: dmaengine and framebuffer drivers ?

Robert Love Open-FCoE Submission (round 2) ?

Michael Kerrisk man-pages-3.15 is released ?

Diego Elio 'Flameeyes' =?utf-8?q?Petten=C3=B2?= Add basic export support to HFS+. ?

Niraj Kumar add unrestricted_chown option to vfs (mount) ?

Daniel Phillips Tux3 report: Tux3 by Christmas? ?

Lee Schermerhorn - support inheritance of mlocks across fork/exec V2 ?

Ying Han page_fault retry with NOPAGE_RETRY ?

Nick Piggin mm: direct IO starvation improvement ?

Inaky Perez-Gonzalez merge request for WiMAX kernel stack and i2400m driver v3 ?

Paul Moore Labeled networking patches for 2.6.29 ?

KAMEZAWA Hiroyuki [RFC][PATCH 0/4] cgroup ID and css refcnt change and memcg hierarchy (2008/12/05) ?

Rusty Russell kvm: use modern cpumask primitives, no cpumask_t on stack ?

Avi Kivity KVM Updates for 2.6.29 (Part 1 of 3) ?

Avi Kivity KVM Updates for 2.6.29 (part 2 of 3) ?

Rafael J. Wysocki 2.6.28-rc7-git2: Reported regressions from 2.6.27 ?

Rafael J. Wysocki 2.6.28-rc8-git5: Reported regressions from 2.6.27 ?

Rafael J. Wysocki 2.6.28-rc8-git5: Reported regressions 2.6.26 -> 2.6.27 ?

Arjan van de Ven Oops/Warning report for the week of December 3rd, 2008 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week - all about documentation

Speakers needed for the linux.conf.au Kernel Miniconf

A new realtime tree

Tracking down a "runaway loop"

Dueling performance monitors

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs