Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.28-rc8, released on December 10. 2.6.28-rc8 contains another fairly long list of fixes, including some for fairly important regressions.Linus also notes (in the 2.6.28-rc8 announcement) that he's trying to figure out whether to release 2.6.28 before or after the holidays. He asks for suggestions, of which, one assumes, he will get plenty.
The current stable 2.6 kernel is 2.6.27.8, released on December 5. This is a large update with fixes for a wide variety of problems.
Kernel development news
Quotes of the week - all about documentation
I wish it could work, but it doesn't across the board. So unless we have dedicated monkeys scouring over every single patch that goes into the tree and doing the necessary kerneldoc updates, kerneldoc will be chronically wrong somewhere.
That leads to confusion and lost developer time. Because if the kerneldoc bits are wrong, it's worthless.
Speakers needed for the linux.conf.au Kernel Miniconf
The Kernel Miniconf at linux.conf.au next January is looking for a speaker or two to fill out the schedule. "Presentations do not have to be limited to a slide deck. If you have an idea for a 50-minute session that follows a non-traditional format, it will be considered."
A new realtime tree
It has been just over four years, now, since the realtime discussion got serious and the realtime preemption patch set got its start. During that time, your editor has heard many predictions for when the bulk of the realtime work would be merged; generally, the guess has been "within about a year." While a lot of realtime work has been merged, some of the core components of the realtime tree remain outside of the mainline. Beyond that, the realtime developers have been relatively quiet over the last year - at least on the realtime front. Having taken on some little side tasks - unifying the x86 architecture and maintaining it going forward, for example - some of those developers have been just a little bit distracted recently.The realtime patch set has not gone away, though. If nothing else, the fact that a number of distributors are shipping this code is enough to ensure continued interest in its development. So your editor noted with interest the recent announcement of a new -rt tree with an updated set of realtime patches. This tree will be of interest for anybody wanting to look at the realtime work in the context of the 2.6.28 kernel or beyond.
One of the core technologies in the realtime tree is a change to how spinlocks work. Spinlocks in the mainline will busy-wait until the required lock becomes available; they thus occupy the processor to no useful end when acquiring a contended lock. Holding a spinlock will also prevent a thread from being preempted. This behavior is generally best for system throughput; it also makes it easier to write correct code. But anything which prevents a CPU from immediately servicing the highest-priority process runs counter to the chief design goal of a realtime operating system: providing deterministic response times in all situations. So, for the realtime patches, classic spinlocks had to go.
The solution was to turn most spinlocks into a form of mutex with priority inheritance. A process which attempts to acquire a contended "spinlock" will no longer spin; instead, it goes to sleep and waits for the lock to become free, making the processor available to another thread. Code which holds one of these non-spinlocks is no longer immune to preemption; a higher-priority thread can always push it out of the way. By changing spinlocks in this way, the realtime hackers were able to eliminate one of the largest sources of latency in the mainline kernel. Much of that work found its way into the mainline some time ago in the form of the mutex API, but spinlocks themselves have not been changed in the mainline.
To minimize the pain of maintaining the realtime patches, the developers simply redefined the spinlock_t type to be the new mutex type instead. Except that, as it turns out, some spinlocks in low-level parts of the kernel really do need to be spinlocks still. So those were switched to a new raw_spinlock_t type - but without changing the various spin_lock() calls. Instead, some truly frightening macro trickery was introduced to cause the spinlock API to do the right thing when passed either of two entirely different mutual exclusion primitives. This bit of macro magic was always going to be an impediment to mainline inclusion, so the realtime developers never really expected to merge the lock code in that form.
The new realtime tree now shows how the realtime developers think this work might get into the mainline. It involves a more explicit separation of the two types of "spinlocks" - and a lot of code churn. In the realtime tree, most locks of type spinlock_t are changed to a new lock_t type. There is a new set of operations for this type:
#include <linux/lock.h> lock_t lock; acquire_lock(&lock); release_lock(&lock);
For a normal, non-realtime kernel build, lock_t will be the same as spinlock_t, and things will work as they always have. On realtime kernels, instead, lock_t will be a mutex type. The other variants of the spinlock API will be represented in the new API (there is an acquire_lock_irqsave(), for example), but none of them will actually disable interrupts in a realtime kernel. Meanwhile, spinlock_t will remain a true spinlock type.
This change gets rid of the tricky macros, but at the cost of changing the declarations of and operations on almost all spinlocks in the kernel. That is a lot of code changes: a quick grep turns up over 20,000 spin_lock*() calls in the upcoming 2.6.28 kernel. That will make for some pain if and when this change is merged. But in the mean time, it can only make for a lot of pain for the people who have to maintain this patch out of tree. To make their lives a little easier, the realtime developers have created a couple of scripts to do the bulk of the work. First, all spinlocks in a pristine kernel are converted to lock_t, then the few locks which truly must be spinlocks are switched back. This work is kept in a separate branch which is regenerated when needed; in this way, the realtime developers avoid the need to do nasty merges to keep up with current kernels.
Your editor has heard talk of another locking change which does not, yet, appear in this tree. One problem with the realtime patch set is that it requires distributors to create yet another kernel build - something they hate doing - if they want to support realtime operation. In an effort to make life easier for distributors, the realtime developers are working on a scheme whereby a kernel would determine at run time whether it should be running in a realtime mode. If so, spinlocks will be changed to sleeping locks by patching the kernel binary as it boots. Kernels built this way will be able to run efficiently in either mode.
The branches of the realtime tree provide a quick guide to the other parts of the realtime work which remain outside of the mainline. The threaded interrupt handler code is one example; that change could be proposed (again) for merging in the near future. The priority workqueue mechanism sits in another branch, as do patches aimed at Java support, filesystem changes, memory management changes, and more. Then, there's a branch for stuff which will never be merged; for example, there is this patch which gives Java programs direct access to physical memory - not something which strikes most kernel developers as a good idea. All told, there is a great deal of work sitting in the realtime patch set; this work is finally being organized into a proper git tree.
The "upstream first" policy says that vendors should merge their code upstream before shipping it to customers. The 2.6.x development model is built on the idea that no change is too fundamental to be accepted into a regular, 3-month development cycle. The realtime patches would appear to be an exception to both rules. It has taken over four years to get to a point where some of the fundamental realtime technologies are close to ready for the mainline, but distributors have been shipping it for at least three of those years. It has, in other words, been one of the biggest forks of the Linux kernel, ever. The plan has always been to join this fork back with the mainline, though; perhaps, finally, that goal is getting closer. With luck, it will happen within about a year.
Tracking down a "runaway loop"
The Linux boot process, at least as provided by distributions, depends on help from user space, with drivers being loaded as required from the initial filesystem (initramfs/initrd). Loading drivers requires using tools built into initramfs and if those tools break, the kernel won't boot. But when a working kernel configuration and initramfs are used with a new kernel, the result is expected to be a kernel that successfully boots. When that doesn't happen, bugs are filed regarding kernel regressions but, as a recent example shows, the actual problem may be elsewhere.
The original report was made in late October, but no progress was made until Evgeniy Polyakov saw it again in early December. The symptom was a kernel that hangs after printing:
request_module: runaway loop modprobe char-major-5-1four times on the console. Since nothing in the user space (initramfs) or kernel configuration had changed, it seemed to clearly point to something in the kernel itself.
It turns out that the "runaway loop" message is meant to indicate that the request_module() function has been invoked recursively. So in an effort to load the driver for the character device with major/minor numbers 5/1—which corresponds to /dev/console—request_module() was invoked again. The code in kernel/kmod.c:
if (atomic_read(&kmod_concurrent) > max_modprobes) { /* We may be blaming an innocent here, but unlikely */ if (kmod_loop_msg++ < 5) printk(KERN_ERR "request_module: runaway loop modprobe %s\n", module_name); atomic_dec(&kmod_concurrent); return -ENOMEM; }only allows that message to be printed four times, but the invoker should recognize the ENOMEM and handle it appropriately.
The root cause was that something in the kernel was trying to access /dev/console before that device was registered in the kernel. This led the kernel to try and load a module to handle /dev/console, which will fail. Because of the failure, something in the user space modprobe then tries to access /dev/console, presumably to output an error message, which repeats the kernel module loading process. And so on. After that recurses enough to exceed the max_modprobes limit, request_module() will produce the runaway loop message and return ENOMEM which should put a stop to the whole process.
In an acrimonious thread—and kernel bug report—Alan Cox, Kay Sievers, and Polyakov tried to determine where the problem came from and what to do about it. It didn't help matters that they were using different distribution's initramfs so that they saw different behavior. Polyakov/Sievers were using Debian user space while Cox was using Fedora. Something in the Debian version was continuing to try to open /dev/console even after getting ENOMEM. This leads to an infinite loop, thus a kernel hang.
Sievers eventually tracked it to the kernel cryptographic API:
This modprobe process does try to log an error, accesses /dev/console, which is not initialized in the kernel at that time, and the kernel module loader tries the load a module to support dev_t 5:1, which again runs modprobe, and ...
Setting CONFIG_CRYPTO_MANAGER=y makes it disappear.
It turns out that the crypto layer attempts to load the cryptomgr module as part of its algorithm testing infrastructure. If cryptomgr fails to load, the algorithm registration code can continue without it. It is optional, but modprobe wants to put out a message when it fails to load it, which leads to the runaway loop. As Herbert Xu points out, though, the problem is not crypto-specific at all:
It is this potential pitfall that Sievers and Polyakov would like to see removed. In general, user-space programs are not required to be concerned with the availability of /dev/console—except when they are run from early kernel initialization. But Cox points out that user-space helpers must concern themselves with avoiding loops because there are multiple possible ways to cause that to happen. As an example, he notes that if UNIX-domain sockets (AF_UNIX) are in a module and syslog() is called before the module is loaded, a similar loop will occur.
In an effort to "step back" from the arguments that were going back and forth, Ted Ts'o offers his analysis of the problem along with a suggested course of action:
[...] So I would think the best thing to do is to figure out what Debian's initrd is doing that is evading the recursion detection. Fixing that is going to make things much more robust.
Clearly the recursion detector is working to some extent, or the runaway
loop messages would not be seen, but on Debian, at least, that detection
doesn't stop the problem. Ts'o's theory is that something outside of
directly invoked helper is actually the culprit: "I'm guessing why
it isn't working given Debian's initrd setup is that whatever is
ultimately opening /dev/console isn't being called until after the
helper script has exited.
" That seems worth tracking down as Ts'o
points out in a later message:
The exact cause of the problem and why Debian and Fedora behave differently is still not known. Digging into Debian's initrd to figure that out, as Ts'o suggests, is clearly the right starting point. That answer will likely lead to sensible fixes, either in user space or the kernel—possibly both. Bickering about where and how to fix the problem before it is fully understood seems counter-productive at best.
Dueling performance monitors
Low-level optimization of performance-critical code can be a challenging task. At this point, one assumes, the potential for algorithmic improvements in the targeted code has been realized; what is left is trying to locate and address problems like cache misses, mis-predicted branches, and so on. Such problems can be impossible to find by just looking at the code; one needs support from the hardware. The good news is that contemporary hardware provides that support; most processors can collect a wide range of performance data for analysis. The bad news is that, despite the fact that processors have been able to collect that data for many years, there has never been support for this kind of performance monitoring in the mainline kernel. That situation may be about to change, but, first, the development community will have to make a choice between a venerable out-of-tree implementation and an unexpected competitor.The "perfmon" patch set has been under development for some years, but, for a number of reasons, it has never found its way into the mainline kernel. The most recent version of the patch was posted for review by Stéphane Eranian in late November. The perfmon patches show the signs of all those years of development work and usage experience; they offer a wide set of features and extensive user-space support. The full perfmon patch adds twelve system calls to the kernel; the posted version, though, trims that count back to five in the hope that a narrower interface will have a better chance of getting into the mainline. The additional system calls, one assumes, will be proposed for inclusion sometime after the perfmon core is merged. The reduced interface is described in the patch set; briefly, an application hooks into the performance monitoring subsystem with a call to:
int pfm_create(int flags, pfarg_sinfo_t *regs);
This system call returns a file descriptor to identify the performance monitoring session. The regs parameter is used to return a list of performance monitoring registers available on the current system; flags is currently unused.
Specific performance counter registers can be manipulated with:
int pfm_write(int fd, int flags, int type, void *d, size_t sz); int pfm_read(int fd, int flags, int type, void *d, size_t sz);
These system calls can be used to write values into registers (thus programming the performance monitoring hardware) and to read counter and configuration information from those registers.
Actually doing some performance monitoring requires a couple more calls:
int pfm_attach(int fd, int flags, int target); int pfm_set_state(int fd, int flags, int state);
A call to pfm_attach() specifies which process is to be monitored; pfm_set_state() then turns monitoring on and off.
There are a couple of distinctive aspects to the perfmon interface. One is that it knows almost nothing about the specific performance monitoring registers; that information, instead, is expected to live in user space. As a result, the bare perfmon system call interface is probably not something that most monitoring applications would use; instead, those system calls are hidden behind a user-space library which knows how to program different types of processors for the desired results. Beyond that, perfmon uses the ptrace() mechanism to stop the monitored process while performance counters are being queried; as a result, the monitoring process must have the right to trace the target process.
On December 4, Thomas Gleixner and Ingo Molnar posted a surprise announcement of a new performance counter subsystem. The announcement states:
This is not the first time that these developers have shown up with an out-of-the-blue reimplementation of somebody else's subsystem; other examples include the CFS scheduler, high-resolution timers, dynamic tick, and realtime preemption. Most of the time, the new code quickly supplants the older version - an occurrence which is not always pleasing to the original developers - but the situation does not seem quite as straightforward this time.
The proposed interface is much simpler, adding a single system call:
int perf_counter_open(u32 hw_event_type, u32 hw_event_period, u32 record_type, pid_t pid, int cpu);
This call will return a file descriptor corresponding to a single hardware counter. A call to read() will then return the current value of the counter. The hw_event_period can be used to block reads until the counter overflows the given value, allowing, for example, events to be queried in batches of 1000. The pid parameter can be used to target a specific process, and cpu can restrict monitoring to a specific processor.
There are a few advantages claimed for the new implementation. The simplicity of the system call interface is one of those; it is possible to write a very simple application to perform monitoring tasks, with no additional libraries required. The second version of the patch includes a simple "kerneltop" utility which can display a constantly-updated profile of anything the performance counting hardware can monitor. Another advantage is the avoidance of ptrace(); this reduces the amount of privilege needed by the monitoring process and avoids perturbing the monitored process by stopping and restarting it. The management of counters is said to be more flexible, with facilities for sharing counters between processes and reserving them for administrative access. The low-level hardware interface is said to be simpler as well.
Those claimed advantages notwithstanding, a number of complaints have been raised with regard to the new performance monitoring code. Two of those seem to be at the top of the list: the single counter per file descriptor API, and programming the hardware performance monitoring unit inside the kernel. On the API side, the biggest concern is that putting each counter behind its own file descriptor makes it very hard to correlate two or more counters. Reading two counters requires two independent read() system calls; as is always the case, just about anything could happen between those two calls. So it's hard to tell how two different counter values relate to each other. But that sort of correlation is exactly what developers doing performance optimization want to do. Paul Mackerras says:
In response, Ingo argues that the loss of precision caused by independent read() calls is small - much smaller than the muddying of the results caused by stopping the target process so that all of the counters can be read at the same time. That argument does not appear to have convinced the detractors, though.
The other complaint is that moving the counter programming task into the kernel requires that the kernel know about the complexities of every possible performance monitoring unit it may encounter. This hardware sits at the core of the most performance-critical CPU subsystems, so its design parameters value non-interference above features or a straightforward programming interface. So programming it can be a complex business, involving sizeable tables describing how various operations interact with each other. The perfmon code keeps those tables in a user-space library, but the alternative implementation won't allow that. Quoting Paul again:
Your API condemns us to adding all that bloat to the kernel, plus the code to use those tables.
Paul (and others) argue that this information - which can add up to hundreds of kilobytes - is better kept in user space.
There also seems to be a bit of concern over the fact that Stéphane had clearly never heard about this work before it was posted for review. It must, indeed, be a shock to work on a subsystem for years, then find a proposed replacement sitting in one's mailbox. As David Miller put it:
So, at this point, what will happen with performance monitoring is unclear at best. Perhaps, though, this discussion will have the effect of raising the profile of performance monitoring, which has been without proper kernel support for many years. The merging of either solution - or, perhaps, a combination of both - seems like it has to be an improvement over having no support at all.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jake Edge
Next page:
Distributions>>