User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.32-rc8. As of this writing, just over 200 changes have been merged since 2.6.32-rc8, including some significant feature enhancements to the FS-Cache and slow work subsystems. Linus has not told the world whether he thinks that's enough change to justify an -rc9 release or not; stay tuned.

Comments (6 posted)

Quotes of the week

GPUs have gotten more and more complex every 6 months for about 8 years now. A current radeonhd 4000 series bears little resemblence to the radeon r100 that was out then. The newer GPUs require a full complier to be written for an instruction set more complex than x86 in some places. The newer GPUs get more and more varied modesetting combos that all require supporting.

Now I'd would guess (educated slightly) that the amount of code required to write a full driver stack for a modern GPU has probably gone up 40-50x what used to be required, whereas the number of open source community developers has probably doubled since 2001. Also newer GPU designs have forced us to redesign the Linux GPU architecture, this had to happen in parallel with all the other stuff, again with similiar number of developers. So yes it sucks but it should point out why there is no reason why 3D should really be working on all cards.

-- Dave Airlie

The best way to make everything "just work" is to eliminate it.
-- Jon Smirl

I agree that having only one of SLAB/SLUB/SLQB would be nice, but it's going to take a lot of heavy lifting in the form of hacking and benchmarking to have confidence that there's a clear performance winner. Given the multiple dimensions of performance (scalability/throughput/latency for starters), I don't even think there's good a priori reason to believe that a clear winner CAN exist. SLUB may always have better latency, and SLQB may always have better throughput. If you're NYSE, you might have different performance priorities than if you're Google or CERN or Sony that amount to millions of dollars. Repeatedly saying "but we should have only one allocator" isn't going to change that.
-- Matt Mackall

Comments (none posted)

Fault injection and unexpected requirement injection

By Jonathan Corbet
December 2, 2009
Good developers carefully write their code to handle error conditions which may arise. This code frequently suffers from one problem, though: test coverage is hard. Many of the anticipated errors never come about, so the error-handling code never gets exercised. So when things go wrong for real, recovery does not work as expected. For a few years, the Linux kernel has had a fault injection framework designed to help in the debugging of some types of error-handling code. By forcing specific things (memory allocations in particular) to go wrong, the fault injection framework can help developers ensure that errors are really handled as expected.

Sripathi Kodi recently posted a patch adding certain types of futex failures to the fault injection framework. Ingo Molnar responded with a potentially surprising request:

Instead of this unacceptably ugly and special-purpose debugfs interface, please extend perf events to allow event injection. Some other places in the kernel (which deal with rare events) want/need this capability too.

This "unacceptably ugly" interface has existed as part of the fault injection framework since 2006, so it is a little surprising to hear, now, that it cannot be used. Ingo is firm about this point, though, and appears unwilling to back down.

Extending perf events for fault injection might be the right long-term solution. But this situation highlights a trap for developers which certainly acts to make participation in the development process harder. In his travels, your editor has heard complaints from developers who set out to accomplish a specific task, only to be told that they must undertake a much larger cleanup to get their code merged. The topic also came up at the 2009 kernel summit; there, the consensus seemed to be that this kind of request can quickly become unreasonable.

In this case, Sripathi has not been asked to fix the remainder of the fault injection framework code. But adding a new functionality to the perf events subsystem still likely goes rather beyond the scope of the original project. Sripathi has not responded to this request, so it's not clear whether we'll see a futex fault injection mechanism reworked to fit the new requirements, or whether this code will just fade away.

Comments (9 posted)

Kernel development news

Another mainline push for utrace

By Jake Edge
December 2, 2009

When last we looked in on utrace, back in March, it was being proposed for inclusion into 2.6.30. There were various objections at that time, but the biggest was the lack of a "real" in-kernel user for utrace. It was suggested that providing a real user along with utrace itself would smooth its path into the mainline. Now utrace has returned in the form of a set of patches from Oleg Nesterov (based on Roland McGrath's work), along with a rewrite of the ptrace() system call using the utrace interface. With the 2.6.33 merge window opening soon, the hope is that utrace will, finally, make its way into the mainline.

Utrace provides a means to control user-space threads, which could be used for debugging, tracing, and other tasks like user-mode-linux. SystemTap is one of the biggest current utrace users, as Red Hat and Fedora kernels have had utrace support for several years. Utrace came from a recognition that ptrace() was too limited—and messy—for many of the things folks wanted to use it for. In particular, only allowing one active tracing process for a given thread, as ptrace() requires, was too limiting for various envisioned tracing and control scenarios. Utrace allows multiple tracing "engines" to attach to a thread, list which events they are interested in, and receive callbacks when those events occur.

The interface provided by utrace has not changed enormously since our first look in March 2007. Engines, which are typically implemented as loadable kernel modules, will attach to a given thread by using utrace_attach_task() or utrace_attach_pid() depending on whether they have a struct task_struct or struct pid available. In either case, a struct utrace_engine pointer is returned, which is used to identify the engine in additional calls.

The struct utrace_engine looks like:

    struct utrace_engine {
        const struct utrace_engine_ops *ops;
        void *data;
        unsigned long flags;
with flags containing an event mask and data used for engine-specific private data. The most interesting part is the ops field which points to a set of ten different callback functions. These functions make up the heart of the tracing engine functionality.

The function pointers in struct utrace_engine_ops are described in linux/utrace.h. All of the kerneldoc comments are pulled from the source code files into the DocBook documentation that comes with the patchset. The callbacks are made as the traced thread encounters various events. These include signals being delivered, clone() or exec() being called, other system calls as they are entered or exited, thread exit or death, and more. In each case, the callbacks are made for each interested engine in the order in which the engines were attached.

An engine uses the utrace_set_events() (or utrace_set_events_pid()) call to indicate which of the events it is interested in:

    int utrace_set_events(struct task_struct *target,
                          struct utrace_engine *engine,
                          unsigned long events);
The UTRACE_EVENT() macro is used to turn on the appropriate bits in the events mask. There must be a callback defined in the engine->ops table for any events which are enabled.

Once a callback has been invoked, the engine uses utrace_control() (or utrace_control_pid()) to tell the traced thread to do something:

    int utrace_control(struct task_struct *target,
                       struct utrace_engine *engine,
                       enum utrace_resume_action action);
The action parameter governs what is supposed to happen. Those actions include things like single-stepping, block-stepping, resuming execution, detaching from the thread, and so on.

In the only real complaint about the patchset seen so far, Christoph Hellwig is unhappy that the ptrace() reimplementation is not supplanting the current ptrace() code: "One thing I really hate about this is that it introduces two ptrace implementations by adding the new one without removing the old one." In the patches, the inclusion of utrace is governed by the CONFIG_UTRACE flag. Since it isn't optional to have a ptrace() system call, that meant the current code needed to stay.

What Hellwig suggests, though, is adding utrace support to the two major architectures that don't have it (arm and mips), then removing the current ptrace(). He clearly believes it is too late to get utrace into 2.6.33, which would allow time to get utrace support into those—and hopefully other, minor architectures—before utrace is merged. "If the remaining minor architectures don't manage to get their homework done they're left without ptrace," he said.

That didn't sit well with various other kernel hackers. Pavel Machek said: "I don't think introducing regressions to force people to rewrite code is a good way to go". In addition, Ingo Molnar seems to have warmed up to utrace's inclusion since it was last proposed. Molnar had many complaints about utrace last time, but is much more positive this time. He doesn't think adding more architecture support is the way to go:

Regarding porting it to even more architectures - that's pretty much the worst idea possible. It increases maintenance and testing overhead by exploding the test matrix, while giving little to [the] end result. Plus the worst effect of it is that it becomes even more intrusive and even harder (and riskier) to merge.

Unlike last time, where most of the complaints were not aimed at the code itself, but more at its timing and lack of an in-kernel user, this time there is some code review taking place. Peter Zijlstra has a fairly detailed review of both the code and documentation for example. There is a clear sense that utrace is clearing hurdles that may have held it up in the past.

One of the outcomes from the tracing meetings at the Collaboration Summit in April was to come up with an in-kernel user, and ptrace() seemed like a good candidate. Other ideas were mentioned in those meetings, including adding a gdb "stub" in the kernel to allow debugging of user-space programs. A patch to expose a /proc/PID/gdb interface that implements gdb's remote serial protocol was proposed by Srikar Dronamraju.

That patch is running into more serious difficulty than utrace seems to be. Because kgdb already exposes the remote serial interface for gdb, but for the kernel instead, Zijlstra and Molnar think that the two should be combined. It seems unlikely to get merged until that is resolved.

Some parts of the utrace patchset have spent time in the -mm tree, and utrace has been shipped with every Fedora kernel since FC6. But the utrace-ptrace piece has not done any time in either -mm or -next, which may make it harder to get it in the mainline for 2.6.33. Since utrace is optional, though, there are relatively few risks. McGrath is willing to consider removing the current ptrace() implementation, but its clear that he and Nesterov—maintainers of the current ptrace()—would prefer to get utrace into the mainline now:

We don't want to rush anyone, like uninterested arch maintainers. We don't want to break anyone who doesn't care about our work (we do test for ptrace regressions but of course new code will always have new bugs so some instances of instability in the transition to a new ptrace implementation have to be expected no matter how hard we try). We just don't want to keep working out of tree.

Presumably, we will know within the next few weeks whether utrace makes its way into 2.6.33. But, if that doesn't happen, it would seem that one more kernel development cycle is all that it should take.

Comments (2 posted)

Eliminating rwlocks and IRQF_DISABLED

By Jonathan Corbet
December 1, 2009
Reader-writer spinlocks and interrupt-enabled interrupt handlers both have a long history in the Linux kernel. But both may be nearing the end of their story. This article looks at the push for the removal of a pair of legacy techniques for mutual exclusion in the kernel.

Reader-writer spinlocks (rwlocks) behave like ordinary spinlocks, but with some significant exceptions. Any number of readers can hold the lock at any given time; this allows multiple processors to access a shared data structure if none of them are making changes to it. Reader locks are also naturally nestable; a single processor can acquire a given read lock more than once if need be. Writers, instead, require exclusive access; before a write lock can be granted, all read locks must be released, and only one write lock can be held at any given time.

Rwlocks in Linux are inherently unfair in that readers can stall writers for an arbitrary period of time. New read locks are allowed even if a writer is waiting, so a steady stream of readers can block a writer indefinitely. In practice this problem rarely surfaces, but Nick Piggin has reported a case where the right user-space workload can cause an indefinite system livelock. This is a performance problem for specific users, but it is also a potential denial of service attack vector on many systems. In response, Nick started pondering on the challenge of implementing more fair rwlocks which do not create performance regressions.

That is not an easy task. The obvious solution - blocking new readers when a writer gets in line - will not work for the most important rwlock (tasklist_lock) because that lock can be acquired by interrupt handlers. If a processor already holding a read lock on tasklist_lock is interrupted, and the interrupt handler, too, needs that lock, forcing the handler to wait will deadlock the processor. So workable solutions require allowing nested reader locks to be acquired even when writers are waiting, or disabling interrupts when tasklist_lock is held. Neither solution is entirely pleasing.

Beyond that, there has been a general sentiment toward the removal of rwlocks for some years. The locking primitives themselves are significantly slower than plain spinlocks, so any performance gain from allowing multiple readers must be large enough to make up for that extra cost. In many cases, that gain does not appear to actually exist. So, over time, kernel developers have been changing rwlocks to normal spinlocks or replacing them with read-copy-update mechanisms. Even so, a few hundred rwlocks remain in the kernel. Perhaps it would be better to focus on removing them instead of putting a lot of work into making them more fair.

Almost all of those rwlocks could be turned into spinlocks tomorrow and nobody would ever notice. But tasklist_lock is a bit of a thorny problem; it is acquired in many places in the core kernel and it's not always clear what this lock is protecting. This lock is also taken in a number of critical kernel fast paths, so any change has to be done carefully to avoid performance regressions. For these reasons, kernel developers have generally avoided messing with tasklist_lock.

Even so, it would appear that, over time, a number of the structures protected by tasklist_lock have been shifted to other protection mechanisms. This lock has also been changed in the realtime preemption tree, though that code has not yet made it to the mainline. Seeing all this, Thomas Gleixner decided to try to get rid of this lock, saying "If nobody beats me I'm going to let sed loose on the kernel, lift the task_struct rcu free code from -rt and figure out what explodes." As of this writing, the results of this exercise have not been posted. But Thomas is still active on the mailing list, so one concludes that any explosions experienced cannot have been fatal.

If tasklist_lock can be converted successfully to an ordinary spinlock, the conversion of the remaining rwlocks is likely to happen quickly. Shortly after that, rwlocks may go away altogether, simplifying the set of mutual exclusion primitives in Linux considerably.


Meanwhile, a different sort of exclusion happens with interrupt handlers. In the early days of Linux, these handlers were divided into "fast" and "slow" varieties. Fast handlers could be run with other interrupts disabled, but slow handlers needed to have other interrupts enabled. Otherwise, a slow handler (perhaps doing a significant amount of work in the handler itself) could block the processing of more important interrupts, impacting the performance of the system.

Over the years, this distinction has slowly faded away, for a number of reasons. The increase in processor speeds means that even an interrupt handler which does a fair amount of work can be "fast." Hardware has gotten smarter, minimizing the amount of work which absolutely must be done immediately on receipt of the interrupt. The kernel has gained improved mechanisms (threaded interrupt handlers, tasklets, and workqueues) for deferred processing. And the quality of drivers has generally improved. So driver authors generally do not really even need to think about whether their handlers run with interrupts enabled or not.

Those authors still need to make that choice when setting up interrupt handlers, though. Unless the handler is established with the IRQF_DISABLED flag set, it will be run with interrupts enabled. For added fun, handlers for shared interrupts (perhaps the majority on most systems) can never be assured of running with interrupts disabled; other handlers running on the same interrupt line might enable them at any time. So many handlers will be running with interrupts enabled, even though that is not needed.

The solution, it would seem, would be to eliminate the IRQF_DISABLED flag and just run all handlers with interrupts disabled. In almost all cases, everything will work just fine. There are just a few situations where interrupt handling still takes too long, or where one interrupt handler depends on interrupts for another device being delivered at any time. Those handlers could be identified and dealt with. "Dealt with" in this case could take a few forms. One would be to equip the driver with a better-written interrupt handler which does not have this problem. Another, related approach would be to move the driver to a threaded handler which, naturally, will run with interrupts enabled. Or, finally, the handler could be set up with a new flag (IRQF_NEEDS_IRQS_ENABLED, perhaps) which would cause it to run with interrupts turned on in the old way.

It's not clear when all this might happen, but it could be that, in the near future, all hard interrupt handlers are expected to run - quickly - with interrupts disabled. Few people will even notice, aside from some maintainers of out-of-tree drivers who will need to remove IRQF_DISABLED from their code. But the kernel as a whole should be faster for it.

Comments (12 posted)

Kernel support for infrared receivers

By Jonathan Corbet
December 2, 2009
One of the stated goals of the staging tree is to bring widely-used drivers into the mainline kernel tree. This effort has been quite successful; the number of out-of-tree drivers has dropped considerably over the last year or so. There is one high-profile holdout, though: the Linux Infrared Remote Control (LIRC) subsystem. LIRC is used to obtain input events from remote control devices and feed them through to applications; Linux-based digital video recorder systems are heavy LIRC users, but there are others as well. Back in October, Jarod Wilson posted a new version of LIRC for consideration. One month later, the kernel developers have started talking about it; what they lack in punctuality has been more than made up for in volume.

One might think that merging this longstanding, heavily-used project into the mainline would not require a great deal of discussion. The problem is that LIRC brings with it a new ABI. Since user-space interfaces must be supported indefinitely, they tend to come under a higher degree of scrutiny than other parts of the code. LIRC has never had to freeze its ABI during its many years of out-of-tree existence, a freedom which has made life easier for its developers. But LIRC in mainline would not have this freedom, so any incompatible ABI changes need to be made prior to merging. And, as it happens, some developers would like to see significant changes.

One would think that an IR receiver would be a simple device; all it must do is report button press and release events, much like a keyboard. Often, it seems, the simplest devices are the most complex to deal with. Some receivers have decoders built into them, allowing them to pass scan codes to the driver, which can then map them onto key codes to pass to applications. But others are simple indeed - they simply report the timing and length of pulses received from the remote. In this case, the driver must filter out glitches and perform protocol processing to get to the point where it can generate scan codes. For extra fun, there are a number of protocols in use, and some manufacturers have wisely decided that life would be much more interesting if they were to make their own versions of the protocols which differ from everybody else's. So the protocol processing can be painful and unpleasant.

LIRC handles this mess by having drivers report "raw" pulse-length information via a special device; a user-space daemon then handles the task of turning that information into something that usefully describes a button-press event. In many cases, the low-level driver runs in user space and does not involve the kernel at all. Distribution of these events is also handled by the LIRC daemon, which can direct specific events to different applications, run programs in response to events, and so on in a flexible, scriptable manner. LIRC works, and some developers would like to see it merged into the mainline more-or-less as it stands now. Others, though, dislike the special-purpose "raw" interface used by LIRC. As Jon Smirl put it:

[W]e used to have device specific user space interfaces for mouse and keyboard. These caused all sort of problems. A lot of work went into unifying them under evdev. It will be years until the old, messed up interfaces can be totally removed.

I'm not in favor of repeating the problems with a device specific user space interface for IR. I believe all new input devices should implement the evdev framework.

In other words, these developers want remote control devices to look like any other input device and generate input events through the same interface. Jon has posted a proposed IR input driver for discussion; it is actually a rework of work first posted one year ago. This code moves all processing into the kernel and provides a flexible mechanism for dealing with multiple remote controls.

As it happens, a number of remote control receivers already work this way, even in the absence of Jon's patch. LIRC is not the sole repository of IR receiver drivers; a fair number of them also live in the mainline kernel already, in the Video4Linux2 subsystem. TV cards often come with a bundled remote control and receiver, so it makes sense to write a driver for the receiver as part of the larger V4L2 driver. These drivers do not use the LIRC interface; instead, they generate input events directly. See the Conexant CX2388x IR driver for an example of what this sort of driver looks like.

The discussion covered various approaches to IR receivers without coming to any real resolution. Jon Smirl's attempt to clarify the goals for in-kernel IR support may have brought some focus, but little in the way of solid conclusions. Even so, there are some points of near consensus; these include:

  • There needs to be some sort of API based on the input subsystem, where applications can obtain processed, high-level keycodes for button presses. The goal is to have remote-using applications "just work" whenever possible.

  • There probably needs to be a separate interface where special-purpose applications can get raw timing data from the receiver - at least, for receivers without built-in decoders which can provide this information. This interface can be used to reverse-engineer the sequences sent by new remote control units and to deal with pathologically-bad hardware. There is talk of funneling raw data through the input layer as well, but it's not clear that doing so buys anything; it may be that just adopting the existing LIRC interface for raw data is as good an approach as any.

With regard to the keycode interface, there is still a lot of disagreement over where the keycodes should come from. Some developers want all of the IR drivers to be in the kernel, while others are happy with using the LIRC daemon (or something like it) to generate keycodes and push them back into the kernel from user space. In-kernel drivers have the potential to work with no daemon process and they can use the current module loading mechanism. Kernel-based drivers will also have lower response latency than a user-space daemon, saving precious milliseconds for desperate users who want to change channels and evade that "too much information" pharmaceuticals commercial.

On the other hand, in-kernel drivers are kernel code, with the higher level of risk that always implies. Filtering of input sequences and protocol processing can be a significant amount of work that some would rather see done in user space. It may never be possible to support the more problematic hardware in the kernel. Then, there are the truly wild ideas, such as wiring an IR receiver to a sound card's microphone input - something people actually do, evidently. The fact that some IR protocols may be patent-encumbered also needs to be kept in mind.

Another detail worth bearing in mind: a number of IR receivers are also capable of transmitting information. A solution based solely on the input layer will not be able to handle the output case.

There is one final, simple point: the LIRC drivers have seen many years of development, and they work. If LIRC is merged directly, the kernel will benefit from that work and the associated lessons learned. If LIRC is dropped in favor of fully in-kernel drivers, chances are good that some of those lessons will have to be learned anew. If the kernel were to go with a non-LIRC approach to IR drivers, it would probably, eventually, reach a point where it had a more capable and flexible system with wider device support than is available now. But, between here and there would be a period - perhaps a long period - where in-kernel IR support was not as good as LIRC.

Still, that might just be how things go in the end. The kernel development community, always concerned about what it will have to maintain five or ten years in the future, tends not to be in a hurry to merge something now just because it is seen to work. So, while it is yet possible that LIRC could be merged in something close to its current form, it's also possible that it could lurk on the sidelines while something significantly different is created for the mainline.

Comments (14 posted)

Patches and updates

Core kernel code

Development tools

Device drivers

Filesystems and block I/O



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds