LWN.net Logo

Kernel development

Brief items

Kernel release status

The current stable 2.6 release is 2.6.17.13, released on September 8, several minutes after the rather abortive 2.6.17.12 release. Quite a few important fixes have made it into these releases, though none of them have vulnerability numbers attached.

On the 2.6.16 front, Adrian Bunk has released 2.6.16.29-rc1 and 2.6.18.29-rc2 with another set of fixes.

The current 2.6 prepatch is 2.6.18-rc7, announced by Linus on September 13. "Ok, ok, don't rub it in. I know I thought -rc6 would be the last one, but I just feel more comfy doing an -rc7, even if most of the changes are pretty minor." Expect the final release before too long.

The current -mm tree is 2.6.18-rc6-mm2. Recent changes to -mm include some USB API changes, a big x86-64 patch (including stack protection support), access control lists for tmpfs, and a patch which may reorder PCI device enumeration on some systems. There are currently 1915 patches in -mm, the largest number ever.

Comments (none posted)

Kernel development news

Quotes of the week

The road to 2.6.19-rc1 is going to be rough - there's an unusually large amount of work pending, and there is an unusual (although still small) amount of overlap between the subsystem trees which people will need to sort out. Because of this I expect it will take us more than the nominal two weeks to reach -rc1.

-- Andrew Morton

We are very sorry for for the mistakes that happened with the .12 release, and those responsible have been sacked.

-- The -stable team

Comments (4 posted)

Memory-mapped I/O barriers

Paul Mackerras recently reported a subtle bug. The tg3 Ethernet driver, like many other network drivers, operates on a set of buffer descriptors stored in the host system's memory. These descriptors describe the buffers which are available for incoming network packets; when a packet arrives, the interface picks the next descriptor on the list, stuffs the data there, then tells the processor that the packet is available. The reported bug works like this: the processor makes some changes to this descriptor data structure, then does a write to a memory-mapped I/O (MMIO) register to tell the device to start I/O. The device, however, receives this MMIO write before the data written to main memory arrives at its final destination, and thus operates on old data. When this happens, correct operation is, to say the least, unlikely.

Bugs resulting from the reordering of memory operations can be some of the most subtle and difficult-to-find problems. A developer can stare at the code for hours without realizing that what is actually happening, deep down within the system's hardware, does not quite match the code as it appears to be written. The incorrect behavior can happen infrequently and be impossible to reproduce in any easy way.

The solution for this kind of problem is usually to add some sort of memory barrier in situations where the ordering of operations matters. The sort of barrier most familiar to device driver writers may well be the classic rule: MMIO writes to I/O memory hosted on a PCI bus cannot be considered to be complete until a read has been done from that memory range. So drivers often have a pattern where many registers are set with values describing an I/O operation, but a read is done before the final write which sets the "go" bit. Without that read, which functions as a sort of MMIO barrier, the device could take off using older values and make a mess of things.

The tg3 bug illustrates a slightly different sort of problem, however: there is no guaranteed ordering between writes to regular memory and writes to a memory-mapped I/O range. So Paul's question was: should an MMIO write be redefined to be strictly ordered with respect to preceding writes to regular memory? On a number of architectures (including the i386), the hardware orders things nicely now, but on others (Paul is working with PowerPC64), there are no such guarantees. Redefining the MMIO write operations (iowrite32(), writel(), etc.) to add the necessary barriers on the relevant architectures could make a number of potential bugs go away.

Linus didn't like the idea, stating that it was too expensive. Memory barriers can stall the processor for long periods of time, so it is nice to leave them out when they are not truly needed. So, Linus says, the preferred approach is to require the programmer to put in an explicit barrier operation when one is needed.

There are some problems with this approach, however. One of those is that the kernel does not currently implement a barrier designed to force ordering between regular and MMIO memory operations. There is mmiowb(), but its real purpose is to enforce ordering between MMIO operations only. So Linus mentioned the possibility of creating new barriers with names like mem_to_io_barrier() to bring about the desired ordering in this situation.

Alternatively, the MMIO operations could be redefined to contain a barrier before the MMIO access happens. That would fix the tg3 bug without adding any extra cost, but it would come at the cost of removing the barrier that is currently placed after the operation. This is the solution that Paul favors:

I suspect the best thing at this point is to move the sync in writeX() before the store, as you suggest, and add an "eieio" before the load in readX(). That does mean that we are then relying on driver writers putting in the mmiowb() between a writeX() and a spin_unlock, but at least that is documented.

This approach brought out a different objection from David Miller (and others), however:

Driver authors will not get these memory barriers right, you can say they will because it will be "documented" but that does not change reality which is that driver folks will get simple interfaces right but these memory barriers are relatively advanced concepts, which they thus will get wrong half the time

David would rather see things work correctly in the simple scenario, even if the run-time expense is higher. As others have mentioned, one can always implement no-barrier versions of the MMIO primitives for performance-minded developers who (think they) know what they are doing.

The case mentioned by Paul above - putting in a call to mmiowb() between the last MMIO write operation and a spin_unlock() call - would be the biggest concern. Spinlocks are used to keep multiple processors (or, in a preemptive scenario, multiple processes on a single processor) from mixing up operations to the same device. But a spinlock lives in regular memory, so it is possible that the unlock operation could succeed (allowing another process to access the MMIO region) before the previous process's MMIO writes complete. That is why mmiowb() is called for - but it does look like the sort of thing that driver authors will have a hard time remembering.

An alternative suggested by Alan Cox is the creation of a new pair of spinlock operations: spin_lock_io() and spin_unlock_io(). They would be explicitly defined to protect operations on MMIO regions, and would contain the requisite barriers. If device drivers could be trained to use these locking operations (and driver writers often can be trained - just feed them beer when they do something right), they would not have to remember to insert barriers.

There's a couple of problems here too, however. There are already a number of variations on the spin_lock() operation; adding another option will expand the number of locking calls considerably. Code which calls functions while holding locks must already be aware of the called functions' locking needs, and that awareness will be made more complicated as well. So Linus would much rather avoid this approach and just require the use of explicit barriers.

Yet another approach - the one which might just be adopted in the end - is to redefine and expand the set of MMIO accessor functions. In this scenario, as described by Benjamin Herrenschmidt, the existing functions (writel(), etc.) would be made fully ordered - even though that might well slow them down some. All drivers using those functions would continue to work - and some might have rare, subtle bugs fixed in the process.

For most drivers, the above functions will be adequate - memory barriers around MMIO operations will not materially affect performance most of the time. There are exceptions, however. For situations where the barriers are unnecessary and hurtful, a new set of accessors with names like __writel() or __iowrite32() would be defined. These functions would ensure that MMIO operations are seen by the peripheral device in the order issued by the processor, but no other guarantees would be made. When these primitives are used, the programmer is responsible for inserting barriers in cases where ordering between MMIO and regular memory operations is important.

Finally, for developers who truly want to live on the edge, a set of functions with names like __raw_writel() has been proposed. These accessors would provide no ordering guarantees at all and would not concern themselves with issues like byte swapping. They are one small step above issuing I/O operations directly in assembly. Benjamin's proposal also brings back the idea of creating a new set of memory barriers for specific situations. Thus, io_to_io_barrier() would ensure ordering between MMIO operations; it would be useful in conjunction with the "raw" operations described above. Other barriers would deal with ordering between MMIO and regular memory operations in various ways; see Benjamin's post for the full set.

There have been a number of suggestions for changes to this proposal, but no real opposition to the general idea. So, in the end, that may be just how it works out - though expect this discussion to return in the future. When the topic is one of the trickiest areas of kernel programming on contemporary hardware, easy and final solutions will likely be hard to come by.

Comments (none posted)

A bid to resurrect Linux capabilities

Back in 1998, as the 2.1 kernel went into yet another feature freeze, the capabilities feature was merged. Capabilities split the power of the root account into a set of privileges, each of which can be granted or withheld independently of the others. A process which needs to be able to bind to a privileged port number, for example, could be given that ability without simultaneously enabling it to override file permissions, kill other processes, or exceed resource limits. Proponents of capabilities have long seen a world where the root account no longer exists and all tasks have the minimum level of privilege they need to get their jobs done. A system organized in this way, it is thought, would be more secure.

The world is full of Linux distributions, many of which are oriented toward higher levels of security. But, to your editor's knowledge, nobody has ever put together a successful, capability-based distribution. There are many reasons for this lack of implementations, including the fact that nobody has really figured out a way to administer a system with a couple dozen more security-related bits attached to every executable file. But one should also not overlook the fact that, from the 2.1.x days to now, there has never been a Linux kernel where capabilities actually worked as intended.

Part of the problem is an incomplete implementation: no patch which attaches capability masks to files has ever been merged. But the kernel has also never implemented capability inheritance - what happens to the capability bits when a process executes a new program - in a correct manner. For some time now, in fact, capability inheritance has been disabled completely. Without inheritance, the full capability model cannot work. So the use of capabilities in Linux systems has been limited to a very small number of programs which have been coded to drop the capabilities they do not need.

David Madore has set out to change that state of affairs with a set of patches to fix up capability support. This patch set does a few things, the first of which being to expand the capability set from 32 to 64 bits. Current kernels have 31 capabilities defined, so it is not especially hard to imagine needing more in the future. That need could become pressing if anybody ever gets serious about splitting the catch-all CAP_SYS_ADMIN capability into several smaller privileges.

This patch uses some of those new bits from the outset for a set of "regular capabilities" which all processes are normally expected to have. These capabilities include the ability to use fork() or exec(), the ability to open files and to write to files, the ability to use ptrace(), and the ability to increase privilege by running a setuid program. The idea here is that processes running in security-relevant settings can drop those capabilities if they are not needed, making it harder to exploit any vulnerabilities in those processes.

The core of the patch, however, is the implementation of capability inheritance. Understanding this part requires just a bit of background. As it happens, while one can talk about the capabilities possessed by a process, each process in Linux has three separate capability masks. The permitted set is all of the capabilities that the process is allowed to have. But capabilities cannot be used unless they are set in the effective set, is a subset of the permitted set. Finally, each process has an inheritable set, listing the capabilities (again, a subset of the permitted set) which can be passed on to any program run with exec(). Processes can adjust the effective and inheritable sets at any time (within the bounds of the permitted set), but the permitted set cannot be expanded.

In a capability-based system, executable files also have a set of three capability masks. Those masks have the same names as the process masks, and their function is almost the same. The file's inherited mask, however, will limit the capabilities which can be inherited from any other process. David's patch set includes a patch (by Serge Hallyn) which adds support for capability masks to the filesystem layer.

When a process runs a new executable, the masks are combined as follows:

  • P′p ← (Pi ∩ Fi) ∪ (Fp ∩ bnd)
  • P′e ← (Pi ∩ Pe ∩ Fi) ∪ (Fp ∩ Fe ∩ bnd)
  • P′i ← P′p

These equations are taken directly from David's "new capabilities" page, which has much more detail on all of this work. What they say, in English, is something like this:

  • The permitted capabilities for the new executable (P′p) are the intersection of the inheritable set from process before calling exec() (Pi) and the file's inherited set (Fi). The permitted set from the file (Fp) is then added in, but not before being limited by the system-wide capability bounding set.

  • The effective capabilities (P′e) will be the same as the inherited capabilities, except that capabilities which are not effect in the current process or in the file's effective set will be masked out.

  • The inheritable capabilities (P′i) will be the same as the permitted capabilities.

For the most part, these rules match the usual understanding of how capability-based systems are supposed to work. Capabilities, in such a system, are assigned to programs, not to users; the normal permissions bits can then come into play to control which programs specific users can run.

David's patch differs from the usual idea of capability-based systems in one important regard, however: how it handles programs with no capability sets defined. On most systems, that will be almost every executable file there is. By the rules, such programs should be treated as having an empty inherited set, which, by the rules above, would cause them to be run with no capabilities at all. David's patch, instead, causes these programs to be run with the same capabilities the process had before - though the presence of things like setuid bits can obviously change that calculation. This interpretation breaks the classic capability-based model, but it has the advantage of actually working on current systems.

Ted T'so, however, complains that this compromise fundamentally weakens the security of the capability-based model. He has suggested that the behavior be configurable, with each filesystem having a flag describing how capabilities should be handled in the absence of a set per-file masks. A set of default capabilities for new files could be part of this change as well.

The other complaint which has been heard is fairly predictable: why, it is asked, should we bother with capabilities when SELinux can do all of the same things and more? In fact, SELinux does something vaguely similar, but with a level of indirection; it attaches labels to files, then associates capabilities with the labels through the policy mechanism. Anybody who has ever gotten that cheery Fedora "your filesystem must be relabeled, please wait for a very long time" boot message knows that keeping files and labels properly synchronized is a difficult task. There is no real reason to believe that keeping capability masks in a correct state would be any easier. That fact alone may continue to limit the real usage of capabilities well into the future.

Comments (12 posted)

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

September 13, 2006

This article was contributed by Valerie Henson

The Problem

Kernel developers have written many wonderful and useful tools for debugging and observing system behavior, such as slab allocation debugging, lock dependency tracking, and scheduler statistics. However, few of these tools can be used in production systems (those are computers used to do actual work as opposed to what I use them for, which is compiling and testing my latest kernel patches) because of the overhead they create, even when disabled. Whenever Dave Jones is trying to track down a memory allocation bug in Rawhide and turns on slab debugging, he's inundated with complaints about sluggish systems until he turns it back off again.

We also lack decent tools to do system-wide analysis - analysis spanning the operating system and all running processes - since most tools are built around either a single process (e.g., strace) or a single kernel subsystem (e.g., SCSI logging). When it comes down to root-causing a performance problem on a production system, our hands are pretty much tied if we can't boot into a kernel compiled with support for debugging and tracing - and often we can't reboot, either due to downtime restrictions or rules about certification of software on production systems.

Today, performance analysis on production Linux systems usually ends up being a jumble of iostat, top, sysrq-t, random /proc entries, and unreliable oprofile results (if we're lucky enough to have oprofile). Recently, one of my friends with extensive Linux experience upgraded his business's production system (a computer used to do actual work) to a more recent Linux kernel and found that performance had suddenly dropped to an unusable level. Once he had figured out that many Apache processes were spending a lot of time in iowait, he had no idea where to go next and had to revert to the old kernel without root-causing the problem. Unfortunately, the problem is only reproducible on a system in production use - and so must be investigated using only tools suitable for a production system. System-wide performance analysis on present-day Linux systems remains a black art.

The Solution

The ideal tracing system would cause zero performance degradation when it is disabled, would be dynamically enabled as needed, could collect data over an entire system, and would be safe to use on a production system. The paper describing DTrace, Dynamic Instrumentation of Production Systems, published in the USENIX 2004 Annual Technical Conference, earns itself a place on the Kernel Hacker's Bookshelf for describing the first system that lives up to this ideal.

DTrace was originally written for Solaris on both SPARC and x86, and has recently been ported to Mac OS X. I used DTrace extensively while I was working on Solaris and got used to being able to answer any question I had about a system with a few minutes of script writing. When I went back to work on Linux and could no longer use DTrace, I felt like I went from wielding a sharp steel katana to fumbling with dull flint tools. The only tool for Linux that comes close is SystemTap, which has improved significantly in the last year, though it still remains out of the mainline kernel.

I'm not the only person who thinks DTrace is ground-breaking. DTrace won the top award in the Wall Street Journal's 2006 Technology Awards. MIT's Technology Review named DTrace's lead engineer, Bryan Cantrill, as one of their 2005 TR35 winners, their list of top innovators under the age of 35. Any company with a half-decent marketing group can generate hype, but DTrace has garnered praise from both industry leaders and the people knuckling down to do the real work.

The Paper

The DTrace paper begins with the motivation for DTrace. For many years, Solaris developers, like Linux developers, focused on writing tools to help them in a kernel development environment. Then they began venturing out into the field to analyze real-world systems - and discovered that much of their toolkit was useless. Besides being impossible to use on production systems, their tools were designed to analyze processes or the kernel in isolation. They began to design a dynamic tracing system intended from its inception for use in production systems. It needed to be completely safe, have zero probe effect, aggregate data over the whole system, lose a minimum of trace data, and allow arbitrary instrumentation of any part of the system.

The architecture they came up with divides up the work of tracing into several modular components. The first is DTrace providers. These are kernel modules that know how to create and enable a particular class of DTrace probes. DTrace providers include things like function boundary tracing and virtual memory info tracing. When enabled, each DTrace probe has one or more series of actions associated with it that are executed by the DTrace framework (another kernel module) each time the probe fires, such as "Record the timestamp" or "Get the user stack of this thread." Actions can have predicates - conditions that must be met for the the action to be taken. This is one way to cut down on the amount of data that would otherwise be laboriously copied out of the kernel, only to be thrown away in post-processing. A useful predicate might be "Only if the pid is 7893" or "Only if the first argument is non-zero."

Probes are enabled by DTrace consumers - processes which tell the DTrace framework what probe points and actions they want to use. Probes can have multiple consumers. Each consumer has its own set of per-CPU buffers for transferring trace data out of the kernel, which is done is such a way that data is never corrupted, and the consumer is notified if data is lost. Many tracing systems silently drop data, which can lead to serious errors in analysis when an event is significantly under-sampled.

The most interesting and controversial part of DTrace is the scripting language, "D", and its conversion to the D Intermediate Format, DIF. Many developers don't understand why C and native machine code aren't preferable - after all, we already know C, and we have plenty of tools for compiling C into runnable machine code. Why reinvent the wheel? The answer comes in two parts.

First, D was invented to quickly form questions about a running system. A quote from the paper: "Our experience showed that D programs were rapidly developed and edited and often written directly on the dtrace(1M) command line." As such, it lends itself to a script-like language that is friendly to rapid prototyping. It is also intended primarily to gather and process data, and as such an awk or python-like structure was more appropriate. The language used to specify probe actions should be specialized for the task at hand, rather than simply reusing a language designed for generic system programming. At the same time, D is very similar to C (the paper describes D as "a companion language to C") and C programmers can quickly learn D.

Second, some level of emulation is needed for safety. Not all program errors can be caught in an initial pass; things like illegal dereferences must be caught and handled on the fly. The in-kernel DIF emulator is vital for the level of safety needed to use DTrace on a production system. When explaining to Linux developers the need to prevent buggy scripts from crashing the system, often the response is, "Well, don't do that." But imagine for a minute that you are debugging with SystemTap on your friend's production Linux server. When they ask you if it could possibly crash their system (which will cost them many thousands of dollars in lost business), you don't want to say, "Well, only if I have a bug in the scripts I am writing... on the fly... without code review... Um, how many thousands of dollars did you say?" A tracing system that can still cause the system to crash in some situations will be limited to kernel developers, students, and other people with the luxury of unscheduled downtime.

Two major components of DTrace remain: aggregations and speculative tracing, two methods of reducing trace data at the source, allowing far greater flexibility of tracing. The traditional method of tracing involves generating vast quantities of data, shoveling it out to user space as fast as possible, and then sifting through the detritus with post-processing scripts. The downsides of this approach are data loss (there is a limit to how quickly data can be copied out of the kernel), limitations on what we can trace (without excessive data loss), and expensive post-processing times. If we instead throw away or coalesce trace data at the source, our tracing is cheaper and more flexible.

One method of data pruning is aggregations, which coalesce a set of data into a useful summary. For example, with only a few lines of D, you can create an aggregation that collects a frequency distribution of the size of mmap function calls across all processes on the system. The alternative is copying out the entire set of trace data for each mmap call on the system, then writing a script to extract the sizes and calculate the distribution - which is slower, more error-prone, and has a much higher probe effect.

Speculative tracing is even more interesting; it allows a script to collect trace data and then decide whether to throw it away or pass it back up to user space. This is vital for collecting data for a common event, of which only a few events are judged "interesting" later on. For example, if you want to trace the entire call path of all system calls that result in a particular error code, you can speculatively trace each system call, but throw away the data for all system calls except the ones with the interesting error code.

If you don't have much time to read the DTrace paper, be sure to at least read Section 9, which describes a session root-causing a mysterious performance problem on a large server with hundreds of users. In the end, 6 instances of a stock ticker applet were putting so much load on the X server that killing them resulted in an increase in system idle time of 15% (!!!). More DTrace examples are available, linked to from the DTrace OpenSolaris web site.

What does this mean for Linux?

Hopefully anyone who saw Dave Jones' Why Userspace Sucks talk at OLS 2006 will already be excited about using SystemTap to track down problems. SystemTap is the current state of the art dynamic tracing system for Linux. It has little or no probe effect - performance degradation when it is disabled - and it can trace events across the system. However, it still has some way to go in the areas of safety, early data processing, and general usability. Understanding the DTrace paper will help people understand why these areas are important. More importantly, understanding the DTrace paper will help people understand how they can use SystemTap to solve interesting problems.

Bored? Lonely? Download SystemTap and start investigating performance problems today! If you're running FC4, you can even install SystemTap using yum.

Comments (24 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

  • Marco Costalba: qgit-1.5. (September 10, 2006)

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds