The current stable 2.6 release is 184.108.40.206
, released on
September 8, several minutes after the rather abortive 220.127.116.11
release. Quite a few
important fixes have made it into these releases, though none of them have
vulnerability numbers attached.
On the 2.6.16 front, Adrian Bunk has released 18.104.22.168-rc1 and 22.214.171.124-rc2 with another set
The current 2.6 prepatch is 2.6.18-rc7, announced by Linus on September 13. "Ok, ok, don't rub it in. I know I thought -rc6 would be
the last one, but I just feel more comfy doing an -rc7, even if most of the
changes are pretty minor." Expect the final release before too long.
The current -mm tree is 2.6.18-rc6-mm2. Recent changes
to -mm include some USB API changes, a big x86-64 patch (including stack
protection support), access control lists for tmpfs, and a patch which may
reorder PCI device enumeration on some systems. There are currently 1915
patches in -mm, the largest number ever.
Comments (none posted)
Kernel development news
The road to 2.6.19-rc1 is going to be rough - there's an unusually
large amount of work pending, and there is an unusual (although
still small) amount of overlap between the subsystem trees which
people will need to sort out. Because of this I expect it will
take us more than the nominal two weeks to reach -rc1.
-- Andrew Morton
We are very sorry for for the mistakes that happened with the .12
release, and those responsible have been sacked.
-- The -stable team
Comments (4 posted)
Paul Mackerras recently reported
bug. The tg3 Ethernet driver, like many other network drivers, operates on
a set of buffer descriptors stored in the host system's memory. These
descriptors describe the buffers which are available for incoming network
packets; when a packet arrives, the interface picks the next descriptor on
the list, stuffs the data there, then tells the processor that the packet
is available. The reported bug works like this: the processor makes some
changes to this descriptor data structure, then does a write to a
memory-mapped I/O (MMIO) register to tell the device to start I/O. The
device, however, receives this MMIO write before the data written to main
memory arrives at its final destination, and thus operates on old data.
When this happens, correct operation is, to say the least, unlikely.
Bugs resulting from the reordering of memory operations can be some of the
most subtle and difficult-to-find problems. A developer can stare at the
code for hours without realizing that what is actually happening, deep down
within the system's hardware, does not quite match the code as it appears
to be written. The incorrect behavior can happen infrequently and be
impossible to reproduce in any easy way.
The solution for this kind of problem is usually to add some sort of
memory barrier in situations where the ordering of operations matters. The
sort of barrier most familiar to device driver writers may well be the
classic rule: MMIO writes to I/O memory hosted on a PCI bus cannot be
considered to be complete until a read has been done from that memory
range. So drivers often have a pattern where many registers are set with
values describing an I/O operation, but a read is done before the final
write which sets the "go" bit. Without that read, which functions as a
sort of MMIO barrier, the device could take
off using older values and make a mess of things.
The tg3 bug illustrates a slightly different sort of problem, however:
there is no guaranteed ordering between writes to regular memory and writes
to a memory-mapped I/O range. So Paul's question was: should an MMIO write
be redefined to be strictly ordered with respect to preceding writes to
regular memory? On a number of architectures (including the i386), the
hardware orders things nicely now, but on others (Paul is working with
PowerPC64), there are no such guarantees. Redefining the MMIO write
operations (iowrite32(), writel(), etc.) to add the
necessary barriers on the relevant architectures could make a number of
potential bugs go away.
Linus didn't like the idea, stating that it
was too expensive. Memory barriers can stall the processor for long
periods of time, so it is nice to leave them out when they are not truly
needed. So, Linus says, the preferred approach is to require the
programmer to put in an explicit barrier operation when one is needed.
There are some problems with this approach, however. One of those is that
the kernel does not currently implement a barrier designed to force
ordering between regular and MMIO memory operations. There is
mmiowb(), but its real purpose is to enforce ordering between MMIO
operations only. So Linus mentioned the possibility of creating new
barriers with names like mem_to_io_barrier() to bring about the
desired ordering in this situation.
Alternatively, the MMIO operations could be redefined to contain a barrier
before the MMIO access happens. That would fix the tg3 bug without adding
any extra cost, but it would come at the cost of removing the barrier that
is currently placed after the operation. This is the solution that
I suspect the best thing at this point is to move the sync in
writeX() before the store, as you suggest, and add an "eieio"
before the load in readX(). That does mean that we are then
relying on driver writers putting in the mmiowb() between a
writeX() and a spin_unlock, but at least that is documented.
This approach brought out a different
objection from David Miller (and others), however:
Driver authors will not get these memory barriers right, you can
say they will because it will be "documented" but that does not
change reality which is that driver folks will get simple
interfaces right but these memory barriers are relatively advanced
concepts, which they thus will get wrong half the time
David would rather see things work correctly in the simple scenario, even
if the run-time expense is higher. As others have mentioned, one can
always implement no-barrier versions of the MMIO primitives for
performance-minded developers who (think they) know what they are doing.
The case mentioned by Paul above - putting in a call to mmiowb()
between the last MMIO write operation and a spin_unlock() call -
would be the biggest concern. Spinlocks are used to keep multiple
processors (or, in a preemptive scenario, multiple processes on a single
processor) from mixing up operations to the same device. But a spinlock
lives in regular memory, so it is possible that the unlock operation could
succeed (allowing another process to access the MMIO region) before the
previous process's MMIO writes complete. That is why mmiowb() is
called for - but it does look like the sort of thing that driver authors
will have a hard time remembering.
An alternative suggested by Alan Cox is the
creation of a new pair of spinlock operations: spin_lock_io() and
spin_unlock_io(). They would be explicitly defined to protect
operations on MMIO regions, and would contain the requisite barriers. If
device drivers could be trained to use these locking operations (and driver
writers often can be trained - just feed them beer when they do something
right), they would not have to remember to insert barriers.
There's a couple of problems here too, however. There are already a number
of variations on the spin_lock() operation; adding another option
will expand the number of locking calls considerably. Code which calls
functions while holding locks must already be aware of the called
functions' locking needs, and that awareness will be made more complicated
as well. So Linus would much rather avoid this
approach and just require the use of explicit barriers.
Yet another approach - the one which might just be adopted in the end - is
to redefine and expand the set of MMIO accessor functions. In this
scenario, as described by
Benjamin Herrenschmidt, the existing functions (writel(), etc.)
would be made fully ordered - even though that might well slow them down
some. All drivers using those functions would continue to work - and some
might have rare, subtle bugs fixed in the process.
For most drivers, the above functions will be adequate - memory barriers
around MMIO operations will not materially affect performance most of the
time. There are exceptions, however. For situations where the barriers
are unnecessary and hurtful, a new set of accessors with names like
__writel() or __iowrite32() would be defined. These
functions would ensure that MMIO operations are seen by the peripheral
device in the order issued by the processor, but no other guarantees would
be made. When these primitives are used, the programmer is responsible for
inserting barriers in cases where ordering between MMIO and regular memory
operations is important.
Finally, for developers who truly want to live on the edge, a set of
functions with names like __raw_writel() has been proposed. These
accessors would provide no ordering guarantees at all and would not concern
themselves with issues like byte swapping. They are one small step above
issuing I/O operations directly in assembly.
Benjamin's proposal also brings back the idea of creating a new set of
memory barriers for specific situations. Thus, io_to_io_barrier()
would ensure ordering between MMIO operations; it would be useful in
conjunction with the "raw" operations described above. Other barriers
would deal with ordering between MMIO and regular memory operations in
various ways; see Benjamin's post for the full set.
There have been a number of suggestions for changes to this proposal, but
no real opposition to the general idea. So, in the end, that may be just
how it works out - though expect this discussion to return in the future.
When the topic is one of the trickiest areas of kernel programming on
contemporary hardware, easy and final solutions will likely be hard to come
Comments (none posted)
Back in 1998, as the 2.1 kernel went into yet another feature freeze, the
capabilities feature was merged. Capabilities split the power of the root
account into a set of privileges, each of which can be granted or withheld
independently of the others. A process which needs to be able to bind to a
privileged port number, for example, could be given that ability without
simultaneously enabling it to override file permissions, kill other
processes, or exceed resource limits. Proponents of capabilities have long
seen a world where the root account no longer exists and all tasks have the
minimum level of privilege they need to get their jobs done. A system
organized in this way, it is thought, would be more secure.
The world is full of Linux distributions, many of which are oriented toward
higher levels of security. But, to your editor's knowledge, nobody has
ever put together a successful, capability-based distribution. There are
many reasons for this lack of implementations, including the fact that
nobody has really figured out a way to administer a system with a couple
dozen more security-related bits attached to every executable file. But
one should also not overlook the fact that, from the 2.1.x days to
now, there has never been a Linux kernel where capabilities actually worked
Part of the problem is an incomplete implementation: no patch which
attaches capability masks to files has ever been merged. But the kernel
has also never implemented capability inheritance - what happens to the
capability bits when a process executes a new program - in a correct
manner. For some time now, in fact, capability inheritance has been
disabled completely. Without inheritance, the full capability model cannot
work. So the use of capabilities in Linux systems has been limited to a
very small number of programs which have been coded to drop the
capabilities they do not need.
David Madore has set out to change that state of affairs with a set of patches to fix up
capability support. This patch set does a few things, the first of which
being to expand the capability set from 32 to 64 bits. Current kernels
have 31 capabilities defined, so it is not especially hard to imagine
needing more in the future. That need could become pressing if anybody
ever gets serious about splitting the catch-all CAP_SYS_ADMIN
capability into several smaller privileges.
This patch uses some of those new bits from the outset for a set of
"regular capabilities" which all processes are normally expected to have. These
capabilities include the ability to use fork() or exec(),
the ability to open files and to write to files, the ability to use
ptrace(), and the ability to increase privilege by running a
setuid program. The idea here is that processes running in
security-relevant settings can drop those capabilities if they are not
needed, making it harder to exploit any vulnerabilities in those
The core of the patch, however, is the implementation of capability
inheritance. Understanding this part requires just a bit of background.
As it happens, while one can talk about the capabilities possessed by a
process, each process in Linux has three separate capability masks. The
permitted set is all of the capabilities that the process is allowed
to have. But capabilities cannot be used unless they are set in the
effective set, is a subset of the permitted set.
Finally, each process has an inheritable set, listing the
capabilities (again, a subset of the permitted set) which can be passed on
to any program run with exec(). Processes can adjust the
effective and inheritable sets at any time (within the bounds of the
permitted set), but the permitted set cannot be expanded.
In a capability-based system, executable files also have a set of three
capability masks. Those masks have the same names as the process masks,
and their function is almost the same. The file's inherited mask, however,
limit the capabilities which can be inherited from any other process.
David's patch set includes a patch (by Serge Hallyn) which adds support for
capability masks to the filesystem layer.
When a process runs a new executable, the masks are combined as follows:
- P′p ←
(Pi ∩ Fi) ∪
(Fp ∩ bnd)
- P′e ←
(Pi ∩ Pe ∩ Fi) ∪
(Fp ∩ Fe ∩ bnd)
- P′i ← P′p
These equations are taken directly from David's "new
capabilities" page, which has much more detail on all of this work.
What they say, in English, is something like this:
- The permitted capabilities for the new executable
(P′p) are the intersection of the inheritable set from
process before calling exec() (Pi) and the
file's inherited set (Fi). The permitted set from the file
(Fp) is then added in, but not before being limited by the
- The effective capabilities (P′e) will be the same as
the inherited capabilities, except that capabilities which are not
effect in the current process or in the file's effective set will be
- The inheritable capabilities (P′i) will be the same
as the permitted capabilities.
For the most part, these rules match the usual understanding of how
capability-based systems are supposed to work. Capabilities, in such a
system, are assigned to programs, not to users; the normal permissions bits
can then come into play to control which programs specific users can run.
David's patch differs from the usual idea of capability-based systems in
one important regard, however: how it handles programs with no capability
sets defined. On most systems, that will be almost every executable file
there is. By the rules, such programs should be treated as having an empty
inherited set, which, by the rules above, would cause them to be run with
no capabilities at all. David's patch, instead, causes these programs to
be run with the same capabilities the process had before - though the
presence of things like setuid bits can obviously change that calculation.
This interpretation breaks the classic capability-based model, but it has
the advantage of actually working on current systems.
Ted T'so, however, complains that this
compromise fundamentally weakens the security of the capability-based
model. He has suggested that the behavior be configurable, with each
filesystem having a flag describing how capabilities should be handled in
the absence of a set per-file masks. A set of default capabilities for new
files could be part of this change as well.
The other complaint which has been heard is fairly predictable:
why, it is asked, should we bother with capabilities when SELinux can do
all of the same things and more? In fact, SELinux does something vaguely
similar, but with a level of indirection; it attaches labels to files, then
associates capabilities with the labels through the policy mechanism.
Anybody who has ever gotten that cheery Fedora "your filesystem must be
relabeled, please wait for a very long time" boot message knows that
keeping files and labels properly synchronized is a difficult task. There
is no real reason to believe that keeping capability masks in a correct
state would be any easier. That fact alone may continue to limit the real
usage of capabilities well into the future.
Comments (12 posted)
Kernel developers have written many wonderful and useful tools for
debugging and observing system behavior, such as slab allocation
debugging, lock dependency
tracking, and scheduler statistics. However, few of these tools
can be used in production systems (those are computers used to do
actual work as opposed to what I use them for, which is
compiling and testing my latest kernel patches) because of the
overhead they create, even when disabled. Whenever Dave Jones is
trying to track down a memory allocation bug in Rawhide and turns on
slab debugging, he's inundated with complaints about sluggish systems
until he turns it back off again.
We also lack decent tools to do system-wide analysis - analysis
spanning the operating system and all running processes - since most
tools are built around either a single process (e.g., strace) or a
single kernel subsystem (e.g., SCSI logging). When it comes down to
root-causing a performance problem on a production system, our hands
are pretty much tied if we can't boot into a kernel compiled with
support for debugging and tracing - and often we can't reboot, either
due to downtime restrictions or rules about certification of software
on production systems.
Today, performance analysis on production Linux systems usually ends
up being a jumble of iostat, top, sysrq-t, random /proc entries, and
unreliable oprofile results (if we're lucky enough to have oprofile).
Recently, one of my friends with extensive Linux experience upgraded
his business's production system (a computer used to do actual
work) to a more recent Linux kernel and found that performance
had suddenly dropped to an unusable level. Once he had figured out
that many Apache processes were spending a lot of time in iowait, he
had no idea where to go next and had to revert to the old kernel
without root-causing the problem. Unfortunately, the problem is only
reproducible on a system in production use - and so must be
investigated using only tools suitable for a production system.
System-wide performance analysis on present-day Linux systems remains
a black art.
The ideal tracing system would cause zero performance degradation when
it is disabled, would be dynamically enabled as needed, could collect
data over an entire system, and would be safe to use on a production
system. The paper describing DTrace,
Dynamic Instrumentation of Production Systems
, published in the USENIX 2004 Annual
, earns itself a place on the Kernel Hacker's
Bookshelf for describing the first system that lives up to this ideal.
DTrace was originally written for Solaris on both SPARC and x86, and
has recently been ported to Mac OS
X. I used DTrace extensively while I was working on Solaris and
got used to being able to answer any question I had about a system
with a few minutes of script writing. When I went back to work on
Linux and could no longer use DTrace, I felt like I went from wielding
a sharp steel katana to fumbling with dull flint tools. The only tool
for Linux that comes close is SystemTap, which has
improved significantly in the last year, though it still remains out
of the mainline kernel.
I'm not the only person who thinks DTrace is ground-breaking. DTrace
won the top award in the
Wall Street Journal's 2006 Technology Awards. MIT's Technology
Review named DTrace's lead engineer, Bryan Cantrill, as one of their 2005 TR35
winners, their list of top innovators under the age of 35. Any
company with a half-decent marketing group can generate hype, but
DTrace has garnered praise from both industry leaders and the
people knuckling down to do the real work.
begins with the motivation for DTrace. For many years,
Solaris developers, like Linux developers, focused on writing tools to
help them in a kernel development environment. Then they began
venturing out into the field to analyze real-world systems - and
discovered that much of their toolkit was useless. Besides being
impossible to use on production systems, their tools were designed to
analyze processes or the kernel in isolation. They began to design a
dynamic tracing system intended from its inception for use in
production systems. It needed to be completely safe, have zero probe
effect, aggregate data over the whole system, lose a minimum of trace
data, and allow arbitrary instrumentation of any part of the system.
The architecture they came up with divides up the work of tracing into
several modular components. The first is DTrace providers. These are
kernel modules that know how to create and enable a particular class
of DTrace probes. DTrace providers include things like function
boundary tracing and virtual memory info tracing. When enabled, each
DTrace probe has one or more series of actions associated with it that
are executed by the DTrace framework (another kernel module) each time
the probe fires, such as "Record the timestamp" or "Get the user stack
of this thread." Actions can have predicates - conditions that must
be met for the the action to be taken. This is one way to cut down on
the amount of data that would otherwise be laboriously copied out of
the kernel, only to be thrown away in post-processing. A useful
predicate might be "Only if the pid is 7893" or "Only if the first
argument is non-zero."
Probes are enabled by DTrace consumers - processes which tell the
DTrace framework what probe points and actions they want to use.
Probes can have multiple consumers. Each consumer has its own set of
per-CPU buffers for transferring trace data out of the kernel, which
is done is such a way that data is never corrupted, and the consumer
is notified if data is lost. Many tracing systems silently drop data,
which can lead to serious errors in analysis when an event is
The most interesting and controversial part of DTrace is the scripting
language, "D", and its conversion to the D Intermediate Format, DIF.
Many developers don't understand why C and native machine code aren't
preferable - after all, we already know C, and we have plenty of tools
for compiling C into runnable machine code. Why reinvent the wheel?
The answer comes in two parts.
First, D was invented to quickly form questions about a running
system. A quote from the paper: "Our experience showed that D
programs were rapidly developed and edited and often written directly
on the dtrace(1M) command line." As such, it lends itself to a
script-like language that is friendly to rapid prototyping. It is also
intended primarily to gather and process data, and as such an awk or
python-like structure was more appropriate. The language used to
specify probe actions should be specialized for the task at hand,
rather than simply reusing a language designed for generic system
programming. At the same time, D is very similar to C (the paper
describes D as "a companion language to C") and C programmers can
quickly learn D.
Second, some level of emulation is needed for safety. Not all program
errors can be caught in an initial pass; things like illegal
dereferences must be caught and handled on the fly. The in-kernel DIF
emulator is vital for the level of safety needed to use DTrace on a
production system. When explaining to Linux developers the need to
prevent buggy scripts from crashing the system, often the response is,
"Well, don't do that." But imagine for a minute that you are
debugging with SystemTap on your friend's production Linux server.
When they ask you if it could possibly crash their system (which will
cost them many thousands of dollars in lost business), you don't want
to say, "Well, only if I have a bug in the scripts I am writing... on
the fly... without code review... Um, how many thousands of dollars
did you say?" A tracing system that can still cause the system to
crash in some situations will be limited to kernel developers,
students, and other people with the luxury of unscheduled downtime.
Two major components of DTrace remain: aggregations and speculative
tracing, two methods of reducing trace data at the source, allowing
far greater flexibility of tracing. The traditional method of tracing
involves generating vast quantities of data, shoveling it out to user
space as fast as possible, and then sifting through the detritus with
post-processing scripts. The downsides of this approach are data loss
(there is a limit to how quickly data can be copied out of the
kernel), limitations on what we can trace (without excessive data
loss), and expensive post-processing times. If we instead throw away
or coalesce trace data at the source, our tracing is cheaper and more
One method of data pruning is aggregations, which coalesce a set of
data into a useful summary. For example, with only a few lines of D,
you can create an aggregation that collects a frequency distribution
of the size of mmap function calls across all processes on the system.
The alternative is copying out the entire set of trace data for each
mmap call on the system, then writing a script to extract the sizes
and calculate the distribution - which is slower, more error-prone,
and has a much higher probe effect.
Speculative tracing is even more interesting; it allows a script to
collect trace data and then decide whether to throw it away or pass it
back up to user space. This is vital for collecting data for a common
event, of which only a few events are judged "interesting" later on.
For example, if you want to trace the entire call path of all system
calls that result in a particular error code, you can speculatively
trace each system call, but throw away the data for all system calls
except the ones with the interesting error code.
If you don't have much time to read the DTrace paper, be sure to at
least read Section 9, which describes a session root-causing a
mysterious performance problem on a large server with hundreds of
users. In the end, 6 instances of a stock ticker applet were putting
so much load on the X server that killing them resulted in an increase
in system idle time of 15% (!!!). More DTrace
examples are available, linked to from the DTrace
OpenSolaris web site.
What does this mean for Linux?
Hopefully anyone who saw Dave Jones' Why
talk at OLS 2006
will already be excited about using SystemTap
to track down
problems. SystemTap is the current state of the art dynamic tracing
system for Linux. It has little or no probe effect - performance
degradation when it is disabled - and it can trace events across the
However, it still has some way to go in the areas of safety,
early data processing, and general usability.
DTrace paper will help people understand why these areas are
important. More importantly, understanding the DTrace paper will help
people understand how they can use SystemTap to solve interesting
Bored? Lonely? Download SystemTap and start investigating
performance problems today! If you're running FC4, you can even install
SystemTap using yum.
Comments (24 posted)
Patches and updates
Core kernel code
- Marco Costalba: qgit-1.5.
(September 10, 2006)
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>