The current 2.6 development kernel is 2.6.29-rc7
on March 3. It
contains a long list of fixes, new drivers for Atheros L1C gigabit Ethernet
adapters and FireDTV IEEE1394 adapters, and some out-of-space handling
improvements for the btrfs filesystem. See the
for the details.
There have been no stable 2.6 updates released over the last week.
Comments (none posted)
Kernel development news
HAHAHAHHAAAA!!!! My evil scheme is working! I post some sub-optimal
code, and have others do the nasty work for me!!!!
Oh, did I just say that out loud?
-- Steven Rostedt
Not only that, I will also sue you for my patent on that algorithm.
-- Linus Torvalds
+ * The pifutex has an owner, make sure it's us, if not complain
+ * to userspace.
+ * FIXME_LATER: handle this gracefully
+ pid = curval & FUTEX_TID_MASK;
+ if (pid && pid != task_pid_vnr(current))
+ return -EMORON;
-- Darren "graceful" Hart
(thanks to Bert
Yup, there's lots of crappy code in the tree, and it is regrettable
that maintainers continue to go ahead and merge that crappy code.
There's no easy fix for this - you need to be aware of what is right
and what is wrong, but you cannot look at existing code to determine
-- Andrew Morton
Comments (1 posted)
Anybody who travels with a suspended laptop has likely run into the irritating problem of NetworkManager trying to reconnect to the old network - the one which was left behind before getting onto the airplane. It seems that Dan Williams has figured out the problem
and queued a set of patches to fix it. "See, drivers timestamp wifi networks they know about. That way you can figure out if the network was last seen a second ago, 7 seconds ago, or so long ago that its dead to me. But they all use an kernel counter called jiffies to do that. And jiffies doesnt increment across suspend/resume. See where Im going with this?
" Your editor plans to buy Dan a beer at the next opportunity.
Comments (36 posted)
Felipe Balbi recently posted a
driver called twl4030-pwrbutton
, which generates input events when
somebody hits a power button connected through a twl4030 i2c controller.
It is, in many ways, a standard driver; Felipe certainly did not expect to
see a long and acrimonious discussion result from its posting. But that's
what ensued. Over the course of this discussion, the participants were
able to outline some problems with how interrupts are handled on Linux
systems, along with a potential solution.
Things started when Andrew Morton questioned the following bit of code,
found in the driver's interrupt handler:
/* WORKAROUND for lockdep forcing IRQF_DISABLED on us, which
* we don't want and can't tolerate. Although it might be
* friendlier not to borrow this thread context...
Workarounds of this variety do tend to catch the attention of diligent
reviewers. Understanding this one requires just a bit of background.
Back in the Good Old Days, the Linux kernel had "fast" and "slow" interrupt
handlers; the main difference between the two is that "fast" handlers ran
with further interrupts disabled, while "slow" handlers were run with
interrupts enabled. Over time, the distinction between the two types has
faded; faster, smarter hardware and greater use of software interrupts and
tasklets have made the execution time of most well-written interrupt handlers
essentially irrelevant. So most driver authors do not even think much
about whether they are writing a "fast" or a "slow" handler, even though
the distinction still exists. Unless a driver passes the
IRQF_DISABLED flag when requesting its interrupt line, its
interrupt handler will be called with interrupts enabled.
"Lockdep" is the kernel lock
validator, which, when enabled, creates a detailed model of how locks
are used in the kernel. This model can be used to find potential deadlocks
and other problems. According to Ingo
Molnar, lockdep has been quite effective:
You might also have noticed that over the past 2-3 years the term
"hard lockup" in regression reports has gone down by about an order
of magnitude - and much of that can be attributed to the lockdep
coverage we have in place.
It turns out, though, that the lockdep developers made one significant,
simplifying assumption: all interrupt handlers were to be invoked with
interrupts disabled. When lockdep is enabled, in fact, the generic
interrupt handling layer forces this condition, regardless of whether any
specific handler was registered with the IRQF_DISABLED flag.
Lockdep has worked this way for some time, and complaints have been
scarce. But, as can be seen from the patch cited above, "scarce" is not
the same as "nonexistent."
Drivers for i2c-connected devices operate under a number of interesting
constraints, mostly forced by the fact that the i2c "bus" is, in reality, a
slow, two-wire serial interface. So even "fast" operations like reading a
device register are, in fact, slow on i2c devices; they are slow enough
that the process involved should sleep while waiting for the result. That
is a bit of a problem for i2c interrupt handlers, since they need to access
device registers, but they cannot sleep.
The result is that a number of i2c drivers have implemented what is, in
effect, a threaded interrupt handler mechanism. The "real" interrupt
handler simply masks the interrupt and wakes up the thread, which then does
the real work of talking to the device. In the case of the twl4030 driver,
this threaded implementation has been done in a relatively formal manner in
which the device interrupt handlers are invoked - from within a
special-purpose kernel thread - by way of the generic IRQ layer itself.
These threaded handlers do not expect to run with interrupts disabled -
indeed, they cannot run that way - but the generic IRQ code will, when
lockdep is enabled, turn off interrupts anyway. That is why this patch
takes pains to turn them back on when lockdep is being used.
Peter Zijlstra's response to this discussion was to post a patch forcing
IRQF_DISABLED for all drivers. His position is that no
interrupt handlers should be run with interrupts enabled. Doing so invites
kernel stack overruns if too many nested interrupts come in; it also, he
says, encourages the notion that it's OK for interrupt handlers to be
slow. Additionally, he says, drivers must already be able to run their
handlers with interrupts disabled, since another driver may disable
interrupts on a shared interrupt line. So, he says, it makes no sense to
"fix" lockdep for handlers which want interrupts to be enabled; instead,
the always-disabled assumption built into lockdep should be made part of
the system as a whole.
The response to this patch was somewhat sympathetic, at least in a general
sense. Making IRQF_DISABLED be the default situation makes sense
for most devices. But there really are drivers which need their interrupt
handlers to run with
interrupts enabled; IDE drivers using programmed I/O are one example. If
interrupt handlers are given exclusive control over the system, other
devices will see unacceptable latencies and start to fail operations or
drop data. So any change of this nature must be done carefully, and it
must remain possible to run some handlers with interrupts enabled.
And, of course, forcing IRQF_DISABLED does nothing to fix the
The real solution is to have general support for threaded interrupt
handlers. The realtime preemption tree has supported threaded handlers for
quite some time; more recently, a
variant of the threaded handlers patch was posted for mainline
consideration. There are a lot of advantages to threaded handlers beyond
their applicability to the problems discussed here; threaded handlers can
improve latencies, allow interrupt handlers to be prioritized, and,
someday, perhaps allow the removal of software interrupts altogether. So
it seems like there would be value in getting this code merged.
To that end, Thomas Gleixner has come back with a new version of the threaded
handlers patch. The API looks much like it did in the previous
posting, though it could change in response to some review comments made this time around.
In essence, this infrastructure allows a driver to register a "quick
handler" to acknowledge (and mask) an interrupt; there would also be a
regular handler which could be called in either hard interrupt or process
context, depending on the quick handler's return value. The API allows
drivers to continue to work unmodified, or they can be converted over to
David Brownell, the leading critic of lockdep's behavior and the idea of
disabling interrupts for all handlers, seems to agree that the threaded
interrupt handler infrastructure should be able to solve the i2c problem.
All threaded handlers will, by necessity, run with interrupts enabled, so
the primary difficulty goes away. David would like to see some changes
made to better support the chaining of handlers that is typically needed in
such situations, but it's not clear how many changes are really needed.
In summary, threaded interrupt handlers seem likely to be the next
technology to be merged from the realtime preemption tree. Just when that
might happen remains to be seen, though. The request for some API changes
may well slow things down a bit; there were also requests for example
implementations of threaded handlers with more types of drivers.
Satisfying those requests quickly enough to allow the code to be reviewed
before the 2.6.30 merge window opens could be a bit of a challenge. So
this code might just have to wait for one more development cycle; it would
be surprising if it were to take longer than that, though.
Comments (3 posted)
Once upon a time, Xen was the hot virtualization story. The Xen developers
had a working solution for Linux - using free software - well ahead of
anybody else, and Xen looked like the future of virtualization on Linux.
Much venture capital chased after that story, and distributors raced to be
the first to offer Xen-based virtualization. But, along the
way, Xen seemed to get lost. The XenSource developers often showed
little interest in getting their code into the mainline, and attempts by others
to get that job done ran into no end of obstacles. So Xen stayed out of
the mainline for years; the first public Xen release happened in 2003, but
the core Xen code was only merged for 2.6.23 in
In the mean time, KVM showed up and grabbed much of the attention. Its
path into the mainline was almost blindingly fast, and many kernel
developers were less than shy about expressing their preference for the KVM
approach. More recently, Red Hat has made things more formal with its announcement
of a "virtualization agenda" based on KVM. Meanwhile, lguest showed up as an easy
introduction for those who want to play with virtualization code.
The Xen story is a classic example of the reasons behind the "upstream
first" policy, which states that code should be merged into the mainline
before being shipped to customers. Distributors rushed to ship Xen,
then found themselves supporting out-of-tree code which, often, was not
well supported by its creators. In particular, published releases of Xen
often only supported relatively old kernels, creating lots of work for
distributors wanting to ship something more current.
Now at least some of those distributors
are moving on to other solutions, and high-level kernel developers are
questioning whether, at this point, it's worth merging the remaining Xen
code at all.
told, Xen looks to be on its last legs.
Or, perhaps, the rumors of Xen's demise have been slightly exaggerated.
The code in the mainline implements the Xen "DomU" concept - an
unprivileged domain with no access to the hardware. A full Xen
implementation requires more than that, though; there is the user-space
hypervisor (which is GPL-licensed) and the kernel-based "Dom0" code. Dom0
is the first domain started by the hypervisor; it is typically run with
more privileges than any other Xen guest. The purpose of Dom0 is to
carefully hand out privileges to other Xen domains, providing access to
hardware, network interfaces, etc. as set by administrative policy. Actual
implementations of Xen must include the Dom0 code - currently a large body
of out-of-tree kernel code.
Jeremy Fitzhardinge would like to change that situation. So he has posted
a core Xen Dom0 patch set
with the goal of getting it merged into the 2.6.30 release. Among the
review comments was this question from
I hate to be the one to say it, but we should sit down and work out
whether it is justifiable to merge any of this into Linux. I think
it's still the case that the Xen technology is the "old" way and
that the world is moving off in the "new" direction, KVM?
In three years time, will we regret having merged this?
The questions asked by Andrew were, essentially, (1) what code (beyond
the current posting) is required to finish the job, and (2) is there
really any reason to do that? The answer
to the first question was "another 2-3 similarly sized series to get
everything so that you can boot dom0 out of the box." Then there are
various other bits which may not ever make it into the mainline. But, says
Jeremy, getting the core into the mainline would shrink the out-of-tree
patches carried by distributors and generally make life easier for
everybody. For the second question, Jeremy responds:
Despite all the noise made about kvm in kernel circles, Xen has a
large and growing installed base. At the moment its all running on
massive out-of-tree patches, which doesn't make anyone happy. It's
best that it be in the mainline kernel. You know, like we argue
for everything else.
Beyond that, Jeremy is arguing that Xen still has a reason to exist. Its
design differs significantly from that of KVM in a number of ways; see this message for an excellent description of
those differences. As a result, Xen is useful in different situations.
Some of the advantages claimed by Jeremy include:
- Xen's approach to page tables eliminates the need for shadow page
tables or page table nesting in the guests; that, in turn, allows for
significantly better performance for many workloads.
- The Xen hypervisor is lightweight, and can be run standalone; the KVM
hypervisor is, instead, the Linux kernel. It seems that some vendors
(HP and Dell are named) are shipping a Xen hypervisor in the firmware
of many of their systems; that's the code behind the "instant on"
feature, among other things.
- Xen's paravirtualization support allows it to work with hardware which
does not support full virtualization. KVM, instead, needs hardware
- The separation between the hypervisor, Dom0, and DomU makes security
validation easier. The separation between domains also allows for
wild configurations with each device being driven by a separate
domain; one might think of this kind of thing as a sort of heavyweight
KVM's advantages, instead, take the form of relative simplicity, ease of
use, full access to contemporary kernel features, etc. By Jeremy's
reasoning, there is a place for both systems in Linux.
The relative silence at the end of the discussion suggests that Jeremy has
made his case fairly well. Mistakes may have been made in Xen's history,
but it is a project which remains alive, and which has clear reasons to
exist. Your editor predicts that the Dom0 code will find little opposition
at the opening of the 2.6.30 merge window.
Comments (39 posted)
A kernel patch that reduces memory, while providing a performance increase
of roughly a factor of three, is generally seen as a good thing. But, when
there is another, more-or-less equivalent—but much faster—way
to perform that action, it
may appear to be an unnecessary optimization. A recent patch to the ftrace_printk() function
has those characteristics, but the ability to get such a speed increase,
even in something that is just convenient—rather than
trump the concerns about the necessity.
Lai Jiangshan proposed adding a binary version of ftrace_printk()
last December; Frederic Weisbecker has picked up the patches and
submitted them for inclusion into ftrace. The basic idea is that rather than
converting the arguments to strings—as specified in a
string—ftrace_bprintk() would defer the actual
conversion until the trace output is read by user space. Instead it would
put the binary values into the ring buffer, along with a pointer to the
format string. When the trace data is read from debugfs, the
format string and binary data are used to construct the output.
Ingo Molnar liked the idea, but was unhappy
with the implementation that duplicated much of the code in
vsnprintf() into two new functions. He suggested that it should
be possible to pull out the common code: "We should try _much_ harder
at unifying these functions before
giving up and duplicating them." Weisbecker agreed, which
eventually resulted in a patch that breaks
out the format string decoding as a separate function.
Molnar also asked for some performance numbers, which
Weisbecker provided as part of his patch. He reported the memory and time
difference when adding:
ftrace_printk("This is the timer interrupt: %llu", jiffies_64);
to the timer interrupt. The memory used was less than half (16 versus 39
bytes per entry), and the time savings was also significant:
After some time running on low load (no X, no really active processes):
ftrace_printk: duration average: 2044 ns, avg of bytes stored per entry: 39
ftrace_bprintk: duration average: 1426 ns, avg of bytes stored per entry: 16
Higher load (started X and launched a cat running on an X console looping on
ftrace_printk: duration average: 8812 ns
ftrace_bprintk: duration average: 2611 ns
Andrew Morton was a bit puzzled by the
intent of the patch: "Trying to make something which is inherently
slow run slightly faster seems...odd." But Molnar explained why it makes sense:
The _fastest_ way of tracing is obviously to know about the
precise argument layout and having a specific C based tracepoint
stub that directly stuffs that data into the ring buffer. Most
tracepoints are of such nature.
That does not remove the ease of use of ad-hoc printk-alike
tracepoints though, and speeding them up 3-fold is a [worthwhile]
Breaking out the format string handling into its own
format_decode() function was mostly met with approval, except that
the argument list is rather ugly:
int format_decode(const char *fmt, enum format_type *type,
int *flags, int *field_width, int *base,
int *precision, int *qualifier)
Linus Torvalds suggested
using a struct
to contain the various values decoded from the format specifier, passing
a pointer to that into the function.
Weisbecker agreed, and added that into his patches, but he didn't quite go
Torvalds also thought that the various helper functions to handle specific
(i.e. number(), pointer(), string(), etc.)
should get passed a struct printf_spec pointer as well. As
he points out: "When
cleaning up, let's just do it properly." Once again, Weisbecker was
quick to agree; he plans to respin the patches addressing these and other
comments in the near future.
In addition, because ftrace_bprintk() is a drop-in replacement for
ftrace_printk(), Weisbecker proposes eliminating the current code in favor
of the faster version. Molnar, at least, advocates that outcome:
Well, ftrace_bprintk() seems to be a worthy and transparent
replacement for ftrace_printk() to me. I.e. lets just use this
as the new implementation for ftrace_printk().
While it is a minor upgrade to a relatively minor kernel subsystem, it does
provide some impressive performance gains. As a bonus, the review process
has resulted in some clean-up that was probably overdue. While there is
validity to the argument that it is not really required, it is
not very intrusive, nor very large. In the end, that is likely to be
enough to see it eventually end up in the mainline.
Comments (none posted)
As the 2.6.29 kernel development cycle draws toward its eventual close, it
is appropriate to look back at the internal API changes which have been
made. The following list cannot possibly be exhaustive, but, hopefully, it
captures the major points.
- The massive task credentials
patch set has been merged. This code reorganizes the handling of
process credentials (user ID, capabilities, etc.). One of the
immediate implications of this change is direct references to
credential-oriented fields in the task structure need to be changed;
for example, current->user->uid becomes
current_uid(). See Documentation/credentials.txt for a
description of the new API.
- The ftrace code has seen a lot of internal changes. The function
tracing feature has seen a number of improvements, and the developers
mechanisms to profile the behavior of if statements,
provide function call graphs,
obtain user-space stack traces, and
follow CPU power-state transitions.
- Most of the callback functions/methods associated with the
net_device structure have been moved out of that structure
and into the new struct net_device_ops. In-tree drivers
have been converted to the new API.
- The priv field has been removed from struct
net_device; drivers should use netdev_priv() instead.
- The generic PHY layer now has power management support. To that end,
two new methods - suspend() and resume() - have been
added to struct phy_driver.
- The networking layer now supports large receive offload (or
"generic receive offload") operation.
- The NAPI API has been cleaned up somewhat; in particular, functions
like netif_rx_schedule(), netif_rx_schedule_prep(),
and netif_rx_complete() have lost the unneeded struct
- The poll() file operation is now allowed to sleep; see this article for more
information on this change.
- The CPU mask mechanism, used to represent sets of processors in the
system, is in the middle of being massively reworked. The problem is
that CPU masks were often put on the stack, but, as the number of
processors grows, the stack lacks room for the mask. The new API is designed to
get these masks off the stack, and to guard against anybody ever
trying to put one back. See this
posting by Rusty Russell for details on this work.
- An infrastructure for
asynchronous function calls has been merged. This code is still a
work in progress, though, and, for 2.6.29, it will not be activated in
the absence of the fastboot command-line parameter.
- The exclusive I/O memory
allocation functions have been merged.
- There is a new synchronous hash interface called "shash." It
simplifies the use of synchronous hash operations while allowing the
same tfm to be used simultaneously in different threads. All in-tree
users have been switched to the new API.
- The hrtimer code has been simplified with the removal of variable
modes for callback functions. All processing is now done in hardirq
- A new set of LSM hooks has been added; these support pathname-based
security operations. With the merging of these hooks, one major
obstacle to the inclusion of security modules like AppArmor and TOMOYO
has been removed.
- The kernel will now refuse to build with GCC 4.1.0 or 4.1.1; those
versions have unfortunate bugs which prevent the building of a working
kernel. Versions 3.0 and 3.1 have also been deemed to be too old and
will not be supported in 2.6.29.
- Video4Linux drivers now use a separate v4l2_file_operations
structure to hold their VFS-like callbacks. The prototypes of a
number of these functions have been changed to remove the
- Video4Linux2 has also acquired a new "subdevice" concept, meant to
reflect the fact that video "devices" tend to be, in reality, a set of
cooperating devices. See the new
document for a description of how this mechanism works.
- Two new functions - stop_machine_create() and
stop_machine_destroy() - allow the independent creation of
the threads used by stop_machine(). That, in turn, lets
those threads be created before trying to actually stop the machine,
making that operation more resistant to failure.
- The exports for a number of SUNRPC functions have been changed to
- The internal MTD (memory technology device) API has seen significant
changes aimed at supporting larger devices (those requiring 64-bit
Developers interested in the history of kernel API changes can look at the LWN 2.6 API changes page. After a
period of unfortunate neglect, this page has been made current once again;
your editor promises to be a bit more diligent about maintaining this page
in the future.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>