Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.29-rc7, released on March 3. It contains a long list of fixes, new drivers for Atheros L1C gigabit Ethernet adapters and FireDTV IEEE1394 adapters, and some out-of-space handling improvements for the btrfs filesystem. See the long-format changelog for the details.There have been no stable 2.6 updates released over the last week.
Kernel development news
Quotes of the week
Oh, did I just say that out loud?
+ /* + * The pifutex has an owner, make sure it's us, if not complain + * to userspace. + * FIXME_LATER: handle this gracefully + */ + pid = curval & FUTEX_TID_MASK; + if (pid && pid != task_pid_vnr(current)) + return -EMORON;
There's no easy fix for this - you need to be aware of what is right and what is wrong, but you cannot look at existing code to determine this.
Making NetworkManager work with suspend/resume
Anybody who travels with a suspended laptop has likely run into the irritating problem of NetworkManager trying to reconnect to the old network - the one which was left behind before getting onto the airplane. It seems that Dan Williams has figured out the problem and queued a set of patches to fix it. "See, drivers timestamp wifi networks they know about. That way you can figure out if the network was last seen a second ago, 7 seconds ago, or so long ago that its dead to me. But they all use an kernel counter called jiffies to do that. And jiffies doesnt increment across suspend/resume. See where Im going with this?" Your editor plans to buy Dan a beer at the next opportunity.
Interrupts, threads, and lockdep
Felipe Balbi recently posted a driver called twl4030-pwrbutton, which generates input events when somebody hits a power button connected through a twl4030 i2c controller. It is, in many ways, a standard driver; Felipe certainly did not expect to see a long and acrimonious discussion result from its posting. But that's what ensued. Over the course of this discussion, the participants were able to outline some problems with how interrupts are handled on Linux systems, along with a potential solution.Things started when Andrew Morton questioned the following bit of code, found in the driver's interrupt handler:
#ifdef CONFIG_LOCKDEP
/* WORKAROUND for lockdep forcing IRQF_DISABLED on us, which
* we don't want and can't tolerate. Although it might be
* friendlier not to borrow this thread context...
*/
local_irq_enable();
#endif
Workarounds of this variety do tend to catch the attention of diligent reviewers. Understanding this one requires just a bit of background.
Back in the Good Old Days, the Linux kernel had "fast" and "slow" interrupt handlers; the main difference between the two is that "fast" handlers ran with further interrupts disabled, while "slow" handlers were run with interrupts enabled. Over time, the distinction between the two types has faded; faster, smarter hardware and greater use of software interrupts and tasklets have made the execution time of most well-written interrupt handlers essentially irrelevant. So most driver authors do not even think much about whether they are writing a "fast" or a "slow" handler, even though the distinction still exists. Unless a driver passes the IRQF_DISABLED flag when requesting its interrupt line, its interrupt handler will be called with interrupts enabled.
"Lockdep" is the kernel lock validator, which, when enabled, creates a detailed model of how locks are used in the kernel. This model can be used to find potential deadlocks and other problems. According to Ingo Molnar, lockdep has been quite effective:
It turns out, though, that the lockdep developers made one significant, simplifying assumption: all interrupt handlers were to be invoked with interrupts disabled. When lockdep is enabled, in fact, the generic interrupt handling layer forces this condition, regardless of whether any specific handler was registered with the IRQF_DISABLED flag. Lockdep has worked this way for some time, and complaints have been scarce. But, as can be seen from the patch cited above, "scarce" is not the same as "nonexistent."
Drivers for i2c-connected devices operate under a number of interesting constraints, mostly forced by the fact that the i2c "bus" is, in reality, a slow, two-wire serial interface. So even "fast" operations like reading a device register are, in fact, slow on i2c devices; they are slow enough that the process involved should sleep while waiting for the result. That is a bit of a problem for i2c interrupt handlers, since they need to access device registers, but they cannot sleep.
The result is that a number of i2c drivers have implemented what is, in effect, a threaded interrupt handler mechanism. The "real" interrupt handler simply masks the interrupt and wakes up the thread, which then does the real work of talking to the device. In the case of the twl4030 driver, this threaded implementation has been done in a relatively formal manner in which the device interrupt handlers are invoked - from within a special-purpose kernel thread - by way of the generic IRQ layer itself. These threaded handlers do not expect to run with interrupts disabled - indeed, they cannot run that way - but the generic IRQ code will, when lockdep is enabled, turn off interrupts anyway. That is why this patch takes pains to turn them back on when lockdep is being used.
Peter Zijlstra's response to this discussion was to post a patch forcing IRQF_DISABLED for all drivers. His position is that no interrupt handlers should be run with interrupts enabled. Doing so invites kernel stack overruns if too many nested interrupts come in; it also, he says, encourages the notion that it's OK for interrupt handlers to be slow. Additionally, he says, drivers must already be able to run their handlers with interrupts disabled, since another driver may disable interrupts on a shared interrupt line. So, he says, it makes no sense to "fix" lockdep for handlers which want interrupts to be enabled; instead, the always-disabled assumption built into lockdep should be made part of the system as a whole.
The response to this patch was somewhat sympathetic, at least in a general sense. Making IRQF_DISABLED be the default situation makes sense for most devices. But there really are drivers which need their interrupt handlers to run with interrupts enabled; IDE drivers using programmed I/O are one example. If those interrupt handlers are given exclusive control over the system, other devices will see unacceptable latencies and start to fail operations or drop data. So any change of this nature must be done carefully, and it must remain possible to run some handlers with interrupts enabled.
And, of course, forcing IRQF_DISABLED does nothing to fix the twl4030 problem.
The real solution is to have general support for threaded interrupt handlers. The realtime preemption tree has supported threaded handlers for quite some time; more recently, a variant of the threaded handlers patch was posted for mainline consideration. There are a lot of advantages to threaded handlers beyond their applicability to the problems discussed here; threaded handlers can improve latencies, allow interrupt handlers to be prioritized, and, someday, perhaps allow the removal of software interrupts altogether. So it seems like there would be value in getting this code merged.
To that end, Thomas Gleixner has come back with a new version of the threaded handlers patch. The API looks much like it did in the previous posting, though it could change in response to some review comments made this time around. In essence, this infrastructure allows a driver to register a "quick handler" to acknowledge (and mask) an interrupt; there would also be a regular handler which could be called in either hard interrupt or process context, depending on the quick handler's return value. The API allows drivers to continue to work unmodified, or they can be converted over to threaded handlers.
David Brownell, the leading critic of lockdep's behavior and the idea of disabling interrupts for all handlers, seems to agree that the threaded interrupt handler infrastructure should be able to solve the i2c problem. All threaded handlers will, by necessity, run with interrupts enabled, so the primary difficulty goes away. David would like to see some changes made to better support the chaining of handlers that is typically needed in such situations, but it's not clear how many changes are really needed.
In summary, threaded interrupt handlers seem likely to be the next technology to be merged from the realtime preemption tree. Just when that might happen remains to be seen, though. The request for some API changes may well slow things down a bit; there were also requests for example implementations of threaded handlers with more types of drivers. Satisfying those requests quickly enough to allow the code to be reviewed before the 2.6.30 merge window opens could be a bit of a challenge. So this code might just have to wait for one more development cycle; it would be surprising if it were to take longer than that, though.
Xen: finishing the job
Once upon a time, Xen was the hot virtualization story. The Xen developers had a working solution for Linux - using free software - well ahead of anybody else, and Xen looked like the future of virtualization on Linux. Much venture capital chased after that story, and distributors raced to be the first to offer Xen-based virtualization. But, along the way, Xen seemed to get lost. The XenSource developers often showed little interest in getting their code into the mainline, and attempts by others to get that job done ran into no end of obstacles. So Xen stayed out of the mainline for years; the first public Xen release happened in 2003, but the core Xen code was only merged for 2.6.23 in October, 2007.In the mean time, KVM showed up and grabbed much of the attention. Its path into the mainline was almost blindingly fast, and many kernel developers were less than shy about expressing their preference for the KVM approach. More recently, Red Hat has made things more formal with its announcement of a "virtualization agenda" based on KVM. Meanwhile, lguest showed up as an easy introduction for those who want to play with virtualization code.
The Xen story is a classic example of the reasons behind the "upstream first" policy, which states that code should be merged into the mainline before being shipped to customers. Distributors rushed to ship Xen, then found themselves supporting out-of-tree code which, often, was not well supported by its creators. In particular, published releases of Xen often only supported relatively old kernels, creating lots of work for distributors wanting to ship something more current. Now at least some of those distributors are moving on to other solutions, and high-level kernel developers are questioning whether, at this point, it's worth merging the remaining Xen code at all.
All told, Xen looks to be on its last legs. Or, perhaps, the rumors of Xen's demise have been slightly exaggerated.
The code in the mainline implements the Xen "DomU" concept - an unprivileged domain with no access to the hardware. A full Xen implementation requires more than that, though; there is the user-space hypervisor (which is GPL-licensed) and the kernel-based "Dom0" code. Dom0 is the first domain started by the hypervisor; it is typically run with more privileges than any other Xen guest. The purpose of Dom0 is to carefully hand out privileges to other Xen domains, providing access to hardware, network interfaces, etc. as set by administrative policy. Actual implementations of Xen must include the Dom0 code - currently a large body of out-of-tree kernel code.
Jeremy Fitzhardinge would like to change that situation. So he has posted a core Xen Dom0 patch set with the goal of getting it merged into the 2.6.30 release. Among the review comments was this question from Andrew Morton:
In three years time, will we regret having merged this?
The questions asked by Andrew were, essentially, (1) what code (beyond
the current posting) is required to finish the job, and (2) is there
really any reason to do that? The answer
to the first question was "another 2-3 similarly sized series to get
everything so that you can boot dom0 out of the box
". Then there are
various other bits which may not ever make it into the mainline. But, says
Jeremy, getting the core into the mainline would shrink the out-of-tree
patches carried by distributors and generally make life easier for
everybody. For the second question, Jeremy responds:
Beyond that, Jeremy is arguing that Xen still has a reason to exist. Its design differs significantly from that of KVM in a number of ways; see this message for an excellent description of those differences. As a result, Xen is useful in different situations.
Some of the advantages claimed by Jeremy include:
- Xen's approach to page tables eliminates the need for shadow page
tables or page table nesting in the guests; that, in turn, allows for
significantly better performance for many workloads.
- The Xen hypervisor is lightweight, and can be run standalone; the KVM
hypervisor is, instead, the Linux kernel. It seems that some vendors
(HP and Dell are named) are shipping a Xen hypervisor in the firmware
of many of their systems; that's the code behind the "instant on"
feature, among other things.
- Xen's paravirtualization support allows it to work with hardware which
does not support full virtualization. KVM, instead, needs hardware
support.
- The separation between the hypervisor, Dom0, and DomU makes security validation easier. The separation between domains also allows for wild configurations with each device being driven by a separate domain; one might think of this kind of thing as a sort of heavyweight microkernel architecture.
KVM's advantages, instead, take the form of relative simplicity, ease of use, full access to contemporary kernel features, etc. By Jeremy's reasoning, there is a place for both systems in Linux.
The relative silence at the end of the discussion suggests that Jeremy has made his case fairly well. Mistakes may have been made in Xen's history, but it is a project which remains alive, and which has clear reasons to exist. Your editor predicts that the Dom0 code will find little opposition at the opening of the 2.6.30 merge window.
Speeding up ftrace printing
A kernel patch that reduces memory, while providing a performance increase of roughly a factor of three, is generally seen as a good thing. But, when there is another, more-or-less equivalent—but much faster—way to perform that action, it may appear to be an unnecessary optimization. A recent patch to the ftrace_printk() function has those characteristics, but the ability to get such a speed increase, even in something that is just convenient—rather than required—may well trump the concerns about the necessity.
Lai Jiangshan proposed adding a binary version of ftrace_printk() last December; Frederic Weisbecker has picked up the patches and submitted them for inclusion into ftrace. The basic idea is that rather than converting the arguments to strings—as specified in a printk()-style format string—ftrace_bprintk() would defer the actual conversion until the trace output is read by user space. Instead it would put the binary values into the ring buffer, along with a pointer to the format string. When the trace data is read from debugfs, the format string and binary data are used to construct the output.
Ingo Molnar liked the idea, but was unhappy
with the implementation that duplicated much of the code in
vsnprintf() into two new functions. He suggested that it should
be possible to pull out the common code: "We should try _much_ harder
at unifying these functions before
giving up and duplicating them.
" Weisbecker agreed, which
eventually resulted in a patch that breaks
out the format string decoding as a separate function.
Molnar also asked for some performance numbers, which Weisbecker provided as part of his patch. He reported the memory and time difference when adding:
ftrace_printk("This is the timer interrupt: %llu", jiffies_64);
to the timer interrupt. The memory used was less than half (16 versus 39
bytes per entry), and the time savings was also significant:
After some time running on low load (no X, no really active processes):
ftrace_printk: duration average: 2044 ns, avg of bytes stored per entry: 39
ftrace_bprintk: duration average: 1426 ns, avg of bytes stored per entry: 16
Higher load (started X and launched a cat running on an X console looping on traces printing):
ftrace_printk: duration average: 8812 ns
ftrace_bprintk: duration average: 2611 ns
Andrew Morton was a bit puzzled by the
intent of the patch: "Trying to make something which is inherently
slow run slightly faster seems...odd.
" But Molnar explained why it makes sense:
That does not remove the ease of use of ad-hoc printk-alike tracepoints though, and speeding them up 3-fold is a [worthwhile] goal.
Breaking out the format string handling into its own format_decode() function was mostly met with approval, except that the argument list is rather ugly:
int format_decode(const char *fmt, enum format_type *type,
int *flags, int *field_width, int *base,
int *precision, int *qualifier)
Linus Torvalds suggested using a struct
printf_spec
to contain the various values decoded from the format specifier, passing
a pointer to that into the function.
Weisbecker agreed, and added that into his patches, but he didn't quite go
far enough.
Torvalds also thought that the various helper functions to handle specific
formats
(i.e. number(), pointer(), string(), etc.)
should get passed a struct printf_spec pointer as well. As
he points out: "When
cleaning up, let's just do it properly.
" Once again, Weisbecker was
quick to agree; he plans to respin the patches addressing these and other
comments in the near future.
In addition, because ftrace_bprintk() is a drop-in replacement for ftrace_printk(), Weisbecker proposes eliminating the current code in favor of the faster version. Molnar, at least, advocates that outcome:
While it is a minor upgrade to a relatively minor kernel subsystem, it does provide some impressive performance gains. As a bonus, the review process has resulted in some clean-up that was probably overdue. While there is validity to the argument that it is not really required, it is not very intrusive, nor very large. In the end, that is likely to be enough to see it eventually end up in the mainline.
A summary of 2.6.29 internal API changes
As the 2.6.29 kernel development cycle draws toward its eventual close, it is appropriate to look back at the internal API changes which have been made. The following list cannot possibly be exhaustive, but, hopefully, it captures the major points.
- The massive task credentials
patch set has been merged. This code reorganizes the handling of
process credentials (user ID, capabilities, etc.). One of the
immediate implications of this change is direct references to
credential-oriented fields in the task structure need to be changed;
for example, current->user->uid becomes
current_uid(). See Documentation/credentials.txt for a
description of the new API.
- The ftrace code has seen a lot of internal changes. The function
tracing feature has seen a number of improvements, and the developers
have added
mechanisms to profile the behavior of if statements,
provide function call graphs,
obtain user-space stack traces, and
follow CPU power-state transitions.
- Most of the callback functions/methods associated with the
net_device structure have been moved out of that structure
and into the new struct net_device_ops. In-tree drivers
have been converted to the new API.
- The priv field has been removed from struct
net_device; drivers should use netdev_priv() instead.
- The generic PHY layer now has power management support. To that end,
two new methods - suspend() and resume() - have been
added to struct phy_driver.
- The networking layer now supports large receive offload (or
"generic receive offload") operation.
- The NAPI API has been cleaned up somewhat; in particular, functions
like netif_rx_schedule(), netif_rx_schedule_prep(),
and netif_rx_complete() have lost the unneeded struct
net_device parameter.
- The poll() file operation is now allowed to sleep; see this article for more
information on this change.
- The CPU mask mechanism, used to represent sets of processors in the
system, is in the middle of being massively reworked. The problem is
that CPU masks were often put on the stack, but, as the number of
processors grows, the stack lacks room for the mask. The new API is designed to
get these masks off the stack, and to guard against anybody ever
trying to put one back. See this
posting by Rusty Russell for details on this work.
- An infrastructure for
asynchronous function calls has been merged. This code is still a
work in progress, though, and, for 2.6.29, it will not be activated in
the absence of the fastboot command-line parameter.
- The exclusive I/O memory
allocation functions have been merged.
- There is a new synchronous hash interface called "shash." It
simplifies the use of synchronous hash operations while allowing the
same tfm to be used simultaneously in different threads. All in-tree
users have been switched to the new API.
- The hrtimer code has been simplified with the removal of variable
modes for callback functions. All processing is now done in hardirq
context.
- A new set of LSM hooks has been added; these support pathname-based
security operations. With the merging of these hooks, one major
obstacle to the inclusion of security modules like AppArmor and TOMOYO
has been removed.
- The kernel will now refuse to build with GCC 4.1.0 or 4.1.1; those
versions have unfortunate bugs which prevent the building of a working
kernel. Versions 3.0 and 3.1 have also been deemed to be too old and
will not be supported in 2.6.29.
- Video4Linux drivers now use a separate v4l2_file_operations
structure to hold their VFS-like callbacks. The prototypes of a
number of these functions have been changed to remove the
inode argument.
- Video4Linux2 has also acquired a new "subdevice" concept, meant to
reflect the fact that video "devices" tend to be, in reality, a set of
cooperating devices. See the new
document for a description of how this mechanism works.
- Two new functions - stop_machine_create() and
stop_machine_destroy() - allow the independent creation of
the threads used by stop_machine(). That, in turn, lets
those threads be created before trying to actually stop the machine,
making that operation more resistant to failure.
- The exports for a number of SUNRPC functions have been changed to
GPL-only.
- The internal MTD (memory technology device) API has seen significant changes aimed at supporting larger devices (those requiring 64-bit sizes).
Developers interested in the history of kernel API changes can look at the LWN 2.6 API changes page. After a period of unfortunate neglect, this page has been made current once again; your editor promises to be a bit more diligent about maintaining this page in the future.
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
