Brief items
The current stable 2.6 kernel is 2.6.27.2, released on October 18. It contains about a dozen important fixes. Previously, 2.6.27.1 was released with a single fix disabling the dynamic function tracing feature.
There are stable updates for the 2.6.25, 2.6.26, and 2.6.27 kernels in the review process as of this writing; chances are they will have been released by the time you read this.
Kernel development news
Changes visible to kernel developers include:
dev_WARN(struct device *dev, char *format, ...);
will output the formatted warning, along with a full stack trace. This will allow the warnings to be collected at kerneloops.org and incorporated into the reports there.
static inline void *pci_ioremap_bar(struct pci_dev *pdev, int bar);
This function will remap the entire PCI I/O memory region, as selected by the bar argument.
See next week's Kernel Page for a summary of the final days of the 2.6.28 merge window.
Tracking down intermittent problems is hard. When those problems result in the destruction of hardware, finding them is even harder. Even the most dedicated testers tend to balk when faced with the prospect of shipping their systems back to the manufacturer for repairs. So the task of finding this issue fell to Intel; engineers there locked themselves into a lab with a box full of e1000e adapters and set about bisecting the kernel history to identify the patch which caused the problem. Some time (and numerous fried adapters) later, the bisection process turned up an unlikely suspect: the ftrace tracing framework.
Developers working on tracing generally put a lot of effort into minimizing the impact of their code on system performance. Every last bit of runtime overhead is scrutinized and eliminated if at all possible. As a general rule, bricking the hardware is a level of overhead which goes well beyond the acceptable parameters. So the ftrace developers, once informed of the bisection result, put in some significant work of their own to figure out what was going on.
One of the features offered by ftrace is a simple function call tracing operation; ftrace will output a line with the called function (and its caller) every time a function call is made. This tracing is accomplished by using the venerable profiling mechanism built into gcc (and most other Unix-based compilers). When code is compiled with the -pg option, the compiler will place a call to mcount() at the beginning of every function. The version of mcount() provided by ftrace then logs the relevant information on every call.
As noted above, though, tracing developers are concerned about overhead. On most systems, it is almost certain that, at any given time, nobody will be doing function call tracing. Having all those mcount() calls happening anyway would be a measurable drag on the system. So the ftrace hackers looked for a way to eliminate that overhead when it is not needed. A naive solution to this problem might look something like the following. Rather than put in an unconditional call to mcount(), get gcc to add code like this:
if (function_tracing_active)
mcount();
But the kernel makes a lot of function calls, so even this version will have a noticeable overhead; it will also bloat the size of the kernel with all those tests. So the favored approach tends to be different: run-time patching. When function tracing is not being used, the kernel overwrites all of the mcount() calls with no-op instructions. As it happens, doing nothing is a highly optimized operation in contemporary processors, so the overhead of a few no-ops is nearly zero. Should somebody decide to turn function tracing on, the kernel can go through and patch all of those mcount() calls back in.
Run-time patching can solve the performance problem, but it introduces a new problem of its own. Changing the code underneath a running kernel is a dangerous thing to do; extreme caution is required. Care must be taken to ensure that the kernel is not running in the affected code at the time, processor caches must be invalidated, and so on. To be safe, it is necessary to get all other processors on the system to stop and wait while the patching is taking place. The end result is that patching the code is an expensive thing to do.
The way ftrace was coded was to patch out every mcount() call point as it was discovered through an actual call to mcount(). But, as noted above, run-time patching is very expensive, especially if it is done a single function at a time. So ftrace would make a list of mcount() call sites, then fix up a bunch of them later on. In that way, the cost of patching out the calls was significantly reduced.
The problem now is that things might have changed between the time when an mcount() call is noticed and when the kernel gets around to patching out the call. It would be very unfortunate if the kernel were to patch out an mcount() call which no longer existed in the expected place. To be absolutely sure that unrelated data was not being corrupted, the ftrace code used the cmpxchg operation to patch in the no-ops. cmpxchg atomically tests the contents of the target memory against the caller's idea of what is supposed to be there; if the two do not match, the target location will be left with its old value at the end of the operation. So the no-ops will only be written to memory if the current contents of that memory are a call to mcount().
This all seems pretty safe, except that it fell down in one obscure, but important case. One obvious place where an mcount() call could go away is in loadable modules. This can happen if the module is unloaded, of course, but there is another important case too: any code marked as initialization code will be removed once initialization is complete. So a module's initialization function (and any other code marked __init) could leave a dangling reference in the "mcount() calls to be patched out" list maintained by ftrace.
The final piece of this puzzle comes from this little fact: on 32-bit architectures, memory returned from vmalloc() and ioremap() share the same address space. Both functions create mappings to memory from the same range of addresses. Space for loadable modules is allocated with vmalloc(), so all module code is found within this shared address space. Meanwhile, the e1000e driver uses ioremap() to map the adapter's I/O memory and NVRAM into the kernel's address space. The end result is this fatal sequence of events:
Remember that the ftrace code was very careful in its patching, using cmpxchg to avoid overwriting anything which is not an mcount() call. But, as Steven Rostedt noted in his summary of the problem:
The end result is a write to the wrong bit of I/O memory - and a destroyed device.
In hindsight, this bug is reasonably clear and understandable, but it's not at all surprising that it took a long time to find. One should note that there were, in fact, two different bugs here. One of them is ftrace's attempt to write to a stale pointer. But the other one was just as important: the e1000e driver should never have left its hardware configured in a mode where a single stray write could turn it into a brick. One never knows where things might go wrong; hardware should never be left in such a vulnerable state if it can be helped.
The good news is that both bugs have been fixed. The e1000e hardware was locked down before 2.6.27 was released, and the 2.6.27.1 update disables the dynamic ftrace feature. The ftrace code has been significantly rewritten for 2.6.28; it no longer records mcount() call sites on the fly, no longer uses cmpxchg, and, one hopes, is generally incapable of creating such mayhem again.
Functions like vmalloc() have long been known to be somewhat expensive to use. They have to work with a single shared (and limited) address range, and they require making changes to the kernel's page tables. Page table changes, in turn, require translation lookaside buffer (TLB) flushes, which are a costly, all-CPUs operation. So kernel developers have generally tried to avoid using these functions in performance-critical parts of the kernel.
Nick Piggin has noticed, though, that the performance characteristics of vmalloc() and friends are catching up with us. The vmalloc() address space is kept on a linked list and protected by a global lock, which does not scale very well. But the real cost is in freeing memory regions in this space; the ensuing TLB flush must be done using an inter-processor interrupt to every CPU, each of which must then flush its own TLB. People normally do not buy more CPUs unless they have more work to run on them, so systems with more processors will, as a general rule, be performing more mapping and freeing in the vmalloc() range. As systems grow, there will be more global TLB flushes, each of which disrupts more processors. In other words, the amount of work grows proportional to the square of the number of processors - meaning that everything falls down, eventually.
To make things worse, Nick has a longstanding series of patches which, among other things, do a lot of vmap() calls to support larger block sizes in the filesystem layer and page cache. Merging those patches would add significantly to the amount of time the system spends managing the vmalloc() space, which would not be a good thing. So fixing vmalloc() seems like a good thing to do first. As of 2.6.28, Nick has, in fact, fixed the management of kernel virtual allocations.
The first step is to get rid of the linked list and its corresponding global lock. Instead, a red-black tree is used to track ranges of available address space; finding a suitable region can now be done without having to traverse a long list. The tree is still protected by a global lock, which poses potential scalability problems. To avoid this issue, Nick's patch creates a separate, per-CPU list of small address ranges which can be allocated and freed in a lockless manner. New functions must be called to make use of this facility:
void *vm_map_ram(struct page **pages, unsigned int count,
int node, pgprot_t prot);
void vm_unmap_ram(const void *mem, unsigned int count);
A call to vm_map_ram() will create a virtually-contiguous mapping for the given pages. The associated data structures will be allocated on the given NUMA node; the memory will have the protection specified in prot. With the version of the patch merged for 2.6.28, mappings of up to 64 pages can be made from the per-cpu lists.
Note that these functions do not allocate memory, they just create a virtual mapping for a given set of pages. They are a replacement for vmap() and vunmap(), not vmalloc() and vfree(). It is probably possible to rewrite vmalloc() to use this mechanism, but that has not happened. So vmalloc() calls still require the acquisition of a global lock.
There's another trick in this patch set which is used by all of the kernel virtual address management functions. Nick realized that it is not actually necessary to flush TLBs across the system immediately after an address range is freed. Since those addresses are being given back to the system, no code will be making use of them afterward, so it does not matter if a processor's TLB contains a stale mapping for them. All that really matters is that the TLB gets cleaned out before those addresses are used again elsewhere. So unmapped regions can be allowed to accumulate, then all flushed with a single operation. That cuts the number of TLB flushes significantly.
How much faster do things run? Nicks patch (the merged version can be found here) contains some benchmark results. With an artificial test aimed at demonstrating the difference, the new code runs 25 times faster. By changing the vmap() code in the XFS filesystem to use vm_map_ram() instead, some workloads were sped up by a factor of twenty. So it seems to work.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page:
Distributions>>
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds