Brief items
The 2.6.28 merge window is still open, so there is no 2.6
development kernel at the moment. See the article below for an update on
what has been merged for the 2.6.28 development cycle.
The current stable 2.6 kernel is 2.6.27.2, released on October 18. It
contains about a dozen important fixes. Previously, 2.6.27.1 was released
with a single fix disabling the dynamic function tracing feature.
There are stable updates for the 2.6.25, 2.6.26, and 2.6.27 kernels in the
review process as of this writing; chances are they will have been released
by the time you read this.
Comments (none posted)
Kernel development news
This adds support for OLPC's touchpad. It has lots of neat
features, none of which are enabled because the hardware is too
buggy. Instead, we use it like a normal touchpad, but with a
number of workarounds in place to deal with the frequent hardware
spasms. Humidity changes, sweat, tinfoil underwear, plugging in
AC, drinks, evil felines.. All tend to cause the touchpad to freak
out.
--
Andres
Salomon should be a hardware salesman
Yeah, I'm grumpy. The quality control during this merge window has
been absolutely disgusting. I feel like I have to complain about
every other pull I do, because people feel like another new warning
isn't a problem. And every single time, the new warning is due to
some total crap code.
--
Linus Torvalds
I'm afraid that once a barrier discussion comes up and we insert
them, then I become dazedly paranoid and it's very hard to shake
me from seeing a need for barriers everywhere, including a barrier
before and after every barrier ad infinitum to make sure they're
really barriers.
--
Hugh Dickins
Comments (3 posted)
By Jonathan Corbet
October 22, 2008
As of this writing, just under 6200 non-merge changesets have been merged
into the mainline kernel since the 2.6.27 release. This merge window
should be drawing to a close around October 24, so we are getting
closer to seeing what 2.6.28 will look like. User-visible changes merged
since
last week's update
include:
- New drivers have been merged for
Maxim/Dallas DS3234 SPI realtime clock chips,
VIA UniChrome Family graphics chipsets,
Toshiba Mobile IO framebuffers,
C-Media CM109 USB phones,
the touchpad shipped on OLPC XO systems,
Automata Sercos III PCI cards (via UIO),
Delcom USB 7-segment LED displays,
generic USB test-and-measurement devices,
Freescale QE/CPM USB device controllers,
Vernier Software Technologies USB spectrometers,
GPIO-connected NAND flash devices,
Freescale i.MX2 and i.MX3 flash controllers,
OMAP2/OMAP3-connected OneNAND flash devices,
Dialog DA9030/DA9034 multifunction controllers, and
Texas Instruments TWL4030/TPS659x0 multifunction controllers.
- The driver staging tree has been moved into the mainline.
It brings with it a new TAINT_CRAP flag and suitably tainted drivers
for Meilhaus ME-4000 data collection boards,
Go 7007 ("some weird device") video controllers,
Agere ET-1310 Gigabit Ethernet controllers,
Atmel at76c503/at76c505/at76c505a wireless USB cards,
Alacritech SLIC Technology non-accelerated 10Gb Ethernet cards,
Alacritech IS-NIC gigabit Ethernet cards,
Winbond w35und wireless network adapters,
and Prism 2.5 USB wireless network adapters (a driver which includes
its own 802.11 stack). Also added are an echo cancellation module and
a driver which enables the passing of network packets over a USB link.
- A lot of work on the Intel i915 graphics driver has been merged; this
work includes the Graphics
Execution Manager (GEM) GPU memory management subsystem and "IGD
OpRegion" support which enables ACPI backlight control. It looks like
kernel-based mode setting might not make it for 2.6.28, but much of
the rest of the big graphics rework is now merged.
- The way video drivers handle waiting for vertical blank cycles has
been changed to reduce interrupts - and, thus, power consumption.
- Rik van Riel's memory
management scalability patches have, at long last, been merged.
These patches separate the management of anonymous, file-backed, and
completely unevictable pages, eliminating a lot of useless page scanning.
- Another VM improvement causes the system to free a page's swap space
after that page is brought back into RAM; this effectively increases
the amount of swap available on the system.
- Nick Piggin's rewritten vmap
layer should give significant performance
improvements, especially as the number of CPUs on a system grows.
- Huge pages will now be included in core dumps, making the debugging of
applications using those pages easier.
- The container freezer
has been merged. It is now possible for the system to freeze all
processes within a container (control group) as a unit.
- The KVM virtualization code has seen a number of improvements,
including the ability to assign PCI devices to guests and support for
Intel "Tukwila" processors.
- Kprobes are now supported by the SuperH architecture.
- There is a new ext3 mount option (data_err=abort) which
causes filesystem operations to abort when I/O errors are
encountered. In the absence of this option, the old behavior
(continue but complain in the system log) remains.
- In-kernel interrupt balancing for 32-bit x86 systems has been
removed. This feature has been deprecated (in favor of user-space
balancing) for some time.
Changes visible to kernel developers include:
- A number of tracing-related patches have been merged. These include
the tracepoints
mechanism, some instrumentation in the core scheduler code,
improvements to the ftrace function tracing feature,
a new ftrace-based stack tracer,
a new ftrace-based boot (initcall) tracer, and
the low-level trace
buffer code.
- The sysctl strategy() function prototype has changed: the
unused name and nlen parameters have been removed.
- Asynchronous I/O support can now be configured out of the kernel,
saving about 7KB of space on systems where AIO is not needed.
- As planned, device_create_drvdata() has been renamed to
device_create(), with the same parameters.
- There is now a mechanism to enable and disable output from
pr_debug() and dev_dbg() calls on a per-module
basis. Control is through a virtual file in debugfs. There is no
documentation file associated with this change; instructions on how
to use this feature can be found in the
patch changelog.
- The new dev_WARN() function:
dev_WARN(struct device *dev, char *format, ...);
will output the formatted warning, along with a full stack trace.
This will allow the warnings to be collected at kerneloops.org and incorporated into
the reports there.
- The new %pR formatting directive allows printk() and
friends to output the contents of resource structures.
- There is a new function intended to make life easier for PCI driver
writers:
static inline void *pci_ioremap_bar(struct pci_dev *pdev, int bar);
This function will remap the entire PCI I/O memory region, as
selected by the bar argument.
See next week's Kernel Page for a summary of the final days of the 2.6.28
merge window.
Comments (7 posted)
By Jonathan Corbet
October 21, 2008
When LWN
last looked at the
e1000e hardware corruption bug, the source of
the problem was, at best, unclear. Problems within the driver itself
seemed like a likely culprit, but it did not take long for those chasing
this problem to realize that they needed to look further afield. For a while, the
X server came under scrutiny, as did a number of other system components.
When the real problem was found, though, it turned out to be a surprise for
everybody involved.
Tracking down intermittent problems is hard. When those problems result in
the destruction of hardware, finding them is even harder. Even the most
dedicated testers tend to balk when faced with the prospect of shipping
their systems back to the manufacturer for repairs. So the task of finding
this issue fell to Intel; engineers there locked themselves into a lab with
a box full of e1000e adapters and set about bisecting the kernel history to
identify the patch which caused the problem. Some time (and numerous fried
adapters) later, the bisection process turned up an unlikely suspect: the
ftrace tracing framework.
Developers working on tracing generally put a lot of effort into minimizing
the impact of their code on system performance. Every last bit of runtime
overhead is scrutinized and eliminated if at all possible. As a general
rule, bricking the hardware is a level of overhead which goes well beyond
the acceptable parameters. So
the ftrace developers, once informed of the bisection result, put in some
significant work of their own to figure out what was going on.
One of the features offered by ftrace is a simple function call tracing
operation; ftrace will output a line with the called function (and
its caller) every time a function call is made. This tracing is
accomplished by using the venerable profiling mechanism built into gcc (and
most other Unix-based compilers). When code is compiled with the
-pg option, the compiler will place a call to mcount() at
the beginning of every function. The version of mcount() provided
by ftrace then logs the relevant information on every call.
As noted above, though, tracing developers are concerned about overhead.
On most systems, it is almost certain that, at any given time, nobody will
be doing function call tracing. Having all those mcount() calls
happening anyway would be a measurable drag on the system. So the ftrace
hackers looked for a way to eliminate that overhead when it is not needed.
A naive solution to this problem might look something like the following.
Rather than put in an unconditional call to mcount(), get gcc to
add code like this:
if (function_tracing_active)
mcount();
But the kernel makes a lot of function calls, so even this version
will have a noticeable overhead; it will also bloat the size of the kernel
with all those tests. So the favored approach tends to be different:
run-time patching. When function tracing is not being used, the kernel
overwrites all of the mcount() calls with no-op instructions. As
it happens, doing nothing is a highly optimized operation in contemporary
processors, so the overhead of a few no-ops is nearly zero. Should
somebody decide to turn function tracing on, the kernel can go through and
patch all of those mcount() calls back in.
Run-time patching can solve the performance problem, but it introduces a
new problem of its own. Changing the code underneath a running kernel is a
dangerous thing to do; extreme caution is required. Care must be taken to
ensure that the kernel is not running in the affected code at the time,
processor caches must be invalidated, and so on. To be safe, it is
necessary to get all other processors on the system to stop and wait while the
patching is taking place. The end result is that patching the code is an
expensive thing to do.
The way ftrace was coded was to patch out every mcount() call
point as it was discovered through an actual call to mcount().
But, as noted above, run-time patching is very expensive, especially if it
is done a single
function at a time. So ftrace would make a list of mcount() call
sites, then fix up a bunch of them later on. In that way, the cost of
patching out the calls was significantly reduced.
The problem now is that things might have changed between the time when an
mcount() call is noticed and when the kernel gets around to
patching out the call. It would be very unfortunate if the kernel were to
patch out an mcount() call which no longer existed in the expected
place. To be absolutely sure that unrelated data was not being corrupted,
the ftrace code used the cmpxchg operation to patch in the
no-ops. cmpxchg atomically tests the contents of the target
memory against the caller's idea of what is supposed to be there; if the
two do not match, the target location will be left with its old value at
the end of the operation. So the no-ops will only be written to memory if
the current contents of that memory are a call to mcount().
This all seems pretty safe, except that it fell down in one obscure, but
important case. One obvious place where an mcount() call could go
away is in loadable modules. This can happen if the module is unloaded, of
course, but there is another important case too: any code marked as
initialization code will be removed once initialization is complete.
So a module's initialization function (and any other code marked
__init) could leave a dangling reference in the "mcount()
calls to be patched out" list maintained by ftrace.
The final piece of this puzzle comes from this little fact: on 32-bit
architectures, memory returned from vmalloc() and
ioremap() share the same address space. Both functions create
mappings to memory from the same range of addresses. Space for loadable
modules is allocated with vmalloc(), so all module code is found
within this shared address space. Meanwhile, the e1000e driver uses
ioremap() to map the adapter's I/O memory and NVRAM into the kernel's
address space. The end result is this fatal sequence of events:
- A module is loaded into the system. As part of the module's
initialization, a number of mcount() calls are made; these
call sites are noted for later patching.
- Module initialization completes, and the module's __init
functions are removed from memory. The address space they occupied is
freed up for future use.
- The e1000e driver maps its I/O memory and NVRAM into the address range
recently occupied by the above-mentioned initialization code.
- Ftrace gets around to patching out the accumulated list of
mcount() calls. But some of those "calls" are now, actually,
I/O memory belonging to the e1000e device.
Remember that the ftrace code was very careful in its patching, using
cmpxchg to avoid overwriting anything which is not an
mcount() call. But, as Steven Rostedt noted in his summary of the problem:
The cmpxchg could have saved us in most cases (via luck) - but with
ioremap-ed memory that was exactly the wrong thing to do - the
results of cmpxchg on device memory are undefined. (and will
likely result in a write)
The end result is a write to the wrong bit of I/O memory - and a destroyed
device.
In hindsight, this bug is reasonably clear and understandable, but it's not
at all surprising that it took a long time to find. One should note that
there were, in fact, two different bugs here. One of them is ftrace's
attempt to write to a stale pointer. But the other one was just as
important: the e1000e driver should never have left its hardware configured
in a mode where a single stray write could turn it into a brick. One never
knows where things might go wrong; hardware should never be left in such a
vulnerable state if it can be helped.
The good news is that both bugs have been fixed. The e1000e hardware was
locked down before 2.6.27 was released, and the 2.6.27.1 update disables
the dynamic ftrace feature. The ftrace code has been significantly
rewritten for 2.6.28; it no longer records mcount() call sites on
the fly, no longer uses cmpxchg, and, one hopes, is generally
incapable of creating such mayhem again.
Comments (19 posted)
By Jonathan Corbet
October 21, 2008
Kernel memory is normally allocated in relatively small chunks - usually
just a single page at a time. As the size of an allocation grows,
satisfying that allocation with physically-contiguous pages gets
progressively harder. So most of the kernel has been written with an eye
toward avoiding the use of large, contiguous allocations. There are times,
though, when a large memory array needs to be virtually contiguous, but not
necessarily physically contiguous. One example is the allocation of space
for loadable modules; any given module should live in a single, contiguous
address range, but nobody cares how it's laid out in physical RAM. For
cases like this, the kernel provides a set of functions like
vmalloc() and
vmap().
Functions like vmalloc() have long been known to be somewhat
expensive to use. They have to work with a single shared (and limited)
address range, and they require making changes to the kernel's page
tables. Page table changes, in turn, require translation lookaside buffer
(TLB) flushes, which are a costly, all-CPUs operation. So kernel
developers have generally tried to avoid using these functions in
performance-critical parts of the kernel.
Nick Piggin has noticed, though, that the performance characteristics of
vmalloc() and friends are catching up with us. The
vmalloc() address space is kept on a linked list and protected by
a global lock, which does not scale very well. But the real cost is in
freeing memory regions in this space; the ensuing TLB flush must be done
using an inter-processor interrupt to every CPU, each of which must then
flush its own TLB. People normally do not buy more CPUs unless they have
more work to run on them, so systems with more processors will, as a
general rule, be performing more mapping and freeing in the
vmalloc() range. As systems grow, there will be more global TLB
flushes, each of which disrupts more processors. In other words, the
amount of work grows proportional to the square of the number of processors
- meaning that everything falls down, eventually.
To make things worse, Nick has a longstanding series of patches which,
among other things, do a lot of vmap() calls to support larger
block sizes in the filesystem layer and page cache. Merging those patches would add
significantly to the amount of time the system spends managing the
vmalloc() space, which would not be a good thing. So fixing
vmalloc() seems like a good thing to do first. As of 2.6.28, Nick
has, in fact, fixed the management of kernel virtual allocations.
The first step is to get rid of the linked list and its corresponding
global lock. Instead, a red-black tree is used to track
ranges of available address space; finding a suitable region can now be
done without having to traverse a long list. The tree is still protected
by a global lock, which poses potential scalability problems. To avoid
this issue, Nick's patch creates a separate, per-CPU list of small
address ranges which can be allocated and freed in a lockless manner. New
functions must be called to make use of this facility:
void *vm_map_ram(struct page **pages, unsigned int count,
int node, pgprot_t prot);
void vm_unmap_ram(const void *mem, unsigned int count);
A call to vm_map_ram() will create a virtually-contiguous mapping
for the given pages. The associated data structures will be
allocated on the given NUMA node; the memory will have the
protection specified in prot. With the version of the patch
merged for 2.6.28, mappings of up to 64 pages can be made from the
per-cpu lists.
Note that these functions do not allocate memory, they just create a
virtual mapping for a given set of pages. They are a replacement for
vmap() and vunmap(), not vmalloc() and
vfree(). It is probably possible to rewrite vmalloc() to
use this mechanism, but that has not happened. So vmalloc() calls
still require the acquisition of a global lock.
There's another trick in this patch set which is used by all of the kernel
virtual address management functions. Nick realized that it is not
actually necessary to flush TLBs across the system immediately after an
address range is freed. Since those addresses are being given back to the
system, no code will be making use of them afterward, so it does not matter
if a processor's TLB contains a stale mapping for them. All that really
matters is that the TLB gets cleaned out before those addresses are used
again elsewhere. So unmapped regions can be allowed to accumulate, then
all flushed with a single operation. That cuts the number of TLB flushes
significantly.
How much faster do things run? Nicks patch (the merged version can be
found here)
contains some benchmark results. With an artificial test aimed at demonstrating
the difference, the new code runs 25 times faster. By changing the
vmap() code in the XFS filesystem to use vm_map_ram()
instead, some workloads were sped up by a factor of twenty. So it seems to
work.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>