The current 2.6 development kernel is 2.6.25-rc5
on March 9. Linus
says: "So the size of the -rc patches is finally starting to shrink,
but we still have way too many outstanding regression reports.
the announcement for the short-form changelog, or the
for all the details.
The flow of patches into the mainline git repository continues; they are
mostly fixes, but, since the 2.6.25-rc5 release, Linus has also merged
drivers for JMicron jmb38x MemoryStick host controllers, Varitronix
VL-PS-COG-T350MCQB-01 displays, and RouterBoard 500 PATA Compact Flash
controllers and removed the x86 quicklist feature.
The current -mm tree is 2.6.25-rc5-mm1. Recent changes
to -mm include a number of memory policy changes, a rework of the signal
delivery code, the simple tracing
infrastructure, and a lot of cleanup patches.
There have been no stable kernel releases over the last week.
Comments (none posted)
Kernel development news
Things are going much much more smoothly now than they were in
2.6.24-rcX and 2.6.23-rcX. Tree integration problems are
negligible and build errors are far fewer and runtime problems seem
to be less too. Fingers crossed.
-- Andrew Morton
That is not particularly important, because Linux isn't a GNU
package. If it were a GNU package, I would write to its
maintainers to suggest using Bzr.
-- Richard Stallman
Comments (16 posted)
As any device driver author knows, hardware can be a pain sometimes. In
the early days of Linux, peripherals attached to the ISA bus inflicted
their particular variety of pain by being unable to use more than
24 bits to access memory. What that meant, in practical terms, was
that ISA devices could not perform DMA operations on memory above 16MB.
The PCI bus lifted that restriction, but, for some time, there were quite a
few "PCI" devices that were minimally modified ISA peripherals; many of
those retained the 16MB limit.
To handle the needs of these devices, Linux has long maintained the DMA
memory zone. Drivers which need to allocate memory from that zone would
specify GFP_DMA in their allocation requests. The memory management code
takes special care to keep memory in that zone available so that DMA
requests can be satisfied. In this way, the system can provide reasonable
assurance that memory will be available to perform DMA in ways which meet
the special needs of this particularly challenged hardware.
The only problem is that there aren't a whole lot of devices out there
which still have the old 24-bit addressing limitation. So the DMA zone
tends to sit idle. Meanwhile, there are devices with other sorts of
limitations. Many peripherals only handle 32-bit addresses, so their DMA
buffers must be allocated in the bottom 4GB of memory. There is a subset,
however, with stranger limitations - 30 or 31-bit addresses, for example.
The kernel's DMA library provides a way for drivers to disclose that sort
of embarrassing limitation, but the memory management code does not really
help the DMA layer make allocations which satisfy those constraints. So
drivers for such devices must use the DMA zone (which may not be present on
all architectures), or hope that normal zone memory fits the bill.
Andi Kleen has set out to clean up this situation with a new DMA memory allocator. His
solution is to take a chunk of memory out of the kernel's buddy allocator
entirely and manage it in an entirely different way, forming a reserve pool
for DMA allocations. The result is a bit
of a departure from normal Linux memory management algorithms, but it may
well be better suited to the task at hand.
The new "mask" allocator grabs a configurable chunk of low memory at boot
time. Allocations from this region are made with a separate set of calls,
with the core API being:
struct page *alloc_pages_mask(gfp_t gfp, unsigned size, u64 mask);
void __free_pages_mask(struct page *page, unsigned size);
void *get_pages_mask(gfp_t gfp, unsigned size, u64 mask);
void free_pages_mask(void *mem, unsigned size);
alloc_pages_mask() looks a lot like the longstanding
alloc_pages() function, but there's some important differences.
The size parameter is the desired size of the allocation, rather
than the "order" value used by alloc_pages(), and mask
describes the range of usable addresses for this allocation. Though
mask looks like a bitmask, it is really better understood as the
address value that the allocated memory should have; "holes" in the mask
would make no sense.
A call to alloc_pages_mask() will first attempt to allocate the
requested memory using the normal Linux memory allocator, on the assumption
that the reserved DMA memory is an especially limited resource. If the
allocation fails, perhaps because there's no physically-contiguous chunk of
sufficient size available, then the allocator will dip into the reserved
DMA pool. If the normal allocation succeeds, though, the allocated memory
must still be tested against the maximum allowable address: the normal
memory allocator, remember, has no support for allocating below an arbitrary
address. So if the returned memory is out of bounds, it must be
immediately freed and the reserved pool will be used instead.
That reserved pool is not managed like the rest of memory. Rather than the
buddy lists maintained by the slab allocator, the DMA allocator has a
simple bitmap describing which pages are available. It will normally cycle
through the entire memory region, allocating the next available chunk of
sufficient size. If that chunk is above the memory limit, though, the
allocator will move back to the lower end of the reserved pool and allocate
from there instead. Since DMA allocations tend to be short-lived, one
would expect that a suitable block of memory would either be available or
become available in the near future.
One other difference of note is that, unlike the slab allocator, the DMA
allocator does not round memory allocation sizes up to the next power of
two. DMA allocations can be relatively large, so that rounding can result
in significant internal fragmentation and memory waste.
At the next level up, Andi has added a new form of mempool which uses the
mempool_t *mempool_create_pool_pmask(int min_nr, int size, u64 mask);
This pool will behave like normal mempools, with the exception that all
allocations will be below the limit passed in as mask. These pools are used
in the block layer, where memory allocations for DMA must succeed.
One might object that reserving a big chunk of low memory for this purpose
reduces the total amount of memory available to the system - especially if
the DMA allocator is cherry-picking normal memory whenever it can anyway.
But the cost is not as bad as one might think. These patches do away with
the old DMA zone, which, for all practical purposes, was already managed as
a reserved (and often unused) memory area. Some 64-bit architectures also
set aside a significant chunk (around 64MB) of low memory for the swiotlb -
essentially a set of bounce buffers used for impedance matching between
high memory (>4GB) buffers and devices which cannot handle more than
32-bit addresses. With Andi's patch set, the swiotlb, too, makes
allocations from the DMA area and no longer has its own dedicated memory
pool. So the total amount of memory set aside for I/O will not change very
much; it could, in fact, get smaller.
For most driver authors, there will be little in the way of required
changes if this patch set gets merged. The DMA layer already allows
drivers to specify an address mask with dma_set_mask(); with the
DMA allocator in place, that mask will be better observed. The one change
which might affect a few drivers is further down the line: eventually the
GFP_DMA memory allocation flag will go away. Any driver which
still uses this flag should set a proper mask instead.
So far, there has been little discussion resulting from the posting of
these patches. Silence does not mean assent, of course, but it would
appear that there is little opposition to this set of changes.
Comments (2 posted)
We have not yet reached a point where systems - even high-end boxes - come
with a terabyte of installed memory. But products like those from Violin Memory
make it clear that
the day is coming; one can buy a Violin box with 500GB in it now. So it
seems worth asking the question: once one has spent the not inconsiderable
sum to buy a box like that, what does one do with all that memory -
especially now that the Firefox developers have gotten serious about fixing
Perhaps it's time for some wild ideas. And there is no better source for
such ideas than Daniel Phillips, whose Ramback patch has stirred up a
bit of discussion this week. The core idea behind Ramback is that all of
that memory is turned into a ramdisk, but with a persistent device attached
to it. In normal conditions, all application I/O involves only the
ramdisk, and is, thus, quite fast ("Every little factor of 25
performance increase really helps."). In the background, the kernel
about synchronizing data from the ramdisk onto permanent storage. But the
synchronization process is mostly concerned with I/O performance, rather
than providing guarantees about just when any given block will make it onto
the disk platters.
Ramback thus differs from the normal block I/O caching done by the kernel
in a number of ways. It keeps the entire device in memory, so that, in
steady-state operation, applications need never encounter a disk I/O
delay. Should an application call fsync(), the expected result
(blocking until the data is written to physical media) will not happen.
Filesystems take great care to order operations in a way that minimizes the
risk of data loss in a crash; Ramback ignores all of that and writes data
to physical media in whatever order it decides is best. As Daniel put it, the "most basic principle" of
Ramback's design is:
[T]he backing store is not expected to represent a consistent
filesystem state during normal operation. Only the ramdisk needs
to maintain a consistent state, which I have taken care to ensure.
You just need to believe in your battery, Linux and the hardware it
runs on. Which of these do you mistrust?
Ramback does include an emergency mode which will endeavor to bring the
disk up to date in a hurry should the UPS indicate that power has been
lost. But that does not seem to be enough for everybody.
In the resulting discussion, nobody complained about the sort of
performance benefits that a tool like Ramback could provide. But there was
a lot of concern about data integrity; it seems that many people distrust
their battery, their hardware, and Linux. And that has led to a
sort of impasse, with several developers claiming that Ramback would be too
risky to use and Daniel dismissing their concerns as FUD.
FUD or not, those concerns are likely to be a difficult barrier for Ramback
to overcome. Meanwhile, Daniel is looking for people to help test out the
code, but that presents challenges of its own:
This driver is ready to try for a sufficiently brave developer. It
will deadlock and livelock in various ways and you will have to
reboot to remove it. But it can already be coaxed into running
well enough for benchmarks, and when it solidifies it will be
pretty darn amazing.
So far, reports from suitably courageous testers have been, well, scarce.
Your editor fears that this work could suffer the same fate as many of
Daniel's other patches: they can contain brilliant ideas and great coding
but just don't quite survive the encounter with the real, messy world.
But we need people thinking about how our systems will work in the
coming years; one hopes that Daniel won't stop.
Comments (33 posted)
A change to GCC for a recent release coupled with a kernel bug has created
a messy situation, with possible security implications. GCC changed some
assumptions about x86 processor flags, in accordance with the ABI standard,
that can lead to memory corruption for programs built with GCC 4.3.0. No
one has come up with a way to exploit the flaw, at least yet, but it
clearly is a problem that needs to be addressed.
The problem revolves around the x86 direction flag (DF), which governs
whether block memory operations operate forward through memory or
backwards. The main use for the flag is to support overlapping memory
copies, where working backwards through memory may be required so that the data
being copied does not get overwritten as the copy progresses. Debian
hacker Aurélien Jarno reported the problem to
linux-kernel on March 5th, which was found when building Steel Bank
Common Lisp (SBCL) using the new compiler.
GCC's most recent
release, 4.3.0, assumes that the direction flag has been cleared
(i.e. memory operations go in a forward direction) at the entry of each
function, as is specified by the ABI (which is, somewhat amusingly, found at
sco.com [PDF]). Unfortunately, this clashes with
Linux signal handlers, which get called, incorrectly, with the flag in
whatever state it was in when the signal occurred. This has the effect of
leaking one bit of state from the user space process that was running when
the signal occurred to the signal handler,
which could be in another process.
That, in itself, is a bug, seemingly with fairly minimal impact. Prior to 4.3, GCC
would emit a cld (clear direction flag) opcode before doing inline
or memory operations, so those operations would start from a known state.
In 4.3, GCC relies on the ABI mandate that the direction flag is cleared before
entry to a function, which means that the kernel needs to arrange that
before calling a signal handler. It currently doesn't, but a small patch fixes that.
The window of vulnerability is small, but was observed in SBCL. The
sequence of events that would lead to memory corruption are as follows:
- a user space program does an operation (memmove() for example)
that sets DF
- a signal occurs for some process
- the kernel calls the signal handler
- the signal handler does a memmove() in what it thinks is a
- the memory is copied in the reverse direction, leading to corruption
It is hard to see how that could be turned into a security breach, but it
would be a mistake to assume that it can't. Other kernel bugs, like the
one that allowed the recent vmsplice()
, have looked liked memory corruption, but were found to be more than
that. The DF issue may turn out to be harmless from a security standpoint,
but it should not be assumed.
So, now the question is: what to do about it. It is clear that the kernel
should not leak the DF state to signal handlers, regardless of what GCC
does. It is interesting to note that this behavior is the same (DF is not
cleared on entry to a signal handler) on BSD
kernels, leading some to claim that it is the ABI that is incorrect and
that GCC should revert to its old behavior. Solaris kernels do
clear the DF before calling signal handlers. This problem has existed for
15 years; GCC has always emitted code that worked correctly on kernels
that did not follow the ABI, until now.
Part of the problem is that there are an enormous number of installed
kernels that are vulnerable to this problem, but only if GCC 4.3 is
installed. That version of GCC is not, yet, in widespread use, so the
thinking is that GCC should revert its behavior now, before it gets into
distributions. As kernels with the fix become more widespread, the
"proper" behavior could be restored. The GCC folks don't necessarily see
it that way, so it is unclear what will happen.
While it is true that distributors can control what kernel version and GCC
version they ship, those aren't the only ways that either GCC or
GCC-compiled binaries get installed. It is a bit of ticking time bomb for
random memory corruption at a minimum. Handling those bug reports will be
very difficult and time consuming. While the new behavior of GCC is
correct, and the kernel is broken, it would be very helpful to back out
this change, perhaps providing the new behavior via a command-line argument
for those who are sure their binaries will be running on patched kernels. Some discussion
on the gcc-devel list would indicate that a GCC 188.8.131.52 or 4.3.1 may be
Comments (53 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>