LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements ->One big page
This page Previous weekFollowing week Sponsored link Serve your customers, not your servers, with VERIO Linux VPS. Full-access test-drive here. |
Kernel developmentRelease status Kernel release status The current 2.6 development kernel is 2.6.25-rc5, released on March 9. Linus says: "So the size of the -rc patches is finally starting to shrink, but we still have way too many outstanding regression reports." See the announcement for the short-form changelog, or the long-form changelog for all the details.The flow of patches into the mainline git repository continues; they are mostly fixes, but, since the 2.6.25-rc5 release, Linus has also merged drivers for JMicron jmb38x MemoryStick host controllers, Varitronix VL-PS-COG-T350MCQB-01 displays, and RouterBoard 500 PATA Compact Flash controllers and removed the x86 quicklist feature. The current -mm tree is 2.6.25-rc5-mm1. Recent changes to -mm include a number of memory policy changes, a rework of the signal delivery code, the simple tracing infrastructure, and a lot of cleanup patches. There have been no stable kernel releases over the last week.
Kernel development news Quotes of the week
Things are going much much more smoothly now than they were in
2.6.24-rcX and 2.6.23-rcX. Tree integration problems are
negligible and build errors are far fewer and runtime problems seem
to be less too. Fingers crossed.
-- Andrew Morton
That is not particularly important, because Linux isn't a GNU
package. If it were a GNU package, I would write to its
maintainers to suggest using Bzr.
-- Richard Stallman
A better DMA memory allocator As any device driver author knows, hardware can be a pain sometimes. In the early days of Linux, peripherals attached to the ISA bus inflicted their particular variety of pain by being unable to use more than 24 bits to access memory. What that meant, in practical terms, was that ISA devices could not perform DMA operations on memory above 16MB. The PCI bus lifted that restriction, but, for some time, there were quite a few "PCI" devices that were minimally modified ISA peripherals; many of those retained the 16MB limit.To handle the needs of these devices, Linux has long maintained the DMA memory zone. Drivers which need to allocate memory from that zone would specify GFP_DMA in their allocation requests. The memory management code takes special care to keep memory in that zone available so that DMA requests can be satisfied. In this way, the system can provide reasonable assurance that memory will be available to perform DMA in ways which meet the special needs of this particularly challenged hardware. The only problem is that there aren't a whole lot of devices out there which still have the old 24-bit addressing limitation. So the DMA zone tends to sit idle. Meanwhile, there are devices with other sorts of limitations. Many peripherals only handle 32-bit addresses, so their DMA buffers must be allocated in the bottom 4GB of memory. There is a subset, however, with stranger limitations - 30 or 31-bit addresses, for example. The kernel's DMA library provides a way for drivers to disclose that sort of embarrassing limitation, but the memory management code does not really help the DMA layer make allocations which satisfy those constraints. So drivers for such devices must use the DMA zone (which may not be present on all architectures), or hope that normal zone memory fits the bill. Andi Kleen has set out to clean up this situation with a new DMA memory allocator. His solution is to take a chunk of memory out of the kernel's buddy allocator entirely and manage it in an entirely different way, forming a reserve pool for DMA allocations. The result is a bit of a departure from normal Linux memory management algorithms, but it may well be better suited to the task at hand. The new "mask" allocator grabs a configurable chunk of low memory at boot time. Allocations from this region are made with a separate set of calls, with the core API being:
struct page *alloc_pages_mask(gfp_t gfp, unsigned size, u64 mask);
void __free_pages_mask(struct page *page, unsigned size);
void *get_pages_mask(gfp_t gfp, unsigned size, u64 mask);
void free_pages_mask(void *mem, unsigned size);
alloc_pages_mask() looks a lot like the longstanding alloc_pages() function, but there's some important differences. The size parameter is the desired size of the allocation, rather than the "order" value used by alloc_pages(), and mask describes the range of usable addresses for this allocation. Though mask looks like a bitmask, it is really better understood as the address value that the allocated memory should have; "holes" in the mask would make no sense. A call to alloc_pages_mask() will first attempt to allocate the requested memory using the normal Linux memory allocator, on the assumption that the reserved DMA memory is an especially limited resource. If the allocation fails, perhaps because there's no physically-contiguous chunk of sufficient size available, then the allocator will dip into the reserved DMA pool. If the normal allocation succeeds, though, the allocated memory must still be tested against the maximum allowable address: the normal memory allocator, remember, has no support for allocating below an arbitrary address. So if the returned memory is out of bounds, it must be immediately freed and the reserved pool will be used instead. That reserved pool is not managed like the rest of memory. Rather than the buddy lists maintained by the slab allocator, the DMA allocator has a simple bitmap describing which pages are available. It will normally cycle through the entire memory region, allocating the next available chunk of sufficient size. If that chunk is above the memory limit, though, the allocator will move back to the lower end of the reserved pool and allocate from there instead. Since DMA allocations tend to be short-lived, one would expect that a suitable block of memory would either be available or become available in the near future. One other difference of note is that, unlike the slab allocator, the DMA allocator does not round memory allocation sizes up to the next power of two. DMA allocations can be relatively large, so that rounding can result in significant internal fragmentation and memory waste. At the next level up, Andi has added a new form of mempool which uses the DMA allocator:
mempool_t *mempool_create_pool_pmask(int min_nr, int size, u64 mask);
This pool will behave like normal mempools, with the exception that all allocations will be below the limit passed in as mask. These pools are used in the block layer, where memory allocations for DMA must succeed. One might object that reserving a big chunk of low memory for this purpose reduces the total amount of memory available to the system - especially if the DMA allocator is cherry-picking normal memory whenever it can anyway. But the cost is not as bad as one might think. These patches do away with the old DMA zone, which, for all practical purposes, was already managed as a reserved (and often unused) memory area. Some 64-bit architectures also set aside a significant chunk (around 64MB) of low memory for the swiotlb - essentially a set of bounce buffers used for impedance matching between high memory (>4GB) buffers and devices which cannot handle more than 32-bit addresses. With Andi's patch set, the swiotlb, too, makes allocations from the DMA area and no longer has its own dedicated memory pool. So the total amount of memory set aside for I/O will not change very much; it could, in fact, get smaller. For most driver authors, there will be little in the way of required changes if this patch set gets merged. The DMA layer already allows drivers to specify an address mask with dma_set_mask(); with the DMA allocator in place, that mask will be better observed. The one change which might affect a few drivers is further down the line: eventually the GFP_DMA memory allocation flag will go away. Any driver which still uses this flag should set a proper mask instead. So far, there has been little discussion resulting from the posting of these patches. Silence does not mean assent, of course, but it would appear that there is little opposition to this set of changes.
How to use a terabyte of RAM We have not yet reached a point where systems - even high-end boxes - come with a terabyte of installed memory. But products like those from Violin Memory make it clear that the day is coming; one can buy a Violin box with 500GB in it now. So it seems worth asking the question: once one has spent the not inconsiderable sum to buy a box like that, what does one do with all that memory - especially now that the Firefox developers have gotten serious about fixing memory leaks?Perhaps it's time for some wild ideas. And there is no better source for such ideas than Daniel Phillips, whose Ramback patch has stirred up a bit of discussion this week. The core idea behind Ramback is that all of that memory is turned into a ramdisk, but with a persistent device attached to it. In normal conditions, all application I/O involves only the ramdisk, and is, thus, quite fast ("Every little factor of 25 performance increase really helps."). In the background, the kernel worries about synchronizing data from the ramdisk onto permanent storage. But the synchronization process is mostly concerned with I/O performance, rather than providing guarantees about just when any given block will make it onto the disk platters. Ramback thus differs from the normal block I/O caching done by the kernel in a number of ways. It keeps the entire device in memory, so that, in steady-state operation, applications need never encounter a disk I/O delay. Should an application call fsync(), the expected result (blocking until the data is written to physical media) will not happen. Filesystems take great care to order operations in a way that minimizes the risk of data loss in a crash; Ramback ignores all of that and writes data to physical media in whatever order it decides is best. As Daniel put it, the "most basic principle" of Ramback's design is:
[T]he backing store is not expected to represent a consistent
filesystem state during normal operation. Only the ramdisk needs
to maintain a consistent state, which I have taken care to ensure.
You just need to believe in your battery, Linux and the hardware it
runs on. Which of these do you mistrust?
Ramback does include an emergency mode which will endeavor to bring the disk up to date in a hurry should the UPS indicate that power has been lost. But that does not seem to be enough for everybody. In the resulting discussion, nobody complained about the sort of performance benefits that a tool like Ramback could provide. But there was a lot of concern about data integrity; it seems that many people distrust their battery, their hardware, and Linux. And that has led to a sort of impasse, with several developers claiming that Ramback would be too risky to use and Daniel dismissing their concerns as FUD. FUD or not, those concerns are likely to be a difficult barrier for Ramback to overcome. Meanwhile, Daniel is looking for people to help test out the code, but that presents challenges of its own:
This driver is ready to try for a sufficiently brave developer. It
will deadlock and livelock in various ways and you will have to
reboot to remove it. But it can already be coaxed into running
well enough for benchmarks, and when it solidifies it will be
pretty darn amazing.
So far, reports from suitably courageous testers have been, well, scarce. Your editor fears that this work could suffer the same fate as many of Daniel's other patches: they can contain brilliant ideas and great coding but just don't quite survive the encounter with the real, messy world. But we need people thinking about how our systems will work in the coming years; one hopes that Daniel won't stop.
GCC 4.3.0 exposes a kernel bug A change to GCC for a recent release coupled with a kernel bug has created a messy situation, with possible security implications. GCC changed some assumptions about x86 processor flags, in accordance with the ABI standard, that can lead to memory corruption for programs built with GCC 4.3.0. No one has come up with a way to exploit the flaw, at least yet, but it clearly is a problem that needs to be addressed. The problem revolves around the x86 direction flag (DF), which governs whether block memory operations operate forward through memory or backwards. The main use for the flag is to support overlapping memory copies, where working backwards through memory may be required so that the data being copied does not get overwritten as the copy progresses. Debian hacker Aurélien Jarno reported the problem to linux-kernel on March 5th, which was found when building Steel Bank Common Lisp (SBCL) using the new compiler.
GCC's most recent
release, 4.3.0, assumes that the direction flag has been cleared
(i.e. memory operations go in a forward direction) at the entry of each
function, as is specified by the ABI (which is, somewhat amusingly, found at
sco.com [PDF]). Unfortunately, this clashes with
Linux signal handlers, which get called, incorrectly, with the flag in
whatever state it was in when the signal occurred. This has the effect of
leaking one bit of state from the user space process that was running when
the signal occurred to the signal handler, That, in itself, is a bug, seemingly with fairly minimal impact. Prior to 4.3, GCC would emit a cld (clear direction flag) opcode before doing inline string or memory operations, so those operations would start from a known state. In 4.3, GCC relies on the ABI mandate that the direction flag is cleared before entry to a function, which means that the kernel needs to arrange that before calling a signal handler. It currently doesn't, but a small patch fixes that. The window of vulnerability is small, but was observed in SBCL. The sequence of events that would lead to memory corruption are as follows:
So, now the question is: what to do about it. It is clear that the kernel should not leak the DF state to signal handlers, regardless of what GCC does. It is interesting to note that this behavior is the same (DF is not cleared on entry to a signal handler) on BSD kernels, leading some to claim that it is the ABI that is incorrect and that GCC should revert to its old behavior. Solaris kernels do clear the DF before calling signal handlers. This problem has existed for 15 years; GCC has always emitted code that worked correctly on kernels that did not follow the ABI, until now. Part of the problem is that there are an enormous number of installed kernels that are vulnerable to this problem, but only if GCC 4.3 is installed. That version of GCC is not, yet, in widespread use, so the thinking is that GCC should revert its behavior now, before it gets into distributions. As kernels with the fix become more widespread, the "proper" behavior could be restored. The GCC folks don't necessarily see it that way, so it is unclear what will happen. While it is true that distributors can control what kernel version and GCC version they ship, those aren't the only ways that either GCC or GCC-compiled binaries get installed. It is a bit of ticking time bomb for random memory corruption at a minimum. Handling those bug reports will be very difficult and time consuming. While the new behavior of GCC is correct, and the kernel is broken, it would be very helpful to back out this change, perhaps providing the new behavior via a command-line argument for those who are sure their binaries will be running on patched kernels. Some discussion on the gcc-devel list would indicate that a GCC 4.3.0.1 or 4.3.1 may be forthcoming.
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet |
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.