Brief items
The current 2.6 kernel is still 2.6.7; the first 2.6.8 prepatch has
not yet been released.
Patches continue to accumulate in Linus's BitKeeper tree, however;
they include the new dma_get_required_mask() API (covered here last week), support for 64-bit Super-H
hardware (forward ported from 2.4), x86 no-execute support, asynchronous
I/O support for USB gadgets, a reworked symbolic link lookup
implementation (see below), a new "CPU mask" implementation, some read-copy-update
performance improvements, support for new Apple
PowerBooks, more sparse annotations, some netfilter improvements, some
kbuild work, a new wait_event_interruptible_exclusive() macro,
support for the O_NOATIME flag in the open() call, sysfs
knobs for tuning the CFQ I/O scheduler, mirroring and snapshot targets for
the device mapper, the removal of the PC9800 subarchitecture, reiserfs
data=journal support, preemptible kernel support for the PPC64
architecture, and many fixes and updates.
The current prepatch from Andrew Morton is 2.6.7-mm4. Recent additions to -mm include a
rearrangement of the x86 user-space memory layout (see below), some
preparatory work for software suspend on SMP systems, PCMCIA sysfs support,
and lots of fixes.
The current 2.4 prepatch is 2.4.27-rc2, which was released by Marcelo on June 26.
A relatively large number of patches (for a release candidate) went in;
they include a USB gadget driver update, a number of backported fixes for
potential security problems, an XFS update, a netfilter update, and various
fixes.
Comments (none posted)
Kernel development news
![[memory layout diagram]](/images/ns/kernel/mmap1.png)
The traditional organization of the virtual address space (as seen from
user space, on x86 systems) is as shown in the diagram to the right. The
very bottom part of the address space is unused; it is there to catch NULL
pointers and such. Starting at 0x8000000 is the program text - the
read-only, executable code. The text is followed by the heap region, being
the memory obtainable via the
brk() system call. Typically
functions like
malloc() obtain their memory from this area;
non-automatic program data is also stored there.
The heap differs from the first two regions in that it grows in response to
program needs. A program like cat will not make a lot of demands
on the heap (one hopes), while running a yum update can grow the
heap in a truly disturbing way. The heap can expand up to 1GB
(0x40000000), at which point it runs into the mmap area; this is where
shared libraries and other regions created by the mmap() system
call live. The mmap area, too, grows upward to accommodate new mappings.
Meanwhile, the kernel owns the last 1GB of address space, up at
0xc0000000. The kernel is inaccessible to user space, but it occupies that
portion of the address space regardless. Immediately below the kernel is
the stack region, where things like automatic variables live. The stack
grows downward. On a really bad day, the stack and the mmap area can run
into each other, at which point things start to fail.
This organization has worked for some time, but it does have a couple of
disadvantages. It fragments the address space, such that neither the heap
nor the mmap area can make use of the entire space. If one program makes
heavy use of the heap, it could run out of memory, even though a large
chunk of space is available between the mmap area and the stack. Normally,
not even yum can occupy that much heap, but there are other
applications out there which are up to that challenge.
As a way of making life safer for the true memory hogs out there, Ingo
Molnar has posted a patch which rearranges
user space along the lines of the revised diagram on the left. The mmap area has been
moved up to the top of the address space, and it now grows downward toward
the heap. As a result, the bulk of the address space is preserved in a
single, contiguous chunk which can be allocated to either the heap or mmap,
as the application requires.
As an added bonus, this organization reduces the amount of kernel memory
required to hold each process's page tables, since the fragment at
0x40000000 is no longer present.
There are a couple of disadvantages to this approach. One is that the
stack area is rather more confined than it used to be. The actual size of
the stack area is determined by the process's stack size resource limit,
with a sizable cushion added, so problems should be rare. The other
problem is that, apparently, a very small number of applications get
confused by the new layout. Any application which is sensitive to how
virtual memory is laid out is buggy to begin with; according to Arjan van de Ven, the most common
case is applications which store pointers in integer variables and then do
the wrong thing when they see a "negative" value.
The fact is that most users will never notice the change; for a
demonstration, consider that Fedora kernels have been shipping with this
patch for some time. Even a vanilla Fedora Core 1 system has it; a
command like "cat /proc/self/maps" will show the new layout at
work. The patch is currently part of the -mm kernel, and will probably
find its way into the mainline before too long.
Comments (14 posted)
Last week's Kernel Page looked at various
DMA-related issues. One of those was the ability to make use of memory
located on I/O controllers for DMA operations. That work has taken a step
forward with
this proposal from James
Bottomley, which adds a new function to the DMA API:
int dma_declare_coherent_memory(struct device *dev,
dma_addr_t bus_addr,
dma_addr_t device_addr,
size_t size, int flags);
This function tells the DMA code about a chunk of memory available on the
device represented by dev. The memory is size bytes
long; it is located at bus_addr from the bus's point of view, and
device_addr from the device's perspective. The flags
argument describes how the memory is to be used: whether it should be
mapped into the kernel's address space, whether children of the device can
use it, and whether it should be the only memory used by the device(s) for
DMA.
The actual patch implementing this API is still in the works. As of this
writing, there have been no real comments on it.
Meanwhile, a different DMA issue has been raised by the folks at nVidia,
who are trying to make their hardware work better on Intel's em64t (AMD64
clone) architecture. It is, it turns out, difficult to reliably use DMA on
devices which cannot handle 64-bit addresses.
Memory on (non-NUMA) Linux systems has traditionally been divided into
three zones. ZONE_DMA is the bottom 16MB; it is the only memory
which is accessible to ancient ISA peripherals and, perhaps, a few old PCI
cards which are simply a repackaging of ISA chipsets. ZONE_NORMAL
is all of the memory, outside of ZONE_DMA, which is directly accessible to
the kernel. On a typical 32-bit Linux system, ZONE_NORMAL extends
up to just under the first 1GB of physical memory. Finally,
ZONE_HIGHMEM is the "high memory" zone - the area which is not
directly accessible to the kernel.
This layout works reasonably well for DMA allocations on 32-bit systems.
Truly limited peripherals use memory taken from ZONE_DMA; most of
the rest work with ZONE_NORMAL memory. In the 64-bit world,
however, things are a little different. There is no need for high memory
on such systems, so ZONE_HIGHMEM simply does not exist, and
ZONE_NORMAL contains everything above ZONE_DMA. Having
(almost) all of main memory contained within ZONE_NORMAL
simplifies a lot of things.
Kernel memory allocations specify (implicitly or explicitly) the zone from
which the memory is to be obtained. On 32-bit systems, the DMA code can
simply specify a zone which matches the capabilities of the device and get
the memory it needs. On 64-bit systems, however, the memory zones no
longer align with the limitations of particular devices. So there is no
way for the DMA layer to request memory fitting its needs. The only
exception is ZONE_DMA, which is far more restrictive than
necessary.
On some architectures - notably AMD's x86_64 - an I/O memory management
unit (IOMMU) is provided. This unit remaps addresses between the
peripheral bus and main memory; it can make any region of physical
memory appear to exist in an area accessible by the device. Systems
equipped with an IOMMU thus have no problems allocating DMA memory - any
memory will do. Unfortunately, when Intel created its variant of the
x86_64 architecture, it decided to leave the IOMMU out. So devices running
on "Intel inside" systems work directly with physical memory addresses,
and, as a result, the more limited devices out there cannot access all of
physical memory. And, as we have seen, the kernel has trouble allocating
memory which meets their special needs.
One solution to this problem could be the creation of a new zone,
ZONE_BIGDMA, say, which would represent memory reachable with
32-bit addresses. Nobody much likes this approach, however; it involves
making core memory management changes to deal with the shortcomings of a
few processors. Balancing memory use between zones is a perennial
memory management headache, and adding more zones can only make things
worse. There is one other problem as well: some devices have strange DMA
limitations (a maximum of 29 bits, for example); creating a zone which
would work for all of them would not be easy.
The Itanium architecture took a different approach, known as the "software
I/O translation buffer" or "swiotlb." The swiotlb code simply allocates a
large chunk of low memory early in the bootstrap process; this memory is
then handed out in response to DMA allocation requests. In many cases, use
of swiotlb memory involves the creation of "bounce buffers," where data is
copied between the driver's buffer and the device-accessible swiotlb
space. Memory used for the swiotlb is removed from the normal Linux
memory management mechanism and is, thus, inaccessible for any use other
than DMA buffers. For these reasons, the swiotlb is seen as, at best,
inelegant.
It is also, however, a solution which happens to work. The swiotlb can
also accommodate devices with strange DMA masks by searching until it finds
memory which fits. So the solution to the problem experienced by nVidia
(and others) is likely to be a simple expansion of the swiotlb space.
Carving a 128MB array out of main memory for full-time use as DMA buffers
may seem like a shocking waste, but, if you have enough memory that you're
having trouble with addresses requiring more than 32 bits, the cost of a
larger swiotlb will be hard to notice.
Comments (2 posted)
Linux has long limited filename lookups to a maximum of five chained
symbolic links. The limit is a useful way of dealing with symbolic link
loops, but that is not why it exists. Following symbolic links is an
inherently recursive task; once a link has been resolved, the new
destination can be another link, which starts the whole process from the
beginning. In general, recursion is frowned on in the kernel; the tight
limit on kernel stack space argues against allowing any sort of significant
call depth at all. The five-link limit was set because, if the limit were
higher, the kernel would risk overrunning the kernel stack when following
long chains.
Users do occasionally run into the five-link limit, and, of course, they
complain. The limit imposed by Linus is lower than that found on a number
of other Unix-like systems. So there has long been some motivation to
raise that limit somewhat.
Alexander Viro has finally done something about it. His approach was to
change the behavior of the filesystem follow_link() method
slightly. This method has traditionally been charged with finding the
target of a symbolic link, then calling back into the virtual filesystem
code (via vfs_follow_link()) to cause the next stage of resolution
to happen.
In the new scheme of things, the follow_link() method is still
free to do the whole job, so unmodified filesystems still work. But the
preferred technique is for the filesystem code to simply store the file
name for the link target in a place where the VFS code can find it and
return. The VFS can then make the vfs_follow_link() call itself.
This seems like a small change, but it has an important effect. The
filesystem's follow_link() method's stack frame is now gone, since
it has returned back to the core VFS code. And the core code can use an
in-lined version of vfs_follow_link(), rather than calling it (with
its own stack frame) from the outside. As a result, two fewer stack frames
are required for every step in the resolution of the symbolic link.
Al figures that this change will enable raising the maximum link depth to
eight, or even higher (though there is probably little reason to go beyond
eight). That change has not yet happened - all of the filesystems will
need to be updated and the patch proven stable first. But the initial set
of patches has found its way into Linus's BitKeeper tree, so the process
is coming near to its conclusion.
Comments (8 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Networking
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>