Kernel development [LWN.net]

Kernel release status

The current 2.6 kernel is still 2.6.7; the first 2.6.8 prepatch has not yet been released.

Patches continue to accumulate in Linus's BitKeeper tree, however; they include the new dma_get_required_mask() API (covered here last week), support for 64-bit Super-H hardware (forward ported from 2.4), x86 no-execute support, asynchronous I/O support for USB gadgets, a reworked symbolic link lookup implementation (see below), a new "CPU mask" implementation, some read-copy-update performance improvements, support for new Apple PowerBooks, more sparse annotations, some netfilter improvements, some kbuild work, a new wait_event_interruptible_exclusive() macro, support for the O_NOATIME flag in the open() call, sysfs knobs for tuning the CFQ I/O scheduler, mirroring and snapshot targets for the device mapper, the removal of the PC9800 subarchitecture, reiserfs data=journal support, preemptible kernel support for the PPC64 architecture, and many fixes and updates.

The current prepatch from Andrew Morton is 2.6.7-mm4. Recent additions to -mm include a rearrangement of the x86 user-space memory layout (see below), some preparatory work for software suspend on SMP systems, PCMCIA sysfs support, and lots of fixes.

The current 2.4 prepatch is 2.4.27-rc2, which was released by Marcelo on June 26. A relatively large number of patches (for a release candidate) went in; they include a USB gadget driver update, a number of backported fixes for potential security problems, an XFS update, a netfilter update, and various fixes.

Comments (none posted)

Reorganizing the address space

The traditional organization of the virtual address space (as seen from user space, on x86 systems) is as shown in the diagram to the right. The very bottom part of the address space is unused; it is there to catch NULL pointers and such. Starting at 0x8000000 is the program text - the read-only, executable code. The text is followed by the heap region, being the memory obtainable via the brk() system call. Typically functions like malloc() obtain their memory from this area; non-automatic program data is also stored there.

The heap differs from the first two regions in that it grows in response to program needs. A program like cat will not make a lot of demands on the heap (one hopes), while running a yum update can grow the heap in a truly disturbing way. The heap can expand up to 1GB (0x40000000), at which point it runs into the mmap area; this is where shared libraries and other regions created by the mmap() system call live. The mmap area, too, grows upward to accommodate new mappings.

Meanwhile, the kernel owns the last 1GB of address space, up at 0xc0000000. The kernel is inaccessible to user space, but it occupies that portion of the address space regardless. Immediately below the kernel is the stack region, where things like automatic variables live. The stack grows downward. On a really bad day, the stack and the mmap area can run into each other, at which point things start to fail.

This organization has worked for some time, but it does have a couple of disadvantages. It fragments the address space, such that neither the heap nor the mmap area can make use of the entire space. If one program makes heavy use of the heap, it could run out of memory, even though a large chunk of space is available between the mmap area and the stack. Normally, not even yum can occupy that much heap, but there are other applications out there which are up to that challenge.

[revised memory layout] As a way of making life safer for the true memory hogs out there, Ingo Molnar has posted a patch which rearranges user space along the lines of the revised diagram on the left. The mmap area has been moved up to the top of the address space, and it now grows downward toward the heap. As a result, the bulk of the address space is preserved in a single, contiguous chunk which can be allocated to either the heap or mmap, as the application requires.

As an added bonus, this organization reduces the amount of kernel memory required to hold each process's page tables, since the fragment at 0x40000000 is no longer present.

There are a couple of disadvantages to this approach. One is that the stack area is rather more confined than it used to be. The actual size of the stack area is determined by the process's stack size resource limit, with a sizable cushion added, so problems should be rare. The other problem is that, apparently, a very small number of applications get confused by the new layout. Any application which is sensitive to how virtual memory is laid out is buggy to begin with; according to Arjan van de Ven, the most common case is applications which store pointers in integer variables and then do the wrong thing when they see a "negative" value.

The fact is that most users will never notice the change; for a demonstration, consider that Fedora kernels have been shipping with this patch for some time. Even a vanilla Fedora Core 1 system has it; a command like "cat /proc/self/maps" will show the new layout at work. The patch is currently part of the -mm kernel, and will probably find its way into the mainline before too long.

Comments (14 posted)

DMA issues, part 2

Last week's Kernel Page looked at various DMA-related issues. One of those was the ability to make use of memory located on I/O controllers for DMA operations. That work has taken a step forward with this proposal from James Bottomley, which adds a new function to the DMA API:

    int dma_declare_coherent_memory(struct device *dev, 
                                    dma_addr_t bus_addr,
                                    dma_addr_t device_addr, 
                                    size_t size, int flags);

This function tells the DMA code about a chunk of memory available on the device represented by dev. The memory is size bytes long; it is located at bus_addr from the bus's point of view, and device_addr from the device's perspective. The flags argument describes how the memory is to be used: whether it should be mapped into the kernel's address space, whether children of the device can use it, and whether it should be the only memory used by the device(s) for DMA.

The actual patch implementing this API is still in the works. As of this writing, there have been no real comments on it.

Meanwhile, a different DMA issue has been raised by the folks at nVidia, who are trying to make their hardware work better on Intel's em64t (AMD64 clone) architecture. It is, it turns out, difficult to reliably use DMA on devices which cannot handle 64-bit addresses.

Memory on (non-NUMA) Linux systems has traditionally been divided into three zones. ZONE_DMA is the bottom 16MB; it is the only memory which is accessible to ancient ISA peripherals and, perhaps, a few old PCI cards which are simply a repackaging of ISA chipsets. ZONE_NORMAL is all of the memory, outside of ZONE_DMA, which is directly accessible to the kernel. On a typical 32-bit Linux system, ZONE_NORMAL extends up to just under the first 1GB of physical memory. Finally, ZONE_HIGHMEM is the "high memory" zone - the area which is not directly accessible to the kernel.

This layout works reasonably well for DMA allocations on 32-bit systems. Truly limited peripherals use memory taken from ZONE_DMA; most of the rest work with ZONE_NORMAL memory. In the 64-bit world, however, things are a little different. There is no need for high memory on such systems, so ZONE_HIGHMEM simply does not exist, and ZONE_NORMAL contains everything above ZONE_DMA. Having (almost) all of main memory contained within ZONE_NORMAL simplifies a lot of things.

Kernel memory allocations specify (implicitly or explicitly) the zone from which the memory is to be obtained. On 32-bit systems, the DMA code can simply specify a zone which matches the capabilities of the device and get the memory it needs. On 64-bit systems, however, the memory zones no longer align with the limitations of particular devices. So there is no way for the DMA layer to request memory fitting its needs. The only exception is ZONE_DMA, which is far more restrictive than necessary.

On some architectures - notably AMD's x86_64 - an I/O memory management unit (IOMMU) is provided. This unit remaps addresses between the peripheral bus and main memory; it can make any region of physical memory appear to exist in an area accessible by the device. Systems equipped with an IOMMU thus have no problems allocating DMA memory - any memory will do. Unfortunately, when Intel created its variant of the x86_64 architecture, it decided to leave the IOMMU out. So devices running on "Intel inside" systems work directly with physical memory addresses, and, as a result, the more limited devices out there cannot access all of physical memory. And, as we have seen, the kernel has trouble allocating memory which meets their special needs.

One solution to this problem could be the creation of a new zone, ZONE_BIGDMA, say, which would represent memory reachable with 32-bit addresses. Nobody much likes this approach, however; it involves making core memory management changes to deal with the shortcomings of a few processors. Balancing memory use between zones is a perennial memory management headache, and adding more zones can only make things worse. There is one other problem as well: some devices have strange DMA limitations (a maximum of 29 bits, for example); creating a zone which would work for all of them would not be easy.

The Itanium architecture took a different approach, known as the "software I/O translation buffer" or "swiotlb." The swiotlb code simply allocates a large chunk of low memory early in the bootstrap process; this memory is then handed out in response to DMA allocation requests. In many cases, use of swiotlb memory involves the creation of "bounce buffers," where data is copied between the driver's buffer and the device-accessible swiotlb space. Memory used for the swiotlb is removed from the normal Linux memory management mechanism and is, thus, inaccessible for any use other than DMA buffers. For these reasons, the swiotlb is seen as, at best, inelegant.

It is also, however, a solution which happens to work. The swiotlb can also accommodate devices with strange DMA masks by searching until it finds memory which fits. So the solution to the problem experienced by nVidia (and others) is likely to be a simple expansion of the swiotlb space. Carving a 128MB array out of main memory for full-time use as DMA buffers may seem like a shocking waste, but, if you have enough memory that you're having trouble with addresses requiring more than 32 bits, the cost of a larger swiotlb will be hard to notice.

Comments (2 posted)

Supporting deeper symbolic links

Linux has long limited filename lookups to a maximum of five chained symbolic links. The limit is a useful way of dealing with symbolic link loops, but that is not why it exists. Following symbolic links is an inherently recursive task; once a link has been resolved, the new destination can be another link, which starts the whole process from the beginning. In general, recursion is frowned on in the kernel; the tight limit on kernel stack space argues against allowing any sort of significant call depth at all. The five-link limit was set because, if the limit were higher, the kernel would risk overrunning the kernel stack when following long chains.

Users do occasionally run into the five-link limit, and, of course, they complain. The limit imposed by Linus is lower than that found on a number of other Unix-like systems. So there has long been some motivation to raise that limit somewhat.

Alexander Viro has finally done something about it. His approach was to change the behavior of the filesystem follow_link() method slightly. This method has traditionally been charged with finding the target of a symbolic link, then calling back into the virtual filesystem code (via vfs_follow_link()) to cause the next stage of resolution to happen. In the new scheme of things, the follow_link() method is still free to do the whole job, so unmodified filesystems still work. But the preferred technique is for the filesystem code to simply store the file name for the link target in a place where the VFS code can find it and return. The VFS can then make the vfs_follow_link() call itself.

This seems like a small change, but it has an important effect. The filesystem's follow_link() method's stack frame is now gone, since it has returned back to the core VFS code. And the core code can use an in-lined version of vfs_follow_link(), rather than calling it (with its own stack frame) from the outside. As a result, two fewer stack frames are required for every step in the resolution of the symbolic link.

Al figures that this change will enable raising the maximum link depth to eight, or even higher (though there is probably little reason to go beyond eight). That change has not yet happened - all of the filesystems will need to be updated and the patch proven stable first. But the initial set of patches has found its way into Linus's BitKeeper tree, so the process is coming near to its conclusion.

Comments (8 posted)

Andrew Morton 2.6.7-mm2 ?

Andrew Morton 2.6.7-mm3 ?

Andrew Morton 2.6.7-mm4 ?

Con Kolivas 2.6.7-ck2 ?

Con Kolivas 2.6.7-ck3 ?

Nick Piggin 2.6.7-np1 ?

Nick Piggin 2.6.7-np2 ?

Marcelo Tosatti Linux 2.4.27-rc2 ?

Philippe Gerum HYADES (ITEA) project -- Adeos/ia64 ?

Sam Ravnborg RFC: Testing for kernel features in external modules ?

Timm Morten Steinbeck PATCH: Precise Accounting for 2.6.7 ?

Guillaume Thouvenin Enhanced Linux System Accounting for 2.6.7 ?

Erik Jacobson Process Aggregates (PAGG) for 2.6.7 ?

Con Kolivas Staircase scheduler v7.4 ?

Con Kolivas Staircase scheduler v7.7 ?

Con Kolivas Staircase scheduler v7.8 ?

Paul Jackson cpusets v3 - Table of Contents ?

Paul Jackson cpusets v3 - Overview ?

Paul Jackson cpusets v3 - cpumask_t - additional const qualifiers ?

Paul Jackson cpusets v3 - nodemask patch (draft of Matthew Dobson's patch) ?

Paul Jackson cpusets v3 - New bitmap lists format ?

Paul Jackson cpusets v3 - The main new files: cpuset.c, cpuset.h ?

Paul Jackson cpusets v3 - The few, small kernel hooks needed ?

Paul Jackson cpusets v3 - One more hook, for /proc/<pid>/cpuset. ?

Iwamoto Toshihiro new memory hotremoval patch ?

Peter Williams CPU scheduler evaluation tool ?

James Bottomley dma_get_required_mask() ?

James Bottomley on-chip coherent memory API for DMA ?

James Bottomley SCSI updates to 2.6.7 ?

Adam Belay driver model and sysfs support for PCMCIA (1/3) ?

Adam Belay update drivers/net/pcmcia (2/3) ?

Adam Belay update drivers/net/wireless (3/3) ?

Mukker, Atul : megaraid driver version 2.20.0.0 ?

Vojtech Pavlik Experimental PS/2 driver with heuristic synchronization ?

Ken Preslan GFS cluster filesystem re-released ?

Ken Preslan Patch to allow distributed flock ?

Goldwyn Rodrigues Breaking ext2 file size limit of 2TB ?

Pete Zaitcev drivers/block/ub.c ?

David Howells intrinsic automount and mountpoint degradation support ?

Jon Maloy supporting cluster communication with TIPC ?

Jean Tourrilhes Updated Wireless Extension patches ?

Mariusz Mazur linux-libc-headers 2.6.7.0 ?

christophe.varoqui@free.fr multipath-tools-0.2.4 ?

Patrick Mansfield scsi_id release 0.5 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Reorganizing the address space

DMA issues, part 2

Supporting deeper symbolic links

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Networking

Miscellaneous