User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.24-rc2, released, somewhat belatedly, on November 6. "There was nothing in particular holding this thing up, I just basically just forgot to cut a -rc2 release last week." Patches merged since -rc1 are mostly fixes, but there's also some DCCP improvements, a flag to silence warnings about use of deprecated interfaces, an asynchronous event notification API for SCSI/SATA (it tells the system when a removable disk has been inserted), a Japanese translation of the SubmittingPatches document, an ATA link power management API, and a bit more x86 unification work. See the short-form changelog for a list of patches, or the long-form changelog for the details.

As of this writing, no patches have found their way into the mainline git repository since the -rc2 release.

For older kernels: and came out on November 2 and 5, respectively. These releases contain a number of patches, including one which is security-related for both people running Minix filesystems. Greg Kroah-Hartman has recently said that, contrary to previous indications, the 2.6.22.x series will continue for a while yet. and were released on November 1 and 5, respectively. They contain quite a few fixes, several of which have vulnerability numbers associated with them.

Comments (none posted)

Kernel development news

Quotes of the week

We've already got way too many incomplete concepts and APIs in the kernel. Maybe i'm over-worrying, but i fear we end up like with capabilities or sendfile - code merged too soon and never completed for many years - perhaps never completed at all. VMS and WNT did those things a bit better i think - their API frameworks were/are pervasive and complete, even in the corner cases.
-- Ingo Molnar

[W]hat concerns me is that stringbuf was good, but not great. Yet I always think of the kernel as a bastion of really good C code and practice, carefully honed by thoughtful coders. Here even the unmeasured optimization attempts show a lack of effort on the part of experienced kernel coders.
-- Rusty Russell

What makes the code in the kernel so great is not that it goes in perfect, it's that we whittle all code in the tree down over and over again until it reaches it's perfect form. It is this whittling system that allows us to thrust 25MB of changes into a tree over a week and a half and it all works out.
-- David Miller

Comments (none posted)

Memory management for graphics processors

By Jonathan Corbet
November 6, 2007
The management of video hardware has long been an area of weakness in the Linux system (and free operating systems in general). The X Window System tends to get a lot of the blame for problems in this area, but the truth of the matter is that the problems are more widespread and the kernel has never made it easy for X to do this job properly. Graphics processors (GPUs) have gotten steadily more powerful, to the point that, by some measures, they are the fastest processor on most systems, but kernel support for the programming of GPUs has lagged behind. A lot of work is being done to remedy this situation, though, and an important component of that work has just been put forward for inclusion into the mainline kernel.

Once upon a time, video memory comprised a simple frame buffer from which pixels were sent to the display; it was up to the system's CPU to put useful data into that frame buffer. With contemporary GPUs, the memory situation has gotten more complex; a typical GPU can work with a few different types of memory:

  • Video RAM (VRAM) is high-speed memory installed directly on the video card. It is usually visible on the system's PCI bus, but that need not be the case. There is likely to be a frame buffer in this memory, but many other kinds of data live there as well.

  • In many lower-end systems, the "video RAM" is actually a dedicated section of general-purpose system memory. That RAM is set aside for the use of the GPU and is not available for other purposes. Even adapters with their own VRAM may have a dedicated RAM region as well.

  • Video adapters contain a simple memory management unit (the graphics address remapping table or GART) which can be used to map various pages of system memory into the GPU's address space. The result is that, at any time, an arbitrary (scattered) subset of the system's RAM pages are accessible to the GPU.

Each type of video memory has different characteristics and constraints. Some are faster to work with (for the CPU or the GPU) than others. Some types of VRAM might not be directly addressable by the CPU. Memory may or may not be cache coherent - a distinction which requires careful programming to avoid data corruption and performance problems. And graphical applications may want to work with much larger amounts of video memory than can be made visible to the GPU at any given time.

All of this presents a memory management problem which, while being similar to the management needs of system RAM, has its own special constraints. So the graphics developers have been saying for years that Linux needs a proper manager for GPU-accessible memory. But, for years, we have done without that memory manager, with the result that this management task has been performed by an ugly combination of code in the X server, the kernel, and, often, in proprietary drivers. Happily, it would appear that those days are coming to an end, thanks to the creation of the translation-table maps (TTM) module written primarily by Thomas Hellstrom, Eric Anholt, and Dave Airlie. The TTM code provides a general-purpose memory manager aimed at the needs of GPUs and graphical clients.

The core object managed by TTM, from the point of view of user space, is the "buffer object." A buffer object is a chunk of memory allocated by an application, and possibly shared among a number of different applications. It contains a region of memory which, at some point, may be operated on by the GPU. A buffer object is guaranteed not to vanish as long as some application maintains a reference to it, but the location of that buffer is subject to change.

Once an application creates a buffer object, it can map that object into its address space. Depending on where the buffer is currently located, this mapping may require relocating the buffer into a type of memory which is addressable by the CPU (more accurately, a page fault when the application tries to access the mapped buffer would force that move). Cache coherency issues must be handled as well, of course.

There will come a time when this buffer must be made available to the GPU for some sort of operation. The TTM layer provides a special "validate" ioctl() to prepare buffers for processing; validating a buffer could, again, involve moving it or setting up a GART mapping for it. The address by which the GPU will access the buffer will not be known until it is validated; after validation, the buffer will not be moved out of the GPU's address space until it is no longer being operated on.

That means that the kernel has to know when processing on a given buffer has completed; applications, too, need to know that. To this end, the TTM layer provides "fence" objects. A fence is a special operation which is placed into the GPU's command FIFO. When the fence is executed, it raises a signal to indicate that all instructions enqueued before the fence have now been executed, and that the GPU will no longer be accessing any associated buffers. How the signaling works is very much dependent on the GPU; it could raise an interrupt or simply write a value to a special memory location. When a fence signals, any associated buffers are marked as no longer being referenced by the GPU, and any interested user-space processes are notified.

A busy system might feature a number of graphical applications, each of which is trying to feed buffers to the GPU at any given time. It is not at all unlikely that the demands for GPU-addressable buffers will exceed the amount of memory which the GPU can actually reach. So the TTM layer will have to move buffers around in response to incoming requests. For GART-mapped buffers, it may be a simple matter of unmapping pages from buffers which are not currently validated for GPU operations. In other cases, the contents of the buffers may have to be explicitly copied to another type of memory, possibly using the GPU's hardware to do so. In such cases, the buffers must first be invalidated in the page tables of any user-space process which has mapped it to ensure that the buffer will not be written to during the move. In other words, the TTM really does become an extension of the system's memory management code.

The next question which is bound to come up is: what happens when graphical applications want to use more video memory than the system as a whole can provide? Normal system RAM pages which are used as video memory are locked in place (and unavailable for other uses), so there must be a clear limit on the number of such pages which can be created. The current solution to this problem is to cap the number of such pages at a fraction of the available low memory - up to 1GB on a 4GB, 32-bit system. It would be nice to be able to extend this memory by writing unused pages to swap, but the Linux swap implementation is not meant to work with pages owned by the kernel. The long-term plan would appear to be to let the X server create a large virtual range with mmap(), which would then be swappable. That functionality has not yet been implemented, though.

There is a lot more to the TTM code than has been described here; some more information can be found in this TTM overview [PDF]. For the time being, this code works with a patched version of the Intel i915 driver, with other drivers to be added in the future. TTM has been proposed for inclusion into -mm now and merger into the mainline for 2.6.25. The main issue between now and then will be the evaluation of the user-space API, which will be hard to change once this code is merged. Unfortunately, documentation for this API appears to be scarce at the moment.

Comments (7 posted)

Process IDs in a multi-namespace world

By Jonathan Corbet
November 6, 2007
Last week's article on containers discussed process ID namespaces. The purpose of these namespaces is to manage which processes are visible to a process inside a container. The heavy use of PIDs to identify processes has caused this particular patch to go through a long period of development before being merged for 2.6.24. It appears that there are some remaining issues, though, which could prevent this feature from being available in the next kernel release. As is often the case, the biggest problems come down to user-space API issues.

On November 1, Ingo Molnar pointed out that some questions raised by Ulrich Drepper back in early 2006 remained unanswered. These questions all have to do with what happens when the use of a PID escapes the namespace that it belongs to. There are a number of kernel APIs related to interprocess communication and synchronization where this could happen. Realtime signals carry process ID information, as do SYSV message queues. At best, making these interfaces work properly across PID namespaces will require that the kernel perform magic PID translations whenever a PID crosses a namespace boundary.

The biggest sticking point, though, would appear to be the robust futex mechanism, which uses PIDs to track which process owns a specific futex at any given time. One of the key points behind futexes is that the fast acquisition path (when there is no contention for the futex) does not require the kernel's involvement at all. But that acquisition path is also where the PID field is set. So there is no way to let the kernel perform magic PID translation without destroying the performance feature that was the motivation for futexes in the first place.

Ingo, Ulrich, and others who are concerned about this problem would like to see the PID namespace feature completely disabled in the 2.6.24 release so that there will be time to come up with a proper solution. But it is not clear what form that solution would take, or if it is even necessary.

The approach seemingly favored by Ulrich is to eliminate some of the fine-grained control that the kernel currently provides over the sharing of namespaces. With the 2.6.24-rc1 interface, a process which calls clone() can request that the child be placed into a new PID namespace, but that other namespaces (filesystems, for example, or networking) be shared. That, says Ulrich, is asking for trouble:

This whole approach to allow switching on and off each of the namespaces is just wrong. Do it all or nothing, at least for the problematic ones like NEWPID. Having access to the same filesystem but using separate PID namespaces is simply not going to work.

Coalescing a number of the namespace options into a single "new container" bit would help with the current shortage of clone bits. But it might well not succeed in solving the API issues. Even processes with different filesystem namespaces might be able to find the same futex via a file visible in both namespaces. The passing of credentials over Unix-domain sockets could throw in an interesting twist. And it would seem that there are other places where PIDs are used that nobody has really thought of yet.

Another possible approach, one which hasn't really featured in the current debate, would be to create globally-unique PIDs which would work across namespaces. The current 32-bit PID value could be split into two fields, with the most significant bits indicating which namespace the PID (contained in the least significant bits) is defined in. Most of the time, only the low-order part of the PID would be needed; it would be interpreted relative to the current PID namespace. But, in places where it makes sense, the full, unique PID could be used. That would enable features like futexes to work across PID namespaces.

There are still problems, of course. The whole point of PID namespaces is to completely hide processes which are outside of the current namespace; the creation and use of globally-unique PIDs pokes holes in that isolation. And there's sure to be some complications in the user-space API which prove to be hard to paper over.

Then, there is the question of whether this problem is truly important or not. Linus thinks not, pointing out that the sharing of PIDs across namespaces is analogous to the use of PIDs in lock files shared across a network. PID-based locking does not work on NFS-mounted files, and PID-based interfaces will not work between PID namespaces. Linus concludes:

I don't understand how you can call this a "PID namespace design bug", when it clearly has nothing what-so-ever to do with pid namespaces, and everything to do with the *futexes* that blithely assume that pid's are unique and that made it part of the user-visible interface.

One could argue that the conflict with PID namespaces was known when the robust futex feature was merged and that something could have been done at that time. But that does not really help anybody now. And, in any case, there are issues beyond futexes.

PID namespaces are a significant complication of the user-space API; they redefine a basic value which has had a well-understood meaning since the early days of Unix. So it is not surprising that some interesting questions have come to light. Getting solid answers to nagging API questions has not always been the strongest point of the Linux development process, but things could always change. With luck and some effort, these questions can be worked through so that PID namespaces, when they become available, will have well-thought-out and well-defined semantics in all cases and will support the functionality that users need.

Comments (19 posted)

Page replacement for huge memory systems

By Jake Edge
November 7, 2007

As the amount of RAM installed in systems grows, it would seem that memory pressure should reduce, but, much like salaries or hard disk space, usage grows to fill (or overflow) the available capacity. Operating systems have dealt with this problem for decades by using virtual memory and swapping, but techniques that work well with 4 gigabyte address spaces may not scale well to systems with 1 terabyte. That scalability problem is at the root of several different ideas for changing the kernel, from supporting larger page sizes to avoiding memory fragmentation.

Another approach to scaling up the memory management subsystem was recently posted to linux-kernel by Rik van Riel. His patch is meant to reduce the amount of time the kernel spends looking for a memory page to evict when it needs to load a new page. He lists two main deficiencies of the current page replacement algorithm. The first is that it sometimes evicts the wrong page; this cannot be eliminated, but its frequency might be reduced. The second is the heart of what he is trying to accomplish:

The kernel scans over pages that should not be evicted. On systems with a few GB of RAM, this can result in the VM using an annoying amount of CPU. On systems with >128GB of RAM, this can knock the system out for hours since excess CPU use is compounded with lock contention and other issues.

A system with 1TB of 4K pages has 256 million pages to deal with. Searching through the pages stored on lists in the kernel can take an enormous amount of time. According to van Riel, most of that time is spent searching pages that won't be evicted anyway, so in order to deal with systems of that size, the search needs to focus in on likely candidates.

Linux tries to optimize its use of physical memory, by keeping it full, using any memory not needed by processes for caching file data in the page cache. Determining which pages are not being used by processes and striking a balance between the page cache and process memory is the job of the page replacement algorithm. It is that algorithm that van Riel would eventually like to see replaced.

The current set of patches, though, take a smaller step. In today's kernel, there are two lists of pages, active and inactive, for each memory zone. Pages move between them based on how recently they were used. When it is time to find a page to evict, the kernel searches the inactive list for candidates. In many cases, it is looking for page cache pages, particularly those that are unmodified and can simply be dropped, but has to wade through an enormous number of process-memory pages to find them.

The solution proposed is to break both lists apart, based on the type of page. Page cache pages (aka file pages) and process-memory pages (aka anonymous pages) will each live on their own active and inactive lists. When the kernel is looking for a specific type, it can choose the proper list to reduce the amount of time spent searching considerably.

This patch is an update to an earlier proposal by van Riel, covered here last March. The patch is now broken into ten parts, allowing for easier reviewing. It has also been updated to the latest kernel, modified to work with various features (like lumpy reclaim) that have been added in the interim.

Additional features are planned to be added down the road, as outlined on van Riel's page replacement design web page. Adding a non-reclaimable list for pages that are locked to physical memory with mlock(), or are part of a RAM filesystem and cannot be evicted, is one of the first changes listed. It makes little sense to scan past these pages.

Another feature that van Riel lists is to track recently evicted pages so that, if they get loaded again, the system can reduce the likelihood of another eviction. This should help keep pages in the page cache that get accessed somewhat infrequently, but are not completely unused. There are also various ideas about limiting the sizes of the active and inactive lists to put a bound on worst-case scenarios. van Riel's plans also include making better decisions about when to run the out-of-memory (OOM) killer as well as making it faster to choose its victim.

Overall, it is a big change to how the page replacement code works today, which is why it will be broken up into smaller chunks. By making changes that add incremental improvements, and getting them into the hands of developers and testers, the hope is that the bugs can be shaken out more easily. Before that can happen, though, this set of patches must pass muster with the kernel hackers and be merged. The external user-visible impacts of these particular patches should be small, but they are fairly intrusive, touching a fair amount of code. In addition, memory management patches tend to have a tough path into the kernel.

Comments (2 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds