Brief itemsreleased, somewhat belatedly, on November 6. "There was nothing in particular holding this thing up, I just basically just forgot to cut a -rc2 release last week." Patches merged since -rc1 are mostly fixes, but there's also some DCCP improvements, a flag to silence warnings about use of deprecated interfaces, an asynchronous event notification API for SCSI/SATA (it tells the system when a removable disk has been inserted), a Japanese translation of the SubmittingPatches document, an ATA link power management API, and a bit more x86 unification work. See the short-form changelog for a list of patches, or the long-form changelog for the details.
As of this writing, no patches have found their way into the mainline git repository since the -rc2 release.
For older kernels: 184.108.40.206 and 220.127.116.11 came out on November 2 and 5, respectively. These releases contain a number of patches, including one which is security-related for both people running Minix filesystems. Greg Kroah-Hartman has recently said that, contrary to previous indications, the 2.6.22.x series will continue for a while yet.
Kernel development news
Once upon a time, video memory comprised a simple frame buffer from which pixels were sent to the display; it was up to the system's CPU to put useful data into that frame buffer. With contemporary GPUs, the memory situation has gotten more complex; a typical GPU can work with a few different types of memory:
Each type of video memory has different characteristics and constraints. Some are faster to work with (for the CPU or the GPU) than others. Some types of VRAM might not be directly addressable by the CPU. Memory may or may not be cache coherent - a distinction which requires careful programming to avoid data corruption and performance problems. And graphical applications may want to work with much larger amounts of video memory than can be made visible to the GPU at any given time.
All of this presents a memory management problem which, while being similar to the management needs of system RAM, has its own special constraints. So the graphics developers have been saying for years that Linux needs a proper manager for GPU-accessible memory. But, for years, we have done without that memory manager, with the result that this management task has been performed by an ugly combination of code in the X server, the kernel, and, often, in proprietary drivers. Happily, it would appear that those days are coming to an end, thanks to the creation of the translation-table maps (TTM) module written primarily by Thomas Hellstrom, Eric Anholt, and Dave Airlie. The TTM code provides a general-purpose memory manager aimed at the needs of GPUs and graphical clients.
The core object managed by TTM, from the point of view of user space, is the "buffer object." A buffer object is a chunk of memory allocated by an application, and possibly shared among a number of different applications. It contains a region of memory which, at some point, may be operated on by the GPU. A buffer object is guaranteed not to vanish as long as some application maintains a reference to it, but the location of that buffer is subject to change.
Once an application creates a buffer object, it can map that object into its address space. Depending on where the buffer is currently located, this mapping may require relocating the buffer into a type of memory which is addressable by the CPU (more accurately, a page fault when the application tries to access the mapped buffer would force that move). Cache coherency issues must be handled as well, of course.
There will come a time when this buffer must be made available to the GPU for some sort of operation. The TTM layer provides a special "validate" ioctl() to prepare buffers for processing; validating a buffer could, again, involve moving it or setting up a GART mapping for it. The address by which the GPU will access the buffer will not be known until it is validated; after validation, the buffer will not be moved out of the GPU's address space until it is no longer being operated on.
That means that the kernel has to know when processing on a given buffer has completed; applications, too, need to know that. To this end, the TTM layer provides "fence" objects. A fence is a special operation which is placed into the GPU's command FIFO. When the fence is executed, it raises a signal to indicate that all instructions enqueued before the fence have now been executed, and that the GPU will no longer be accessing any associated buffers. How the signaling works is very much dependent on the GPU; it could raise an interrupt or simply write a value to a special memory location. When a fence signals, any associated buffers are marked as no longer being referenced by the GPU, and any interested user-space processes are notified.
A busy system might feature a number of graphical applications, each of which is trying to feed buffers to the GPU at any given time. It is not at all unlikely that the demands for GPU-addressable buffers will exceed the amount of memory which the GPU can actually reach. So the TTM layer will have to move buffers around in response to incoming requests. For GART-mapped buffers, it may be a simple matter of unmapping pages from buffers which are not currently validated for GPU operations. In other cases, the contents of the buffers may have to be explicitly copied to another type of memory, possibly using the GPU's hardware to do so. In such cases, the buffers must first be invalidated in the page tables of any user-space process which has mapped it to ensure that the buffer will not be written to during the move. In other words, the TTM really does become an extension of the system's memory management code.
The next question which is bound to come up is: what happens when graphical applications want to use more video memory than the system as a whole can provide? Normal system RAM pages which are used as video memory are locked in place (and unavailable for other uses), so there must be a clear limit on the number of such pages which can be created. The current solution to this problem is to cap the number of such pages at a fraction of the available low memory - up to 1GB on a 4GB, 32-bit system. It would be nice to be able to extend this memory by writing unused pages to swap, but the Linux swap implementation is not meant to work with pages owned by the kernel. The long-term plan would appear to be to let the X server create a large virtual range with mmap(), which would then be swappable. That functionality has not yet been implemented, though.
There is a lot more to the TTM code than has been described here; some more information can be found in this TTM overview [PDF]. For the time being, this code works with a patched version of the Intel i915 driver, with other drivers to be added in the future. TTM has been proposed for inclusion into -mm now and merger into the mainline for 2.6.25. The main issue between now and then will be the evaluation of the user-space API, which will be hard to change once this code is merged. Unfortunately, documentation for this API appears to be scarce at the moment.Last week's article on containers discussed process ID namespaces. The purpose of these namespaces is to manage which processes are visible to a process inside a container. The heavy use of PIDs to identify processes has caused this particular patch to go through a long period of development before being merged for 2.6.24. It appears that there are some remaining issues, though, which could prevent this feature from being available in the next kernel release. As is often the case, the biggest problems come down to user-space API issues.
On November 1, Ingo Molnar pointed out that some questions raised by Ulrich Drepper back in early 2006 remained unanswered. These questions all have to do with what happens when the use of a PID escapes the namespace that it belongs to. There are a number of kernel APIs related to interprocess communication and synchronization where this could happen. Realtime signals carry process ID information, as do SYSV message queues. At best, making these interfaces work properly across PID namespaces will require that the kernel perform magic PID translations whenever a PID crosses a namespace boundary.
The biggest sticking point, though, would appear to be the robust futex mechanism, which uses PIDs to track which process owns a specific futex at any given time. One of the key points behind futexes is that the fast acquisition path (when there is no contention for the futex) does not require the kernel's involvement at all. But that acquisition path is also where the PID field is set. So there is no way to let the kernel perform magic PID translation without destroying the performance feature that was the motivation for futexes in the first place.
Ingo, Ulrich, and others who are concerned about this problem would like to see the PID namespace feature completely disabled in the 2.6.24 release so that there will be time to come up with a proper solution. But it is not clear what form that solution would take, or if it is even necessary.
The approach seemingly favored by Ulrich is to eliminate some of the fine-grained control that the kernel currently provides over the sharing of namespaces. With the 2.6.24-rc1 interface, a process which calls clone() can request that the child be placed into a new PID namespace, but that other namespaces (filesystems, for example, or networking) be shared. That, says Ulrich, is asking for trouble:
Coalescing a number of the namespace options into a single "new container" bit would help with the current shortage of clone bits. But it might well not succeed in solving the API issues. Even processes with different filesystem namespaces might be able to find the same futex via a file visible in both namespaces. The passing of credentials over Unix-domain sockets could throw in an interesting twist. And it would seem that there are other places where PIDs are used that nobody has really thought of yet.
Another possible approach, one which hasn't really featured in the current debate, would be to create globally-unique PIDs which would work across namespaces. The current 32-bit PID value could be split into two fields, with the most significant bits indicating which namespace the PID (contained in the least significant bits) is defined in. Most of the time, only the low-order part of the PID would be needed; it would be interpreted relative to the current PID namespace. But, in places where it makes sense, the full, unique PID could be used. That would enable features like futexes to work across PID namespaces.
There are still problems, of course. The whole point of PID namespaces is to completely hide processes which are outside of the current namespace; the creation and use of globally-unique PIDs pokes holes in that isolation. And there's sure to be some complications in the user-space API which prove to be hard to paper over.
Then, there is the question of whether this problem is truly important or not. Linus thinks not, pointing out that the sharing of PIDs across namespaces is analogous to the use of PIDs in lock files shared across a network. PID-based locking does not work on NFS-mounted files, and PID-based interfaces will not work between PID namespaces. Linus concludes:
One could argue that the conflict with PID namespaces was known when the robust futex feature was merged and that something could have been done at that time. But that does not really help anybody now. And, in any case, there are issues beyond futexes.
PID namespaces are a significant complication of the user-space API; they redefine a basic value which has had a well-understood meaning since the early days of Unix. So it is not surprising that some interesting questions have come to light. Getting solid answers to nagging API questions has not always been the strongest point of the Linux development process, but things could always change. With luck and some effort, these questions can be worked through so that PID namespaces, when they become available, will have well-thought-out and well-defined semantics in all cases and will support the functionality that users need.
As the amount of RAM installed in systems grows, it would seem that memory pressure should reduce, but, much like salaries or hard disk space, usage grows to fill (or overflow) the available capacity. Operating systems have dealt with this problem for decades by using virtual memory and swapping, but techniques that work well with 4 gigabyte address spaces may not scale well to systems with 1 terabyte. That scalability problem is at the root of several different ideas for changing the kernel, from supporting larger page sizes to avoiding memory fragmentation.
Another approach to scaling up the memory management subsystem was recently posted to linux-kernel by Rik van Riel. His patch is meant to reduce the amount of time the kernel spends looking for a memory page to evict when it needs to load a new page. He lists two main deficiencies of the current page replacement algorithm. The first is that it sometimes evicts the wrong page; this cannot be eliminated, but its frequency might be reduced. The second is the heart of what he is trying to accomplish:
A system with 1TB of 4K pages has 256 million pages to deal with. Searching through the pages stored on lists in the kernel can take an enormous amount of time. According to van Riel, most of that time is spent searching pages that won't be evicted anyway, so in order to deal with systems of that size, the search needs to focus in on likely candidates.
Linux tries to optimize its use of physical memory, by keeping it full, using any memory not needed by processes for caching file data in the page cache. Determining which pages are not being used by processes and striking a balance between the page cache and process memory is the job of the page replacement algorithm. It is that algorithm that van Riel would eventually like to see replaced.
The current set of patches, though, take a smaller step. In today's kernel, there are two lists of pages, active and inactive, for each memory zone. Pages move between them based on how recently they were used. When it is time to find a page to evict, the kernel searches the inactive list for candidates. In many cases, it is looking for page cache pages, particularly those that are unmodified and can simply be dropped, but has to wade through an enormous number of process-memory pages to find them.
The solution proposed is to break both lists apart, based on the type of page. Page cache pages (aka file pages) and process-memory pages (aka anonymous pages) will each live on their own active and inactive lists. When the kernel is looking for a specific type, it can choose the proper list to reduce the amount of time spent searching considerably.
This patch is an update to an earlier proposal by van Riel, covered here last March. The patch is now broken into ten parts, allowing for easier reviewing. It has also been updated to the latest kernel, modified to work with various features (like lumpy reclaim) that have been added in the interim.
Additional features are planned to be added down the road, as outlined on van Riel's page replacement design web page. Adding a non-reclaimable list for pages that are locked to physical memory with mlock(), or are part of a RAM filesystem and cannot be evicted, is one of the first changes listed. It makes little sense to scan past these pages.
Another feature that van Riel lists is to track recently evicted pages so that, if they get loaded again, the system can reduce the likelihood of another eviction. This should help keep pages in the page cache that get accessed somewhat infrequently, but are not completely unused. There are also various ideas about limiting the sizes of the active and inactive lists to put a bound on worst-case scenarios. van Riel's plans also include making better decisions about when to run the out-of-memory (OOM) killer as well as making it faster to choose its victim.
Overall, it is a big change to how the page replacement code works today, which is why it will be broken up into smaller chunks. By making changes that add incremental improvements, and getting them into the hands of developers and testers, the hope is that the bugs can be shaken out more easily. Before that can happen, though, this set of patches must pass muster with the kernel hackers and be merged. The external user-visible impacts of these particular patches should be small, but they are fairly intrusive, touching a fair amount of code. In addition, memory management patches tend to have a tough path into the kernel.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds