Brief itemsreleased by Linus on February 15. We are in the stabilization period now, so, as one would expect, most of the changes are fixes. This prepatch also contains some tweaks to the realtime group scheduling interface and the addition of a multiple-probes capability to kernel markers. Says Linus: "I'm optimistic that this release cycle won't be anywhere near the pain of what 24 was, which is why I'm just going to go off for the long weekend and stay at the beach." See the long-format changelog for the details.
As of this writing, just over 300 patches have gone into the mainline repository since 2.6.25-rc2. They are mostly fixes, but there's also some new watchdog drivers, a SMACK security module enhancement, and some fairly large Video4Linux driver updates.
The current -mm tree is 2.6.25-rc2-mm1. Recent changes to -mm include ext4 online defragmentation support and read-only bind mount support.
For older kernels: 184.108.40.206 was released on February 16; it contains a number of low-priority security fixes.
Kernel development news
The kernel source level debugger, kgdb, has been around for a long time, but never in the mainline tree. Linus Torvalds is not much of a fan of debuggers in general and has always resisted the inclusion of kgdb. That looks like it might be changing somewhat, with the inclusion of kgdb in 2.6.26 now a distinct possibility.
Over the years, Torvalds has made various pronouncements about debuggers, particularly kernel debuggers, a long message to linux-kernel in 2000 seems to outline his objections:
An attempt to sneak kgdb into the mainline via x86 architecture updates failed, but Torvalds did open the door a crack towards merging the kgdb changes: "I won't even consider pulling it unless it's offered as a separate tree, not mixed up with other things. At that point I can give a look." That has spawned the kgdb-light effort, spearheaded by Ingo Molnar.
The original hope to get it included into 2.6.25 has been dashed, but with Molnar rapidly iterating to address kernel hacker concerns, the amount of complaints seems to be decreasing. Molnar is up to version 10 of the kgdb-light patchset in something like three days since the first was posted. The various linux-kernel threads show a number of very hopeful developers waiting with bated breath to see if kgdb can finally make its way into the mainline.
The light version of kgdb still has most of the capabilities of the original kgdb and any additional, possibly more intrusive, features can be added later. Molnar is clearly trying to do things the right way, with a merge of the non-intrusive kgdb functionality that can eventually be used by multiple architectures. He points out that there are already gdb remote stubs in three separate architectures in the mainline, continuing:
But we wanted to do it _right_ and not add an arch/x86/kernel/gdb-stub.c special hack.
Discussions about the patches have been mostly to point out problems or areas that need cleaning up. The philosophical objections have been mostly avoided, quite possibly because Molnar has been scrupulously trying to make a "no impact" set of patches:
To that end, the patch changes 22 files (rather than the 41 touched by the original kgdb submission), removing "_all_ critical path impact" and the low-level serial drivers—as Molnar points out, kgdb should not be in the driver business. In addition, the "kgdb over polled consoles" support has been reworked and cleaned up. Various hacks to get at module symbols have been removed as a better solution for that problem is needed. So far, no show stopping problems have been identified, so it really seems to come down to what Torvalds thinks; for that, we may have to wait until the 2.6.26 merge window opens in April or May.
Nouveau is an effort to create a complete open source driver for NVidia graphics cards for X.org. It aims to support 2D and 3D acceleration from the early NV04 cards up to the latest G80 Cards and work across all supported architectures like x86-64, PPC and x86. The project originated when Stéphane Marchesin set out to de-obfuscate parts of the NVidia-maintained nv driver. However, NVidia had corporate policies in place about the nv driver, and had no plans to change them at the time. So they refused Stéphane's patches.
This left Stéphane with the greatest open source choice: "fork it"! At FOSDEM in February 2006, Stéphane unveiled his plans for an open source driver for NVidia hardware called Nouveau. The name was suggested by his IRC client's French autoreplace feature which suggested the word "nouveau" when he typed "nv". People liked it, so the name stuck. The FOSDEM presentation got the project enough publicity to engage the curiosity of other developers.
Ben Skeggs was one of the first developers to sign up. He had worked on reverse engineering the R300 (one of ATI graphics chips) shader components and writing parts of the R300 driver; as a result, he had great experience with graphics drivers. He initially showed interest in the NV40 shaders only, but he got caught in the event horizon and has worked on every aspect of the driver for NV40 and later cards.
The project engaged other developers with short and long term interest. It also generated a large amount of interest due to a pledge drive that an independent user started.
However, the project was mainly developed on IRC and it was quite difficult for newcomers to get any insight into previous development; reading IRC logs is unpractical at best. With this in mind, KoalaBR decided to start summarizing development in a series of articles known as the TiNDC (The irregular Nouveau Development Companion). This series of articles proved very useful for attracting developers and testers to the project. TiNDC issues are published every two to four weeks; as of this writing, the current issue is TiNDC #34.
Linux.conf.au 2007 saw the first live demo of Nouveau. Dave Airlie had signed up to give a talk on the subject; he managed to persuade Ben Skeggs that showing a working glxgears demo would be a great finish to the talk. Ben toiled furiously with the other developers to get the init code into shape for his laptop card and the presentation was a great success.
After missing a Google Summer of Code place, X.org granted Nouveau a Vacation of Code alternative. This saw Arthur Huillet join the team to complete proper Xv support on Nouveau. Arthur saw the light and continued with the project once the VoC ended. In autumn 2007 Stuart Bennett and Maarten Maathuis vowed to get Nouveau's RandR1.2 into a better shape. Since then a steady stream of patches has advanced the code greatly.
The project now has 8 regular contributors (Stéphane Marchesin, Ben Skeggs, Patrice Mandin, Arthur Huillet, Pekka Paalanen, Maarten Maathuis, Peter Winters, Jeremy Kolb, Stuart Bennett) with many more part time contributors, testers, writers and translators.
This article will use the NVidia GPU technical names as opposed to marketing names.
GPU name Product name(s) NV04/05 Riva TNT, TNT2 NV1x GeForce 256, GeForce 2, GeForce 4 MX NV2x GeForce 3, GeForce 4 Ti NV3x GeForce 5 NV4x(G7x) GeForce 6, GeForce 7 NV5x(G8x) GeForce 8
Where there are "N" and "G" naming the "N" variant (NV4x, NV5x) will be used. Further information can be found on the Nouveau site.
Before jumping into the Nouveau driver, this section provides a short background on the mess that is the Linux graphics stack. This stack has a long history dating back to Unix X servers and the XFree86 project. This history has lead to a situation quite unlike the driver situation for any other device on a Linux system. The graphics drivers existed mainly in user space, provided by the XFree86 project, and little or no kernel interaction was required. The user-space component known as the DDX (Device-Dependant X) was responsible for initializing the card, setting modes and providing acceleration for 2D operations.
The kernel also provided framebuffer drivers on certain systems to allow a usable console before X started. The interaction between these drivers and the X.org drivers was very complex and often caused many problems regarding which driver "owned" the hardware.
The DRI project was started to add support for direct rendering of 3D applications on Linux. This meant that an application could talk to the 3D hardware directly, bypassing the X server. OpenGL was the standard 3D API, but it is a complex interface which is definitely too large to implement in-kernel. GPUs also provided completely different low-level interfaces. So, due to the complexity of the higher level interface and nonstandard nature of the hardware APIs, a kernel component (DRM) and a userspace driver (DRI) were required to securely expose the hardware interfaces and provide the OpenGL API.
Shortcomings of the current architecture have been noted over the past few years; the current belief is that GPU initialization, memory management, and mode setting need to migrate to the kernel in order to provide better support for features such as suspend/resume, proper cohabitation of X and framebuffer driver, kernel error reporting, and future graphics card technologies.
The GPU memory manager implemented by Tungsten Graphics is known as TTM. It was originally designed as a general VM memory manager but initially targeted at Intel hardware. On top of this memory manager, a new modesetting architecture for the kernel is being implemented. This is based on the RandR 1.2 work found in the X.org server.
Graphics cards are programmed in numerous ways, but most initialization and mode setting is done via memory-mapped IO. This is just a set of registers accessible to the CPU via its standard memory address space. The registers in this address space are split up into ranges dealing with various features of the graphics card such as mode setup, output control, or clock configuration. A longer explanation can be found on Wikipedia.
Most recent GPUs also provide some sort of command processing ability where tasks can be offloaded from the CPU to be executed on the GPU, reducing the amount of CPU time required to execute graphical operations. This interface is commonly a FIFO implemented as a circular ring buffer into which commands are pushed by the CPU for processing by the GPU. It is located somewhere in a shared memory area (AGP memory, PCIGART, or video RAM). The GPU will also have a set of state information that is used to process these commands, usually known as a context.
Most modern GPUs only contain a single command processing state machine. However NVidia hardware has always contained multiple independent "channels" which consist of a private FIFO (push buffer), a graphics context and a number of context objects. The push buffer contains the commands to be processed by the card. The graphics context stores application specific data such as matrices, texture unit configuration, blending setup, shader information etc. Each channel has 8 subchannels to which graphics objects are bound in order to be addressed by FIFO commands.
Each NVidia card provides between 16 and 128 channels, depending on model; these are assigned to different rendering-related tasks. Each 3D client has an associated channel, while some are reserved for use in the kernel and the X server. Channels are context-switched by software via an interrupt (on older cards) or automatically by the hardware on cards after the NV30.
Now what to store within the FIFO? Each NVidia card offers a set of objects, each of which provide a set of methods related to a given task, e.g. DMA memory transfers or rendering. Those methods are the ones used by the driver (or on a higher level, the rendering application). Whenever a client connects, it uses an ioctl() to create the channel. After that the client creates the objects it needs via an additional ioctl().
Currently we do have two types of possible clients: X (via the DDX driver) and OpenGL via DRI/MESA. An accelerated framebuffer using the new mode setting architecture (nouveaufb) will also be a future client to avoid conflicts with nvidiafb.
Let's have a look at a small number of objects:
object name Description Available on NV_IMAGE_BLIT 2D engine, blit image from one image into another one NV03 NV04 NV10 NV20 NV12_IMAGE_BLIT An enhanced version of the above NV11 NV20 NV20 NV30 NV40 NV_MEMORY_TO_MEMORY_FORMAT DMA memory transfer NV04 NV10 NV20 NV30 NV40 NV50
From this list, you can see that there are object types which are available on all cards (NV_MEMORY_TO_MEMORY_FORMAT) while others are only available on certain cards. For example, each class of card has its own 3D-engine object, such as NV10TCL on NV1x and NV20TCL on NV2x. An object is identified by a unique number: its "class". This id is 0x5f for NV_IMAGE_BLIT, 0x9f for NV12_IMAGE_BLIT and 0x39 for NV_MEMORY_TO_MEMORY_FORMAT. If you want to use functionality provided by a given object, you must first bind this object to a subchannel. The card provides a certain number of subchannels which correspond to a certain number of "active" (or "bound") objects.
A command in the FIFO is made of a command header, followed by one or more parameters. The command header usually contains the subchannel number, the method offset to be called, and the number of parameters (a command header can also define a jump in the FIFO but this is outside the scope of this document). Each method the object provides has an offset which has to be set in the command. In order to limit the number of command headers to be written, thereby improving performance, NVidia cards will call several subsequent methods in a row if you provide several parameters.
How do we refer to an object? The data written to the FIFO doesn't hold any info about that... Binding an object to a subchannel is done by writing the object ID as an argument to method number 0. For example: 00044000 5c00000c binds object id 5c00000c to subchannel 2. This object ID is used as a key in a hash table kept in the card's memory which is filled up when creating objects.
The creation of an object relies on special memory areas. RAMIN is "instance memory", an area of memory through which the graphics engines of the card are configured. A RAMIN area is present on all NVIDIA chipsets in some form, but it has evolved quite a bit as newer chipsets have been released. Basically, RAMIN is what contains the objects. An object is usually not big (128 bytes in general, up to a few kilobytes in case of DMA transfer objects).
Card-specific RAMIN areas Pre-NV40 Area of dedicated internal memory accessible through the card's MMIO registers. NV4x A 16MiB PCI resource is used to access PRAMIN. This resource maps over the last 16MiB of VRAM. The first 1MiB of PRAMIN is also accessible through the (now "legacy") MMIO PRAMIN aperture. NV5x A 32MiB PCI resource, which is unusable in the default power-on state of the card. It can be configured in a variety of different ways through the NV5x virtual memory. The legacy MMIO aperture can be re-mapped over any 1MiB of VRAM that's desired.
There are also a few specific areas in RAMIN that are worth mentioning:
When I was but a wee computer science student at New Mexico Tech, a graduate student in OS handed me an inch-thick print-out and told me that if I was really interested in operating systems, I had to read this. It was something about a completely lock-free operating system optimized using run-time code generation, written from scratch in assembly running on a homemade two-CPU SMP with a two-word compare-and-swap instruction - you know, nothing fancy. The print-out I was holding was Alexia (formerly Henry) Massalin's PhD thesis, Synthesis: An Efficient Implementation of Fundamental Operating Systems Services (html version here). Dutifully, I read the entire 158 pages. At the end, I realized that I understood not a word of it, right up to and including the cartoon of a koala saying "QUA!" at the end. Okay, I exaggerate - lock-free algorithms had been a hobby of mine for the previous few months - but the main point I came away with was that there was a lot of cool stuff in operating systems that I had yet to learn.
Every year or two after that, I'd pick up my now bedraggled copy of "Synthesis" and reread it, and every time I would understand a little bit more. First came the lock-free algorithms, then the run-time code generation, then quajects. The individual techniques were not always new in and of themselves, but in Synthesis they were developed, elaborated, and implemented throughout a fully functioning UNIX-style operating system. I still don't understand all of Synthesis, but I understand enough now to realize that my grad student friend was right: anyone really interested in operating systems should read this thesis.
The name "Synthesis" comes from run-time code generation - code synthesis - used to optimize and re-optimize kernel routines in response to changing conditions. The concept of optimizing code during run-time is by now familiar to many programmers in part from Transmeta's processor-level code optimization, used to lower power consumption (and many programmers are familiar with Transmeta as the one-time employer of Linus Torvalds.)
Run-time code generation in Synthesis begins with some level of compile-time optimization, optimizations that will be efficient regardless of the run-time environment. The result can thought of as a template for the final code, with "holes" where the run-time data will go. The run-time code generation then takes advantage of data-dependent optimizations. For example, if the code evaluates A * B, and at run-time we discover that B is always 1, then we can generate more efficient code that skips the multiplication step and run that code instead of the original. Fully optimized versions of the code pre-computed for common data values can be simply swapped in without any further run-time computation. Another example from the thesis:
Run-time code generation in Synthesis is a fusion of compile-time and run-time optimizations in which useful code templates are created at compile time that can later be optimized simply and cleanly at run time.
Understanding run-time code generation is a prerequisite for understanding quajects, the basic unit out of which the Synthesis kernel is constructed. Quajects are almost but not quite entirely unlike objects. Like objects, quajects come in types - queue quaject, thread quaject, buffer quaject - and encapsulate all the data associated with the quaject. Unlike objects, which contain pointers to functions implementing their methods, quajects contain the code implementing their methods directly. That's right - the actual executable instructions are stored inside the data structure of the quaject, with the code nestled up against the data it will operate on. In cases where the code is too large to fit in the quaject, the code jumps out to the rest of the method located elsewhere in memory. The code implementing the methods is created by filling in pre-compiled templates and can be self-modifying as well.
Quajects interact with other quajects via a direct and simple system of cross-quaject calls: callentries, callouts, and callbacks. The user of quaject invokes callentries in the quaject, which implement that quaject's methods. Usually the callentry returns back to the caller as normal, but in exceptional situations the quaject will invoke a method in the caller's quaject - a callback. Callouts are places where a quaject invokes some other quaject's callentries.
Synthesis implements a basic set of quajects - thread, queue, buffer, clock, etc. - and builds higher-level structures by combining lower-level quajects. For example, a UNIX process is constructed out of a thread quaject, a memory quaject, and some I/O quajects.
As an example, let's look at the queue quaject's interface. A queue has two callentries, queue_put and queue_get. These are invoked by another quaject wanting to add or remove entries to and from the queue. The queue quaject also has four callbacks into the caller's quaject, queue_full, queue_full-1, queue_empty, and queue_empty-1. When a caller invokes the queue_put method and the queue is full, the queue quaject invokes the queue_full callback in the caller's quaject. From the thesis:
The queue_full-1 method is executed when a queue has transitioned from full to not full, queue_empty when the queue doesn't contain anything, and queue_empty-1 when the queue transitions from empty to not empty. With these six callentries and callbacks, a queue is implemented in a generic, extensible, yet incredibly efficient manner.
Pretty cool stuff, huh? But wait, there's more!
Most modern operating systems use a combination of interrupt disabling and locks to synchronize access to shared data structures and guarantee single-threaded execution of critical sections in general. The most popular synchronization primitive in Linux is the spinlock, implemented with the nearly universal test-and-set-bit atomic operation. When one thread attempts to acquire the spinlock guarding some critical section, it busy-waits, repeatedly trying to acquire the spinlock until it succeeds.
Synchronization based on locks works well enough but it has several problems: contention, deadlock, and priority inversion. Each of these problems can be (and is) worked around by following strict rules: keep the critical section short, always acquire locks in the same order, and implement various more-or-less complex methods of priority inheritance. Defining, implementing, and following these rules is non-trivial and a source of a lot of the pain involved in writing code for modern operating systems.
To address these problems, Maurice Herlihy proposed a system of lock-free synchronization using an atomic compare-and-swap instruction. Compare-and-swap takes the address of a word, the previous value of the word, and the desired new value of the word. It swaps the previous and new values of the word if and only if the previous value is the same as the current value. The bare compare-and-swap instruction allows atomic updates of single pointers. To atomically switch between larger data structures, a new copy of the data structure is created, updated with the changes, and the addresses of the two data structures swapped. If the compare-and-swap fails because some other thread has updated the value, the operation is retried until it succeeds.
Lock-free synchronization eliminates deadlocks, the need for strict lock ordering rules, and priority inversion (contention on the compare-and-swap instruction itself is still a concern, but rarely observed in the wild). The main drawback of Herlihy's algorithms is that they require a lot of data copying for anything more complex than swapping two addresses, making the total cost of the operation greater than the cost of locking algorithms in many cases. Massalin took advantage of the two-word compare-and-swap instruction available in the Motorola 68030 and expanded on Herlihy's work to implement lock-free and copy-free synchronization of queues, stacks, and linked lists. She then took a novel approach: Rather than choose a general synchronization technique (like spinlocks) and apply it to arbitrary data structures and operations, instead build the operating system out of data structures simple enough to be updated in an efficient lock-free manner.
Synthesis is actually even cooler than lock-free: Given the system of quajects, code synthesis, and callbacks, operations on data structures can be completely synchronization-free in common situations. For example, a single-producer, single-consumer queue can be updated concurrently without any kind of synchronization as long as the queue is non-empty, since each thread operates on only one end of the queue. When the callback for queue empty happens, the code to operate on the queue is switched to use the lock-free synchronization code. When the quaject's queue-not-empty callback is invoked, the quajects switch back to the synchronization-free code. (This specific algorithm is not, to my knowledge, described in detail in the thesis, but was imparted to me some months ago by Dr. Massalin herself at one of those wild late-night kernel programmer parties, so take my description with a grain of salt.)
The approach to synchronization in Synthesis is summarized in the following quote:
The last two points will sound familiar if you're aware of Paul McKenney's read-copy-update (RCU) algorithm. In Synthesis, thread structures to be deleted or removed from the run queue are marked as such, and then actually deleted or removed by the scheduler thread during normal traversal of the run queue. In RCU, the reference to a list entry is removed from the linked list while holding the list lock, but the removed list entry is not actually freed until it can be guaranteed that no reader is accessing that entry. In both cases, reads are synchronization-free, but deletes are separated into two phases, one that begins the operation in an efficient low-contention manner, and a second, deferred, synchronization-free phase to complete the operation. The two techniques are by no means the same, but share a similar philosophy.
The design principles of Synthesis, while powerful and generic, still have some major drawbacks. The algorithms are difficult to understand and implement for regular human beings (or kernel programmers, for that matter). As Linux has demonstrated, making kernel development simple enough that a wide variety of people can contribute has some significant payoffs. Another drawback is that two-word compare-and-swap is, shall we say, not a common feature of modern processors. Lock-free synchronization can be achieved without this instruction, but it is far less efficient. In my opinion, reading this paper is valuable more for retraining the way your brain thinks about synchronization than for copying the exact algorithms. This thesis is especially valuable reading for people interested in low-latency or real-time response, since one of the explicit goals of Synthesis is support for real-time sound processing.
Finally, I want to note that Synthesis contains many more elegant ideas that I couldn't cover in even the most superficial detail - quaject-based user/kernel interface, per-process exception tables, scheduling based on I/O rates, etc., etc. And while the exact implementation details are fascinating, the thesis is also peppered with delightful koan-like statements about design patterns for operating systems. Any time you're feeling bored with operating systems, sit down and read a chapter of this thesis.
[ Valerie Henson is a Linux file systems consultant and proud recipient of a piggy-back ride from Dr. Alexia Massalin. ]
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jake Edge
Next page: Distributions>>
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds