The state of Nouveau, part 2

February 26, 2008

This article was contributed by B. Rathmann (KoalaBR)

[Editor's note: this is the second in a two-part series on the state of the Nouveau driver for NVIDIA hardware. The first installment is recommended reading for those who have not yet seen it.]

Sources of information, and reverse engineering tools

As very little information is available on NVidia's hardware design and implementation, the Nouveau project has developed a number of tools to gain a better understanding of card architecture and programming model. These tools, along with some previously available information, are what are used to create the driver.

The Haiku/BeOS projects have a driver that came from a software development kit NVidia released for NV03/04 cards, and also gathered some information from an unobfuscated nv driver that appeared briefly in XFree86. This driver has improved mode-setting code compared to nv, and a basic 3D driver using hard-coded objects running in a single context.

More information was available in the nvclock utility, which allows overclocking NVidia GPUs on Linux. Its lead developer Roderick Colenbrander (Thunderbird) has helped out Nouveau in the clock setup, i2c and tv-out areas.

renouveau

The first utility developed was called renouveau. renouveau is mainly concerned with reverse engineering the NVidia binary driver by black-boxing it, feeding it certain inputs and watching what it writes to the hardware. It runs a large batch of OpenGL tests which exercise most of the GPU's capabilities and generates a set of dump files which are sent to the Nouveau developers.

The tool works by mapping the card registers and the FIFO assigned to the current application. It then records the current state of both FIFO and registers, executes small OpenGL tests, and compares the final state against the initial saved state. It then dumps this info, which can be parsed into a human readable form using an XML register/command database. (Some developers would argue the hex is readable to them).

The tool has advantages in that it can be run very simply by end users, on various card architectures, without requiring root privileges. It doesn't tamper with the binary driver, and does not require much technical knowledge.

MMioTrace

MMioTrace is a tool for tracing memory-mapped I/O (MMIO) access within the kernel. The NVidia driver contains a kernel module which is responsible for a lot of card initialization and mode setting. This activity cannot easily be traced by user-space tools such as renouveau. MMioTrace uses relayfs and debugfs to relay the tracing data to userspace.

MMioTrace works by replacing calls to the kernel's ioremap(), ioremap_nocache(), and iounmap() calls from the driver that is to be probed with wrappers that call into MMioTrace. When the driver module in question calls ioremap() to access the MMIO registers, the pages are mapped as not-present in the kernel address space instead. It can be set up to only trace address ranges which are likely to be touched by the driver you are interested in, thus reducing the amount of useless MMIO accesses.

When the module then tries to access the register space, a page fault will occur. In the page fault handler the address is detected and the attempted action recorded. The page is then marked present and the page-faulting code is single-stepped to execute the instruction doing MMIO. After that the page is set to "not present" again so that the cycle can be restarted for the next access to the page.

MMioTrace has some restrictions on tracing into the legacy ISA address range, as marking those pages not present crashes the kernel. A solution to this may be forthcoming but would require patching the kernel.

MMioTrace is usable for all types of drivers running in the kernel, not just graphics drivers. It is not shipped with the kernel as of yet and was shipped as a working external module up 2.6.23. However 2.6.24 has seen the removal of certain features that mean MMioTrace will need to be upstreamed for it to work with 2.6.25 or later kernels.

If you are interested in more details, you should have a look at the MMioTrace page.

valgrind-mmt

Valgrind-mmt is a plugin for the valgrind debugging suite. It traces MMIO accesses from a user-space process (like the X.org server) where the NVidia DDX code is loaded. This was originally written by Dave Airlie for tracing ATI hardware and has since been extended by a number of other developers. It is used in Nouveau in a way similar to renouveau: to dump the contents of a FIFO. Valgrind-mmt allows reliably tracing the X.org FIFO, which is something renouveau cannot do very well. Tracing the X.org FIFO is sometimes required as it is the only way to see how some 2D features are implemented.

Using MMioTrace to implement a new feature

Commands are usually sent to the card by writing in the command FIFO, not by touching registers directly. But initialization of the card (including notably mode setting), as well as some other operations, are done via MMIO operations from within the kernel.

Below is an example of how MMioTrace was used to reverse engineer the YV12 video overlay that is present in some NVidia cards.

Video formats

Videos are usually not encoded in the RGB colorspace. Most video codecs work in the YUV colorspace instead, where Y stands for luminance (black and white image), and U and V represent the chrominance (i.e. color). Since eye perception is higher for luminance, codecs usually drop a fraction of U and V samples in order to save space. When the card is asked through e.g. X-Video to display a video frame, it is passed a buffer containing YUV data, usually in YV12 or YUY2 format.

FourCC.org can give you details about those formats, but for the purposes of this article, we will just say that YUY2 is a format that keeps one chrominance sample (U or V alternatively) per luminance sample, thus giving "YUYVYUYV" to the card (16 bits per pixel), and YV12 is a format that keeps two chrominance samples (one U, one V) per 2x2 luminance block, which gives an effective 12 bits per pixel of video. YV12 is 25% smaller than YUY2 and is the format used by most popular codecs. Your author has yet to find any movie codec that does not output YV12. (or I420, which conceptually is the same - it just inverts the position of U and V in the buffer).

Some months ago, Nouveau's Xv implementation was inherited from nv. Besides being extremely slow, nv supported only the YUY2 format, and converted YV12 input to YUY2 in software before uploading the data to the card. While working on improving performance, we quickly came to wonder if NVidia cards supported YV12 in hardware. Due to the 25% size reduction, this would naturally decrease the volume of bus transfers, which plays a very important role in Xv throughput especially on PCI cards.

We verified that by running performance tests on the NVidia binary driver, playing YV12 and YUY2 videos (using mplayer's -yuy2 option). Our performance tests consisted simply of mplayer's "benchmark" mode. The results were extremely clear: the operation required just over 20 seconds in YUY2 mode, and in just over 15 seconds in YV12 mode. No need to take your calculator, it is a 25% difference which matches the data size exactly. The most obvious explanation is that the data is sent to the hardware in YV12 format.

So the situation was: we had a Xv driver that handled YUY2 video only, we knew (or thought, with a high degree of confidence and hope) that the hardware supported YV12, but no existing driver like rivatv had code for it. Some reverse engineering had to take place.

MMioTrace doesn't enter the arena just now, however. As mentioned above, most of the time, commands are sent to the card by writing to the command FIFO, and not by touching registers. So we first checked the X command FIFO using valgrind-mmt and found some commands related to video.

However, it quickly turned out that those were software methods, that is to say, dummy methods that make the card generate an interrupt asking for the kernel to handle it. It's somehow similar to an ioctl() call into the kernel module, except that it's in sync with the FIFO. First lesson learned: Video overlay setup is being done by the kernel module.

We then MMioTraced the NVidia binary driver, playing YUY2 and YV12 video (same dimensions, window position, ... - the only thing that differed was the format), and compared the outputs. And among the 150 kilobytes of resulting data, we found (for YUY2 mode):

    NV_PVIDEO.[0].FORMAT <- 0x00110200

While for YV12 mode:

    NV_PVIDEO.[0].FORMAT <- 0x00110101
    NV_PVIDEO+0x800 <- 0x00000000
    NV_PVIDEO+0x808 <- 0x07fcffff
    NV_PVIDEO+0x820 <- 0x07f70000

So here we had a different value being written into FORMAT, and three unknown registers. From a reading of existing documentation and code, it turned out that the bit 0 of FORMAT was previously unknown to us.

Next we tried to get the feature to work in our driver. We tried it without touching the three unknown registers, and got no video at all. So it had an effect, but we weren't sure if it really was the "YV12 format" bit. Further looking into MMioTraces showed that what was written into the three registers was in fact fairly similar to what was done for the image buffer setup, and we were able to make an educated guess at what was supposed to be written here. (It was the set up of the color buffer, while the "main" buffer was used for luminance data.)

In the end, we got YV12 to work in Nouveau's Xv without converting to YUY2, which represented an increase in performance of (about) the expected 25%. MMioTrace enabled us to discover how the card needed to be programmed to do YV12 in hardware, which was apparently known by nobody outside of Nvidia before.

This knowledge ended up in nv_video.c in NVPutOverlayImage:

   /* Those are important only for planar formats (NV12) */
   if ( uvoffset )
   {
       nvWriteVIDEO(pNv, NV_PVIDEO_UVPLANE_BASE(buffer), 0); 
       nvWriteVIDEO(pNv, NV_PVIDEO_UVPLANE_OFFSET_BUFF(buffer), uvoffset);
   }

It is interesting to note that MMioTrace simply records all register reads and writes - you can see almost everything that the kernel module does to the card. The downside to "almost everything" is that the saved data set gets large fast. Reducing the trace range and using it only for short periods of time helps a bit but still... after a few minutes of mmiotracing, you will get into the megabyte range for your logs. Sifting through those thousands of lines to find what one is looking for takes some time to get used to.

We used MMioTrace to reverse-engineer YV12 overlay, but we also used it to reverse-engineer a very large part of card initialization code and mode setting - and it will most certainly be useful for many other things that involve a kernel module. It is not limited to Nouveau, and is able to trace MMIO operations from any of your (binary) kernel modules, thereby allowing reverse-engineering of drivers for other hardware.

Current development in Unix graphics and its influence on Nouveau

We'll now take a peek into the future of 3D acceleration on Linux. 2007 saw a number of major changes in how Linux and X11 handle graphics. A lot of improvements are coming into use: EXA for 2D acceleration, TTM for memory management, Gallium3D for 3D, the new DRI2 interface, etc. All this needs driver-side changes, which can take some time to be done.

With the advent of programmable graphics hardware, the old graphics driver model in Mesa became unsuitable. The current Mesa model is designed for cards which are based around OpenGL fixed-function operations. Fixed-function cards have hardware blocks designed for each part of the GL pipeline. The driver model for this requires each new piece of fixed functionality to call into the driver, which can get complex. This also causes a lot of code to be duplicated in each driver.

A new driver model, called Gallium3D, tries to simplify the driver interface and increase the amount of shared code. It is designed to cater for OpenGL 3.0's needs as well as current OpenGL and DirectX APIs. It is also designed to allow portable drivers across all major platforms/OSes. It assumes programmable graphics hardware with, at least, fragment shaders.

Now that we know why the design was changed, let's have a look at the architecture of Gallium3D. Gallium3D splits the DRI driver into 3 major components, the common "state tracker", the OS dependent "winsys" layer and hardware specific 3D driver. The winsys is in charge of 2D action and most of the housekeeping and OS-specific bits, while the hardware driver does the 3D. Each driver needs to implement a hardware driver and a Winsys part. If an existing driver gets ported to another OS, only the Winsys parts needs to be redone.

There is also a fully working reference software 3D driver called softpipe. It is a software renderer showing the Gallium3D concepts and how to implement them, which also acts as a software fallback driver for things the hardware cannot handle.

Another component of the new graphics subsystem is the TTM based memory manager. TTM is a unified in-kernel manager for all GPU accessible memory. Previous memory management was split between X drivers, mostly using static allocations. TTM was originally designed and implemented for Intel hardware, and had to be adapted to handle NVidia hardware and Nouveau software design. The main feature added to TTM was called fence classing, which was required to support NVidia's multiple hardware contexts.

Current Status

When we shifted work from reverse engineering to driver development last year, we were asked when a driver would be ready. We predicted late 2007, but we only got part of the work done.

Except for NV5x cards, we basically have a good-to-reasonably-well working 2D driver. Releasing an official "2D" driver was considered but, at this point, the kernel interfaces are not considered stable enough to support for the long term. When a DRM kernel module is shipped in Linus's kernel, the interfaces are required to be supported indefinitely. This would be unwise for Nouveau as the interface is evolving to accommodate changes for TTM and mode setting, and supporting old interfaces may place hard-to-support requirements on newer ones.

Currently, Nouveau can claim:

basic 2D rendering on all cards (through EXA)
EXA composite (implementing the XRENDER extension) works via the 3D engine on all cards except NV5x and NV04. In the case of NV04, hardware limitations make a composite implementation difficult if not impossible. NV1x was just recently completed, which was a major feat as these cards only have two fixed function register combiners and no shaders
Xv from NV04 up to NV4x thanks to the work of Arthur Huillet. Depending on the hardware, either blitter (on NV4->NV4x), overlay (on NV4->NV30) or video texture (on NV40). Xv performance is on par with that of the nvidia binary driver on some cards.
PPC support: at least some PPC based systems work. Most endian-based problems are solved thanks to the help of the PS3 RSX project and Ben Herrenschmidt. However, some systems are exhibiting DMA hangs when trying to do uploads to the card. The code is currently being audited and most of the PPC bugs have been fixed.
xrandr 1.2 support is being worked on, basic mode setting should work mostly on NV3x, NV4x and NV5x cards. More sophisticated features, like dual head support, are actively being worked on and progress is fast.
the Nouveau specific DRM code has some preliminary work done for TTM. e.g. we have one FIFO allocated for DRM use only. However, a fair amount of work is left until we have something really useful there.
Ben Skeggs is working on a Gallium3D driver for NV4x and NV5x. This driver does work for NV4x but is neither feature complete nor bug free. NV5x does not work currently.
Stephane is working on supporting shaderless cards with Gallium3D. That would be a generic framework which, in case of NVidia cards, could support shader instructions on cards ≥NV04 <NV30. This framework is not specifically designed for NVidia cards but should help older ATI/Intel cards too.

The weak spot is currently the NV50. On these cards, 2D is working the same as nv but saving and restoring the console / virtual terminal state doesn't work.

All that is nice and somewhat important to have, but I hear you ask "what about 3D"? The short answer is: We don't have 3D working. The longer answer is: NV5x doesn't work and needs more reverse engineering as a lot has changed from NV4x. For all other cards the needed information is available but there are many pieces in the puzzle to build a final driver.

As a proof of concept, glxgears works on NV1x, NV3x and NV4x but with some glitches. However, work on the Mesa DRI driver has ceased in order to target Gallium3D. A somewhat working Gallium3D driver exists with many bugs and glitches. The NV4x is getting better everyday but isn't usable for games yet. Gallium3D itself is still a work in progress and the same holds true for our Gallium3D driver.

Currently, a fair amount work is going on in the mode setting field, with Maarten Maathuis and Stuart Bennett enhancing this part of the code. This leads to RandR1.2 (dual head) support in Nouveau. Once this is done, we plan to move it into kernel land, following the other drivers. A kernel API has been defined for that purpose. Basically this API looks like a simplified randr1.2 api which should make porting easy.

So what is coming next? This is only a rough outlook of what we want to do mid term:

Finish 2D work which includes mode setting and RandR1.2
more reverse engineering for NV5x cards.
Implement TTM support
Implement Gallium3D drivers. This one is obvious for the cards with shaders, However as Gallium3D expects shaders, older cards are left in the cold unless Stephane gets his framework working. In case the framework isn't feasible, a DRI driver for older cards may be the only option.

By the way: If you are interested in more details, please have a look at our Wiki and TiNDC ("The Irregular Nouveau Development Companions") or join us in #nouveau on freenode (logs are available).

So to keep tradition lets have some screenshots. Here's a shot of Neverball running under the Nouveau driver:

[Neverball under Nouveau]

And OpenArena with a Nouveau Gallium3D build from January 2008 displays this:

[OpenArena under Nouveau]

It seems the weapon is a bit too dark but otherwise we couldn't find obvious differences.

Further information about Gallium3D can be found on the Tungsten Graphics site.

Conclusion

So that is our current status, our roadmap shows the next milestone would be Quake which is not so far away on NV4x, but which has some problems to overcome on the other cards. Our first estimate of Autumn / Winter 2007 held up well for the 2D part but, as we detailed earlier, was somewhat delayed due to decisions out of our control like TTM and Gallium. However, the decision was the right one as Nouveau will be one of the most advanced and future proof drivers available.

And finally: I would like to take this opportunity and thank Arthur Huillet, Ben Skeggs David Airlie and Stephane Marchesin for their great help on this article. It definitely was a team effort!

Index entries for this article
GuestArticles	Rathmann, Bettina