February 26, 2008
This article was contributed by B. Rathmann (KoalaBR)
[
Editor's note: this is the second in a two-part series on the state of
the Nouveau driver for NVIDIA hardware. The first installment is recommended
reading for those who have not yet seen it.]
Sources of information, and reverse engineering tools
As very little information is available on NVidia's hardware design and
implementation, the Nouveau project has developed a number of tools to gain
a better understanding of card architecture and programming model. These
tools, along with some previously available information, are what are used
to create the driver.
The Haiku/BeOS projects have a driver that came from a software development
kit NVidia released
for NV03/04 cards, and also gathered some information from an unobfuscated
nv driver that appeared briefly in XFree86. This driver has improved
mode-setting code compared to nv, and a basic 3D driver using hard-coded
objects running in a single context.
More information was available in the nvclock utility, which allows
overclocking NVidia GPUs on Linux. Its lead developer Roderick Colenbrander
(Thunderbird) has helped out Nouveau in the clock setup, i2c and tv-out
areas.
renouveau
The first utility developed was called renouveau. renouveau is mainly
concerned with reverse engineering the NVidia binary driver by black-boxing
it, feeding it certain inputs and watching what it writes to the
hardware. It runs a large batch of OpenGL tests which exercise most of the
GPU's capabilities and generates a set of dump files which are sent to the
Nouveau developers.
The tool works by mapping the card registers and the FIFO assigned to the
current application. It then records the current state of both FIFO and
registers, executes small OpenGL tests, and compares the final state
against the initial saved state. It then dumps this info, which can be
parsed into a human readable form using an XML register/command
database. (Some developers would argue the hex is readable to them).
The tool has advantages in that it can be run very simply by end users, on
various card architectures, without requiring root privileges. It doesn't
tamper with the binary driver, and does not require much technical
knowledge.
MMioTrace
MMioTrace is a tool for tracing memory-mapped I/O (MMIO) access within the kernel. The NVidia
driver contains a kernel module which is responsible for a lot of card
initialization and mode setting. This activity cannot easily be traced by user-space
tools such as renouveau. MMioTrace uses relayfs and debugfs to relay the
tracing data to userspace.
MMioTrace works by replacing calls to the kernel's ioremap(),
ioremap_nocache(),
and iounmap() calls from the driver that is to be probed with wrappers that call into
MMioTrace. When the driver module in question calls ioremap() to access the
MMIO registers, the pages are mapped as not-present in the kernel address
space instead. It can be set up to only trace address ranges which are
likely to be touched by the driver you are interested in, thus reducing the
amount of useless MMIO accesses.
When the module then tries to access the register space, a page fault will
occur. In the page fault handler the address is detected and the attempted
action recorded. The page is then marked present and the page-faulting
code is single-stepped to execute the instruction doing MMIO. After that
the page is set to "not present" again so that the cycle can be restarted
for the next access to the page.
MMioTrace has some restrictions on tracing into the legacy ISA address
range, as marking those pages not present crashes the kernel. A solution to
this may be forthcoming but would require patching the kernel.
MMioTrace is usable for all types of drivers running in the kernel, not
just graphics drivers. It is not shipped with the kernel as of yet and was
shipped as a working external module up 2.6.23. However 2.6.24 has seen
the removal of certain features that mean MMioTrace will need to be
upstreamed for it to work with 2.6.25 or later kernels.
If you are interested in more details, you should have a look at the
MMioTrace page.
valgrind-mmt
Valgrind-mmt is a plugin for the valgrind debugging suite. It traces MMIO
accesses from a user-space process (like the X.org server) where the NVidia
DDX code
is loaded. This was originally written by Dave Airlie for tracing ATI
hardware and has since been extended by a number of other developers. It is used in
Nouveau in a way similar to renouveau: to dump the contents of a FIFO.
Valgrind-mmt allows reliably tracing the X.org FIFO, which is something
renouveau cannot do very well. Tracing the X.org FIFO is sometimes required
as it is the only way to see how some 2D features are implemented.
Using MMioTrace to implement a new feature
Commands are usually sent to the card by writing in the command FIFO, not
by touching registers directly. But initialization of the card (including
notably mode setting), as well as some other operations, are done via MMIO
operations from within the kernel.
Below is an example of how MMioTrace was used to reverse engineer the YV12
video overlay that is present in some NVidia cards.
Video formats
Videos are usually not encoded in the RGB colorspace. Most video codecs
work in the YUV colorspace instead, where Y stands for luminance (black and
white image), and U and V represent the chrominance (i.e. color). Since eye
perception is higher for luminance, codecs usually drop a fraction of U and
V samples in order to save space. When the card is asked through
e.g. X-Video to display a video frame, it is passed a buffer containing YUV
data, usually in YV12 or YUY2 format.
FourCC.org can give you details about
those formats, but for
the purposes of this article, we will just say that YUY2 is a format that
keeps one chrominance sample (U or V alternatively) per luminance sample,
thus giving "YUYVYUYV" to the card (16 bits per pixel), and YV12 is a format that
keeps two chrominance samples (one U, one V) per 2x2 luminance block, which
gives an effective 12 bits per pixel of video. YV12 is 25% smaller than
YUY2 and is the format used by most popular codecs. Your author has yet to
find any movie codec that does not output YV12. (or I420, which
conceptually is the same - it just inverts the position of U and V in the
buffer).
Some months ago, Nouveau's Xv implementation was inherited
from nv. Besides being extremely slow, nv supported only the YUY2 format, and
converted YV12 input to YUY2 in software before uploading the data to the
card. While working on improving performance, we quickly came to wonder if
NVidia cards supported YV12 in hardware. Due to the 25% size reduction,
this would naturally decrease the volume of bus transfers, which plays a
very important role in Xv throughput especially on PCI cards.
We verified that by running performance tests on the NVidia binary driver,
playing YV12 and YUY2 videos (using mplayer's -yuy2 option). Our performance
tests consisted simply of mplayer's "benchmark" mode. The results were
extremely clear: the operation required just over 20 seconds in YUY2 mode, and in
just over 15 seconds in YV12 mode.
No need to take your calculator, it is a 25% difference which matches the
data size exactly. The most obvious explanation is that the data is sent to
the hardware in YV12 format.
So the situation was: we had a Xv driver that handled YUY2 video only, we
knew (or thought, with a high degree of confidence and hope) that the
hardware supported YV12, but no existing driver like rivatv had code for
it. Some reverse engineering had to take place.
MMioTrace doesn't enter the arena just now, however. As mentioned above,
most of the time, commands are sent to the card by writing to the command
FIFO, and not by touching registers. So we first checked the X command FIFO
using valgrind-mmt and found some commands related to video.
However, it quickly turned out that those were software methods, that is to
say, dummy methods that make the card generate an interrupt asking for the
kernel to handle it. It's somehow similar to an ioctl() call into the
kernel module, except that it's in sync with the FIFO. First lesson
learned: Video overlay setup is being done by the kernel module.
We then MMioTraced the NVidia binary driver, playing YUY2 and YV12 video
(same dimensions, window position, ... - the only thing that differed was
the format), and compared the outputs. And among the 150 kilobytes of resulting data,
we found (for YUY2 mode):
NV_PVIDEO.[0].FORMAT <- 0x00110200
While for YV12 mode:
NV_PVIDEO.[0].FORMAT <- 0x00110101
NV_PVIDEO+0x800 <- 0x00000000
NV_PVIDEO+0x808 <- 0x07fcffff
NV_PVIDEO+0x820 <- 0x07f70000
So here we had a different value being written into FORMAT, and three
unknown registers. From a reading of existing documentation and code, it turned out
that the bit 0 of FORMAT was previously unknown to us.
Next we tried to get the feature to work in our driver. We tried it
without touching the three unknown registers, and got no video at all. So it had
an effect, but we weren't sure if it really was the "YV12 format"
bit. Further looking into MMioTraces showed that what was written into the
three registers was in fact fairly similar to what was done for the image
buffer setup, and we were able to make an educated guess at what was
supposed to be written here. (It was the set up of the color buffer, while
the "main" buffer was used for luminance data.)
In the end, we got YV12 to work in Nouveau's Xv without converting to YUY2,
which represented an increase in performance of (about) the expected 25%.
MMioTrace enabled us to discover how the card needed to be programmed to do
YV12 in hardware, which was apparently known by nobody outside of Nvidia
before.
This knowledge ended up in nv_video.c in NVPutOverlayImage:
/* Those are important only for planar formats (NV12) */
if ( uvoffset )
{
nvWriteVIDEO(pNv, NV_PVIDEO_UVPLANE_BASE(buffer), 0);
nvWriteVIDEO(pNv, NV_PVIDEO_UVPLANE_OFFSET_BUFF(buffer), uvoffset);
}
It is interesting to note that MMioTrace simply records all register reads
and writes - you can see almost everything that the kernel module does to
the card. The downside to "almost everything" is that the saved data set
gets large fast. Reducing the trace range and using it only for short
periods of time helps a bit but still...
after a few minutes of mmiotracing, you will get into the megabyte range
for your logs. Sifting through those thousands of lines to find what one is
looking for takes some time to get used to.
We used MMioTrace to reverse-engineer YV12 overlay, but we also used it to
reverse-engineer a very large part of card initialization code and
mode setting - and it will most certainly be useful for many other things that
involve a kernel module.
It is not limited to Nouveau, and is able to trace MMIO operations from any
of your (binary) kernel modules, thereby allowing reverse-engineering of
drivers for other hardware.
Current development in Unix graphics and its influence on Nouveau
We'll now take a peek into the future of 3D acceleration on Linux. 2007
saw a number of major changes in how Linux and X11 handle
graphics. A lot of improvements are coming into use: EXA for 2D
acceleration, TTM for memory
management, Gallium3D for 3D, the new DRI2
interface, etc. All this needs driver-side changes, which can take some time
to be done.
With the advent of programmable graphics hardware, the old graphics driver
model in Mesa became unsuitable. The current Mesa model is designed for
cards which are based around OpenGL fixed-function
operations. Fixed-function cards have hardware blocks designed for each part
of the GL pipeline. The driver model for this requires each new piece of
fixed functionality to call into the driver, which can get complex. This
also causes a lot of code to be duplicated in each driver.
A new driver model, called Gallium3D, tries to simplify the driver
interface and increase the amount of shared code. It is designed to cater
for OpenGL 3.0's needs as well as current OpenGL and DirectX APIs. It is
also designed to allow portable drivers across all major platforms/OSes. It
assumes programmable graphics hardware with, at least, fragment shaders.
Now that we know why the design was changed, let's have a look at the
architecture of Gallium3D. Gallium3D splits the DRI driver into 3 major
components, the common "state tracker", the OS dependent "winsys" layer and
hardware specific 3D driver.
The winsys is in charge of 2D action and most of the housekeeping and
OS-specific bits, while the hardware driver does the 3D. Each driver needs
to implement a hardware driver and a Winsys part. If an existing driver
gets ported to another OS, only the Winsys parts needs to be redone.
There is also a fully working reference software 3D driver called softpipe.
It is a software renderer showing the Gallium3D concepts and how to implement
them, which also acts as a software fallback driver for things the hardware
cannot handle.
Another component of the new graphics subsystem is the TTM based memory
manager. TTM is a unified in-kernel manager for all GPU accessible memory.
Previous memory management was split between X drivers, mostly using static
allocations. TTM was originally designed and implemented for Intel
hardware, and had to be adapted to handle NVidia hardware and Nouveau
software design. The main feature added to TTM was called fence classing,
which was required to support NVidia's multiple hardware contexts.
Current Status
When we shifted work from reverse engineering to driver development last
year, we were asked when a driver would be ready. We predicted late 2007,
but we only got part of the work done.
Except for NV5x cards, we basically have a good-to-reasonably-well working
2D driver. Releasing an official "2D" driver was considered but, at this
point, the kernel interfaces are not considered stable enough to support
for the
long term. When a DRM kernel module is shipped in Linus's kernel, the
interfaces are required to be supported indefinitely. This would be unwise
for Nouveau as the interface is evolving to accommodate changes for TTM and
mode setting, and supporting old interfaces may place hard-to-support
requirements on newer ones.
Currently, Nouveau can claim:
- basic 2D rendering on all cards (through EXA)
- EXA composite (implementing the XRENDER extension) works via the 3D engine on
all cards except NV5x and NV04. In the case of NV04, hardware limitations
make a composite implementation difficult if not impossible.
NV1x was just recently completed, which was a major feat as
these cards only have two fixed function register combiners and no shaders
- Xv from NV04 up to NV4x thanks to the work of Arthur Huillet.
Depending on the hardware, either blitter (on NV4->NV4x), overlay (on NV4->NV30) or video texture (on NV40).
Xv performance is on par with that of the nvidia binary driver on some cards.
- PPC support:
at least some PPC based systems work. Most endian-based problems
are solved thanks to the help of the PS3 RSX project and Ben Herrenschmidt. However,
some systems are exhibiting DMA hangs when trying to do uploads to the
card. The code is currently being audited and most of the PPC bugs have been fixed.
- xrandr 1.2 support is being worked on, basic mode setting should work mostly
on NV3x, NV4x and NV5x cards. More sophisticated features, like dual head
support, are actively being worked on and progress is fast.
- the Nouveau specific DRM code has some preliminary work done for TTM. e.g.
we have one FIFO allocated for DRM use only. However, a fair amount of work
is left until we have something really useful there.
- Ben Skeggs is working on a Gallium3D driver for NV4x and NV5x. This driver does
work for NV4x but is neither feature complete nor bug free. NV5x does not work
currently.
- Stephane is working on supporting shaderless cards with Gallium3D. That would
be a generic framework which, in case of NVidia cards, could support
shader instructions on cards ≥NV04 <NV30. This framework is not specifically
designed for NVidia cards but should help older ATI/Intel cards too.
The weak spot is currently the NV50. On these cards, 2D is working the same
as nv but saving and restoring the console / virtual terminal state doesn't
work.
All that is nice and somewhat important to have, but I hear you ask "what
about 3D"? The short answer is: We don't have 3D working. The longer
answer is: NV5x doesn't work and needs more reverse engineering as a lot
has changed from NV4x. For all other cards the needed information is
available but there are many pieces in the puzzle to build a final driver.
As a proof of concept, glxgears works on NV1x, NV3x and NV4x but with some
glitches. However, work on the Mesa DRI driver has ceased in order to
target Gallium3D.
A somewhat working Gallium3D driver exists with many bugs and glitches.
The NV4x is getting better everyday but isn't usable for games
yet. Gallium3D itself is still a work in progress and the same holds true
for our Gallium3D driver.
Currently, a fair amount work is going on in the mode setting field, with
Maarten Maathuis and Stuart Bennett enhancing this part of the code. This
leads to RandR1.2 (dual head) support in Nouveau. Once this is done, we
plan to move it into kernel land, following the other drivers. A kernel API
has been defined for that purpose. Basically this API looks like a
simplified randr1.2 api which should
make porting easy.
So what is coming next?
This is only a rough outlook of what we want to do mid term:
- Finish 2D work which includes mode setting and RandR1.2
- more reverse engineering for NV5x cards.
- Implement TTM support
- Implement Gallium3D drivers. This one is obvious for the cards with
shaders, However as Gallium3D expects shaders, older cards are left in the cold
unless Stephane gets his framework working.
In case the framework isn't feasible, a DRI driver for older cards may be
the only option.
By the way: If you are interested in more details, please have a look at
our Wiki and TiNDC ("The Irregular Nouveau Development Companions") or join
us in #nouveau on freenode (logs are available).
So to keep tradition lets have some screenshots. Here's a shot of Neverball running under the
Nouveau driver:
And OpenArena with a Nouveau Gallium3D build from January 2008
displays this:
It seems the weapon is a bit too dark but otherwise we couldn't find obvious
differences.
Further information about Gallium3D can be found on the
Tungsten Graphics site.
Conclusion
So that is our current status, our roadmap shows the next milestone would be
Quake which is not so far away on NV4x, but which has some problems to overcome on
the other cards. Our first estimate of Autumn / Winter 2007 held up well for
the 2D part but, as we detailed earlier, was somewhat delayed due to decisions
out of our control like TTM and Gallium. However, the decision was the right
one as Nouveau will be one of the most advanced and future proof drivers
available.
And finally:
I would like to take this opportunity and thank Arthur Huillet, Ben Skeggs
David Airlie and Stephane Marchesin for their great help on this article. It
definitely was a team effort!
(
Log in to post comments)