LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements ->One big page
This page Previous weekFollowing week Sponsored link Serve your customers, not your servers, with VERIO Linux VPS. Full-access test-drive here. |
Kernel developmentRelease status Kernel release status The current 2.6 development kernel is 2.6.25-rc3, released by Linus on February 24. The patches applied this time are mostly fixes, but there is also a new libata.force module parameter, a driver for ADT7473 hardware monitoring chips, a new PM_EVENT_HIBERNATE power management state, a driver for Marvell 88SE6440 SAS/SATA controllers, and file capabilities support for the SMACK security module. See the short-form changelog for details, or the long-form changelog for lots of details. A slow stream of fixes has been trickling into the mainline git repository since the -rc3 release. The current stable 2.6 kernel is 2.6.24.3, released on February 25 with a fair number of fixes. The 2.6.23.17 and 2.6.22.19 stable updates were released at the same time with smaller number of fixes; they are probably the last updates in the 2.6.22 and 2.6.23 series. For older kernels: 2.4.36.2 was released on February 24; it fixes a bug introduced in 2.4.36.1 and adds a fix for a relatively obscure security problem.
Kernel development news Quote of the week
Machine-generated warnings are a great way of
quickly locating a large amount of questionable code in an otherwise
overwhelming haystack. It doesn't even matter much, which warnings you
look for. Almost all code checkers find the same hotspots.
-- Jörn Engel
But there is a catch. If you have an over-eager warning police that "fixes all the warnings", the warnings may be gone, but the very real problems in near vicinity are not. Not to mention new problems introduced by those claimed "fixes". [...] Note one scary consequence: code checkers in the wrong hands are actively harmful.
Merging drivers early Drivers tend to be a world unto themselves, with bugs only affecting a subset—often a tiny subset—of kernel users. Until a driver gets merged into the kernel though, anyone wishing to test it, or help clean it up, has to jump through some hoops. To try and help reduce those barriers, Linus Torvalds and others have been advocating early merging of drivers; getting them into the kernel and incrementally improving them from there. This policy of early merging of drivers is not universally embraced, with a recent remote DMA (RDMA) ethernet driver, which lives in the infiniband tree, getting singled out. Based on the problems he observed in the driver, Adrian Bunk asked: "Is it really intended to merge drivers without _any_ kind of review?" This was, perhaps, an overly dramatic question as the driver has undergone review, but not all of the changes have been reflected in the mainline version. There is still work to do, as Infiniband maintainer Roland Dreier points out:
Just to be clear, this driver was reviewed. Many issues were found,
and many were fixed while others are being worked on.
It's a judgment call when to merge things, but in this case given the good engagement from the vendor, I didn't see anything to be gained by delaying the merge. It is a sentiment shared by other kernel hackers as well. When there is a developer who is responding to the feedback along with a working driver, getting it into the mainline kernel—where more eyes can scrutinize it—is seen as a positive step. Torvalds is very interested in seeing drivers earlier so that more collaboration can happen:
I'd really rather have the driver merged, and then *other* people can send
patches!
The thing is, that's what merging really means - people can work on it sanely together. Before it's merged, it's a lot harder for people to work on it unless they are really serious about that driver, so before merging, the janitorial kind of things seldom happen. Other maintainers explained their criteria for accepting drivers that are not quite up to usual kernel standards. The consensus seems to be that drivers with the following characteristics are acceptable:
There is little in the way of a downside to making drivers available earlier. Since they are self-contained, they generally don't cause problems elsewhere in the kernel. As long as reviewers are keeping an eye out for security problems, which could lead to an unsuspecting user's box being compromised, there are not many ways for a driver to negatively impact the kernel as a whole. User space interfaces via ioctl(), sysfs, or other means also need to be closely examined as they will have to be maintained as part of the kernel interface. Along the way, much grumbling was heard about checkpatch, the perl script that complains about various stylistic problems with a patch. Notably absent from the list above is any kind of requirement that checkpatch errors or warnings be handled. The main complaint against checkpatch is its checks for line length; the resulting "fixes" to kernel source sometimes leave much to be desired. While it is generally agreed that too many overly long lines can result in code that is difficult to read, exactly what constitutes such a line tends to be an aesthetic judgment. Slavish adherence to a fixed number of characters on a line in order to appease checkpatch is clearly seen as a problem. To some, this makes checkpatch less than useful, bordering on dangerous to readability. Torvalds stated that he has considered removing it from the kernel tree on more than one occasion. Human judgment is required to interpret the warnings from checkpatch and sometimes it is not being applied. On the other hand, Ingo Molnar gives an impassioned defense of the tool:
Based on this first hand experience, my opinion about checkpatch has
changed, rather radically: i now believe that checkpatch is almost as
important to the long term health of our kernel development process as
BitKeeper/Git turned out to be. If i had to stop using it today, it
would be almost as bad of a step backwards to me as if we had to migrate
the kernel source code control to CVS.
Molnar goes on to outline the pros and cons of checkpatch, all of which stands in stark contrast to some of his earlier complaints about the tool. For most drivers, the path into the kernel has been made a lot easier. This will have the effect of getting working, or mostly working, drivers into the hands of users more quickly. More importantly, it will also get the code into the hands of the Linux kernel community faster. The likely result is a fully working, cleanly coded driver sooner than it might have happened in the past. An already quick turnaround for hardware support in Linux may have just gotten faster.
Tracing memory-mapped I/O operations Device drivers, in the end, usually do one thing: they communicate with the hardware by way of a set of memory-mapped I/O (MMIO) registers. So when one is trying to figure out what a driver is doing - for debugging purposes, perhaps - it is often interesting to look at the sequence of MMIO operations the driver performs. If one is trying to reverse-engineer a driver which is available only in binary form, watching what is done with MMIO registers may be the only way to figure out how the hardware works. To this end, the developers behind the Nouveau project developed a tool called "mmiotrace" which helps them to watch which is going on with memory-mapped I/O. Now that tool is being fixed up and pushed toward the mainline.Drivers gain access to MMIO regions with ioremap() (or one of the higher-level functions like pci_iomap()), so that is the logical place to hook in a tracing infrastructure. So the current mmiotrace patch adds some new variants of ioremap():
void __iomem *ioremap_cache_trace(unsigned long offset, unsigned long size);
void __iomem *ioremap_nocache_trace(unsigned long offset, unsigned long size);
void iounmap_trace(volatile void __iomem *addr);
These functions perform like ioremap() and ioremap_nocache(), in that they return a I/O memory pointer which can be used by the driver to get at MMIO space. What goes on internally, though, is quite different. On the x86 architecture (as with most others), I/O memory space is accessed with memory operations through the page tables in the usual way, so ioremap() just returns an address which maps onto the desired physical space. The tracing versions, though, take the extra step of marking the pages within the I/O region as not being present in the system; as a result, whenever code attempts to access that space, a page fault will be generated. Normally, page faults incurred when running in kernel mode will cause a kernel oops. There are exceptions, though; the functions which copy data between user and kernel space are one example. The mmiotrace patch adds another exception which tests faulting addresses against the MMIO region(s) being traced. Should the address indicate that an MMIO access is being attempted, the mmiotrace code will:
Once all this has happened, the instruction which originally caused the page fault will be rerun, successfully this time. But the setting of the trace bit will cause a new processor trap after that instruction has been executed. At that point, the page is marked unavailable once again, the trace bit is reset (assuming it wasn't set elsewhere), the tracing layer's "post" handler is called, and life continues as normal until the next fault happens. The tracing layer really only has one task: figure out what the code was trying to do in MMIO space and log the action by way of the relay interface. Figuring things out means learning enough about the instruction which caused the page fault to determine which address was being accessed, whether a read or write was being performed, the size of the data being transferred, and the actual value read or written. So there is a certain amount of architecture-specific instruction grubbing code involved, which, for the current patch, is only provided for x86 machines. Since tracing is enabled by calling a special version of ioremap(), it is not possible to trace a driver module without making changes to its source and rebuilding it. That might seem like a strange requirement for a tool meant to help with reverse engineering (among other things). The driver being studied by the Nouveau project uses a GPL-licensed shim to link into the kernel, so making modifications in that case was not a hard thing to do. A more general solution may eventually need to be found, though, for situations where that sort of glue layer is not present. Beyond that, this patch is likely to go through a number of changes before it finds its way into the mainline. Reviewers have found a number of things which need fixing, and there's a few too many places in the code where the comments say (literally) "if this happens, all hell breaks loose." It also seems likely that mmiotrace will be merged with the recently-posted ftrace tracing mechanism. There is time to get this work done before the 2.6.26 merge window opens, but the mmiotrace hackers will need to keep the work moving forward.
The state of Nouveau, part 2 [Editor's note: this is the second in a two-part series on the state of the Nouveau driver for NVIDIA hardware. The first installment is recommended reading for those who have not yet seen it.]
Sources of information, and reverse engineering toolsAs very little information is available on NVidia's hardware design and implementation, the Nouveau project has developed a number of tools to gain a better understanding of card architecture and programming model. These tools, along with some previously available information, are what are used to create the driver. The Haiku/BeOS projects have a driver that came from a software development kit NVidia released for NV03/04 cards, and also gathered some information from an unobfuscated nv driver that appeared briefly in XFree86. This driver has improved mode-setting code compared to nv, and a basic 3D driver using hard-coded objects running in a single context. More information was available in the nvclock utility, which allows overclocking NVidia GPUs on Linux. Its lead developer Roderick Colenbrander (Thunderbird) has helped out Nouveau in the clock setup, i2c and tv-out areas.
renouveauThe first utility developed was called renouveau. renouveau is mainly concerned with reverse engineering the NVidia binary driver by black-boxing it, feeding it certain inputs and watching what it writes to the hardware. It runs a large batch of OpenGL tests which exercise most of the GPU's capabilities and generates a set of dump files which are sent to the Nouveau developers. The tool works by mapping the card registers and the FIFO assigned to the current application. It then records the current state of both FIFO and registers, executes small OpenGL tests, and compares the final state against the initial saved state. It then dumps this info, which can be parsed into a human readable form using an XML register/command database. (Some developers would argue the hex is readable to them). The tool has advantages in that it can be run very simply by end users, on various card architectures, without requiring root privileges. It doesn't tamper with the binary driver, and does not require much technical knowledge.
MMioTraceMMioTrace is a tool for tracing memory-mapped I/O (MMIO) access within the kernel. The NVidia driver contains a kernel module which is responsible for a lot of card initialization and mode setting. This activity cannot easily be traced by user-space tools such as renouveau. MMioTrace uses relayfs and debugfs to relay the tracing data to userspace. MMioTrace works by replacing calls to the kernel's ioremap(), ioremap_nocache(), and iounmap() calls from the driver that is to be probed with wrappers that call into MMioTrace. When the driver module in question calls ioremap() to access the MMIO registers, the pages are mapped as not-present in the kernel address space instead. It can be set up to only trace address ranges which are likely to be touched by the driver you are interested in, thus reducing the amount of useless MMIO accesses. When the module then tries to access the register space, a page fault will occur. In the page fault handler the address is detected and the attempted action recorded. The page is then marked present and the page-faulting code is single-stepped to execute the instruction doing MMIO. After that the page is set to "not present" again so that the cycle can be restarted for the next access to the page. MMioTrace has some restrictions on tracing into the legacy ISA address range, as marking those pages not present crashes the kernel. A solution to this may be forthcoming but would require patching the kernel. MMioTrace is usable for all types of drivers running in the kernel, not just graphics drivers. It is not shipped with the kernel as of yet and was shipped as a working external module up 2.6.23. However 2.6.24 has seen the removal of certain features that mean MMioTrace will need to be upstreamed for it to work with 2.6.25 or later kernels. If you are interested in more details, you should have a look at the MMioTrace page.
valgrind-mmtValgrind-mmt is a plugin for the valgrind debugging suite. It traces MMIO accesses from a user-space process (like the X.org server) where the NVidia DDX code is loaded. This was originally written by Dave Airlie for tracing ATI hardware and has since been extended by a number of other developers. It is used in Nouveau in a way similar to renouveau: to dump the contents of a FIFO. Valgrind-mmt allows reliably tracing the X.org FIFO, which is something renouveau cannot do very well. Tracing the X.org FIFO is sometimes required as it is the only way to see how some 2D features are implemented.
Using MMioTrace to implement a new featureCommands are usually sent to the card by writing in the command FIFO, not by touching registers directly. But initialization of the card (including notably mode setting), as well as some other operations, are done via MMIO operations from within the kernel. Below is an example of how MMioTrace was used to reverse engineer the YV12 video overlay that is present in some NVidia cards.
Video formatsVideos are usually not encoded in the RGB colorspace. Most video codecs work in the YUV colorspace instead, where Y stands for luminance (black and white image), and U and V represent the chrominance (i.e. color). Since eye perception is higher for luminance, codecs usually drop a fraction of U and V samples in order to save space. When the card is asked through e.g. X-Video to display a video frame, it is passed a buffer containing YUV data, usually in YV12 or YUY2 format. FourCC.org can give you details about those formats, but for the purposes of this article, we will just say that YUY2 is a format that keeps one chrominance sample (U or V alternatively) per luminance sample, thus giving "YUYVYUYV" to the card (16 bits per pixel), and YV12 is a format that keeps two chrominance samples (one U, one V) per 2x2 luminance block, which gives an effective 12 bits per pixel of video. YV12 is 25% smaller than YUY2 and is the format used by most popular codecs. Your author has yet to find any movie codec that does not output YV12. (or I420, which conceptually is the same - it just inverts the position of U and V in the buffer). Some months ago, Nouveau's Xv implementation was inherited from nv. Besides being extremely slow, nv supported only the YUY2 format, and converted YV12 input to YUY2 in software before uploading the data to the card. While working on improving performance, we quickly came to wonder if NVidia cards supported YV12 in hardware. Due to the 25% size reduction, this would naturally decrease the volume of bus transfers, which plays a very important role in Xv throughput especially on PCI cards. We verified that by running performance tests on the NVidia binary driver, playing YV12 and YUY2 videos (using mplayer's -yuy2 option). Our performance tests consisted simply of mplayer's "benchmark" mode. The results were extremely clear: the operation required just over 20 seconds in YUY2 mode, and in just over 15 seconds in YV12 mode. No need to take your calculator, it is a 25% difference which matches the data size exactly. The most obvious explanation is that the data is sent to the hardware in YV12 format. So the situation was: we had a Xv driver that handled YUY2 video only, we knew (or thought, with a high degree of confidence and hope) that the hardware supported YV12, but no existing driver like rivatv had code for it. Some reverse engineering had to take place. MMioTrace doesn't enter the arena just now, however. As mentioned above, most of the time, commands are sent to the card by writing to the command FIFO, and not by touching registers. So we first checked the X command FIFO using valgrind-mmt and found some commands related to video. However, it quickly turned out that those were software methods, that is to say, dummy methods that make the card generate an interrupt asking for the kernel to handle it. It's somehow similar to an ioctl() call into the kernel module, except that it's in sync with the FIFO. First lesson learned: Video overlay setup is being done by the kernel module. We then MMioTraced the NVidia binary driver, playing YUY2 and YV12 video (same dimensions, window position, ... - the only thing that differed was the format), and compared the outputs. And among the 150 kilobytes of resulting data, we found (for YUY2 mode):
NV_PVIDEO.[0].FORMAT <- 0x00110200
While for YV12 mode:
NV_PVIDEO.[0].FORMAT <- 0x00110101
NV_PVIDEO+0x800 <- 0x00000000
NV_PVIDEO+0x808 <- 0x07fcffff
NV_PVIDEO+0x820 <- 0x07f70000
So here we had a different value being written into FORMAT, and three unknown registers. From a reading of existing documentation and code, it turned out that the bit 0 of FORMAT was previously unknown to us. Next we tried to get the feature to work in our driver. We tried it without touching the three unknown registers, and got no video at all. So it had an effect, but we weren't sure if it really was the "YV12 format" bit. Further looking into MMioTraces showed that what was written into the three registers was in fact fairly similar to what was done for the image buffer setup, and we were able to make an educated guess at what was supposed to be written here. (It was the set up of the color buffer, while the "main" buffer was used for luminance data.) In the end, we got YV12 to work in Nouveau's Xv without converting to YUY2, which represented an increase in performance of (about) the expected 25%. MMioTrace enabled us to discover how the card needed to be programmed to do YV12 in hardware, which was apparently known by nobody outside of Nvidia before. This knowledge ended up in nv_video.c in NVPutOverlayImage:
/* Those are important only for planar formats (NV12) */
if ( uvoffset )
{
nvWriteVIDEO(pNv, NV_PVIDEO_UVPLANE_BASE(buffer), 0);
nvWriteVIDEO(pNv, NV_PVIDEO_UVPLANE_OFFSET_BUFF(buffer), uvoffset);
}
It is interesting to note that MMioTrace simply records all register reads and writes - you can see almost everything that the kernel module does to the card. The downside to "almost everything" is that the saved data set gets large fast. Reducing the trace range and using it only for short periods of time helps a bit but still... after a few minutes of mmiotracing, you will get into the megabyte range for your logs. Sifting through those thousands of lines to find what one is looking for takes some time to get used to. We used MMioTrace to reverse-engineer YV12 overlay, but we also used it to reverse-engineer a very large part of card initialization code and mode setting - and it will most certainly be useful for many other things that involve a kernel module. It is not limited to Nouveau, and is able to trace MMIO operations from any of your (binary) kernel modules, thereby allowing reverse-engineering of drivers for other hardware.
Current development in Unix graphics and its influence on NouveauWe'll now take a peek into the future of 3D acceleration on Linux. 2007 saw a number of major changes in how Linux and X11 handle graphics. A lot of improvements are coming into use: EXA for 2D acceleration, TTM for memory management, Gallium3D for 3D, the new DRI2 interface, etc. All this needs driver-side changes, which can take some time to be done. With the advent of programmable graphics hardware, the old graphics driver model in Mesa became unsuitable. The current Mesa model is designed for cards which are based around OpenGL fixed-function operations. Fixed-function cards have hardware blocks designed for each part of the GL pipeline. The driver model for this requires each new piece of fixed functionality to call into the driver, which can get complex. This also causes a lot of code to be duplicated in each driver. A new driver model, called Gallium3D, tries to simplify the driver interface and increase the amount of shared code. It is designed to cater for OpenGL 3.0's needs as well as current OpenGL and DirectX APIs. It is also designed to allow portable drivers across all major platforms/OSes. It assumes programmable graphics hardware with, at least, fragment shaders. Now that we know why the design was changed, let's have a look at the architecture of Gallium3D. Gallium3D splits the DRI driver into 3 major components, the common "state tracker", the OS dependent "winsys" layer and hardware specific 3D driver. The winsys is in charge of 2D action and most of the housekeeping and OS-specific bits, while the hardware driver does the 3D. Each driver needs to implement a hardware driver and a Winsys part. If an existing driver gets ported to another OS, only the Winsys parts needs to be redone. There is also a fully working reference software 3D driver called softpipe. It is a software renderer showing the Gallium3D concepts and how to implement them, which also acts as a software fallback driver for things the hardware cannot handle. Another component of the new graphics subsystem is the TTM based memory manager. TTM is a unified in-kernel manager for all GPU accessible memory. Previous memory management was split between X drivers, mostly using static allocations. TTM was originally designed and implemented for Intel hardware, and had to be adapted to handle NVidia hardware and Nouveau software design. The main feature added to TTM was called fence classing, which was required to support NVidia's multiple hardware contexts.
Current StatusWhen we shifted work from reverse engineering to driver development last year, we were asked when a driver would be ready. We predicted late 2007, but we only got part of the work done. Except for NV5x cards, we basically have a good-to-reasonably-well working 2D driver. Releasing an official "2D" driver was considered but, at this point, the kernel interfaces are not considered stable enough to support for the long term. When a DRM kernel module is shipped in Linus's kernel, the interfaces are required to be supported indefinitely. This would be unwise for Nouveau as the interface is evolving to accommodate changes for TTM and mode setting, and supporting old interfaces may place hard-to-support requirements on newer ones. Currently, Nouveau can claim:
The weak spot is currently the NV50. On these cards, 2D is working the same as nv but saving and restoring the console / virtual terminal state doesn't work. All that is nice and somewhat important to have, but I hear you ask "what about 3D"? The short answer is: We don't have 3D working. The longer answer is: NV5x doesn't work and needs more reverse engineering as a lot has changed from NV4x. For all other cards the needed information is available but there are many pieces in the puzzle to build a final driver. As a proof of concept, glxgears works on NV1x, NV3x and NV4x but with some glitches. However, work on the Mesa DRI driver has ceased in order to target Gallium3D. A somewhat working Gallium3D driver exists with many bugs and glitches. The NV4x is getting better everyday but isn't usable for games yet. Gallium3D itself is still a work in progress and the same holds true for our Gallium3D driver. Currently, a fair amount work is going on in the mode setting field, with Maarten Maathuis and Stuart Bennett enhancing this part of the code. This leads to RandR1.2 (dual head) support in Nouveau. Once this is done, we plan to move it into kernel land, following the other drivers. A kernel API has been defined for that purpose. Basically this API looks like a simplified randr1.2 api which should make porting easy. So what is coming next? This is only a rough outlook of what we want to do mid term:
By the way: If you are interested in more details, please have a look at our Wiki and TiNDC ("The Irregular Nouveau Development Companions") or join us in #nouveau on freenode (logs are available). So to keep tradition lets have some screenshots. Here's a shot of Neverball running under the Nouveau driver:
And OpenArena with a Nouveau Gallium3D build from January 2008 displays this:
It seems the weapon is a bit too dark but otherwise we couldn't find obvious differences. Further information about Gallium3D can be found on the Tungsten Graphics site.
ConclusionSo that is our current status, our roadmap shows the next milestone would be Quake which is not so far away on NV4x, but which has some problems to overcome on the other cards. Our first estimate of Autumn / Winter 2007 held up well for the 2D part but, as we detailed earlier, was somewhat delayed due to decisions out of our control like TTM and Gallium. However, the decision was the right one as Nouveau will be one of the most advanced and future proof drivers available. And finally: I would like to take this opportunity and thank Arthur Huillet, Ben Skeggs David Airlie and Stephane Marchesin for their great help on this article. It definitely was a team effort!
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Kernel building
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet |
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.