A new direction for i965

By Jake Edge
October 17, 2018

Graphical applications are always pushing the limits of what the hardware can do and recent developments in the graphics world have caused Intel to rethink its 3D graphics driver. In particular, the lower CPU overhead that the Vulkan driver on Intel hardware can provide is becoming more attractive for OpenGL as well. At the 2018 X.Org Developers Conference Kenneth Graunke talked about an experimental re-architecting of the i965 driver using Gallium3D—a development that came as something of a surprise to many, including him.

Graunke has been working on the Mesa project for eight years or so; most of that time, he has focused on the Intel 3D drivers. There are some "exciting changes" in the Intel world that he wanted to present to the attendees, he said.

CPU overhead has become more of a problem over the last few years. Any time that the driver spends doing its work is time that is taken away from the application. There has been a lot of Vulkan adoption, with its lower CPU overhead, but there are still lots of OpenGL applications out there. So he wondered if the CPU overhead for OpenGL could be reduced.

Another motivation is virtual reality (VR). Presenting VR content is a race against time, so there is no time to waste on driver overhead. In addition, Intel has integrated graphics, where the CPU and GPU share the same power envelope; if the CPU needs more power, the GPU cannot be clocked as high as it could be. Using less CPU leads to more watts available for GPU processing.

For the Intel drivers, profilers show that "draw-time has always been [...] the volcanically hot path" and, in particular, state upload (sending the state of the OpenGL context to the GPU) is the major component of that. There are three different approaches to handling state upload in an OpenGL driver that he wanted to compare, he said. OpenGL is often seen as a "mutable state machine"; it has a context that has a "million different settings that you can tweak". He likens it to an audio mixing board, which has lots of different knobs that each do something different. At its heart, OpenGL programs are setting these knobs, drawing, then setting them and drawing again—over and over.

Handling state

The first way to handle state tracking is with "state streaming": translate the knobs that have been changed and send those down to the GPU. This assumes that it is not worth reusing any state from previous draws; since state is changing all the time and every draw call could be completely new, these drivers just translate as little as possible of the state changes, as efficiently as possible, before sending them to the GPU.

Early on, he asked his mentor about why state was not being reused and was told that cache lookups are too expensive. Essentially the context and state have to be turned into some kind of hash key that gets looked up in the cache. If there is a miss, that state needs to be recalculated and sent to the GPU, so you may as well just have done the translation. This is what i965 does "and it works OK", but it does make draw-time dominate, which leads to efforts to shave microseconds off the draw-time.

But he started thinking about this idea of "not reusing state" some. He put up an image from a game he has been playing lately and noted that it is drawn 60 times per second; the scene is fairly static. So the same objects get drawn many, many times. It is not all static, of course, so maybe a character walks into the scene in shiny armor and you need to figure out how to display that in the first frame, but the next 59 frames can reuse that. "This 'I've never seen your state before' idea is kinda bunk", he said.

The second mechanism is the one that Vulkan uses, which is to have "pre-baked pipelines". The idea is to create pipeline objects for each kind of object displayed in the scene. That makes draw-time "dirt cheap" because you just bind a pipeline and draw, over and over again. If the applications are set up to do this, "it is wonderful", but OpenGL applications are not, so it is not really an option.

The third option, surprisingly, is Gallium3D (or simply "Gallium"), he said. He has been learning that it is basically a hybrid approach between state streaming and pre-baked pipelines. It uses "constant state objects" (CSOs), which are immutable objects that capture a part of the GPU state and can be cached across multiple draw operations. CSOs are essentially a Vulkan pipeline that has been chopped up into pieces that can be mixed and matched as needed. Things like the blending state, rasterization mode, viewport, and shader would each have their own CSO. The driver can associate the actual GPU commands needed to achieve that state with the CSO.

The Gallium state tracker essentially converts the OpenGL mutable API into the immutable CSOs that make up the Gallium world. That means the driver really only has to deal with CSOs and Gallium handles the messy OpenGL context.. The state tracker looks at the OpenGL context, tracks what's dirty, and ideally rediscovers cached CSOs for the new state. The state tracker helps "get rid of a bunch of nonsense from the API", Graunke said. For example, handling the different coordinate systems between the driver, GPU, and the window system is much simplified using Gallium. That simplifying can be done before the cache lookup occurs, which may mean more cache hits for CSOs.

A consistent critique of Gallium is that it adds an extra layer into the driver path. While that's true, there is a lot of work done in a Classic (or state-streaming) driver to convert the context into GPU commands. There is far less work needed to set things up to be looked up in the cache for a Gallium driver, but if there is a cache miss, there will be additional work needed to create (and cache) the new CSO. But even if the sum of those two steps is larger for Gallium, and it generally is, the second step is skipped much of the time, which means that a Gallium-based driver may well be more efficient.

i965

The i965 driver is one of the only non-Gallium Mesa drivers. Graunke said that the developers knew it could be better than it is in terms of its CPU usage. The code itself is pretty efficient, but the state tracking is too coarse-grained, which means that this efficient code executes too often. Most of the workloads they see are GPU bound, so they spent a lot of time improving the compiler, adding color compression, improving cache utilization, and the like to make the GPU processing more efficient.

But, CPU usage could be improved and "people loved to point that out", he said. It was a source of criticism from various Intel teams internally, but there was also Twitter shaming from Vulkan fans. The last straw for him was data showing that the i965 driver was "obliterated" by the performance of the AMD RadeonSI driver on a microbenchmark. That started him on the path toward seeing what could be done to fix the CPU side of the equation.

A worst case for i965 is when an application binds a new texture or modifies one. The i965 driver does not have good tracking for the texture state, so it has to do a bunch of retranslations for each other texture and image that are bound to that texture in any shader stage. It is a lot of stuff to do for a relatively simple operation. Reusing some state would help a lot, but it is hard to do for surprising reasons.

Back in the "bad old days of Intel hardware", there was one virtual GPU address space for all processes. The driver told the kernel about its buffers and the kernel allocated addresses for them. But the commands needed to refer to those buffers using pointers when the addresses had not yet been assigned, so the driver gave the kernel a list of pointers in those buffers that needed to be patched up when the buffer was assigned to an address. Intel GPUs save the last known GPU state in a hardware context that could potentially be reused, but it includes pointers to unpatched addresses, so no state that involves pointers can be saved and reused. The texture state has pointers, which leads to the worst case he described.

Modern hardware does not suffer from these constraints. More recent Intel GPUs have 256TB of virtual GPU address space per process. The "softpin" kernel feature available since Linux 4.5 allows user space to assign the virtual addresses and never have to worry about them being changed. That allows the state to be pre-baked or inherited even if it has pointers. Other changes also make state reuse much easier.

So that led to the obvious conclusion that state reuse needed to be added to the i965 driver. But in order to do that, an architectural overhaul was needed. It required a fundamental reworking of the state upload code. Trying to do that in the production driver was "kinda miserable". Adding these kinds of changes incrementally is difficult. Enterprise kernels also complicate things, since there are customers using new hardware on way-old kernels—features like softpin are not available on those. Beyond that, the driver needs to support old hardware that still has all of the memory constraints, so he would need to support both old and new memory management and state management in the same driver.

Enter Gallium

He finally realized that Gallium gave him "a way out of all of these problems". In retrospect that all may seem kind of obvious, so why didn't Intel do this switch earlier? In looking back, though, Gallium never seemed to be the answer to the problems the developers were facing along the way. It didn't magically get them from OpenGL 2.1 to 4.5; that required lots of feature work. The shader compiler story was lacking or not really viable (because it was based on TGSI). It also didn't solve their driver performance problems or enable new hardware. Switching to Gallium would have allowed the driver to support multiple APIs, but that was not really of interest at the time. All in all, it looked like a huge pile of work for no real gain. But things have changed, Graunke said.

Gallium has gotten a lot better due to lots of work by the Mesa community. Threading support has been added. NIR is a viable option instead of TGSI. In addition, the i965 driver has gotten more modular due to Vulkan. It seemed possible to switch to Gallium, but it was still a huge effort, and he wanted to be sure that it would actually pay off to do so.

That led to his big experiment. In November 2017, he started over from scratch using the noop driver as a template. He borrowed ideas from the Vulkan driver and focused on just the latest hardware and kernel versions. He wanted to be free to experiment so he did not publicize his work, though he did put it in his Git repository in January. If it did not work out, he wanted to be able to scrap it without any fanfare.

After ten months, he is ready to declare the experiment a success. The Iris driver is available in his repository (this list of commits is what he pointed to). It is primarily of interest to driver developers at this point and is not ready for regular users. One interesting note is that Iris uses no TGSI; most drivers use some TGSI, rather than translating everything to NIR, but Iris did not take that approach. It only supports Skylake (Gen9+) hardware and kernel 4.16+, though he could make it work for any 4.5+ kernel if needed.

The driver passes 87% of the piglit OpenGL tests. It can run some applications, but others run into bugs. There are several missing features that are being worked on or at least thought about at this point. But enough is there that he can finally get an answer to whether switching to Gallium makes sense; the performance can be measured and the numbers will be in the right ballpark to allow conclusions to be drawn.

Results

He put up performance numbers on the draw overhead using piglit for five different scenarios (which can be seen in his slides [PDF] and there is more description of the benchmark and results in the YouTube video of the talk). It went from roughly two million draw calls per second to over nine million in the simplest case; in the worst case (from the perspective of the i965 driver) it went from half a million to almost nine million. On average, Iris can do 5.45x more draw calls per second than i965. Those are good numbers, but using threaded contexts results in even better numbers (6.48x for the simple case and 20.8x for the worst), though he cautioned that support for threaded contexts in Iris is still not stable, so those numbers should be taken with a grain of salt.

The question "is Iris worthwhile?" can be answered with a "yes". But reducing the CPU overhead when most workloads are GPU bound may not truly reflect reality. The microbenchmark he used is "kind of the ideal case for a Gallium driver" since it uses back-to-back draws that are hitting the CSO cache. That said, going back to his observations about displaying a game, it may well be representative for the 59 of 60 frames per second where little changes. There is a need to measure real programs; one demo he ran on a low-power Atom-based system was 19% faster, but a bunch of others showed no difference with i965 at all. "So, your mileage may vary, but it's still pretty exciting", he said.

Graunke believes that this work has settled the Classic versus Gallium driver debate in favor of the latter. Also, the Gallium interface is much nicer to work with than the Classic interface. He does not regret the path Intel took, but he is excited about the future; Iris is a much better architecture for that future. In addition, he believes that RadeonSI, and now Iris, have basically debunked the myth that Mesa itself is slow. i965 may be slow, but that is not really an indictment of Mesa.

There is a lot of work left to do and lots of bugs to fix. He needs to finish getting the piglit running as well as doing so for the conformance test suites (CTS) for OpenGL. He has started running the CTS, which is looking good so far. He still needs to test lots of applications and there is work to be done cleaning up some of his Gallium hacks before the driver can go upstream. Beyond that, he wants to look at Iris performance on real applications and compare it to i965 to see if there are places where Iris can be made even faster. He would like to use FrameRetrace on application data as part of that process.

Now that Intel has joined the rest of the community in using Gallium, that is probably a good opportunity for the whole community to think about where Mesa should go from here. All of the drivers will be Gallium-based moving forward, so the community can collaborate with a focus on further Gallium enhancements. Gallium is not the ideal infrastructure (nor is Classic), but by dreaming about what could come in the future, the Mesa community can evolve it into something awesome, he said.

In the Q&A, he was asked about moving more toward a Vulkan-style driver. Graunke noted that there are several efforts to implement OpenGL on Vulkan and that he is interested to see where they go. There is something of an impedance mismatch between the baked-in pipelines of Vulkan and the wildly mutable OpenGL world and it is not clear to him whether that can be resolved reasonably. For Iris, he chose the route that RadeonSI had taken and proven, but if the Vulkan efforts pan out, that could be something to look at down the road.

[I would like to thank the X.Org Foundation and LWN's travel sponsor, the Linux Foundation, for travel assistance to A Coruña for XDC.]

Index entries for this article
Conference	X.Org Developers Conference/2018

A new direction for i965

Posted Oct 18, 2018 14:08 UTC (Thu) by mgedmin (subscriber, #34497) [Link] (5 responses)

> The i965 driver is one of the only non-Gallium Mesa drivers.

"one of" and "the only" seem to contradict each other.

A new direction for i965

Posted Oct 18, 2018 15:19 UTC (Thu) by excors (subscriber, #95769) [Link]

>> The i965 driver is one of the only non-Gallium Mesa drivers.
>
> "one of" and "the only" seem to contradict each other.

It's a common English idiom, and idioms are rarely concerned about being illogical. http://throwgrammarfromthetrain.blogspot.com/2014/07/obse... notes that many people dislike the phrase, but that it has been in use for about 250 years anyway.

A new direction for i965

Posted Oct 21, 2018 5:57 UTC (Sun) by roblucid (guest, #48964) [Link] (3 responses)

As a native English speaker, I would not call this common in written form at least. "One of the few" or "The only" are far clearer. Perhaps it's an Americanism?

Spoken, people can't help change idea having started a sentence, but it would make me grin/chuckle if I heard this dron a native speaker in a presentation

A new direction for i965

Posted Oct 22, 2018 15:04 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (2 responses)

> Perhaps it's an Americanism?

Probably. I didn't notice anything odd until folks pointed it out.

A new direction for i965

Posted Oct 26, 2018 5:51 UTC (Fri) by njs (subscriber, #40338) [Link] (1 responses)

Huh, apparently it was more common in British English until the mid-eighties, but American usage has been rising rapidly and overtook it:

https://books.google.com/ngrams/graph?content=%28one+of+t...

A new direction for i965

Posted Oct 26, 2018 18:38 UTC (Fri) by Wol (subscriber, #4433) [Link]

Who speaks BRITISH English? The English speak English, and the Scots speak Scots, which are two very similar languages (actually, really Scots speak Gaelic, but lets not go there ...).

There is NOT a common language across Britain, and the British language is probably Welsh, anyways ...

But no, I've always understood (and used) "one of the only" as a perfectly normal English idiom.

Cheers,
Wol

A new direction for i965

Posted Oct 19, 2018 11:36 UTC (Fri) by mm7323 (subscriber, #87386) [Link] (1 responses)

"pre-baked pipelines". The idea is to create pipeline objects for each kind of object displayed in the scene. That makes draw-time "dirt cheap" because you just bind a pipeline and draw, over and over again. If the applications are set up to do this, "it is wonderful", but OpenGL applications are not, so it is not really an option.

It's a very long time since I've done much graphics work, but isn't this exactly what Display Lists in OpenGl 2.0 are for? Why were they deprecated/removed in 3.0/3.1?

A new direction for i965

Posted Oct 19, 2018 22:59 UTC (Fri) by excors (subscriber, #95769) [Link]

I think display lists were mainly useful for specifying vertex data efficiently, compared to doing thousands of glVertex3f() calls per frame. But then OpenGL added vertex buffer objects, where the vertex data is stored directly in VRAM and you only need a handful of API calls to reference it, which is much simpler and more efficient than display lists. That largely eliminated the value of display lists and it was no longer worth the effort of supporting them in the API specification and in drivers.

Now GPUs have gotten really fast at drawing vertexes and pixels, so the bottleneck has moved back towards the API overhead and pipeline state changes, and we need a solution to that. Display lists are sort of the right idea, but aren't a good solution for modern hardware or applications.

Display lists are monolithic - they can affect the entire pipeline state at once, while hardware probably splits the pipeline into several independent chunks that could be changed relatively cheaply. They're incomplete - if any part of the state isn't specified, it gets inherited from the global state at call time (so is unknown when compiling the display list). They're immutable, while applications probably want to draw a lot of objects with identical state except for a few parameters that change frequently. They make the API complicated and inefficient, since pretty much every other API call changes behaviour if you're currently recording a display list. They put more complexity into the driver, which makes it harder for application programmers to understand how to optimise their use of the API.

Vulkan explicitly splits the pipeline state into about a dozen chunks that roughly map to modern hardware pipelines; you can reuse most of the state across all your draw calls and maybe change only a few chunks. It associates every draw call with a complete pipeline state object, so the complete state can be compiled and optimised ahead of time. It lets applications choose to make some state mutable(/dynamic). The API is reasonably orthogonal and doesn't depend on global state and can be used by multiple threads. It lets applications control allocation and caching so they don't have to second-guess the driver. Probably the most important design principle is that the Vulkan API tries to force applications to work in hardware-friendly ways (even if that's tricky for the application), whereas OpenGL tries to let applications work in application-friendly ways and then the driver has to scramble to map it onto the hardware.