The anatomy of a Vulkan driver

By Jake Edge
September 28, 2016

Jason Ekstrand gave a presentation at the 2016 X.Org Developers Conference (XDC) on a driver that he and others wrote for the new Vulkan 3D graphics API on Intel graphics hardware. Vulkan is significantly different from OpenGL, which led the developers to making some design decisions that departed from those made for OpenGL drivers.

He started with an "obligatory brag slide" (slides [PDF]) that outlined the progress that had been made on the driver in only eight months, with roughly three and a half people. Ekstrand, Kristian Høgsberg, and Chad Versace, with help from a dozen others, got a Vulkan driver working that was released (as open source) on the same day that the Vulkan specification was released in February. Not everything was written from scratch; the driver uses the same internal representation and back-end compiler that Mesa uses. The driver passed the conformance tests on day one as well, which is not something that everyone in the industry can say, Ekstrand said.

Vulkan is a new industry-standard 3D rendering and compute API from Khronos, which is the same group that maintains OpenGL. It is not simply OpenGL++, he said, as it has been redesigned from the ground up. Vulkan is designed for modern GPUs and software. It will run on currently shipping (OpenGL ES 3.1 class) hardware.

A lot has happened since SGI released OpenGL 1.0 in 1992, which is why a new 3D API is needed. In the 24 years since that first release: GPUs have become more powerful and flexible, memory has become much cheaper, and multi-core CPUs are common. OpenGL has done "amazingly well" over that time, but it is showing its age at this point.

Multi-threaded programs are now commonplace, which makes OpenGL's state machine based on a singleton context kind of obsolete. Off-screen rendering is common as well. Beyond that, GPU hardware has become more standardized, so application developers don't want the API to hide the details of what the GPU is doing as OpenGL does.

Vulkan takes a different approach. It has an object-based API where there is no global state. All state is stored in the command buffer and there can be multiple command buffers. It is more explicit about what the GPU is doing: texture formats, memory management, and synchronization are all client-controlled. Those things are needed to support multi-threading, but also make drivers simpler.

Vulkan drivers do no error checking. There is a set of open-source, vendor-neutral validation layers that do much the same checking as is done in Mesa but they are meant to be disabled at runtime. The idea is for application developers to check their Vulkan code during development, so "why burn 10% of my CPU doing validation" when there are no errors in the Vulkan code?

There is a short distance between the API call and the driver in Vulkan, rather than traversing multiple layers as in Mesa. There is also a short distance between the driver function and actually putting data into the command buffer for the GPU. There are "no extra layers", Ekstrand said.

To handle multiple generations of hardware, each with its own packet format and packing scheme, the Vulkan driver has header files that are generated using Python scripts to process an XML representation of the formats. There is a function that uses that header file information to pack the command data into the buffer in the right way. It has debugging support that can assert() for various problems and the code can be run under Valgrind to find other kinds of problems.

To handle four separate Intel GPU generations, the code is compiled four times to create one version per generation. That allows the driver to keep up with new hardware more easily. The hardware-generation checks for each command function (as in the Mesa driver) are compiled away and the right thing is done for the generation in use. This is one example of where the team got to rethink things because it is a new, from-scratch driver.

One of the challenges faced by the team was in memory allocation. Vulkan provides a collection of heaps where clients can allocate VkDeviceMemory objects. The client can place VkImage or VkBuffer objects at explicit offsets within the VkDeviceMemory object. This doesn't map well to allocation from LibDRM, he said, but it does map well to Graphics Execution Manager (GEM) buffer objects. Other objects have small amounts of driver-allocated memory for state that the driver needs to track. The team had to figure out how to manage all those pieces of memory. Complicating matters was that the Intel hardware has different base addresses for different types of allocations (e.g. shaders, surface states), so the state information needs to be stored with others of the same type.

He and Høgsberg came up with a "crazy" memory allocation structure that they are pretty proud of, Ekstrand said. For device memory objects, GEM buffers are used; there is also a pool of GEM buffers that are used for back buffers. For the state objects, there are block pools that are allocated as a buffer object that grows in both directions as needed. The pools are initialized to provide objects of a specific size. Allocating from either end of the pool is required because of some hardware-specific restrictions.

The block pools are implemented as a 2GB memfd that gets mmap()-ed into the driver. An address in the middle is then turned into a GEM buffer object. The block pool is used to implement both a traditional "allocate and free" style state pool as well as a pool that is used for state that is associated with a command buffer. The latter pool has no free function, it simply gets reset when the command buffer is thrown away. It is a complicated infrastructure, but has worked well, he said.

Most hardware has support for compressed surfaces, but not all parts of the GPU understand all of the different formats. So a "resolve" operation is needed to decompress or recompress the surface at different points in the pipeline. Due to the multi-threaded nature of Vulkan, though, there is no real way to track when the resolves are needed on the CPU side. The Vulkan API provides two features ("render passes" and "layout transitions") that can help. Layout transitions are not currently used in the driver, but render passes delineate where resolves may be needed.

It is easier to write a Vulkan driver than one for OpenGL, Ekstrand said. The lack of error checking simplifies things to start with. The SPIR-V shader language is a bit easier to deal with than OpenGL's GLSL. Also, the Vulkan conformance tests consist of 115,000 tests that the driver developer doesn't have to write. It is a good set of tests, but there are still some holes, he said.

Some things are harder to do for Vulkan than for OpenGL. There is no CPU-side object state-tracking, for one thing. In addition, "applications have a lot more power for stupid". If the application is doing something wrong, which results in a bug filed against the driver, there is a good bit of work—without good tools—needed to track down the problem.

As far as sharing code between Vulkan and OpenGL drivers goes, there are a couple of different approaches. The approach taken was a "toolbox" that provides a number of different parts, from which a driver can be created. That approach has also provided better infrastructure for building other drivers in the future. Those looking for more details may want to view the YouTube video of the talk.

[I would like to thank the X.Org Foundation for sponsoring my travel to Helsinki for XDC.]

Index entries for this article
Conference	X.Org Developers Conference/2016

The anatomy of a Vulkan driver

Posted Sep 29, 2016 4:26 UTC (Thu) by brouhaha (subscriber, #1698) [Link] (2 responses)

I know very little about the Vulkan architecture, but it worries me to read that "Vulkan drivers do no error checking". Can a rogue Vulkan application use bad parameter values (especially pointers) to interfere with the graphics of other applications, or worse yet, crash the GPU or the main CPU?

The anatomy of a Vulkan driver

Posted Sep 29, 2016 6:43 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

"Vulkan driver" is a completely userspace dynamic library that simply utilizes kernel interfaces to manage command buffers and memory. So malicious code can affect only code running inside the same process.

In theory. In practice the kernel-side buffer validation has been known to have security bugs and is not the most robust code in the world. There's also firmware that runs on GPUs and that interacts with untrusted data.

The anatomy of a Vulkan driver

Posted Sep 29, 2016 10:49 UTC (Thu) by excors (subscriber, #95769) [Link]

The specification says errors can only hurt the current application:

> The core layer assumes applications are using the API correctly. Except as documented elsewhere in the Specification, the behavior of the core layer to an application using the API incorrectly is undefined, and may include program termination. However, implementations must ensure that incorrect usage by an application does not affect the integrity of the operating system, the Vulkan implementation, or other Vulkan client applications in the system, and does not allow one application to access data belonging to another application.

If the application is C/C++, that means it's no worse than what the application could do with plain C/C++ code. But if you've got some higher-level 'safe' language and expose the raw Vulkan API to it, that'd allow the application to bypass the language's safeness properties, which may be bad (so don't do that - maybe create your own high-level error-checked 3D engine or API and expose that to the applications instead).

I think in some (most?) Vulkan implementations the kernel driver is exactly the same as is used for OpenGL, the only difference is the userspace part. That means it should make no difference to the system's security - if you can exploit some vulnerability via Vulkan from C, you could equally exploit it without Vulkan by talking directly to the OpenGL kernel driver. (Vulkan just makes it much more convenient.)

The kernel driver may do some validation, but ideally the GPU hardware provides decent security for no performance cost - MMU-based process isolation, correct exception handling for unmapped addresses, watchdogs to cleanly stop infinite loops, etc. Modern GPUs do enough of that that Vulkan was able to rely on it.

In practice there are almost certainly lots of bugs that make the GPU drivers unsafe, which need to be found and fixed. But at least Vulkan makes it much more likely that innocent application developers will stumble across those bugs when they accidentally misuse the API, rather than relying on people intentionally searching the drivers for exploits, so hopefully they'll get reported sooner.

The anatomy of a Vulkan driver

Posted Sep 29, 2016 11:43 UTC (Thu) by Karellen (subscriber, #67644) [Link] (13 responses)

"The idea is for application developers to check their Vulkan code during development, so "why burn 10% of my CPU doing validation" when there are no errors in the Vulkan code?"

ROTFLMAO!

Wait, what? They're *serious*? And think "there are not errors in the Vulkan code" is not wildly, unrealistically, mind-bogglingly optimistic?

"Testing shows the presence, not the absence of bugs" -- Edsger Dijkstra, *1969*.

The anatomy of a Vulkan driver

Posted Sep 29, 2016 12:48 UTC (Thu) by DOT (subscriber, #58786) [Link] (1 responses)

It's just the question of where you do the validation. Do you check it once at compile time, or every run? Compile time checks are generally much more comfortable to work with.

The anatomy of a Vulkan driver

Posted Sep 29, 2016 15:09 UTC (Thu) by excors (subscriber, #95769) [Link]

It's also a question of what you do with the results of the checks. An OpenGL driver spends a lot of time on parameter validation etc, and if it spots an error it sets a flag bit (as required by the GL spec). If the application is a production build on a user's machine, it's probably never going to look at those flags. At best it may call glGetError() once per frame and print "GL_INVALID_VALUE occurred during the previous frame, dunno where" into a log file that nobody will ever read.

In that environment, the time spent checking for errors is wasted. And there's also a strong incentive to keep that wasted time to a minimum by not adding any more extensive error checking into the API, so OpenGL still allows plenty of undetected errors that can corrupt the rendering or crash the application.

It's not that applications won't ever have bugs - just this isn't a worthwhile way to catch those bugs.

When a developer or QA tester runs the application, they *do* want extensive error checking, because they're in a position to do something about those errors - but because OpenGL doesn't distinguish between development and deployment, they're stuck with only the lowish-overhead checking that misses lots of bugs. It's a compromise that isn't ideal for anyone.

Vulkan tries to fix that by having an explicitly layered architecture, where a layer can cleanly intercept and inspect all API calls (which was only possible in OpenGL through ugly hacks), so anybody can write arbitrarily-expensive debug layers to use during development and then disable before running on a user's machine.

The anatomy of a Vulkan driver

Posted Sep 29, 2016 13:28 UTC (Thu) by jem (subscriber, #24231) [Link] (6 responses)

> Wait, what? They're *serious*? And think "there are not errors in the Vulkan code" is not wildly, unrealistically, mind-bogglingly optimistic?

Well, "error checking" in the sense "we got input we don't understand, return an error" is more related to the lack of a contract between the user and the implementor of an API. It's not a safeguard against bugs. You can't reliably detect bugs at run time, like "something odd is going on so there must be a bug here, return an error".

The anatomy of a Vulkan driver

Posted Sep 30, 2016 4:41 UTC (Fri) by eru (subscriber, #2753) [Link] (4 responses)

You can't reliably detect bugs at run time, like "something odd is going on so there must be a bug here, return an error".

Assertions make a world of difference when you are trying to troubleshoot the problem afterwards. After decades of programming, I'm a belt-and-suspenders guy. You need both static checking by compilers and analyzers, and run-time checks. Neither is sufficient alone.

The anatomy of a Vulkan driver

Posted Sep 30, 2016 23:25 UTC (Fri) by rahvin (guest, #16953) [Link] (3 responses)

Context is important, as mentioned previously, the "errors" we're talking about are rendering errors such as tearing or other graphical glitches. I don't think anyone's going to build a Vulkan driver or application that guides missiles or driver-less vehicles.

A random graphical glitch in some game that might not even get seen is not something you want to be wasting processing power recording, logging and transmitting. That's what they mean, as this API is mostly used for gaming the important testing is before it's in the user hands, afterwards you'd be wasting cycles recording errors no one is ever going to look at.

The anatomy of a Vulkan driver

Posted Oct 1, 2016 15:59 UTC (Sat) by excors (subscriber, #95769) [Link] (2 responses)

It's a bit more than just graphical glitches - misuse of the API can cause crashes, memory corruption, information leakage, etc. And it's very easy to accidentally misuse the API, so any application whose rendering is influenced by untrusted input (e.g. a multiplayer game) might have bugs that don't show up in normal testing but that an attacker could theoretically trigger and exploit.

The same applies to OpenGL to some extent, so it's not a new problem. Vulkan just gives you many more opportunities for misuse, since memory management and lifetime tracking and synchronisation are now the application's responsibility and not the (hopefully-well-tested) driver's.

> I don't think anyone's going to build a Vulkan driver or application that guides missiles or driver-less vehicles.

https://www.khronos.org/openglsc/ says: "The Khronos Safety Critical working group is developing open graphics and compute acceleration standards for markets, including avionics and automotive displays, which require system safety certification [...] The safety critical working group is working to adapt more recent Khronos standards including the new generation Vulkan API"

The anatomy of a Vulkan driver

Posted Oct 1, 2016 17:47 UTC (Sat) by NAR (subscriber, #1313) [Link] (1 responses)

I'd guess that's the entertainment system used by the passengers they are aiming for, not the autopilot.

The anatomy of a Vulkan driver

Posted Oct 2, 2016 1:39 UTC (Sun) by nybble41 (subscriber, #55106) [Link]

Not the autopilot, no, but OpenGL is used for various cockpit displays such as Synthetic Vision and 3D maps; it isn't limited to passenger entertainment systems. I presume that they intend to use Vulkan for the same applications. If they were only targeting passenger displays then they wouldn't be called the "Safety Critical working group", as entertainment systems are not deemed safety-critical and often run commodity operating systems, including Linux.

The anatomy of a Vulkan driver

Posted Sep 30, 2016 16:57 UTC (Fri) by hkario (subscriber, #94864) [Link]

It also helps with future releases of drivers.

if the old hardware/driver had a bug that was workarounded by a lot of applicaitons, new drivers will have to have bug for bug compatibility with the old code

error checking minimizes that

The anatomy of a Vulkan driver

Posted Sep 29, 2016 18:01 UTC (Thu) by flussence (guest, #85566) [Link] (3 responses)

Do you run all your CPU-bound desktop applications compiled with ubsan/asan and -Og? Then why should the GPU be any different?

The anatomy of a Vulkan driver

Posted Sep 30, 2016 3:57 UTC (Fri) by pabs (subscriber, #43278) [Link] (2 responses)

ubsan/asan are only for debugging, not production:

http://seclists.org/oss-sec/2016/q1/363

The anatomy of a Vulkan driver

Posted Oct 2, 2016 2:26 UTC (Sun) by zlynx (guest, #2285) [Link]

That is also what Vulkan verification layers are for. Debugging. And like the assert() macro they get disabled in release builds.

The anatomy of a Vulkan driver

Posted Oct 2, 2016 12:21 UTC (Sun) by robert_s (subscriber, #42402) [Link]

I think that's exactly his point.