Whither the Apple AGX graphics driver?
The direct rendering manager (DRM) subsystem is a complex beast, into which a driver for a specific GPU must fit. That subsystem is also written in C, of course, meaning that any graphics driver written in Rust will have to depend on a whole stack of abstractions that allow it to interface with the rest of the DRM code. So it is not surprising that the first posting of the Apple GPU driver in March 2023 consisted mostly of DRM-level abstractions, with a preliminary version of the driver posted almost as an afterthought at the end. Perhaps more surprising — and discouraging — is that this first posting was also the last.
A pile of new abstractions is sure to draw comments, and that happened here. In a number of cases, those comments pointed out specific areas in the code that needed improvement and led to useful discussions. But two relatively small changes affecting the DRM scheduler proved to be a significant sticking point.
DRM scheduling and Rust
Graphical applications create an ongoing series of jobs, each of which is a program for the GPU to run. Many of those jobs have dependencies between them; as a simple example, consider that an image frame should not be displayed (one job) until the rendering of the graphics within it (performed, perhaps, by several other jobs) has completed. Dependencies can also extend outside of the graphics subsystem; an application generating images captured by a camera must wait for the camera to deliver each frame before rendering it. The task of a DRM scheduler is to run each job given to it, but only once any dependencies that job may have are satisfied. There can be multiple schedulers associated with any given hardware device.
Fences are the synchronization mechanism used to coordinate all of this. The camera driver will attach a fence to each frame it creates, and signal that fence when the frame is ready. Each job will wait until its dependent fences have been signaled; each job also has a fence of its own that is signaled when the job completes. Other fences are used to signal when the GPU has finished working on a specific task. The result is a complicated mesh of objects in the kernel with interdependent life cycles and a set of rules about how it all works that is not, to put it gently, thoroughly documented.
The
first controversial patch added a new scheduler callback that could be
used to defer the execution of a job if the hardware was not yet prepared
to take it. Christian König, the maintainer of the scheduler code, responded
curtly: "Well complete NAK
", followed by a few words of
explanation. It took a while to get to the real problem and König's
suggested alternative, which was to simply provide another fence to be
signaled when the hardware is ready instead. Asahi Lina agreed
to solve the problem that way instead.
Agreement on the other patch proved more elusive. In current kernels, if a scheduler is torn down, any pending jobs can be left dangling, which can lead to all kinds of unfortunate behavior. Asahi Lina changed the scheduler to explicitly detach any pending jobs and free them independently. König rejected the change, saying that this situation should never come about; a scheduler should never be torn down if jobs still exist. A related patch, posted one month later, fixed a use-after-free bug that could come about after a scheduler is destroyed. Again, König rejected the patch, saying that this is a situation that should never actually happen. If these problems turn up with the Apple GPU driver, he said, the problem is with how the driver is using the DRM subsystem.
König is correct, in that these problems do not (hopefully) happen with existing, in-tree graphics drivers. Asahi Lina's point, however, is that this situation is fragile and incompatible with the Rust way of doing things. The life-cycle rules for schedulers and the objects they interact with are vague and not always documented well; developers have to put a lot of time into reverse-engineering the subsystem to figure out how to use its APIs properly, and nothing in the code enforces that the rules have been followed. Rust developers seek to avoid that kind of interface, and Asahi Lina subscribes to that approach:
"This should not happen in the first place" is how you end up with C APIs that have corner cases that lead to kernel oopses...The idea with Rust abstractions is that it needs to be actually impossible to create memory safety problems for the user of the abstraction, you can't impose arbitrary constraints like "you must wait for all jobs to finish before destroying the scheduler"... it needs to be intrinsically safe. [...]
This is the power of Rust: it forces you to architect your code in a way that you don't have complex high-level dependencies that span the entire driver and are difficult to prove hold.
Without the small changes to the scheduler, she said, it would not be possible to create a suitably safe interface in Rust, at least not without the addition of a lot of other infrastructure that duplicates much of what the DRM scheduler is meant to provide.
This is where the whole discussion appears to fall apart. Unlike somebody
writing a driver in C, Asahi Lina is not just trying to get the hardware working;
she has set out to create an interface to the DRM scheduler that ensures
that all drivers (those written in Rust, anyway) will be free of
memory-safety bugs. König, instead, is worried about upsetting the
structure that keeps the DRM scheduler working, and insists the
the Rust code is simply using it incorrectly: "You just have implemented
the references differently than they are supposed to be
". He never
addressed the objective of making the API intrinsically safe; that
objective appears to not be on his radar.
In the end, Asahi Lina gave up on using the DRM scheduler at all; the plan now is to reimplement that functionality within the Rust code using workqueues instead.
Diverging goals
What has been shown here is a fundamental impedance mismatch between the two languages — and the development styles that they support — that is likely to make itself felt repeatedly as Rust makes inroads into the kernel. The Rust developers do not want to simply reimplement the C interfaces with their associated pitfalls; the whole point of Rust is to have the compiler ensure that certain kinds of bugs simply do not exist in the code. Achieving that goal means creating interfaces that differ from those used on the C side; it also means making changes to the C side to support the API guarantees that the Rust developers would like to have.
From your editor's reading of the discussion, there are no kernel developers who are actively trying to impede the adoption of Rust within the DRM subsystem. Indeed, many of the high-level DRM maintainers are highly supportive of Rust. But many maintainers may not fully understand what the Rust developers are trying to implement and, as a consequence, do not see why they should allow modifications to their subsystems to further those goals. Everybody is working to make the kernel better, but there is a lack of a shared understanding on how to get there.
The Rust-for-Linux developers have done their best to shield maintainers from much of the work of integrating Rust into their subsystems. This is helpful and perhaps necessary work, but it may also have the effect of making it harder for maintainers to fully understand what the Rust developers are trying to achieve. Maintainers may see a relatively straightforward project to create a set of bindings to their subsystem for a new language without understanding the full extent of what those bindings are meant to do. That will get in the way of reviewing the Rust work properly or understanding why some subsystem changes may be needed.
When developers review code against different objectives, misunderstandings may well result. If the above analysis of the situation holds (the truth almost certainly being, as is its way, more complicated), it suggests that the Rust-for-Linux developers need to make a clearer and more complete case for the kind of environment they want to create for kernel developers. Maintainers need to not only understand those goals but embrace and support them, with a willingness to accept changes to their subsystems to make the implementation of safe APIs possible. Everybody will need to work harder to make the Rust-for-Linux project a success. That may be especially true for maintainers who have, thus far, sat on the sidelines; without active support and understanding across the kernel community, it will be much harder to move Rust support beyond the "experimental" stage.
Perhaps most important, though, is getting past the perceptions (most often expressed outside of the kernel project itself) that the Rust developers are trying to force their approach onto the kernel development process or that kernel maintainers are, as a group, trying to block the adoption of Rust. One can find examples of almost anything in a community this large, but the overall story is rather more nuanced than that — and more amenable to solutions.
Meanwhile, the timeline for the merging of the Apple GPU driver is unclear at best, especially since the developer involved is, for now, not actively working toward that goal. It may turn out that the Nova driver, an effort to create a driver for NVIDIA GPUs in Rust that was announced in March (and which has resulted in a small set of abstractions of its own) will get there first — though that driver seemingly has quite a bit of ground to cover before it reaches the same level of functionality. Either way, there appears to be a future for graphics drivers written in Rust, even if many questions about just how they will get there are yet to be answered.
(See also: Asahi
Lina's view on the DRM-scheduling disagreement.)
| Index entries for this article | |
|---|---|
| Kernel | Development tools/Rust |
| Kernel | Device drivers/Graphics |
