A missed invariant

Posted Sep 4, 2024 18:16 UTC (Wed) by ebiederm (subscriber, #35028)
In reply to: Thoughts and clarifications by asahilina
Parent article: Whither the Apple AGX graphics driver?

I read through your description of what is going on and I have the strong feeling you are missing a strong invariant in the code.

The invariants that I see reading the discussion are:

- The driver may not be torn while I the hardware is doing something.

- The schedulable entities must exist while they have state in the gpu to manage.

Those lead to a scheduler can only be torn down if there is nothing to schedule.

My sense is that you have paid attention to the Rust lifetimes, but have lost the real world hardware lifetimes.

I don't understand how in the world it makes any sense to tear down something, when what it is managing still needs to be managed.

That is just my sense looking at this discussion from the outside, as nether a Rust nor a graphics person. So I might be wrong. But I hope this helps move the conversation forward.

A missed invariant

Posted Sep 4, 2024 19:19 UTC (Wed) by asahilina (subscriber, #166071) [Link] (7 responses)

> I don't understand how in the world it makes any sense to tear down something, when what it is managing still needs to be managed.

I suspect nor did the maintainer, because he wasn't paying attention to what I was saying, and he was stuck in his mental model of how other drivers work, and that's why he kept rejecting my ideas.

> The driver may not be torn while I the hardware is doing something

This is obvious. However, the scheduler is not the driver and in my driver there is not one global scheduler as there is in most drivers. In my driver, a scheduler is bound to a graphics queue and gets torn down when the queue is destroyed, for example when you kill the process that owns it. There are many schedulers running in parallel. This is because the actual global scheduler is implemented in GPU firmware and the DRM schedulers are just one layer on top of the firmware's scheduling primitives, used for dependency control and to wait for hardware resources (not to actually schedule across multiple queues since each only handles one). As strange as this might sound at first glance, this is how firmware-scheduled GPUs are supposed to be implemented in DRM, and I had long conversations with DRM people about this before jumping into the implementation, so the concept is sound.

Therefore schedulers are created and destroyed constantly, and it is not acceptable to block scheduler destruction on unnecessary things, since that would create the need for a bunch of extra tracking that would not otherwise be necessary.

> The schedulable entities must exist while they have state in the gpu to manage.

This is not true. It might be true in some drivers, perhaps drivers that have a better match between the drm_sched design and the GPU design, perhaps even the AMD driver (it is not a coincidence that the drm_sched maintainer is an AMD employee, that code came from the AMD driver).

But it is not true in my driver. My driver has to match the firmware API (which Apple defines and in fact isn't even stable and changes across versions and GPU generations) to DRM's model and the UAPI that I defined, which in turn is what the userspace code speaks to. As such, the driver's code consists of lower layers that manage firmware resources such as its idea of command queues, and higher layers that aggregate them into UAPI objects such as the UAPI command queue. In fact, in this driver, a single UAPI command queue maps to up to three firmware command queues (compute, vertex, fragment).

Since it is critical that firmware resources are always kept alive as long as is necessary (otherwise the whole GPU firmware crashes irrecoverably), the underlying firmware job objects are what hold references to the resources they need (and so on for resources that depend on others). These job objects are only cleaned up when the firmware itself signals job completion (successful or not).

This is perhaps very different from the typical C GPU driver design. C encourages you to have big objects and top-down locking and ownership management. That sort-of works but runs into all kinds of issues where everything needs to wait for everything else for cleanup, and you can easily end up with ordering issues.

Rust instead encourages fine-grained reference counting and fine-grained division of responsibility. The outcome of doing things this way is very positive, because it means that I can use Rust lifetimes to track firmware object lifetimes and ensure they meet the expectations of the firmware itself. This is in fact *the* reason why I chose Rust for this driver, because I knew getting all this right in C (and I have to get it right for the driver to not crash the whole system all the time) would be a nightmare.

Put another way, I do not trust drm_sched or any high-level code to be the ultimate decision maker of the lifetime of GPU resources. The underlying resources "track themselves" instead, and that works a lot better.

And so, when a process dies or destroys a GPU queue, I really don't care if there are jobs in flight. I just want to tear down the DRM scheduler layer and its concept of jobs. The underlying firmware resources will continue to run (there is no "cancel" command I can use to abort them, macOS never does that either), and will continue to hold references to any needed resources such as the associated GPU VM page tables, firmware command queues, kernel-allocated temporary GPU buffers, etc. until they complete, and then all the reference counts drop to zero and everything gets freed, possibly over one minute (compute jobs can be long-running) after the DRM scheduler and indeed the whole process that created this GPU job is long gone.

Not coincidentally, this is also how macOS behaves at least at the UAPI layer (you can ^C a compute app running a long job and the process dies, but the compute job continues to run to completion or failure in the firmware and kernel driver), though there are definitely big differences in our implementations. It's how the hardware is intended to be driven.

A missed invariant

Posted Sep 5, 2024 2:56 UTC (Thu) by nksingh (subscriber, #94354) [Link] (5 responses)

I just wanted to restate my understanding of Lina's point in hopes of justifying her point:

External code effectively controls the lifetime of two objects, the fences and the schedulers that process the fences. The fences are tied to some fds that the userspace controls and the scheduler(s) are tied to either a process context in Lina's case or some hardware device.

Both sides could go away at independent times driven by external forces, like a process exiting or a GPU being yanked out. Lina's design change is saying that the scheduler should be able to keep its lifetime independent by actively kicking out the fences it is tracking. This is nice because it's minimal extra complexity to vastly simplify the contract.

Oftentimes the simplified and less interdependent contracts allow components to evolve cleanly in the future and support more scenarios, so regardless of rust this is probably something that would be a righteous. Is there some C only driver code that could be simplified and deleted along with the proposed change?

A missed invariant

Posted Sep 5, 2024 7:43 UTC (Thu) by khim (subscriber, #9252) [Link] (4 responses)

> Is there some C only driver code that could be simplified and deleted along with the proposed change?

No, but as was already shown there exist a C code that hit the same issue that Asahi driver, even if in somewhat exotic case of external eGPU attached to laptop.

And even if there wasn't anything like that then flat out rejection without explanation is not something I like to ever see.

Sometimes even the most trivial and simple looking changes could be wrong because they are revealing deeply hidden problems and yes I've seen this situation that a 2-line patch can end up with 6 months of work to redo lots of things differently, too. That's normal! And saying that your two-line patch breaks our 10'000 line of code that are built on the opposite assumption is not a crime.

What is not normal is rejecting things on your authority without any explanations. That's acceptable when technical arguments are ignored and heck, as I like to say, I started considering Rust seriously when I saw how Rust community applied that to a person who kept rejecting technical arguments, but damn it, can anyone show us how and when any of Rust-in-Linux developers ignored technical arguments?

We saw that many times with C maintainers doing it (accusation that Rust is a religion and rustaceans try to be a missionaries are as non-technical as they come), but what about other side? Where was ignorance of technical issues was demonstrated?

A missed invariant

Posted Sep 5, 2024 9:01 UTC (Thu) by asahilina (subscriber, #166071) [Link] (3 responses)

It's actually possible that the Xe driver could be simplified with those changes (since it is also a firmware-scheduled GPU driver), but I'm not sure. As I understand it, they did architect things around the existing drm_sched requirements at least in some ways. Maybe that could be cleaned up if drm_sched is made more reasonable, or maybe it ties deeply enough into the rest of the driver that it doesn't make a difference at this point.

When valid technical arguments were made about the way I was doing things (the first patch mentioned in the article, about the new callback) I did change how my code worked. That discussion was very frustrating and could have been handled much better, but the end result *was* that the maintainer let me know of a better way of doing things.

There is, however, one more ironic and frustrating event here. This patch:

https://lore.kernel.org/lkml/20231110001638.71750-1-dakr@...

This patch to drm_sched basically implements what I wanted to implement in my first patch. It adds a new callback just the same, it just calls it update_job_credits() which is merely a slightly different way of doing the same thing. Except it does it much more intrusively, changing existing APIs so all the drivers have to be updated.

It also introduced the race condition that I then had to debug in [1], so it wasn't even a correct patch since it introduced a memory safety regression (for all drivers).

So I guess the "technical arguments" against my first patch, which I did agree with, still only apply to me somehow, and not the author of the above patch. I wonder why... maybe it's because people with an @redhat.com email address are inherently given more respect and trust? I can't help but wonder.

[1] https://github.com/AsahiLinux/linux/issues/309#issuecomme...

A missed invariant

Posted Sep 5, 2024 10:41 UTC (Thu) by khim (subscriber, #9252) [Link]

> So I guess the "technical arguments" against my first patch, which I did agree with, still only apply to me somehow, and not the author of the above patch. I wonder why... maybe it's because people with an @redhat.com email address are inherently given more respect and trust? I can't help but wonder.

Nah, more likely because this particular person have earned trust in the past.

I was in a similar position at my $DAYJOB and had to fight for a similar “privilege” of not fighting against nonsense comments for a few years. What worked, in the end, was the way people fight software patents: instead of declaring them invalid they accept their validity on faith but then “cut it to pieces” ¹. Similarly with $TROUBLESOME_MAINTAINER: after I accepted nonsense suggestions what were sure to lead to troubles and submitted them in my name but with explicit and prominent comments “this code handles issues that are too tricky for me to understand, but it was suggested $TROUBLESOME_MAINTAINER thus I assume they are correct” I then redirected all bugs caused by that code back.

After about half-dozen of times of that I convinced $TROUBLESOME_MAINTAINER that I understand that part of code better and now I could, probably submit really nonsense code and it would be accepted. I actually even did that, once, but since it was me who realized that code I submitted was nonsense and $TROUBLESOME_MAINTAINER have never understood the issue till I cooked up the test that broke and also created a fix faith remained.

I'm pretty sure there are something like this in play there.

¹) Bring prior art and instead of saying that this prior art invalidates the patent assert that since patent is still there and prior art is also there you have to interpret claims for the patent very narrowly to make sure they don't clash with the prior art. Repeat dozen of times and now you have extremely crippled patent which covers some very obscure and strange thing… which is no longer even close to what you are doing.

A missed invariant

Posted Sep 5, 2024 14:48 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Has this regression been fixed?

A missed invariant

Posted Sep 5, 2024 16:02 UTC (Thu) by asahilina (subscriber, #166071) [Link]

Not as far as I know. I found it recently but at this point I don't really want to spend time fixing it in drm_sched since I'll be moving away from it. I just have a quick hack in our downstream tree to avoid the issue (that just disables part of the functionality that introduced it).

I haven't seen any other reports of the crash happening with other drivers, so I guess they got lucky for whatever reason (it's a UAF so if the freed memory isn't reused yet it would "work"), or maybe they have extra locking that makes the race impossible (if for some reason queuing a job is mutexed with its execution then that would avoid the problem, though it would also likely introduce a violation of fence signaling rules, but that's a whole different can of worms).

For us the problem only started happening after a certain kernel update, plus the GPUVM changes, and with kernel builds configured with certain preemption modes (not the case on the Fedora kernel), and even then it happened quite rarely and was hard to repro. So it's conceivable that the bug is just lurking and not triggering for other drivers.

A missed invariant

Posted Sep 6, 2024 12:59 UTC (Fri) by daenzer (subscriber, #7050) [Link]

> I suspect nor did the maintainer, because he wasn't paying attention to what I was saying, [...]

And you were paying attention to what he was saying perfectly, right?

Has it occurred to you that Christian might feel the same way about your exchange, just in reverse?

In the two contentious threads referenced in the article, I see Christian asking you for clarification, explaining why things can't be done like this, and making suggestions how it could be done differently. Doesn't really look like "he just wouldn't listen" to me.

P.S. AFAICT Christian was mostly explaining issues from the C PoV, I'm not sure he even made any direct statement about his position on Rust for Linux. Comments here trying to put him in a "maintainers resisting Rust" corner seem unjustified.

A missed invariant

Posted Sep 4, 2024 19:32 UTC (Wed) by pizza (subscriber, #46) [Link] (4 responses)

> My sense is that you have paid attention to the Rust lifetimes, but have lost the real world hardware lifetimes.

To paraphrase the Red Queen, hardware has the remarkable ability of triggering six supposedly-impossible situations before breakfast, invariably wrecking your nice clean/consistent/clever abstractions in the process.

(I have no opinion on if that applies to this particular situation...)

A missed invariant

Posted Sep 5, 2024 11:22 UTC (Thu) by taladar (subscriber, #68407) [Link]

That point applies just as much whether the clever abstraction is written down or is just a set of unwritten invariants that only exist in maintainer's heads. If anything changing the mental model as new behaviors are discovered is a lot harder than changing something that is written down since our brain has a way of making us confuse old versions and new versions of similar things that existed over time.

A missed invariant

Posted Sep 5, 2024 18:30 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (2 responses)

Well, either that, or your abstractions are so nasty that they are actually worse than the hardware. The C++ spec says that relaxed atomics may be implemented using (effectively) a time machine, and therefore inherited all manner of temporal paradoxes from every bad sci-fi novel you've ever read (but they insist on using entirely new names for everything, so e.g. the bootstrap paradox is renamed to "out of thin air," the grandfather paradox is "read from unexecuted branch," etc.). Meanwhile, real compilers emit raw load and stores (in the case of simple atomic reads and writes) or real atomic instructions (for read-modify-write), and real CPUs are usually not time machines, and so this is all a purely theoretical problem that never actually happens.

Rust also inherits this model for its atomics, and to my understanding, it takes exactly the same "meh, theory is hard, we'll just assume the implementation is not insane" stance as C++. I like to imagine that some aggressive compiler writer is going to blow past the spec's non-normative "please do not actually do this" note and try to optimize a time machine into C++ and/or Rust, but realistically I think the compiler writers are smart people who understand that this would be a bad idea.

To further explain the motivation: Relaxed atomics are basically the "I don't want the compiler to turn my data race into UB" annotation. They don't actually do anything other than guarantee that the individual operations on that particular variable are atomic and consistent with some sequential ordering, and that the compiler is not allowed to deduce UB from any data races that may result, but other than that, there are no rules. This is intended to allow compilers to just emit loads and stores for simple (read-only or write-only) atomics, since the compiler may reason that any cache incoherency could also be interpreted as some reordering of the parallel loads and stores, and so you don't have to emit a fence. Since the "some" sequential ordering is not required to be consistent with the ordering of any other relaxed atomic variable, you can have two of these variables that interact with one another in such a way that the resulting ordering is permitted to contain a loop or cycle, in which case the results are not well-defined but also are not UB, and now you have rules that permit circular arguments like "x is 37 because x is 37." In C++14, they did add a rule prohibiting some specific cases (including the simple 37-because-37 case), but to the best of my understanding, this is not a comprehensive fix and there are still variations of the behavior that are not forbidden by the spec.

A missed invariant

Posted Sep 5, 2024 19:35 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

> real CPUs are usually not time machines

I would say that since Pentium Pro “real CPUs” are a time machines and thus all these paradoxes they speak about have become real

> so this is all a purely theoretical problem that never actually happens.

Except it does. It even happens on x86, as we all know. And x86 tries harder than most other CPUs to hide it's true nature.

> to the best of my understanding, this is not a comprehensive fix and there are still variations of the behavior that are not forbidden by the spec

The problem here is that creating a memory model that makes some sense and also is compatible with all these optimizations that hardware (not compiler!) is doing… it's not easy.

From what I understand C++ (and thus Rust) model is not well aligned with what hardware is doing but what Linux is doing is not compatible with some more exotic targets (e.g. GPUs) thus we just have to agree that designing sensible memory model for atomics is just hard.

A missed invariant

Posted Sep 7, 2024 23:18 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

That link is not what I mean by "a time machine." I'm referring to real temporal paradoxes (such as both variables nonsensically taking on the value 42, because each supposedly got that value from the other), not just "the CPU executed things in a different order than I thought it would," nor even "the reordering which the CPU performed cannot be reconciled with a single global order of execution." Those are both normal and expected anomalies under relaxed atomics - they're the spec working as designed.