C Job Security
C Job Security
Posted Sep 4, 2024 16:51 UTC (Wed) by asahilina (subscriber, #166071)In reply to: C Job Security by pbonzini
Parent article: Whither the Apple AGX graphics driver?
1) It doesn't.
2) This is the diffstat for what I asked for:
Asahi Lina (3):
drm/scheduler: Add more documentation
drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
drm/scheduler: Clean up jobs when the scheduler is torn down.
drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
drivers/gpu/drm/scheduler/sched_fence.c | 4 +-
drivers/gpu/drm/scheduler/sched_main.c | 90 ++++++++++++++++++++++++++++++--
include/drm/gpu_scheduler.h | 5 ++
4 files changed, 99 insertions(+), 7 deletions(-)
58 of those 90 added lines were added documentation.
Not even 50 lines of actual code delta to make things robust enough to make a Rust abstraction possible, but the maintainer would rather I add a bunch of craziness in the abstraction to work around it. I don't think that is a sensible tradeoff.
Posted Sep 4, 2024 17:02 UTC (Wed)
by asahilina (subscriber, #166071)
[Link]
There are plenty of other lifetime/safety requirements in drm_sched, but for the rest I was able to either directly encode them as Rust lifetimes and Rust types, or enforce them via an extra reference counting layer in the abstraction. It's when those approaches fail that I really have to change the C code, and so far drm_sched has been the only component with an API crazy enough to need that when abstracted in Rust, of all the abstractions I've written.
Posted Sep 4, 2024 17:18 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (6 responses)
Would you be able to link to the relevant commits somewhere? Even just a git URL and a set of commit IDs from which we can recreate that diffstat would be good enough - I basically just want a chance to read your C code and get a sense for myself of what the changes you were asking for look like, and whether they can be justified in the absence of Rust (my gut feeling is "yes, they make sense for a pure C kernel, too", but it'd be nice to be able to point to code to explain that).
Posted Sep 4, 2024 17:24 UTC (Wed)
by asahilina (subscriber, #166071)
[Link] (4 responses)
Posted Sep 4, 2024 17:27 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
Perfect, thank you! It makes what you were doing there nice and easy to follow, and shows that the diffstat is misleading in that virtually all the additions are documentation.
Posted Sep 21, 2024 11:43 UTC (Sat)
by ras (subscriber, #33059)
[Link] (2 responses)
Three comments stood out to me. The first is from Christian, the drm maintainer:
> I have seen at least halve a dozen approach in the last two years where people tried to signal a dma_fence from userspace or similar.
Then this one, from an independent third party, possibly in response to the fact the panfrost had got dma_fence life times badly wrong (apparently a windmill sailed past unnoticed):
> I really think this needs to be documented if nothing else out of this thread.
The last comment in the the thread comes from the maintainer, Christian:
> Ah shit, yeah of course. We use IRQ threads in amdgpu for the second interrupt ring as well.
The significance of "IRQ threads" is the ban against signalling dma_fence's from userspace "or similar". "IRQ threads" fall under "so similar", apparently.
So it seems it's not just the Rust dev's who have a problem with the current API. Everyone does. The difference looks to be that Asahi tolerance for this state of affairs is so low she refuses to use it in it's current state. Perhaps the connection to Rust is it's borrow checker has made sloppy lifetime accounting an anathema to it's adherents.
Posted Sep 22, 2024 16:36 UTC (Sun)
by Rudd-O (guest, #61155)
[Link]
🤣 oh that question was so rich. I'm chuckling, sitting on my couch.
He's been presented with a giant 50kW laser with an automated turret to completely annihilate this class of API defect (the Rust type system), and when presented with this epic gift, he says NYAAAH.
Cue scroll of truth meme.
Asahi's "sin" was pointing out bad API design that a few noticed before, and a few are too cringely attached to. I understand that no driver needs to be completely rewritten if this problem is fixed. But if this problem needs to be fixed and that required the rewriting of a bunch of drivers, then that is exactly why it should happen. Over the years of existence of the DRM system, I have seen quite a few upses and other problems caused by it. And I have little doubt this is the sort of code quality issue that I've been bitten by.
W.r.t. the lifetime/ownership problem, rightfully raised by Asahi:
So, printk() needs to dereference the scheduler to print what, exactly, of importance? Easy solutions all around! Either cut out that printk() function call altogether, or make it not deref the scheduler (e.g. print a string referring to the scheduler name), or just use a weak reference which is easily modelable in any reasonable type system and there's a sane behavioral fallback in case the scheduler has died in between.
I completely understand why in some circumstances you may need to have circularity in ownership. But this specific case is exactly one of those cases, one of those many, many cases, where circularity is completely unjustified.
Years ago when I first heard of Rust, I was turned off to it by a few unsavory personalities in that community. I have been coding for two straight years in Rust now, and I have to say: it is absolutely wonderful; all the times before where I reached for C++ or Go or Python? Oh, how badly do I wish I had actually just used Rust back then.
If I was professionally engaged in writing a kernel driver for video cards or any other such device, I would probably just rip out the DRM, rewrite it in Rust, and then provide a C interface / API that is somewhat or mostly compatible with the current API for existing drivers, so I don't have to rewrite these other drivers, at least not totally. This is possible, and (I believe) Federico Mena Quintero has basically rewritten librsvg to be safe using these techniques (the GNOME planet carried his posts detailing these efforts). But, even if that was not doable, Asahi's call to just ignore the DRM subsystem and write her own workqueue-based implementation is the right call. Just route around the brain damage and keep going forward. Hopefully future drivers have based their own development on that work, rather than continuing to rely on a C API that is technically fixable, but (for human reasons) probably won't be anytime soon.
Posted Sep 22, 2024 16:39 UTC (Sun)
by Rudd-O (guest, #61155)
[Link]
I would wager this is exactly it, because that's exactly what has happened to my own brain.
Even now, when I must use something like Python, I follow the advice of a fascinating article (I'm not sure if it was echoed in this publication here) called Write Python Like It's Rust. That article has some solid advice of Rust patterns that you can use when you're writing a Python application, which definitely lead to improving the quality of your Python code.
Heck, just using types in Python, accompanied with a suitable IDE built-in type analyzer, has already helped me dodge hundreds of potential bugs.
Posted Sep 4, 2024 17:26 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted Sep 4, 2024 17:32 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
That said, I was referring to the quote from the article ("making the API intrinsically safe [...] appears not to be on his radar") which seemed to be about more than your patch—especially considering the words of Dave Airlie, it seems to be something that requires a lot of study and planning. ¯\_(ツ)_/¯
Posted Sep 22, 2024 16:40 UTC (Sun)
by Rudd-O (guest, #61155)
[Link]
Not those 5, in my opinion.
C Job Security
Changes to drm_sched to make it safe
Changes to drm_sched to make it safe
Changes to drm_sched to make it safe
Changes to drm_sched to make it safe
>
> Fortunately it was mostly prototyping and I could jump in early enough to stop that, but basically this is a fight against windmills.
>
> Clearly nobody is going to get it right and hidden here in this thread, this info isn't useful.
>
> Ok, nail that coffin. Any other ideas how we could enforce this?
Changes to drm_sched to make it safe
Changes to drm_sched to make it safe
Most of those were linked in the article:
Changes to drm_sched to make it safe
C Job Security
C Job Security