Not-a-GPU accelerator drivers cross the line

By Jonathan Corbet
August 26, 2021

As a general rule, the kernel community is happy to merge working device drivers without much concern for the availability of any associated user-space code. What happens in user space is beyond the kernel's concern and unaffected by the kernel's license. There is an exception, though, in the form of drivers for graphical processors (GPUs), which cannot be merged in the absence of a working, freely-licensed user-space component. The question of which drivers are subject to that rule has come up a few times in recent years; that discussion has now come to a decision point with an effort to block some Habana Labs driver updates from entry into the 5.15 kernel.

The GPU-driver rule is the result of a "line in the sand" drawn by direct-rendering (DRM) maintainer Dave Airlie in 2010. The kernel side of most GPU drivers is a simple conduit between user space and the device; it implements something similar to a network connection. The real complexity of these drivers is in the user-space component, which uses the kernel-provided channel to control the GPU via a (usually) proprietary protocol. The DRM maintainers have long taken the position that, without a working user-space implementation, they are unable to judge, maintain, or test the kernel portion of the driver. They have held firm for over a decade now, and feel that this policy is an important part of the progress that this subsystem has made over that time.

At its core, a GPU is an accelerator that is optimized to perform certain types of processing much more quickly than even the fastest CPU can. Graphics was the first domain in which these accelerators found widespread use, but it is certainly not the last. More recently, there has been a developing market in accelerators intended to perform machine-learning tasks; one of those, the Habana Gaudi, is supported by the Linux kernel.

The merging of the Gaudi driver has raised a number of questions about how non-GPU accelerators should be handled. This driver did not go through the DRM tree and was not held to that subsystem's rules; it went into the mainline kernel while lacking the accompanying user-space piece. That was later rectified (mostly — see below), but the DRM developers were unhappy about a process that, they felt, bypassed the rules they had spent years defending. Just over one year ago, the arrival of a couple of other accelerator drivers spurred a discussion on whether those drivers should be treated like GPUs or not; no clear conclusions resulted.

The Habana driver has been the source of a few similar discussions over the last few months, with bursts in late June and early July. The problem now is an expansion of that driver's capabilities that requires using the kernel's DMA-BUF and P2PDMA subsystems to move data between devices. These subsystems were developed to work with GPU drivers and are clearly seen by some DRM developers as being part of the kernel's GPU API; drivers using them should, by this reasoning, be subject to the GPU subsystem's merging rules. Or, as Airlie phrased it in his objection to merging the Gaudi changes:

NAK for adding dma-buf or p2p support to this driver in the upstream kernel. There needs to be a hard line between "I-can't-believe-its-not-a-drm-driver" drivers which bypass our userspace requirements, and I consider this the line.
This driver was merged into misc on the grounds it wasn't really a drm/gpu driver and so didn't have to accept our userspace rules.
Adding dma-buf/p2p support to this driver is showing it really fits the gpu driver model and should be under the drivers/gpu rules since what are most GPUs except accelerators.

The interesting twist here, as acknowledged by DRM developer Daniel Vetter, is that there is, indeed, a free user-space implementation of the Gaudi driver. What is still not available is the compiler used to generate the instruction streams that actually drive this device. Without the compiler, Vetter said, the available code is "still useless if you want to actually hack on the driver stack". He elaborated further:

Can I use the hw how it's intended to be used without it?
If the answer is no, then essentially what you're doing with your upstream driver is getting all the benefits of an upstream driver, while upstream gets nothing. We can't use your stack, not as-is. Sure we can use the queue, but we can't actually submit anything interesting.

Over the course of the discussions, the DRM developers have tried to make it clear that they want a working, free implementation of the user-space side. It does not have to be the code that is shipped to customers, as long as it is sufficient to understand how the driver as a whole works. To some, though, the compiler requirement stretches things a bit far. Habana developer Oded Gabbay has described the DRM subsystem's requirements this way:

I do think the dri-devel merge criteria is very extreme, and effectively drives-out many AI accelerator companies that want to contribute to the kernel but can't/won't open their software IP and patents.
I think the expectation from AI startups (who are 90% of the deep learning field) to cooperate outside of company boundaries is not realistic, especially on the user-side, where the real IP of the company resides.

Cooperating outside of company boundaries is, of course, at the core of the Linux kernel development process. The DRM subsystem is not alone in making such requirements; Vetter responded by pointing out, among other things, that the kernel community will not accept a new CPU architecture without a working, free compiler.

Over the years, there has been no shortage of problems with vendors that want their hardware to work with Linux while keeping their "intellectual property" to themselves. This barrier has been overcome many times, resulting in wider and better hardware support in the kernel we all use. Getting there has required at least two things: demand from customers for free drivers and a strong position in the development community against proprietary drivers. The demand side must develop on it own (and often does), but the kernel community has worked hard to maintain and communicate a unified position on driver code; consider, for example, the position statement published in 2008. As a result there is a consensus in the community covering a number of areas relevant to proprietary drivers; one will have to work hard to find voices in favor of exporting symbols to benefit such drivers, for example.

This ongoing series of discussions makes it clear that the kernel community has not yet reached a consensus when it comes to the requirements for drivers for accelerator devices. That creates a situation where code that is subject to one set of rules if merged via the DRM subsystem can avoid those rules by taking another path into the kernel. That, of course, will make it hard for the rules to stand at all. Concern about this prospect extends beyond the DRM community; media developer Laurent Pinchart wrote:

I can't emphasize strongly enough how much effort it took to start getting vendors on board, and the situation is still fragile at best. If we now send a message that all of this can be bypassed by merging code that ignores all rules in drivers/misc/, it would be ten years of completely wasted work.

Avoiding that outcome will require getting kernel developers and (especially) subsystem maintainers to come to some sort of agreement — always a challenging task.

In the case of the Gaudi driver, Greg Kroah-Hartman replied that he had pulled the controversial code into his tree. In response to the subsequent objections, he dropped that work and promised to "write more" once time allows. Dropping the patches for now helps to calm the situation, but it has not resolved the underlying disagreement. At some point, the kernel community will have to reach some sort of conclusion regarding its rules for accelerator drivers. Failing that, we are likely to see a steady stream of not-a-GPU drivers finding their way into the kernel — and a lot of unhappiness in their wake.

Index entries for this article
Kernel	Device drivers/Accelerators

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 16:39 UTC (Thu) by q_q_p_p (guest, #131113) [Link] (31 responses)

Aren't DMA-BUF and P2PDMA interfaces completely generic ? I don't see why drivers using them, should be considered GPUs. Yeah, they were made with GPU use cases in mind, but it doesn't mean they can't be used for other things too. Imagine a floppy drive driver using DMA-BUF and being called a GPU driver ;)

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 17:00 UTC (Thu) by farnz (subscriber, #17727) [Link] (29 responses)

The thing that makes these things GPU-like is that a GPU is a programmable accelerator that takes in memory buffers (including those provided by DMA-BUF and P2PDMA interfaces) and outputs new memory buffers with processing applied based on the contents of one of the input memory buffers (which contains a program for the accelerator to run). These "not-a-GPU" things are programmable accelerators that take in memory buffers (including those provided by DMA-BUF and P2PDMA interfaces) and outputs new memory buffers with processing applied based on the contents of one of the input memory buffers (which contains a program for the accelerator to run).

A floppy drive driver would not have the programmable accelerator bit.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 17:45 UTC (Thu) by q_q_p_p (guest, #131113) [Link] (28 responses)

For me GPU is something that utilizes DRM or KMS. Your description of gpu and not-a-gpu is what almost all computing devices do - take input (memory buffer), compute (execute program) and produce output (memory buffer) - on this level there's no difference between any of these devices, but it would be ridiculous to call every such device a GPU (e.g. FPGAs, programmable DSPs).

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 18:29 UTC (Thu) by ab (subscriber, #788) [Link]

A difference is that the stream of instructions for transforming an input buffet into an output buffet is completely opaque to the kernel. If you are not in a privileged position to be able to generate the stream, there's no way you'd know how to drive a particular floppy drive, so to say, if there's no a common standard for that communication. Accelerators like these are black boxes without that knowledge. Vendors like to keep them as such and this is way different from traditional DRM or KMS use.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 19:30 UTC (Thu) by airlied (subscriber, #9104) [Link] (9 responses)

"At its core, a GPU is an accelerator "

So for you that might be something you hold true to, but in the real world, DRM (without KMS) is a subsystem that exposes accelerators to userspace. It just happens to be 3D accelerators.

We don't call these devices GPUs we call them accelerators. Does that make it easier to understand, and yes DSPs and FPGAs should be in the same boat, we have bunch of pointless upstream FPGA drivers that are pretty much unmaintainable because they are tied to closed userspaces. Xilinx even submitted a drm driver in the past.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 20:03 UTC (Thu) by brigdh (subscriber, #137272) [Link] (8 responses)

Perhaps a problem is that DRM stands for Direct Rendering Manager. "Rendering" evokes "graphics". Sure, its a decades old name, but if its intended to be a general accelerator interface, perhaps it should be renamed to account for that?
Essentially your "marketing" doesn't match your intent.

what does "drm" mean

Posted Aug 26, 2021 20:49 UTC (Thu) by nwnk (guest, #52271) [Link] (6 responses)

This feels a bit like chasing goalposts, but if that's the concern I suggest retconning it to "Doesn't Really Matter" and then explaining that it's the accelerator subsystem in the Kconfig text.

what does "drm" mean

Posted Aug 26, 2021 21:14 UTC (Thu) by brigdh (subscriber, #137272) [Link] (5 responses)

The naming isn't strictly important, but it probably does hurt the conversation. What may be more important is having someone be the "poster boy" for a "non-gpu" in the drivers/gpu area to show everyone else what that looks like, and to be an example to copy. Some of the past discussion seemed to be the GPU folks claiming their interface will do it all, but not providing any examples, and expecting someone else to figure out what the userspace stack is going to look like. Is it MESA? Some other framework? Chances are someone is going to get OneAPI talking to DRM, post the patches, and the GPU folks are going to tell them they are an idiot for not refactoring MESA. This is not to say the GPU folks need to do all the work, but some sort of a "blueprint" for what they want to see or what they think will work might go a long way.

what does "drm" mean

Posted Aug 26, 2021 21:22 UTC (Thu) by blackwood (guest, #44174) [Link]

xillinx submitted a drm driver with their own opencl userspace stack a few years ago. That looked pretty reasonable as far as these things go (we have plenty of non-mesa userspace for the not-so-3d use-cases). Aside from the lack of compiler. It was some TPU/fpga fusion core aimed at AI use-cases.

what does "drm" mean

Posted Aug 27, 2021 20:39 UTC (Fri) by linusw (subscriber, #40300) [Link] (3 responses)

I have recently been under the impression that Apache TVM has the ambition to be the "MESA of learning/inference engines". In difference from other AI userspace this project seems very alive with major companies contributing to it. Their stated ambition is "An End to End Machine Learning Compiler Framework for CPUs, GPUs and accelerators".

what does "drm" mean

Posted Aug 27, 2021 20:53 UTC (Fri) by blackwood (guest, #44174) [Link] (2 responses)

From a quick look it's another model optimizer framework, of which there's an enormous amount all over for different combinations of use-cases. This one seems more aimed at embedded since they talk a lot about zephyr, if I got that right.

That's all important to squeeze actual good real world performance out of your hw, but not something we care about for a driver stack. The actual compiler backend is absent for real hw. They have llvm (for cpus), cuda (we know that one is blobby) and bring-your-own-codegen (definitely going to be blobby too).

With these fairly pure math accelerators with a lot less fixed function than 3d the only somewhat interesting thing in the hw is the actual compute block and corresponding compiler. And that one is always the sticking point, all the pieces around are generally fairly easy to get after a bit of talking.

what does "drm" mean

Posted Aug 28, 2021 15:07 UTC (Sat) by linusw (subscriber, #40300) [Link] (1 responses)

Actually Arm has submitted a backend for the Ethos NPU, and it is pretty open:
https://discuss.tvm.apache.org/t/rfc-ethosn-arm-ethos-n-i...

what does "drm" mean

Posted Aug 30, 2021 18:26 UTC (Mon) by blackwood (guest, #44174) [Link]

(I forgot to hit publish, why does lwn make you ponder your comment first)

On a quick look this seems fairly interesting. I think there's a few other NN stacks for inteference engines at least which are similarly open. Nothing yet that achieved actual cross hw-vendor status, but good thing to watch for sure.

This one also seems to just redirect to the arm ethnos stack, and not actually merge the backend fully into the upstream project. Only when you do that do the benefits of a real upstream project kick in.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 23:25 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> DRM stands for Direct Rendering Manager
Maybe change that to Device Resource Manager? Because it's kinda what it is these days.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 9:07 UTC (Fri) by farnz (subscriber, #17727) [Link] (16 responses)

KMS is an interface for display controllers, not GPUs; GPUs are simply compute accelerators attached to the main CPU, which take a program and some input buffers, and produce output, and that's what DRM is intended to provide a good interface for. It happens, for historical reasons, that the first major use of programmable compute accelerators in Linux was accelerators attached to display controllers, but that's simply history.

This is why other compute accelerators belong with DRM; after more than a decade of getting the interfaces for compute acceleration right, and fighting the fight for maintainable kernel drivers, the DRM team (Dave Airlie et al) know what they're doing. One of the (many) things they've learnt over the years is that the driver maintainers need to be able to understand the program bytes for security reasons - if you don't know what the program bytes do, you can't determine what memory accesses the hardware is intended to attempt, and which ones are security bugs, nor is it even possible to filter out programs that are going to ask the hardware to make memory accesses the kernel does not want it to.

There's even an option for teams that think that their software secret sauce is what makes the difference; do what AMD has done, and have a complete open source stack atop the kernel driver (the Mesa stack in the case of AMD OpenCL compute drivers), and have a separate proprietary stack (the AMDGPU-PRO stack in the case of AMD OpenCL compute drivers) that includes the secret sauce.

Also, when we're trying to determine what devices should utilize DRM, "utilizes DRM" is not a good dividing line because it's circular :-)

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 10:35 UTC (Fri) by daniels (subscriber, #16193) [Link] (15 responses)

> This is why other compute accelerators belong with DRM; after more than a decade of getting the interfaces for compute acceleration right, and fighting the fight for maintainable kernel drivers, the DRM team (Dave Airlie et al) know what they're doing. One of the (many) things they've learnt over the years is that the driver maintainers need to be able to understand the program bytes for security reasons - if you don't know what the program bytes do, you can't determine what memory accesses the hardware is intended to attempt, and which ones are security bugs, nor is it even possible to filter out programs that are going to ask the hardware to make memory accesses the kernel does not want it to.

Not just security reasons, but also because dma-buf and thus dma-fence have deep ties into memory management.

The analogy to things like floppy drives is a total red herring. For storage and network devices, you have well-known interfaces on both sides: industry standards like SCSI/NVMe/RDMA for hardware, and standardised kernel/userspace API through POSIX or Linux-specific development. Accelerator development is much more wild-west in that regard, so the kernel is being asked to be a dumb shovel between huge unknowns on both sides. That was much easier to handle in the past, because the driver had to bash registers and IRQs directly, so you could at least derive from that how the hardware works and thus what the implications should be. But now kernel drivers for modern accelerators just map command/event queues into userspace, so it's a total black box.

That's why there's a line in the sand: because the deep integration into all kinds of complex subsystems means that we're tying the behaviour of huge amounts of kernel behaviour into totally unknown vendor-specific black boxes. Having an open userspace - something meaningful, not just submitting an empty 'ping' command into the ring - is the only thing that lets us reason about what this actually means.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 17:50 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (13 responses)

I'm not very familiar with the DRM stack, so perhaps this is a stupid question, but...

What is stopping these manufacturers from releasing product A, with FOSS drivers/userspace, getting the drivers upstreamed, and then releasing product B, with proprietary userspace, but which just so happens to have the same firmware/hardware-level interface as product A, so that you can use the same drivers at the kernel level (but if you try to use product A's userspace with product B's hardware, it doesn't work)?

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 19:59 UTC (Fri) by dezgeg (subscriber, #92243) [Link]

This actually has happened (assuming I understood correctly what you were asking): NVIDIA proprietary driver running on top of nouveau kernel driver: https://www.phoronix.com/scan.php?page=news_item&px=N...

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 20:24 UTC (Fri) by blackwood (guest, #44174) [Link]

Generally these come with new pciids or new device-tree id, and that requires a kernel patch to add. You could just bind the old driver through sysfs to the new device to bypass that. But no one is that good at creating hardware that doesn't need at least minor adjustments to the kernel driver setup code, so in pratice this doesn't happen. Except when it actually really is the exact same chip with just different branding.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 21:43 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

> What is stopping these manufacturers from releasing product A, with FOSS drivers/userspace, getting the drivers upstreamed, and then releasing product B, with proprietary userspace, but which just so happens to have the same firmware/hardware-level interface as product

That's the situation now with the Radeon drivers. There's a reasonable open source stack, and then there's AMDGPU Pro closed-source driver that uses the same kernel interface as the open source driver, but provides more optimizations and functionality.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 23:21 UTC (Fri) by daniels (subscriber, #16193) [Link] (9 responses)

I don’t think that was the question. What you’re saying is that you can write closed userspace that happens to be better optimised against the same stack, and that’s totally fine, as long as the full surface area of the hardware and the API is understood in open stacks.

The question though was: what if someone makes hardware that’s similar enough to existing open hardware + userspace stacks to fool the kernel, but is only usable with new proprietary userspace, presumably with new extensions as well? And honestly, it’s a good question, and would be a great problem to have. But like blackwood said, no-one has yet done it and thanks to the designs open stacks enforce, it’s not on the cards either.

Not-a-GPU accelerator drivers cross the line

Posted Aug 28, 2021 16:08 UTC (Sat) by alyssa (guest, #130775) [Link]

In an alternate universe where the Midgard stack was upstream FOSS and Bifrost DDK-on-mainline was productized, this could have happened close to home...

Not-a-GPU accelerator drivers cross the line

Posted Aug 29, 2021 18:37 UTC (Sun) by marcH (subscriber, #57642) [Link] (7 responses)

> The question though was: what if someone makes hardware that’s similar enough to existing open hardware + userspace stacks to fool the kernel, but is only usable with new proprietary userspace, presumably with new extensions as well? And honestly, it’s a good question, and would be a great problem to have

What would be the kernel maintenance problem in that case?

Not-a-GPU accelerator drivers cross the line

Posted Aug 30, 2021 15:48 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (6 responses)

5+ years after the FOSS userspace is no longer usable, the kernel suddenly suffers a regression which can only be reproduced using the proprietary userspace. The regression is eventually narrowed down to some random subsystem which the driver briefly calls into, and the easiest way to fix it is to roll back a security fix. Now what?

Not-a-GPU accelerator drivers cross the line

Posted Aug 30, 2021 19:41 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Remove the driver from the mainline. Duh. Actually, remove it immediately (move to staging, mark BROKEN, remove) after the userspace goes proprietary.

Not-a-GPU accelerator drivers cross the line

Posted Aug 31, 2021 4:22 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

Well, I suppose technically the FOSS userspace does *work.* It's just that it only works with some 10+ year old toaster that 4 users actually own. Unfortunately, those 4 users know how to post to LKML and are not afraid of doing so.

Not-a-GPU accelerator drivers cross the line

Posted Aug 31, 2021 4:30 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Then limit it to several PCI IDs and blacklist further additions. Or make sure it works with certain firmware versions. It's not like the kernel developers can't do anything once the driver is in the kernel.

Not-a-GPU accelerator drivers cross the line

Posted Aug 31, 2021 10:16 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

If there's a regression with the proprietary userspace only, then that's what's at fault, automatically - after all, the FOSS userspace works just fine.

Not-a-GPU accelerator drivers cross the line

Posted Aug 31, 2021 14:41 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

Since when has the source-availability been a release valve for ignoring kernel regressions? I haven't heard that that was suitable now.

The *right* thing to do is to ask for a way to reproduce the behavior with the FOSS userspace stack as part of the regression's test suite :) . Because how else are kernel developers going to know that they broke it?

Not-a-GPU accelerator drivers cross the line

Posted Aug 31, 2021 15:27 UTC (Tue) by farnz (subscriber, #17727) [Link]

In this case, because the driver is split into two parts - the userspace component, and the kernelspace component. Without source for both, how do you know that the kernelspace component is causing the breakage by doing the wrong thing? It might, for example, be the userspace component expecting that the kernel doesn't enable IOMMU facilities for this device, and being caught out by the change.

Not-a-GPU accelerator drivers cross the line

Posted Sep 7, 2021 2:17 UTC (Tue) by florianfainelli (subscriber, #61952) [Link]

> Accelerator development is much more wild-west in that regard, so the kernel is being asked to be a dumb shovel between huge unknowns on both sides.

So maybe the approach should be to push back, wait for these 100 AI startups or so to get out of business/be acquired/merged and start working on what resembles a standard. The same is true with any silicon these days, the differentiation is in support, pricing, intrinsic hardware design and software should just be a commodity, because it will be down the road.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 17:01 UTC (Fri) by dvdeug (guest, #10998) [Link]

Imagine a drive whose developers want to install a DMA interface in the Linux kernel and keep all the details of loading data from and to the disk in a proprietary system outside kernel space. I can't imagine that kernel developers would be too amused.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 17:27 UTC (Thu) by flussence (guest, #85566) [Link] (12 responses)

There's prior art outside the DRM tree for how to deal with shovelware. If the manufacturer insists on making these devices a miserable experience to use and debug, like Conexant did with its winmodems and Broadcom with its everything, we should respect their wishes loud enough that nobody is fooled into buying them again.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 19:01 UTC (Thu) by airlied (subscriber, #9104) [Link] (11 responses)

The problem is these are not consumer devices, they are mainly enterprise but also show up on cloud vendors like AWS.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 9:08 UTC (Fri) by shiftee (subscriber, #110711) [Link]

Unlike the consumer market, Linux has a lot of clout in the enterprise market.
Broadcom drivers is an area we need to compromise for the sake of regular users.
This seems like a bad deal where we provide kernel infrastructure and maintenance for very little return.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 15:15 UTC (Fri) by flussence (guest, #85566) [Link] (9 responses)

In that case, I humbly suggest that they should never have merged it at all. For all intents and purposes this thing is about as relevant to mainline as a big-endian x86.

There's plenty of weird drivers already in the kernel but all of them have the major distinction that a normal person could realistically obtain the hardware and use it under Linux without breaking any laws (cf. the PS4 R600 support patches - rejecting those was the right thing to do even though it annoyed a few people). This hardware on the other hand sounds effectively time-bomb drm'ed — it'll be obsolete before it shows up on the grey market, and it's non-functional without the secret sauce component. Aside from the technical reasons: there's very few honest uses for these responsibility laundering accelerators to begin with.

A quick skim over the EULA on their site suggests that the company keeps tight legal control over who has access to the userspace part, so once they lose interest in it it becomes a paperweight. And it's an Intel subsidiary, throwing things over the wall and discontinuing support on a timeframe of months is all but guaranteed.

Not-a-GPU accelerator drivers cross the line

Posted Aug 29, 2021 18:39 UTC (Sun) by marcH (subscriber, #57642) [Link] (8 responses)

> so once they lose interest in it it becomes a paperweight. And it's an Intel subsidiary, throwing things over the wall and discontinuing support on a timeframe of months is all but guaranteed.

Unlike... smartphones?

Not-a-GPU accelerator drivers cross the line

Posted Sep 3, 2021 6:06 UTC (Fri) by flussence (guest, #85566) [Link] (7 responses)

Not really comparable (unless you're talking Windows Mobile). My phone's 10 years old and although the manufacturer sucks, it still gets aftermarket OS updates thanks to having so many eyes on it; I've even done some myself to get better fonts.

These ML chips will never get that kind of attention, because they're neither interesting nor accessible to hackers. If the company's lucky they may become a curiosity in a hardware museum someday, but they probably won't be plugged in.

Not-a-GPU accelerator drivers cross the line

Posted Sep 3, 2021 19:13 UTC (Fri) by marcH (subscriber, #57642) [Link] (5 responses)

> Not really comparable...

... to other Intel products either.

Not-a-GPU accelerator drivers cross the line

Posted Sep 21, 2021 4:09 UTC (Tue) by flussence (guest, #85566) [Link] (4 responses)

From recent memory: GMA500 (infamously never had a functional driver beyond software framebuffer, PowerVR junk but Intel branding and responsibility); the entire Moorestown x86 SoC platform (all purged from the kernel earlier this year to prevent people using them, if anyone ever bought them); old CPUs getting excluded from microcode fixes for logo-and-website architectural vulnerabilities. I've heard sour opinions of their recent WiFi 6 stuff from the bufferbloat people too.

And that's just the problems concerning Linux. The entire USB3/4/C/TB mess they've foisted on the world is a miserable experience no matter which OS you use, to the point where I've actively avoided buying any devices that need it.

Not-a-GPU accelerator drivers cross the line

Posted Sep 21, 2021 5:47 UTC (Tue) by marcH (subscriber, #57642) [Link] (3 responses)

Dunno about the USB mess. All other examples look like exceptions rather than the rule.

Not-a-GPU accelerator drivers cross the line

Posted Sep 21, 2021 17:10 UTC (Tue) by flussence (guest, #85566) [Link] (2 responses)

The one area where they seem to be doing well is their own GPUs, I'll give them that. If Intel was an option that didn't require an entire new PC I'd consider it over upgrading my ageing Radeon, because they've always been on the ball with driver support and the latter seems to be perpetually 5 years behind on video codecs.

It'd be irresponsible to go there right now though, what with the tulip mania apocalypse going on.

Not-a-GPU accelerator drivers cross the line

Posted Sep 22, 2021 6:37 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (1 responses)

You only need a PCIe slot (Intel does stand-alone GPUs, again):
https://www.tomshardware.com/news/intel-iris-xe-dg1-budge...

Not-a-GPU accelerator drivers cross the line

Posted Sep 22, 2021 8:39 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

I don't think you can buy it just now, though. It should be available in near-future, though.

Not-a-GPU accelerator drivers cross the line

Posted Sep 7, 2021 10:42 UTC (Tue) by immibis (subscriber, #105511) [Link]

> These ML chips will never get that kind of attention, because they're neither interesting nor accessible to hackers. If the company's lucky they may become a curiosity in a hardware museum someday, but they probably won't be plugged in.

I have been looking at Tenstorrent, which promises to sell you a PCIe card for $750 (but it's not ready yet). Some other companies will sell you a powerful server for $X,XXX (or maybe $1X,XXX) which seems to still be in the range of (perhaps small groups of) really dedicated ML hackers. And, yes, some will only sell thousands of servers to huge cloud enterprises, but we can't assume that as a rule.

---

Not to mention that in 5-10 years, all these now-outdated accelerators made by now-defunct startups will be discarded and will have to go somewhere. At that point you may be able to pick one up on eBay for $100-$1000.

I found a set of "obsolete" Infiniband hardware that way - where "obsolete" = "only does 40Gbps when the modern hardware does 400Gbps". So it does happen. Now I can hack on Infiniband if I get the time and inclination. Because there are open source drivers for these NICs.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 19:30 UTC (Thu) by bjacob (guest, #58566) [Link] (5 responses)

It would be understandable if Linux maintainers felt particularly conservative now, given the fact that so many of these new hardware accelerators will never actually be used by anyone in practice. There are probably more than 100 artificial-intelligence hardware accelerator startups, the majority funded by investors not actual consumers. Even when one gets acquired by a division in a corporation that really thinks it is going to ship it, the odds of that actually happening are still long.

Maybe what Linux needs is some middle-ground between fully supported in-tree drivers and "NAK"...

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 19:43 UTC (Thu) by olof (subscriber, #11729) [Link]

I tried to get that going a couple of years ago, but was met with significant resistance -- and worse, after talking through and reaching some sort of middle ground, it just continued with passive-aggressiveness and subtweeting and snipes whenever the topic came up.

As a result, there's better things to spend my time on than dealing with difficult or toxic people, and I've stopped paying attention.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 19:51 UTC (Thu) by airlied (subscriber, #9104) [Link] (3 responses)

The middle ground is just open source out of tree drivers I suppose.

Piling crap into the kernel isn't maintainable, especially crap with it's own bespoke ioctl interfaces and user APIs. Like a bad network driver you can remove later, but a bad accelerator driver with a uAPI that userspace has started using is nightmare fuel. There are still a lot of kernel maintainers who think all drivers are like network or scsi and assume userspace isn't important to them. Really kernel maintainers need to take more responsibility for the messes they create by merging stuff.

There is no track record of meeting these companies half way working, some people believe that if we let them merge stuff it'll magically create a community with no effort or they will suddenly see the light.

In 10-15 years of experience in this is that companies will only do what they have to satisfy customer/market pressures. The only reason most of these companies want upstream drivers is to get them into RHEL in the end, and I don't get why we'd want to create upstream as a dumping ground for that.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 2:58 UTC (Fri) by developer122 (guest, #152928) [Link] (2 responses)

Makes me thing they should be striking deals with RHEL to carry them for their customers.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 6:57 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (1 responses)

RHEL doesn't include out-of-tree drivers, so that would be a non starter.

It doesn't for exactly the same reasons that Dave mentioned,. It is in Red Hat's interest to have (at least in the future) a common upstream subsystem instead of locking in Red Hat's customers to the SDKs of individual vendors.

Not-a-GPU accelerator drivers cross the line

Posted Aug 28, 2021 13:29 UTC (Sat) by lacos (guest, #70616) [Link]

> RHEL doesn't include out-of-tree drivers, so that would be a non starter.

That's how it should be, but at least VDO is an exception, to my understanding. (See e.g. the "kmod-kvdo" package from "rhel-7-workstation-rpms".)

https://www.redhat.com/en/blog/look-vdo-new-linux-compres...

> The code is available in the source RPMs, and upstream projects are getting established.

https://man7.org/linux/man-pages/man7/lvmvdo.7.html

> To use VDO with lvm(8), you must install the standard VDO user-space tools vdoformat(8) and the currently non-standard kernel VDO module "kvdo".

https://github.com/dm-vdo/kvdo#status

> there are a number of issues that must still be addressed to be ready for upstream

The file list of "linux-5.13.13.tar.xz" does not seem to contain relevant "uds" of "vdo" hits. Even on Fedora, the kernel module(s) seem only available from personal COPRs -- two different (competing?) ones, at that.

https://copr.fedorainfracloud.org/coprs/rhawalsh/dm-vdo/
https://copr.fedorainfracloud.org/coprs/tmm/VDO/

I fully agree that RHEL *should* not include out-of-tree drivers, but at least VDO is a precedent to the contrary.

IIUC, VDO is now (C) Red Hat, and Red Hat actually open-sourced originally proprietary code, which is fantastic -- but for this discussion, the driver is still out-of-tree. (I abandoned evaluating VDO on my RHEL installs when I realized I wouldn't be able to mount the same volumes under Fedora Live DVDs; e.g. for system recovery.)

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 21:54 UTC (Thu) by marcH (subscriber, #57642) [Link] (2 responses)

> Without the compiler, Vetter said, the available code is "still useless if you want to actually hack on the driver stack"

I think this should be the "line in the sand" (except "to hack" sounds vague and/or "too big" in this context).

I have the corresponding hardware and I want to make a dead simple one-line change in this driver. How easily can I run my one-line change before submitting? If I cannot run a simple one-line change in that driver then it is not really open-source. This should not require the manufacturer to expose advanced features or other intellectual property.

Discussing what is a GPU and what is not Does not Really Matter.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 3:10 UTC (Fri) by developer122 (guest, #152928) [Link] (1 responses)

I think it was described more clearly in terms of "can I use this thing the way it's meant to be used?"

To maintain something you need to be able to exercise it. The open-source implementation doesn't need to be high-performance, or even very good. However, there needs to be enough there that a maintainer can come along and exercise the kernel driver and hardware in all of the ways that are important to what the device is intended to do. For GPUs, a maintainer can run a video game over Mesa (or run a pytorch ML model over ROCm).

The inevitable alternative is to have broken drivers in the tree because either they couldn't be tested to ensure continued functionality, or the provided precompiled command language or anemic open source userspace components didn't exercise it comprehensively enough to notice the break.*

Either customers _eventually_ discover the break years later when they are (forced) to move to a new kernel/version of RHEL, or there's a heated debate years later over whether users still exist for a decrepit driver and if it can finally be dropped.

*(or worse, there's not enough information available to fix it, meaning reversions to not-break-userspace)

Either way, reverts, complaining users, and calls for users of decrepit drivers are no fun for kernel maintainers to deal with.

Not-a-GPU accelerator drivers cross the line

Posted Aug 29, 2021 19:17 UTC (Sun) by marcH (subscriber, #57642) [Link]

> I think it was described more clearly in terms of "can I use this thing the way it's meant to be used?"
> all of the ways that are important to what the device is intended to do.

That sounds more ambiguous to me than "run a one-line change".

> To maintain something you need to be able to exercise it.

Agreed, the devil is in defining "to exercise", especially in today's binary world of yes/no answers.

A number/metric can be attached to "running a one-line change" to make completely non-ambiguous, example: ability to run a one-line change in 80% of the driver code (without any NDA).

This all sounds like a test coverage metric because it basically is, as you noted yourself:

> For GPUs, a maintainer can run [TEST!] a video game over Mesa (or run [TEST!] a pytorch ML model over ROCm). The inevitable alternative is to have broken drivers in the tree because either they couldn't be tested

I think a surprisingly number of kernel maintenance issues come down to a simple lack of consideration for test code. For instance in this case if there were a requirement to provide some minimal test code and/or instructions as part of the "line in the sand" then this entire discussion would probably have not happened and this article not be written.

Not-a-GPU accelerator drivers cross the line

Posted Aug 26, 2021 22:07 UTC (Thu) by alyssa (guest, #130775) [Link] (1 responses)

danvet and airlied are 100% right as usual -- upstream = FOSS. There is no such thing (rather, there shouldn't be) as an upstream closed driver stack, and that is precisely what's under discussion here.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 4:04 UTC (Fri) by marcH (subscriber, #57642) [Link]

Right, and there are plenty of downstream git repos everywhere for anything that does not pass that bar.

Spoilt with choice.

Not-a-GPU accelerator drivers cross the line

Posted Aug 27, 2021 8:33 UTC (Fri) by pabs (subscriber, #43278) [Link] (1 responses)

Seems like the DRM rule should apply to all the drivers in Linux that expose hardware-specific interfaces to userspace. Are there any where that doesn't make sense?

Not-a-GPU accelerator drivers cross the line

Posted Aug 28, 2021 2:17 UTC (Sat) by josh (subscriber, #17465) [Link]

None whatsoever; this should be the rule for every single driver.

Not-a-GPU accelerator drivers cross the line

Posted Aug 28, 2021 9:46 UTC (Sat) by jezuch (subscriber, #52988) [Link]

For me it is a tale of following the letter of the law while ignoring its spirit. All these companies are conveniently forgetting the fact that FOSS is a social contract in the first place. When you push shovelware (or any software which appears to be giving while giving nothing), even if you do everything according to the "rules", you still make the FOSS community just an accessory. You're not a member of that community, you're treating it instrumentally. And it doesn't matter one whit if it's using one interface or the other or if it's a GPU or any other type of device.

Not-a-GPU accelerator drivers cross the line

Posted Aug 28, 2021 14:23 UTC (Sat) by moorray (guest, #54145) [Link]

The mission of staging was pretty clear, and code-centric. Attracting vendors to stable trees was primarily a corporate success IMHO. Feels like what's happening in the char-misc tree is a continuation of the downward “lets make corporations happy” trend. Did Greg ever publish a clear explanation? Maybe he should write it up as an LWN article, I've discussed it over email with him in the past but we all write 100 emails a day, that's not conducive to solid philosophical disputes.

This is not a discussion between peers, he’s #2 in the project.

Not-a-GPU accelerator drivers cross the line

Posted Aug 29, 2021 7:55 UTC (Sun) by tchernobog (guest, #73595) [Link] (1 responses)

From the article:

> I do think the dri-devel merge criteria is very extreme, and effectively drives-out many AI accelerator companies that want to contribute to the kernel but can't/won't open their software IP and patents.

Then the few remaining will have the advantage of working with Linux, which would make them succeed on the rather fundamental enterprise market. Frankly, I fail to see how this is not benefiting the community at large in the long run.

I have yet to see these startup companies contribute large pieces of the Linux kernel. Mostly it has been hacks and tweaks to work around GPL restrictions, which are more harmful than not. The bulk of the work (outside the driver itself, which is company-specific) is still being done by others.

Not-a-GPU accelerator drivers cross the line

Posted Aug 31, 2021 15:00 UTC (Tue) by Wol (subscriber, #4433) [Link]

> but can't/won't open their software IP and patents.

If it's going in the kernel, then it *shouldn't* have got past the patent office.

We need to bang that drum loud and hard ...

Cheers,
Wol

"Startup companies" argument

Posted Aug 29, 2021 8:55 UTC (Sun) by zdzichu (subscriber, #17118) [Link] (2 responses)

Oded Gabbay seem to imply that AI startup companies should be treated differently. I really hope no-one will buy this.
Besides, Habana Labs is not a startup anymore, it's a part of Intel.
Intel people should know better how Linux development works.

"Startup companies" argument

Posted Aug 29, 2021 18:54 UTC (Sun) by marcH (subscriber, #57642) [Link] (1 responses)

From https://lwn.net/Articles/817671/

> There is a tendency in these discussions to anthropomorphize large corporations. Which is dangerous, because it is grossly inaccurate to reality. A large corporation is difficult to steer from the top and impossible to steer from the bottom. ...

Search keywords: https://www.google.com/search?q=site%3Alwn.net+anthropomo...

> Besides, Habana Labs is not a startup anymore, it's a part of Intel.

Another oversimplification. I don't know about Habana specifically but in general: https://www.google.com/search?q=%22hands-off%22+merger

"Startup companies" argument

Posted Aug 30, 2021 14:07 UTC (Mon) by joib (subscriber, #8541) [Link]

In particular, you shouldn't anthropomorphize Oracle: https://www.youtube.com/watch?v=-zRN7XLCRhc

Not-a-GPU accelerator drivers cross the line

Posted Aug 30, 2021 8:06 UTC (Mon) by tamara_schmitz (guest, #141258) [Link]

I'm crossing the line!
And I'm done holding back
So look out, clear the track
It's my turn!
I'm taking what's mine!
Every drop, every smidge
If I'm burning a bridge
Let it burn
But I'm crossing the line!

And for us, if we're over
That's fine
I'm crossing the line.

Not-a-GPU accelerator drivers cross the line

Posted Sep 10, 2021 9:21 UTC (Fri) by scientes (guest, #83068) [Link]

It is pretty clear that if you can't run the software (without loading some binary blob), there is no reason it can or should be maintained, and anyone that accepts it is a fool.

Not-a-GPU accelerator drivers cross the line

Posted Sep 10, 2021 15:31 UTC (Fri) by geert (subscriber, #98403) [Link]

Habanalabs Open-Source TPC LLVM compiler and SynapseAI Core library
https://lore.kernel.org/lkml/CAFCwf119s7iXk+qpwoVPnRtOGcx...

Thanks to LaF0rge and Thorsten Leemhuis for the reporting trail.