The intersection of mlx5, netdev, and lockdown

By Jonathan Corbet
December 18, 2023

The NVIDIA Mellanox ConnectX HW family of adapters is a complex beast, supporting networking, InfiniBand, RDMA, and more. As a result, the mlx5 kernel driver that supports this hardware is also complex, as is the interface that it provides to user space. The mlx5 developers have, for a while now, been proposing the addition of a new control interface, in the form of a separate virtual device exported by the kernel, that would make vast amounts of debugging information available. This driver has encountered some significant opposition on its way toward the mainline, though, raising a number of questions about appropriate interfaces and when subsystem maintainers have veto power over submissions.

As described in the cover letter to the patch series, the hardware in question presents a sort of remote-procedure-call interface to the system; that interface is used to control a complex set of operations provided by the device. As developer Saeed Mahameed later said, the device "is able to expose millions of objects and device states interactively"; that results in "a complex debugging environment" where it can be hard to figure out why something is not working correctly. That presents a challenge for both customers and Mellanox support personnel, all of whom need a way to get at information about the state of the device.

In the past, according to Mahameed, this information was obtained by communicating directly with the device's bus registers using interfaces under /sys/bus/pci. Any program that can access the system at that level, of course, is able to compromise the system entirely, so that access would naturally be restricted to privileged users. That, however, is deemed insufficient on systems where lockdown policies are in place; PCI access could be used to circumvent lockdown and make persistent changes to the underlying system — exactly the scenario that lockdown is meant to prevent. So, on systems where lockdown is enabled, this approach to obtaining information from mlx5 devices is blocked.

Lockdown is often seen as a way for consumer-electronics vendors to prevent their customers from controlling the devices that they think they own, and it can certainly be used that way. But it has also found a home in large data centers, where there is a strong motivation to ensure that every machine is running the software that it is supposed to. Those are the same machines that are likely to be running mlx5 hardware, though, leading to a problem for users. How can the necessary information be extracted from a locked-down production system to figure out why something went wrong?

The proposed answer is the mlx5ctl driver, which provides a restricted interface for the acquisition of debugging information. It also provides the ability to tweak configuration parameters (about 600 of them, it seems) within the device. With this driver installed, it is possible to talk to the device at a debugging level without direct access to its bus registers, and without running afoul of lockdown protections.

This driver has run into some concerted opposition, primarily from networking maintainer Jakub Kicinski, who is concerned about the addition of device-specific APIs to the mainline kernel. Modern interfaces have a large number of tunable parameters, and the networking ("netdev") developers have put a lot of effort into creating common interfaces for those parameters whenever possible. That allows the same tools to be used for hardware from multiple vendors, making life easier for both developers and users. Since there is no common API for mlx5-specific parameters, none of those tools will work with mlx5ctl. Kicinski, not liking this state of affairs, has blocked the driver, saying: "I don't see how netdev can agree to this driver as long as there is potential for configuring random networking things thru it".

Much of the ensuing discussion has circled around the question of what would be an acceptable interface for this data, if mlx5ctl is not it. One possibility that is repeatedly mentioned is the devlink API, which can be used to access and configure a vast number of parameters on network interfaces. It seems that devlink would work for mlx5, but that there is opposition to using it that way. The networking community does not want to see a proliferation of device-specific devlink parameters, especially if those parameters can be used to configure the operation of an interface. As a result, review of those parameters is required, and getting 600 mlx5-specific parameters past review seems challenging at best.

If the mlx5 information is not welcome in devlink, then where should it go? Kicinski mentioned debugfs at one point in the discussion, and Greg Kroah-Hartman has suggested that approach as well. There are a couple of problems with that idea, though: debugfs is not meant to scale to the amount of data a device can make available, and it is not enabled on locked-down systems. Kicinski has also suggested just shipping mlx5ctl to customers as an out-of-tree module, which runs counter to the usual advice given by kernel developers. Out-of-tree modules also run afoul of lockdown restrictions, of course; Kicinski described that restriction as a problem in its own right.

This discussion appears to be at an impasse; a somewhat frustrated Jason Gunthorpe described the situation this way:

Users want an in-tree solution that is compatible with lockdown. A solution that works for all the mlx5 deployment modes (including Infiniband native without netdev) and covers most of the functionality they previously enjoyed with the /sys/../resource based tooling.
This series delivers that.
Nobody has offered an alternative vision that achieves the same thing.

The mlx5ctl developers have raised a related question as well: why do the networking developers have a say over this driver at all? From the mlx5 point of view, it is not a network device.

Also I would like to repeat, this is not touching netdev, netdev's policies do not apply to the greater kernel or RDMA, and we have use cases with pure-infiniband/DPU/FPGA cards that have no netdev at all, or other cases with pure virtio instances, and much more.

This question has not been directly answered other than pointing out that some of these devices do, indeed, have a network interface on them. The fact that mlx5ctl could be used to influence network devices on some hardware is seen as being sufficient to require the approval of the networking developers.

The question of jurisdiction, for lack of a better word, has come up before in the kernel community. As a recent example, consider AI accelerators, which look a lot like graphics coprocessors without a display controller. The graphics community has spent years developing and enforcing requirements, including the availability of a free user-space implementation of each device's functionality, on GPU drivers. AI accelerators were being upstreamed via a separate path where such requirements were not enforced, leading to protests from the graphics community. In the end, after extensive discussions, an accommodation was reached that brought AI accelerators under a similar set of rules.

This case looks similar; networking developers see an interface to network interfaces that does not follow that subsystem's rules, and they worry that their hard work to prevent a proliferation of device-specific configuration APIs will be undermined. The mlx5ctl developers, instead, feel that they are being prevented from merging a proper upstream implementation of needed functionality by developers from an only marginally related subsystem. The end result is a fair amount of frustration and little apparent progress toward a solution.

Eventually, it would seem, some sort of understanding will need to be reached here. What that will look like is not clear at this point; getting there will require the people who are closest to the problem to find a way to work together toward a solution that addresses the concerns of both sides. The kernel community tends to find such a solution eventually, but the road to that destination can be bumpy at times.

Index entries for this article
Kernel	Development model/Driver merging
Kernel	Device drivers

The intersection of mlx5, netdev, and lockdown

Posted Dec 18, 2023 16:01 UTC (Mon) by sima (subscriber, #160698) [Link] (1 responses)

My experience from both the drivers/accel discussion, and another discussion that's still mostly private, is that the collaboration benefits everyone. The more private discussion is about someone proposes to merge a new type of device into drm, a move I expect will greatly annoy people from another subsystem, and so the other way around from drivers/accel. And most of what I do is talk with people from that other subsystem to figure out if and under what conditions they'd ack this, because I do not want to miss out on that expertise and experience, even if I don't (yet) understand the reasons behind it.

More on topic I think in drm we'll go with devlink for these management/debug/observability features too, it looks like the place where we can tap into the most experience. Maybe except for specific things where there's already other solutions, like devcoredump. And going with devlink should help for some of the datecenter gpus, which are glued together with infiniband/networking and a lot of other interesting devices.

The intersection of mlx5, netdev, and lockdown

Posted Dec 18, 2023 21:16 UTC (Mon) by npws (subscriber, #168248) [Link]

Reading through the entire discussion, it does appear that the mlx5 guys have a point, the device supports a vast number of different subsystems and the interfaces they want to introduce are meant to serve all these subsystems, not just networking. While I definitely agree that common *networking* functions should use proper APIs and not vendor specific add-ons, Jakub comes over rather unreasonable and arrogant by insisting that all other subsystems get no say as long as it *also* affects networking.

The intersection of mlx5, netdev, and lockdown

Posted Dec 18, 2023 17:37 UTC (Mon) by Nahor (subscriber, #51583) [Link]

Easy solution: io-uring eBPF! The new 42.

The intersection of mlx5, netdev, and lockdown

Posted Dec 19, 2023 6:46 UTC (Tue) by marcH (subscriber, #57642) [Link] (2 responses)

> That allows the same tools to be used for hardware from multiple vendors, making life easier for both developers and users. Since there is no common API for mlx5-specific parameters, none of those tools will work with mlx5ctl.

Is it not possible for the mlx5 kernel code to offer BOTH interfaces? So:

1. all the common and usual networking features work using the usual configuration tools
2. unusual mlx5 features require a custom driver and unusual APIs
3. overlap between 1. and 2. does not matter as long as 1. is fully implemented, compliant etc.

I didn't follow the links sorry.

The intersection of mlx5, netdev, and lockdown

Posted Dec 19, 2023 8:51 UTC (Tue) by atnot (subscriber, #124910) [Link] (1 responses)

I think the worry is that the following happens:

1. The ctl driver gets added mostly for debugging
2. Some new super complex networking feature gets added, at first to the ctl driver only because it's easier
3. It's a complex feature so Nvidia Mellanox releases a library to work with it. It's possibly a big userspace blob. It contains at least one fork of LLVM.
4. The kernel never gains any native feature for it because it's not in the vendors interest anymore

And voila you've recreated the CUDA situation for networking

The intersection of mlx5, netdev, and lockdown

Posted Dec 20, 2023 16:29 UTC (Wed) by jgg (subscriber, #55211) [Link]

We already have rdma for that. Ucx, libfabric, nccl, dpdk, spdk and more implement wildly complex network datapath stuff in user space over there.

There are compilers too, in the p4 space, and someone is pushing exactly a closed p4 compiler solution for netdev using tc for some of the operations (unrelated to mlx5).

Mlx5ctl is not setup to be able to do datapath operations at all.

The intersection of mlx5, netdev, and lockdown

Posted Dec 19, 2023 13:08 UTC (Tue) by yaneti (subscriber, #641) [Link]

Somewhat similar conundrum with the DASH support for Realtek nics,
where the OOB DASH block is separate and different enough from a nic to have its own firmware and needs to be talked to by the OS to fully operate.

https://lore.kernel.org/netdev/20211129101315.16372-381-n...

Not helped by Realtek not publishing any public datasheets about it

Lockdown and data center deployments

Posted Dec 20, 2023 1:19 UTC (Wed) by geofft (subscriber, #59789) [Link]

The purpose of lockdown / secure boot is simple: to prevent unauthorized access to ring 0, with a particular eye towards preventing an attacker who gains root access from using that access to persistently compromise the machine across reboots or reinstalls, e.g. by adding malware to the boot sector or to the firmware itself. The kernel is necessarily able to write to the boot sector, install firmware updates, etc.; userspace does not need to do so, and can be locked down. Once you set this as the goal, there are some obvious implications: you have to protect the integrity of all kernel memory, or else an attacker can get code execution in ring 0. You need to protect it not just from userspace but also from a DMA-capable device that's been duped into writing to the wrong place, so you need to restrict arbitrary reconfiguration of devices. Both loading modules and resuming from hibernation involves writing to kernel memory, so you have to come up with some cryptographic scheme to prevent an attacker from writing a custom module or modifying a hibernation image. For the hibernation case, since the kernel is writing the data, it must be in possession of a secret key, so now you need to not only ensure integrity of kernel memory (no unauthorized writes) but also confidentiality (no unauthorized reads).

For the consumer-device use case, despite our editor's (warranted!) worries about taking control away from end users, this is a perfectly reasonable threat model. Most people run their personal devices with one account that has admin access in some fashion. Most people run a variety of software from a variety of sources on their personal devices. There are a variety of meaningful solutions to try to limit malware from getting onto the machine in the first place, but lockdown is defense in depth if something gets through. Once you're on a single-user device, you can, with patience, become root within userspace. We'd like to ensure that being infected with malware is a recoverable problem, that is, that rebooting or in the worst case reinstalling the computer (via normal reinstall mechanisms that normal people would use) gets it back to a known good state. Those of us heading towards impromptu tech support jobs over the holidays probably appreciate a world where we don't have to tell our family members, sorry you got malware, I didn't bring my EEPROM programmer with me so I can't fix it.

For large-company deployments, I'm not so sure. Most large companies, I'd think, should only be running software in their data centers that they intended to run in their data centers and that they have some amount of control over: even if they don't audit every line and build from source, they probably still have a more robust supply chain than the one backing random downloadable indie games on itch.io. This means that you can turn the problem from "How do I add defense-in-depth against inevitable malware?" to "How do I make sure my systems are running the code I intended them to run?" For the cloud computing use case, in particular, your physical machines should basically only be running qemu-kvm (or firecracker or whatever) and some minimal management code; the majority of even your own source code is going to run inside a VM, and the need for lockdown / secure boot on the inside of a VM is basically gone since blowing away the VM from the outside is quite easy.

(And on a side note, large companies can meaningfully manage their own keys in a way single-user devices can't. If I put my laptop's module signing private key or MOK or whatever in my home directory, any malware that wants to attack that laptop has the keys it needs right there! It makes things harder for automated attacks, sure, but it's still the cryptographic equivalent of putting a spare key under the doormat. So even though it genuinely takes away user control for the OS vendor to hold the key, there isn't a clear alternative. For large-scale deployments, you can have a couple of machines somewhere with access to the private key and with increased paranoia, and malware they gets onto the rest of your fleet has no access to this key.)

So I think the real answer here is the large companies are better served by something other than kernel lockdown. If you use the platform's secure-boot mechanism to load a custom bootloader that verifies a read-only image for an entire OS, not just its kernel and initrd, and the image is appropriately minimal, you've already accomplished the goal of lockdown - unauthorized people don't have access to ring 0 because they don't have access to anything on your machine at all. (Compare with Chromium OS, which did basically this on the consumer side well before the Linux kernel had lockdown support and was still secure.) If you can use something like IMA to make sure the only executables that get loaded in normal userspace are signed by your build infrastructure, the attacker's binaries can't run in the first place. If you restrict shell access to production in some fashion, you get the ability to defend against a compromised workstation. And so forth.

So, for the topic at hand: don't turn on lockdown, let people directly mess with the PCI registers, and put a lot of auditing and logging and access control around getting a shell that gets someone access to /sys/bus/pci in the first place. Let people build custom kernel modules and load them by pushing source code to a temporary branch, having a second engineer hit approve, and taking a signed build from your normal build infrastructure. No need to build a special custom interface that tries to figure out what changes to device configuration can cause X which can cause Y which can cause Z which can do DMA to arbitrary locations, and what can't.

If you do want lockdown for defense in depth, as it happens, there's support for IMA. As the manpage notes, under lockdown, "Only validly signed modules may be loaded (waived if the module file being loaded is vouched for by IMA appraisal)." You could imagine building a similar mechanism where you can sign shell scripts that can access resources that lockdown otherwise blocks. You could even use this mechanism to get a secured interactive shell (e.g., you type a command line, a coworker clicks "seems fine," the infrastructure turns it into a signed script/binary, runs it on the machine, and sends the output with the two of you). Extending the "unless IMA" special case in lockdown seems more productive than building complex driver interfaces that are compliant with the strict lockdown rules.

(For context, I've spent the last almost decade of my professional life working on large-company deployments that use cards from either Mellanox or their competitors and make good use of their fancy features, but I can't speak to deploying either secure boot or IMA at scale, and I think that's because IMA has very little mindshare but secure boot genuinely is the wrong fit.)

The intersection of mlx5, netdev, and lockdown

Posted Dec 21, 2023 9:08 UTC (Thu) by amarao (guest, #87073) [Link]

The existing hardware raids have about the same problem: there is a vast configuration interface (create raids/hot spares, change raid parameters, control battery, etc), which is absolutely outisde of block device interface. They have own character device they can do whatever they want with custom utility (like megacli). I see no much of difference of mlx5 from a megaraid.