LWN: Comments on "The intersection of mlx5, netdev, and lockdown"

The intersection of mlx5, netdev, and lockdown

amarao — Thu, 21 Dec 2023 09:08:53 +0000

The existing hardware raids have about the same problem: there is a vast configuration interface (create raids/hot spares, change raid parameters, control battery, etc), which is absolutely outisde of block device interface. They have own character device they can do whatever they want with custom utility (like megacli). I see no much of difference of mlx5 from a megaraid.

The intersection of mlx5, netdev, and lockdown

jgg — Wed, 20 Dec 2023 16:29:42 +0000

We already have rdma for that. Ucx, libfabric, nccl, dpdk, spdk and more implement wildly complex network datapath stuff in user space over there.

There are compilers too, in the p4 space, and someone is pushing exactly a closed p4 compiler solution for netdev using tc for some of the operations (unrelated to mlx5).

Mlx5ctl is not setup to be able to do datapath operations at all.

Lockdown and data center deployments

geofft — Wed, 20 Dec 2023 01:19:46 +0000

The purpose of lockdown / secure boot is simple: to prevent unauthorized access to ring 0, with a particular eye towards preventing an attacker who gains root access from using that access to persistently compromise the machine across reboots or reinstalls, e.g. by adding malware to the boot sector or to the firmware itself. The kernel is necessarily able to write to the boot sector, install firmware updates, etc.; userspace does not need to do so, and can be locked down. Once you set this as the goal, there are some obvious implications: you have to protect the integrity of all kernel memory, or else an attacker can get code execution in ring 0. You need to protect it not just from userspace but also from a DMA-capable device that's been duped into writing to the wrong place, so you need to restrict arbitrary reconfiguration of devices. Both loading modules and resuming from hibernation involves writing to kernel memory, so you have to come up with some cryptographic scheme to prevent an attacker from writing a custom module or modifying a hibernation image. For the hibernation case, since the kernel is writing the data, it must be in possession of a secret key, so now you need to not only ensure integrity of kernel memory (no unauthorized writes) but also confidentiality (no unauthorized reads).

For the consumer-device use case, despite our editor's (warranted!) worries about taking control away from end users, this is a perfectly reasonable threat model. Most people run their personal devices with one account that has admin access in some fashion. Most people run a variety of software from a variety of sources on their personal devices. There are a variety of meaningful solutions to try to limit malware from getting onto the machine in the first place, but lockdown is defense in depth if something gets through. Once you're on a single-user device, you can, with patience, become root within userspace. We'd like to ensure that being infected with malware is a recoverable problem, that is, that rebooting or in the worst case reinstalling the computer (via normal reinstall mechanisms that normal people would use) gets it back to a known good state. Those of us heading towards impromptu tech support jobs over the holidays probably appreciate a world where we don't have to tell our family members, sorry you got malware, I didn't bring my EEPROM programmer with me so I can't fix it.

For large-company deployments, I'm not so sure. Most large companies, I'd think, should only be running software in their data centers that they intended to run in their data centers and that they have some amount of control over: even if they don't audit every line and build from source, they probably still have a more robust supply chain than the one backing random downloadable indie games on itch.io. This means that you can turn the problem from "How do I add defense-in-depth against inevitable malware?" to "How do I make sure my systems are running the code I intended them to run?" For the cloud computing use case, in particular, your physical machines should basically only be running qemu-kvm (or firecracker or whatever) and some minimal management code; the majority of even your own source code is going to run inside a VM, and the need for lockdown / secure boot on the inside of a VM is basically gone since blowing away the VM from the outside is quite easy.

(And on a side note, large companies can meaningfully manage their own keys in a way single-user devices can't. If I put my laptop's module signing private key or MOK or whatever in my home directory, any malware that wants to attack that laptop has the keys it needs right there! It makes things harder for automated attacks, sure, but it's still the cryptographic equivalent of putting a spare key under the doormat. So even though it genuinely takes away user control for the OS vendor to hold the key, there isn't a clear alternative. For large-scale deployments, you can have a couple of machines somewhere with access to the private key and with increased paranoia, and malware they gets onto the rest of your fleet has no access to this key.)

So I think the real answer here is the large companies are better served by something other than kernel lockdown. If you use the platform's secure-boot mechanism to load a custom bootloader that verifies a read-only image for an entire OS, not just its kernel and initrd, and the image is appropriately minimal, you've already accomplished the goal of lockdown - unauthorized people don't have access to ring 0 because they don't have access to anything on your machine at all. (Compare with Chromium OS, which did basically this on the consumer side well before the Linux kernel had lockdown support and was still secure.) If you can use something like IMA to make sure the only executables that get loaded in normal userspace are signed by your build infrastructure, the attacker's binaries can't run in the first place. If you restrict shell access to production in some fashion, you get the ability to defend against a compromised workstation. And so forth.

So, for the topic at hand: don't turn on lockdown, let people directly mess with the PCI registers, and put a lot of auditing and logging and access control around getting a shell that gets someone access to /sys/bus/pci in the first place. Let people build custom kernel modules and load them by pushing source code to a temporary branch, having a second engineer hit approve, and taking a signed build from your normal build infrastructure. No need to build a special custom interface that tries to figure out what changes to device configuration can cause X which can cause Y which can cause Z which can do DMA to arbitrary locations, and what can't.

If you do want lockdown for defense in depth, as it happens, there's support for IMA. As the manpage notes, under lockdown, "Only validly signed modules may be loaded (waived if the module file being loaded is vouched for by IMA appraisal)." You could imagine building a similar mechanism where you can sign shell scripts that can access resources that lockdown otherwise blocks. You could even use this mechanism to get a secured interactive shell (e.g., you type a command line, a coworker clicks "seems fine," the infrastructure turns it into a signed script/binary, runs it on the machine, and sends the output with the two of you). Extending the "unless IMA" special case in lockdown seems more productive than building complex driver interfaces that are compliant with the strict lockdown rules.

(For context, I've spent the last almost decade of my professional life working on large-company deployments that use cards from either Mellanox or their competitors and make good use of their fancy features, but I can't speak to deploying either secure boot or IMA at scale, and I think that's because IMA has very little mindshare but secure boot genuinely is the wrong fit.)

The intersection of mlx5, netdev, and lockdown

yaneti — Tue, 19 Dec 2023 13:08:41 +0000

Somewhat similar conundrum with the DASH support for Realtek nics,
where the OOB DASH block is separate and different enough from a nic to have its own firmware and needs to be talked to by the OS to fully operate.

https://lore.kernel.org/netdev/20211129101315.16372-381-n...

Not helped by Realtek not publishing any public datasheets about it

The intersection of mlx5, netdev, and lockdown

atnot — Tue, 19 Dec 2023 08:51:14 +0000

I think the worry is that the following happens:

1. The ctl driver gets added mostly for debugging
2. Some new super complex networking feature gets added, at first to the ctl driver only because it's easier
3. It's a complex feature so Nvidia Mellanox releases a library to work with it. It's possibly a big userspace blob. It contains at least one fork of LLVM.
4. The kernel never gains any native feature for it because it's not in the vendors interest anymore

And voila you've recreated the CUDA situation for networking

The intersection of mlx5, netdev, and lockdown

marcH — Tue, 19 Dec 2023 06:46:00 +0000

> That allows the same tools to be used for hardware from multiple vendors, making life easier for both developers and users. Since there is no common API for mlx5-specific parameters, none of those tools will work with mlx5ctl.

Is it not possible for the mlx5 kernel code to offer BOTH interfaces? So:

1. all the common and usual networking features work using the usual configuration tools
2. unusual mlx5 features require a custom driver and unusual APIs
3. overlap between 1. and 2. does not matter as long as 1. is fully implemented, compliant etc.

I didn't follow the links sorry.

The intersection of mlx5, netdev, and lockdown

npws — Mon, 18 Dec 2023 21:16:38 +0000

Reading through the entire discussion, it does appear that the mlx5 guys have a point, the device supports a vast number of different subsystems and the interfaces they want to introduce are meant to serve all these subsystems, not just networking. While I definitely agree that common *networking* functions should use proper APIs and not vendor specific add-ons, Jakub comes over rather unreasonable and arrogant by insisting that all other subsystems get no say as long as it *also* affects networking.

The intersection of mlx5, netdev, and lockdown

Nahor — Mon, 18 Dec 2023 17:37:58 +0000

Easy solution: io-uring eBPF! The new 42.

The intersection of mlx5, netdev, and lockdown

sima — Mon, 18 Dec 2023 16:01:11 +0000

My experience from both the drivers/accel discussion, and another discussion that's still mostly private, is that the collaboration benefits everyone. The more private discussion is about someone proposes to merge a new type of device into drm, a move I expect will greatly annoy people from another subsystem, and so the other way around from drivers/accel. And most of what I do is talk with people from that other subsystem to figure out if and under what conditions they'd ack this, because I do not want to miss out on that expertise and experience, even if I don't (yet) understand the reasons behind it.

More on topic I think in drm we'll go with devlink for these management/debug/observability features too, it looks like the place where we can tap into the most experience. Maybe except for specific things where there's already other solutions, like devcoredump. And going with devlink should help for some of the datecenter gpus, which are glued together with infiniband/networking and a lot of other interesting devices.