The first Linaro Forum for Arm Linux kernel topics

April 9, 2024

This article was contributed by Tom Gall, Bill Fletcher, and Arnd Bergmann

On February 20, Linaro held the initial get-together for what is intended to be a regular Linux Kernel Forum for the Arm-focused kernel community. This gathering aims to convene approximately a few weeks prior to the merge window opening and prior to the release of the current kernel version under development. Topics covered in the first gathering include preparing 64-bit Arm kernels for low-end embedded systems, memory errors and Compute Express Link (CXL), devlink objectives, and scheduler integration.

The forum generally follows a "show and tell" format for people to share their plans, bring up issues of the day, and advance ideas for discussion for upcoming versions of the kernel. Ideally, this will help advance coordination, find developers with common interests, and encourage participation. The meetings are public and they are recorded; these notes are meant to give a sense of what occurred.

The meeting agenda, the link to the zoom meeting, and links to the recording can be found on this page.

The meeting is focused on helping the Arm kernel community solve Linux upstreaming issues for the ecosystem. It's understood that this needs to be done with strong community buy-in. Curation and having the right people in the room are important — specifically to avoid any hint of "Ask Maintainers Anything" sessions. The goal is not to ask for a patch set to be tagged or queued, but to share with others and the community what people want to focus on for the next cycle.

Getting Arm64 kernels ready for low-end embedded

Led by Arnd Bergmann

Bergmann asked which kernel changes would be needed now that Arm64 is starting to displace 32-bit Armv7 CPUs in low-end, embedded systems. These are typically mass-market devices, such as camera SoCs, which generally have less than 128MB of RAM; they have previously been the realm of Armv7. With 32-bit systems, we didn't need to optimize the kernel at all; now the immediate need is to shrink the size of the kernel and eliminate overhead. A separate need for footprint optimization is also being driven by embedded systems, which are now running multiple virtual machines. In this case, individual system memory consumption matters a lot more. How small can we make the kernel and system?

Bergmann covered a few options, along with initial results, starting with turning off features while still being able to boot a Debian image and have a normal filesystem and network connectivity; that got down to 7MB for a 64-bit kernel. A 40MB complete system seems achievable. The focus then shifts to shrinking user space. A 32-bit user space saves much more memory than anything that could be done in the kernel, but the factor-of-two overhead for kernel code, data, and heap remains compared to a 32-bit kernel.

Potential overhead reductions that were explored included:

CONFIG_SMP, which saves 1-3MB of RAM if only a single core is enabled.
Execute-In-Place (XIP) can save RAM by putting all the text and read-only data into flash (saving a substantial 5MB in an example case).
XIP has an issue with run-time patching. Modern CPU features are detected (and the loaded kernel patched accordingly) at run time, but we've reached a point where we should reconsider this policy in order to build a smaller kernel that only runs on newer CPUs.
There was a patch for dead-code elimination for 32-bit CPUs which didn't make a huge difference (CONFIG_LD_DEAD_CODE_DATA_ELIMINATION). There could be more potential in there and there are very few downsides.

Memory Errors and Compute Express Link (CXL)

Led by Jonathan Cameron

ACPI/APEI: Sending the correct SIGBUS si_code: Cameron raised an issue with memory-error reporting. A memory error generates an ACPI event; if the error reported is in a user-space process, the expectation is that the process containing the error will be killed. Unfortunately, that doesn't currently work because the ACPI event for memory errors reports the wrong type of error. Hence, as it stands currently, a detected memory failure in a user-space process isn't resulting in that process being killed. It is something that needs to get cleaned up.

Memory scrub control for differentiated reliability: Also related to memory-error reporting, memory scrub is an activity that involves checking ECC memory for errors on a periodic basis. Its purpose is to keep ECC-correctable errors from turning into uncorrectable errors. Generally this is performed autonomously by the memory controller and may even be performed by the memory DIMM itself (Error Check and Scrub (ECS) is implemented in DDR5). Errors — correctable or not — are reported.

Cameron raised the need for user-space control of scrub in CXL's case as there's no way for firmware to configure it for hot-plugged devices. There is an RFC currently at v6. Who else cares? How general can we make it; which market segments beyond server might care? There's a proposed scrub-control system with an ABI, is that the way to go?

NUMA domains post boot: There is an assumption on Arm64 today that all memory hotplugging happens in known NUMA nodes. In contrast, x86 does not assume this but it has an architecture-specific solution to associate memory hotplug and NUMA nodes. CXL memory is hot-pluggable and there's an underlying assumption that we declare a NUMA node for each CXL Fixed Memory Window Structure (CFMWS) memory window. We need to know which of those NUMA nodes is associated with a hotplug event. Currently no one wants to expend the (significant) effort that will be required to address dynamic NUMA-node creation. As a work-around we could re-parse the CXL Early Discovery Table (CEDT) and do a look up. Is this too much of a hack?

Config options with a big blast radius: CXL has kernel features with user-space interfaces that, if misused, could potentially take out the rack or even a data center by, for example, removing everyone's access to memory. There will be security measures in place to mitigate this but a security bug could still risk exposing a way of taking down a rack. There has therefore been pushback from the community on accepting these features.

Where these interfaces are required they should only be wired to a baseboard management controller (BMC) or fabric manager. The issue also applies to the Management Component Transport Protocol (MCTP) stack. How do we prevent people from shooting themselves in the foot with the option to configure these in general purpose systems? Options include either gating all of the features behind a CONFIG_(I_AM_A_)BMC option, or kernel tainting, which is already used elsewhere in CXL. Does anyone else have any ideas?

Current plans for fw_devlink

Led by Saravana Kannan

There are bunch of TODO items that Kannan is planning on working on, including adding post-init-suppliers to the devicetree schema. A devicetree link (devlink) should guarantee correct suspend/resume and shutdown ordering between a "supplier" device and its "consumer" devices. The consumer devices are not probed before the supplier is bound to a driver, and they're unbound before the supplier is unbound.

There are, unfortunately, cyclic dependencies between suppliers and consumers in the devicetree; adding the post-init-suppliers property is about providing a way to give additional information to the firmware devlink interface, or to any kernel, for it to be able to break the cycle. This is to achieve deterministic probe and suspend/resume operation, with an end goal of better stability and reliability and to improve run-time power management. Also, if a supplier is forcefully unbound, consumers don't necessarily get cleaned up correctly unless it's a bus device with a driver. There are some corner cases which will also get fixed during this work.

Another objective is adding support to devlink for "class" devices. Devlink is a driver-core feature where devices can say "don't probe me until my supplier finishes probing". Currently, class devices don't probe, we just add them. The framework currently allows adding class devices as suppliers. There's scope for potentially weird probing behavior. Some of the nuances relate to, for example, what it means to be "ready" for a class device.

Finally, there is clock-framework sync-state support. This was talked about at the 2023 Linux Plumbers Conference. There was some agreement to clean up Kannan's patch series from more than a year ago and address the gaps that were pointed out. He expects to send it out as an RFC or updated patch set.

System pressure on the scheduler

Led by Vincent Guittot

This work has been presented at OSPM and LPC. It aims at consolidating the view that the scheduler has of the CPU's compute capacity and how it maps to actual CPU frequencies.

arch_scale_*() are per-architecture functions that enable architecture-specific code to report some specific behavior to the scheduler. As an example, arch_scale_cpu_capacity() reports the maximum compute capacity of a CPU. The default function returns the same value — 1024 for all CPUs — which is fine for SMP systems, but an architecture can provide its own version when CPUs have different compute capacities, as in big.LITTLE or heterogeneous systems. Similarly, arch_scale_freq_capacity() reports the current compute capacity of a CPU according to the current frequency. With arch_scale_cpu_capacity() and arch_scale_freq_capacity(), the scheduler knows the compute capacity of a CPU and can compare it with others when selecting one for a task. Nevertheless, some inconsistency can appear with some configuration when the maximum frequency of a core changes during boot or at runtime.

The work of consolidating all the arch_scale_*() functions has been split into three parts. The first part, with arch_scale_cpu_capacity() and arch_scale_req_ref(), has been merged in 6.8. It raised a few regressions but now everything is fixed. The second part introduces arch_scale_hw_pressure() and cpufreq_get_pressure(). This is under discussion on the mailing list with a new version (v6).

The next task will be on the new arch_scale_cpu_capped() for 6.10. When user space has capped the maximum frequency of a CPU, we want to take this as a new maximum capacity of the CPU instead of doing some kind of estimation or best-effort decisions. Ultimately the aim is to better take into account user-space capping into the scheduler. Guittot would be interested to get any feedback on this.

Participating in Future Sessions

The next forum will be on April 30th, just prior to the 6.10 kernel merge window opening. If you'd like an email notification, contact tom.gall@linaro.org. A calendar invite is available, along with a reminder a week prior. If you have something for the agenda, please add it to the shared document.

Index entries for this article
GuestArticles	Gall, Thomas