Leading items
Welcome to the LWN.net Weekly Edition for July 3, 2025
This edition contains the following feature content:
- Accessing new kernel features from Python: how to get at the latest kernel goodness, even if the surrounding software has not yet caught up.
- Fedora's i686 support gets a reprieve: a plan to remove support for 32-bit x86 code on 64-bit systems is pushed back in the Fedora community.
- Supporting kernel development with large language models: an Open Source Summit presentation on how large language models can be put to work to help kernel developers get their job done.
- How to write Rust in the kernel: part 2: a comparison between the C and Rust forms of a device driver.
- Improved load balancing with machine learning: enabling the scheduler to learn how to optimally schedule diverse workloads.
- Yet another way to configure transparent huge pages: the elusive search for a consensus on the best way to control the creation and use of transparent huge pages.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Accessing new kernel features from Python
Every release of the Linux kernel has lots of new features, many of which are accessible from user space. Usually, though, the GNU C Library (glibc) and tools that access the Linux user-space API lag behind the kernel releases. Geoffrey Thomas showed how Python programs can access these new kernel features as soon as the kernel is released in his "What's New in the Linux Kernel... from Python" talk at PyCon US 2025. While he had two examples of accessing new kernel features, the real goal of the talk was to demonstrate how to go about connecting Python to the Linux kernel.
He began by noting that the kernel and its interfaces are written in C, so
there would be a "tiny bit
" of C in the talk. He would be
explaining any of that, so "as long as you can
read Python, you'll be okay
". In addition, all of the code, his
slides, and more are available from his GitHub repository for
the talk.
Since the presentation, the YouTube video of it
has been released as well.
Creating files
So, he asked, "what are these new kernel features and why would you even
want to use them?
" Opening a file in Python looks something like this:
open("file.txt", "w")That will open a file named "file.txt" for writing; if the file exists, it will get overwritten, but if it does not, it will be created. It turns out that the systemd developers needed to figure out which of those two things happened, so they added a way to find out to the kernel. He pointed to the first 6.12 merge window article here at LWN as the place where he found out about the change.
![Geoffrey Thomas [Geoffrey Thomas]](https://static.lwn.net/images/2025/pycon-thomas-sm.png)
The commit spells out the change, which added a new fcntl() command called F_CREATED_QUERY that programs can use to find out if the file was created when it was opened or not. fcntl() is a Linux system call, which is a way for an unprivileged user-space program to request something from the kernel; it is, effectively, a function call, he said, but also kind of like making an API call to a server. As with network-server APIs, languages like Python often have wrappers around system calls. The fcntl module in the Python standard library serves that purpose for fcntl().
But, when looking at the Python documentation, there is no mention of the new command. The biggest reason for that is because it is new, but it also has not really been documented anywhere other than in the kernel source. He sympathizes with the kernel developers, since he also skips documenting new features at times, but it does mean that those who want to use the feature need to dig for information on it.
"Being able to search source code and figure out just enough of a large
project to track down the info you need and not get lost is a valuable
skill
". He listed a few code-search options, starting with GitHub code
search, which searches its public repositories, though you need to be logged into the site to access it. Sourcegraph search searches
GitHub, GitLab, and a few other sites. Debian code search will look in
anything that is part of the Debian distribution, including projects that
are not in Git at all. And, of course, the "git grep" command is
invaluable if you are searching in a local Git clone. One that he did not
mention, but that is quite useful for searching the kernel (and its various
versions) is Elixir.
Since he has a clone of the Linux kernel repository, he used "git grep" to look for F_CREATED_QUERY. It appears in two locations, fs/fcntl.c, which looks like the implementation, and include/uapi/linux/fcntl.h, which is a definition in the user-space API (uapi) of the kernel. The latter file contains:
#define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4)Another "git grep" shows that F_LINUX_SPECIFIC_BASE is 1024, so the value of the constant for the new command is 1028. Thomas did note that the new command shows up in a few spots in the kernel's tools directory, which can be ignored for the purposes of this exercise, though that can sometimes be useful for examples and tests using a feature.
He then turned to the do_fcntl() function in fcntl.c. That function matches fairly closely with the fcntl.fcntl() described in the Python documentation, so he tried using it in the Python REPL:
>>> F_CREATED_QUERY = 1028 >>> import fcntl >>> a = open("foo.txt", "w") >>> fcntl.fcntl(a, F_CREATED_QUERY) 1 >>> a.close() >>> b = open("foo.txt", "w") >>> fcntl.fcntl(b, F_CREATED_QUERY) 0As can be seen, the file was created with the first open, and not on the second. The code could then be wrapped up into a tiny library as follows:
import fcntl F_CREATED_QUERY = 1028 def was_created(file): return bool(fcntl.fcntl(file, F_CREATED_QUERY))Even though it is only four lines long, that library does something that nothing else in the Python API can do. It might even be worth uploading to the Python Package Index (PyPI), he said; beyond its immediate utility, it could perhaps serve as an example for someone trying to support some other new fcntl() command in the future.
Memory maps
A longstanding feature of Unix-like operating systems is to try to treat everything as a file; the tools and interfaces for working with files already exist, which makes it easy to access new features that way. Linux has long had the /proc filesystem, which has files and directories but is not anything that is stored on disk. Instead, it provides textual information about processes and other kernel objects, which Python programmers can access using the standard mechanisms for reading files. Every process in the system has a directory in /proc with its process ID as its name; various files in that directory give information about the process. For example:
$ ls -l /proc/4067/exe [...] /proc/4067/exe -> /usr/bin/bashVarious utilities, such as ps and lsof, use information from /proc.
The /proc/self directory refers to the directory for the current process, so one can look at the memory mappings for their running Python with the following:
>>> print(open("/proc/self/maps").read()) 00400000-009d8000 r-xp 00000000 00:20 3872 \ /usr/bin/python3.13 [...] f08ceae80000-f08ceaeba000 r-xp 00000000 00:20 8872 \ /usr/lib/aarch64-linux-gnu/libncursesw.so.6.5 [...]That produces voluminous output about all of the memory mappings, which includes things like the Python executable and all of the libraries used by it. One way to (mostly) just look at the filenames that appear at the end of the line would be:
>>> libs = {l.strip().split()[-1] for l in open("/proc/self/maps")} >>> print("\n".join(libs))That splits each line into fields and puts the last one into a set using a set comprehension; it then prints each set element on its own line. Thomas said he uses some variant of that frequently in order to check library versions. Its output might look something like this:
[...] /usr/lib/aarch64-linux-gnu/libexpat.so.1.10.0 /usr/lib/aarch64-linux-gnu/libz.so.1.3.1 [heap] /usr/bin/python3.13 /usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.34 [...]
But that code is a bit of a hack; it shows things that are not files (e.g. "[heap]") and will not work correctly if there are spaces in the filename, for example. In general, it seems somewhat circuitous for the kernel to have to turn its internal data structures into textual output that the Python code then needs to parse in some fashion to extract what it wants.
It turns out that some high-performance tracing programs for Linux needed a
way to get that information directly, so a new interface was added
for the 6.11 kernel; it uses a new PROCMAP_QUERY ioctl()
command on a file descriptor for the /proc/PID/maps file. It
provides mechanisms to filter the output in various ways and get the
information in a binary format. "So the motivation was not Python
users, but that doesn't mean we can't use it from Python.
"
In a bit of an aside, he noted that C is sort of the "lingua franca
"
of computing, but that Linux (and other operating systems) could have been
written in other languages. At PyCon 2018 in Cleveland, he and Alex Gaynor
wondered how hard it would be to get Rust code working with the Linux
kernel, so they began that effort at the sprints after the conference.
That is where Rust in the kernel got its start. He strongly recommended
attending the sprints, even for non-Python projects like the one they
worked on.
Complications
The new PROCMAP_QUERY is a "much more complicated
interface
" than his other example; it starts with the struct
procmap_query that is used with the new ioctl() command. He
showed the lengthy structure on a slide without all of its comments so that
it fit; he noted that C structures are akin to Python data classes.
One of the reasons C is used for system-level programs, though, is that its data types are simple,
Thomas said. It does not have classes, dictionaries, inheritance, and so on
like Python does. "C has very few data types
", just numbers, integer
and floating point, pointers (addresses, which are ultimately numbers),
structures, and not much more. Its arrays are fixed length and its strings
are simply arrays of bytes. For the most part, C completely specifies the
format of its data, down to the level of bits and bytes; a __u32,
for example,
is a 32-bit unsigned integer, and it will take up exactly four
bytes in the structure.
Turning that into a data class is fairly straightforward, but all of
the types will simply be int, because Python does not care about
the number of bits in a value of that sort. There will need to be a way to
convert between the two representations, but the difference makes it clear
that the data-class representation "doesn't tell you anything about what
the bits and bytes are
".
@dataclasses.dataclass class ProcmapQuery: size: int query_flags: int query_addr: int vma_start: int vma_end: int vma_flags: int [...]
The Python representation is great for doing calculations and not having to worry about the number of bytes involved, but it does not work so well when you need to store the data in an interoperable way. That's where C's rigid specification helps. The Python struct module can convert between Python types and the bits and bytes needed by C. So, using struct, he changed the ProcmapQuery class to automatically generate the size field:
@dataclasses.dataclass class ProcmapQuery: _STRUCT = struct.Struct("@9L4I2L") size: int = _STRUCT.size [...]
The format string ("@9L4I2L") specifies nine 64-bit values, then four 32-bit values, ending with two 64-bit fields, all of which are encoded in the native byte order ("@"). The size field is used to allow the kernel to extend the API by adding more entries to the structure, thus increasing its size.
The structure for the query consists of both parameters that are passed into the call, others that are returned from the call, and two that have both roles. The idea is that the programmer fills in the input fields and the kernel responds by filling in the output fields. Given that he wants to extract the names of the files from the query, he focused on two pairs of fields in struct procmap_query:
__u32 vma_name_size; /* in/out */ __u32 build_id_size; /* in/out */ __u64 vma_name_addr; /* in */ __u64 build_id_addr; /* in */Those fields highlight some big differences between C and Python.
In both cases, for the virtual memory area (VMA) name and the build ID (which he did not use), the kernel is being passed the address of a buffer in the *_addr field and the length of the buffer in *_size. The size fields are both input and output parameters because they tell the kernel how many bytes are available in the buffer and the kernel returns how many of those bytes it filled in. If the returned size is the same as the passed-in size, Thomas said, the code should probably retry the query with a larger buffer, but he planned to leave that as an enhancement that could be added.
The query_flags field uses values from the enum procmap_query_flags, most of which are uninteresting for getting a list of the file mappings. But two are of use: PROCMAP_QUERY_FILE_BACKED_VMA asks for VMAs that are backed by a file, while PROCMAP_QUERY_COVERING_OR_NEXT_VMA allows code to step through all of the VMAs in a process by asking for the next VMA starting at address zero.
He then showed a few member functions added to the ProcmapQuery class:
def pack(self) -> bytes: return self._STRUCT.pack(*dataclasses.astuple(self)) @classmethod def unpack(cls, packed: bytes) -> Self: return cls(*cls._STRUCT.unpack(packed)) def ioctl(self, fd: int) -> Self: return self.unpack(fcntl.ioctl(fd, PROCMAP_QUERY, self.pack()))The first two simply convert the data class from its Python form to the binary form expected by C—or the reverse. For pack(), which converts to the C representation, the pack() member function of the struct.Struct class expects a tuple, which dataclasses.astuple() creates. unpack() turns the C structure into a ProcmapQuery object by using the @classmethod decorator and the struct.Struct unpack() method. The ioctl() function ties it all together by packing up the data class, passing it to the ioctl() system call, and unpacking the result.
PROCMAP_QUERY
One important thing that is needed, though, is a value for the PROCMAP_QUERY constant, which is more difficult to determine than F_CREATED_QUERY was, he said. It is a #define in fs.h to:
_IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query)That definition creates a constant value that encodes the type of the file the command should operate on along with other information that the ioctl() machinery uses to try to fend off invalid calls. It is not at all straightforward to work all of that out and turn it into an integer, especially within the time constraints of his talk, so he simply took the path of least resistance and wrote a little C program to print it out. With the appropriate C header files, a simple printf() of PROCMAP_QUERY gives the magic integer needed: 3228067345. That number may be architecture-specific, he cautioned, so production code would need a better mechanism to obtain it, possibly following the lead of his procmapquery-cffi example.
The last thing needed before this can all be put together into a program, Thomas said, is to figure out how to get a buffer to pass in. There are a number of different ways to get a buffer in Python, but some of them do not provide a mechanism to get its address, which is not useful for Python, generally, but is needed here. For this, he is using the ctypes module for its create_string_buffer() and addressof() functions:
>>> buf = ctypes.create_string_buffer(1024) >>> ctypes.addressof(buf) 4806710272With all of that in hand, he put up the 15 lines of Python code that constitute the guts of his "what libraries are being used?" tool:
libraries = set() query = ProcmapQuery( query_flags=PROCMAP_QUERY_COVERING_OR_NEXT_VMA | PROCMAP_QUERY_FILE_BACKED_VMA, query_addr=0, vma_name_addr=ctypes.addressof(buf), vma_name_size=1024, ) while True: try: response = query.ioctl(proc_maps) except FileNotFoundError: break libraries.add(os.fsdecode(buf[:response.vma_name_size])) query.query_addr = response.vma_endIt creates a set to hold the libraries, sets up the query object, and then loops over the addresses, adding filenames from the response to the set (using os.fsdecode() for the filename), and then using the vma_end address returned as the next address to query. The standard FileNotFoundError exception is what fnctl.ioctl() raises when there are no more VMAs to process, which he found by trial and error. The procmapquery directory in his repository contains the code, which has a get_libraries() function to return the set of libraries for a file descriptor or that can be run from the command line to print the libraries from /proc/self/maps.
Some wrap-up
There were two web sites that he recommended using to find out about new
kernel features. As readers presumably already know, LWN.net is a great
place to discover new kernel features in the merge-window summaries,
feature articles, and so on. While Thomas is "a happy subscriber
",
he noted that the paywall only lasts a week or so; all of the content he
mentioned in the talk was freely available to anyone.
Another site he pointed to was KernelNewbies, which has "really
detailed changelogs of just about everything that goes into the kernel
"
(example). That summary
will have links to articles and blog posts about those features, as well as
links to the commits.
He also had suggestions on some areas to explore for those who are
interested in digging in further, including investigating the
ctypes library he used and the C Foreign Function
Interface (CFFI) module that he mentioned for retrieving the
PROCMAP_QUERY constant. Both of those are useful for connecting
Python to C programs, including the kernel. Cython is a Python-like language that
compiles to C, which is another avenue for exploration. Writing Python
extensions in C will give even more power to Python programs; various
modules that he used, such as fcntl and os were written
that way. "If you need that much control and you don't want to
write C
", he suggested investigating PyO3, which allows writing Python
extensions in Rust.
In the brief Q&A, an audience member wondered about how programmers could be sure that the Python version of the C procmap_query structure was correct. Thomas said that could definitely be a problem, which might be hard to track down. His longer example using the CFFI module would actually properly ensure that, but it is more involved to use; CFFI is a third-party module and requires a C compiler to use. If he was working on a production version of the query, though, he said he would likely base it on the CFFI version.
[Thanks to the Linux Foundation for its travel sponsorship that allowed me to travel to Pittsburgh for PyCon US.]
Fedora's i686 support gets a reprieve
A change proposal to end support for 32-bit x86 (i686) applications on the x86_64 architecture with the Fedora 44 release has been withdrawn after significant pushback. As proposed, the change could have had a significant impact on gamers, compiler development, and the Bazzite project, which uses Fedora as a base for a gaming-focused distribution. While i686 gets a reprieve for now, the question still lingers: who is going to keep the necessary i686 packages in working order when few upstream maintainers or volunteer packagers care about the architecture?
Stop me if you think you've heard this one before
Some readers may have a sense of déjà vu when learning that Fedora is discussing dropping i686 support in 2025; didn't the project drop i686 support years ago? The answer to that is "yes, but not entirely".
Fedora kernel maintainer Justin Forbes proposed dropping i686 kernels for Fedora in 2017. That would have scuttled i686 kernels for the Fedora 27 release, but an x86 special-interest group (SIG) was formed to care for the architecture and given an opportunity to ensure its upkeep. After a small flurry of activity, that SIG went dormant; the i686-removal proposal was revived and approved for the Fedora 31 release in 2019. From that point, Fedora stopped producing installation images, kernels, and providing package repositories for i686.
However, the project has continued to build packages for i686, which are available in the x86_64 repositories for "multilib" support. This meant that users could no longer install Fedora on 32-bit x86 systems but could continue to run 32-bit applications on x86_64 systems. This is necessary for applications like proprietary games or Windows software running under Wine, where recompiling applications with 64-bit support is not an option.
With Fedora 37, the project allowed (and encouraged) package maintainers to stop building "leaf" packages for i686 without having to announce the decision or file a tracking bug. Leaf packages are those that no other packages depend on. If a maintainer owned an i686 package that was a dependency for other packages, they would still need to follow the procedures for breaking changes. The change proposal had a list of more than 230 packages that were still needed as runtime dependencies for some common multilib use cases—such as installing the Steam client for gaming or using Wine on x86_64. Note that it is not an exhaustive list of all i686 packages that might be necessary, just a baseline of packages that are runtime dependencies and are definitely not leaf packages.
Current proposal
On June 24, Aoife Moloney announced the change proposal to drop 32-bit multilib support and stop building packages for i686 on the fedora-devel mailing list. Note that the initial announcement to the list was mistakenly described as a change for Fedora 43, when it is actually planned for Fedora 44 or later. All of the owners of the proposal—Kevin Fenzi, Fabio Alessandro Locati, and Fabio Valentini—are members of the Fedora Engineering Steering Committee (FESCo). That committee, of course, is the body that will decide whether it is approved or not.
The goal of the proposal is to reduce the burden on package maintainers. It notes that the upstream projects for many packages have dropped support for building or running on 32-bit architectures, which requires additional work by maintainers to ensure those packages continue to build for i686.
The i686 package builds are also a burden for Fedora's release
engineering team and infrastructure; dropping 32-bit libraries would
let the team get rid of "brittle heuristics and rules
" required
to accommodate i686, as well as reduce the load on x86_64 build
systems that have to cross-compile approximately 10,000 i686
packages.
Users will also see a small benefit to dropping i686. The x86_64 repository metadata will be smaller without the packages; in turn, that will speed up downloads and DNF dependency resolution.
There are three steps to the proposal; the first, and easiest to
revert if needed, is to stop including i686 packages in the x86_64
repository. The second step, no longer building the packages for i686,
is described as basically irreversible: "reverting the changes
would require re-bootstrapping the architecture, which would be
difficult to justify
". Fedora releases are typically built by the
previous release—new architectures have to be bootstrapped by
cross-compiling enough packages to compile a full release for the
platform. Finally, there will need to be a mechanism to remove i686
packages from Fedora systems on upgrade.
Reactions
Jakub Jelinek said that
disabling i686 would be a problem for packaging compilers, such as GCC,
which requires some i686 libraries to support generating code for 32-bit
mode (-m32). The complete absence of i686 packages would also
be felt by developers working other compilers and toolchains;
it could lead to users migrating away from Fedora. He
suggested reducing the number of i686 packages being built, and doing
away with documentation for i686 packages, so that it would no longer be
necessary to build 32-bit TeX and other tools used to generate
documentation. But, he said, "shrinking the set to zero will not
serve the distro well
".
Daniel P. Berrangé observed that Fedora only shipped a handful of the host architectures that are supported by GCC, GNU Binutils, and others, so this could not be a unique problem for i686. Wouldn't it be sufficient to ship a cross-compiler build for i686, as Fedora already does for other architectures it does not support?
Developing GCC for unsupported platforms on Fedora is "a lot of
pain
", Jelinek replied,
and would certainly not be sufficient for i686-linux as one of the few
primary architectures for GCC:
Several of us run daily bootstraps/regtests not just on x86_64-linux but also on i686-linux (currently Fedora with 64-bit kernel but i686.rpms around) to make sure the code is 32 vs. 64-bit clean, etc.). This just wouldn't be possible anymore or would be much harder otherwise.
Adam Williamson prodded
Jelinek to explain why 32-bit development really mattered at this
point. "I'm kinda expecting to be told 'yes, GCC still cares about 32-bit for
<insert good reasons here>'. I just wanted to have that written
down.
" Jelinek asserted
32-bit's importance, but Stephen Smoogen replied
with a concrete example from his recent work with embedded
hardware:
One thing that the last 4 years outside of Fedora has reinforced to me is how much of the computer world is NOT 64 bit. Most of the hardware in everything from the 900 mini computers in a standard car to the 10 or so in a washing machine are 16 bit and 32 bit ones. The average home computer is at least 4 32 bit cpus and a couple of 16 bit one's working behind the scenes of the 64 bit processor and GPU you do your work in. Intel is still used a lot in this 32 bit world and the compiler that is used is primarily gcc.
Since many Red Hat developers use Fedora to support that ecosystem, it means that build-chain problems on Fedora get more attention than other operating systems, he said. If those developers have to move to another operating system, it would mean less attention paid to Fedora's problems.
Most of the people participating in the discussion seemed to agree that the current i686 situation was untenable, but were not quite ready to fully rid the distribution of its i686 support. David Airlie replied that he would prefer to start with a list of packages that have i686 dependencies and drop the rest. He suggested that the packages that provide Mesa 3D-graphics support, and their dependencies, would be a good start. In a follow-up, he identified more than 150 packages that would need to be built in order to support Mesa.
Steam and Bazzite
A large chunk of the discussion focused on running Steam, which is not an open-source application, and Valve, the company that provides it, does not package it for Fedora. It can be installed from the RPM Fusion repository, however.
Berrangé said that whether Steam runs is not Fedora's concern. Its needs do not align with the project's mission and foundations, he said, and it should not have significant influence on the decision whether to drop i686 packages or not. Michal Schorm, for one, disagreed:
Fedora values ethical choices over easy ones. But that does not mean we shouldn't care and say "that's not our problem".
It is our problem. If we leave a significant portion of users without an upgrade path for their favourite software without a good justification, we will force them to leave and they will never come back.
He suggested discussing the situation with Valve and seeing what they
might be able to do to keep Fedora's game-playing base alive. That may
not be successful, he said, but it was a better justification than
"we don't care
".
When Canonical planned to drop i686 packages in 2019 it received
enough community pushback that it revised
its plan and build "selected 32-bit i386 packages
" for
Ubuntu releases, as described on the Ubuntu wiki.
Valentini said that one overlooked option for running Steam on Fedora was to use the Flatpak, which already includes the necessary 32-bit libraries. He said that he had been using the Steam Flatpak for gaming for years and had not hit any Flatpak-specific issues. No matter what, though, he said that support for i686 would have to come to an end at some point:
Yes, some things will stop working. But I hope that we can provide solutions and / or workarounds for most use cases.
And it's better to start planning for the removal of i686 packages now than when (insert foundational package here - for example, CPython) stops supporting 32-bit architectures and we need to scramble to adapt.
Noel Miller said that the proposal would impact downstream projects like Bazzite, which is a Universal Blue project that builds image-based operating systems from Fedora packages for specific use cases. Bluefin, which LWN covered in 2023, is also a Universal Blue project.
Bazzite requires a native installation of Steam rather than a Flatpak so that the Gamescope Wayland compositor works properly. Ashe Hutchins explained that this is because many of Bazzite's users run the distribution on handheld gaming computers, such as the Steam Deck, and want to emulate Steam's gaming mode. This has been tried with Flatpak and Podman container solutions that could have replaced the native installation, but those attempts were met with insurmountable issues:
Gamescope is a microcompositor. The details are in the weeds but basically, it can serve as 3 types of compositor: Nested compositor - running on top of your current desktop environment. Embedded compositor - embedded in a specific application. Session compositor - running like a desktop environment, essentially. The flatpak for gamescope can serve the first two. The last one, which is what console-like experiences like bazzite-deck provide, cannot.
I emailed Bazzite's lead developer, Kyle Gospodnetich, and asked if
the Bazzite team had been approached before the proposal was
announced, and what the project's plans would be if it were
approved. They had not been approached ahead of time, he said, so "this
moment now is our chance to communicate our needs and use
case
". He noted that some Fedora contributors had not realized
that Flatpak would not work for Bazzite's use case.
Gospodnetich agreed that there is no reason for Fedora to be
building as many 32-bit packages as it is, but there was still a need
for some 32-bit packages for legacy compatibility. "There's 30 years of
PC games that are never going to get updates and the features Bazzite
offers using Steam don't work in a container
". There is no
contingency plan for Bazzite if Fedora decides to dump 32-bit packages
entirely, he said.
Bazzite is not a distro in the traditional sense and our build pipeline wouldn't be able to handle this, nor would we have the manpower to build this. Packages needed for Steam & Wine would need to be kept in lockstep with Fedora and likely need changes applied to continue to build in 32-bit long term, which means versions will drift and we'll risk putting out broken builds unless we can also pause/prevent builds if there's a mismatch.
It would be easier and better to sunset the Bazzite project if the
32-bit packages it depends on go away. However, he thought that some
of the options being discussed around building a subset of 32-bit
packages were excellent and that "a middle
ground that works better for everyone can be
reached
". Gospodnetich added that he had faith in Fedora as a
project, "and none of us would be here without them
".
i686 is people
The discussion is only a few days old, and there is time for proposals that might stave off the end of i686 for some use cases on Fedora. However, the bottom line is that keeping i686 alive as it is done in Fedora now is an unfunded mandate for volunteer packagers. Berrangé said:
This is one of the periodic unusual Fedora change proposals where not adopting it, is de facto making a conscious decision to force volunteer maintainers to continue to work on something that many consider to be undesirable and a technological dead end.
Pushing the problem down the road for a few more releases will not
solve the problem, and he held little hope that Valve will
"suddenly decide to do something different
" now. He said,
ideally, those who still care about i686 would outline a strategy for
supporting it without the cross-distribution and build infrastructure
burden that it currently imposes. A full solution need not appear by
Fedora 44, but "we need to at least make some step forward towards a
solution that is more sustainable than the status-quo
".
Newly minted Fedora Project Leader Jef Spaleta said
that he viewed the current proposal as a necessary incentive for
"the right people to put an actionable plan together that is less
disruptive
". In a later comment, he said
Fedora must find a way to separate out i686 work "in a way that
lets people lay down that burden so other people who care about it can
pick it up
". If that did not happen, then the project may have to
accept dropping i686 altogether.
Some commenters were unhappy that the proposal seemed to be a way to force others to step forward to take over i686 maintenance. Claire Robsahm said:
I don't agree with using this proposal as a "Sword of Damocles" to motivate other proposals to come forward. That means outsiders – gamers and devs like me – will see this as the "default" outcome, which will cause uncertainty in the long-term future of Fedora as a reliable gaming and developer platform.
Miller agreed,
and said that Fedora had been gaining a lot of momentum as a
distribution for playing and developing games; the proposal had
"already created brand damage and loss of user confidence in
Fedora (and by extension Bazzite)
".
Withdrawn
After much discussion—including 400 comments on Fedora's Discourse forum—Valentini withdrew the proposal on June 28 and said he was looking forward to counter-proposals when they are ready.
For now, those who depend on multilib support on Fedora can breathe easier, but the problem remains. The burden of maintaining Fedora's i686 packages is largely being borne by people who are not interested in or benefiting from the effort. The users, and commercial entities, that do benefit from the work have (thus far) not come forward to shoulder the burden. That situation is not sustainable or likely to be supported indefinitely.
Supporting kernel development with large language models
Kernel development and machine learning seem like vastly different areas of endeavor; there are not, yet, stories circulating about the vibe-coding of new memory-management algorithms. There may well be places where machine learning (and large language models — LLMs — in particular) prove to be helpful on the edges of the kernel project, though. At the 2025 North-American edition of the Open Source Summit, Sasha Levin presented some of the work he has done putting LLMs to work to make the kernel better
An LLM, he began, is really just a pattern-matching engine with a large
number of parameters; it is a massive state machine. Unlike the sort of
state machine typically seen in the kernel, though, LLMs perform state
transitions in a probabilistic, rather than deterministic, manner. Given a
series of words, the LLM will produce a possible next word in the sequence.
Given "
the Linux kernel is written in...
", the LLM will almost
certainly respond "C". There is a much lower probability, though, that it
might say "Rust" or "Python" instead.
An LLM works with a "context window", which is the user-supplied text it can remember while answering questions. A system like Claude has a context window of about 200,000 tokens, which is enough for an entire kernel subsystem.
Levin does not believe that LLMs will replace humans in tasks like kernel development. Instead, an LLM should be viewed as the next generation of fancy compiler. Once upon a time, developers worked in assembly; then higher-level languages came along. Some sneered at this new technology, saying that "real developers" did their own register allocation. But, in time, developers adopted better programming languages and became more productive. An LLM is just another step in that direction; it is not a perfect tool, but it is good enough to improve productivity.
LLM-generated code in the kernel
As an example, he pointed to a patch credited to him that was merged for the 6.15 release. That patch was entirely written by an LLM, changelog included. Levin reviewed and tested it, but did not write the code. This fix, he said, is a good example of what LLMs can do well; they excel at small, well-defined tasks, but cannot be asked to write a new device driver. LLMs also help with writing the commit message, which is often more difficult than writing the patch itself, especially for developers whose native language is not English.
He pointed out a couple of things about the patch itself, excerpted here:
-/* must be a power of 2 */ -#define EVENT_HASHSIZE 128 +/* 2^7 = 128 */ +#define EVENT_HASH_BITS 7
The switch from one hash API to another required specifying a size as a power of two rather than a straight number of bits; the LLM took that into account and made the appropriate change. It also realized, later in the patch, that a masking operation was not needed, so it took that operation out. The LLM, he said, generated code that was both correct and efficient.
Another example is the git-resolve
script that was merged for 6.16. This script, which came out of a late 2024 discussion on ambiguous commit IDs,
will resolve an ambiguous (or even incorrect) ID into a full commit. It,
too, was generated with an LLM. Not only does it work, but it includes a
full set of self tests, something he noted (with understatement) is unusual
for code found in the kernel's scripts directory. LLMs, he said,
"won't give you a frowny face
" when asked to generate tests. The
script includes documentation (also unusual for that directory), and is
being used on a daily basis in the kernel community.
Moving on, he introduced the concept of "embeddings", which are a way of representing text within an LLM. They can be thought of as an equivalent to a compiler's internal representation of a program. Embeddings turn human language into vectors that can be processed mathematically. They preserve the semantic meaning of the text, meaning that phrases with similar meanings will "compile" to similar embeddings. That, in turn, allows meaning-based searching. In the kernel context, embeddings can help in searching for either commits or bugs that are similar to a given example.
Another useful LLM technology is "retrieval augmented generation" (RAG).
LLMs, he said, have an unfortunate tendency to make things up when they do
not know the answer to a question; an LLM will only rarely admit that it
does not know something. That can be "really annoying
" for
generated code; an LLM will make up kernel functions that do not exist, for
example. RAG works to ground an LLM in actual knowledge, enabling the
model to look up information as needed, much like how humans use
documentation. It is also useful to update an LLM with knowledge that came
about after its training was done.
For the kernel in particular, RAG can ground the model and teach it about kernel-specific patterns. It also adds explainability, where the model can cite specific examples to justify the decisions it makes. Among other things, RAG allows the model to connect to a Git repository, giving it access to the kernel's development history.
Updates and CVEs
The stable kernels include massive number of patches that have been backported from the mainline; the 5.10 series, for example, has incorporated over 31,000 commits after the initial 5.10 release was made. Maintaining these stable updates requires reviewing around 100 patches per day — every day, with no breaks. Of those, maybe five or ten are suitable for backporting. It is a tedious and frustrating process that does not scale; as a result, important fixes are sure to fall through the cracks.
The "AUTOSEL" tool has been around for some years; it tries to select the
mainline commits that should be considered for backporting. The initial
version was primitive; it would just look for specific keywords in the
changelog. Switching AUTOSEL to an LLM causes it to act like "another
stable-kernel maintainer
", albeit a special one who remembers every
backporting decision that has ever been made. It works by creating an
embedding for every commit in the history, then finding similarities with
new commits that may be solving the same kind of problem.
AUTOSEL, he noted, is not replacing the stable maintainers, but it does narrow down the set of commits that they must consider. It is able to process hundreds of commits quickly, catching fixes that humans will miss. It also explains its reasoning in each email that is sent to the list (random example) proposing a patch for backporting. When asked to consider a specific commit, he said, AUTOSEL can also recommend similar commits for consideration.
People ask which LLM is being used for AUTOSEL; the answer is "all of
them
". Each model has its own strengths and weaknesses, so AUTOSEL
asks several of them, then allows each to vote on the conclusion. If
enough models vote in favor of a backport, it is referred to the humans for
consideration.
In early 2024, the kernel project took on the
responsibility for assigning its own CVE numbers. The tooling to
support this work started as a collection of "bash hacks
" that
quickly became unmaintainable. So the CVE team decided to convert them to
Rust, since "that's what the cool kids do these days
". The only
problem is that the CVE team members are all kernel developers who are not
that proficient in Rust. LLMs are proficient in the language,
though, and were able to quickly rewrite the scripts, adding documentation
and tests in the process. The new scripts are more maintainable and vastly
more efficient.
The CVE process itself is a challenge similar to that of backporting; commits must be reviewed for security relevance, which is another tedious task. It is hard to find people with the requisite expertise to do this work; the people with the needed skills can easily find more rewarding work to do. A purely human-based process thus runs behind, misses important vulnerabilities, while occasionally flagging bugs that are not, in fact, vulnerabilities.
This is, in other words, another job for a machine. The CVE selection is able to share much of the infrastructure used by AUTOSEL, but this time the LLM is being asked to look for commits that somehow resemble previous vulnerability fixes.
He concluded by saying that, using LLMs, the kernel community now has a system that can make use of multiple models, directly access Git repositories, and make use of historical data to answer various types of questions about kernel patches. He provided URLs for AUTOSEL and the commit classifier.
Tim Bird asked whether there is a risk of humans trusting the output from the LLMs too much, allowing errors to creep in. Levin agreed that LLMs can be wrong, but he said that humans can be wrong too, and they often are. Another participant asked about the licensing for code that is emitted by an LLM; Levin said that he has not really thought about the problem, and assumes that, if an LLM produces code, he is free to make use of it.
The last question was whether this infrastructure could be used to examine patches prior to merging in the hope of catching bugs earlier. This is an area that Levin has explored in the past, but that is not a focus currently. He agreed that LLMs could do that work, but it would be a huge job, and LLMs are still too expensive to use in that way. Perhaps in the future, he said, when the price has fallen, that sort of analysis will be possible.
[Thanks to the Linux Foundation for supporting our travel to this event.]
How to write Rust in the kernel: part 2
In 2023, Fujita Tomonori wrote a Rust version of the existing driver for the Asix AX88796B embedded Ethernet controller. At slightly more than 100 lines, it's about as simple as a driver can be, and therefore is a useful touchstone for the differences between writing Rust and C in the kernel. Looking at the Rust syntax, types, and APIs used by the driver and contrasting them with the C version will help illustrate those differences.
Readers who are already conversant with Rust may find this article retreads some basics, but it is my hope that it can still serve as a useful reference for implementing simple drivers in Rust. The C version and the Rust version of the AX88796B driver are remarkably similar, but there are still some important differences that could trip up a developer performing a naive rewrite from one to the other.
The setup
The least-different thing between the two versions is the legalities. The Rust driver starts with an SPDX comment asserting that the file is covered by the GPL, as many files in the kernel do. Below that is a documentation comment:
//! Rust Asix PHYs driver //! //! C version of this driver: [`drivers/net/phy/ax88796b.c`](./ax88796b.c)
As mentioned in the previous article, comments starting with //! contain documentation that applies to the entire file. The next few lines are a use statement, the Rust analogue of #include:
use kernel::{ c_str, net::phy::{self, reg::C22, DeviceId, Driver}, prelude::*, uapi, };
Like C, Rust modules are located starting from a search path and then continuing down a directory tree. Unlike C, a use statement can selectively import only some items defined in a module. For example, DeviceId is not a separate module, but rather a specific item inside the kernel::net::phy module. By importing both kernel::net::phy::DeviceId and kernel::net::phy as a whole, the Rust module can refer to DeviceId directly, and anything else from the PHY module as phy::name. These items can always be referred to by their full paths; a use statement just introduces a shorter local alias. If a name would be ambiguous, the compiler will complain.
All of these imported items come from the kernel crate (Rust library), which contains the bindings between the main kernel and Rust code. In a user-space Rust project, a program would usually also have some imports from std, Rust's standard library, but that isn't possible in the kernel, since the kernel needs more precise control over allocation and other details that the standard library abstracts away. Kernel C developers can't use functions from libc in the kernel for much the same reason. The kernel::prelude module contains kernel replacements for many common standard-library functions; the remainder can be found in core, the subset of std that doesn't allocate.
In the C version of the driver, the next step is to define some constants representing the three different, but related, devices this driver supports: the AX88772A, the AX88772C, and the AX88796B. In Rust, items do not have to be declared before use — the entire file is considered at once. Therefore, Fujita chose to reorder things slightly to keep the code for each board in its own section; the types for each board (PhyAX88772A and so on) are defined later. The next part of the Rust driver is a macro invocation that sets up the necessary symbols for a PHY driver:
kernel::module_phy_driver! { drivers: [PhyAX88772A, PhyAX88772C, PhyAX88796B], device_table: [ DeviceId::new_with_driver::<PhyAX88772A>(), DeviceId::new_with_driver::<PhyAX88772C>(), DeviceId::new_with_driver::<PhyAX88796B>() ], name: "rust_asix_phy", authors: ["FUJITA Tomonori <fujita.tomonori@gmail.com>"], description: "Rust Asix PHYs driver", license: "GPL", }
Rust macros come in two general kinds: attribute macros, which are written #[macro_name] and modify the item that they appear before, and normal macros, which are written macro_name!(). There is also a less common variant of attribute macros written #![macro_name] which applies to the definition that they appear within. Normal macros can use any matching set of braces to enclose their arguments, but can always be recognized by the mandatory exclamation mark between the name and the braces. The convention is to use parentheses for macros that return a value and braces for macros that are invoked to define a structure (as is the case here), but that is not actually required. Invoking the macro with parentheses would have the same result, but it would make it less obvious to other Rust programmers what is happening.
The drivers argument to the macro contains the names of the three board types this driver covers. Each driver has to be associated with information such as the name of the device and the PHY device ID that it should be active for. In the C version of the driver, this is handled by a separate table:
static struct phy_driver asix_driver[] = { ... };
In the Rust code, this information is stored in the code for each board (see below), since all PHY drivers need to provide it. Overall, the kernel::module_phy_driver!{} macro serves the same role as the module_phy_driver() macro in C.
Next, the Rust driver defines two constants that the code uses later:
const BMCR_SPEED100: u16 = uapi::BMCR_SPEED100 as u16; const BMCR_FULLDPLX: u16 = uapi::BMCR_FULLDPLX as u16;
Every declaration of a value (as opposed to a data structure) in Rust starts with either const or let. The former are compile-time constants — like a simple #define in C. Types are mandatory for const definitions, but optional for let ones. In either case, the type always appears separated from the name by a colon. So, in this case, both constants are u16 values, Rust's unsigned 16-bit integer type. The as u16 part at the end is a cast, since the original uapi::BMCR_* constants being referenced are defined in C and assumed to be 32 or 64 bits by default, depending on the platform.
An actual function
The final piece of code before the actual drivers is a shared function for performing a soft reset on Asix PHYs:
// Performs a software PHY reset using the standard // BMCR_RESET bit and poll for the reset bit to be cleared. // Toggle BMCR_RESET bit off to accommodate broken AX8796B // PHY implementation such as used on the Individual // Computers' X-Surf 100 Zorro card. fn asix_soft_reset(dev: &mut phy::Device) -> Result { dev.write(C22::BMCR, 0)?; dev.genphy_soft_reset() }
There's a few things to notice about this function. First of all, the comment above it is not a documentation comment. This isn't a problem because this function is also private — since it was declared with fn instead of pub fn, it's not visible outside this one module. The C equivalent would be a static function. In Rust, the default is the opposite way around, with functions being private (static) unless declared otherwise.
The argument to the function is an &mut phy::Device called dev. References (written with an &) are in many ways Rust's most prominent feature; they are like pointers, but with compile-time guarantees that certain classes of bugs (such as concurrent mutable access without synchronization) can't happen. In this case, asix_soft_reset() takes a mutable reference (&mut). The compiler guarantees that no other function can have a reference to the same phy::Device at the same time. This means that the body of the function can clear the BMCR pin and trigger a soft reset without worrying about concurrent interference.
The last part of the function to understand is the return type, Result, and the "try" operator, ?. In C, a function that could fail often indicates this by returning a special sentinel value, typically a negative number. In Rust, the same thing is true, but the sentinel value is called Err instead, and is one possible value of the Result enumeration. The other value is Ok, which indicates success. Both Err and Ok can carry additional information, but the default in the kernel is for Err to carry an error number, and for Ok to have no additional information.
The pattern of checking for an error and then immediately propagating it to a function's caller is so common that Rust introduced the try operator as a shortcut. Consider the same function from the C version of the driver:
static int asix_soft_reset(struct phy_device *phydev) { int ret; /* Asix PHY won't reset unless reset bit toggles */ ret = phy_write(phydev, MII_BMCR, 0); if (ret < 0) return ret; return genphy_soft_reset(phydev); }
It performs the same two potentially fallible library function calls, but needs an extra statement to propagate the potential error. In the Rust version, if the first call returns an Err, the try operator automatically returns it. For the second call, note how the line does not end with a semicolon — this means the value of the function call is also the return value of the function as a whole, and therefore any errors will also be returned to the caller. The missing semicolon is not easy to forget, however, because adding it in will make the compiler complain that the function does not return a Result.
The main driver
The actual driver code differs slightly for the three different boards. The simplest is the AX88786B, the implementation of which starts on line 124:
struct PhyAX88796B;
This is an empty structure. An actual instance of this type has no storage associated with it — it doesn't take up space in other structures, size_of() reports 0, and it has no padding — but there can still be global data for the type as a whole (such as debugging information). In this case, an empty structure is used to implement the Driver abstraction, in order to bundle all of the needed data and functions for a PHY driver together. When the compiler is asked to produce functions that apply to a PhyAX88796B (which the module_phy_driver!{} macro does), it will use this definition:
#[vtable] impl Driver for PhyAX88796B { const NAME: &'static CStr = c_str!("Asix Electronics AX88796B"); const PHY_DEVICE_ID: DeviceId = DeviceId::new_with_model_mask(0x003b1841); fn soft_reset(dev: &mut phy::Device) -> Result { asix_soft_reset(dev) } }
The constant and function definitions work in the same way as above. The type of NAME uses a static reference ("&'static CStr"), which is a reference that is valid for the entire lifetime of the program. The C equivalent is a const pointer to the data section of the executable: it is never allocated, freed, or modified, and is therefore fine to dereference anywhere in the program.
The new Rust feature in this part of the driver is the impl block, which is used to implement a trait. Often, a program will have multiple different parts that conform to the same interface. For example, all PHY drivers need to provide a name, associated device ID, and some functions implementing driver operations. In Rust, this kind of common interface is represented by a trait, which lets the compiler perform static type dispatch to select the right implementation based on how the trait functions are called.
C, of course, does not work like this (although _Generic can sometimes be used to implement type dispatch manually). In the kernel's C code, PHY drivers are represented by a structure that contains data and function pointers. The #[vtable] macro converts a Rust trait into a singular C structure full of function pointers. Up above, in the call to module_phy_driver!{}, the reference to the PhyAX88796B type lets the compiler find the right Driver implementation, and from there produce the correct C structure to integrate with the C PHY driver infrastructure.
There are obviously more functions involved in implementing a complete PHY driver. Luckily, these functions are often the same between different devices, because there is a standard interface for PHY devices. The C PHY driver code will fall back to a generic implementation if a more specific function isn't present in the driver's definition, so the AX88796B code can leave them out. The other two devices supported in this driver specify more custom functions to work around hardware quirks, but those functions are not much more complicated than what has already been shown.
Summary
Steps to implement a PHY driver ...
... in C: | ... in Rust: |
Write module boilerplate (licensing and authorship information, #include statements, etc.). | Write module boilerplate (licensing and authorship information, use statements, a call to module_phy_driver!{}). |
Implement the needed functions for the driver, skipping functions that can use the generic PHY code. | Implement the needed functions for the driver, skipping functions that can use the generic PHY code. |
Bundle the functions along with a name, optional flags, and PHY device ID into a struct phy_driver and register it with the PHY subsystem. | Bundle the functions along with a name, optional flags, and PHY device ID into a trait; the #[vtable] macro converts it into the right form for the PHY subsystem. |
Of course, many drivers have specific hardware concerns or other complications; kernel software is distinguished by its complexity and concern with low-level details. The next article in this series will look at the design of the interface between the C and Rust code in the kernel, as well as the process of adding new bindings when necessary.
Improved load balancing with machine learning
The extensible scheduler class ("sched_ext") allows the loading of a custom CPU scheduler into the kernel as a set of BPF functions; it was merged for the 6.12 kernel release. Since then, sched_ext has enabled a wide range of experimentation with scheduling algorithms. At the 2025 Open Source Summit North America, Ching-Chun ("Jim") Huang presented work that has been done to apply (local) machine learning to the problem of scheduling processes on complex systems.Huang started with a timeline of Linux scheduler development, beginning with the adoption of the completely fair scheduler (CFS) in 2007. Various efforts were made to write alternatives to CFS for specific use cases, notably the 2009 submission of BFS, and the 2016 MuQSS submission, both from Con Kolivas. In 2023, the EEVDF scheduler showed up as an enhancement to, and eventual replacement for, CFS. The following year, finally, saw the merging of sched_ext, after some extensive discussion.
In other words, he said, it took 17 years from the beginning of the CFS era to get to the point where an extensible scheduler was added to Linux. That period reflects a long-held opinion that one scheduler could be optimal for all situations. This position was clearly expressed by Linus Torvalds in 2007:
The arguments that "servers" have a different profile than "desktop" is pure and utter garbage, and is perpetuated by people who don't know what they are talking about. The whole notion of "server" and "desktop" scheduling being different is nothing but crap.
The reality of the situation, Huang said, has changed since then. In 2007, machines typically had a maximum of four CPUs, those CPUs were all equivalent to each other, and the workloads were relatively simple. In 2025, instead, systems can have over 48 cores with heterogeneous CPUs and complex requirements for throughput, latency, and energy consumption. The heuristics used by CFS (and EEVDF) were designed for the simpler times, and are no longer optimal.
The complexity of modern systems comes in numerous forms. NUMA systems can perform badly if workloads are scheduled far from their memory. The CFS scheduler often makes bad placement choices; as a result, administrators are being forced to pin processes to specific CPUs or to partition their systems to regain performance. Heterogeneous systems have multiple CPU types with different performance and energy-use characteristics, and even different instruction sets. Putting a task on the wrong type of CPU can affect performance, waste energy, or, in some cases, even cause a crash due to an instruction-set mismatch.
The types of workloads being seen now add complications of their own. A gaming workload, for example, often features a combination of rendering and streaming tasks. The rendering is latency-sensitive and should run on high-performance cores, while the streaming can run on the more efficient cores. A scheduler that treats both task types equally will end up causing dropped frames. The sort of network processing involved with 5G networking involves a combination of tight latency constraints and CPU-intensive work. Even a modern development environment involves challenges, with a combination of CPU-intensive tasks (compilation, for example) and interactive tasks. Bad scheduler decisions can lead to lots of context switches and an unresponsive user interface.
The end result of all this, Huang said, is that any scheduler using a single, fixed algorithm is fundamentally broken. All of the traditional schedulers do exactly that. They have brought a simple system view into a world where a typical computer has billions of possible states, and their limitations are showing.
The sched_ext framework offers a potential solution, an environment where schedulers can evolve to meet contemporary challenges. Huang took as a case study the Free5GC project, which is creating an open-source solution for 5G network processing. Its data-plane processing, in particular, is subject to a number of difficult constraints. It has a number of CPU-bound tasks, but also has some strict latency constraints. The CPU scheduler must be able to balance these constraints; CFS often fails to do so optimally.
The project experimented with a sched_ext scheduler called "scx_packet". It used a relatively simple algorithm: half the CPUs in the system were reserved for latency-sensitive network-processing tasks, while the other half were given over to CPU-bound general processing. But this scheduler treated all network traffic equally — voice calls, web browsing, and streaming all went to the same CPUs. That could cause voice data to be blocked behind download traffic, and emergency calls had the same priority as social-media activity. This approach also led to some CPUs being overloaded, while others were idle, as the workload shifted. Finally, some packets require much more processing than others; the processing of the more CPU-intensive packets should be scheduled separately.
This experience led the Free5GC developers to look into machine learning.
Scheduling on such systems has many dimensions of input to consider; it is,
he said, "the perfect problem domain
" for machine learning. Among
other things, the scheduler must consider the priority of each task, its
CPU requirements, its virtual run time so far, and its recent CPU-usage
patterns. The load on each CPU must be taken into account, as must NUMA
distance, cache sharing, and operating frequency. Then, of course, there
are the workload-specific factors.
A new sched_ext scheduler (based on scx_rusty) was developed to try to take all of these parameters into account and decide when a task should be moved from one CPU to another. It initially runs in a data-collection mode, looking at migration decisions and the results from them; these decisions are then used to train a model (in user space) that is subsequently stored in a BPF map. The scheduler can then use this model inside the kernel to make load-balancing decisions. The outcome of these decisions is continually measured and reported back to user space, which updates the model over time.
Implementing this scheduler required overcoming an obstacle unique to the kernel environment. Neural-network processing involves a fair amount of floating-point arithmetic, but use of floating-point instructions is not allowed in kernel code (saving the floating-point-unit state on entry to the kernel would have a heavy performance cost, so the kernel does not do that). A form of fixed-point arithmetic was adopted for the neural-network processing instead.
In a test using the all-important kernel-compilation benchmark, this scheduler produced a 10% improvement in compilation time over the EEVDF scheduler. The number of task migrations was reduced by 77%.
Huang concluded with a summary of why machine learning works in this context. Scheduling in this complex environment is, he said, a pattern-recognition problem, and neural networks are good at that task. The scheduler is able to balance competing goals and automatically re-trains itself for new architectures and workloads. The scheduler is able to take 15 separate parameters into account for each migration decision, and to adjust its model based on the results.
The slides from Huang's talk are available for interested readers. The source for the machine-learning-based sched_ext scheduler can be found on GitHub.
[Thanks to the Linux Foundation for supporting my travel to this event.]
Yet another way to configure transparent huge pages
Transparent huge pages (THPs) are, theoretically, supposed to allow processes to benefit from larger page sizes without changes to their code. This does work, but the performance impacts from THPs are not always a benefit, so system administrators with specific knowledge of their workloads may want the ability to fine-tune THPs to the application. On May 15, Usama Arif shared a patch set that would add a prctl() option for setting THP defaults for a process; that patch set has sparked discussion about whether such a setting is a good fit for prctl(), and what alternative designs may work instead.
The patch set added three new prctl() flags. Two of them would globally enable or disable THPs for a process, while the third would restore the system default. All three of these flags would be persisted across calls to fork() and exec(). Being able to set separate policies for each process — and to have those policies configured by the parent process, such as a system manager — would help with systems where multiple different types of workloads are present on the same machine, Arif explained.
THPs can group many smaller pages of memory together; this eliminates the overhead of tracking as many pages, and removes indirections from the process of resolving a virtual memory address to a location in physical memory. This comes at the cost of greater potential for internal fragmentation, and therefore higher memory use. Whether that tradeoff is appropriate for a program depends on the amount of available memory, its memory-access patterns, and what other jobs are competing for memory on the system.
Lorenzo Stoakes was concerned that Arif's change would effectively make the current mechanism for controlling system-wide THP policy meaningless. On current kernels, there is a sysfs file that can be used to choose whether THPs are used for all processes, for no processes, or only for processes that request it via a call to madvise(). Arif attempted to reassure him that the existing mechanism would remain in place; the prctl() proposal would effectively take the place of madvise(). Stoakes wasn't reassured:
prctl() feels like it's literally never, ever the right choice.
It feels like we shove all the dark stuff we want to put under the rug there.
Reading the man page is genuinely frightening. [T]here's stuff about VMAs _I wasn't aware of_.
After an extensive back-and-forth about the merits of Arif's proposal
with David Hildenbrand, Stoakes eventually
proposed creating an entirely new system call that subsumes both the current
madvise() interface and Arif's need for a configurable process-wide
default. His proposed interface would be a "struct-configured function
"
that takes a new structure describing the version of the API in use, and one of
several possible operations. An enumeration would be used to select between
applying advice to a single range, multiple ranges, or an entire address space.
In the latter case, it could be set as the default. An optional pidfd could
apply the operation to a separate process, and a set of boolean flags would
configure how the call reacts to errors or gaps in the requested range.
Arif
thought Stoakes's proposal was unnecessary, and would "introduce a lot of
code to solve something that can be done in a very very simple way
". In
Arif's view, the three use cases for modifying single ranges, multiple ranges, and
whole-process changes are already served by madvise,
process_madvise(), and prctl(), respectively.
Stoakes
didn't agree, pointing out several problems with the current interfaces and
calling his own proposal a "fixed madvise()
". One complaint was
that the current madvise() interface stops on the first error, with no
way for the user to discover how far it progressed or what the details of the
error were beyond the error code.
He also
said that the process_madvise() interface was limited to eight separate
ranges. SeongJae Park
pointed out that this does not seem to be true,
but Stoakes
did not change his mind, in light of his other concerns.
Despite their disagreement, Stoakes provided substantial review comments on Arif's patch set, which were happily accepted. Arif published a new version of the patch set on May 19. That same day, Stoakes came back with an alternative patch set extending the existing madvise() interface to support setting process defaults.
Stoakes's patch set introduces four new madvise() flags. PMADV_SKIP_ERRORS will allow a call to madvise() to continue applying advise to subsequent ranges even after encountering an error. PMADV_NO_ERROR_ON_UNMAPPED is a less strict version; with this flag, madvise() stops on errors, but does not consider encountering an unmapped range in the middle of an area as an error. PMADV_ENTIRE_ADDRESS_SPACE applies the supplied advice to a process's entire address space, and should ordinarily be called with PMADV_SKIP_ERRORS, given that not every mapping in the address space will necessarily support a given piece of advice. Finally, PMADV_SET_FORK_EXEC_DEFAULT specifies that a given set of advice should become the default for the process, persisting across fork() and exec().
Hildenbrand found Stoakes's approach interesting, but thought that the patch should introduce the minimum number of needed flags. Specifically, he questioned whether PMADV_NO_ERROR_ON_UNMAPPED was really needed as a separate flag from PMADV_ENTIRE_ADDRESS_SPACE. Stoakes agreed to make the latter flag imply the former flag, but thought the former flag could still be useful in some circumstances.
Shakeel Butt didn't like the interface for a different reason; he pointed out that the eventual goal of the THP effort is to reach a state where they can be enabled by default for all workloads. In a world where that goal is reached, what is the use for these madvise() flags? He would rather see a temporary change to prctl(), lasting just long enough to finish the THP work.
That vision of the future wasn't one that Stoakes agreed with, though. He pointed out that MADV_HUGEPAGE and MADV_NOHUGEPAGE already existed and would need to be maintained in the madvise() interface for compatibility reasons. Butt clarified that he wasn't disputing that, he just thought that this option would likely go away in the future, and so it shouldn't be embedded in the madvise() interface.
Stoakes
brought up an off-list suggestion from Liam Howlett: creating a new
"beautifully named
"
mmadvise() system call for process-wide options (of which there are
already a few), to remove them from the madvise() interface. Hildenbrand
thought the idea was worth experimenting with.
Johannes Weiner
agreed, but was worried about introducing an over-broad solution if there
was actually only one thing the new interface would be used for.
Arif wanted to reach some consensus on a design before people showed off more code, and thought that introducing another option into the conversation would just make things more complicated. Eventually, Arif and Stoakes both agreed to produce new versions of their patch sets for comparison.
While those patch sets have not yet been forthcoming, the discussion on the mailing list makes it clear that there is a use case for setting process-specific THP policies. The exact form of the API to enable this is still up in the air, but it's likely that the capability will soon be added to the kernel one way or the other. The nature of temporary solutions suggests that it might remain with us for some time.
Page editor: Jonathan Corbet
Next page:
Brief items>>