|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for August 21, 2025

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Lucky 13: a look at Debian trixie

By Joe Brockmeier
August 20, 2025

After more than two years of development, the Debian Project has released its new stable version, Debian 13 ("trixie"). The release comes with the usual bounty of upgraded packages and more than 14,000 new packages; it also debuts Advanced Package Tool (APT) 3.0 as the default package manager and makes 64-bit RISC-V a supported architecture. There are few surprises with trixie, which is exactly what many Linux users are hoping for—a free operating system that just works as expected.

Debian's stable releases are aptly named; the project prioritizes stability over shipping the latest software. The freeze schedule for trixie called for a soft freeze in April, which meant that (for example) the KDE Plasma 6.4 release in June was too late to make the cut—even though trixie was not released until August. Users who prefer to live on the edge will want to run another distribution or follow Debian development by running the testing release that previews the next stable version—Debian 14 ("forky"). Truly adventurous users may take their chances with the unstable ("sid") release.

That said, trixie is up-to-date enough for many folks; it includes GNOME 48, KDE Plasma 6.3, Xfce 4.20, GNU Emacs 30.1, GnuPG 2.4.7, LibreOffice 25.2, and more. Under the hood, it includes the most recent Linux LTS kernel (6.12.41), GNU Compiler Collection (GCC) 14.2, GNU C Library (glibc) 2.41, LLVM/Clang 19, Python 3.13, Rust 1.85, and systemd 257. The release notes have a section for well-known software that compares the version in Debian 12 against Debian 13. While some of the versions lag a bit behind the upstream, they are not woefully outdated.

The project now supports six major hardware architectures: x86-64/amd64, 32-bit Arm with a hardware FPU (armhf), 64-bit Arm (arm64), IBM POWER8 or newer (ppc64el), IBM S/390 (s390x), and 64-bit RISC-V. The i386 architecture is not supported for trixie, though the project continues to build some i386 packages to run on 64-bit systems; users with i386 systems cannot upgrade to trixie. The MIPS architectures (mipsel and mis64el) have also been removed in trixie.

The Arm EABI (armel) port that targets older 32-bit Arm devices prior to Arm v7 is still supported with trixie, but this release is the end of the line. There is no installation media for armel systems, but users who have bookworm installed can upgrade to trixie if they have supported hardware: the Raspberry Pi 1, Zero, and Zero W are the only devices mentioned in the release notes.

Upgrades from bookworm are supported, of course. The release notes suggest that users convert APT source files to the DEB822 format before the upgrade. APT 3.0 includes an "apt modernize-sources" command to convert APT data source files to DEB822, but that is not available in bookworm. Users are also expected to remove all third-party packages prior to running the upgrade. I tested the upgrade on one of my servers, after taking a snapshot to roll back to if needed, and all went smoothly. Users who are considering an upgrade should read the release notes carefully before forging ahead; in particular, users should be aware that it's possible (but not certain) for network interface names to change on upgrade.

Installation

For users who want to start fresh, Debian offers a variety of installer images and download methods; users can choose a 64MB minimal ISO image with the netboot installer, all the way up to a set of Blu-ray images. The project recommends using BitTorrent or Jigsaw Download (jigdo) for the largest images. BitTorrent probably needs no introduction, but jigdo is not as well-known. Jigdo is a method of downloading all of the individual packages for an image from multiple mirrors and then assembling them into an ISO image on the user's machine. It was a bit fiddly to use jigdo to download an image, but not overly so—and the speed of the whole process was comparable to simply downloading an ISO of the same size.

Debian's network install ("netinst") image is probably the best option for server installations and for experienced Linux users; it includes the packages required for a base install and then fetches the remaining software from Debian mirrors. Unlike the tiny netboot image, it includes the option of using either the graphical installer or the text-based installer.

The installer is a bit of a throwback to an earlier era when users were expected to know a lot more about the workings of a Linux system. Users who have only worked with distributions like Fedora and Ubuntu will notice that installing Debian requires many more steps than other popular distributions. For example, many desktop distributions have eliminated the step of setting a password for the root user—instead, it is generally assumed that the primary user will also be the system administrator, so the default is to give the primary user sudo privileges instead. Debian does not take that approach; in fact, there is no way to give a user sudo privileges during installation. Setting up sudo has to be done manually after the installation is completed Update: Users can skip creation of a root account and the installer will then set up the regular user as an administrator with sudo permissions. Apologies for the error.

For some folks, installing Debian will be a bit of a chore and may even be confusing for users who are new to Linux. For example, the text-mode installer requires users to specify the device for GRUB boot loader installation, without providing a default. If one chooses an invalid partition, the installer tells the user that the operation has failed and drops back to a menu listing all the installation steps. Presumably if one picks the wrong partition it will happily install GRUB to that and render the system unbootable. This is not insurmountable for experienced Linux users, but it would no doubt be a hurdle for many users.

More experienced Linux users are likely to appreciate the amount of control offered by the installer. For example, Fedora's recent web-based installer makes it difficult to even find the option to perform custom partitioning. Debian has a guided partitioning option for those who do not want to fuss with it, but the option to create custom partitions is not hidden from the user.

Debian has a better installation option for newer Linux users, though it is easy to miss: the live install images, which use the Calamares installer. Its workflow is more akin to the installation process one finds with Fedora and Ubuntu; it also sets up the primary user with sudo privileges rather than creating a root password. Unfortunately, the live images are not listed on the main page for installer images—though they are mentioned, briefly, in the release notes.

[Debian Calamares installer]

The Debian installer also has the option of using a Braille display and/or speech synthesizer voice for the installation. I have not tried these options, but they are available for users who need them.

X.org

Many distributions are in the process of phasing out X.org support for GNOME and KDE as the upstream projects have started doing so. For example, Fedora will remove X.org session support for GNOME in Fedora 43, and the plan is for Ubuntu to do the same in its upcoming 25.10 release. GNOME will be completely removing X.org support in GNOME 49, which is planned for September.

Much has already been said about this, of course, and there is likely little new left to be said or that needs to be said. However, for users who still need or want X.org support, Debian 13 includes X.org sessions for GNOME and KDE. In testing trixie, I've spent some time in the GNOME and KDE X.org sessions as well as the Wayland sessions; if there are any gotchas or horrible bugs, I haven't encountered them (yet). This might be a compelling reason for some folks to switch to (or stick with) Debian.

Trying trixie

I use Debian for my personal web site and blogs, but it has been quite some time since I used it as my primary desktop operating system. Debian (and Ubuntu) derivatives, such as Linux Mint and Pop!_OS, yes—but it's been several years since I've used vanilla Debian on the desktop for more than casual tinkering.

The Debian release announcement boasts about the number of packages included in trixie: 64,419 packages total, with 14,100 added and more than 6,000 removed as obsolete since bookworm. That is quite a few packages, but falls short of some other distributions. For example, "dnf repoquery --repo=fedora --available" shows more than 76,000 packages available for Fedora 42.

After installing Debian, I went to install some of my preferred software, such as aerc, Ghostty, niri, and Speech Note. The aerc packages in trixie are current, but Ghostty and niri are not packaged for Debian at all. Ghostty is written in Zig, which is also not available, so users who want to build it from source will need to install Zig separately and then build Ghostty. Speech Note is packaged as a Flatpak, but Debian does not enable Flatpaks or Flathub in the GNOME Software Store by default. Users who want Flatpaks on Debian via Flathub will need to install the flatpak package and manually add the Flathub repo:

    flatpak remote-add --if-not-exists flathub \
      https://dl.flathub.org/repo/flathub.flatpakrepo

Users will need to add the gnome-software-plugin-flatpak package for Flatpak support in GNOME Software, and plasma-discover-backend-flatpak to add it to KDE Discover.

Trixie ships with the Firefox extended-support release (ESR) by default: Firefox 128, which was released in July 2024. Happily, Mozilla offers a Debian repository for those who want to run more current versions. Even better, there is a little-advertised utility called extrepo that has a curated list of external repositories users might want to enable for Debian. To enable the Mozilla repository, for example, a user only needs to install extrepo, run "extrepo enable mozilla" as root (or with sudo), update the package cache, and look for the regular Firefox package. In all, extrepo includes more than 160 external repositories for applications like Docker CE, Signal, and Syncthing. Unfortunately, the extrepo utility does not have a separate "list" command to show the available repositories, though running "extrepo search" with no search parameter will return all of its DEB822-formatted repository entries. Some of the software is in an external repository due to a non-free license, other software (like Firefox) just has a development cycle that outpaces Debian's.

As one might expect, the Debian desktop experience is not dramatically different from other distributions; GNOME 48 on Debian is little different than GNOME 48 on Fedora, and the same is true for KDE, Xfce, etc. The primary difference is that users can expect more or less the same desktop experience running Debian stable in two years that they have today, which is not necessarily true for other distributions.

Miscellaneous

One of the features in Debian 13 is something that most users won't notice or appreciate at all: a transition to 64-bit time_t on 32-bit architectures, to avoid the Year 2038 problem. The short version is that 32-bit integers cannot hold a Unix epoch timestamp for dates after January 19, 2038. That may seem like a distant concern, even irrelevant for Debian trixie; after all, Debian 13 is only supported by the project until 2030. However, the project expects that some 32-bit embedded systems will still be running trixie in 2038, so Debian developers did the heavy lifting to complete the transition to 64-bit time_t now. LWN covered the early planning for this in 2023.

By now, most users have retired their DSA SSH keys; if not, now is the time to do so. DSA keys were disabled by default with OpenSSH in 2015, and they are entirely disabled now with the openssh-client and openssh-server packages in trixie. If there is a device that can, for some reason, only be connected to with DSA, users can install the openssh-client-ssh1 package and use ssh1 to make the connection.

As we covered in June 2024, Debian 13 has switched to using a tmpfs filesystem for the /tmp directory. By default, Debian allocates up to 50% of memory to /tmp, but this can be changed by following the instructions in the release notes. Note that this also applies to systems that are upgraded to trixie from bookworm.

Forward to forky

Debian Project Leader (DPL) Andreas Tille recently announced "Debian's 100000th birthday", so clearly the project has a bit of experience with putting out solid releases. Granted, he was reporting the number in binary, but even when converted to decimal numbers (32 years), it's an impressive track record.

While testing, I installed trixie on a couple of systems, including a new Framework 12-inch laptop. My original intent was to just see whether Debian had any problems with the new hardware (it didn't), but now I'm leaning toward sticking with Debian on this system for a while to see if stability suits me.

With trixie out the door, the Debian Project has already turned its attention to working on forky, which has no release date set. Debian has stuck to a loose schedule of a new stable release roughly every two years. Most likely we will see Debian 14 sometime in 2027. After the forky release, trixie will still receive updates from Debian's security team through 2028, and then from its LTS team through 2030.

As of yet, there are no major new features or changes announced for forky; it seems likely that those will be coming to light in the coming months now that the project has trixie out the door. LWN will, of course, be reporting on those developments as they happen.

Comments (35 posted)

Python, tail calls, and performance

By Jake Edge
August 20, 2025

EuroPython

Ken Jin welcomed EuroPython 2025 attendees to his talk entitled "Building a new tail-calling interpreter for Python", but noted that the title really should be: "Measuring the performance of compilers and interpreters is really hard". Jin's efforts to switch the CPython interpreter to use tail calls, which can be optimized as regular jumps, initially seemed to produce an almost miraculous performance improvement. As his modified title suggests, the actual improvement was rather smaller; there is still some performance improvement and there are other benefits from the change.

Jin said that he has been a CPython core developer since 2021. He worked on the specializing adaptive interpreter for Python 3.11. He also worked on the initial optimizer for the just-in-time (JIT) compiler for CPython; he is continuing to work on optimizing the JIT.

Interpreter history

Python source code is compiled by the CPython program into bytecode. That bytecode is a set of simple instructions for the CPython virtual machine; the interpreter implements the virtual machine by stepping through the instructions performing the operations indicated. Originally, the Python interpreter was implemented as a giant switch statement in C, which looked like the following from his slides:

    switch (ip->opcode) {
        case INSTRUCTION_1:
            // Subroutines
            ip++;
            break;
        case INSTRUCTION_2:
            // Subroutines
            ip++;
            break;
        ...
    }
The problem with that construction is that every instruction requires two jumps, one based on the opcode (bytecode instruction) and the other for the break statements. In addition, the jump at the top has so many destinations that it is bad for the branch-prediction hardware in older CPUs; that does not matter much for newer CPUs, he said.

In 2008, Antoine Pitrou switched to using computed gotos for the interpreter, which was enabled by default as part of Python 3.2 in February 2011. It looks like the following:

    void *DISPATCH_TABLE = {&&INSTRUCTION_1, &&INSTRUCTION_2, ... }

    goto *DISPATCH_TABLE[*ip]
    INSTRUCTION_1:
        // Subroutines
        ip++;
        goto *DISPATCH_TABLE[*ip]
    INSTRUCTION_2:
        // Subroutines
        ip++;
        goto *DISPATCH_TABLE[*ip]
    ...
[Ken Jin]

The idea is that the addresses of the labels for the destinations of the gotos are put into the dispatch table; the opcode that is pointed to by ip is used as the index into the table. That means there is only a single jump needed per instruction; the branch targets are also spread across fewer locations, which is better for older CPUs.

There are some problems with the computed-goto interpreter, however. For one thing, the CPython developers have no direct control over what gets put into registers; they simply have to trust the compiler's register allocator. Inline assembly could be used for that, Jin said, but the project avoids assembly "because it's unmaintainable and Python runs on so many different platforms".

Another problem is that the interpreter function is massive; it is over 12,000 lines of C code in Python 3.15 (coming in 2026). That makes it prone to encounter compiler bugs of various sorts, he said. He listed a few different compiler bugs that the computed-goto interpreter has encountered. "It's too complicated, basically, the interpreter is too big, and you hit magic thresholds in the compiler and that causes problems."

As an example, he showed two versions of the assembly code generated for getting the next instruction from the dispatch table and jumping to it, one from GCC versions 13-15beta and another for GCC 11. He did not want to go into detail about the differences, but thought attendees could easily spot problems with the GCC 13-15beta version. It was twice as long as the other; it has "two jumps and a bunch of register shuffles", while the GCC 11 version has a single jump "and the register moves are almost as good as you can get". That compiler bug caused a 6-7% performance regression in CPython.

Doing better

Tail calls are subroutine calls that are performed as the last action in a function. For example:

    def func(x):
        if x == 0:
            return 0
        # Tail call
        return func(x-1)
In 1977, Guy Steele "published a paper that said that tail calls are equivalent to gotos in most scenarios", Jin said. With a tail call, there is no need to set up a new stack frame to make the call, a simple jump can be made instead. So the idea was to switch the interpreter from using computed gotos to tail calls. In the goto interpreter, there is a single function with 12,000+ lines of code, but the tail-call version splits that up into 225 functions, each with up to 200 lines of code:
    funcptr[] DISPATCH_TABLE = {INSTRUCTION_1, INSTRUCTION_2, ...};

    __attribute__((preserve_none))
    void INSTRUCTION_1(int *ip, ...) {
       // Subroutines.
       ip++;
       [[clang::musttail]]
       return DISPATCH_TABLE[*ip](ip, ...);
    }

    __attribute__((preserve_none))
    void INSTRUCTION_2(int *ip, ...) {
       // Subroutines.
       ip++;
       [[clang::musttail]]
       return DISPATCH_TABLE[*ip](ip, ...);
    }
    ...
The code uses a "proper tail-call attribute" (clang::musttail here) that requires the compiler to emit tail-call code for the return statement or to fail to compile the function. In addition, the calling convention using the "preserve_none" attribute and passing the pointer to the next instruction as the first argument ensures that the important parameters are always in registers, he said. The only change to the definition of each instruction is two lines of code, guarded by an #if Py_TAIL_CALL_INTERP, along with some changes to the macros used in those definitions to switch from a label and a jump for the goto interpreter to a function header and return for the tail-call interpreter.

Performance

The initial results were amazing: "over 15% geometric mean speedup on some platforms". For four weeks he was "jumping for joy" like the happy cat. The tail-call interpreter was merged into Python 3.14 during that time. But "the cake was a lie and it still is a lie", as the correct title of his talk suggested.

"It was too good to be true". Another Python contributor, Nelson Elhage, "spent a month doing excellent investigative work" to show that the speedup was much smaller; "I was really sad". The short summary is that the computed-goto baseline that was being compared against had a compiler bug that caused its performance to be worse than it should have been, Jin said. Effectively, the compiler was adding a second jump back in by factoring out the common "goto *DISPATCH_TABLE[*ip]" to its own jump label, so it was jumping from the end of each instruction block to the common label and then from there to the next instruction.

The upshot is that the initial performance was so good because the computed-goto interpreter got slower, not that the tail-call interpreter was that much faster, he said. He put up a table with the updated results, showing a speedup of around 2% on x86_64 Linux. It also showed 5-7% improvement for Arm64 macOS, but there is a belief that its compiler has the same bug; unfortunately, that cannot be confirmed because another bug, in profile-guided optimization (PGO), does not allow measuring the performance correctly. "This whole thing is just a bunch of bugs." The build for x86_64 Windows using Clang (which is not a supported compiler for Windows) supposedly shows a performance regression, but that is still being investigated.

So the performance gains were not as good as had been expected, Jin said. He thinks the reason is that "the compiler can probably outsmart us, even in the case of normal computed goto". There are some side benefits to the tail-call work, including adding Clang builds for Windows to the continuous-integration (CI) pipeline; Clang support allows more sanitizers to be used during Windows testing.

There was also work on refactoring the interpreter to add more things into the domain-specific language used to specify bytecode operations, which "helps to ensure the correctness of the interpreter". Along the way, the core developers have found multiple compiler bugs in Clang and GCC that involve special attributes; these bugs have affected other projects, such as the Lua language. In addition, GCC is adding a special calling-convention feature that the CPython developers have been helping to test and debug.

Moving to tail calls makes the interpreter more resilient, Jin said. Tail calls have been around forever and compilers have had tail-call optimizations for most of their existence. So relying on that, rather than custom compiler extensions for computed gotos, which are probably less well-tested, will likely result in hitting fewer compiler bugs.

In addition, since each bytecode is a function now, the Linux perf tool can see and profile each bytecode. That allows the CPython team to see which instructions are slow in order to focus optimization efforts. Another benefit that is coming for the JIT compiler is the elimination of a shim that converts between the calling conventions of the interpreter and the JIT code, which slows things down. He has a working prototype of the interpreter and JIT code tail-calling back and forth without needing the conversion shim. "Having a faster calling-convention, a faster way to switch between them, is actually a huge benefit for the JIT."

He wanted to point out that all of his current performance results have used Clang, as he has not yet benchmarked (unreleased) GCC 16. Part of the reason for that is that GCC 16 presumably has a bug that causes tail-call performance to regress when PGO is enabled. A toy benchmark of a BF interpreter, which was written by Brandt Bucher, showed that a tail-call interpreter outperforms a goto interpreter on GCC 16. Hopefully, the CPython and GCC developers can find the problem and make a fix.

He listed nine separate developers who had helped along the way, because he wanted to dispel the idea that he had done all of this work on his own. "I am responsible for the slowdown and other stuff", Jin said with a grin, but he had lots of help—from more people than just those who were listed.

An attendee asked about the future of the computed-goto interpreter; they wondered if it had been fully replaced or would be removed from CPython at some point. Jin said that there is a need to be able to use compilers that do not support the way that the tail-call interpreter is implemented; it only works with compilers released in the last two years or so, specifically Clang 19 and GCC 16. Maybe, after ten years, the goto version can be dropped, but it will need to be available for some time to come. With a laugh, Jin hoped that by then he would be working on something else.

[I would like to thank the Linux Foundation, LWN's travel sponsor, for travel assistance to Prague for EuroPython.]

Comments (2 posted)

Simpler management of the huge zero folio

By Jonathan Corbet
August 14, 2025
One might imagine that managing a page full of zeroes would be a relatively straightforward task; there is, after all, no data of note that must be preserved there. The management of the huge zero folio in the kernel, though, shows that life is often not as simple as it seems. Tradeoffs between conflicting objectives have driven the design of this core functionality in different directions over the years, but much of the associated complexity may be about to go away.

There are many uses for a page full of zeroes. For example, any time that a process faults in a previously unused anonymous page, the result is a newly allocated page initialized to all zeroes. Experience has shown that, often, those zero-filled pages are never overwritten with any other data, so there is efficiency to be gained by having a single zero-filled page that is mapped into a process's virtual address space whenever a new page is faulted in. The zero page is mapped copy-on-write, so if the process ever writes to that page, it will take a page fault that will cause a separate page to be allocated in place of the shared zero page. Other uses of the zero page include writing blocks of zeroes to a storage device in cases where the device itself does not provide that functionality and the assembly of large blocks to be written to storage when data only exists for part of those blocks.

The advent of transparent huge pages added a new complication; now processes could fault in a PMD-sized (typically 2MB) huge page with a single operation, and the kernel had to provide a zero-filled page of that size. In response, for the 3.8 kernel release in 2012, Kirill Shutemov added a huge zero page that could be used in such situations. Now huge-page-size page faults could be handled efficiently by just mapping in the huge zero page. The only problem with this solution was that not all systems use transparent huge pages, and some only use them occasionally. When there are no huge-page users, there is no need for a zero-filled huge page; keeping one around just wastes memory.

To avoid this problem, Shutemov added lazy allocation of the huge zero page; that page would not exist in the system until an actual need for it was encountered. On top of that, he added reference counting that would keep track of just how many users of the huge zero page existed, and a new shrinker callback that would be invoked when the system is under memory pressure and looking to free memory. If that callback found that there were no actual users of the huge zero page, it would release it back to the system.

That seemed like a good solution; the cost of maintaining the huge zero page would only be paid when there were actual users to make that cost worthwhile. But, naturally, there was a problem. The reference count on that page is shared globally, so changes to it would bounce its cache line around the system. If a workload that created a lot of huge-page faults was running, that cache-line bouncing would measurably hurt performance. Such workloads were becoming increasingly common. As so often turns out to be the case, there was a need to eliminate that global sharing of frequently written data.

The solution to that problem was contributed to the 4.9 kernel by Aaron Lu in 2016. With this change, a process needing to take its first reference to the huge zero page would increment the reference count as usual, but it would also set a special flag (MMF_USED_HUGE_ZERO_PAGE) in its mm_struct structure. The next time that process needed the huge zero page, it would see that flag set, and simply use the page without consulting the reference count. The existence of the flag mean that the process already has a reference, so there is no need to take another one.

This change eliminated most of the activity on the global reference count. It also meant, though, that the kernel no longer knew exactly how many references to the huge zero page exist; the reference count now only tracks how many mm_struct structures contained at least one reference at some point during their existence. The only opportunity to decrease the reference count is when one of those mm_struct structures goes away — when the process exits, in other words. So the huge zero page may be kept around when it is not actually in use; all of the processes that needed it may have dropped their references, but the kernel cannot know that all of the references have been dropped as long as the processes themselves continue to exist.

That problem can be lived with; chances are that, as long as the processes that have used the huge zero page exist, at least one of them still has it mapped somewhere. But Lu's solution inherently ties the life cycle of the huge zero page to that of the mm_struct structures that used it. As a result, the huge zero page cannot be used for operations that are not somehow tied to an mm_struct. Filesystems are one example of a place where it would be useful to have a huge zero page; they often have to zero out large ranges of blocks on an underlying storage device. But buffered I/O operations happen independently of any process's life cycle; they cannot use the huge zero page without running the risk that it might be deallocated and reused before an operation completes.

That limitation may be about to go away. As Pankaj Raghav pointed out in this patch series, the lack of a huge zero page that is usable in the filesystem context makes the addition of large block size support to filesystems like XFS less efficient than it could be. To get around this problem, a way needs to be found to give the huge zero page an even more complex sort of life cycle that is not tied to the life cycle of any process on the system without reintroducing the reference-counting overhead that Lu's patch fixed.

Or, perhaps, the right solution is, instead, to do something much simpler. After renaming the huge zero page to the "huge zero folio" (reflecting how it has come to be used in any case), the patch series adds an option to just allocate the huge zero folio at boot time and keep it for the life of the system. The reference counting and marking of mm_struct structures is unnecessary in this case, so it is not performed at all, and the kernel can count on the huge zero folio simply being there whenever it is needed. This mode is controlled by the new PERSISTENT_HUGE_ZERO_FOLIO configuration option which, following standard practice, is disabled by default.

The acceptance of this series in the near future seems nearly certain. It simplifies a bit of complex logic, reduces reference-counting overhead even further, and makes the huge zero folio available in contexts where it could not be used before. The only cost is the inability to free the huge zero folio but, in current systems, chances are that this folio will be in constant use anyway. The evolution of hardware has, as a general rule, forced a lot of complexity into the software that drives it. Sometimes, though, newer hardware (and especially much larger memory capacity) also allows the removal of complexity that was driven by the constraints felt a decade or more ago.

Comments (49 posted)

Kexec handover and the live update orchestrator

By Jonathan Corbet
August 18, 2025
Rebooting a computer ordinarily brings an abrupt end to any state built up by the old system; the new kernel starts from scratch. There are, however, people who would like to be able to reboot their systems without disrupting the workloads running therein. Various developers are currently partway through the project of adding this capability, in the form of "kexec handover" and the "live update orchestrator", to the kernel.

Normally, rebooting a computer is done out of the desire to start fresh, but sometimes the real objective is to refresh only some layers of the system. Consider a large machine running deep within some cloud provider's data center. A serious security or performance issue may bring about a need to update the kernel on that machine, but the kernel is not the only thing running there. The user-space layers are busily generating LLM hallucinations and deep-fake videos, and the owner of the machine would much rather avoid interrupting that flow of valuable content. If the kernel could be rebooted without disturbing the workload, there would be great rejoicing.

Preserving a workload across a reboot requires somehow saving all of its state, from user-space memory to device-level information within the kernel. Simply identifying all of that state can be a challenge, preserving it even more so, as a look at the long effort behind the Checkpoint/Restore in Userspace project will make clear. All of that state must then be properly restored after the kernel is swapped out from underneath the workload. All told, it is a daunting challenge.

The problem becomes a little easier, though, in the case of a system running virtualized guests. The state of the guests themselves is well encapsulated within the virtual machines, and there is relatively little hardware state to preserve. So it is not surprising that this is the type of workload that is being targeted for the planned kernel-switcheroo functionality.

Preserving state across a reboot

The first piece of the solution, kexec handover (KHO), was posted by Mike Rapoport earlier this year and merged for the 6.16 kernel release. Rapoport discussed this work at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit. KHO offers a deceptively simple API to any subsystem that needs to save data across a reboot; for example, a call to kho_preserve_folio() will save the contents of a folio. After the new kernel boots, that folio can be restored with kho_restore_folio(). A subsystem can use these primitives to ensure that the data it needs will survive a reboot and be available to the new kernel.

Underneath the hood, KHO prepares the memory for preservation by coalescing it into specific regions. A data structure describing all of the preserved memory is created as a type of flattened devicetree that is passed through to the new kernel. Also described in that devicetree are the "scratch areas" of memory — the portions of memory that do not contain preserved data and which, consequently, are available for the new kernel to use during the initialization process. Once the bootstrap is complete and kernel subsystems have reclaimed the memory that was preserved, the system operates as usual, with the workload not even noticing that the foundation of the system was changed out from underneath it.

Every subsystem that will participate in KHO must necessarily be supplemented with the code that identifies the state to preserve and manages the transition. For the virtualization use case, much of that work can be done inside KVM, which contains most of the information about the virtual machines that are running. With support added to a few device drivers, it should be possible to save (and restore) everything that is needed. What is missing in current kernels, though, is the overall logic that tells each subsystem when it should prepare for the change and when to recover.

The live update orchestrator

The live update orchestrator (LUO) patches are the work of Pasha Tatashin; the series is currently in its second version. LUO is the control layer that makes the whole live-update process work as expected. To that end, it handles transitions between four defined system states:

  • Normal: the ordinary operating state of the system.
  • Prepared: once the decision has been made to perform a reboot, all LHO-aware subsystems are informed of a LIVEUPDATE_PREPARE event by way of a callback (described below), instructing them to serialize and preserve their state for a reboot. If this preparation is successful across the system, it will enter the prepared state, ready for the final acts of the outgoing kernel. The workload is still running at this time, so subsystems have to be prepared for their preserved state to change.
  • Frozen: brought about by a LIVEUPDATE_FREEZE event just prior to the reboot. At this point, the workload is suspended, and subsystems should finalize the data to be preserved.
  • Updated: the new kernel is booted and running; a LIVEUPDATE_FINISH event will be sent, instructing each subsystem to restore its preserved state and return to normal operation.

To handle these events, every subsystem that will participate in the live-update process must create a set of callbacks to implement the transition between system states:

    struct liveupdate_subsystem_ops {
	int (*prepare)(void *arg, u64 *data);  /* normal → prepared */
	int (*freeze)(void *arg, u64 *data);   /* prepared → frozen */
	void (*cancel)(void *arg, u64 data);   /* back to normal w/o reboot */
	void (*finish)(void *arg, u64 data);   /* updated → normal */
    };

Those callbacks are then registered with the LUO core:

    struct liveupdate_subsystem {
	const struct liveupdate_subsystem_ops *ops;
	const char *name;
	void *arg;
	struct list_head list;
	u64 private_data;
    };

    int liveupdate_register_subsystem(struct liveupdate_subsystem *subsys);

The arg value in this structure reappears as the arg parameter to each of the registered callbacks (though this behavior seems likely to change in future versions of the series). The prepare() callback can store a data handle in the space pointed to by data; that handle will then be passed to the other callbacks. Each callback returns the usual "zero or negative error code" value indicating whether it was successful.

There is a separate in-kernel infrastructure for the preservation of file descriptors across a reboot; the set of callbacks (defined in this patch) looks similar to those above with a couple of additions. For example, the can_preserve() callback returns an indication of whether a given file can be preserved at all. Support will need to be added to every filesystem that will host files that may be preserved across a reboot.

LUO provides an interface to user space, both to control the update process and to enable the preservation of data across an update. For the control side, there is a new device file (/dev/liveupdate) supporting a set of ioctl() operations to initiate state transitions; the LIVEUPDATE_IOCTL_PREPARE command, for example, will attempt to move the system into the "prepared" state. The current state can be queried at any time, and the whole process aborted before the reboot if need be. The patch series includes a program called luoctl that can be used to initiate transitions from the command line.

The preservation of specific files across a reboot can be requested with the LIVEUPDATE_IOCTL_FD_PRESERVE ioctl() command. The most common anticipated use of this functionality would appear to preserve the contents of memfd files, which are often used to provide the backing memory for virtual machines. There is a separate document describing how memfd preservation works that gives some insights into the limitations of file preservation. For example, the close-on-exec and sealed status of a memfd will not be preserved, but its contents will. In the prepared phase, reading from and writing to the memfd are still supported, but it is not possible to grow or shrink the memfd. So reboot-aware code probably needs to be prepared for certain operations to be unavailable during the (presumably short) prepared phase.

This series has received a number of review comments and seems likely to go through a number of changes before it is deemed ready for inclusion. There does not, however, seem to be any opposition to the objective or core design of this work. Once the details are taken care of, LUO seems likely to join KHO in the kernel and make kernel updates easier for certain classes of Linux users.

Comments (7 posted)

Finding a successor to the FHS

By Joe Brockmeier
August 15, 2025

The purpose of the Filesystem Hierarchy Standard (FHS) is to provide a specification for filesystem layout; it specifies the location for files and directories on a Linux system to simplify application development for multiple distributions. In its heyday it had some success at this, but the standard has been frozen in time since 2015, and much has changed since then. There is a slow-moving effort to revive the FHS and create a FHS 4.0, but a recent discussion among Fedora developers also raised the possibility of standardizing on the suggestions in systemd's file-hierarchy documentation, which has now been added to the Linux Userspace API (UAPI) Group's specifications.

FSSTND to FHS 3.0

Efforts to standardize directory structure and file placement for Linux systems go back to the earliest days of distributions. The Filesystem Standard (FSSTND) 1.0 was released in 1994; it was developed as "a consensus effort of many Linux activists", and coordinated by Daniel Quinlan. In the preface to the document, which can be found in this directory, it noted that the "open and distributed process" of Linux created a need for a standardized structure of the filesystem:

This will allow users, developers, and distributors to assemble parts of the system from various sources that will work together as smoothly as if they had been developed under a monolithic development process. It will also make general documentation less difficult, system administration more consistent, and development of second and third party packages easier.

It was supplanted by the FHS 2.0 in 1997. Version 2.3, the last in the 2.x series, was announced in 2004. Eventually the work moved under the auspices of the Linux Foundation (LF), as part of the Linux Standard Base (LSB) effort, which was a project that attempted to promote open standards to improve compatibility between Linux distributions. In addition to filesystem layout, the LSB specified standard libraries, run levels, and so forth. The LF released FHS 3.0 in 2015.

The FSSTND and FHS specifications described the layout of the filesystem as well as the files and commands that should exist on a conformant system. It describes the expected structure of everything under root ("/"), which directories should be present on a local system or those that can be shared via NFS or similar, and more. For example, the FHS describes the /bin directory as well as all of the commands—such as cp, ls, and mv— that are expected to exist there.

One might question whether any of the various standards succeeded in making open-source projects work together as envisioned; but we can also imagine how chaotic things would be if there had been no FHS at all. Now, however, the standard has been frozen in time for a decade while Linux development has continued apace.

The FHS is not just old at this point: everyone involved seems to have packed up and gone home. The fhs-discuss mailing list that was hosted by the LF appears to have been retired and LSB maintainer Mats Wichmann wrote in 2023 that "the LSB project is essentially abandoned".

Last year, postmarketOS core developer Pablo Correa Gomez and a few others started an effort to move the FHS work under the freedesktop.org banner and create 4.0 of the standard. There has been some sporadic activity and discussion; the project has a handful of issues open, but little real progress has been made so far. In fact, the discussions under the project had been stalled for months until the topic was raised on a Fedora list.

FHS or UAPI?

On August 5, Pavol Sloboda asked on the fedora-devel mailing list whether Fedora's packaging guidelines were referring to FHS 2.3 or 3.0. The filesystem layout section of the guidelines only links to the landing page of the FHS specification archive, where both 2.3 and 3.0 are linked. Sloboda was trying to figure out whether it was acceptable to create directories inside /usr/bin. He said that the 3.0 version prohibits directories inside /usr/bin, but 2.3 does not.

Michal Schorm said that the project should officially adopt FHS 3.0 and create a list of exceptions where Fedora intentionally deviates from the standard. Zbigniew Jędrzejewski-Szmek, though, said that even FHS 3.0 had missed much of the evolution of Linux systems:

In particular, it completely missed the usr-merge, and obviously the merge of bin and sbin… Just looking at the contents table, it is full of outdated stuff, it talks about /mnt and /media, without the understanding that *temporary* paths need to go under /run instead of polluting the root file system.

Some of the FHS is useful, he conceded, but it is now "mostly of historical interest". He said that Fedora should just use the systemd file-hierarchy. He later suggested that the document should be moved out of systemd's repository and made part of the UAPI Group's specifications; that was done on August 8. Michael Catanzaro said that the UAPI group was "clearly the right place for the successor to FHS".

It is worth noting that the systemd/UAPI file-hierarchy is based in part on the FHS, XDG Base Directory Specification, and XDG User Directories. It does not attempt to be as comprehensive as the FHS; it "only documents a skeleton of a directory tree, that downstreams can extend". It does not, for example, provide any guidance about what binaries are to be expected on a compliant system.

Neal Gompa, who is participating in the FHS 4.0 effort, disagreed, saying that the UAPI "isn't a neutral space: it's a systemd-driven project". He added that he has no "particular beef" with systemd, but it is not used by everyone. Luca Boccassi argued that the file-hierarchy documentation was not systemd-only:

For example, one of the specs (config files) is driven mainly by libeconf which is completely unrelated to systemd and which is used by many projects (and not by systemd) [...]

None of the specifications mentioned on this website require systemd for anything.

Both Boccassi and Jędrzejewski-Szmek expressed skepticism about the FHS 4.0 effort's progress thus far.

Change coming?

Jędrzejewski-Szmek has opened a ticket with Fedora's packaging committee to update Fedora's guidelines to point to that document instead. He added that the UAPI guidelines describe what Fedora is doing anyway, so "this doesn't introduce any new rules, only drops some baggage that we were ignoring anyway". If the packaging committee agrees, Fedora could be the first distribution to officially adopt the file-hierarchy specification. In his most recent comment on the ticket, Jędrzejewski-Szmek said that by linking to UAPI guidelines "we make it more likely that other distributions will follow the same rules", and Fedora does not need to duplicate the work.

It remains to be seen, of course, whether other distributions will want to follow Fedora's lead. Currently, Debian uses the FHS 3.0, with a number of exceptions or additions in its policy manual. Gentoo, rather emphatically, does not adhere to the FHS; its developer guide spells out where files should be placed on Gentoo systems. According to openSUSE's guidelines for RPM packaging, it also follows the FHS; but, like Fedora, openSUSE does not link to a specific version, leaving it open to question whether it refers to 3.0. (In fact, openSUSE links to a wiki page that still mentions the 3.0 as being developed, rather than completed.) Ubuntu's packaging guidelines also refer to the FHS, without specifying the version.

It may be that the need for a universal Linux filesystem standard has waned; while it is important for distribution packages to have an agreed-on standard, there's far less emphasis on creating native distribution packages for third-party software in 2025. Flatpaks, Snaps, and AppImage packages seem more popular with desktop-application developers these days. A lot of server-side software is now expected to be deployed as a container—or a group of containers run in Kubernetes—rather than installed as a package.

Perhaps, at some point, the FHS revival will bear fruit as a meaningful update to the standard. Until then, the UAPI documentation is the only current game in town.

Comments (101 posted)

The Koka programming language

By Daroc Alden
August 19, 2025

Statically typed programming languages can help catch mismatches between the kinds of values a program is intended to manipulate, and the values it actually manipulates. While there have been many bytes spent on discussions of whether this is worth the effort, some programming language designers believe that the type checking in current languages does not go far enough. Koka, an experimental functional programming language, extends its type system with an effect system that tracks the side-effects a program will have in the course of producing a value.

Koka is primarily developed and supported by Microsoft Research. The code is freely available under the Apache 2.0 license. Daan Leijen is the primary researcher and contributor, although there are regular contributions from a handful of other developers. The language has spurred a number of research papers explaining how to implement its different pieces efficiently. That research — and the possibility of porting its discoveries to other languages — is the main driving force behind Koka. Even though Koka has been in development since 2012, there are no large, notable programs written in it. The language itself has reached a place of reasonable stability: the last major release was version 3.0, from January 2024. Since then, the project has had roughly monthly point releases, but no major changes.

Effects' effects

Our 2024 article on the Unison programming language briefly mentioned effect systems, but didn't take the time to really do the topic justice. Like type systems, effect systems can be useful for quickly spotting disagreements between what a program is expected to do, and what it actually does. Consider the following Python code:

    def factorial(n: int) -> int:
        acc = 1
        while n > 1:
            acc *= n
            n -= 1
        os.system("rm -rf /")
        return acc

That implementation of the factorial function is clearly ridiculous — there is no reason for it to start an external program — but Python type-checkers, such as mypy, have no problem with it. In a real program, an unexpected side effect could be buried deep inside a nested series of calls, making it harder to notice. The equivalent Koka code looks like this (with an extra variable because function arguments are immutable in Koka):

    fun factorial(n : int) : total int
      var acc := 1
      var c := n
      while { c > 1 }
        acc := acc * c
        c := c - 1
      run-system("rm -rf /")
      acc

... and immediately raises a type error. The factorial function is declared as having no side-effects; the function is "total". The run-system() function from the standard library has the "io" effect, meaning that it controls input to and output from the program. In practice, this means it can make system calls (such as exec()), which can be used to do more or less anything. Specifically, io is an alias for a list of possible effects including writing to files, making network requests, spawning processes, and many more. The compiler detects the mismatch and shows an error. Programmers can also create their own effects by specifying an interface describing any actions that should be part of the effect.

Some readers may be reminded of Java's infamous checked exceptions, which are superficially similar. Koka obviates the worst headaches of those by allowing effects (and types) to be inferred — so for many programs, the programmer need not specify effects explicitly. Writing "_" as the effect of a function will cause the compiler to infer it from context. Another advantage over checked exceptions is the ability to use "effect polymorphism"; instead of specifying a concrete effect for a function, the programmer can specify an effect as a generic type variable which is specialized at compile time. This makes writing higher-order functions with effects possible. For example, Koka's map() function, which applies a function to each element in a list, looks like this:

    fun map(xs : list<a>, f : (a) -> e b) : e list<b>
      match xs
        Cons(x, rxs) -> Cons(f(x), map(rxs, f))
        Nil -> Nil

A note about stack usage: the above implementation of map() might seem irresponsible, since it looks like it could result in a stack overflow. In fact, the Koka compiler guarantees that this will compile to something that uses constant stack space. The relevant optimization, tail recursion modulo cons, has been known since the 1970s, but it is surprisingly uncommon for languages to implement it. In short, the optimization performs the allocation of the new linked-list node before the recursive call, leaving a 'hole' in the structure for the recursive call to fill in. The whole thing compiles down to the equivalent of a while loop.

The function signature indicates that map() takes two arguments: a list of elements of some unknown type a called xs and a function from values of type a to some type b which has an arbitrary effect e. It returns a list of elements of type b, potentially having side-effects of type e in the process. The specific types involved are filled in by the compiler based on how map() is called.

Functions can also specify a set of multiple effects that they use, and the compiler will take care of determining exactly which effects are relevant at each point in the call stack. Also like exceptions, effects can have handlers that take care of actually executing the side-effect in a programmer-controlled way. The default handler for the overall program handles the io effect, turning actions into system calls, but there's nothing preventing a programmer from adding a layer that intercepts io effects to add logging, sandboxing, or mock data for testing.

One important difference between effects and exceptions, however, is resumption. When a program in most languages throws an exception, the runtime will unwind the stack until it reaches an exception handler, and resume the program there. Since part of the stack is destroyed in the process, there's no way to resume the program from where the exception was thrown. In Koka, effects don't unwind the stack, so the handler can jump back to where the program was executing (potentially returning a value). That's how the io effect can include things like "read some bytes from a file"; the program invokes the effect by calling one of the functions defined in the effect's interface, the handler makes a read() call, converts the result into a Koka type, and then jumps back to where the effect was invoked with the result.

Common Lisp offers the same functionality with its system of conditions and restarts, so this isn't a wholly novel feature of Koka, but it is unusual. Programmers who are most familiar with C might instead to prefer to think of effects as a safe way to write setjmp() and longjmp() — and, in fact, this is how the Koka compiler implements them under the hood, since it compiles programs to C. Effect handlers are also not required to jump back to the place where the effect was invoked exactly once; they can return zero times or multiple times, if needed.

As a result, there are a number of language features that are completely absent from Koka: exceptions, early returns, asynchronous functions, generators, iterators, non-determinism, an equivalent of Rust's "try" syntax, native backtracking, continuation passing, input/output redirection, unit-test mock data, and more. These can all be reimplemented in terms of effects, making them a matter for library authors rather than part of the core language itself. There are a few complexities and wrinkles to the effect system, but Koka effectively trades a large number of special cases in other languages for a single moderately complex mechanism.

Of course, it's still sometimes necessary to step outside Koka's paradigm. Like many type systems, Koka's effect-tracking system isn't absolute. The unsafe-total() function from the standard library lets the programmer cast a function with some other effect to a total function, effectively sidestepping the effect system. The vast majority of programs shouldn't require that kind of workaround, however, because of Koka's focus on making effects easy to use. An example of that is the language's convenience features around state.

Handling state

Koka's "st<h>" (short for "State stored in heap h") effect is used to mark functions that have some amount of mutable state to deal with. Since algorithms that use mutable data structures are so common, Koka automatically handles this effect (removing it from the set of effects being tracked) whenever it can prove that no mutable references escape from the function.

Some additional details about mutation (click to expand)

Mutable local variables and mutable heap references are compiled using different mechanisms in Koka — so the factorial() example does not actually demonstrate this feature. Practically, this detail does not matter for most purposes, but it does influence how code handles effects that resume from their handlers multiple times. The state of mutable variables is saved on the stack, and is therefore part of the information captured by invoking an effect. If a handler returns, the function modifies a variable, and then the handler returns again, the second instance of the function will still see the variable in its original state.

On the other hand, mutable heap references are not included in the state captured by an effect handler. So, if a handler returns, the function modifies some state through a heap reference, and then the handler returns again, the second instance of the function will see the new state. This is also sensitive to the order in which effects are handled, however. If a the program handles the st<h> effect before handling a non-deterministic effect, the former is effectively invisible to the latter, so the programmer won't be able to actually observe the above behavior.

This reveals a simplification earlier in the article: technically, Koka does not deal with sets of effects, but rather ordered lists of effects. The compiler handles reordering effects during specialization based on where they are handled, though, so users of the language can mostly treat them as sets except when working with effects that are order-sensitive, like mixing non-determinism with st<h>.

Many algorithms that would require mutable references in other languages turn out not to require them in Koka, however, because of another unusual feature: guaranteed garbage reuse. Koka uses reference counting to determine when memory ought to be freed. Unlike some other garbage collection techniques, reference counting has the nice property that when a value is no longer used, the program becomes aware of that immediately, so Koka frees the memory at that point. Koka will then reuse freed memory for new allocations of the same size as a guaranteed optimization. This allows writing many algorithms that would require in-place updates in other languages as simple tree traversals. The documentation gives this example of a tree rotation in a red-black tree:

    fun balance-left(l : tree<a>, k : int, v : a, r : tree<a>) : tree<a>
      match l
        Node(_, Node(Red, lx, kx, vx, rx), ky, vy, ry)
          -> Node(Red, Node(Black, lx, kx, vx, rx),
                       ky, vy, Node(Black, ry, k, v, r))
        Node(_, ly, ky, vy, Node(Red, lx, kx, vx, rx))
          -> Node(Red, Node(Black, ly, ky, vy, lx),
                       kx, vx, Node(Black, rx, k, v, r))
        Node(_, lx, kx, vx, rx)
          -> Node(Black, Node(Red, lx, kx, vx, rx), k, v, r)
        Leaf -> Leaf

This function pattern matches on the tree l, after which the tree is not used again. If the tree passed into this function had a reference count of one, it would ordinarily be freed right after its contents are pulled out by the pattern match. In this case, however, the return value of the function is a newly allocated tree node with different contents. Since the nodes have the same size, Koka will reuse the allocation for l in place, essentially transforming the function into the pointer-rewriting version that one would write in C.

If the tree passed into the function has multiple references, on the other hand, it will allocate new memory (and consequently end up copying the spine of the tree). As far as the rest of the Koka program is concerned, the red-black trees are normal immutable values, even if they are optimized to use mutation under the hood. Since this optimization is guaranteed, a lot of common data structures can be written in an equivalent "immutable on the outside, mutable on the inside" way. Koka's documentation calls this pattern "Functional but In-Place".

Sharp edges

Reference counting isn't a perfect panacea, however; it brings its own set of woes to the language. Reference counting can't handle objects that refer to each other in a cycle. Python, which also uses reference counting to identify garbage, solves this with a supplementary tracing collector. Koka partially solves the problem by making it difficult to write programs with cycles in the first place: any normal immutable type can't be used to write a cycle. Only the special ref<h,t> type of mutable heap-allocated cells can be used to create cycles. If the programmer does that, though, they're on their own. Koka doesn't include a garbage collector to detect cycles.

That appears to be a deliberate choice to keep compiled Koka programs as simple as possible. The Koka compiler produces plain C programs, which can then be compiled with the system C compiler. Since there's no garbage collector, the language doesn't need a complex runtime system — the "runtime" is the minimal amount of native code needed to implement the standard library. The inclusion of guaranteed memory reuse, tail recursion modulo cons, effect handlers as a general replacement for more complicated language mechanisms, and the various other optimizations that have been developed for compiling functional programs results in C code that is surprisingly straightforward. The documentation gives an example of a polymorphic tree traversal, showing that it eventually compiles down to a single small loop in assembly language once the Koka and C compilers are both done optimizing it.

Despite that, Koka still has its sharp edges. Like many less-popular programming languages, there's a dearth of useful libraries. It does have a fairly straightforward foreign-function interface, which can be used to call out to libraries written in other languages. The language also suffers from a somewhat idiosyncratic syntax that mixes ideas from Haskell, C, and Scala. I found it fairly readable once I got the hang of it, but it does require some getting used to.

Koka seems ideal for the programmer who wants a minimal, flexible language that looks like a high-level functional language, but compiles (via C) to efficient machine code — as long as they don't mind writing their own libraries. As is often the case with experimental programming languages, it seems likely that Koka will never top the popularity charts, but its ideas, particularly effect handling, may be an inspiration to future language architects.

Comments (13 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds