|
|
Log in / Subscribe / Register

LWN.net Weekly Edition for February 5, 2026

Welcome to the LWN.net Weekly Edition for February 5, 2026

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Sigil simplifies creating and editing EPUBs

By Joe Brockmeier
February 4, 2026

Creating an ebook in EPUB format is easy, for certain values of "easy". All one really needs is a text editor, a few command-line utilities; also needed is a working knowledge of XHTML, CSS, along with an understanding of the format's structure and required boilerplate. Creating a well-formatted and attractive ebook is a bit harder. However, it can be made easier with an application custom-made for the purpose. Sigil is an EPUB editor that provides the tooling authors and publishers may be looking for.

About Sigil

Sigil is basically a one-stop shop for creating and working with publications in EPUB format; it supports both the EPUB 2 and EPUB 3 standards. Users will generally not want to create new publications using EPUB 2, since it was superseded by EPUB 3 in 2011, but version 2 support may be useful when working with legacy publications or creating ebooks for people who may not have up-to-date ebook readers.

It is a multi-platform desktop application that uses the Qt 6 framework and is written primarily in C++, with a fair amount of C and Python as well. The GPLv3-licensed project got its start in 2009, when Strahinja Marković began working on it as part of his computer science course work.

Since then, Sigil has had a few periods of inactivity, one of which led to Kovid Goyal to create an ebook editor for the Calibre project in 2013, when it appeared that Sigil had gone dormant for good. Work on Sigil resumed in 2014 when Kevin Hendricks and Doug Massay took over maintenance of the project, which is hosted on GitHub. The pair have continued to work on Sigil since then; the project is actively maintained and has published eight releases over the course of 2025. The application is fairly mature at this point, so most of the work consists of minor new features, bug fixes, and updates to keep up with changing dependencies.

Documentation

The user guide for Sigil is provided as an EPUB, of course. There is also a version of the user guide that is readable online; it is the same EPUB content but served by Readium, a project for displaying EPUB content in web browsers.

It is always a good idea to start with a project's documentation, but it's an especially good idea with Sigil if one doesn't have much experience tinkering with EPUBs. The guide not only provides a reference to Sigil's features, it has a number of short tutorials as well. For example, it has useful guidance on converting content from LibreOffice (ODF) or Microsoft Word (DOCX) formats into XHTML, as Sigil does not import those formats directly.

I'd recommend starting with the tutorials and then doubling back and skimming through the "Features" section, which serves as a good reference of Sigil's interface, tools, and capabilities. The user guide EPUB is also useful as a test document for working with Sigil. It provides a good example of an EPUB's structure, well-formatted CSS and XHTML, and other metadata documents.

Using Sigil

On first glance, Sigil looks like a basic text editor with syntax highlighting for code—which it is, but there's a bit more to it than that. The default layout includes Sigil's menu bar and a number of toolbars at the top of the application window. Below the toolbars, Sigil displays the "Book Browser" pane on the left-hand side, a text editor in the middle, and a preview pane on the right-hand side that shows what the final page layout would look like. That is, assuming that a reader's ebook reader renders an EPUB's styles the same way that Sigil does, which is difficult to predict.

[Sigil interface]

Users can hide or move the various toolbars and preview panes, except for the middle text-editor pane. The panes can also be popped out of the main window into a separate floating window.

It's worth noting that Sigil is not a good choice for actually writing content or collaborating on a publication before it is ready to be turned into an EPUB. Sigil is designed for assembling content into a book, or massaging content that is already in EPUB format. The application's editor is not really meant for authoring or heavy copy editing of content; it is not a pleasant writing environment or particularly ergonomic for editing prose. It is really only suitable for light editing and adding markup to the text.

That is not to imply that Sigil does not have any tools to assist with copy editing, though; it has a decent spell checker, and supports Perl-compatible regular expressions (PCRE) for searching and replacing text via the PCRE2 library. There is also a "Saved Searches" utility for users to create and save search-and-replace operations that are used often. For example, Sigil comes with a number of example searches for removing empty paragraphs, trailing spaces, and for converting characters like the em dash ("—") to XHTML entities ("—").

The project does provide a separate XHTML editor called PageEdit, which allows users to create content in a WYSIWYG editor. It is much more suitable for writing and laying out a chapter of a book or similar before importing it into Sigil for assembly into a final publication. Users can also use the external editor of their choice with Sigil; those who prefer to use Emacs, nano, Vim or others can do so. PageEdit may also be worth a look for those who would like a basic WYSIWYG HTML editor for other purposes.

[PageEdit]

What Sigil is best at is assembling content into EPUB format, refining it to be more usable for the reader and checking the results to ensure that the publication conforms to the EPUB standard (though a plugin is required for version 3). It is not terribly difficult to generate XHTML files for chapters using one's favorite text editor or word processor, for example, but creating a table of contents or index is another story. Happily, Sigil has easy-to-use tools for doing both.

Sigil can generate a table of contents from headings in a book, so each heading tag (<h1> through <h4>) becomes an entry in the table of contents. If that is not quite what is desired—perhaps including <h3> and <h4> headings would result in too many entries—users can choose to only include headings of a certain level. It's also possible to edit the table manually and add or remove entries.

[Table of contents editor]

Creating an index is a more manual process, but still much easier with Sigil's help. Users can mark any text, such as a person's name or references to specific programs, and select it to be included in the index. Alternatively, a user can include all instances of a word or text string in the index by adding it in Sigil's "Index Editor" tool. It's best to reserve that for rarer words or names, though. Once all of the text is marked up or added via the editor to be included in the publication's index, Sigil can then automatically create an index that refers back to each entry. This can be recreated as often as needed.

Plugins and automation

Sigil supports plugins, written in Python, to extend its functionality. There is an API guide (also available as an EPUB) that includes information on the expected structure of a plugin, and an example plugin.

It may not be necessary to write a plugin, however. There are many plugins already that provide features that might be wanted, such as importing DOCX files, validating an EPUB using the WC3's EPUBCheck utility, and improving an EPUB's accessibility features. There is an index of plugins available on the MobileRead forum. The EPUBCheck plugin, in particular, is something users should add immediately after installing Sigil to identify errors and help ensure that an EPUB conforms to the standard.

If one is processing a lot of publications with Sigil, it's likely that there will be combinations of operations that are ripe for automation; naturally, Sigil has a tool for that as well. Users can create automations that consist of combinations of tool operations (e.g., deleting unused CSS styles) and plugins.

One of the features I most appreciate is checkpoints; a checkpoint is a snapshot of an EPUB at a point in time. A user can create a checkpoint, work on their EPUB, and then compare the current state to the prior checkpoints to see what has changed. Sigil will show what files have been added or deleted, and what files have been changed. Users can also inspect the diffs of changes in a file between checkpoints. Best of all, users can restore from previous checkpoints if needed—so if a file has been accidentally deleted, or a bunch of changes need to be reverted, checkpoints can be quite useful. They are, however, an all-or-nothing affair. Sigil doesn't support, for instance, just restoring one set of edits or one file from a checkpoint if multiple files have been deleted. And, as Sigil will warn, restoring a checkpoint overwrites the current status of a project—so it's a good idea to take checkpoints often and save one's work in a new EPUB before restoring from a checkpoint.

The most recent release, version 2.7.0, improved navigation in Sigil's metadata editor, updated the menu for user-created automated actions, and included a number of bug fixes. The Sigil project provides builds for Linux in the AppImage format with each release. The project also points users to Flathub for a Flatpak build, but it apparently has not confirmed ownership of the Flatpak, so it shows up as an unverified app. Most major Linux distributions provide Sigil packages as well.

The Sigil project's support community gathers on the MobileRead forum; the project developers seem fairly active there. Users can report bugs and other problems using GitHub Issues.

Generally, I've found Sigil easy to work with, if a bit clunky in places. As an example, when using the table-of-contents (TOC) editor, it is not possible to select multiple entries at once to change the heading level or remove them from the TOC entirely; it can only be done one at a time. But, a few minor gripes aside, it is a suitable tool for making and working with EPUBs.

Comments (3 posted)

Compiling Rust to readable C with Eurydice

By Daroc Alden
January 30, 2026

A few years ago, the only way to compile Rust code was using the rustc compiler with LLVM as a backend. Since then, several projects, including Mutabah's Rust Compiler (mrustc), GCC's Rust support (gccrs), rust_codegen_gcc, and Cranelift have made enormous progress on diversifying Rust's compiler implementations. The most recent such project, Eurydice, has a more ambitious goal: converting Rust code to clean C code. This is especially useful in high-assurance software, where existing verification and compliance tools expect C. Until such tools can be updated to work with Rust, Eurydice could provide a smoother transition for these projects, as well as a stepping-stone for environments that have a C compiler but no working Rust compiler. Eurydice has been used to compile some post-quantum-cryptography routines from Rust to C, for example.

Eurydice was started in 2023, and includes some code under the MIT license and some under the Apache-2.0 license. It's part of the Aeneas project, which works to develop several different tools related to applying formal verification tools to Rust code. The various Aeneas projects are maintained by a group of people employed by Inria (France's national computer-science-research institution) and Microsoft, but they do accept outside contributions.

Eurydice follows the same general structure as many compilers: take a Rust program, convert it into an intermediate representation (IR), modify the IR with a series of passes, and then output it as code in a lower-level language (in this case, C). Jonathan Protzenko, the most prolific contributor to Eurydice, has a blog post where he explains the project's approach. Unlike other compilers, however, Eurydice is concerned with preserving the overall structure of the code while removing constructs that exist in Rust but not in C. For example, consider this Rust function that calculates the least common multiple of two numbers using their greatest common denominator:

    fn gcd(a: u64, b: u64) -> u64 {
        if b == 0 {
            a
        } else {
            gcd(b, a%b)
        }
    }

    fn lcm(a: u64, b: u64) -> u64 {
        (a * b) / gcd(a, b)
    }

Here's how Eurydice compiles those functions to C:

    uint64_t example_gcd(uint64_t a, uint64_t b)
    {
        uint64_t uu____0;
        if (b == 0ULL)
        {
            uu____0 = a;
        }
        else
        {
            uu____0 = example_gcd(b, a % b);
        }
        return uu____0;
    }

    uint64_t example_lcm(uint64_t a, uint64_t b)
    {
        uint64_t uu____0 = a * b;
        return uu____0 / example_gcd(a, b);
    }

Whether this C code counts as "readable" is probably a matter of individual taste. It does, however, preserve the structure of the code. Even the evaluation order of the original is preserved by adding extra temporary variables (uu____0 in example_lcm()) where necessary to define an order. (Rust guarantees that if the multiplication overflows and causes a panic, that will happen before any side effects caused by calling example_gcd(), but C only guarantees that if the multiplication is performed in a separate statement.) Compiling the same functions with rustc results in a pair of entangled loops filled with bit-twiddling operations, instead — which is appropriate for machine-code output, but much less readable.

Of course, not all Rust programs can be faithfully represented in C. For example, for loops that use an iterator instead of a range need to be compiled to while loops that call into some of Eurydice's support code to manage the state of the iterator. More importantly, C has no concept of generics, so Rust code needs to be monomorphized during conversion. This can result in several different implementations of a function that differ only by type — often, the more idiomatic C approach would be to use macros or void * arguments.

The implementation of dynamically sized types also poses certain challenges. In Rust, a structure can be defined where one of its fields does not have a fixed size — like flexible array members in C:

    struct DynamicallySized<U: ?Sized> {
        header: usize,
        my_data: U, // The compiler does not know the size of U, here
    }

But if that structure is generic, and one of the generic users of the type gives the flexibly sized field a type with a known size, the compiler can take advantage of that knowledge to elide bounds checks where appropriate.

    let foo: DynamicallySized<[u8; 4]> = ...;
    // No bounds check emitted, since the array size is known to be 4:
    let bar = foo.my_data[2];

This kind of separation, where some parts of the code may know the size of a type and some may not, is an important semantic detail to preserve in C because of how it interacts with the possibility of formal verification. If Eurydice compiled DynamicallySized to use a flexible array member everywhere, analysis of the C code might point out "missing" bounds checks that were not required in Rust. Conversely, if Eurydice added extra bounds checks, it would need to manufacture extra error paths that don't appear in the Rust source and that should be completely unused.

So, Eurydice emits two different types: a version of the dynamically sized type that has a flexible array member, and one that has a known-length array member. Converting between the two representations is a no-op at run time, but it technically violates C's strict-aliasing rule. Therefore Protzenko recommends compiling Eurydice-generated code with -fno-strict-aliasing.

Associated tooling

This approach, of compiling a more abstract language to C in a way that preserves the structure of the code, is not new. The KaRaMeL project, upon which Eurydice is based, does the same thing for the F* programming language. F* is a dependently typed functional programming language used to develop cryptographic libraries. Compiling provably correct F* programs to equivalent C lets those libraries be used in programs where performance is a concern.

Unfortunately, Eurydice doesn't currently scale much beyond small examples. Rather than implement its own parser and typechecker for Rust code, Eurydice uses another Aeneas tool — Charon — to extract the parsed and preprocessed program from rustc. When I tested Charon on a variety of Rust packages, it was routinely foiled by more recent Rust features such as const generics.

When Charon does work, however, it dumps rustc's medium-level intermediate representation (MIR) as JSON, along with any compiler flags necessary to understand the compilation. Eurydice reads this JSON representation and converts it to KaRaMeL's intermediate representation. Then it uses a series of small passes over the KaRaMeL code to eliminate some Rust-specific details, before handing things over to the same code-generation logic that KaRaMeL uses for F*.

In its current form, Eurydice works best for small, self-contained programs that avoid complex Rust features. Within that niche, however, it works well. The generated code maintains the same structure as the original Rust code, except for places where Eurydice emits extra intermediate variables or needs some glue code to implement a more complicated feature. On the other hand, small self-contained code is also the easiest to rewrite by hand, so bringing in Eurydice is probably only worthwhile if the original Rust code is going to be updated and one wants an automatic solution to keep them in sync. In any case, Eurydice is only the newest tool in a rapidly expanding collection of ways to fold, spindle, and mutilate Rust code to fit into more environments.

[ Thanks to Henri Sivonen for the topic suggestion. ]

Comments (25 posted)

Sub-schedulers for sched_ext

By Jonathan Corbet
January 29, 2026
The extensible scheduler class (sched_ext) allows the installation of a custom CPU scheduler built as a set of BPF programs. Its merging for the 6.12 kernel release moved the kernel away from the "one scheduler fits all" approach that had been taken until then; now any system can have its own scheduler optimized for its workloads. Within any given machine, though, it's still "one scheduler fits all"; only one scheduler can be loaded for the system as a whole. The sched_ext sub-scheduler patch series from Tejun Heo aims to change that situation by allowing multiple CPU schedulers to run on a single system.

Sched_ext was built around the idea that no scheduler can be optimized for every possible workload that it may encounter. The sub-scheduler work extends that idea by saying that no scheduler — even a sched_ext scheduler — can be prepared to obtain optimal performance from every workload that a given system may run. From the cover letter:

Applications often have domain-specific knowledge that generic schedulers cannot possess. Database systems understand query priorities and lock holder criticality. Virtual machine monitors can coordinate with guest schedulers and handle vCPU placement intelligently. Game engines know rendering deadlines and which threads are latency-critical.

A system that runs a single workload can also run a special-purpose scheduler optimized for that workload. But the owners of systems tend to want to keep them busy, and that means running multiple workloads on the same machine. If two workloads on the same system would benefit from different scheduling algorithms, at least one of them is going to end up with sub-optimal performance.

The solution is to allow sched_ext schedulers to be attached to control groups. Each task in the system will be governed by the scheduler attached to its control group (or to the nearest ancestor group). The kernel has long supported the CPU controller, which allows an administrator to allocate CPU resources across control groups. Interestingly, the sub-scheduler feature is not tied to the CPU controller; it is, instead, tied directly to the control-group mechanism. As a result, the CPU controller is still in charge of how much CPU time each group gets, while the sub-schedulers manage how that CPU time is used by the processes within their respective groups.

Control groups with attached schedulers can be nested up to four levels deep. Any scheduler that is to be the parent of another in the control-group hierarchy must be written with that responsibility in mind. The attachment of a sub-scheduler to a control group will only succeed if the parent controller allows it. The parent scheduler also controls when the sub-schedulers' dispatch() callbacks are invoked. This callback instructs the scheduler to choose the next tasks to run and add them to a specific CPU's dispatch queue. So, in other words, the parent scheduler controls when a given workload (represented by a control group) can run, the sub-scheduler controls how the processes that make up that workload access the CPU, and the CPU controller is in charge of how much CPU time is available for them to run.

The kernel exports a long list of kfuncs that allow sched_ext programs to operate on the scheduler and the processes running under it. Those kfuncs will need to be generalized so that, rather than operating on the scheduler, they operate on the appropriate sub-scheduler. A system that is running multiple schedulers must also take extra care to ensure that these schedulers do not interfere with each other. For example, a BPF program that is running on behalf of a given scheduler should not be able to affect — or even see — any other schedulers that might be present. So the generalization of the sched_ext kfuncs must be carried out in a way that preserves the security and robustness of the system as a whole.

To that end, many of those kfuncs have been augmented with an implicit argument that gives them access to the bpf_prog_aux structure associated with the running task; from there, they can obtain a pointer to the sub-scheduler data they should be working with. The BPF programs themselves need never specify which scheduler they are operating on, and have no ability to operate on any scheduler other than the one they are attached to. The kernel is able to ensure that they are always tied to the correct sub-scheduler.

Similarly, sched_ext programs must be prevented from operating on processes other than the ones running under the scheduler they implement. The kernel already maintains a structure (struct sched_ext_entity) in the task structure that holds the information needed to manage each task with sched_ext. The new series adds a new field to that structure (called sched) pointing to the (sub-)scheduler that is in control of that task. Any kfunc that operates on a process can use this information to be sure that the process is, indeed, under the purview of the scheduler that is trying to make the change.

Sched_ext is designed with the intent of keeping faulty schedulers from causing too much damage. When a problem is detected with a running scheduler (for example, a runnable task is not dispatched to a CPU within a reasonable time period), that scheduler is put into "bypass mode". This mode is also entered when a scheduler is being deliberately shut down. In bypass mode, the scheduler is deactivated, and all tasks running under it are placed under a simple FIFO scheduler. In current kernels, that bypass scheduler is global.

In a system with multiple schedulers, though, allowing one sub-scheduler to toss processes into a global FIFO queue could lead to interference with other sub-schedulers. So, when a sub-schedulers are in use, the parent scheduler, if it exists, will inherit tasks from sub-schedulers that go into the bypass mode. In the other direction, if a parent scheduler goes into bypass mode, any schedulers below it in the hierarchy will also be placed in bypass mode.

The current patch set (version 1, though there was an RFC version in September 2025 as well) is not yet a complete implementation. It covers primarily the dispatch path — where a given scheduler sends a task to a CPU to execute. There are a couple of important phases that happen before dispatch, though, that have not been addressed in this series:

  • The select_cpu() callback is invoked when a task first wakes up. The scheduler should decide what to do with the task, including selecting a CPU for it to run on (though the selection is not final at this stage).
  • The enqueue() callback will actually put the task into a dispatch queue. That queue might be a specific CPU's local dispatch queue, but it may also be some other queue maintained by the scheduler, from which the task will be put into a CPU-local queue at a later time.

These callback paths will clearly need to be worked out for a complete sub-scheduler implementation. What is there now, though, is enough to show how the whole thing is intended to work. There is a modified version of the scx_qmap scheduler that is able to operate as both a parent and a sub-scheduler; it shows, in a relatively simple form, the type of changes that are necessary to the schedulers themselves.

As noted, this is early-stage work, and it is not yet complete. One should thus not expect to see sub-scheduler support in the kernel for some time yet. It is not hard to see how this feature could be useful on systems running a variety of workloads, though, so there will be a clear motivation to push it over the finish line. At that point, the "one scheduler fits all" model will have been left far behind.

Comments (4 posted)

Modernizing swapping: introducing the swap table

By Jonathan Corbet
February 2, 2026
The kernel's swap subsystem is a complex and often unloved beast. It is also a critical component in the memory-management subsystem and has a significant impact on the performance of the system as a whole. At the 2025 Linux Storage, Filesystem, Memory-Management and BPF Summit, Kairui Song outlined a plan to simplify and optimize the kernel's swap code. A first installment of that work, written with help from Chris Li, was merged for the 6.18 release. This article will catch up with the 6.18 work, setting the stage for a future look at the changes that are yet to be merged.

In a virtual-memory system, memory shortages must be addressed by reclaiming RAM and, if necessary, writing its contents to the appropriate persistent backing store. For file-backed memory, the file itself is that backing store. Anonymous memory — the memory that holds the variables and data structures used by a process — lacks that natural backing store, though. That is where the swap subsystem comes in: it provides a place to write anonymous pages when the memory they occupy is needed for other uses. Swapping allows unused (or seldom-used) pages to be pushed out to slower storage, making the system's RAM available for data that is currently in use.

A quick swap-subsystem primer

A full description of the kernel's swap subsystem would be lengthy indeed; there is a lot of complexity, much of which has built up over time. What follows is a partial, simplified overview of how the swap subsystem looked in the 6.17 kernel, which can then be used as a base for understanding the subsequent changes.

The swap subsystem uses one or more swap files, which can be either partitions on a storage device or ordinary files within a filesystem. Inside the kernel, active swap files are described by struct swap_info_struct, but are usually referred to using a simple integer index instead. Each file is divided into page-sized slots; any given slot in the kernel's swap areas can be identified using the swp_entry_t type:

    typedef struct {
	unsigned long val;
    } swp_entry_t;

This long value is divided into two fields: the upper six bits are the index number of the swap file (which, for extra clarity, is called the "type" in the swap code), and the rest is the slot number within the file. There is a set of simple functions used to create swap entries and get the relevant information back out.

Note that the above describes the architecture-independent form of the swap entry; each architecture will also have an architecture-dependent version that is used in page-table entries. Curious readers can look at the x86_64 macros that convert between the two formats. Within the swap subsystem itself, though, the architecture-independent version of the swap entry is used.

An overly simplified description of swapping would be something like: when the memory-management subsystem decides to reclaim an anonymous page, it selects a swap slot, writes the page's contents into that slot, then stores the associated swap entry in the page-table entry (using the architecture-dependent format) with the "present" bit cleared. The next attempt to reference that page will result in a page fault; the kernel will see the swap entry, allocate a new page, read the contents from the swap file, then update the page-table entry accordingly.

The truth of the matter is that things are rather more complex than that. For example, writing a page to the swap file takes time, and the page itself cannot be reclaimed until the write is complete. So, when the reclaim decision is made, the page is put into the swap cache, which is, in many ways, the analog of the page cache used for file-backed pages. Saying that a page is in the swap cache really only means that a swap entry has been assigned; the page itself may or may not still be resident in RAM. If a fault happens on that page while the writing process is underway, that page can be quickly reactivated, despite being in the swap cache.

All of this means that the swap subsystem has to keep track of the status of every page in the swap cache, and that status involves more than just the swap slot that was assigned. To that end, in kernels prior to 6.18, the swap subsystem maintained an array called swapper_spaces that contained pointers to arrays of address_space structures. That structure is used to maintain the mapping between an address space (the bytes of a file, or the slots of a swap file) and the storage that backs up that space. It provides a set of operations that can be used to move pages between RAM and that backing store. Using struct address_space means, among other things, that much of the code that works with the page cache can also operate with the swap cache.

Another reason to use struct address_space is the XArray data structure associated with it. For a swap file, that data structure contains the current status of each slot in the file, which can be any of:

  • The slot is empty.
  • There is a page assigned to the slot, but that page is also resident in RAM; in that case, the XArray entry is a pointer to the page (more precisely, the folio containing the page) itself.
  • There is a page assigned, but it exists only in the swap file. In that case, the entry contains "shadow" information used by the memory-management system to detect pages that are quickly faulted in after being swapped out. (See this 2012 article for an overview of this mechanism).

For extra fun, there is not a single address_space structure and XArray for each swap file. Instead, the file is divided into 64MB chunks, and a separate address_space structure is created for each. This design helps to spread the management of swap entries across multiple XArrays, reducing contention and increasing scalability on larger systems where a lot of swapping is taking place. The swapper_spaces entry for a swap file, thus, points to an array of address_space structures; a 1GB swap file, for example, would be managed with an array of 16 of these structures.

There is one more complication (for the purpose of this discussion — there are many others as well) in the management of swap slots. Each swap device is also divided into a set of swap clusters, represented by struct swap_cluster_info; these clusters are usually 2MB in size. Swap clusters make the management of swap files more scalable; each CPU in the system maintains a cache of swap clusters that have been assigned to it. The associated swap entries can then be managed entirely locally to the CPU, with cross-CPU access only needed when clusters must be allocated or freed. Swap clusters reduce the amount of scanning of the global swap map needed to work with swap entries, but the appropriate XArray must still be used to obtain or modify the status of a given slot.

The swap table

With that background in place, it is possible to look at the changes made for 6.18. They start with the understanding that the swap-subsystem code that deals with swap entries already has access to the swap clusters those entries belong to. Keeping the status information with the clusters would allow the elimination of the XArrays, which can be replaced with simple C arrays of swap entries. The smaller granularity of the swap clusters serves to further localize the management of swap entries, which should improve scalability.

So the phase-1 patch set augments the swap_cluster_info structure; the post-6.17 version of that structure contains a new array pointer:

    atomic_long_t __rcu *table;

The new table array, which is designed to occupy exactly one page on most architectures, is allocated dynamically, reducing the swap subsystem's memory use when the swap files are not full. Each entry in the table is the same swp_entry_t value seen above, describing the status of one page in the swap cache. The swap code has been reworked to use this new organization, with many of the internal APIs needing minimal or no changes. The arrays of address_space structures covering 64MB each are gone; the XArrays are no longer needed, and the address-space operations can be provided by a single structure, called swap_space.

In summary, where the kernel previously divided swap areas using two independent clustering mechanisms (the address_space structures and the swap clusters), now it only has one clustering scheme that increases the locality of many swap operations. The end result, at this stage, is "up to ~5-20% performance gain in throughput, RPS or build time for benchmark and workload tests", according to Song. This speed improvement is entirely due to the removal of the XArray lookups and the reduction in contention that comes from managing swap space in smaller chunks.

That is the state of affairs as of 6.18. As significant as this change is, it is only the beginning of the project to simplify and improve the kernel's swap code. The 6.19 kernel did not significantly advance this work, but there are two other installments under consideration, one of which is seemingly poised for the 7.0 release. Those changes will be covered in the second part of this series.

Comments (10 posted)

API changes for the futex robust list

By Jake Edge
February 4, 2026

LPC

The robust futex kernel API is a way for a user-space program to ensure that the locks it holds are properly cleaned up when it exits. But the API suffers from a number of different problems, as André Almeida described in a session in the "Gaming on Linux" microconference at the 2025 Linux Plumbers Conference in Tokyo. He had some ideas for a new API that would solve many of those problems, which he wanted to discuss with attendees; there is a difficult-to-trigger race condition that he wanted to talk about too.

"Some years ago, I made a new API for futex", Almeida said to start things off, "so why not do a new API for robust list as well?" The new futex API that he was referring to was merged for 5.16 in 2022 in the form of the futex_waitv() system call (documentation). Some further pieces of the futex2 API were released with Linux 6.7 in 2024.

[André Almeida]

The ABI for games on the SteamOS distribution, where much of the work in gaming on Linux is being done, is Windows on the x86 architecture. The games are mostly built for that ABI, but SteamOS also runs on Arm64, which leads to "a lot of interesting challenges". It adds the FEX emulator to run x86 binaries on the Arm64 processor in addition to the Proton compatibility layer that provides the Windows ABI. That has implications for various kernel areas, including futexes, memory management, and filesystems.

FEX is a just-in-time (JIT) compiler for turning x86 instructions, for both 32 and 64 bits, into Arm64 machine code. As part of that, when it finds a syscall instruction, it needs to translate that to the Arm64 system call, but that does not work well for some x86-32 system calls. The FEX project has a wiki page describing the problematic calls, one of which is set_robust_list().

set_robust_list() is used to avoid problems when a futex holder dies before releasing the lock, which will starve any other threads waiting on it. So, when a thread takes a lock, it can add the lock to the robust list, which is a linked list maintained in user space. The thread informs the kernel where the head of the list is using set_robust_list(). The exit path for a thread in the kernel uses that information to wake all of the threads waiting for each futex on the list; it also adds the FUTEX_OWNER_DIED mark to each futex. One other wrinkle that he mentioned is that a futex can be put into a "pending" field on the list head while an operation (taking or releasing the lock) is being done, but before the linked list has been updated, so that it can be cleaned up if a crash happens in that window.

Why?

A new API is needed for several reasons, he said. The first is that, unlike x86, Arm64 does not have both 32- and 64-bit system calls, so emulating 32-bit applications is difficult—the "compat" system calls are missing. For example, a 32-bit robust list cannot be handled by the 64-bit system call because it cannot parse the list due to the different pointer size. There is a need for the new interface to allow user space to inform the kernel whether it is a 32- or 64-bit robust list so that the kernel can parse the list correctly.

Another shortcoming of the existing interface is that only one robust list can be set for a thread, but FEX also wants to use robust futexes. If the application uses them, FEX has to choose which one gets that access. A new interface would provide a way to set multiple list heads for a thread.

There is currently a limit of 2048 entries on a robust list that will be processed by the kernel, which is meant to avoid getting trapped in an infinite loop. But that limit was never documented as part of the API, so user-space programs are unaware of it, which led to a bug report for the GNU C library. With a new API, either the limit should be documented and exposed as part of the API or it should be made limitless using countermeasures against circular lists, he said.

The final problem is "much more interesting" but "kind of tricky to explain"; it is a race condition that can occur when a futex is being unlocked. The normal sequence for unlocking a robust futex is as follows:

  1. The address of the futex is put into the pending slot of the robust list
  2. The futex is removed from the robust list
  3. The low-level unlock is done, which clears the futex and wakes any threads waiting on it
  4. The pending slot is cleared
Between steps three and four, though, another thread can decide to free the futex because it appears to be the only user of the futex. That thread could then allocate memory in the same location as the former futex. Then the original thread, which is about to perform step four, dies, which causes the kernel to write FUTEX_OWNER_DIED in the futex, thus corrupting some random memory. It is difficult to reproduce, but it does happen.

Almeida said that he is unsure how to address this. Perhaps serializing the exit path with all of the mmap() and munmap() calls made by the thread is a possibility. Another idea might be to change the API around the pending field somehow to avoid the race. The previous day he had participated in the extensible scheduler class (sched_ext) microconference, which got him thinking that perhaps a specialized scheduler could be written to reproduce the problem reliably; that would help in the fixing process and could be turned into a test case as well.

New API

The API he proposed in the session seems to have evolved somewhat since his v6 patch set posting in November 2025 (a few weeks before LPC). It consists of two new system calls:

    set_robust_list2(struct robust_list_head *head, unsigned int index,
                     unsigned int cmd, unsigned int flags);
                     
    get_robust_list2(int pid, void **head_ptr,
                     unsigned int index, unsigned int flags);
The index argument is used to distinguish between different lists so that libraries and applications can have their own lists. The cmd argument to set_robust_list2() can be CREATE_LIST_32 (or 64) to create a list of the appropriate "bitness" using the head pointer; in that case, the call returns an unused index that is associated with the list. A list can be overwritten using the SET_LIST_32 (or 64) command by passing the index of interest. The LIST_LIMIT command returns the number of lists supported for each task. (All of the command names will presumably have FUTEX_ROBUST_LIST_CMD_ as part of their full name.) get_robust_list2() will simply return the head of the robust list (in head_ptr) for a given pid and index.

Discussion

After that, Almeida opened the floor to questions and comments. Liam Howlett noted that the exit path for robust lists requires a delay to the out-of-memory (OOM) handling in the kernel, so the race condition could be more easily reproduced by setting the OOM-handler delay to zero and triggering an OOM-kill of the task. While that may be true, glibc maintainer Carlos O'Donell said, it does not really lead to a solution to the race, which both he and Rich Felker of the musl libc project have looked at. If there is going to be a new API, it is a "perfect opportunity" to sit down and figure out a proper solution and also determine how existing C libraries can transition to the new interface over time.

"It gets worse", Howlett said. Tasks that are exiting can be frozen by the control-group subsystem, which means that the OOM handler has to wait potentially forever before it can clean things up. That is another piece that should be unwound as part of the process of creating the new API, he said.

O'Donell said that it made sense that users of the new API will need to be able to register the number of bits in the structure that is being shared with the kernel. He asked if sizes other than 32 or 64 bits should be considered, but Howlett pointed out that there is an unused flags argument in the proposed API, which could be used if needed.

The conversation turned back to the delay for the OOM handler, which no one seemed to fully understand. O'Donell wondered if it was an attempt to fix the race condition that Almeida is concerned about when it arose in some other context. Howlett said that he believed it was meant to hold off the OOM killer from freeing memory holding the locks before the exit-handling code could process the robust list. Sebastian Siewior said that he was not clear on why the delay was added, either, but would put it on his list to look into.

There was some further discussion of why and how the OOM-killer delay came about, but the session ran out of time. Interested readers may want to consult the YouTube video and slides from the talk. Overall, participants seemed to agree that the new API was needed, and no real complaints about its proposed form were heard, but there are obviously still some details to be worked out before it can go upstream.

[ I would like to thank our travel sponsor, the Linux Foundation, for assistance with my travel to Tokyo for Linux Plumbers Conference. ]

Comments (4 posted)

The future for Tyr

February 3, 2026

This article was contributed by Daniel Almeida

The team behind Tyr started 2025 with little to show in our quest to produce a Rust GPU driver for Arm Mali hardware, and by the end of the year, we were able to play SuperTuxKart (a 3D open-source racing game) at the Linux Plumbers Conference (LPC). Our prototype was a joint effort between Arm, Collabora, and Google; it ran well for the duration of the event, and the performance was more than adequate for players. Thankfully, we picked up steam at precisely the right moment: Dave Airlie just announced in the Maintainers Summit that the DRM subsystem is only "about a year away" from disallowing new drivers written in C and requiring the use of Rust. Now it is time to lay out a possible roadmap for 2026 in order to upstream all of this work.

What are we trying to accomplish with Tyr?

Miguel Ojeda's talk at LPC this year summarized where Rust is being used in the Linux kernel, with drivers like the anonymous shared memory subsystem for Android (ashmem) quickly being rolled out to millions of users. Given Mali's extensive market share in the phone market, supporting this segment is a natural aspiration for Tyr, followed by other embedded platforms where Mali is also present. In parallel, we must not lose track of upstream, as the objective is to evolve together with the Nova Rust GPU driver and ensure that the ecosystem will be useful for any new drivers that might come in the future. The prototype was meant to prove that a Rust driver for Arm Mali could come to fruition with acceptable performance, but now we should iterate on the code and refactor it as needed. This will allow us to learn from our mistakes and settle on a design that is appropriate for an upstream driver.

What is there, and what is not

A version of the Tyr driver was merged for the 6.18 kernel release, but it is not capable of much, as a few key Rust abstractions are missing. The downstream branch (the parts of Tyr not yet in the mainline kernel) is where we house our latest prototype; it is working well enough to run desktop environments and games, even if there are still power-consumption and GPU-recovery problems that need to be fixed. The prototype will serve the purpose of guiding our upstream efforts and let us experiment with different designs.

A kernel-mode GPU driver such as Tyr is a small component backing a much larger user-mode driver that implements a graphics API like Vulkan or OpenGL. The user-mode driver translates hardware-independent API calls into GPU-specific commands that can be used by the rasterization process. The kernel's responsibility centers around sharing hardware resources between applications, enforcing isolation and fairness, and keeping the hardware operational. This includes providing the user-mode driver with GPU memory, letting it know when submitted work finishes, and giving user space a way to describe dependency chains between jobs. Our talk (YouTube video) at LPC2025 goes over this in detail.

[SuperTuxKart running on Tyr at LPC]

Having a working prototype does not mean it's ready for real world usage, however, and a walkthrough of what is missing reveals why. Mali GPUs are usually found on mobile devices where power is at a premium. Conserving energy and managing the thermal characteristics of the device is paramount to user experience, and Tyr does not have any power-management or frequency-scaling code at the moment. In fact, Rust abstractions to support these features are not available at all.

Something else worth considering is what happens if the GPU hangs. It is imperative that the system remains working to the extent possible, or users might lose all of their work. Owing to our "prototype" state, there is no GPU-recovery code right now. These two things are a hard requirement for deployability. One simply cannot deploy a driver that gobbles all of the battery in the system — making it hot and unpleasant in the process — or crashes and takes the user's work with it.

On top of that, Vulkan must be correctly implementable on top of Tyr, or we may fail to achieve drop-in compatibility with our Vulkan driver (PanVK). This requires passing the Vulkan Conformance Testing Suite when using Tyr instead of the C driver. At that point, we would be confident enough to add support for more GPU models beyond the currently supported Mali-G610. Finally, we will turn our attention to benchmarking to ensure that Tyr can match the C driver's performance while benefiting from Rust's safety guarantees. We have demonstrated running a complex game with acceptable performance, so results are good so far.

Which Rust abstractions are missing

Some required Rust infrastructure is still work-in-progress. This includes Lyude Paul's work on the graphics execution manager (GEM) shmem objects, needed to allocate memory for systems without discrete video RAM. This is notably the case for Tyr, as the GPU is packaged in a larger system-on-chip and must share system memory. Additionally, there are still open questions, like how to share non-overlapping regions of a GPU buffer without locks, preferably encoded in the type system and checked at compile time.

On top of allocating GPU memory, modern kernel drivers must let the user-mode driver manage its own view of the GPU address space. In the DRM ecosystem, this is delegated to GPUVM, which contains the common code to manage those address spaces on hardware that offers memory-isolation capabilities similar to modern CPUs. The GPU firmware also expects control over the placement of some sections in memory, so it will not work until this capability is available. Alice Ryhl is working on the Rust abstractions for GPUVM as well as the io-pgtable abstractions that are needed to manipulate the IOMMU page tables used to enforce memory isolation. These are both based on the previous work of Asahi Lina, who pioneered the first Rust abstractions for the DRM subsystem.

Another unsolved issue is DRM device initialization. The current code requires an initializer for the driver's private data in order to return a drm::Device instance, but some drivers need the drm::Device to build the private data in the first place, which leads to an impossible-to-satisfy cycle of dependencies. This is also the case for Tyr: allocating GPU memory through the GEM shmem API requires a drm::Device, but some fields in Tyr's private data need to store GEM objects — for example, to parse and boot the firmware. Lyude Paul is working on this by introducing a drm::DeviceCtx that encodes the device state in the type system.

The situation remains the same as when the first Tyr patches were submitted: most of the roadmap is blocked on GEM shmem, GPUVM, io-pgtable and the device initialization issue. There is room to integrate some work by the Nova team, as well: the register! macro and bounded integers. Once we can handle those items, we expect to quickly become able to boot the GPU firmware and then progress unhindered until it is time to discuss job submission.

Another area needing consideration is the paths where the driver makes forward progress on completing fences, which are synchronization primitives that GPU drivers signal once jobs finish executing. These paths must be carefully annotated or the system may deadlock, and the driver must ensure that only safe locks are taken in the signaling path. Additionally, DMA fences must always signal in finite time, or someone elsewhere in the system may block forever. Allocating memory using anything other than GFP_ATOMIC must be disallowed, or the shrinker may kick in under memory pressure and wait on the very job that triggered it. All of this is covered in the documentation. We conveniently ignore this in the prototype, meaning it can randomly deadlock under memory pressure. Addressing this is straightforward: it is just a matter of carefully vetting key parts of the driver. Doing so elegantly, however, and perhaps in a way that takes advantage of Rust's type system is something that remains to be discussed.

Looking into the future

We have not touched upon what is next for Linux GPU drivers as a whole: reworking the job-submission logic in Rust. The current design assumes that drm_gpu_scheduler is used, but this has become a hindrance for some drivers in an age where GPU firmware can schedule jobs itself, and it's been plagued by hard-to-solve lifetime problems. Quite some time was spent at the X.Org Developer's Conference in 2025 discussing how to fix it.

The current consensus for Rust is to write a new component that merely ensures that the dependencies for a given job are satisfied before the job is eligible to be assigned in the GPU's ring buffer, at which point the firmware scheduler takes over. This seems to be where GPU hardware is going, as most vendors have switched to firmware-assisted scheduling in recent years. As this component will not schedule jobs, it will probably be called JobQueue instead. This correctly conveys the meaning of a queue where new work is deposited in and removed once the dependencies are met and a job is ready to run. Philip Stanner has been spearheading this work.

The plan is to also expose an API for C drivers using a technique I have described here in the past. This will possibly be the first Rust kernel component usable from C drivers, another milestone for Rust in the kernel, and a hallmark of seamless interoperability between C and Rust.

One way that Tyr can fit into this overall vision is by serving as a testbed for the new design. If the old drm_gpu_scheduler can be replaced with the JobQueue successfully in the prototype, it will help attest its suitability for other, more complex drivers like Nova. Expect this discussion to continue for a while.

In all, Tyr has made a lot of progress this past year. Hopefully, it will continue to do so through 2026 and beyond.

Comments (13 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: openSUSE governance; Git 2.53.0; LibreOffice 26.2; Open Source Award; Quotes; ...
  • Announcements: Newsletters, conferences, security updates, patches, and more.
Next page: Brief items>>

Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds