LWN.net Weekly Edition for January 22, 2026
Welcome to the LWN.net Weekly Edition for January 22, 2026
This edition contains the following feature content:
- A free and open-source rootkit for Linux: a look at the Singularity research project.
- Cleanup on aisle fsconfig(): ideas on improving the fsconfig() system call.
- Task-level io_uring restrictions: Jens Axboe is working on a fast-moving patch set to provide better access controls for io_uring.
- Responses to gpg.fail: a pair of security researchers have sparked interesting discussions about GPG and the complexity of OpenPGP implementations.
- Removing a pointer dereference from slab allocations: Al Viro wanders into memory management.
- An alternate path for immutable distributions: a look at the AshOS experiment.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
A free and open-source rootkit for Linux
While there are several rootkits that target Linux, they have so far not fully embraced the open-source ethos typical of Linux software. Luckily, Matheus Alves has been working to remedy this lack by creating an open-source rootkit called Singularity for Linux systems. Users who feel their computers are too secure can install the Singularity kernel module in order to allow remote code execution, disable security features, and hide files and processes from normal administrative tools. Despite its many features, Singularity is not currently known to be in use in the wild — instead, it provides security researchers with a testbed to investigate new detection and evasion techniques.
Alves is quite emphatic about the research nature of Singularity, saying that
its main purpose is to help drive security research forward by demonstrating
what is currently possible. He
calls for anyone using the software to "be a
researcher, not a criminal
", and to test it only on systems where they have
explicit permission to test. If one did wish to use Singularity for nefarious
purposes, however, the code is MIT licensed and freely available — using it in
that way would only be a crime, not an instance of copyright infringement.
Getting its hooks into the kernel
The whole problem of how to obtain root permissions on a system and go about installing a kernel module is out of scope for Singularity; its focus is on how to maintain an undetected presence in the kernel once things have already been compromised. In order to do this, Singularity goes to a lot of trouble to present the illusion that the system hasn't been modified at all. It uses the kernel's existing Ftrace mechanism to hook into the functions that handle many system calls and change their responses to hide any sign of its presence.
Using Ftrace offers several advantages to the rootkit; most importantly, it means that the rootkit doesn't need to change the CPU trap-handling vector for system calls, which was one of the ways that some rootkits have been identified historically. It also avoids having to patch the kernel's functions directly — kernel functions already have hooks for Ftrace, so the rootkit doesn't need to perform its own ad-hoc modifications to the kernel's machine code, which might be detected. The Ftrace mechanism can be disabled at run time, of course — so Singularity helpfully enables it automatically and blocks any attempts to turn it off.
Singularity is concerned with hiding four classes of things: its own presence, the existence of attacker-controlled processes, network communication with those processes, and the files that those processes use. Hiding its own presence is actually fairly straightforward: when the kernel module is loaded, it resets the kernel's taint marker and removes itself from the list of active kernel modules. This also means that Singularity cannot be unloaded, since it doesn't appear in the normal interfaces that are used for unloading kernel modules. It also blocks the loading of subsequent kernel modules (although they will appear to load — they'll just silently fail). Consequently, Alves recommends experimenting with Singularity in a virtual machine.
Hiding processes
Hiding processes, on the other hand, is more complicated. The mechanism that Singularity uses starts by identifying and remembering which processes are supposed to be hidden. Singularity uses a single 32-entry array of process IDs to track attacker-controlled processes; this is because a more sophisticated data structure would introduce more opportunities for the rootkit to be caught, either by adding additional memory allocations that could be noticed, or by introducing delays whenever one of its hooked functions needs to check the list of hidden process IDs.
Singularity supports two ways to add processes to the list: by sending an unused signal, or by setting a special environment variable and launching a new process. To implement the former, it hooks the kill() system call to detect an unused signal (number 59, by default), quashes the signal, adds the target process to its internal list, and gives the process root permissions in the global namespace. This means that attacker-controlled processes can be added from inside containers, and automatically escape the container using their new root privileges. To handle the environment variable, the execve() system call is hooked in a similar way.
Once a process is in the list, attempts to send signal 0 (to check whether the process exists) are also intercepted, as are other system calls that could refer to the process, such as getpgid(), sched_getaffinity(), and others. The total number of processes on the system, as reported by sysinfo() is also decremented to keep things consistent. The process's files in /proc are hidden by Singularity's file-hiding code. That code is probably the trickiest part of the whole rootkit. The basic idea is to filter out hidden directory entries such that the filesystem appears to remain in a consistent state, but filesystem code is difficult to get right at the best of times.
Hiding files
When a program calls getdents(), the kernel fills the provided buffer with directory entries as normal. Then, Singularity's hook copies the buffer back from user memory, removes the hidden entries, puts the modified buffer back in user memory, and changes the return value of the system call to reflect the smaller number of directory entries returned. This slightly complicated process is because the kernel doesn't provide a good place for Singularity to inject a hook before the directory entries are written to user memory the first time. So, one potential way to identify the rootkit is to have another thread race with the attempt to read directory entries, trying to spot any that were removed.
Changing the number of returned directory entries alone would make the system appear to be in an inconsistent state, however. Directories in Linux filesystems are supposed to track the number of references to them; this includes the ".." references inside child directories. So, when hiding a directory, Singularity also needs to intercept calls to stat() in order to adjust the number of visible links to its parent directory.
Direct access to hidden directories, in the form of openat() and related system calls, is also made to fail. readlink() poses a special challenge because it resolves symbolic links without actually opening them; it has to be handled separately. In addition to the procfs files of hidden processes, Singularity also hides any directories matching a set of user-supplied patterns. By default, it hides things named "singularity", but the project's documentation suggests changing this in the build configuration, since otherwise detecting the rootkit becomes straightforward.
Despite this sophisticated file-hiding machinery, Singularity doesn't help against forensic examinations of a hard disk from another computer. If it isn't installed in the running kernel, it can't hide anything. Therefore, the documentation also recommends putting as many hidden files as possible onto temporary filesystems stored in RAM, so that they don't show up after the system is rebooted.
Another problem for the rootkit is files that contain traces of its presence, but that would raise eyebrows if they disappeared entirely. This includes things like the system log, but also files in procfs like kallsyms or enabled_functions that expose which kernel functions have had Ftrace probes attached. For those files, Singularity doesn't hide them at the filesystem level, but it does filter calls to read() to hide incriminating information.
Deciding which log lines are incriminating isn't a completely solved problem, though. Right now, Singularity relies on matching a set of known strings. This is another place where users will have to customize the build to avoid simple detection methods.
Hiding network activity
Even once an attacker's processes can hide themselves and their files, it is still usually desirable to communicate information back to a command-and-control server. Singularity will work to hide network connections using a specific TCP port (8081, by default), and hide packets sent to and from that port from packet captures. It supports both IPv4 and IPv6. Hiding the connections from tools like netstat uses the same filesystem-hiding code as before. Hiding things from packet captures requires hooking into the kernel's packet-receiving code.
On the other hand, this is another place where Singularity can't control the observations of uncompromised computers: if one is running a network tap on another computer, the packets to and from Singularity's hidden port will be totally visible.
The importance of compatibility
Singularity only supports x86 and x86_64, but it does support both 64-bit and 32-bit system call interfaces. This is important, because otherwise a 32-bit application running on top of a 64-bit kernel could potentially see different results, which would be suspicious. To avoid this, Singularity inserts all of the aforementioned Ftrace hooks twice, once on the 32-bit system call and once on the 64-bit system call. A generic wrapper function converts from the 32-bit calling convention to the 64-bit calling convention before forwarding to the actual implementation of the hook.
Singularity has been tested on a variety of 6.x kernels, including some versions shipped by Ubuntu, CentOS Stream, Debian, and Fedora. Since the tool primarily uses the Ftrace interface, it should be supported on most kernels — although since it interfaces with internal details of the kernel, there is always the chance that an update will break things.
The tool also comes bundled with a set of utility scripts for cleaning up evidence that it was installed in the first place. These include a script that mimics normal log-rotation behavior, except that it silently truncates the logs to hinder analysis; a script that securely shreds a source-code checkout in case the module was compiled locally; and a script that automatically configures the rootkit's module to be loaded on boot.
Overall, Singularity is remarkably sneaky. If someone didn't know what to look for, they would probably have trouble identifying that anything was amiss. The rootkit's biggest tell is probably the way that it prevents Ftrace from being disabled; if one writes "0" to /proc/sys/kernel/ftrace_enabled and the content of the file remains "1", that's a pretty clear sign that something is going on.
Readers interested in fixing that limitation are welcome to submit a pull request to the project; Alves is interested in receiving bug fixes, suggestions for new evasion techniques, and reports of working detection methods. The code itself is simple and modular, so it is relatively easy to adapt Singularity for one's own purposes. Perhaps having such a vivid demonstration of what is possible to do with a rootkit will inspire new, better detection or prevention methods.
Cleanup on aisle fsconfig()
As part of the process of writing man pages for the "new" mount API, which has been available in the kernel since 2019, Aleksa Sarai encountered a number of places where the fsconfig() system call—for configuring filesystems before mounting—needs to be cleaned up. In the 2025 Linux Plumbers Conference (LPC) session that he led, Sarai wanted to discuss some of the problems he found, including at least one with security implications. The idea of the session was for him to describe the various bugs and ambiguities that he had found, but he also wanted attendees to raise other problems they had with the system call.
Christian Brauner, who helped organize the "Containers and
checkpoint/restore" microconference (and LPC as well), introduced the
session by referring to the "horrific design
" of
fsconfig()—something that Sarai immediately disclaimed ("I
didn't say that
"). Sarai began by noting that there are now man pages
for the mount API, which may help improve the adoption of the API by filesystems; his theory
is that adoption lagged due to having to read the code in order to
understand the system calls. "Hopefully, this is at least a slight
improvement.
"
The new mount API, perhaps more properly "the suite of
file-descriptor-based mount facilities
" as the man page calls it,
breaks up the mount()
system call into multiple steps to provide a more granular approach to the
myriad of ways that filesystems can be mounted in Linux.
fsconfig() is used to set parameters and otherwise customize a
filesystem context that has been created with fsopen()
or obtained from an existing mounted filesystem using fspick().
The function prototype for fsconfig() is as follows, from the man page:
int fsconfig(int fd, unsigned int cmd,
const char *_Nullable key,
const void *_Nullable value, int aux);
The fd parameter is for the filesystem context to operate on,
while cmd is the operation requested; key,
value, and aux provide additional information based on
the operation chosen.
FSCONFIG_SET_PATH
Sarai said that the FSCONFIG_SET_PATH
and FSCONFIG_SET_PATH_EMPTY
commands are almost completely unused by filesystems; he thought they were
not used at all but, right before the session, found out that the ext4
journal_path parameter can be set that way. It is unfortunate
that no other filesystem parameters can be set using those commands,
because they can take a directory file descriptor, thus providing more
options for specifying the path, as with openat().
In part due to helpers that are "a little bit janky
", filesystems
require their paths to be set as strings using FSCONFIG_SET_STRING,
which is the same form as options to the mount() system call. As noted
in the fsconfig() man page, the source path parameter,
normally used for the block device containing the filesystem, must be set
using FSCONFIG_SET_STRING, but others ostensibly could use the
set-path commands.
Ideally, he thinks that most filesystems want to support all three commands for their file parameters, but none of the helpers currently support that. A single helper that handles those three, plus the related FSCONFIG_SET_FD command would be useful. He wondered, though, whether full O_PATH support was needed for the paths that are being set.
A file descriptor opened using O_PATH behaves differently than one obtained from a regular file open—and it can be done without the permissions needed to actually open the file. For that reason, Lennart Poettering thought that O_PATH support should not be added to the helper; files should be opened normally as that will be more secure, he said. There was no opposition to adding helpers as Sarai described, so he turned to his next topic.
Singletons and exclusive create
Singleton filesystems, such as debugfs and others that have the same
contents no matter how and where
they are mounted, are a perennial problem,
he said. The FSCONFIG_CMD_CREATE_EXCL
command was developed and merged by Brauner
a few years ago, but it is a "very big hammer
" that is largely
unusable because it does not provide any extra information to the caller if
it fails. It is the counterpart to the FSCONFIG_CMD_CREATE
command, which is used to turn a configured filesystem context into a
filesystem instance that can be mounted using fsmount().
There is a hidden surprise when using the create command: almost all of the
configuration (with the exception of the read-write and read-only flags) that
has been done using fsconfig() is silently ignored if the
filesystem instance is already present in the kernel. So
FSCONFIG_CMD_CREATE_EXCL requests that the kernel create a context
without reusing an existing filesystem configuration, thus ensuring that
the configuration requested is used. But if it cannot, "it just gives
you an error and you can't do anything about it
"; it is simply an
instance of "computer says 'no' and that's basically all you can do about it
".
The conversion to the new mount API has broken some singleton filesystems, because the semantics of vfs_get_super() have changed, but some developers were not aware of that. The bug was fixed for debugfs and for tracefs. In general, the semantics of superblock reuse are not clear, and the messages provided do not give enough information about what parameters were ignored.
Brauner noted that FSCONFIG_CMD_CREATE_EXCL is meant for times
when the user absolutely must have a particular filesystem-configuration
option, otherwise FSCONFIG_CMD_CREATE should be used. But Sarai
pointed out that the existing filesystem configuration may be just fine,
but in order to be sure the application uses
FSCONFIG_CMD_CREATE_EXCL, which would fail even if the required
parameter is already set in the existing configuration. Part of the
problem is that the actual set of configuration options is not completely
resolved until the superblock is actually created, Brauner said; for
example, filesystems are not required to resolve the path parameters
supplied until superblock-creation time. There is no "generically
elegant
" solution for finding out whether a filesystem context really
contains a configuration value of interest.
fc_log
Information about conflicts between parameters and the like can be logged
using logfc()
(which uses struct
fc_log), but that interface suffers from some problems as well.
The interface had been discussed at LSFMM+BPF 2024, as well, Sarai said,
which is described in the article linked above. For one thing,
fc_log has a limit of eight entries before it overflows, "and
the fun part is that there's no priority
", so eight informational
messages could delete an error or warning message.
User space can read the messages logged via a file descriptor returned from fsopen(), but ideally it would need to do so after every mount-API call and Sarai said that the util-linux tools do not do that. He wondered if increasing the size of fc_log made sense. Brauner asked if there was a way to know if any messages were dropped; Sarai did not think so. Overall, though, the fc_log interface is much better than trying to parse things out of the dmesg log, Sarai said. Brauner agreed, noting that util-linux does print out those messages in versions where it is using the new mount API, which is a big improvement.
Brauner thought that some levels of fc_log messages were also output to dmesg, though Sarai was not convinced that was true. Sarai did think that unread fc_log messages should be written to dmesg so that they do not just disappear. It might make sense to provide a way for user space to poll the fc_log messages, so that it can read them as soon as they are available, Brauner said. There is a problem that the format and wording of those messages becomes part of the kernel ABI, which may be unavoidable, but is something to keep in mind.
Sarai described an idea he had just come up with, which would allow
fc_log messages to be extended with, say, some JSON that came
after a NUL byte in the buffer; that would allow users to simply print the
message (as they do now) or to parse it further if they need more information.
Poettering suggested following the model of kmsg,
which was long unstructured, but eventually added some structure, including
a sequence number so that overwritten messages can be detected. Brauner
said there is some structure to the fc_log messages, with prefixes
for different filesystems, for example, but it is still "wildly
inconsistent
" among them. Even the VFS is inconsistent about what and
where it logs information. Switching to a structured format might be a
good idea, but it would require a new flag for fsopen() to allow
user space to request structured logs.
The limit of eight messages is something that should probably be addressed,
Sarai said. Poettering agreed, noting that it was an "irritatingly
low
" number for something of that nature. Jeff Layton speculated that
David Howells (who designed the API but was not present) expected users to
check for messages after each call, and that only a few messages would be
generated for each. That has not really been borne out, however, attendees
seemed to agree.
FS_CONTEXT_FAILED
The final topic Sarai wanted to raise was the FS_CONTEXT_FAILED
state that a filesystem context enters if there is any kind of error. At
that point, the context cannot be inspected or otherwise operated upon, so
if user space wants to try again with different options, it has to start
all over. This comes up for the runc
container tool, he said, because it has various fallbacks that it wants to
try if a mount fails. In order to do that, it has to keep all of the
options around, remake the context, and try again (and again, if that
fails). That is not too bad, "but it's just kind of awful that it goes
into this fail state, where all you have are log messages
" to try to
figure out what went wrong.
Brauner speculated that it was that way because the VFS cannot really know what caused the failure, so it cannot know whether the context can be changed and retried. There may be filesystems that enter into a non-recoverable state if the superblock creation fails, for example. On the other hand, there are situations where a new mount option is introduced that a filesystem may or may not implement; it is unfortunate that the option cannot just be removed and the mount retried.
It would make sense to have a way for a filesystem to decide whether it is
a non-recoverable error or not, Brauner said. Part of the problem is that
the API is "an unfinished project in a sense
"; Howells had also proposed
the fsinfo() system call that
would have allowed querying the filesystem context and more. It was
rejected and has never resurfaced, though statmount()
was separated out as its own system call, Brauner said. It might be
interesting to consider resurrecting a "very slimmed-down version
"
of fsinfo().
Using a structured-message kmsg-like approach would allow filesystems or the VFS to put the required information into some newer fc_log, Poettering said. Those applications that care can pull out the information about options that were rejected (or translated to a compatible option); that way they can programmatically determine what needs to change for a retry. It is effectively the same as the JSON idea that Sarai had mentioned; there just needs to be agreement about the structure among the subsystems. It would also make it easy to simply add the messages to kmsg if they are going to be overwritten or were not read. Sarai seemed amenable to that approach.
With that, the session ran out of time. Interested readers can check out the YouTube video and slides from the talk.
[ I would like to thank our travel sponsor, the Linux Foundation, for assistance with my travel to Tokyo for Linux Plumbers Conference. ]
Task-level io_uring restrictions
The io_uring subsystem is more than an asynchronous I/O interface for Linux; it is, for all practical purposes, an independent system-call API. It has enabled high-performance applications, but it also brings challenges for code built around classic, Unix-style system calls. For example, the seccomp() sandboxing mechanism does not work with it, causing applications using seccomp() to disable io_uring outright. Io_uring maintainer Jens Axboe is seeking to improve that situation with a rapidly evolving patch series adding a new restrictive mechanism to that subsystem.The core feature of seccomp() is restricting access to system calls; an installed filter can examine each system call (along with its arguments) made by a thread and decide whether to allow the call to proceed or not. The operations provided by io_uring are analogous to system calls, so one might well want to restrict them in the same way. But seccomp() has no visibility into — and thus no way to control — operations requested via io_uring. Running a program under seccomp() and allowing it access to io_uring almost certainly gives that program a way to bypass the sandboxing entirely.
As it turns out, io_uring itself supports a mechanism that allows the placement of limits on io_uring operations; LWN covered an early version of this feature in 2020. To create an operation-restricted ring, a process fills in an array of io_uring_restriction structures:
struct io_uring_restriction {
__u16 opcode;
union {
__u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
__u8 sqe_op; /* IORING_RESTRICTION_SQE_OP */
__u8 sqe_flags; /* IORING_RESTRICTION_SQE_FLAGS_* */
};
/* Some reserved fields omitted */
};
While the term "restriction" is used throughout the API, what these structures are doing is describing the allowed operations. Each has a sub-operation code affecting what is allowed:
- IORING_RESTRICTION_REGISTER_OP allows a specific registration operation — an operation that affects the ring itself. These operations include registering files or buffers, setting the clock to use, and even imposing these restrictions, among many others.
- IORING_RESTRICTION_SQE_OP enables an operation that can be queued in the ring; these include all of the I/O and networking operations supported by io_uring. The io_uring_enter() man page has a list of available operations.
- IORING_RESTRICTION_SQE_FLAGS_ALLOWED sets the list of operation flags that are allowed to appear in io_uring operations; these flags, listed in the io_uring_enter() man page, control the sequencing of operations, use of registered buffers, and more.
- IORING_RESTRICTION_SQE_FLAGS_REQUIRED creates a set of flags that must appear in each operation. Making a flag required implicitly sets it as being allowed as well.
The array of these structures can be installed with an IORING_REGISTER_RESTRICTIONS operation, after which it will be effective on the ring. This restriction mechanism is not as capable as what seccomp() can do; it cannot look at operation arguments, for example. But it is fast enough to not interfere with the performance goals of io_uring, and is sufficient to wall off significant parts of the API.
There is, however, a significant limitation to the current restriction mechanism: restrictions can only be applied to an existing ring, and that ring must be in the disabled state at the time. It works well for an application that, for example, needs to create a ring, add restrictions, then pass it into a container. It falls short, though, for use cases that want to allow io_uring in general, but with a specified subset of operations. Axboe's work is intended to address this limitation by allowing restrictions to be applied to a task rather than to a specific ring.
Specifically, this work started by adding a new operation, IORING_REGISTER_RESTRICTIONS_TASK, that can accept the same list of io_uring_restriction structures. That list will be stored with the calling task itself, though, rather than with a specific ring, and the restrictions will be applied to all rings subsequently created by that task. The list is applied to children during a fork, so the restrictions will apply to all child processes created after they are set up. These restrictions thus govern any rings created in the future, without the controlling task having to participate in that creation.
Once the restrictions have been set, they are immutable, with a couple of exceptions. The IORING_REG_RESTRICTIONS_MASK flag allows restrictions to be tightened further by removing allowed operations and flags, or by adding new required flags. The process that initially added the restrictions retains the power to modify them or remove them entirely. That process's children, instead, will remain stuck with the restrictions that were created for them.
At least, that was the state of things as of the second RFC version of the patch set. The third version made a number of changes, starting with the removal of IORING_REG_RESTRICTIONS_MASK and any other ability to change the restrictions once they have been put into place. The bigger change, though, was the addition of more flexible filtering using, inevitably, a set of BPF programs. Interestingly, that flexibility was reduced somewhat in later versions, as will be seen.
The current BPF implementation is a bit of a proof of concept. Among other things, it currently only properly filters the IORING_OP_SOCKET operation, which is the io_uring equivalent to the socket() system call. Operations can be controlled, but registration requests are not currently included in the BPF mechanism.
There is a new registration operation, IORING_REGISTER_BPF_FILTER, which adds a new BPF program to a ring; the program is associated with a specific IORING_ operation code. It will be invoked after the initial preparation for a new operation has been done; as a result, any structures provided by user space as part of the operation will have been copied into the kernel and will be available for the program to inspect. That gives these filters an advantage over seccomp(), which generally cannot access data in user space that is passed to the kernel via pointers.
The program will also be passed context specific to the operation
in question; for IORING_OP_SOCKET, that context includes the
address family, socket type, and protocol provided by user space. A
non-zero return value from the BPF program allows the operation to proceed;
otherwise it will be blocked. There can be multiple BPF programs attached
to any given operation; they will be invoked in sequence, and any one of
them can block an operation. While the current patch set does not
implement this behavior, Axboe has said that
he intends to change the behavior to "deny by default
" in the
future; if BPF is in use, then an operation will be disallowed unless a BPF
program explicitly allows it.
By the time the patch set reached version 5 (with the "RFC" tag removed) things had changed again in an interesting way. There are two versions of BPF in the kernel, the "extended BPF" that is normally just called "BPF" in recent times, and "classic BPF", which is the earlier, BSD-derived variant that was designed for packet filtering. Classic BPF is far less capable and lacks compiler support; there have been no new users of it added to the kernel for years. But the current version of the io_uring patches now uses classic BPF rather than extended BPF.
Axboe noted that the usability of the feature is reduced by this change:
"This obviously comes with a bit of pain on the usability front, as you
now need to write filters in cBPF bytecode
". The change was driven by
the fact that classic BPF can be used by unprivileged processes, while
extended BPF requires privilege (specifically, the CAP_BPF
capability). For the desired use case of sandboxing containers,
accessibility without privilege is important. It is worth noting that
seccomp() also still uses classic BPF, for the same reason. The
hooks for extended BPF are still there, but cannot be used.
As one might surmise, this patch set seems to be evolving quickly, and may well have changed again by the time you read this. It seems clear, though, that it will soon be possible to control access to io_uring at a level that, previously, has not been possible. Just as brakes allow a car to go faster, fine-grained control may make io_uring available in contexts where, until now, it has been blocked.
Responses to gpg.fail
At the 39th Chaos Communication Congress (39C3) in December, researchers Lexi Groves ("49016") and Liam Wachter said that they had discovered a number of flaws in popular implementations of OpenPGP email-encryption standard. They also released an accompanying web site, gpg.fail, with descriptions of the discoveries. Most of those presented were found in GNU Privacy Guard (GPG), though the pair also discussed problems in age, Minisign, Sequoia, and the OpenPGP standard (RFC 9580) itself. The discoveries have spurred some interesting discussions and as well as responses from GPG and Sequoia developers.
Flaws
Out of 14 discoveries listed on the gpg.fail site, 11 affect
GPG—they range from a flaw in GPG's --not-dash-escaped
option (that would allow signature
forgery) to a memory-corruption
flaw that "might be exploitable to the point of remote code
execution (RCE)
". Two of the discoveries affect Minisign (one, two); both of the
vulnerabilities allow attackers to insert content into trusted-comment
(e.g. metadata attached to a signature) fields.
The researchers also described an exploit in OpenPGP's Cleartext
Signature Framework, which could allow an attacker to substitute
the signed data with malicious content "while retaining a seemingly
valid cryptographic verification
" when using GPG or Sequoia. It is worth
noting, as they did, that the framework already has documented
issues and the GPG project recommends
against its use, though it is still supported and the
recommendation was in the form of a short phrase in a man page.
We won't try to recap each vulnerability in detail here; the gpg.fail site already has detailed write-ups of the discoveries. In addition, Groves and Wachter's presentation (video) is informative as well as entertaining. The slides and all of the proof of concepts from the presentation have not yet made their way to the site; I sent a query on January 15 about the slides to the contact email provided on the site, but have not received any reply.
Reactions
Demi Marie Obenour started
a discussion of the GPG vulnerabilities on the oss-security
mailing list, which prompted some of the list's participants to
examine and weigh in on the researchers' claims. Jacob Bachmeyer was
quick to respond
to many of the researchers' findings. He agreed that
the memory-corruption flaw was a serious error in GPG, but claimed
that most of the flaws reported "are definitely edge cases
", and minimized the possible real-world impact. For example, he
said that a flaw in GPG's
sanitization of file paths was potentially serious, but that it
also relied on a social-engineering attack. In short, the described
attack relies on a user following an attacker's suggested method of
opening a file using GPG, which then would trigger a fake prompt,
which would look like this:
$ gpg --decrypt pts.enc && gpg pts.enc gpg: WARNING: Message contains no signatures. Continue viewing [Y/n]?
If the user responds affirmatively to the prompt, the proof of
concept designed by the researchers would overwrite a file of the
attacker's choosing. Bachmeyer was dismissive of this, though, because
he felt it would only be effective against inexpert users. "While a
naive user might use the suggested command, a more-experienced user
should immediately smell a rat.
" Bachmeyer had also wondered about
this discovery about
gpg truncating plain text in a way that would allow an
attacker to extend a signed message with arbitrary data in a way that
would still pass signature verification. He said that if there was a
bug, then it was an out-of-bounds read.
In response,
Groves acknowledged that the exploit chain would need to abuse the
naivety of a user to trigger the technical problem. Despite that, she
said that software should do its best to protect against human
error. It might not fool "a hardcore cyperpunk, but to be honest,
it'd get me
".
Groves also apologized for getting the writeup of the signature
truncation exploit "slightly wrong
", but said that it had been
correctly described in the presentation. She then provided a lengthy
explanation of the bug in depth. She said, in part, that the bug was
really a malleability
attack where an attacker could try to manipulate the output of a GPG
operation. That was practically exploitable because GPG defaults to
standard Zlib compression, which has a predictable header, and an
attack would only require guessing seven bytes from the header to set it
up. The OpenPGP standard describes protection
against malleability that GPG violates in two ways. The standard
says that an implementation should not attempt to parse or release
data to the user if it appears to be malleable. GPG, however, still
does attempt this. The standard also says that an implementation must
generate a clear error that indicates the integrity of a message is
suspect, but Groves said that it is possible for an attacker to
circumvent that.
This is bypassed by *another* described bug, where by triggering an error *before* the checksum is printed, we can change the error message from "WARNING: encrypted message has been manipulated!" to a harmless-appearing "decryption failed: invalid packet". A user looking at the plausible PGP packet stream output would not suspect that there is anything wrong [...]
This chain of exploits allows doing this by just abusing logic bugs and odd decisions in GnuPG. Several of those, especially the bypass silencing the warning that MUST be printed, are technical, logical bugs that can and should be fixed.
Bachmeyer responded
that he now saw what he missed on the first examination. "Clever,
very clever. :-)
"
GPG creator and maintainer Werner Koch said
on December 29 that he agreed with most of the comments in
Bachmeyer's first email. Koch pointed to a tracking bug for the reports
that were filed with the GPG project by the researchers. According to
Koch, the reports were filed "one after the other
" in
October.
Because there was no clear statement on when we were allowed to publish them and no further communication, most of them were set to private. I set them to public when I noticed the schedule for the talk on December 26.
Koch called the memory-corruption bug good research but said it
was "the only serious bug from their list
" and that it was
questionable whether it would actually allow an RCE. That bug was
fixed in the 2.5.14
release in November, but it had not been fixed in the 2.4
branch. Koch said that there was another release of 2.4 pending, which
presumably would contain a fix for the bug. However, he added that the
end of life for 2.4 would be coming in six months, so it would be
better for users to switch to 2.5.
GPG 2.4.9 was, in fact, released on December 30—though one would be forgiven for having missed it as there was no announcement of its release. It is not mentioned in the NEWS file that the GPG site advises users to consult, nor has it been mentioned on the gnupg-announce mailing list, though the 2.5.16 release was announced the same day that 2.4.9 was published.
Koch published a blog
post ("Cleartext Signatures Considered Harmful") that recommends
detached
signatures instead. In one of the bug reports, Koch said that the suggestion
to remove cleartext signatures entirely was a no-go: "there are too
many systems (open source or in-house) that rely on that format. If
properly used (i.e. using --output to get the signed text) there is no
problem
".
"Staggeringly complex"
Peter Gutmann said
that he was concerned that two researchers "walked up to GPG and
quickly found a pile of bugs, many relating to authentication
".
OpenPGP signatures are the de facto standard for authenticating
code and binaries in the non-Windows world and GPG, "the one with
all the bugs in its authentication handling
", is what's used. The
first problem, he said, is that GPG is staggeringly complex. It is not
just a sign-and-encrypt application, but one that spawns off other
programs, has many command-line options that change between releases,
and even runs services. The other problem is the OpenPGP format
itself.
To appreciate just how bad it really is, grab a copy of RFC 9580 and see how long it takes you to write down the sequence of fields (not packets, fields) that you'd expect to see in a message encrypted with RSA and signed with Ed25519 (to grab the two opposite ends of the cryptographic spectrum) as well as the cryptographic bindings between them, i.e. which bits get hashed/signed/ processed, and also provide a confidence level in your work. I suspect most people won't even get to that, the answer would be "I can't figure it out".
He said that it would be better if an application that does
something critical, like authenticating downloaded binaries, does that
and nothing more. Obenour suggested
it might make sense to use OpenSSH signatures instead. Gutmann replied
that PGP signatures are fine as long as they are used in a way that "there's only one simple, unambiguous, and minimal-attack-
surface way to apply them, as well as a means of having them work over longer
time periods
". He has, apparently, had some unpleasant experiences
with trying to find unexpired keys to verify Linux packages or images.
Sequoia thoughts
Sequoia developer Neal H. Walfield published a blog
post on January 12 in response to the presentation, and in
particular to the Cleartext Signature Framework attack that "the
researchers claim demonstrates a security weakness in Sequoia
". He
praised the impressive number and breadth of vulnerabilities found,
but said that the researchers had used a "naive translation
" of
gpg commands to sq commands. Using the standard
workflows for sq would have prevented the attack from being
successful.
Despite that, he said that the researchers had found a real bug in Sequoia:
When verifying a signature using sq, the caller specifies the type of signature that should be checked. In this case, we use --cleartext. Yet, the inline signature was verified, which should only be done if the caller passed --message. This is due to a known issue in our library, which unfortunately we haven't yet had the chance to fix. Had we fixed this, this would have mitigated this attack. Nevertheless, the possibility for confusion remains, and the next step should always use the verified data and not the original file. We plan to address this issue this quarter. Thanks to the security researchers for showing us that the issue has a practical security impact.
Walfield concluded by saying, once again, that the researchers had done impressive work—but wished that they had explained that the signature-verification attack required incorrect use of Sequoia.
Overall, it would appear that the gpg.fail researchers have uncovered some real issues that need to be addressed—only some of which are easily patched. The complexity of the tools and their use will remain a barrier for secure use long after any code vulnerabilities are fixed.
Removing a pointer dereference from slab allocations
Al Viro does not often stray outside of the core virtual filesystem area; when he does, it is usually worthy of note. Recently, he wandered into memory management with this patch series to the slab allocator and some of its users. Kernel developers will often put considerable effort into small optimizations, but it is still interesting to look at just how much effort has gone toward the purpose of avoiding a single pointer dereference in some memory-allocation hot paths.
The slab cache
The kernel's slab allocator exists to provide quick allocations of fixed-sized objects. For example, the kernel uses large numbers of dentry structures to cache information about file names; on the system where this is being written, there are currently over 800,000 active dentry structures, as reported by /proc/slabinfo. Requests to allocate and free these structures are frequent, so their performance matters.
The slab allocator provides a function, kmem_cache_create(), that returns a pointer to a newly allocated and initialized kmem_cache structure. This pointer, in turn, can be used (by calling kmem_cache_alloc()), to allocate a new object of the size that this particular cache was configured for. The virtual filesystem layer, for example, can use a slab cache to allocate dentry structures. The slab allocator will maintain a cache of available structures, handing them out on request; it will also make an effort to lay them out optimally in pages of memory obtained from the page allocator. Even simple operations in the kernel may involve allocating and freeing a number of objects, so considerable effort has gone into optimizing the slab allocator over time.
While slab caches can be created and destroyed dynamically, there are a number of them that exist for the life of the system. The cache for dentry structures, for example, is created during the system bootstrap process, and the struct kmem_cache pointer for this cache is stored (as dentry_cache) in memory that is made read-only once the bootstrap is complete. Code that needs to allocate or free a dentry structure will, once compiled, contain the address of dentry_cache, which can be used to fetch the pointer to the kmem_cache structure that must be passed to the slab allocator. Most of the time, this extra dereference will be a small cost relative to the cost of allocating a new object but, according to Viro, it does have a measurable effect for heavily used caches.
Optimizing out that dereference thus has some appeal, and it should be possible. The value of the dentry_cache pointer is constant; once it has been set, it will not change for the life of the system. All that is needed is to replace the address of dentry_cache, in every place in the kernel binary where it appears, with the address of the kmem_cache structure that is stored there.
Run-time constants
The above description of how the dentry cache slab is accessed is not, as it turns out, fully accurate for current kernels. If one looks in fs/dcache.c in a 6.19-rc kernel, one will see that the slab pointer is declared as:
static struct kmem_cache *__dentry_cache __ro_after_init;
#define dentry_cache runtime_const_ptr(__dentry_cache)
The pointer to the slab cache used for dentry structures is actually stored in a variable called __dentry_cache; the unadorned dentry_cache name is created by the #define in the second line. This declaration sequence demonstrates the "run-time constant" mechanism that was added to the 6.11 kernel by Linus Torvalds. He never quite got around to documenting this new feature — surely it must be near the top of his "to-do" list at this point — so one has to reverse engineer it. In short, run-time constants do what was described above: they patch an address directly into the code, at run time, so that a dereference operation can be avoided.
To set up a pointer as a run-time constant, the first step is to declare it using the runtime_const_ptr() macro as seen above. That macro returns a value that, using #define, is bound to the name that the rest of the code uses for the pointer (dentry_cache, without underscores, in this case). There are other macros used to set the value of a run-time constant; for the dentry slab, the constant is set, in dcache_init(), using runtime_const_init():
__dentry_cache = KMEM_CACHE_USERCOPY(dentry,
SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_ACCOUNT,
d_shortname.string);
runtime_const_init(ptr, __dentry_cache);
The kmem_cache is allocated with KMEM_CACHE_USERCOPY(), which is a macro wrapping kmem_cache_create(); the resulting pointer is then used to set the value of the run-time constant. This initialization will cause any instruction in the kernel code that references dentry_cache to contain the kmem_cache pointer instead. So the extra dereference is eliminated; this was the motivation for the addition of the run-time constant machinery in the first place.
So it might appear that the problem is already solved, but there are a couple of ways in which this solution falls short. The first is that not all architectures support run-time constants, though it seems that the most important architectures do. But this mechanism only works during the system bootstrap process; once the system is fully booted, it is no longer possible to modify the kernel text to reflect the actual value of a run-time constant. That, in turn, means that run-time constants cannot be used in loadable modules.
Static kmem_cache structures
Rather than try to fix run-time constants to address those problems, Viro decided to focus on the slab problem specifically. It is easy enough to have kernel code contain a pointer to the kmem_cache structure it needs to use, without the need for run-time code patching, if that structure is allocated statically on the caller's side. The address of that structure becomes a compile-time constant. Even loadable modules would be able to use such a feature, at least for slabs allocated and managed within the module itself.
One small obstacle that needs to be overcome is that the definition of struct kmem_cache is hidden from code outside of the slab allocator itself, and for good reasons. That will make it hard to declare those structures elsewhere in the kernel. The key to the solution is the realization that this code really only needs to allocate some memory that is large enough to hold a kmem_cache structure. So Viro's patch set introduces a new type, struct kmem_cache_opaque, that is defined in such a way that it is the same size as struct kmem_cache, but which does not reveal any of the details of that structure. There is a new macro, to_kmem_cache(), that will cast a pointer to the opaque form of the structure to the regular type expected by the slab subsystem.
With these changes, the declaration of dentry_cache becomes:
static struct kmem_cache_opaque __dentry_cache;
#define dentry_cache to_kmem_cache(&__dentry_cache)
A few other changes are needed to convert a subsystem over to a static kmem_cache structure. The usual call to kmem_cache_create() becomes, instead, a call to kmem_cache_setup() with the same parameters. (In the dentry cache, the more specialized KMEM_CACHE_USERCOPY() macro becomes KMEM_CACHE_SETUP_USERCOPY()). Otherwise, code that allocates and frees objects works without change.
Making this feature work in modules required a bit more plumbing to ensure that the cleanup of statically allocated slab caches is completed before the module that created those caches is removed. Within the slab allocator itself, the main change was to make note of preallocated kmem_slab structures so that the slab code does not try to allocate or free them itself. Statically allocated slab caches also cannot be merged with any others.
The patch series converts a fair number of caches in the core kernel and
filesystem subsystems to the static variant. There are no benchmark
results showing how much of a performance improvement ensues. Torvalds was
happy with the patch set, calling it "much better than runtime_const
for these things
". Thus far, there have not been many comments from
the memory-management developers. Assuming they have no complaints, the
path for this work into the mainline looks relatively smooth.
An alternate path for immutable distributions
LWN has had a number of articles on immutable distributions,
such as Bluefin and
Bazzite, in recent years. These distributions have taken a variety of approaches, including
using
rpm-ostree, filesystem snapshots, and
bootable container (bootc) images. But those
approaches, especially the latter, lead to extra complexity for a user
attempting to install new software, instead of just
using the existing package manager.
AshOS (Any Snapshot Hierarchical OS) is an experimental AGPL-3-licensed
"meta-distribution
" that tried a different approach more in line with
traditional package management. Although the project is no longer updated,
it remains usable, and can still shed some light on a potential alternate path for users
worried about adopting bootc-based approaches.
There are a few reasons to find immutable distributions appealing. The fact that updates can be applied and rolled back atomically is probably chief among them, but they also make it easier to reproduce a corresponding installation from a small set of configuration files. Immutability means that changes to the configuration are cleanly separated out, so it's easy to see how a long-lived immutable system has been changed, and to reproduce a new system from those changes. The question, as always, is whether these benefits are worth the disk usage of A/B updates, build times of any custom images, complexity, and inconvenience. AshOS was an attempt to change the balance of that tradeoff by making it possible to add immutability onto an existing distribution, while keeping the existing distribution's package manager in control.
To do this, AshOS makes and manages snapshots of the root filesystem (including installed software and configuration files, but excluding users' home directories). These snapshots can be overlaid on top of one another, so that package-management operations can be separated out and named (similar to sysext images). For example, a user might have one snapshot for the base OS installation, another for their graphical user environment, and a third for their mail server. To try out a new piece of software or configuration change, they would create a new snapshot to work in on top of the existing ones. Within that snapshot, they could use the distribution's normal package manager to install the software. If that turned out to have horrible consequences, they could remove the most recent snapshot and return to the previous (working) configuration.
Other immutable distributions do offer similar capabilities. Fedora Silverblue-based distributions can use rpm-ostree to layer packages on top of a base image, for example. But doing so is rife with sharp edges compared to just using dnf — for example, if a base image is updated to include a package that was previously layered in, removing the exising layer is difficult to do with rpm-ostree.
Trying it out
It's a charming idea. I have been using Silverblue for a bit more than a year, and found it just enough of a hassle for AshOS's promised simplification to be tempting. Unfortunately, AshOS is in the "usable but not without struggle" stage of open-source projects. The recommended installation procedure is to obtain a live ISO of one's preferred distribution, use that to download the AshOS source code and to configure disk partitions, and then run AshOS's setup script. AshOS tries to be portable across distributions, but it does still have a small amount of per-distribution code (to facilitate installation and produce smart diffs between snapshots). The currently supported distributions are Alpine, Arch, CachyOS, Debian, Endeavourous, Fedora, Kicksecure, Proxmox, and Ubuntu — although the documentation suggests that Gentoo also works, and other distributions similar to the supported ones might work.
I downloaded Arch's 2026.01.01 live ISO image, and went through the installation process in a virtual machine. It took me a few attempts to do it right: AshOS has no particular enforcement that one has set up the partition table in a sensible way, but it also has to know exactly what each partition is for in order to set up snapshots correctly. My first time through the procedure, I ended up with an unbootable system. Eventually, I was able to match AshOS's understanding of my chosen disk layout to the partition table, and things went more smoothly.
The installer is text-based and somewhat rudimentary, but not terribly complicated. After getting through the difficult partitioning-based questions, it asks for the normal details such as username, password, and timezone. Once complete, it installs a minimal package set from the host OS (Arch, in my case) and instructs one to reboot. In the default installation configuration, it uses Btrfs on the root disk for making snapshots, but the documentation suggests that it can be made to work with other filesystems if one tries.
By default, AshOS doesn't install anything other than the bare minimum needed to log in at a text console. It does have the ability to install a few "profiles": special commands that set up common desktop environments. I made a new snapshot and then installed the GNOME profile. Rebooting produced an extremely minimal, but functional, Arch installation. From there, I experimented with adding, cloning, deleting, and updating snapshots with the provided command-line tool (ash), and found the process much more intuitive than the installation procedure.
AshOS manages snapshots in a tree structure starting from the base OS installation. Any branch of the tree can be selected at the boot menu. So, for example, it is possible to have two different desktop environments installed (for testing, perhaps) on top of a shaed set of software. More commonly, one could make a copy of one's existing configuration (using AshOS's tree-manipulation commands), try out an upgrade or modification, and still be able to easily switch back. Each layer is given an ID number and an (optional) human-meaningful description, both of which show up in the boot menu and when manipulating the tree of snapshots from the command line.
The "ash branch --snap N" command can be used to branch from snapshot N; this prints out the ID of the new snapshot, M. Then "ash chroot M" will spawn a new shell within this snapshot wherein one can make changes using normal tools, such as "pacman -Sy firefox" (pacman's somewhat cryptic installation command) or "vim /etc/resolv.conf". When exiting the shell, ash saves the changes and updates the version of the snapshot in the tree. There is also an ash live-chroot command to open a shell with the currently running snapshot mounted as writable, in case one needs to mess with things without creating a separate snapshot first.
There are also some helper commands for running a command on each snapshot in an entire subtee. For example, one could run their package manager's update command on all of the snapshots below a certain point on the tree.
Configuration
AshOS's snapshot management clearly qualifies it as an immutable meta-distribution, but one weakness of a purely snapshot-based approach compared to creating custom bootable containers is configuration management: once a system is set up, how can you tell exactly what was changed, and how to recreate the same setup elsewhere?
This is where the majority of AshOS's distribution-specific code comes in. AshOS has a set of hooks for each supported package manager that lists installed packages and attributes files to them. The ash diff command uses these hooks to produce diffs between snapshots. The diffs show which packages were added and which configuration files were changed. This isn't quite the same level of reproducibility as building an operating-system image from a container description, but as long as one remembers to assign meaningful descriptions to each snapshot, it seems like a clean-enough way to track operating-system changes.
Overall, AshOS is rough. The command-line tooling, for all that it seems reliable, has slightly inconsistent syntax. The installation procedure is baroque, the documentation is a bit messy, and the last development occurred just over two years ago. The project's creator and only contributor, "i2", appears to have lost interest. Despite that, it works — and arguably does so more smoothly than traditional immutable desktops such as Fedora Silverblue. It serves as a proof of concept that it is possible to have the nice parts of immutability (atomic upgrades and rollbacks, configuration management) without giving up traditional package management or switching to a container-centric approach. Nothing that AshOS does is out of reach for someone with some scripting knowledge and a filesystem that supports snapshots, although it does a good job of packaging those operations into a single tool.
AshOS is currently only suitable for people who don't mind relying on experimental, unmaintained software. But the core principle of organizing and layering filesystem snapshots to turn a traditional Linux system into an immutable one seems entirely sound. Perhaps such an approach could prove a useful middle-ground for people wrestling with the downsides of bootable containers — or inspire the developers of those systems to improve the user-interface for installing additional packages.
Page editor: Joe Brockmeier
Inside this week's LWN.net Weekly Edition
- Briefs: Pixel exploit; telnetd exploit; OzLabs; korgalore; Firefox Nightly RPMs; Forgejo 14.0; Pandas 3.0; Wine 11.0; Quotes; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.
