|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for May 8, 2025

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Debian's AWKward essential set

By Joe Brockmeier
May 7, 2025

The Debian project has the concept of essential packages, which provide the bare minimum functionality considered absolutely necessary (or "essential") for a system to function. Packages tagged as essential, and the packages that are required by the set of essential packages, are always installed as part of a Debian system. However, Debian's packaging rules do not require developers to explicitly declare dependencies on that set of packages (the essential set) but they can simply rely on the fact that those will always be present. That means that changing the essential set, as the project may wish to do occasionally, is more complicated than it should be. This came to light recently when a Debian developer asked what might be required to remove mawk to slim down the project's container images.

Simon Josefsson observed that the Fedora project's container images do not have an AWK interpreter. He wondered if it was possible to remove mawk from the default set of tools in the upcoming Debian 13 ("trixie") release to reduce the size of the Debian container image. If not, then what might be the blockers to doing so?

For the record, Debian's container images are already slightly smaller than Fedora's. The default Fedora 42 image is about 56MB, while a comparable Debian image is a bit more than 48MB. Debian's container images include the packages from a Debian minimal installation ("minbase"), which is made up of packages that are tagged as essential or required. The difference between "essential" and "required" is a bit subtle. Essential is a separate control field that, if set to "yes", indicates that a package is necessary for any Debian system. Each Debian package also has a priority which indicates whether it is required for a system to function, completely optional, or somewhere in between. A package, like mawk, may have a priority of "required" without being tagged as essential. One can see all packages tagged as essential on a Debian system by running "aptitude search ~E", and all packages with priority "required" by running "aptitude search ~prequired".

Paul Tagliamonte, who is a co-maintainer of the Debian container images along with Tianon Gravi, said that changing packages that are pre-installed in the image is not something that the team wanted to do in isolation. "The package(s) should be added or removed from the minbase set via our usual conventions" for revising the set of essential and required packages.

One obvious objection to the removal, voiced by Antonio Terceiro, would be the fact that Debian entered soft freeze for trixie on April 15—two days before Josefsson asked about removing mawk. This means that Debian maintainers should be focusing on "small, targeted fixes" rather than, say, tearing out the plumbing. Russ Allbery pointed out that removing mawk would be "a very substantial amount of work" in part because it is one of the packages that can provide awk functionality that the essential package base-files depends on. Awk was added as a dependency of the package in 1998 by Santiago Vila so that an AWK implementation would be available to other packages if needed. Removing it would hardly be a small fix, especially just to slim down a container image by about 292KB on x86_64 systems. (The installed size of the package differs depending on the architecture.)

The base-files package, which contains the basic filesystem hierarchy of a Debian system plus "important miscellaneous files" can use other packages that provide the awk dependency, such as the GNU awk (gawk) package. However, that package weighs in at about ten times the size of mawk at more than 2.9MB on x86_64 systems. The original-awk package would reduce the installed size somewhat—it weighs in at only 184KB, but it is unclear if original-awk meets the requirements of all the packages expecting an AWK interpreter.

Let's get small

Jonathan Dowland questioned why one would target a tiny dependency like awk when there is larger game afoot. "If you are in the minimizing game, perhaps you'd rather remove perl from the essential set?" He did add that it would be a substantially harder project to remove Perl, though Gioele Barabucci said he had been working on such a project for the past three years and found it "very doable":

I have PoC VMs where perl is not installed at all. I'm very slowly polishing and upstreaming my changes: I've sent dozens (hundreds?) of patches and there now are only about 40 maintscripts where perl is directly used. Most importantly, I've already removed all uses of perl from postrm maintscripts in bookworm. I've also written shell-only replacements of Perl programs used in the transitive essential set.

Barabucci added that he may send out a more public announcement of his Perl-pruning project once Debian 14 ("forky") is open for development.

Some, like Michael Stone, wondered why one would use Debian at all if the goal is a distribution optimized for small container images. Running Alpine Linux without Perl, he said, "is already a solved problem". Then again, Chris Hofstaedtler observed, that was true for many things that Debian is used for—GNOME users could just use Fedora instead, and that would save the project the trouble of maintaining GNOME.

People like to use Debian for a lot of different reasons. Very large and very small installs are "just" usecases too. When there are enough people interested (and so on...) in it, it will happen.

Essentially explicit

The discussion did not dwell exclusively on diminishing Debian's Docker image. Some of the focus shifted to the interesting question of implicit versus explicit dependencies on packages or functionality provided by Debian's essential packages. The presence of awk in the essential set has a "weird effect that dependencies on awk are hidden", Tim Woodall said, unless the package depends on a particular implementation of awk. Josh Triplett replied that he would love to see the essential set reduced, but noted that it was more complex than a single-step process.

The first step, Triplett said, is asking whether the project should make dependencies on awk explicit "rather than having them be implicit and undocumented because awk is Essential". The second step would then be to ask if the project should reduce dependencies on awk. He later elaborated that it would be possible to make progress toward removing "awk" if each package that relied on it declared that explicitly, rather than relying on the essential package set.

Andrey Rakhmatullin asked if the project should start declaring dependencies on all essential packages explicitly. Triplett responded that he thought it'd be a good idea, but he was not trying to make the case for that across the board. "Right now, I'm trying to make the case that that's a good first step for any packages people might want to work on making optional."

G. Branden Robinson, though, endorsed the idea and offered several reasons that he felt essential dependencies should be explicit. He asserted that dependencies "should be as decoupled from the details of the set of 'Essential' packages as possible" because of the way packages enter and exit that group. The decision to add a package to the set is, in principle, decided by all Debian developers collectively. In practice, it is decided by a "vaguely defined intersection of the dpkg maintainer(s), the release managers, and installer team, I guess".

The Debian Policy Manual states that a package may not be tagged essential unless it has been discussed on the debian-devel mailing list, and a consensus has been reached in favor of adding it. The manual is, to Robinson's dismay, less prescriptive about reaching consensus before removing a package from the essential set.

That implies to me that a package can be taken _out_ of the essential set unilaterally by the package maintainer(s) of a package that's in it, but because of the status quo of being able to depend on an essential package without declaring that fact, in practice that probably wouldn't work well, and we should update the Policy Manual to require discussion of the dropping of such a "tag" as well.

Triplett said that was a bug in the policy as it was written, but not as it was observed. "Historical practice has definitely been to discuss such removals (extensively)." He agreed Debian should have a well-defined process for removing things from the essential set.

Theory versus practice

Declaring essential dependencies explicitly may sound good in theory, Adrian Bunk replied, but it would be horrible in practice. For example, the libc6 package, which provides shared GNU C libraries "used by nearly all programs on the system" has a pre-installation script that calls dpkg, sed, grep, and rm. To make those dependencies explicit would require adding sh, dpkg, sed, grep, and coreutils as Pre-Depends dependencies.

Doing that, he said, would likely result in broken systems during upgrades due to cyclic dependencies. He added that it would also be a waste of time to have to declare dependencies on essential packages just because, for example, a package has a Bash-completion file with dependencies on things like awk, grep, sed, and sort. "That's dependencies on four essential packages".

Robinson said he was not yet convinced that it would be horrible. Upgrades would only encounter cyclic dependencies if the Pre-Depends were versioned in a way that caused the cycle. "In which case there likely _really is_ a breakage problem" because one package or the other needs to delay the use of a newly available feature. He waved away the concern about packagers wasting time on declaring dependencies for Bash-completion scripts. Many do not use Bash as their interactive shell, and completion only matters for interactive shells. Nevertheless, Bunk still thought that the suggestion of explicitly declaring essential dependencies would lead to "a huge amount of error-prone manual work".

Libc6 is a good example of a potential problem, Dowland said, but a library "doesn't likely reflect the experience we'd have for the vast majority of packages".

No AWKward omissions

For now, it does not appear that awk—in any of its forms—will be disappearing from the essential set for trixie. Whether it's excised from the essential set, or whether the project pursues any other form of slenderizing Debian container images during the forky cycle, is up in the air.

Comments (29 posted)

Custom out-of-memory killers in BPF

By Jonathan Corbet
May 1, 2025
The out-of-memory (OOM) killer has long been a scary and controversial part of the Linux kernel. It is summoned from some dark place when the system as a whole (or, more recently, any given control group) is running so low on memory that further allocations are not possible; its job is to kill off processes until a sufficient amount of memory has been freed. Roman Gushchin has found a way to make the OOM killer even scarier: adding the ability to load custom OOM killers in BPF.

The kernel, in its default configuration, will overcommit the memory available on the system; it will allow processes to allocate more memory than can be provided (that is, more than the sum of physical memory and swap space). Applications routinely allocate more memory than they use; limiting allocations to the available memory would, as a result, cause some of that memory to be unused. Overcommitting memory avoids that waste, and it almost always works out in the end.

The rare occasion where it doesn't work out is reminiscent of the (not even remotely rare enough) situation where an airline overbooks the seats on a flight. When too many passengers actually show up, some unlucky person will lose their seat. In the Linux world, instead, some process loses its memory — and, with it, the ability to run at all. The kernel is not able to broadcast a request for volunteers to be killed, though, so the OOM killer has to apply a set of heuristics in an attempt to find the victim that will free the most memory while minimizing user anguish.

Unsurprisingly, the choices made by the OOM killer often align poorly with the decisions the human(s) using the computer would have made. The kernel provides a set of knobs to help tune the heuristics, essentially by allowing some processes to volunteer (or be volunteered) to be the first OOM-killer targets. In recent years, as well, there has been a lot of effort that has gone into user-space OOM killers; these include tools like oomd, systemd-oomd, and the Android lmkd.

User-space OOM killers have some fundamental problems, though. Since they run in user space, they will necessarily be slower than the kernel to respond to a low-memory situation, which is a problem that urgently requires a solution. Running the user-space OOM killer may, itself, require allocating memory at the worst possible time. User-space killers may also lack useful information about the state of the system that may be available within the kernel. While OOM killers running in user space can make use of information about the specific workload running on the system, leading to better decisions, they can fall short in other ways.

Gushchin's series aims to address these problems by providing two hooks for BPF programs to implement the OOM-handling task, enabling OOM killers that benefit from user-space control while running inside the kernel. The first of these hooks (even though it comes second in the series) is a hook that is called in response to events from the pressure stall information (PSI) subsystem:

   int bpf_handle_psi_event(struct psi_trigger *t);

A program that attaches to this hook will be invoked whenever a PSI event triggers, indicating that a threshold has been exceeded and that memory pressure is causing the workload to slow down. That program can use the PSI information, along with anything else it might have at hand, to come to its own conclusion about the state of the system. Should the judgment be that the time has come to push the panic button, this program can declare an OOM event with a new kfunc:

    int bpf_out_of_memory(struct mem_cgroup *memcg, int order);

The memcg parameter, if non-NULL, limits the OOM event to the given memory control group; order describes the order of the allocation being attempted — which is not entirely applicable in this situation, since memory is not actually being allocated. This call will summon the OOM killer to deal with the problem.

It is worth noting that the kernel is only able to handle one OOM event at a time; if one control group is on fire, the kernel cannot respond to OOM situations in any others. In that case, bpf_out_of_memory() will return -EBUSY.

The other new hook comes into play once the OOM apocalypse has struck:

    int bpf_handle_out_of_memory(struct oom_control *oc);

This function, which will be called to answer the invocation of the OOM killer, should do something to address the OOM situation, returning a non-zero value if it succeeds in freeing some memory. The kernel checks after the BPF program runs to be sure that memory was really freed; it does not just take the program's word for it. If the BPF program is unable to free memory, the normal kernel OOM killer will step in to try to get the job done properly.

Since "do something" likely involves killing processes, the series provides a handy new kfunc to do just that:

    int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *task,
			     const char *message);

This function will bring about the untimely demise of the process indicated by task, updating oc to indicate whether memory was successfully freed. It is worth noting that bpf_oom_kill_process() is made available to all BPF programs of the tracing type, which normally are not in the business of killing processes. There is another new kfunc, bpf_get_root_mem_cgroup(), that can be used to get the root of the control-group hierarchy. The OOM-killer program can then traverse that hierarchy in search of the best victim.

Of course, the OOM-killer program could also take other actions to address the memory problem. Gushchin suggests deleting tmpfs files as one example. Or, depending on the application being run, the program could request that memory be freed in some other way that does not involve killing off processes.

This is an RFC patch set that is not intended to be applied in its current form. As of this writing, there have been few comments. Michal Hocko suggested that it might make sense splitting bpf_handle_out_of_memory() to handle the different cases (global, control group, or CPUset) cases separately; he also asked whether real OOM handlers have been implemented with this infrastructure.

It seems likely that the prospect of giving BPF programs the ability to run around and kill processes will make some people uncomfortable. Of course, the OOM killer always makes people uncomfortable. But, as long as this creature must exist, it may be best if it runs within the kernel.

Comments (17 posted)

Injecting speculation barriers into BPF programs

By Jonathan Corbet
May 5, 2025
The disclosure of the Spectre class of hardware vulnerabilities created a lot of pain for kernel developers (and many others). That pain was especially acutely felt in the BPF community. While an attacker might have to painfully search the kernel code base for exploitable code, an attacker using BPF can simply write and load their own speculation gadgets, which is a much more efficient way of operating. The BPF community reacted by, among other things, disallowing the loading of programs that may include speculation gadgets. Luis Gerhorst would like to change that situation with this patch series that takes a more direct approach to the problem.

While the potential to enable speculative-execution attacks may be a concern for any BPF program, the problem is especially severe for unprivileged programs — those that can be loaded by ordinary users. Most program types require privilege but there are a couple of packet-filter program types that do not (though the unprivileged_bpf_disabled sysctl knob can disable those types too). Among the many defenses added to the BPF subsystem is this patch by Daniel Borkmann, which was merged for the 5.13 release in 2021. It causes the verifier to treat possible speculative paths (for Spectre variant 1 in particular) as real alternatives when simulating the execution of a program, even though the verifier can demonstrate that such paths will not be taken in non-speculative execution. If the program does something untoward on one of those speculative paths, it will be rejected by the verifier.

In other words, an unprivileged BPF program must behave correctly even in the presence of branch decisions that have been guessed incorrectly by the CPU. Gerhorst performed an analysis of 364 BPF programs from various projects, and found that "31% to 54% of programs" would be rejected as a result of this extra verification requirement. So a sizable subset of valid BPF programs cannot be loaded by unprivileged users out of fear of speculative attacks. That makes BPF rather less useful than it would otherwise be.

The approach chosen by Gerhorst to address this problem is relatively simple: if the verifier is unable to prove that a given speculative path will execute correctly, it injects a speculation barrier into the code at the branch that might be mispredicted. That barrier (an lfence instruction on x86_64 systems) halts all speculation until the true execution catches up with the barrier, making it impossible to mispredict the branch. The verification of the speculative path can simply be halted, since that path will not be taken even speculatively; any bad behavior in that path will thus no longer cause the program to fail verification.

The solution is conceptually simple, though it requires some complexity in the verifier to implement. It does widen the set of programs that are now acceptable for an unprivileged user to load (though Gerhorst does not say by how much). There are some downsides that are part of this solution as well, though.

One of those is performance; speculation barriers are, since they disable speculative execution, relatively expensive. The actual cost of injecting them, according to Gerhorst, is "0% to 62%", depending on the program. This cost is only paid, though, by programs that would be rejected by the verifier without the barrier injection; users are likely to agree that slowed-down execution is, in the end, better than being unable to run the program at all.

The other potential concern is for architectures that are vulnerable to Spectre variant 1, but which do not provide a suitable barrier instruction. This patch series, as currently written, will disable the current, verifier-based checking on those architectures without actually replacing it with barrier-based protection. According to Gerhorst (as described in this patch), the only architecture that would be affected by this problem is MIPS, which does not allow unprivileged BPF at all by default. Thus, he said, this potential security regression is "deemed acceptable". The status of some of the more obscure architectures supported by the kernel, including whether they are vulnerable to Spectre variant 1 at all, is unknown, he said.

This is the second posting for this series; the first posting was in late February. So far, neither posting has received any review comments at all. That may be because the BPF community has long since decided that unprivileged BPF is a dead end. It is simply too difficult to make it possible for unprivileged users to load code to run in the kernel and have the result be secure. The Spectre vulnerabilities may have been the last straw in that regard; they opened up a whole new set of exploitation paths that the BPF developers had never needed to take into account before.

Still, unprivileged BPF does exist in current kernels, though it is not clear how many distributors enable it. If the kernel is going to support that feature, it should be supported as well as possible. Gerhorst's patches make it possible to run more programs in the unprivileged mode without opening up Spectre vulnerabilities. The goal seems worthy; the real question will be whether it justifies the complexity that the implementation necessarily adds to the verifier.

Comments (3 posted)

Flexible data placement

By Jake Edge
May 2, 2025

LSFMM+BPF

At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF) Kanchan Joshi and Keith Busch led a combined storage and filesystem session on data placement, which concerns how the data on a storage device is actually written. In a discussion that hearkened back to previous summits, the idea is to give hints to enterprise-class SSDs to help them make better choices on where the data should go; hinting was most recently discussed at the summit in 2023. If SSDs can group data with similar lifetimes together, it can lead to longer life for the devices, but there is a need to work out the details.

Joshi began by noting that the logical placement of data provided by the host system is not the same as the physical placement of it on the device. There is a question of where the placement decision is made; if there is a data creator and multiple layers between it and the device (e.g. filesystem, device mapper), it is the piece that is closest to the device that ultimately decides where the data goes, he said. Currently, data is generally written sequentially because there is a single append point in a single open erase block on the device.

[Kanchan Joshi]

Flexible data placement (FDP) is an NVMe SSD feature that allows writes to be tagged to indicate whether they should be grouped together or not. SSDs with FDP can have multiple append points in separate erase blocks in order to group the data based on its tag. It is not an error to write untagged data or with an invalid tag, however. It is an open question whether the applications or the layers between, like filesystems, should be deciding which tags to apply; the device itself does not care, but if the data is tagged, it "can get grouped as originally intended", Joshi said.

Busch said that SSDs generally have a lot of resources to do things in parallel, but that "without any hints, it's not going to know what the separation should be". Hints would allow multiple applications to be writing without sharing the resources. These hints will also help reduce write amplification because data with the same lifetime can be placed (and updated or erased) together.

Josef Bacik said that he knew Busch had run some experiments using Btrfs that would group data writes separately from metadata writes, but that the performance improvements were not that large. Busch agreed, noting that simply separating data from metadata was not granular enough. It would be better if the B-tree writes could be separated from the journal writes, for example.

Bacik suggested that he could tag based on the subvolume ID for a write, which might improve things. Providing a different tag for data based on grouping operations with similar write-time-to-discard (or -overwrite) characteristics will make a difference, Busch said. Bacik asked about the number of tag values available. Busch said the tag is 16 bits, but that today's hardware does not make use of all of them; the range is from eight tags up to low hundreds of tags. There is a likelihood of running out of tags depending on how they are allocated.

Should filesystem developers just start using the feature and hope that it is helping out, Bacik asked. Previous efforts of this sort lacked for any kind of feedback mechanism, Busch said, but the current protocols do provide ways to get an SSD endurance measure, which allows running experiments to see if the tagging is helping. The numbers provided measure the write amplification of a workload, he said; the number of bytes written to the device is reported along with the amount of data that was actually written to get those bytes on the media. Testing should probably be done with identical workloads on two systems, one without tagging and one with a tagging scheme.

All of the SSDs that Meta (where both Busch and Bacik work) is buying have FDP, Busch said, though it is not enabled by default. It takes a few minutes to configure it for a new drive. All of the major vendors are supporting the feature, but each does it a little differently, so the amount of improvement will vary between SSD types.

[Keith Busch]

Chris Mason asked if there had been testing done with other filesystems that have a journal, such as ext4 or XFS. "There are filesystems with a really specific lifetime for a very heavily used part of the SSD and Btrfs is not one." Busch said that the filesystem testing had not been particularly extensive; the focus shifted to the applications, but he agreed that tagging journal writes for those filesystems might have a big impact.

Ted Ts'o said that there had been some testing done long ago, perhaps by Martin Petersen, that would separate database log writes or filesystem journal writes for a particular type of storage device. The results were encouraging, but the storage devices were expensive and hard to obtain, which meant that developers did not have much of a chance to experiment.

Petersen said that the FDP model is fine for use cases where applications or filesystems have tagging added based on known workloads on a particular SSD model. "But that's not really a good model for all the other use cases". The reason that earlier hinting efforts failed also exists for FDP: there are, say, eight tags that need to somehow be split up between the various filesystems that are on the device. The problem is that the tag values are a scarce resource. If they were not, things could always be tagged based on how the data should be grouped, but FDP and the earlier mechanisms are not general enough; each drive, filesystem, and application combination has to be tested individually to see what works.

Zach Brown said that he was happy to see the increased visibility that FDP devices are providing; "we are so used to devices being shitty black boxes". Busch agreed, noting that it is likely that the lack of feedback is part of what caused earlier hinting efforts to fail.

The current patches do not yet plumb the FDP feature through the rest of the system, Busch said. Instead, there is a passthrough that allows user space to write commands directly to the NVMe device, "so you have full control there". It exposes the number of tags available as max_write_streams for the disk in sysfs. "The passthrough interface is not a particularly pleasant interface to use", he said. NVMe commands have to be constructed in user space, which is not the right level of abstraction. So there is a new write-stream command for io_uring that provides a nicer way to access the feature. It is only available for direct I/O and he questions whether it even makes sense to hook it up for buffered I/O.

Joshi said that connecting up FDP to filesystem I/O in the iomap interface is still an unsolved problem. There are plans of that sort which will need to be discussed, he said. Once a filesystem is mounted, it will own the block device and, thus, will see and can manage all of the write streams (tags). A filesystem can support application-managed write streams, with a mapping to the hardware tags based on its rules, or it can manage the streams itself directly. Those two may not be mutually exclusive, so a filesystem could choose to support a hybrid mode.

It will require a new user interface, as well as per-filesystem enablement, even just for the application-managed case, he said. There is also the question of whether there should be generic application-management code for write streams, so that each filesystem does not need to implement their own.

Stephen Bates asked whether the write streams were per-NVMe-namespace or global to the device. They are global, Busch said, but subsets of the tag values can be specifically assigned to a namespace.

Busch said that he had hoped to use the existing write-hints interface, which is currently a no-op, for FDP. Christoph Hellwig said that the filesystems should still be involved in order to map the application-provided tag values properly. That requires lots of work in the filesystems, Busch said, while the write-hints option does not; it would be better for the filesystems to be involved, but it is an easier path, with some of the benefits, to leave it to the applications.

Bacik would rather not put Btrfs, for example, in the middle as the arbiter of the tags, but filesystems may want to also use the tags. Busch said that filesystems could reserve some of the tag space for themselves if they are the arbiter, but even that may result in tag collisions with filesystems on other partitions. For that reason, Bacik would like the block layer to do the arbitration.

Petersen said that there had been experiments done along the way where filesystems would tag writes based on attributes, such as journal writes versus random writes. A scheduler was added to do the mapping from the filesystem tags to the hardware streams. "It worked. It wasn't pretty, but it had the benefit of being flexible, because you could change that scheduler to match your application workload". Now, developers could do something similar with a BPF program as a scheduler that was tuned for the workload.

One of his complaints about the application-driven hints is that the kernel already has the knowledge that it needs. The application should not have to tell the kernel that these writes are for an application and are bound for a particular file—"we know that, we're the kernel". The user interface should not be tied to the hardware; "we can, in the kernel, schedule resources, that's why we exist".

There are applications that could make use of multiple streams within a single file, though, Javier González said. Joshi agreed, saying that the filesystem and block layer cannot necessarily make the best decisions on the allocation of the write streams. Bacik said that he did not want the filesystems getting in the way of applications doing what they need to do.

Ts'o said that the small number of streams available is what makes things difficult; if there were an infinite number, for the sake of argument, the inode number could be used as the tag "and let the block layer sort it all out". Since that is not the case, something needs to allocate the streams, which may mean denying them to some requesters; every application will claim that its files are the most important, of course.

The session got a little chaotic toward the end, with multiple sub-conversations that made it hard to follow, perhaps because it had run well into the subsequent break. It turned to ways to measure how well (or poorly) the system is tagging its data. The measurements that can be used are the write-amplification information from the drive along with the 99th-percentile (p99) write latencies, both of which should reduce when data with similar lifetimes is being grouped correctly, Busch said. Joshi finished the session by briefly going over some of the results that came from a recent paper on FDP.

Comments (12 posted)

Improving FUSE writeback performance

By Jake Edge
May 6, 2025

LSFMM+BPF

In a combined filesystem and memory-management session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Joanne Koong led a discussion on improving the writeback performance for the Filesystem in Userspace (FUSE) layer. Writeback is how data that is written to the filesystem is actually flushed to the disk; it is the process of writing dirty pages from the page cache to storage. The current FUSE implementation allocates unmovable memory, then copies the dirty data to it before initiating writeback, which is slow; Koong wanted to change that behavior. Since the session, she has posted a patch set that has been applied by FUSE maintainer Miklos Szeredi.

Koong started the session with a description of the current FUSE writeback operation. A temporary page is allocated in the unmovable memory zone for each dirty page and the data is copied to the temporary page. After that, writeback is initiated on the temporary pages and the original pages can immediately have their writeback state cleared. That extra allocation and copying work is expensive, but is needed so that the pages do not move while the writeback operation is underway.

Benchmarks have shown around 45% improvement in throughput for writes without the temporary pages, she said. Beyond that, eliminating the copy simplifies the internals of FUSE. There is currently a red-black tree tracking the temporary pages that could be eliminated. It also makes the conversion of FUSE to use large folios much cleaner.

Back in November, she sent a proposed solution that removes the temporary pages, which means that the writeback state will not be cleared immediately anymore. In order to avoid deadlocks, the patch set added a new mapping flag AS_WRITEBACK_INDETERMINATE that filesystems can set on inode mappings to say that writeback may take an indeterminate amount of time to complete; FUSE will set the flag on its mappings, which can be used to avoid deadlocks in the writeback machinery.

[Joanne Koong]

That patch set was rejected, Koong said, primarily because it would allow buggy or malicious FUSE servers to hold up migration indefinitely by not ever completing the writeback of some pages. That would increase memory fragmentation and thwart attempts to allocate contiguous memory. Allocating the temporary pages can also fragment memory, but those are made in unmovable memory, which is less problematic to fragment than movable memory. Other parts of FUSE already have this problem, including readahead and writethrough splicing (using splice()), but "we shouldn't try to add more of this, we should try to eliminate it if we can".

Several options were discussed in the thread, but the most promising idea, providing a mechanism to cancel writeback if pages need to migrate, does not work. The problem is that pages can be spliced, she said, and the writeback cannot be canceled for those pages. Another viable possibility is to have a dedicated area in the movable zone for pages that may be unmovable for indeterminate amounts of time. That would reduce the impact of the fragmentation to only that area of memory. Alternatively, unprivileged FUSE servers that behave badly, by not completing writeback in a timely fashion or by having too many pages under writeback, could just be killed.

David Hildenbrand said that there was some discussion of disallowing splicing for unprivileged FUSE servers; "you're not trustworthy enough to let you do that". That would allow canceling writeback, but Koong was not sure that was the right path forward. What followed was some fast-moving, hard-to-follow discussion on various possibilities for avoiding the edge cases that can lead to deadlock.

Omar Sandoval asked about the feasibility of just killing the misbehaving unprivileged servers as was suggested. Koong said that it was a reasonable solution, though it may not be backward compatible because existing servers are not expecting it. But she thinks that something along those lines should already have been done as a protection mechanism.

Sandoval asked what a reasonable timeout value should be. There is a balance to be struck; "if you're a FUSE server and you've gone out to lunch for 30 minutes, I don't care about your backwards compatibility, you already broke everything". Hildenbrand said that is a difficult problem to solve; any timeout chosen will sometimes be too large or too small. Sometimes the data will be valuable enough that a long wait is acceptable, but, say, 30 seconds may already be too long to hold off an allocation.

It would be his wish to find some easy way to handle the common cases where the pages can just be migrated, which might mean prohibiting the use of splice(). He wondered what the implication of that prohibition would be. Koong said that the FUSE servers could be audited for the use of splice() and the problem could be discussed with the developers. Josef Bacik said that the kernel could just fall back to doing an internal copy when splice() is requested from an unprivileged server.

The crux of the problem seems to be the unmovable nature of the memory that is under writeback, he continued; if some new way could be found to use movable memory without doing a copy, that would be ideal. "We love splice() because it's faster, but it sounds like we need to invent a new zero-copy mechanism that uses movable memory".

The ability to mount FUSE filesystems as an unprivileged user makes them so problematic, Jeff Layton said; any random user can start a server that can grab a bunch of memory and not handle it properly. That is what the system needs to guard against; doing so with "draconian measures" like killing the server is not unreasonable. He suggested finding a way to maintain compatibility with the existing servers and to provide a zero-copy mechanism for new ones; in his mind, it is not out of the question to rewrite some of the old FUSE servers to take advantage of newer features. Koong agreed, but said that it would be Szeredi's call on what should be done in that regard; she was not clear on what his thinking is.

Comments (none posted)

Filtering fanotify events with BPF

By Daroc Alden
May 6, 2025

LSFMM+BPF

Linux systems can have large filesystems; trying to keep up with the stream of fanotify filesystem-monitoring notifications for them can be a struggle. Fanotify is one of a few ways to monitor accesses to filesystems provided by the kernel. Song Liu led a discussion on how to improve in-kernel filtering of fanotify events to a joint session of the filesystem and BPF tracks at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit. He wants to combine the best parts of a few different approaches to efficiently filter filesystem events.

There are two ways to monitor and restrict filesystem actions on Linux, Liu said: fanotify and Linux security modules (LSMs). They both have benefits and drawbacks. The main problem with using LSM hooks to respond to filesystem events is that LSM hooks are global — the LSM must respond to accesses for all files, even if it's only interested in a subset of files. The main problem with fanotify is that notifications are handled in user space, incurring a lot of context switches. The best of both worlds would be to have efficient mask-based filtering for relevant files (like fanotify) and fast in-kernel handling for the more complicated cases (like LSMs).

One member of the audience pointed out that LSM hooks are invoked for all filesystem operations, but fanotify can only block calls to open() and read(), so they're not really comparable. Liu agreed, but said that was a separate topic.

[Song Liu]

Liu then went into a little more detail about how BPF-LSM hooks work. Multiple BPF programs can attach to the bpf_lsm_file_open() hook. When a file is opened, the kernel will iterate through each different rule to see whether it wants to block the open(). Most BPF programs don't apply to most files, Liu said, but they are still run. Fanotify filters, on the other hand, use a bitmask to quickly eliminate events that the filter is not interested in.

There are problems with combining the approaches, though. Currently, of all the different types of BPF program, only BPF-LSM programs have storage associated with an inode object. Filesystem developers generally want to avoid bloating the inode structure too much, which means that extra storage for LSMs lives in a separate structure: inode->i_security. Like many parts of the kernel's design, this is a tradeoff; for example, it allows the use of read-copy-update (RCU) protections, but it also adds an additional dereference to access the information and it prevents other types of BPF program from accessing the data.

To solve this, Liu wants to move the data currently stored in the i_security blob into the inode structure (conditionally, based on the value of a kernel-configuration setting). That will make the data available to more BPF program types, and therefore to his prototype solution for fanotify BPF filtering. That suggestion didn't sit well with Josef Bacik, though, who pointed out that the inode structure is embedded in other places, such as in some Btrfs structures. So changing its size has the potential to change cache-line alignment or, more importantly, change the number of inodes that can fit in a page. Liu agreed that could be a problem.

Amir Goldstein asked what kinds of data Liu expected to be stored in the inode. He replied that it depended on the use case, but that one example might be cached information about whether access to a file should be allowed.

One current problem with fanotify is that it is not "local", Liu explained. There's a long chain of functions called from fsnotify_open() to decide whether the operation will be allowed. In particular, the check for whether a superblock has any watchers is fairly cumbersome. If the fast mask-based check could be moved earlier, it might improve performance.

Goldstein thought that made sense, but pointed out that the fanotify mask is the combined mask of all watchers, so it would make sense to have LSMs use the same mask just to indicate that "someone is interested". Liu agreed. The kernel should optimize for the case when nobody is watching, he said, so that the LSM hook doesn't have to be called for every file.

Unfortunately, the simple idea of having a mask that can be checked quickly and is shared between LSMs and fanotify watches is complicated by subtree monitoring. Liu summarized the existing options for handling watches on subtrees as: add a separate mask entry to everything in the subtree individually, walk up the filesystem tree for each operation, or match the full path of the directory against a pattern. None of those options are great. The first option makes applying and removing watches potentially slow for large directories. The second option makes file accesses slower, and the last option requires a good deal of complexity to find the full path of a file.

In Liu's patch set, he took the approach of setting a fanotify mark for a whole filesystem, and then using is_subdir(), which checks whether one entry is a subdirectory of another, to filter events further in BPF. That required letting BPF programs call is_subdir(), which seems straightforward, except that it lets programs hold references to directory entries in BPF maps, which in turn prevents the filesystem from being unmounted. Bacik questioned why holding onto directory entries was necessary here, and Liu explained that he wanted to use them to quickly check whether something was in the subtree of interest.

Jeff Layton pointed out that since directories can't be hard-linked, the BPF program could hold onto an inode for the directory instead. Liu agreed that if it were possible to quickly check for membership in a subtree using an inode, that would work for his purposes. Goldstein thought that it could make sense to use fsnotify to set a mark on a directory, and let that code handle things.

Christian Brauner thought that checking whether something was in a subtree was not quite as simple as Layton and Goldstein made it sound because of the global rename lock. Goldstein asked whether is_subdir() really took a lock, to which Brauner replied that it did. He went on to explain that until recently there had been an issue where is_subdir() would retry when it detected a concurrent rename, but that apparently people had deep subdirectories that could cause is_subdir() to stall when there were only a few rename operations, so it was changed to acquire the lock. He agreed to go over the code with Goldstein later, and informed Liu that unless he was willing to deal with false positives, any solution was going to contend on the rename lock.

Goldstein suggested that the operation could ignore the lock entirely and keep calling get_parent() — theoretically, it could loop forever, but only if someone were continuously relocating the directories being traversed. Brauner didn't think this was sufficient for a Linux security module, where correctness is especially important. Liu indicated that there were several cases where security-critical code, such as the Landlock LSM, does something like this, and asked what a better solution would look like. Goldstein suggested an approach similar to audit: track renames, so that cached information can be invalidated on rename.

Several other participants indicated that they had their own suggestions, but at that point the session ran out of time, and discussion moved to the hallway. It seems clear that the general idea of Liu's proposal, allowing the creation of BPF LSMs that don't need to be invoked for irrelevant file accesses, had plenty of support. The specifics of the design, however, remain up in the air.

Comments (none posted)

Hash table memory usage and a BPF interpreter bug

By Daroc Alden
May 7, 2025

LSFMM+BPF

Anton Protopopov led a short discussion at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit about amount of memory used by hash tables in BPF programs. He thinks that the current memory layout is inefficient, and wants to split the structure that holds table entries into two variants for different kinds of maps. When that proposal proved uncontroversial, he also took the chance to talk about a bug in BPF's call instruction.

Hash table memory use

Protopopov began with an explanation of the current structure of BPF hash tables. Each hashed bucket is a linked list of items; the table itself just stores a pointer to the first item in the list and a spinlock. Each item in the list is an htab_elem structure. This contains a hash of the key, a copy of the whole key, and two unions that are used in different ways by per-CPU and normal hash tables.

[Anton Protopopov]

Half of the structure, 24 bytes, is only used by per-CPU BPF maps. Maps that are shared between CPUs are essentially wasting that space, Protopopov said. In a hash table with 100,000 elements, that's a total of about 2.4MB. To get some idea of how much of an impact that really had, he tried removing the per-CPU elements from htab_elem and running some benchmarks. The benchmarks reflected lower memory usage in line with his predictions, and were 7% faster as well.

There are obviously users of the per-CPU BPF maps, though. Even without considering users' BPF programs, there are a number of uses throughout the kernel as well. So Protopopov's question to the assembled BPF developers was: how can we remove the per-CPU data from htab_elem properly?

Alexei Starovoitov asked what per-CPU data was taking 24 bytes in the structure. Protopopov pointed at the bpf_lru_node element that is used to manage information on how recently a given key was accessed. Starovoitov asked whether the information could be shuffled around in the structure without removing it, but that doesn't seem to be possible, because the other components of the structure are used by all of the different types of BPF map.

Ultimately, the best Protopopov had been able to do was to split the structure into two variants of different sizes, and allocate the appropriate one based on the type of the map. Because BPF hash tables are widely used, that resulted in a big patch, so he wanted to solicit feedback first. After some additional discussion about alignment within the structure, he agreed to write a first pass of a patch set following that approach for people to look at.

Translated call instructions

Call instructions in BPF refer to functions by number, in the order in which they're defined in a BPF program. During loading, libbpf (or another user-space loader) replaces these numbers with the actual offsets of the function within the memory allocated for the program. The verifier processes the program in this form. After the program is verified, however, there's a final step: translating BPF bytecode so that it can be executed.

In most configurations, the BPF program is compiled to native machine code by the BPF just-in-time compiler (JIT). If the JIT is disabled (or doesn't support the current architecture), however, the BPF bytecode is directly interpreted instead. This interpreter stores the relative offset of a call instruction's destination from the instruction pointer in a 16-bit field.

Architectures that don't support the BPF JIT:
  • Alpha
  • Big-endian 32-bit ARM
  • C-SKY
  • m68k
  • microMIPS
  • OpenRISC
  • 32-bit SPARC
  • Some older ARC CPUs
  • Hexagon
  • MicroBlaze
  • Nios II
  • RISC-V systems without an MMU
  • SuperH
  • User-mode Linux
  • Xtensa

So if there are two function calls with enough instructions between them to overflow this 16-bit field, Protopopov said, then the offset to the called function is incorrect. This breaks several things, including bpftool, the BPF program, and potentially the verifier's guarantees. Since the problem requires an unusual set of conditions to manifest, Protopopov wasn't exactly sure how big of a problem this was.

Starovoitov thought that Protopopov might be confusing two different variants of BPF bytecode; Starovoitov didn't think that the interpreter would run the version with the smaller 16-bit field. Protopopov was adamant that it would, however. Daniel Borkmann thought that such code was broken, and ought to be rejected — at run time, by the interpreter, since this issue is only apparent after verification.

Blaise Boscaccy asked whether Protopopov had tried using the invalid program counter to break BPF's security. Protopopov replied that he hadn't. If Protopopov is correct about the issue, and it poses a security risk, the only vulnerable kernels will be those running on architectures without support for the JIT. Borkmann thought that all such architectures were "niche". Ultimately, the discussion ended without a concrete commitment from anyone to investigate further. At the time of writing, the problem does not appear to have been discussed on the mailing list.


Comments (none posted)

Page editor: Joe Brockmeier
Next page: Brief items>>


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds