LWN.net Weekly Edition for May 1, 2025

Welcome to the LWN.net Weekly Edition for May 1, 2025

This edition contains the following feature content:

The mystery of the Mailman 2 CVEs: the report of three vulnerabilities in the end-of-life Mailman 2 system led to concern — and to doubts about whether those vulnerabilities actually exist.
Debian debates AI models and the DFSG: what should be required of a model-based system before it can be part of Debian?
Some __nonstring__ turbulence: a broken 6.15-rc3 release leads to disagreement in the kernel community.
Cache awareness for the CPU scheduler: trying to do a better job of keeping processes near their data.
LSFMM+BPF 2025 coverage:
- Freezing filesystems for suspend: the system suspend/resume path lacks the ability to freeze and thaw filesystems; what would it take to add that functionality?
- Inline socket-local storage for BPF: BPF developers discuss changing the layout of BPF map information in the networking subsystem to reduce cache misses.
- Better debugging information for inline kernel functions: selectively inlined functions can be difficult to debug; perhaps enriching the kernel's BTF debugging information could help.
How LWN is faring in 2025: navigating through stormy waters.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The mystery of the Mailman 2 CVEs

By Joe Brockmeier
April 30, 2025

Many eyebrows were raised recently when three vulnerabilities were announced that allegedly impact GNU Mailman 2.1, since many folks assumed that it was no longer being supported. That's not quite the case. Even though version 3 of the GNU Mailman mailing-list manager has been available since 2015, and version 2 was declared (mostly) end of life (EOL) in 2020, there are still plenty of users and projects still using version 2.1.x. There is, as it turns out, a big difference between mostly EOL and actually EOL. For example: WebPros, the company behind the cPanel server and web-site-management platform, still maintains a port of Mailman 2.1.x to Python 3 for its customers and was quick to respond to reports of vulnerabilities. However, the company and upstream Mailman project dispute that the CVEs are valid.

GNU Mailman 2

Mailman has been in development since 1998, and no doubt many LWN readers have had, or still have, subscriptions to mailing lists managed by some version of the software. The 1.0 release was announced in July 1999, with the 2.0 release following closely in November 2000. The project embarked on a major rewrite for the 3.0 release, which provided support for Python 3 and split Mailman into several components. The new Mailman was not, and still is not, a simple upgrade from the earlier version. It lacked feature parity with 2.1.x when it was first released in 2015, and still lacks a few features (such as topic filters) that users liked.

However, Mailman 2.x does not run on Python 3—and the project has been gently trying to nudge users away from the 2.x series for quite some time. Python 2.x was sunset on January 1, 2020, though it is currently still supported by some Linux vendors as part of long-term-support releases. Mailman core contributor Mark Sapiro said in 2017 that he was the only person still supporting 2.1.x. In 2020, Sapiro announced 2.1.30 as the last release to contain any new features. But he also said that there might be further updates with bug and security fixes, as well as internationalization updates.

The last 2.1.x release, so far, is version 2.1.39, which was announced in December 2021. It fixed two CVEs: a remote-privilege-escalation vulnerability (CVE-2021-42097), and a flaw that could allow list members or moderators to obtain a token to make administrative changes (CVE-2021-44227). According to that announcement, there is still the possibility that there will be more patch releases to address security problems. The Mailman web site also lists 2.1.39 as a current, stable version rather than as an EOL version, but its home page also has a thank you for donors who helped a Mailman core developer attend PyCon 2015. It's just possible that the project's web-site maintenance has fallen a bit behind.

In any case, plenty of users are still on Mailman 2.x with little sign of budging, including those who use the proprietary cPanel control panel. Last year, rather than trying to force its users to migrate, cPanel announced that it would provide extended support for Mailman 2 by upgrading it to Python 3 and taking responsibility for continued maintenance. The cPanel fork of Mailman is on GitHub and is based on a port to Python 3 by Jared Mauch. That fork has received a steady trickle of small commits from WebPros developers, with nearly 30 commits in 2025 from six people.

Vulnerabilities

On April 20, three CVEs were published that are supposed to affect "GNU Mailman 2.1.39, as bundled with cPanel and WHM" by two researchers—Firudin Davudzada and Aydan Musazade—from a company called Datricon. CVE-2025-43919, is described as a path-traversal vulnerability that would allow attackers to read arbitrary files. CVE-2025-43920 claims that unauthenticated attackers can execute arbitrary commands by using shell metacharacters in an email Subject line, if an external archiver (such as MHonArc) is used to archive mailman emails. It does not specify which external archiving software has been tested to produce this vulnerability, but blames Mailman for not sanitizing input to the external archiver. Finally, CVE-2025-43921 reports that unauthenticated attackers could create mailing lists.

Each of the CVEs has a corresponding repository on GitHub (CVE-2025-43919, CVE-2025-43920, and CVE-2025-43921) with a description of the vulnerability, exploitation scenarios, and so forth. The overviews published on GitHub by Davudzada and Musazade claim that these vulnerabilities were discovered in "Q1 2025" and reported to cPanel and the GNU Mailman project in Q1.

Alan Coopersmith forwarded the CVEs to the oss-security list on April 21. Valtteri Vuorikoski replied: "I saw these mentioned earlier and could not reproduce either on a stock 2.1.39 install." Vuorikoski wondered if the vulnerabilities might be specific to the cPanel version.

According to the cPanel support forum, they do not affect its version, either. On April 28, the company posted a support article to say that it had investigated the CVEs "both internally and via third party subject-matter experts" and could not reproduce the vulnerabilities using the information provided. The article also states that there is no record that the reporters attempted to contact the company.

We have contacted the Mailman maintainers, and they do not show any records of an attempted contact from the reporters either. We have attempted to contact the reporter multiple times via their publicly listed email addresses and have received no response. We do not consider these vulnerabilities to be valid. We will be taking no further action unless new information is provided.

I reached out to Sapiro by email, and he said no one with the GNU Mailman project had been contacted "as far as we know" and described the vulnerabilities as "bogus".

CVE-2025-43919 and CVE-2025-43921 ignore the fact that the attacker would need to provide authentication which the proof of concept attacks do not do and hence do not work. Thus, there is no vulnerability.

The vulnerability described by CVE-2025-43920 "relies on a convoluted configuration with an external archiver", he said, and that attack could be carried out equally well by sending the email directly to the archiver. "There are no plans to address this in Mailman 2.1."

I also contacted Datricon about the vulnerabilities and received a reply from Musazade. She said they had emailed cPanel on February 27 before applying for the CVEs, but made no mention of contacting the Mailman maintainers. She said that the lack of response to cPanel was because the messages from cPanel fell on non-business days, though that does not explain why they did not follow up later. However, she said that they have now communicated with the cPanel team—presumably since cPanel published its update on April 28—and would "provide best support and advisory regarding technical explanations of the CVEs" to help them reproduce the issues.

It is not uncommon for vulnerabilities to be difficult to reproduce under differing configurations. Factors such as customized software builds (such as cPanel's Mailman variant), environmental differences, and specific operational conditions (authentication, user permissions), can all impact reproducibility. Nonetheless, reproduction difficulty does not invalidate a vulnerability, especially after independent vetting and CVE assignment by MITRE and [the National Vulnerability Database (NVD)].

MITRE, of course, does not provide independent testing or validation of vulnerability reports and the fact that a CVE was published does not guarantee that it's valid. NIST has updated each of the NVD's CVE entries to note that multiple third parties have reported they are unable to reproduce the vulnerabilities. It's also odd that, despite claiming to have contacted cPanel and the upstream project, those parties dispute that they'd had contact.

No rush

It would seem that users of Mailman 2.1.39, or the cPanel fork, are not in imminent danger. The consensus from all parties—except the reporters—seems to be that the alleged vulnerabilities are not valid. Or "bogus" as Sapiro put it.

However, 2.x is largely a dead end, even if it does not have any currently known vulnerabilities. As Russ Allbery said on the oss-security list, "it's probably more realistic to view Mailman 2 as orphaned, end-of-life software" that will require a major migration.

While cPanel is, currently, providing extended maintenance for the project, there is no indication how long it will continue to do so or that a community is forming around the fork. Those remaining on 2.x should probably be plotting the migration to another mailing-list-manager platform at some point, whether that is Mailman 3, or something like Discourse that provides a discussion forum with the ability to participate via email.

Comments (4 posted)

Debian debates AI models and the DFSG

By Joe Brockmeier
April 25, 2025

The Debian project is discussing a General Resolution (GR) that would, if approved, clarify that AI models must include training data to be compliant with the Debian Free Software Guidelines (DFSG) and be distributed by Debian as free software. While GR discussions are sometimes contentious, the discussion around the proposal from Debian developer Mo Zhou has been anything but—there seems to be consensus that AI models are not DFSG-compliant if they lack training data. There are, however, some questions about the exact language and questions about the impact the GR will have on existing packages in the Debian archive.

While many folks in the free-software community are generally skeptical about AI and would be happy to see the trend come to an end, Zhou is certainly not in the anti-AI camp. He is a Ph.D. student at Johns Hopkins University, and his academic web site states that his research interest is in computer vision and machine learning. He has created a project called DebGPT that explores using LLMs to aid in Debian development. Clearly, he sees some value in the technology, but also wants to adhere to free-software principles.

GR proposal

In February, Zhou wrote to the debian-project mailing list to say that he had created "something draft-ish" for a general resolution about applying the DFSG to AI models, which he later defined thusly:

A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model.

He called for help in adding reference materials and to shape up the early draft before posting it. Zhou sent his revised proposal to the debian-vote mailing list on April 19, with a detailed explanation with his reasoning for the GR and several appendices containing background information on AI technology, previous discussions, and comments on possible implications if the proposal is passed.

Debian has taken up the topic previously (see LWN's coverage from 2018) but never settled the question. The goal now is to reach a consensus on handling AI models that are released under DFSG-compliant licenses, but do not provide training data. Zhou's proposal notes that the software that runs AI models, such as Python scripts or C++ programs, are out of scope of the proposal since traditional software is already a well-defined case.

The actual text of the proposal, what Debian members would vote for (or against), is short and to the point:

Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.

Francois Mazen, Timo Röhling, Matthias Urlichs, Christian Kastner, Boyuan Yang, and others have replied to support and sponsor the proposal. Resolutions are required to have five additional sponsors before they are put to discussion and eligible for a vote. Currently, if put to a vote, Debian members would have a choice between "A" or "none of the above". It is possible, according to the resolution procedure, that amendments or alternative proposals, such as "AI models are DFSG-compliant if under DFSG licenses", could be added during the discussion period.

Thorsten Glaser posted what he called a counter-proposal on April 23, and requested comments. While Zhou's proposal would simply clarify that models without training data do not meet the DSFG, Glaser goes much further. For example, he wants Debian to require that models be "trained only from legally obtained and used works" and that the data itself be under a suitable license for distribution. His proposal would also place heavy requirements on building models that would be hosted in Debian's main archive:

For a model to enter the main archive, the model training itself must *either* happen during package build (which, for models of a certain size, may need special infrastructure; the handling of this is outside of the scope of this resolution), *or* the model resulting from training must build in a sufficiently reproducible way that a separate rebuilding effort from the same source will result in the same trained model.

Finally, the current language would ask that training sources not be obtained unethically and "the ecological impact of training and using AI models be considered". What constitutes ethical or unethical acquisition of training sources is not defined. When asked by Carsten Leonhardt to summarize the difference between the proposals, Glaser replied that his was "a hard anti-AI stance (with select exceptions)". Thomas Goirand said that he would second Glaser's proposal, but he is the only one so far to endorse it.

Possible impact

Gunnar Wolf replied to sponsor the proposal and added that that Debian "cannot magically extend DFSG-freeness to a binary we have no way to recreate". That does not mean, he said, that Debian is shut out entirely from participating in the LLM world. Users could always download models from other sources or the models could even be uploaded to Debian's non-free repository.

Among the potential implications listed in Appendix D of the proposal is the downside that there are almost no useful AI models that would be able to enter the main section of Debian's archive under this interpretation. The upside is that Debian does not have to immediately deal with "the technical problem of handling 10+GB models in .deb packages" or expect downstream mirrors that host the main repository to carry such large binary files.

Simon McVittie asked if anyone had an idea whether any models existed in Debian's main repository that already match the definition. He said it was typical for proposals to provide an estimate of how many packages that would be made "insta-RC-buggy". In other words, how many packages would be subject to release-critical bugs if the GR passes? Since Debian is currently in freeze to prepare for the release of Debian 13 ("trixie"), he wanted to know if the GR would take effect immediately, or would it take effect at the beginning of the cycle for the next release? The pre-release freeze is already lengthy, and he thought it would be best to avoid making it longer in order to deal with any packages affected by this GR.

Russ Allbery observed that GNU Backgammon comes with neural-network weights that do not have source code. He admitted that he did not give that much thought when he was maintaining the package because it predated the LLM craze. "I'm not even sure if the data on which it's trained (backgammon games, I think mostly against bots) is copyrightable." He was also unsure whether other old-school machine-learning applications might be lurking around, and said he had no strong opinion about what to do if there were.

Games are not the only software that may be impacted. Ansgar Burchardt said that the GR might impact other useful software. His list included the Tesseract optical-character-recognition (OCR) software, OpenCV image-recognition software, Festival text-to-speech software, or other software with weights and data of uncertain origin. Urlichs suggested that Burchardt could write a counter-proposal or a more-nuanced proposal that would take some of those packages into account. He also questioned whether the software would need to be removed—the packages could be relocated to Debian's contrib archive and models placed in non-free.

Next steps

So far, Burchardt has not offered any proposals of his own, but there is still time. Discussion will continue for at least two weeks from the initial proposal, though the Debian Project Leader could shorten the discussion period by calling for a vote sooner. The proposal already has enough seconds to proceed; if the discussion reflects the overall mood of Debian developers, the GR would be likely to pass.

If it does pass, it will be in contrast to the Open Source Initiative's controversial Open Source AI Definition (OSAID) which LWN looked at last year. The OSI requires that model weights be provided under "OSI-approved terms" (which are yet to be specified), but does not require training data to be supplied in order to meet its definition for open-source AI. That has been a sticking point for many, who feel that the OSAID devalues the OSI's Open Source Definition (OSD)—which was derived from the DFSG in the first place.

Comments (42 posted)

Some nonstring turbulence

By Jonathan Corbet
April 24, 2025

New compiler releases often bring with them new warnings; those warnings are usually welcome, since they help developers find problems before they turn into nasty bugs. Adapting to new warnings can also create disruption in the development process, though, especially when an important developer upgrades to a new compiler at an unfortunate time. This is just the scenario that played out with the 6.15-rc3 kernel release and the implementation of -Wunterminated-string-initialization in GCC 15.

Consider a C declaration like:

    char foo[8] = "bar";

The array will be initialized with the given string, including the normal trailing NUL byte indicating the end of the string. Now consider this variant:

    char foo[8] = "NUL-free";

This is a legal declaration, even though the declared array now lacks the room for the NUL byte. That byte will simply be omitted, creating an unterminated string. That is often not what the developer who wrote that code wants, and it can lead to unpleasant bugs that are not discovered until some later time. The -Wunterminated-string-initialization option emits a warning for this kind of initialization, with the result that, hopefully, the problem — if there is a problem — is fixed quickly.

The kernel community has worked to make use of this warning and, hopefully, eliminate a source of bugs. There is only one little problem with the new warning, though: sometimes the no-NUL initialization is exactly what is wanted and intended. See, for example, this declaration from fs/cachefiles/key.c:

    static const char cachefiles_charmap[64] =
	"0123456789"			/* 0 - 9 */
	"abcdefghijklmnopqrstuvwxyz"	/* 10 - 35 */
	"ABCDEFGHIJKLMNOPQRSTUVWXYZ"	/* 36 - 61 */
	"_-"				/* 62 - 63 */
	;

This char array is used as a lookup table, not as a string, so there is no need for a trailing NUL byte. GCC 15, being unaware of that usage, will emit a false-positive warning for this declaration. There are many places in the kernel with declarations like this; the ACPI code, for example, uses a lot of four-byte string arrays to handle the equally large set of four-letter ACPI acronyms.

Naturally, there is a way to suppress the warning when it does not apply by adding an attribute to the declaration indicating that the char array is not actually holding a string:

    __attribute__((__nonstring__))

Within the kernel, the macro __nonstring is used to shorten that attribute syntax. Work has been ongoing, primarily by Kees Cook, to fix all of the warnings added by GCC 15. Many patches have been circulated; quite a few of them are in linux-next. Cook has also been working with the GCC developers to improve how this annotation works and to fix a problem that the kernel project ran into. There was some time left to get this job done, though, since GCC 15 has not actually been released — or so Cook thought.

Fedora 42 has been released, though, and the Fedora developers, for better or worse, decided to include a pre-release version of GCC 15 with it as the default compiler. The Fedora project, it seems, has decided to follow a venerable Red Hat tradition with this release. Linus Torvalds, for better or worse, decided to update his development systems to Fedora 42 the day before tagging and releasing 6.15-rc3. Once he tried building the kernel with the new compiler, though, things started to go wrong, since the relevant patches were not yet in his repository. Torvalds responded with a series of changes of his own, applied directly to the mainline about two hours before the release, to fix the problems that he had encountered. They included this patch fixing warnings in the ACPI subsystem, and this one fixing several others, including the example shown above. He then tagged and pushed out 6.15-rc3 with those changes.

Unfortunately, his last-minute changes broke the build on any version of GCC prior to the GCC 15 pre-release — a problem that was likely to create a certain amount of inconvenience for any developers who were not running Fedora 42. So, shortly after the 6.15-rc3 release, Torvalds tacked on one more patch backing out the breaking change and disabling the new warning altogether.

This drew a somewhat grumpy note from Cook, who said that he had already sent patches fixing all of the problems, including the build-breaking one that Torvalds ran into. He asked Torvalds to revert the changes and use the planned fixes, adding: "It is, once again, really frustrating when you update to unreleased compiler versions". Torvalds disagreed, saying that he needed to make the changes because the kernel failed to build otherwise. He also asserted that GCC 15 was released by virtue of its presence in Fedora 42. Cook was unimpressed:

Yes, I understand that, but you didn't coordinate with anyone. You didn't search lore for the warning strings, you didn't even check -next where you've now created merge conflicts. You put insufficiently tested patches into the tree at the last minute and cut an rc release that broke for everyone using GCC <15. You mercilessly flame maintainers for much much less.

Torvalds stood his ground, though, blaming Cook for not having gotten the fixes into the mainline quickly enough.

That is where the situation stands, as of this writing. Others will undoubtedly take the time to fix the problems properly, adding the changes that were intended all along. But this course of events has created some bad feelings all around, feelings that could maybe have been avoided with a better understanding of just when a future version of GCC is expected to be able to build the kernel.

As a sort of coda, it is worth saying that Torvalds also has a fundamental disagreement with how this attribute is implemented. The __nonstring__ attribute applies to variables, not types, so it must be used in every place where a char array is used without trailing NUL bytes. He would rather annotate the type, indicating that every instance of that type holds bytes rather than a character string, and avoid the need to mark rather larger numbers of variable declarations. But that is not how the attribute works, so the kernel will have to include __nonstring markers for every char array that is used in that way.

Comments (111 posted)

Cache awareness for the CPU scheduler

By Jonathan Corbet
April 29, 2025

The kernel's CPU scheduler has to balance a wide range of objectives. The tasks in the system must be scheduled fairly, with latency for any given task kept within bounds. All of the CPUs in the system should be kept busy if there is enough work to do, but unneeded CPUs should be shut down to reduce power consumption. A task should also run on the CPU that is most likely to have cached the memory that task is using. This patch series from Chen Yu aims to improve how the scheduler handles cache locality for multi-threaded processes.

RAM is fast, but it is still unable to provide data at anything resembling the rate that a CPU can consume it. For this reason, systems are built with multiple layers of cache that are meant to hold frequently used data and make it available more quickly. Reading a value from cache is relatively fast; a read that goes all the way to RAM, instead, can stall a CPU for the time it takes to execute hundreds of instructions. Making effective use of cache is, thus, important for an application to perform well. Well-written applications are implemented with cache behavior in mind, but the kernel has a role to play as well.

Each layer of cache is accessible by a different number of CPUs; the closest (L1) cache may be specific to a single CPU, while the subsequent (slower, but often larger) layers of cache will be shared by a group of CPUs. The last-level cache (LLC) farthest from the CPUs and, thus, the slowest, but it tends to be the largest and is shared by the largest number of CPUs. Moving a task from one CPU to another may move it away from the data it has built up in cache, hurting its performance. If a task is moved to another CPU in the same socket, much of its cached data may still be available in the lower-level caches; if it is moved to another NUMA node, it may have to start anew with an empty (from its point of view) cache.

Because moving a task can hurt its performance, the CPU scheduler tries to avoid doing that when it can. That objective often runs into conflict with others, such as the need to balance the load across the system to make the best use of the available CPU resources. What the scheduler currently does not do, though, is to try to identify groups of tasks that might be sharing resources and, as a result, could benefit from sharing a single cache if they were scheduled together. Spreading those tasks across the system, instead, could lead to contention as they fight to keep data in their local caches.

In March, Peter Zijlstra posted an RFC patch to explore improving this situation. It is based on the idea that, if a process has multiple threads, those threads are likely to be sharing memory and could benefit from running within the same cache domain. It adds some instrumentation to the (already large) mm_struct structure that describes an address space, including a per-CPU array that is used by the scheduler to keep track of how much time threads using that mm_struct spend on each CPU in the system. This data decays over time, so recent usage is more strongly represented than usage in the distant, forgotten past (a few tens of milliseconds ago, say).

When the time comes to wake a thread that had been waiting for some event, the scheduler goes to that per-CPU array and determines which CPU has spent the most time executing threads from the same process. If the thread of interest has been running elsewhere, it will be moved to the selected CPU, where it will be closer to the other threads and, with luck, benefit from sharing cache space with them. As Zijlstra noted at the time: "This patch is not meant to be merged, it is meant for testing and development. We need to first make it actually improve workloads".

Chen then picked up this work and made a number of improvements to it. The original code would move a task to the hot CPU even if the task was already running within the same LLC domain, and thus already sharing the largest cache with that CPU. In this case, the movement will just slow that task down without much, if any, performance benefit, so task migration is inhibited in that case.

The other problem that turned up might seem obvious from the description of how the original patch works. Threads are migrated to the hot CPU without regard to how busy that CPU already is. If the number of threads is large, that gathering may well overload the target cache domain, hurting performance overall. This problem was addressed by looking at the overall usage statistics generated by the scheduler's load-balancing algorithm to ensure that the process owning the thread in question is not overloading the target domain. Specifically, if the process is using more than 25% of the CPU time in that domain, or has more than 33% of the overall load there, then the scheduler will not move more threads there.

That has improved the situation, but this work is still in a relatively early state. For example, it can fight with the load balancer:

The aggregation of tasks will move tasks towards the preferred LLC pretty quickly during wake ups. However load balance will tend to move tasks away from the aggregated LLC. The two migrations are in the opposite directions and tend to bounce tasks between LLCs.

CPU scheduling is driven by a lot of heuristics that can often come into conflict. So a patch series adding yet another heuristic ("concentrate a process's threads in a single cache domain") is sure to bring more surprising interactions that, perhaps, need to be addressed with even more heuristics.

There is also a question that has not yet been asked here: is collecting a process's threads the best way to identify tasks that would benefit from sharing a cache? It may not be, but it has the advantage of actually being possible; detecting cache sharing in tasks without that kind of direct relationship could be difficult, if it is possible at all. In the end, the fate of this patch series will depend on whether it actually shows improvements on real workloads without causing regressions for others. That is a high bar to clear for a change like this. The kernel may well have more cache-aware scheduling in the future, but it seems like it may take a while yet before it is ready.

Comments (6 posted)

Freezing filesystems for suspend

By Jake Edge
April 24, 2025

LSFMM+BPF

Sometimes worms have a tendency to multiply once their can is opened. James Bottomley recently encountered that situation; he led a session in the filesystem track at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF) to discuss filesystem behavior with respect to suspending and resuming the system. As he noted in his topic proposal, he came at the problem because he needed a way to resynchronize the contents of efivarfs after a system resume and thought there should be an API available to use. But, as the resulting thread shows, the filesystem freeze and thaw code had never been used by the system-wide suspend and resume code. Due to a scheduling mixup, though, several of us missed Bottomley's session, including Luis Chamberlain who has been working on hooking those two pieces up; what follows is largely from a second session that Chamberlain led, with some background information from the topic-proposal discussion and an email exchange with Bottomley.

Background

The underlying problem that Bottomley is trying to solve is that efivarfs may not reflect the correct state of the EFI variables on the system after a resume operation. That's because some other OS could have been booted on the system after the suspend and changed some of the variables. So he was looking for a hook where he could add a resync operation that would run when the system is resumed. He had a solution, though it was kind of ugly, but he thought other filesystems might benefit from having some kind of API in order to give better guarantees about filesystem consistency across a suspend-resume cycle. As his proposal notes:

Hibernate is a particularly risky operation and resume may not work leading to a full reboot and filesystem inconsistencies. In many ways, a failed resume is exactly like a system crash, for which filesystems already make specific guarantees. However, it is a crash for which they could, if they had power management hooks, be forewarned and possibly make the filesystem cleaner for eventual full restore. Things like guaranteeing that uncommitted data would be preserved even if a resume failed, which isn't something we guarantee across a crash today.

The API for freezing and thawing superblocks was believed to be the proper hooks to use, but that API is not called by suspend and resume, though many were under the impression that it was. In the proposal thread, Bottomley noted that there were several attempts to freeze filesystems when suspending, but they ran aground on deadlock problems of various sorts. Jan Kara described a problem for FUSE filesystems; user space needs to still be running or writes will block, so there are some ordering issues that need to be worked out.

As part of the first session, it was agreed that it would make sense to try freezing the filesystems in reverse superblock order, so that each level closer to the storage would still be active so that data could be flushed. Any user-space processes that are accessing the filesystems after they have been frozen will effectively suspend themselves and user space can be fully suspended after all of the filesystems are frozen. It was generally believed that would all work, but it would obviously need to be tested, which Bottomley started working on during the summit.

Second session

Chamberlain has been working on freezing filesystems for a few years now. He said that he was working on the patches, which he had last posted back in 2023, on the plane to the summit, but did not finish. His posting was made, in part, to prepare attendees for his session at LSFMM+BPF 2023. At the time of that posting, there were some locking problems that he had worked around, which he raised at the start of this year's session. Christian Brauner said that those problems had already been fixed; there were some locking inversions between the VFS and block layers for freezing, but it has all been untangled "and should be safe from that perspective, as far as I can tell".

The rest of Chamberlain's patches are all Coccinelle-driven and fairly straightforward, he said. After applying those, testing with XFS could commence, followed by adding them to ext4 and other filesystems, one by one, with, of course, more testing. There was some discussion of the ordering of freezing the filesystems, including various possible deadlock problems, but most of the problematic cases had already been addressed or were on their way toward removal from the kernel.

Automated testing of the change will be somewhat difficult, Chamberlain said, because there is no real framework for that. Fstests does not test the resume operation, for one thing. Currently the best way is to use a laptop for the testing, he said, so that there is access to a real system suspend and resume. Ted Ts'o suggested testing a loopback filesystem using a file on one filesystem that gets mounted on another as a good way to try to ensure that the ordering of the freeze operations is being handled correctly.

Lennart Poettering said that systemd has given up calling sync() before suspending the system due to problems with complicated mount topologies, often involving NFS, that take too much time to complete the operation. The project would have investigated freezing filesystems, but that is not something that user space can do because of the deadlock possibilities. He would be happy to see the kernel take care of all of that.

Chamberlain returned to the plan to get rid of the kthread freezer (which was the topic of his session in 2023). The first step is to remove its use by filesystems, which is what his Coccinelle scripts do, but then there are other users of the kthread-freezer API that need to be addressed. The API was added for filesystems to use, but other parts of the kernel have started using it as well, so the "next challenge" will be getting rid of it throughout the kernel, he said.

There was a lot of confusion in the audience about what the kthread freezer actually is. Bottomley suggested that it was the freeze_task() call, but could not find any uses of that by filesystems. Maybe things have changed, Chamberlain said, but it is something that needs more investigation once filesystems are properly frozen during suspend.

Over the remote link, Kara raised a potential problem with the proposed mechanism: loopback block devices that are not mounted but just have a file as their storage. Those devices may have dirty data but when that data is flushed, the underlying filesystem will be frozen, because block devices are frozen after filesystems, thus there will be a deadlock. That use case was deemed to be sufficiently rare that it could perhaps be ignored, at least for now.

Since the session, there have been several patch postings, including one from Chamberlain and another from Brauner that incorporates Chamberlain's patches.

Comments (1 posted)

Inline socket-local storage for BPF

By Daroc Alden
April 28, 2025

LSFMM+BPF

Martin Lau gave a talk in the BPF track of the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit about a performance problem plaguing the networking subsystem, and some potential ways to fix it. He works on BPF programs that need to store socket-local data; amid other improvements to the networking and BPF subsystems, retrieving that data has become a noticeable bottleneck for his use case. His proposed fix prompted a good deal of discussion about how the data should be laid out.

One day, Lau said, Yonghong Song showed him an instruction-level profile of some kernel code from the networking subsystem. Two instructions in particular were much hotter than it seemed like they should be. In bpf_sk_storage_get() (which looks up socket-local data for a BPF program), the inline function bpf_local_storage_lookup() needs to dereference two pointers in order to retrieve the user data associated with a given socket. As it turns out, both of those pointer indirections were causing expensive cache misses.

The socket-local data storage is laid out like this because the kernel can't know how much space BPF programs will need in their maps ahead of time, and so must be able to dynamically allocate the correct amount. In practice, however, the BPF programs in use at Meta, where Lau works, do not change the layout of their per-socket data frequently. One program hasn't changed the layout at all since 2021.

So what if that data could be stored inline in the socket structure? Specifically, Lau proposed adding a new kernel-configuration parameter for reserving space in the structure. When set to zero, the kernel would keep the current behavior. When set to some non-zero value, allocations of socket-local data from BPF programs could be taken from the reserved space until it fills up, before falling back to the existing path. For Meta, which knows how much storage its BPF programs use per socket, this would allow the kernel to be configured with the appropriate size ahead of time, completely avoiding the double-dereference and cache misses. Reorganizing the storage like this would also allow saving 64 bytes of internal overhead from the BPF map per socket.

One problem with the scheme is that the memory would no longer be visible to per-control-group accounting. Right now, when creating a per-socket BPF map, that memory is charged to the user-space program that created the map, Lau explained. With this proposal, it would count against the kernel instead.

Alexei Starovoitov asked whether the array size really needed to be configured at build time; couldn't the socket structure end with a variable-length array? Lau agreed that it was possible, but thought that it would be complex to handle for all of the different types of socket. After a bit of thought, Starovoitov suggested that this might be a good fit for run-time constants: values that are treated as constants and hard-coded into the kernel's machine code, but that can be patched at boot time. Lau said that he didn't know that was possible, but that it seemed like it could fit.

Another developer asked what would happen if a BPF program reserved space in the inline storage, and was then reloaded by user space — would it get a new allocation (leaving the old allocation as garbage), or reuse the old allocation somehow? Lau thought that still needed to be decided, and asked for ideas.

Andrii Nakryiko wanted to know how much space Lau intended to reserve; "like, kilobytes?" Lau clarified that they needed somewhere around 100 bytes, possibly less if BPF programs could be made to share data that they all need. That allayed some of Nakryiko's concerns, but he still wondered how bad a single pointer indirection would be. What if the socket structure stored a single pointer to a dynamically sized object? Data at the end of the socket structure is unlikely to be in cache anyway, he asserted. Lau disagreed, saying that it depends on how the packet has been processed so far.

That said, one potential benefit of using non-inline storage is the ability to share data between multiple sockets, he said. For some of the BPF programs he worked with, there might only be 500 different values in a given BPF map across 500,000 sockets. If those could be combined, it could decrease cache misses and make non-inline storage less expensive. He didn't think that it would help with the current design, however, since the first dereference would still not be cached.

Song Liu wondered if it would make more sense to compress the data — if there are only 500 variants, it should only need 9 bits. Nakryiko agreed, suggesting that the current data structures could be stored in a separate map, and the socket structures could store a small index into that map. Lau said that he had tried that, and the approach ran into some problems.

Unfortunately, the session also ran into the end of its scheduled time before a design could be pinned down.

Comments (none posted)

Better debugging information for inlined kernel functions

By Daroc Alden
April 30, 2025

LSFMM+BPF

Modern compilers perform a lot of optimizations, which can complicate debugging. Song Liu and Thierry Treyer spoke about a potential improvement to BPF Type Format (BTF) debugging information that could partially combat that problem at the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit. They want to add information on selectively inlined functions to BTF in order to better support tracing tools. Treyer participated remotely.

One of the most common compiler optimizations is inlining, which embeds the code of a function directly into its caller, avoiding the overhead of a function call and potentially exposing other hidden optimization opportunities. Modern compilers don't always inline every call to a particular function that would benefit from inlining, however. The compiler uses heuristics to decide which call sites will benefit from inlining. This means that a programmer can easily end up with a situation where a function still appears in a binary's symbol table (because some calls were not inlined), but tracing that function won't show calls to it (because the hot calls were inlined, and therefore the function's symbol no longer referrs to them).

That is a problem, Liu said. Both because it makes debugging harder, but also because it motivates developers to mark important kernel functions as not inlinable, so that they can rely on being able to trace them. It is technically possible to trace selectively inlined functions by finding the inlined locations using DWARF debugging information, which some tracing tools do automatically. DWARF is a bulky, complex format, however, which makes using it for this purpose slow. Other methods, such as tracepoints and Linux security-module hooks, also work with selectively inlined functions. Liu argued that those aren't a proper replacement for normal tracing, however, since when debugging a kernel problem it is often not clear what functions one will want to trace until one is actually working on the problem.

Liu outlined two different options for how to improve this situation: just marking selectively inlined functions in the kernel's BTF, and including information about where they have been inlined. The first option has the benefit of simplicity; it would be easy to add an additional function attribute and convert tools like pahole to handle it. Selectively inlined functions are mostly a problem because of the confusion they cause, Liu said. Just being able to warn the developer about what is happening would help.

The second idea would let other tools more easily match what perf probe does: set breakpoints at every location where a function was inlined, as well as its non-inlined location. perf does this by parsing DWARF; Liu's change would let other tools use BTF for the same purpose.

Tracing an inlined function is not as simple as putting breakpoints in the right place, however. When a function is inlined, arguments and return values can disappear due to other optimizations. BTF would need a way to indicate how to transform the machine state at the callsite to recover the function's arguments.

Liu and Treyer analyzed the kernel's existing debug information to figure out whether this was even possible. Out of 228,000 arguments to inlined functions, across 150,000 different call sites, the location of about 83% can be expressed as an offset from a register, usually because the argument is present on the calling function's stack or passed as a pointer. This analysis doesn't include a handful of the most commonly inlined functions, Liu warned, because those disproportionately skew the numbers.

One audience member asked whether that number was per-argument or per-function. Liu clarified that it was per-argument. The audience member then asked what percentage of functions have only arguments representable in this way. Liu showed a slide indicating that approximately half of the functions he looked at would have all their arguments available using this scheme.

With the knowledge that it was possible, Liu put together a proposal for how to encode information about inlined function arguments in BTF: each parameter would be described by a sequence of operators in a highly restricted virtual machine. Supported operations would include loading constants, dereferencing registers, applying offsets, etc. So, for example, an argument that was stored as a member of a structure pointed to by a register would be represented by something like "load register, add offset, dereference, end". Using Liu and Treyer's prototype implementation, this would add about 10MB to the kernel's BTF, although that number is before performing deduplication, which may help substantially.

A final problem is that BTF only exists for functions that were not fully inlined. Functions that are inlined at every call site are less confusing than selectively inlined functions, but they would be fairly simple to support. Adding those in would add an additional 32,000 functions to the kernel's BTF.

The audience had a number of questions about the design for encoding argument information, mostly focused around how to streamline it and remove redundant information. Alexei Starovoitov said that he thought this approach seemed "inspired by the DWARF state machine", and that they had the opportunity to do something simpler instead. Liu agreed that the proposal was a bit rough, and welcomed suggestions for how to improve it.

Starovoitov also pointed out that the encoding Liu had proposed wouldn't work properly for 16-byte values, of which the kernel uses a fair number. Andrii Nakryiko asked whether the problem was that these values were sometimes stored in a pair of registers, instead of in memory. Starovoitov agreed that was sometimes the case, but outlined a few other edge-cases that small structures could fall into. Liu agreed to look into the encoding of 16-byte values.

Comments (2 posted)

How LWN is faring in 2025

Just over six months ago, The Economist described the US economy as "the envy of the world". That headline would be unlikely to appear now. The economic boom referenced in that article feels like a distant memory, markets are falling, and uncertainty is at an all-time high. Like everybody else, LWN is affected by the current turbulence in the political and economic spheres; we expect to get through this period, but there will be some challenges.

To put it bluntly: starting around the beginning of March, we have observed a distinct drop in both new subscriptions and renewals. That timing roughly corresponds with the US administration's increasing attacks on the global system of trade and the economic downturn that has been its result. As it stands, this subscription drop does not pose an existential threat to LWN — or to the salaries of its writers — but it is a matter of concern.

We are responding by tightening our belt where we can, but otherwise working to provide the best coverage of the Linux and free-software communities, as we always have. Readers can help, of course, by subscribing if they have not already done so. Encouraging your employer to set up a group subscription is especially helpful. Subscriptions are the only thing that has kept LWN operating for all these years.

Beyond the immediate situation, there are a number of potential problems to be concerned about. For example, inflation did not stop after our price increase in 2022, with the result that subscription dollars buy significantly less then they once did. We are not considering a price increase at this time but, if the situation worsens, we may have to go there.

For better or worse, LWN is a US-based company, but a large portion of our subscription sales come from outside the country. If the backlash against US companies grows, we are unlikely to escape its effects entirely. Further attacks on global trade could make it more difficult for us to accept payments from outside the country, even when the buyer is willing. In a truly terrible world, there may be attempts to reduce US participation in (and support for) free software; the probability of that seems low, but not zero.

Those are all future worries, though. For now, we will focus on getting through the current economic storms. The good news is that LWN has been here since 1998, which is long enough to have been through more than one difficult cycle. We are still here, thanks entirely to the steady support from you, our readers. Our subscribers, especially, have our gratitude; if you have not yet subscribed to LWN, please consider doing so now.

Meanwhile our primary focus will remain being worthy of the support you all have given us since the beginning. It is the least we can do for all of you who have made our existence possible for the last 27 years.

Comments (106 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Debian election; Kali Linux key; OpenBSD 7.7; Firefox 138.0; GCC 15.1; Meson 1.8.0; Valgrind 3.25.0; FSF review; OSI retrospective; Mastodon; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>