LWN.net Weekly Edition for March 19, 2026
Welcome to the LWN.net Weekly Edition for March 19, 2026
This edition contains the following feature content:
- Cindy Cohn on privacy battles old and new: a SCALE 23x keynote about the Electronic Frontier Foundation's history of protecting privacy and what's to come.
- More timing side-channels for the page cache: securing the Linux kernel is a never-ending job.
- Practical uses for a null filesystem: making life easier for init programs and more.
- Fedora ponders a "sandbox" technology lifecycle: a proposal to make the Fedora project a friendlier place for experimentation.
- A safer kmalloc() for 7.0: a new set of type-safe memory-allocation functions for the kernel.
- BPF comes to io_uring at last: after five years, Pavel Begunkov's patch set to allow running BPF programs from within io_uring is finally in.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Cindy Cohn on privacy battles old and new
Cindy Cohn is the executive director of the Electronic Frontier Foundation (EFF) and she gave the Saturday morning keynote at SCALE 23x in Pasadena about some of the work she and others have done to help protect online rights, especially digital privacy. The talk recounted some of the history of the court cases that the organization has brought over the years to try to dial back privacy invasions. One underlying theme was the role that attendees can play in protecting our rights, hearkening back to earlier efforts by the technical community.
Cohn has been the executive director for the past decade and worked for EFF
for 26 years, plus a few years before that informally. She is soon to be
the former executive director as she is stepping down, "because it's
time to pass the torch
", sometime over the northern-hemisphere summer.
She was wearing a T-shirt that colleagues had made for her which said
"Let's sue the government"; she said
"I'm not done suing the government
" even though she is leaving the leadership role. On her way out she has written
a book, Privacy's
Defender, which came out on March 10, two days after her talk. The
keynote was her "first official stab at a 'book talk'
".
She wrote the book in part to help capture
the history of the early internet that wasn't just about "dudes and the
companies they built
", which was part of the story, of course, "but
they were incredibly rich times
" with "a lot of people who weren't
named 'Jobs' or 'Gates'
". She also wants to reclaim the word "hackers"
from the "people who want to make it about something illegal
", which
was met with loud applause.
In my world and where I came up, being a "hacker" meant someone who hacked away at a problem until they solved it, like the way you take a small ax to a large tree. [...] So I am very intentionally calling this community "the hacking community" and if you don't like it, you can take it up with me afterwards, but my fight is not with you, it is with the people who tried to take something beautiful and make it something nasty.
She put her cards on the table early on: "I'm trying to recruit
you
". She works for an advocacy group and believes that "we need
all the hackers in the world to help us make the world a better place
".
Privacy
The book is primarily about privacy, she said; it is part memoir and part legal history, according to the publisher's description. But privacy is not what some people think it is: some kind of "cloak of invisibility" that can be used when someone is about to do something they don't want anyone to know about. It can work that way, but that is not why privacy is important.
"Privacy is important because it is a check on power
", she said,
"it is a way that people with less power can have some protection
against people who have more power
". It is a check that works on
multiple levels. It starts with the personal—EFF works with domestic violence
victims who are trying to get out from under surveillance by their former
partners, for example—"it is a check on power, literally in people's
homes
". It also checks corporate power, where the surveillance by
companies is impacting people's lives, including the prices they pay and
whether they can qualify for a mortgage.
Privacy is one of the ways we can regain our power against the companies, large and small, who want to control us and manipulate us and, you know, empty our bank accounts as much as they can.
Beyond that, privacy is a check on governments, which is what she has
mostly worked on throughout her career. Privacy enables dissent and
allows planning; all of the efforts in the US to bring more rights to
more people had both public and private parts. For example, during her
lifetime, gay people
were able to go from being violently attacked just for talking about the
idea that they should have equal rights to
same-sex couples being able to legally marry; "the public parts of that
work could not have happened unless there were private parts of that work
".
We are seeing privacy being used right now to organize against some of the
injustices occurring in the US; that needs to happen in private "if it's going to get a leg up and a
chance to catch fire
" so that it can be effective. Privacy
"ultimately enables democracy
", which is easily seen in the secret
ballots in the US that shield voters against pressure from the powerful to
vote in certain ways.
Freeing cryptography
"Now I want to tell you a little story about something that happened in
the 1990s.
" It was about the Bernstein
v. United States lawsuit;
in that case, she helped lead the fight to
"free cryptography from governmental control
". It was filed in
1994, which means that the fight to free cryptography pre-dates the
world wide web, she said.
Her involvement came about in an interesting way, when a hacker she knew
socially (John
Gilmore, she noted later) asked if she would be willing to help a math
PhD student who wanted to publish some code but was told he would go to
jail as an arms dealer if he did. She asked if the code "blows things
up
" and was told that it simply "keeps things secret
", which
sounded like a First
Amendment violation to her; he agreed and she took the case.
But that was not the only involvement of the hacker community in the case.
In a reading from her book, she set the stage for the first day in court at
the San Francisco Federal Courthouse, which she called; "Cypherpunk
dress-up day
". When she arrived at the courthouse, she was greeted by
around 30 people from the hacker community, mostly 30-something,
long-haired, scruffy looking, and in suits and ties. "They all seemed to
be in outfits their mothers picked out for them.
" It is possible that
she was projecting, however, as she was 32, conscious of her appearance,
and was dressed in a suit her mother had picked out.
It was something of a motley crew, but they were there to show support for
her arguments when the case reached court in September 1996. Both she and
the assembled hackers knew that "what
happened in that courtroom would be crucial to the future of the
internet
". The hackers were there in part to show the judge, Marilyn
Patel, that they were serious about making a change; they had followed the
EFF's request to dress in their finest to make it clear that the case was important.
And it was clearly important, Cohn said, reflecting on what the internet
would look like without encryption. While she does not think today's
internet is "as secure or private as it needs to be
", she listed
lots of different ways that the internet would be worse off with no (or weak)
encryption, "the way the government wanted it
". For example: No secure messaging
for organizing and other purposes, stolen or seized phones would compromise the
identity and communications of its owner, no way to know for sure that
communication is with the expected party, no e-commerce, and so on.
Ultimately, the internet could have remained a tool of academics,
governments, and a few hackers, like it was in those days, but it would not
have gained the worldwide reach (with consequences both good and ill) it
has today.
In the 1990s, the US government treated "software with the capability of
maintaining secrecy
" the same way it treated surface-to-air missiles
and tanks: a license was needed to be able to "export" any of them to foreign
countries. Making something available on the internet was considered an
export, which was not just of theoretical concern. Phil Zimmermann faced a criminal
investigation due to the release of his Pretty Good
Privacy (PGP) tool. Dozens of others, mostly academics, had been
threatened as well.
Because the issue involved publishing, which is a free-speech right
protected by the First Amendment, the lawyers decided to build the case
around the legal doctrine of prior restraint.
That doctrine says that requiring government permission before speaking or
publishing must meet a particularly high standard or else it violates the
right of free speech. In the early 1990s, it had not yet been established
whether the internet would be a place of "full First-Amendment
protection
" and they knew that freeing up encryption and the science of
cryptography, along with the ability to share code, "was going to be key
to making the internet itself a place of freedom of speech
".
Beyond the cypherpunks who showed up on the first day, the case was bolstered by the support of a wide variety of people and organizations: cryptographers, computer-science professors, open-source toolmakers, privacy groups, and more all wrote declarations in support. Even outside of the courtroom, she and the other lawyers were supported by hackers of various stripes who took the time to patiently explain cryptography to her in a way that she could understand what it was and did. That allowed her to translate cryptography and the internet to the judges who heard the case at various levels.
That patient explanation was empowering for her and she thinks it is a
lesson that we should be applying today. "People are hungry for privacy
and security and the people in this room have the knowledge to help
them.
" In recent times she has seen much more engagement from hackers
toward educating people and inviting them into the hacking community. "I
think you are standing in the shoes and following in the legacy of those
early hackers and I really want to commend you for it.
"
There were also efforts to publicize the case through T-shirts with the RSA
code printed on them, for example. Companies in the computer industry gave
their support, even though they are generally loath to go up against the US
national-security apparatus and the US Congress started looking into the
matter, as well. Eventually, the courts ruled that "code is
speech
", first in the district court and again in the court
of appeals for the ninth circuit. "We won
", she said to
applause.
That particular story ends in Washington, DC in mid-2000, when she and
others on the case were
invited by her counterpart on the government side, Tony Coppolino, to talk
about encryption regulations. She read another excerpt from the book
describing a majestic conference room in some storied building in the US
capital, which was a bit intimidating. But she and the others had "come
to negotiate the terms of the government's surrender
". Coppolino had
sent her a draft of the new export regulations that dropped the
requirements for pre-publication review for open-source encryption code in favor of anyone exporting
(publishing) said code just needing to send a copy or a link to the government when they
do so. "It was 95% of what we wanted.
"
Unusual
While it was a "tremendous victory
", it has needed defending over
the years, like many other victories. There were efforts by the government
to undermine encryption, many of which we learned about through Edward Snowden, for
example. The Bernstein case was "a fun story
", but it is not the
way that these kinds of changes typically happen when you are up against
the government, she said.
The other two stories she tells in her book represent the more usual
path. One is about spying by the US National Security Agency (NSA) and the other about national-security
letters; both of those are "post-9/11 spying that the government did,
some of it publicly known and some of it not until much later
". Those
cases have a rather different trajectory, she said. A dramatic
courtroom victory as in the first story is definitely outside of the norm.
The NSA spying case came about because whistleblower Mark Klein
"literally knocked on the front door at the Electronic Frontier
Foundation in early 2006
". He brought details of how the NSA was
tapping the internet backbone in various locations, including a secret room
in the AT&T building in downtown San Francisco (the city where the EFF
is located). It is the most "cloak and dagger" of the stories in the
book, she said, due to the courage of Klein and, later, Snowden in 2013.
After a few early victories, "Congress rushed in to protect ... the
phone companies
" by killing the lawsuit that had been filed. The EFF
was able to get a few reforms passed by Congress after the Snowden
revelations, "but not nearly enough
". Eventually, the US Supreme
Court sided with the government when it ruled that which telephone
companies participated in the mass spying was so secret that the case could not
go forward—though the world already knew about NSA spying and the EFF had
evidence of exactly how it worked in 2003.
The third story from her book is about cases that had a similar trajectory: an early win in the
courts, and some reform in Congress, "but still not enough
". She
calls them "the
alphabet cases because we couldn't even name our clients for six years
",
so they were called "case Q, case Z, and case X
". The cases were an
attempt to scale back a kind of subpoena that the US government was using
on telecommunications providers, which are called
national-security letters. Those letters were "demanding information
about their customers and gagging the companies from ever telling anyone
that anything had happened
".
The EFF was able to get the gags lifted and to add some more procedural
safeguards to the process. One of those allowed the companies to produce
transparency reports where they could characterize the number and scope of
such requests. Those numbers are eye-opening: "there were hundreds of
thousands of these issued that implicated millions of people in the times
that we were able to track
".
Hackers
So the Bernstein case was "amazing
", but it was an outlier; most
cases are more like the other two, where any progress made is via "a
thousand tiny cuts
" rather than a sweeping courtroom victory. All
along, though, the EFF had the support of the hacking community in various
forms. Both Klein and Snowden are technical people, and hackers in her
mind, though Klein would probably avoid that label were he alive today, she
said. The community has also helped keep the media informed and to raise
public awareness of surveillance and spying so that voters can apply
pressure.
Because it's opaque, it's hard for people to see it, it's hard for people to understand it. And the hacking community has played a huge role in continuing to keep attention on these issues and continuing to talk about how important they are. And that pressure did lead to congressional reforms, increased pressure from courts, and some administrative shifts that we should all be proud of even as there's more work to do.
She had a slide with a picture (seen below) of a blimp
that the EFF and others had flown over an NSA data center in Utah in 2014. The data center was being built to hold all of the records that were
being gathered from the NSA spying efforts. The blimp had an arrow
pointing down with the message "Illegal Spying Below
", which she
recounted to laughter. "Our friends at Greenpeace lent us their blimp;
we're not above a little stunt every now and then to draw attention to things.
"
She had a message for the Linux builders and users in the audience about
the role they can play. As builders of the tools people use, the
open-source community can help ensure that encryption is built into
everything—and that it is easy to use. "My plea to the open-source
community for at least 30 years now is: 'please, user interfaces'.
"
While that may not be the fun part, "I'm here to tell you that you need
to do the not-fun part too
".
She suggested defaulting to privacy-preserving architectures, along with minimizing data collection and retention. Meanwhile, conducting security research and publishing the findings is important so that users have the most secure products they can. In addition, she hoped builders would push back against surveillance features being built into products they were working on for their employers.
Things feel really dark right now, she said, listing a bunch of
developments that are taking us further down the "surveillance state" path.
She sometimes feels like Cassandra, having warned
about a future that those in power apparently could not see, but that we
are now living through. For example, databases created for commercial
purposes are increasingly being used by the government, which is the
largest purchaser of information from data brokers, as a weapon against its
targets. "Those targets are increasingly more political than legal.
"
And on and on.
The courts have created a "national security shaped hole in the
Constitution
"; it has been built over many years, by administrations of
both political parties in the US. That is why the magic "national
security" phrase is used so frequently these days, since it is "the easy
road
" for the government at this point. She noted that Benjamin
Franklin had said that the US Constitution created "a republic, if you
can keep it
". She believes we are in the "if you can keep it" part at
this point; everyone needs to participate in the fight for that, and not
just sit back and wait for others to do it, Cohn said.
Closing and Q&A
"We have some things to learn about the cypherpunk legacy.
" Beyond
showing up in ill-fitting suits in 1996, they built PGP, published
cryptography research, and pushed for privacy. The cypherpunks recognized
that privacy needs more than just technology, it requires society and its
laws to support the technology. "Just adding encryption does not equal
privacy or security, there's much much more to it.
"
The work of the cypherpunks (and others) enabled the internet that we have
today, Cohn said, "and you are the next generation
". She had some
ideas for how attendees could join the fight, starting with: "Show
up
" to represent privacy-preserving views at various levels of
government, from courtrooms to homeowner associations. "Privacy is a
team sport
", so use the tools yourself and help others to use them too.
Also, educate people, young and old, contribute to privacy-oriented open-source
projects, advocate for encryption and other privacy protections at your
workplace and beyond, and build the tools that the next generation will
need to further the effort. As the EFF executive director, "I am almost
contractually required to say 'please join the EFF'
", as well, of course.
She closed by noting that it surprised a lot of people that the "crazy,
wild-eyed misfits
" who were outnumbered and outgunned when they took on
the government in 1996 were able to prevail. That was one successful path,
but Cohn does not believe it is the only one available. "I think we
need to figure out new strategies and new ideas [...] and not get stuck
just trying to
replicate the ones from before.
"
SCALE organizer Ilan Rabinovitch asked the first question (after announcing that he would donate matching funds for EFF memberships made that day—an offer that many seemingly took him up on). He noted that in recent times EFF has done more with developing privacy tools and related technology, such as Let's Encrypt, and he wondered how the organization had ended up shifting somewhat from advocacy to technology.
Cohn said that early on she would call out to technical people to ask for
explanations of various things; those people were quite helpful and
generous with their time, but eventually the organization decided to bring
on someone in-house. EFF hired Seth Schoen as the first-ever staff
technologist at an advocacy organization; he was followed by hiring Peter
Eckersley, who did a lot of work on Let's Encrypt before he died in 2022. "And you know what happens when you get a bunch of technologists
hanging around? They want to build something.
"
In particular, they wanted to build things that aligned with the fights
that the organization was having on the policy and legal side. Early on,
even before it had a full staff,
the EFF had helped build the DES cracker to
show that the then-standard Data
Encryption Standard (DES) was insecure due to the mandated 56-bit key
size. In the end, "the reason that the EFF has a tech team is that
hackers want to hack
".
She mentioned Privacy Badger as
another project that the organization built, to applause. It is a browser
extension for third-party cookie blocking that came about because one of
the EFF technologists got angered "that the techs on the browser side
were basically lying
" about how hard it was to build such a thing.
Having people who can work both on the policy side and on the
technology-building side is "kind of deep in our DNA at this point
".
The next question was regarding the battle between the US government and Anthropic over two red lines that the company wanted to enforce on the use of its large language models (LLMs). Cohn said that one of those red lines, not use the LLMs for mass surveillance, was of particular interest to the EFF.
It is important for companies to be willing to
draw those lines and stick to them, she said; she is no real fan of the
company, and it "did not draw the line where I would draw the line, but
at least they drew it somewhere
". She pointed out that the OpenAI
position, "if it's legal, then we'll do it
", is worrisome in part
because the law is so malleable; every genocide and human-rights violation
around the world is done "legally" (complete with air quotes). Beyond
that, our privacy should not be decided by the CEO of a tech company, it
should be protected at every level of government.
Another question asked about the difference between Bernstein's algorithm
and the encryption that was being used all over the world at that time; why
did the government allow export of some encryption schemes but try to stop
Bernstein? The answer, Cohn said, was key length; "the government would
grant a license if the key length was short enough that they could break
it
". Bernstein was making a larger point with his algorithm, which he
called "Snuffle", that adapted a widely used hash function and turned it
into an encryption algorithm. The hash function was used for
authentication, and was unregulated by the government, but his point was
that the same basic algorithm could be used for encryption, so the
encryption restrictions made no sense.
The final question was from Denver Gingerich, who keynoted at SCALE 2025, about attracting
staff litigators to a non-profit organization. He works for Software Freedom Conservancy (SFC),
which sometimes has to bring lawsuits to try to enforce the GPL. Cohn
agreed that it was a hard problem and suggested that SFC had it worse
than EFF: "I offer people First-Amendment law, Fourth-Amendment law, and
you offer people kind of the puzzle that are open-source licenses.
"
She said that EFF tries to have a fun working environment, for one thing,
and also has an internship program that brings in law students, but that it
is a difficult problem, especially with regard to salaries.
The talk provided some interesting history for those who were too young to live through some of those times. There are more fights ongoing and surely more to come; EFF will be part of those efforts, but Cohn made it clear that there is far more that needs doing, so attendees should figure out how they can pitch in. A video of just the talk will likely appear before long, but those interested can see the talk in the livestream YouTube video.
[Thanks to LWN's travel sponsor, the Linux Foundation, for its travel funding to attend SCALE in Pasadena.]
More timing side-channels for the page cache
In 2019, researchers published a way to identify which file-backed pages were being accessed on a system using timing information from the page cache, leading to a handful of unpleasant consequences and a change to the design of the mincore() system call. Discussion at the time led to a number of ad-hoc patches to address the problem. The lack of new page-cache attacks suggested that attempts to fix things in a piecemeal fashion had succeeded. Now, however, Sudheendra Raghav Neela, Jonas Juffinger, Lukas Maar, and Daniel Gruss have found a new set of holes in the Linux kernel's page-cache-timing protections that allow the same general class of attack.
The impact
The ability to determine when pages are present in memory and when they are accessed
may not sound particularly easy to exploit. There are some subtle attack
vectors, such as how knowing which page of an executable is in memory
can indicate which code is being executed. In turn, that allows deploying other
attacks that depend on coincidences of timing with more reliability. For
example, the timing
information can be used to defeat
address-space-layout randomization. Another
possible use is detecting when a user is entering a password, and looking at when
the privileged application accepting the password resumes executing in response
to an event. That
reveals how long it takes the user to press each key, which
can be used to
reconstruct the actual password typed words with reasonable
fidelity. [A reader pointed out that the linked paper applies specifically to
written text, and that "reconstructing passwords, passphrases, and
pseudorandom strings presents a very different and more
difficult problem
". Depending on how similar a password is to normal text,
timing information may be more or less usable to narrow the search space of
possible passwords.]
It is not currently recommended to change your keyboard layout every thirty days, however: the real problem is not any specific attack that can be performed with access to page-cache-timing information, but rather the fact that the page cache touches on the timing of nearly every operation on a modern computer. The page cache is shared between applications running at different privilege levels, and can be monitored and flushed by any of those applications, for longstanding performance reasons. Changing the semantics of page-cache operations would certainly cause noticeable breakage in user space.
The mechanisms
The original page-cache-timing exploits from 2019 used the mincore() system call, which allows programs to check whether a page is already present in the page cache for performance optimization. The fix at the time made mincore() return fake information for pages that are not mapped in a process's page tables, so that one application could not spy on another. That check, however, was not correctly applied to the cachestat() system call when it was added to the kernel in 2023, reopening the same set of vulnerabilities.
The recent paper also laid out a handful of other mechanisms that don't depend on specific system calls, however. The most basic is simply measuring the amount of time that it takes for a page to be read; when the page is not in the cache, it takes noticeably longer to read the page. Reading the page can be done with a number of system calls, such as read(), mmap(), or even sendfile().
The disadvantage to using timing information to detect whether a page is present in the page cache is that attempting to read the page brings it into the cache. If a malicious program loads the page of interest right before the program being attacked, the malicious program may not be able to flush the page out of the cache quickly enough to observe the subsequent access. Still, there are well-known statistical techniques for turning an unreliable timing channel into a reliable leak of information.
Flushing a page from the page cache by brute force is fairly simple: just access enough other pages that the cache fills up and evicts the least-recently accessed page. (Although the exact details can get more complicated with the kernel's multi-generational LRU.) There is a subtler way, though: the posix_fadvise() system call. It allows applications to advise the kernel on how they expect to access memory; using the POSIX_FADV_DONTNEED flag reliably makes the kernel remove the relevant page from the page cache as long as the page is not mapped anywhere else with mmap(). The system call is also usable as a timing side-channel itself: the call completes more quickly when the targeted page is not already in the cache. Calling posix_fadvise() in a loop can therefore both keep a page out of the page cache, and determine when another process faults it back in.
Even if posix_fadvise() isn't available, the preadv2() system call, when invoked with the RFW_NOWAIT flag, also provides a way to check whether a page is in the page cache. The flag makes the system call return EAGAIN when the page is not immediately available. The exact semantics of the call are up to specific filesystems, so this could theoretically still bring the page into the cache, but the researchers cite another recent paper (unfortunately paywalled) that claims this doesn't happen in practice.
These mechanisms can be combined in flexible ways. The most reliable attack technique demonstrated by the paper was to use posix_fadvise() to remove a page from the cache, and then wait for it to be faulted back in with preadv2(). If preadv2() is blocked (by seccomp, perhaps), using posix_fadvise() on its own still works, just a little less reliably. If posix_fadvise() is blocked, evicting the page from the cache the old-fashioned way still works. And if both system calls are blocked, pages can still be repeatedly evicted and loaded to obtain rough timing information.
What to do
It's tempting to try to introduce another set of targeted fixes for posix_fadvise() and preadv2(). The problem is that these system calls have existed since 2003 and 2016, respectively, and are widely used by existing user-space applications. Changing either of their semantics would certainly introduce breaking changes. In an informal writeup of the paper on his blog, Neela quoted Linus Torvalds's opinion from their private correspondence:
Yeah, while I'm very comfortable changing cachestats, I'm not so sure about POSIX_FADV_DONTNEED.
In particular, I can easily see cases where people really want to say "drop the caches" on files that they really cannot write to.
Even if kernel developers were to change the semantics of POSIX_FADV_DONTNEED, however, there are multiple other mechanisms to accomplish the same thing. The only total solution would be to partition the page cache so that privileged processes no longer share pages with unprivileged ones. There is existing research on the performance impacts of such a change, but the results are mixed: the impact depends heavily on the particular workloads making use of the cache. A less-invasive solution might be to update secure software to use mlock() to pin their executable code in the page cache. That would introduce yet another subtle detail that writers of secure software would have to be aware of and use judiciously.
Since the researchers' disclosure of the new set of page-cache-based attacks in January, cachestat() has been patched, but there has been relatively little discussion about other changes to the page cache. Without any clever ideas about how to mitigate the risks without harming backward compatibility, this may become one of those attacks that, like Spectre or Rowhammer, can be mitigated but not properly prevented.
Practical uses for a null filesystem
One of the first changes merged for the upcoming 7.0 release was nullfs, an empty filesystem that cannot actually contain any files. One might logically wonder why the kernel would need such a thing. It turns out, though, that there are places where a null filesystem can come in handy. For 7.0, nullfs will be used to make life a bit easier for init programs; future releases will likely use nullfs to increase the isolation of kernel threads from the init process.
Making root actually pivotable
The process of bootstrapping a computer involves a number of tricky steps, one of which is locating and mounting the root filesystem. That task might involve digging through text files, contacting other systems over the network, assembling RAID volumes, and more. As a result, there really needs to be a temporary root filesystem, with usable commands, before the real root filesystem can be mounted. That initial filesystem usually comes in the form of an initramfs image that is bundled with the kernel binary.
That initial filesystem is the first root filesystem, but there comes a time when it must be replaced with the real root. The kernel provides a system call, pivot_root(), for just that purpose. It will cause a new filesystem to become the root filesystem, and will cause any existing processes to move from the old root to the new. There is only one tiny little problem: pivot_root() cannot be used for this purpose for the actual root filesystem. From the man page:
The rootfs (initial ramfs) cannot be pivot_root()ed. The recommended method of changing the root filesystem in this case is to delete everything in rootfs, overmount rootfs with the new root, attach stdin/stdout/stderr to the new /dev/console, and exec the new init(1). Helper programs for this process exist; see switch_root(8).
Even with the availability of a helper, the need for this kind of workaround is seen by some as mildly inelegant. (The system call can be used in other contexts, such as setting up a new root for a container).
The solution is to base the filesystem tree on an empty filesystem — nullfs — upon which both the temporary and permanent root filesystems can be mounted. pivot_root() can then be used to place the permanent root below the temporary one in the mount stack, allowing the latter to be unmounted. The system can then continue the bootstrap without the need for the above-described workaround. The nullfs implementation, as found in 7.0, enables this type of operation.
Isolating kernel threads
There are other possible uses for nullfs, though; consider the case of kernel threads. The first two processes created by a booting Linux kernel are init and kthreadd. The init process will serve as the ultimate ancestor for all user-space processes created thereafter. Instead, kthreadd is the parent for all of the kernel threads that will be spawned over the life of the system. When the ps command shows you a process with a name like "[rcu_tasks_rude_kthread]", you'll know that it is an ill-mannered child of kthreadd.
Christian Brauner (who implemented nullfs), pointed out an interesting
relationship between those two processes in this
patch series cover letter. They share the same initial
fs_struct file-descriptor table,
meaning that they each have access to the other's filesystem state. When
init forks, it explicitly disconnects its children from its file
table, but kthreadd does not do that. As a result, every running
kernel thread shares full access to init's filesystem state, and vice
versa. Normally, all of those processes mutually trust each other, but the
potential for mischief (and unfortunate bugs) is real.
Brauner's proposed solution is to isolate kernel threads from the initial
file table fs_struct and, instead, to run each of
them with a nullfs instance as its root filesystem. In that way, a kernel
thread can no longer interfere with the init
process; indeed, it has no access to the filesystem at all. This seems
like a bit of worthwhile kernel hardening, given that most kernel threads
have no need for filesystem access. As an added bonus,
pivot_root() no longer needs to forcibly change the root
filesystem for all of the running kernel threads, since they no longer need
to be moved to the new root.
The init process, too, is separated from the initial fs_struct and given one of its own, just like any other fully independent process. At that point, the initial fs_struct is entirely unused — most of the time.
Separating kernel threads from the filesystem is almost always the right
thing to do; as Brauner noted:
"Offloading fs work to kthreads is really nasty [...] It's a broken
concept.
" It is also unnecessary most of the time; after all,
rcu_tasks_rude_kthread can happily go about its job of
inconsiderately interrupting CPUs to force RCU grace periods without
accessing any files. But there are other kernel threads that do, indeed,
need occasional access to the filesystem. Enabling this access is why
kernel threads have long retained their connection to the initial fs_struct.
For cases like that, the patch series adds a mechanism to
temporarily give a kernel thread access to the filesystem:
scoped_with_init_fs() {
/* code here can perform filesystem operations */
}
The use of the scoped guard ensures that the access is only provided within the indicated scope, with no possibility of that access being left in place accidentally.
There are roughly a dozen kernel threads that have to be patched to use this new mechanism. The production of core dumps, for example, naturally requires filesystem access. Unix-domain sockets need to be able to look up names in the filesystem, firmware loading must be able to find and open the file containing the firmware, the devtmpfs filesystem must be able to provide the filesystem content, and so on. So there are a number of holes punched into the wall separating kernel threads from the filesystem, but they are small, localized, and easy to find.
Brauner is careful to not set expectations too high for this work at this
point: "Is it crazy? Yes. Is it likely broken? Yes. Does it at least
boot? Yes.
" It is a significant change to some of the deepest code
within the kernel, code that has had its current form for a long time. The
existence of surprises seems almost certain. But, so far, nobody has
questioned the goals or direction of this patch series. Greater isolation
for kernel threads (and init) thus seems likely to show up in a
future kernel release.
Fedora ponders a "sandbox" technology lifecycle
Fedora Project Leader (FPL) Jef Spaleta has issued
a "modest proposal
" for a technology-innovation-lifecycle process
that would provide more formal structure for adopting technologies in
Fedora. The idea is to spur innovation in the project without having an adverse
impact on stability or the release process. Spaleta's proposal is
somewhat light on details, particularly as far as specific examples of
which projects would benefit; however, the reception so far is mostly
positive and some think that it could make Fedora more "competitive" by being the
place where open-source projects come to grow.
Spaleta said some people may have already heard about his idea, which he has been
calling the "Fedora Sandbox
". It would be used to test, refine, and validate
"experimental features, components, output, process or services
" without a
commitment to integrate any of the experiments into Fedora. The technologies that
would be appropriate for the sandbox are those that need to mature or prove
themselves in some way. For example, a technology may need time to become stable
enough to rely on, or it may be stable but need to demonstrate that there is enough
community interest to sustain the technology long term.
He has outlined several objectives for the sandbox. The first is to foster innovation in Fedora by encouraging contributors to bring new ideas forward that might not be mature enough for direct inclusion in Fedora. Another is to isolate risk; by sandboxing a project, it can be worked on as an experiment that won't affect other Fedora components or negatively impact its users. He positioned the sandbox idea as a way to provide a clearer path for inclusion as an official Fedora feature, service, etc. As it stands now, there is a bit of a murky land between a project being a brand new idea and its readiness to be included in Fedora. Spaleta also hopes to use the process as a more formal way of gathering feedback on technologies and to promote transparency in the project.
Fedora already has a mechanism for proposing innovations in the form of its change process. The Fedora Sandbox proposal does not supplant that; projects that make it through the integration stage are then expected to be proposed as a change. So, even if an experiment makes it all the way through the sandbox, it is not guaranteed to be adopted—though it would seem highly probable at that stage. The sandbox process would not be needed for changes that are already well-covered by Fedora's existing process. So, for example, it would not be necessary to put a new release of GCC through this process in order to include it in Fedora.
The proposal
Spaleta suggested a few potential stages for a project going through this process: sandbox, curation, and integration. If an experiment is unsuccessful in moving through the incubator, then it is moved to a fourth stage: retirement.
To enter the sandbox stage, maintainers of an experiment would need to demonstrate that it might be a fit for Fedora and specify the exit criteria as well as the timeline. Applications to enter the sandbox would be reviewed either by the Fedora Engineering Steering Committee (FESCo) or a designated working group.
Maintainers would have to provide quarterly updates on the sandbox project and any packages or features that go with it would need to be clearly labeled as "Fedora Sandbox" components. The maintainers would also be responsible for active participation in some dedicated communication channel, such as a mailing list or Matrix room, to get feedback from the larger community.
When the maintainers feel that the project is ready to exit the sandbox, the team would submit a request to FESCo to decide its fate. The council could allow the project to fully graduate and move to the integration stage, extend its time in the sandbox if the graduation criteria has not been fully met, or decide that the project should be wrapped up and retired altogether.
FESCo could also decide to send the project on to curation if the project has met its stated exit criteria, but the council feels there are other deficiencies that need addressing or refinements that need to be made. For example, FESCo might find that a project met its initial release criteria but decide that it does not yet have sufficient community interest and engagement. In that case, it would be sent to curation and have a fixed amount of time to drum up sufficient community interest to satisfy FESCo that it should graduate. Curation is meant to have a firm exit date; 12 months is the current suggestion. FESCo would conduct a mandatory review when the time is up. Again, FESCo could decide to let the project graduate and attempt to integrate into Fedora, extend its time in curation, or retire it.
If a project makes it to the integration stage, then it would need to submit a
"symbolic
" change proposal according to Fedora's usual process that FESCo
could approve. Projects may also be subject to maintenance reviews even after they
graduate the sandbox, and could be kicked back to the curation stage if they are
found to be struggling.
Handwavy
The proposal is a bit handwavy; this is not surprising or unreasonable. There is little point in putting a great deal of detail into a proposal at this stage when the community may balk at the very idea. It is something of a rookie mistake to flesh out every detail of a proposal when it has not been established whether there is an appetite for it or not; the finer points can be worked out if the community doesn't shoot down the basic idea. It is a bit surprising, though, that Spaleta hasn't provided any concrete examples of projects or technologies to help illustrate the need for an incubator. Previous Fedora discussions may provide a clue, though.
When Fedora discussed its AI-assisted
contributions policy in October 2025, the conversation exposed some unhappiness
by Red Hat leadership with Fedora as an "innovation engine
" for RHEL. There
was more than a little pushback from the Fedora community on accepting AI-assisted
contributions, but Red Hat and IBM are investing heavily in AI. Mike McGrath, Red Hat
vice president of core platforms, complained
about objections to doing experiments with AI, including letting AI determine if a
contribution is accepted:
At a bare minimum I'd like to see Fedora get in front of RHEL again with a more aggressive approach to AI. Not just in how we build Fedora but formally opening the doors to AI contributors and data scientist so that Fedora is their first stop shop. I'd hate to see a scenario where Fedora's policies make it a less attractive innovation engine than even CentOS Stream.
In another comment, McGrath said
that the point he wanted to make is that it wasn't possible to improve a technology
if it was banned before getting started. "We can't even reasonably see how Fedora
would make such a system if its not allowed
".
The Fedora Council report on its recent strategy summit pointed to another project that Red Hat has been trying to interest the Fedora community in testing and adopting, but with little traction so far. The Konflux project is an Apache-licensed, continuous integration and delivery (CI/CD) platform for building, testing, and releasing software artifacts—including bootc images and RPMs. Fedora already has, and many contributors are already comfortable with, the venerable Koji build system—but Red Hat has moved to or is moving to using Konflux internally for its own products. Koji is getting a bit long in the tooth, and Red Hat probably has little use for it outside of Fedora at this point. It is not entirely unreasonable that the company might want to move things along with moving Fedora to Konflux, but volunteer contributors may not see any direct benefit in learning yet another system.
The thinking within Red Hat may be that a sandbox process with more lightweight approvals would help with some of the corporate-led initiatives that have few supporters in the larger Fedora community. One thing that is not entirely clear in the sandbox proposal is where community feedback comes into the picture: Fedora's change process requires an announcement and opportunity for the larger community to provide feedback. It is unclear whether Spaleta envisions such an announcement and community engagement for experiments proposed for the incubator.
Spaleta noted that the proposal "had a genesis
" in the Red Hat Enterprise
Linux (RHEL) 11 planning discussion he had sat in on as FPL. Fedora's role in RHEL
development, and corporate dissatisfaction with it, may well have been on the
agenda. Going directly to FESCo, without the more stringent requirements of a change
proposal, could allow Red Hat to conduct some experiments more quickly without
significant pushback. The change proposal prior to integration would then provide the
community with a chance to object even if FESCo, which is typically made up entirely
or almost entirely of Red Hat employees, favors a technology. But, at that stage, it
may be much harder for the community to block a change if it has been proven to be
technically feasible.
Reactions
Spaleta asked for the initial discussion to focus on the high-level structure of the proposal: would it help Fedora in cultivating sustainable ideas and in weeding out unsustainable ones? If there was consensus on the bones of the proposal, then it would make sense to drill down into more of the details and nail down how everything would work.
Julia Bley wondered
if the "increased bureaucracy
" would turn FESCo into a bottleneck or waste
people's time on crafting proposals. Spaleta said
that he did not see it as an increase in bureaucracy; technology entering Fedora has
to go through FESCo somehow anyway. The proposal also makes it possible for FESCo to
delegate the work to a working group. His concern, "having had a lot of quiet
conversations
", is that there is a bias against experimenting within the Fedora
project.
There were several concerns about how a sandbox process would fit with existing Fedora processes and infrastructure. FESCo member Fabio Valentini said that the proposal sounded worthwhile, but asked for details about how it would work. For instance, where would sandboxed experiments host alternative versions of packages that are already in Fedora? He also imagined that a sandboxed project might have alternative kernel packages or ship kernel modules that are not in Fedora's kernel; either of those scenarios would violate Fedora's current packaging guidelines.
Spaleta said
that he fully expected projects would break "non legally binding
" Fedora
policy in the sandbox and curation stages. "For policy violations that survive
from sandbox to curated, we have to have a reasonable path forward towards resolution
in time for integration
". How a project could solve those problems during the
curation period would be part of the sandbox exit review discussion. Projects would
seem to be expected to comply with any Fedora policies around licensing or that
prohibit shipping certain technologies for legal reasons, but other policy would be,
effectively, optional.
Fedora Infrastructure
lead and FESCo member Kevin Fenzi also worried
that it might be a burden for FESCo, but said it might be acceptable depending on how many
projects were active at one time. He also asked how the infrastructure for the
sandboxed projects would be handled. He did not want to throw projects to the wolves
and tell teams to build things on their own dime, but he also did not want to provide
fully staffed and resourced infrastructure; he suggested a middle ground that used
AWS or the Fedora community's OpenShift instance, Communishift. Spaleta
said
that any requirements would be part of a sandbox proposal; the infrastructure team
could say no if it didn't have the capacity. "I'm also hoping that if we do this
correctly, the infra team is able to mentor a group through the curation stage if
there are special infrastructure needs to get to fully integrated.
"
Fenzi had also asked if Spaleta had any examples of good candidates for the
process. That was a chicken-and-egg problem, Spaleta said. "The best question that
could be asked right now is what would have gone better in the past if this were a
process available?
" He seemed reluctant to provide any actual
examples of current or past projects that needed a sandbox
process.
Michel Lind felt that discussions about adopting Konflux would have been easier with the sandbox framework in place. Spaleta said that experience was reasonable to think through in the context of the sandbox proposal:
My view is, how Konflux has been introduced is effectively a very loosely defined sandbox, with a false start or two. It's on a path towards integration now because it's started to get some traction from contributors interested in using the technology. But there is less clarity on the expectations on what a sustainable konflux implementation for Fedora looks like, and as a result there is more anxiety about it than there really needs to be. A lifecycle process with clear stage exit expectations and review points may have helped get clarity around what is expected from a sustainable Konflux integration.
Justify your existence
Scott McCarty, who is a product manager for Red Hat, said
that a Fedora sandbox would have been helpful when teams at the company were trying
justify resources to develop the Podman
project. Fedora's adoption of Podman was seen as a "rubber stamp
", and not a
validation of the market interest in the project. "Real validation came from
Ubuntu and Debian picking up Podman independently
". He reasoned that a sandbox
process might change that for future projects and make Fedora's stamp of approval
carry more weight.
He also wrote a blog post that went into more detail about Podman's history and the need for Fedora to become a place for projects to incubate. The post provides more detail about the need to justify Podman's existence and resources inside Red Hat, and how the sandbox might have helped in that regard:
It's not like there was ever some big dramatic moment where an executive tried to kill Podman, because that's not really how these things play out inside a big company, or anywhere else for that matter. What you get instead is this constant, grinding pressure to justify expansion of the team every single planning cycle. You start with a handful of engineers writing code, and that's fine as a proof of concept, but to become a real project you need QE [quality engineering], you need documentation writers, you need dedicated engineering resources, you need someone thinking about the user experience, and every one of those resources has to be justified against other priorities over and over and over again.
I think it's important for people in the Fedora community to understand this dynamic, because it applies to pretty much every new project, whether it's backed by a company or not. Large companies are for-profit entities with budgets and quarterly planning cycles, and they don't fund things indefinitely out of goodwill. Open source community traction is one of the most important metrics that companies use to decide whether to keep investing, because community adoption is a leading indicator of market demand.
McCarty made the case that a sandbox would be useful to help differentiate Fedora
from other Linux distributions. Distributions today "compete on stability, package
count, release cadence, or desktop experience
", all of which are important but
are effectively commodities. What would be harder to replicate, he said, is a culture and
process that nurtures projects, welcomes experimental work, and provides it with
visibility. He also acknowledged the concerns about adding too much bureaucracy and
stressing Fedora's infrastructure, but argued that the core idea of the sandbox is
sound.
Somewhat ironically, Spaleta's proposal for a structured process to incubate technologies with time-boxed stages does not, itself, have a specific timeline for feedback or next steps. It has attracted modest interest and no strong opposition so far; presumably, at some point, he will gather up the feedback and revise the proposal for additional feedback or move it along to the Fedora Council.
A safer kmalloc() for 7.0
A pull request that touches over 8,000 files, changing over 20,000 lines of code in the process, is (fortunately) not something that happens every day. It did happen at the end of the 7.0 merge window, though, when Linus Torvalds merged an extensive set of changes by Kees Cook to the venerable kmalloc() API (and its users). As a result of that work, though, the kernel has a new set of type-safe memory-allocation functions, with a last-minute bonus change to make the API a little easier to use.
Classic kmalloc()
kmalloc() is a general-purpose interface to the slab allocator; its purpose is to allocate small (generally sub-page) chunks of memory for use within the kernel. The first kernel release to contain kmalloc() was 0.98.4 from November 1992, though a similar function existed under the malloc() name since the 0.11 release at the end of 1991. The 0.98.4 version had a reasonably familiar prototype:
void *kmalloc(unsigned int len, int priority);
The len parameter specifies how much memory is needed, while priority describes how the memory should be allocated; in 0.98.4 it could be one of GFP_BUFFER, GFP_ATOMIC, GFP_USER, or GFP_KERNEL. The return value, if all goes well, is a pointer to the newly allocated chunk of memory.
In current kernels, instead, that prototype is:
void *kmalloc(size_t size, gfp_t gfp);
The types of the arguments have shifted slightly, and the (now) gfp argument is an explicit bitmask, but otherwise it is essentially unchanged, more than 34 years later.
The kmalloc() API clearly works, but it is also a 20th century C interface. Its return value is untyped, and there is nothing ensuring that the size of the allocated chunk of memory is correct. That leaves developers exposed to classic errors like this:
struct foo *ptr;
ptr = kmalloc(sizeof(ptr), GFP_KERNEL); /* Don't do this */
The code is valid C, but it will successfully allocate a block of memory that is almost certainly too small (the size of a pointer rather than to the pointed-to type). Once developers start allocating arrays of objects the number of opportunities for mistakes that the compiler cannot detect grows even further; allocations of objects with flexible array members are more error-prone yet. Unsurprisingly, such mistakes have been the source of numerous kernel bugs during the history of the project.
Safer memory allocation
Efforts have been made over the years to improve the safety of kmalloc(), with some success. Furthering that work, the 7.0 kernel release will include a relatively large change that is intended to make a lot of typical errors impossible. The new functions (more precisely, macros) were added by this commit; they are:
ptr = kmalloc_obj(*ptr, gfp);
ptr = kmalloc_objs(*ptr, count, gfp);
ptr = kzalloc_obj(*ptr, gfp);
ptr = kzalloc_objs(*ptr, count, gfp);
The simplest form is functionally identical to a basic kmalloc() call, but there is no need to use sizeof(), and the type of the return value from the macro will be a pointer type matching the first parameter. So if a developer were to type:
ptr = kmalloc_obj(ptr, GFP_KERNEL); /* Argument should be "*ptr" */
The compiler will object with a complaint that the return value of kmalloc_obj() does not match the type of the pointer to which that value is being assigned. This version, in other words, has a level of type safety that kmalloc() has always lacked.
Arrays of objects can be allocated with kmalloc_objs(), eliminating the need for the sort of arithmetic that has proved surprisingly difficult to get right over time. The kzalloc_ versions will zero the allocated memory before returning it.
Structures with flexible array members can be another source of allocation mistakes; developers will often get the calculation of the total structure size wrong. There is a new allocator (added in this commit) that is designed to eliminate those mistakes:
ptr = kmalloc_flex(*ptr, flex_member, count, gfp);
Here, *ptr is a structure with a flexible array member, the name of which is flex_member. The number of elements that the flexible array should be sized for is given by count. The returned object will be sized correctly to hold the requested number of elements in its flexible array. As an added bonus, if the structure is defined using __counted_by() to indicate which member holds the size of the flexible array, that field will be automatically initialized during the allocation — at least, on compilers that fully support __counted_by().
With these new allocation functions in place, the stage was set to convert much of the existing kernel code base over to their use. That was the purpose of the massive patch mentioned in the introduction, as well as a number of followup patches cleaning up the harder cases.
Implicit GFP_KERNEL
The large-scale patching was not quite done yet, though. As he pulled in
Cook's changes, Torvalds observed
that almost all of the allocation calls use GFP_KERNEL; that
result is unsurprising, since the more restrictive allocation options are
only used when they are truly necessary. He wondered if, by way of some
macro magic, the gfp argument could be made optional, with a
default of GFP_KERNEL when it is not supplied. About nine hours
later, he reported
that he had implemented and applied that change. As he pointed out, there
was no better time to thrash that much code: "those lines are all being
modified anyway, so any merge conflict pain is not going to be made worse
if I tweak the end result a bit more
".
That work resulted in another massive
commit removing the unneeded GFP_KERNEL argument from
kmalloc_obj() calls, and a smaller one for
kmalloc_flex(). He declared victory before fixing every single
call, but stated that he was happy with the results: "The code really
does look better
". After more than three decades, the kernel's core
allocation mechanism for small objects finally looks a bit different, and
is hopefully less susceptible to silly mistakes. So developers will have
something to celebrate, even as they grumble about having to fix the
countless merge conflicts this change has surely created.
BPF comes to io_uring at last
The kernel's asynchronous io_uring interface maintains two shared ring buffers: a submission queue for sending requests to the kernel, and a completion queue containing the results of those requests. Even with shared memory removing much of the overhead of communicating with user space, there is still some overhead whenever the kernel must switch to user space to give it the opportunity to process completion requests and queue up any subsequent work items. A patch set from Pavel Begunkov minimizes this overhead by letting programmers extend the io_uring event loop with a BPF program that can enqueue additional work in response to completion events. The patch set has been in development for a long time, but has finally been accepted.
To use io_uring, the programmer sets up appropriate shared buffers with io_uring_setup() and mmap() before putting a number of io_uring_sqe (submission queue event) structures in the submission queue. The kernel can be notified of the presence of new events to process in two ways: by setting up a dedicated kernel thread to poll the queue, or by having user space call io_uring_enter() periodically.
When user space calls io_uring_enter(), the kernel first dispatches all of the items in the submission queue. After that, it can wait for a certain number of events to complete, wait for a timeout, or return to user space immediately, depending on which flags the system call was invoked with. Over time, the interface has been extended with ways to chain a series of operations together, such that one operation can depend on the outcome of another without requiring user space to act as an intermediary. For example, io_uring can be used to read a file and then send the contents over a socket asynchronously, without copying the data back to user space or performing a context switch. With the door opened to encoding more complex sequences of operations, people naturally wanted to handle cases just slightly more complex than a simple linear chain of operations.
This is where BPF comes in. Begunkov's patch set lets users associate a BPF struct ops program with io_uring queues; when user space calls io_uring_enter() on one of those queues, the BPF program will run instead of io_uring's normal event loop. The program can use the new bpf_io_uring_submit_sqes() kfunc to instruct the kernel to process entries from the submission queue, and the bpf_io_uring_get_region() kfunc to obtain access to the submission or completion queue in order to manipulate their contents.
The program then returns IOU_LOOP_CONTINUE to indicate that io_uring should call the program again after a configurable delay (or a set number of completions), or IOU_LOOP_STOP to return to user space. The BPF program may ask the kernel to loop as many times as it wishes, so it is theoretically possible to write a program that sets up io_uring, registers a BPF program, calls io_uring_enter() and then never returns to user space at all. This effectively bypasses BPF's limit on the number of operations that can be performed in a single program execution by calling the BPF program in a loop; if an application is already structured around an asynchronous event loop, it may be tempting to put more and more functionality into the BPF component. The BPF program does have to be tolerant of both spurious wakeups and potential cancellation by the kernel; if the task is killed, for example, the kernel will stop calling the BPF program and clean up the io_uring queues as normal.
That temptation to put more of the program into BPF is one reason that kernel developers were skeptical of Begunkov's approach when it resurfaced in November 2025. BPF makes it possible to implement complex operations in kernel space — but it can hardly be said to be as easy as writing normal user-space software. Programs will probably need to communicate between their in-kernel and user-space components anyway, but Begunkov's approach would have them doing so via ad-hoc interfaces rather than the existing io_uring interface.
Begunkov sees avoiding extraneous system calls as one of several uses for his patch set. He also suggested that BPF could become a transitional path for deprecating existing io_uring APIs. There are a number of organic extensions to the io_uring API, such as IOSQE_IO_DRAIN, that could be emulated in BPF, taking that logic out of the core kernel. He also thought that, as with the extensible scheduler class, introducing BPF would allow for experimentation with smarter polling algorithms before they're introduced to the kernel.
Jens Axboe, io_uring's creator and maintainer, planned to merge Begunkov's patch set during the 7.1 merge window. Caleb Mateos thought that the changes would not be as useful without kfuncs for interacting with io_uring registered buffers — additional buffers shared between the kernel and user space that can be referenced by io_uring operations. Registered buffers can be more efficient because they only need to be faulted and pinned into memory once, and can then be reused by subsequent operations.
Mateos referenced a patch set from Ming Lei, first seen in November (with an updated version in January). Lei's patch set is an alternate approach to integrating io_uring and BPF, which includes kfuncs for interacting with registered buffers along with an alternate set of attachment points for hooking into the io_uring subsystem. Lei's patches would not let users completely customize the behavior of io_uring_enter(); instead, users would be able to register BPF programs that could be invoked with a new IORING_OP_BPF io_uring operation. The approach is less flexible than Begunkov's (which could be used to emulate something similar, since it allows the BPF program to inspect and modify requests before submitting them to the kernel for processing), but is probably easier to use for targeted changes. Lei's approach is arguably more natural for allowing the deprecation of existing io_uring commands, since it can be used to replace specific operations with a BPF implementation.
Neither patch set has seen as much discussion as might be warranted, for a major change to io_uring. Mateos thought that the additional kfuncs were largely orthogonal to Begunkov's work — the kfuncs would be useful for BPF programs running in the context of io_uring, regardless of how those programs are triggered. Axboe agreed, deciding to apply Begunkov's patch set on March 17. Ming's work will have to be rebased, but Axboe seemed generally inclined to accept it as well. Either way, configurable BPF is coming to io_uring.
Page editor: Joe Brockmeier
Inside this week's LWN.net Weekly Edition
- Briefs: AppArmor vulnerabilities; snapd vulnerability; Sashiko; DPL election; Fedora Asahi 43; GIMP 3.2; Marknote 1.5; Quotes; ...
- Announcements: Newsletters, conferences, security updates, patches, and more.
