Leading items
Welcome to the LWN.net Weekly Edition for June 6, 2024
This edition contains the following feature content:
- Rethinking the PostgreSQL CommitFest model: CommitFests are a fundamental part of how the PostgreSQL community creates its releases, but that model is (yet again) showing signs of strain.
- Debian's /tmpest in a teapot: moving Debian over to tmpfs creates some noise.
- One more pidfdfs surprise: another example of how the user-space ABI is not always what developers think it is.
- Yet more LSFMM+BPF coverage:
- New APIs for filesystems: a discussion on new APIs needed for filesystems, particularly newer filesystems that have subvolumes and snapshots.
- Handling the NFS change attribute: file timestamps do not have the granularity needed for NFS-client-cache-invalidation purposes; the session was yet another discussion on ways to fix that problem.
- Removing GFP_NOFS: the GFP_NOFS flag should be replaced by using the scoped-allocation API, but that conversion has not made all that much progress, what can be done to change that?
- Measuring and improving buffered I/O: a "pathological" test result showed buffered I/O performance being far behind that of direct I/O; the underlying problems and possible solutions were discussed.
- Standardizing the BPF ISA: The IETF BPF working group is nearly done standardizing a BPF ISA specification.
- An instruction-level BPF memory model: BPF doesn't have a memory model yet; what are important properties for whichever it adopts?
- Comparing BPF performance between implementations: there is a benchmark suite which runs on both Windows and Linux that can be used to make comparisons.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Rethinking the PostgreSQL CommitFest model
Many years ago, the PostgreSQL project started holding regular CommitFests to help tackle the work of reviewing and committing patches in a more organized fashion. That has served the project well, but some in the project are concerned that CommitFests are no longer meeting the needs of PostgreSQL or its contributors. A lengthy discussion on the pgsql-hackers mailing list turned up a number of complaints, a few suggestions for improvement, but little consensus or momentum toward a solution.
The CommitFest concept got its start in 2008, after a planned
six-month development cycle took more than a year to complete.
Dave Page detailed
the plan in early 2008 to the mailing list, noting
that it had been discussed previously and had received general
approval. It would replace the traditional feature freeze with "a
series of 'commit fests' throughout the cycle
".
Whenever a commit fest is in progress, the focus will shift from development to review, feedback and commit of patches. Each fest will continue until all patches in the queue have either been committed to the CVS repository, returned to the author for additional work, or rejected outright, and until that has happened, no new patches will be considered.
Since then, the project has adopted the name CommitFest for the process of prioritizing review, feedback, and committing, and a web-based tracking tool (CommitFest application or CFA) written in Django with a (naturally) PostgreSQL database backend. CommitFests are held several times a year, usually odd-numbered months (currently five times a year since 2018) and are managed by a CommitFest Manager.
The concept of not considering new patches until all of the CommitFest patches have been dispensed with has not survived the test of time. Patches do pile up, and are automatically moved to the next CommitFest if not committed, rejected, or otherwise removed. The first CommitFest using the CFA was held in December 2014 through January 2015. It saw 79 submissions: 18 were committed, 21 were rejected, and 40 moved to the next CommitFest. Since that time, the possible states of a patch submission have been expanded to include "waiting on author", "ready for committer", "returned with feedback", and "withdrawn". The March 2024 CommitFest had a total of 331 submissions, with 132 committed, 7 returned with feedback, 13 withdrawn, and only 6 rejected. The majority of patches, 173, were moved to the next CommitFest. The upcoming CommitFest currently has 308 submissions.
No longer fit for purpose
On May 16, Robert Haas wrote that he believed CommitFests were no longer serving their purpose:
The original intent of CommitFests, and of commitfest.postgresql.org by extension, was to provide a place where patches could be registered to indicate that they needed to be reviewed, thus enabling patch authors and patch reviewers to find each other in a reasonably efficient way. I don't think it's working any more.
Haas complained that few patches submitted for
CommitFests were actually ready for review. They were, he
elaborated, in a variety of states that were unsuitable for review
including patches "we've said we don't want but the author thinks
we do
" and "patches that have long-unresolved difficulties which
the author doesn't know how to solve or is in no hurry to solve
".
The chances that reviewers would find things to review that were the
most in need of their attention "are pretty much nil
". The CommitFest
application had turned into a patch tracker, he said, and patch
trackers "intrinsically tend to suck
" because they are
full of garbage nobody cares about and nobody wants to do the work
to maintain them. "But our patch tracker sucks MORE, because it's
not even intended to BE a general-purpose patch tracker.
"
There were three factors that have led to this state of affairs,
he said. The first being that "we got tired of making people mad
"
by rejecting patches submitted to CommitFests even if they had no
chance of being committed this time, or ever. The second problem,
he said, was the addition of continuous integration (CI), which
adds value for developers who register patches for a CommitFest even
when they are unlikely to be committed. All of that culminated in the
third problem, which is that the list "is now so long and messy
that even the Commitfest manager probably [doesn't] manage to go
through the whole thing thoroughly in a month
".
He acknowledged that there was no easy answer, because keeping a
large group of people organized is hard enough—especially when
"nobody's really the boss
"—but he wondered what ideas others
might have to improve the situation. "I also feel like what we're
doing right now can't possibly be the best that we can do
".
Discussion
David G. Johnston suggested
reusing the same CFA software but creating a new instance called
"Collaboration" as a patch tracker, or just to add a new slot to the existing CFA called work in process
(WIP). Then work could be pulled into a future CommitFest when ready.
It would cause less angst, he noted, to move work to a WIP status
than dropping it outright. The project would still have a problem with "too many
WIP patches and not enough ability to resolve them
" but putting a
higher bar to get into a CommitFest while having a place to
collaborate would be a step forward.
Even though patch trackers "tend to suck
", Jacob Champion said, the project should
just "embrace the need for a tracker and make it more
helpful
". Magnus Hagander proposed
a similar idea, to have a parking lot for inactive patches that aren't
ready for a CommitFest but "you also don't want to be dead
". Tom Lane liked
the idea of a "not-the-CF
" application with CI runs and said
that there appeared to be consensus around the parking lot
idea. However, that was not the only problem and he urged people to
keep thinking about what could be improved.
Daniel Gustafsson wanted
clarification on what was
broken about CommitFests. Was it the application, the process and
workflow, or the assumption that the process and workflow were
followed "which may or may not be backed by evidence
"? His
experience as a CommitFest Manager led him to believe that
the problem was that the process and workflow were not being
followed, which would be hard to fix just by applying better software.
"Some of us are abusing the process a bit
", wrote
Lane. He admitted that he was guilty of sticking patches into the
CommitFest application, "because it's a near-zero-friction way to
run CI on them, and I'm too lazy to learn how to do that
otherwise
". Those patches, however, would spend little time in the
application and were not really the problem. The stuff that hangs
around seems to be the problem, he said—and that is
caused by a fear of rejecting patches, which causes people to punt them to
the next CommitFest.
Greg Stark disagreed with the idea that lack of rejection was the problem. He said he had tried to find patches with negative reviews to reject, but did not find many of that type. Instead, he found patches with generally positive reviews, but complex situations, and patches that weren't getting any reviews at all.
Joe Conway asked
if the project should have a policy that no patches are automatically moved
forward from one CommitFest to another. Therefore "the author
needs to care enough to register for the next one
". Champion did
not like that idea. He said
that it would disadvantage those who do not work on PostgreSQL as
their day job; and "not having time to shepherd a patch is not the
same as not caring
". Conway said
that the word "care" was a poor choice, but still wanted to force
authors to consider whether they had time to shepherd the patch
for the next round.
Peter Eisentraut wrote
that he thought having to resubmit would be effective, but it
would double the annoyance for contributors whose patches were not
being reviewed in the first place. Tomas Vondra agreed, and said
that having to resubmit patches "just to have a chance someone
might take a look
" would make him question whether he wanted to
submit patches at all.
Eisentraut proposed
a new state for patches called "Unclear". The CommitFest Manager could set any patch
that was doubtful to "Unclear", so reviewers could focus on patches that
definitely need review. "I think, if we consider the core mission of the commitfest app, we need
to be more protective of the Needs Review state.
" Lane agreed
that would make sense, and Gustafsson wrote
that being protective of the Needs Review state was a good summary of
what the project needed to focus on in improving CommitFests.
Great ideas, but...
The discussion produced many insights into why CommitFests are less
effective, and a number of interesting solutions. The question is,
what's next? Dmitry Dolgov observed
that the thread had produced many good takes, but he was concerned
about next steps. If memory serves, he said, similar discussions
crop up every couple of years but "not many things have changed
due to those discussions
". How could the project make sure this
time was different, he asked.
Lane argued
that "if you take the long view
" things have changed for the
better. Part of the point of CommitFests, he said, was to help people learn to
become good committers. Lane noted that PostgreSQL had around ten
active committers through 2012, but it had crept up to about two dozen
today. "My point here is not that things are great, but just that we are
indeed improving, and I hope we can continue to. Let's not be
defeatist about it.
" The fundamental problem has not changed, though: "too many patch
submissions, too few committers
". Lane suggested a
divide-and-conquer strategy, having senior committers split up the list of patches and see what it
would take to move the patches along. That might not be possible
before each CommitFest, he said, but the project could try to make that happen
once per year.
Haas liked the idea of a yearly review, but countered
that the project had not significantly improved the CommitFests for many
years. He said it was not possible, or necessary, to solve all the
problems overnight, but the project needed to admit there is a
problem that is "an existential threat to the project
". He was
surprised that the number of committers had grown substantially, but
doubted that the new-patch-submitter experience had improved. "On
the plus side, I think we make more of an effort not to be a jerk
to newcomers than we used to.
"
Vondra asked
when the yearly review might take place. He pointed out that the
project used to do a "triage" at the FOSDEM PGDay meeting, but it was
only an informal effort with a small subset of people. And it would be
somewhat late in the development cycle. Lane said
that he pictured a few days to a week, but the question would be
finding a week when people are available. Conway proposed
doing a review at PGConf.eu. Lane said
that might work, but pointed out that the throughput of a group
of reviewers working together in a meeting was less than the same
reviewers working independently. However, "the consensus of a meeting is more likely to be
taken seriously than a single person's opinion, senior or not
"
which made the idea attractive. The thread trailed
off with Lane agreeing it would be a good idea to do the work in
subgroups of three or four people.
What's next?
As of this writing, nothing concrete has emerged to "fix" the process or application and the discussion has largely dried up. There seems to be general consensus around the problems with CommitFests and some agreement on how to solve them. The question is whether anyone takes the next step toward implementing solutions.
Debian's /tmpest in a teapot
Debian had a major discussion
about mounting /tmp as a RAM-based tmpfs in 2012 but inertia
won out in the end. Debian systems have continued to
store temporary files on disk by default. Until now. A mere
12 years later, the project will be switching to a RAM-based /tmp in the Debian
13 ("Trixie") release. Additionally, starting with Trixie, the
default will be to periodically clean up temporary files automatically in
Join the party
By now, using tmpfs for the /tmp directory is a road well-traveled. Many Linux distributions have already switched to mounting /tmp as a tmpfs. Arch Linux, Fedora, openSUSE Tumbleweed, and many others have all made the switch. Red Hat Enterprise Linux (RHEL) and its clones, as well as SUSE Linux Enterprise Server (SLES) and openSUSE Leap, still default to /tmp on disk. Ubuntu, following Debian, also continues to have a disk-based /tmp rather than using tmpfs.
The knobs to control how /tmp is mounted, and the handling of temporary files, are part of systemd. (On systems where systemd is used, of course.) The upstream defaults for systemd are to mount /tmp as a tmpfs and delete files that have not been read or changed after ten days in /tmp and 30 days for those stored in /var/tmp. Debian Developer and systemd contributor Luca Boccassi recently decided it was time to revisit the topic of /tmp as a tmpfs, and start deleting temporary files in /tmp and /var/tmp.
On May 6 Boccassi resurrected a discussion from Debian's bug tracker that started in 2020 and that had trailed off in July 2022 without resolution. The bug was submitted by Eric Desrochers, who complained that Debian's systemd implementation did not clean /var/tmp by default. Michael Biebl wrote that this was a deliberate choice to match Debian's pre-systemd behavior. After a long period of inactivity, Biebl suggested in October 2021 (and again in July 2022) that Desrochers raise the topic on the Debian devel mailing list. That never happened.
In reviving the topic, Boccassi declared that it was time to bring Debian's defaults in line with upstream and other Linux distributions by making /tmp a tmpfs, and cleaning up /tmp and /var/tmp on a timer. He planned to apply those changes in the next systemd upload to Debian unstable within a week, with instructions in the NEWS file on how to override the changes for any users who wanted to keep the current behavior. That, in turn, sparked quite a discussion.
The case for and against
Biebl worried
about the effect of cleaning up temporary files by default. Currently,
he said, Debian only cleans /tmp on boot (which is
unnecessary if /tmp is a RAM-based tmpfs) and to never clean
up files in /var/tmp. Boccassi answered:
"Defaults are defaults, they are trivially and fully overridable
where needed if needed.
" Biebl replied
that there is value in aligning defaults across Linux distributions,
but argued that Debian has "a larger scope
" than a desktop
distribution like Fedora. He suggested that it was necessary to "gather feedback from all affected parties
to make an informed decision
". Boccassi responded
that it was impossible to have defaults that would make everyone
happy, and he was not looking for hypothetical arguments or
philosophical discussions, he wanted facts: "what would
break where, and how to fix it?
"
Sam Hartman noted that ssh-agent created its socket under /tmp, but it would be better if it respected the $XDG_RUNTIME_DIR setting and created its socket under /run/user. Boccassi agreed and said that he had filed a bug for ssh-agent. Richard Lewis pointed out that tmux stores its sockets in /tmp/tmux-$UID, and deleting those files might mean users could not reattach to a tmux session that had been idle a long time. Boccassi suggested that using flock() would be the right solution to stop the deletion, and said he had filed a bug on that as well.
Lewis questioned
whether files downloaded with apt source might be
considered "old" when they were extracted under /tmp and
immediately flagged to be deleted. Russ Allbery said
that systemd-tmpfiles would respect the access timestamp (atime)
and changed timestamp (ctime), not just the modified timestamp
(mtime), "so I think this would only be a problem on file systems that didn't support
those attributes
". Allbery said he believed tmpfs supports all three.
Some users store information in /var/tmp for long-running
jobs, because they want it preserved without being backed up, said
Barak A. Pearlmutter. Boccassi dismissed
that, saying that those users "assuming they actually exist
",
can customize the system defaults according to their needs.
Those users do exist, wrote
Jonathan Dowland, "[I've] been one of them, I've worked with many of
them, it's an incredibly common pattern in academic
computing
". Making a change to how these files are handled,
"should be a very carefully explored decision
". He also
asked for arguments in favor of the change, and to hold off the change
to allow the discussion to continue.
Hartman wished that it was possible to specify that directories under /var/tmp should be deleted entirely or left alone. Boccassi said that was a reasonable enhancement request, and that it had already been filed upstream for systemd.
Allbery eventually admitted
surprise at the number of people who reported using
/var/tmp "for things that are clearly not temporary
files in the traditional UNIX sense
". Periodically deleting files
in /var/tmp, he said, has been common UNIX practice for at
least 30 years:
Whatever we do with /var/tmp retention, I beg people to stop using /var/tmp for data you're keeping for longer than a few days and care about losing. That's not what it's for, and you *will* be bitten by this someday, somewhere, because even with existing Debian configuration many people run tmpreaper or similar programs. If you are running a long-running task that produces data that you care about, make a directory for it to use, whether in your home directory, /opt, /srv, whatever.
Hakan Bayındır said
that he did not use /var/tmp, but applications that
other people use do. He cited a high-end scientific application that was
not under his or other users' control. Allbery argued that this
was "not a good design decision
" but conceded that may not be
helpful if a user or admin does not have control over the
application's design. However, he said, it was "playing with fire
"
to use /var/tmp that way "and it's
going to result in someone getting their data deleted at some point,
regardless of what Debian does
".
Lewis said
he still did not understand the rationale for the change. Was there,
he asked, "perhaps something more compelling than 'other
distributions and upstream already do this'?
" That sounded like
the original rationale, Allbery
said, but he added that moving /tmp to tmpfs should make
applications run faster, and reaping files under /var/tmp
could help to avoid filling up a partition. Allowing the partition to
fill up, he noted, could
cause bounced mail, unstable services, and other problems. It may not
be a problem for desktop systems, which tend to have enough disk space
not to be affected, but for virtual machines that have /var/tmp
contained within a small root partition, it is still a concern.
Rolling out changes
In the end, none of the arguments for maintaining Debian's status quo managed to persuade Boccassi to stay his hand. On May 28, Boccassi announced the changes that he had made and uploaded to unstable. As expected, /tmp will become a tmpfs by default, for both new and existing installations of Debian unstable. New installs will use the systemd default behavior. The openssh and tmux packages, he said, had been fixed to provide exceptions to retain their temporary files. A description has been added to the Debian installer to inform users that /tmp is a tmpfs by default, and the changes have been noted in the NEWS file. He also offered to review and merge any changes to the Debian installer that would let users customize those options during installation. Users who want /tmp to remain on disk can override the upstream default with systemctl mask tmp.mount. To stop periodic cleanups of /tmp and /var/tmp users can run touch /etc/tmpfiles.d/tmp.conf.
None of the changes, it should be noted, will affect Debian versions prior to Trixie. Users of Debian stable and oldstable will only encounter these changes on upgrade.
As noted, many distributions have already made these changes without catastrophe. Debian, and its users, should be able to adapt to the new defaults or override them if they are unsuitable for the workloads to be run. At worst, it promises to be a temporary inconvenience.
One more pidfdfs surprise
The "pidfdfs" virtual filesystem was added to the 6.9 kernel release as a way to export better information about running processes to user space. It replaced a previous implementation in a way that was, on its surface, fully compatible while adding a number of new capabilities. This transition, which was intended to be entirely invisible to existing applications, already ran into trouble in March, when a misunderstanding with SELinux caused systems with pidfdfs to fail to boot properly. That problem was quickly fixed, but it turns out that there was one more surprise in store, showing just how hard ABI compatibility can be at times.A pidfd is a file descriptor that identifies a running process. Within the kernel, it must have all of the data structures that normally go along with file descriptors so that kernel subsystems know what to do with it. The kernel has, since the 2.6.22 release in 2007, had a small helper mechanism providing anonymous inodes to back up file descriptors on virtual filesystems that do not have a real file behind them. When the pidfd abstraction was added to the 5.1 kernel, it was naturally implemented using anonymous inodes, and all worked as intended.
Eventually, though, the limitations of this implementation began to make themselves felt. In response, Christian Brauner reworked the implementation away from anonymous inodes, creating the separate pidfdfs filesystem. The new filesystem supported the use of statx() on a pidfd; that, in turn, made real inode numbers available to user space that could be used to compare two pidfds for equality. Added functionality (such as killing a process when the last pidfd referring to it is closed) became possible. The new implementation also gave security modules a say in pidfd operations; this was the source of the first set of problems but, in the longer term, will help administrators with the control of their system.
Once the security-module problem was worked out, it seemed like the pidfdfs problems had been taken care of. But then, Jiri Slaby reported that kernels with pidfdfs broke both the util-linux test suite and the lsof utility. Brauner answered that the util-linux problem had been fixed upstream, but that lsof was a surprise. It turns out that there were two problems in need of solving, one rather more predictable than the other.
In pre-6.9 systems, "ls -l" or lstat() would show a pidfd as a symbolic link to the string "anon_inode:[pidfd]". As of 6.9, instead, the result would be "pidfd:[inode]", showing the inode number assigned to the pidfd. Since lsof was looking for the pre-6.9 version of the string, it failed to recognize or do the right thing with pidfds.
But, it turns out, there is more. When the anonymous-inode code was added,
the author never bothered (or simply forgot) to set the file-type field in
each inode as it was created. As a result, system calls like stat()
will report the file type as zero, which is not actually a defined file
type. That would cause command-line tools like stat
to describe the result as type as "weird file
", which was
objectively true. This little quirk never created trouble for any tools
that actually worked with files backed by anonymous inodes, so it never was
fixed.
It was noticed, though. Somebody working on lsof cleverly realized that a file type of zero was a convenient way to recognize anonymous-inode files. So lsof acquired a test for that condition, taking advantage of an ABI quirk that was never intended, much less documented. Once a stat() call on a pidfd started returning a proper file type, lsof no longer recognized the file and got hopelessly confused. Linus Torvalds was unimpressed when all of this became clear:
What a crock. That's horrible, and we apparently never noticed how broken anon_inodes were because nobody really cared. But then lsof seems to have done the *opposite* and just said (for unfathomable reasons) "this can't be a normal regular file".
That said, he also allowed that "we probably just have to live in the
bed we made
"; breaking lsof was a user-space regression in
need of fixing.
Brauner put together a patch that was merged prior to the 6.10-rc1 release; it has not yet found its way into the 6.9 stable updates. The patch restored the older output format for pidfds and caused the file-type field to be explicitly masked to zero, restoring the previous behavior. With that fix, lsof works again, and people are mostly happy.
When he sent
the patch, though, Brauner said that he "would like to try to move
away from the current insanity
" in the near future. He hopes that
lsof will be fixed to be able to handle the newer output format,
and that it will be possible to remove this compatibility hack. Torvalds
seems
willing to try, but he pointed out that some users (and their
distributions) can be quite slow to update their user-space tools, so it
may be a long time before this change is no longer needed.
In summary: Hyrum's law has shown its applicability yet again. Leaving zero in the type field was never meant to be a part of the ABI for anonymous inodes; it is just a bug, an artifact of a job that was not completely done. But, since that behavior was visible, code came to depend on it, and the bug can no longer be fixed. This episode is another hint that kernel interfaces could benefit from a higher level of scrutiny than they typically get before showing up in a released kernel.
New APIs for filesystems
A discussion of extensions to the statx() system call comes up frequently at the Linux Storage, Filesystem, Memory Management, and BPF Summit; this year's edition was no exception. Kent Overstreet led the first filesystem-only session at the summit on querying information about filesystems that have subvolumes and snapshots. While it was billed as a discussion on statx() additions, it ranged more widely over new APIs needed for modern filesystems.
Brainstorming
Overstreet began the session with the idea that it would be something of a brainstorming exercise to come up with additions to the filesystem APIs. He had some thoughts on new features, but wanted to hear what other attendees were thinking so that a list of tasks could be gathered. He said that he did not plan to do all of the work on that list himself, but he would help coordinate it.
He has started thinking about per-subvolume disk-accounting for bcachefs, which led him to the need for a way to iterate over subvolumes. He mentioned some previous discussion where Al Viro had an idea for an iterator API that would return a file descriptor for each open subvolume. "That was crazy and cool", Overstreet said; it also fits well with various openat()-style interfaces. He thinks there is a simpler approach, however.
Adding a flags parameter to opendir() would allow creating a flag for iterating over subvolumes and submounts. Subvolumes and mounts have a lot in common, he has noticed recently; user-space developers would like to have ways to work with them, which this would provide.
Extended attributes (xattrs) on files are also in need of an iterator interface of some kind, he said. Those could smoothly fit into the scheme he is proposing. The existing getdents() interface is "nice and clean", he said, so it could be used for xattrs as well.
The stx_subvol field has recently been added to statx() for subvolume identifiers. Another flag for statx() will be needed to identify whether the file is in a snapshot. That way, coreutils would be able to, by default, filter out snapshots. That way, when someone is working through the filesystem, they do not see the same files over and over again.
Steve French asked a "beginner question" about how to list the snapshots for a given mount in a generic fashion on Linux. Overstreet said that a snapshot is a type of subvolume and that "a subvolume is a fancy directory". This new opendir() interface could be used to iterate over the subvolumes and the new statx() flag could be used to check which are snapshots.
All of the information that statfs() returns for a mounted filesystem should also be available for subvolumes, he said, "continuing with the theme that subvolumes and mount points actually have a lot in common". That includes things like disk-space usage and the number of inodes used.
Dave Chinner said that XFS already has a similar interface based on project IDs, where a directory entry that corresponds to a particular project can be passed to statfs() to retrieve the information specific to that project. He said that filesystems could examine the passed-in directory and decide what to report based on that, so no new system call would be needed. Overstreet was skeptical that users who type df in their home directory would expect to only get information for the subvolume it is in, rather than the whole disk, as they do now. He thought a new system call would be the right way to approach it.
French said that other operating systems have a way to simply open a version of a file from a snapshot without actually having to directly work with the entire snapshot subvolume itself. A user can simply open a file from a given snapshot identifier, which is convenient and not really possible on Linux. Overstreet acknowledged the problem, but said that he did not think a new system call was needed to support that use case. Using the new interfaces that are being discussed, user space should be able to handle that functionality, perhaps using read-only mounts of snapshots in such a way that the user does not directly have to work with them.
User-space concerns
But Lennart Poettering said: "as a user-space person, I find it a bit weird" that opendir() is seen as a good interface for this functionality. In many ways, he finds opendir() to be "a terrible API" because it gives you a filename, but then you have to open the file to get more information, which does not necessarily match up because there can be a race between the two operations. He would much prefer to get a file descriptor when enumerating things so that the state cannot change between the two.
There are some other mismatches between opendir() and subvolumes, he continued. Right now, user space expects to get filenames from readdir(), which means they do not contain the slash ("/") character, but subvolume path names do. In addition, the filename returned in the struct dirent can only be 255 characters long, which is too restrictive for subvolume names.
In the end, Poettering thinks that user-space programs do not want to get filenames, they want something that cannot change out from under them. Jeff Layton suggested using file handles instead, which Poettering agreed would be better still. Christian Brauner noted that the listmount() system call uses a new 64-bit mount ID, but there is no way to go from that mount ID to a file descriptor or handle. It would be easy to add, however.
Overstreet said that he plans to add firmlinks, which is an Apple filesystem (APFS) feature that fits in between hard links and symbolic links. It would use a file handle and filesystem universally unique ID (UUID) to identify a particular file. Amir Goldstein said that overlayfs also uses those two IDs to identify its files, so Overstreet thought that perhaps that scheme should become a standard for Linux filesystems.
There are some other missing pieces for file handles, though, he said. There is no system call to go from a file handle to a path. Goldstein said that the ability exists, but it is only reliable for directories. "That's because hard links suck", Overstreet said; Goldstein agreed that was part of it, but Jan Kara said that there are some filesystems that cannot provide that mapping.
It is getting increasingly difficult to guarantee inode-number uniqueness, Overstreet said. Most of the discussion about his proposal for a session at LSFMM+BPF revolved around the problem and its solutions; it has come up at earlier summits, as well. The basic problem is that more-recent filesystems (Btrfs, bcachefs) have lots of trouble ensuring that inode numbers are unique across all of the subvolumes/snapshots in a mounted filesystem, which confuses tools like rsync and tar.
The 64-bit inode space is simply too small to guarantee uniqueness, he said, but there are various schemes that have been used to make things work. He would rather not be "kicking cans down the road" and thinks filesystem developers need to nudge user-space developers to start using file handles for uniqueness detection "sooner, rather than later".
Inode zero
He noted a recent "kerfuffle" regarding filesystems that return all inode numbers as zero values, which broke lots of user-space tools. That will become more prevalent over time, so he wondered if it made sense to add a mount option that would intentionally report the same inode number in order to shake out those kinds of problems. Chinner suggested using a sysctl instead, which Overstreet agreed would be a better choice.
Ted Ts'o said that in order to get user space on board with a switch to using file handles, it is important to make it a cross-OS initiative. Lots of maintainers of user-space tools want to ensure that they work on macOS and the BSDs. If it can get to a point where using file handles is "supported by more forward-leaning, POSIX-like filesystems", the chances will be much better for getting enough of user space converted so that it is possible to return zeroes for inode numbers without breaking everything. It will still be a multi-year effort, which means that it is worth taking the time to try to ensure that it can be adopted more widely than just on Linux.
Overstreet asked about support for file handles in the other operating systems; Chinner said that anything that supports NFS must have some form of file-handle support. Ts'o agreed but said that the others may not export file-handle information to user space.
As part of the conversion process, a list of affected programs should be created, Ts'o said. To his "total shock", he found out that the shared-library loader needs unique inode numbers because that is how it distinguishes different libraries, which he found out the hard way. Overstreet wanted to hear that story, but it is a long one that might need a more casual setting to relate, Ts'o said.
This problem will only get worse in, say, 20 years when 64 bits is even less able to handle the number of inodes, Overstreet said. If the right tools are provided to user-space developers, they will help find and fix all of the problems. But Poettering cautioned that getting rid of the reliance on the uniqueness of inode numbers is going to be extremely difficult. It is used to ensure that the same resources are not loaded multiple times, for example, so it would be better to provide user-space APIs that directly address that problem.
There was some discussion of various ways to try to add information to the inode number to solve that problem, but there is nothing generalized for all filesystems; it is fair to say there is not any real agreement on how to do that within the filesystem community. Ts'o asked Poettering if file handles, which have more bits to work with, would solve his problems. Poettering said that "file handles are great", but it requires different privileges to query them and they are not implemented on all filesystems, so he still needs a fallback.
For example, he wondered about getting file handles for procfs files, though it was not entirely clear what the answer to that was. Beyond that, he asked if there was a limit on the size of a file handle; Overstreet said it was a string, so there was no limit. There was some mention of using a hash on the file handles to create a fixed-length quantity, but the end of the session was a bit chaotic, with multiple side discussions all going on at once.
Brauner got the last word in, pretty much, when he said that he originally had been scared of adding an option to return zero for all inode numbers. But he sees that it makes sense as a tool for educating user space that inode numbers are not unique. There is still a need to provide user space with some kind of API to determine whether two files are actually the same, but that will have to be worked out later—on the mailing list or perhaps at a future summit.
Handling the NFS change attribute
The saga of the i_version field for inodes, which tracks the occurrence of changes to the data or metadata of a file, continued in a discussion at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. In a session led by Jeff Layton, who has been doing a lot the work on changing the semantics and functioning of i_version over the years, he updated attendees on the status of the effort since a session at last year's summit. His summary was that things are "pretty much where we started last year", but the discussion this time pointed to some possible ways forward.
Granularity
The problem is that the granularity of the timestamps used by Linux (generally 1-10ms) is not sufficient to actually record all of the changes that can happen to a file. Multiple writes, for example, could all happen within the same change time (ctime) value. This becomes a problem for NFSv2 and NFSv3 clients, which effectively use the ctime value to decide when to invalidate their cached information; two different versions of a file with the same ctime makes for a mess.
NFSv4 added a "change attribute", which is a 64-bit value that is guaranteed to change any time that the ctime would change (effectively), he said; it does not get updated when the access time (atime) changes, because the caches should not be invalidated when files are read. NFSv4.2 added the ability for the server to indicate what kind of change-attribute information it is providing, which may allow clients to make better caching decisions. For example, if it is reported as monotonically increasing, clients can ignore updates with lower values; only change attributes that are higher than the value in the cache are valid at that point.
Most Linux filesystems track the change-attribute information as the i_version of the file's inode. But different filesystems handle the attribute somewhat differently. In particular, XFS has its own attribute that does not follow the same semantics as the others—it is incremented for atime updates. So, if atime updates are turned on, the client caches are invalidated incorrectly; even the relative-atime option can cause some incorrect cache invalidations.
In the past, XFS developers have been reluctant to add space in the on-disk inode for a change attribute that works in the expected way. He spoke with Darrick Wong earlier in the day, though, and got the sense that perhaps that reluctance might be diminishing. Bcachefs still needs to implement support for the attribute, but the space for it in the inode has been reserved, Layton said.
Another problem is that, on a write, the attribute value is typically incremented (and the timestamps are updated) before copying the data to the page cache. A read that comes in between the updates and the copy will associate the wrong state of the file with the data that is read. That problem can then persist for a long time in the client—until the file is updated again.
Moving the updates after the copy still leaves a window for incorrect information on the client, but it should resolve itself quickly. Kent Overstreet asked if the race condition can truly be eliminated. Layton said that moving the updates helps, but does not get rid of the race; clients may have the new data associated with the old attribute value, but they should get the new attribute value soon and invalidate their cache.
The change attributes are not stored on disk immediately, so server crashes can lead to problems where different file states end up with the same attribute values. Amir Goldstein mentioned some patches he is working on that will use sleepable RCU to protect the write operation, so that values can be updated in memory, but will not be written to disk until the full operation has completed. Layton said that he would look to see if the patch set could be used to help with this problem.
The crash-loss problem can be remedied by using the ctime value combined with the change attribute, which means there can only be a problem if there is a crash and a clock rollback on the server, "which is all pretty unlikely". One thing that makes it hard to test these kinds of problems is that the change attribute is not accessible from user space, so he would like to expose it in some fashion.
Multi-grain
Last year, Dave Chinner had an idea for multi-grain timestamps that was implemented and, briefly, merged. It turned out that there was a problem where an operation with a fine-grained timestamp and another with a coarse-grained one could be seen as happening in the wrong order, Layton said. That breaks "some little-known tools like make and rsync", so the change was backed out. He thinks the problem could be fixed by using the fine-grained timestamp as the floor for coarse-grained updates from that point on, but he got the impression that Linus Torvalds and Christian Brauner were tired of him pushing it. It could be resurrected; Brauner pointed out that his objections were only meant to apply to the merge window that was active at the time, so Layton may pick that work back up.
Another alternative would be to use some "extra" low bits in the ctime field for a counter that could be bumped every time there is more than one operation in a single timer tick. The timestamps could be shifted appropriately when they were reported to user space and used in full as change attributes. That would require changing all filesystems, though, so that there were not different granularities of timestamps being reported on a given system, Layton said.
He then went through the order of operations for updating timestamps and i_version. There is no locking done for queries of i_version; that means that as soon as the value is updated, which is currently done before the copy to the page cache, it can be read. Normally, ctime is updated at the same time as i_version, before the write operation; for directories, though, those updates are done after the operation because there is a lock being held.
In truth, i_version is only updated if it has been queried since the last time it was changed, so the increment is often a no-op. One way to handle the race problems, then, might be to increment the value both before and after the operation; the second of those would be a no-op nearly all of the time, so the cost should be minimal. He may experiment with that some.
Crash resilience is something that he has not yet done sufficient research on, though it has been identified as a potential problem area, he said. He and Jan Kara had an idea for a crash counter that could be tracked by user space; nfsd has a daemon that already tracks some client information where this could be added. It is kind of a "blue sky" idea that would require quite a bit of work, but it would remove the problem of multiple file states with the same change attribute after a crash. That, in turn, would allow the kernel NFS server to report that its change attribute is monotonically increasing, which is advantageous for NFS clients.
He would like to expose the change attribute via statx() so that user-space programs, such as NFS-Ganesha, could access the value. It will be important to ensure that only filesystems that implement the change attribute with the usual semantics expose it that way, however. That would also allow a feature he has thought about for a long time: a "gated write". The idea would be to fetch the change attribute, then make some changes to the file in memory, and write the file, but only if the change attribute was the same. That would allow synchronizing writes from multiple threads on the same machine, or writes to a network filesystem from multiple machines, without file locking.
When he asked Layton to lead the session, Goldstein had asked for a "roadmap" to be presented, but Layton said it was "more like a wish list". He wants to add support for the change attribute to bcachefs and to figure out what to do for XFS in that regard. He also wants to move the i_version update (and maybe timestamp updates) to after the page-cache copy, or do the double bump that he described. Finally, he wants to figure out what to do about crash resilience.
Brauner asked about an idea for shrinking inodes by changing the storage of the timestamps. Layton said that came from Torvalds, who pointed out that consecutive struct timespec64 entries for ctime, mtime, and atime leave alignment gaps. Switching to separate entries for the seconds and nanoseconds, as he did in a patch posted shortly after the summit, saves eight bytes. There are plans for how to use some of that savings, he said.
There was some final discussion on the roadmap/wish list, with Ted Ts'o noting that there are no real dependencies between the items, so they could all be worked on in parallel. Wong said that there is actually plenty of room in the XFS inode for a few more counters, but he needed clarification on when the change attribute needed to be updated. It seems like the NFSv4 semantics can be supported in a fairly straightforward fashion, so that piece of puzzle may already be falling into place.
Removing GFP_NOFS
The GFP_NOFS flag is meant for kernel memory allocations that should not cause a call into the filesystems to reclaim memory because there are already locks held that can potentially cause a deadlock. The "scoped allocation" API is a better choice for filesystems to indicate that they are holding a lock, so GFP_NOFS has long been on the chopping block, though progress has been slow. In a filesystem-track session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, Matthew Wilcox wanted to discuss how to move kernel filesystems away from the flag with the eventual goal of removing it completely.
He began the session by saying that there are several changes that people would like with regard to the GFP flags, but that the scoped-allocation API (i.e. memalloc_nofs_save() and memalloc_nofs_restore() as mentioned in the LSFMM+BPF topic discussion) for GFP_NOFS went in long ago, while the conversion to it is far from complete. He also wanted to talk a bit about Rust, he said. There is a desire to bring in Rust code from outside the kernel for, say, a hash table, but that requires the ability to allocate memory, which means making GFP flags available. "Why the hell would we want to add GFP flags to every Rust thing that we bring into the kernel? That's crazy."
So, he asked, what interface would the filesystem developers like to have to indicate that an allocation should not recurse into the filesystem code; the existing interface is, seemingly, not the right one, since filesystems are generally not using it. Josef Bacik said that from the Btrfs perspective, there is no real advantage to switching away from GFP_NOFS because there are other GFP flags that it still needs to use. The flags are already plumbed through the Btrfs code, so it is just easier to add another GPF_NOFS use when that situation arises.
But the Rust use case is not one he had considered before; it is the first example he has seen where the save and restore interface makes a difference that makes the switch worth doing. Bacik said that he would be willing to change the code to eliminate GFP_NOFS if that was requested, but it has not really "seemed like a pressing need", so far, to him. The Rust need for the switch changes that equation in his mind.
Jan Kara said that he has been working on removing the flag from ext4 and other places, but has found it inconvenient to do in some parts of the code. Passing the cookie that gets returned from the save operation through, so that it is available to pass to the restore, is ugly and makes the conversion harder, he said. Similarly, there are locks that are taken in some paths such that recursing back into the filesystem needs to be prevented (which is what GFP_NOFS protects against). Rather than having code that manually does the save and restore while managing the cookie, marking those locks in a way that causes the system to automatically handle the allocation scope would make things easier for conversions.
Rust for Linux developer Wedson Almeida Filho said that it was painful to have to manage and think about all of the different GFP flags in the code. He wondered if there were some way to automatically set the scope by detecting the areas where one of them is needed. Ideally, that detection would happen at compile time; there could perhaps be support in Rust for that, he suggested.
Wilcox said that it depends on the specific filesystem, because they work in different ways with regard to their locking. The detection would have to know that a particular lock is taken during the reclaim path, for example. Kara suggested that lockdep might help, but Wilcox seemed skeptical, noting that lockdep cannot say that a particular lock is never taken in the reclaim path, for example. Dave Chinner agreed that it would be difficult to detect that situation since the filesystem code, its locking, and the interaction with the reclaim path are extremely complex.
There was some discussion between Chinner, Almeida, and Kent Overstreet about the difficulty of automatically detecting the proper context. There is also some interaction with lockdep, which emits false-positive warnings; that has led the XFS developers to use the __GFP_NOLOCKDEP flag (some of which is described in a patch from January). It was all rather fast-moving and technically deep.
Overstreet raised the idea of preventing kernel allocations from failing (effectively making GFP_NOFAIL more widespread). He is opposed to that effort because he thinks that it is important to ensure that the error paths are well-tested both for allocation failure and other errors. But he wondered if making that change might eliminate the need for memory pools (or "mempools").
Chinner said that there is a difference between mempools and the no-fail case: mempools provide a guarantee of forward progress that no-fail does not. In particular, when memory is being reclaimed, there is no guarantee that whatever needs the memory will actually get it. That reclaimed memory could be allocated to some other task, unlike with mempools. He thought it would be difficult to be sure that making that switch would work in all cases.
But what a no-fail policy does do, he continued, is remove the possibility of dereferencing null pointers when there is an allocation failure. Those kinds of bugs generally have security implications, so eliminating the possibility of allocation failure can remove a whole class of security-sensitive bugs.
Overstreet said that making the error paths easier to test is another approach. He plans to post some patches for that, including ways to inject errors at any memory-allocation site; those patches rely on his recently merged memory-allocation profiler. Bacik said that Btrfs has ways to inject errors to test its error paths using BPF scripts. Overstreet said that it is important to be able to target the error injection for code that is under your control; simply randomly failing memory allocations for the kernel as a whole is not viable. Bacik said that the Btrfs error paths are systematically tested using the BPF code.
The session ran out of time without coming to any conclusions on the path forward, which is unfortunate, Wilcox said. Everyone seemed interested in removing GFP_NOFS allocations, but there is no concrete proposal for how to get there; he will try to work on one. Now that he realizes there is a major push to get rid of those allocations, Bacik said that he will work with the other Btrfs developers to not add any more and to start removing the ones that are there.
Measuring and improving buffered I/O
There are two types of file I/O on Linux, buffered I/O, which goes through the page cache, and direct I/O, which goes directly to the storage device. The performance of buffered I/O was reported to be a lot worse than direct I/O, especially for one specific test, in Luis Chamberlain's topic proposal for a session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. The proposal resulted in a lengthy mailing-list discussion, which also came up in Paul McKenney's RCU session the next day; Chamberlain led a combined storage and filesystem session to discuss those results with an eye toward improving buffered I/O performance.
Testing goals
He began by outlining his goals with the testing, which were to measure the limits of the page cache and to find ways that page-cache performance could be improved. In order to improve the performance, it needs to be measured; in particular, there needs to be a way to avoid introducing performance regressions as part of the work. He has done a lot of testing of the page cache, but there is a need to try to distinguish between normal and pathological use cases.
Based on the tests that he has done, and a suggestion for a "normal" test case from Chris Mason, he wondered if it seemed reasonable to try to achieve throughput parity between buffered and direct I/O on a six-drive, RAID 0 configuration. Dave Chinner said "absolutely not"; he does not think it is possible to get parity between the two types of I/O in that configuration. Chamberlain suggested that the summit would be a good place to work out what the right tools, tests, and configurations would be to try to measure and improve page-cache performance (thus, buffered I/O performance).
The pathological test case that he presented in the topic proposal is the one that got the most attention. On a well-resourced system, he reported 86GB/s writes for direct I/O and only 7GB/s writes using buffered I/O. That is a huge difference; he wondered whether it was acceptable or is something that should be investigated and, perhaps, fixed.
Chamberlain said that there were some other outcomes from the thread. Matthew Wilcox reported a problem with 64-byte random reads, which resulted in a patch from Linus Torvalds; Kent Overstreet did some preliminary testing and found that it provided a 25% performance improvement. Torvalds was not interested in pushing the patch any further, but Chamberlain said he is testing it some more to see that it does not crash.
He described a few other things that came up in the thread, some of which were addressed with patches. But the results from the pathological case seem to be unexpected; what should be done about that?
These kinds of discussions are always about tradeoffs, Ted Ts'o said; you trade some amount of safety, which you may not care about depending on the workload, for improvement on a microbenchmark. There may or may not be user-space applications that actually care about the operations measured in the microbenchmark, as well. For example, he does not know of any real-world application that needs to be able to do 64-byte random reads.
It takes work to determine if these things make a difference to real-world applications, he continued, and more work to ensure that whatever changes are being considered do not break other applications and make things worse. There is a philosophical question that needs to be answered about whether it even makes sense to spend the time to investigate any given problem.
He contrasted the problems being discussed with the torn-write problem, which has a clear and obvious benefit for database performance if it gets solved; whether that is also true about providing high-performance for 64-byte I/O is unclear to him. In the absence of a customer, "is it worth it?" But Wilcox said that the 64-byte-read problem came from a real (and large) Linux-using customer.
Ts'o's point is valid, Chamberlain said, but his goal in the session was not to come up with areas to address; he wanted to raise some of the questions that arose from his testing.
Not unexpected
Chinner said that the numbers from the pathological test were not unexpected. Part of the problem is that buffered I/O has a single writeback thread per filesystem, so that I/O cannot go any faster than that. The writeback thread is CPU-bound; "it is not that the page cache is slow, it is cleaning the page cache that's slow", he said. There are hacks to get around that limitation somewhat, but what needs to be looked at is some way to parallelize cleaning the page cache. That could be multiple writeback threads, using writethrough instead, or some other mechanism; the architecture of the page cache is not scalable for the cleanup part. Writethrough means that writes go into the page cache, but are also written to storage immediately.
Chamberlain wondered if there was general agreement with Chinner in the room. Wilcox said that he did not disagree, but that how the scaling is to be done is an interesting question. For example, a single huge file that needs writeback to be done in multiple places will be harder to scale than dealing with multiple small or medium-size files that need writeback.
Chinner said that much of the CPU time in writeback is spent scanning the page cache looking for pages that need to be written, which does not really change based on the size of the files. There are some filesystem-specific considerations as well, but a pure-overwrite workload will have higher writeback rates because the amount of scanning needed is less; at that point it runs into contention for the LRU lock. Adding more threads into that mix will not help at all and may make things worse.
One of the workloads that Chinner runs frequently simulates an untar operation with lots of files, each of which gets created, 4KB is written to it, and it gets closed. XFS gets stuck at around 50K files/s (roughly 200MB/s) on a device that can normally handle 7-8GB/s; the limitation is the single writeback thread. The rate goes way up (to 600K files/s or 2.4GB/s) if he does a flush when the file is closed, which simulates a writethrough mechanism. The writeback problem for this workload is trivially parallelizable, but that is not always true. The key problem is how to get the data out of the page cache and to the disk as efficiently as possible.
Jan Kara said that it would be difficult to add more writeback threads because there are assumptions that there is only one at various levels. He and Chinner discussed ways to do so, though it sounds like there would be quite a bit of work. Part of the reason it has not really been investigated, perhaps, is that SSDs are so fast that there is less of a push to optimize these kinds of things, Ts'o said. Getting a, say, 20% benefit on an untar-and-build workload, which already runs quickly, is not all that compelling.
There may be opportunities to simply turn off writeback on certain classes of devices, since writethrough performs so much better on high-end SSDs, Ts'o said. Chamberlain wondered if switching to writethrough would help solve the buffered I/O atomic-write problem; Chinner said that it could. With that, the session ran out of time, though there was talk of picking it back up at BoF session later in the summit.
Standardizing the BPF ISA
While BPF may be most famous for its use in the Linux kernel, there is actually a growing effort to standardize BPF for use on other systems. These include eBPF for Windows, but also uBPF, rBPF, hBPF, bpftime, and others. Some hardware manufacturers are even considering integrating BPF directly into networking hardware. Dave Thaler led two sessions about all of the problems that cross-platform use inevitably brings and the current status of the standardization work at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit.
Thaler opened the first session (on the first day of the summit) by discussing the many platforms that are now capable of running BPF. With multiple compilers and runtimes, there are inevitable compatibility problems. He defined the goal of the ongoing IETF BPF standardization work as trying to ensure that any compiler can be used with any compliant runtime. He then went into a bit more detail about what "compliant" means in this specific context, which required first explaining a bit of background about the structure of the standardization documents.
In his later session, Thaler would go into more detail about the exact state of the first IETF draft from the working group; for the initial session, he merely stated that the working group had produced a draft instruction set architecture (ISA) specification for BPF. That draft defines the semantics of all of the BPF instructions. One wrinkle is that different implementations may not actually care about implementing every BPF instruction. For example, BPF started off with some instructions that are particular to its initial use case as a packet-filtering language; those packet-filtering instructions might not actually be useful to BPF code running in other contexts.
The draft ISA splits the defined instructions into sets of "conformance groups". A compliant runtime, then, is one that correctly implements the specified instructions for all of the conformance groups it claims to support. Splitting things up in this way helps runtimes (and compilers) communicate exactly what they support, Thaler explained.
The draft ISA splits the existing instructions into groups largely modeled after the RISC-V ISA: atomic32, atomic64, base32, base64, divmul32, divmul64, and packet. Some of these groups include other groups — for example, any implementation claiming to implement base64 must also implement base32. In fact, all of the 64-bit groups include their 32-bit counterparts. Any new instructions that get added to BPF in the future will be added to a new conformance group, so existing groups will never be modified. That means that once an implementation has become compliant, it doesn't necessarily need to stay up to date with new changes to BPF; it can continue claiming compatibility with the old instruction groups and leave things there.
Thaler also described the process that the working group has settled on for deprecations. If a group of instructions needs to be deprecated for whatever reason, they'll be added to a separate conformance group, and then new implementations can explicitly exclude that group and still be considered compliant. A compiler processing a BPF program will need to receive a set of conformance groups implemented by the target (either via compiler flags or other configuration), and take care to emit only supported instructions. The base32 group, which must be supported by all implementations, is already fairly broad, so code generation should not be much of an issue. Hopefully the end result for users will be seamless compatibility.
Instructions are not the only component of BPF, however. Another area requiring standardization is the platform-specific application binary interface (psABI), which includes details such as which registers are saved across a call, which register contains the frame pointer, how large the stack is, and other details, Thaler said. This is all a lot more up in the air, because the working group has not put together a draft of a psABI specification yet. He also floated the possibility that there might end up being multiple psABIs, in which case compilers would need to either choose one to support, or allow some way to specify which one would be used for code generation.
José Marchesi objected to the idea that the frame pointer was a choice that ought to be left to the psABI — BPF uses automatic stack allocation, meaning that the runtime manages the frame pointer in BPF register r10. Thaler responded that the ISA doesn't actually say that; the unwritten psABI would need to say that. Marchesi wasn't satisfied with that explanation, since r10 is treated differently from the other registers. In particular, it is read-only. There was some additional discussion of the point, but other members of the audience didn't seem to agree with Marchesi that the behavior of r10 ought to be specified in the ISA.
At that point, Thaler moved on to addressing another point of compatibility unique to BPF: the verifier. There already exist multiple BPF verifiers, notably the one in the Linux kernel and the PREVAIL verifier. A compiler hoping to produce portable BPF code would need an actual description of what code is or is not verifiable. That is something the working group has been considering, but has not yet written any draft specifications about.
The state of the standard
In his second session, late on the last day of the summit, Thaler updated everyone on the current state of the ISA standard. He began the session by explaining what it is the working group is chartered to do: produce a set of standards and informational documents on several specific topics. How the working group does that is up to them — so they could work on these documents in parallel, but have generally been pursuing them in priority order.
Because having an ISA is foundational to being able to discuss other topics such as the psABI and requirements for the verifier, the ISA is the first document the working group has been focusing on. At the time of the session, the ISA was "almost done", with the last call for comments ending the next day. As of this writing, the ISA is on the agenda for the June 13 meeting of the Internet Engineering Steering Group (IESG).
To people not well versed in IETF minutiae, that might not provide a clear picture of what the state of the document actually is; Thaler provided a brief overview of the remaining process as it applies to the BPF ISA. At the June 13 meeting, the IESG will vote on the proposed document. If it fails to pass, any questions or comments will go back to the working group, the document will be revised, and then it will return to the IESG at a later date. If the vote passes, the document enters the RFC editor's queue. The RFC editor converts the document to the specific format for RFCs, updates any references, and assigns it a tentative RFC number. Then the authors have a final chance to review the RFC editor's changes before it is published, and the assigned number becomes final.
In parallel, the document also needs to be reviewed by the Internet Assigned Numbers Authority (IANA), because IANA will become responsible for managing the official list of conformance groups. Thaler described IANA as comprised of "process people", who are unlikely to raise any objections to the document as long as the procedure described for registering new conformance groups does not have any problems.
All parts of that process are fairly fast, except waiting on the IESG, which only meets every two weeks. So it is quite likely that the BPF ISA may be an official RFC by the end of June, he said. David Vernet, the chair of the working group, asked Thaler whether there was anything that the assembled BPF developers could do to prevent delays. Thaler said that there was not, since it was all waiting on the IETF — except for providing fast responses to feedback.
In particular, Thaler had already received some feedback during the last call for comments. Since the comment period was scheduled to end the next day, he thought that if the attendees quickly replied to these concerns, there would likely not be any other delays. He went through the feedback, most of which was minor and already incorporated. One piece of feedback prompted actual discussion, however. Eric Klein had suggested that the ISA should not define the range of registers available to BPF programs, saying that should be moved to the (not yet written) psABI instead. This suggestion was not well received.
Several audience members, including Marchesi, spoke up to say that the number of registers a CPU has should always be part of the ISA. One audience member asked how compilers are supposed to produce code for a platform without knowing how many registers there are. Marchesi and Alexei Starovoitov separately mentioned that there were some details of how registers were used, such as the use of r0 for return values, or the use of the frame pointer, that did not necessarily need to be included in the ISA, but still thought the number of valid registers was important to include. Thaler noted everyone's responses, and intends to keep the number of registers (currently eleven — ten general purpose and one read-only frame pointer) in the ISA.
Vernet then questioned why they should standardize on eleven registers — other than to match the existing behavior of BPF implementations. Another member of the audience said that was a "very good question, but not one that should affect the standard", given that this is what all existing portable BPF programs do. Marchesi suggested that the ISA could say that the instruction encoding has space for up to 16 registers, but that the exact number depends on the implementation, with a minimum of eleven. Several other people pushed to move on with eleven as-is, noting that if this were a real issue it would have come up at some point before the final 48 hours of the last call for comments on the ISA.
Once the ISA is standardized, the next step (although he was clear that this was only a rough order), will be an informational standard describing the expectations for a BPF verifier. This might include ensuring properties like not using undefined instructions, not dereferencing invalid pointers, or ensuring that programs terminate. Marchesi noted that it would be convenient for compilers if the document took the form of a numbered or named list of rules, so that compiler error messages (and internal code) can reference them by name. Starovoitov thought that those kinds of requirements belonged in a separate document; Thaler concurred, noting that a compiler expectations document was later in his list. Other upcoming tasks for the working group includes standardizing the BPF Type Format (BTF), and informational documents on producing portable binaries — including documenting compiler expectations for verifiable code, and the psABI.
Thaler spoke a bit about his preferred form for the psABI work, and then moved into one last topic for the audience to help with: an ELF profile for BPF. He has a draft proposal, but he has concerns about the right way to perform the standardization. ELF is not an IETF standard — it is defined as part of the System V specifications. So, he asked, what is the right way to register BPF-specific ELF information (like the BPF CPU identifier in the ELF headers)?
The consensus among the audience was that System V was pretty much defunct, and that sending an email claiming the BPF CPU ID to the System V mailing list should be sufficient. With that, the session came to a close.
An instruction-level BPF memory model
There are few topics as arcane as memory models, so it was a pleasant surprise when the double-length session on the BPF memory model at the Linux Storage, Filesystem, Memory Management, and BPF Summit turned out to be understandable. Paul McKenney led the session, although he was clear that the work he was presenting was also due to Puranjay Mohan, who unfortunately could not attend the summit. BPF does not actually have a formalized memory model yet; instead it has relied on a history of talks like this one and a general informal understanding. Unfortunately, ignoring memory models does not make them go away, and this has already caused at least one BPF-related bug on weakly-ordered architectures. Figuring out what a formal memory model for BPF should define was the focus of McKenney's talk.
McKenney opened the session by noting that — in an example of weak ordering typical of memory-modeling work — the patches he hoped to result from the talk had actually already been sent to the mailing list. Mohan had done an impressive amount of work in the week leading up to the summit, including fixing several problems with atomic instructions in BPF on weakly-ordered architectures and adding code to herd7, a memory model simulator, in order to show the changes were correct.
After acknowledging Mohan's work, McKenney then walked the attendees through the common assumptions embedded in BPF code about memory ordering, which amount to an informal memory model. BPF has a set of atomic instructions — notably xchg and cmpxchg — that provide full ordering. All CPUs and tasks agree that the side effects of all instructions that come before an atomic instruction are visible before it, and all effects of subsequent instructions are not visible until it has been executed. McKenney noted that this was "straightforward, but really important". Other atomic instructions, such as atomic adds, are unordered, unless they specify the optional FETCH flag.
The other source of memory ordering guarantees in BPF is jump instructions. Unconditional jumps don't affect ordering, but conditional branches do under some circumstances. When there is a load instruction, the result of which is used in the comparison for a branch, and then that value is stored following the branch, the store is guaranteed to occur after the load. McKenney noted that this was a pretty subtle point, and means that optimizing BPF programs requires some care. Alexei Starovoitov interjected: "Don't run BPF assembly through an optimizing compiler", which produced some chuckles.
Those properties form the bare minimum of what most BPF programs currently assume about ordering. But in the absence of a formalized model, it's impossible to say whether any given bug resulting from memory operations being reordered is a bug in the BPF program, the compiler, or the BPF just-in-time compiler (JIT). McKenney has four main goals for a formal memory model: it should operate at the level of individual instructions, so that it is directly applicable to the JIT; it should be consistent with the Linux kernel memory model (LKMM); it should support low-overhead mappings to supported hardware; and it should be able to grow as BPF does.
These goals pose some challenges. In order to be efficiently mapped to different kinds of hardware, the memory model should avoid forbidding reorderings that those architectures permit — which necessarily means that it will end up being a "weaker" memory model than any of those architectures alone. The benefit is that, if the formal model is successful, the JIT should not need to emit synchronization instructions or memory barriers in most cases.
McKenney thought that some people might wonder: don't we already have the Linux kernel memory model? Why not just map BPF assembly to LKMM C? Unfortunately, that doesn't really work, he said, at least not trivially. High-level languages like C and low-level languages like assembly have different event structures that make a simple mapping impossible. Assembly also has additional constraints around individual registers or similar low-level constructs that the LKMM doesn't address. He did think starting from an existing memory model was a good idea, however, since the basics of any memory model are likely to be fairly similar.
Of the existing memory models, McKenney considered a few alternatives. X86 is much too strong, and PowerPC is not actively developed (and missing atomic instructions). Ultimately, he chose ArmV8 as a possible starting point, since it is actively maintained and full-featured. The downside is that it includes some irrelevant hardware features, and that it is still stronger than PowerPC, so some changes will be needed.
McKenney showed the attendees a few selected sections of the ArmV8 memory model specification, to give a feeling for what adapting it would be like. Then he went into a series of examples of things "that kinda hit me over the head really hard" — unintuitive consequences of the model that would need to be considered while adapting it to BPF.
Examples
The first example dealt with dependencies between loads and stores. Suppose that there are two reads (called R1 and R2), and one write (called W1). In the program, the instructions occur in the order R1, R2, W1. When executed, R1 reads from some known address (a pointer), returning a second address that R2 reads from. W1 writes to a different address. The question is: is the CPU allowed to re-order the write to occur before the reads? After describing the scenario, McKenney paused to give the experienced kernel developers in the room a chance to guess.
As it turns out, both ArmV8 and PowerPC forbid that reordering, but the LKMM doesn't. The reason has to do with aliasing — the CPU, when it is executing those instructions, can't know whether R2 and W1 will access the same location until read R1 actually resolves. So of course W1 can't occur before R1, because that might cause R2 to read the wrong value. The LKMM, however, operates on a more abstract level. If the pointers involved in this scenario are passed in as arguments to a function written in C, the LKMM is allowed to assume that they don't alias. Since they don't refer to the same location, reordering W1 before R1 doesn't cause any problems.
In this case, BPF is in a position more like a CPU than like the C abstract machine. The JIT could theoretically know whether two accesses alias, but that would seriously complicate code generation, so it makes more sense to adopt the same restriction as ArmV8 and forbid the reordering.
The second example was somewhat more esoteric. Suppose that you have two
threads and one variable X. One thread reads X twice, in read operations R1 and
R2. The other thread writes to X, with an operation W1. Suppose that R2 occurs
before W1 (i.e., it doesn't see the value written). All the operations use the
platform's equivalent of
READ_ONCE() and WRITE_ONCE(), i.e.
atomic reads and writes with weak ordering guarantees. Does R1 also necessarily
occur before W1? [McKenney
clarified in the comments that this is not
quite the correct framing. The proper question is: does R2 being unable
to read a value older than the value read by R1 have any other ordering consequences?]
After a moment of silence, one audience member pointed out that
the answer can't possibly be "Yes", because otherwise McKenney wouldn't be
asking, producing another round of laughter.
And indeed, on PowerPC, operation R1 is allowed to read the value written by W1,
even though R2 doesn't. even though R2 might read a value later than that of R1,
that doesn't impose additional ordering consequences.
To prove it, McKenney showed the scenario running in
herd7, showing that it found a proof that the reordering is possible. The
scenario did need to include a few additional complications in order to actually
make the test meaningful, and McKenney spent a while walking through the
requirements to actually measure the scenario so that the audience could
understand what was happening in the example. "If you're serious about
understanding this — which I'm not sure you should be —", McKenney advised, "you
can consult the example in
the book", referring to his book about parallel programming.
McKenney noted that this is an example of PowerPC being exceedingly weak — ArmV8 forbids the same reordering. The BPF developers probably prefer to go with PowerPC's version, however, so that they can avoid emitting extra memory synchronization instructions during code generation. BPF doesn't currently have an equivalent of READ_ONCE() and WRITE_ONCE(), so the example doesn't necessarily apply, but it's something to remain mindful of.
In short, even starting from the ArmV8 specification will require a certain amount of adaptation and careful thought, McKenney summarized. He then showed some of the support for BPF code in herd7. Mohan is working to extend that support, which will probably prove useful for validating any proposed formal model. The session ended with some discussion about where the memory model should live once it was written. McKenney plans to get it into the kernel's tools/standardization or documentation directories, but isn't particular about where exactly it ends up.
Comparing BPF performance between implementations
Alan Jowett returned for a second remote presentation at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit to compare the performance of different BPF runtimes. He showed the results of the MIT-licensed BPF microbenchmark suite he has been working on. The benchmark suite does not yet provide a good direct comparison between all platforms, so the results should be taken with a grain of salt. They do seem to indicate that there is some significant variation between implementations, especially for different types of BPF maps.
The benchmarks measure a few different things, including the time taken to actually execute various test programs, but also the overhead of transitioning from the kernel into the BPF VM, and the performance of calling helper functions. It is important to measure these things in a platform-neutral, repeatable way, Jowett said. The benchmark suite uses libbpf to load BPF programs, which uses "compile once — run everywhere" (CO-RE) to run the same ELF artifacts on the different supported platforms.
There are several different kinds of BPF programs included in the benchmark suite, including an empty (no-op) program, programs that exercise various helper functions, and programs that test the performance of BPF maps, including trie and hash-table maps. Measurements are taken on multiple CPU cores in parallel, to make testing the performance of concurrent maps more meaningful.
The eBPF for Windows project uses the benchmark suite as part of its daily continuous-integration (CI) setup to track performance regressions. The CI also runs the same tests on Linux, but he said that these weren't a good comparison because of infrastructure issues — the GitHub runners the CI uses can't specify a particular Linux kernel version. He also noted that there would be some variation because Windows uses ahead-of-time (AOT) compilation of BPF, rather than just-in-time (JIT) compilation as Linux does.
Despite that, Jowett thought that there were some valuable lessons to be drawn from the benchmarks. He said that AOT compilation outperforms JIT compilation, which itself outperforms interpretation. Alexei Starovoitov challenged that assertion; he said that the JIT being tested on Windows — which was Jowett's basis for comparison, given the Linux infrastructure issues — was fairly dumb, and was not enough to make generalizations about Linux's JIT. Jowett acknowledged that, and pointed out some ways that the Windows JIT could not be improved.
Jowett also showed some measurements demonstrating that least-prefix-match trie maps have faster updates than hash tables, and that Windows had trouble matching the performance of Linux's least-recently-used (LRU) tables. He noted that maintaining a global consensus on the age of keys in the table "is expensive".
With the difficulties Jowett had measuring Linux performance, however, it seems hard to say how eBPF for Windows and the Linux BPF implementation compare. Perhaps when that is resolved this work will prove a useful tool to highlight potential performance improvements.
Page editor: Jonathan Corbet
Next page:
Brief items>>
