|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for May 7, 2026

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

LLM-driven security reports disrupt coordinated disclosure

By Joe Brockmeier
May 6, 2026

Predictions that LLM tools would cause a surge in reports of security vulnerabilities have, unquestionably, borne out. As expected, maintainers are having to wade through more security reports than ever before; in addition, LLM tools are disrupting traditional-coordinated disclosure practices as well. The method of Copy Fail's disclosure, in particular, left vendors, projects, and users scrambling. In addition, maintainers are seeing parallel discovery of the same security flaws within the embargo window. Both of these developments mean that coordinated security disclosures may become a thing of the past.

"Mining security gold"

Jeremy Stanley, a member of the vulnerability management team for the OpenStack cloud-computing project, brought this topic to the OSS Security mailing list on April 28. He said that projects he worked on "are under a seemingly unending deluge of reports from researchers using LLMs to mine for security gold in our software". That had led to him thinking about the risks that public LLM services might pose to traditional vulnerability-handling workflows, embargoes, and coordinated disclosure of security flaws. He wondered if it would be helpful to keep embargoes short to mitigate the risks of parallel discovery or disclosure to other LLM users.

I'm sorely tempted, both due to the increased volume and the risk of premature disclosure, to just assume that any vulnerability reported as a result of research using an LLM is trivially discoverable by others, and give up trying to pretend there's any point to working it under embargo. Similarly, it makes sense to me that patch development and descriptive prose shouldn't be produced with LLM assistance for any vulnerability that is being worked under an embargo.

He said that he could not be the only one thinking about this topic, and asked what others thought. Jacob Bachmeyer replied that parallel discovery was a big risk: "If an LLM can find a bug for a whitehat, it can do the same for a blackhat. [...] LLM-discovered vulnerabilities should be considered already publicly known".

Effect on embargoes

If the vulnerabilities are (arguably) known to multiple people, the question arises of whether standard practices around security embargoes still make sense. Lucas Holt said that he saw "the logic and temptation" of shortening embargoes, but dropping a zero-day exploit on a small project would not help anyone. He suggested that large projects may have multiple people looking for and discovering the same flaws, but smaller projects were unlikely to have multiple people probing them for vulnerabilities at the same time.

Stanley clarified that he was approaching the problem from the perspective of "an upstream maintainer and vulnerability coordinator of large/popular projects receiving these reports". He was seeing a flood of reports, and that trying to manage all of them in private was leading to "accidents breaking our embargoes before things are ready or distros/deployers have been given sufficient advance warning". If security reports were public immediately, he thought projects could crowdsource help from a larger portion of their community "rather than relying solely on overwhelmed vulnerability coordinators and security-focused maintainers".

Holt had suggested that people who used LLMs to find security bugs could also use their tools to create patches. Brian May warned people to be careful of that approach. "Simple patches that look good can in fact be hiding serious security issues. Thinking of the September 2006 Debian openssl issue here."

Greg Dahlman also thought that LLMs were unreliable when it comes to creating security fixes. He described the ability of current LLMs to produce correct and secure code as being within "the coin flip range"; in other words, he thought there was only a 50-50 chance that an LLM's suggested solution to fix a flaw would be adequate. Therefore, any embargo timeline needed to take into account the asymmetry between the time it takes to discover a flaw and the time needed to actually fix it.

Already happening

It turns out that Stanley's hypothesis about parallel discovery had already been proven in the wild. Clemens Lang, who works on the Red Hat Enterprise Linux (RHEL) crypto team, provided a data point to support the theory: "We're seeing duplicate reports of the same issue found by multiple independent groups that use LLMs, within the embargo period." Greg Kroah-Hartman also reported that kernel developers were "seeing duplicate reports of the same issue from different groups within the time period it takes to get a fix merged".

Willy Tarreau said that he had predicted the death of security embargoes months ago:

Embargoes now play against security, for all the time we don't act, users stay exposed to anyone having the luck to find the same problem. It's not a matter of the LLM's strength but a matter of determination by the researcher who could simply run a small model several times helping it dig further. Bigger models just find faster, but that only counts for those seeking protection, not for those trying to attack.

Copy Fail disclosure fail

The Copy Fail privilege-escalation vulnerability (CVE-2026-31431) was announced on April 29 by Xint. It is a "straight-line logic flaw", meaning that it does not require race conditions or other special circumstances, that allowed local users to trivially get root access on most Linux distributions unless they had the most up-to-date kernels.

Along with the announcement, there was a proof-of-concept (POC) Python script that allowed users to see if their systems were vulnerable. The fix had already been included with the 7.0, 6.19.12, and 6.18.22 kernels, but was not available as a backport to older stable kernels until April 30. Most Linux distributions were caught with their proverbial pants down. On April 30, with a POC showing how to exploit the vulnerability widely available, major distributions such as Debian, Red Hat Enterprise Linux, SUSE, and Ubuntu had no fix ready for their users.

It was described as "one of the worst make-me-root vulnerabilities in the kernel in recent times" by Eddie Chapman on the OSS Security list. He wondered what went wrong:

Has the embargo been broken early today? Not looking to point any fingers, those who make things happen in our communities work dam hard and deserve respect and support, especially with the extra burden of AI slop now.

Gentoo contributor Sam James said that it's up to security reporters to notify Linux distributions of kernel vulnerabilities: "unless the reporter chooses to bring it to the linux-distros [mailing list], there is no heads-up to distributions. It did not happen here."

The discussion turned to something of an ad-hoc postmortem, fingerpointing, and problem-solving session. Alexander Peslyak, who goes by "Solar Designer", said that distributions had little way of knowing about the importance of this vulnerability, as it did not stand out among all the others.

The vulnerability was reported to the kernel team on March 23, according to the timeline in Xint's announcement. A patch was committed to the mainline kernel on April 1. CVE-2026-31431 was added to the kernel's security repository on April 25, with a Common Vulnerability Scoring System (CVSS) score of 7.8 (out of a possible ten), four days before the Copy Fail announcement. Peslyak pointed out that there were 168 CVEs in the batch from that day, with scores between 7.1 and 9.8. "By the score alone, this one really does not stand out. To me, this is usual noise, with little signal in there." It did not hint at the severity or imminent threat of a POC being released. In fact, there are 21 CVEs added on April 25 that have a "9.8 CRITICAL" CVSS score.

Greg Kroah-Hartman replied that the kernel team's "constant message", for decades, has been that users must upgrade to the latest release to ensure that they have all of the fixes for currently known issues. He added that the kernel team had no knowledge of the Copy Fail announcement ahead of time "as no one is obligated to tell us that they are about to let loose a trivial exploit", and that the team was not allowed to notify others ahead of time, lest they have to tell everyone about everything. "That's the only policy by which all the legal/governmental agencies have agreed to allow us to operate in, so we are stuck with it."

On May 3, James shared a link to a comment from Brian Pak of Xint that said the company had provided a fully working exploit to the kernel security team when the vulnerability was reported. "We've since learned that such details don't automatically get forwarded downstream and that Linux kernel commit messages are typically kept minimal. That's simply how the process works." James said that the kernel team "were very much aware of the impact from the offset", and inquired if the kernel team was "honestly proud of how this went".

The CVE garbage patch

Kroah-Hartman asked exactly what James would suggest that the kernel team do better, and how it should do it. He added that the team receives bug reports of local-user privilege-escalation bugs all the time; it was not obvious that this one was special until after the fact because the submitter used it to show off their software. "That is something that normally does not happen and is outside of the control of all of us involved here." Emily Shepherd wanted to know if the POC or a description of the flaw's severity had been provided; Kroah-Hartman replied that he honestly did not remember because "that was months and hundreds, if not thousands, of reports ago". He also reminded the list how the kernel security team operates:

The job of the kernel security team is to triage a bug report, drag in the relevant maintainer/developer, get the issue fixed and merged into Linus's tree as soon as possible. Once it lands in Linus's tree, our role is over.

We do not do "announcements" of anything to anyone, so even if this was a "look how bad you can abuse the system" type of thing, we would not be telling anyone anything.

He added that he's documented this in detail, given many talks about it, and has been blogging about it as well.

On social media, Josh Bressers suggested that the blame lies with the company: "every AI vulnerability company wants to find something juicy, and have no idea how to coordinate the findings". Kroah-Hartman dismissed the idea that coordination of vulnerabilities was even possible now. Bressers agreed: "I do think you're right that the traditional disclosure model is gone forever". He did think it was "pretty obvious" that Copy Fail would be a big one; but he had no idea what to do to prevent important vulnerabilities from drowning in the "great CVE garbage patch".

Copy Fail is just one vulnerability out of thousands. At the rate that LLM tools seem to be speeding up discovery, it does seem that Kroah-Hartman has a point. Traditional disclosure notification is becoming increasingly difficult, if not outright impossible. The volume of reports, coupled with the fact that many AI-assisted reporters will be unfamiliar with how security disclosure usually works, means that relying on embargoes and coordinated fixes is going to be increasingly risky. We live in interesting times for open-source security, whether we want to or not.

Comments (28 posted)

Restartable sequences, TCMalloc, and Hyrum's Law

By Jonathan Corbet
April 30, 2026
Hyrum's Law states that any observable behavior of a system will eventually be depended upon by somebody. The kernel community is currently contending with a clear demonstration of that principle. The recent work to address some restartable-sequences performance problems in the 6.19 release maintained the documented API in all respects, but that was not enough; Google's TCMalloc library, as it turns out, violates the documented API, prevents other code from using restartable features, and breaks with 6.19. But the kernel's no-regressions rule is forcing developers to find a way to accommodate TCMalloc's behavior.

As a quick reminder: the restartable sequences feature, accessed by way of the rseq() system call, provides a mechanism for the execution of brief critical sections in user space. A shared-memory segment is used to communicate to the kernel when a critical section is active, and the kernel can redirect execution if the running thread is preempted or migrated during that critical section. There are a number of associated features, including the ability to quickly determine which CPU a thread is running on; the time-slice-extension feature merged for the 7.0 release is also tied to restartable sequences.

TCMalloc trouble

On April 22, Mathias Stearn reported two problems with TCMalloc resulting from the 6.19 improvements to restartable sequences. One of them turned out to be a simple bug in the 64-bit Arm implementation, tied to the fact that the Arm architecture does not yet fully use the generic entry code. This bug is uncontroversial and will be fixed like any other. The second problem, though, has deeper origins.

The memory shared between the kernel and user space for restartable sequences includes an instance of struct rseq; that structure contains a number of fields, one of which, in particular, is of interest for this discussion. The 32-bit cpu_id_start field contains the number of the CPU on which the thread is running. This value, which is maintained by the kernel, is explicitly defined as a read-only value for user space, and is guaranteed to always contain a valid CPU number, even if restartable sequences are not in use.

Prior to the 6.19 release, the kernel would update cpu_id_start on every return from the kernel to user space, regardless of whether the CPU number had changed or not. Storing a single integer value does not seem like an expensive operation, but looks can be deceiving; many CPUs have features that prevent the kernel from randomly changing user-space memory. Turning off that protection (and re-enabling it after the store) is expensive. Removing the redundant stores improved performance by 15% for many workloads without changing the restartable-sequences ABI in any way — or so it seemed.

The TCMalloc library makes extensive use of restartable sequences to improve performance. Notably, while it does use this feature for critical sections, it also uses it to detect scheduling interruptions outside of critical sections. The trick (to avoid more pejorative terms) used is described in detail in this document. In short, TCMalloc's internal data structures are designed to overlay the shared struct rseq so that cpu_id_start becomes the upper four bytes of an internal cache pointer. When TCMalloc stores this pointer, the result is to write zeroes into cpu_id_start; the topmost bit is set, though, to distinguish the contents of cpu_id_start from any valid CPU number. When the kernel stores into cpu_id_start, it will end up clearing that top bit, enabling TCMalloc to quickly detect the change and regenerate that pointer.

The key point is that TCMalloc needs that signal for any sort of interruption, even if the running thread did not move to a new CPU. Pre-6.19 kernels would always overwrite cpu_id_start — an undocumented but observable behavior — providing TCMalloc with that signal; as of 6.19, that overwriting only happens if the thread migrates to a new CPU. As a result, TCMalloc, which has come to depend on that undocumented behavior, ends up leaving a smoking crater in the middle of any application that is trying to use it.

A regression?

The problematic nature of this behavior has been widely known for some years; the above-linked documentation advises that, since cpu_id_start can no longer be counted on to hold the current CPU number, "this makes __rseq_abi.cpu_id_start unusable for its original purpose". In other words, no other code running within a TCMalloc-using thread can use restartable sequences and expect it to work. That is somewhat awkward, given that one other user of the feature is the GNU C Library (glibc), as of version 2.35. Back in 2022, this problem was reported in the TCMalloc issue tracker and a change of behavior was requested but, despite a fair amount of discussion, no change was ever made. As a result, code using TCMalloc must be run with an environment variable set to prevent glibc from trying to use restartable sequences.

Kernel developers have had a dim view of this behavior for a while. Unsurprisingly, that view became rather dimmer yet when users started complaining that the 6.19 release broke TCMalloc entirely. The behavior described above violates the documented restartable-sequences ABI and makes the feature unusable for anybody else. It would have been detected by the kernel's debugging facilities, but those were clearly never used with TCMalloc, since they would have caused an immediate killing of the offending thread. In the issue discussion, kernel developers had offered an extension to restartable sequences to let TCMalloc stop overwriting cpu_id_start, but that offer was not accepted. The result, as Thomas Gleixner described it, is "everyone is in a hard place and up a creek without a paddle".

Gleixner made it clear that he thought TCMalloc's difficulties should not be considered to be a kernel regression, since the documented ABI guarantees are still being upheld by the kernel and the debugging feature would have caught the problem years ago. Linus Torvalds, though, was just as clear that the only thing that matters is that a once-working program stopped working as the result of a kernel change: "This is not some kind of gray area. It clearly violates our regression rules".

This response was clearly expected by Gleixner, though he still did not like it: "Feel free to enforce it, but be aware that you thereby set a precedence that a single abuser can then rightfully own a general shared interface of the kernel forever and force everybody else to give up". Glibc developer Florian Weimer was also unhappy, pointing out that TCMalloc's use breaks the modular design of the restartable-sequences ABI. Torvalds, though, was adamant that a fix had to be found.

Now what?

Various options for fixes were discussed; Stearn had been working on a simple, low-cost option that, seemingly, has not worked out. Another option, of course, is to simply go back to always updating cpu_id_start and accepting the associated performance penalty; there are not many supporters of this approach in the kernel community. The most likely fix, as of this writing, is something based on this patch from Gleixner, which works without requiring changes to either TCMalloc or glibc, albeit with the performance penalty in some environments.

As is the pattern with many recent system calls, rseq() requires the caller to pass in both the pointer to struct rseq and the size of that structure. In this way, future extensions can be made in compatible ways by increasing that size. Gleixner proposes increasing it from the current 32 bytes to 33 bytes, which would also have the effect of forcing a 64-byte alignment of the structure. Any caller presenting a 32-byte struct rseq or failing to align the structure properly would see the pre-6.19 behavior, with unconditional updating of cpu_id_start; more recent related features, such as time-slice extension, would also be unavailable. If, instead, the caller provides a 33-byte, 64-byte-aligned rseq structure, the kernel provides the 6.19 behavior, with full performance.

The result should be a fully compatible change. Existing TCMalloc installations will use the older structure size; the overlay trick used by the library also prevents a 64-byte alignment for the structure. So TCMalloc will be given the older behavior that it depends on. Newer glibc versions query the expected structure size (using getauxval()) and will be rewarded with higher performance and full functionality, with no glibc update needed.

Older glibc versions (those prior to 2.41), though, will be stuck with the performance penalty; Weimer indicated that updating those versions would not be an easy thing to do. Mathieu Desnoyers suggested adding a flag that could be passed to rseq() as an "I'm not TCMalloc" indicator, resulting in the faster behavior. Adding that flag would be a far easier backport to older glibc versions. Gleixner, though, dismissed that idea, saying that it would lead to unnecessary complexity in the code and, in any case, would be problematic in cases where there are multiple users of restartable sequences within a single application.

This solution appears to have survived initial testing, and has been put together into a proper patch series, along with some sharp words for the people who made it necessary:

As Linus decreed the onus is on the lack of ABI compliance enforcement in the original RSEQ kernel implementation and the clever abuse is fine. That's technically correct, but in the context of the larger ecosystem a fundamentally flawed decision. Though that's a completely different discussion to have as it affects the long term sustainability of the Open Source ecosystem in general and the ability to protect it against rogue actors, which are thereby officially entitled to hold a whole ecosystem hostage and force the people who provide them their operational base to go out on a limb to make progress.

Grumpiness aside, it seems that the form of the solution to this particular demonstration of Hyrum's law has been worked out. As long as there are no other users doing strange things, it should be possible to move forward, preserving all of the improvements that have been made, without breaking existing users. The absence of other surprises seems fairly likely to be the case, since there are few users of restartable sequences. In the end, things clearly could have been worse.

That said, this episode will surely not be reassuring to developers who fear exposing features that could end up creating similar compatibility problems in the future. Many discussions about adding new interfaces have run aground on that point; see, for example, this response by David Hildenbrand to the idea of allowing BPF programs to control more aspects of memory management. It is easy to see Hyrum lurking in the corner, just waiting to ossify some behavior inadvertently exposed by the kernel.

Comments (33 posted)

Bug-monitoring expectations and Fedora GNOME packages

By Joe Brockmeier
May 4, 2026

For a number of years, users submitting bugs reports against GNOME packages in Fedora have received an auto-reply saying that the reports were not actively monitored; users were encouraged to file bugs with GNOME upstream instead. However, that practice seems to be in conflict with the Fedora Engineering Steering Committee (FESCo) policy that package maintainers "deal with reported bugs in a timely manner". On April 28, FESCo discussed the disconnect between practice and policy; so far, it has only opted to tweak the wording of the automatic response.

Many of the GNOME packages in Fedora are maintained by members of Red Hat's desktop team. Bugs filed against some of those packages, such as gnome-disk-utility, gnome-session, and nautilus, are automatically assigned to the "gnome-sig" alias in the Bugzilla bug tracker. There are 21 members in the group, but it is unclear if all users in that group are currently active.

Nearly 750 bugs are assigned to gnome-sig as of this writing. A few of those bugs, however, were opened more than a decade ago and were reassigned to the group at some point when package ownership changed—likely because a package was "orphaned" and no other packager stepped up to claim it. When a bug is assigned to gnome-sig, it receives an automatic response that reads in part:

Bug reports for this component on Red Hat Bugzilla are not actively monitored. Please consider reporting your issue directly to GNOME at https://gitlab.gnome.org/GNOME/ to improve the chances that your issue will be resolved.

In February, Carl George opened a ticket with FESCo about the automated reply. He said that it was "entirely reasonable to encourage collaboration with upstream projects", but the response suggested that the bugs were not being monitored at all, which would indicate that the package maintainers were not meeting their responsibilities under Fedora policy. "Can FESCo provide guidance on how this should be addressed?"

Too many bugs

FESCo member Kevin Fenzi replied with a link to a Fedora Workstation discussion in 2020 that led to the automatic reply. Michael Catanzaro had opened that conversation by saying he thought that most of Fedora's GNOME developers had given up on Bugzilla: "Currently, it's where bugs go to be ignored and receive no response, including serious bugs. This status quo is unfair to users." Fedora GNOME developers were only "passingly familiar" with the packages that they own: "I checked a couple other GNOME maintainers to see how many bugs are assigned to them and found: 260, 182, 420, 511, 372." He had fewer bugs, but also owned fewer packages than most of the GNOME maintainers who were maintaining "dozens of packages".

There were just too many bugs to practically deal with, Catanzaro said. Moving them upstream took too long as well: "we'd spend so much time moving bugs that we'd never get anything else done". If the bugs were filed upstream to begin with, "there's a decent chance the bug reports won't be ignored" because the bugs go "straight to the right developers". Allan Day wondered if GNOME was the only upstream with the problem of managing Fedora bugs. Neal Gompa responded that KDE also struggled, but he thought it made more sense to file bugs with both the upstream project and Fedora: "That way Fedora processes still work, and GNOME/KDE people can focus on upstream reports."

FESCo member Fabio Valentini pointed out that GNOME changes often caused problems in Fedora that did not qualify as bugs upstream. For example, GNOME may make changes to D-Bus interfaces that introduce crashes in the Pantheon desktop environment developed for elementary OS and packaged for Fedora. He had been told that GNOME's GitLab instance was the wrong place to report those problems, because "they're just 'features, not bugs'".

Owen Taylor admitted being one of the Fedora maintainers who ignored hundreds of bugs that had been assigned to him. Looking at bugs in the Fedora Bugzilla turned up "lots of interesting and intriguing stuff" that would sometimes lead to upstream bug fixes, "but getting bugzilla anywhere close to clean would have required me to be close to full-time bugzilla triager, and not get anything else done". He wanted to teach Fedora's Automatic Bug-Reporting Tool (ABRT) to report GNOME bugs upstream; that, though, could only apply to bugs being opened when application crashes were detected. He also wanted users to receive a suggestion to report bugs upstream unless their problem seemed to be related to packaging or otherwise unique to the distribution.

The discussion continued, and a solution of sorts was finally found; in June 2023, Tomas Popela said that a Bugzilla rule would be created to automatically add a comment after a bug was filed that the user should report bugs upstream unless it was related to packaging or to Fedora's release process (e.g., if a bug would block a Fedora release). The rule was implemented in November 2023.

ABRT did not learn how to report bugs upstream and is currently in the process of being decommissioned following years of neglect. Catanzaro said that Red Hat has stopped assigning developers to work on ABRT, and no one from the community has stepped up to maintain it. The graphical-user interface component for ABRT was removed in Fedora 44 and the plan is to remove it in its entirety before Fedora 45.

Community discussion

FESCo decided, during its meeting on February 17, to open a discussion on Fedora's forum to get community input about unmonitored bug reports. Valentini started the thread and then followed up with his own opinion on the topic. He agreed that a ridiculous number of bugs were being filed against some GNOME components, but the autoresponder was not helping. Once a bug was filed, he said, people feel like their part is done "and then they're asked to go do all that again somewhere else".

George replied that policy dictated maintainers deal with bugs in a timely manner; either the policy should be changed, or the response needed to change to be in alignment with Fedora policy. Gompa said that Fedora KDE maintainers also had "a very heavy bug reporting load", but had not requested special treatment. He acknowledged that the team was "quite bad at tagging bug reports to be closed", but felt that it was not a good idea to convey to users that bug reports were ignored by default:

To me, actions like this make it feel like Fedora is not providing value to the user or the upstream. Because if maintainers aren't doing that funnel work or leveraging shared infrastructure to help improve the quality of the components they maintain, I don't know what they are doing.

Fedora is currently working on adopting Forgejo as its collaborative-development platform, which will (at some point) include using Fedora Forge instead of Bugzilla for tracking bug reports. Catanzaro said that, when that happens, users would see a report template that would direct them upstream before they open a bug with the Fedora project. He suggested that Fedora needed to focus on fixing its bug tracker and bug-report tooling to ensure that bugs were reported in the right place. "I really don't think it's realistic to expect packagers to read the downstream bug reports."

Gompa argued that telling users to go upstream was a bad idea for several reasons; there is no one-to-one mapping of components between Fedora and GNOME, and users would be interacting with developers who may or may not have influence on how quickly a fix could be delivered. George replied that, whether it was realistic or not, Fedora's current policy required packagers to read and respond to bug reports.

Change the message

After the public discussion had time to wind down, FESCo revisited the topic during its meeting on April 28 (log). FESCo decided that the first sentence in the automatic response ("Bug reports for this component on Red Hat Bugzilla are not actively monitored") would be dropped. That was implemented on April 29.

Note that this does not mean that bug reports are being any more actively monitored than before; the message has changed, but FESCo has not yet given any guidance on what should be done next. It has concluded further discussion (and, perhaps, action) would be needed because, as Valentini wrote in the meeting summary, "intentionally not monitoring bugs on bugzilla is in tension (if not conflict) with guidelines around package maintainer responsibilities."

FESCo member Stephen Gallagher said that he had followed up with Catanzaro and Matthias Clasen, and they intended to join the next FESCo meeting. If the Workstation working group adopted an issue template that directed users upstream before filing a bug report, he thought, "it will be accepted as a reasonable compromise between policy and reality".

Whatever the outcome of the meeting, it seems unlikely that it will address the core problem that Fedora's GNOME package maintainers seem to have more work than they can comfortably manage. It matters little where bugs are filed if there aren't enough hands to do the work.

Comments (23 posted)

Version-controlled databases using Prolly trees

By Daroc Alden
May 1, 2026

Modern database and filesystems make pervasive use of B-trees, which are tree structures optimized for storing sorted lists of keys and values on block devices. Dolt is an Apache 2.0-licensed project that makes clever use of a variant of a B-tree to support efficient version control for an entire database. The data structure it uses could well be of interest to other projects.

The company behind Dolt, DoltHub, makes its money selling hosted versions of three Apache-licensed open-source projects: Dolt, Doltgres, and DoltLite. The projects are intended to be drop-in replacements for MySQL, PostgreSQL, and SQLite, respectively. They have separate frontends for the different SQL dialects, but the projects share a common storage backend that supports version-control operations.

In practice, that allows the projects to expose Git-like operations as custom SQL functions:

    -- Modify the database
    INSERT INTO users VALUES (...);
    -- Add the changes to a table to staging
    SELECT dolt_add('users');
    -- Commit them
    SELECT dolt_commit('-m', 'Add Joe to users');
    -- See the resulting diff
    SELECT * from dolt_diff_users('HEAD~1', 'HEAD');

At any given time, the database has a current HEAD commit, some number of changes in the staging area, and some number of changes in the working area, just like Git. That deliberate similarity is a major selling point of the projects — the company's home page says: "You already know how to use Dolt." That may be an exaggeration, but it is true that having basic knowledge of Git and SQL seems sufficient to start making use of Dolt.

What might be less clear is what one would use a version-controlled database for. Simply making commits on its own does not really add much power over the existing concept of a database transaction. The real utility of Dolt comes from the ability to restore old commits, fork historical states of the database, and merge in changes after review. For example, a programmer might want to run an analysis against an old state of the database to determine what would have been reported as of a particular date. They could make a new branch pointing at a historical commit, cherry pick a schema change that the analysis needs, and then run it.

Another potential use is to give automated tools a "real" database that they can work against, without giving them the ability to completely break the production database; any changes can be investigated separately and merged in only once it is clear that they won't break anything. Dolt is fully compatible with normal MySQL clients; the only difference is in the connection string, so the tools have no reason to think that they're not talking to the real production database.

    -- Connect to the normal db
    mysql://db-server:3306/mydb
    -- Connect to a named branch
    mysql://db-server:3306/mydb/branch-name
    -- Connect to a read-only historical commit
    mysql://db-server:3306/mydb/ia1ibijq8hq1llr7u85uivsi5lh3310p

Of the three projects, Dolt is the most mature. Doltgres recently entered beta, and DoltLite is somewhere in-between — the components adapted from SQLite have had plenty of testing in their original context, but the project is still on version 0.8.1. The company's performance testing of Dolt suggests that its speed is comparable to MySQL's, although the normal caveats about the difficulty of precise benchmarking apply.

That competitive performance is possible because Dolt only changes the storage layer of the database. The query planner, index maintenance, and so on all reuse MySQL, PostgreSQL, or SQLite's existing implementations. All of these databases use B-trees; Dolt uses a slight variation on a B-tree that allows efficient diffing and shares storage between mostly identical versions.

B-trees

B-trees have a lot in common with their simpler cousins, binary trees. In a binary tree, each node has a key, a value, and two (possibly empty) sub-trees. Keys with a lower value are stored in the left sub-tree, and keys with a higher value are stored in the right sub-tree. As long as some scheme is employed to keep the tree balanced (such as the use of red-black trees in the kernel), this results in reasonably efficient comparison-based searching: each comparison cuts out roughly half of the remaining candidates. Keys are looked up in time proportional to the base-two logarithm of the number of nodes in the tree.

B-trees take this core concept, and remove a bunch of pointer indirections. Each node in a B-tree has an array of keys, values, and pointers to sub-trees. Typically, the size of the array is chosen such that one node fits on one sector or page of the underlying storage. To find an item in a B-tree, the computer scans over the array and either finds the key directly, or finds two keys that the target key should be between, and goes on to check the corresponding sub-tree. This structure greatly reduces the number of levels of the tree and the number of pointer indirections, thus making the whole structure faster to access in practice. In a B-tree with sixteen keys per node, for example, searching for a key only involves one quarter of the number of nodes compared to a binary tree.

[A diagram of a small B-tree, created by 'CyHawk' and shared under a CC-BY-SA license]

This makes B-trees an excellent fit for databases, which have to maintain a balance between efficient searches for particular items, in-order scans over the items in a range, and access to the underlying storage. There are a handful of variations of B-trees used by different databases that provide small speedups. For example, moving all of the values to leaf nodes and storing only keys in the interior nodes of the tree allows for more keys per interior node, making the tree wider and shorter. An extension to that idea adds a "next" pointer to each leaf node so that in-order traversals can proceed along the leaves without needing to load any interior nodes. Dolt's core data structure, described below, incorporates both of those optimizations.

The problem with plain B-trees, and the reason that Dolt is interesting, is that comparing entire B-trees is expensive. The order in which items are inserted and deleted can affect how the internal nodes of the B-tree are structured, which means that the only way to compare two B-trees is by iterating through both trees in their entirety. That downside doesn't matter much to traditional databases, but it poses real problems for version-control systems.

Snapshot-based version-control systems, such as Git, can be thought of as storing many snapshots of some source tree. Each commit is logically independent, and could in theory just be stored as-is. In practice, most Git repositories do not completely change between commits, and it's more efficient to store the differences between subsequent snapshots, rather than the snapshots themselves.

Git, which works with entire directory trees, relies on being able to quickly check whether a particular sub-tree has changed in order to produce compact diffs that include only the actual changes. B-trees aren't compatible with that kind of algorithm, since the same logical list of keys and values could be represented by two different B-trees. Naively applying a diff algorithm to a B-tree could produce a lot of inefficient churn between representations of internal nodes.

The problem comes from the way that B-trees split and merge nodes as items are added and removed. When a B-tree node is full, it is split into two smaller nodes that are each half-full. When a B-tree node is not full enough (a threshold that varies between implementations, but which is typically just under half-full), its children are merged together to produce a more full node. When correctly implemented, this keeps the B-tree balanced and prevents the nodes from wasting storage space. But it also means that splitting and merging decisions depend on the order of insertion into the tree.

Prolly trees

Probabilistic B-trees (Prolly trees, for short) are Dolt's answer to this problem. They are almost identical to B-trees, except that the logic for inserting and removing items is updated such that any two trees that contain the same items are completely identical regardless of how items were inserted. The same property holds for individual sub-trees, so comparing two Prolly trees is much simplified: if a sub-tree hasn't changed between two versions, it will have the same hash, even if an item was inserted and then removed.

To accomplish this, Prolly trees need to be able to determine what size each node "should" be, based only on its contents and not on any internal state of the tree. At first, it might seem tempting to say that every node should be as full as possible, to maximize use of storage space. But that would require sometimes shuffling nodes at the same level of the tree back and forth between parents, which would destroy some of the nice properties of the B-tree and cause additional churn between versions. To keep insertion and deletion efficient, the average node should not be so full or so empty that the insertion or deletion of a single item will need to trigger a split or a merge.

The solution to this problem is to use a hash function to make the decision in a way that depends only on the data and not on how full a particular node happens to be. A hash is effectively a deterministic pseudorandom number that can be compared to a threshold to make a randomized decision, which is where the "probabilistic" part of Prolly trees comes in. To insert a new value into a node, the implementation puts it in the correct position in the array (based on the key), and then recalculates where the boundary between this node and its sibling should fall based on the hashes of the items. Each hash is compared to a chosen cutoff value (the threshold) to decide whether the node should end at that item. Since the values of the hashes are independent, so are the decisions about where to put the boundaries between nodes. Most of the time, therefore, adding an item won't result in a new boundary: its hash value will be above the threshold, and the node will still end at the same place (a preexisting item with a low hash value). Occasionally, an inserted item will have a low hash value, causing the node to split in two and keeping the overall sizes of the nodes balanced.

That picture is slightly complicated by performance considerations; in order to keep the nodes optimally sized, the threshold used is actually dynamic, based on the number of items before a given key in a node. That still means that the actual decisions about where to place the boundaries between nodes are made independently of the pre-existing structure of the tree. The locations of the boundaries depend only on the hashes of the items stored in the tree, which are the same regardless of what order items were inserted in.

This insertion logic is a little complicated, especially if the implementation is optimized for performance, but it's still reasonably compact:

    # Unoptimized Python example code
    # This function is called by some internal node to update one of its child
    # nodes; the return value is used to update the internal node.
    def insert_into_node(node, new_items):
        sibling = node.next_sibling
        new_items = sorted(new_items + node.items)
        collected = []
        nodes = []
        
        while new_items:
            next = new_items.pop(0)
            collected.append(next)
            if hash(next.key) < threshold(len(collected)) \
               or len(collected) >= max_node_length:
                nodes.append(Node(items=collected))
                collected = []

        if collected:
            # The while loop didn't end on a natural barrier
            if node.next_sibling is not None:
                new_nodes, sibling = insert_into_node(node.next_sibling, collected)
                nodes += new_nodes
            else:
                nodes.append(Node(items=collected))
            
        for i in range(1, len(nodes)):
            nodes[i - 1].next_sibling = nodes[i]
        nodes[-1].next_sibling = sibling
        
        return nodes, nodes[0]

With a suitably chosen threshold, the average insertion won't create a new node, and so only one node needs to be updated at each level of the tree. Theoretically, adding a single item could cause all of the nodes after it to be split up differently, but the chances of that are infinitesimal, because the hash values of the items are statistically independent. The hash values of all of the existing items would have to just barely miss the threshold in order to cause that kind of realignment, so it's not a practical performance problem. Since the decision of how to place items into nodes only depends on the data, the same nodes will be created every time.

There are some downsides to using a Prolly tree over a B-tree. For example, nodes need to store the hashes of sub-trees, which requires either more storage space or an additional indirection through a content-addressed block store to turn a hash into an address. Content-addressed storage is a scheme that many version-control systems use to deduplicate data: a particular diff or other object is stored in a location based on the hash of its contents. This ensures that duplicate items will reuse the same storage space. A Prolly tree implemented as part of a version-control system can reuse that existing content-addressed object storage for blocks, and refer to blocks by hash, leading to more compact internal nodes. Other systems will need their own solutions.

Another problem is that the performance of the tree is highly sensitive to how the probabilities of splitting are tuned. Using a constant probability results in the mean node being half-full, but the median node containing only one or two entries. To cope with that, Prolly trees use a dynamic threshold that increases as the node fills up, so that the size of nodes ends up following a normal distribution. Of course, some applications really do have requirements for worst-case performance rather than average-case performance, which is incompatible with the randomized layout of a Prolly tree.

Despite these deficits, Prolly trees could be a useful data structure for applications that depend on being able to compare snapshots of trees, or on being able to store trees in a content-addressable way.

History and future

Prolly trees were invented by Aaron Boodman for the now-defunct Noms database. Afterward, they were picked up by the InterPlanetary Linked Data project and by DoltHub. The storage backend used in Dolt is a forked version of Noms, although the project has seen a fair number of changes and optimizations since then. Some other implementations of Prolly trees exist as libraries, but none apply them to databases.

Dolt provides an appealing evolution on the traditional design of a database. Personally, however, I am excited to see how the ideas could apply in other programs that use B-trees, such as filesystems. Prolly trees are certainly not appropriate for all uses of B-trees, but filesystems share many of the same tradeoffs between in-order traversal, random access, and ease of snapshotting that databases do. Perhaps Prolly trees could help with space-efficient snapshots of large, frequently modified files.

Comments (20 posted)

Hardware-assisted Arm VMs for s390

By Daroc Alden
May 5, 2026

A recent patch set from Steffen Eiden and others has set the groundwork for allowing hardware-assisted emulation of Arm CPUs on s390 CPUs. Version two of the posting fixes a handful of smaller problems, but does not differ much. The patches were welcomed by the Arm maintainers, pending some discussion of how the collaboration between the architectures could be structured to prevent maintainability problems on the Arm side. When those details are resolved, the patches could pave the way for transparently running Arm-based virtual machines (VMs) on s390 hosts at native or near-native speeds.

The core of the feature is a patch that adds support for a new s390 instruction called "Start Arm Execution" (SAE). It performs a similar function to the existing "Start Interpretive Execution" instruction on s390 that is used to enter a hardware-assisted virtual machine while keeping the virtual CPU state separate from the host CPU. Both instructions take a pointer to a "control block" that describes how the virtual CPU should be set up and entered. The difference is that a SAE instruction's control block sets the instruction pointer to a block of memory containing Arm instructions and interprets them as such, rather than s390 instructions.

In theory, this allows someone with a s390 CPU new enough to support the feature to run an Arm virtual machine directly. There clearly has to be a translation from Arm machine code to s390 machine code at some point, but the CPU handles that internally. How exactly it does that is not clear from the patch set.

On the kernel side, while the idea is simple, the implementation is a bit more complex. When virtual machines make calls to their hypervisor, they do so using an architecture-specific interface. If Arm virtual machines are to run on s390 unmodified, the KVM code in an s390 kernel needs to be able to interpret Arm hypercalls. To support that, Eiden's patch set moves the interface definition and related header files to include/uapi/arch/arm64/, allowing other architectures to reference them. Amusingly, this allowed some duplicated code to be cleaned up; the patches ended up removing more lines of code than they added.

There is additional work to be done once these patches are merged, however. The s390 architecture is also gaining instructions for manipulating Arm registers, handling interrupts, and so on. Eiden hopes to add support for those over the coming months in conjunction with the rest of the s390 developers.

There was relatively little feedback on the patch set, but Marc Zyngier had a lengthy response written in consultation with Will Deacon. They were supportive of the effort, but worried that moving some of the Arm-specific code to a shared directory was messy and could interfere with the code's maintainability. They suggested that using symbolic links, relative paths, or code generation could allow the existing Arm code to remain it its accustomed location without unduly restricting s390 code.

Eiden's reply explained that they had tried using symbolic links for header files originally, but feared the approach wouldn't gain support since it was somewhat messy. Some of the code is also automatically generated, reusing the Makefile rules from the Arm code. But he agreed to prototype an alternative that makes use of symbolic links and post that for comparison. That patch set was sent on the April 28 and has not received any review as of the time of writing.

A more serious concern was how s390 would keep up with changes to the Arm virtual-machine-guest API, which is still evolving, Zyngier said. For example, there are some CPU-vulnerability mitigations that require cooperation from guests. When a new one is discovered, Arm assigns a hypercall function number for the new interface needed, and KVM implements it. The s390 code will presumably need something similar, both adding stubs for new Arm-specific mitigations and working with Arm to add any s390-specific interfaces needed. Those should be limited to mitigations, however, since the Arm KVM code "makes a point of forbidding any use of implementation specific instruction or system registers," and Zyngier expects the s390 code to do the same.

Eiden was not overly worried about this, saying that a primary goal of the project was to be able to run unmodified Arm guests, and so he and his fellow s390 developers would be treating the Arm code as the source of truth. As such, they had no plans to introduce s390-specific features or abrogate the usual process.

Zyngier and the other Arm maintainers are, understandably, not familiar with the details of s390. So, Zyngier and Deacon also asked for help with documentation, testing, and debugging, to ensure that changes to the Arm code do not have adverse effects on s390.

Finally, we feel it would be beneficial for both projects to swap prisoners and have cross-reviewers in MAINTAINERS, so that there is an s390 reviewer added to KVM/arm64, and an arm64 reviewer added to KVM/s390.

Eiden readily agreed, and suggested that cross-compilation of the kernel for s390 was a good starting point for testing. He offered access to s390 virtual machines hosted by IBM for doing native builds, as well. Eiden plans to add Arm's continuous-integration-testing branches to s390's build infrastructure, to catch any breakage early. He also thought an exchange of maintainers made sense, and added himself as an Arm KVM reviewer in the second revision of the patch series.

Eiden and Zyngier made plans to meet up with the other s390 and Arm kernel developers at the Linux Plumbers Conference Linux Storage, Filesystem, Memory Management, and BPF Summit. Hopefully an in-person meeting will suffice to figure out any remaining details, but since everyone seems happy enough with the change that is quite likely.

[ Thanks to Andi Holmes for bringing this topic to our attention. ]

Comments (31 posted)

Page editor: Joe Brockmeier
Next page: Brief items>>


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds