|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for March 13, 2025

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (3 posted)

New terms of service for PyPI

By Jake Edge
March 12, 2025

On February 25, the Python Software Foundation (PSF), which runs the Python Package Index (PyPI), announced new terms of service (ToS) for the repository. That has led to some questions about the new ToS, and the process of coming up with them. For one thing, the previous terms of use for the service were shorter and simpler, but there are other concerns with specific wording in the new agreement.

According to the announcement, which was posted by PyPI administrator (and PSF director of infrastructure) Ee Durbin, the new terms will go into effect on March 27. Anyone continuing to use PyPI after that date is implicitly agreeing to the terms. Durbin outlined the history of the terms, going back to the initial opening for uploads in 2005; the terms have been updated twice since, in 2009 and in 2016, they said. In the past, the terms were primarily meant to protect PyPI and the PSF; now the organization has worked with its legal team to create the new terms. They are meant to be compatible with the old, "while adding as permissive a set of new terms as possible to ensure that PyPI users and the PSF are protected".

Another reason for the update is to help enable the Organization accounts feature for PyPI. That feature, which was announced almost two years ago during PyCon in 2023, has languished, in part because of staffing difficulties that likely stem from unanticipated demand. Organizations will be able to apply for special status on PyPI, with their own namespace that can house multiple projects; organizations for community projects would be free, while companies would pay a small fee, with the revenue targeted for PyPI operations and maintenance. The feature is currently in a closed beta and is expected to be rolled out more widely soon.

Questions

A few days after the ToS announcement, Franz Király posted some questions in the PSF category of the Python discussion forum. He wondered whether the PSF was moving PyPI "to a paid subscription model" and if it was "looking to abolish package neutrality on pypi". He pointed to the "Paid Services" section of the new ToS and highlighted some of the text from the "PSF May Terminate" clause:

PSF has the right to suspend or terminate your access to all or any part of the Website at any time, with or without cause, with or without notice, effective immediately. PSF reserves the right to refuse service to anyone for any reason at any time.

To him, those changes might indicate the PSF is fundamentally changing course with PyPI. Beyond that, though, he suggested that there should have been some discussion with the community ahead of making a change of this sort. If that kind of discussion did happen, he asked that the PSF leadership say so and "explain where and how the decision has been taken" with a link to any public minutes.

Durbin was quick to reply, trying to keep the thread from "spiraling from FUD to chaos over the weekend". They said that the new ToS came about "to formalize the relationship and protections for maintainers, the pypi admins, and PSF in order to move forward with paid Organization accounts for corporate users". The Organization accounts will be completely optional and only cost money for "paid corporate users"; community projects will be provided the accounts "at no cost, forever" if they want them. On the question of whether the PSF is looking to abolish package neutrality for PyPI, Durbin had a one-word answer: "No".

Antoine Pitrou noted that the termination wording applies more widely than only to Organization accounts. He wondered why it was needed to move forward on the new feature. Király agreed, saying that instead of formalizing protections for the maintainers of PyPI packages, "the paragraph in fact seems to strip maintainers (as well as users) of any and all protections".

Guillaume Lemaitre also wondered about the termination clause, contrasting it with the clause about when the PSF can remove content. The content clause specifically calls out content "that, in our sole discretion, violates any laws or PSF terms or policies" for removal. Brett Cannon suggested that the likely reason for the broad termination clause is that the lawyers for the PSF, which is a US non-profit, "require it to be this broad to protect the PSF from lawsuits".

Cannon also did not think it made any real change to what the PSF can do under the existing terms. Király strongly disagreed with that interpretation, saying that the absence of a termination clause in the old terms would actually provide some level of protection to users and developers, because it would fall back to the norms and precedents from court cases in the US. Cannon argued that both he and Király were speculating and suggested contacting the email address specified in the new ToS. Király did so on March 2; as of March 8, there had not yet been any reply to the email.

Procedures

Meanwhile, there were some procedural questions that Király and others raised in the thread, which he gathered up with the hope that the PSF board could address them. He wondered when the decision was made to switch to a new ToS, if there were minutes from the meeting where that happened or was discussed, whether the membership of the PSF was consulted, and so on. Pitrou noted that, as a PSF member, he could confirm that there was no consultation of the members on the change to the PyPI terms.

A few others had views similar to Cannon's that the new terms did not really change anything of note. Both William Woodruff and Marc-André Lemburg had that take, with the latter noting that "getting some more community feedback upfront would probably have helped" with the reception of the new terms. Paul Moore agreed as well:

My personal view is that PyPI have always been free to do pretty much what they choose - that's the nature of a free service (you pay nothing, so you can expect nothing). These new T&Cs basically set the expectations for paying customers to be the same, just using legal terms that will make sense to such customers.

[...] Getting that email out of the blue was a bit of a shock. And I'm closely involved with the packaging community, so I would have expected to have had at least some indication that this was on its way.

Frankly, I hope that at least some of the funds PyPI get from the new paid features will be invested in improving their community involvement and consultation. I don't know to what extent the PyPI admins currently expect this to be coming from the PSF, but if they do, then it's not working very well :(

Moore noted that he did have some concerns about the new paid Organization feature for PyPI, mostly around the likelihood of prioritizing features for the "paying customers" over those of the regular, community "customers", but those were not connected to the changes to the terms. In a reply to Woodruff, Király outlined some of the possible conflicts that could come from the new terms:

Terms up until now: PSF can terminate in case of wrongdoing ("policy violation") or for other clearly stated "good reason". E.g., if you upload a virus package, violate someone's trademark, typosquat, etc.

Terms from March: PSF can terminate for any reason, including what most people might consider "bad reasons". E.g., for the purpose of illustration and without implying any intent here: selling a coveted project name on pypi to the highest bidder. Or, resolving name conflicts always in favour of a US entity. Or shutting down projects from a country the US is in a trade war with.

Of those commenting in the discussion, only Pitrou seemed to share some of Király's concerns: "the fact that they are now being bluntly explicit in their 'we do what we want' policy is certainly cause for concern". The other commenters largely agreed that there was no actual change to what the PSF can and cannot do.

Response

PSF Executive Director Deb Nicholson posted answers to some of the questions raised in the thread and perhaps as a response to Király's email; she also agreed that there was no real material change in the powers, just in the wording. "It is an explicit statement of our existing authority, as we understand it and as it has been applied in practice." The actual terms were based on the GitHub Terms of Service and that the "with or without cause" and "for any reason at any time" language are seen as standard for services similar to PyPI. That wording is needed to allow PyPI staff to quickly respond to new types of abuse, she said. The idea to generate revenue for PyPI by offering paid accounts, which is what the Organization feature is, came from the PSF board, with discussions on that stretching back to 2020; the ToS updates were done by PSF staff in conjunction with the foundation's lawyer.

As might be guessed, Király had some follow-up questions with regard to the termination clause. There was more back and forth between Király and others about the specifics, where Király tried to respond to most posts. That hearkened back to a somewhat similar "discussion" from last year, which led to a three-month suspension for Tim Peters. As with that one, the responses are polite and on-topic, but the volume and somewhat repetitive nature of them leads to complaints. The parallels were not lost on Peters, who had some thoughts:

I advise you not to reply to messages from those who cannot answer your specific questions. People overwhelmingly don't want "a discussion" about potentially unpleasant foundational issues. and routinely assume bad faith and ill intent on the part of those who raise them. They can't be dissuaded, and trying only makes it worse.

You'll get forthright answers from those who know, or you won't. Nobody else matters here.

Király was surprised to hear that such discussions are not welcome; he was worried that it implied a kind of "groupthink" within the Python community, though he did not believe that was the case. Moore had a level-headed response that suggested the underlying problem may be that the format of a text-based forum does not lend itself to those kinds of discussions:

I think it's more a case that a lot of people have found that discussions about such potentially unpleasant foundational issues tend to be both unproductive and emotionally draining, and simply aren't worth getting into. The topics themselves are worth discussing, but it doesn't seem that this environment (for whatever reason) is the right way of doing so. You could say that it's because this environment is "unpleasant" and "toxic", but I think that's unfair, and potentially contributes to the sort of atmosphere we'd like to avoid. Rather, it's just that people have divergent views, and a text-based online forum isn't the best place for nuance and expressing willingness to discuss while remaining strongly opposed to another person's views.

That was the final word on the subject, as one of the forum moderators, David Lord, closed the thread due to multiple posts that had been flagged and because, he said, the thread was no longer going in a "productive, on-topic direction". It is unfortunate that discussions of this nature so often meet that fate, but moderation within communities is certainly hard. There are still some unanswered questions at this point—reasonable questions, at least in the eyes of some—and it is unclear if the answers will ever come to light.

Over the decades, PyPI has gone from being the CheeseShop—a whimsical Monty Python reference—in the beginning, to soon becoming a much more serious repository with actual paying customers. It is, however, fundamentally a community resource and it is not surprising that some in the community feel a certain level of ownership based on their contributions. Nearly all of the work that has gone into PyPI has been done by unpaid volunteers; the vast majority of the packages stored there come from unpaid community members as well. It may well feel like something of a slap in the face to be told that they can be denied access for any reason, without any recourse. Maybe that was always true, but seeing it in black-and-white terms can be, not surprisingly, somewhat difficult.

While the PSF clearly holds title to PyPI, it would be not much of a cheese shop without those countless contributions. It behooves the organization to remember that and to try to find ways to communicate changes of this nature so that the community is not surprised by them a month before they go into effect. Better still would be to find ways to include the community in the process, so that those who helped make PyPI what it is today feel that they have a voice.

Comments (10 posted)

Zig's 0.14 release inches the project toward stability

By Daroc Alden
March 12, 2025

The Zig project has announced the release of the 0.14 version of the language, including changes from more than 250 contributors. Zig is a low-level, memory-unsafe programming language that aims to compete with C instead of depending on it. Even though the language has not yet had a stable release, there are a number of projects using it as an alternative to C with better metaprogramming. While the project's release schedule has been a bit inconsistent, with the release of version 0.14 being delayed several times, the release contains a number of new convenience features, broader architecture support, and the next steps toward removing Zig's dependency on LLVM.

More targets

Zig tracks the different architectures it supports using a tier system to describe different levels of support. This release looks as though it demotes all tier-1 targets except for x86_64 on Linux to tier-2, but that's actually an artifact of changing how the project defines its tiers. tier-1 targets used to be the ones with working code generation, a fully implemented standard library, regular continuous-integration testing, all language features known to work, and a few other conditions. The 0.14 release adds a requirement that the compiler can generate machine code for the target without relying on LLVM for it to be considered tier-1.

Zig has been trying to drop its dependency on LLVM for some time. The first step of that process was switching to the self-hosting compiler; once that was complete, the project started focusing on adding native code generation. This release includes a mostly-working x86_64 code generator, although LLVM remains the default. If it proves stable, the Zig code generator may become the default for debug builds (where performance impacts from a less-mature optimizer are less important) in the next release. The release also includes an (experimental) incremental-compilation mode. When combined with the Zig code generator, the new mode makes debug builds quite fast.

Besides the code generation changes, Zig's list of supported targets has significantly expanded in this release, largely due to additional support in the standard library for different platforms. The release notes make this bold claim:

The full list of target-specific fixes is too long to list here, but in short, if you've ever tried to target arm/thumb, mips/mips64, powerpc/powerpc64, riscv32/riscv64, or s390x and run into toolchain problems, missing standard library support, or seemingly nonsensical crashes, then there's a good chance you'll find that things will Just Work with Zig 0.14.0.

Pages and allocation

The 0.14 release also saw a major change to the way the language handles different page sizes. It used to be that each architecture was expected to have a single, static page size. This doesn't match how some architectures (notably Apple Silicon) work — the page size can vary at run time. The new Zig release switches to having two compile-time constants: a minimum and maximum size. At run time, the program can detect what the actual page size is.

This change required rewriting Zig's general-purpose memory allocator, because it assumed that the page size was fixed. Andrew Kelley, Zig's founder, wrote about the improvements that he was able to make in rewriting the allocator. In short, the old general-purpose allocator is now Zig's debug allocator, and a new multi-threaded allocator should be Zig users' first choice.

Kelley has some benchmarks showing that the multi-threaded allocator actually outperforms the GNU C library's (glibc's) memory allocator — although, as is always the case with benchmarks, simple tests can be misleading. The multi-threaded allocator is about 10% faster when using the compiler to compile itself, but Zig's compiler has a somewhat unusual approach to memory use, so whether the gains translate to other programs remains to be seen. One potential complication is memory fragmentation; allocators can trade decreased time for increased fragmentation or vice versa. Zig programmers can experiment with the multi-threaded allocator, the debug allocator, glibc's allocator, or others in order to find what works best for their programs.

Labeled switches

The biggest change to the language itself is the addition of labeled switch statements. In Zig, continue statements require specifying the label of the loop that is being continued to. This helps avoid confusion when there are nested loops in a program. Now, switch statements may be given optional labels, so that continue statements can target them. This turns switch statements into implicit loops, since control flow can return to the start of the statement. The intended use case is writing more efficient byte-code interpreters (a topic that LWN discussed in a recent article), although the feature also makes writing some finite-state automatons more straightforward.

    state: switch (State.start) {
        .start => switch (self.buffer[self.index]) {
            '0'...'9' => {
                result.tag = .number;
                continue :state .number;
            },
            else => continue :state .invalid,
        },
        .number => { ... },
        .invalid => { ... },
    }

When the values given to the continue statements are known at compile time, the compiler produces direct jumps. When they are only known at run time, the compiler produces a separate indirect jump for each case, which helps the CPU's branch predictor learn which cases are likely to follow which other cases. The first implementation of the feature ran into problems with an LLVM optimization pass that combined the jumps into one, but the version present in the release doesn't have that problem.

Another change that makes writing Zig programs more pleasant has to do with how the language interprets enumeration literals. Zig supports a limited form of type inference in the form of "Result Location Semantics"; when the compiler knows what type the result of an expression should have, that information can be used to automatically perform certain casts, or provide a shorter syntax. For example, when the compiler knows that the result of an expression is a value of an enumeration type, the programmer can leave the name of the type out — writing .number in place of State.number, as in the example above.

The 0.14 release extends this shorthand to things that aren't enumerations. Now, when the compiler sees the expression ".name", it will look for an associated constant with that name on the inferred type of the expression; if one exists, the value of that constant will be used. Besides being a useful shorthand, the new syntax lets developers rename an enum variant in-place without needing to update all the callers by defining the old name as an alias for the new one. In combination with improvements to packed structures — allowing them to be compared for equality and operated on atomically — Zig programmers have a good deal of freedom to create types that seem like simple enumerations while containing other information as well.

A practical example of this use in the standard library comes from Zig's support for different calling conventions. Zig functions can optionally be given a callconv() annotation in order to set their calling convention:

    pub fn callable_from_c() callconv(.C) void { ... }

Previously, this annotation took an enumeration type specifying one of a handful of alternatives. Now, the standard library has changed it to a tagged union, so that different calling conventions can add additional configuration options. Most of them permit setting the required stack alignment, for example. The previous versions have become constants defined for the type, so the new literal syntax ensures existing code doesn't need to change.

The future

Zig's development community remains active, with 350 commits by 58 authors over the past month — a speed that's fairly typical for the project. Despite this, the 1.0 release of the language remains several years away. Kelley has already made a strong commitment to backward compatibility once Zig reaches version 1.0, and consequently has a long list of experimental features and deficiencies in the current implementation that should be solved before then.

The number of actual changes to the language itself in each release seems to be slowing down, with this release boasting many small changes, as opposed to the wide-reaching differences of previous releases. That stability seems to have afforded the project the chance to spend time improving the tooling instead, as demonstrated by the expanded number of supported architectures and improvements to the compiler's backend. Despite that improvement, there are still a large number of known problems with Zig's tooling, not least of all because of how expansive that tooling is. Zig includes a lot of tools that other languages do not — a translator for C code, a cross-platform C compiler, a build system, and (new in this release) a built-in fuzzer — and it seems likely that the development cycle for the next release, expected in about six months, will be spent improving them.

Comments (27 posted)

The road to mainstream Matrix

By Joe Brockmeier
March 11, 2025

FOSDEM

Matrix provides an open network for secure, decentralized communication. It has enjoyed some success over the last few years as an IRC replacement and real-time chat for a number of open-source projects. But adoption by a subset of open-source developers is a far cry from the mainstream adoption that Matthew Hodgson, Matrix project lead and CEO of Element (the company that created Matrix), would like to see. At FOSDEM 2025, he discussed the history of Matrix, its missteps in chasing mainstream adoption, its current status, as well as some of the wishlist features for taking Matrix into the mainstream.

Hodgson said that the mission for Matrix is to build the real-time communication layer for the open web so that "no single party owns your conversations". Matrix is an open standard for a decentralized communication protocol. Often, when people refer to Matrix, though, they may be referring as well to the server and client implementations, or the providers, such as Matrix.org, that offer servers for users. The Try Matrix page is a useful start for those who have not yet dipped their toe in the Matrix waters.

Quite a few people have taken the plunge already, though the numbers are small when compared to other systems like Slack, Discord, and Microsoft Teams, which Hodgson named as competitors to Matrix. Hodgson displayed a slide with the number of "retained users" across reporting servers. Retained users are the users who have been "actually hanging around for at least 30 days on the platform". If the user disappears, they are no longer counted. In 2019, the retained user count was in the tens of thousands. In the six years since "it's been growing fairly linearly" to reach nearly 350,000 users by early 2025.

[Matrix active users]

The dream

The rest of the talk, Hodgson said, would be about the road to making Matrix mainstream, "and this is a road with a past and a future". It started in May 2014 with the dream of building the missing communication layer for the web. At the time Hodgson worked at Amdocs with a team of "about ten folks who'd been working together about ten years doing lots of [session initiation protocol] SIP" work. The team wrote its own voice over IP (VoIP) stacks and that sort of thing. The idea was that the team would build an implementation first (called "Project Synapse" originally) and then use that implementation to prove that the protocol works. "Then, having implemented it, we would spec it, updating the One True Spec. This would keep us anchored in reality. Hopefully."

Project Synapse was publicly launched, using the Apache License version 2.0, in September 2014 at the TechCrunch Disrupt conference after an all-night hackathon. At Disrupt, the project shipped Synapse (the server) and a client called Syweb (later renamed Matrix Console). Hodgson said that it actually had a lot of features when launched. "You'd have chat, you'd have one-to-one VoIP, you'd get federation in there too. I think VoIP got added literally at the last hour." The whole idea was to prove the implementations and then create the spec. After the launch, the team then went around the world to tell people about it and convince them to use it. That included a talk at FOSDEM 2015.

Hodgson said that the team really wanted Matrix to be more than chat, and wanted it to be mainstream "not just for geeks. No offense." For that, the project needed a killer app to bring funding for development. "We figured out there were 75 different products you can build on Matrix [the protocol]". He said that one idea was to create a "visual chat application" that featured something like Apple's Animoji, but four years before Apple had shipped the concept. Other product ideas included virtual worlds, government communications tools, or dating apps. In the end, the team decided to converge on a professional collaboration application it originally code-named "Skype done right". It was named Vector on launch—as in "a vector to push Matrix out into the world". That was launched in 2016. Then Vector was renamed Riot, and ultimately became Element in July 2020.

The original sin and mainstream miss

End-to-end encryption (E2EE) work started about the same time that Vector was launched. Hodgson called the E2EE work "the original sin of Matrix" because it slowed development velocity down significantly. People who remembered Matrix before E2EE development began may have felt "oh my God, this thing is going incredibly fast". Then everything ground "not to a halt, but probably ten times slower than we'd been going before". As it turns out, decentralized encryption is rather hard.

At the same time, Matrix was getting real attention. People were excited about a new protocol that would liberate them from the silos of proprietary applications. "Entirely my fault for hyping it to the heavens and back," he said. There was lots of pressure from the community to improve the spec, because it was "very much following what we were implementing". Unfortunately, the governance for a spec process wasn't there. The team behind Matrix wound up investing a lot of time into trying to make the spec something everyone could build on rather than shooting for mainstream adoption with a killer app people could use.

And I think it's an interesting and very controversial viewpoint, which will probably get people to throw rotten fruit at me, that perhaps in retrospect we should have ruthlessly prioritized polishing the app or an app to drive mainstream adoption. Rather than, perhaps prematurely, focusing on the open ecosystem.

One thought example, Hodgson said, was that Matrix itself should not have been primarily positioned as a protocol. It could have been the name of an app "which was a Trojan horse, unashamedly, for a protocol". Rather than setting expectations from the outset that Matrix would be the missing communications layer of the web, "you could have just shipped a really, really good thing like, say, Skype". He noted that the Bluesky social-media service is taking this approach with its Authenticated Transfer Protocol (atproto). This has been "pretty controversial, but it seems to be working out for them" in that they are effectively smuggling a decentralized protocol under the guise of a centralized communication network.

Hodgson said that he knows the people developing Bluesky and believes that they are "genuinely aiming to decentralize things", but they are following the track that Matrix did not follow. As a result, they have a mainstream, successful, decentralized social-media app "whereas we've been on this Escher-like infinite staircase".

Most of the initial work on Matrix was funded by Amdocs until 2017 when the team set up a company called New Vector, which was a subsidiary of Amdocs. In 2018, the Matrix.org Foundation was established to ensure that Matrix would be independent from commercial vendors. In 2020 the company took the name Element, and adopted the Element naming for its products as well. Hodgson said that the company spent 70 to 80 percent of its funding on building out Matrix for many years before transferring its IP to the Matrix Foundation.

1.0 ships

From 2018 to 2021 Matrix entered what Hodgson called its "halcyon days", with many open-source communities—and government entities in France, Germany, and elsewhere—rolling out their Matrix presence. There were hints of killer apps forming, such as one of the forks of Element, Beeper. Lots of projects and products were being built on Matrix, like peer-to-peer Matrix applications, low-bandwidth versions, Gitter integration, virtual-reality demos, and other applications beyond chat. "All of this inspirational work made for some fun demos at FOSDEM, but it did steal a lot of energy from writing that killer mainstream app."

In 2022, "the wheels come off". The markets crashed post-COVID, there was no more zero-interest-rate policy, and "no more investment, really". Element is not yet profitable, and not in a position to raise more money, but it is seeing adoption in the public sector. "Governments realize that the idea your country is operationally dependent on a private US tech company like Microsoft is bananas." So Element decided to focus on government implementations to become more profitable and sustainable.

That, said Hodgson, has two big failure modes. The first is losing bids on contracts to large system integrators. A government entity would put out a tender for a Matrix deployment, which would wind up going to the integrators because they have local staff, existing contracts, and know all the right people. The integrators then pick up the open-source Synapse off GitHub and support it themselves, without any of the money being routed to the upstream development. That happens again and again, Hodgson said.

Basically, almost every deal, I think, that we had on the Element side for providing Matrix to governments ended up disappearing to somebody who basically could win it because they didn't have the cost of developing Matrix.

The second failure mode is that a government is willing to put public money toward open-source feature development, but not toward maintenance or support fees. The upstream gets paid to do features, which is good, he said. For example, all of the Matrix 2.0 work on the Element X Matrix client app for iOS and Android been funded by government contracts. But those features may not be applicable to a mainstream client. Or the government client may say "here's $500,000, go and implement threads or whatever." But, Hodgson said, threads, which allow users to visually branch their conversations in a Matrix room, are a maintenance nightmare and no one is paying for continued development and maintenance of that feature.

Survival time

By the end of 2023, Hodgson said that Matrix was being taken for granted as a commons. The creators of open-source implementations were cut out of paid implementations, as there was little incentive to pay for Matrix's development. "So it started to get fairly drastic at the end of 2023". At this point Element started to focus on being sustainable and switched from the Apache License-2.0 to the AGPLv3 for its contributions to Synapse and other backend Matrix projects. The research and development projects were shelved in favor of focusing purely on stability and quality, which was "arguably a good thing for the ecosystem overall". Even with those changes, though, free riding is still posing an existential threat to the whole ecosystem.

Free riding is the technical economic term for this failure mode when people take the free thing and milk it for all it's worth and don't maintain it.

Hodgson said that the simple answer is that organizations buying a commercial Matrix deployment should also mandate that the upstream project is funded by buying that upstream's products. He suggested that organizations should "only buy from people who are actually going to fund upstream maintenance" and normalize paying for open-source products as much as they pay for proprietary ones.

Licenses for things like Adobe Acrobat or Microsoft products are in the billions of Euros per year for government agencies, but then the same agencies say "we don't have the two million a year that would be needed to provide some support for the Matrix stuff". People might feel that this is just "evil Element trying to force people to buy their thing", he said, but there are other open-source implementations out there. If the upstream isn't suitable, then switch, and that would be "true public money for public code."

Some organizations, such as the Center for Digital Sovereignty (ZenDiS), do get it right in Hodgson's book. It provides openDesk collaboration software and has worked with Element, Nextcloud, Open-Xchange, and others to incorporate those projects into the product. ZenDiS's ground rules, he said, were "we will just go and pay for their professionally supported product, full stop". Hodgson said that a lot of Matrix vendors, including Element, have ended up focusing on government and healthcare initiatives to survive. As a result, "you end up with enterprise features prioritized over the mainstream consumer and community-centric features". But sometimes what is good for government adoption is good for mainstream uptake as well.

Matrix 2.0

Hodgson introduced Matrix 2.0 in 2023 at FOSDEM. Previously the idea had just been to make Matrix work right, but with Matrix 2.0 the idea is to "make it not suck". That means making Matrix fast and usable—and that means a number of proposals to enhance the spec. Proposals to change or enhance the Matrix specification are offered as spec change proposals, known as MSCs, which follow a process that is likely familiar to anyone who has worked with a standards process before. Hodgson said that implementations of the MSCs for Matrix 2.0 "that prove that 2.0 could work" landed in September 2024.

One of those is MSC3861, which lays out next-generation authentication for Matrix. It adds industry-standard OpenID Connect (OIDC), QR login, two-factor authentication, and more. Currently, Matrix's OIDC implementation is merely "ODIC-ish", according to Thibault Martin, who wrote a useful blog post in 2023 to explain some of the authentication changes for Matrix 2.0. According to Hodgson, MSC3861 is implemented in the Matrix Authentication Service and many Matrix clients. He said that the plan is to turn next-generation authentication on after it passes the final comment period in the MSC process "so that we drag everybody on the Matrix.org instance, kicking and screaming, into the brave new OIDC-world future".

Simplified Sliding Sync (MSC4186) is a feature that will, Hodgson said, provide "instant login, instant launch, and instant sync" of a user's chat history and rooms. As an occasional user of Matrix, I can attest that this is a sorely-needed feature. The wait for Matrix to load is painful. He said that it is close to entering the final comment period, but has an edge case that needs to be sorted out, and is already implemented in the Synapse and conduwuit servers.

Matrix real-time communication (MatrixRTC) is a set of MSCs that define native encrypted group VoIP and video calls with pluggable media engines. Hodgson said that this is already implemented in Element Call, Element Call Web, and Element X clients as well as a WhatsApp-like client called Famedly that is aimed at German healthcare providers, but that "we are still incorporating feedback" on the specification.

Finally, there is the "invisible crypto" specification, which is laid out in MSC4153 ("Exclude non-cross-signed devices") and others. The idea behind this specification is to make E2EE "invisible" to users, as it is on systems like Signal, iMessage, and others. Hodgson said that there are people who don't understand the specification and say "this is crap, they're removing all the warnings, it won't be safe anymore". That is the opposite of what invisible crypto is. "The whole idea is to make it more safe by removing the unactionable cryptic (see what I did there?) warnings" about unverified devices and message authenticity. It will require users to identify their devices at login and then it will simplify the experience of using E2EE massively.

Implementations of invisible crypto were launched with the Element X client and Element Web but "we kind of forgot to actually ship this in a community-friendly distribution". He said that there would be AGPLv3 Helm charts (a packaging format for deploying software on Kubernetes) "coming soon" to allow community users to deploy the feature.

Hodgson talked a bit about the first Matrix conference, which was held in September 2024, and noted that there were many independent, production-grade implementations of Matrix available and discussed at the conference. "This is a properly heterogeneous ecosystem" with totally separate stacks without any shared code. That demonstrated, he said, that Matrix was a true ecosystem and not controlled by Element.

The next ten years

Hodgson briefly discussed State Resolution 3, and Chaos. State Resolution 3 is an overhaul of the way that Matrix replicates state (such as room membership and permissions) between servers. The plan for State Res 3 involves the Time Agnostic Room DAG Inspection Service (TARDIS), which provides a "time-traveling debugger" for Matrix room directed-acyclic graphs (DAGs). Chaos is a fault-tolerance-testing tool for Matrix servers, in the vein of Netflix's Chaos Monkey. While Hodgson did not go into detail on these during his talk, they were the topic of another talk at FOSDEM by Kegan Dougal, "Demystifying Federation in Matrix".

Message-layer security (MLS) (RFC9420) defines efficient key exchange across a set of devices. "If I had another three hours, I would talk about MLS in great detail", Hodgson said. There are two proposals to add proper MLS to Matrix, and it is an active area of research, but both proposals are blocked on funding. There is also a pull-request to add the post-quantum extended Diffie-Hellman (PQXDH) protocol to Matrix, which was mentioned at the last FOSDEM. Hodgson said that "we assumed that somebody would leap out of the audience" to offer funding for that work, but a year later nothing has happened. Element is also experimenting with post-quantum group messaging but again, he said, progress is blocked on funding. There are lots of missing features needed for mainstream support on the client and server, Hodgson said. For example there is lots of trust and safety work to be done. "We are expanding the team there and making some progress". He did not discuss other features, but his slides listed features like custom emojis, voice rooms ("for Discord folks"), and safety tooling to hide things like invite spam.

The Matrix Foundation, he said, is key to the future of Matrix. "The next ten years of Matrix will be nurtured the foundation if it has funding". He asked the audience to join the foundation, or "even better, bully your employer to join if they like Matrix and use Matrix". There was no time left for questions, but he invited the audience to join the Matrix state of the union talk immediately after his. The video and slides for Hodgson's talk have been uploaded to the road to Matrix talk page on the FOSDEM 2025 web site.

Coda

On February 20, Martin and Robin Riley posted an update to the Matrix.org blog with a plea for funding.

The Matrix.org Foundation has gone from depending entirely on Element, the company set up by the creators of Matrix, to having half of its budget covered by its 11 funding members, which is a great success on the road to financial independence! However half of the budget being covered means half of it isn't. Or in other words: the Foundation is not yet sustainable, despite running on the strictest possible budget, and is burning through its (relatively small) reserves. And we are at the point where the end of the road is in sight.

The bottom line, according to the post, is that the foundation needs to raise $100,000 of funding by the end of March, or it will have to shut down bridges to other networks that are hosted by the foundation. This includes bridges to Slack, XMPP, and IRC.

Raising $100,000 would extend the runway by one month, but the foundation says that it needs to raise an additional $610,000 to break even. The full cost of operations is $1.2 million, but the foundation is only bringing in about $561,000 currently. The post lists a number of ways individuals or organizations might help fund the foundation.

Seeing open-source projects build their communication strategies around proprietary tools, like Discord and Slack, is discouraging. Some of us might even say short-sighted. Whatever warts Matrix may have, it is one of the few open alternatives that is viable and relatively simple for projects to adopt. One hopes that the foundation will find ways to be sustainable.

[I was unable to attend FOSDEM in person this year, but watched the video for the talk when it became available. Many thanks to the video team for their work in recording all the FOSDEM sessions and making them available.]

Comments (81 posted)

Timer IDs, CRIU, and ABI challenges

By Jonathan Corbet
March 6, 2025
The kernel project has usually been willing to make fundamental internal changes if they lead to a better kernel in the end. The project also, though, goes out of its way to avoid breaking interfaces that have been exposed to user space, even if programs come to rely on behavior that was never documented. Sometimes, those two principles come into conflict, leading to a situation where fixing problems within the kernel is either difficult or impossible. This sort of situation has been impeding performance improvements in the kernel's POSIX timers implementation for some time, but it appears that a solution has been found.

Timers and CRIU

The POSIX timers API allows a process to create its own private interval timers based on any of the clocks provided by the kernel. A process calls timer_create() to create such a timer:

    int timer_create(clockid_t clockid, struct sigevent *sevp, timer_t *id);

The id argument is a pointer to a location where the kernel can return an ID used to identify the new timer; it is of type timer_t, which maps eventually to an int. Various other system calls can use that ID to arm or disarm the timer, query its current status, or delete it entirely. The man page for timer_create() indicates that each created timer will have an ID that is unique within the creating process, but makes no other promises about the returned value.

The "unique within the process" guarantee came with the 3.10 kernel release in 2013; previously, the timer IDs were unique system-wide. To understand that change, one has to look at the Checkpoint/Restore in Userspace (CRIU) project, which has long worked on the ability to save the state of a group of processes to persistent storage, then restore that group at a future time, possibly on a different system. Reconstructing the state of a set of processes well enough that the processes themselves are not aware of having been restored in this way is a challenging task; the CRIU developers have often struggled to get all of the pieces working (and to keep them that way).

POSIX timers were one of the places where they ran into trouble. To restore a process that is using timers, CRIU must be able to recreate the timers with the same ID they had at checkpoint time, but the system-call API provides no way to request a specific timer ID. Even if such an ability existed, though, the existence of a single, system-wide ID space for timers was an insurmountable problem; CRIU might try to recreate a timer for a process, only to find that some other, unrelated process in the system already had a timer with that ID. In such cases, the restore would fail.

This problem was addressed with this patch from Pavel Emelyanov, which implemented a new hash table to store timer IDs. That table was still global, but the timer IDs kept therein took the identity of the owning process (specifically, the address of its signal_struct structure) into account, separating each process's timer IDs from all the others. At that point, the problem of ID collision when restoring a process went away.

The other problem — the lack of a way to request a specific timer ID — remained, though. To address that problem, CRIU stuck with the approach it had used before, which was based on some internal knowledge about how the kernel allocates those IDs. There is a simple, per-process counter, starting at zero, that is used for timer IDs; that counter is incremented every time a new timer is created. So a series of timer_create() calls will yield a predictable series of IDs, counting through the integer space. When CRIU must create a timer with a specific ID within a to-be-restored process, it takes advantage of this knowledge and simply runs a loop, allocating and destroying timers, until the requested ID is returned.

If a process only creates a small number of timers in its lifetime, this linear ID search will not take long. Checkpointing, though, is often used on long-running processes in order to save their state should something go wrong partway through. That kind of process, if it regularly creates and destroys timers, can end up with IDs that are spread out widely in the integer space. That, in turn, means it can take a long time to land on the needed ID at restore time.

Without a paddle

In 2023, Thomas Gleixner sent this summary in response to a timer bug report; he noted that in some cases, the allocation loop "will run for at least an hour to restore a single timer". That is not the speedy restore operation that CRIU users may have been hoping for. But the real problem at the time was that the requirement to allocate timer IDs sequentially in the kernel was getting in the way of some needed changes to the internal global hash table, which, in turn, were blocking other changes within the timer subsystem. Since that behavior could not be changed without breaking CRIU, Gleixner concluded that the kernel was "up the regression creek without a paddle".

At the time, some possible solutions were considered. Reducing the ID space from 0..INT_MAX to something smaller could speed the ID search, but it would still be an ABI break; CRIU would no longer be able to restore any process that had created timers with a larger ID. A new system call to create a timer with a given ID was another possibility but, due to how the timer API works (and the sigevent structure it accepts), the 64-bit and 32-bit versions of the system call could not be made compatible. That would require the addition of another "compat" system call, which is something the kernel developers have gone out of their way to avoid for some time. In the end, the conversation wound down with no solution being found.

In mid-February 2025, networking developer Eric Dumazet posted a patch series aimed at reducing locking contention in the kernel's timer code, citing "incidents involving gazillions of concurrent posix timers". That work elicited some testy responses from Gleixner, but there was no questioning the existence of a real problem. So Gleixner went off to create his own patch series, incorporating Dumazet's work, but then aiming to solve the other problems as well. Most of the series is focused on implementing a new hash table that lacks the performance problems found in current kernel; benchmark results included in the cover letter show that some success was achieved on that front.

A better solution for CRIU

But then Gleixner set out to solve the CRIU problem as well. Rather than create a new system call to enable the creation of a timer with a specific ID, though, he concluded that the id argument to timer_create() could be used to provide that ID. All that is needed is a flag to tell timer_create() to use the provided value rather than generating its own ... but timer_create() has no flags argument. So, if timer_create() is to gain the ability to read a timer ID from user space, some other way needs to be found to let it know that this behavior is requested.

The answer is a pair of new prctl() operations. A call like this:

    prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON);

will cause the calling process to enter a "timer restoration mode" that causes timer_create() to read the requested timer ID from the location pointed to by the id parameter passed from user space. The special value TIMER_ANY_ID can provided in cases where user space does not have an ID it would like to request. Another prctl() call with PR_TIMER_CREATE_RESTORE_IDS_OFF will exit the restoration mode, causing any subsequent timer_create() calls to generate an ID internally as usual.

This functionality is narrowly aimed at CRIU's needs. Normally, adding this kind of process-wide state would be an invitation for problems; some distant thread could make a timer_create() call while the restoration mode is enabled, but expecting the old behavior, and thus be unpleasantly surprised. But CRIU can use this mode at the special point where the restarted processes have been created, but are not yet allowed to resume running at the spot where they were checkpointed. At that time, CRIU is entirely in control and can manage the state properly.

Another important point is that the prctl() call will fail on an older kernel that does not support the timer restoration mode. When CRIU sees that failure, it can go back to the old, brute-force method of allocating timers. The CRIU developers will thus be able to take advantage of the new API while maintaining compatibility for users on older kernels.

One problem that will remain even after this series is merged is that the sequential-allocation behavior of timer_create(), in the absence of the new prctl() operation, is still part of the kernel's ABI. The timer developers never meant to make that promise, but they are stuck with it for as long as CRIU installations continue to depend on it. The good news is that updating CRIU will generally be necessary for users who update their kernels anyway, since that is the only way to get support for newer kernel features. So, perhaps before too long, the sequential-allocation guarantee for timer_create() can be retired — unless some other user that depends on it emerges from the woodwork.

Comments (12 posted)

Hash-based module integrity checking

By Daroc Alden
March 7, 2025

On January 20, Thomas Weißschuh shared a new patch set implementing an alternate method for checking the integrity of loadable kernel modules. This mechanism, which checks module integrity based on hashes computed at build time instead of using cryptographic signatures, could enable reproducible kernel builds in more contexts. Several distributions have already expressed interest in the patch set if Weißschuh can get it into the kernel.

Linux has supported signing loadable kernel modules since 2012, when David Howells introduced the first code for it. Since then, the implementation has not changed too much. Users can enable the CONFIG_MODULE_SIG option to turn on module signing; by default, this will simply taint the kernel if an unsigned module is loaded. Enabling CONFIG_MODULE_SIG_FORCE or the Lockdown Linux Security Module (LSM) prevents the kernel from loading unsigned modules. The public keys needed to verify signatures are baked into the kernel at build time. The build process can be configured to use an existing keypair, or to automatically generate the necessary asymmetric keys.

That automatic generation is Weißschuh's gripe with the current code. Reproducible builds are important for security, since they allow independent verification that an open-source project has been compiled without inserting extra, malicious changes. For something as foundational as the Linux kernel, it would be nice to be able to verify that a build of the kernel is unmodified. But when signing keys are needed for the build, it cannot be made reproducible without distributing the key. Currently, this puts users in a bind. They cannot have loadable modules, reproducible builds, and signature verification all turned on at the same time.

It's tempting to search for a clever cryptographic solution, but nobody has yet proposed one. If the signing keys are publicly available for use in recreating the build, malicious actors could also sign modified loadable modules with them. If they aren't publicly available, the build can't be reproduced. So Weißschuh's patch set takes a much simpler approach: instead of trying to create a signature for loadable modules, the patch set calculates cryptographic hashes for all modules built with the kernel, and embeds those hashes as a static list to verify prospective modules before they are loaded.

This has the benefit of simplicity — if a module has the same hash that it had at build time, it certainly hasn't been tampered with — but it is a bit less flexible than the signature-based approach. Specifically, Weißschuh's patch set only works for modules that are built at the same time as the kernel. The default build configuration, which generates new keys for each build, has the same limitation, but users can use their own long-term signing keys if they prefer. Out-of-tree modules, DKMS modules, and so on can't have their hashes included in the kernel, and therefore can't be verified by hash.

Nothing prevents users from enabling both Weißschuh's new code (with CONFIG_MODULE_HASHES) and the existing signature verification support, although in that case the resulting kernel isn't reproducible. This will build all the in-tree modules with hashes (but not signatures), and still allow the loading of signed (but not hashed) out-of-tree modules. Arch Linux contributor "kpcyrd" summarized the possible options.

Combining the settings for module hashing and module signing yield four possibilities. When neither CONFIG_MODULE_HASHES nor CONFIG_MODULE_SIG are enabled, the kernel does not perform any checks (and therefore Lockdown will reject loading any modules). When only CONFIG_MODULE_SIG is enabled, the behavior is the same as today — signed modules can be loaded. When CONFIG_MODULE_HASHES is enabled, modules with a known hash can be loaded. When both are enabled, a module that has either a known hash or a valid signature can be loaded. If CONFIG_MODULE_SIG_FORCE is enabled, CONFIG_MODULE_SIG must also be enabled. If neither CONFIG_MODULE_SIG_FORCE nor the Lockdown LSM are enabled, the kernel won't disallow modules without valid signatures or hashes, and therefore there is little point in configuring either signed or hashed modules.

For users of Arch Linux, Proxmox, and SUSE — projects that Weißschuh indicated were interested in this work based on the previous discussion — the last option is the most likely configuration for a normal kernel build. NixOS does not enable module signing by default, and is more likely to only enable CONFIG_MODULE_HASHES, but was still interested. The patch set is not yet quite ready to go into the mainline kernel, however. Petr Pavlu suggested a few tweaks for Weißschuh to make. A handful of other people participated in the discussion of the patch set, although there was not too much objection to the patch set in face of the evident interest from the different distributions. Once he revises the patch set, it seems likely to remove one of the few remaining obstacles to deploying reproducible kernel builds.

Comments (29 posted)

Capability analysis for the kernel

By Jonathan Corbet
March 10, 2025
One of the advantages of the Rust type system is its ability to encapsulate requirements about the state of the program in the type system; often, this state includes which locks must be held to be able to carry out specific operations. C lacks the ability to express these requirements, but there would be obvious benefits if that kind of feature could be grafted onto the language. The Clang compiler has made some strides in that direction with its thread-safety analysis feature; two developers have been independently working to take advantage of that work for the kernel.

The Clang feature is based on the concept of "capabilities" that a program can be determined — at compile time — to hold (or not) at any given point. Capabilities are typically the address of a data structure; for example, the address of a specific spinlock can be designated as a capability that a program can acquire with a lock operation. Functions can be annotated to indicate that they acquire or release a capability; developers can also indicate that callers of a function must hold (or not hold) a specific capability.

Adding analysis to the kernel

Bart Van Assche posted a patch series on February 6 showing how Clang's thread-safety feature could be used with the kernel's mutex type. The core of this work can be found in this patch, which annotates the various mutex-related functions; for example, the prototype for mutex_lock() and mutex_unlock() are modified to be:

    void mutex_lock(struct mutex *lock) ACQUIRE(*lock);
    void mutex_unlock(struct mutex *lock) RELEASE(*lock);

The first line says that a call to mutex_lock() will gain a capability in the form of the pointed-to mutex, while calling mutex_unlock() will give up that capability. The ACQUIRE() and RELEASE() macros are built on top of Clang's lower-level macros; there are quite a few other macros in the set. With that infrastructure in place, any function that requires a specific mutex to be held can be annotated accordingly; for example:

     static struct devfreq_governor *try_then_request_governor(const char *name)
	REQUIRES(devfreq_list_lock);

The compiler will then warn on any call to that function if the possession of the indicated lock (devfreq_list_lock) cannot be determined. There is also a series of macros with names like GUARDED_BY() to document that access to specific data (a structure member, for example) requires that a certain mutex be held. Those macros are not actually used in the posted series, though.

Van Assche's patch set is focused on the mutex type, and attempts to annotate usage throughout the entire kernel (though many of the annotations are NO_THREAD_SAFETY_ANALYSIS, which disables the analysis because the locking is too complicated for Clang to figure out — a situation that arises frequently). This work culminates in a massive patch touching over 800 files, which is a significant amount of code churn. The work has already found a number of locking bugs, the fixes for which are included in the series.

An alternative approach

On the same day, Marco Elver posted a patch set of his own with a slightly different approach to using the same feature; that series has since been updated, adopting the term "capability analysis" in place of "thread-safety analysis". While Van Assche used a breadth-first approach with mutexes, Elver has gone depth-first with a series that adds annotations for several locking primitives, but which is only active in subsystems that explicitly opt into it. In that way, warnings can be turned on for code where the maintainers and developers are interested in them (and will act on them), while being left off for the rest of the kernel.

The syntax of the annotations is a little different from Van Assche's approach, but the intent is clearly the same:

    void mutex_lock(struct mutex *lock) __acquires(lock);
    void mutex_unlock(struct mutex *lock) __releases(lock);

Elver's series, though, goes beyond mutexes to add annotations for spinlocks, reader-writer locks, seqlocks, single-bit spinlocks, read-copy-update, local locks, and wait/wound mutexes. In many cases, the annotations that already exist for the kernel's locking correctness validator (lockdep) have been reworked to add the needed capability declarations. There is a __guarded_by() annotation to document that a lock that must be held to access a specific structure member; its use can be seen in this patch instrumenting the kfence subsystem. The capability_unsafe() marker disables checking for a block of code. Most of the new annotations, along with documentation, can be found in this patch.

The other difference found in Elver's approach is the explicit opt-in design, which allows each subsystem to enable or disable the feature. By default, any given subsystem will not be covered by this analysis; that can be changed by adding one or more lines to the subsystem's makefile:

    CAPABILITY_ANALYSIS := y
    CAPABILITY_ANALYSIS_foo := y

The first line will enable analysis for all code compiled by way of that makefile, while the second will enable it only for the compilation creating foo.o. The patch set enables the feature for a number of kernel subsystems, including debugfs, kernel fences, rhashtables, tty drivers, the TOMOYO security module, the crypto subsystem, and more.

What next?

It would appear that the community has a wealth of riches here: two competing patch sets that aim to use the same compiler feature to improve correctness checking within the kernel. Either series can increase confidence that locking is being handled correctly, and both work entirely at compile time, with no run-time overhead. The reception for this work has been quite positive, with the only open question seemingly being which series would be accepted, or whether the kernel might adopt a combination of the two.

There are no definitive answers to that question, but a clue can be found in the fact that Van Assche has been posting comments (and Reviewed-by tags) for Elver's patch set. Peter Zijlstra has also tried his hand with Elver's work in the scheduler subsystem, saying that "this is all awesome". That attempt pointed to some needed changes; it seems Zijlstra also managed to crash the Clang compiler. He later pointed out that the capability analysis works in simple cases, but for "anything remotely interesting, where you actually want the help, it falls over".

Real use in the kernel (and beyond) may well help to drive development work in the Clang community to improve this analysis feature to the point that it can be routinely used to verify locking patterns. Some of that work may need to happen before support for this kind of capability analysis can be added to the kernel. But, since it is an opt-in, compile-time feature, there may well be value to adding it relatively soon, even if it needs further work. The kernel community seems to be hungry for this kind of support.

Comments (25 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds