LWN.net Weekly Edition for April 4, 2024

Welcome to the LWN.net Weekly Edition for April 4, 2024

This edition contains the following feature content:

Free software's not-so-eXZellent adventure: how the XZ project was compromised, and where we go from here.
How the XZ backdoor works: the technical details of the XZ backdoor.
The race to replace Redis: now that Redis has a non-free license, there are multiple efforts to establish a free fork.
Improving performance with SCHED_EXT and IOCost: two SCALE presentations covering scheduling and I/O bandwidth control.
A memory model for Rust code in the kernel: which memory model should Rust code use in the kernel?
Declarative partitioning in PostgreSQL: the PostgreSQL project is making the partitioning of databases easier.
Radicle: peer-to-peer collaboration with Git: providing the useful features of Git forges without the centralization.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Free software's not-so-eXZellent adventure

By Jonathan Corbet
April 2, 2024

A common theme in early-days anti-Linux FUD was that, since anybody can contribute to the code, it cannot be trusted. Over two decades later, one rarely hears that line anymore; experience has shown that free-software communities are not prone to shipping overtly hostile code. But, as the backdooring of XZ has reminded us, the embedding of malicious code is, unfortunately, not limited to the proprietary realm. Our community will be busy analyzing this incident for some time to come, but clear conclusions may be hard to come by.

The technical details of the attack are fascinating in many ways. See the companion article for more about that aspect of things.

For those needing a short overview: the XZ package performs data compression; it is widely distributed and used in many different places. Somebody known as "Jia Tan" managed to obtain maintainer-level access to this project and used that access to insert a cleverly concealed backdoor that, when the XZ library was loaded into an OpenSSH server process, would provide an attacker with the ability to run arbitrary code on the affected system. This code had found its way into some testing distributions and the openSUSE Tumbleweed rolling distribution before being discovered by Andres Freund. (See this page for a detailed timeline.)

The hostile code was quickly removed and, while it is too soon to be sure, it appears that it was caught before it could be exploited. Had that discovery not happened, the malicious code could have found its way onto vast numbers of systems. The consequences of a success at that level are hard to imagine and hard to overstate.

Social engineering

Like so many important projects, XZ was for years the responsibility of a single maintainer (Lasse Collin) who was keeping the code going on his own time. That led to a slow development pace at times, and patches sent by others often languished. That is, unfortunately, not an unusual situation in our community.

In May 2022, Collin was subjected to extensive criticism in this email thread (and others) for failing to respond quickly enough to patches. That, too, again unfortunately, is not uncommon in our community. Looking back now, though, the conversation takes on an even more sinister light; the accounts used to bully the maintainer are widely thought to have been sock puppets, created for this purpose and abandoned thereafter. In retrospect, the clear intent was to pressure Collin into accepting another maintainer into the project.

During this conversation, Collin mentioned (more than once) that he had been receiving off-list help from Tan, and that Tan might be given a bigger role in the future. That, of course, did happen, with devastating effects. Tan obtained the ability to push code into the repository, and subsequently abused that power to add the backdoor over an extended period of time. As well as adding the backdoor, Tan modified the posted security policy in an attempt to contain the disclosure of any vulnerabilities found in the code, changed the build system to silently disable the Landlock security module, redirected reports from the OSS-Fuzz effort, and more.

Once the malicious code became part of an XZ release, Tan took the campaign to distributors in an attempt to get them to ship the compromised versions as quickly as possible. There was also a series of patches submitted to the kernel that named Tan as a maintainer of the in-kernel XZ code. The patches otherwise look innocuous on their face, but do seem intended to encourage users to update to a malicious version of XZ more quickly. These patches made it as far as linux-next, but never landed in the mainline.

Much has been made of the fact that, by having an overworked and uncompensated maintainer, XZ was especially vulnerable to this type of attack. That may be true, and support for maintainers is a huge problem in general, but it is not the whole story here. Even paid and unstressed maintainers are happy to welcome help from outsiders. The ability to take contributions from — and give responsibility to — people we have never met from a distant part of the world is one of the strengths of our development model, after all. An attacker who is willing to play a long game has a good chance of reaching a position of trust in many important projects.

This whole episode is likely to make it harder for maintainers to trust helpful outsiders, even those they have worked with for years. To an extent, that may be necessary, but it is also counterproductive (we want our maintainers to get more help) and sad. Our community is built on trust, and that trust has proved to be warranted almost all of the time. If we cannot trust our collaborators, we will be less productive and have less fun.

Closing the door

As might be expected, the Internet is full of ideas of how this attack could have been prevented or detected. Some are more helpful than others.

There have been numerous comments about excess complexity in our systems. They have a point, but few people add complexity for no reason all; features go into software because somebody needs them. This is also true of patches applied by distributors, which were a part of the complex web this attack was built on. Distributors, as a general rule, would rather carry fewer patches than more, and don't patch the software they ship without a reason. So, while both complexity and downstream patching should be examined and avoided when possible, they are a part of our world that is not going away.

In the end, the specific software components that were targeted in this attack are only so relevant. Had that vector not been available, the attacker would have chosen another. The simple truth is that there are many vectors to choose from.

There is certainly no lack of technical and process questions that should be asked with regard to this attack. What are the dependencies pulled in by critical software, do they make sense, and can they be reduced? Why do projects ship tarballs with contents that are not found in their source repository, and why do distributors build from those tarballs? How can we better incorporate the ongoing reproducible-builds work to catch subtle changes? Should testing infrastructure be somehow separated from code that is built for deployment? What are the best practices around the management of binary objects belonging to a project? Why are we still using ancient build systems that almost nobody understands? How can we get better review of the sorts of code that makes eyes glaze over? What is the proper embargo policy for a problem like this one?

These are all useful questions that need to be discussed in depth; hopefully, some useful conclusions will come from them. But it is important to not get hung up on the details of this specific attack; the next one is likely to look different. And that next attack may well already be underway.

On the unhelpful side, the suggestion from the OpenSSF that part of the problem was the lack of an "OpenSSF Best Practices badge" is unlikely to do much to prevent the next attack. (Note that the organization seems to have figured that out; see the Wayback Machine for the original version of the post). The action by GitHub to block access to the XZ repository — after the horse had long since left the barn, moved to a new state, and raised a family — served only to inhibit the analysis of the problem while protecting nobody.

The source

This attack was carried out in a careful and patient fashion over the course of at least two years; it is not a case of a script kiddie getting lucky. Somebody — or a group of somebodies — took the time to identify a way to compromise a large number of systems, develop a complex and stealthy exploit, and carry out an extensive social-engineering campaign to get that exploit into the XZ repository and, from there, into shipping distributions. All of this was done while making a minimum of mistakes until near the end.

It seems clear that considerable resources were dedicated to this effort. Speculating on where those resources came from is exactly that — speculation. But to speculate that this may have been a state-supported effort does not seem to be going out too far on any sort of limb. There are, undoubtedly, many agencies that would have liked to obtain this kind of access to Linux systems. We may never know whether one of them was behind this attempt.

Another thing to keep in mind is that an attack of this sophistication is unlikely to limit itself to a single compromise vector in a single package. The chances are high that other tentacles, using different identities and different approaches, exist. So, while looking at what could have been done to detect and prevent this attack at an earlier stage is a valuable exercise, it also risks distracting us from the next attack (or existing, ongoing attacks) that do not follow the same playbook.

Over the years, there have not been many attempts to insert a backdoor of this nature — at least, that we have detected — into core free-software projects. One possible explanation for that is that our software has been sufficiently porous that a suitably skilled attacker could always find an existing vulnerability to take advantage of without risking the sort of exposure that is now happening around the XZ attack. An intriguing (and possibly entirely wishful) thought is that the long and ongoing efforts to harden our systems, move to better languages, and generally handle security issues in a better way has made that avenue harder, at least some of the time, pushing some attackers into risky, multi-year backdoor attempts.

Finally

In the numerous discussions sparked by this attack, one can readily find two seemingly opposing points of view:

The XZ episode shows the strength of the free-software community. Through our diligence and testing, we detected a sophisticated attack (probably) before it was able to achieve its objectives, analyzed it, and disabled it. (Example).
The world was saved from a massive security disaster only by dint of incredible luck. That such an attack could get so far is a demonstration of the weakness of our community; given our longstanding sustainability problems, an attack like this was inevitable at some point. (Example).

In the end, both of those points of view can be valid at the same time. Yes, there was a massive bit of luck involved in the detection of this attack, and yes, the support for maintainers (and many other contributors) is not what it needs to be. But, to paraphrase Louis Pasteur, chance favors the prepared development community. We have skilled and curious developers who will look into anomalous behavior, and we have a software environment that facilitates that sort of investigation. We never have to accept that a given software system just behaves strangely; we have, instead, gone to great lengths to ensure that it is possible to dig in and figure out why that behavior is happening — and to fix it. That is, indeed, a major strength; that, along with luck and heroic work, is what saved us.

We must hope that we can improve our game enough that our strengths will save us the next time as well. There is a good chance that our community, made up of people who just want to get something done and most of whom are not security experts, has just been attacked by a powerful adversary with extensive capabilities and resources. We are not equipped for that kind of fight. But, then, neither is the proprietary world. Our community has muddled through a lot of challenges to get this far; we may yet muddle through this one as well.

Comments (140 posted)

How the XZ backdoor works

By Daroc Alden
April 2, 2024

Versions 5.6.0 and 5.6.1 of the XZ compression utility and library were shipped with a backdoor that targeted OpenSSH. Andres Freund discovered the backdoor by noticing that failed SSH logins were taking a lot of CPU time while doing some micro-benchmarking, and tracking down the backdoor from there. It was introduced by XZ co-maintainer "Jia Tan" — a probable alias for person or persons unknown. The backdoor is a sophisticated attack with multiple parts, from the build system, to link time, to run time.

The community response to the attack is just as interesting as the technical aspects. For more information on that, refer to this companion article.

Build time

The backdoor consists of several distinct phases, starting when the package is being built. Gynvael Coldwind wrote an in-depth investigation of the build-time parts of the backdoor. Releases of XZ were provided via GitHub, which has since disabled the maintainers' accounts and taken the releases offline. Like many projects that use GNU Autoconf, XZ made releases that provided several versions of the source for download — an automatically generated tarball containing the source and related files in the repository, along with versions containing the generated build files. Those extra files include the configure script and makefiles for the project. Releasing versions that contain the generated files allows downstream users of the software to build without needing to install Autoconf.

In this case, however, the scripts in the maintainer-provided source tarballs were not generated by Autoconf. Instead, one of the build scripts contained the first stage of the exploit in m4/build-to-host.m4. This script is originally from the Gnulib library; it provides a macro that converts between the style of pathname used by the build environment and the run-time environment of the program. The version in these XZ releases was modified to extract the next stage of the exploit, which is contained in tests/files/bad-3-1corrupt_lzma2.xz.

This file is included in the repository, ostensibly as part of XZ's test suite, though it was never used by those tests. It was committed well before the release of version 5.6.0. The file, supposedly a corrupted XZ file, is actually a valid XZ stream with some bytes swapped — for example, 0x20 is swapped with occurrences of 0x09 and vice versa. When decoded, it yields a shell script that unpacks and executes the next stage of the backdoor.

The next stage of the backdoor is located in tests/files/good-large_compressed.lzma. This is the injected.txt file attached to Freund's message. That file contains more than just the next stage of the script — it also contains additional binary data that forms the actual backdoor itself. The final script skips over the header of the file from which it was extracted, and then uses awk to decrypt the remainder of the file. Finally, that decrypted stream is decompressed using the XZ command-line program, in order to extract a pre-compiled file called liblzma_la-crc64-fast.o, which is also attached to Freund's message.

Link time

The extracted file is a 64-bit relocatable ELF library. The remainder of the build process links it into the final liblzma library which ends up being loaded into OpenSSH on some distributions. Those distributions patch OpenSSH to use systemd for daemon-readiness notifications; libsystemd in turn depends on liblzma for compressing journal files. Lennart Poettering has since posted some example code (written by Luca Boccassi) showing how to let applications use systemd readiness notifications without pulling in the entire library. When the malicious liblzma is used by a dynamically linked process, it uses the indirect function mechanism to involve itself in the linking process.

Indirect functions are a feature of the GNU C library (glibc) that permits a developer to include several versions of a function and select which version to use at dynamic linking time. Indirect functions are useful for including optimized versions of a function that rely on specific hardware features, for example. In this case, the backdoor provides its own version of the indirect function resolvers crc32_resolve() and crc64_resolve() that select versions of crc32() and crc64() to use, respectively. liblzma does not usually use indirect functions, but using faster functions to calculate checksums does sound like a plausible use of the feature. This plausible deniability is probably why the exploit itself lives in a file called liblzma_la-crc64-fast.o.

When the dynamic linker finalizes the locations of those functions, it calls the backdoor's resolver functions. At this point, dynamic linking is still in progress, so many of the linker's internal data structures have not yet been made read-only. This would let the backdoor manipulate libraries that had already been loaded by overwriting entries in the procedure linkage table (PLT) or global offset table (GOT). However, liblzma is loaded fairly early in the link order of OpenSSH, which means that the OpenSSL cryptography functions that are the backdoor's ultimate target may not have been loaded yet.

To deal with that, the backdoor adds an audit hook. The dynamic linker calls all the registered audit hooks when it is resolving a symbol. The backdoor uses this to wait until it sees the RSA_public_decrypt@got.plt symbol being resolved. Despite the name, this function is actually part of handling an RSA signature (which is a decryption operation) — OpenSSH calls it while validating an RSA certificate provided by the client during a connection.

Run time

Once the backdoor detects this function being linked, it replaces the function with its own version. What the altered version does is still being investigated, but at least one of its functions is to attempt to extract a command from the public-key field of the provided RSA certificate (which means that certificates that are used in this attack cannot actually be used to authenticate normally). The backdoor checks whether the command is signed by the attacker's private key and has valid formatting. If it does, then the backdoor directly runs the given command as the user running sshd, which is usually root.

Anthony Weems has put together an explanation of the run-time portion of the exploit, including a honeypot to detect attempts to use the exploit, and code to generate command payloads. Using the backdoor involves signing the command to be executed with a private key, but the attacker's is not available, so the backdoored server needs to be patched to use another private key. This also means that detecting backdoored servers remotely is nearly impossible, since they will not react any differently to connections that don't use the attacker's private key.

Ultimately, the effect of the backdoor appears to be that a compromised SSH server which receives a connection with a hand-crafted RSA certificate for authentication can be made to run attacker-controlled code.

Anti-analysis

The design of the backdoor makes it difficult to notice without directly inspecting liblzma. For example, the choice to enable remote code execution rather than an authentication bypass means that use of the exploit does not detect a login session that could be noticed by traditional administration tools. The backdoor's code also uses several techniques to make discovery more difficult. For example, the string "RSA_public_decrypt@got.plt", which is used by the audit hook, never appears in the binary of the exploit. Instead, it uses a trie to hold various strings. Serge Bazanski posted a list of strings in the malicious liblzma encoded this way.

Examining that list shows that RSA_public_decrypt is likely not the only function interfered with; several other cryptography routines are listed. It also shows various functions and strings that are used to interfere with OpenSSH's logging. This is not yet confirmed, but it seems likely that a compromised SSH server would not actually log any connection attempts that use the exploit.

The backdoor also includes many checks to ensure it is running in the expected environment — a standard precaution for modern malware that is intended to make reverse-engineering more difficult. The backdoor is only active under specific circumstances, including: running in a non-graphical environment, ~~as root~~ (see this comment from Freund), in a binary located at /usr/sbin/sshd, with sshd having the expected ELF header, and where none of its functions have had a breakpoint inserted by a debugger. Despite these obstacles, community efforts to reverse-engineer and explain the remainder of the backdoor's code remain underway.

~~The backdoor also includes code that patches the binary of sshd itself to disable seccomp() and prevent the program from creating a chroot sandbox for its children~~ (see this comment). In total, the code of the backdoor is 87KB, which is plenty of space for additional unpleasant surprises. Many people have put together their own summaries of the exploit, including this comprehensive FAQ by Sam James, which links to other resources.

Being safe

The exploit was caught promptly, so almost no users were affected. Debian sid, Fedora Rawhide, the Fedora 40 beta, openSUSE Tumbleweed, and Kali Linux all briefly shipped the compromised package. NixOS unstable also shipped the compromised version, but was not vulnerable because it does not patch OpenSSH to link libsystemd. Tan also included some other changes to the XZ code to make detecting and mitigating the backdoor more difficult, such as sabotaging sandboxing measures and making preemptive efforts to redirect security reports. Even though the exploit did not reach their stable versions, several distributions are nonetheless taking steps to move to a version of XZ that does not contain any commits from Tan, so users should expect to see security updates related to that soon. Readers may also wish to refer to the security notice for their distribution for more specific information.

Comments (39 posted)

The race to replace Redis

By Joe Brockmeier
March 28, 2024

On March 21, Redis Ltd. announced that the Redis "in-memory data store" project would now be released under non-free, source-available licenses, starting with Redis 7.4. The news is unwelcome, but not entirely unexpected. What is unusual with this situation is the number of Redis alternatives to choose from; there are at least four options to choose as a replacement for those who wish to stay with free software, including a pre-existing fork called KeyDB and the Linux Foundation's newly-announced Valkey project. The question now is which one(s) Linux distributions, users, and providers will choose to take its place.

A short history of Redis

Redis has a complicated backstory. Salvatore Sanfilippo (also known as "antirez") started the project to use as "a different kind of database" with a realtime log-analyzer application called LLOOGG, because MySQL was not meeting his needs. Instead of creating a relational database, he designed the project as a simple dictionary database that stored a key-value pair in memory—its name is a contraction of "remote dictionary server". It has, of course, matured and accrued many more features over the years. Redis quickly became popular as part of the NoSQL movement, and he was hired by VMware to work on Redis development in 2010. He moved to VMware's spin-off, Pivotal, in 2013 and continued to work on the project.

Around that time, Redis was growing in popularity, with high-profile use by Twitter and Pinterest, among others, and started to appear in Linux distributions. It was packaged for Ubuntu in the 12.04 (April 2012) release, Fedora 18 (January 2013), and others. Support for Redis was added to Amazon Web Service's (AWS) ElastiCache service in September 2013, which took advantage of, and helped bolster, Redis's popularity.

In early 2013, a startup called Garantia Data started offering Redis services and positioning itself as a better alternative to "open source Redis". Garantia took a first round of funding in November 2013 and floated changing its corporate name to RedisDB. After some pushback from Sanfilippo, the company renamed itself Redis Labs, instead, in early 2014. Sanfilippo joined Redis Labs as the lead for open-source development in 2015. He remained with Redis Labs until stepping down in 2020.

In 2018, Redis Labs adopted a new license for its add-on modules that provide features on top of the core database. The company chose to use a modified version of the Apache License, Version 2.0, with an addition called the Commons Clause. This clause restricted selling the software or charging for services. The rationale given for the switch was that cloud providers were "taking advantage of the open source community" by selling services based on open-source code they didn't develop. At the time, the company pledged that Redis "is BSD and will always remain BSD".

It was not the only company to start experimenting with use-restrictive licenses. Venture-backed database companies, in particular, were starting to run toward new licenses to try to ensure they could exclusively sell services using the software. MariaDB had created the Business Source License (BSL) for a product called MaxScale in 2016, and MongoDB launched the Server Side Public License (SSPL) in late 2018. Eventually, Redis Labs settled on a dual-license scheme of SSPL and its own Redis Source Available License (RSAL) for its modules.

The company dropped "Labs" from its name in mid-2021. In announcing the name change, Redis again committed to open source, and said that the company renaming "will not affect the licensing of open source Redis, which has always been and will continue to be BSD licensed". The company also put in place a governance model that would place major decisions about Redis's "architecture, design, or philosophy" with a community "core team". One would expect that team's mandate to include the license for Redis itself. The governance page is no longer on Redis's web site, but is available on the Internet Archive's Wayback Machine. It listed a core team of five members, three from Redis (Yossi Gotlieb, Oran Agra, and Itamar Haber) as well as Zhao Zhao from Alibaba and Madelyn Olson from AWS.

On March 20, Redis announced that "all future versions of Redis will be released with source-available licenses", specifically the SSPL and RSAL. Rowan Trollope, Redis CEO, wrote that maintaining the BSD license was now "at odds with our ability to drive Redis successfully into the future". Future versions, in this case, means Redis 7.4 and later. The announcement's FAQ says that, following the company's security policy, security patches will be backported to previous versions under the original three-clause BSD license.

Cloud versus open source

Proponents of use-restrictive licenses like the SSPL and Redis's RSAL have tried to position this solely as a battle between giant cloud providers like AWS and open source, where use restrictions are the only logical alternative and cloud providers are the only losers. In 2019, Redis Labs then-CEO Ofer Bengal was quoted as saying that there were "many different views" after Redis adopted its source-available licenses for Redis modules, but that it was necessary to compete with cloud providers:

Some people condemned that [license change]. But after the initial noise calmed down — and especially after some other companies came out with a similar concept — the community now understands that the original concept of open source has to be fixed because it isn't suitable anymore to the modern era where cloud companies use their monopoly power to adopt any successful open source project without contributing anything to it.

In the March 20 announcement, Trollope wrote that "cloud service providers will be able to deliver Redis 7.4 only after agreeing to licensing terms with Redis, the maintainers of the Redis code" but, that "nothing changes for the Redis developer community who will continue to enjoy permissive licensing under the dual license".

The choice of the phrase "permissive licensing" is misleading. Redis is able to adopt a non-free license scheme for 7.4 and later versions because external developers had granted their contributions under the permissive BSD license. The terms of the SSPL and RSAL are incompatible with common usage of the term "permissive" in the open source community.

It is also hard to reconcile the claims that cloud providers do not contribute with the actual commits to the Redis repository. A quick examination of the commits since the 7.0.0 release using gitdm shows 967 commits over that time period:

Top changeset contributions by employer

(Unknown) 331 34.2%

Tencent 240 24.8%

Redis 189 19.5%

Alibaba 65 6.7%

Huawei 50 5.2%

Amazon.com 50 5.2%

Bytedance 19 2.0%

NetEase 13 1.3%

Top changeset contributions by employer
(Unknown)	331	34.2%
Tencent	240	24.8%
Redis	189	19.5%
Alibaba	65	6.7%
Huawei	50	5.2%
Amazon.com	50	5.2%
Bytedance	19	2.0%
NetEase	13	1.3%

Binbin Zhu ~~BinBin Wang~~, of Tencent, is responsible for nearly 25% of the commits to the project. Some of the contributors without a readily identifiable employer surely are Redis employees, but it's clear that the company has not been working alone. (Note that some single-digit contributors were omitted.)

Changing distribution model

So it should be apparent that code contribution is beside the point. Redis is a venture-backed company that has taken more than $350 million in funding over many rounds since 2011. The company, and its investors, seem to have calculated that they can safely move away from open source to try to capture more revenue.

They have some reason to believe this is the case, if MongoDB's results are any guide. The company went public in 2017 and moved to the SSPL a little more than a year later. Shortly afterward, major Linux distributions stopped packaging the database because it no longer met their licensing standards. But, by that time, the company had set its sights on a platform model that would encourage developers (and their employers) to use and pay for MongoDB and ancillary offerings with the as-a-service model. Distributing a source-available version of MongoDB could be seen as a loss-leader strategy to reach developers that the company wagered did not care about open-source.

As Redmonk founder Stephen O'Grady wrote in 2017:

As developers began to assert control over technical selection and direction in increasing numbers, even in situations where a proprietary alternative is technically superior, the sheer accessibility of open source software gave it an enormous market advantage. Choosing between adequate option A that could be downloaded instantly and theoretically superior option B gated by a salesperson was not in fact a choice.

But open source, noted Grady, "is typically less convenient than service-based alternatives" and if convenience is the most important factor then open source has a problem. Especially when vendors like MongoDB have learned from proprietary vendors that "locking in customers is good for business".

Is it good for business? MongoDB has kept growing, adding customers, and brought in $1.68 billion its last fiscal year. That's more than a 30% increase, and its Atlas database service revenue also increased more than 30%, demonstrating that a lot of companies would rather pay to use the service than try to host it themselves. Despite all that, the company is still losing money—more than $345 million in the same time period.

But, investors may be more interested in stock price than actual profit. The company's stock started around $33 a share when it went public, but is now more than $350 a share. Redis's investors would likely be happy if it can produce similar results.

Forks and alternatives

Venture-backed vendors seem to have, as O'Grady wrote last year, reached a consensus that they can move away from open source. Especially if they are not "actively opposed by other commercial interests, foundations and other interested industry participants". Here, Redis may have miscalculated the industry mood.

When Hashicorp adopted BSL for its projects last year, a fork of its Terraform project appeared within days and was embraced by the Linux Foundation under the name OpenTofu. On March 28, the foundation announced that it was supporting Valkey, a direct fork of Redis 7.2.4, with Amazon Web Services (AWS), Google Cloud, Oracle, Ericsson, and Snap named as backers of the effort.

The Valkey fork got its life just a few days after the Redis license change. Olson wrote that she and "various former Redis contributors" had started working on a fork, using the original three-clause BSD license, with "placeholderkv" as a temporary name. Olson, Zhao, Viktor Söderqvist, and Ping Xie were listed as maintainers. According to Olson this was not an AWS fork of Redis, but "just me trying to keep the continuity with the community". KeyDB was considered, but it has diverged to the point that it "is missing a lot of stuff the community is used to".

The KeyDB fork was created in 2019 for technical, rather than licensing, reasons. The project, which billed itself as "a faster drop in alternative to Redis" was created by John Sully, Eric Blenkarn, and Ben Schermel, who wanted a multithreaded version and were not able to persuade Redis maintainers to go in that direction. Sully and Schermel started a company, also called KeyDB, that offered a proprietary enterprise version. The entire codebase became fully open source under the three-clause BSD license when KeyDB was acquired by Snap in 2022.

The problem with KeyDB as a direct alternative is that it hasn't kept up with Redis since it forked. It still lacks many features found in Redis 7, and Sully indicated that there's little time for him to work on issues "not directly affecting Snap", though the project "would of course welcome outside help and we can certainly name additional maintainers if there is community interest in helping". On March 22, Sully updated another issue and said he was in discussions to "potentially" add maintainers to bring KeyDB closer to Redis 7. It's not clear yet whether Valkey will supplant KeyDB, but Snap's involvement makes that seem likely over the long term.

Drew DeVault, founder and CEO of SourceHut, has also created a fork named Redict based on Redis 7.2.4, but chose to use the LGPLv3. In his announcement post, he said that the choice of license was "a deliberate one which balances a number of concerns". DeVault wanted a license that was copyleft but "as easy as possible for users to comply" with the license and to ease integrations with Redis-compatible modules or Lua plugins that can be used to perform operations within Redis. He also noted that Redict will have no contributor license agreement (CLA), but contributors would be asked to verify contributions with a developer certificate of origin. Despite his connection to SourceHut, DeVault chose to host Redict on Codeberg to "provide a comfortable and familiar user experience for anyone comfortable with the GitHub-based community" of Redis.

Another kind-of contender is Microsoft's Garnet, announced on March 18. According to the announcement, it has been in development by Microsoft Research since 2021. It is a remote cache-store that can cache and manage the same types of data as Redis and is designed to be compatible with the Redis serialization protocol. Garnet is MIT-licensed, written in .NET C#, and is not meant to be a direct drop-in replacement. However, its API compatibility page claims that it can be "regarded as a close-enough starting point" that works "unmodified with many Redis clients". Many, but not all. For example, one user attempted to switch a NodeJS application to Garnet, but found that Redis's FLUSHALL command is not currently supported. Adding support for missing APIs on is on the project's roadmap.

Scramble for alternatives

Once again, Linux distributions are left with a mess to clean up. Neal Gompa opened a discussion on the Fedora development list, noting the license change and the need to remove Redis from Fedora. Jonathan Wright replied that KeyDB might be a replacement; he had been "loosely working on packaging" before the license change. He later said that KeyDB would be "a step backwards and cause headaches" for those looking to replace later versions of Redis. Nevertheless, he wrote on March 23 that he had pushed builds that were ready for testing for Fedora and EPEL 8 and 9.

Shortly after the Valkey announcement, Wright wrote that he would be packaging it as soon as there is a tagged release. Wright also said that he is "anticipating valkey becoming the [de facto] replacement for redis in most places."

Gompa also raised the issue on openSUSE's Factory discussion list. Dominique Leuenberger replied with a list of 18 packages with dependencies on the redis package in Tumbleweed. The initial discussion mentioned Redict and KeyDB as possible replacements, but Valkey had not been announced yet.

Having to find a replacement to ship in place of Redis is not the only problem for community distributions. Jacob Michalskie called out several services in use by the openSUSE project that will need a Redis replacement, including the Pagure code-hosting software (created and used by Fedora as well) used for code.opensuse.org, and the Discourse forum software.

Debian contributor Guillem Jover filed a Request for Package (RFP) for KeyDB as a potential replacement for Redis. Jover said he was not sure if he was up for sole maintainership, but was happy to give a hand. In an email exchange with Jover, he told me that his company had migrated from Redis 6 to KeyDB and it was a "smooth transition". According to Jover, "KeyDB might be lacking some features compared to Redis 7, but we have neither noticed we miss any or felt we were missing out on anything."

Jover said that it was too early to tell whether the newer forks would continue to be maintained, and that Redict's LGPLv3 licensing "might also be problematic for the ecosystem". In a follow-up email after the Valkey announcement, he said "I think we'll probably go further with packaging KeyDB for Debian at least, if it dies out it can always be removed or transitioned out from there."

Path forward

It is, of course, too soon to predict whether one or more of the forks will gain significant traction—but it seems likely that Valkey will be a credible alternative. The possibility of a swift fork with widespread community and industry backing should give pause to vendors who expect a smooth path after abandoning open source.

Comments (79 posted)

Improving performance with SCHED_EXT and IOCost

By Daroc Alden
April 1, 2024

SCALE

At SCALE this year Dan Schatzberg and Tejun Heo, both from Meta, gave back-to-back talks about some of the performance-engineering work that they do there. Schatzberg presented on the extensible BPF scheduler, which has been discussed extensively on the kernel mailing list. Heo presented on IOCost — a control group (cgroup) I/O controller optimized for solid-state disks (SSDs) — and the benchmark suite that is necessary to make it work well on different models of disk.

Scheduling with BPF

For many years, Linux used the Completely Fair Scheduler (CFS) to decide which tasks should run on which CPUs at any given time. In kernel version 6.6, the default scheduler changed to the Earliest Eligible Virtual Deadline First (EEVDF) scheduler. Schatzberg presented what is, in his view, a serious problem with both of these schedulers: iteratively improving the design is difficult, because trying out a new scheduler policy requires rebuilding the kernel and rebooting. He presented the BPF-based scheduler as a way to work around this problem — a proposal that has previously been the subject of much debate in the kernel community.

The core idea of a scheduler is simple — "the scheduler's job is just sharing the core" — but implementing a performant scheduler can make things "very complicated very quickly". Schatzberg called out the increasing prevalence of heterogeneous computing systems as a particular source of complexity. There are several attributes of a good scheduler that people might care about, such as fairness, optimization, low overhead, and general applicability. Finding a scheduling policy that is best for a given workload therefore needs to involve a lot of performance measurements and experimentation. Once a scheduling policy has been found, though, using it requires maintaining a scheduler out-of-tree currently. Schatzberg's employer has a policy of trying to upstream as many kernel changes as possible, to reduce its maintenance burden and make use of recent kernels. Upstreaming patches for a constantly changing workload-specific scheduler would be difficult.

Therefore, Schatzberg and others have been working on an extensible scheduler (SCHED_EXT) "allowing you to implement scheduling policies [...] as BPF programs at runtime". Experimenting with a scheduling policy using SCHED_EXT involves writing a BPF program, compiling it, and loading it into the running kernel. When changing to a new scheduler, users "don't even need to stop the running workload", which makes experimentation much faster.

Schatzberg spent some time explaining the work that has gone into making this kind of experimentation safe. Since the scheduler is written in BPF, the BPF verifier ensures it cannot crash or break the running kernel. Furthermore, there is a kernel watchdog in SCHED_EXT that will automatically kick a BPF scheduler out if it has failed to schedule a task for a configurable amount of time. So even a broken BPF scheduler that refuses to schedule anything won't permanently hang a machine.

Writing a scheduler

Schatzberg then gave an explanation of how to actually write a scheduler using SCHED_EXT. The general idea is that the BPF program implements an operations structure full of callbacks that the kernel will call during scheduling. The structure that BPF sets up with these callbacks also contains various configuration options, such as the name of the scheduler and the desired watchdog timeout. Three callbacks contain the most vital parts of the scheduler:

    s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);

    void (*enqueue)(struct task_struct *p, u64 enq_flags);

    void (*dispatch)(s32 cpu, struct task_struct *prev);

All three of these have default implementations, but "any sufficiently complicated scheduler will probably implement these three".

To actually implement these callbacks, "you need to understand dispatch queues", which are data structures that contain a list of runnable tasks. By default, each CPU has its own dispatch queue that it automatically takes tasks from, but the scheduler can create as many auxiliary dispatch queues as it needs. SCHED_EXT handles all the required locking around performing operations on these queues.

With that understanding in mind, Schatzberg showed the code to make a simple global first-in first-out scheduler. This is a naive scheduling policy, but Schatzberg noted: "This actually outperforms CFS for some of our production workflows" because getting a task onto an idle core as quickly as possible has some benefits. The example scheduler enqueues tasks by adding them to a global dispatch queue, and then dispatches tasks from that queue when a CPU runs out of work on its own queue.

    s32 BPF_STRUCT_OPS_SLEEPABLE(mysched_init)
    {
      scx_bpf_create_dsq(/* queue id = */ 0, -1);
    }

mysched_init() creates the global dispatch queue that the scheduler uses.

    void BPF_STRUCT_OPS(mysched_enqueue, struct task_struct *p, u64 enq_flags)
    {
      scx_bpf_dispatch(p, /* queue id = */ 0, SCX_SLICE_DFL, enq_flags);
    }

mysched_enqueue() accepts a task from the kernel, and places it directly on the end of the global dispatch queue.

    void BPF_STRUCT_OPS(mysched_dispatch, s32 cpu, struct task_struct *prev)
    {
      scx_bpf_consume(/* queue id = */ 0);
    }

mysched_dispatch() runs when a CPU becomes idle and needs another task. It takes the task from the front of the global dispatch queue and immediately starts it running.

Schatzberg noted that the full implementation of the scheduler was slightly longer because it needs to handle tasks such as waking up a CPU that has gone to sleep when there's more work, but emphasized that a complete scheduler could be written in 150 lines, including comments.

He then gave an example of a slightly less naive scheduler that keeps a separate queue per L3 cache. He pointed out that even this simple optimization raises a lot of questions — when should cores steal work from other L3 caches? When should work be proactively load-balanced across the different queues? Schatzberg said that the real answer is that it depends on the workload, and that developers should experiment using SCHED_EXT. "The whole idea is that SCHED_EXT lets us experiment with these kinds of policies really quickly".

He finished off the talk by giving some examples of various schedulers suited to different purposes. Those schedulers, as well as various support tools to make writing BPF-based schedulers easier, are available on GitHub. For the future, he sees the top priority as getting SCHED_EXT upstreamed into the kernel — a task that may be difficult since scheduler maintainer Peter Zijlstra dislikes the patch series. Schatzberg thought that SCHED_EXT could be valuable to users: "We want to build a community around it". He said that there were many features that could be added to make SCHED_EXT more usable, but that they were mostly BPF features, and not SCHED_EXT features per se.

IOCost

After Schatzberg's talk, Heo described his own performance work, this time in the area of cgroup I/O controllers. Cgroups are hierarchical groupings of processes that can have resource limits attached to different nodes in the hierarchy. Heo compared them to a filesystem that contains the processes running on a system. Cgroup controllers have the job of allocating resources to processes within a cgroup, and reacting if they exceed their limits. There are existing cgroup controllers for memory usage, CPU usage, process creation, block device I/O, and more.

Heo has been working on a controller for block device I/O called IOCost. The controller is intended to distribute solid-state disk block operations to different cgroups, ensuring that applications cannot hog the disk's bandwidth.

He opened by describing why writing an I/O controller is more challenging than writing a controller for memory or CPU. Firstly, the metrics usually used to measure performance of block devices "are not great". For example, specifying the budget of an application in bytes per second or I/O requests per second can result in a limit that is simultaneously too low and too high, because the actual number of bytes per second that a solid-state disk (SSD) can service varies wildly based on several factors. "That makes single-metric based control, based on these numbers, really challenging".

More problematically, SSDs can be so fast that if the overhead from the cgroup controller is too high, it can noticeably impact the performance of the application. This isn't the case for rotating drives, where it is frequently worth sacrificing some CPU time to optimize disk reads, because the disk latency is so high. That means that existing I/O controllers designed for rotating disks often perform poorly on SSDs.

And finally, block devices are a shared resource that the entire system needs to use, which makes it easy to accidentally create priority inversions. For example, writing to a journaling filesystem requires writing to a shared journal for the whole filesystem. If the I/O controller delays a low-priority process's write to the journal, all of the high-priority processes that also need to write to the journal are stuck waiting as well. Priority inversion isn't a problem unique to block devices, but it is especially noticeable there since every process on the system contends for the same resource.

IOCost addresses these challenges in a number of ways, starting by measuring the expense of block operations using "a fairly simple linear model" that is "better than any single metric". It also doesn't try to restrict applications to a numeric limit. Heo pointed out that if you ask an application developer how many I/O operations per second or MB per second they need, "you don't get a good answer. Nobody really knows." So instead, IOCost imposes relative limits — allowing the administrator to say how much I/O each cgroup should receive proportionally.

To make access decisions quickly enough to not impede SSD performance, IOCost separates its logic into two parts: a planning path that runs every few milliseconds, and an execution path that runs on every request. The planning path handles all of the complex calculations, and then pushes the configuration to the execution path, so that the execution path doesn't need to do much to determine how a request should be handled. This helps keep the latency overhead of the controller low.

Finally, IOCost integrates with the filesystem and memory-management facilities in the kernel to avoid priority inversions. Throttled I/O requests that are expected to cause priority inversions are run right away, but then charged to the process that exceeded its limits. Heo summarized this approach as "do first, pay later"

IOCost works well in practice for the use-case for which it was designed. Heo showed the results of some internal tests which demonstrated that IOCost has negligible latency overhead, and perfectly maintains the configured ratio of I/O between two benchmark workloads. IOCost "is fully deployed at Meta, on almost all our machines". He summarized this section with a slide that said "IOCost works well as long as it[']s configured well".

But that statement conceals a lot of complexity, because it is not easy to configure IOCost well. The linear cost-model that IOCost uses to estimate how expensive a particular I/O request is requires tuning for each model of SSD, because they all behave differently. To solve this problem, Heo and his collaborators created resctl-bench: a scenario-based benchmarking tool that observes the end-to-end behavior of an SSD. It is part of the larger resctl-demo project, which includes resources for various different types of resource control on Linux. resctl-bench imitates a workload "similar to what we see in the production environment". Heo also said that it was a major improvement on other benchmarks because: "The bar is fairly low here. We don't have a bar, because nothing works well".

resctl-bench is not easy to set up and run, so Heo also made an installable Linux image that runs the benchmark and automatically generates a report (and configurations for IOCost) based on the result. He noted that "I didn't want to do a live demo, because it takes eight hours", but he did show a series of slides explaining how to use the tool to characterize an SSD and then submit the measurements to the database where he is collecting statistics on different SSD models. He called for interested people to run the benchmark on their own systems, because more data points can make IOCost more performant for everyone.

One audience member asked whether it was viable to have IOCost learn the parameters it needs on the fly. Heo explained that it wasn't, because of how difficult it is to get the needed feedback from an SSD under normal operation. "It would be really great if you could", he remarked. Another audience member asked whether the benchmark takes into account SSD over-provisioning — the practice of putting more storage into an SSD than its stated capacity, so that the firmware of the disk has spare sectors to replace failing sectors with. Benchmarks that only write a disk's claimed capacity often provide inaccurate results about long-term performance for over-provisioned disks. Heo affirmed that resctl-bench does take overprovisioning into account, and that it ends up writing many times the disk's claimed capacity in the course of the benchmark.

I asked about the plans for IOCost; Heo said "in our fleet, it seems to work well", and that he wanted to see more people using it. He sees IOCost as fairly complete, with the notable exception of how painful it is to configure. He hopes to improve that over time.

IOCost is already part of the kernel, and can be configured by following the documentation. The resctl-bench documentation explains how to try the benchmark on one's own devices.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Pasadena for SCALE.]

Comments (8 posted)

A memory model for Rust code in the kernel

By Jonathan Corbet
April 3, 2024

The Rust programming language differs from C in many ways; those differences tend to be what users admire in the language. But those differences can also lead to an impedance mismatch when Rust code is integrated into a C-dominated system, and it can be even worse in the kernel, which is not a typical C program. Memory models are a case in point. A programming language's view of memory is sufficiently fundamental and arcane that many developers never have to learn much about it. It is hard to maintain that sort of blissful ignorance while working in the kernel, though, so a recent discussion of how to choose a memory model for kernel code in Rust is of interest.

Memory models

It is convenient to view a system's memory as a simple array of bytes that can be accessed from any CPU. The reality, though, is more complicated. Memory accesses are slow, so a lot of effort goes into minimizing them; modern systems are built with multiple levels of caching for that purpose. Without that caching, performance would slow to a crawl, severely impacting the production, delivery, and consumption of cat videos, phishing spam, and cryptocurrency scams. This prospect is seen as a bad thing.

Multi-level caching speeds computation, but it brings a problem of its own: it is no longer true that every CPU in the system sees the same memory contents. If one CPU has modified some data in a local cache, another CPU reading that data may not see the changes. Operations performed in a carefully arranged order may appear in a different order elsewhere in the system. As is the case in relativistic situations, the ordering of events depends on who is observing them. Needless to say, this kind of uncertainty also has the potential to create disorder within an operating-system kernel.

CPUs have special operations that can ensure that a given memory store is simultaneously visible across the system. These operations, though, are slow and need to be used with care. Modern CPUs provide a range of "barrier" operations that can be used to properly sequence access to data with less overhead; see this article for an overview of some of them. Use of these barriers can be somewhat complex (and architecture-specific), so a few generic interfaces have been created to simplify things (to the extent that they can be simplified). A memory model combines a specification of how barriers should be used and any interfaces that ease that use, describing how to safely access data in concurrent settings.

"The majority of the _good_ programmers I know run away screaming from this stuff. As was said many, many years ago - understanding memory-barriers.txt is an -extremely high bar- to set as a basic requirement for being a kernel developer."
— Dave Chinner, 2020

The C11 standard, for example, defines some atomic types and atomic operations. C++ offers an atomic type of its own. While the kernel community has occasionally discussed using the C atomic types, that has never happened for a number of reasons, some of which will become apparent below. Instead, the kernel defines its own memory model, described in the infamous memory-barriers.txt file (and sometimes referred to as "LKMM"). Few kernel developers understand this model in detail (and many find it too subtle to understand deeply), but it governs how memory access works at the lowest levels. (See also this article series for more on the kernel's memory model).

One of the early concerns about incorporating Rust into the kernel is that the Rust language lacked a memory model of its own. That gap has since been filled; the Rust memory model looks a lot like the C++ model. Boqun Feng, who helps maintain the kernel's memory model, thought that it would be good to formalize a model for Rust code to use in the kernel. Should the Rust model be used, the kernel's model, or some combination of the two? He posted his conclusion to the linux-kernel mailing list: Rust code should adhere to the kernel's memory model. He included an initial patch set showing what that would look like.

Using the kernel's model

The reasoning behind this conclusion is simple enough: "Because kernel developers are more familiar with LKMM and when Rust code interacts with C code, it has to use the model that C code uses". Learning one memory model is hard enough; requiring developers to learn two models to work with the kernel would not lead to good results. Even worse results are likely when Rust and C code interact, each depending on its respective memory model to ensure proper data ordering. So, as long as the kernel has its own memory model, that is what Rust code will have to use.

Kent Overstreet pointed out a disadvantage of this approach: the kernel will remain incompatible with other Rust code and will not be able to incorporate it easily. He suggested that, perhaps, the kernel's memory model could be rebuilt on top of C or C++ atomic operations, at which point supporting the Rust model would be easier. That seems unlikely to happen, though, given the strong opposition from Linus Torvalds to any such change.

One of Torvalds's arguments was that language-based memory models are insufficiently reliable for use in the kernel, and that is not a problem that can be addressed quickly. "The C++ memory model may be reliable in another decade. And then a decade after *that*, we can drop support for the pre-reliable compilers." Thus, he said, "I do not understand why people think that we wouldn't want to roll our own". Feng added that the kernel's memory model encompasses a number of use cases, such as mixed-size operations on the same variable, that are important to the kernel. Those uses are not addressed (or allowed) in the Rust model. He did suggest that, perhaps, a subset of the Rust model could be implemented on top of the kernel's operations.

Another reason for the kernel project to implement its own memory model, Torvalds said, is that kernel developers need to be deeply familiar with the architectures they are supporting anyway. There is no way to create a kernel without a lot of architecture-specific code. Given that, "having the architecture also define things like atomics is just a pretty small (and relatively straightforward) detail".

Torvalds has another reason for sticking with the kernel's memory model, though: he thinks that the C++ model is fundamentally misdesigned. A key aspect of that model is that data exposed to concurrent access is given a special atomic type, and the compiler automatically inserts the right barriers when that data is accessed. Such a model can ensure that a developer never forgets to use the proper atomic operations, which is an appealing feature. That is, however, not how the kernel's model works.

In the kernel's memory model, it is not the type of the data that determines how it must be accessed, but the context in which that access happens. A simple example is data that is protected by a lock; while the lock is held, that data is not truly shared since the lock holder has exclusive access. So there is no need for expensive atomic operations; instead, a simple barrier when the lock is released is sufficient. In other settings, where a lock is not held, atomic operations may be needed to access the same data.

Torvalds argued that the kernel's approach to shared data makes more sense:

In fact, I personally will argue that it is fundamentally wrong to think that the underlying data has to be volatile. A variable may be entirely stable in some cases (ie locks held), but not in others.
So it's not the *variable* (aka "object") that is 'volatile', it's the *context* that makes a particular access volatile.
That explains why the kernel has basically zero actual volatile objects, and 99% of all volatile accesses are done through accessor functions that use a cast to mark a particular access volatile.

This approach has been taken in the kernel for a long time; it is described in the volatile-considered-harmful document that was first added to the kernel for the 2.6.22 release in 2007.

The outcome of this discussion is clear enough: Rust code in the kernel will have to use the kernel's memory model for the foreseeable future. The Rust language brings with it a number of new ways of doing things, many of which have significant advantages over C. But bringing a new language into a code base that is old, large, and subject to special requirements is always going to require some compromises on the new-language side. Using the kernel's memory model may not actually be a compromise, but it is different from what other Rust code will do; it will be one of the many things Rust developers will have to learn to work in the kernel project.

Comments (20 posted)

Declarative partitioning in PostgreSQL

By Daroc Alden
March 28, 2024

SCALE

Keith Fiske gave a talk (with slides) about the state of partitioning — splitting a large table into smaller tables for performance reasons — in PostgreSQL at SCALE this year. He spoke about the existing support for partitioning, what work still needs to be done, and what place existing partitioning tools, like his own pg_partman, still have as PostgreSQL gains more built-in features.

Partitioning in a database context is when a table is split into multiple smaller tables that each are part of the same logical relation, but contain a smaller physical portion of the data. There are several reasons why someone might want to partition their database, but the most common reasons are to make it easier to manage large amounts of data and to allow databases to reclaim disk space.

In PostgreSQL, deleting a large number of rows does not immediately return the disk space to the filesystem. Instead, those rows are only marked as deleted — because other transactions may still need to read the data. A row can only be reused, or have the backing storage returned to the filesystem, once all the transactions using the row have finished. Even then, only completely unused blocks at the end of a table are freed, so it is quite possible to delete every row in a table except one and see no space reclaimed. On the other hand, dropping an entire table gives the space that it uses back immediately, and is much faster than deleting all the rows involved individually. Therefore, splitting data over multiple smaller tables can make dropping old data that is no longer needed faster. Fiske called this the "primary reason for partitioning in Postgres".

Keeping more, smaller tables has other benefits as well. Several operations take time proportional to the size of a table, such as full table scans when running an SQL query, or the VACUUM process that actually does the work of freeing unused rows for reuse. Having smaller tables reduces the latency of performing a VACUUM, but if some of those tables are not changing, it also reduces the overall time that vacuuming the entire database takes.

The old way

The old way to partition data (prior to PostgreSQL version 10, which was released in 2017), was to manually partition data using table inheritance, triggers, and constraints. Table inheritance is a feature of PostgreSQL that allows one table to inherit from another in a manner similar to object-oriented inheritance, extending the schema of the base table. Queries made against the base table can also return rows from the child table.

But partitioning data in this way was a lot of repetitious work. Fiske said that he "was tired of writing the same thing all the time, for the fourth or fifth customer." Finally, PostgreSQL introduced a new version of partitioning that is built into the database. Fisk clarified that "there may be some narrow use cases where you need to use the old method of partitioning", but he was eager to see any features needed to cover those use cases added to the software.

The new way

The new way to handle data partitioning in PostgreSQL is with declarative partitioning. This lets users set up partitioned tables using data-definition language (DDL) statements:

    CREATE TABLE measurement (
        city_id         int not null,
        logdate         date not null,
        peaktemp        int,
        unitsales       int
    ) PARTITION BY RANGE (logdate);

    CREATE TABLE measurement_y2006m02 PARTITION OF measurement
        FOR VALUES FROM ('2006-02-01') TO ('2006-03-01');

    CREATE TABLE measurement_y2006m03 PARTITION OF measurement
        FOR VALUES FROM ('2006-03-01') TO ('2006-04-01');

    ...

There are three kinds of partitioning now supported by PostgreSQL: range, list, and hash partitioning. Shown above is an example of range partitioning. Fisk stated that ranges can be used with "anything you can define an upper or lower boundary for", such as integers, dates, or GUIDs (Globally Unique Identifiers). Each range is inclusive of its lower bound and exclusive of its upper bound, ensuring that edge cases can be sorted appropriately. List-based partitioning involves enumerating which values of a key appear in each partition: "telling it explicitly what goes in a different table".

Hash partitioning works by giving a modulus and remainder that are used to sort rows into child tables based on the hash of selected columns. Fiske said that there were two use cases for hash partitioning. Either "data growing evenly", or "you don't know what your data distribution is going to be in advance". He pleaded with the audience not to use hash partitioning if they do know what the distribution of the data will be.

Hash partitioning has a few drawbacks compared to other methods. For example, "in hash partitioning you cannot add or remove a child without rebuilding" the partitioned table and all its children. In addition, data in a hash-partitioned table "often becomes very unbalanced unless your data is actually random".

PostgreSQL supports redirecting inserted rows from the parent table into the appropriate child table. The feature also handles updates — as of PostgreSQL 11 in 2018, if an update changes which child table a row should go in, it is seamlessly migrated. Support for partitioned tables also extends to query optimizations inside PostgreSQL's query planner. Specifically, when the WHERE clause of a query references the column on which the table is partitioned, the query planner can completely exclude child tables that are not relevant to the query.

Coming soon

Fiske thinks that the implementation of partitioning in PostgreSQL is nearly done. "There's not a lot left, really, as far as base partitioning features". He detailed two pain points that he did hope would be addressed: the need to improve query performance, which is "always going to be there, forever", but is "actively being worked on", and the problem of global indexes.

Indexes are used not only to speed up queries, but also to maintain uniqueness constraints. Right now, the way indexes are designed means that they can only cover a single table. Partitioned tables can have indexes for the columns that are used as partition keys, but otherwise the fact that they are split up into multiple tables makes it impossible to specify uniqueness constraints. Adding global indexes would be useful not just for partitioning, but also for creating other complex cross-table constraints. Fiske said that this would "still take a while to support", but that he doesn't "know enough about the backend" to say what the exact challenges with implementing it are.

There are a few other warts with partitioning. Some properties — such as whether a table uses a write-ahead log — are not inherited properly. Also, there is a bug that has yet to be tracked down, where dropping child tables when foreign keys reference them results in the entire foreign key constraint being dropped.

Despite this, Fiske says partitioning is now a first-class feature in PostgreSQL that is "mostly feature complete". He advised the audience to "tune your queries first" before resorting to partitioning, but that partitioning was a big performance boon to certain use cases.

The place of `pg_partman`

Fiske is qualified to talk about PostgreSQL partitioning because he has maintained his own PostgreSQL extension to make partitioning easier for many years. pg_partman is an extension that performs many maintenance tasks associated with partitioning, which is provided under the PostgreSQL license.

Using a trigger to create new child tables when required — a common approach, Fiske said — is convenient, but "for a very very busy table, making tables on demand can cause a huge backlog". Making a new child table involves taking a lock on the parent, which can cause delays when many transactions try to make the new table at the same time. pg_partman supports a configurable system for creating new child tables before they're needed, avoiding that contention.

It also includes utilities for partitioning existing tables, for managing tables that would have their names truncated (to PostgreSQL's 63-character limit), and many other details relevant to partitioning. One interesting feature is the ability to scan a child table that has stopped being updated, and automatically create constraints based on what actual data is present in the table, which "dramatically increases performance" of queries made against large tables.

Fiske is still updating pg_partman; version 5.1 should be released by the end of March, and will support proactive child-table creation for list partitions with single-ID values. Despite this, Fiske said that one thing he wanted people to come away with was the understanding that he would much rather these features make their way into PostgreSQL itself. "Please make my extension obsolete," he requested. To that end, Fiske was happy to see a core PostgreSQL contributor in the audience. He also called the code in his extension a kludge, and said that unfortunately the code was not in a fit state to be accepted into core PostgreSQL as-is.

While there remain some small problems to work on, it seems clear that PostgreSQL's support for table partitioning is ready to see more widespread use. The project's development strategy means that additional performance and usability improvements to partitioning are likely to trickle in over time, but PostgreSQL 16, the current supported version, is already quite capable.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Pasadena for SCALE.]

Comments (3 posted)

Radicle: peer-to-peer collaboration with Git

March 29, 2024

This article was contributed by Lars Wirzenius

Radicle is a new, peer-to-peer, MIT/Apache-licensed collaboration platform written in Rust and built on top of Git. It adds support for issues and pull requests (which Radicle calls "patches") on top of core Git, which are stored in the Git repository itself. Unlike GitHub, GitLab, and similar forges, Radicle is distributed; it doesn't rely on having everyone use the same server. Instead, Radicle instances form a network that synchronizes changes between nodes.

As a new project, Radicle is not fully featured compared to the mature and centralized forges. That said, the Radicle developers are using Radicle itself to collaborate on the software, along with a Zulip chat system. The first 1.0.0 release candidate was announced on March 26.

(Note that I am paid to help develop Radicle.)

Overview

In Radicle, each user runs their own node on each computer they use for collaboration. The node stores copies of the repositories the user is interested in, regardless of whether they're created by the user or cloned from another node. The node process runs in the background to fetch changes from peers and serve repositories to peers that want them. To the user, the node acts like a local Git server. You clone from, pull from, or push to the node and it coordinates with other nodes.

There is a web interface for browsing repositories, issues, and patches, and it also allows opening and managing issues. The web interface can be opened for the local node, or on a suitably configured server, for any other node. Thus you can inspect any public node to see if it is in sync with yours.

The web interface looks a lot like the more mainstream forges, and is meant to feel instantly familiar. You can browse the code, issues, and existing patches. However, unless you run your own Radicle node and open its web interface, you can't currently make changes: you can't report issues, comment on issues, etc.

If you want to clone a repository locally, the web interface provides two ways: either using normal Git (git clone) and an HTTPS URL, just like other forges, or having your Radicle node fetch it and clone from that using the rad command-line tool. You don't need to use Radicle to get a copy of a repository from Radicle.

Creating issues and patches — and commenting on them — happens using your own Radicle node. There is a command-line interface, and a web user interface. The Radicle project is also working on a full-screen terminal user interface, like Midnight Commander but for Git, and there is integration with Visual Studio Code and IntelliJ IDEs, among others.

Motivation

The motivation for Radicle is similar to that of the overall decentralization movement. The centralized Git forges are popular for good reasons: they've put in a lot of effort into being easy to use and efficient, and to provide the kinds of features their users need. However, they are also not always great for everyone. There is some truth in the joke that when GitHub is down, the world stops developing software. Git was the first popular distributed version control system. Then, the popularity of GitHub made it the most centralized version control system.

With a peer-to-peer system, if your node is down, you may have to stop working, but nobody else needs to. More importantly, you don't need permission to run a Radicle node. Your access can't be revoked. Your repositories can't be deleted from your node by others. Nobody will force you to accept an "updated" set of terms and conditions.

Radicle stores issues and patches in the Git repository itself. You can create, comment, and manage them while offline. Radicle is local-first software: network access is only required for operations that inherently require communicating with other computers, such as retrieving or sending changes. Everything else works without the network.

Radicle repositories are self-signing, a necessary feature in a distributed system. While a GitHub repository can be authenticated by location (it's on GitHub with a given name), a Radicle repository is associated with a small set of cryptographic signing keys, which allows its identity and contents to be authenticated regardless of where the repository is stored.

Radicle's origin traces back to 2017. Its development is funded by Radworks, an organization that coordinates on the Ethereum blockchain using a token called RAD. However, Radicle does not use any blockchain or cryptocurrency technology. Radicle is not the only approach to decentralized collaboration over Git. ForgeFed is a protocol built on ActivityPub to support a federated network of Git servers.

Architecture

Nodes communicate with each other using two different protocols. First, there is a gossip protocol, where nodes tell each other about the nodes and repositories they know about, and about changes to the repositories. Second, they use the Git v2 smart transfer protocol to exchange repository content.

Each node stores copies of the repository as it exists on each other node it knows about, using the Git namespace feature and the node identifier as a name. Each node has nearly identical content, so this is an efficient way to store the data. To Git, a Radicle repository is a perfectly normal repository. Radicle uses standard Git mechanisms to add a persistent repository identity, issues, and patches. These are stored in special Git refs.

Every repository stored in Radicle has an identity that is the same in every Radicle node where the repository appears. If there are multiple copies of a repository stored in a node, each copy has the same identity, and can thus be identified as a version of the same repository. This identity is created by committing an "identity document" that contains metadata about the repository; an example identity document is:

    {
      "payload": {
        "xyz.radicle.project": {
          "defaultBranch": "master",
          "description": "Radicle Heartwood Protocol & Stack",
          "name": "heartwood"
        }
      },
      "delegates": [
        "did:key:z6MksFqXN3Yhqk8pTJdUGLwATkRfQvwZXPqR2qMEhbS9wzpT",
        "did:key:z6MktaNvN1KVFMkSRAiN4qK5yvX1zuEEaseeX5sffhzPZRZW",
        "did:key:z6MkireRatUThvd3qzfKht1S44wpm4FEWSSa4PRMTSQZ3voM"
      ],
      "threshold": 1
    }

The document includes a description of the repository and its default branch. The delegates list contains the public Ed25519 keys of the delegate(s) that are empowered to change the identity document or commit to the main branch. This document is stored behind the rad/id ref in the repository. It must be signed by one of the delegate's private keys. Each repository has a short ID that is calculated from a hash of the initial version of the identity document. This ID will not change as the identity document is modified, and can thus be used to identify the repository over time.

There is no way to prevent anyone from changing the identity document on their own node and signing it with their own key, bypassing Radicle. However, other nodes won't accept the change, since it is not signed by a delegate's key, making such changes pointless. Radicle also signs the branch and tag refs. The signatures are refreshed whenever a node changes anything in the repository. This means that other nodes can verify another node's repository content without inspecting the other node directly. This helps prevent forgeries and related attacks.

Issues and patches are stored using an implementation of a conflict-free replicated data type, which Radicle calls a "collaborative object", or COB. Any node can append data to a COB, and changes from different nodes can be merged without conflict. (Normal git conflict management applies to the rest of the content of the repositories, however: the user needs to resolve them.) The COB mechanism in Radicle is generic, and can be used to build further collaboration tools.

Seed nodes

Any node can synchronize data with any other node, but only if they can communicate directly. With network address translation (NAT) being prevalent, this is often not possible. Radicle does not yet have "NAT punching", but relies on third-party, publicly accessible seed nodes. This is safe thanks to the repositories being self-signed: the seed node can't modify the data.

Thus, if Alice and Bob are both behind NAT networks, they can collaborate via a seed node on the Internet that they both can access. Unlike with centralized forges, anyone can set up a seed node. This works especially well for open-source projects that don't need to keep repositories hidden. If hiding is necessary, a private seed node and Radicle private repositories can be used. A private repository is one that's configured to only be shared with specific other nodes. However, Radicle is not yet a good solution for truly confidential material: the private nodes are still rather rudimentary.

Missing features and other challenges

Radicle does not yet have mature support for continuous-integration systems, although work on that is underway. There is also only rudimentary support for code review, but that is also being worked on.

Currently, Radicle has a simple identity system: each node has its own public key. A challenge for the future is to evolve this into a more versatile identity system. For example, a developer with both a laptop and a desktop system would benefit from being able to certify that both nodes are theirs. Support for groups or organizations is also missing.

Perhaps a more fundamental challenge is that interacting with a Radicle repository, even just to report issues, requires using Radicle. With a centralized forge, all you need is an account and a web browser. This may be a problem for projects that would like to use Radicle, but whose users can't be expected to use it.

Conclusion

Radicle is a new and promising decentralized approach to Git hosting. If you are curious to know more, the guides are a good place to start. We're also accepting patches and hoping to bring in new contributors, so if you know some Rust and care about developer tooling, please join us on our Zulip forum.

Comments (33 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: XZ backdoor; Kernel 6.9-rc2; Fedora 40; SUSE KDE security; NetBSD 10.0; GCC static analysis; Redict 7.3.0; Samba 4.20.0; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>