LWN.net Weekly Edition for July 4, 2024

Welcome to the LWN.net Weekly Edition for July 4, 2024

This edition contains the following feature content:

Debian debate over tag2upload reaches compromise: a long discussion — even by Debian standards — on a new mechanism for developers to upload packages makes some progress.
Python grapples with Apple App Store rejections: getting Python apps into Apple's store requires some compromises.
PostmarketOS: Linux for phones and more: a distribution to keep older mobile devices running usefully.
Direct-to-device networking: getting the CPU out of the loop when the data is just passing through.
Arithmetic overflow mitigation in the kernel: ongoing work to harden the kernel against classes of bugs runs into some resistance.
Eliminating indirect calls for security modules: a patch series to make security modules both faster and more secure.
Mount notifications: an LSFMM+BPF discussion on how to best inform user space of mount events.
Redox: An operating system in Rust: a microkernel system built from scratch.
FreeDOS turns 30: three decades of work to emulate the DOS operating system.
Mourning Daniel Bristot de Oliveira: the kernel community loses another developer.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Debian debate over tag2upload reaches compromise

By Joe Brockmeier
July 3, 2024

Debian's proposed tag2upload service would be worthy of an article even if it wasn't so contentious; tag2upload promises a streamlined way for Debian developers using Git to upload packages to the Debian Archive. But tag2upload has been in limbo for years due to disagreement and a communication breakdown between the team behind tag2upload and the ftpmasters team. It took the threat of a General Resolution (GR), weeks of discussion, and more than 1,000 emails to finally move forward.

History

Work on tag2upload began in 2019. Sean Whitton wrote that he and Ian Jackson designed, implemented, and tested the prototype over a weekend. The system, he wrote, would allow Debian Developers to upload new versions of packages by using a new script ("git debpush") to push a signed and specially formatted Git tag (example here) to Debian's GitLab instance, called Salsa:

That's right: the only thing you will have to do to cause new source and binary packages to flow out to the mirror network is sign and push a git tag.

When a developer signs and pushes the Git tag, it triggers a GitLab webhook that sends the URI of the Salsa project repository and the name of the tag to the tag2upload service. The service then verifies the signature on the tag against Debian's keyring. Assuming that the signature checks out, tag2upload would produce a Debian source control file (.dsc) and a Debian changes (.changes) file, signs these, and uploads the results to the ftp-master.debian.org server queue. If all has gone well to this point, the source packages and binary packages would be built and then sent on their way to Debian's package pools. The access to that path runs through the ftpmasters.

The ftpmaster team members are appointed ("delegated" in Debian lingo) by the Debian Project Leader (DPL). This team is responsible for deciding what is allowed into the Debian Archive and may reject uploads for a variety of reasons including an unacceptable license, errors caught by Debian's package linter, missing source, and policy errors to name only a few. The reject FAQ provides a lengthy, but non-exhaustive, list and leaves the door open to rejection for other reasons. Because the team is delegated by the DPL, its decisions cannot be overridden by Debian's Technical Committee; they can only be overridden by a GR.

Jackson shared a draft of the service architecture to the debian-devel list in July 2019. Ansgar Burchardt, a member of the ftpmaster team, noted that the service as described would bypass various permission checks on the archive side. Jackson asked which checks would be bypassed, but no response was given. In August 2019, he posted a second version of the draft in response to suggestions from the list.

Debian developer Raphaël Hertzog replied that he had reviewed the thread and said that the "point of friction" between the groups was that the ftpmasters wanted to require a signed .dsc from the maintainer that would ensure that the source package created by tag2upload was what the maintainer intended to upload. Jackson, he said, doesn't want the maintainer to have to deal with the .dsc. He also proposed a workaround that might supply that information. There was no reply to that email on the list.

Another member of the ftpmaster team, Joerg Jaspert, replied to say that he was "a bit detached right now with anything Debian" and had not read most of the threads about tag2upload. He said he was in favor of something more Git-like, but the implementation as it stood was a no-go. The final check of the maintainer's key has to be performed on the ftpmasters systems. Other systems could perform the check as well, but the final one must be performed by the Debian Archive Kit (dak) system. Jackson replied that tag2upload would include a copy of the original uploader's signed Git tag object "as soon as dak supports it". The conversation seems to have sputtered to a stop there. Whitton, who is also part of the ftpmaster team as an FTP Assistant (a member with fewer privileges than members of the team with the FTP Masters role, such as Jaspert and Burchardt), did not reply on the list about Jaspert's concerns.

In 2020 there was a conversation on debian-devel about the status of tag2upload. Whitton said that the ftpmaster team had objections to its design and "we could not overcome the disagreement". There was some back-and-forth between Whitton and Burchardt, but nothing was resolved.

"All lies"

In 2023, Jackson and Whitton gave a talk at a Mini DebConf in Cambridge to promote tag2upload. In the draft text of the talk, they wrote that tag2upload would make uploading packages more convenient, and improve Debian's source code integrity too. Debian currently has several methods for uploading source packages; what they have in common is that they all require the source package to be created on the maintainer's system and then uploaded. In the talk text, however, Jackson and Whitton argue that the majority of Debian packagers are actually using Git and treating repositories in Salsa as the primary source of truth:

Theoretically, the Debian archive is the canonical source code repository. Theoretically, the .dscs are the preferred form for modification of our software. But, this is, nowadays, all lies, at least for the vast majority of packages.

Most packagers, they said, are treating the .dsc and "the whole of the official archive" as outputs rather than the principal work. But the primary working repositories in Salsa are not official, and they do not follow a standardized format. This makes life harder for people who test the packages, and for anyone making a non-maintainer upload (NMU). "You can't automatically grab the git source for an arbitrary package, and build it," they wrote. In contrast, using tag2upload, anyone could check out the tag that represents a version of a package and work with the same source that the maintainer used.

At the time of the talk, the pair said that the service was almost ready to deploy. It had received a security review by "some Debian security experts" including Russ Allbery and Jonathan McDowell, and there was some outstanding work to address before live deployment, but the tag2upload service also needed approval from the ftpmaster team. They expressed hope that the details covered in the tag2upload talk would help make that happen. What did not happen is a direct conversation between teams to attempt to unblock the service.

Call for a General Resolution

On June 12 of this year Whitton sent an email with a description of tag2upload, its benefits, the reasons for proposing a GR, and the draft of the GR to the debian-vote mailing list. Whitton said that he was posting the draft for review, rather than proposing it for a vote immediately, "because of the relative shortness of our official discussion periods". (LWN covered proposed changes to the GR discussion period in 2021, and a GR amending the process passed in 2022.)

The reason for a GR, Whitton wrote, is that the ftpmaster team had stated a requirement that the Debian Archive Kit (dak) system "be able to completely re-perform the verification of maintainer intent done by the tag2upload service". That requirement would fatally undermine the design and user-experience of tag2upload. Whitton also wrote that Allbery and others have tried, and failed, to get an explanation from the ftpmaster team "that we could understand as a strong technical objection". His conclusion was that the team is being conservative and favoring existing processes by default, rather than having an actual technical objection to tag2upload. The service was ready to deploy without any code changes by the ftpmaster team, he said, but it needed "a suitably trusted key" similar to the ones used by Debian's buildd autobuilders.

Since the designers and proponents of tag2upload had reached an impasse, Whitton said that the GR would be needed to get the project unstuck and into production. The text of the proposed GR stated that tag2upload should be deployed "in the form designed and implemented", which was by Allbery and McDowell. The ftpmaster team's objection would be overruled according to the Debian Constitution section 4.1.3. ("Make or override any decision authorised by the powers of the Project Leader or a Delegate.")

Jackson said that he and Whitton had been "very slow to reach for the GR hammer", and said that tag2upload had been blocked since 2019. He said that they had asked for help behind the scenes, including speaking to several Debian Project Leaders (DPLs) and others, "but sadly that has not been effective".

Allbery shared his security review of tag2upload to debian-vote shortly after Whitton's draft GR, with the caveat that he was not a neutral party "in the sense that I think tag2upload is a good idea and should be deployed". However, he said that he does security reviews professionally and tried to approach it the same way he would a major work project. He encouraged other Debian members with security expertise to check his work. The entire document is worth reading, but his conclusion was that tag2upload should be adopted:

Compared to the existing upload architecture, tag2upload provides additional defenses against injection of malicious code into source packages and better traceability of source package contents, at the cost of some minor additional security risk and infrastructure complexity. I believe tag2upload has somewhat stronger security properties than the current upload mechanism but not a profound advantage. I do not believe it introduces any significant security regressions.

Allbery added that there is "some irreducible security risk" that comes with introducing new features and a new attack surface, but whether the benefit outweighed the risk "is not a decision that can be made by a security review."

Surprise

On June 15, Jaspert joined the discussion. He wrote that he was taken by surprise by Whitton's draft GR and that the last communication "in my ftpmaster inbox" was 2019. "There had been mentionings on some mailing list somewhere, but nothing coming to us, that I can find." At that time, he said, the ftpmaster team raised several points of objection to tag2upload's design. The primary point Jaspert raised was that the design of tag2upload could "bypass/circumvent archive upload checks and restrictions". He said that, in five years since it was first discussed with the ftpmaster team, the concern had still not been addressed. "More like, entirely ignored." However, he said, the ftpmaster team was in favor of the service and wanted to see it running if the final verification and authorization of an upload was handled by dak without needing to trust another service.

Whitton replied that the tag2upload team has been "seeking help behind the scenes" for four years, without progress and that led to the GR. He said that Jaspert had not been very active with the ftpmaster team recently and "may be missing some things". He thanked Jaspert for how he had explained his position in the email, but said that the tag2upload team had argued against the ftpmaster team's opinion—they had not ignored it—and no one had responded. He directed Jaspert's attention to an email from Allbery, and said "I'm looking forward to your reply to Russ".

Allbery wanted to know why final verification and authorization was the ftpmaster team's red line. "Is it only that you don't want to add another system to the trusted set, or is there something more specific that you're concerned about?" He also proposed a hypothetical design for tag2upload that would add a signed hash of the Git tree, possibly in a separate file, that would then pass dak an object that has the hash of the tree and the uploader's signature. Would the ftpmaster team approve that design "modulo the normal sort of details that would need to be hashed out?"

Jaspert replied that, yes, there should be one point doing the verification and authorization, not many. He also said that tag2upload would be doing work that was delegated to the ftpmaster team, though that could be addressed by the ftpmaster team running tag2upload or adjusting delegations. But even if the ftpmaster team ran it, he said, it was still a loss given the current tag2upload design. For one thing, the current practice allowed users to verify the signature of the actual maintainer rather than having to trust the key from tag2upload or the ftpmaster team:

We want dak (and anyone else) to be able to say "Yes, [packager] $x has signed off this content". That only works, if dak (and later, the public, if they want to check too) have the signature for this in a way they can verify it. And not just a line somewhere "Sure, $service checked this for you, trust us, please".

In another reply on June 17, this one to Jackson, Jaspert said that the tag2upload proponents should have directly asked the ftpmaster team for its position instead of working behind the scenes. That, he said, would have led to an actual position from the team. He reiterated that he was interested in finding middle ground, but he had developed an eye condition that meant it hurt a lot to focus on anything. He asked for a pause on the GR "in the term of days, not months" to allow him time to return to the conversation.

Questions, objections, and clarifications

The conversation continued unabated for days in Jaspert's absence, as many still wanted to discuss the design and implications of tag2upload. Simon Richter asked whether tag2upload was "completely orthogonal to any efforts to move all packages to git-based team maintenance?" That was likely, at least in part, in reference to a conversation in April on the debian-devel mailing list. In short, current DPL Andreas Tille had opened a discussion about ending single-package maintainership and requiring Debian package source to be maintained in Salsa. As one might expect, that discussion did not lead to a consensus that all Debian packagers should adopt one true method of doing things. But the topic is clearly still fresh in the minds of some Debian contributors.

Jackson said that there were two answers to that question, one technical, one social. "The social answer is that there is absolutely no connection." The tag2upload team, he said, is not interested in trying to wrangle everyone to the same development platform. The aim is to provide tooling that can support anyone in their current development practices "insofar as we can". On a technical level, he said, tag2upload and efforts to mandate maintaining packages in Salsa are almost completely orthogonal, except that developers obviously can only use tag2upload if they are also using Git. As far as team versus individual maintainership, "tag2upload doesn't care at all about maintainership." It only cares about having a Git tag signed by a package maintainer that indicates an intent to upload.

Timo Röhling had a practical question: who would be responsible for the service once it is deployed? Jackson said that he expected to own the service together with Whitton, while the Debian System Administrators (DSA) team would manage the virtual machines.

One advantage to tag2upload, Matthias Urlichs wrote, is that the potential for discovering an attacker's modification is greater when using tag2upload. A signed tag, he noted, tracks the history of a set of files. A user could verify that an emergency NMU pushed to a package "only contains one commit on top of the maintainer's and changes only one file". Two files, if one counts the changelog. It is possible to do this with source packages, but he said it is more work, doesn't work well between upstream versions, and more: "all of which means that nobody's going to do it, much less notice said spurious change by accident."

One problem with tag2upload's design that Simon Josefsson called out is that it aggregates and amplifies the consequences of one successful compromise. While it may be easier to compromise, say, the laptop of a single packager, the reward (for an attacker) of compromising a tag2upload server would be much higher. However, he said that "adding another attack vector will not break this camel's back". The bigger problem, he said, was "how some people use their technical powers to limit what other people can do effectively."

Sprawling thread

On June 21, Allbery wrote that he was sure that many people had stopped reading because the discussion had become "a sprawling thread of nearly unreadable volume". He said that he was possibly the largest contributor to the volume, so he would "do penance and summarize the thread for everyone else".

Allbery said that it seemed that the ftpmaster team and tag2upload team had agreed that the authentication protocol for tag2upload should change. Instead of the tag2upload server performing the signature check and asking dak to trust that it had, it would include the signed Git tag and dak would redo the signature check. This would entail code changes for dak, and could be done with a proper API. He said that raised other questions about APIs between Debian project services but that should be deferred to later.

A remaining point of disagreement, he said, was that "tag2upload uploads will not contain an uploader signature over the exact files that comprise the source package". The transformation from a signed Git tree to source package might include "synthesizing or modifying files in ways that the uploader does not have in their Git tree in a hashable form". The ftpmaster team, Allbery said, wants the exact contents of every file in the source package to be covered by an uploader signature "other than the constructed *.dsc file" and possibly the *.orig.tar.gz file.

From a security standpoint, Allbery said, it makes little difference whether the source package construction happens before or after the signature "given that the uploader doesn't (and can't, realistically) check the output in either case". That seemed to be the remaining blocking issue, he said. Jackson thanked Allbery for the summary and said that he didn't spot anything he would disagree with in the summary.

Call for vote

On June 27, Whitton formally called for a GR to override the ftpmasters. The procedure for resolutions requires five additional sponsors, and a minimum of two weeks for discussion. Jaspert responded to the call for seconds with disappointment and asked why there was a hurry to run the GR now. "You waited 5 years without ever bringing this forward, and now it must be done right now, and can't wait more time? Oh come on." In another reply he disputed that the ftpmaster team had made a decision to reject uploads from tag2upload. He accused Whitton of taking a "my way or nothing" approach.

Allbery responded that he flatly did not believe Jaspert actually thought that ftpmaster had not made a decision:

There is literally no way that you could think, after all of this discussion, that you haven't been asked for a decision or what that decision is about.

He pointed out that Jaspert had asked for time ten days ago. When Whitton asked if the team was still considering options there was no response. Allbery said that the process was exhausting and that even though people are reluctant to use a GR to force a decision "at some point there *has to be an end*". If not, it would just be "decision by attrition".

Finally, compromise

Whitton did not reply directly to Jaspert, but on June 28 he wrote that he had put the GR forward for seconds because he wanted to minimize the impact of the GR on the upcoming DebConf and DebCamp. However, he said that he had received feedback that he had "misjudged the extent to which discussion is still ongoing" and appeared to hint at delaying or stopping the GR.

On June 30, Jaspert sent a proposal on how to move forward with tag2upload. It would require tag2upload to send a normal Debian source package, plus two files. He suggested using a Git command to generate a list of the file names, modes, and commit IDs that would be signed by the packager's key. The exact format, he said, could be worked out during implementation. The other file would be a shallow Git clone of the repository, put into a tarball, for the tag that is to be uploaded by tag2upload. That would be generated by the tag2upload service.

Allbery thanked Jaspert strongly for putting the proposal together, and they continued to hash out details on the list, with input from others. On July 1, Whitton cited the "productive discussion we're now having" and asked that Tille use his DPL powers to extend the discussion period for the GR to give the two camps time to nail down details with the anticipation that he would withdraw the GR once an agreement was reached. In his July 1 "Bits from the DPL" message Tille said that he would extend the discussion period to give the camps time to reach agreement and avoid a GR.

On July 2, Jackson seemed satisfied that an agreement had been reached, and said that he thought it was time to update the design of tag2upload with the agreed changes.

At this point, it seems that the parties will be able to move forward and eventually deploy tag2upload as a service for Debian's packagers. It is unfortunate that it took five years and the threat of a vote that would override the authority of a team to prod the parties into working together constructively to make what amounts to a small compromise and design change.

Comments (21 posted)

Python grapples with Apple App Store rejections

By Joe Brockmeier
June 27, 2024

An upgrade from Python 3.11 to 3.12 has led to the rejection of some Python apps by Apple's app stores. That led to Eric Froemling submitting a bug report against CPython. That, in turn, led to an interesting discussion among Python developers about how far the project was willing to go to accommodate app store review processes. Developers reached a quick consensus, and a solution that may arrive as soon as Python 3.13.

The problem at hand is that Apple's macOS App Store is automatically rejecting apps that contain the string "itms-services". That is the URL scheme for apps that want to ask Apple's iTunes Store to install another app. Software distributed via Apple's macOS store is sandboxed, and sandboxed apps are prohibited from using URLs with the itms-services scheme. That string is in the urllib parser in Python's standard library, though an application may never actually use the itms-services handler.

Of course, Apple did not do anything so straightforward as to explain this to Froemling. Once he filed an appeal with Apple about the rejection, Apple finally told him that parse.py and parse.pyc were the offending files. After that, he said, it was not hard to track down the problem:

Now in retrospect I'm frustrated I didn't think to run a full text search for itms-services over Python itself earlier or stumble across one of the other cases of folks hitting this.

Russell Keith-Magee started the discussion in the Python Core Development discussion forum on June 17. He wanted to know whether "acceptable to app stores" should be a design goal for CPython, or if that compliance should be a problem left to the tools that generate application bundles for app stores.

Paranoid and inscrutable

Keith-Magee noted in his initial question that Apple's review processes were the most "paranoid and inscrutable" of app-store-review processes, but that other app stores also had "validation and acceptance processes that are entirely opaque". One solution might be to obfuscate the offending string to pass review, but that might "lead to an obfuscation arms race" and there were no guarantees this would be the last time the project had to resolve app-validation problems. The other option, he said, was to consider this to be a distribution problem and leave it to tools like Briefcase, py2app, and buildozer to solve. Traditionally, they have had to patch CPython anyway, he said, because it did not support Android or iOS "out of the box". But that will change with Python 3.13 when no patching should be required for those platforms.

Alex Gaynor suggested that the project try an approach that Keith-Magee had not put forward inspired by Gaynor's experience with the cryptography library. The project often receives complaints that the library refuses to parse a certificate that is technically invalid, but was in wide use. He said that the policy was to accept pull requests that work around those issues "provided they are small, localized, and generally aren't too awful". But, he added, these patches should only be accepted on the condition that someone complains to the third party (in this case Apple), and extracts some kind of commitment that they would do something about it. He suggested that the workaround be time-limited, to give users a decent experience "while also not letting large firms simply externalize their bizarre issues onto OSS projects".

Brandt Bucher wondered whether obfuscation was even allowed, or if it would be seen as circumventing the review process. That was a question no one seemed to have an answer to; and Keith-Magee responded with an 8-Ball emoji and the phrase "ask again later." He added that Gaynor's approach sounded appealing, but it would be like screaming into the void. Apple, he said, barely has an appeals process and there is no channel available to the Python project "to raise a complaint that we could reasonably believe would result in a change of policy".

Another approach, suggested by Alyssa Coghlan, would be to use a JSON configuration file that urllib would read to set up its module level attributes "rather than hardcoding its knowledge of all the relevant schemes". That could allow app generators to drop "itms-services" from the configuration file rather than patching urllib.py directly. Keith-Magee said that could work, but "it strikes me as a bit of overkill for an edge case" that could be handled by obfuscation or distribution-level patching.

On June 20, Keith-Magee wrote that he had thought of another approach: adding a build-time option called "--with-app-store-patch" that removes code that is known to be problematic. He said it would be enabled by default for the iOS platform, and disabled elsewhere. It could be used when building an application for macOS, if the developer intended to distribute that application via the macOS App Store. He suggested that the option could also accept a path to a file with a patch, to allow distributors to provide an updated patch if an app store changes its rules after the maintenance window for a Python release has closed.

Let's paint the bikeshed

Coghlan asked if it was now time to "paint a config option bikeshed". She said that the proposed option name was both too broad and too narrow. The "app-store" component of the name was too broad, because it could encompass any app store, not only Apple app stores. The "patch" component was too narrow, because patch specifies the method of complying with policies rather than intent. There may be other methods required to comply with app-store-compliance checks. Keith-Magee liked the suggestion about dropping "patch" from the option name, and suggested painting the bikeshed a nice shade of "--with-app-store-compliance" that would interact with platform identification to sort out what is required.

On June 25, Keith-Magee thanked participants in the discussion for their input, and pointed to a pull request that would implement the --with-app-store-compliance configuration option. In the request, he noted that it would be possible to use the option with platforms other than iOS or macOS, but there were no use cases for that at present. If all goes well, it should be available in Python 3.13.

It is frustrating that free-software projects like Python have to waste time finding ways around opaque review processes just so developers can write software for non-free platforms. However, the approach taken by Keith-Magee and other CPython developers seems to be the least-bad option that offers the best experience for Python application developers. It will almost certainly not be the last time that a project runs into this problem.

Comments (43 posted)

PostmarketOS: Linux for phones and more

July 3, 2024

This article was contributed by Koen Vervloesem

In 2016, Oliver Smith reached a point of frustration with the short lifespan of updates for his Android phone. Taking matters into his own hands, he began developing postmarketOS, a Linux distribution for mobile phones. Eight years later, the core team and trusted contributors have grown to twenty individuals, while the latest release, v24.06, now shows support for over 250 devices. Although postmarketOS isn't usable as a day-to-day phone operating system on all of them, it can also enable repurposing devices into compact servers or kiosk machines.

On its web site, postmarketOS is described as a "real Linux distribution for phones and other mobile devices". Unlike mainstream mobile operating systems, this means that users have full control over postmarketOS. It gives them the freedom to tinker, back up and restore their complete home directory, turn their phone into a second display or other USB gadget, and to be able to choose from multiple interfaces (what would be called "desktop environments" on non-mobile systems).

PostmarketOS is based on Alpine Linux, a lightweight Linux distribution that also serves as a popular base for Linux containers. The recently announced v24.06 release is built on Alpine Linux 3.20. All of the interfaces have received an upgrade since the v23.12 release from December 2023. In particular, KDE Plasma Mobile 6 provides a lot of new functionality, including the introduction of a new home screen that allows users to customize pages with apps, folders, and widgets.

Device support

As with any mobile Linux distribution, the primary concern of prospective users is probably device support, since it can't be expected that postmarketOS will support any arbitrary phone. The project's wiki has an automatically generated page with supported devices, grouped into three categories according to their support status: main, community, and testing.

Understanding these categories is vital to adjust expectations prior to installing postmarketOS onto a device. A newly ported device starts out in the testing category. It's only promoted to the community category once certain requirements are met. These requirements are not related to the feature set, but rather pertain to the maintenance of the port. This includes well-documented installation instructions, a (close to) mainline kernel, automatic kernel upgrades, and an active maintainer participating in the release process. There are dozens of devices in the community category, including phones, tablets, Arm-based laptops, and even some single-board computers. When considering one of these, always consult the feature matrix to know what to expect, and thoroughly read the device-specific wiki page for any potential caveats.

A port is elevated to the highest category only if it's actively maintained by at least two people and can be fully used as a phone. This includes having a functional user interface, working phone calls, and support for many other features users expect on a phone: SMS messages, mobile data, WiFi, audio, battery charging, and Bluetooth. Note that a working camera is not listed among the required features. Currently, only two phones are in the main category: the PINE64 PinePhone and the Purism Librem 5, both of which have all of the main features operational but only have a partially functional camera. LWN looked at the PinePhone back in February 2022.

Installing postmarketOS

Two primary installation methods are available for postmarketOS. The first method entails downloading a pre-built image and flashing it onto the phone. The images on the download page are largely main and community devices, along with a few devices in the testing category. This is the easiest method, albeit one lacking customization options.

The second method, available for all devices, requires building a customized image using pmbootstrap. Even for devices that have a pre-built image available, this is an interesting method, since it allows choosing the user interface, pre-installing specific packages, and performing other customizations.

Pmbootstrap works by setting up a chroot environment with Alpine Linux and installing the desired packages within it, similar to Debian's debootstrap. It requires at least 10GB of free space and runs on most Linux distributions that have Python 3, OpenSSL, and Git installed. Pmbootstrap is available in various distribution repositories, but using the package from Ubuntu 24.04 immediately failed for me, due to its 2.1 version being outdated, prompting me to install the latest version from Git.

An invocation of "pmbootstrap init" initializes the working environment after asking some questions, such as the update channel to get packages from, the device vendor, and the device name. It also allows choosing a user name, the WiFi backend (wpa_supplicant or iwd), and the user interface. For my PinePhone, I selected Phosh, which was originally developed by Purism for its Librem 5; it is a phone shell based on GNOME and Wayland. Subsequently, any additional packages to be installed can be specified as a comma-separated list, and the time zone, locale, and host name can be set. The initialization process also allows copying an SSH public key to the device to log into it later.

Following this preparation, "pmbootstrap install --sdcard /dev/DEVICENAME" command builds the custom image for the phone and writes it to the microSD card specified. Creating an image with full-disk encryption requires appending the --fde option to the install command.

The command first prepares the chroot and creates the root filesystem for the device. The login password prompt allows choosing between only numeric characters (since a PIN code is easier to input) and alphanumeric characters for more security. Finally, pmbootstrap writes the boot and root filesystems to the microSD card. The device-specific wiki page provides more information for those devices requiring it. A "pmbootstrap shutdown" command cleans up the process, and the microSD card can then be inserted into the phone. The entire procedure went smoothly during a test installation on my PinePhone.

An alternative to writing an image to a microSD card is to generate the image with "pmbootstrap install" and then flash it to the connected device with "pmbootstrap flasher". The exact steps can be found on the device's wiki page.

Working with Phosh on postmarketOS doesn't differ that much from interacting with the same shell on any other mobile Linux distribution, such as Mobian, which we covered here in April 2023. Most of the applications are the same; installing and updating software is done with GNOME Software. It is configured for installing packages from the Alpine and postmarketOS repositories, as well as Flatpaks from Flathub. The same packages can also be installed using Alpine's command-line package manager apk or the flatpak command in the GNOME Console app.

Naturally, development of Phosh hasn't stagnated since we looked at it in Mobian last year. The new Phosh 0.39.0 in the latest postmarketOS release has added some nice features. This includes the capability to select the WiFi network in the quick settings by long-pressing the WiFi icon, and opening the on-screen keyboard using a long press on the bottom bar.

Further development

Just like its parent distribution Alpine Linux, postmarketOS uses OpenRC as its default init system. However, running GNOME or KDE on OpenRC presents certain challenges, since parts of systemd had to be reimplemented to support those desktop environments:

In order to get KDE and GNOME working at all, we use a lot of systemd polyfills on top of OpenRC. So while we are technically "not using systemd", in practice we already do use a large chunk of its components to get KDE and GNOME running, just different versions of those components.

For instance, openrc-settingsd reimplements systemd's hostnamed, localed, and timedated. Other adaptations reimplement parts of systemd-logind, systemd --user, and journald. A major issue is that some of the code chunks adapting to systemd ("polyfills"), such as those for systemd.timer and systemd-coredumpd, are unmaintained. So, the developers made the decision to begin developing postmarketOS on top of systemd, at least for GNOME and KDE. Users building their own image with pmbootstrap will retain the option to select OpenRC. Images of postmarketOS with the minimalist Sxmo interface will stay with OpenRC.

Initial systemd support targets the next release, v24.12, due in December of this year. While systemd officially only supports the GNU C Library (glibc) and postmarketOS and Alpine Linux are based on the lightweight musl libc, the developers don't consider the required changes to be a big problem, based on an LWN comment. The announcement also noted that systemd facilitates some new features in postmarketOS. For instance, socket activation enables printing from the phone without having CUPS running all the time. One caveat is that switching a phone with postmarketOS to systemd will require a reinstall.

Another change that postmarketOS is planning is a switch to PipeWire. For now, PulseAudio is still the distribution's default sound server, but the recent postmarketOS release has removed a hard-coded dependency on PulseAudio in numerous packages, allowing developers to experiment with PipeWire. In the long term, the project is also exploring the prospect of releasing an immutable version of the operating system.

Conclusion

While PostmarketOS may not be a mobile operating system suitable for all, it is certainly usable for Linux enthusiasts. The devices with the best support, the Librem 5 and the PinePhone, unfortunately aren't the greatest in terms of battery life and performance. However, Phosh is a capable mobile user interface on these devices, including functional phone calls and SMS messages.

Combined with the fact that this is a real Linux distribution that you can mold to your liking, install Linux packages on, and back up with tools such as rsync, postmarketOS undoubtedly serves a niche that includes many non-phone use cases. For instance, you could build your own image with the Wayland kiosk compositor Cage. This displays a single maximized application, which is a perfect solution to repurpose an old phone into a control panel for a home-automation system.

Comments (none posted)

Direct-to-device networking

By Jonathan Corbet
June 27, 2024

It has been nearly one year since the first version of the device memory TCP patches was posted by Mina Almasry. Now on the 14th revision, this series appears to be stabilizing. Device memory TCP is a specialized networking feature requiring a certain amount of setup, but it could provide a significant performance improvement for some data-intensive applications.

The kernel's networking stack is designed to manage data transfer between the system's memory and a network device. Much of the time, data will be copied into a kernel buffer on its way to or from user space; in some cases, there are zero-copy options that can accelerate the process. But even zero-copy operations can be wasteful when the ultimate source or sink for the data is a peripheral device. An application that is, for example, reading data from a remote system and feeding it into a device like a machine-learning accelerator may never actually look at the data it is moving.

Copying data through memory in this way can be expensive, even if the copies themselves are done with DMA operations and (mostly) do not involve the CPU. Memory bandwidth is limited; copying a high-speed data stream into and out of memory will consume much of that bandwidth, slowing down the system uselessly. If that data could be copied directly between the network device and the device that is using or generating that data, the operation would run more quickly and without the impact on the performance of the rest of the system.

Device memory TCP is intended to enable this sort of device-to-device transfer when used with suitably capable hardware. The feature is anything but transparent — developers must know exactly what they are doing to set it up and use it — but for some applications the extra effort is likely to prove worthwhile. Although some of the changelogs in the series hint at the ability to perform direct transfers of data in either direction, only the receive side (reading data from the network into a device buffer) is implemented in the current patch set.

The first step is to assemble a direct communication channel between the network device and the target device using the dma-buf mechanism. The device that is to receive the data must have an API (usually provided via ioctl()) to create the dma-buf, which will be represented by a file descriptor. A typical application is likely to create several dma-bufs, so that data reception and processing can happen in parallel, to set up a data pipeline. Almasry's patch set adds a new netlink operation to bind those dma-bufs to a network interface, thus establishing the connection between the two devices.

Some system-configuration work is required as well. High-performance network interfaces usually have multiple receive queues; the dma-bufs can be bound to one or more of those queues. For the intended data stream to work correctly, the interface should be configured so that only the traffic of interest goes to the queue(s) that have the dma-bufs bound to them, while any other traffic goes to the remaining receive queues.

The dma-buf binding will make a range of device memory available to the network interface. A new memory allocator has been added to manage that memory and make it available for incoming data when the user specifies that the data should be written directly to device memory. To perform such an operation, the application should call recvmsg() with the MSG_SOCK_DEVMEM flag. An attempt to read data that has been directed to the special receive queue(s) without that flag will fail with an EFAULT error.

If the call succeeds, the data that was read will have been placed somewhere in device memory where the application may not be able to see it. To find out what has happened, the application must look at the control messages returned by recvmsg(). A SCM_DEVMEM_DMABUF control message indicates that the data was delivered into a dma-buf, and provides the ID of the specific buffer that was used. If, for some reason, it was not possible to copy the data directly into device memory, the control message will be SCM_DEVMEM_LINEAR, and the data will have been placed in regular memory.

After the operation completes, the application owns the indicated dma-buf; it can proceed to inform the device that some new data has landed there. Once that data has been processed, the dma-buf can be handed back to the network device with the SO_DEVMEM_DONTNEED setsockopt() call. This should normally be done as quickly as possible, lest the interface run out of buffers for incoming packets and start dropping them — an outcome that would defeat the performance goals of this entire exercise.

This documentation patch in the series gives an overview of how the device memory TCP interface works. It also documents a couple of quirks it introduces due to the fact that any packet data that is written directly to device memory is unreadable by the kernel. For example, the kernel cannot calculate or check any checksums transmitted with the data; that checking has long been offloaded to network devices anyway, so this should not be a problem. Perhaps a more significant limitation is that any sort of packet filtering that depends on looking at the payload, including filters implemented in BPF, cannot work with device memory TCP.

The patch set includes a sample application, an implementation of netcat using dma-bufs from udmabuf. The series does not include any implementation for an actual network device; Almasry maintains a repository containing an implementation for the Google gve driver. This work has evolved considerably over the last year, but it appears to be settling down and might just find its way into the mainline in the relatively near future.

Comments (none posted)

Arithmetic overflow mitigation in the kernel

By Daroc Alden
July 1, 2024

On May 7, Kees Cook sent a proposal to the linux-kernel mailing list, asking for the kernel developers to start working on a way to mitigate unintentional arithmetic overflow, which has been a source of many bugs. This is not the first time Cook has made a request along these lines; he sent a related patch set in January 2024. Several core developers objected to the plan for different reasons. After receiving their feedback, Cook modified his approach to tackle the problem in a series of smaller steps.

Cook referenced his slides from a talk at the 2024 Linux Security Summit North America, saying that the kernel has "averaged about 6 major integer overflow flaws a year". In his email, he was clear that he was not talking about undefined behavior: "We already demand from our compilers that all our arithmetic uses a well-defined overflow resolution strategy; overflow results in wrap-around (thanks to '-fno-strict-overflow')."

Instead, Cook was worried about cases where unintentional arithmetic overflow, even though it is well-defined, causes an incorrect value to be calculated. This is especially troublesome when it impacts bounds checks, but that isn't the only place that an incorrect value can have security implications. Unfortunately, identifying places where unintentional arithmetic overflow can occur is difficult because there are plenty of places in the kernel that use intentional arithmetic overflow. Therefore, any solution is going to have to involve some amount of human effort to determine whether overflow is expected in any given situation.

Cook proposed two possible solutions: adding a new sanitizer, or operator overloading. The kernel already uses several sanitizers to catch problems like invalid memory accesses and undefined behavior. Cook's first proposal was to add typedefs for values that are expected to wrap, and then begin converting the kernel over time. The sanitizer would initially issue warnings, until the kernel is converted, at which point it could be changed to emit errors. This was Cook's preferred approach. The other potential solution would use a feature that has recently been proposed (once compilers implement it) for operator overloading in C that would let the kernel customize how different types are handled.

The response

David Laight objected to the proposal on performance grounds, a topic that Cook had not addressed in his original email because, he said, it was a secondary consideration. Laight was worried that the additional checks required would both bloat the code, and make branch prediction worse. Cook responded with some performance measurements, showing that enabling the sanitizer for signed-integer overflow caused 1% run-time overhead when it was configured to issue a warning, and 0.57% when it was configured to cause a kernel panic.

Linus Torvalds was also not a fan of the proposal, saying that "unsigned arithmetic is well-defined as wrapping around, and it is so for a good reason". He continued:

I think you need to limit your wrap-around complaints, and really think about how to limit them. If you go "wrap-around is wrong" as some kind of general; rule, I'm going to ignore you, and I'm going to tell people to ignore you, and refuse any idiotic patches that are the result of such idiotic rules.

In reply, Cook said that he agreed with most of Torvalds points — some arithmetic overflow is expected, and the sanitizer will need to be smart enough not to warn about harmless cases such as existing overflow checks. The problem, as Cook sees it, is that there is no way to tell from the source whether the developer of a piece of code intended for arithmetic overflow to occur or not. "What I need, though, is for _intent_ to be captured at the source level. Unfortunately neither compilers nor humans are omniscient". Cook also explained why he thought a type-based solution was best:

Yes, I agree, annotating the calculations individually seems like it would make things much less readable. And this matches the nearly universal feedback from developers. The solution used in other languages, with much success, is to tie wrapping intent to types. We can do the same.

Torvalds pushed back, saying that in some cases the intent is clear from context, and any tool needs to be smart enough to determine this from how a value is used: "If it's used as a size or as an array index, it might be a problem. But if it's used for masking and then a *masked* version is used for an index, it clearly is NOT a problem". One example of a place where overflow is intentionally permitted and then a mask is used to obtain the correct index is the recent generic ring buffer that Kent Overstreet posted. Torvalds also objected to the idea of adding another type, on the basis of not increasing cognitive load: "We already expect a lot of kernel developers. We should not add on to that burden because of your pet project".

Cook did not think it was fair to characterize his work as a "pet project", given that there is a large amount of academic research about this exact class of flaws. He also agreed that it was reasonable to ask that the sanitizer clear a bar of not producing meaningless warnings, "but if the bar is omniscience, that's not a path forward". To make the discussion a bit more concrete, he cited a specific issue that took eight years between being first identified and correctly fixed, claiming that the sanitizer would have caught it immediately. In that case, the calculated size of a perf event could be recorded incorrectly, because the structure where the size was stored only used 16 bits to do so. Cook questioned how this kind of issue could be prevented, if not by something like his proposal.

In response, Torvalds proposed an alternative approach: instead of initially tackling all arithmetic overflow, first focus on some small subset where problems are easy to spot and likely to be important. For example, he suggested focusing on cases that were not technically undefined behavior because of C's integer promotion rules, but that involved a signed quantity being promoted to unsigned and then used in a way that overflows. Other possible areas would be places where the result of some arithmetic expression is explicitly used as the size of an allocation, or where it involves direct pointer arithmetic. Torvalds also disputed the assertion that the sanitizer would have prevented the problem Cook cited, pointing out that it was a case of "implicit cast drops significant bits", not normal arithmetic overflow.

The kernel already has various targeted checks for these things, Cook responded:

Right, and we've already been doing all of these things. They do all share the common root cause, though: unexpected wrap-around. But I hear you: don't go after the root cause, find high-signal things that are close to it.

Cook did agree to focus on implicit integer truncation as his next target for mitigations, before going after arithmetic overflow, since Torvalds was more willing to consider it.

Compiler questions

One detail of Torvald's message prompted further comment from Peter Zijlstra. Torvalds had claimed that signed integer overflow was undefined behavior that was already being identified by the undefined-behavior sanitizer (UBSAN) that the kernel uses. "We build with -fno-strict-overflow, which implies -fwrapv, which removes the UB from signed overflow by mandating 2s complement," Zijlstra said. Justin Stitt claimed that this was not quite right, because LLVM had introduced a way for -fwrapv and UBSAN to work together. Zijlstra called the change "the exact kind of drugs we don't need", saying that since -fwrapv makes the behavior defined, it should not be caught by UBSAN. Fangrui Song clarified that LLVM would no longer interpret -fwrapv as implying -fsanitize=signed-integer-overflow after a recent pull request.

Talking about what is or is not undefined behavior is missing the point, Cook said. "This is _not_ about undefined behavior. This is about finding a way to make the intent of C authors unambiguous." Zijlstra was still of the opinion that the behavior shouldn't be checked in UBSAN. Stitt claimed that "we should think of UBSAN as an 'Unexpected Behavior' Sanitizer", not an undefined behavior sanitizer. There are many sanitizers present in UBSAN's code that "aren't *technically* handling undefined behavior", he said. He emphasized the importance of annotating the places where arithmetic overflow is expected, since there's no other good way to identify those places.

Ted Ts'o was unhappy with the suggestion that kernel developers should undertake yet another large set of changes like this:

But the problem is when you dump a huge amount of work and pain onto kernel developers, when they haven't signed up for it, when they don't necessarily have the time to do all of the work themselves, and when their corporate overlords won't given them the headcount to handle unfunded mandates which folks who are looking for a bright new wonderful future --- don't be surprised if kernel developers push back hard.

Ts'o went on to say that the figure of merit for any proposed change along these lines must not be how many security bugs are found, but how much noise the security features create. The intent of his proposal was not to require all of the core kernel developers to spend time working on this, Cook responded. He reiterated that the change would be gradual, and that, while he would appreciate the help, he wasn't asking anyone else to work on the cleanup. He has been doing similar work in the kernel for ten years, and has never used the number of bugs found as a metric — rather, he wants to "kill entire classes of bugs" so that there are fewer bugs to find. "I guess I *do* worry about bug counts, but only in that I want them to be _zero_."

Cook did say "I hear what you (and Linus and others) have been saying about minimizing the burden on other developers". That is why he brought the proposal to the mailing list, he said, so that he could get feedback on how to address the arithmetic-overflow problem. Ts'o apologized, clarifying that he hadn't meant to criticize Cook's work, but rather to push back against Stitt's assertion that every single case where arithmetic overflow is expected should be annotated. He reiterated his position that, while finding security problems is important, sanitizers need to be careful not to produce so many false positives that their warnings are ignored. "So please consider this a plea to **really** seriously reduce the false positive rate of sanitizers whenever possible."

Having gathered feedback from other kernel developers, Cook no longer plans to go forward with a sanitizer for all arithmetic overflow — at least, not yet. Instead, he intends to take three "baby steps": finish the ongoing signed-integer overflow refactoring work, start going after cases of signed-integer truncation, and then look at unsigned-integer overflow for specific types, starting with Torvalds's suggestion of size_t. Given how long changes like this have taken in the past, this seems likely to occupy Cook's time for a while.

Comments (71 posted)

Eliminating indirect calls for security modules

By Jonathan Corbet
July 2, 2024

Like many kernel subsystems, the Linux security module (LSM) subsystem makes extensive use of indirect function calls. Those calls, however, are increasingly problematic, and the pressure to remove them has been growing. The good news is that there is a patch series from KP Singh that accomplishes that goal. Its progress into the mainline has been slow — this change was first proposed by Brendan Jackman and Paul Renauld in 2020 — and this work has been caught up in some wider controversies along the way, but it should be close to being ready.

A security module provides a set of hooks, one for each operation within the kernel that it wants to control. Whenever that operation (opening a file, for example, or creating a new process) is invoked by user space, the security module's hook function will be called with information about the requested action. The hook then has the opportunity to see whether an action is allowed by the policy it is meant to enforce and, if not, block that action. The kernel can have more than one security module active at a time, each of which provides its own hook functions. Those functions are stored in a linked list; traversing that list and calling all of the hook functions is where the indirect calls come in.

Calling functions through pointers is, of course, a common C-language technique that is used heavily throughout the kernel. These indirect function calls have increasingly come under scrutiny in recent years, mostly as a result of the threat posed by the Spectre class of hardware vulnerabilities. Indirect function calls can be points where a number of speculative-execution vulnerabilities can be exploited. The gyrations required to thwart such exploits — notably retpolines — come with a heavy run-time cost.

That added cost is especially painful when it comes to the indirect function calls used by security modules. Almost anything that user space can do, if it involves the kernel, will be mediated by at least one security-module hook; if those hooks are made more expensive, the pain is felt throughout the system. The added performance hit is prohibitive on systems that are already running at full capacity, with the result that the use of security modules is not possible for some workloads. It is thus not surprising that there is interest in getting that lost performance back.

The attention to security modules increased in April with the disclosure of the branch history injection hardware vulnerability, which is, once again, exploitable in code using indirect function calls. The LSM subsystem is, arguably, an especially appealing target for such exploits because it makes so many indirect calls, its hooks are attached to almost every system call, and the LSM call is often one of the first things done on entry into the kernel.

This vulnerability forced the use of retpolines on CPUs that had, previously, been able to get away with less-expensive mitigations provided by the hardware. That provoked a touchy conversation with Linus Torvalds, who questioned the direction that the LSM subsystem had been taking for the last ten years. In the middle of that, though, he also said that the work to switch the LSM subsystem to using static calls "needs to be turned to 11"; that was the one part of Torvalds's message that nobody disagreed with strongly.

Static calls

Static calls are widely used within the kernel in situations where an indirect call is necessary, but the target for that call is set only once (or at most rarely) in the life of the system. Their purpose is to provide the flexibility of indirect calls (albeit with an increased cost for changing the target of the call) with the improved performance and security of direct calls. The static-call infrastructure was first added for the 5.9 kernel release in 2020; it is conspicuously absent from the kernel's documentation, but there is an overview of the API in include/linux/static_call.h.

In the simplest case, kernel code will set up a static call with:

    DEFINE_STATIC_CALL(name, func);

Where name is the name to be associated with the static call, and func() is the function to be invoked. The static_call() macro can then be used to call the function:

    static_call(name)(args...);

This call will work like a normal function call; func() will be called with the given args, and its return value will be passed back to the caller. It is, however, a direct call, much as if func() had been called directly in the code. As a result, this call is faster than an indirect call and lacks the associated speculative-execution problems.

The value of indirect calls, of course, is that the target can be changed at run time; that is not normally the case with direct calls. Static calls use some architecture-specific trickery to get around this problem; if the target of a static call needs to be changed, that can be done with a call to:

    static_call_update(name, new_func);

After this call is made, a static_call(name)() invocation will make a (direct) call to new_func() rather than func().

How this mechanism is implemented depends on the architecture. Some architectures (arm and x86) are able to patch the call instructions directly in line, meaning that static calls are indeed just like normal direct calls and have no additional overhead. That said, the cost of patching the code in a running kernel is high, making suitable for use only in situations where the function to be called will be changed infrequently, if ever. Other architectures need to use a special trampoline for the static call; for architectures with no support at all, ordinary indirect function calls are used. There is more complexity to the API than described here; see the above-linked header file for details.

Bringing static calls to LSMs

While the LSM subsystem is, as its name would suggest, modular, it is not set up for the arbitrary loading and unloading of modules. Instead, the set of available security modules is established (through the kernel configuration) at build time, and those modules are built directly into the kernel image. The set of active security modules is then defined at boot time and never changes during the operation of the system. So the set of hook functions to be called can be worked out at boot time, and need never be altered thereafter. This seems like a situation that is well suited to static calls; that is, indeed, the approach taken by Singh's patch set.

In current kernels, as mentioned above, a linked list of hook functions is maintained for each LSM hook; the kernel iterates through that list to invoke each hook function with an indirect call. With this patch series applied, that linked list is replaced with an array of static calls; the LSM subsystem now just has to step through the array, calling each hook in turn. In theory, the conversion should be straightforward. In practice, of course, there turns out to be a few little details that get in the way.

One of those details relates to the fact that an LSM need not supply functions for every hook. In the old implementation, a missing hook would be absent from the linked list and would never be invoked, but an array works differently. It turns out that providing a hook that returns the default value can have unwanted side effects; it is not the same as leaving out the hook entirely. So each entry in the arrays of hook function must be protected by a static key to avoid calls when a hook function is absent.

There are other troublesome details as well. The set of possible security modules is defined in the kernel configuration and is known at boot time. A command-line parameter is then used to control both which modules are enabled and the order in which they are invoked. The kernel must then, at boot time, set up the requisite static calls in the correct order; the number of these calls, and the order in which they must be made, cannot be known ahead of time. There is some trickiness and ugly macro code involved, but the result is an end to indirect calls for LSM hooks.

The result of this work is a performance improvement that averages about 3% and a system that, without all those indirect calls, is more secure overall.

This patch set has been through 13 revisions since Singh picked it up at the beginning of 2023; it appears to have satisfied most reviewers. Kees Cook asked for it to be merged soon, lest Torvalds return and "make unilateral changes to the LSM". But LSM subsystem maintainer Paul Moore pushed back, saying that he simply has not had the time to review the current version of the patches. More than two months after the last discussion, it seems that this is still a bit of a touchy subject.

Nearly three weeks later, nothing appears to have changed, so whether this work will be applied in time for 6.11 is unclear. If that doesn't happen, though, a 6.12 merge seems almost certain (unless some sort of new problem turns up). Either way, the days of indirect calls in the LSM subsystem would appear to be numbered.

Comments (2 posted)

Mount notifications

By Jake Edge
July 3, 2024

LSFMM+BPF

There are a handful of extensions to the "new" mount API that Christian Brauner wanted to discuss as part of a filesystem session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. In the session, though, the only one that he got to was a followup to last year's discussion on mount-operation monitoring. There is a need for user-space programs to be able to follow mount operations (e.g. mount and unmount) that happen in the system, especially for tools like container managers or systemd.

He began by briefly listing the potential topics in his slides, but noted that he was doubtful that he would get far into the list—or even past the first. He chose to focus on mount-operation monitoring (or mount notifications) as it is "the most pressing and interesting issue for user space". The idea is that user-space tools can register for mount-related events, which will allow them to track the state of the mount tree.

Brauner thinks the right way forward is to use fanotify, rather than the watch-queue-based notifications that David Howells had originally proposed. Howells clarified that his patches were meant to also provide a way for user space to query the mount topology using a new system call; the notification part was implemented on top of watch queues.

He said that fanotify has "a lot of desirable properties, such as missed-event notifications when the queue overruns". He said that Josef Bacik had some experience with the problems of queue overruns for some systems running container workloads with events from up to 10,000 mounts that were propagated all over the mount tree. Brauner thinks that the overrun problem has been solved at this point, however. Programs can use listmount() with the unique 64-bit mount ID when they find out that they have missed events; that will give them the mount IDs of child mounts that they can further query.

Amir Goldstein said that the information needed could also be part of the event message when the mount notification happens, preferably as a file handle for the mount. Brauner agreed that made sense, rather than returning an O_PATH file descriptor, which is another option. That kind of file descriptor would allow opening any mount on the system, so it provides "an extremely privileged interface", while a file handle would not.

There is a need to "decide which objects we want to watch", Goldstein said; will the watches be placed on parent mounts or on subtrees? Brauner said that Howells's patches could watch an entire mount namespace or subtrees. There are use cases where you want to watch all of the mounts in a container, Brauner said, so that is where a mount-namespace watch would make sense; there are also services that only use a subtree so they will want to only get events for that part of the tree.

Goldstein agreed that it made sense to have both, but was not sure how to implement the subtree watches. Brauner said that there is a potential race condition because new mounts in the subtree do not inherit the watch, so any mounts that happen before it is established might be lost. He is not sure that is a real problem, so long as there is a way to query the state of the mount tree right after the watch is established on a new mount. Jan Kara said that a watch only gets informed about events for the immediate children of a mount, which makes implementing a recursive mount watch in user space painful.

The whole reason for adding mount notifications is to perform better than the existing practice, which involves frequently parsing /proc/self/mountinfo, so some performance numbers should be gathered, Brauner said. It should be faster "and I'm pretty sure that it is, but we should have some numbers".

Solving the mount-notification problem is something that filesystem developers "should aim to get done this year", he said. It is longstanding and "kind of a shame that we have not correctly solved it yet". Goldstein said that there have been performance problems with recursive watches in the past, but those were for directories, which have a higher volume of events than mounts; he does not see that as a real problem for mounts.

Brauner asked about the interaction between mount notifications and pivot_root(); he wondered if any watches on the old root were copied to the new as part of that operation. Howells said that because the watches are associated with the mount object, not the mount namespace directly, they would get lost when pivot_root() is called. The discussion seemed to indicate that something would need to be done to maintain the watches in that case.

The session wrapped up with a bit of discussion on implementation; Brauner said that he had wanted to get something working for a while now. Goldstein said that once an API was decided on, it would not be all that hard to implement mount notifications. Howells said that his code could be used as a starting point. Goldstein suggested that a simple API for watching child mounts of a given parent would be straightforward to develop, then additions could be made for more complicated scenarios (presumably for things like recursive watches) based on that work.

Comments (5 posted)

Redox: An operating system in Rust

By Daroc Alden
June 28, 2024

With the Rust-for-Linux project starting to gain some ground, it is worth looking at other operating systems that use Rust in their kernels. There are many attempts to use Rust for operating system development, but Redox may be the most complete. Redox is an MIT-licensed microkernel and corresponding user space, designed around concepts taken from Plan 9. While nowhere near being usable as a replacement for Linux, it already provides a graphical user interface and the ability to run many POSIX programs.

Redox was started in 2016 by Jeremy Soller, who remains the project's benevolent dictator for life. Soller also works as a maintainer for Pop!_OS. Since then, approximately 150 people have contributed to Redox. The project summarizes its goals as "to make a complete, fully-functioning, general-purpose operating system with a focus on safety, freedom, stability, correctness, and pragmatism". The project aims to eventually become a practical alternative to Linux or BSD, although it does not aim for strict binary compatibility with either.

Redox has a number of different components, mostly written in Rust. The project doesn't forbid software written in other languages; it has an implementation of the C library on top of the Redox kernel called relibc. Using the library lets software written in C run on Redox. However, the core concepts of the system are sufficiently different that the main services of the operating system — the shell, user interface, and so on — mostly have to be written from scratch.

Plan 9 influences

Redox is designed around the concept that "everything is a URL", a generalization of the "everything is a file" approach of Unix. Every resource in a Redox system is identified by a URL that can be opened and read from or written to in the same way that a file is in a traditional system. Redox supports both socket-like files that don't support seeking to a position, and files that do.

The Redox kernel is a microkernel — meaning that core components such as drivers and filesystems can run in user space — but rather than invent its own interprocess communication (IPC) mechanism as some other microkernels do, Redox reuses file operations. URLs all have a "scheme" that identifies the protocol to use to resolve the URL; Redox uses that scheme to choose how to handle requests to read or write files under that scheme. The most basic schemes are implemented in the kernel, but the rest are implemented as user-space daemons.

The kernel translates open(), read(), and write() calls targeting schemes that are implemented in user space into packets which are sent to the daemon through a normal socket. So a program might read from "http://example.com", which would result in a packet being sent to the "http" daemon, that would in turn open "dns:" and "tcp:" files to service the request, and so on. This approach turns pretty much every system service except the most basic process handling into a separate program that opens the files it needs at boot, and can then drop privileges.

The schemes implemented in the kernel include the "root scheme" (referred to using an empty string for the scheme name), which allows registering new scheme handlers, and the "event" scheme, which defines an epoll-like mechanism for listening for events on multiple file descriptors at once. Unlike a Linux system, Redox does not implement support for networking or filesystems in the kernel. Instead, the root and event schemes are all that are necessary for a Redox driver to register itself as the handler for a given type of filesystem or networking hardware on startup.

There has been extensive discussion about the pros and cons of a microkernel architecture. The core advantages and disadvantages remain largely the same, decades later. However, computers have gotten faster over time; the Redox project thinks that the performance difference has become small enough that the benefits to security, robustness, modularity, and testability that a microkernel offers are now worth the tradeoff.

Trying it out

The most recent Redox release is from 2022, so in order to try out the system I checked out a recent weekly image. Redox currently supports i686 and x86_64; I ran the latter in a virtual machine (VM). Booting it brought me to the Redox desktop — a custom desktop environment called Orbital. The demo image comes with several custom applications, including a file manager, editor, and terminal. There is also software ported from Linux, including the NetSurf web browser and the FFmpeg multimedia library. Overall, the system is usable for basic tasks, including browsing the internet or editing text files. It is even theoretically possible to develop and compile Redox on Redox — although doing so involves using a slightly tweaked Rust compiler toolchain.

The instructions for running Redox in a virtual machine also give access to a text console. The system has bash, nano, git, a package manager called pkg, and other basic features. However, Redox is not yet ready for serious use — I managed to cause a kernel panic by killing a process that was taking too long, the networking subsystem doesn't handle cancellation, and relibc doesn't implement some functions like getgroups() or getrlimit().

Despite its current state, some of the developers are using Redox permanently installed to real hardware. Soller has had a laptop running Redox since 2019. Hardware support is still somewhat scattered — mostly limited to those devices that the developers have access to — but that support is expanding over time.

Comparisons

Redox is hard to compare with Linux for several reasons. The choice of a microkernel architecture means that the interfaces between the kernel itself and the drivers looks quite different from anything Linux is likely to implement. Additionally, Redox's unfinished state means that direct comparisons of performance or line count are unlikely to be meaningful. But there are still a few lessons to be drawn from Redox.

Linux keeps all of the most important parts of the kernel in the same Git repository. There are out-of-tree modules, but they're not supported by the core development community. In contrast, Redox sprawls across a large number of Git submodules, external packages, and other components. While there are advantages and disadvantages to both approaches, breaking Linux up into so many distinct pieces is probably not even possible, regardless of whether it is desirable. Package management is not cited as often as memory safety when justifying the use of Rust, but Redox shows that it can be a definite benefit, by allowing projects to depend on the large number of available Rust libraries.

Redox also makes strong use of Rust's type system to provide abstractions for various kernel interfaces. The Linux kernel does the same, but there is still a difference of degree between the two kernels. Using some internal interfaces in the Linux kernel can require calling specific functions in the right order, with the right locks held. These requirements are not enforced by the compiler, and not always consistent between subsystems. In contrast, Redox makes pervasive use of closures, trait objects, and other Rust language features to make internal interfaces more consistent and ensure they can be checked by the compiler.

In all, Redox uses Rust in a different way than the slow integration that the Rust-for-Linux project proposes. It is also unlikely to be serious competition for Linux anytime soon. But it does show that it is possible to create a working operating system using Rust — one that is useful to its developers, and will hopefully be useful to users as well.

Comments (39 posted)

FreeDOS turns 30

By Joe Brockmeier
June 28, 2024

FreeDOS is an open-source operating system designed to be compatible with the now-defunct MS-DOS. Three decades have now passed since the FreeDOS project was first announced, and it is still alive and well with a small community of developers and users committed to running legacy DOS software, classic DOS games, and developing modern applications that extend its functionality well beyond the original MS-DOS. It may well be around in another 30 years.

`C:\EDIT HISTORY.TXT`

MS-DOS was the most popular operating system for IBM-compatible personal computers (PCs) in the 1980s and 1990s prior to the advent of Windows 95 in (as one might expect) 1995. Early versions of Windows ran "on top" of MS-DOS. It was not the only Disk Operating System (DOS), but that's a far more convoluted tale. In 1994, Microsoft announced that it was going to stop selling and supporting MS-DOS. The final standalone release of MS-DOS from Microsoft was 6.22. There were versions of DOS included as part of Microsoft Windows 95 and later, but they were not separate products.

Even though Microsoft had moved on from MS-DOS, many users had DOS-compatible software they wanted to run. Jim Hall, then a student at the University of Wisconsin, decided that the world needed a public domain version of DOS and announced the PD-DOS project on June 29, 1994. It was swiftly renamed to Free-DOS in July 1994, and dropped the hyphen in 1996 to become FreeDOS.

Hall wrote that it was originally called PD-DOS because "I naively assumed that when everyone could use it, it was 'public domain.'" He learned the difference quickly, and FreeDOS has been under the GNU GPLv2 since its earliest releases. He said in 2021 that other developers reached out shortly after the announcement "to offer utilities they had created to replace or enhance the DOS command line, similar to my own efforts". They pooled their resources and first alpha for FreeDOS was released in September 1994.

The history of FreeDOS is well-documented across its site and elsewhere, but I also took the opportunity to email Hall to ask some questions about its early development, community, and future. Hall wrote that the community made it to the 1.0 release in 2006 "by inches" because everyone was focused on getting it right:

We were making Beta releases until then. We had Beta 8, for example. And in the run-up to Beta 9, we had Beta 9 "Release Candidate 1," which was a pre-release to Beta 9 .. then Beta 9 RC2, RC3, then RC4, then RC5. Finally we had Beta 9 in September 2004! But before we were ready for "1.0," we also had Beta 9 "Service Release 1" in November 2004, and "Service Release 2" in November 2005. All of that before we were ready for "1.0" in September 2006.

1.0 and beyond

The project continues to focus on getting it right. FreeDOS 1.1 took almost six years and was released in early 2012 with initial USB support, a generic PCI IDE CD-ROM driver, updated memory drivers, and more. FreeDOS 1.2 came out in 2016, with an all-new installer and the FreeDOS Installer - My Package List Editor Software (FDIMPLES) package manager. Hall says that was a major milestone for the project:

If I remember events correctly, that was the first version after Jerome Shidel stepped in as the new distribution manager. And as part of that, Jerome completely overhauled the FreeDOS install process to simplify it. The old installer was okay, but it was definitely showing its age. It was great to have Jeremy completely update it.

The current 1.3 release was unleashed in 2022, with a version for 8086 CPUs with FAT32 filesystem support, a new command shell, and many other updates. The FreeDOS 1.3 release report provides a breakdown of all the packages available on the various FreeDOS release media. The project started releasing monthly test releases in 2022, so users can try out the latest state of FreeDOS without waiting until another official release.

The project also published a Get Started with FreeDOS collection of documentation in 2022; no doubt useful for new generations of users wishing to run classic DOS games who have never had to stare at a blinking cursor at the C:\ prompt before. The developer documentation for the project is currently offline due to a spam attack on its wiki.

Tools of the trade

When Linus Torvalds decided to start working on what would become Linux, the GNU Project had ample free-software utilities just waiting for a Unix-like kernel to come along. Hall was not so lucky. He noted that, when development started in 1994, "there really weren't any open source compilers or development tools that were appropriate for writing DOS system programs":

But in the 1990s, many FreeDOS developers already had access to one of the many proprietary compilers or assemblers. So that was our first toolset: proprietary software to make open source software.

Hall said that he started with an old copy of the Microsoft QuickC compiler to write the original FreeDOS tools. Later he turned to the Borland C compiler "which I liked a lot better". Now, however, all the preferred tools are open-source software. The project uses OpenWatcom C, an open source C compiler that works well on DOS. Its last official version is Version 1.9 but Hall said that other developers have picked up the code and created a Version 2.0 fork that he likes. The preferred assembler is NASM.

The project tries to hold firm on using open-source software to create programs that are in the FreeDOS Base group, the programs that reproduce the functionality of MS-DOS. They are "more relaxed which tools you use" for programs outside the base. He also mentioned that he likes the Intel Architecture 16 (IA-16) version of GCC, which he has been experimenting with recently. "I also like that TK Chia (the developer) also created a libi86 library for IA-16 GCC that reproduces nonstandard DOS C programming interfaces like <conio.h> and <dos.h> functions."

FreeDOS in 2024

In his email, Hall said that the FreeDOS development community is small, "but I think it's doing well". It is not surprising, he said, that FreeDOS has a smaller development community in 2024 than something like the Linux kernel.

Development of FreeDOS is split between the FreeDOS kernel and everything else, such as user-space applications, utilities, development tools, drivers, and more. The kernel was based on DOS-C by Pasquale J. Villani, and is currently maintained by Jeremy Davis, with contributions from a number of others. Hall pointed out that the FreeDOS kernel does not need to change much, these days. "After the FreeDOS kernel reached a certain level of maturity, we were basically feature-complete with MS-DOS. After that, it was really down to 'user space' programs."

FreeDOS does have an abundance of user-space programs, not to mention all the legacy software that is archived around the internet. If a person was particularly determined, they could conceivably use FreeDOS as their primary operating system. A few years ago, Ars Technica writer Sean Gallagher tried it for a week and reported that he was "ready to return to the comfort of a modern operating system—any modern operating system". I was not quite so brave, but I did learn that using DOS is not like riding a bike. Despite many hours tinkering with MS-DOS before starting down the Linux path in 1996, I retain almost none of what I had learned about tinkering with AUTOEXEC.BAT files or trying to get the most out of the full 8MB of RAM my 486/66MHz PC had to offer.

Even Hall does not use FreeDOS as his primary operating system, but he said that he does use it on a daily basis. He wrote that he runs Linux at home, and runs FreeDOS in a virtual machine (QEMU). "If I'm not testing some FreeDOS program, I'm probably playing a DOS game or experimenting with one of my favorite classic DOS applications."

Some of Hall's favorite DOS games include Commander Keen and Jill of the Jungle. He still does work in the As-Easy-As spreadsheet application, and recommends Microsoft Word for DOS 5.5 which was released as a free download in 1999 due to Y2K compliance efforts. He reports that he has been able to import files from both programs into LibreOffice on Linux. "It works great! Your data is not 'locked in' to the older formats."

One missing piece

By the time MS-DOS 6.22 (the final MS-DOS release, not counting the versions bundled into Windows 95 and later) was released, most users were using it to run Microsoft Windows 3.1x and running graphical applications in addition to, or in place of, MS-DOS applications.

Windows 3.1x had two modes, explained in-depth here: Standard Mode and Enhanced Mode. Standard Mode, which required a 286 Intel-compatible CPU or better, allowed access to a whopping 16MB of RAM. Enhanced Mode brought support for a 32-bit Virtual Machine Manager (VMM) that enabled Windows to create virtual machines to run the Windows Operating Environment and a VM for each DOS session. In theory, it could access up to 4GB of RAM but the practical limit was 256MB of RAM. Some versions of Windows 3.x, specifically Windows for Workgroups 3.11, only ran in Enhanced Mode.

Support for this Enhanced Mode, 30 years later, is still missing in FreeDOS. The good news is that it might be making an appearance before long. According to Hall, Davis has added some support to the FreeDOS kernel to allow it to run Windows 3.11 in Enhanced Mode and he expects a new version of the kernel "soon".

The next 30

Will FreeDOS be around in another 30 years time? Hall said that he thinks DOS has "incredible staying power, because it's so simple to learn and figure out". But he predicted that users will not be using it for real work:

Even in 2014, there was a "real-world" need for FreeDOS. At that time, I served as the campus CIO for a university, and a faculty researcher came to us with a collection of floppy disks containing research data. And while we still had PCs that had floppy drives, none of the modern applications could read his data. So we installed FreeDOS on one of those PCs, found a copy of the DOS application that created the data, and exported all the research data to a CSV format or some other text format that the researcher could import into other software.

Looking out to 2054, though, Hall said "we won't have that same need". By then, he expected that FreeDOS would be mostly of interest to learn about computer history for users in the 2030s and beyond. Interested users will still be able to run old DOS applications and games, but "the reason for running those DOS applications in 2054 will be very different from today".

Comments (15 posted)

Mourning Daniel Bristot de Oliveira

June 27, 2024

Juri Lelli, Tommaso Cucinotta, Steve Rostedt, Kate Stewart, and Thomas Gleixner

The academic and the Linux real-time and scheduling community mourns the premature death of Daniel Bristot de Oliveira. Daniel died at the age of 37 on Monday, June 24, 2024.

Daniel was a computer scientist with a focus on real-time systems and scheduling theory who is well recognized in the academic and the Linux kernel community. His truly outstanding ability to apply theoretical real-time concepts to real-world problems in the industry has been instrumental in driving the success of Linux and its adoption in real-time critical application spaces. While he pursued his ideas and visions with great perseverance, he was always open for discussion, criticism, and other people's ideas. His honesty, his modesty, and his wicked humor made it a pleasure to work with him. His wide interests outside of technology and his exceptional social skills made it easy to connect with him which resulted in many deep friendships reaching beyond the scope of work.

Daniel was creative and passionate about computer science. He earned a joint PhD from Universidade Federal de Santa Catarina in Brazil and Scuola Superiore Sant'Anna in Italy, with a research thesis focusing on Automata-based Formal Analysis and Veriﬁcation of the Real-Time Linux Kernel. His work was an exemplary piece of research, combining theoretical research arguments with a real implementation of a kernel-level mechanism. It models the behavior of complex parts of the Linux kernel, such as the process scheduler, with a finite-state machine and uses minimum-overhead run-time verification to validate the coherence of the kernel's run-time behavior and the theoretical model.

Having Daniel at the ReTiS lab has been invaluable, as witnessed by various other collaborations that naturally developed while he was a PhD student, and later when he remained a Professional Affiliate at the lab. He helped mentor undergraduate and PhD students on various issues related to the performance and optimization of software running on Linux. He was also active in collaborating with various other research groups working on system-level topics, as witnessed by his co-authored papers.

Daniel started contributing to the development of the SCHED_DEADLINE scheduling policy by fixing all sorts of issues, demonstrating from the start a deep understanding of both the technical and theoretical details of the implementation. Not long after, he stepped up to the role of co-maintainer of the project, taking on a big portion of the recent work on new features for the scheduler.

RTLA (Real-Time Linux Analysis toolset) and RV (Runtime Verification) are just two outstanding examples of Daniel's work in Linux. RTLA is a meta-tool that binds the timerlat, osnoise, and hwnoise tracers into a single, user-friendly, command-line application.

The timerlat tracer helps find the sources of wakeup latencies affecting real-time threads. Similar to cyclictest, it uses a periodic timer to catch and measure latency spikes, but timerlat provides greater level of detail and a more precise picture of the various contributions to latency at different levels (including interrupts, kernel, and user space).
The osnoise tracer runs a busy-loop workload in the kernel, with preemption, soft and hard interrupts enabled. By taking note of the entry and exit point of any source of interference, it produces a fine-grained analysis of the potential sources of system noise that a polling application (e.g., HPC, DPDK) can suffer from.
Last but not least, the hwnoise tool (based on the osnoise tracer) is essentially meant as a replacement for hwlatdetect, extending coverage to multiple scenarios (including round-robin, per-CPU, and a subset of CPUs) and, again, increasing the level of detail in the report.

RV is a lightweight yet rigorous method that complements classical exhaustive verification techniques, such as model checking and theorem proving, with a more practical approach for complex systems. Instead of relying on a fine-grained model of a system, such as a re-implementation at instruction level, RV works by analyzing the trace of the system's actual execution and comparing it against a formal specification of the system's behavior. Daniel pioneered the method of using an RV Monitor as an active safety mechanism in the kernel with the ELISA (Enabling Linux in Safety Critical Applications) community. He also generously shared how the RTLA tools could be used to isolate a workload from interference from the rest of the system in one of the first seminars the project held.

Daniel has been deeply involved in organizing several conferences over the years. Many discussions with key outcomes could not have happened without his tireless work inviting people, putting together schedules, and making sure people were constantly focusing on arguments that matter rather than digressing into pointless arguing. He used his natural talent for jokes and witty comments to make everyone relax, feel at ease, and feel welcome to contribute to the discussion. To name a few, knowing it's going to be only a partial list, he helped organized the Linux Plumbers Real-time and Scheduler micro-conferences, Power Management and Scheduling in the Linux Kernel (OSPM) Summit and the Real-Time Linux Summit; he has also been in the technical program committee of top conferences in real-time systems research, such as RTSS, RTAS and ECRTS.

The Brazilian lyricist Paolo Coelho wrote: "Never. We never lose our loved ones. They accompany us; they don't disappear from our lives. We are merely in different rooms." The academic and Linux kernel communities will always be accompanied by Daniel and by the traces he left in his work and in our hearts. Our hearts and thoughts are with Daniel's fiance and family.

If you want to express your condolences please send an email to bristot@tglx.de. It will be passed on to the ones who he loved most.

Comments (4 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: OpenSSH 9.8; FreeBSD Developer Summit; Scientific Linux 7 EOL; Universal Blue updates; GNU findutils 4.10.0; FSF board; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>