Leading items

Welcome to the LWN.net Weekly Edition for December 12, 2024

This edition contains the following feature content:

Debian opens a can of username worms: allowing non-ASCII usernames seems like a nice idea, but it raises a number of challenges.
A look at CentOS Stream 10: how CentOS Stream is shaping up, and what it means for the upcoming RHEL 10 release.
A Zephyr-based camera trap for seagrass monitoring: environmental monitoring with free software.
Finally continuing the discussion over continue in finally: a troublesome corner case in the Python language comes to the fore — again.
Freezing out the page reference count: a change to low-level kernel memory management that highlights where the memory-management developers are going.
Auto-tuning the kernel: the bpftune utility can automate a number of kernel-tuning tasks.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Debian opens a can of username worms

By Joe Brockmeier
December 5, 2024

It has long been said that naming things is one of the hard things to do in computer science. That may be so, but it pales in comparison to the challenge of handling usernames properly in applications. This is especially true when multiple applications are involved, and they are all supposed to agree on what characters are, and are not, allowed. The Debian project is facing that problem right now, as two user-creation utilities disagreed about which names are allowable. A plan is in place to sort this out before the release of Debian 13 ("trixie") sometime next year.

The useradd utility is part of the shadow-utils project, which includes programs for managing user and group accounts. The shadow-utils suite is included in Debian's passwd package. For historical reasons, and to avoid confusion with the upstream project, Debian's version of the shadow-utils sources are often referred to as "src:shadow".

Most Debian users don't work with useradd, or groupadd, directly. Instead, Debian has long supplied its own adduser (and addgroup) utilities, originally written by founder Ian Murdock. These act as simpler front ends to useradd and use Debian-supplied system defaults for creating users' home directories and configurations. It should be noted that useradd, et al., have become much more full-featured since Debian's utilities were introduced, but the project continues to maintain them nonetheless.

Little Bobby Tables

In June, Debian developer and src:shadow maintainer Chris Hofstaedtler filed a bug against the adduser package. The src:shadow package had dropped a Debian-specific patch, originally introduced in 2003 by Karl Ramm, to allow characters far beyond what were allowed by the upstream shadow-utils project. In the patch, Ramm wrote:

I can't come up with a good justification as to why characters other than ':'s and '\0's should be disallowed in group and usernames (other than '-' as the leading character). Thus, the maintenance tools don't anymore.

Hofstaedtler said that he had puzzled out some of the patch's purpose from old bug reports that had been "fixed" by the patch, and those asked for two things not allowed by the upstream shadow-utils: usernames with upper-case characters or that are purely numeric. Hofstaedtler said that upper-case names had been allowed in the upstream shadow-utils project "a long time ago", but it seemed like a bad idea to allow purely numeric usernames.

The patch enabled much more than upper-case and purely numeric names, though. With the patch dropped in version 1:4.15.2-2 of the shadow source package, one of adduser's tests—which explicitly allowed a username reminiscent of a famous xkcd comic ("bob;>/hacked")—had failed:

For src:shadow, I would really like to not have a divergence from upstream in this regard. I think if we have clear requirements then we (I) can submit them upstream and I would expect upstream to accept patches.

I do feel that making the case for "bob;>/hacked" would be very hard.

Hofstaedtler said that the patch had been reapplied for the time being, it was included again in version 1:4.15.2-3, but he asked if username requirements could be sorted out in time for the Debian "trixie" release. If the patch were dropped entirely, then useradd would restrict usernames to the POSIX standard, with the exception of allowing a "$" character at the end of a username

Debian developer and adduser maintainer Marc Haber replied in late October that other tests were failing as well, and thought that "useradd upstream is being too picky here". Since adduser depends on useradd it could not create users that useradd would reject, he said he would like to synchronize on what would be allowed or not.

As part of the research into what should be allowed in usernames, Haber took over Debian's UserAccounts wiki page, which outlines Debian's username tools and policies, and started looking into whether the project should relax its requirements around usernames.

Limits on usernames

One of the questions that bubbles up when looking at usernames is not just allowable characters, but the allowable length of the username. The documentation for shadow-utils does not specify a length for usernames or what encoding is being used.

However, in order to be portable between systems, the POSIX standard says that usernames should not include non-ASCII characters. The standard says that usernames should be "composed of characters from the portable filename character set". That set is comprised of numbers 0 through 9, upper-case and lower-case "a" through "z", the period (.), underscore (_), and hyphen (-). It also specifies that usernames should not begin with a hyphen.

It is, however, possible to assign characters outside that set with the tools at hand. But Linux distributions usually put up some guardrails in the adduser and useradd configurations to prevent administrators from creating usernames with non-ASCII characters unintentionally. These configurations can be overridden with adduser's --allow-bad-names option or useradd's --badname option.

In November, Haber posted a message on debian-devel that he had "opened an especially nasty can of worms" and was finding that things were more complicated than he had understood. He sought input and opinions on a number of questions about whether Debian should allow non-ASCII characters for usernames, how to do that if so, and if it was more appropriate to document username guidance in Debian's Policy Manual rather than its wiki. His suggestion was to allow UTF-8 for regular user accounts, but to restrict to ASCII for system accounts created by Debian packages.

Richard Lewis asked if enabling UTF-8 would open the door to "some of the abuse described" in a 2021 LWN article about flaws in Unicode handling that led to security exploits. He said that it seemed to be a bad idea to make the change, even if it would be nicer for users to have the option.

Haber said that he was not sure if it would be dangerous to allow UTF-8 usernames, "since we can expect other commands to gracefully handle a byte stream, can't we?" Additionally, local administrators already can loosen restrictions to allow UTF-8 usernames, but Debian does not test for such use cases. Debian would become "more robust" if it assumed UTF-8 characters would be used in usernames. "Vulnerabilities that could be exploited by having non-ascii user names are already here and present today, just not uncovered yet."

It would be reasonable, Timo Röhling said, to mitigate possible homograph attacks by disallowing mixed alphabets "such as cyrillic and latin letters in the same name". Haber said that was not going to help if a user could directly write to /etc/passwd, and he was unwilling to implement that himself in adduser. He would accept code and test cases written by others, though.

Keyboards

Security concerns aside, there are other practical problems with supporting non-ASCII usernames. Étienne Mollier noted that he had "one weird enough" character in his first name that posed a problem if he had to log in using a keyboard layout that lacked the capability to transcribe the lower-case or upper-case 'e' acute characters ("é" or "É"). For that reason, he said, he felt better about keeping a full ASCII username and "wouldn't feel strongly if unicode support for login never happens". But it would be good if the gecos field of the passwd file had proper Unicode support to properly display users' real names.

Not only was it difficult to type "é" on some keyboards, it could also be encoded in multiple ways. Gioele Barabucci pointed out that it could be "e with acute" which is encoded in UTF as U+00E9, or it could be "e, combined with an [acute] accent" which would be U+0065 plus U+0301:

If a keyboard input system provides the former sequence of bytes, but the username is stored in the login infrastructure using the latter sequence of [bytes], then a naive comparison will not find the user "émollier" in the system. Unicode defines in Annex 15 a few normalization forms as a way to work around this problem. But a correct use of these normalization forms still requires coordination and standardization among all programs accessing the data.

He asked if POSIX or other standards provided a normalization form for UTF-8 encoded usernames. Peter Pentchev responded that POSIX said to stick to the portable filename character set to ensure portability. Haber argued that it should be up to local admins to decide whether they wanted their local user database to be portable. "I don't think that we should restrict local admins who don't need that kind of portability."

Simon McVittie recommended that Debian consider adopting systemd's user name syntax and concepts of "strict mode" and "relaxed mode". The systemd tooling adheres to a strict naming convention when creating usernames, but it has a relaxed convention for accepting usernames created by other tools. McVittie said that seemed like a good principle for Debian to follow, even if its specific rules might differ from systemd's.

Haber seemed to agree in part, but said systemd's strict mode was "even stricter than what we currently allow for system accounts", and he did not like that systemd's policies (especially with systemd-homed, which LWN covered recently) were not configurable.

This time it's personal

The discussion, perhaps not surprisingly, brought out some strong feelings about how names and usernames were represented. Especially when, as Hofstaedtler noted, usernames can be important to some users:

I see and type my username hundreds times a day, people use it to address me in written and spoken conversations with it, etc.

If it were my uid, which I see maybe once a week and don't have to remember, I wouldn't care.

Indeed, it's not uncommon in open-source communities or within organizations to use a person's username rather than their given name—so it is unsurprising that some people feel strongly that usernames should be composed of a wider range of characters than POSIX recommends. Others dislike the practice of conflating usernames with real-world names, and see little reason to go to any trouble to go beyond ASCII.

Johannes Schauer Marin Rodrigues supported allowing more than ASCII in usernames. He said it would be good for Debian to put pressure on other projects to provide Unicode support. "We cannot find these kind of bugs if we accept translating everybody's given name to the American alphabet." Bálint Réczey, though, asked that Debian avoid opening that can of worms and imposing needless work on upstreams. "Keep what works reasonably well for decades."

A plan

Haber initially seemed bullish on allowing UTF-8 usernames in Debian "as a courtesy to those people who need non-ascii user names to write their name" and as an opportunity to find "bugs that are already here" in Debian's software. He acknowledged that it is late in the development cycle for trixie. But, since it was currently possible to create usernames with UTF-8 characters, he did not want to tighten restrictions in trixie versus Debian 12, only to revisit those restrictions for Debian 14. In a reply to Mollier he wondered about what advice to give in Debian's documentation "once we have decided to officially allow UTF-8 login names".

On December 3, however, Haber said that he "finally understood" that UTF-8 support would require more than the ability to create an UTF-8 encoded username and write it to /etc/passwd. Homograph characters, such as U+00E9 (é) and U+0065 plus U+0301 (é), could be used with adduser to create two separate users with lookalike usernames:

At the least, adduser should reject creating étienne if étienne already exists - those are different user names but look the same, and if you don't cut-and-paste user names instead of typing them you're bound to hit the wrong user depending on HOW you type and what input medium you use. Not good.

Haber said that he was the only active developer working on adduser and did not have time to implement a check against lookalike usernames in time for the trixie release. Worse, he said, the Perl module that he would use (Unicode::Precis) was not packaged for Debian and had not had a release in more than five years.

The next version of adduser, Haber said, would reject UTF-8 usernames by default. They would still be allowed when using the --allow-bad-names option, but he said he wanted to deprecate that option name in favor of something that doesn't use the word "bad". The --allow-all-names option will continue to pass everything verbatim to useradd.

Mollier thanked Haber for his work on the problem, and suggested some alternatives to the bad names option. Barabucci also thanked Haber for taking the time to research the issue, to which Haber replied dryly, "I have learned many things."

Haber's current course of action for adduser seems the most prudent. There may be a day when it is more practical to expand the allowed characters for usernames, but the work required to do so right now is far greater than the benefits that users would gain in the process.

Comments (115 posted)

A look at CentOS Stream 10

By Joe Brockmeier
December 11, 2024

The Red Hat Enterprise Linux (RHEL) 10 beta was released in mid-November and, if all goes according to plan, CentOS Stream 10 should be released before the end of the year. While nothing is etched in stone just yet, it is a good time for anyone using or targeting RHEL (and its clones) to start taking a look at how Stream 10, and the corresponding EPEL repository, is shaping up. This is not only important to RHEL and Stream users, but anyone deploying and supporting software on enterprise Linux (EL) derivatives like AlmaLinux, Oracle Linux, and Rocky Linux as well.

From Fedora to Stream

In the CentOS Linux days, the Red Hat folks would start with a Fedora release, beat that into shape, and then release a major version of RHEL based on it. The CentOS project would then use RHEL sources to build CentOS Linux releases from them. CentOS simply provided a clone of RHEL and it had no role in the overall RHEL development process.

Now that order has shifted. New CentOS Stream major releases are branched from Fedora ELN, a continuous-build project made up of packages from Fedora Rawhide that is designed to emulate a RHEL release. (The ELN name, which was originally "EL Niño", is explained more here.) The RHEL releases come from CentOS Stream. During the life of a RHEL release most, though not all, updates should appear in Stream before making their way to RHEL. Embargoed security updates appear in RHEL first, and may take some time to wind their way to Stream.

Stream 10, and therefore RHEL 10, is based on ELN around the time of the Fedora 40 release that came out in April 2024 and before Rawhide was branched for Fedora 41. This means that some major changes introduced in Fedora 41, such as DNF5, are not likely to turn up in an RHEL release until 2028 at the earliest. (Work to switch Fedora ELN to DNF5 is already ongoing.) Since DNF5, and RPM 4.20, both introduce some incompatible changes that might break scripts, build pipelines, other automated tooling, or just thwart muscle memory, without bringing any dramatic improvements, it will probably not disappoint many system administrators to wait a while longer.

x86-64-v3

One major change for Stream 10 that may well be a disappointment is the move to the the x86-64-v3 instruction set architecture (ISA). This means that Stream 10 will not run on x86-64 CPUs without support for Advanced Vector Extensions (AVX) and AVX2. The change is in place despite the objections raised when it was discussed on the centos-devel mailing list. Alex Iribarren, for example, said that the decision to raise the baseline to v3 caused CERN to migrate away from RHEL for its industrial-control systems.

The AVX2 instruction set was introduced in 2013, but some recent CPUs (such as Intel's Atom series) still do not have support for AVX2. This means that systems without the AVX2 extensions are stranded on Stream 9, or will need to be switched to other Stream-based distributions, such as AlmaLinux, that plan to produce x86-64-v2 builds rather than drop support for recent hardware. However, that may mean foregoing support for the EPEL repository that provides additional EL packages, unless the AlmaLinux project decides to also rebuild EPEL for v2 as well.

The impetus for the change seems to be about making life easier for companies that support software on RHEL. Florian Weimer wrote about the change in January. He said that the Stream ISA special interest group (SIG) was creating rebuilds of Stream 9 that it hoped would show performance improvements for key workloads when using x86-64-v3. However, even absent evidence of improvements, he said it still made sense to switch: "This reduces maintenance cost for some ISVs [independent software vendors] because they no longer need to maintain (and test) AVX and non-AVX code paths in their manually tuned software.".

The bottom line is that Stream 10 cannot be installed on quite a few systems that would otherwise have enough resources to run a modern server operating system.

Setup and software

The Anaconda installer has changed little since CentOS 8 or 9. All one needs to do is to choose their language and keyboard layout, select a target disk for installation, set up a user, and reboot after installation is finished. The default install is "Server with GUI", but users can tune the software selection to add additional package sets (such as "Debugging Tools", "Mail Server", or "Virtualization Client") or dial down the base environment to a minimal install or server without a desktop.

There are two notable changes, though. First, Anaconda no longer has inline help—its help feature has been removed entirely. Secondly, user setup now assumes users will be administrators. Previously, when adding a user with Anaconda during the installation, the option to make the user an administrator was unchecked. This option is now checked by default—so it's a good idea to pay attention to that when adding users during setup for a multi-user system, lest users are granted more power than intended.

Naturally, there are a number of major version bumps in this release. Stream 10 currently features the 6.12 Linux kernel, though the recently released RHEL 10 beta is still on 6.11. Assuming RHEL 10 GA releases with the 6.12 kernel, this would be the first RHEL version since RHEL 6 that uses a long-term release kernel. However, the five-year lifecycle for CentOS Stream and ten-year cycle for RHEL will mean that those versions will be supported well past the projected end-of-life for 6.12 in December 2026.

The release also includes GCC 14.2.1, Python 3.12.6, GNU C Library (glibc) 2.39, GNU Debugger 14.2, systemd v256, RPM 4.19.1.1, and DNF 4.20 as part of the BaseOS repository. Software in the BaseOS repository is the "core set" of a Stream version, and should only receive security and bug fixes during the lifetime of the release.

Much of CentOS's software is distributed via two other repositories: AppStream and CodeReady Linux Builder (CRB), which have different lifecycles depending on the software. This also means that, in some cases, different Stream releases will have the same versions of software, so it's not necessary to upgrade to a new major release just to have more modern software. For example, Stream 9 and Stream 10 both currently have Podman 5.2.3 from the AppStream repository.

Users who opt for a desktop install will get GNOME 47 with Wayland as the windowing system. There is still a GNOME Classic profile option that emulates the GNOME 2 layout, but X11 is no longer supported (other than the Xwayland packages to provide an X server for Wayland). In addition to shedding X11, some desktop software has been dropped as well.

In 2023, Matthias Clasen announced that Red Hat's Display Systems team, which had maintained the LibreOffice RPMs for RHEL and Fedora, was "pivoting away" from work on packaging desktop applications in favor of working to improve Wayland, HDR support, and more. The upshot was that RHEL would cease shipping LibreOffice "in a future RHEL version". That future is now here. Stream 10's repositories do not include LibreOffice packages, nor packages for Firefox, GIMP, Evolution, Inkscape, or Thunderbird.

The default selection of desktop software takes minimalism perhaps a bit too far. Those who use Stream (or RHEL) as a desktop will need to hope that their software of choice is packaged in EPEL, or to turn to RPMs or Flatpaks supplied by the upstream projects. This would be easier if the Flathub repository was enabled by default, but it is not. It can be added, though, with a single command:

    $ flatpak remote-add --if-not-exists flathub \
     https://flathub.org/repo/flathub.flatpakrepo

Looking up that command without a browser installed, however, is a challenge. Omitting the web browser from a default desktop install, without even providing a way to install one without fiddling with additional repositories, is likely to frustrate users. It seems fair to say that a browser is a crucial piece of software on the desktop. Especially since one cannot search for documentation without a browser at hand. Users cannot access Stream's web-based administration tool, Cockpit (rebranded "Web Console" in RHEL/CentOS), without a browser, either.

Cockpit, which we covered in March, is installed by default and available on port 9090. For some time now, Cockpit has been my preferred way of accessing and managing servers for most interactive and one-time system tasks. It has not changed much since March, but one noteworthy new feature is the addition of a basic file browser application. It can be enabled within Cockpit by going to the Applications page from the Tools menu and clicking "Install". The file browser is primarily useful for uploading or downloading files to and from a server without using scp or the like. It also allows users to change permissions on files, and has a simple text editor for making quick changes to configuration files. It won't dramatically improve productivity, but it is a nice enhancement.

EPEL

EPEL has seen a few interesting changes leading up to the Stream 10 release. One of those changes is that the EPEL infrastructure for EL 10 was brought online earlier than usual. This means that packagers have had more time to work on packages for EL 10 before the Stream release than in the past. There are already more than 10,000 packages in the EPEL repository for EL 10 according to "dnf list available", which is about half of what's currently available for EL 9.

Another change is the addition of branches that target minor versions of RHEL for EPEL. While it will be some time before the benefits are evident (namely, after RHEL 10 actually has minor releases) it should address some pain points for users and developers. In the past, EPEL only targeted the current minor version of a major EL release. That meant that there might be times when a package was not installable on a new minor version until it was rebuilt. When the package was rebuilt, it was then not compatible with the old minor version. The introduction of minor branches should smooth out many of those problems. See Carl George's proposal for more information on the change.

Miscellaneous

As with any operating-system update there are a number of miscellaneous enhancements, upgrades, replacements, and removals to be aware of, and Stream 10 is no different. On the database front, 10 includes MariaDB 11, MySQL 8.4, PostgreSQL 16, and drops Redis in favor of Valkey—no doubt due to the Redis license change in March.

The old ISC DHCP server, which was declared end-of-life in 2022, has been replaced with ISC's Kea DHCP in EL 10. Folks running DHCP servers may need to put in extra work when migrating to EL 10. Sendmail has been removed from the AppStream repository, though it is still currently available in EPEL.

Composefs has been enabled as a "technology preview" in RHEL for container storage and is used in image mode, a deployment method that uses bootable containers. Bootc base images are available for Stream 10 as well, but converting those into, say, usable qcow2 images is still up to users. The bootc-image-builder repository has instructions that can be followed to convert the bootc images into Stream 10 virtual-machine images. Since this is categorized as a technology preview by Red Hat, it may change dramatically before it's considered stable—or (unlikely, but possible) dropped entirely.

The RHEL 10 beta release notes provide a fairly complete summary of known changes that also affect Stream 10, including a long list of new and updated device drivers. Stream users can happily ignore the release notes related to registration and subscription management.

Getting Stream

Stream 10 has not been officially released yet, but the project has been creating builds ("composes") since June. Early builds were rough going—at one point merely changing the wallpaper in GNOME would cause the desktop to crash—but my recent testing indicates that things have stabilized nicely. The last major release, Stream 9, was introduced in December 2021, so if the project sticks to a similar schedule, it is likely that the first stable release of 10 will be out before long.

The latest builds of Stream 10 are available as ISO images, or a variety of formats for use with virtualization, for Arm64, 64-bit PowerPC, s390, and x86_64. Stream 10 container images are available via Quay.io. Currently there is a stream10 image and a stream10-minimal image.

The move to time-based releases for RHEL has made the major versions less exciting to review, but easier for administrators to handle. For example, the EL 6 release used the short-lived Upstart init system and SELinux by default. EL 7 ushered in systemd, Linux containers, and software collections (SCLs). EL 8 introduced DNF as the replacement for Yum, broke content up into BaseOS and AppStream repositories, and dropped Docker in favor of Podman, Buildah, Skopeo, and runc. The Stream 9 and, now, Stream 10 releases have contained fewer disruptive changes—and fewer major enhancements.

One might even say that it's a somewhat boring release, but then again, boring is desirable when looking at enterprise software.

Comments (82 posted)

A Zephyr-based camera trap for seagrass monitoring

By Jake Edge
December 10, 2024

OSSEU

In a session at Open Source Summit Europe (OSSEU) back in September, Alex Bucknall gave an overview of a camera "trap"—a device to capture images in a non-intrusive way—that he helped develop which is being used to monitor seagrass. He works for the Arribada Initiative, which is a non-profit organization focused on creating open-source technology for studying wildlife and ecosystems. The camera system uses the Zephyr realtime operating system (RTOS) on an open platform that is designed to be inexpensive and usable for multiple applications.

He began with a brief mention of some of the kinds of projects that the Arribada Initiative has helped build over the years. That includes projects all over the world such as satellite transmitters mounted on the backs of tortoises, thermal-imaging traps for monitoring pangolins in Cameroon, suction tags for tracking manta rays in a non-intrusive way, and penguin colony nest cameras in Antarctica.

Seagrass

The specific project he talked about is for the only flowering marine plants, which are seagrasses. They can form meadows on the ocean floor, storing 10% of the oceans' buried carbon though they only cover 0.1% of the seafloor. "They also produce a significant amount of oxygen as well". They are critical habitat for various marine species, especially juveniles that need to hide out from predators.

Beyond that, seagrasses stabilize the seafloor; a lot of coastal islands have seagrass meadows that slow erosion of their coastlines. But seagrass is declining, sadly; it is disappearing at a rate of around 1.5% per year and at least 29% of seagrass has disappeared over the last 50 years, he said. Sustaining seagrass is important for both climate-change mitigation and for maintaining marine biodiversity.

The project partnered with some local organizations in Bermuda to help them figure out what was going on with the seagrass in that country. The government had noticed the decline in seagrass and some research had indicated that it was caused by green sea turtles that were eating the seagrass back to its roots, Bucknall said. In order to test that, the government placed cages over juvenile seagrass meadows to exclude the turtles, but found that did not change anything. So the government put out a call to help it understand what was happening, which is how the local organizations and Arribada got involved.

What the organization came up with was an underwater time-lapse camera "that could monitor the decay of the seagrass or, at least ideally, the growth of the seagrass" along with measuring characteristics of the water, such as temperature, pH, and the amount of light reaching the seafloor. The idea was to build a number of these cameras to place on the cages to help in determining "whether we were seeing turtles eating the seagrass or whether it was the case that something else was happening", he said.

Device

So the team developed a generalized platform for time-lapse photos as well as sensing. There are a number of requirements, including the need to be field configurable for changing parameters like the realtime clock, photo interval, and the sampling rate for the sensors. It also needed to be low power so that it could last weeks underwater in a hardened enclosure that would allow it to survive the hostile underwater environment ("sea water tries to destroy everything"). Another requirement is that everything be open source; "it's important for the world of conservation that these tools are made accessible for people to use".

Something that Arribada tries to do when it designs the hardware for projects like this is to take it as an opportunity to create a base platform "that allows us to use one piece of hardware for multiple projects". That is part of "why Zephyr is really important for us". There is only a small team at Arribada, so it tries to use off-the-shelf hardware, such as Adafruit's Feather and FeatherWings for its projects. Standard hardware connectors are also used, such as STEMMA/Qwiic, which allows plugging multiple sensors together into an I²C bus.

Different microcontrollers are used in various projects, so another important piece was to find a camera that is supported in Zephyr on multiple microcontrollers. Many cameras require large frame buffers in main memory, but the Arducam Mega that was chosen has its own buffers and allows frames to be transferred over SPI. The camera also needed to be inexpensive so that more of the devices can be deployed in the field. At the time of the talk, ten cameras were deployed in Bermuda, he said, but the hope is to scale the project to put more in place. The overall design is flexible, allowing others to add and remove components based on the needs of their conservation projects.

Zephyr

"Why does Zephyr make sense for conservation?" Historically, Arribada has done a lot of projects that ended up with their own code repositories and code bases, which meant that there "were handfuls of on-the-shelf, never-to-be-used-again projects". Since they have a small team, building everything new for each project slowed things down considerably.

Zephyr allows the team to choose a variety of ready-made SoCs depending on the needs, such as power consumption or connectivity. It has easy-to-use APIs for power management, networking, and sensing, as well. When designing something new, the team often splits into two parts, one for drivers and the other for the "business logic of the application", Bucknall said; the APIs make that split easier.

Every time he looks at the Zephyr repositories, there are new sensor drivers that can be used, which makes it easy to quickly integrate others sensors into the platform. "A growing library of performant sensors for data collection is really important when we have people doing all sorts of crazy things." The team also likes the tools that Zephyr has available for build- and run-time configuration, such as the various shells (for sensors, I²C, and for the system) that allow non-programmers to configure devices at run time.

To make things easier, Arribada uses a central board repository with devicetree overlays for various pieces like the camera. A project can choose a repository for a particular microcontroller based on the needs of the project, such as whether it needs WiFi, Bluetooth, or reduced power usage, which will provide a base to start building the application. That also allows Arribada to outsource writing the drivers to contractors while the team focuses on the application, especially when there are tight timing windows. The overlays make it easy to add new hardware to a system and "it is a really nice way to almost have a hardware bill of materials" that shows what components the system has and how they connect; Zephyr takes care of connecting the buses together and attaching the peripherals correctly.

The Zephyr driver model allows Arribada to develop a library of code for sensors, which is "a little bit similar to Arduino, with their comprehensive libraries of random sensors off the shelf, but in a much more standardized method", he said. For a given project, the developers can simply pick the drivers they need and add things that are unique to their application.

For example, Arribada has been working with Arducam on a Zephyr driver for the Mega camera. The organization encouraged the company to create an open-source driver so that others can use it, and not just in conservation applications. At the time of the talk, the driver was working its way into the Zephyr mainline, he said, though the apparent repository for it does not show any recent activity.

Overall, the goal is to reduce the cost of these conservation tools. Arribada's usual partners are resource-constrained university research groups and charities. Development time costs money, so reducing the complexity by abstracting the drivers using standard APIs is helpful. Less time spent dealing with drivers and hardware means that the tools to start monitoring can be deployed that much more quickly.

Deploying

Field deployment is more than just flashing the device and delivering it. There is a need for doing configuration, updates, and debugging remotely on devices that may not be easily accessible. One problem that the team faced is how to build binaries and then get them to the field. The Zephyr Docker images that are used for continuous integration (CI) testing on GitHub have been useful in that regard. The binaries can be built and tested in the cloud, then local technicians can obtain them from there in order to update the devices. That helps avoid versioning problems in the field, he said.

The Zephyr shells make it much easier for technicians in the field; they can get instant feedback on configuration changes, network connectivity, device status, and more. The Arribada team is not always on-site to do any hand-holding, so giving the local technicians tools that avoid them having to do builds in the field before deployment is important. Some update mechanisms are easier than others; the UF2 bootloader is the easiest, because people know how to drag-and-drop a file onto a USB device, using a serial terminal is another possibility, since getting a terminal running is fairly easy to teach, but other options are far more cumbersome.

Over the years, Arribada has found some "things to watch out for". For example, support from vendors can be hit or miss, though Bucknall has been impressed by the number of vendors starting to support Zephyr of late. Newer microcontrollers and other hardware may take a while to be supported or lack features, such as power management, that these devices require. That means either waiting for the support to be added, or the team adding those features and contributing them back. "If you want an unusual feature, you will have to write it yourself", he said.

Learning Zephyr is somewhat difficult at this point, though it is getting better; "training a new team to use Zephyr is not the easiest", but once you get going, the tools make development and deployment faster. The documentation is generally good, but examples are somewhat limited, especially for unusual uses. Arribada tries to help fill these holes with documentation and examples from what it did.

Other uses

Arribada has a few different projects that are using Zephyr currently. There is a humanitarian-aid-tracking project in Ethiopia, a project for avian leg tags, and a seaweed-monitoring project that is close to deployment in the Philippines.

He closed his talk with a video from the seagrass-monitoring cameras, which can be seen in the YouTube video of his talk. The slides (without the video) are also available. The video showed a variety of fish passing through a cage on the seafloor. As far as the seagrass problem goes, "we don't know yet"; there are a number of cameras deployed and "we are hoping to collect the data sets from them in the next couple months". They will take that data back to the Darwin Initiative that funded the work to help scientists get a handle on the problem; hopefully there will be more funding to do further deployments if needed.

[ I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Vienna for Open Source Summit Europe. ]

Comments (none posted)

Finally continuing the discussion over continue in finally

By Daroc Alden
December 9, 2024

In 2019, the Python community had a lengthy discussion about changing the rules (that some find counterintuitive) on using break, continue, or return statements in finally blocks. These are all ways of jumping out of a finally block, which can interrupt the handling of a raised exception. At the time, the Python developers chose not to change things, because the consensus was that the existing behavior was not a problem. Now, after a report put together by Irit Katriel, the project is once again considering changing the language.

Like several other languages, Python has a try statement that allows catching exceptions. The optional finally block of a try statement allows the programmer to write code that should always run, regardless of whether an exception occurred in the try statement. This facility is frequently used to ensure a resource is always cleaned up, even if an exception is thrown. The Python documentation clearly describes what happens when an exception is thrown and a finally block that includes a control-flow statement executes: "If the finally clause executes a return, break or continue statement, the saved exception is discarded". But some people see this behavior as counterintuitive.

When Batuhan Taskaya originally proposed PEP 601 ("Forbid return/break/continue breaking out of finally"), forbidding control-flow statements in finally blocks, they called the behavior "not at all obvious". At the time, a handful of other Python developers agreed. Brandt Bucher shared an example of one of the kinds of potentially surprising code that is currently allowed:

    >>> def f() -> bool:
    ...     while True:
    ...         try:
    ...             return True
    ...         finally:
    ...             break
    ...     return False
    ...
    >>> f()
    False

This code would normally exit from f() upon reaching the return statement — except that the finally block runs and breaks out of the enclosing loop, letting execution continue into the rest of the function. Therefore the value returned from inside the try block is lost, and the function returns False. A similar example that throws an exception from inside the try block would show that the exception is lost as well.

Other people did not think that this behavior was a problem. Paul Moore called the motivation for the proposal weak, and asked whether there were any real-world examples of people using control-flow statements inside a finally block incorrectly. Serhiy Storchaka supplied two examples that they thought were bugs in the standard library — but Guido van Rossum indicated that only one of those was actually a bug.

Ultimately, the Python steering council voted unanimously not to adopt PEP 601, but suggested that the rule should instead be added to PEP 8 ("Style Guide for Python Code") as a matter of good coding practice.

Revisiting the topic

In November 2024, Katriel reopened the discussion by posting a link to a report that examines the 8,000 most popular Python packages. Katriel wrote a script to examine nearly 121 million lines of Python code. The script found 203 cases where there were control-flow statements in a finally block that could cause the block to be exited in a way that could potentially suppress exceptions. She examined the examples and determined that 46 were correct, 149 were clearly incorrect and 8 were difficult to classify. Notably, almost all of the correct uses appear in tests for linters that check that the linter can identify when a control-flow statement is used like this.

Katriel analyzed the incorrect cases in more detail, noting that 27 of them had actually been fixed in the development branch of the relevant library. She reported the remaining cases to the developers of affected packages. Of the 73 packages where she opened issues, only two indicated that the code was working as intended. With these facts in hand, Katriel and Alyssa Coghlan proposed PEP 765 ("Disallow return/break/continue that exit a finally block").

The new PEP would issue a SyntaxWarning for any use of a return, break, or continue statement that would exit a finally block. The PEP would essentially call for the Python interpreter to warn about any means of exiting a finally block other than letting control flow reach the end of the block normally. This would be a syntax warning so that it shows up when Python code is first parsed, not when the code is actually executed. This means that the warning would appear to library authors, but not to most library users, who usually use pre-compiled Python bytecode. These users might see the warning at installation time, but not while running their Python scripts.

Van Rossum called the report "an excellent piece of research", and suggested that "perhaps we should do more PEPs based on such investigations". He was, however, "personally sad to see this syntactic corner case disappear". He believes that having control-flow statements within finally blocks which work as they do now is an example of how Python has many small features that can be seamlessly composed together.

Katriel responded that there are already similar exceptions to the rules in Python. Specifically, except* clauses already disallow control-flow statements. The overall response to the proposal was fairly positive, however. Several people chimed in to express their support, and Tim Peters, recently back from his three-month suspension, went so far as to call the existing behavior "a bug magnet on the face of it".

Not everyone was as convinced by Katriel's research. Robin Becker thought that the existing behavior was obvious, and said in a later message that there was nothing special about finally blocks suppressing exceptions. Peters disagreed, quoting the Zen of Python, which he authored:

Errors should never pass silently.
Unless explicitly silenced.

That led Steve Dower to propose an alternative: "why are we forbidding the construct rather than just making exceptions re-raise on exit from finally, regardless of how we leave the block?" He did later clarify that he wasn't really arguing for that alternative, but just wanted an explanation of why the PEP didn't seem to consider the possibility.

Katriel said that Dower's proposal would introduce backward-compatibility problems; code that was previously correct would raise unexpected exceptions. Dower wasn't satisfied with that explanation, saying that adding a syntax warning would also disrupt existing code. He eventually asked: "So is it better if the error is still raised? Or better if everyone who uses your module gets a warning (that breaks their own tests/users) whether that error ever occurs or not?"

Katriel thought the latter was preferable, because it is easier to debug a syntax warning that points to the exact line causing the issue, rather than a random no-longer-suppressed exception that may occur without warning. In the end, neither Katriel nor Dower was able to convince the other, but Katriel did add a section to the PEP summarizing Dower's proposed alternative and why she did not think it was a good idea. Ultimately, Dower wanted a solution that "minimises breakage", since he is going to have to justify any changes to his users, who are already "unjustifiably nervous" about updating to Python 3.12.

Daniël van Noord asked Dower what it would take for him to support the PEP. Dower replied that Python has "many millions of users and billions of lines of code" — so adding a new warning is going to impact people, no matter how rare the circumstances. He won't support a change unless that can somehow be avoided.

Katriel pointed out that the choice to issue a warning instead of an error was deliberate — existing Python code would continue to work, just possibly with a warning. That led to an extended discussion about under which circumstances the warning would be shown, whether that was too disruptive, and various related concerns. Several people said that they thought putting a control-flow statement inside a finally block should be an error, instead of a warning, despite that being more backward incompatible than Katriel's proposed change.

Ultimately, Katriel submitted the PEP to the Python Steering Council. So regardless of the arguments in favor (the fact that the feature is almost never used correctly) and the arguments against (any amount of backward incompatibility is a pain), it will lie with the council to decide whether the PEP is accepted. The council faces yearly elections — and being so close to the end of its term, it decided to leave the decision to next year's council. (The election for which ends on December 9.) Van Rossum, near the end of the discussion, said that he has mixed feelings about the proposal. "I honestly don't know what I would have done when I was BDFL [benevolent dictator for life]." What the steering council will decide remains to be seen, but it seems clear that either choice is going to make some people unhappy.

Comments (17 posted)

Freezing out the page reference count

By Jonathan Corbet
December 6, 2024

The page structure sits at the core of the kernel's memory-management subsystem (for now), and a key part of that structure is its reference count, stored in refcount. The page reference count tells the kernel how many users a given page has and when it can be freed. That count is not needed for every page in the system, though. Matthew Wilcox has recently resurrected an old patch set that expands the concept of a "frozen" page — one that lacks a meaningful reference count — to the immediate benefit of the slab allocator but in the service of a longer-term goal as well.

The kernel is in the business of managing resources that are shared between multiple users. For example, anonymous (data) pages and file-backed pages can both be mapped into the address space of one or more processes; each mapping increases the relevant page's reference count, ensuring that the page stays around as long as it is needed. Reference counts can be a relatively efficient way to manage object lifecycles, but they are not free. Frequent reference-count changes can cause cache-line bouncing, and the atomic operations needed to change a reference count are relatively expensive. So there is value in not using a reference count when the opportunity arises.

In the case of the struct page reference count, its use is so deeply ingrained within the memory-management subsystem that it is maintained for almost all pages in the system — even those for which it is not needed. One case, in particular, is the slab allocator, which allocates groups of pages, splits them into smaller objects, and hands those objects out on request. A call to kmalloc() is the most common way to get memory from the slab allocator. Since it must track the status of each of the sub-objects contained within a page, the slab allocator knows whether the page as a whole is in use or not; it does not need a separate reference count for that purpose.

Even so, pages given to the slab allocator are reference-counted. The overhead of maintaining that reference count may not seem like much, but it can add up, especially under workloads that exercise the slab allocator heavily. Given the amount of effort that goes into optimizing the kernel for even tiny gains, eliminating this potentially costly atomic operation seems like a worthwhile goal on its own.

Frozen pages

Wilcox's patch set expands on the notion of a "frozen" page — one for which the reference count has been frozen (with a value of zero) and is not in use. In current kernels, frozen pages are only used in the hugetlbfs subsystem, which maintains a reserve of huge pages for application use. In that case, frozen pages are used to avoid manipulating reference counts while assembling larger pages, but the concept is more widely useful.

In current kernels, a page's reference count is initialized with a call to set_page_refcounted(), which sets the count to one. This initialization is done deeply within the page-allocation paths; there are only three call sites in the entire kernel. The bulk of the patch set is, perhaps surprisingly, dedicated to adding many more of those call sites. In short, the set_page_refcounted() calls are pushed down the call stack into the callers of the low-level allocation functions. This change enables the existence of allocation paths that never set the reference count at all. Still, it may seem like a counterproductive change; set_page_refcounted() turns into a single assignment instruction (no atomic operations are needed for the initial reference), and the code was arguably cleaner before this change. There are reasons to do things this way, though, as we will see.

Once those calls have all been pushed down, a new allocation function (a macro, in truth) is added:

    struct page *alloc_frozen_pages(gfp_t gpf_flags, unsigned int order);

Along with that, the existing free_unref_page() is renamed to free_frozen_pages(). The latter function is where the bigger savings is to be found. The normal function for freeing pages (__free_pages() or one of its callers) works by decrementing the reference count of the pages passed to it; it only actually frees the pages when the count goes to zero. free_frozen_pages() can avoid the atomic decrement-and-test operation on the reference count, since it knows that there are no other references to the page.

Once that work is done, the final step is a small patch causing the slab allocator to use alloc_frozen_pages() and free_frozen_pages() rather than alloc_pages() and __free_pages(). The unneeded atomic operation is eliminated, and the slab allocator is presumably that much faster — though no benchmark results have been included to quantify that.

The future is glorious

Performance patches normally should include benchmarks, but the performance improvement here is really more of a side effect; the real purpose driving this work is something different. Since the beginning of the folio transition some years ago, one of the end goals has been the reduction of the size of struct page. This structure is as small as developers could make it but, since one of them exists for every page in the system, page structures still end up occupying roughly 1.5% of the memory in the system. Reclaiming some of that overhead for productive use is an attractive prospect.

The long-term goal is to reduce struct page to a single, 64-bit memory descriptor that indicates how each page is being used and contains a pointer to a structure with the information needed for that usage. For example, slab pages can be described by struct slab; that structure exists in current kernels, but is carefully crafted as an overlay on top of struct page. In a future world, a single struct slab could exist for an entire folio of slab pages, reducing the memory-management overhead for those pages; a pointer to that structure would be placed into each of the relevant memory descriptors.

In current kernels, struct slab, being an overlay on struct page, contains the refcount field; that will remain true even if Wilcox's patch set is merged. But, in a future where struct slab no longer needs to overlay struct page, that reference count can be removed, shrinking struct slab. And that is, indeed, the future toward which Wilcox is looking:

This patchset is also a step towards the Glorious Future in which struct page doesn't have a refcount; the users which need a refcount will have one in their per-allocation memdesc.

Many steps toward the "Glorious Future" have already been taken; some of the type-specific memory descriptors (such as struct slab) have already been merged, and others are in the works. There has been an ongoing effort to move information out of struct page and into struct folio where appropriate. Other projects, like the removal of the index member from struct page, are ongoing. This member, which is used for page-cache pages, tells the kernel the offset of the page within the file it represents on disk; it is being shifted over to the folio structure. That work depends, in turn, on the zswap memory-descriptor transition, which is also in progress. There are many other steps yet to be taken, but at the conclusion of this work, struct page should mostly have just withered away.

So, while the removal of a single atomic operation from the slab allocator's page-freeing path is only so exciting, this patch series is rather more interesting as a piece in the larger memory-descriptor project. The core of the kernel's memory-management subsystem is going through a period of radical (and much-needed) change that will make it more efficient, flexible, and maintainable in the long term, and most users are not even noticing. The task is somewhat like exchanging the foundation underneath a skyscraper while it remains open for business; in the kernel community, though, it's just development as usual.

Comments (3 posted)

Auto-tuning the kernel

By Daroc Alden
December 11, 2024

The Linux kernel has many tunable parameters. While there is much advice available on the internet about how to set them, few people have the time to weed through the (often contradictory) explanations and choose appropriate values. One possible way to address this is a project called bpftune, a program that uses BPF to track various metrics about a running system and adjust the sysctl knobs appropriately. The program is developed by Oracle, and is available under a GPLv2 license. Bpftune is currently mostly focused on optimizing network settings, but the authors hope that the system is flexible enough to be extended to cover other settings.

Bpftune is built around dynamically linked modular libraries called "tuners". Each one handles a different set of sysctl settings. They do this by reacting to performance data sent via a BPF ring buffer from various performance hooks that bpftune installs in the kernel. The project has eight existing tuners, but is structured to make it fairly easy to add more for a new use case. The available tuners are:

ip_frag, which manages IP fragmentation settings,
net_buffer, which manages non-TCP buffer sizes,
route_table, which manages routing settings,
tcp_buffer, which manages TCP buffer sizes,
neigh_table, which manages the neighbor (ARP) tables,
netns, which manages network namespaces,
tcp_conn, which chooses TCP congestion-control algorithms for each connection, and
sysctl, which reacts to changes to sysctl settings.

Bpftune only has official packages for Oracle Linux, although there is an Arch user repository package for it. In order to try bpftune myself, I had to compile from source — which was a mostly painless process. The one sticking point is that I had to tweak one of the headers in order to load tuners from a custom location, since my root partition is read-only. Bpftune requires a v5.6 or newer kernel, in order to make use of BPF ring buffers, built with BPF Type Format (BTF) debugging information.

Upon starting up, bpftune assured me that it had loaded successfully and was also able to support network namespaces on my system — with the exception of the route_table tuner, which failed to load. Once it was running, it almost immediately increased my net.ipv4.tcp_rmem setting by 25%. A few minutes later, it increased it by another 25%, and then gave another two bumps 15 minutes or so later. That setting controls the minimum, default, and maximum size of TCP receive buffers. Having larger buffers enables the computer to receive more data at once, and avoid dropping packets, at the cost of using more memory. Other than thinking that my TCP receive buffers were too small, bpftune was content with my configuration, and didn't make any other adjustments.

It's possible that I would have seen more dramatic results on a heavily-loaded server, instead of a relatively quiescent laptop. One tuner in particular — the TCP connection tuner — doesn't show up in the logs, however. Presumably it could be made to log when it chooses a particular TCP congestion-control algorithm, but this would lead to exceedingly noisy logs. Instead, the tuner prints a summary of how many times it made a decision when the program exits. The tuner chooses a TCP congestion-control algorithm for each remote host that the computer connects to, in order to minimize latency and maximize throughput. Alan Maguire, the project's maintainer, wrote an article describing the tuner's design.

In short, the connection tuner uses reinforcement learning to figure out which congestion-control algorithm performs the best in practice. Since network conditions change over time, it will occasionally select a different algorithm at random to see whether it performs better. According to Maguire's performance measurements, using bpftune to select the best congestion-control algorithm imposes a (slight) performance overhead in perfect network conditions, but can offer a substantial improvement when there are nonzero amounts of packet loss. Since few of us enjoy the luxury of having perfect networks, that is probably an acceptable tradeoff.

Since bpftune is a project aimed at simplifying the configuration of Linux systems, it does not offer many configuration options. One that it does offer is a configuration for the learning rate (that is, how often a different algorithm is tried) of modules that use reinforcement learning. The documentation suggests that the default value is usually fine, however. One can also select which tuners are used — although there is theoretically no harm in leaving them all enabled.

Bpftune does a good job of being set-and-forget. When a tuner makes a change to a configuration setting, it logs the information about what was changed and why, so administrators will have an easier time troubleshooting. The program also automatically detects when the administrator sets a configuration value, and stops trying to manage that particular piece of configuration — so administrators shouldn't find themselves fighting with the program, either. The documentation isn't clear about whether this also applies to settings configured in a system's sysctl.conf file.

Additionally, bpftune can be attached to a particular control group, in order to segregate processes that might be affected by configured settings from ones that shouldn't be. Still, any tool that occasionally changes configuration settings without human oversight is going to mess up eventually. In my brief experiment with it, this hasn't caused any problems, but it is still probably best to avoid bpftune on servers where consistency and reliability are more important than throughput.

Bpftune seems like it could be a useful tool, but right now it has fairly limited applicability. The Linux kernel has thousands of configuration options, but currently bpftune only handles a limited subset of them — the inevitable result of the fact that the project only has four contributors, and has only existed for a little more than a year. If bpftune gains popularity, and members of the Linux community contribute tuners for non-network-related settings, it could end up becoming a lot more broadly applicable.

Not everyone will be comfortable contributing to the project, however, because doing so requires agreeing to the Oracle Contributor Agreement. Among other things, this allows Oracle to sublicense contributions for its own use. The company does promise to make contributions available under an FSF or OSI approved license in addition to any of its own uses. Bpftune's architecture also makes it possible to develop and distribute compatible tuners separately from the project itself — but this adds complexity for end users.

In any case, bpftune might already be a good choice for anyone who is managing a large number of Linux servers, and who wants more performant networking. Alternatively, it is also an option for anyone who wants a simple tool to throw on their existing servers to improve performance without needing to learn the ins-and-outs of Linux networking configuration.

Comments (8 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>