|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for January 13, 2022

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Relocating Fedora's RPM database

By Jake Edge
January 12, 2022

The deadlines for various kinds of Fedora 36 change proposals have mostly passed at this point, which led to something of a flurry of postings to the distribution's devel mailing list over the last month. One of those, for a seemingly fairly innocuous relocation of the RPM database from /var to /usr, came in right at the buzzer for system-wide changes on December 29. There were, of course, other things going on around that time, holidays, vacations, and so forth, so the discussion was relatively muted until recently. Proponents have a number of reasons why they would like to see the move, but there is resistance, as well, that is due, at least in part, to the longstanding "tradition" of the location for the database.

To /usr

As is normal for Fedora change proposals, program manager Ben Cotton posted the proposal to the mailing list on behalf of its owners: Chris Murphy, Michel Alexandre Salim, and Neal Gompa. It can also be found on the Fedora wiki. In a nutshell, it would move the database maintained by the RPM package manager from its current location, /var/lib/rpm, to /usr/lib/sysimage/rpm; the former would be replaced with a symbolic link to the latter. The DNF package-management tool would not be switching locations for its database, at least yet; the RPM developers have mixed feelings about the change, but are not standing in its way.

The RPM database tracks which packages have been installed and their metadata. In addition, it tracks all of the files that get installed to the system, their locations, and which package is responsible for them. RPM uses it to clean up packages when they are removed; users can query the database, as well, to answer various questions about their packages and files.

There are several benefits for Fedora listed in the proposal, but the main driving force seems to be support for rolling back failed or undesired updates of the operating system. The RPM database describes what is in /usr for the most part, so having it stored in the same directory hierarchy means that /usr can be rolled back as a unit, without needing to change /var. The switch is also in keeping with what the Fedora variants based on rpm-ostree (CoreOS, IoT, Sliverblue, and Kinoite) already do; it is also aligned with another RPM-based distribution ecosystem, the SUSE distributions (SUSE Linux Enterprise, openSUSE, and Tumbleweed).

Much of the early discussion revolved around the Filesystem Hierarchy Standard (FHS), which Fedora packaging explicitly follows—with a few exceptions. Tom Hughes pointed out that the FHS descriptions for /usr ("shareable, read-only data") and for /var ("variable data files") seem somewhat at odds with the proposal. In particular, he said, the /var description "appears to exactly describe the RPM database". Thus the changes are not FHS compliant, nor do they follow the packaging guidelines, he said.

Stephen John Smoogen agreed and suggested that the change needed to be made in the FHS first, then it could be done for the distribution. Fedora project leader Matthew Miller did not think that was likely to happen, since the most recent release (FHS 3.0) was in 2015 "and the whole thing has been effectively dead since". On the other hand, Miller was in favor of reviving the FHS effort, especially with buy-in from other distributions.

But, as Murphy (and others) said, no matter what the FHS says, /usr cannot be more than "mostly read-only" or the system can never be updated:

In practice it is read-only data, except when software is being installed or updated. The FHS is a PITA sometimes, but it's not advocating for systems that can't be updated or changed.

RPM developer Panu Matilainen agreed that the database has generally only been changed during update operations, "but it doesn't mean it will stay that way forever more". There are unimplemented features that might change that situation; beyond that, there are already situations where changes are made completely outside of /usr but that require an update to the RPM database:

There seems to be this strange underlying assumption that all packaged content lives in /usr when that's not at all the case. To install a kernel, or a config-only package (under etc), or 3rd party software putting stuff under /opt, or... you need a writable rpmdb. Ditto for 'rpm --import'.

Vitaly Zaitsev also pointed to the FHS and its "read-only" language for /usr, but as Gordon Messmer said: "If /usr really is read-only, then it probably doesn't matter where the rpmdb is, since packages can't be installed (generally)." Gompa concurred with that, and expanded on the advantages of moving the database out of /var:

The bigger problem is that having the RPM database in /var makes it much harder to correctly implement a boot-to-snapshot scheme for Fedora Linux that reasonably maintains system state properly once /var is carved out. The reason that /var *isn't* carved out by default with our Btrfs configuration is because of the RPM database. Once the RPM database is moved, it becomes possible to split /var into its own subvolume and make it trivial to configure a full boot-to-snapshot system leveraging the technologies we have available to us.

Miller suggested that the "Benefit to Fedora" section of the change proposal add more information about that "pretty compelling benefit". Murphy made that change, but did caution: "There's more hurdles to jump, just so no one thinks snapshots and rollbacks are showing up in Fedora 36."

/state

Things had mostly quieted down in the discussion after the new year, at least until David Cantrell threw a bit of a curve ball on January 9. As with others in the thread, he was uncomfortable with moving the database to /usr, "but we should move it to gain the improvements as noted in the feature proposal". In his lengthy message, he suggested some other, novel ways to look at /usr and /var:

It is generally understood that /usr contains [most of] the installed system. What I think is a bigger requirement or [expectation] now is that one can tar up /usr and transport it to another system or virtual machine or container and expect that it will _probably_ work maybe with a bit of tinkering. This is a really valuable thing to have for developers. Moving the RPM database to this tree adds a component that is unnecessary and sort of out of place.

[...] Reading comments and talking to people, the long standing understanding of /var is still "that's stuff you can rm -rf and the system will still work fine". Technically you could remove the RPM database and the system still function, but arguably would still be broken since you really want the RPM database. This use case of removing the RPM database and still having a functioning system is really only useful for data recovery scenarios. You're ultimately going to reinstall. Probably.

With that in mind, he suggested moving "the RPM database and other variable and stateful data" to a new top-level directory called /state. The FHS does not prevent the addition of new top-level directories, Cantrell said. Michael Catanzaro thought that adding a top-level directory was "a pretty big hammer" and that perhaps it was easier to just support two separate locations for the RPM database.

But rpm-ostree developer Colin Walters took exception to Cantrell's "unnecessary and sort of out of place" characterization. "Multiple independent groups who *actually work* on image based updates and/or client side snapshots all generally agree that the rpmdb should be in /usr." Murphy said that /state does not really solve the problem, it is "just rearranging the chairs". The RPM database holds information about multiple locations, so it needs to stay in sync with the rest of the system, no matter where it lives:

If /usr is to be truly portable and have e.g. 'rpm query, verify, remove, reinstall' work as expected, you need the metadata (the database) representing its state to always come along for the ride. Either the database is already in /usr, or you have to make sure /usr and /state are inseparable.

If /usr and /state are inseparable, and if rpm can also describe anything in /etc or /var or /opt, then all or part of those directories are also inseparable from /state. And thus /usr. So I think /state doesn't help.

Cantrell is not alone in feeling that /usr is a bad location, however; Matilainen said that putting the RPM database there "just *feels so wrong*". There is other data like the RPM database, he said, so having a separate location for it all makes a lot of sense:

For many practical purposes it's probably just rearranging the chairs, but a separate top-level directory describing the *system* state seems instinctively *much* more correct solution to it than stuffing it somewhere deep inside a loosely related fs.

Just FWIW, I would quit my whining about this right there if it went to a new toplevel directory instead because it just *feels* right unlike /usr.

Further pieces

Walters said that problems, such as PGP keys being added via rpm --import, need to be addressed as part of the adoption process:

The TL;DR for me is: I think everyone agrees that moving the rpmdb as it is today to /usr is not 100% a perfect fit. But it's a ~90% fit - almost all the raw data is just headers which are clearly immutable data generated elsewhere. And by making this change, we're basically saying we'll fix the remaining 10% of cases.

[...] But, I hope we can get agreement about something like having `rpm --import` write to `/etc/pki/rpm-gpg` and dropping gpg keys from the rpmdb.

There is, it seems, an effort toward regularizing package installation in various ways with an end goal that is clear to some, but perhaps not to others. Zbigniew Jędrzejewski-Szmek described where he sees things heading:

Traditionally, packages installed all kinds of files all over the place. But we're slowly and painfully moving towards the model where:
  1. packages are only allowed to install under /usr, /var, and /etc. (Or under /opt, but I'd want to move that to /usr/opt…)
  2. packages must support /var/cache being wiped at any time, and most packages support anything under /var being wiped at any time.
  3. systemd and other projects are trying to only use /etc for local admin state, and support "factory reset" by wiping /etc and /var.

Based on that, he said that it makes sense to put the database under /usr somewhere, and the exact location is only a matter of convention, "so obviously we want to follow what opensuse and others are already using". But Matilainen was concerned that there was something of a "hidden agenda" behind the change proposal. He thought the goals should be more clearly stated:

I'm not saying these are necessary bad goals at all, it's just that there's a huge disconnect between reality and the above model on which this change seems based on, and not a single mention about these goals and changes needed to get there. I mean, I totally get that you can't change everything at once, but if there's a plan this big behind something then maybe it should be brought up front, no?

Matilainen said that "nearly all packages put something in /etc", for example, which is something that is being ignored, but Walters said that the handling of /etc is something that is being improved in the image-based update world:

rpm-ostree uses ostree, which introduces /usr/etc which are the pristine default config files. /etc is 3-way merged by ostree. One of the major benefits of this that I really love is `ostree admin config-diff` - at any point we can show you machine-local changes from the default, and it's trivial to reset back to defaults without redownloading a whole RPM.

[...] There's no hidden agenda - the goal is to support image based updates as well as client side snapshots, factory reset, etc. And we're shipping today versions of Fedora that do a lot of this, and we want to continue to improve it.

The discussion is still ongoing as of this writing, which likely means that the Fedora Engineering Steering Committee (FESCo) will not decide on the proposal right away. Several members are favorably disposed to it, as can be seen in the FESCo issue entry. Matilainen has said that he is "not going to stand in front of the Change truck", even though he has reservations about the proposal. While it may "feel wrong" to do so, at least for some, it seems that the writing may already be on the wall for this particular change. Whether that larger agenda (hidden or not) is adopted will play out over the next few releases.

Comments (40 posted)

An outdated Python for openSUSE Leap

By Jake Edge
January 11, 2022

Enterprise distributions are famous for maintaining the same versions of software throughout their, normally five-year-plus, support windows. But many of the projects those distributions are based on have far shorter support periods; part of what the enterprise distributions sell is patching over those mismatches. But openSUSE Leap is not exactly an enterprise distribution, so some users are chafing under the restrictions that come from Leap being based on SUSE Enterprise Linux (SLE). In particular, shipping Python 3.6, which reached its end of life at the end of 2021, is seen as problematic for the upcoming Leap 15.4 release.

Background

The openSUSE distribution has undergone a few different shifts over the years. In 2005, it started out as a more open offshoot of the SUSE Linux distribution and in 2009 started accepting community contributions. Eventually, the Tumbleweed rolling distribution came along, which LWN looked at in 2016. It was paired with openSUSE Leap 42, which came after openSUSE 13.x and, somewhat confusingly, before Leap 15.x.

OpenSUSE and SLE have generally been aligned over the years. In 2020, Leap and SLE grew even closer together. The build system and repositories between the two were shared starting with Leap 15.2, which corresponded to the second "service pack" (SP) of SLE (i.e. SLE 15-SP2). In 2021, with Leap 15.3 and SLE 15-SP3, the two distributions effectively merged, such that all of the base packages were shared between the two. To a first approximation, Leap is an openSUSE-branded version of SLE, much like what CentOS used to be for Red Hat Enterprise Linux.

Python 3.6

On January 3, Michael Ströder posted about the Python version to the opensuse-factory mailing list. He noted that some Python modules are already dropping support for Python 3.6, but that Leap 15.4 will ship with that Python version as its default. "IMO this does not make sense at all." As Ben Greiner pointed out, though, the situation is not quite as bad as portrayed, since Leap 15.3 already has Python 3.9 available and there is talk of adding support for 3.10 (the most recent Python release). But, Greiner said, the "system Python" is the problem area:

The real discussion is about what Python version the system packages have to use. And because, as I understand it, all 15.X releases need to maintain the binary compatibility of earlier 15 releases, they cannot move away from Python 3.6.

But Ströder said that the other versions of Python "are mainly only usable for developing software in virtual envs". Meanwhile, having the default Python (i.e. /usr/bin/python3) be a version that is no longer supported by the upstream community causes real problems: "Now try to explain to an auditor that this security relevant system software runs on a unmaintained Python version. Yeah, have fun." But part of the promise that SUSE offers its customers is in maintaining software like Python, Simon Lees said, so those end-of-life packages are not really unsupported:

This is simple, the python 3.6 is still maintained just by a different group of people, if significant python 3.6 security issues are found they will be fixed by someone within SUSE, this is a guarantee we give to our customers

That policy is fixed for SLE 15.x, thus Leap 15.x, he said; "Maybe in future SLE versions the model will be different but I am as yet unsure." Lees suggested that one solution might be to make SLE and Leap 16 available sooner, though he realized that may not be workable. Neal Gompa agreed with that idea:

I'd really like to see SLE/Leap 16 sooner rather than later too. The tech debt in SLE 15 is piling up and a lot of stuff that I've done in Tumbleweed for that future SLE version can't be backported because they're foundational things that would make SLE 15 into a different SLE anyway.

Tumbleweed will eventually be used as the basis for the next SLE and Leap. It currently uses Python 3.8, and is in the process of switching to 3.10.

Alternatives

There is more than just Python itself that needs to be maintained, however, as Axel Braun pointed out. There are lots of other Python modules that are shipped with the distribution, many of which will start dropping support for Python 3.6—though they continue fixing bugs, security bugs in particular. Those fixes will need to be backported. Lees said that those packages may or may not get updates:

For SLE packages SUSE Employees will keep them working, for some openSUSE packages someone in the community will care and do the work and for some others probably no one will care enough to do the work and they will just stay at [their] existing versions.

That's obviously problematic from a security standpoint, as Braun noted; he wondered if it made sense to turn Leap into a faster-moving distribution. Lees replied that it would be something of a reversion to the previous openSUSE model:

With enough man effort anything is possible but currently I doubt we have the man effort to maintain a third Gnome, KDE or python stack, this lack of man power was one of the main reasons from moving from the older model to the current Leap model.

Along similar lines, Frans de Boer wondered if it made sense to eliminate Leap entirely, since it is seen as "the consumer version of the Suse SLE" but offers packages that are outdated or even beyond their end of life. It damages the openSUSE brand to have those kinds of packages, De Boer said, so a change may be in order:

That said, it might be an idea that OpenSuse is rethinking it's strategy. For example, forgo the whole Leap series, which will free resources to concentrate on Tumbleweed. Also, as a suggestion, once a year repackage Tumbleweed of a few months old into a rebranded "stable" version and provide only security updates.

Lees does not think that would actually work well, though: "If we could find the resources to do this I agree it would be a great distro and probably more useful for many people then Leap and Tumbleweed." The problem is that maintaining Leap does not actually take all that much time, since most of the updates come "for free" from SLE, but keeping up with security updates for the "stable Tumbleweed" (separate from SLE/Leap) would be a lot of work. But there are others, like Georg Pfuetzenreuter, who want Leap to continue its current course:

I, personally, like Leap as it is - the other big enterprise GNU/Linux vendor (yes, the one with the red hat) ditched the free fork of their enterprise distribution - SUSE and the openSUSE project keep Leap, a free SLE fork, (whether that's the right definition or not) alive - and I value that a lot.

[...] I understand that other users do not like being presented with outdated packages and legacy code - I understand people want a "different" Leap - but for me, coming from an enterprise background, Leap gives security and stability for the server-side open source projects I run and experiment with.

Balancing

Matěj Cepl said that balancing the differing needs of various distribution users is an endemic problem for Linux—and one that remains unsolved:

I can assure you that too-slow/too-fast problem is something we are acutely aware of, it is probably one of the biggest problems all Linux distributors are struggling with, but nobody came with really good solution. With every user who is angry with us for moving too fast and upgrading too much, there is at least one user or more who is angry with us for moving too slow and not upgrading enough. Currently, the leading solution seem to lead to heavy use of containers, but then there are already users of containers who are becoming quite aware of problems with maintaining content of those containers, and the circle just continues.

In another message, Cepl said that simply adding support for Python 3.10 is difficult: "Maintaining two stacks of Python packages so distant from each other as Python 3.6 and 3.10 gets really complicated really quickly." That led Martin Wilck to wonder if the problem could be split up:

Can't we identify those packages that _must_ be maintained for 3.6 (because they're required for core distro functionality) and maintain only those for both versions, replacing all others with the 3.10 packages?

Of course, even that would break the expectations of SLE 15 customers, as Dan Čermák pointed out. The ironclad guarantee that SUSE provides its enterprise customers with regards to stability came up multiple times in the thread. It is a choice that the company has made, Richard Brown said, in order to attract customers who want that kind of stability. It is expensive for the company to do so, but is done with the expectation of bringing in more than it costs, he continued. Ever since the full alignment of SLE and Leap for 15.3, Leap is along for that ride.

While Ströder strongly disagreed ("Outdated is not 'stable'."), there were a few different examples given of SUSE customers looking for "outdated stability". Stefan Seyfried noted that some customers are still installing SLE 12 (the predecessor to SLE 15) on new hardware; even those who install SLE 15 often choose not to use the latest service pack release. "So yes, paying customers really like old stuff :-)". Beyond that, Cepl said that there is pressure to support even older versions of Python:

And yes, we have mounting pressure from our customers to have continuation of support for Python 2.7 on SLE-12 (or even SLE-11, it is still supported for some special, read paying more, customers). Those people cannot care less about EOS [end of support] of Python 3.6.

It seems clear that Leap 15.4 will continue using Python 3.6, for good or ill. As seen in the thread, there are users who value that stability even in Leap and SUSE's customers would be beyond unhappy with a switch in SLE 15. As long as Leap and SLE are tied together, that is the way things are going to be. In addition, there does not seem to be any real appetite for splitting the two and returning to an openSUSE that is independent of the enterprise offering.

Tumbleweed is the SUSE distribution for those looking for something faster moving, but it is still somewhat conservative in its pace. That may lead some users to look elsewhere, but distributions simply cannot cover each and every use case. SLE/Leap and Tumbleweed have staked out certain models that fit with their goals and those of SUSE, which funds much of the development of them.

In the unlikely event that some distribution figures out how to please all of the people all of the time, maybe we will see all of its competitors wither away. In the meantime, users need to understand the philosophy a distribution is pursuing before choosing it. Even then, priorities shift on both sides of the equation—user and distribution—over time, which leads to conflicts like this one.

Comments (61 posted)

VSTATUS, with or without SIGINFO

By Jonathan Corbet
January 6, 2022
The Unix signal interface is complex and hard to work with; some developers have argued that its design is "unfixable". So when Walt Drummond proposed increasing the number of signals that Linux systems could manage, eyebrows could be observed at increased altitude across the Internet. The proposed increase seems unlikely to happen, but the underlying goal — to support a decades-old feature from other operating systems — may yet become a reality.

The kernel is able to support up to 64 different signal types, which seems like a fair number, but all 64 are taken, on some architectures at least. That makes it impossible to add new signal types to Linux. Drummond sought to address that problem by raising the limit to 1024, which would surely be enough for all time. Raising the limit requires making some subtle changes to the user-space API (putting a larger signal mask into the information passed to realtime signal handlers, for example) that have the possibility of breaking applications, which means that extra scrutiny would be required. But that, it seems, is what would be needed to be able to add more signals.

Developers immediately wanted to know why there was a need to add signals, and urged Drummond to find an alternative if possible. As Eric Biederman put it:

Please let's not expand the number of signals supported if there is any alternative. Signals only really make sense for supporting existing interfaces. For new applications there is almost always something better.

So why is there a need to add new signals? The answer has roots in ancient history, well before the creation of Linux. The TOPS-10 operating system was developed by Digital Equipment Corporation in the 1970s; one feature it offered was to print a line of status information (what was running, CPU time consumed, etc.) when the user typed control-T. It was a quick way for users to verify that the system was still alive and making progress on whatever task they were running. This feature found its way into VMS later on, and it still exists today in a number of BSD variants. Linux, however, does not have this feature — in that form, anyway.

Within the kernel, there are two aspects to supporting this feature, which goes by the name VSTATUS. The first is to recognize the status request in the terminal driver; this is done using the same logic that recognizes other control characters, such as control-C to send a SIGINT signal to the running process. The kernel would then respond to the keystroke in two ways:

  • The kernel will print a status line directly to the terminal with generic information about the running process and the state of the system.
  • A SIGINFO signal is sent to the process running on the terminal at the time. Applications can catch this signal to print some status information of their own; a copy application could tell the user how far the copy has progressed, for example.

The Linux kernel doesn't implement VSTATUS, so it won't do either of the above things. Adding the ability to print a status line to the terminal driver is not that hard; neither is sending a signal in response to an event. The problem is that Linux does not implement SIGINFO, so that signal would need to be added and, as noted above, there is no room to add new signals. Thus Drummond's patch set.

This is not the first time that VSTATUS support has been requested; the issue also came up in 2014 and again in 2019 — and undoubtedly other times as well. In 2019, Arseny Maslennikov tried to get around the signal limitation by defining SIGINFO as a synonym for SIGPWR, which was meant to (depending on who is reminiscing at the time) indicate that either a power failure was impending or power had been restored. Either way, the signal is delivered only to the init process, and it tends to be little used on current systems. Repurposing it for VSTATUS requires changing the default action for SIGPWR to "ignore" but otherwise shouldn't prove disruptive to user space.

Or so the developers would hope. Real-world uses of SIGPWR are rare, but they do exist; as Ted Ts'o pointed out, systemd can be made to respond to it, for example. Changing the default handling of SIGPWR would probably not break any user-space applications, but it is hard to know that for sure. If something does break, it could show up years later when the change makes it into an enterprise kernel; at that point, the problem would be nearly impossible to fix to everybody's satisfaction. So it is unsurprising that kernel developers are reluctant to make a change like this.

Over the years, there have also been questions about whether the feature is really needed or not. As Greg Kroah-Hartman pointed out in 2019, the kernel's "magic SysRq" feature does everything VSTATUS does (and a lot more). But magic SysRq only works on the system console; as Biederman noted in the above-linked message, a "persuasive case" could be made for the utility of this feature for users interacting with systems over SSH, for example. So a use case for VSTATUS does appear to exist.

Given that, what is to be done? Ts'o's message above included a suggestion: implement VSTATUS as far as printing the status line from the kernel, but don't bother sending the SIGINFO signal to user space. That eliminates the need to add a new signal (or repurpose an old one), which is where all the problems arise. This implementation would deprive user space of the ability to add its own status information, but the number of programs that have ever implemented SIGINFO handling is quite small; Drummond said that only sleep, dd, and ping have such support, "so it's not like there's a vast hole in the tooling or something, nor is there a large legacy software base just waiting for SIGINFO to appear". He readily agreed to leave out the SIGINFO part.

The followup patch set without SIGINFO has not been posted as of this writing. Once it arrives, there should not be a great deal of opposition to its merging into the mainline; at that point, Linux will have most of the VSTATUS functionality that is offered by the BSDs. If it turns out later on that somebody really needs the SIGINFO part, that whole problem can be reconsidered. Meanwhile, though, the kernel community is happy to kick that can further down the road.

Comments (49 posted)

Fixing a corner case in asymmetric CPU packing

January 7, 2022

This article was contributed by Marta Rybczyńska

Linux supports processor architectures where CPUs in the same system might have different processing capacities; for example, the Arm big.LITTLE systems combine fast, power-hungry CPUs with slower, more efficient ones. Linux has also run for years on simultaneous multithreading (SMT) architectures, where one CPU executes multiple independent execution threads and is seen as if it were multiple cores. There are architectures that mix both approaches. A recent discussion on a patch set submitted by Ricardo Neri shows that, on these systems, the scheduler might distribute tasks in an inefficient way.

Simultaneous multithreading

SMT functionality has been present in architectures like PowerPC and x86 for years. On an SMT system, a CPU can run instructions from two (or more) separate execution contexts. Each logical thread is visible as a separate CPU, so one physical CPU running two threads will be seen in Linux as two CPUs. SMT processors using the same hardware in this way are often called "siblings". Operating systems have little control over an SMT processor's decisions on how to divide its resources between execution contexts.

SMT allows better use of a processor's resources because, when one execution path is stalled (waiting for memory, for example), the physical CPU can execute instructions from other threads. However, doubling the number of threads in a processor does not normally double its processing capacity. Both threads are sharing the same resources, and the SMT mode is most efficient when the system is under low load. Two SMT threads thus have a lower capacity than two physical CPU cores.

This reduced capacity needs to be reflected in the scheduler, but the exact value of the reduction depends on the load, so the kernel needs to use a heuristic. Linux models this by reducing the CPU priority (which regulates how likely the CPU is to be chosen to run a given task) for the second (and following) CPU threads running on the same hardware.

Users can view sibling CPUs in the topology information available on their systems in /sys/devices/system/cpu/cpuX/topology/core_cpus_list (where X is the number of a CPU on the system). For example, in a 12-core system with SMT:

    $cat /sys/devices/system/cpu/cpu0/topology/core_cpus_list
    0,6

The result means that user-visible cores 0 and 6 are SMT siblings, so they are using the same hardware core.

ASYM_PACKING

Asymmetric packing (SD_ASYM_PACKING in the scheduler) is a feature originally added for the PowerPC architecture in 2010. It handles a case when the scheduler can obtain better processor performance by moving tasks to certain CPUs and leaving others idle. The busy CPUs can then move to a lower SMT mode (running fewer threads) and obtain higher overall system performance. The SMT modes documentation for PowerPC includes some examples.

The ASYM_PACKING mode has witnessed a number of reworks and is currently supported on x86 and PowerPC. The support for x86 includes a way to support cores that might have a higher frequency than others, for example using the Turbo Boost Max Technology (ITMT) 3.0 feature. A short slide set from the 2016 Linux Plumbers Conference explains the work in a little more detail.

Mixing SMT and ASYM_PACKING

Neri observed some undesired scheduling behavior on a system with three distinct CPU priorities. This system contains high-performance cores (Intel Core) with their SMT siblings, along with lower-performance cores (Intel Atom). The efficient scheduling approach in this case (for some workloads at least) is to use the high-performance cores first (but without their SMT secondary threads), then the lower-performance cores, leaving the SMT secondary cores for last. The scheduler was, instead, putting tasks on the high-performance cores and their SMT siblings, leaving the other cores idle. As a result, tasks were contending for processor resources while independent CPUs remained idle.

To understand this problem, consider the example from one of Neri's patches. Imagine a system with two physical CPUs with different priorities: 60 and 30 respectively. Both of them have SMT siblings. The kernel assigns SMT priorities using an equation in the x86-specific code:

    smt_prio = prio * smp_num_siblings / i;

where smt_prio is the effective priority, prio is the original priority of the CPU, smp_num_siblings is the number of siblings for each CPU (the value is two in Neri's case), and i is the sibling number assigned to the given physical CPU, starting from one. According to the formula, the resulting priorities are 120 for the main thread of the first CPU and 60 for its SMT sibling. For the second CPU, the main thread gets a priority of 60, and the sibling a priority of 30. In this case, the SMT sibling and the lower-performance main thread will have the same priorities.

Neri wanted to change the scheduler to assign tasks to the main thread of the second (physical) CPU before using the SMT sibling CPUs. To that end, he proposed a modification to the formula so that it would divide by the square of the sibling number:

    smt_prio = prio * smp_num_siblings / (i*i);

In this case, the priorities will be 120 and 30 for the threads of the first CPU; then 60 and 15 for the threads of the second CPU. Tasks will thus be scheduled first on both main threads.

Neri's patch set makes another change when it comes to the scheduler's load-balancing decisions, and specifically when the scheduler decides to move a task from one CPU to another to even out the load on the system. When considering whether to move a task from a source CPU to a new target CPU, the scheduler considers whether the target has SMT siblings; if not, it can receive tasks from an SMT source CPU that has at least two busy threads. If only one sibling in the source CPU is busy, tasks will be moved only if the target CPU has a higher priority than the source.

Summary

Scheduling performance improves by a few percent in some cases, according to the benchmark results presented in the cover letter, though it is smaller in most cases. Benchmarks also show also some cases with performance degradation, but Neri gave no explanation for this result.

The change has been merged for the 5.16 kernel, so owners of such systems should see a change in scheduling and, hopefully, better performance. This fix covers one scheduling corner case; there is no reason to think it was the only one. We should expect to see more adjustments in scheduling on asymmetric CPUs in the future.

Comments (6 posted)

Some 5.16 kernel development statistics

By Jonathan Corbet
January 10, 2022
The 5.16 kernel was released on January 9, as expected. This development cycle incorporated 14,190 changesets from 1,988 developers; it was thus quite a bit busier than its predecessor, and fairly typical for recent kernel releases in general. A new release means that the time has come to have a look at where those changes came from.

The 1,998 developers contributing to 5.16 was the second-highest number ever, with only 5.13 (with 2,062 developers) being higher. This time around, 296 developers contributed their first change to the kernel, which is at the high end of the typical range. The most active developers in this cycle were:

Most active 5.16 developers
By changesets
Michael Straube 2862.0%
Cai Huoqing 2321.6%
Jakub Kicinski 2001.4%
Christoph Hellwig 1581.1%
Bart Van Assche 1571.1%
Krzysztof Kozlowski 1401.0%
Mauro Carvalho Chehab 1300.9%
Pavel Begunkov 1220.9%
Thomas Gleixner 1170.8%
Alex Deucher 1120.8%
Matthew Wilcox1080.8%
Geert Uytterhoeven 1030.7%
Jani Nikula 940.7%
Ian Rogers 910.6%
Arnd Bergmann 880.6%
Ville Syrjälä 860.6%
Mark Brown 850.6%
Martin Kaiser 850.6%
Colin Ian King 820.6%
Jens Axboe 800.6%
By changed lines
Ping-Ke Shih 9111611.4%
Zhan Liu 345014.3%
Nick Terrell 286113.6%
Sameer Pujar 151211.9%
Johan Almbladh 139011.7%
Thomas Bogendoerfer 115911.4%
Michael Straube 90141.1%
Dmitry Baryshkov 78361.0%
Srinivas Kandagatla 76631.0%
Larry Finger 75860.9%
Prabhakar Kushwaha 62610.8%
Jakub Kicinski 57960.7%
Fangzhi Zuo 57650.7%
Alex Deucher 56270.7%
Peter Zijlstra 54480.7%
Jani Nikula 52870.7%
Simon Trimmer 52490.7%
Shawn Guo 51520.6%
Tony Lindgren 50200.6%
Derek Fang 49730.6%

The most prolific contributor of changesets for 5.16 was Michael Straube, who worked almost exclusively on the r8188eu wireless network adapter driver in the staging tree; that driver has now received 755 changes since being merged for the 5.15 release. Cai Huoqing contributed clean-up patches in many areas of the kernel, Jakub Kicinski made improvements throughout the networking subsystem, Christoph Hellwig continues his refactoring work in the block and filesystem layers, and Bart Van Assche reworked much of the SCSI subsystem code.

In the lines-changed column, Ping-Ke Shih came out on top with the addition of the Realtek rtw89 driver; unlike many past Realtek drivers, this one skipped the staging tree and landed directly under drivers/net. Zhan Liu contributed exactly two patches adding yet another set of amdgpu header files. Nick Terrell updated the kernel's zstd compression module, Sameer Pujar added a set of NVIDIA Tegra sound drivers, and Johan Almbladh added eBPF JIT compilers for the 32- and 64-bit MIPS architectures. It's worth noting that there were relatively few large code removals in 5.16 (the biggest was the removal of Netlogic MIPS support by Thomas Bogendoerfer), so the kernel as a whole grew by 422,000 lines.

The kernel project depends on its testers and reviewers as much as it depends on its developers. For the 5.16 cycle, the contributors with the most test and review credits were:

Test and review credits in 5.16
Tested-by
Daniel Wheeler 15314.8%
Sandeep Penigalapati 343.3%
Tony Brelinski 252.4%
Deren Wu 242.3%
Gurucharan G 222.1%
Sohaib Mohamed 222.1%
Konrad Jankowski 201.9%
Alexei Starovoitov 161.5%
Mark Wunderlich 141.4%
John Garry 131.3%
Christian Zigotzky 131.3%
Fuad Tabba 121.2%
Shawn Guo 121.2%
Geert Uytterhoeven 101.0%
Ferry Toth 101.0%
Reviewed-by
Christoph Hellwig 2023.2%
Rob Herring 1943.0%
Hans de Goede 1191.9%
Pierre-Louis Bossart 1041.6%
Stephen Boyd 1001.6%
David Howells 831.3%
David Sterba 801.2%
Jani Nikula 771.2%
Christian König 741.2%
Andrew Lunn 681.1%
Jan Kara 600.9%
Kai Vehmanen 600.9%
Kees Cook 580.9%
Florian Fainelli 570.9%
Linus Walleij 550.9%

Once again, Daniel Wheeler heads the list of test credits, having received 15% of all such credits during the 5.16 development cycle. That is over two patches tested per day — every day, weekends and holidays included. Wheeler appears to be doing this work as part of his employer's internal review process, as do many of the other top testers. The top reviewers, instead, tend to be active developers who also manage to get a lot of reviews done. The top two reviewers for 5.16 are the same as for 5.15; Christoph Hellwig managed to review three patches and write two of his own for every day of the 70-day 5.16 development cycle.

A different sort of review is associated with the task of selecting patches to apply and push into the mainline kernel. That decision may involve a thorough review in its own right, or it may rely on the review efforts of others. When maintainers accept patches, they will apply a Signed-off-by tag to those patches. By looking at signoffs by people other than the author of a patch, it is possible to get a picture for who the most active maintainers are. For 5.16 they were:

Top signoffs in 5.16
David S. Miller 10827.8%
Greg Kroah-Hartman 10627.6%
Mark Brown 5584.0%
Alex Deucher 4723.4%
Jens Axboe 4423.2%
Andrew Morton4002.9%
Martin K. Petersen 3532.5%
Jakub Kicinski 3252.3%
Mauro Carvalho Chehab 3252.3%
Bjorn Andersson 3052.2%
Paolo Bonzini 2301.7%
Jonathan Cameron 2241.6%
Kalle Valo 2101.5%
Arnaldo Carvalho de Melo 2031.5%
Hans Verkuil 1831.3%
Felix Fietkau 1631.2%
David Sterba 1621.2%
Alexei Starovoitov 1541.1%
Borislav Petkov 1521.1%
Saeed Mahameed 1481.1%

This list of maintainers tends not to change much from one release to another; it is made up of some of the kernel project's most senior developers who have been on the job for many years.

Work on 5.16 was supported by 251 employers that we were able to identify. The most active of those were:

Most active 5.16 employers
By changesets
Intel145410.2%
(Unknown)11968.4%
Google9326.6%
(None)7815.5%
Red Hat7655.4%
AMD6824.8%
Facebook6414.5%
Linaro5924.2%
NVIDIA4633.3%
Huawei Technologies4223.0%
SUSE3112.2%
Oracle2942.1%
IBM2741.9%
(Consultant)2661.9%
Canonical2491.8%
Arm2441.7%
Baidu2341.6%
Renesas Electronics2211.6%
MediaTek1991.4%
Code Aurora Forum1921.4%
By lines changed
Realtek9723712.2%
Intel725659.1%
AMD670768.4%
Facebook508946.4%
(Unknown)431525.4%
(None)403895.0%
Linaro394284.9%
NVIDIA388984.9%
Google358714.5%
Red Hat233122.9%
Marvell191362.4%
MediaTek153991.9%
Code Aurora Forum145641.8%
Anyfi Networks139011.7%
Renesas Electronics128881.6%
SUSE109401.4%
IBM108081.4%
Huawei Technologies103781.3%
Cirrus Logic100461.3%
Oracle87281.1%

This table, too, tends not to change much from one release to the next. For the curious, the "unknown" category consists of nearly 400 developers, most of whom contributed one or two patches. Any one of these developers is a small contributor to this release, but together they add up to a significant portion of the total patch flow. Many of those developers will move on, having done what they came to the kernel project to do; others are just getting started and will become significant contributors over time.

In summary, 5.16 was just another typical kernel development cycle. Lots of patches from nearly 2,000 developers, all integrated into another solid (though not perfect) kernel release. The kernel project does not lack its share of problems with quality control, testing, support for maintainers, and more, but it nonetheless manages to get the work done on a predictable schedule. Work now begins on 5.17, which will be released in mid-March.

Comments (1 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds