Leading items
Welcome to the LWN.net Weekly Edition for January 13, 2022
This edition contains the following feature content:
- Relocating Fedora's RPM database: a proposal to move a package-management database sparks a discussion on filesystem organization.
- An outdated Python for openSUSE Leap: Python 3.6 is end-of-life, but openSUSE Leap 15.4 will ship it anyway.
- VSTATUS, with or without SIGINFO: supporting a venerable console feature on Linux, finally.
- Fixing a corner case in asymmetric CPU packing: the search for optimal task placement on heterogeneous processors.
- Some 5.16 kernel development statistics: where the code in 5.16 came from.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Relocating Fedora's RPM database
The deadlines for various kinds of Fedora 36 change proposals have mostly passed at this point, which led to something of a flurry of postings to the distribution's devel mailing list over the last month. One of those, for a seemingly fairly innocuous relocation of the RPM database from /var to /usr, came in right at the buzzer for system-wide changes on December 29. There were, of course, other things going on around that time, holidays, vacations, and so forth, so the discussion was relatively muted until recently. Proponents have a number of reasons why they would like to see the move, but there is resistance, as well, that is due, at least in part, to the longstanding "tradition" of the location for the database.
To /usr
As is normal for Fedora change proposals, program manager Ben Cotton posted the proposal to the mailing list on behalf of its owners: Chris Murphy, Michel Alexandre Salim, and Neal Gompa. It can also be found on the Fedora wiki. In a nutshell, it would move the database maintained by the RPM package manager from its current location, /var/lib/rpm, to /usr/lib/sysimage/rpm; the former would be replaced with a symbolic link to the latter. The DNF package-management tool would not be switching locations for its database, at least yet; the RPM developers have mixed feelings about the change, but are not standing in its way.
The RPM database tracks which packages have been installed and their metadata. In addition, it tracks all of the files that get installed to the system, their locations, and which package is responsible for them. RPM uses it to clean up packages when they are removed; users can query the database, as well, to answer various questions about their packages and files.
There are several benefits for Fedora listed in the proposal, but the main driving force seems to be support for rolling back failed or undesired updates of the operating system. The RPM database describes what is in /usr for the most part, so having it stored in the same directory hierarchy means that /usr can be rolled back as a unit, without needing to change /var. The switch is also in keeping with what the Fedora variants based on rpm-ostree (CoreOS, IoT, Sliverblue, and Kinoite) already do; it is also aligned with another RPM-based distribution ecosystem, the SUSE distributions (SUSE Linux Enterprise, openSUSE, and Tumbleweed).
Much of the early discussion revolved around the Filesystem
Hierarchy Standard (FHS), which Fedora packaging explicitly
follows—with a few exceptions. Tom Hughes pointed
out that the FHS descriptions for
/usr ("shareable, read-only data
") and for
/var ("variable data files
") seem somewhat at
odds with the proposal. In particular, he said, the /var
description "appears to exactly
describe the RPM database
". Thus the changes are not FHS compliant,
nor do they follow the packaging guidelines, he said.
Stephen John Smoogen agreed
and suggested that the change needed to be made in the FHS first, then it
could be done for the distribution. Fedora project leader Matthew Miller did not
think that was likely to happen, since the most recent release
(FHS 3.0) was in 2015 "and the
whole thing has been effectively dead since
". On the other hand,
Miller was in
favor of reviving the FHS effort, especially with buy-in from other
distributions.
But, as Murphy (and others) said, no matter what the FHS says, /usr cannot be more than "mostly read-only" or the system can never be updated:
In practice it is read-only data, except when software is being installed or updated. The FHS is a PITA sometimes, but it's not advocating for systems that can't be updated or changed.
RPM developer Panu Matilainen agreed
that the database has generally only been changed during update operations,
"but it doesn't mean it will
stay that way forever more
". There are unimplemented features that
might change that situation; beyond
that, there are already situations
where changes are made completely outside of /usr but that require
an update to the RPM database:
There seems to be this strange underlying assumption that all packaged content lives in /usr when that's not at all the case. To install a kernel, or a config-only package (under etc), or 3rd party software putting stuff under /opt, or... you need a writable rpmdb. Ditto for 'rpm --import'.
Vitaly Zaitsev also pointed
to the FHS and its "read-only" language for /usr, but as Gordon
Messmer said:
"If /usr really is read-only, then it probably doesn't matter where
the
rpmdb is, since packages can't be installed (generally).
" Gompa concurred
with that, and expanded on the advantages of moving the database out of
/var:
The bigger problem is that having the RPM database in /var makes it much harder to correctly implement a boot-to-snapshot scheme for Fedora Linux that reasonably maintains system state properly once /var is carved out. The reason that /var *isn't* carved out by default with our Btrfs configuration is because of the RPM database. Once the RPM database is moved, it becomes possible to split /var into its own subvolume and make it trivial to configure a full boot-to-snapshot system leveraging the technologies we have available to us.
Miller suggested
that the "Benefit to Fedora" section of the change proposal add more
information about that "pretty compelling benefit
". Murphy made
that change, but did caution: "There's more hurdles to jump, just
so no one thinks snapshots
and rollbacks are showing up in Fedora 36.
"
/state
Things had mostly quieted down in the discussion after the new year, at
least until David Cantrell threw a bit
of a curve ball on January 9. As with others in the thread, he was uncomfortable
with moving the database to /usr, "but we should move it
to gain the improvements as noted in the feature proposal
". In his
lengthy message, he suggested some other, novel ways to look at /usr
and /var:
It is generally understood that /usr contains [most of] the installed system. What I think is a bigger requirement or [expectation] now is that one can tar up /usr and transport it to another system or virtual machine or container and expect that it will _probably_ work maybe with a bit of tinkering. This is a really valuable thing to have for developers. Moving the RPM database to this tree adds a component that is unnecessary and sort of out of place.[...] Reading comments and talking to people, the long standing understanding of /var is still "that's stuff you can rm -rf and the system will still work fine". Technically you could remove the RPM database and the system still function, but arguably would still be broken since you really want the RPM database. This use case of removing the RPM database and still having a functioning system is really only useful for data recovery scenarios. You're ultimately going to reinstall. Probably.
With that in mind, he suggested moving "the RPM database and other
variable and stateful data
" to a new top-level directory called
/state. The FHS does not prevent the addition of new top-level
directories, Cantrell said. Michael Catanzaro thought that
adding a top-level directory was "a pretty big hammer
" and
that perhaps it was easier to just support two separate locations for the
RPM database.
But rpm-ostree developer Colin Walters took
exception to Cantrell's "unnecessary and sort of out of place
"
characterization. "Multiple independent groups who *actually work* on
image based updates and/or client side snapshots all generally agree that
the rpmdb should be in /usr.
" Murphy said
that /state does not really solve the problem, it is "just
rearranging the chairs
". The RPM database holds information about
multiple locations, so it needs to stay in sync with the rest of the
system, no matter where it lives:
If /usr is to be truly portable and have e.g. 'rpm query, verify, remove, reinstall' work as expected, you need the metadata (the database) representing its state to always come along for the ride. Either the database is already in /usr, or you have to make sure /usr and /state are inseparable.If /usr and /state are inseparable, and if rpm can also describe anything in /etc or /var or /opt, then all or part of those directories are also inseparable from /state. And thus /usr. So I think /state doesn't help.
Cantrell is not alone in feeling that /usr is a bad location, however;
Matilainen said
that putting the RPM database there "just *feels so wrong*
".
There is other data like the RPM database, he said, so having a separate location
for it all makes a lot of sense:
For many practical purposes it's probably just rearranging the chairs, but a separate top-level directory describing the *system* state seems instinctively *much* more correct solution to it than stuffing it somewhere deep inside a loosely related fs.Just FWIW, I would quit my whining about this right there if it went to a new toplevel directory instead because it just *feels* right unlike /usr.
Further pieces
Walters said that problems, such as PGP keys being added via rpm --import, need to be addressed as part of the adoption process:
The TL;DR for me is: I think everyone agrees that moving the rpmdb as it is today to /usr is not 100% a perfect fit. But it's a ~90% fit - almost all the raw data is just headers which are clearly immutable data generated elsewhere. And by making this change, we're basically saying we'll fix the remaining 10% of cases.[...] But, I hope we can get agreement about something like having `rpm --import` write to `/etc/pki/rpm-gpg` and dropping gpg keys from the rpmdb.
There is, it seems, an effort toward regularizing package installation in various ways with an end goal that is clear to some, but perhaps not to others. Zbigniew Jędrzejewski-Szmek described where he sees things heading:
Traditionally, packages installed all kinds of files all over the place. But we're slowly and painfully moving towards the model where:
- packages are only allowed to install under /usr, /var, and /etc. (Or under /opt, but I'd want to move that to /usr/opt…)
- packages must support /var/cache being wiped at any time, and most packages support anything under /var being wiped at any time.
- systemd and other projects are trying to only use /etc for local admin state, and support "factory reset" by wiping /etc and /var.
Based on that, he said that it makes sense to put the database under
/usr somewhere, and the exact location is only a matter of
convention, "so obviously we want to follow
what opensuse and others are already using
". But Matilainen was concerned
that there was something of a "hidden agenda
" behind the
change proposal. He thought the goals should be more clearly stated:
I'm not saying these are necessary bad goals at all, it's just that there's a huge disconnect between reality and the above model on which this change seems based on, and not a single mention about these goals and changes needed to get there. I mean, I totally get that you can't change everything at once, but if there's a plan this big behind something then maybe it should be brought up front, no?
Matilainen said that "nearly all packages
put something in /etc
", for example, which is something that is being ignored,
but Walters said
that the handling of /etc is something that is being improved in
the image-based update world:
rpm-ostree uses ostree, which introduces /usr/etc which are the pristine default config files. /etc is 3-way merged by ostree. One of the major benefits of this that I really love is `ostree admin config-diff` - at any point we can show you machine-local changes from the default, and it's trivial to reset back to defaults without redownloading a whole RPM.[...] There's no hidden agenda - the goal is to support image based updates as well as client side snapshots, factory reset, etc. And we're shipping today versions of Fedora that do a lot of this, and we want to continue to improve it.
The discussion is still ongoing as of this writing, which likely means that
the Fedora Engineering Steering Committee (FESCo) will not decide on the
proposal right away. Several members are favorably disposed to it, as can
be seen in the FESCo issue
entry. Matilainen has said
that he is "not going to stand in front of the Change truck
",
even though he has reservations about the proposal. While it may "feel
wrong" to do so, at least for some, it seems that the writing may already
be on the wall for this particular change. Whether that larger agenda
(hidden or not) is adopted will play out over the next few releases.
An outdated Python for openSUSE Leap
Enterprise distributions are famous for maintaining the same versions of software throughout their, normally five-year-plus, support windows. But many of the projects those distributions are based on have far shorter support periods; part of what the enterprise distributions sell is patching over those mismatches. But openSUSE Leap is not exactly an enterprise distribution, so some users are chafing under the restrictions that come from Leap being based on SUSE Enterprise Linux (SLE). In particular, shipping Python 3.6, which reached its end of life at the end of 2021, is seen as problematic for the upcoming Leap 15.4 release.
Background
The openSUSE distribution has undergone a few different shifts over the years. In 2005, it started out as a more open offshoot of the SUSE Linux distribution and in 2009 started accepting community contributions. Eventually, the Tumbleweed rolling distribution came along, which LWN looked at in 2016. It was paired with openSUSE Leap 42, which came after openSUSE 13.x and, somewhat confusingly, before Leap 15.x.
OpenSUSE and SLE have generally been aligned over the years. In 2020, Leap and SLE grew even closer together. The build system and repositories between the two were shared starting with Leap 15.2, which corresponded to the second "service pack" (SP) of SLE (i.e. SLE 15-SP2). In 2021, with Leap 15.3 and SLE 15-SP3, the two distributions effectively merged, such that all of the base packages were shared between the two. To a first approximation, Leap is an openSUSE-branded version of SLE, much like what CentOS used to be for Red Hat Enterprise Linux.
Python 3.6
On January 3, Michael Ströder posted
about the Python version to the opensuse-factory mailing list. He noted
that some Python modules are already dropping support for Python 3.6,
but that Leap 15.4 will ship with that Python version as its
default. "IMO this does not make sense at all.
"
As Ben Greiner pointed
out, though, the situation is not quite as bad as portrayed, since
Leap 15.3 already has Python 3.9 available and there is talk of
adding support for 3.10 (the most recent Python release). But, Greiner
said, the "system
Python" is the problem area:
The real discussion is about what Python version the system packages have to use. And because, as I understand it, all 15.X releases need to maintain the binary compatibility of earlier 15 releases, they cannot move away from Python 3.6.
But Ströder said
that the other versions of Python "are mainly only usable for developing
software in virtual envs
". Meanwhile,
having the default Python (i.e. /usr/bin/python3) be a
version that is no longer supported by the upstream community causes real problems:
"Now try to explain to an
auditor that this security relevant system software runs on a unmaintained
Python version. Yeah, have fun.
"
But part of the promise that SUSE offers its customers is in maintaining
software like Python, Simon Lees said,
so those end-of-life packages are not really unsupported:
This is simple, the python 3.6 is still maintained just by a different group of people, if significant python 3.6 security issues are found they will be fixed by someone within SUSE, this is a guarantee we give to our customers
That policy is fixed for SLE 15.x, thus Leap 15.x, he said; "Maybe in
future SLE versions the model will be different but I am as yet
unsure.
" Lees suggested that one solution might be to make SLE and
Leap 16 available sooner, though he realized that may not be
workable. Neal Gompa agreed
with that idea:
I'd really like to see SLE/Leap 16 sooner rather than later too. The tech debt in SLE 15 is piling up and a lot of stuff that I've done in Tumbleweed for that future SLE version can't be backported because they're foundational things that would make SLE 15 into a different SLE anyway.
Tumbleweed will eventually be used as the basis for the next SLE and Leap. It currently uses Python 3.8, and is in the process of switching to 3.10.
Alternatives
There is more than just Python itself that needs to be maintained, however, as Axel Braun pointed out. There are lots of other Python modules that are shipped with the distribution, many of which will start dropping support for Python 3.6—though they continue fixing bugs, security bugs in particular. Those fixes will need to be backported. Lees said that those packages may or may not get updates:
For SLE packages SUSE Employees will keep them working, for some openSUSE packages someone in the community will care and do the work and for some others probably no one will care enough to do the work and they will just stay at [their] existing versions.
That's obviously problematic from a security standpoint, as Braun noted; he wondered if it made sense to turn Leap into a faster-moving distribution. Lees replied that it would be something of a reversion to the previous openSUSE model:
With enough man effort anything is possible but currently I doubt we have the man effort to maintain a third Gnome, KDE or python stack, this lack of man power was one of the main reasons from moving from the older model to the current Leap model.
Along similar lines, Frans de Boer wondered
if it made sense to eliminate Leap entirely, since it is seen as "the consumer
version of the Suse SLE
" but offers packages that are outdated or
even beyond their end of life. It damages the openSUSE brand to have those
kinds of packages, De Boer said, so a change
may be in order:
That said, it might be an idea that OpenSuse is rethinking it's strategy. For example, forgo the whole Leap series, which will free resources to concentrate on Tumbleweed. Also, as a suggestion, once a year repackage Tumbleweed of a few months old into a rebranded "stable" version and provide only security updates.
Lees does
not think that would actually work well, though: "If we could find
the resources to do this I agree it would be a great distro and probably
more useful for many people then Leap and Tumbleweed.
" The problem
is that maintaining Leap does not actually take all that much time, since
most of the updates come "for free" from SLE, but
keeping up with security updates for the "stable Tumbleweed" (separate from
SLE/Leap) would be a lot of work. But there are others, like Georg
Pfuetzenreuter, who want
Leap to continue its current course:
I, personally, like Leap as it is - the other big enterprise GNU/Linux vendor (yes, the one with the red hat) ditched the free fork of their enterprise distribution - SUSE and the openSUSE project keep Leap, a free SLE fork, (whether that's the right definition or not) alive - and I value that a lot.[...] I understand that other users do not like being presented with outdated packages and legacy code - I understand people want a "different" Leap - but for me, coming from an enterprise background, Leap gives security and stability for the server-side open source projects I run and experiment with.
Balancing
Matěj Cepl said that balancing the differing needs of various distribution users is an endemic problem for Linux—and one that remains unsolved:
I can assure you that too-slow/too-fast problem is something we are acutely aware of, it is probably one of the biggest problems all Linux distributors are struggling with, but nobody came with really good solution. With every user who is angry with us for moving too fast and upgrading too much, there is at least one user or more who is angry with us for moving too slow and not upgrading enough. Currently, the leading solution seem to lead to heavy use of containers, but then there are already users of containers who are becoming quite aware of problems with maintaining content of those containers, and the circle just continues.
In another
message, Cepl said that simply adding support for Python 3.10 is
difficult: "Maintaining two stacks of Python packages so
distant from each other as Python 3.6 and 3.10 gets really complicated
really quickly.
" That led Martin Wilck to wonder
if the problem could be split up:
Can't we identify those packages that _must_ be maintained for 3.6 (because they're required for core distro functionality) and maintain only those for both versions, replacing all others with the 3.10 packages?
Of course, even that would break the expectations of SLE 15 customers, as Dan Čermák pointed out. The ironclad guarantee that SUSE provides its enterprise customers with regards to stability came up multiple times in the thread. It is a choice that the company has made, Richard Brown said, in order to attract customers who want that kind of stability. It is expensive for the company to do so, but is done with the expectation of bringing in more than it costs, he continued. Ever since the full alignment of SLE and Leap for 15.3, Leap is along for that ride.
While Ströder strongly
disagreed ("Outdated is not 'stable'.
"), there were a few
different examples given of SUSE customers looking for "outdated
stability". Stefan Seyfried noted
that some customers are still installing SLE 12 (the predecessor to
SLE 15) on new hardware; even those who install SLE 15 often
choose not to use the latest service pack release. "So yes,
paying customers really like old stuff :-)
". Beyond that, Cepl said
that there is pressure to support even older versions of Python:
And yes, we have mounting pressure from our customers to have continuation of support for Python 2.7 on SLE-12 (or even SLE-11, it is still supported for some special, read paying more, customers). Those people cannot care less about EOS [end of support] of Python 3.6.
It seems clear that Leap 15.4 will continue using Python 3.6, for good or ill. As seen in the thread, there are users who value that stability even in Leap and SUSE's customers would be beyond unhappy with a switch in SLE 15. As long as Leap and SLE are tied together, that is the way things are going to be. In addition, there does not seem to be any real appetite for splitting the two and returning to an openSUSE that is independent of the enterprise offering.
Tumbleweed is the SUSE distribution for those looking for something faster moving, but it is still somewhat conservative in its pace. That may lead some users to look elsewhere, but distributions simply cannot cover each and every use case. SLE/Leap and Tumbleweed have staked out certain models that fit with their goals and those of SUSE, which funds much of the development of them.
In the unlikely event that some distribution figures out how to please all of the people all of the time, maybe we will see all of its competitors wither away. In the meantime, users need to understand the philosophy a distribution is pursuing before choosing it. Even then, priorities shift on both sides of the equation—user and distribution—over time, which leads to conflicts like this one.
VSTATUS, with or without SIGINFO
The Unix signal interface is complex and hard to work with; some developers have argued that its design is "unfixable". So when Walt Drummond proposed increasing the number of signals that Linux systems could manage, eyebrows could be observed at increased altitude across the Internet. The proposed increase seems unlikely to happen, but the underlying goal — to support a decades-old feature from other operating systems — may yet become a reality.The kernel is able to support up to 64 different signal types, which seems like a fair number, but all 64 are taken, on some architectures at least. That makes it impossible to add new signal types to Linux. Drummond sought to address that problem by raising the limit to 1024, which would surely be enough for all time. Raising the limit requires making some subtle changes to the user-space API (putting a larger signal mask into the information passed to realtime signal handlers, for example) that have the possibility of breaking applications, which means that extra scrutiny would be required. But that, it seems, is what would be needed to be able to add more signals.
Developers immediately wanted to know why there was a need to add signals, and urged Drummond to find an alternative if possible. As Eric Biederman put it:
Please let's not expand the number of signals supported if there is any alternative. Signals only really make sense for supporting existing interfaces. For new applications there is almost always something better.
So why is there a need to add new signals? The answer has roots in ancient history, well before the creation of Linux. The TOPS-10 operating system was developed by Digital Equipment Corporation in the 1970s; one feature it offered was to print a line of status information (what was running, CPU time consumed, etc.) when the user typed control-T. It was a quick way for users to verify that the system was still alive and making progress on whatever task they were running. This feature found its way into VMS later on, and it still exists today in a number of BSD variants. Linux, however, does not have this feature — in that form, anyway.
Within the kernel, there are two aspects to supporting this feature, which goes by the name VSTATUS. The first is to recognize the status request in the terminal driver; this is done using the same logic that recognizes other control characters, such as control-C to send a SIGINT signal to the running process. The kernel would then respond to the keystroke in two ways:
- The kernel will print a status line directly to the terminal with generic information about the running process and the state of the system.
- A SIGINFO signal is sent to the process running on the terminal at the time. Applications can catch this signal to print some status information of their own; a copy application could tell the user how far the copy has progressed, for example.
The Linux kernel doesn't implement VSTATUS, so it won't do either of the above things. Adding the ability to print a status line to the terminal driver is not that hard; neither is sending a signal in response to an event. The problem is that Linux does not implement SIGINFO, so that signal would need to be added and, as noted above, there is no room to add new signals. Thus Drummond's patch set.
This is not the first time that VSTATUS support has been requested; the issue also came up in 2014 and again in 2019 — and undoubtedly other times as well. In 2019, Arseny Maslennikov tried to get around the signal limitation by defining SIGINFO as a synonym for SIGPWR, which was meant to (depending on who is reminiscing at the time) indicate that either a power failure was impending or power had been restored. Either way, the signal is delivered only to the init process, and it tends to be little used on current systems. Repurposing it for VSTATUS requires changing the default action for SIGPWR to "ignore" but otherwise shouldn't prove disruptive to user space.
Or so the developers would hope. Real-world uses of SIGPWR are rare, but they do exist; as Ted Ts'o pointed out, systemd can be made to respond to it, for example. Changing the default handling of SIGPWR would probably not break any user-space applications, but it is hard to know that for sure. If something does break, it could show up years later when the change makes it into an enterprise kernel; at that point, the problem would be nearly impossible to fix to everybody's satisfaction. So it is unsurprising that kernel developers are reluctant to make a change like this.
Over the years, there have also been questions about whether the feature is
really needed or not. As Greg Kroah-Hartman pointed out
in 2019, the kernel's "magic
SysRq" feature does everything VSTATUS does (and a lot more). But
magic SysRq only works on the system console; as Biederman noted in the
above-linked message, a "persuasive case
" could be made for
the utility of this feature for users interacting with systems over SSH,
for example. So a use case for VSTATUS does appear to exist.
Given that, what is to be done? Ts'o's message above included a suggestion:
implement VSTATUS as far as printing the status line from the kernel, but
don't bother sending the SIGINFO signal to user space. That
eliminates the need to add a new signal (or repurpose an old one), which is
where all the problems arise. This
implementation would deprive user space of the ability to add its own
status information, but the number of programs that have ever implemented
SIGINFO handling is quite small; Drummond said
that only sleep, dd, and ping have such support,
"so it's not like there's a vast hole
in the tooling or something, nor is there a large legacy software base
just waiting for SIGINFO to appear
". He readily agreed
to leave out the SIGINFO part.
The followup patch set without SIGINFO has not been posted as of this writing. Once it arrives, there should not be a great deal of opposition to its merging into the mainline; at that point, Linux will have most of the VSTATUS functionality that is offered by the BSDs. If it turns out later on that somebody really needs the SIGINFO part, that whole problem can be reconsidered. Meanwhile, though, the kernel community is happy to kick that can further down the road.
Fixing a corner case in asymmetric CPU packing
Linux supports processor architectures where CPUs in the same system might have different processing capacities; for example, the Arm big.LITTLE systems combine fast, power-hungry CPUs with slower, more efficient ones. Linux has also run for years on simultaneous multithreading (SMT) architectures, where one CPU executes multiple independent execution threads and is seen as if it were multiple cores. There are architectures that mix both approaches. A recent discussion on a patch set submitted by Ricardo Neri shows that, on these systems, the scheduler might distribute tasks in an inefficient way.
Simultaneous multithreading
SMT functionality has been present in architectures like PowerPC and x86 for years. On an SMT system, a CPU can run instructions from two (or more) separate execution contexts. Each logical thread is visible as a separate CPU, so one physical CPU running two threads will be seen in Linux as two CPUs. SMT processors using the same hardware in this way are often called "siblings". Operating systems have little control over an SMT processor's decisions on how to divide its resources between execution contexts.
SMT allows better use of a processor's resources because, when one execution path is stalled (waiting for memory, for example), the physical CPU can execute instructions from other threads. However, doubling the number of threads in a processor does not normally double its processing capacity. Both threads are sharing the same resources, and the SMT mode is most efficient when the system is under low load. Two SMT threads thus have a lower capacity than two physical CPU cores.
This reduced capacity needs to be reflected in the scheduler, but the exact value of the reduction depends on the load, so the kernel needs to use a heuristic. Linux models this by reducing the CPU priority (which regulates how likely the CPU is to be chosen to run a given task) for the second (and following) CPU threads running on the same hardware.
Users can view sibling CPUs in the topology information available on their systems in /sys/devices/system/cpu/cpuX/topology/core_cpus_list (where X is the number of a CPU on the system). For example, in a 12-core system with SMT:
$cat /sys/devices/system/cpu/cpu0/topology/core_cpus_list 0,6
The result means that user-visible cores 0 and 6 are SMT siblings, so they are using the same hardware core.
ASYM_PACKING
Asymmetric packing (SD_ASYM_PACKING in the scheduler) is a feature originally added for the PowerPC architecture in 2010. It handles a case when the scheduler can obtain better processor performance by moving tasks to certain CPUs and leaving others idle. The busy CPUs can then move to a lower SMT mode (running fewer threads) and obtain higher overall system performance. The SMT modes documentation for PowerPC includes some examples.
The ASYM_PACKING mode has witnessed a number of reworks and is currently supported on x86 and PowerPC. The support for x86 includes a way to support cores that might have a higher frequency than others, for example using the Turbo Boost Max Technology (ITMT) 3.0 feature. A short slide set from the 2016 Linux Plumbers Conference explains the work in a little more detail.
Mixing SMT and ASYM_PACKING
Neri observed some undesired scheduling behavior on a system with three distinct CPU priorities. This system contains high-performance cores (Intel Core) with their SMT siblings, along with lower-performance cores (Intel Atom). The efficient scheduling approach in this case (for some workloads at least) is to use the high-performance cores first (but without their SMT secondary threads), then the lower-performance cores, leaving the SMT secondary cores for last. The scheduler was, instead, putting tasks on the high-performance cores and their SMT siblings, leaving the other cores idle. As a result, tasks were contending for processor resources while independent CPUs remained idle.
To understand this problem, consider the example from one of Neri's patches. Imagine a system with two physical CPUs with different priorities: 60 and 30 respectively. Both of them have SMT siblings. The kernel assigns SMT priorities using an equation in the x86-specific code:
smt_prio = prio * smp_num_siblings / i;
where smt_prio is the effective priority, prio is the original priority of the CPU, smp_num_siblings is the number of siblings for each CPU (the value is two in Neri's case), and i is the sibling number assigned to the given physical CPU, starting from one. According to the formula, the resulting priorities are 120 for the main thread of the first CPU and 60 for its SMT sibling. For the second CPU, the main thread gets a priority of 60, and the sibling a priority of 30. In this case, the SMT sibling and the lower-performance main thread will have the same priorities.
Neri wanted to change the scheduler to assign tasks to the main thread of the second (physical) CPU before using the SMT sibling CPUs. To that end, he proposed a modification to the formula so that it would divide by the square of the sibling number:
smt_prio = prio * smp_num_siblings / (i*i);
In this case, the priorities will be 120 and 30 for the threads of the first CPU; then 60 and 15 for the threads of the second CPU. Tasks will thus be scheduled first on both main threads.
Neri's patch set makes another change when it comes to the scheduler's load-balancing decisions, and specifically when the scheduler decides to move a task from one CPU to another to even out the load on the system. When considering whether to move a task from a source CPU to a new target CPU, the scheduler considers whether the target has SMT siblings; if not, it can receive tasks from an SMT source CPU that has at least two busy threads. If only one sibling in the source CPU is busy, tasks will be moved only if the target CPU has a higher priority than the source.
Summary
Scheduling performance improves by a few percent in some cases, according to the benchmark results presented in the cover letter, though it is smaller in most cases. Benchmarks also show also some cases with performance degradation, but Neri gave no explanation for this result.
The change has been merged for the 5.16 kernel, so owners of such systems should see a change in scheduling and, hopefully, better performance. This fix covers one scheduling corner case; there is no reason to think it was the only one. We should expect to see more adjustments in scheduling on asymmetric CPUs in the future.
Some 5.16 kernel development statistics
The 5.16 kernel was released on January 9, as expected. This development cycle incorporated 14,190 changesets from 1,988 developers; it was thus quite a bit busier than its predecessor, and fairly typical for recent kernel releases in general. A new release means that the time has come to have a look at where those changes came from.The 1,998 developers contributing to 5.16 was the second-highest number ever, with only 5.13 (with 2,062 developers) being higher. This time around, 296 developers contributed their first change to the kernel, which is at the high end of the typical range. The most active developers in this cycle were:
Most active 5.16 developers
By changesets Michael Straube 286 2.0% Cai Huoqing 232 1.6% Jakub Kicinski 200 1.4% Christoph Hellwig 158 1.1% Bart Van Assche 157 1.1% Krzysztof Kozlowski 140 1.0% Mauro Carvalho Chehab 130 0.9% Pavel Begunkov 122 0.9% Thomas Gleixner 117 0.8% Alex Deucher 112 0.8% Matthew Wilcox 108 0.8% Geert Uytterhoeven 103 0.7% Jani Nikula 94 0.7% Ian Rogers 91 0.6% Arnd Bergmann 88 0.6% Ville Syrjälä 86 0.6% Mark Brown 85 0.6% Martin Kaiser 85 0.6% Colin Ian King 82 0.6% Jens Axboe 80 0.6%
By changed lines Ping-Ke Shih 91116 11.4% Zhan Liu 34501 4.3% Nick Terrell 28611 3.6% Sameer Pujar 15121 1.9% Johan Almbladh 13901 1.7% Thomas Bogendoerfer 11591 1.4% Michael Straube 9014 1.1% Dmitry Baryshkov 7836 1.0% Srinivas Kandagatla 7663 1.0% Larry Finger 7586 0.9% Prabhakar Kushwaha 6261 0.8% Jakub Kicinski 5796 0.7% Fangzhi Zuo 5765 0.7% Alex Deucher 5627 0.7% Peter Zijlstra 5448 0.7% Jani Nikula 5287 0.7% Simon Trimmer 5249 0.7% Shawn Guo 5152 0.6% Tony Lindgren 5020 0.6% Derek Fang 4973 0.6%
The most prolific contributor of changesets for 5.16 was Michael Straube, who worked almost exclusively on the r8188eu wireless network adapter driver in the staging tree; that driver has now received 755 changes since being merged for the 5.15 release. Cai Huoqing contributed clean-up patches in many areas of the kernel, Jakub Kicinski made improvements throughout the networking subsystem, Christoph Hellwig continues his refactoring work in the block and filesystem layers, and Bart Van Assche reworked much of the SCSI subsystem code.
In the lines-changed column, Ping-Ke Shih came out on top with the addition of the Realtek rtw89 driver; unlike many past Realtek drivers, this one skipped the staging tree and landed directly under drivers/net. Zhan Liu contributed exactly two patches adding yet another set of amdgpu header files. Nick Terrell updated the kernel's zstd compression module, Sameer Pujar added a set of NVIDIA Tegra sound drivers, and Johan Almbladh added eBPF JIT compilers for the 32- and 64-bit MIPS architectures. It's worth noting that there were relatively few large code removals in 5.16 (the biggest was the removal of Netlogic MIPS support by Thomas Bogendoerfer), so the kernel as a whole grew by 422,000 lines.
The kernel project depends on its testers and reviewers as much as it depends on its developers. For the 5.16 cycle, the contributors with the most test and review credits were:
Test and review credits in 5.16
Tested-by Daniel Wheeler 153 14.8% Sandeep Penigalapati 34 3.3% Tony Brelinski 25 2.4% Deren Wu 24 2.3% Gurucharan G 22 2.1% Sohaib Mohamed 22 2.1% Konrad Jankowski 20 1.9% Alexei Starovoitov 16 1.5% Mark Wunderlich 14 1.4% John Garry 13 1.3% Christian Zigotzky 13 1.3% Fuad Tabba 12 1.2% Shawn Guo 12 1.2% Geert Uytterhoeven 10 1.0% Ferry Toth 10 1.0%
Reviewed-by Christoph Hellwig 202 3.2% Rob Herring 194 3.0% Hans de Goede 119 1.9% Pierre-Louis Bossart 104 1.6% Stephen Boyd 100 1.6% David Howells 83 1.3% David Sterba 80 1.2% Jani Nikula 77 1.2% Christian König 74 1.2% Andrew Lunn 68 1.1% Jan Kara 60 0.9% Kai Vehmanen 60 0.9% Kees Cook 58 0.9% Florian Fainelli 57 0.9% Linus Walleij 55 0.9%
Once again, Daniel Wheeler heads the list of test credits, having received 15% of all such credits during the 5.16 development cycle. That is over two patches tested per day — every day, weekends and holidays included. Wheeler appears to be doing this work as part of his employer's internal review process, as do many of the other top testers. The top reviewers, instead, tend to be active developers who also manage to get a lot of reviews done. The top two reviewers for 5.16 are the same as for 5.15; Christoph Hellwig managed to review three patches and write two of his own for every day of the 70-day 5.16 development cycle.
A different sort of review is associated with the task of selecting patches to apply and push into the mainline kernel. That decision may involve a thorough review in its own right, or it may rely on the review efforts of others. When maintainers accept patches, they will apply a Signed-off-by tag to those patches. By looking at signoffs by people other than the author of a patch, it is possible to get a picture for who the most active maintainers are. For 5.16 they were:
Top signoffs in 5.16 David S. Miller 1082 7.8% Greg Kroah-Hartman 1062 7.6% Mark Brown 558 4.0% Alex Deucher 472 3.4% Jens Axboe 442 3.2% Andrew Morton 400 2.9% Martin K. Petersen 353 2.5% Jakub Kicinski 325 2.3% Mauro Carvalho Chehab 325 2.3% Bjorn Andersson 305 2.2% Paolo Bonzini 230 1.7% Jonathan Cameron 224 1.6% Kalle Valo 210 1.5% Arnaldo Carvalho de Melo 203 1.5% Hans Verkuil 183 1.3% Felix Fietkau 163 1.2% David Sterba 162 1.2% Alexei Starovoitov 154 1.1% Borislav Petkov 152 1.1% Saeed Mahameed 148 1.1%
This list of maintainers tends not to change much from one release to another; it is made up of some of the kernel project's most senior developers who have been on the job for many years.
Work on 5.16 was supported by 251 employers that we were able to identify. The most active of those were:
Most active 5.16 employers
By changesets Intel 1454 10.2% (Unknown) 1196 8.4% 932 6.6% (None) 781 5.5% Red Hat 765 5.4% AMD 682 4.8% 641 4.5% Linaro 592 4.2% NVIDIA 463 3.3% Huawei Technologies 422 3.0% SUSE 311 2.2% Oracle 294 2.1% IBM 274 1.9% (Consultant) 266 1.9% Canonical 249 1.8% Arm 244 1.7% Baidu 234 1.6% Renesas Electronics 221 1.6% MediaTek 199 1.4% Code Aurora Forum 192 1.4%
By lines changed Realtek 97237 12.2% Intel 72565 9.1% AMD 67076 8.4% 50894 6.4% (Unknown) 43152 5.4% (None) 40389 5.0% Linaro 39428 4.9% NVIDIA 38898 4.9% 35871 4.5% Red Hat 23312 2.9% Marvell 19136 2.4% MediaTek 15399 1.9% Code Aurora Forum 14564 1.8% Anyfi Networks 13901 1.7% Renesas Electronics 12888 1.6% SUSE 10940 1.4% IBM 10808 1.4% Huawei Technologies 10378 1.3% Cirrus Logic 10046 1.3% Oracle 8728 1.1%
This table, too, tends not to change much from one release to the next. For the curious, the "unknown" category consists of nearly 400 developers, most of whom contributed one or two patches. Any one of these developers is a small contributor to this release, but together they add up to a significant portion of the total patch flow. Many of those developers will move on, having done what they came to the kernel project to do; others are just getting started and will become significant contributors over time.
In summary, 5.16 was just another typical kernel development cycle. Lots of patches from nearly 2,000 developers, all integrated into another solid (though not perfect) kernel release. The kernel project does not lack its share of problems with quality control, testing, support for maintainers, and more, but it nonetheless manages to get the work done on a predictable schedule. Work now begins on 5.17, which will be released in mid-March.
Page editor: Jonathan Corbet
Next page:
Brief items>>