LWN.net Weekly Edition for May 21, 2020
Welcome to the LWN.net Weekly Edition for May 21, 2020
This edition contains the following feature content:
- The PEPs of Python 3.9: an update on the final changes that found their way into the upcoming Python 3.9 release.
- The state of the AWK: it may be an old tool, but it's still interesting.
- Ongoing coverage from OSPM 2020, including:
- The weighted TEO cpuidle governor: an attempt to improve idle-time predictions.
- Testing scheduler thermal properties for avionics: an in-progress test bed to evaluate thermally-oriented scheduler changes.
- Utilization inversion and proxy execution: using load tracking for task placement can lead to some strange inversion situations; fixing them may not be entirely easy.
- The many faces of "latency nice": a complex and inclusive session on an incompletely designed feature.
- Scheduler benchmarking with MMTests: a test suite developed for memory-management benchmarking finds a new use case.
- Evaluating vendor changes to the scheduler: mobile vendors make a lot of tweaks to the CPU scheduler; why do they do that and what is gained from it?
- Bao: a lightweight partitioning hypervisor for mixed-criticality workloads.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
The PEPs of Python 3.9
With the release of Python 3.9.0b1, the first of four planned betas for the development cycle, Python 3.9 is now feature-complete. There is still plenty to do in terms of testing and stabilization before the October final release. The release announcement lists a half-dozen Python Enhancement Proposals (PEPs) that were accepted for 3.9. We have looked at some of those PEPs along the way; there are some updates on those. It seems like a good time to fill in some of the gaps on what will be coming in Python 3.9.
String manipulations
Sometimes the simplest (seeming) things are the hardest—or at least provoke an outsized discussion. Much of that was bikeshedding over—what else?—naming, but the idea of adding functions to the standard string objects to remove prefixes and suffixes was fairly uncontroversial. Whether those affixes (a word for both prefixes and suffixes) could be specified as sequences, so more than one affix could be handled in a single call, was less clear cut; ultimately, it was removed from the proposal, awaiting someone else to push that change through the process.
Toward the end of March, Dennis Sweeney asked
on the python-dev mailing list for a core developer to sponsor PEP 616
("String methods to remove prefixes and suffixes
"). He
pointed to a python-ideas discussion
from March 2019 about the idea.
Eric
V. Smith agreed
to sponsor the PEP, which led Sweeney to post
it and kick off the discussion. In the original version, he used
cutprefix() and cutsuffix() as the names of the string
object methods
to be added. Four types of Python objects would get the new methods: str (Unicode
strings), bytes
(binary sequences), bytearray
(mutable binary sequences), and collections.UserString
(a wrapper around string objects). It would work as follows:
'abcdef'.cutprefix('abc') # returns 'def' 'abcdef'.cutsuffix('ef') # returns 'abcd'
There were plenty of suggestions in the name department. Perhaps the most widespread agreement was that few liked "cut", so "strip", "trim", and "remove" were all suggested and garnered some support. stripprefix() (and stripsuffix(), of course) seemed to run into opposition due, at least in part, to one of the rationales specified in the PEP; the existing "strip" functions are confusing so reusing that name should be avoided. The str.lstrip() and str.rstrip() methods also remove leading and trailing characters, but they are a source of confusion to programmers actually looking for the cutprefix() functionality. The *strip() calls take a string argument, but treat it as a set of characters that should be eliminated from the front or end of the string:
'abcdef'.lstrip('abc') # returns 'def' as "expected" 'abcbadefed'.lstrip('abc') # returns 'defed' not at all as expected
Eventually, removeprefix() and removesuffix() seemed to gain the upper hand, which is what Sweeney eventually switched to. It probably did not hurt that Guido van Rossum supported those names as well. Eric Fahlgren amusingly summed up the name fight this way:
cutprefix - Removes the specified prefix.
trimprefix - Removes the specified prefix.
stripprefix - Removes the specified prefix.
removeprefix - Removes the specified prefix. Duh. :)
Sweeney announced an update to the PEP that addressed a number of comments, but also added the requested ability to take a tuple of strings as an affix (that version can be seen in the PEP GitHub repository). But Steven D'Aprano was not so sure it made sense to do that. He pointed out that the only string operations that take a tuple are str.startswith() and str.endswith(), which do not return a string (just a boolean value). He is leery of adding a method that returns a (potentially changed) version of the string while taking a tuple because whatever rules are chosen on how to process the tuple will be the "wrong" choice for some. For example:
"extraordinary".startswith(('ex', 'extra'))since it is True whether you match left-to-right, shortest-to-largest, or even in random order. But for cutprefix, which prefix should be deleted?
As he said, the rule as proposed is that the first matching string processing the
tuple left-to-right is used, but some might want
the longest match or the last match; it all depends on the context of the
use. He suggested that the feature get more "soak time" before committing
to adding that behavior: "We ought to get some real-life exposure to
the simple case first, before
adding support for multiple prefixes/suffixes.
"
Ethan Furman agreed with D'Aprano. But Victor Stinner was strongly in favor of the tuple-argument idea. He wondered about the proposed behavior, however, when the empty string is passed as part of the tuple. As proposed, encountering the empty string (which effectively matches anything) when processing the tuple would simply return the original string, which leads to surprising results:
cutsuffix("Hello World", ("", " World")) # returns "Hello World" cutsuffix("Hello World", (" World", "")) # returns "Hello"
The problem is not likely to manifest so obviously; affixes will not
necessarily be hard coded so empty strings might slip into unexpected
places. Stinner suggested raising ValueError if an empty string is
encountered, similar to str.split().
But Sweeney decided
to remove the tuple-argument feature entirely to "allow someone else with a stronger
opinion about it to propose and defend a set of semantics in a different
PEP
" He posted
the last version of the PEP on March 28.
On April 9, Sweeney opened a steering council issue requesting a review of the PEP. On April 20, Stinner accepted it on behalf of the council. It is a pretty minimal change but worth the time to try to ensure that it has the right interface (and semantics) for the long haul. We will see removeprefix() and removesuffix() in Python 3.9.
New parser
It should not really surprise anyone that the new parser for CPython, covered here in mid-April, has been accepted
by the steering council. PEP 617 ("New
PEG parser for CPython
") was proposed by project founder and former
benevolent dictator for life (BDFL) Guido van Rossum, along with Pablo Galindo
Salgado and Lysandros Nikolaou; it is already working well and its
performance is
within 10% of the existing parser in terms of speed and memory use. It
will also make the language specification simpler because the parser is
based on a parsing
expression grammar (PEG). The existing LL(1) parser for CPython
suffers from a number of shortcomings and contains some hacks that the new
parser will eliminate.
The change paves the way for Python to move beyond having an LL(1) grammar—though the existing language is not precisely LL(1)—down the road. That change will not come soon as the plans are to keep the existing parser available in Python 3.9 behind a command-line switch. But Python 3.10 will remove the existing parser, which could allow language changes. If those kinds of changes are made, however, alternative Python implementations (e.g. PyPy, MicroPython) may need to switch their parsers to something other than LL(1) in order to keep up with the language specification. That might give the core developers pause before making a change of that nature.
And more
We looked at PEP 615
("Support for the IANA Time Zone Database in the Standard
Library
") back in early March. It would add a zoneinfo
module to the standard library that would facilitate getting time-zone
information from the IANA time zone
database (also known as the "Olson database") to populate a time-zone
object. It was looked on favorably at the time of the article and at the end
of March Paul Ganssle asked for a
decision on the PEP. He thought it might be amusing to have it accepted
(assuming it was) during an
interesting time window:
He recognized that it might be difficult to pull off and it certainly was not a priority. The steering council did not miss the second window by much; Barry Warsaw announced the acceptance of the PEP on April 20. Python will now have a mechanism to access the system's time-zone database for creating and handling time zones. In addition, there is a tzdata module in the Python Package Index (PyPI) that contains the IANA data for systems that lack it; it will be maintained by the Python core developers as well.
PEP 593
("Flexible function and variable annotations
") adds a way to
associate context-specific metadata with functions and variables.
Effectively, the type hint annotations have
squeezed out other use cases that were envisioned in PEP 3107
("Function Annotations
") that was implemented in
Python 3.0 many years ago. PEP 593 creates a new mechanism for
those use cases using
the Annotated typehint.
Another kind of clean up comes in PEP 585
("Type Hinting Generics In Standard Collections
"). It will
allow the removal of a parallel set of type aliases maintained in the typing
module in order to support generic types. For example, the
typing.List type will no longer be needed to support annotations
like "dict[str, list[int]]" (i.e.. a dictionary with string keys
and values that are lists of integers).
The dictionary union operation for
"addition" will also be part of Python 3.9. It was a bit
contentious at times, but PEP 584
("Add Union Operators To dict
") was recommended
for acceptance by Van Rossum in mid-February. The steering council promptly agreed
and the feature was merged on February 24.
The last PEP on the list is PEP 602
("Annual Release Cycle for Python
"). As it says on the tin,
it changes the release cadence from every
18 months to once per year. The development and release cycles overlap,
though, so that a full 12 months is available for feature development.
Python 3.10 feature development begins when the first Python 3.9
beta has been released—which is now. Stay tuned for the next round of PEPs
in the coming year.
The state of the AWK
AWK is a text-processing language with a history spanning more than 40 years. It has a POSIX standard, several conforming implementations, and is still surprisingly relevant in 2020 — both for simple text processing tasks and for wrangling "big data". The recent release of GNU Awk 5.1 seems like a good reason to survey the AWK landscape, see what GNU Awk has been up to, and look at where AWK is being used these days.
The language was created at Bell Labs in 1977. Its name comes from the initials of the original authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. A Unix tool to the core, AWK is designed to do one thing well: to filter and transform lines of text. It's commonly used to parse fields from log files, transform output from other tools, and count occurrences of words and fields. Aho summarized AWK's functionality succinctly:
AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.
AWK programs are often one-liners executed directly from the command line. For example, to calculate the average response time of GET requests from some hypothetical web server log, you might type:
$ awk '/GET/ { total += $6; n++ } END { print total/n }' server.log 0.0186667
This means: for all lines matching the regular expression /GET/, add up the response time (the sixth field or $6) and count the line; at the end, print out the arithmetic mean of the response times.
The various AWK versions
There are three main versions of AWK in use today, and all of them conform to the POSIX standard (closely enough, at least, for the vast majority of use cases). The first is classic awk, the version of AWK described by Aho, Weinberger, and Kernighan in their book The AWK Programming Language. It's sometimes called "new AWK" (nawk) or "one true AWK", and it's now hosted on GitHub. This is the version pre-installed on many BSD-based systems, including macOS (though the version that comes with macOS is out of date, and worth upgrading).
The second is GNU Awk (gawk), which is by far the most featureful and actively maintained version. Gawk is usually pre-installed on Linux systems and is often the default awk. It is easy to install on macOS using Homebrew and Windows binaries are available as well. Arnold Robbins has been the primary maintainer of gawk since 1994, and continues to shepherd the language (he has also contributed many fixes to the classic awk version). Gawk has many features not present in awk or the POSIX standard, including new functions, networking facilities, a C extension API, a profiler and debugger, and most recently, namespaces.
The third common version is mawk, written by Michael Brennan. It is the default awk on Ubuntu and Debian Linux, and is still the fastest version of AWK, with a bytecode compiler and a more memory-efficient value representation. (Gawk has also had a bytecode compiler since 4.0, so it's now much closer to mawk's speed.)
If you want to use AWK for one-liners and basic text processing, any of the above are fine variants. If you're thinking of using it for a larger script or program, Gawk's features make it the sensible choice.
There are also several other implementations of AWK with varying levels of maturity and maintenance, notably the size-optimized BusyBox version used in embedded Linux environments, a Java rewrite with runtime access to Java language features, and my own GoAWK, a POSIX-compliant version written in Go. The three main AWKs and the BusyBox version are all written in C.
Gawk changes since 4.0
It's been almost 10 years since LWN covered the release of gawk 4.0. It would be tempting to say "much has changed since 2011", but the truth is that things move relatively slowly in the AWK world. I'll describe the notable features since 4.0 here, but for more details you can read the full 4.x and 5.x changelogs. Gawk 5.1.0 came out just over a month ago on April 14.
The biggest user-facing feature is the introduction of namespaces in 5.0. Most modern languages have some concept of namespaces to make it easier to ship large projects and libraries without name clashes. Gawk 5.0 adds namespaces in a backward-compatible way, allowing developers to create libraries, such as this toy math library:
# area.awk @namespace "area" BEGIN { pi = 3.14159 # namespaced "constant" } function circle(radius) { return pi*radius*radius }
To refer to variables or functions in the library, use the namespace::name syntax, similar to C++:
$ gawk -f area.awk -e 'BEGIN { print area::pi, area::circle(10) }' 3.14159 314.159
Robbins believes that AWK's lack of namespaces is one of the key reasons it hasn't caught on as a larger-scale programming language and that this feature in gawk 5.0 may help resolve that. The other major issue Robbins believes is holding AWK back is the lack of a good C extension interface. Gawk's dynamic extension interface was completely revamped in 4.1; it now has a defined API and allows wrapping existing C and C++ libraries so they can be easily called from AWK.
The following code snippet from the example C-code wrapper in the user manual populates an AWK array (a string-keyed hash table) with a filename and values from a stat() system call:
/* empty out the array */ clear_array(array); /* fill in the array */ array_set(array, "name", make_const_string(name, strlen(name), &tmp)); array_set_numeric(array, "dev", sbuf->st_dev); array_set_numeric(array, "ino", sbuf->st_ino); array_set_numeric(array, "mode", sbuf->st_mode);
Another change in the 4.2 release (and continued in 5.0) was an overhauled source code pretty-printer. Gawk's pretty-printer enables its use as a standardized AWK code formatter, similar to Go's go fmt tool and Python's Black formatter. For example, to pretty-print the area.awk file from above:
$ gawk --pretty-print -f area.awkwhich results in the following output:
@namespace "area" BEGIN { pi = 3.14159 # namespaced "constant" } function circle(radius) { return (pi * radius * radius) }
You may question the tool's choices: why does "BEGIN {" not have a line break before the "{" when the function does? (It turns out AWK syntax doesn't allow that.) Why two blank lines before the function and parentheses around the return expression? But at least it's consistent and may help avoid code-style debates.
Gawk allows a limited amount of runtime type inspection, and extended that with the addition of the typeof() function in 4.2. typeof() returns a string constant like "string", "number", or "array" depending on the input type. These functions are important for code that recursively walks every item of a nested array, for example (which is something that POSIX AWK can't do).
With 4.2, gawk also supports regular expression constants as a first-class data type using the syntax @/foo/. Previously you could not store a regular expression constant in a variable; typeof(@/foo/) returns the string "regexp". In terms of performance, gawk 4.2 brings a significant improvement on Linux systems by using fwrite_unlocked() when it's available. As gawk is single-threaded, it can use the non-locking stdio functions, giving a 7-18% increase in raw output speed — for example gawk '{ print }' on a large file.
The GNU Awk User's Guide has always been a thorough reference, but it was substantially updated in 4.1 and again in the 5.x releases, including new examples, summary sections, and exercises, along with some major copy editing.
Last (and also least), a subtle change in 4.0 that I found amusing was the reverted handling of backslash in sub() and gsub(). Robbins writes:
The default handling of backslash in sub() and gsub() has been reverted to the behavior of 3.1. It was silly to think I could break compatibility that way, even for standards compliance.
The sub and gsub functions are core regular expression substitution functions, and even a small "fix" to the complicated handling of backslash broke people's code:
Robbins may have had a small slip in judgment with the original change, but it's obvious he takes backward compatibility seriously. Especially for a popular tool like gawk, sometimes it is better to continue breaking the specification than change how something has always worked.
Is AWK still relevant?
Asking if AWK is still relevant is a bit like asking if air is still relevant: you may not see it, but it's all around you. Many Linux administrators and DevOps engineers use it to transform data or diagnose issues via log files. A version of AWK is installed on almost all Unix-based machines. In addition to ad-hoc usage, many large open-source projects use AWK somewhere in their build or documentation tooling. To name just a few examples: the Linux kernel uses it in the x86 tooling to check and reformat objdump files, Neovim uses it to generate documentation, and FFmpeg uses it for building and testing.
AWK build scripts are surprisingly hard to kill, even when people want to: in 2018 LWN wrote about GCC contributors wanting to replace AWK with Python in the scripts that generate its option-parsing code. There was some support for this proposal at the time, but apparently no one volunteered to do the actual porting, and the AWK scripts live on.
Robbins argues in his 2018 paper for the use of AWK (specifically gawk) as a "systems programming language", in this context meaning a language for writing larger tools and programs. He outlines the reasons he thinks it has not caught on, but Kernighan is "not 100% convinced" that the lack of an extension mechanism is the main reason AWK isn't widely used for larger programs. He suggested that it might be due to the lack of built-in support for access to system calls and the like. But none of that has stopped several people from building larger tools: Robbins' own TexiWeb Jr. literate programming tool (1300 lines of AWK), Werner Stoop's d.awk tool that generates documentation from Markdown comments in source code (800 lines), and Translate Shell, a 6000-line AWK tool that provides a fairly powerful command-line interface to cloud-based translation APIs.
Several developers in the last few years have written about using AWK in their "big data" toolkit as a much simpler (and sometimes faster) tool than heavy distributed computing systems such as Spark and Hadoop. Nick Strayer wrote about using AWK and R to parse 25 terabytes of data across multiple cores. Other big data examples are the tantalizingly-titled article by Adam Drake, "Command-line Tools can be 235x Faster than your Hadoop Cluster", and Brendan O'Connor's "Don’t MAWK AWK – the fastest and most elegant big data munging language!"
Between ad-hoc text munging, build tooling, "systems programming", and big data processing — not to mention text-mode first person shooters — it seems that AWK is alive and well in 2020.
[Thanks to Arnold Robbins for reviewing a draft of this article.]
The weighted TEO cpuidle governor
Life gets complicated for the kernel when there is nothing for the system to do. The obvious response is to put the CPU into an idle state to save power, but which one? CPUs offer a wide range of sleep states with different power-usage and latency characteristics. Picking too shallow a state will waste energy, while going too deep hurts latency and can impact the performance of the system as a whole. The timer-events-oriented (TEO) cpuidle governor is a relatively new attempt to improve the kernel's choice of sleep states; at the 2020 Power Management and Scheduling in the Linux Kernel Summit, Pratik Sampat presented a variant of the TEO governor that tries to improve its choices further.Sampat started with a bit of background. The TEO governor is based on the idea that timer events are the most likely way that a system will wake up; they also happen to be the most deterministic, since they are known before the system goes idle. But CPUs are subject to wakeups from other sources — interrupts in particular — and that complicates the situation. So the TEO governor maintains a short history of actual idle times that is used to come up with a (hopefully) better guess for what the next idle period will really be.
This history is an eight-entry circular buffer that indicates the recent pattern of non-timer wakeups. When the time comes to pick an idle state, the TEO governor looks at how many of those wakeups led to a sleep time that was less than expected; if the answer is "a majority of them", then the average observed sleep time is used to select an idle state that is shallower than would have otherwise been chosen. It works well, he said, but maybe it can be made better?
He started by testing the idea of whether more history would improve the situation. Increasing the size of the idle-times buffer to 128 did not really help, though. With a set of benchmark results, Sampat showed performance numbers that were sometimes better and sometimes worse; latency often improved, but power consumption got much worse. More history led to the selection of shallower sleep states more often, in other words.
It turns out, he said, that an average is not a good model of the distribution of sleep times, and a longer history may not reflect what is going to happen in the future. So he concluded that what is needed is to store and manage the history differently. The cpuidle governor would benefit from a way to answer a specific question: if the kernel is about to pick a given sleep state, what are the chances that the actual sleep time will better match a sleep that is one level shallower?
The weighted governor
The result was the weighted-history TEO governor, which replaces the history buffer with an NxN matrix, where N is the number of sleep states supported by the processor. The rows correspond to the sleep state the TEO governor would pick in any given situation; each column along that row indicates the probability that the corresponding state should actually be chosen. If the system in question had three sleep states ("shallow", "medium", and "deep"), the matrix would be initialized to look like this:
Shallow Medium Deep Shallow 70% 15% 15% Medium 15% 70% 15% Deep 15% 15% 70%
In other words, the matrix is set up so that the chances of each state selection being correct are 70%, with the remaining 30% spread across all the other states. Giovanni Gherdovich asked whether this initial distribution was hard-coded; the answer was "yes for now", and that the values have been chosen from a set of experiments Sampat ran.
After each wakeup, the actual behavior is measured and the probabilities are tweaked accordingly. The actual amount of adjustment that should be performed is still unclear, he said; more experiments and testing are needed.
When it comes time to make a prediction, the governor uses a biased random-number generator to pick a state; the biasing is done so that the chances of picking any particular state are the same as the observed probability that said state is the correct one. Why do that rather than just pick the highest-probability state? Often it turns out that the probabilities are fairly close, so a subset of the available states are all about as likely to be correct. The system will self-correct when the random-number generator steers it wrong.
Results
A number of benchmark results followed, showing variable results. With schedbench, latency was better some times and worse others, but power consumption was always less. The accuracy of the sleep-state choices was similar to the unweighted TEO governor for a small number of threads, but improved for larger numbers of threads. Rafael Wysocki, the author of the TEO governor, said that he was surprised to see TEO doing as well as it does; he deliberately chose a simple algorithm to minimize the overhead involved.
Sampat modified the ebizzy benchmark to make it do occasional sleeps, and
got better results than TEO for both throughput and power consumption. The
pgbench benchmark showed mixed results, with things getting worse as more
clients were added. Hackbench results saw better results with relatively
short run times, and a consistent 8-10% improvement in power consumption.
At this point, some confusion about the results became evident. Sampat characterized the results as "overshooting" or "undershooting", which most people expected to refer to the sleep state chosen, but actually referred to the sleep residency time. So "overshooting" meant picking a sleep state that was too shallow — the residency time overshot the estimate. This terminology seems highly likely to change in the near future.
Wysocki observed that picking a sleep state that is too shallow is generally better than picking one that is too deep. Not sleeping deeply enough will cost some power, but sleeping too deeply can hurt the performance of the system (in both latency and throughput terms).
Sampat finished with an overview of the work that is yet to be done. The aging algorithm still needs some work; workloads change over time, and old history can lead to poor predictions going forward. He tried simply decaying the highest-probability state, but that led to large variance in the results.
Another issue is the initial weights put in the matrix; these were determined through experiments, but more testing is needed. Wysocki disagreed, though, saying that with proper aging, the initial states don't matter much. The governor will correct itself over time. But that depends on the aging working well, so that is the important part to work on. The session concluded with Wysocki saying that the work looks promising and can be discussed further on the mailing list.
Testing scheduler thermal properties for avionics
Linux is not heavily used in safety-critical systems — yet. There is an increasing level of interest in such deployments, though, and that is driving a number of initiatives to determine how Linux can be made suitable for safety-critical environments. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), Michal Sojka shone a light on one corner of this work: testing the thermal characteristics of Linux systems with an eye toward deployment in avionics systems.In particular, his focus is on how scheduling decisions can affect the thermal behavior of computers in avionic systems; this effort is part of the European THERMAC project. The requirements for avionic systems include doing without both fans and heavy heat sinks while getting as much performance out of each system as thermal constraints will allow. There is no room for missed deadlines in safety-critical work, so there is not much space for the usual thermal-management techniques there. But these systems also support best-effort workloads that run when time and temperatures allow; that is where it may be possible to improve the situation with clever power management.
These systems tend to use time-partitioned scheduling. Each safety-critical task runs within its own time window; any time left over within the window when that work is done can be used for best-effort workloads. The good news, Sojka said, was that the workloads on these systems are well understood; that is a distinct difference from the systems discussed in the previous session, where the kernel has to make guesses about what is going to happen next.
This work, so far, has not yet come up with any thermal-aware scheduling strategies; that is for a later stage. What is being done now is to put together the framework for evaluating such strategies so it will be possible to know which ones actually work. To that end, the project has built a testbed based on a leading-edge NXP i.MX8 board; thermal sensors and a thermal camera have been added to that. Control groups are being used to simulate the scheduling windows that will be used on a real system.
The work so far has resulted in a framework called "thermobench"; Sojka
described it as "a fancy CSV file generator". It will run a series of
benchmarks, capturing measurements (temperatures, CPU frequencies, CPU
loads, etc.) as they go. When the runs are complete, the system can create
plots of what happened. The benchmarks in the repository now include
various micro-instruction tests and tests that evaluate a variety of sleep
patterns.
The system can also perform model fitting in order to get a sense for the changes that happen at different time scales; some changes happen much more quickly than others, leading to a model equation with three distinct terms. The temperature at the heat sink can change within a minute, while whole-board temperature changes play out over four or five minutes. There is also an 18-minute term which, he surmised, was the response of the entire testbed. Among other things, these results tell them how long each test needs to run for.
In conclusion, he said, thermobench will be useful for comparing various thermal management strategies. He wondered whether others might find it useful for their areas as well. Vincent Guittot asked whether the tests included CPU-frequency scaling; Sojka answered the tests that were shown are all single-frequency tests, but multiple-frequency tests have been done as well. He said that temperature is not a linear function of CPU frequency, but did not get into details.
Rafael Wysocki said that the tests should always measure both the power consumption of the board and the temperature, since the two are somewhat independent of each other. Giovanni Gherdovich asked whether the realtime preemption patches had been tested, noting that kernels with those patches have different performance and power-usage profiles. Sojka answered that the test board is quite new and is currently not able to run a mainline kernel; he expressed interest in hearing what NXP's plans are for getting support upstream. Once that happens, he will be happy to experiment with the realtime patches.
Souvik Chakravarty pointed out that a number of factors affect power usage. For example, what is the power structure of the board? If all CPUs are on a single power rail, it will be necessary to stop them all to gain significant power (and thermal) savings. Sojka said that the processor in question has six big.LITTLE CPUs, and the project is testing on the little CPUs only. But details like the power layout are not entirely clear.
Sojka concluded by encouraging attendees to check out the thermobench code, which had been posted that very day.
Utilization inversion and proxy execution
Over the years, the kernel's CPU scheduler has become increasingly aware of how much load every task is putting on the system; this information is used to make smarter task placement decisions. Sometimes, though, this logic can go wrong, leading to a situation that Valentin Schneider describes as "utilization inversion". At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), he described the problem and some approaches that are being considered to address it.Utilization tracking, initially conceived as per-entity load-tracking or PELT, gives the scheduler information about what the impact of running a task on a given CPU will be. It is used in the schedutil CPU-frequency governor to select a frequency appropriate to the current workload. On Arm big.LITTLE systems, where some processors are faster and more power-hungry than others, the utilization-tracking signal is also used to decide which type of CPU a task should run on. The situation is complicated a bit by utilization clamping, which allows an administrator to make a task's utilization appear larger or smaller than it really is. Clamping is used to bias the placement of specific tasks and influences CPU-frequency selection as well.
Imagine, Schneider said, a large task (one with a high utilization) and a
small task, both of which are contending for the same lock. The large task
may have a high minimum clamp, so it looks like an even bigger load even
when it is not doing much; the small task, instead, may have a low maximum,
ensuring that it always looks small. One would expect the large task to
run on a big CPU at a high frequency while the small task is consigned to a
small CPU at a low frequency. If the small task grabs the lock, the large
task's progress suddenly depends on how quickly the small task can
progress.
This situation is similar to priority inversion, though the problem is not as severe. Even so, it would be better if the small task could inherit some of the large task's resources while it holds the lock.
The kernel's realtime mutexes can handle priority inheritance now; if a high-priority task contends for a lock held by a low-priority task, the latter will have its priority boosted until it drops the lock. Priority inheritance can help, but it only affects process priority; it can force preemption, but it does not really change task placement or CPU frequency. Perhaps the kernel could gain a similar mechanism for utilization that would help for placement, at least, if not CPU frequency. Schneider expressed skepticism that such an approach could work well, though.
An alternative he has been working on is proxy execution: giving the lock-holding task the waiting task's execution parameters until it lets go of the lock. This is a work in progress, he said, that doesn't survive for more than 20 seconds on real hardware, and it has no provision for futexes (user-space locks), but it still has some interesting properties, he said.
With proxy execution, a task that blocks on a mutex is not removed from the run queue as it would be in a mainline kernel. It can thus be picked to run by the scheduler in the usual way if it's the highest-priority task in the queue. When that happens, though, the lock-holding task inherits the blocked task's scheduling context. The blocked task is also migrated to the run queue of the lock holder, which brings its utilization information over; that will cause the CPU frequency to be increased, helping the lock holder to get its work done and release the lock.
That solves the problem reasonably well on symmetric multiprocessor systems, but it still falls short on asymmetric systems like big.LITTLE. To address such systems, Schneider would like to put the utilization-tracking information into the scheduling context, where it can be passed more directly to a lock holder. This has to be done carefully, though, or it could create priority inversions of its own; if a low-utilization task is picked to run, it could end up slowing a high-utilization task. Making a smart choice is hard, though, since the utilization signals are highly variable and hard to track in the proxy-execution code. The solution might be to ignore the utilization values and just look at the clamps.
Juri Lelli asked why this mattered, since the clamp values are already aggregated on each run queue. That works for frequency selection, Schneider answered, but it has no influence on task selection, so it doesn't help to ensure that the lock-holding task actually runs.
Then, there is the perennial problem of load balancing. Utilization signals are highly useful here, since they let the scheduler ensure that the load on each CPU is about the same. But what should be done in the proxy execution case? Currently, load-balancing decisions will use the scheduling context of the donor task (the one waiting for the lock), which could lead to interesting decisions. Since contending tasks remain on the run queue, the apparent load on the CPU increases, which can throw things off as well. Peter Zijlstra said that this isn't necessarily a big problem; one does not expect locks to be held for long periods, so things should straighten themselves out relatively quickly.
Patrick Bellasi asked whether just relying on clamp values is sufficient, or whether the load-tracking signal should be used too. Schneider responded that using the clamps really is the best that can be done; there is no choice. Utilization values simply change too quickly to be useful.
Heading toward a conclusion, Schneider said that getting proxy execution working right is his first priority; presumably rebooting after 20 seconds of uptime is getting a little tiresome. He asked whether other developers were interested in proxy execution as well. Zijlstra said that he has been trying to get it to the top of his list for a long time, but has been "failing miserably".
Qais Youssef asked how quickly this work might be done. The next Android release will not be happening for some time, so it would be nice if there were some way to fix this problem in the short term. Could the realtime mutex code help? Zijlstra responded that realtime mutexes are really for realtime processes and won't help with tasks in the completely fair scheduling class, as most Android tasks are. We will get the problem solved when we do, he said.
The session concluded with numerous developers saying that they would like to have a working proxy execution mechanism in the kernel, but nobody has found the time to work on it.
The many faces of "latency nice"
A task's "nice" value describes its priority within the completely fair scheduler; its semantics have roots in ancient Unix tradition. Last August, a "latency nice" parameter was proposed to provide similar control over a task's response-time requirements. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), Parth Shah, Chris Hyser, and Dietmar Eggemann ran a discussion about the latency nice proposal; it seems that everybody agrees that it would be a useful feature to have, but there is a wide variety of opinions about what it should actually do.
A different kind of nice
Shah started by describing the latency nice value as a per-task attribute that behaves much like the normal nice value. It gives the scheduler a hint about what the task's latency needs are. It can be tweaked via the sched_setattr() system call, though there is some desire to switch to a control-group interface. Its values vary between -20 and 19 (as with nice), with -20 indicating a high degree of latency sensitivity and 19 indicating a complete indifference to latency. The default value is zero.
The first question he raised had to do with privilege: should an unprivileged process be able to decrease its latency nice value? Ordinary nice does not allow that, of course; processes must have the CAP_SYS_NICE capability to reduce their nice values. The advantage of establishing a similar rule for latency nice is that it might block potential denial-of-service problems, but at the cost of preventing ordinary users from taking advantage of this feature.
Whether this knob should be privileged depends on what it actually does, which had not yet been discussed. The initial effect of this feature is to control how hard the scheduler will look for an idle core to place a task on when it wakes up. This search takes time (thus increasing latency); an idle core may also have to be roused out of a sleep state, increasing latency further. Dhaval Giani pointed out a use case that Oracle cares about, where some latency-sensitive tasks will typically run for very short periods — less than the time spent searching for an idle core sometimes. That search can be avoided by setting a low latency nice value.
Giani also mentioned a use case from Facebook, which is more interested in
getting longer-running tasks up to full speed quickly; Facebook still wants
low latency, but is better served by finding an idle core that will be able
to get a significant amount of work done quickly. IBM, meanwhile is hoping
to use this knob to influence
the scheduler to avoid placing tasks on a CPU that is currently running
latency-sensitive tasks. The discussion on use cases was cut off at this
point, though, with a promise to revisit it later.
Returning to privilege, Qais Youssef suggested keeping the ability to reduce latency nice values as a privileged operation for now, especially given that this knob could gain new meanings in the future. Shah said that there do not appear to be any denial-of-service issues with the implementation for the current use cases.
Eggemann wondered about the range of values for this knob; there is a wish to bias latency in both directions, but it's not clear what the actual effects of a positive latency nice value would be. Patrick Bellasi suggested that the time before one task could preempt another could be scaled by the latency nice value. Vincent Guittot said that, with ordinary nice, each increment makes about a 10% difference in the amount of CPU time the process may use. With latency nice, he said, the values of -19, zero, and +20 make sense, but he couldn't say what the values in between would mean. Hyser said that, for negative values, there could be a fairly direct effect on the number of CPUs that will be searched before placing a task. Shah suggested that positive values could allow task placement anywhere in the system, even to CPUs that do not share low-level memory cache, which is something the scheduler normally tries hard to avoid.
Eggemann then expressed a sentiment that would be heard a few times in the session: latency nice is trying to control too many functionalities with a single knob. Bellasi suggested that the use cases could be hammered out during review of the patch and asked whether there were any real use cases with contradictory semantics. Giani mentioned the Oracle and Facebook cases mentioned above.
Control groups
Eggemann took over the presentation at this point to talk about what the Android developers would like to see. Android currently uses a control-group interface that includes a "prefer idle" attribute; setting that will bias CPU selection toward an idle CPU. The real effect of this setting, though, is to short out the energy-aware scheduling logic, which brings a certain amount of latency of its own. Thus, in this context, searching for an idle CPU is something that is preferable to do for latency-sensitive tasks — just the opposite of the situation described above.
His real purpose, though, was to discuss a potential control-group-based interface for latency nice. Control groups are a mechanism to organize processes and distribute resources, which is what is needed here. With the CPU controller, there are three ways in which CPU resources are controlled. The "weight" value gives a relative priority to the group, while the "max" value limits the maximum CPU time available and the "min" value ensures that a minimal amount of CPU time will be granted. Utilization clamping is also handled here.
How could the latency nice value be managed in this setting? The resource controlled would still have to be CPU cycles, he said. But the association between latency requirements and CPU cycles is not as clear as it is with the parameters described above. He is not sure what sort of semantics would be acceptable to the control-group maintainer. Bellasi suggested a clamping model, where each group would have values indicating the minimum and maximum latency nice values a task in that group could request. Guittot pointed out, though, that changes to latency nice values would have to be propagated up to the root of the control-group hierarchy. The discussion wandered around this point for a while before bogging down in just how latency nice would work
Eggemann eventually suggested moving on, saying that perhaps the use cases should have been discussed from the outset. The control-group interface is only really important to Android, he said, so perhaps it would be better to figure out what the per-task attribute implementation would actually be doing.
Use cases at last
Hyser took over at this point to talk about use cases; he reiterated that the original purpose of the patch set was to skip the idle-CPU search for latency-sensitive tasks. This resulted in a 1% increase in a transaction-processing benchmark. Many workloads have critical processes that do not run for long but need to run immediately when the time comes. The latency nice change can make it possible for many of these workloads to avoid the need to use the realtime patches.
He put up some plots showing that latency nice does result in better latencies; the effect is more pronounced on systems with more cores.
He suggested that negative values should be interpreted as the number of cores to search; a value of -20 means search no cores at all, -19 would search one core, etc. But should this value be scaled by the number of CPUs in the system? It's still not clear how it should be interpreted. He suggested that latency nice looks a lot like a Boolean value in real-world use; either other cores are searched to place a task or not.
Giani said that the effect of changing a task's nice value is well understood; the effect of changing latency nice is rather less so. Hyser suggested that it could be seen as adjusting the size of the scheduling domain for latency-sensitive tasks. But scheduling domains are hardware dependent, making it hard to come up with a hardware-independent description of the semantics of latency nice. The -20 value, which searches zero cores, is not dependent on hardware at least, Hyser said. He concluded by saying that a value of -1 could mean that the CPU search would happen, but energy-aware scheduling would be disabled.
Giani said that latency nice appears to be trying to do a bunch of things and wondered if it makes sense to control it all with a single interface; Peter Zijlstra responded that those things do all affect latency, at least. Rafael Wysocki said that a single integer value is not enough to express everything that is needed here. Zijlstra said that the session really should have started with the use cases, then looked at tunables to suit those cases.
Shah discussed the task-packing use case. In particular, on systems with Intel's "turbo" mode, packing tasks onto a small number of cores can save enough resources to allow others to go into turbo mode. He suggested that tasks marked with a latency nice value greater than 15 could be packed this way, as long as they don't push the utilization of the target core above a threshold value. Doing so led to a 14% performance benefit on a workload he tested.
Another use case involves restricting the sleep states that a CPU can go into. The pm_qos mechanism can do that now, but it is a system-wide parameter with no per-task control, so it does not work as well as one would like on larger systems; it has no notion of where the latency-sensitive task will run. He suggested implementing a per-CPU counter indicating how many latency-sensitive tasks are present; if a CPU is running such tasks, the sleep states it could go into would be restricted.
Wysocki responded that this isn't a realistic scenario. It could become confused if the task is migrated, for example; he said that latency nice is not a good interface for this case. There is no way to map a latency nice value and the set of permissible exit latencies for the CPU. Bundling semantics in this way is not going to work, he said. Bellasi said that such an interface would require users to determine their latency nice values through experimentation on a specific platform.
Shah persisted, though, saying that it can be beneficial to keep CPUs with latency-sensitive tasks from going idle. Scheduler benchmark runs showed a significant latency reduction with these semantics while maintaining similar power consumption. A pgbench run also showed big improvements in latency, but at a cost (sometimes large) in power consumption.
Youssef said that the interface to all of this is the sticking point. Thomas Gleixner agreed, saying that the -20..19 range "requires a crystal ball" to use properly. Zijlstra repeated his call to enumerate the use cases before getting into the interface details. Giani repeated that the interface does not look correct now, and agreed that a more comprehensive look at the use cases was needed. Things were being done backwards currently, he said.
Eggemann concluded by saying that the group needed to collect use cases and "take them all seriously". While the discussion continued to circle around these points for a while, it was, for all practical purposes, done.
[See the slides from this session [PDF] for more plots and other details.]
Scheduler benchmarking with MMTests
The MMTests benchmarking system is normally associated with its initial use case: testing memory-management changes. Increasingly, though, MMTests is not limited to memory management testing; at the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), Dario Faggioli talked about how he is using it to evaluate changes to the CPU scheduler, along with a discussion of the changes he had to make to get useful results for systems hosting virtualized guests.Kernel benchmarking, he began, is typically done on bare metal. Developers want to know what the impact of a given kernel change might be, so they run a series of tests to measure performance in a reproducible setting. But Faggioli works in SUSE's virtualization lab, which has a more complicated set of objectives. A kernel change might have one effect on a host, but a different effect in guests running on that host. That leads to a need to run benchmarks with various combinations of baseline and changed kernels. Life gets even more interesting when you consider that benchmarks can take varying amounts of time to run between the host and a guest, or even between guests. Without some extra effort, a series of tests running simultaneously will not line up in any sort of predictable or repeatable way.
For example, consider a test for hypervisor scheduling fairness. If the
scheduler is fair, guests with equal computing requirements should get
equal amounts of CPU time. One way to test that is to ensure that every
benchmark takes the same amount of time to run. Even in the presence of
fair scheduling, though, there may be differences in run times between one
virtual machine and the next. If a series of tests is being run, the VMs
could end up running different tests at any given time, muddying the
results.
The
only way to get clear and deterministic results is to ensure that the
benchmark runs on all systems run in a synchronized manner.
There are, he said, a lot of testing and benchmarking suites to choose from. None of them, though, is able to perform synchronized runs in multiple virtual machines. He decided that the time had come to implement a suite that could do that, but he didn't want to start from scratch, so he based his work on MMTests.
The MMTests suite dates back to at least 2012, Faggioli said (LWN covered it in August of that year). While it was initially focused on memory-management changes, that is no longer the case. It is mostly implemented in a combination of Bash and Perl. The core suite is able to fetch, build, configure, and run a whole range of benchmarks. Multiple runs can be made, with MMTests collecting and storing both the configuration that was used and the results that were obtained. A set of tools exists to compare results between runs, create plots, and more. There is also a "monitor" functionality that can capture the output from various monitoring commands (top, vmstat, iostat, etc.) as well as from sources like ftrace and perf events. The set of benchmarks that can be run is large, consisting of most of the tools that kernel developers have found useful over the years.
The configuration file for MMTests is a Bash script containing a lot of export lines describing the tests to be run. There are commands to query system characteristics, such as the number of NUMA nodes; the results can be used to size the benchmarks appropriately. It is quite intuitive, Faggioli said — as long as you are familiar with the specific benchmarks you want to run. The run-mmtests.sh script will actually run the tests; there is a compare_mmtests.pl script to see what changed between different runs. Use graph-mmtests.sh to make pretty plots.
It is possible to try running MMTests as a regular user, he said, but that's not necessarily the best idea. The tests won't fail, but MMTests will not be able to do everything it needs to get a proper run. It may, for example, try to make changes to the CPU-frequency governor. It tries to undo such changes at the end, but it's still better to run the tests on a disposable machine if possible. MMTests will download benchmarks from the net, then run them as root, which may give some users pause. It is possible to set up a local mirror, which can be good for both performance and security.
For tests involving virtualization in particular, the run-kvm.sh script should be used; it will get results from both the host and guest(s). The script sets up and starts any virtual machines, as well as generating SSH keys to connect to those machines. The MMTests directory is copied directly into the virtual machines and the tests are run there. There are different configuration files for the host and the virtual machines; one may want to collect different data in the two settings, he said.
Synchronization, which Faggioli had to add to MMTests, is handled by passing tokens between the host and the virtual machines; the guests never talk directly to each other. The host implements a "barrier" before each benchmark run; once every virtual machine has informed the host that it is ready for the next test, they are all told to proceed to the next one. This ensures that the tests on all systems start at the same time.
Faggioli has various patches that he had really intended to submit before the talk, but that didn't happen despite his proclaimed affinity for "conference-driven development". That should happen soon. With regard to documentation, he said, there is absolutely none. But there is a nice ASCII-art diagram in the script for virtual-machine synchronization, at least. He concluded by saying he has considered rewriting the whole thing in Go, but he was not sure if Mel Gorman, the maintainer of MMTests, would be up for such an idea. Gorman, who was present at the event, held his peace regarding this idea.
Douglas Raillard spoke up after Faggioli finished to say that Arm has a test suite that it uses; it lacks virtual-machine synchronization, though. It does some statistical testing on the results; he wondered if there were plans for adding that to MMTests. Faggioli said that he is not a statistician and wouldn't add that capability himself. Gorman said that MMTests does enough evaluation to try to guess whether a specific difference is significant; that is rather subtly marked in the output and is often missed. The fact that it is undocumented probably doesn't help. Raillard also asked about getting output in JSON format; Faggioli said there is JSON "in there somewhere" but he doesn't use it.
The session concluded at this point. See Faggioli's slides [PDF] for details, example plots, configuration files, and more.
Evaluating vendor changes to the scheduler
The kernel's CPU scheduler does its best to make the right decisions for just about any workload; over the years, it has been extended to better handle mobile-device scheduling as well. But handset vendors still end up applying their own patches to the scheduler for the kernels they ship. Shipping out-of-tree code in this way leads to a certain amount of criticism from the kernel community but, as Vincent Donnefort pointed out in his session at the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), those patches are applied for a reason. He looked at a set of vendor scheduler patches to see why they are being used.
As a testbed for these patches, Donnefort chose the Pixel 4 phone. It's a
device with good upstream support, so it's easy to replace its kernel
without the need for lots of other out-of-tree code. This device has three
different CPU core sizes, small, medium, and
large, where the small cores are small indeed. It is imperative to pick
the correct CPU for any given task, or there will be a cost to pay in
performance or energy use. The PCMark benchmark was used
to evaluate performance, while power measurement was done directly from the
phone's power rails. A 4.14 kernel was used for the tests.
The first patch tested performs CPU isolation by actively evacuating tasks to other CPUs; the intent is to idle the CPU and let it be put into a sleep state. Tasks are migrated, interrupts are directed elsewhere, and the CPU is removed from the load balancer's attention entirely; kernel threads attached to that CPU still run, though. This is, he said, a sort of lightweight form of CPU hotplug.
This patch works by looking at the load presented by all of the running tasks and calculating how much CPU power is needed. If the number of running CPUs exceeds what is needed, it will try to isolate one or more of them. This decision is made in user space.
In performance testing, Donnefort found that CPU isolation reduces throughput slightly, but also gives a 4% drop in power consumption. Vincent Guittot asked why the energy model built into the kernel couldn't handle this task; Donnefort responded that he didn't try to evaluate alternative solutions to the problem. The results show, though, that there is room for improvement.
The other patches were presented as a set. They were:
- "Migration margins": this patch changes the way the kernel picks a CPU for a task on an asymmetric system. This is done by comparing the task's expected utilization with the capacity of the CPU; the mainline kernel will only place a task on a CPU if there will be 20% of its capacity left afterward. The vendor patch lowers this margin to 5%, thus increasing the chance that a given task will end up on a smaller, more energy-efficient CPU.
- There is a change to how the scheduler does task packing. The mainline tries to keep tasks contained within a single cluster (thus possibly allowing other clusters to go idle), but will try to spread out tasks across the CPUs in a cluster. The vendor patch, instead, works harder to pack tasks into a single CPU, though stopping before it would become necessary to increase the CPU's frequency.
- The mainline puts some effort into finding the most efficient CPU to run any given task on — too much time, it seems, for some vendors, who make a change to that algorithm. With this change, the kernel decides where to put a task by first looking at where it was running last time; if that CPU is idle and the task fits there, the placement logic will be shorted out and that CPU will be chosen immediately. He noted that energy-aware task placement has improved considerably since the release of the 4.14 kernel used for these tests.
- When placing a realtime task, the kernel performs a search for the CPU that is running the lowest-priority task; that will be the easiest one to preempt. The vendor patch expands this search to look at utilization and idle states as well, trying to find the CPU that is the least busy overall. The search is also biased toward finding the smallest suitable CPU.
The benchmark results for each of these patches were remarkably similar. They all tended to hurt performance by 3-5% while reducing energy use by 8-11%. What Donnefort did not do, though, was to benchmark a system with all of them applied; he cautioned against assuming that those differences would be additive with all of the patches in the system.
He concluded with the simple assertion that, even if some of these changes are controversial, they are clearly useful in this setting. He will be looking at ways of getting those changes into an acceptable form for merging upstream.
In the discussion, Qais Youssef said that some of his recent CPU-capacity changes might be able to replace some of this work. Dietmar Eggemann asked why the energy model wasn't providing CPU isolation now; it should already be pushing things aggressively toward small CPUs. Peter Zijlstra agreed that it was important to figure out why that workaround was necessary; perhaps the scheduler should look more closely at idle states in the energy-aware path. Donnefort said that CPU isolation in this form is probably not the right solution for the mainline kernel, but it does show that there is something to be gained that way.
See Donnefort's slides [PDF] for detailed results and more.
Bao: a lightweight static partitioning hypervisor
Developers of safety-critical systems tend to avoid Linux kernels for a number of fairly obvious reasons; Linux simply was not developed with that sort of use case in mind. There are increasingly compelling reasons to use Linux in such systems, though, leading to a search for the best way to do so safely. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), José Martins described Bao, a minimal hypervisor aimed at safety-critical deployments.The actual target, he began, is "mixed-criticality" systems in which multiple software stacks run in parallel with each other; some of those stacks are safety-critical, while others are not. For example, a system could have a user interface running on Linux alongside the safety-critical application that it controls. There is an industry trend toward consolidating systems in this way driven by power considerations and the availability of processors with numerous CPUs.
Virtualization is naturally interesting for the developers of such systems;
it minimizes the effort required to port systems and eases the integration
within them. Good virtualization provides fault isolation, preventing
failures in one
part of the system from interfering with others. Developers want the usual
things from such a system: good performance, realtime guarantees, and
strong security.
Martins spent some time looking at solutions like Xen and KVM. They were not designed for this kind of use case, but they end up being used anyway. Neither is an optimal solution; they use virtualized I/O mechanisms that add a lot of overhead, and their code bases are large and hard to audit.
Instead, he said, there is a role here for a static partitioning hypervisor, which can be seen as a thin configuration layer that divides up a system's resources. Under a system like this, there is a one-to-one mapping between virtual CPUs and physical CPUs, so there is no contention for CPU time. Devices are mapped directly into the guests, avoiding any added I/O overhead. Perhaps the best-known hypervisor of this type is Jailhouse, but that didn't meet Martins's needs; it depends on a Linux "root cell" to run the whole show, its boot time is relatively long, and there is still a big code base to audit. The Xen Dom0-less project can do direct device assignment, which is nice, but it falls short in other ways.
So Martins set out to create Bao as a "type-1 bare-metal hypervisor" with a one-to-one CPU mapping. It doesn't depend on any sort of privileged virtual machine or operating system to boot. Bao provides a simple inter-VM communication mechanism based on shared memory and virtual interrupts. It depends on hardware assistance for many of its functions, including second-stage address translation, an I/O memory-management unit, and virtual interrupts. Bao can use huge pages to reduce translation lookaside buffer pressure and page-table memory use; it is also able to perform cache coloring for memory allocations to avoid low-level cache interference between machines.
Bao currently targets the Armv8 architecture. There is a RISC-V port, but the virtualization specification for RISC-V is not ready, so this port is not interesting yet. It can run a number of guests, including bare-metal applications, Linux, Android, and various realtime operating systems.
Ideally, he said, Bao would just be a configuration layer that does its work and gets out of the way, but the hardware does not support this mode of operation. Interrupts, for example, have to be mediated through the hypervisor, which is unfortunate since that increases latency. The I/O memory-management unit has a limited number of stream registers, and doesn't cover all devices on some platforms. There is no partitioning mechanism for memory cache on Arm, so the hypervisor must handle isolation via cache coloring.
The system is implemented in about 7,000 lines of code, and requires 50KB of memory on the target system. That is "somewhat small", he said, but he is working to get it smaller. Run-time memory requirements add up to about 250KB. Benchmark runs show that the hypervisor adds an execution-time overhead of about 2%. Turning on cache coloring increases that overhead, since that feature is incompatible with the use of huge pages. Interference tests currently show a significant amount of degradation caused by activity in other virtual machines; cache coloring helps but does not completely solve the problem.
Another issue is interrupt latency, which increases significantly due to the need for a round-trip through the hypervisor. There is a fair amount of cross-VM interference caused by interrupts as well; again, cache coloring helps, especially if it is used within the hypervisor too. He has found a way to map interrupts directly into guests, something that is made possible by the one-to-one CPU mapping. That increases the overhead for interrupts intended for the hypervisor itself, but those are relatively rare.
Current work includes adding support for trusted execution environments. The other approaches out there share the trusted code across virtual machines, which is not ideal. Arm is adding support for a "trusted hypervisor" mode but, for now, complex workarounds are required. Martins said that the "dual-world" approach used in this area is not inherently secure; a lot of code has to be added to the secure side, bringing the same old problems with it. It is better, he said, to limit the secure world to core security primitives; he is trying to do that by avoiding TrustZone completely and dedicating a virtual CPU to trusted work. This involves allowing multiple virtual CPUs to run on a single physical CPU.
Overall, he concluded, Bao has turned out to be a good fit for the intended use case.
Brief items
Security
NXNSAttack: upgrade resolvers to stop new kind of random subdomain attack
CZ.NIC staff member Petr Špaček has a blog post describing a newly disclosed DNS resolver vulnerability called NXNSAttack. It allows attackers to abuse the delegation mechanism to create a denial-of-service condition via packet amplification. "This is so-called glueless delegation, i.e. a delegation which contains only names of authoritative DNS servers (a.iana-servers.net. and b.iana-servers.net.), but does not contain their IP addresses. Obviously DNS resolver cannot send a query to “name”, so the resolver first needs to obtain IPv4 or IPv6 address of authoritative server 'a.iana-servers.net.' or 'b.iana-servers.net.' and only then it can continue resolving the original query 'example.com. A'. This glueless delegation is the basic principle of the NXNSAttack: Attacker simply sends back delegation with fake (random) server names pointing to victim DNS domain, thus forcing the resolver to generate queries towards victim DNS servers (in a futile attempt to resolve fake authoritative server names)." At this time, Ubuntu has updated its BIND package to mitigate the problem; other distributions will no doubt follow soon. More details can also be found in the paper [PDF].
A remote code execution vulnerability in qmail
Just in case anybody out there is still using qmail: a remote code execution vulnerability has just been disclosed. Its CVE number is CVE-2005-1513 because, as it turns out, the problem was reported 15 years ago but the fix was refused by the maintainer. "As a proof of concept, we developed a reliable, local and remote exploit against Debian's qmail package in its default configuration. This proof of concept requires 4GB of disk space and 8GB of memory, and allows an attacker to execute arbitrary shell commands as any user, except root (and a few system users who do not own their home directory)."
Security quotes of the week
That doesn't mean there aren't other companies that do sell private data. There are. Lots of them. Data brokers, telcos, some ISPs, and even your local DMV have been caught selling your actual data. But for some reason, everyone wants to keep insisting that Google and Facebook also sell data, when they never have, and have always only sold targeted advertising in which the data only goes in one direction, and not back to the advertiser.
Kernel development
Kernel release status
The current development kernel is 5.7-rc6, released on May 17. "That said, there's nothing particularly scary in here, and it's not like this rc6 is outrageously big or out of control. I was just hoping for less."
Stable updates: 5.6.13, 5.4.41, and 4.19.123 were released on May 14. The 5.6.14, 5.4.42, 4.19.124, 4.14.181, 4.9.224, and 4.4.224 updates followed on May 20.
Distributions
Distribution quote of the week
That’s a saying I’ve attached to Finnix for many years. A hammer is a tool, and when it does its job, you may consciously or unconsciously appreciate it for doing its job, but there are very few hammer fan clubs.
[...] Finnix was never heavily mentioned by its users in the same way a desktop like, say, Linux Mint was. Why does the tool need to be praised?
Development
Going above and beyond with Inkscape 1.0 (Libre Graphics World)
Libre Graphics World is running an extensive interview with several Inkscape developers. "I'd say we're at the point of supporting SVG as much as possible, but we've mostly given up trying to add editing features to the SVG specification. As the W3C is dominated by web browsers who don't need multi page or connectors. I dare not say much more about W3C-specific things. I know that I'm personally disappointed that Inkscape's considerable importance in the SVG creation space does not lend itself to getting the feature we intend to build into Inkscape into the actual SVG specification. This does lead to the problem that going forwards we're likely to have browser incompatibilities."
Five years of Rust
It seems that the Rust programming language has only been around for five years. "With all that's going on in the world you'd be forgiven for forgetting that as of today, it has been five years since we released 1.0 in 2015! Rust has changed a lot these past five years, so we wanted reflect back on all of our contributors' work since the stabilization of the language."
Page editor: Jake Edge
Announcements
Newsletters
Distributions and system administration
Development
- Emacs News (May 18)
- What's cooking in git.git (May 19)
- What's cooking in git.git (May 20)
- LLVM Weekly (May 18)
- OCaml Weekly News (May 19)
- Perl Weekly (May 18)
- PostgreSQL Weekly News (May 17)
- Python Weekly Newsletter (May 14)
- Racket News (May 18)
- Weekly Rakudo News (May 18)
- Ruby Weekly News (May 14)
- This Week in Rust (May 19)
- Wikimedia Tech News (May 18)
Meeting minutes
- Fedora FESCO meeting minutes (May 18)
Miscellaneous
Calls for Presentations
CFP Deadlines: May 21, 2020 to July 20, 2020
The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.
Deadline | Event Dates | Event | Location |
---|---|---|---|
May 22 | May 28 May 31 |
MiniDebConf Online | |
June 5 | September 13 September 18 |
The C++ Conference 2020 | Online |
June 14 | September 4 September 11 |
Akademy 2020 | Virtual, Online |
June 15 | August 25 August 27 |
Linux Plumbers Conference | Virtual |
June 30 | July 4 July 5 |
State of the Map 2020 | Virtual |
July 1 | September 22 September 24 |
Linaro Virtual Connect | online |
July 2 | September 9 September 10 |
State of the Source Summit | online |
July 5 | October 28 October 29 |
[Canceled] DevOpsDays Berlin 2020 | Berlin, Germany |
July 5 | August 23 August 29 |
DebConf20 | online |
July 5 | September 16 September 18 |
X.Org Developer's Conference 2020 | online |
July 5 | October 2 October 3 |
PyGotham TV | Online |
July 13 | September 29 October 1 |
ApacheCon 2020 | Online |
July 15 | October 6 October 8 |
2020 Virtual LLVM Developers' Meeting | online |
If the CFP deadline for your event does not appear here, please tell us about it.
Upcoming Events
Netdev conference 0x14 going virtual
The Netdev society has announced that the Netdev 0x14 conference, originally scheduled for March, will be a virtual event beginning August 16.
To all the folk who worked hard to contribute submissions and to the program committee that worked hard and reviewed these submissions, it is only fair that we provide a platform where these ideas will be shared in a timely fashion.
Events: May 21, 2020 to July 20, 2020
The following event listing is taken from the LWN.net Calendar.
If your event does not appear here, please tell us about it.
Security updates
Alert summary May 14, 2020 to May 20, 2020
Dist. | ID | Release | Package | Date |
---|---|---|---|---|
Debian | DSA-4686-1 | stable | apache-log4j1.2 | 2020-05-15 |
Debian | DLA-2210-1 | LTS | apt | 2020-05-15 |
Debian | DSA-4685-1 | stable | apt | 2020-05-14 |
Debian | DSA-4689-1 | stable | bind9 | 2020-05-19 |
Debian | DLA-2215-1 | LTS | clamav | 2020-05-20 |
Debian | DSA-4688-1 | stable | dpdk | 2020-05-18 |
Debian | DLA-2213-1 | LTS | exim4 | 2020-05-18 |
Debian | DSA-4687-1 | stable | exim4 | 2020-05-16 |
Debian | DLA-2176-1 | LTS | inetutils | 2020-05-14 |
Debian | DLA-2214-1 | LTS | libexif | 2020-05-18 |
Debian | DSA-4684-1 | stable | libreswan | 2020-05-13 |
Debian | DLA-2211-1 | LTS | log4net | 2020-05-15 |
Debian | DLA-2212-1 | LTS | openconnect | 2020-05-16 |
Fedora | FEDORA-2020-06c54925d3 | F30 | chromium | 2020-05-17 |
Fedora | FEDORA-2020-da49fbb17c | F31 | chromium | 2020-05-17 |
Fedora | FEDORA-2020-ae934f6790 | F30 | condor | 2020-05-17 |
Fedora | FEDORA-2020-f9a598f815 | F31 | condor | 2020-05-17 |
Fedora | FEDORA-2020-fb5af97476 | F32 | condor | 2020-05-18 |
Fedora | FEDORA-2020-885e2343ed | F31 | glpi | 2020-05-14 |
Fedora | FEDORA-2020-ee30e1109f | F32 | glpi | 2020-05-14 |
Fedora | FEDORA-2020-d109a1d1d9 | F31 | grafana | 2020-05-14 |
Fedora | FEDORA-2020-c6b0c7ebbb | F32 | grafana | 2020-05-14 |
Fedora | FEDORA-2020-a60ad9d4ec | F31 | java-1.8.0-openjdk | 2020-05-18 |
Fedora | FEDORA-2020-831ec85119 | F31 | java-1.8.0-openjdk-aarch32 | 2020-05-17 |
Fedora | FEDORA-2020-07aa58121a | F32 | java-1.8.0-openjdk-aarch32 | 2020-05-17 |
Fedora | FEDORA-2020-36298e20f7 | F31 | java-latest-openjdk | 2020-05-14 |
Fedora | FEDORA-2020-755e4213b5 | F32 | java-latest-openjdk | 2020-05-14 |
Fedora | FEDORA-2020-5a69decc0c | F30 | kernel | 2020-05-20 |
Fedora | FEDORA-2020-c6b9fff7f8 | F31 | kernel | 2020-05-20 |
Fedora | FEDORA-2020-4c69987c40 | F32 | kernel | 2020-05-15 |
Fedora | FEDORA-2020-4336d63533 | F32 | kernel | 2020-05-20 |
Fedora | FEDORA-2020-69f2f1d987 | F31 | mailman | 2020-05-14 |
Fedora | FEDORA-2020-20b748e81e | F32 | mailman | 2020-05-15 |
Fedora | FEDORA-2020-e244f22a51 | F32 | mingw-OpenEXR | 2020-05-16 |
Fedora | FEDORA-2020-e244f22a51 | F32 | mingw-ilmbase | 2020-05-16 |
Fedora | FEDORA-2020-7aba37f66a | F30 | moodle | 2020-05-20 |
Fedora | FEDORA-2020-a1b4d24680 | F31 | moodle | 2020-05-20 |
Fedora | FEDORA-2020-758e089ff7 | F32 | moodle | 2020-05-20 |
Fedora | FEDORA-2020-238bbf85d8 | F32 | oddjob | 2020-05-14 |
Fedora | FEDORA-2020-143735a624 | F32 | openconnect | 2020-05-19 |
Fedora | FEDORA-2020-8d3b359179 | F30 | perl-Mojolicious | 2020-05-19 |
Fedora | FEDORA-2020-aceb5a1d0a | F31 | perl-Mojolicious | 2020-05-19 |
Fedora | FEDORA-2020-cc7deffbf1 | F32 | perl-Mojolicious | 2020-05-19 |
Fedora | FEDORA-2020-3ea2253402 | F32 | php | 2020-05-19 |
Fedora | FEDORA-2020-6e3e0c6386 | F30 | sleuthkit | 2020-05-17 |
Fedora | FEDORA-2020-1dd340ab85 | F31 | sleuthkit | 2020-05-17 |
Fedora | FEDORA-2020-94c2f78e0c | F32 | sleuthkit | 2020-05-17 |
Fedora | FEDORA-2020-a6a921a591 | F30 | squid | 2020-05-16 |
Fedora | FEDORA-2020-848065cc4c | F31 | squid | 2020-05-16 |
Fedora | FEDORA-2020-56e809930e | F32 | squid | 2020-05-16 |
Fedora | FEDORA-2020-e67318b4b4 | F32 | transmission | 2020-05-20 |
Fedora | FEDORA-2020-c952520959 | F30 | viewvc | 2020-05-15 |
Gentoo | 202005-13 | chromium | 2020-05-15 | |
Gentoo | 202005-07 | freerdp | 2020-05-15 | |
Gentoo | 202005-10 | libmicrodns | 2020-05-15 | |
Gentoo | 202005-06 | live | 2020-05-15 | |
Gentoo | 202005-12 | openslp | 2020-05-15 | |
Gentoo | 202005-09 | python | 2020-05-15 | |
Gentoo | 202005-11 | vlc | 2020-05-15 | |
Gentoo | 202005-08 | xen | 2020-05-15 | |
Mageia | MGASA-2020-0213 | 7 | jbig2dec | 2020-05-15 |
Mageia | MGASA-2020-0215 | 7 | libreswan | 2020-05-15 |
Mageia | MGASA-2020-0211 | 7 | netkit-telnet | 2020-05-15 |
Mageia | MGASA-2020-0212 | 7 | ntp | 2020-05-15 |
Mageia | MGASA-2020-0214 | 7 | suricata | 2020-05-15 |
openSUSE | openSUSE-SU-2020:0661-1 | 15.1 | mailman | 2020-05-15 |
openSUSE | openSUSE-SU-2020:0667-1 | nextcloud | 2020-05-17 | |
openSUSE | openSUSE-SU-2020:0668-1 | nextcloud | 2020-05-17 | |
Oracle | ELSA-2020-2143 | OL8 | .NET Core | 2020-05-14 |
Oracle | ELSA-2020-1926 | OL8 | container-tools:1.0 | 2020-05-14 |
Oracle | ELSA-2020-1931 | OL8 | container-tools:2.0 | 2020-05-13 |
Oracle | ELSA-2020-1932 | OL8 | container-tools:ol8 | 2020-05-13 |
Oracle | ELSA-2020-2103 | OL6 | kernel | 2020-05-13 |
Oracle | ELSA-2020-2082 | OL7 | kernel | 2020-05-14 |
Oracle | ELSA-2020-5691 | OL7 | kernel | 2020-05-19 |
Oracle | ELSA-2020-2102 | OL8 | kernel | 2020-05-14 |
Oracle | ELSA-2020-5691 | OL8 | kernel | 2020-05-19 |
Oracle | ELSA-2020-2070 | OL8 | libreswan | 2020-05-13 |
Oracle | ELSA-2020-2041 | OL8 | squid:4 | 2020-05-13 |
Oracle | ELSA-2020-2046 | OL8 | thunderbird | 2020-05-13 |
Red Hat | RHSA-2020:2213-01 | EL7.4 | ipmitool | 2020-05-19 |
Red Hat | RHSA-2020:2214-01 | EL7.4 | kernel | 2020-05-19 |
Red Hat | RHSA-2020:2199-01 | EL8.1 | kernel | 2020-05-19 |
Red Hat | RHSA-2020:2171-01 | EL8 | kernel-rt | 2020-05-14 |
Red Hat | RHSA-2020:2203-01 | EL8.1 | kpatch-patch | 2020-05-19 |
Red Hat | RHSA-2020:2210-01 | EL7.4 | ksh | 2020-05-19 |
Red Hat | RHSA-2020:2212-01 | EL7.4 | ruby | 2020-05-19 |
Scientific Linux | SLSA-2020:2082-1 | SL7 | kernel | 2020-05-15 |
Slackware | SSA:2020-140-01 | bind | 2020-05-19 | |
Slackware | SSA:2020-140-02 | libexif | 2020-05-19 | |
Slackware | SSA:2020-139-01 | sane | 2020-05-18 | |
SUSE | SUSE-SU-2020:1272-1 | OS7 OS8 SLE12 SES5 | apache2 | 2020-05-13 |
SUSE | SUSE-SU-2020:1296-1 | SLE15 | autoyast2 | 2020-05-18 |
SUSE | SUSE-SU-2020:1334-1 | SLE15 | dpdk | 2020-05-19 |
SUSE | SUSE-SU-2020:1335-1 | SLE15 | dpdk | 2020-05-19 |
SUSE | SUSE-SU-2020:1294-1 | SLE15 | file | 2020-05-18 |
SUSE | SUSE-SU-2020:1295-1 | OS7 OS8 SLE12 SES5 | git | 2020-05-18 |
SUSE | SUSE-SU-2020:1273-1 | SES5 | grafana | 2020-05-13 |
SUSE | SUSE-SU-2020:1300-1 | SLE15 | gstreamer-plugins-base | 2020-05-18 |
SUSE | SUSE-SU-2020:1275-1 | OS8 SLE12 SES5 | kernel | 2020-05-14 |
SUSE | SUSE-SU-2020:1298-1 | SLE15 | libbsd | 2020-05-18 |
SUSE | SUSE-SU-2020:1277-1 | SLE12 | libvirt | 2020-05-14 |
SUSE | SUSE-SU-2020:1289-1 | SLE12 | libvirt | 2020-05-15 |
SUSE | SUSE-SU-2020:1297-1 | SLE15 | libvpx | 2020-05-18 |
SUSE | SUSE-SU-2020:1299-1 | SLE15 | libxml2 | 2020-05-18 |
SUSE | SUSE-SU-2020:1301-1 | OS7 OS8 SLE12 SES5 | mailman | 2020-05-18 |
SUSE | SUSE-SU-2020:1337-1 | SLE15 | openconnect | 2020-05-19 |
SUSE | SUSE-SU-2020:1292-1 | SLE12 | openexr | 2020-05-18 |
SUSE | SUSE-SU-2020:1293-1 | SLE15 | openexr | 2020-05-18 |
SUSE | SUSE-SU-2020:1285-1 | MP3.2 OS6 OS7 OS8 SLE12 SES5 | python-PyYAML | 2020-05-15 |
SUSE | SUSE-SU-2020:1339-1 | SLE15 | python | 2020-05-19 |
SUSE | SUSE-SU-2020:1274-1 | SES5 | python-paramiko | 2020-05-14 |
SUSE | SUSE-SU-2020:1338-1 | SLE15 | rpmlint | 2020-05-19 |
SUSE | SUSE-SU-2020:14369-1 | SLE11 | syslog-ng | 2020-05-14 |
Ubuntu | USN-4359-1 | 16.04 18.04 19.10 20.04 | apt | 2020-05-14 |
Ubuntu | USN-4365-1 | 16.04 18.04 19.10 20.04 | bind9 | 2020-05-19 |
Ubuntu | USN-4361-1 | 19.10 20.04 | dovecot | 2020-05-18 |
Ubuntu | USN-4362-1 | 18.04 19.10 20.04 | dpdk | 2020-05-18 |
Ubuntu | USN-4366-1 | 14.04 16.04 18.04 19.10 20.04 | exim4 | 2020-05-19 |
Ubuntu | USN-4360-3 | 12.04 14.04 | json-c | 2020-05-15 |
Ubuntu | USN-4360-1 | 12.04 14.04 16.04 18.04 19.10 20.04 | json-c | 2020-05-14 |
Ubuntu | USN-4360-2 | 16.04 18.04 19.10 20.04 | json-c | 2020-05-15 |
Ubuntu | USN-4358-1 | 12.04 14.04 16.04 18.04 19.10 20.04 | libexif | 2020-05-13 |
Ubuntu | USN-4363-1 | 16.04 18.04 | linux, linux-aws, linux-aws-hwe, linux-gcp, linux-gke-4.15, linux-hwe, linux-oem, linux-oracle, linux-snapdragon | 2020-05-18 |
Ubuntu | USN-4367-1 | 20.04 | linux, linux-aws, linux-gcp, linux-kvm, linux-oracle, linux-riscv | 2020-05-19 |
Ubuntu | USN-4364-1 | 14.04 16.04 | linux, linux-aws, linux-lts-xenial, linux-raspi2, linux-snapdragon | 2020-05-18 |
Ubuntu | USN-4368-1 | 18.04 | linux-gke-5.0, linux-oem-osp1 | 2020-05-19 |
Kernel patches of interest
Kernel releases
Architecture-specific
Build system
Core kernel
Development tools
Device drivers
Device-driver infrastructure
Documentation
Filesystems and block layer
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Rebecca Sobol