LWN.net Weekly Edition for May 21, 2020 [LWN.net]

Welcome to the LWN.net Weekly Edition for May 21, 2020

This edition contains the following feature content:

The PEPs of Python 3.9: an update on the final changes that found their way into the upcoming Python 3.9 release.
The state of the AWK: it may be an old tool, but it's still interesting.
Ongoing coverage from OSPM 2020, including:
- The weighted TEO cpuidle governor: an attempt to improve idle-time predictions.
- Testing scheduler thermal properties for avionics: an in-progress test bed to evaluate thermally-oriented scheduler changes.
- Utilization inversion and proxy execution: using load tracking for task placement can lead to some strange inversion situations; fixing them may not be entirely easy.
- The many faces of "latency nice": a complex and inclusive session on an incompletely designed feature.
- Scheduler benchmarking with MMTests: a test suite developed for memory-management benchmarking finds a new use case.
- Evaluating vendor changes to the scheduler: mobile vendors make a lot of tweaks to the CPU scheduler; why do they do that and what is gained from it?
- Bao: a lightweight partitioning hypervisor for mixed-criticality workloads.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

The PEPs of Python 3.9

By Jake Edge
May 20, 2020

With the release of Python 3.9.0b1, the first of four planned betas for the development cycle, Python 3.9 is now feature-complete. There is still plenty to do in terms of testing and stabilization before the October final release. The release announcement lists a half-dozen Python Enhancement Proposals (PEPs) that were accepted for 3.9. We have looked at some of those PEPs along the way; there are some updates on those. It seems like a good time to fill in some of the gaps on what will be coming in Python 3.9.

String manipulations

Sometimes the simplest (seeming) things are the hardest—or at least provoke an outsized discussion. Much of that was bikeshedding over—what else?—naming, but the idea of adding functions to the standard string objects to remove prefixes and suffixes was fairly uncontroversial. Whether those affixes (a word for both prefixes and suffixes) could be specified as sequences, so more than one affix could be handled in a single call, was less clear cut; ultimately, it was removed from the proposal, awaiting someone else to push that change through the process.

Toward the end of March, Dennis Sweeney asked on the python-dev mailing list for a core developer to sponsor PEP 616 ("String methods to remove prefixes and suffixes"). He pointed to a python-ideas discussion from March 2019 about the idea. Eric V. Smith agreed to sponsor the PEP, which led Sweeney to post it and kick off the discussion. In the original version, he used cutprefix() and cutsuffix() as the names of the string object methods to be added. Four types of Python objects would get the new methods: str (Unicode strings), bytes (binary sequences), bytearray (mutable binary sequences), and collections.UserString (a wrapper around string objects). It would work as follows:

    'abcdef'.cutprefix('abc')   # returns 'def'
    'abcdef'.cutsuffix('ef')    # returns 'abcd'

There were plenty of suggestions in the name department. Perhaps the most widespread agreement was that few liked "cut", so "strip", "trim", and "remove" were all suggested and garnered some support. stripprefix() (and stripsuffix(), of course) seemed to run into opposition due, at least in part, to one of the rationales specified in the PEP; the existing "strip" functions are confusing so reusing that name should be avoided. The str.lstrip() and str.rstrip() methods also remove leading and trailing characters, but they are a source of confusion to programmers actually looking for the cutprefix() functionality. The *strip() calls take a string argument, but treat it as a set of characters that should be eliminated from the front or end of the string:

    'abcdef'.lstrip('abc')      # returns 'def' as "expected"
    'abcbadefed'.lstrip('abc')  # returns 'defed' not at all as expected

Eventually, removeprefix() and removesuffix() seemed to gain the upper hand, which is what Sweeney eventually switched to. It probably did not hurt that Guido van Rossum supported those names as well. Eric Fahlgren amusingly summed up the name fight this way:

I think name choice is easier if you write the documentation first:

cutprefix - Removes the specified prefix.
trimprefix - Removes the specified prefix.
stripprefix - Removes the specified prefix.
removeprefix - Removes the specified prefix. Duh. :)

Sweeney announced an update to the PEP that addressed a number of comments, but also added the requested ability to take a tuple of strings as an affix (that version can be seen in the PEP GitHub repository). But Steven D'Aprano was not so sure it made sense to do that. He pointed out that the only string operations that take a tuple are str.startswith() and str.endswith(), which do not return a string (just a boolean value). He is leery of adding a method that returns a (potentially changed) version of the string while taking a tuple because whatever rules are chosen on how to process the tuple will be the "wrong" choice for some. For example:

The difficulty here is that the notion of "cut one of these prefixes" is ambiguous if two or more of the prefixes match. It doesn't matter for startswith:

    "extraordinary".startswith(('ex', 'extra'))

since it is True whether you match left-to-right, shortest-to-largest, or even in random order. But for cutprefix, which prefix should be deleted?

As he said, the rule as proposed is that the first matching string processing the tuple left-to-right is used, but some might want the longest match or the last match; it all depends on the context of the use. He suggested that the feature get more "soak time" before committing to adding that behavior: "We ought to get some real-life exposure to the simple case first, before adding support for multiple prefixes/suffixes."

Ethan Furman agreed with D'Aprano. But Victor Stinner was strongly in favor of the tuple-argument idea. He wondered about the proposed behavior, however, when the empty string is passed as part of the tuple. As proposed, encountering the empty string (which effectively matches anything) when processing the tuple would simply return the original string, which leads to surprising results:

cutsuffix("Hello World", ("", " World"))    # returns "Hello World"
cutsuffix("Hello World", (" World", ""))    # returns "Hello"

The problem is not likely to manifest so obviously; affixes will not necessarily be hard coded so empty strings might slip into unexpected places. Stinner suggested raising ValueError if an empty string is encountered, similar to str.split(). But Sweeney decided to remove the tuple-argument feature entirely to "allow someone else with a stronger opinion about it to propose and defend a set of semantics in a different PEP" He posted the last version of the PEP on March 28.

On April 9, Sweeney opened a steering council issue requesting a review of the PEP. On April 20, Stinner accepted it on behalf of the council. It is a pretty minimal change but worth the time to try to ensure that it has the right interface (and semantics) for the long haul. We will see removeprefix() and removesuffix() in Python 3.9.

New parser

It should not really surprise anyone that the new parser for CPython, covered here in mid-April, has been accepted by the steering council. PEP 617 ("New PEG parser for CPython") was proposed by project founder and former benevolent dictator for life (BDFL) Guido van Rossum, along with Pablo Galindo Salgado and Lysandros Nikolaou; it is already working well and its performance is within 10% of the existing parser in terms of speed and memory use. It will also make the language specification simpler because the parser is based on a parsing expression grammar (PEG). The existing LL(1) parser for CPython suffers from a number of shortcomings and contains some hacks that the new parser will eliminate.

The change paves the way for Python to move beyond having an LL(1) grammar—though the existing language is not precisely LL(1)—down the road. That change will not come soon as the plans are to keep the existing parser available in Python 3.9 behind a command-line switch. But Python 3.10 will remove the existing parser, which could allow language changes. If those kinds of changes are made, however, alternative Python implementations (e.g. PyPy, MicroPython) may need to switch their parsers to something other than LL(1) in order to keep up with the language specification. That might give the core developers pause before making a change of that nature.

And more

We looked at PEP 615 ("Support for the IANA Time Zone Database in the Standard Library") back in early March. It would add a zoneinfo module to the standard library that would facilitate getting time-zone information from the IANA time zone database (also known as the "Olson database") to populate a time-zone object. It was looked on favorably at the time of the article and at the end of March Paul Ganssle asked for a decision on the PEP. He thought it might be amusing to have it accepted (assuming it was) during an interesting time window:

[...] I was hoping (for reasons of whimsy) to get this accepted on Sunday, April 5th either between 02:00-04:00 UTC or between 13:00 and 17:30 UTC, since those times represent ambiguous datetimes somewhere on earth (mostly in Australia). There is one other opportunity for this, which is that on Sunday April 19th, the hours between 01:00 and 03:00 UTC are ambiguous in Western Sahara.

He recognized that it might be difficult to pull off and it certainly was not a priority. The steering council did not miss the second window by much; Barry Warsaw announced the acceptance of the PEP on April 20. Python will now have a mechanism to access the system's time-zone database for creating and handling time zones. In addition, there is a tzdata module in the Python Package Index (PyPI) that contains the IANA data for systems that lack it; it will be maintained by the Python core developers as well.

PEP 593 ("Flexible function and variable annotations") adds a way to associate context-specific metadata with functions and variables. Effectively, the type hint annotations have squeezed out other use cases that were envisioned in PEP 3107 ("Function Annotations") that was implemented in Python 3.0 many years ago. PEP 593 creates a new mechanism for those use cases using the Annotated typehint. Another kind of clean up comes in PEP 585 ("Type Hinting Generics In Standard Collections"). It will allow the removal of a parallel set of type aliases maintained in the typing module in order to support generic types. For example, the typing.List type will no longer be needed to support annotations like "dict[str, list[int]]" (i.e.. a dictionary with string keys and values that are lists of integers).

The dictionary union operation for "addition" will also be part of Python 3.9. It was a bit contentious at times, but PEP 584 ("Add Union Operators To dict") was recommended for acceptance by Van Rossum in mid-February. The steering council promptly agreed and the feature was merged on February 24.

The last PEP on the list is PEP 602 ("Annual Release Cycle for Python"). As it says on the tin, it changes the release cadence from every 18 months to once per year. The development and release cycles overlap, though, so that a full 12 months is available for feature development. Python 3.10 feature development begins when the first Python 3.9 beta has been released—which is now. Stay tuned for the next round of PEPs in the coming year.

Comments (13 posted)

The state of the AWK

May 19, 2020

This article was contributed by Ben Hoyt

AWK is a text-processing language with a history spanning more than 40 years. It has a POSIX standard, several conforming implementations, and is still surprisingly relevant in 2020 — both for simple text processing tasks and for wrangling "big data". The recent release of GNU Awk 5.1 seems like a good reason to survey the AWK landscape, see what GNU Awk has been up to, and look at where AWK is being used these days.

The language was created at Bell Labs in 1977. Its name comes from the initials of the original authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. A Unix tool to the core, AWK is designed to do one thing well: to filter and transform lines of text. It's commonly used to parse fields from log files, transform output from other tools, and count occurrences of words and fields. Aho summarized AWK's functionality succinctly:

AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

AWK programs are often one-liners executed directly from the command line. For example, to calculate the average response time of GET requests from some hypothetical web server log, you might type:

    $ awk '/GET/ { total += $6; n++ } END { print total/n }' server.log 
    0.0186667

This means: for all lines matching the regular expression /GET/, add up the response time (the sixth field or $6) and count the line; at the end, print out the arithmetic mean of the response times.

The various AWK versions

There are three main versions of AWK in use today, and all of them conform to the POSIX standard (closely enough, at least, for the vast majority of use cases). The first is classic awk, the version of AWK described by Aho, Weinberger, and Kernighan in their book The AWK Programming Language. It's sometimes called "new AWK" (nawk) or "one true AWK", and it's now hosted on GitHub. This is the version pre-installed on many BSD-based systems, including macOS (though the version that comes with macOS is out of date, and worth upgrading).

The second is GNU Awk (gawk), which is by far the most featureful and actively maintained version. Gawk is usually pre-installed on Linux systems and is often the default awk. It is easy to install on macOS using Homebrew and Windows binaries are available as well. Arnold Robbins has been the primary maintainer of gawk since 1994, and continues to shepherd the language (he has also contributed many fixes to the classic awk version). Gawk has many features not present in awk or the POSIX standard, including new functions, networking facilities, a C extension API, a profiler and debugger, and most recently, namespaces.

The third common version is mawk, written by Michael Brennan. It is the default awk on Ubuntu and Debian Linux, and is still the fastest version of AWK, with a bytecode compiler and a more memory-efficient value representation. (Gawk has also had a bytecode compiler since 4.0, so it's now much closer to mawk's speed.)

If you want to use AWK for one-liners and basic text processing, any of the above are fine variants. If you're thinking of using it for a larger script or program, Gawk's features make it the sensible choice.

There are also several other implementations of AWK with varying levels of maturity and maintenance, notably the size-optimized BusyBox version used in embedded Linux environments, a Java rewrite with runtime access to Java language features, and my own GoAWK, a POSIX-compliant version written in Go. The three main AWKs and the BusyBox version are all written in C.

Gawk changes since 4.0

It's been almost 10 years since LWN covered the release of gawk 4.0. It would be tempting to say "much has changed since 2011", but the truth is that things move relatively slowly in the AWK world. I'll describe the notable features since 4.0 here, but for more details you can read the full 4.x and 5.x changelogs. Gawk 5.1.0 came out just over a month ago on April 14.

The biggest user-facing feature is the introduction of namespaces in 5.0. Most modern languages have some concept of namespaces to make it easier to ship large projects and libraries without name clashes. Gawk 5.0 adds namespaces in a backward-compatible way, allowing developers to create libraries, such as this toy math library:

    # area.awk
    @namespace "area"

    BEGIN {
        pi = 3.14159  # namespaced "constant"
    }

    function circle(radius) {
        return pi*radius*radius
    }

To refer to variables or functions in the library, use the namespace::name syntax, similar to C++:

    $ gawk -f area.awk -e 'BEGIN { print area::pi, area::circle(10) }'
    3.14159 314.159

Robbins believes that AWK's lack of namespaces is one of the key reasons it hasn't caught on as a larger-scale programming language and that this feature in gawk 5.0 may help resolve that. The other major issue Robbins believes is holding AWK back is the lack of a good C extension interface. Gawk's dynamic extension interface was completely revamped in 4.1; it now has a defined API and allows wrapping existing C and C++ libraries so they can be easily called from AWK.

The following code snippet from the example C-code wrapper in the user manual populates an AWK array (a string-keyed hash table) with a filename and values from a stat() system call:

    /* empty out the array */
    clear_array(array);

    /* fill in the array */
    array_set(array, "name", make_const_string(name, strlen(name), &tmp));
    array_set_numeric(array, "dev", sbuf->st_dev);
    array_set_numeric(array, "ino", sbuf->st_ino);
    array_set_numeric(array, "mode", sbuf->st_mode);

Another change in the 4.2 release (and continued in 5.0) was an overhauled source code pretty-printer. Gawk's pretty-printer enables its use as a standardized AWK code formatter, similar to Go's go fmt tool and Python's Black formatter. For example, to pretty-print the area.awk file from above:

    $ gawk --pretty-print -f area.awk

which results in the following output:

    @namespace "area"

    BEGIN {
        pi = 3.14159    # namespaced "constant"
    }


    function circle(radius)
    {
        return (pi * radius * radius)
    }

You may question the tool's choices: why does "BEGIN {" not have a line break before the "{" when the function does? (It turns out AWK syntax doesn't allow that.) Why two blank lines before the function and parentheses around the return expression? But at least it's consistent and may help avoid code-style debates.

Gawk allows a limited amount of runtime type inspection, and extended that with the addition of the typeof() function in 4.2. typeof() returns a string constant like "string", "number", or "array" depending on the input type. These functions are important for code that recursively walks every item of a nested array, for example (which is something that POSIX AWK can't do).

With 4.2, gawk also supports regular expression constants as a first-class data type using the syntax @/foo/. Previously you could not store a regular expression constant in a variable; typeof(@/foo/) returns the string "regexp". In terms of performance, gawk 4.2 brings a significant improvement on Linux systems by using fwrite_unlocked() when it's available. As gawk is single-threaded, it can use the non-locking stdio functions, giving a 7-18% increase in raw output speed — for example gawk '{ print }' on a large file.

The GNU Awk User's Guide has always been a thorough reference, but it was substantially updated in 4.1 and again in the 5.x releases, including new examples, summary sections, and exercises, along with some major copy editing.

Last (and also least), a subtle change in 4.0 that I found amusing was the reverted handling of backslash in sub() and gsub(). Robbins writes:

The default handling of backslash in sub() and gsub() has been reverted to the behavior of 3.1. It was silly to think I could break compatibility that way, even for standards compliance.

The sub and gsub functions are core regular expression substitution functions, and even a small "fix" to the complicated handling of backslash broke people's code:

When version 4.0.0 was released, the gawk maintainer made the POSIX rules the default, breaking well over a decade’s worth of backward compatibility. Needless to say, this was a bad idea, and as of version 4.0.1, gawk resumed its historical behavior, and only follows the POSIX rules when --posix is given.

Robbins may have had a small slip in judgment with the original change, but it's obvious he takes backward compatibility seriously. Especially for a popular tool like gawk, sometimes it is better to continue breaking the specification than change how something has always worked.

Is AWK still relevant?

Asking if AWK is still relevant is a bit like asking if air is still relevant: you may not see it, but it's all around you. Many Linux administrators and DevOps engineers use it to transform data or diagnose issues via log files. A version of AWK is installed on almost all Unix-based machines. In addition to ad-hoc usage, many large open-source projects use AWK somewhere in their build or documentation tooling. To name just a few examples: the Linux kernel uses it in the x86 tooling to check and reformat objdump files, Neovim uses it to generate documentation, and FFmpeg uses it for building and testing.

AWK build scripts are surprisingly hard to kill, even when people want to: in 2018 LWN wrote about GCC contributors wanting to replace AWK with Python in the scripts that generate its option-parsing code. There was some support for this proposal at the time, but apparently no one volunteered to do the actual porting, and the AWK scripts live on.

Robbins argues in his 2018 paper for the use of AWK (specifically gawk) as a "systems programming language", in this context meaning a language for writing larger tools and programs. He outlines the reasons he thinks it has not caught on, but Kernighan is "not 100% convinced" that the lack of an extension mechanism is the main reason AWK isn't widely used for larger programs. He suggested that it might be due to the lack of built-in support for access to system calls and the like. But none of that has stopped several people from building larger tools: Robbins' own TexiWeb Jr. literate programming tool (1300 lines of AWK), Werner Stoop's d.awk tool that generates documentation from Markdown comments in source code (800 lines), and Translate Shell, a 6000-line AWK tool that provides a fairly powerful command-line interface to cloud-based translation APIs.

Several developers in the last few years have written about using AWK in their "big data" toolkit as a much simpler (and sometimes faster) tool than heavy distributed computing systems such as Spark and Hadoop. Nick Strayer wrote about using AWK and R to parse 25 terabytes of data across multiple cores. Other big data examples are the tantalizingly-titled article by Adam Drake, "Command-line Tools can be 235x Faster than your Hadoop Cluster", and Brendan O'Connor's "Don’t MAWK AWK – the fastest and most elegant big data munging language!"

Between ad-hoc text munging, build tooling, "systems programming", and big data processing — not to mention text-mode first person shooters — it seems that AWK is alive and well in 2020.

[Thanks to Arnold Robbins for reviewing a draft of this article.]

Comments (56 posted)

The weighted TEO cpuidle governor

By Jonathan Corbet
May 14, 2020

OSPM

Life gets complicated for the kernel when there is nothing for the system to do. The obvious response is to put the CPU into an idle state to save power, but which one? CPUs offer a wide range of sleep states with different power-usage and latency characteristics. Picking too shallow a state will waste energy, while going too deep hurts latency and can impact the performance of the system as a whole. The timer-events-oriented (TEO) cpuidle governor is a relatively new attempt to improve the kernel's choice of sleep states; at the 2020 Power Management and Scheduling in the Linux Kernel Summit, Pratik Sampat presented a variant of the TEO governor that tries to improve its choices further.

Sampat started with a bit of background. The TEO governor is based on the idea that timer events are the most likely way that a system will wake up; they also happen to be the most deterministic, since they are known before the system goes idle. But CPUs are subject to wakeups from other sources — interrupts in particular — and that complicates the situation. So the TEO governor maintains a short history of actual idle times that is used to come up with a (hopefully) better guess for what the next idle period will really be.

This history is an eight-entry circular buffer that indicates the recent pattern of non-timer wakeups. When the time comes to pick an idle state, the TEO governor looks at how many of those wakeups led to a sleep time that was less than expected; if the answer is "a majority of them", then the average observed sleep time is used to select an idle state that is shallower than would have otherwise been chosen. It works well, he said, but maybe it can be made better?

He started by testing the idea of whether more history would improve the situation. Increasing the size of the idle-times buffer to 128 did not really help, though. With a set of benchmark results, Sampat showed performance numbers that were sometimes better and sometimes worse; latency often improved, but power consumption got much worse. More history led to the selection of shallower sleep states more often, in other words.

It turns out, he said, that an average is not a good model of the distribution of sleep times, and a longer history may not reflect what is going to happen in the future. So he concluded that what is needed is to store and manage the history differently. The cpuidle governor would benefit from a way to answer a specific question: if the kernel is about to pick a given sleep state, what are the chances that the actual sleep time will better match a sleep that is one level shallower?

The weighted governor

The result was the weighted-history TEO governor, which replaces the history buffer with an NxN matrix, where N is the number of sleep states supported by the processor. The rows correspond to the sleep state the TEO governor would pick in any given situation; each column along that row indicates the probability that the corresponding state should actually be chosen. If the system in question had three sleep states ("shallow", "medium", and "deep"), the matrix would be initialized to look like this:

Shallow Medium Deep

Shallow 70% 15% 15%

Medium 15% 70% 15%

Deep 15% 15% 70%

	Shallow	Medium	Deep
Shallow	70%	15%	15%
Medium	15%	70%	15%
Deep	15%	15%	70%

In other words, the matrix is set up so that the chances of each state selection being correct are 70%, with the remaining 30% spread across all the other states. Giovanni Gherdovich asked whether this initial distribution was hard-coded; the answer was "yes for now", and that the values have been chosen from a set of experiments Sampat ran.

After each wakeup, the actual behavior is measured and the probabilities are tweaked accordingly. The actual amount of adjustment that should be performed is still unclear, he said; more experiments and testing are needed.

When it comes time to make a prediction, the governor uses a biased random-number generator to pick a state; the biasing is done so that the chances of picking any particular state are the same as the observed probability that said state is the correct one. Why do that rather than just pick the highest-probability state? Often it turns out that the probabilities are fairly close, so a subset of the available states are all about as likely to be correct. The system will self-correct when the random-number generator steers it wrong.

Results

A number of benchmark results followed, showing variable results. With schedbench, latency was better some times and worse others, but power consumption was always less. The accuracy of the sleep-state choices was similar to the unweighted TEO governor for a small number of threads, but improved for larger numbers of threads. Rafael Wysocki, the author of the TEO governor, said that he was surprised to see TEO doing as well as it does; he deliberately chose a simple algorithm to minimize the overhead involved.

Sampat modified the ebizzy benchmark to make it do occasional sleeps, and got better results than TEO for both throughput and power consumption. The pgbench benchmark showed mixed results, with things getting worse as more clients were added. Hackbench results saw better results with relatively short run times, and a consistent 8-10% improvement in power consumption.

At this point, some confusion about the results became evident. Sampat characterized the results as "overshooting" or "undershooting", which most people expected to refer to the sleep state chosen, but actually referred to the sleep residency time. So "overshooting" meant picking a sleep state that was too shallow — the residency time overshot the estimate. This terminology seems highly likely to change in the near future.

Wysocki observed that picking a sleep state that is too shallow is generally better than picking one that is too deep. Not sleeping deeply enough will cost some power, but sleeping too deeply can hurt the performance of the system (in both latency and throughput terms).

Sampat finished with an overview of the work that is yet to be done. The aging algorithm still needs some work; workloads change over time, and old history can lead to poor predictions going forward. He tried simply decaying the highest-probability state, but that led to large variance in the results.

Another issue is the initial weights put in the matrix; these were determined through experiments, but more testing is needed. Wysocki disagreed, though, saying that with proper aging, the initial states don't matter much. The governor will correct itself over time. But that depends on the aging working well, so that is the important part to work on. The session concluded with Wysocki saying that the work looks promising and can be discussed further on the mailing list.

Comments (12 posted)

Testing scheduler thermal properties for avionics

By Jonathan Corbet
May 15, 2020

OSPM

Linux is not heavily used in safety-critical systems — yet. There is an increasing level of interest in such deployments, though, and that is driving a number of initiatives to determine how Linux can be made suitable for safety-critical environments. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), Michal Sojka shone a light on one corner of this work: testing the thermal characteristics of Linux systems with an eye toward deployment in avionics systems.

In particular, his focus is on how scheduling decisions can affect the thermal behavior of computers in avionic systems; this effort is part of the European THERMAC project. The requirements for avionic systems include doing without both fans and heavy heat sinks while getting as much performance out of each system as thermal constraints will allow. There is no room for missed deadlines in safety-critical work, so there is not much space for the usual thermal-management techniques there. But these systems also support best-effort workloads that run when time and temperatures allow; that is where it may be possible to improve the situation with clever power management.

These systems tend to use time-partitioned scheduling. Each safety-critical task runs within its own time window; any time left over within the window when that work is done can be used for best-effort workloads. The good news, Sojka said, was that the workloads on these systems are well understood; that is a distinct difference from the systems discussed in the previous session, where the kernel has to make guesses about what is going to happen next.

This work, so far, has not yet come up with any thermal-aware scheduling strategies; that is for a later stage. What is being done now is to put together the framework for evaluating such strategies so it will be possible to know which ones actually work. To that end, the project has built a testbed based on a leading-edge NXP i.MX8 board; thermal sensors and a thermal camera have been added to that. Control groups are being used to simulate the scheduling windows that will be used on a real system.

The work so far has resulted in a framework called "thermobench"; Sojka described it as "a fancy CSV file generator". It will run a series of benchmarks, capturing measurements (temperatures, CPU frequencies, CPU loads, etc.) as they go. When the runs are complete, the system can create plots of what happened. The benchmarks in the repository now include various micro-instruction tests and tests that evaluate a variety of sleep patterns.

The system can also perform model fitting in order to get a sense for the changes that happen at different time scales; some changes happen much more quickly than others, leading to a model equation with three distinct terms. The temperature at the heat sink can change within a minute, while whole-board temperature changes play out over four or five minutes. There is also an 18-minute term which, he surmised, was the response of the entire testbed. Among other things, these results tell them how long each test needs to run for.

In conclusion, he said, thermobench will be useful for comparing various thermal management strategies. He wondered whether others might find it useful for their areas as well. Vincent Guittot asked whether the tests included CPU-frequency scaling; Sojka answered the tests that were shown are all single-frequency tests, but multiple-frequency tests have been done as well. He said that temperature is not a linear function of CPU frequency, but did not get into details.

Rafael Wysocki said that the tests should always measure both the power consumption of the board and the temperature, since the two are somewhat independent of each other. Giovanni Gherdovich asked whether the realtime preemption patches had been tested, noting that kernels with those patches have different performance and power-usage profiles. Sojka answered that the test board is quite new and is currently not able to run a mainline kernel; he expressed interest in hearing what NXP's plans are for getting support upstream. Once that happens, he will be happy to experiment with the realtime patches.

Souvik Chakravarty pointed out that a number of factors affect power usage. For example, what is the power structure of the board? If all CPUs are on a single power rail, it will be necessary to stop them all to gain significant power (and thermal) savings. Sojka said that the processor in question has six big.LITTLE CPUs, and the project is testing on the little CPUs only. But details like the power layout are not entirely clear.

Sojka concluded by encouraging attendees to check out the thermobench code, which had been posted that very day.

Comments (2 posted)

Utilization inversion and proxy execution

By Jonathan Corbet
May 15, 2020

OSPM

Over the years, the kernel's CPU scheduler has become increasingly aware of how much load every task is putting on the system; this information is used to make smarter task placement decisions. Sometimes, though, this logic can go wrong, leading to a situation that Valentin Schneider describes as "utilization inversion". At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), he described the problem and some approaches that are being considered to address it.

Utilization tracking, initially conceived as per-entity load-tracking or PELT, gives the scheduler information about what the impact of running a task on a given CPU will be. It is used in the schedutil CPU-frequency governor to select a frequency appropriate to the current workload. On Arm big.LITTLE systems, where some processors are faster and more power-hungry than others, the utilization-tracking signal is also used to decide which type of CPU a task should run on. The situation is complicated a bit by utilization clamping, which allows an administrator to make a task's utilization appear larger or smaller than it really is. Clamping is used to bias the placement of specific tasks and influences CPU-frequency selection as well.

Imagine, Schneider said, a large task (one with a high utilization) and a small task, both of which are contending for the same lock. The large task may have a high minimum clamp, so it looks like an even bigger load even when it is not doing much; the small task, instead, may have a low maximum, ensuring that it always looks small. One would expect the large task to run on a big CPU at a high frequency while the small task is consigned to a small CPU at a low frequency. If the small task grabs the lock, the large task's progress suddenly depends on how quickly the small task can progress.

This situation is similar to priority inversion, though the problem is not as severe. Even so, it would be better if the small task could inherit some of the large task's resources while it holds the lock.

The kernel's realtime mutexes can handle priority inheritance now; if a high-priority task contends for a lock held by a low-priority task, the latter will have its priority boosted until it drops the lock. Priority inheritance can help, but it only affects process priority; it can force preemption, but it does not really change task placement or CPU frequency. Perhaps the kernel could gain a similar mechanism for utilization that would help for placement, at least, if not CPU frequency. Schneider expressed skepticism that such an approach could work well, though.

An alternative he has been working on is proxy execution: giving the lock-holding task the waiting task's execution parameters until it lets go of the lock. This is a work in progress, he said, that doesn't survive for more than 20 seconds on real hardware, and it has no provision for futexes (user-space locks), but it still has some interesting properties, he said.

With proxy execution, a task that blocks on a mutex is not removed from the run queue as it would be in a mainline kernel. It can thus be picked to run by the scheduler in the usual way if it's the highest-priority task in the queue. When that happens, though, the lock-holding task inherits the blocked task's scheduling context. The blocked task is also migrated to the run queue of the lock holder, which brings its utilization information over; that will cause the CPU frequency to be increased, helping the lock holder to get its work done and release the lock.

That solves the problem reasonably well on symmetric multiprocessor systems, but it still falls short on asymmetric systems like big.LITTLE. To address such systems, Schneider would like to put the utilization-tracking information into the scheduling context, where it can be passed more directly to a lock holder. This has to be done carefully, though, or it could create priority inversions of its own; if a low-utilization task is picked to run, it could end up slowing a high-utilization task. Making a smart choice is hard, though, since the utilization signals are highly variable and hard to track in the proxy-execution code. The solution might be to ignore the utilization values and just look at the clamps.

Juri Lelli asked why this mattered, since the clamp values are already aggregated on each run queue. That works for frequency selection, Schneider answered, but it has no influence on task selection, so it doesn't help to ensure that the lock-holding task actually runs.

Then, there is the perennial problem of load balancing. Utilization signals are highly useful here, since they let the scheduler ensure that the load on each CPU is about the same. But what should be done in the proxy execution case? Currently, load-balancing decisions will use the scheduling context of the donor task (the one waiting for the lock), which could lead to interesting decisions. Since contending tasks remain on the run queue, the apparent load on the CPU increases, which can throw things off as well. Peter Zijlstra said that this isn't necessarily a big problem; one does not expect locks to be held for long periods, so things should straighten themselves out relatively quickly.

Patrick Bellasi asked whether just relying on clamp values is sufficient, or whether the load-tracking signal should be used too. Schneider responded that using the clamps really is the best that can be done; there is no choice. Utilization values simply change too quickly to be useful.

Heading toward a conclusion, Schneider said that getting proxy execution working right is his first priority; presumably rebooting after 20 seconds of uptime is getting a little tiresome. He asked whether other developers were interested in proxy execution as well. Zijlstra said that he has been trying to get it to the top of his list for a long time, but has been "failing miserably".

Qais Youssef asked how quickly this work might be done. The next Android release will not be happening for some time, so it would be nice if there were some way to fix this problem in the short term. Could the realtime mutex code help? Zijlstra responded that realtime mutexes are really for realtime processes and won't help with tasks in the completely fair scheduling class, as most Android tasks are. We will get the problem solved when we do, he said.

The session concluded with numerous developers saying that they would like to have a working proxy execution mechanism in the kernel, but nobody has found the time to work on it.

Comments (6 posted)

The many faces of "latency nice"

By Jonathan Corbet
May 18, 2020

OSPM

A task's "nice" value describes its priority within the completely fair scheduler; its semantics have roots in ancient Unix tradition. Last August, a "latency nice" parameter was proposed to provide similar control over a task's response-time requirements. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), Parth Shah, Chris Hyser, and Dietmar Eggemann ran a discussion about the latency nice proposal; it seems that everybody agrees that it would be a useful feature to have, but there is a wide variety of opinions about what it should actually do.

A different kind of nice

Shah started by describing the latency nice value as a per-task attribute that behaves much like the normal nice value. It gives the scheduler a hint about what the task's latency needs are. It can be tweaked via the sched_setattr() system call, though there is some desire to switch to a control-group interface. Its values vary between -20 and 19 (as with nice), with -20 indicating a high degree of latency sensitivity and 19 indicating a complete indifference to latency. The default value is zero.

The first question he raised had to do with privilege: should an unprivileged process be able to decrease its latency nice value? Ordinary nice does not allow that, of course; processes must have the CAP_SYS_NICE capability to reduce their nice values. The advantage of establishing a similar rule for latency nice is that it might block potential denial-of-service problems, but at the cost of preventing ordinary users from taking advantage of this feature.

Whether this knob should be privileged depends on what it actually does, which had not yet been discussed. The initial effect of this feature is to control how hard the scheduler will look for an idle core to place a task on when it wakes up. This search takes time (thus increasing latency); an idle core may also have to be roused out of a sleep state, increasing latency further. Dhaval Giani pointed out a use case that Oracle cares about, where some latency-sensitive tasks will typically run for very short periods — less than the time spent searching for an idle core sometimes. That search can be avoided by setting a low latency nice value.

Giani also mentioned a use case from Facebook, which is more interested in getting longer-running tasks up to full speed quickly; Facebook still wants low latency, but is better served by finding an idle core that will be able to get a significant amount of work done quickly. IBM, meanwhile is hoping to use this knob to influence the scheduler to avoid placing tasks on a CPU that is currently running latency-sensitive tasks. The discussion on use cases was cut off at this point, though, with a promise to revisit it later.

Returning to privilege, Qais Youssef suggested keeping the ability to reduce latency nice values as a privileged operation for now, especially given that this knob could gain new meanings in the future. Shah said that there do not appear to be any denial-of-service issues with the implementation for the current use cases.

Eggemann wondered about the range of values for this knob; there is a wish to bias latency in both directions, but it's not clear what the actual effects of a positive latency nice value would be. Patrick Bellasi suggested that the time before one task could preempt another could be scaled by the latency nice value. Vincent Guittot said that, with ordinary nice, each increment makes about a 10% difference in the amount of CPU time the process may use. With latency nice, he said, the values of -19, zero, and +20 make sense, but he couldn't say what the values in between would mean. Hyser said that, for negative values, there could be a fairly direct effect on the number of CPUs that will be searched before placing a task. Shah suggested that positive values could allow task placement anywhere in the system, even to CPUs that do not share low-level memory cache, which is something the scheduler normally tries hard to avoid.

Eggemann then expressed a sentiment that would be heard a few times in the session: latency nice is trying to control too many functionalities with a single knob. Bellasi suggested that the use cases could be hammered out during review of the patch and asked whether there were any real use cases with contradictory semantics. Giani mentioned the Oracle and Facebook cases mentioned above.

Control groups

Eggemann took over the presentation at this point to talk about what the Android developers would like to see. Android currently uses a control-group interface that includes a "prefer idle" attribute; setting that will bias CPU selection toward an idle CPU. The real effect of this setting, though, is to short out the energy-aware scheduling logic, which brings a certain amount of latency of its own. Thus, in this context, searching for an idle CPU is something that is preferable to do for latency-sensitive tasks — just the opposite of the situation described above.

His real purpose, though, was to discuss a potential control-group-based interface for latency nice. Control groups are a mechanism to organize processes and distribute resources, which is what is needed here. With the CPU controller, there are three ways in which CPU resources are controlled. The "weight" value gives a relative priority to the group, while the "max" value limits the maximum CPU time available and the "min" value ensures that a minimal amount of CPU time will be granted. Utilization clamping is also handled here.

How could the latency nice value be managed in this setting? The resource controlled would still have to be CPU cycles, he said. But the association between latency requirements and CPU cycles is not as clear as it is with the parameters described above. He is not sure what sort of semantics would be acceptable to the control-group maintainer. Bellasi suggested a clamping model, where each group would have values indicating the minimum and maximum latency nice values a task in that group could request. Guittot pointed out, though, that changes to latency nice values would have to be propagated up to the root of the control-group hierarchy. The discussion wandered around this point for a while before bogging down in just how latency nice would work

Eggemann eventually suggested moving on, saying that perhaps the use cases should have been discussed from the outset. The control-group interface is only really important to Android, he said, so perhaps it would be better to figure out what the per-task attribute implementation would actually be doing.

Use cases at last

Hyser took over at this point to talk about use cases; he reiterated that the original purpose of the patch set was to skip the idle-CPU search for latency-sensitive tasks. This resulted in a 1% increase in a transaction-processing benchmark. Many workloads have critical processes that do not run for long but need to run immediately when the time comes. The latency nice change can make it possible for many of these workloads to avoid the need to use the realtime patches.

He put up some plots showing that latency nice does result in better latencies; the effect is more pronounced on systems with more cores.

He suggested that negative values should be interpreted as the number of cores to search; a value of -20 means search no cores at all, -19 would search one core, etc. But should this value be scaled by the number of CPUs in the system? It's still not clear how it should be interpreted. He suggested that latency nice looks a lot like a Boolean value in real-world use; either other cores are searched to place a task or not.

Giani said that the effect of changing a task's nice value is well understood; the effect of changing latency nice is rather less so. Hyser suggested that it could be seen as adjusting the size of the scheduling domain for latency-sensitive tasks. But scheduling domains are hardware dependent, making it hard to come up with a hardware-independent description of the semantics of latency nice. The -20 value, which searches zero cores, is not dependent on hardware at least, Hyser said. He concluded by saying that a value of -1 could mean that the CPU search would happen, but energy-aware scheduling would be disabled.

Giani said that latency nice appears to be trying to do a bunch of things and wondered if it makes sense to control it all with a single interface; Peter Zijlstra responded that those things do all affect latency, at least. Rafael Wysocki said that a single integer value is not enough to express everything that is needed here. Zijlstra said that the session really should have started with the use cases, then looked at tunables to suit those cases.

Shah discussed the task-packing use case. In particular, on systems with Intel's "turbo" mode, packing tasks onto a small number of cores can save enough resources to allow others to go into turbo mode. He suggested that tasks marked with a latency nice value greater than 15 could be packed this way, as long as they don't push the utilization of the target core above a threshold value. Doing so led to a 14% performance benefit on a workload he tested.

Another use case involves restricting the sleep states that a CPU can go into. The pm_qos mechanism can do that now, but it is a system-wide parameter with no per-task control, so it does not work as well as one would like on larger systems; it has no notion of where the latency-sensitive task will run. He suggested implementing a per-CPU counter indicating how many latency-sensitive tasks are present; if a CPU is running such tasks, the sleep states it could go into would be restricted.

Wysocki responded that this isn't a realistic scenario. It could become confused if the task is migrated, for example; he said that latency nice is not a good interface for this case. There is no way to map a latency nice value and the set of permissible exit latencies for the CPU. Bundling semantics in this way is not going to work, he said. Bellasi said that such an interface would require users to determine their latency nice values through experimentation on a specific platform.

Shah persisted, though, saying that it can be beneficial to keep CPUs with latency-sensitive tasks from going idle. Scheduler benchmark runs showed a significant latency reduction with these semantics while maintaining similar power consumption. A pgbench run also showed big improvements in latency, but at a cost (sometimes large) in power consumption.

Youssef said that the interface to all of this is the sticking point. Thomas Gleixner agreed, saying that the -20..19 range "requires a crystal ball" to use properly. Zijlstra repeated his call to enumerate the use cases before getting into the interface details. Giani repeated that the interface does not look correct now, and agreed that a more comprehensive look at the use cases was needed. Things were being done backwards currently, he said.

Eggemann concluded by saying that the group needed to collect use cases and "take them all seriously". While the discussion continued to circle around these points for a while, it was, for all practical purposes, done.

[See the slides from this session [PDF] for more plots and other details.]

Comments (6 posted)

Scheduler benchmarking with MMTests

By Jonathan Corbet
May 19, 2020

OSPM

The MMTests benchmarking system is normally associated with its initial use case: testing memory-management changes. Increasingly, though, MMTests is not limited to memory management testing; at the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), Dario Faggioli talked about how he is using it to evaluate changes to the CPU scheduler, along with a discussion of the changes he had to make to get useful results for systems hosting virtualized guests.

Kernel benchmarking, he began, is typically done on bare metal. Developers want to know what the impact of a given kernel change might be, so they run a series of tests to measure performance in a reproducible setting. But Faggioli works in SUSE's virtualization lab, which has a more complicated set of objectives. A kernel change might have one effect on a host, but a different effect in guests running on that host. That leads to a need to run benchmarks with various combinations of baseline and changed kernels. Life gets even more interesting when you consider that benchmarks can take varying amounts of time to run between the host and a guest, or even between guests. Without some extra effort, a series of tests running simultaneously will not line up in any sort of predictable or repeatable way.

For example, consider a test for hypervisor scheduling fairness. If the scheduler is fair, guests with equal computing requirements should get equal amounts of CPU time. One way to test that is to ensure that every benchmark takes the same amount of time to run. Even in the presence of fair scheduling, though, there may be differences in run times between one virtual machine and the next. If a series of tests is being run, the VMs could end up running different tests at any given time, muddying the results. The only way to get clear and deterministic results is to ensure that the benchmark runs on all systems run in a synchronized manner.

There are, he said, a lot of testing and benchmarking suites to choose from. None of them, though, is able to perform synchronized runs in multiple virtual machines. He decided that the time had come to implement a suite that could do that, but he didn't want to start from scratch, so he based his work on MMTests.

The MMTests suite dates back to at least 2012, Faggioli said (LWN covered it in August of that year). While it was initially focused on memory-management changes, that is no longer the case. It is mostly implemented in a combination of Bash and Perl. The core suite is able to fetch, build, configure, and run a whole range of benchmarks. Multiple runs can be made, with MMTests collecting and storing both the configuration that was used and the results that were obtained. A set of tools exists to compare results between runs, create plots, and more. There is also a "monitor" functionality that can capture the output from various monitoring commands (top, vmstat, iostat, etc.) as well as from sources like ftrace and perf events. The set of benchmarks that can be run is large, consisting of most of the tools that kernel developers have found useful over the years.

The configuration file for MMTests is a Bash script containing a lot of export lines describing the tests to be run. There are commands to query system characteristics, such as the number of NUMA nodes; the results can be used to size the benchmarks appropriately. It is quite intuitive, Faggioli said — as long as you are familiar with the specific benchmarks you want to run. The run-mmtests.sh script will actually run the tests; there is a compare_mmtests.pl script to see what changed between different runs. Use graph-mmtests.sh to make pretty plots.

It is possible to try running MMTests as a regular user, he said, but that's not necessarily the best idea. The tests won't fail, but MMTests will not be able to do everything it needs to get a proper run. It may, for example, try to make changes to the CPU-frequency governor. It tries to undo such changes at the end, but it's still better to run the tests on a disposable machine if possible. MMTests will download benchmarks from the net, then run them as root, which may give some users pause. It is possible to set up a local mirror, which can be good for both performance and security.

For tests involving virtualization in particular, the run-kvm.sh script should be used; it will get results from both the host and guest(s). The script sets up and starts any virtual machines, as well as generating SSH keys to connect to those machines. The MMTests directory is copied directly into the virtual machines and the tests are run there. There are different configuration files for the host and the virtual machines; one may want to collect different data in the two settings, he said.

Synchronization, which Faggioli had to add to MMTests, is handled by passing tokens between the host and the virtual machines; the guests never talk directly to each other. The host implements a "barrier" before each benchmark run; once every virtual machine has informed the host that it is ready for the next test, they are all told to proceed to the next one. This ensures that the tests on all systems start at the same time.

Faggioli has various patches that he had really intended to submit before the talk, but that didn't happen despite his proclaimed affinity for "conference-driven development". That should happen soon. With regard to documentation, he said, there is absolutely none. But there is a nice ASCII-art diagram in the script for virtual-machine synchronization, at least. He concluded by saying he has considered rewriting the whole thing in Go, but he was not sure if Mel Gorman, the maintainer of MMTests, would be up for such an idea. Gorman, who was present at the event, held his peace regarding this idea.

Douglas Raillard spoke up after Faggioli finished to say that Arm has a test suite that it uses; it lacks virtual-machine synchronization, though. It does some statistical testing on the results; he wondered if there were plans for adding that to MMTests. Faggioli said that he is not a statistician and wouldn't add that capability himself. Gorman said that MMTests does enough evaluation to try to guess whether a specific difference is significant; that is rather subtly marked in the output and is often missed. The fact that it is undocumented probably doesn't help. Raillard also asked about getting output in JSON format; Faggioli said there is JSON "in there somewhere" but he doesn't use it.

The session concluded at this point. See Faggioli's slides [PDF] for details, example plots, configuration files, and more.

Comments (2 posted)

Evaluating vendor changes to the scheduler

By Jonathan Corbet
May 19, 2020

OSPM

The kernel's CPU scheduler does its best to make the right decisions for just about any workload; over the years, it has been extended to better handle mobile-device scheduling as well. But handset vendors still end up applying their own patches to the scheduler for the kernels they ship. Shipping out-of-tree code in this way leads to a certain amount of criticism from the kernel community but, as Vincent Donnefort pointed out in his session at the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), those patches are applied for a reason. He looked at a set of vendor scheduler patches to see why they are being used.

As a testbed for these patches, Donnefort chose the Pixel 4 phone. It's a device with good upstream support, so it's easy to replace its kernel without the need for lots of other out-of-tree code. This device has three different CPU core sizes, small, medium, and large, where the small cores are small indeed. It is imperative to pick the correct CPU for any given task, or there will be a cost to pay in performance or energy use. The PCMark benchmark was used to evaluate performance, while power measurement was done directly from the phone's power rails. A 4.14 kernel was used for the tests.

The first patch tested performs CPU isolation by actively evacuating tasks to other CPUs; the intent is to idle the CPU and let it be put into a sleep state. Tasks are migrated, interrupts are directed elsewhere, and the CPU is removed from the load balancer's attention entirely; kernel threads attached to that CPU still run, though. This is, he said, a sort of lightweight form of CPU hotplug.

This patch works by looking at the load presented by all of the running tasks and calculating how much CPU power is needed. If the number of running CPUs exceeds what is needed, it will try to isolate one or more of them. This decision is made in user space.

In performance testing, Donnefort found that CPU isolation reduces throughput slightly, but also gives a 4% drop in power consumption. Vincent Guittot asked why the energy model built into the kernel couldn't handle this task; Donnefort responded that he didn't try to evaluate alternative solutions to the problem. The results show, though, that there is room for improvement.

The other patches were presented as a set. They were:

"Migration margins": this patch changes the way the kernel picks a CPU for a task on an asymmetric system. This is done by comparing the task's expected utilization with the capacity of the CPU; the mainline kernel will only place a task on a CPU if there will be 20% of its capacity left afterward. The vendor patch lowers this margin to 5%, thus increasing the chance that a given task will end up on a smaller, more energy-efficient CPU.
There is a change to how the scheduler does task packing. The mainline tries to keep tasks contained within a single cluster (thus possibly allowing other clusters to go idle), but will try to spread out tasks across the CPUs in a cluster. The vendor patch, instead, works harder to pack tasks into a single CPU, though stopping before it would become necessary to increase the CPU's frequency.
The mainline puts some effort into finding the most efficient CPU to run any given task on — too much time, it seems, for some vendors, who make a change to that algorithm. With this change, the kernel decides where to put a task by first looking at where it was running last time; if that CPU is idle and the task fits there, the placement logic will be shorted out and that CPU will be chosen immediately. He noted that energy-aware task placement has improved considerably since the release of the 4.14 kernel used for these tests.
When placing a realtime task, the kernel performs a search for the CPU that is running the lowest-priority task; that will be the easiest one to preempt. The vendor patch expands this search to look at utilization and idle states as well, trying to find the CPU that is the least busy overall. The search is also biased toward finding the smallest suitable CPU.

The benchmark results for each of these patches were remarkably similar. They all tended to hurt performance by 3-5% while reducing energy use by 8-11%. What Donnefort did not do, though, was to benchmark a system with all of them applied; he cautioned against assuming that those differences would be additive with all of the patches in the system.

He concluded with the simple assertion that, even if some of these changes are controversial, they are clearly useful in this setting. He will be looking at ways of getting those changes into an acceptable form for merging upstream.

In the discussion, Qais Youssef said that some of his recent CPU-capacity changes might be able to replace some of this work. Dietmar Eggemann asked why the energy model wasn't providing CPU isolation now; it should already be pushing things aggressively toward small CPUs. Peter Zijlstra agreed that it was important to figure out why that workaround was necessary; perhaps the scheduler should look more closely at idle states in the energy-aware path. Donnefort said that CPU isolation in this form is probably not the right solution for the mainline kernel, but it does show that there is something to be gained that way.

See Donnefort's slides [PDF] for detailed results and more.

Comments (1 posted)

Bao: a lightweight static partitioning hypervisor

By Jonathan Corbet
May 20, 2020

OSPM

Developers of safety-critical systems tend to avoid Linux kernels for a number of fairly obvious reasons; Linux simply was not developed with that sort of use case in mind. There are increasingly compelling reasons to use Linux in such systems, though, leading to a search for the best way to do so safely. At the 2020 Power Management and Scheduling in the Linux Kernel summit (OSPM), José Martins described Bao, a minimal hypervisor aimed at safety-critical deployments.

The actual target, he began, is "mixed-criticality" systems in which multiple software stacks run in parallel with each other; some of those stacks are safety-critical, while others are not. For example, a system could have a user interface running on Linux alongside the safety-critical application that it controls. There is an industry trend toward consolidating systems in this way driven by power considerations and the availability of processors with numerous CPUs.

Virtualization is naturally interesting for the developers of such systems; it minimizes the effort required to port systems and eases the integration within them. Good virtualization provides fault isolation, preventing failures in one part of the system from interfering with others. Developers want the usual things from such a system: good performance, realtime guarantees, and strong security.

Martins spent some time looking at solutions like Xen and KVM. They were not designed for this kind of use case, but they end up being used anyway. Neither is an optimal solution; they use virtualized I/O mechanisms that add a lot of overhead, and their code bases are large and hard to audit.

Instead, he said, there is a role here for a static partitioning hypervisor, which can be seen as a thin configuration layer that divides up a system's resources. Under a system like this, there is a one-to-one mapping between virtual CPUs and physical CPUs, so there is no contention for CPU time. Devices are mapped directly into the guests, avoiding any added I/O overhead. Perhaps the best-known hypervisor of this type is Jailhouse, but that didn't meet Martins's needs; it depends on a Linux "root cell" to run the whole show, its boot time is relatively long, and there is still a big code base to audit. The Xen Dom0-less project can do direct device assignment, which is nice, but it falls short in other ways.

So Martins set out to create Bao as a "type-1 bare-metal hypervisor" with a one-to-one CPU mapping. It doesn't depend on any sort of privileged virtual machine or operating system to boot. Bao provides a simple inter-VM communication mechanism based on shared memory and virtual interrupts. It depends on hardware assistance for many of its functions, including second-stage address translation, an I/O memory-management unit, and virtual interrupts. Bao can use huge pages to reduce translation lookaside buffer pressure and page-table memory use; it is also able to perform cache coloring for memory allocations to avoid low-level cache interference between machines.

Bao currently targets the Armv8 architecture. There is a RISC-V port, but the virtualization specification for RISC-V is not ready, so this port is not interesting yet. It can run a number of guests, including bare-metal applications, Linux, Android, and various realtime operating systems.

Ideally, he said, Bao would just be a configuration layer that does its work and gets out of the way, but the hardware does not support this mode of operation. Interrupts, for example, have to be mediated through the hypervisor, which is unfortunate since that increases latency. The I/O memory-management unit has a limited number of stream registers, and doesn't cover all devices on some platforms. There is no partitioning mechanism for memory cache on Arm, so the hypervisor must handle isolation via cache coloring.

The system is implemented in about 7,000 lines of code, and requires 50KB of memory on the target system. That is "somewhat small", he said, but he is working to get it smaller. Run-time memory requirements add up to about 250KB. Benchmark runs show that the hypervisor adds an execution-time overhead of about 2%. Turning on cache coloring increases that overhead, since that feature is incompatible with the use of huge pages. Interference tests currently show a significant amount of degradation caused by activity in other virtual machines; cache coloring helps but does not completely solve the problem.

Another issue is interrupt latency, which increases significantly due to the need for a round-trip through the hypervisor. There is a fair amount of cross-VM interference caused by interrupts as well; again, cache coloring helps, especially if it is used within the hypervisor too. He has found a way to map interrupts directly into guests, something that is made possible by the one-to-one CPU mapping. That increases the overhead for interrupts intended for the hypervisor itself, but those are relatively rare.

Current work includes adding support for trusted execution environments. The other approaches out there share the trusted code across virtual machines, which is not ideal. Arm is adding support for a "trusted hypervisor" mode but, for now, complex workarounds are required. Martins said that the "dual-world" approach used in this area is not inherently secure; a lot of code has to be added to the secure side, bringing the same old problems with it. It is better, he said, to limit the secure world to core security primitives; he is trying to do that by avoiding TrustZone completely and dedicating a virtual CPU to trusted work. This involves allowing multiple virtual CPUs to run on a single physical CPU.

Overall, he concluded, Bao has turned out to be a good fit for the intended use case.

Comments (7 posted)

NXNSAttack: upgrade resolvers to stop new kind of random subdomain attack

CZ.NIC staff member Petr Špaček has a blog post describing a newly disclosed DNS resolver vulnerability called NXNSAttack. It allows attackers to abuse the delegation mechanism to create a denial-of-service condition via packet amplification. "This is so-called glueless delegation, i.e. a delegation which contains only names of authoritative DNS servers (a.iana-servers.net. and b.iana-servers.net.), but does not contain their IP addresses. Obviously DNS resolver cannot send a query to “name”, so the resolver first needs to obtain IPv4 or IPv6 address of authoritative server 'a.iana-servers.net.' or 'b.iana-servers.net.' and only then it can continue resolving the original query 'example.com. A'. This glueless delegation is the basic principle of the NXNSAttack: Attacker simply sends back delegation with fake (random) server names pointing to victim DNS domain, thus forcing the resolver to generate queries towards victim DNS servers (in a futile attempt to resolve fake authoritative server names)." At this time, Ubuntu has updated its BIND package to mitigate the problem; other distributions will no doubt follow soon. More details can also be found in the paper [PDF].

Comments (6 posted)

A remote code execution vulnerability in qmail

Just in case anybody out there is still using qmail: a remote code execution vulnerability has just been disclosed. Its CVE number is CVE-2005-1513 because, as it turns out, the problem was reported 15 years ago but the fix was refused by the maintainer. "As a proof of concept, we developed a reliable, local and remote exploit against Debian's qmail package in its default configuration. This proof of concept requires 4GB of disk space and 8GB of memory, and allows an attacker to execute arbitrary shell commands as any user, except root (and a few system users who do not own their home directory)."

Full Story (comments: 32)

Security quotes of the week

Just two weeks ago, Belkin announced [plans] to shut down one of its cloud services, effectively transforming its several product lines of web cameras into useless bricks. Unlike other end-of-support announcements for IoT devices that (only) mean devices will never see an update again, many Belkin cameras simply refuse to work without the “cloud”. This is particularly disconcerting as many see cloud-based IoT as one possible solution to improve device security by easing the user maintenance effort through remote update capabilities.

— Alexander Vetterl introduces a Light Blue Touchpaper "Three Paper Thursday" on IoT security

As soon as I saw their house I realized exactly what [the NSA's Richard] Ledgett said. I remember standing outside the house, looking into the dense forest for TEMPEST receivers. I didn't see any, which only told me they were well hidden. I assumed black-bag teams from various countries had been all over the house when they were out for dinner, and wondered what would have happened if teams from different countries bumped into each other. I assumed that all the countries Ledgett listed above -- plus the US and a few more -- had a full take of what Snowden gave the journalists. These journalists against those governments just wasn't a fair fight.

— Bruce Schneier remembers visiting the house of Glenn Greenwald

As a refresher: the way targeted advertising works is that an advertiser agrees to place an ad and uses whatever system to target those ads to particular groupings of people, as set up by the ad platform. So, if you want to advertise to grumpy bloggers in their mid-40s, you can find a way to have those ads show to that demographic. But the advertiser doesn't get any data from the platform about anyone. The companies are selling access to highly targeted demographics, but it's never been selling data.

That doesn't mean there aren't other companies that do sell private data. There are. Lots of them. Data brokers, telcos, some ISPs, and even your local DMV have been caught selling your actual data. But for some reason, everyone wants to keep insisting that Google and Facebook also sell data, when they never have, and have always only sold targeted advertising in which the data only goes in one direction, and not back to the advertiser.

— Mike Masnick

Comments (9 posted)

Kernel release status

The current development kernel is 5.7-rc6, released on May 17. "That said, there's nothing particularly scary in here, and it's not like this rc6 is outrageously big or out of control. I was just hoping for less."

Stable updates: 5.6.13, 5.4.41, and 4.19.123 were released on May 14. The 5.6.14, 5.4.42, 4.19.124, 4.14.181, 4.9.224, and 4.4.224 updates followed on May 20.

Comments (none posted)

Distribution quote of the week

The hammer is never praised.

That’s a saying I’ve attached to Finnix for many years. A hammer is a tool, and when it does its job, you may consciously or unconsciously appreciate it for doing its job, but there are very few hammer fan clubs.

[...] Finnix was never heavily mentioned by its users in the same way a desktop like, say, Linux Mint was. Why does the tool need to be praised?

— Ryan Finnie (Thanks to Paul Wise)

Comments (4 posted)

Going above and beyond with Inkscape 1.0 (Libre Graphics World)

Libre Graphics World is running an extensive interview with several Inkscape developers. "I'd say we're at the point of supporting SVG as much as possible, but we've mostly given up trying to add editing features to the SVG specification. As the W3C is dominated by web browsers who don't need multi page or connectors. I dare not say much more about W3C-specific things. I know that I'm personally disappointed that Inkscape's considerable importance in the SVG creation space does not lend itself to getting the feature we intend to build into Inkscape into the actual SVG specification. This does lead to the problem that going forwards we're likely to have browser incompatibilities."

Comments (1 posted)

Five years of Rust

It seems that the Rust programming language has only been around for five years. "With all that's going on in the world you'd be forgiven for forgetting that as of today, it has been five years since we released 1.0 in 2015! Rust has changed a lot these past five years, so we wanted reflect back on all of our contributors' work since the stabilization of the language."

Comments (45 posted)

DistroWatch Weekly (May 18)

Lunar Linux Weekly News (May 15)

openSUSE Tumbleweed Review of the Week (May 15)

Ubuntu Weekly Newsletter (May 16)

Emacs News (May 18)

What's cooking in git.git (May 19)

What's cooking in git.git (May 20)

LLVM Weekly (May 18)

OCaml Weekly News (May 19)

Perl Weekly (May 18)

PostgreSQL Weekly News (May 17)

Python Weekly Newsletter (May 14)

Racket News (May 18)

Weekly Rakudo News (May 18)

Ruby Weekly News (May 14)

This Week in Rust (May 19)

Wikimedia Tech News (May 18)

Fedora FESCO meeting minutes (May 18)

Opensource.org License-Discuss monthly summary (May)

Opensource.org License-Review monthly summary (May)

CFP Deadlines: May 21, 2020 to July 20, 2020

The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.

Deadline	Event Dates	Event	Location
May 22	May 28 May 31	MiniDebConf Online
June 5	September 13 September 18	The C++ Conference 2020	Online
June 14	September 4 September 11	Akademy 2020	Virtual, Online
June 15	August 25 August 27	Linux Plumbers Conference	Virtual
June 30	July 4 July 5	State of the Map 2020	Virtual
July 1	September 22 September 24	Linaro Virtual Connect	online
July 2	September 9 September 10	State of the Source Summit	online
July 5	October 28 October 29	[Canceled] DevOpsDays Berlin 2020	Berlin, Germany
July 5	August 23 August 29	DebConf20	online
July 5	September 16 September 18	X.Org Developer's Conference 2020	online
July 5	October 2 October 3	PyGotham TV	Online
July 13	September 29 October 1	ApacheCon 2020	Online
July 15	October 6 October 8	2020 Virtual LLVM Developers' Meeting	online

If the CFP deadline for your event does not appear here, please tell us about it.

Netdev conference 0x14 going virtual

The Netdev society has announced that the Netdev 0x14 conference, originally scheduled for March, will be a virtual event beginning August 16.

It was a difficult decision to reach because all of us (and including the poll results) would have prefered a physical meeting. It is impossible to replace the hallway track with a virtual meeting. [...]

To all the folk who worked hard to contribute submissions and to the program committee that worked hard and reviewed these submissions, it is only fair that we provide a platform where these ideas will be shared in a timely fashion.

Comments (none posted)

Events: May 21, 2020 to July 20, 2020

The following event listing is taken from the LWN.net Calendar.

Date(s)	Event	Location
May 26 May 28	[VIRTUAL] sambaXP 2020	Berlin, Germany
May 26 May 29	[POSTPONED] Libre Graphics Meeting 2020	Rennes, France
May 27 May 28	[ONLINE] PGCon 2020	Ottawa, Canada
May 28 May 31	MiniDebConf Online
June 6 June 7	[CANCELED] Linuxwochen Linz	Linz, Austria
June 11 June 12	[CANCELED] PGDay.IT 2020	Bergamo, Italy
June 12 June 14	[POSTPONED] GNU Tools Cauldron	Paris, France
June 17 June 18	[ONLINE] Open Source Data Center Conference	Berlin, Germany
June 17 June 18	[ONLINE] stackconf 2020	Berlin, Germany
June 18 June 19	[CANCELED] Swiss PGDay 2020	Zurich, Switzerland
June 19	[CANCELED] Open Source Camp \| #5 Bareos	Berlin, Germany
June 20 June 21	International Conference on Networks, Blockchain and Internet of Things	Dubai, UAE
June 23 June 25	[CANCELED] IcingaConf 2020	Amsterdam, Netherlands
June 23 June 25	[ONLINE] Postgres Vision 2020	Boston, MA, USA
June 24 June 26	Perl Conference in the Cloud
June 25	Linux Plumbers virtual town hall	online
June 25 June 26	[CANCELED] Postgres Ibiza 2020	Ibiza, Spain
June 29 July 2	[VIRTUAL] Open Source Summit/Embedded Linux Conference North America	Austin, TX, USA
July 1 July 2	[ONLINE] Linux Security Summit North America	Austin, TX, USA
July 4 July 5	State of the Map 2020	Virtual
July 6 July 9	[Virtual] XenProject Developer and Design Summit 2020
July 6 July 12	[VIRTUAL] SciPy 2020	Austin, TX, USA
July 7 July 8	[CANCELED] PostgresLondon 2020	London, United Kingdom
July 10	[CANCELED] PGDay Russia 2020	St. Petersburg, Russia
July 16 July 18	[CANCELED] Linux Developer Conference Brazil	São Paulo, Brazil

If your event does not appear here, please tell us about it.

Alert summary May 14, 2020 to May 20, 2020

Dist.	ID	Release	Package	Date
Debian	DSA-4686-1	stable	apache-log4j1.2	2020-05-15
Debian	DLA-2210-1	LTS	apt	2020-05-15
Debian	DSA-4685-1	stable	apt	2020-05-14
Debian	DSA-4689-1	stable	bind9	2020-05-19
Debian	DLA-2215-1	LTS	clamav	2020-05-20
Debian	DSA-4688-1	stable	dpdk	2020-05-18
Debian	DLA-2213-1	LTS	exim4	2020-05-18
Debian	DSA-4687-1	stable	exim4	2020-05-16
Debian	DLA-2176-1	LTS	inetutils	2020-05-14
Debian	DLA-2214-1	LTS	libexif	2020-05-18
Debian	DSA-4684-1	stable	libreswan	2020-05-13
Debian	DLA-2211-1	LTS	log4net	2020-05-15
Debian	DLA-2212-1	LTS	openconnect	2020-05-16
Fedora	FEDORA-2020-06c54925d3	F30	chromium	2020-05-17
Fedora	FEDORA-2020-da49fbb17c	F31	chromium	2020-05-17
Fedora	FEDORA-2020-ae934f6790	F30	condor	2020-05-17
Fedora	FEDORA-2020-f9a598f815	F31	condor	2020-05-17
Fedora	FEDORA-2020-fb5af97476	F32	condor	2020-05-18
Fedora	FEDORA-2020-885e2343ed	F31	glpi	2020-05-14
Fedora	FEDORA-2020-ee30e1109f	F32	glpi	2020-05-14
Fedora	FEDORA-2020-d109a1d1d9	F31	grafana	2020-05-14
Fedora	FEDORA-2020-c6b0c7ebbb	F32	grafana	2020-05-14
Fedora	FEDORA-2020-a60ad9d4ec	F31	java-1.8.0-openjdk	2020-05-18
Fedora	FEDORA-2020-831ec85119	F31	java-1.8.0-openjdk-aarch32	2020-05-17
Fedora	FEDORA-2020-07aa58121a	F32	java-1.8.0-openjdk-aarch32	2020-05-17
Fedora	FEDORA-2020-36298e20f7	F31	java-latest-openjdk	2020-05-14
Fedora	FEDORA-2020-755e4213b5	F32	java-latest-openjdk	2020-05-14
Fedora	FEDORA-2020-5a69decc0c	F30	kernel	2020-05-20
Fedora	FEDORA-2020-c6b9fff7f8	F31	kernel	2020-05-20
Fedora	FEDORA-2020-4c69987c40	F32	kernel	2020-05-15
Fedora	FEDORA-2020-4336d63533	F32	kernel	2020-05-20
Fedora	FEDORA-2020-69f2f1d987	F31	mailman	2020-05-14
Fedora	FEDORA-2020-20b748e81e	F32	mailman	2020-05-15
Fedora	FEDORA-2020-e244f22a51	F32	mingw-OpenEXR	2020-05-16
Fedora	FEDORA-2020-e244f22a51	F32	mingw-ilmbase	2020-05-16
Fedora	FEDORA-2020-7aba37f66a	F30	moodle	2020-05-20
Fedora	FEDORA-2020-a1b4d24680	F31	moodle	2020-05-20
Fedora	FEDORA-2020-758e089ff7	F32	moodle	2020-05-20
Fedora	FEDORA-2020-238bbf85d8	F32	oddjob	2020-05-14
Fedora	FEDORA-2020-143735a624	F32	openconnect	2020-05-19
Fedora	FEDORA-2020-8d3b359179	F30	perl-Mojolicious	2020-05-19
Fedora	FEDORA-2020-aceb5a1d0a	F31	perl-Mojolicious	2020-05-19
Fedora	FEDORA-2020-cc7deffbf1	F32	perl-Mojolicious	2020-05-19
Fedora	FEDORA-2020-3ea2253402	F32	php	2020-05-19
Fedora	FEDORA-2020-6e3e0c6386	F30	sleuthkit	2020-05-17
Fedora	FEDORA-2020-1dd340ab85	F31	sleuthkit	2020-05-17
Fedora	FEDORA-2020-94c2f78e0c	F32	sleuthkit	2020-05-17
Fedora	FEDORA-2020-a6a921a591	F30	squid	2020-05-16
Fedora	FEDORA-2020-848065cc4c	F31	squid	2020-05-16
Fedora	FEDORA-2020-56e809930e	F32	squid	2020-05-16
Fedora	FEDORA-2020-e67318b4b4	F32	transmission	2020-05-20
Fedora	FEDORA-2020-c952520959	F30	viewvc	2020-05-15
Gentoo	202005-13		chromium	2020-05-15
Gentoo	202005-07		freerdp	2020-05-15
Gentoo	202005-10		libmicrodns	2020-05-15
Gentoo	202005-06		live	2020-05-15
Gentoo	202005-12		openslp	2020-05-15
Gentoo	202005-09		python	2020-05-15
Gentoo	202005-11		vlc	2020-05-15
Gentoo	202005-08		xen	2020-05-15
Mageia	MGASA-2020-0213	7	jbig2dec	2020-05-15
Mageia	MGASA-2020-0215	7	libreswan	2020-05-15
Mageia	MGASA-2020-0211	7	netkit-telnet	2020-05-15
Mageia	MGASA-2020-0212	7	ntp	2020-05-15
Mageia	MGASA-2020-0214	7	suricata	2020-05-15
openSUSE	openSUSE-SU-2020:0661-1	15.1	mailman	2020-05-15
openSUSE	openSUSE-SU-2020:0667-1		nextcloud	2020-05-17
openSUSE	openSUSE-SU-2020:0668-1		nextcloud	2020-05-17
Oracle	ELSA-2020-2143	OL8	.NET Core	2020-05-14
Oracle	ELSA-2020-1926	OL8	container-tools:1.0	2020-05-14
Oracle	ELSA-2020-1931	OL8	container-tools:2.0	2020-05-13
Oracle	ELSA-2020-1932	OL8	container-tools:ol8	2020-05-13
Oracle	ELSA-2020-2103	OL6	kernel	2020-05-13
Oracle	ELSA-2020-2082	OL7	kernel	2020-05-14
Oracle	ELSA-2020-5691	OL7	kernel	2020-05-19
Oracle	ELSA-2020-2102	OL8	kernel	2020-05-14
Oracle	ELSA-2020-5691	OL8	kernel	2020-05-19
Oracle	ELSA-2020-2070	OL8	libreswan	2020-05-13
Oracle	ELSA-2020-2041	OL8	squid:4	2020-05-13
Oracle	ELSA-2020-2046	OL8	thunderbird	2020-05-13
Red Hat	RHSA-2020:2213-01	EL7.4	ipmitool	2020-05-19
Red Hat	RHSA-2020:2214-01	EL7.4	kernel	2020-05-19
Red Hat	RHSA-2020:2199-01	EL8.1	kernel	2020-05-19
Red Hat	RHSA-2020:2171-01	EL8	kernel-rt	2020-05-14
Red Hat	RHSA-2020:2203-01	EL8.1	kpatch-patch	2020-05-19
Red Hat	RHSA-2020:2210-01	EL7.4	ksh	2020-05-19
Red Hat	RHSA-2020:2212-01	EL7.4	ruby	2020-05-19
Scientific Linux	SLSA-2020:2082-1	SL7	kernel	2020-05-15
Slackware	SSA:2020-140-01		bind	2020-05-19
Slackware	SSA:2020-140-02		libexif	2020-05-19
Slackware	SSA:2020-139-01		sane	2020-05-18
SUSE	SUSE-SU-2020:1272-1	OS7 OS8 SLE12 SES5	apache2	2020-05-13
SUSE	SUSE-SU-2020:1296-1	SLE15	autoyast2	2020-05-18
SUSE	SUSE-SU-2020:1334-1	SLE15	dpdk	2020-05-19
SUSE	SUSE-SU-2020:1335-1	SLE15	dpdk	2020-05-19
SUSE	SUSE-SU-2020:1294-1	SLE15	file	2020-05-18
SUSE	SUSE-SU-2020:1295-1	OS7 OS8 SLE12 SES5	git	2020-05-18
SUSE	SUSE-SU-2020:1273-1	SES5	grafana	2020-05-13
SUSE	SUSE-SU-2020:1300-1	SLE15	gstreamer-plugins-base	2020-05-18
SUSE	SUSE-SU-2020:1275-1	OS8 SLE12 SES5	kernel	2020-05-14
SUSE	SUSE-SU-2020:1298-1	SLE15	libbsd	2020-05-18
SUSE	SUSE-SU-2020:1277-1	SLE12	libvirt	2020-05-14
SUSE	SUSE-SU-2020:1289-1	SLE12	libvirt	2020-05-15
SUSE	SUSE-SU-2020:1297-1	SLE15	libvpx	2020-05-18
SUSE	SUSE-SU-2020:1299-1	SLE15	libxml2	2020-05-18
SUSE	SUSE-SU-2020:1301-1	OS7 OS8 SLE12 SES5	mailman	2020-05-18
SUSE	SUSE-SU-2020:1337-1	SLE15	openconnect	2020-05-19
SUSE	SUSE-SU-2020:1292-1	SLE12	openexr	2020-05-18
SUSE	SUSE-SU-2020:1293-1	SLE15	openexr	2020-05-18
SUSE	SUSE-SU-2020:1285-1	MP3.2 OS6 OS7 OS8 SLE12 SES5	python-PyYAML	2020-05-15
SUSE	SUSE-SU-2020:1339-1	SLE15	python	2020-05-19
SUSE	SUSE-SU-2020:1274-1	SES5	python-paramiko	2020-05-14
SUSE	SUSE-SU-2020:1338-1	SLE15	rpmlint	2020-05-19
SUSE	SUSE-SU-2020:14369-1	SLE11	syslog-ng	2020-05-14
Ubuntu	USN-4359-1	16.04 18.04 19.10 20.04	apt	2020-05-14
Ubuntu	USN-4365-1	16.04 18.04 19.10 20.04	bind9	2020-05-19
Ubuntu	USN-4361-1	19.10 20.04	dovecot	2020-05-18
Ubuntu	USN-4362-1	18.04 19.10 20.04	dpdk	2020-05-18
Ubuntu	USN-4366-1	14.04 16.04 18.04 19.10 20.04	exim4	2020-05-19
Ubuntu	USN-4360-3	12.04 14.04	json-c	2020-05-15
Ubuntu	USN-4360-1	12.04 14.04 16.04 18.04 19.10 20.04	json-c	2020-05-14
Ubuntu	USN-4360-2	16.04 18.04 19.10 20.04	json-c	2020-05-15
Ubuntu	USN-4358-1	12.04 14.04 16.04 18.04 19.10 20.04	libexif	2020-05-13
Ubuntu	USN-4363-1	16.04 18.04	linux, linux-aws, linux-aws-hwe, linux-gcp, linux-gke-4.15, linux-hwe, linux-oem, linux-oracle, linux-snapdragon	2020-05-18
Ubuntu	USN-4367-1	20.04	linux, linux-aws, linux-gcp, linux-kvm, linux-oracle, linux-riscv	2020-05-19
Ubuntu	USN-4364-1	14.04 16.04	linux, linux-aws, linux-lts-xenial, linux-raspi2, linux-snapdragon	2020-05-18
Ubuntu	USN-4368-1	18.04	linux-gke-5.0, linux-oem-osp1	2020-05-19

Full Story (comments: none)

Linus Torvalds Linux 5.7-rc6 May 17

Greg Kroah-Hartman Linux 5.6.14 May 20

Greg Kroah-Hartman Linux 5.6.13 May 14

Greg Kroah-Hartman Linux 5.4.42 May 20

Greg Kroah-Hartman Linux 5.4.41 May 14

Steven Rostedt 5.4.40-rt24 May 14

Greg Kroah-Hartman Linux 4.19.124 May 20

Greg Kroah-Hartman Linux 4.19.123 May 14

Greg Kroah-Hartman Linux 4.14.181 May 20

Greg Kroah-Hartman Linux 4.9.224 May 20

Greg Kroah-Hartman Linux 4.4.224 May 20

Marc Zyngier arm/arm64: Turning IPIs into normal interrupts May 19

Christophe Leroy Use hugepages to map kernel mem on 8xx May 19

Jarkko Sakkinen Intel SGX foundations May 15

Reinette Chatre x86/resctrl: Enable user to view and select thread throttling mode May 18

Thomas Gleixner x86/KVM: Async #PF and instrumentation protection May 19

Matt Helsley Enable objtool multiarch build May 19

Frederic Weisbecker rcu: Allow a CPU to leave and reenter NOCB state May 13

Andrii Nakryiko BPF ring buffer May 13

Nate Karstens Implement close-on-fork May 15

Pavel Begunkov io_uring tee support May 17

Sebastian Andrzej Siewior Introduce local_lock() May 19

Ahmed S. Darwish seqlock: Extend seqcount API with associated locks May 19

Dan Williams Renovate memcpy_mcsafe with copy_mc_to_{user, kernel} May 19

Dietmar Eggemann Capacity awareness for SCHED_DEADLINE May 20

Cyril Hrubis [LTP] [ANNOUNCE] The Linux Test Project has been released for MAY 2020 May 15

SeongJae Park Introduce Data Access MONitor (DAMON) May 18

Bartosz Golaszewski mediatek: add support for MediaTek Ethernet MAC May 14

Laurent Pinchart phy: zynqmp: Add PHY driver for the Xilinx ZynqMP Gigabit Transceiver May 13

Jacob Pan Nested Shared Virtual Address (SVA) VT-d support May 13

Nagarjuna Kristam Tegra XUSB charger detect support May 14

Kunihiko Hayashi PCI: Add new UniPhier PCIe endpoint driver May 14

Jeffrey Hugo Qualcomm Cloud AI 100 driver May 14

Kishon Vijay Abraham I Implement NTB Controller using multiple PCI EP May 14

Dilip Kota Add Intel ComboPhy driver May 15

Avri Altman scsi: ufs: Add HPB Support May 15

Serge Semin spi: dw: Add generic DW DMA controller support May 15

Vladimir Stankovic Add MA USB Host driver May 15

Guido Günther drm/bridge: Add mux input selection bridge May 15

Tim Harvey Add support for the Gateworks System Controller May 15

Dan Murphy Add JEITA properties and introduce the bq2515x charger May 15

Daniel W. S. Almeida media: vidtv: implement a virtual DVB driver May 15

Jiaxun Yang irqchip: Add Loongson HyperTransport Vector support May 16

William Breathitt Gray Introduce the Counter character device interface May 16

Taniya Das clk: qcom: Support for Low Power Audio Clocks on SC7180 May 17

Taniya Das Add Misc GCC clock driver support for SC7180 May 17

Bryan O'Donoghue Add Qualcomm MSM8939 GCC binding and driver May 17

Manivannan Sadhasivam Add Qualcomm IPCC driver support May 20

Vaibhav Agarwal Enable Greybus Audio codec driver May 17

Bard Liao soundwire: bus_type: add sdw_master_device support May 19

Robin Gong Add pca9450 driver May 20

Kamel Bouhara Microchip TCB Capture driver May 19

Sasha Levin DirectX on Linux May 19

Jim Quinlan PCI: brcmstb: enable PCIe for STB chips May 19

Ramuthevar,Vadivel MuruganX mtd: rawnand: Add NAND controller support on Intel LGM SoC May 20

Alexandre Courbot media: mtk-vcodec: venc: support for MT8183 May 20

Tali Perry i2c: npcm7xx: add NPCM i2c controller driver May 20

Sergey Senozhatsky Implement V4L2_BUF_FLAG_NO_CACHE_* flags May 15

Miquel Raynal Prepare the introduction of generic ECC engines May 14

Cristian Marussi SCMI Notifications Core Support May 20

john mathew Add scheduler overview documentation May 14

ira.weiny@intel.com Enable ext4 support for per-file/directory DAX operations May 13

Satya Tangirala Inline Encryption Support May 14

Johannes Thumshirn Add file-system authentication to BTRFS May 14

Tony Asleson Add persistent durable identifier to storage log messages May 13

Dmitry Osipenko Introduce NVIDIA Tegra Partition Table May 15

Chengguang Xu Suppress negative dentry May 15

Christoph Hellwig blk-mq: improvement CPU hotplug (simplified version) v2 May 18

Matthew Wilcox Large pages in the page cache May 15

Michel Lespinasse Add a new mmap locking API wrapping mmap_sem calls May 19

Björn Töpel Introduce AF_XDP buffer allocation API May 14

Jesper Dangaard Brouer XDP extend with knowledge of frame size May 14

Daniel Borkmann Add get{peer,sock}name cgroup attach types May 18

Vadim Fedorenko ip6_tunnel: add MPLS support May 18

Alexei Starovoitov Introduce CAP_BPF May 13

Casey Schaufler LSM: Module stacking for AppArmor May 14

Like Xu Guest Last Branch Recording Enabling May 14

Vitaly Kuznetsov KVM: x86: KVM_MEM_ALLONES memory May 14

Yang Weijiang Enable Sub-Page Write Protection Support May 16

Anastassios Nanos Expose KVM API to Linux Kernel May 18

LWN.net Weekly Edition for May 21, 2020

String manipulations

New parser

And more

The various AWK versions

Gawk changes since 4.0

Is AWK still relevant?

The weighted governor

Results

A different kind of nice

Control groups

Use cases at last

Brief items

Security

Kernel development

Distributions

Development

Announcements

Newsletters

Distributions and system administration

Development

Meeting minutes

Miscellaneous

Calls for Presentations

CFP Deadlines: May 21, 2020 to July 20, 2020

Upcoming Events

Events: May 21, 2020 to July 20, 2020

Security updates

Kernel patches of interest

Kernel releases

Architecture-specific

Build system

Core kernel

Development tools

Device drivers

Device-driver infrastructure

Documentation

Filesystems and block layer

Memory management

Networking

Security-related

Virtualization and containers