Leading items

Welcome to the LWN.net Weekly Edition for March 4, 2021

This edition contains the following feature content:

Alternative syntax for Python's lambda: should Python follow other languages with a more terse syntax for anonymous functions?
PipeWire: The Linux audio/video bus: the history and status of a new generation of audio and video plumbing for Linux.
Fedora and fallback DNS servers: the question of whether Fedora systems should fall back to public DNS servers if the current configuration does not work.
5.12 merge window, part 2: the rest of the changes merged for 5.12.
Lockless patterns: relaxed access and partial memory barriers: the series on lockless programming continues with a first look at memory barriers.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Alternative syntax for Python's lambda

By Jake Edge
March 3, 2021

The Python lambda keyword, which can be used to create small, anonymous functions, comes from the world of functional programming, but is perhaps not the most beloved of Python features. In part, that may be because it is somewhat clunky to use, especially in comparison to the shorthand notation offered by other languages, such as JavaScript. That has led to some discussions on possible changes to lambda in Python mailing lists since mid-February.

Background

This is far from the first time lambda has featured in discussions in the Python community; the search for a more compact and, perhaps, more capable, version has generally been the focus of the previous discussions. Even the name "lambda" is somewhat obscure; it comes from the Greek letter "λ", by way of the lambda calculus formal system of mathematical logic. In Python, lambda expressions can be used anywhere a function object is needed; for example:

    >>> (lambda x: x * 7)(6)
    42

In that example, we define an anonymous function that "multiplies" its argument by seven, then call it with an argument of six. But, of course, Python has overloaded the multiplication operation, so:

    >>> (lambda x: x * 7)('ba')
    'bababababababa'

Meanwhile, lambda can be used in place of def, though it may be of dubious value to do so:

    >>> incfunc = lambda x: x + 1
    >>> incfunc(37)
    38

    # not much different from:
    >>> def incfunc(x):
    ...     return x + 1
    ... 
    >>> incfunc(37)
    38

Lambdas are restricted to a single Python expression; they cannot contain statements, nor can they have type annotations. Some of the value of the feature can be seen in combination with some of the other functional-flavored parts of Python. For example:

    >>> list(filter(lambda x: x % 2 == 1, range(17)))
    [1, 3, 5, 7, 9, 11, 13, 15]

There we use the filter() function to create an iterator, then use list() to produce a list of the first eight odd numbers, with a lambda providing the test function. Over time, though, other Python features supplanted these uses for lambda expressions; the same result as above can be accomplished using a list comprehension:

    >>> [ x for x in range(17) if x % 2 == 1 ]
    [1, 3, 5, 7, 9, 11, 13, 15]

The most obvious use of lambda may be as key parameters to list.sort(), sorted(), max()/min(), and the like. That parameter can be used to extract a particular attribute or piece of each object in order to sort based on that:

    >>> sorted([ ('a', 37), ('b', 23), ('c', 73) ], key=lambda x: x[1])
    [('b', 23), ('a', 37), ('c', 73)]

Arrow operators?

A thread about "mutable defaults" on the python-list mailing list made its way over to python-ideas when James Pic posted a suggestion for an alternate lambda syntax. His ultra-terse syntax did not strike a chord with the others participating in the thread, but it led "Random832" to suggest looking at the "->" or "=>" arrow operators used by other languages, such as C#, Java, and JavaScript.

It's worth noting that all three of these are later additions to their respective languages, and they all have earlier, more difficult, ways of writing nested functions within expressions. Their designers saw the benefit of an easy lambda syntax, why don't we?

Former Python benevolent dictator Guido van Rossum agreed: "I'd prefer the JavaScript solution, since -> already has a different meaning in Python return *type*. We could use -> to simplify typing.Callable, and => to simplify lambda." He also answered the question by suggesting that the "endless search for for multi-line lambdas" may have derailed any kind of lambda-simplification effort. Van Rossum declared multi-line lambda expressions "un-Pythonic" in a blog post back in 2006, but in the thread he said that it is not too late to add some kind of simplified lambda.

Steven D'Aprano was concerned about having two separate "arrow" operators. "That will lead to constant confusion for people who can't remember which arrow operator to use." He said that the "->" symbol that is already in use for the annotation of return types could also be used to define anonymous functions. It is also a well-known idiom:

There are plenty of popular and influential languages that use the single line arrow -> such as Maple, Haskell, Julia, CoffeeScript, Erlang and Groovy, to say nothing of numerous lesser known and obscure languages.

He also posted a lengthy analysis of how both uses of the single-line arrow could coexist, though it turned out that there is a parsing ambiguity if type annotations are allowed in "arrow functions" (i.e. those defined using the single-line arrow). One can perhaps be forgiven for thinking that the example he gives is not entirely Pythonic, however:

    (values:List[float], arg:int=0 -> Type) -> expression

[...] In case anyone is still having trouble parsing that:

    # lambda version
    lambda values, arg=0: expression

    # becomes arrow function
    (values, arg=0) -> expression

    # add parameter annotations
    (values:List[float], arg:int=0) -> expression

    # and the return type of the function body (expression)
    (values:List[float], arg:int=0 -> T) -> expression

The obscurity of the name "lambda" also came up. Brendan Barnwell lamented the choice of the name as confusing, while Ned Batchelder called it "opaque":

People who know the background of lambda can easily understand using a different word. People who don't know the background are presented with a "magic word" with no meaning. That's not good UI.

But, as D'Aprano pointed out, it is far from the only jargon that Python (and other language) programmers will need to pick up:

[It's] no more "magic" than tuple, deque, iterator, coroutine, ordinal, modulus, etc, not to mention those ordinary English words with specialised jargon meanings like float, tab, zip, thread, key, promise, trampoline, tree, hash etc.

While Paul Sokolovsky is a big fan of the lambda keyword, he does think that differentiating between the two uses of the arrow notation (the existing use for return types and a possible future use for defining short functions) is important. He thinks it would be better to use two different arrows; for defining functions, he is in favor of the double-line arrow (=>) instead.

Proof of concept

To demonstrate the idea, he came up with some proof-of-concept code that implements the double-line arrow as a lambda replacement. Here are some examples that he gave:

    >>> f = (a, b) => a + b    # not actual Python syntax
    >>> print(f(1, 2))
    3
    >>> print(list(map((x) => x * 2, [1, 2, 3, 4])))    # nor this
    [2, 4, 6, 8]

Cade Brown did not like the double-line arrow, on general principles, but Sokolovsky reiterated his belief that the two uses of arrows should be distinct; he also thinks that using the same symbol as JavaScript has some advantages. D'Aprano, however, is not convinced that following JavaScript's notation is necessarily the right thing for Python, nor does he think there is a need to separate the two uses of arrows. As might be guessed, others disagree; it was, to a certain extent, a bikeshedding opportunity, after all.

For his part, Van Rossum was not really opposed to using the single-line arrow for function definitions if there were no technical barriers to overlapping the two uses. No one seemed to think that allowing type annotations for lambda functions, which are generally quite self-contained, was truly needed. On the other hand, both David Mertz and Ricky Teachey were opposed to adding new syntax to handle lambdas, though Teachey thought it would make more sense if it could be used for both unnamed and named functions:

But if we could expand the proposal to allow both anonymous and named functions, that would seem like a fantastic idea to me.
Anonymous function syntax:
(x,y)->x+y
Named function syntax:
f(x,y)->x+y

That was something of a step too far for Van Rossum, though. There is already a perfectly good way to create named functions, he said:

Proposals like this always baffle me. Python already has both anonymous and named functions. They are spelled with 'lambda' and 'def', respectively. What good would it do us to create an alternate spelling for 'def'?
[...] I can sympathize with trying to get a replacement for lambda, because many other languages have jumped on the arrow bandwagon, and few Python first-time programmers have enough of a CS background to recognize the significance of the word lambda. But named functions? Why??

Greg Ewing hypothesized that it simply comes from the desire for brevity: "In the situations where it's appropriate to use a lambda, you want something very compact, and 'lambda' is a rather long and unwieldy thing to have stuck in there." D'Aprano added that it may come from mathematical notation for defining functions, which is not necessarily a good match with Python's def:

So there is definitely some aesthetic advantage to the arrow if you're used to maths notation, and if Python had it, I'd use it.
But it doesn't scale up to multi-statement functions, and doesn't bring any new functionality into the language, so I'm not convinced that its worth adding as a mere synonym for def or lambda or both.

While there was some support for the idea, and Sokolovsky is particularly enthusiastic about it, so far there have been no plans mentioned for a Python Enhancement Proposal (PEP). Adopting an arrow syntax for lambda expressions may just be one of those topics that pops up occasionally in Python circles; maybe, like the recurring request for a Python "switch", it will evolve into something that gets added to the language (as with pattern matching). On the other hand, lambda may be one of those corners of the language that is not used frequently enough to be worth changing. Only time will tell.

Comments (55 posted)

PipeWire: The Linux audio/video bus

March 2, 2021

This article was contributed by Ahmed S. Darwish

For more than a decade, PulseAudio has been serving the Linux desktop as its predominant audio mixing and routing daemon — and its audio API. Unfortunately, PulseAudio's internal architecture does not fit the growing sandboxed-applications use case, even though there have been attempts to amend that. PipeWire, a new daemon created (in part) out of these attempts, will replace PulseAudio in the upcoming Fedora 34 release. It is a coming transition that deserves a look.

Speaking of transitions, Fedora 8's own switch to PulseAudio in late 2007 was not a smooth one. Longtime Linux users still remember having the daemon branded as the software that will break your audio. After a bumpy start, PulseAudio emerged as the winner of the Linux sound-server struggles. It provided a native client audio API, but also supported applications that used the common audio APIs of the time — including the raw Linux ALSA sound API, which typically allows only one application to access the sound card. PulseAudio mixed the different applications' audio streams and provided a central point for audio management, fine-grained configuration, and seamless routing to Bluetooth, USB, or HDMI. It positioned itself as the Linux desktop equivalent of the user-mode audio engine for Windows Vista and the macOS CoreAudio daemon.

Cracks at PulseAudio

By 2015, PulseAudio was still enjoying its status as the de facto Linux audio daemon, but cracks were beginning to develop. The gradual shift to sandboxed desktop applications may be proving fatal to its design: with PulseAudio, an application can snoop on other applications' audio, have unmediated access to the microphone, or load server modules that can interfere with other applications. Attempts were made at fixing PulseAudio, mainly through an access-control layer and a per-client memfd-backed transport. This was all necessary but not yet sufficient for isolating clients' audio.

Around that time, David Henningson, one of the core PulseAudio developers, resigned from the project. He cited frustrations over the daemon's poor fit for the sandboxed-applications use case, and its intermixing of mechanism and policy for audio-routing decisions. At the end of his message, he wondered if the combination of these problems might be the birth pangs of a new and much-needed Linux audio daemon:

In software nothing is impossible, but to re-architecture PulseAudio to support all of these requirements in a good way (rather than to "build another layer on top" […]) would be very difficult, so my judgment is that it would be easier to write something new from scratch.

And I do think it would be possible to write something that took the best from PulseAudio, JACK, and AudioFlinger, and get something that would work well for both mobile and desktop; for pro-audio, gaming, low-power music playback, etc. […] I think we, as an open source community, could have great use for such a sound server.

PulseVideo to Pinos

Meanwhile, GStreamer co-creator Wim Taymans was asked to work on a Linux service to mediate web browsers' access to camera devices. Initially, he called the project PulseVideo. The idea behind the name was simple: similar to the way PulseAudio was created to mediate access to ALSA sound devices, PulseVideo was created to mediate and multiplex access to the Video4Linux2 camera device nodes.

A bit later, Taymans discovered a similarly-named PulseVideo prototype [video], created by William Manley, and helped in upstreaming the GStreamer features required by its code. To avoid conflicts with the PulseAudio name, and due to scope extension beyond just camera access, Taymans later renamed the project to Pinos — in a reference to his town of residence in Spain.

Pinos was built on top of GStreamer pipelines, using some of the infrastructure that was earlier refined for Manley's prototype. D-Bus with file-descriptor passing was used for interprocess communication. At the GStreamer 2015 conference, Taymans described the Pinos architecture [PDF] to attendees and gave a demo of multiple applications accessing the system camera feed in parallel.

Due to its flexible, pipeline-based, file-descriptor-passing architecture, Pinos also supported media broadcasting in the other direction: applications could "upload" a media stream by passing a memfd or dma-buf file descriptor. The media stream can then be further processed and distributed to other applications and system multimedia sinks like ALSA sound devices.

While only discussed in passing, the ability to send streams in both directions and across applications allowed Pinos to act as a generic audio/video bus — efficiently funneling media between isolated, and possibly sandboxed, user processes. The scope of Pinos (if properly extended) could thus overlap with, and possibly replace, PulseAudio. Taymans was explicitly asked that question [video, 31:35], and he answered: "Replacing PulseAudio is not an easy task; it's not on the agenda [...] but [Pinos] is very broad, so it could do more later."

As the PulseAudio deficiencies discussed in the earlier section became more problematic, "could do more later" was not a far-off target.

PipeWire

By 2016, Taymans started rethinking the foundations of Pinos, extending its scope to become the standard Linux audio/video daemon. This included the "plenty of tiny buffers" low-latency audio use case typically covered by JACK. There were two main areas that needed to be addressed.

First, the hard dependency on GStreamer elements and pipelines for the core daemon and client libraries proved problematic. GStreamer has plenty of behind-the-scenes logic to achieve its flexibility. During the processing of a GStreamer pipeline, done within the context of Pinos realtime threads, this flexibility came at the cost of implicit memory allocations, thread creation, and locking. These are all actions that are well-known to negatively affect the predictability needed for hard realtime code.

To achieve part of the GStreamer pipelines' flexibility while still satisfying hard realtime requirements, Taymans created a simpler multimedia pipeline framework and called it SPA — the Simple Plugin API [PDF]. The framework is designed to be safely executed from realtime threads (e.g. Pinos media processing threads), with a specific time budget that should never be exceeded. SPA performs no memory allocations; instead, those are the sole responsibility of the SPA framework application.

Each node has a well-defined set of states. There is a state for configuring the node's ports, formats, and buffers — done by the main (non-realtime) thread, a state for the host to allocate all the necessary buffers required by the node after its configuration, and a separate state where the actual processing is done in the realtime threads. During streaming, if any of the media pipeline nodes change state (e.g. due to an event), the realtime threads can be notified so that control is switched back to the main thread for reconfiguration.

Second, D-Bus was replaced as the IPC protocol. Instead, a native fully asynchronous protocol that was inspired by Wayland — without the XML serialization part — was implemented over Unix-domain sockets. Taymans wanted a protocol that is simple and hard-realtime safe.

By the time the SPA framework was integrated and a native IPC protocol was developed, the project had long-outgrown its original purpose: from a D-Bus daemon for sharing camera access to a full realtime-capable audio/video bus. It was thus renamed again, to PipeWire — reflecting its new status as a prominent pipeline-based engine for multimedia sharing and processing.

Lessons learned

From the start, the PipeWire developers applied an essential set of lessons from existing audio daemons like JACK, PulseAudio, and the Chromium OS Audio Server (CRAS). Unlike PulseAudio's intentional division of the Linux audio landscape into consumer-grade versus professional realtime audio, PipeWire was designed from the start to handle both. To avoid the PulseAudio sandboxing limitations, security was baked-in: a per-client permissions bitfield is attached to every PipeWire node — where one or more SPA nodes are wrapped. This security-aware design allowed easy and safe integration with Flatpak portals; the sandboxed-application permissions interface now promoted to a freedesktop XDG standard.

Like CRAS and PulseAudio, but unlike JACK, PipeWire uses timer-based audio scheduling. A dynamically reconfigurable timer is used for scheduling wake-ups to fill the audio buffer instead of depending on a constant rate of sound card interrupts. Beside the power-saving benefits, this allows the audio daemon to provide dynamic latency: higher for power-saving and consumer-grade audio like music playback; low for latency-sensitive workloads like professional audio.

Similar to CRAS, but unlike PulseAudio, PipeWire is not modeled on top of audio-buffer rewinding. When timer-based audio scheduling is used with huge buffers (as in PulseAudio), support for rewriting the sound card's buffer is needed to provide a low-latency response to unpredictable events like a new audio stream or a stream's volume change. The big buffer already sent to the audio device must be revoked and a new buffer needs to be submitted. This has resulted in significant code complexity and corner cases [PDF]. Both PipeWire and CRAS limit the maximum latency/buffering to much lower values — thus eliminating the need for buffer rewinding altogether.

Like JACK, PipeWire chose an external-session-manager setup. Professional audio users typically build their own audio pipelines in a session-manager application like Catia or QjackCtl, then let the audio daemon execute the final result. This has the benefit of separating policy (how the media pipeline is built) from mechanism (how the audio daemon executes the pipeline). At GUADEC 2018, developers explicitly asked Taymans [video, 23:15] to let GNOME, and possibly other external daemons, take control of that part of the audio stack. Several system integrators had already run into problems because PulseAudio embeds audio-routing policy decisions deep within its internal modules code. This was also one of the pain points mentioned in Henningson's resignation e-mail.

Finally, following the trend of multiple influential system daemons created in the last decade, PipeWire makes extensive use of Linux-kernel-only APIs. This includes memfd, eventfd, timerfd, signalfd, epoll, and dma-buf — all of which make the "file-descriptor" the primary identifier for events and shared buffers in the system. PipeWire's support for importing dma-buf file descriptors was key in implementing efficient Wayland screen capture and recording. For large 4K and 8K screens, the CPU does not need to touch any of the massive GPU buffers: GNOME mutter (or similar applications) passes a dma-buf descriptor that can then be integrated into PipeWire's SPA pipelines for further processing and capturing.

Adoption

The native PipeWire API has been declared stable since the project's major 0.3 release. Existing raw ALSA applications are supported through a PipeWire ALSA plugin. JACK applications are supported through a re-implementation of the JACK client libraries and the pw-jack tool if both native and PipeWire JACK libraries are installed in parallel. PulseAudio applications are supported through a pipewire-pulse daemon that listens to PulseAudio's own socket and implements its native communication protocol. This way, containerized desktop applications that use their own copy of the native PulseAudio client libraries are still supported. WebRTC, the communication framework (and code) used by all major browsers, includes native PipeWire support for Wayland screen sharing — mediated through a Flatpak portal.

The graph below shows a PipeWire media pipeline, generated using pw-dot then slightly beautified, on an Arch Linux system. A combination of PipeWire-native and PulseAudio-native applications is shown:

On the left, both GNOME Cheese and a GStreamer pipeline instance created with gst-launch-1.0 are accessing the same camera feed in parallel. In the middle, Firefox is sharing the system screen (for a Jitsi meeting) using WebRTC and Flatpak portals. On the right, the Spotify music player (a PulseAudio app) is playing audio, which is routed to the system's default ALSA sink — with GNOME Settings (another PulseAudio app) live-monitoring the Left/Right channel status of that sink.

On the Linux distributions side of things, Fedora has been shipping the PipeWire daemon (only for Wayland screen capture) since its Fedora 27 release. Debian offers PipeWire packages, but replacing PulseAudio or JACK is "an unsupported use case." Arch Linux provides PipeWire in its central repository and officially offers extra packages for replacing both PulseAudio and JACK, if desired. Similarly, Gentoo provides extensive documentation for replacing both daemons. The upcoming Fedora 34 release will be the first Linux distribution that will have PipeWire fully replace PulseAudio by default and out of the box.

Overall, this is a critical period in the Linux multimedia scene. While open source is a story about technology, it's also a story about the people hard at work creating it. There has been a notable agreement from both PulseAudio and JACK developers that PipeWire and its author are on the right track. The upcoming Fedora 34 release should provide a litmus test for PipeWire's Linux distributions adoption moving forward.

Comments (88 posted)

Fedora and fallback DNS servers

By Jonathan Corbet
February 25, 2021

One of the under-the-hood changes in the Fedora 33 release was a switch to systemd-resolved for the handling of DNS queries. This change should be invisible to most users unless they start using one of the new features provided by systemd-resolved. Recently, though, the Fedora project changed its default configuration for that service to eliminate fallback DNS servers — a change which is indeed visible to some users who have found themselves without domain-name resolution as a result.

Systemd-resolved continues the systemd tradition of replacing venerable, low-level system components. It brings a number of new features, including a D-Bus interface that provides more information than the traditional gethostbyname() family (which is still supported, of course), DNS-over-TLS, LLMNR support, split-DNS support, local caching, and more. It is not exactly new; Ubuntu switched over in the 16.10 release. Fedora thus may not have lived up to its "first" objective with regard to systemd-resolved, but it did eventually make the switch.

It is probably fair to say that most Fedora users never noticed that things had changed. Toward the end of 2020, though, Zbigniew Jędrzejewski-Szmek made a change that drew some new attention toward systemd-resolved: he disabled the use of fallback DNS servers. The fallback mechanism is intended to ensure that a system has a working domain-name resolver, even if it is misconfigured or the configured servers do not work properly. As a last resort, systemd-resolved will use the public servers run by Google and Cloudflare for lookup operations. On Fedora 33 systems, though, that fallback has been disabled as of the systemd-246.7-2 update, released at the end of 2020.

Toward the end of February, Tadej Janež went to the fedora-devel mailing list to argue that this change should be reverted, saying: "On F33, this actually breaks a working vanilla cloud instance by removing the fallback DNS server list in a systemd upgrade, effectively leaving the system with no DNS servers configured". As one might expect, this was not the desired state of affairs. This post generated some discussion about the change, but it may not lead to an update to Fedora's policy.

One might wonder why a seemingly useful feature like automatic fallback was disabled. The reasoning, as described by Jędrzejewski-Szmek in this changelog, has to do with privacy and compliance with the European GDPR directive:

DNS questions (which necessarily include IP addresses) are personally identifying information in the sense of GDPR (https://gdpr.eu/eu-gdpr-personal-data/ explicitly lists IP address as PII). Sending those packets to Google or Cloudflare is "forwarding" this PII to them. GDPR says that information which is not enough to identify individuals still needs to be protected because it may be combined with other information or processed with improved technology later. So even though the information in DNS alone it not very big, it may be interpreted as protected information in various scenarios.

Janež suggested that the situation could be improved in either of a couple of ways. Rather than disabling the fallback servers everywhere, Fedora could leave them enabled for cloud images, where, it seems, broken DNS configurations are more likely to happen and there tends not to be an individual user to identify in any case. Or Fedora could pick a "reputable DNS resolver" that is known to respect privacy and use it to re-enable the fallback for everybody. Jędrzejewski-Szmek replied that the first option might be possible, but rejected the second, saying that finding a provider that is acceptable worldwide would be a challenge at best.

Beyond privacy concerns, there was another reason cited in the discussion for the removal of the DNS fallbacks: they can hide problems in DNS configurations. Without the fallbacks, a broken configuration is nearly guaranteed to come to the user's attention (though said user may be remarkably unappreciative) and will, presumably, be fixed. With the fallbacks, instead, everything appears to work and the user may never know that there is a problem. So the configuration will not be fixed, leading to a worse situation overall.

Lennart Poettering, though, described this view as "bogus and very user unfriendly". It is better, he said, to complain loudly and fall back to a working setup than to leave the system without domain-name service entirely. A lot of users do not know how to fix DNS themselves, and they won't even be able to ask for help on the net if DNS is not working for them.

Poettering also raised another issue: the privacy argument does not always make sense because using the public DNS servers may well be the more privacy-respecting option anyway:

One could even go further: the privacy level using those public DNS servers might actually be higher than using the DHCP-provided ones in many cases, simply because we can use DoT [DNS over TLS] on the former (admittedly not yet the default in resolved though, but hopefully soon), but almost never can on the latter, and what's worse the latter are usually provided by crappy edge networks like Internet Cafés and such where the fact we send stuff unencrypted is just awful.

The change by Jędrzejewski-Szmek acknowledged this point as well, and noted the additional point that ISP-provided DNS servers may not have the user's best interest in mind either. He still concluded that they were the better option because "they are more obvious to users and fit better in the regulatory framework". In any case, nobody is proposing using Google or Cloudflare servers in preference to those provided by the local network.

What will happen with Fedora's configuration is far from clear at this point. There seems to be some real resistance to enabling the fallback servers, even though the actual privacy and legal risk would appear to be small. Most Fedora users will probably never notice, but a subset may have to learn the details of using the resolvectl command to create a working DNS configuration by hand. Once again, they may be limited in their appreciation of this state of affairs.

Comments (111 posted)

5.12 merge window, part 2

By Jonathan Corbet
March 1, 2021

The 5.12 merge window closed with the release of 5.12-rc1 on February 28; this released followed the normal schedule despite the fact that Linus Torvalds had been without power for the first six days after 5.11 came out. At that point, 10,886 non-merge changesets had found their way into the mainline repository; about 2,000 of those showed up after the first-half merge-window summary was written. The pace of merging obviously slowed down, but there were still a number of interesting features to be found in those patches.

Architecture-specific

The RISC-V architecture has gained support for non-uniform memory access (NUMA) systems. This architecture also now supports kprobes, uprobes, and per-task stack canaries.

Core kernel

The kcmp() system call can now be configured into the system independently of the checkpoint/restore functionality.

Filesystems and block I/O

ID mapping for mounted filesystems, which has seen several proposed implementations over the years, has been merged at last. See this merge message for more information. This functionality is currently supported by the FAT, ext4, and XFS filesystems.
The NFS client implementation has gained support for "eager writes". When this option is enabled (at mount time), file writes are sent immediately to the server rather than sitting in the page cache. This can reduce memory pressure on the client, provide immediate notification if the filesystem fills up, and can even evidently improve throughput for some workloads.
The CIFS ("SMB") filesystem has a couple of new mount options to control the caching of file (acregmax) and directory (acdirmax) metadata.

Hardware support

Miscellaneous: Playstation DualSense gamepads and force-feedback game controllers, Nintendo 64 game controllers, Nintendo 64 data cartridges, Intel Lightning Mountain centralized DMA controllers, Compute Express Link 2.0 type-3 memory devices, Broadcom VK accelerators, Qualcomm MSM8939 and SDX55 interconnect buses, Microchip AXI PCIe host bridges, Intel LGM SSO LED controllers, and Canaan Kendryte K210 reset controllers, pin control units, and clock controllers.
Pin control: R-Car V3U pin controllers, Allwinner H616 pin controllers, and Qualcomm SM8350 and SC8180x pin controllers.

Miscellaneous

The user-space perf-events tools have gained a number of new features, including the ability to report on instruction latency and a new daemon mode for long-running sessions. See this merge changelog for more information.

Virtualization and containers

Support for the ACRN hypervisor has been added.

Internal kernel changes

The build system now can use Clang's link-time optimization (LTO) features on the Arm64 and x86 architectures. Android builds have been using LTO for a while, but now this capability is in the mainline as well. See this commit and this commit for (some) more information.
The EXPORT_UNUSED_SYMBOL() and EXPORT_SYMBOL_GPL_FUTURE() macros have been removed, since there have been no users of them in the kernel for years.
A new memory-debugging tool called "kfence" has been merged; it can find a number of types of errors (use-after-free, buffer overflow, etc.) and features a low-enough overhead, it seems, to be usable on production systems. See this documentation commit for more information.
The core of the io_uring subsystem has been reworked to stop using kernel threads; instead, when work must be done in thread context, a new thread is forked from the calling process. This should result in no user-visible changes other than, it is hoped, a reduction in bugs from the removal of some problematic kernel code.

The 5.12 kernel has now entered the stabilization part of the development cycle. Unless something surprising happens, the final 5.12 release can be expected on April 18 or 25. Given that, seemingly, even record-breaking winter storms are unable to slow down the pace of kernel development, that something would have to be surprising indeed.

Comments (7 posted)

Lockless patterns: relaxed access and partial memory barriers

February 26, 2021

This article was contributed by Paolo Bonzini

Lockless patterns

The first article in this series provided an introduction to lockless algorithms and the happens before relationship that allows us to reason about them. The next step is to look at the concept of a "data race" and the primitives that exist to prevent data races. We continue in that direction with a look at relaxed accesses, memory barriers, and how they can be used to implement the kernel's seqcount mechanism.

Memory barriers are an old acquaintance for some Linux kernel programmers. The first document vaguely resembling a specification of what one could expect from concurrent accesses to data in the kernel is, in fact, called memory-barriers.txt. That document describes many kinds of memory barriers, along with the expectations that Linux has concerning the properties of data and control dependencies. It also describes "memory-barrier pairing"; this could be seen as a cousin of release-acquire pairing, in that it also helps creating cross-thread happens before edges.

This article will not go into the same excruciating detail as memory-barriers.txt. Instead, we'll look at how barriers compare with the acquire and release model and how they simplify or enable the implementation of the seqcount primitive. Nevertheless, one article will not be enough to cover even the most common memory-barrier patterns, so full memory barriers will have to wait for the next installment.

Data races, relaxed accesses, and memory barriers

The concept of a data race, as presented here, was first introduced in C++11 and, since then, has been applied to various other languages, notably C11 and Rust. These language standards are quite strict in how to approach lockless access to data structures, and they introduce specific atomic load and atomic store primitives to do so.

A data race occurs when two accesses are concurrent (i.e., not ordered by the happens before relation), at least one of them is a store, and at least one of them is not using atomic load or store primitives. Whenever a data race occurs, the result (according to C11/C++11) is undefined behavior, meaning that anything is allowed to happen. Avoiding data races does not mean that your algorithm is free of "race conditions": data races are violations of the language standards, while race conditions are bugs that stem from incorrect locking, incorrect acquire/release semantics, or both.

Data races and the consequent undefined behavior are easy to avoid, however. As long as one wants to store to a shared data location, which is probably the case, there are two ways to do so. The first is to ensure that accesses are ordered by the happens before relation, using any pair of acquire and release operations of your choice; the second is to annotate the loads and stores as atomic.

C11, C++11, and Rust all provide various memory orderings that the programmer can use for their loads and stores; the three that we are interested in are acquire (to be used with loads), release (for stores), and relaxed (for both). Acquire and release should be self-explanatory by now, and Linux provides the same concept in its smp_load_acquire() and smp_store_release() operations. Relaxed operations, instead, do not provide any cross-thread ordering; a relaxed operation does not create a happens before relationship. Instead, these operations have essentially no purpose other than to avoid data races and the undefined behavior associated with them.

In practice, Linux expects both the compiler and the CPU to allow a bit more leeway than the language standards do. In particular, the kernel expects that regular loads and stores will not trigger undefined behavior just because there is a concurrent store. However, the value that is loaded or stored in such situations is still not well defined and may well be garbage. For example, the result could include parts of an old value and parts of a new value; this means that, at the very least, dereferencing pointer values loaded from a data race is generally a bad idea.

In addition, regular loads and stores are subject to compiler optimizations, which can produce surprises of their own. Therefore the idea of a relaxed-ordering — but guaranteed atomic — memory operation is useful in Linux too; this is what the READ_ONCE() and WRITE_ONCE() macros provide. In general the remainder of this series will always use READ_ONCE() and WRITE_ONCE() explicitly, which nowadays is considered good practice by Linux developers.

These macros already appeared in an example from part 1:

    thread 1                          thread 2
    -----------------------------     ------------------------
    a.x = 1;
    smp_wmb();
    WRITE_ONCE(message, &a);          datum = READ_ONCE(message);
                                      smp_rmb();
                                      if (datum != NULL)
                                        printk("%x\n", datum->x);

They are used in a similar way to smp_load_acquire() and smp_store_release(), but their first argument is the target of the assignment (an lvalue) rather than a pointer. Unless other mechanisms ensure that the result of a data race is thrown away, it is highly recommended to use READ_ONCE() and WRITE_ONCE() to load and store shared data outside a lock. Typically, relaxed atomics are used together with some other primitive or synchronization mechanism that has release and acquire semantics; that "something else" will order the relaxed writes against reads of the same memory location.

Suppose, for example, that you have multiple work items that fill certain elements of an array with ones; whoever spawned the work items will only read the array after calling flush_work(). Similar to pthread_join(), flush_work() has acquire semantics and synchronizes with the end of the work item; flush_work() guarantees that reading the array will happen after the completion of the work items, and the array can be read with regular loads. However, if multiple, concurrent work items can store into the same array element, they must use WRITE_ONCE(a[x], 1) rather than just a[x] = 1.

A more complicated case occurs when the release and acquire semantics are provided by a memory barrier. In order to explain this case we will use the practical example of seqcounts.

Seqcounts

Seqcounts are a specialized primitive that allows a consumer to detect that a data structure changed in the middle of the consumer's access. While they are only usable in special cases (small amount of data being protected, no side effects within read-side critical sections, and writes being quick and relatively rare), they have various interesting properties: in particular, readers do not starve writers and the writer can keep ownership of the cache line that holds the seqcount. Both properties make seqcounts particularly interesting when scalability is important.

Seqcounts are a single-producer, multiple-consumer primitive. In order to avoid multiple concurrent writers, they are usually combined with a spinlock or mutex, forming the familiar Linux seqlock_t type. Sometimes, outside the kernel, you'll see seqcounts referred to as seqlocks, however.

Seqcounts are effectively a generation count, where the generation number is odd if and only if a write is in progress. Whenever the generation number was odd at the beginning of a read-side critical section, or it changed during the read-side critical section, the reader has accessed potentially inconsistent state and must retry the read. For a seqcount to work correctly, the reader must detect correctly the beginning and the end of the write. This requires two load-acquire and two store-release operations; here is how one might write a seqcount reader and writer without any wrapper APIs:

    thread 1 (buggy writer)             thread 2 (buggy reader)
    --------------------------------    ------------------------
    WRITE_ONCE(sc, sc + 1);             do {
    smp_store_release(&data.x, 123);        old = smp_load_acquire(&sc) & ~1;
    WRITE_ONCE(data.y, 456);                copy.y = READ_ONCE(data.y);
    smp_store_release(&sc, sc + 1);         copy.x = smp_load_acquire(&data.x);
                                        } while(READ_ONCE(sc) != old);

This code is similar to the "message passing" pattern shown in the first part of the series. There are two pairs of load-acquire and store-release operations, one set for sc and one for data.x. It is not even that hard to show why both load-acquire/store-release pairs are necessary:

For thread 2 to exit the loop, the first read of sc must see the even value that was written by the second store to sc in thread 1. If this happens, smp_store_release() and smp_load_acquire() ensure that the stores to the data fields will be visible.
If the store to data.x is visible when thread 2 reads it, smp_store_release() and smp_load_acquire() ensure that thread 2 will see (at least) the first generation-count update. Thus, thread 2 will either loop or, if it also sees the second update, retrieve a consistent copy of the new data as described above.

However, the code has a bug! Because the writer has no acquire operation at the top of the sequence, the write to data.y might execute before writing the odd value to sc. [Note: the article was updated on March 2nd to point out this problem]. Using load-acquire/store-release for all fields would sidestep the issue, but one wonders if this would be putting lipstick on a pig. And in fact it is possible to do much better.

The first article showed that older Linux code may use smp_wmb() followed by WRITE_ONCE() rather than smp_store_release() ; likewise, instead of smp_load_acquire(), sometimes READ_ONCE() is followed by smp_rmb(). These partial barriers create specific types of happens before relations. Specifically (but informally), smp_wmb() turns all the following relaxed stores into release operations and smp_rmb() turns all the preceding relaxed loads into acquire operations.

Let's try to apply this replacement to the data.x accesses:

    thread 1 (writer)                   thread 2 (reader)
    ------------------------------      ------------------------
    // write_seqcount_begin(&sc)        do {
    WRITE_ONCE(sc, sc + 1);                 // read_seqcount_begin(&sc)
    smp_wmb();                              old = smp_load_acquire(&sc) & ~1;
    WRITE_ONCE(data.x, 123);                copy.y = READ_ONCE(data.y);
    WRITE_ONCE(data.y, 456);                copy.x = READ_ONCE(data.x);
    // write_seqcount_end(&sc)              // read_seqcount_retry(&sc, old)
    smp_store_release(&sc, sc + 1);         smp_rmb();
                                        } while(READ_ONCE(sc) != old);

Leaving aside the question of how barriers work, this already has much better chances of being wrapped with an easy-to-use API. Data is accessed entirely with relaxed atomic loads and stores (though in the Linux kernel memory model non-atomic accesses would be acceptable too), and the barriers could be hidden within the seqcount primitives read_seqcount_retry() and write_seqcount_begin().

The barriers inserted above split the reads and writes into two separate groups; this ensures the safety of seqcount accesses. However, there are two complications:

First, the barriers do not impose an order among the relaxed accesses themselves. It is possible that thread 2 sees the update to data.y and not the update to data.x. This is not a problem for seqcounts, because the check on sc forces thread 2 to retry in case it saw only some of the stores.
Second, the barriers are weaker than load-acquire and store-release operations. A read with smp_load_acquire() happens before any loads and stores that follow it, and likewise smp_store_release() happens after not just preceding stores, but also after preceding loads. Instead, for smp_rmb(), the ordering is only guaranteed among loads, and, for smp_wmb(), only among stores. Load-store ordering however rarely matters, which is why Linux developers only used smp_rmb() and smp_wmb() for a long time.

In the case of seqcounts, load-store ordering is not a problem, because the reader does not perform any writes to shared memory in its critical section, thus there cannot be concurrent changes to shared memory between the writer's updates of the generation count. There's a little bit of handwaving in this reasoning, but it is actually sound as long as the code is kept simple and faithful to the pattern. If the reader needed to write to shared memory, it would suffice to protect those writes with a different mechanism than the seqcount.

While informal, the explanation in the previous paragraph highlights the importance of knowing the common lockless programming patterns. In short, patterns enable thinking about code at a more coarse level without losing precision. Instead of looking at each memory access individually, you can make statements like "data.x and data.y are protected by the seqcount sc" or, referring to the earlier message passing example, "a is published to other threads via message". To some extent, proficiency in lockless programming patterns means being able to make such statements and take advantage of them to understand the code.

This concludes our initial look at memory barriers. There is a lot more to this topic than has been covered so far, naturally; the next installment in this series will delve into full memory barriers, how they work, and how they are used in the kernel.

Comments (30 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>