Leading items

Welcome to the LWN.net Weekly Edition for March 11, 2021

This edition contains the following feature content:

Python exception groups: a mechanism for those awkward times when more than one exception must be raised at once.
A vulnerability in Git: this scary-looking hole may not actually affect many users, but upgrading seems like a good idea.
BPF meets io_uring: the convergence of two of the most active areas of kernel development.
Lockless patterns: full memory barriers: the third installment in the lockless-patterns series covers the functionality and use of full memory barriers.
Linux 5.12's very bad, double ungood day: a nasty bug sneaks out of the merge window.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Python exception groups

By Jake Edge
March 10, 2021

Exceptions in Python are a mechanism used to report errors (of an exceptional variety); programs can be and are written to expect and handle certain types of exceptions using try and except. But exceptions were originally meant to report a single error event and, these days, things are a tad more complicated than that. A recent Python Enhancement Proposal (PEP) targets adding exception groups, as well as new syntax to catch and handle the groups.

Exceptions

As a quick review, Python exceptions are objects derived from the BaseException class, though user-defined exceptions should be derived from Exception. An exception can be "raised" when an error occurs (using raise), which causes the call stack to be unwound until either the exception is "caught" (in a try ... except block), or it propagates out to the user. Python defines multiple built-in exceptions with generally self-explanatory names, such as IndexError and SyntaxError. The basics of their operation can be seen below:

    >>> 1/0
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ZeroDivisionError: division by zero

    >>> try:
    ...     raise ValueError('value?')
    ... except ValueError as e:
    ...     print(e)
    ... 
    value?

In the first case, a divide-by-zero exception is generated and propagated out to the read-eval-print loop (REPL). In the second, a ValueError is instantiated and raised, then caught and displayed.

That gives the gist of it, but there are some wrinkles, of course. The except clause can refer to multiple exception types to catch them all and there can be multiple except clauses for separate handling of different exception types. An else clause can be used to do special handling when no exception is caught and a finally clause can be given for code to be executed last, regardless of whether exceptions were caught or not.

Along the way, it was recognized that if an exception was raised while handling an exception, the first exception could get lost, which made for difficult debugging. That led to PEP 3134 ("Exception Chaining and Embedded Tracebacks"), which provided a means to chain exceptions using the __cause__ attribute on exception objects. That attribute would be set in a raise ... from statement; below is a highly contrived example:

    >>> try:
    ...     1/0
    ... except ZeroDivisionError as e:
    ...     try:
    ...         print(x)
    ...     except NameError:
    ...         raise NameError() from e
    ... 
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
    ZeroDivisionError: division by zero

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "<stdin>", line 7, in <module>
    NameError

Multi-exceptions

More recently, situations have arisen where multiple unrelated exceptions need to be raised and handled together. For example, libraries that provide concurrent operations (e.g. asyncio, Trio) may return aggregated results, thus there could be multiple exceptions of various sorts. Another problem area is with operations that are retried within the library, such as socket.create_connection(), where there could be multiple failures that would be useful for the caller to be able to process.

Currently, though, there is no way for Python programs to process these multi-failures; Python can only handle a single exception, though it can chain exceptions that occurred earlier, as mentioned. PEP 654 ("Exception Groups and except*") has come about to try to tame this problem. It describes some other situations that lead to multiple exceptions needing to be raised "simultaneously" and describes a potential solution. It was introduced in a message to the python-dev mailing list on February 23 by Irit Katriel. She is one of three authors of the PEP, along with Yury Selivanov and Guido van Rossum.

Trio has the trio.MultiError exception type to report a collection of errors, but it has proved difficult to use. That led Trio developer Nathaniel Smith to start working on MultiError2, but that effort "demonstrates how difficult it is to create an effective API for reporting and handling multiple errors without the language changes we are proposing", according to the PEP. The "rationale" section gives some more detail on why the authors have taken this route:

Grouping several exceptions together can be done without changes to the language, simply by creating a container exception type. Trio is an example of a library that has made use of this technique in its MultiError type. However, such an approach requires calling code to catch the container exception type, and then to inspect it to determine the types of errors that had occurred, extract the ones it wants to handle, and reraise the rest.
Changes to the language are required in order to extend support for exception groups in the style of existing exception handling mechanisms. At the very least we would like to be able to catch an exception group only if it contains an exception of a type that we choose to handle. Exceptions of other types in the same group need to be automatically reraised, otherwise it is too easy for user code to inadvertently swallow exceptions that it is not handling.

The proposed solution would add a new ExceptionGroup class (derived from BaseExceptionGroup), which can be used as follows:

    ExceptionGroup('problems', [ ValueError('illegal number'),
                                 NameError('bad name'),
                                 TypeError('no such type') ])

That would create an exception group with three separate exceptions. These groups can be nested, so arbitrarily complex relationships of exceptions can be represented. In order to process these groups, some additional syntax is being proposed. It is given the name "except*", though the star is not actually attached to except. It indicates that the except block can handle multiple exceptions:

try:
    ...
except *SpamError:
    ...
except *FooError as e:
    ...
except *(BarError, BazError) as e:
    ...
In a traditional try-except statement there is only one exception to handle, so the body of at most one except clause executes; the first one that matches the exception. With the new syntax, an except* clause can match a subgroup of the exception group that was raised, while the remaining part is matched by following except* clauses. In other words, a single exception group can cause several except* clauses to execute, but each such clause executes at most once (for all matching exceptions from the group) and each exception is either handled by exactly one clause (the first one that matches its type) or is reraised at the end.

The exception group is processed recursively, looking for matching exception types at all of the levels of the hierarchy. Those that are found are extracted and processed before the next except* is consulted. As mentioned, any that slip through the cracks are reraised. It should be noted that a given try block can either have except or except* clauses; they cannot be mixed.

Backward compatibility

Adding a new type and syntax is meant to help ensure that there are no backward-compatibility concerns with the proposals. Exception groups can be treated as exceptions are today for handling and display. The except statement does not change either, so nothing that runs today will break with these changes. On the other hand, though, libraries that start raising exception groups are changing their API; the PEP gives its authors' expectations in that regard:

Our premise is that exception groups and except* will be used selectively, only when they are needed. We do not expect them to become the default mechanism for exception handling. The decision to raise exception groups from a library needs to be considered carefully and regarded as an API-breaking change. We expect that this will normally be done by introducing a new API rather than modifying an existing one.

In the python-dev thread, there was a good deal of discussion of how to handle the difference between an exception like ValueError and an exception group that contains a ValueError—possibly somewhere deep inside the hierarchy of exceptions and groups. Users may expect existing code to work in a congruent fashion, even if the exception has now migrated into a group. As Smith put it:

The intuition is that things will be simplest if ExceptionGroup is kept as transparent and meaningless as possible, i.e. ExceptionGroup(ValueError) and ValueError mean exactly the same thing -- "some code inside this block raised ValueError" -- and ideally should be processed in exactly the same way. (Of course we can't quite achieve that due to backcompat issues, but the closer we can get, the better, I think?)

But, as Katriel pointed out, the two are not the same, and effectively cannot be, because the behavior of except is not being changed. That led Marco Sula to ask why except could not simply be "upgraded" to handle exception groups. As Van Rossum pointed out, there is a problem because exceptions and exception groups are distinct kinds of objects:

Good question. Here's an example:
try:
    . . .
except OSError as err:
    if err.errno != ENOENT:
        raise
    . . .
If this would catch ExceptionGroup(OSError), the `err` variable would be an ExceptionGroup instance, which does not have an `errno` attribute.

The answer to Sula's query was duly added to the PEP. But there are others who are worried about adding non-obvious behaviors with regard to the difference between the two types; as Cameron Simpson put it:

I want the code to do what I said, not have some magic which silently/invisibly intercepts ExceptionGroups which contain something buried deep in their subgroup tree.
[...] I certainly do not want ExceptionGroup([AttributeError]) conflated with AttributeError. That fills me with horror.

Paul Moore and others were concerned that the "catch-all" clause except Exception would stop functioning in this new exception-group world, but Van Rossum explained that exception groups are still derived from Exception, so unless one is using new APIs that can raise an ExceptionGroup and have started using except*, nothing will change.

Many in the thread were trying to find ways to avoid needing the new except* syntax. The general idea would be to process the exceptions inside of the groups to try to match them with the existing except clauses in some defined fashion. That may run afoul of the worry about "magic" behavior; it may also be difficult to define the operation in a way that will satisfy all of the various constituencies—quite possibly why the authors avoided the problem by clearly separating the two.

In the end, it is a somewhat esoteric change, but one that solves some real problems that have come up. Conceptually, it makes a fair amount of sense, though it is, perhaps, somewhat annoying that there seems to be no way to smoothly fit it into the existing language (particularly for Python founder Van Rossum one might guess). Trio has already tried to make it fit, without being entirely successful. The final form of the feature may not have gelled yet, but something down this path is likely bound for the language before too long.

Comments (9 posted)

A vulnerability in Git

By Jake Edge
March 10, 2021

A potentially nasty vulnerability in the Git distributed revision-control system was disclosed on March 9. There are enough qualifiers in the description of the vulnerability that it may appear to be fairly narrowly focused—and it is. That may make it less worrisome, but it is not entirely clear. As with most vulnerabilities, it all depends on how the software is being used and the environment in which it is running.

The vulnerability (CVE-2021-21300) could lead to code execution on the local system when cloning from a repository crafted to exploit it. It requires that some kind of Git filter be installed. Filters are used to manipulate files in between the filesystem and the Git repository; "smudge" filters are used when pulling blobs (binary objects) out of the repository to store in the working directory, while "clean" filters can change files as they are being committed into the repository. Which of those types is needed will depend on the type of transformation being performed. Git Large File Storage (LFS) is a commonly used extension (with both smudge and clean filters), which is installed by default with Git on Windows.

Filters are able to delay the normal processing of Git operations so that long-running filtering can be completed in the background. For example, Git LFS may need to copy a large file across the network in order to satisfy a checkout operation. But the delay feature changes the normal order in which files and directories are processed by Git. That, in turn, means that information cached by the tool may no longer be valid when it is relied upon, which is exactly where the vulnerability lies.

In order to reduce the number of lstat() calls that are made, Git maintains a cache called the "lstat cache". If a path collision (i.e. two files with the same path and name) occurs as the files are being checked out, for example if two files with names that differ only in their case are being checked out into a case-insensitive filesystem, that cache could be left in an invalid state. That does not typically lead to a problem because the checkouts proceed in a known order so the cache is not actually needed at the point where it is invalid.

However, if certain parts of the checkout are delayed by the filters, all bets are off. When the cache is consulted, the type of the files in the cached path may have changed; if that change was crafted by an attacker, unpleasantness is sure to occur. The patch fixing the vulnerability described the problem this way:

But, there are some users of the checkout machinery that do not always follow the index order. In particular: checkout-index writes the paths in the same order that they appear on the CLI (or stdin); and the delayed checkout feature -- used when a long-running filter process replies with "status=delayed" -- postpones the checkout of some entries, thus modifying the checkout order.
When we have to check out an out-of-order entry and the lstat() cache is invalid (due to a previous path collision), checkout_entry() may end up using the invalid data and [trusting] that the leading components are real directories when, in reality, they are not. In the best case scenario, where the directory was replaced by a regular file, the user will get an error: "fatal: unable to create file 'foo/bar': Not a directory". But if the directory was replaced by a symlink [symbolic link], checkout could actually end up following the symlink and writing the file at a wrong place, even outside the repository. Since delayed checkout is affected by this bug, it could be used by an attacker to write arbitrary files during the clone of a maliciously crafted repository.

Several paths to a fix were considered, including disabling the cache for unordered checkouts or sorting the file names so that they are always processed in the same order. Both of those had performance impacts and there was a concern that other code paths could someday lead to unordered processing, thus reviving the bug. Instead, the cache is simply invalidated whenever a remove-directory operation is performed.

As noted, symbolic links play a role in the ability to exploit the vulnerability. While highly useful, symbolic links have also historically been used to wreak havoc in various ways. They often feature in race condition exploits (e.g. for temporary files) and the like. Not all systems support symbolic links, though Unix-derived systems (Linux, macOS) generally do; these days, Windows administrators can also create symbolic links.

So it is a combination of several different features and situations that lead to an exploitable system—including the existence of an attacker-crafted repository that users need to be convinced into cloning. Even for Windows systems, where both Git LFS and case-insensitive filesystems are the norm, exploits are seemingly not at all common—perhaps even non-existent. This has the look of a problem discovered via code inspection or testing that was subsequently reported and fixed quickly—without even time for a catchy name, logo, and web site. If any systems have been exploited, it seems most probable that the attacks were highly targeted and may not have been discovered (yet).

While Linux-native filesystems are not usually case-insensitive, they can be. Beyond that, though, Linux can make use of native filesystem formats for Windows and macOS that have such functionality. In addition, the test cases provided with the fix show another way to cause the problem: via Unicode normalization. The test case uses two different Unicode representations for "ä" (U+0061 U+0308, "\141\314\210", and U+00e4, "\303\244") to ensure that no files are written to the wrong place. So it may be less likely that Linux systems are affected by the bug, but they are not immune.

Comments (7 posted)

BPF meets io_uring

By Jonathan Corbet
March 4, 2021

Over the last couple of years, a lot of development effort has gone into two kernel subsystems: BPF and io_uring. The BPF virtual machine allows programs from user space to be safely run within the context of the kernel, while io_uring addresses the longstanding problem of running system calls asynchronously. As the two subsystems expand, it was inevitable that the two would eventually meet; the first encounter happened in mid-February with this patch set from Pavel Begunkov adding the ability to run BPF programs from within io_uring.

The patch set itself is relatively straightforward, adding less than 300 lines of new code. It creates a new BPF program type (BPF_PROG_TYPE_IOURING) for programs that are meant to be run in the io_uring context. Any such programs must first be created with the bpf() system call, then registered with the ring in which they are intended to run using the new IORING_ATTACH_BPF command. Once that has been done, the IORING_OP_BPF operation will cause a program to be run within the ring. The final step in the patch series adds a helper function that BPF programs can use to submit new operations into the ring.

As a proof of concept, the patch series does a good job of showing how BPF programs might be run from an io_uring. This work does not, though, really enable any new capabilities in its current form, which may be part of why there have been no responses to it on the list. There is little value to running a BPF program asynchronously to submit another operation; one could simply submit that operation directly instead. As is acknowledged in the patch set, more infrastructure will be needed before this capability will become useful to users.

The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. "What to do next" could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.

Making that kind of decision requires the ability to run BPF programs in response to other events in the ring. The sequencing mechanisms built into io_uring now would suffice to run a program once a specific operation completes, but that program will not have access to much useful information about how the operation completed. Fixing that will, as Begunkov noted, require a way to pass the results of an operation into a BPF program when it runs. An alternative would be to tie programs directly to submitted operations (rather than making them separate operations, as is done in the patch set) that would simply run at completion time.

With that piece in place, and with the increasing number of system calls supported within io_uring, it will become possible to create complex, I/O-related programs that can run in kernel space for extended periods. Running BPF programs may look like an enhancement to io_uring, but it can also be seen as giving BPF the ability to perform I/O and run a wide range of system calls. It looks like a combination that people might do some surprising things with.

That said, this is not a feature that is likely to be widely used. On its own, io_uring brings a level of complexity that is only justified for workloads that will see a significant performance improvement from asynchronous processing. Adding BPF into the mix will increase the level of complexity significantly, and long sequences of operations and BPF programs could prove challenging to debug. Finally, loading io_uring programs requires either of the CAP_BPF or CAP_SYS_ADMIN capabilities, which means "root" in most configurations. As long as the current hostility toward unprivileged BPF programs remains, that is unlikely to change; as a result, relatively few programs are likely to use this feature.

Still, the combination of these two subsystems provides an interesting look at where Linux may go in the future. Linux will (probably) never be a unikernel system, but the line between user space and the kernel does appear to be getting increasingly blurry.

Comments (30 posted)

Lockless patterns: full memory barriers

March 5, 2021

This article was contributed by Paolo Bonzini

Lockless patterns

The first two articles in this series introduced four ways to order memory accesses: load-acquire and store-release operations in the first installment, read and write memory barriers in the second. The series continues with an exploration of full memory barriers, why they are more expensive, and how they are used in the kernel.

The primitives that were introduced so far constrain the ordering of loads and stores in four different ways:

Load-acquire operations are ordered before subsequent loads and stores.
Store-release operations are ordered after earlier loads and stores.
Read memory barriers order earlier loads against subsequent loads.
Write memory barriers order earlier stores against subsequent stores.

It may be surprising that no combination of these operations orders an earlier store against a later load:

Ordered against

Load Store

Earlier load smp_load_acquire(),
smp_rmb() smp_load_acquire(),
smp_store_release()

Earlier store ?? smp_store_release(),
smp_wmb()

	Ordered against
	Load	Store
Earlier load	`smp_load_acquire()`, `smp_rmb()`	`smp_load_acquire()`, `smp_store_release()`
Earlier store	??	`smp_store_release()`, `smp_wmb()`

It turns out that guaranteeing a global ordering of stores against later loads is much more complicated for the processor, and it deserves a primitive of its own. To find out why, we need to abandon even the high-level concepts that we have used so far and take a direct peek at how processors operate.

How processors really do it

The first article already mentioned that, deep down, processors communicate via a message-passing architecture, such as QPI or HyperTransport. At the assembly-language level, however, the programmer sees operations like memory loads and stores. Any acquire and release semantics associated with these memory operations are an illusion that is provided by the processor's execution pipeline, based on the program it runs and on the constraints of the architecture.

For example, on x86 processors, all memory loads count as "load-acquire", and all memory stores count as "store-release"; this behavior is required by the architecture. When compiling for these processors, it's still important to annotate acquire and release operations and/or to insert memory barriers, but these annotations are only used by the compiler to block invalid optimizations. Even in this model, the architecture does not guarantee that all processors see stores and loads in the same order. Consider this example, where a and b are initially zero:

    CPU 1                    CPU 2
    -------------------      --------------------
    store 1 into a           store 1 into b
    load b into x            load a into y

If you try interleaving the four operations by hand, you'll see that at least one store will come before the corresponding load. Therefore one would expect that at least one of x and y will be 1 after the above sequence runs to completion. Yet, even on an x86 processor, it is possible that both x and y read as zero.

How so? The reason lies in the existence of the store buffer, which lives between the CPU and its L1 cache. Stores to memory usually only change one part of a cache line; completing those stores, even just to cache, may require fetching the full cache line from memory — a slow operation. The store buffer holds the stored data while this fetch takes place, allowing the CPU to proceed with execution.

Even a CPU with a store buffer can provide load-load, load-store, and store-store ordering relatively easily:

On out-of-order CPUs (which can execute operations in an order different from how they appear in the code to improve performance), ordering memory operations against earlier loads can be done via speculative execution. During the time between a cache line access and the retiring of the instruction that caused the access, the CPU tracks evictions of that cache line. Retiring of instructions happens in order, so this tracking will last until all earlier loads have completed. If the cache line is evicted before the instruction is retired, speculation is canceled and the processor retries the memory operation.
Keeping stores ordered is even simpler; it is enough to flush the entries of the store buffer to the cache in FIFO order.

However, store-load ordering is a different story. First of all, one CPU's store might be stuck in its store buffer, and therefore it will not be visible to another CPU when that CPU loads data from L1 cache. Second, a store buffer also provides store forwarding, where the result of memory loads is taken directly from the store buffer instead of going through the cache. If either CPU 1 or CPU 2 has an entry for respectively b or a in its store buffer, the value could be forwarded to the load, which would return zero.

The only way around these problems is to flush the store buffer entirely between the store and the load. This is as expensive as it sounds (a few tens of clock cycles), but this is what the full memory barrier smp_mb() does under the hood. Here is the same pseudocode as above, fixed and rewritten into C:

    thread 1                 thread 2
    -------------------      --------------------
    WRITE_ONCE(a, 1);        WRITE_ONCE(b, 1);
    smp_mb();                smp_mb();
    x = READ_ONCE(b);        y = READ_ONCE(a);

Let's say x is zero. I will use a squiggly horizontal line from a read to a write to express that WRITE_ONCE(b, 1) overwrites the value that thread 1 had read; then the situation (which is described in detail in the kernel's memory-model documentation) is as follows:

    WRITE_ONCE(a, 1);
           |
      -----+----- smp_mb();
           |
           v
    x = READ_ONCE(b);   ∿∿∿∿∿∿∿>  WRITE_ONCE(b, 1);
                                         |
                                    -----+----- smp_mb();
                                         |
                                         v
                                  y = READ_ONCE(a);

Because these are relaxed operations, this is not enough to introduce a happens-before edge between thread 1 and thread 2. The barriers also do not have acquire or release semantics by themselves, so there are no happens-before edges across the two threads at all.

However, the barriers do provide enough information to order the operations in the two threads. In particular, continuing with the case of x=0, the full memory barrier in thread 2 guarantees that the store buffer has been flushed by the time READ_ONCE(a) executes. Could this be even before READ_ONCE(b) executed? If so, READ_ONCE(b) would certainly see the earlier WRITE_ONCE(b, 1) from thread 2 (remember the store buffer has been flushed), and x would be 1. That's a contradiction, therefore READ_ONCE(b) must have executed first:

    WRITE_ONCE(a, 1);              WRITE_ONCE(b, 1);
           |                              |
           |                         -----+----- smp_mb();
           |                              |
           v                              v
    x = READ_ONCE(b); -----------> y = READ_ONCE(a);
                        (if x=0)

Due to transitivity, READ_ONCE(a) can see the effect of WRITE_ONCE(a, 1) and y=1. Likewise, if thread 2 reads a as zero, thread 1's full memory barrier ensures that the READ_ONCE(a) must have executed before thread 1's READ_ONCE(b):

    WRITE_ONCE(a, 1);              WRITE_ONCE(b, 1);
           |                              |
      -----+----- smp_mb();               |
           |                              |
           v                              v
    x = READ_ONCE(b); <----------- y = READ_ONCE(a);
                        (if y=0)

meaning that y=0 implies x=1. Different executions may show one or the other possibility, but in any case x and y cannot both be zero; that would mean that READ_ONCE(a) executed before READ_ONCE(b) and vice versa, which is impossible.

The Linux kernel memory model does not call these read-read edges "happens-before", because there's no indication of which is the release operation and which is the acquire operation; nevertheless, they have the same effect of ordering memory operations across threads. Therefore we can state that high-level APIs have acquire or release semantics even if internally they use full memory barriers; users of those APIs will be able to think of their code in terms of the happens-before framework, too. The next example will show how this is done in practice.

Sleep/wake-up synchronization

The detailed description in the previous section shows that full memory barriers impose constraints in a complicated and unintuitive manner. Fortunately, the above pattern of "two threads and two flags", where each thread writes to a flag and reads from the other, is probably all that you need to know about full memory barriers.

This pattern has many important and practical applications. Suppose thread 2 wants thread 1 to act on 2's requests; thread 1 on the other hand wants to do something else—something that could even take an indefinite amount of time, for example removing itself from the scheduler's run queue and going to sleep. In this case, the code will look like this:

    thread 1                                   thread 2
   -------------------                        --------------------------
    WRITE_ONCE(dont_sleep, 1);                 WRITE_ONCE(wake_me, 1);
    smp_mb();                                  smp_mb();
    if (READ_ONCE(wake_me))                    if (!READ_ONCE(dont_sleep))
      wake(thread2);                             sleep();

If thread 2 reads dont_sleep as zero, thread 1 will read wake_me as one and wake up thread 2; it's useful to think of thread 1 as having release semantics (imagine that the wake() happens as part of mutex_unlock). If thread 1 reads wake_me as zero, thread 2 will read dont_sleep as one and will not go to sleep. Usually this is the side that needs to exhibit acquire semantics.

There is a hidden assumption that thread 1's wake-up requests are never lost, even if for example wake() is called after thread 2's READ_ONCE() but before sleep(). A way to avoid this bug is for wake() and sleep() calls to take the same lock. Again, we see how lockless patterns cooperate with traditional synchronization—after all this is a slow path.

It really does work, and we can see this, for example, in Linux's prepare_to_wait() and wake_up_process() APIs. This interface was introduced in the 2.5.x development kernels, and was duly covered back then by LWN. Here is how the pattern appears after expanding some of the functions that are involved:

    thread 1                                     thread 2
    -------------------                          --------------------------
    WRITE_ONCE(condition, 1);                    prepare_to_wait(..., TASK_INTERRUPTIBLE) {
    wake_up_process(p) {                           set_current_state(TASK_INTERRUPTIBLE) {
      try_to_wake_up(p, TASK_NORMAL, 0) {            WRITE_ONCE(current->state, TASK_INTERRUPTIBLE);
        smp_mb();                                    smp_mb();
        if (READ_ONCE(p->state) & TASK_NORMAL)     }
	  ttwu_queue(p);                         }      
      }                                          if (!READ_ONCE(condition))
    }                                              schedule();

As we saw in the seqcount case, the memory barriers are hidden inside the implementation of higher-level APIs. Embedding memory barriers or load-acquire/store-release operations inside the implementation is in fact how one defines an API that has acquire and release semantics; in this case, wake_up_process() has release semantics, while set_current_state() and its caller prepare_to_wait() have acquire semantics.

The sleep condition is often checked twice to limit unnecessary wakeups, like this:

    thread 1                               thread 2
    -------------------                    --------------------------
    WRITE_ONCE(dont_sleep, 1);             if (!READ_ONCE(dont_sleep)) {
    smp_mb();                                WRITE_ONCE(wake_me, 1);
    if (READ_ONCE(wake_me))                  smp_mb();
        wake(thread2);                       if (!READ_ONCE(dont_sleep))
                                               sleep();
                                           }

In the kernel we can see this kind of check in the interaction between tcp_data_snd_check() and the call to tcp_check_space(), defined immediately above it (thread 1) and tcp_poll() (thread 2). In this case, the code does not have the luxury of relying on a higher-level abstraction, so it merits a closer look. Thread 2 wants to sleep if the socket has no room in its send buffer, therefore tcp_poll() sets the "wake me" SOCK_NOSPACE flag before checking __sk_stream_is_writeable(). The core of the lockless synchronization in tcp_poll() is:

    set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
    smp_mb__after_atomic();
    if (__sk_stream_is_writeable(sk, 1))
      mask |= EPOLLOUT | EPOLLWRNORM;

The caller will ultimately go to sleep if mask is zero. smp_mb__after_atomic() is a specialized version of smp_mb() and has the same semantics. These optimized barriers will be explained in a future article.

Thread 1, instead, must wake up thread 2 after consuming data in the send buffer. tcp_data_snd_check() first sends out packets to make room in the buffer ("do not sleep, there's room now"), then checks SOCK_NOSPACE, and finally (through the sk->sk_write_space() function pointer) reaches sk_stream_write_space(), where thread 2 is woken up. The call stack is relatively shallow, so I'll let the reader explore the code. I will however point out this comment in tcp_check_space():

    /* pairs with tcp_poll() */
    smp_mb();
    if (test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
      tcp_new_space(sk);

The reference to the "pairing" of barriers tells us that this function has either acquire or release semantics. A read or write memory barrier would tell us straight away that the function has respectively acquire and release semantics. With a full memory barrier, we need to look at the code around the barrier. In this case, we know that the function is on the "wake-up" side of the pattern and therefore has release semantics; tcp_poll() instead has acquire semantics.

Something similar happens almost everywhere a smp_mb() can be found in the kernel. For example:

Workqueues use this idiom to decide whether workers have more work to do. In this case, thread 1's role is taken by insert_work(), while thread 2's role is taken by wq_worker_sleeping().
In the futex() system call, thread 1's write is in user space, while the memory barrier and read are part of futex(FUTEX_WAKE). Thread 2's operation is entirely part of the futex(FUTEX_WAIT) (because wake_me is a flag in kernel memory); FUTEX_WAIT accepts the expected value of the futex as an argument to the system call, and uses it to decide whether to sleep. See the long comment at the head of kernel/futex.c for details on how this works.
Within KVM, sleep() is replaced by the act of entering the processor's guest mode and running a virtual machine. In order to kick the processor out of guest mode, kvm_vcpu_kick() sends an inter-processor interrupt to the processor. The familiar comment around memory barrier pairing can be found down the call chain in kvm_vcpu_exiting_guest_mode(), which also where vcpu->mode is read.
Virtio devices use two instances of the above pattern. On one hand, the driver wants to stop processing completed requests, and waking it up consists of sending an interrupt. On the other hand, the device wants to stop processing submitted requests, and the device can wake it up by writing to a "doorbell" memory location.

That completes the introduction of memory barriers. From here, we will look at compare-and-swap operations, how they can be combined with locks to create lock-free fast paths, and their role in the implementation of lock-free linked lists.

Comments (16 posted)

Linux 5.12's very bad, double ungood day

By Jonathan Corbet
March 8, 2021

The -rc kernels released by Linus Torvalds exist for a reason: after 10,000 or so changes flow into the kernel over a two-week merge window, there will surely be some bugs in need of squashing. The -rc kernels provide an opportunity for wider testing after all those patches have been integrated. Most of the time, -rc kernels (even the initial -rc1 releases) are surprisingly safe to run. Occasionally, though, something goes wrong, giving early testers reason to reconsider their life choices. The 5.12-rc1 kernel, as it turns out, was one of those.

On January 26, Christoph Hellwig posted a 17-patch series containing cleanups to the code dealing with the allocation of the BIO structures used to represent block-I/O requests. The final patch in that series simplified the allocation of requests for swap files in particular. The series was applied by block maintainer Jens Axboe one day later. The change was included in this pull request sent on February 17, and landed in the mainline on February 21 as part of the massive set of pulls done by Torvalds once his power was restored.

"Swapping" is how the kernel pushes anonymous pages (those which are not backed up by a file on disk — program data, in other words) out to persistent storage when memory gets tight. Linux can swap directly to a partition on a block device; that is how systems are traditionally set up. But the kernel can also swap to a file within a mounted filesystem with something close to the same performance. Swapping to a file gives administrators some additional flexibility; it is an easy way to give some relief to a system that is running short of memory without disturbing the existing workload, for example.

The problematic patch works just fine for swap activity to a partition. But, when a swap file is in use, the offset to the location of its blocks within the filesystem would be lost. That had the result of redirecting swap traffic to an essentially random location on the filesystem — a location that was not intended for swapping and may well have contained other information that somebody would have preferred to keep. In other words, the filesystem containing the swap file would be corrupted. (Interestingly, a similar bug was introduced in 2014 and lurked undetected because it only affects extremely rare configurations.)

On March 2, a fix was applied to the mainline repository. One day later, Torvalds sent a warning to the linux-kernel mailing list describing the problem and noting that: "This is what we in the industry call 'double ungood'". He removed the 5.12-rc1 tag from his repository (or, rather, renamed it to "5.12-rc1-dontuse") and generally made it clear that this release should be avoided. He also had a couple of requests for maintainers.

One had to do with the 5.12-rc1 tag; he can remove it from his repository, but that does not change anybody else's repository. So, he said, maintainers should manually remove that tag themselves just to prevent any accidental use of that release.

Beyond that, it is common for subsystem maintainers to base new branches on an -rc release, even -rc1. But any such branches pose a particular hazard for the development and testing community in general, because they will not contain the fix for this problem. The git bisect tool, which is frequently used to find the patch that caused a regression, can land on any location in the development history; branches containing the buggy commit (but not the fix) create extended ranges of history with the potential to destroy filesystems far into the future. This, also, seems "'double ungood'".

The answer is to not base development branches on 5.12-rc1, but instead to use either an older release or to wait until -rc2. That will not be entirely helpful for subsystem maintainers who have already published branches with an -rc1 base, as some certainly have. They have a choice between keeping the dangerous branch or rebasing it onto a safer branch point, possibly creating difficulties for anybody else who has built on that branch. That sort of rebasing would normally be frowned upon, but Torvalds made it clear that it is permissible — and expected — this time.

The kernel community is mostly treating this as a case of "bugs happen" and moving on. One can indeed argue that the process has worked the way it is supposed to: a catastrophic bug was caught early during the stabilization period and will never make it anywhere near a production system. That said, there may be room for a bit of introspection and thinking about how things could have been done better.

For example, one might argue that the review process could have worked better. The patch in question was only posted once, and was applied to the block repository the next day. The patch in the mainline repository contains two Reviewed-by tags and one Acked-by tag, but the mailing-list thread shows those tags all being offered for a different patch in the series. Perhaps all three reviewers offered their tags off-list but, from the public record, it's not clear how much review the problematic patch actually received.

There is no guarantee that more eyeballs would have found this bug before it was committed, but it seems that those eyeballs were not given much of a chance here. It is tempting to see "cleanup" patches as being inherently safe and needing less review, but those patches can contain subtle bugs as well.

A part of the process that could have worked more quickly was warning the community as a whole of the problem. In the message linked above, Torvalds acknowledged that he should have acted more quickly rather than waiting to be sure that the proposed fix actually made the problem go away. One can only imagine that the warning will go out more quickly the next time.

There is one other little problem here that surprisingly few people are talking about. The health of the mailing-lists hosted on vger.kernel.org has not been optimal for a while; at times, posts to the mailing lists have been delayed for a day or two or dropped altogether. Torvalds's warning was not delayed that badly, but it still took several hours to get to recipients of linux-kernel. Whether the delay in the delivery of that warning caused more testers to be bitten by this bug is not clear, but it certainly did not help.

The good news is that work is currently being done to put the vger lists on a more solid, well-supported footing. With luck, the mailing-list problems will be behind us in the relatively near future.

Even then, though, a warning sent just to linux-kernel has a real chance of disappearing into the noise. The traffic on that list, which routinely exceeds 1,000 messages per day, can drown out important messages and has caused many developers to unsubscribe completely. It may well be that we need better communication channels for this kind of urgent message.

In the end, the actual pain caused by this bug was probably relatively small; there have not been massive numbers of "it ate my filesystem!" messages going around. The process did mostly work the way it was intended to. With luck, lessons learned from this incident can help that process to work even better the next time.

Comments (70 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>