Leading items

Welcome to the LWN.net Weekly Edition for April 9, 2020

This edition contains the following feature content:

VMX virtualization runs afoul of split-lock detection: a useful hardware feature can break unready virtual machines.
Frequency-invariant utilization tracking for x86: how busy are the system's CPUs really?
5.7 Merge window part 1: what got pulled into the mainline for 5.7 — so far.
A full task-isolation mode for the kernel: a mode for code that does not want to be interrupted for any reason.
Concurrency bugs should fear the big bad data-race detector (part 1): the first half of a detailed look at the KCSAN data-race detection system.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

VMX virtualization runs afoul of split-lock detection

By Jonathan Corbet
April 7, 2020

One of the many features merged for the 5.7 kernel is split-lock detection for the x86 architecture. This feature has encountered a fair amount of controversy over the course of its development, with the result that the time between its initial posting and appearance in a released kernel will end up being over two years. As it happens, there is another hurdle for split-lock detection even after its merging into the mainline; this feature threatens to create problems for a number of virtualization solutions, and it's not clear what the solution would be.

To review quickly: a "split lock" occurs when a processor instruction locks a range of memory that crosses a cache-line boundary. Implementing such locks requires locking the entire memory bus, with unpleasant effects on the performance of the system as a whole. Most architectures do not allow split locks at all, but x86 does; only recently have some x86 processors gained the ability to generate a trap when a split lock is requested.

Kernel developers are interested in enabling split-lock detection as a way of eliminating a possible denial-of-service attack vector as well as just getting rid of a performance problem that could be especially problematic for latency-sensitive workloads. In short, there is a desire for x86 to be like other architectures in this regard. The implementation of this change has evolved considerably over time; in the patch that was merged, there is a new boot-time parameter (split_lock_detect=) that can have one of three values. Setting it to off disables this feature, warn causes a warning to be issued when user-space code executes a split lock, and fatal causes a SIGBUS signal to be sent. The default value is warn.

The various discussions around split-lock detection included virtualization, which has always raised some interesting questions. A system that runs virtualized guests is a logical place to enable split-lock detection, since a guest can disrupt others with hostile locking behavior. But a host that turns on split-lock detection risks breaking guests that are unprepared for it; this problem extends to the guest operating system, which will be directly exposed to the alignment-check traps caused by split-lock detection. It may not be possible for the administrator of the host to even know whether the guest workloads are ready or not. So various kernel developers wondered what the best policy regarding virtualization should be.

It seems that some of that discussion fell by the wayside as the final patch was being prepared, leading to an unpleasant surprise. Kenneth Crudup first reported that split-lock detection caused VMware guests to crash, but the problem turns out to be a bit more widespread than that.

Intel's "virtual machine extensions" (VMX, also referred to as "VT-x") implements hardware-supported virtualization on x86 processors. A VMLAUNCH instruction places the processor in the virtualized mode, where the client's system software can (mostly) behave like it is running on bare hardware while being contained within its sandbox. It turns out that, if split-lock detection is enabled and code running within a virtual machine attempts a split lock, the processor will happily deliver an alignment-check trap to a thread running in the VMX mode; what happens next depends on the hypervisor. And most hypervisors are not prepared for this to happen; they will often just forward the trap into the virtual machine, which, not being prepared for it, will likely crash. Any hypervisor using VMX is affected by this issue.

Thomas Gleixner responded to the problem with a short patch series trying to cause the right things to happen. One of the affected hypervisors is KVM; since it is a part of the kernel, the right solution is to just make KVM handle the trap properly. Gleixner included a patch causing KVM to check to see whether the machine was configured to receive an alignment-check trap and only deliver it if so. That patch is likely to be superseded by a different series written by Xiaoyao Li, but the core idea (make KVM handle the trap correctly) is uncontroversial.

The real question is what should be done the rest of the time. All of the other VMX-using hypervisors are out-of-tree, so they cannot be fixed directly. Gleixner's original patch was arguably uncharacteristic of his usual approach to such things: it disabled split-lock detection globally if a hypervisor module was loaded into the kernel. But, since modules don't come with a little label saying "this is a hypervisor", Gleixner's patch would, instead, read through each module's executable code at load time in search of a VMLAUNCH instruction. Should such an instruction exist, the module is deemed to be a hypervisor. Unless a special flag ("sld_safe") is set in the module info area, the hypervisor will be assumed to be unready for split-lock detection and the feature will be turned off.

It is not at all clear that this approach will be adopted. Among other things, it turns out that not all VMX hypervisors include VMLAUNCH instructions in their code. As Gleixner noted later in the discussion, VirtualBox doesn't directly contain any of the VMX instructions; those are loaded separately by the VirtualBox module, outside of the kernel's module-loading mechanism. "This 'design' probably comes from the original virtualbox implementation which circumvented GPL that way", Gleixner observed. Other modules use VMXON rather than VMLAUNCH.

Eventually these sorts of problems could be worked around, but there is another concern with this approach that was expressed, in typical style, by Christoph Hellwig:

This is just crazy. We have never cared about any out of tree module, why would we care here where it creates a real complexity. Just fix KVM and ignore anything else.

There is a fair amount of sympathy for this approach in kernel-development circles, but there is still a reluctance to ship something that is certain to create unexpected failures for end users even if it is not seen as a regression in the usual sense. So a couple of other ideas for how to respond to this problem have been circulating.

One of those is to continue scanning module code for instructions that indicate hypervisor functionality. But, rather than disabling split-lock detection on the system as a whole, the kernel would simply refuse to load the module. There are concerns about the run-time cost of scanning through module code, but developers like Peter Zijlstra also see an opportunity to prevent the loading of modules that engage in other sorts of unwelcome behavior, such as directly manipulating the CPU's control registers. A patch implementing such checks has subsequently been posted.

An alternative, suggested by Hellwig, is to find some other way to break the modules in question and prevent them from being loaded. Removing some exported symbols would be one way to do that. Zijlstra posted one attempt at "fixing" the problem that way; Hellwig has a complementary approach as well.

As of this writing, it's not clear which approach will be taken; the final 5.7 kernel could be released with both of them, or some yet unseen third technique. Then, just maybe, the long story of x86 split-lock detection will come to some sort of conclusion.

Comments (14 posted)

Frequency-invariant utilization tracking for x86

By Jonathan Corbet
April 2, 2020

The kernel provides a number of CPU-frequency governors to choose from; by most accounts, the most effective of those is "schedutil", which was merged for the 4.7 kernel in 2016. While schedutil is used on mobile devices, it still doesn't see much use on x86 desktops; the intel_pstate governor is generally seen giving better results on those processors as a result of the secret knowledge embodied therein. A set of patches merged for 5.7, though, gives schedutil a better idea of what the true utilization of x86 processors is and, as a result, greatly improves its effectiveness.

Appropriate CPU-frequency selection is important for a couple of reasons. If a CPU's frequency is set too high, it will consume more power than needed, which is a concern regardless of whether that CPU is running in a phone or a data center. Setting the frequency too low, instead, will hurt the performance of the system; in the worst cases, it could cause the available CPU power to be inadequate for the workload and, perhaps, even increase power consumption by keeping system components awake for longer than necessary. So there are plenty of incentives to get this decision right.

One key input into any CPU-frequency algorithm is the amount of load a given CPU is experiencing. A heavily loaded processor must, naturally, be run at a higher frequency than one which is almost idle. "Load" is generally characterized as the percentage of time that a CPU is actually running the workload; a CPU that is running flat-out is 100% loaded. There is one little detail that should be taken into account, though: the current operating frequency of the CPU. A CPU may be running 100% of the time, but if it is at 50% of its maximum frequency, it is not actually 100% loaded. To deal with this, the kernel's load tracking scales the observed load by the frequency the CPU is running at; this scaled value is used to determine how loaded a CPU truly is and how its frequency should change, if at all.

At least, that's how it is done on some processors. On x86 processors, though, this frequency-invariant load tracking isn't available; that means that frequency governors like schedutil cannot make the best decisions. It is not entirely surprising that performance (as measured in both power consumption and CPU throughput) suffers.

This would seem like an obvious problem to fix. The catch is that, on contemporary Intel processors, it is not actually possible to know the operating frequency of a CPU. The operating system has some broad control over the operating power point of the CPU and can make polite suggestions as to what it should be, but details like actual running frequency are dealt with deep inside the processor package and the kernel is not supposed to worry its pretty little head about them. Without that information, it's not possible to get the true utilization of an x86 processor.

It turns out that there is a way to approximate this information, though; it was suggested by Len Brown back in 2016 but not adopted at that time. There are two model-specific registers (MSRs) on modern x86 CPUs called APERF and MPERF. Both can be thought of as a sort of time-stamp counter that increments as the CPU executes (though Intel stresses that the contents of those registers don't have any specific timekeeping semantics). MPERF increments at constant a rate proportional to the maximum frequency of the processor, while APERF increments at a variable rate proportional to the actual operating frequency. If aperf_change is the change in APERF over a given time period, and mperf_change is the change in MPERF over that same period, then the operating frequency can be approximated as:

    operating_freq = (max_freq*aperf_change)/mperf_change;

Reading those MSRs is relatively expensive, so this calculation cannot be made often, but once per clock tick (every 1-10ms) turns out to be enough.

There is one other little detail, though, in the form of Intel's "turbo mode". Old timers will be thinking of the button on the case of a PC that would let it run at a breathtaking 6MHz, but this is different. When the load in a particular package is concentrated on a small number of CPUs, and the others are idle, the busy CPUs can be run at a frequency higher than the alleged maximum. That makes it hard to know what the true utilization of a CPU is, because its capacity will vary depending on what other CPUs in the system are doing.

The patches (posted by Giovanni Gherdovich) implement the above mentioned method to calculate the operating frequency, and use the turbo frequency attainable by four processors simultaneously as the maximum possible. The result is a reasonable measure of what the utilization of a given processor is. That lets schedutil make better decisions about what the operating frequency of each CPU should actually be.

As it happens, the algorithm used by schedutil to choose a frequency changes a bit when it knows that the utilization numbers it gets are frequency-invariant. Without invariance, schedutil will move frequencies up or down one step at a time. With invariance, it can better calculate what the frequency should be, so it can make multi-step changes. That allows it to respond more quickly to the actual load.

The end result, Gherdovich said in the patch changelog, is performance from schedutil that is "closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework". To back that up, the changelog includes a long series of benchmark results; the changelog is longer than the patch itself. While the effects of the change are not all positive, the improvements (in both performance and power usage) tend to be large while the regressions happen with more focused benchmarks and are relatively small. One of the workloads that benefits the most is kernel compilation, a result that almost guarantees acceptance of the change in its own right.

The curious can read the patch changelog for the benchmark results in their full detail. For the rest of us, what really matters is that the schedutil CPU-frequency governor should perform much better on x86 machines than it has in the past. Whether that will cause distributions to switch to schedutil remains to be seen; that will depend on how well it works on real-world workloads, which often have a disturbing tendency to not behave the way the benchmarks did.

Comments (21 posted)

5.7 Merge window part 1

By Jonathan Corbet
April 3, 2020

As of this writing, 7,233 non-merge changesets have been pulled into the mainline repository for the 5.7 kernel development cycle — over the course of about three days. If current world conditions are slowing down kernel development, it would seem that the results are not yet apparent at this level. As usual, these changesets bring no end of fixes, improvements, and new features; read on for a summary of what the first part of the 5.7 merge window has brought in.

Architecture-specific

A version of the controversial split-lock detection patch has finally been merged. See this changelog for details about how this mode works. At the moment, this work breaks some VMware-based virtual machines, but that will presumably be fixed long before 5.7 is released.
The Arm architecture now supports hot-removal of memory.
Pointer authentication is now supported for kernel code (along with user space, which has been supported for some time). This work includes return-address signing in the kernel.

Core kernel

The io_uring subsystem now includes support for splice() and for automatic buffer selection.
The thermal pressure patch set has been merged; it allows the scheduler to take a processor's thermal situation into account when placing tasks.
The CPU scheduler's load tracking has finally gained frequency invariance — meaning that it has access to correct utilization values regardless of the CPU's current operating frequency — on the x86 architecture.
After a fair amount of back-and-forth, BPF and the realtime preemption patches can now coexist nicely.
The new BPF_MODIFY_RETURN BPF program type can be attached to a function in the kernel and modify its return value.

Filesystems and block I/O

The Btrfs filesystem provides a new ioctl() command (BTRFS_IOC_SNAP_DESTROY_V2) that allows the deletion of a subvolume by its ID.
As expected, the exFAT filesystem module has been deleted to make room for a better one that will be merged into the main kernel via a filesystem tree. This module was not deleted, though, before various developers had made a number of improvements to it that have now been discarded.

Hardware support

Graphics: BOE TV101WUM and AUO KD101N80 45NA 1200x1920 panels, Feixin K101 IM2BA02 panels, Parade PS8640 MIPI DSI to eDP converters, TI Keystone display subsystems, Samsung AMS452EF01 panels, Ilitek ILI9486 display panels, Toshiba TC358768 MIPI DSI bridges, TI TPD12S015 HDMI level shifters, Novatek NT35510 RGB panels, and Elida KD35T133 panels.
Industrial I/O: Sharp GP2AP002 proximity/ambient-light sensors, Dyna Image AL3010 ambient-light sensors, Analog Devices HMC425A GPIO gain amplifiers, Analog Devices AD5770R digital-to-analog converters, and InvenSense ICP-101xx pressure and temperature sensors.
Media: Sony IMX219 sensors and Allwinner DE2 rotation units.
Miscellaneous: Analog Devices fan controllers, Qualcomm Atheros AR934X/QCA95XX SPI controllers, MediaTek SPI NOR controllers, Monolithic MP5416 power-management ICs, Monolithic MP8869 regulators, Ingenic JZ SoC operating-system timers, IDT 82P33xxx PTP clocks, Xilinx ZynqMP AES crypto accelerators, and Allwinner sun6i/sun8i/sun9i/sun50i message boxes.
Networking: Qualcomm IP accelerators, Synopsys DesignWare XPCS controllers, Marvell USB to MDIO adapters, and TI K3 AM654x/J721E CPSW Ethernet adapters.
USB: Intel PMC multiplexers, Ingenic JZ4770 transceivers, Maxim MAX3420 USB-over-SPI controllers, Qualcomm 28nm high-speed PHYs, and Qualcomm USB super-speed PHYs [don't ask us whether "high-speed" is higher-speed than "super-speed" or not...]. The USB subsystem also has a new "raw gadget" interface that allows the emulation of USB devices from user space.
Staging notes: the wireless USB subsystem and "ultra wideband" module have been deleted; they have not worked for some time and nobody is working on the code. The HP100 Ethernet driver is also gone from staging. Meanwhile, the Cavium octeon USB controller and wireless interface drivers, which were deleted in 5.6, have been reinstated for 5.7.

Networking

The networking layer can now take advantage of hardware that offloads 802.11 encapsulation tasks.
The new "Bareudp" module provides generic, level-3 UDP encapsulation that can be used by a number of other tunneling protocols. See the documentation in this commit for more information.
Moving a device from one network namespace to another will now adjust the ownership and permissions of the relevant sysfs files accordingly.
The work of merging the multipath TCP patches continues, but a fully functional MPTCP implementation in the mainline is still probably a few releases away.

Security-related

The SELinux checkreqprot tunable, if set to 1, changes the way that memory protections are checked in security policies; that can have the effect of allowing memory to be made executable regardless of what the loaded policy says. This setting will be deprecated in 5.7, with the plan to disable it entirely in a future release; see this commit for more information.
The KRSI security module has been merged — via the networking tree. This module allows the attachment of BPF programs to any security hook in the kernel; its form has changed somewhat since LWN looked at it and the "KRSI" name is no longer used, but the core idea remains the same. This commit contains some documentation for this feature.
The kernel's Curve25519 elliptic-curve crypto implementation has been replaced with one that has been formally verified.

Internal kernel changes

There is now a generic interrupt-injection mechanism that can be used for debugging and fault-testing purposes.
The TRIM_UNUSED_KSYMS configuration option causes the removal from the kernel symbol table of all exported symbols that are not used by the kernel as it is actually configured and built. There are cases (Android in particular) where this removal is desired, but there is also a need to continue to export a number of symbols for externally supplied modules, even if they are not used by the Android kernel itself. The new UNUSED_KSYM_WHITELIST option allows the provision of a list of symbols that should be retained even if they are seemingly unused.
It is now possible (via the MAGIC_SYSRQ_SERIAL_SEQUENCE configuration option) to specify a string of characters that is required to enable the magic SysRq functionality on a serial port. The purpose is to keep this functionality available while avoiding spurious commands on serial ports that can generate garbage data.
The "unified user-space access-intended accelerator framework" implements shared virtual addresses between the CPU and peripheral devices; it is intended to allow accelerators to "access any data structure of the main cpu". It was merged via the crypto tree. See this commit for documentation.
The kunit unit-testing framework can now make test results available via debugfs.

The 5.7 merge window is just beginning; it can be expected to run through April 12 if the usual schedule holds. As always, LWN will catch up with the rest of the changes pulled for 5.7 once the merge window closes.

Comments (10 posted)

A full task-isolation mode for the kernel

April 6, 2020

This article was contributed by Marta Rybczyńska

Some applications require guaranteed access to the CPU without even brief interruptions; realtime systems and high-bandwidth networking applications with user-space drivers can fall into the category. While Linux provides some support for CPU isolation (moving everything but the critical task off of one or more CPUs) now, it is an imperfect solution that is still subject to some interruptions. Work has been continuing in the community to improve the kernel's CPU-isolation capabilities, notably with improvements in the nohz (tickless) mode, but it is not finished yet. Recently, Alex Belits submitted a patch set (based on work by Chris Metcalf in 2015) that introduces a completely predictable environment for Linux applications — as long as they do not need any kernel services.

Nohz and task isolation

Currently, the nohz mode in Linux allows partial task isolation. It decreases the number of interrupts that the CPU receives; for example, the clock tick interrupt is disabled for nearly all CPUs. However, nohz does not guarantee there will be no interruptions; the running task can still be interrupted by page faults (careful design of an application can avoid that) or delayed workqueues. The advantage of this mode is that the tasks can run regular code, including system calls. In addition to that, any additional overhead is limited to the system-call entry and exit paths.

For some applications, the lack of absolute guarantees from nohz may cause problems. As an example, high-performance, user-space network drivers that have a small number of CPU cycles in which to handle each packet; for those, interrupt and interrupt handling may cause a significant delay in their response and use up to the entire time available. Realtime operating systems (RTOSes) can provide the needed guarantees, but they have limited hardware support; the authors of the patch feel that it is less work to develop and maintain interrupt-free applications than to support a RTOS next to Linux, as Belits explained:

The alternative, running RTOS instead of Linux, is becoming more and more labor-consuming because modern CPUs and SoCs have very complex device/resource configuration and management procedures, and at this point for some hardware it is clearly in the realm of impractical to maintain an RTOS with hardware support on par with Linux kernel, reliable and secure at the same time.

In these times, even embedded systems often contain a number of cores, and system designers are adding more for tasks requiring predictability. Belits explained that further:

Therefore OS ability to switch a CPU core into RTOS-ish mode [...] is an important feature for modern embedded systems development. Probably more important than even real-time interrupts latency and preemption, now that people, when they don't like how their interrupts are handled, can just add CPU cores.

The kernel currently has a couple of features meant to make it possible to run applications without interruptions: nohz (described above) and CPU isolation (or "isolcpus"). The latter feature isolates one or more CPUs — making them unavailable to the scheduler and only accessible to a process via an explicitly set affinity — so that any processes running there need not compete with the rest of the workload for CPU time. These features reduce interruptions on the isolated CPUs, but do not fully eliminate them; task isolation is an attempt to finish the job by removing all interruptions. A process that enters the isolation mode will be able to run in user space with no interference from the kernel or other processes.

Configuring and activating task isolation

The authors assume that isolation is not needed in kernel space or during the task's initialization phase. A task enters the isolation mode at some point in time and stays in this mode until it leaves the isolation on its own, performs some action that causes the isolation to be broken, or receives a signal that was directed to it.

The kernel needs to be compiled with the CONFIG_TASK_ISOLATION flag and then booted with the same options as for nohz mode with CPU isolation:

    isolcpus=nohz,domain,CPULIST

where nohz disables the timer tick on the specified CPUs, domain removes the CPUs from the scheduling algorithms, and CPULIST is the list of CPUs where the isolation options are applied. Optionally, the task_isolation_debug kernel command-line option causes a stack backtrace when a task loses isolation.

When a task has finished its initialization, it can activate isolation by using the PR_TASK_ISOLATION operation provided by the prctl() system call. This operation may fail for either permanent or temporary reasons. An example of a permanent error is when the task is set up on a CPU without isolation; in this case, entering isolation mode is not possible. Temporary errors are indicated by the EAGAIN error code; examples include a time when the delayed workqueues could not be stopped. In such cases, the task may retry the operation if it wants to enter isolation, as it may succeed the next time.

In the prctl() call, the developer may also configure the signal to be sent to the task when it loses isolation. The additional macro to use is PR_TASK_ISOLATION_SET_SIG(), passing it the signal to send. The command then becomes similar to the one in the example code:

    prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE
          | PR_TASK_ISOLATION_SET_SIG(SIGUSR1), 0, 0, 0);

Here, the process has requested the receipt of a SIGUSR1 signal rather than the default SIGKILL should it lose isolation.

Losing isolation

The task will lose isolation if it enters kernel space as the result of a system call, a page fault, an exception, or an interrupt. The (fatal by default) signal will be sent when this happens, with a couple of exceptions: a prctl() call to turn off isolation, or exit() and exit_group(); these calls cause the task to exit, so the isolation mode is finished at that point.

When the task loses isolation by any means other than the above system calls, it will receive a signal, SIGKILL by default, which causes termination of the task. The signal can be modified, in the case the application prefers to catch it. This can be used, for example, if an application wants to log the information about lost isolation before exiting or attempt to rerun the code without isolation guarantees.

The task can enter and exit isolation when it desires. To leave isolation without a signal it should call:

    prctl(PR_SET_TASK_ISOLATION, 0, 0, 0, 0);

The internals

When a process calls prtcl() to enable task isolation, it is marked with the TIF_TASK_ISOLATION flag in the kernel. The main part of the job of setting up task isolation, though, is done when returning from the prctl(). When the kernel returns to user space and sees the TIF_TASK_ISOLATION flag set, it arranges for the task not to be interrupted in the future. Interrupts are disabled, and the kernel disables any events that may interrupt the isolated CPU(s). In current patches, it disables the scheduler's clock tick and vmstat delayed work, and drains pages out of the per-CPU pagevec to avoid inter-processor interrupts (IPIs) for cache flushes. More isolation actions may be added in the future.

This isolation work is more straightforward in the current version than it was in the 2015 patch set. Since then, Linux has gained the ability to offload timer ticks from the isolated CPUs to so-called "housekeeping" CPUs — all that are not on the CPU list of the isolcpus kernel option. That removes the need to make additional requirements for dealing with pending timers on CPUs before they can be isolated.

The patch set also adds diagnostics on the non-isolated CPUs. If the kernel finds itself about to interrupt an isolated CPU, it will generate diagnostics (a warning in the kernel log by default, but a stack dump is also possible) on the interrupting CPU. Examples of such situations include sending an IPI or TLB flush. If an interrupt is not handled by Linux, for example a hypervisor interrupt, it can end up sending a reschedule IPI to an isolated CPU, causing the signal to notify the isolated task to be generated. With regard to that problem, Frédéric Weisbecker wondered if support for hypervisors is even necessary, but no conclusion has been reached on this topic.

The task-isolation mode requires changes in the architecture code; the patch set includes implementations for x86, arm, and arm64. An architecture needs to define HAVE_ARCH_TASK_ISOLATION and the new TIF_TASK_ISOLATION task flag. It needs to change its interrupt and page-fault entry routines to add a call to task_isolation_interrupt() so that any isolated tasks will exit isolation. The reschedule IPI should call task_isolation_remote() for the same purpose. Finally the system-call code should invoke task_isolation_syscall() to check if the call is allowed. When exiting to user space it should call task_isolation_check_run_cleanup() to run pending cleanup and task_isolation_start() if the isolation flag is set for the current task.

Apart from the changes in the architecture-specific code, adding the isolation feature caused several changes in other kernel subsystems. For example, in the network code, flush_all_backlogs() will enqueue work only on non-isolated CPUs. The trace ring buffer behaves on isolated CPUs in a similar way to offline ones — any updates will be done when the task exits isolation. Another change in the isolation mode is that kernel jobs are scheduled on housekeeping CPUs only. This includes tasks like probing for PCIe devices. Finally, kick_all_cpus_sync() has been modified to avoid scheduling interrupts on CPUs with isolated tasks. Weisbecker did not agree with this approach and listed a number of race conditions that may happen between this function and the task entering isolation. He suggested fixing the callers instead.

Summary

The patch set has received initial favorable reviews and it seems that this feature is of interest to developers. There are still some unresolved comments to be addressed and some patches did not receive a review yet. The patch set changes some basic kernel functions in a subtle way, so there will surely be questions asked about testing of the feature. In addition, of course, to the possible regressions. When those issues are resolved, it will likely be included in one of the upcoming kernel releases.

Comments (22 posted)

Concurrency bugs should fear the big bad data-race detector (part 1)

April 8, 2020

(Many contributors)

This article was contributed by Marco Elver, Paul E. McKenney, Dmitry Vyukov, Andrey Konovalov, Alexander Potapenko, Kostya Serebryany, Alan Stern, Andrea Parri, Akira Yokosawa, Peter Zijlstra, Will Deacon, Daniel Lustig, Boqun Feng, Joel Fernandes, Jade Alglave, and Luc Maranget.

The first installment of the "big bad" series described how a compiler can optimize your concurrent program into oblivion, while the second installment introduced a tool to analyze small litmus tests for such problems. Those two articles can be especially helpful for training, design discussions, and checking small samples of code. Although such automated training and design tools are welcome, automated code inspection that could locate even one class of concurrency bugs would be even better. In this two-part article, we look at a tool to do that kind of analysis.

This article focuses on the Kernel Concurrency Sanitizer (KCSAN)—also covered in an earlier LWN article—which can locate data races across the entire Linux kernel. This wide scalability does not come for free: KCSAN relies on compiler instrumentation and performs its analysis at runtime, which slows down the kernel considerably. In addition, it can only report data races that actually happen or almost happen during code execution. Nevertheless, KCSAN has already pointed out numerous problems, many of which have now been fixed.

Although KCSAN follows the Linux-Kernel Memory Consistency Model (LKMM), KCSAN can be told to ignore certain classes of data races depending on the preferences of the developers and maintainers, as will be described in part 2. Such forgiveness is helpful to developers who wish to focus on data races that are not exacerbated by current compilers. Furthermore, KCSAN allows developers to specify certain types of concurrency rules that it also checks for. In this mode KCSAN acts as a thorough concurrency-aware code reviewer, thus providing a much-needed service to the kernel community.

Why care about data races?

The C language evolved independently of concurrency. Consequently, C compilers are permitted to assume that if there is nothing special about a given variable or access, the variable will change only in response to a store by the current thread. Compilers therefore can and do use a variety of optimizations involving load fusing, code reordering, and many others—described in the first installment—that can cause concurrent algorithms to malfunction.

By definition, data races occur when there are concurrent conflicting accesses from multiple threads (or tasks or CPUs), at least one of which is a plain (unmarked) C-language access; accesses conflict if they all access the same memory location and at least one performs a write. While KCSAN can enforce that strict definition, by default it treats all aligned writes up to word size, whether marked or not, as atomic, so it is only looking for unmarked reads that race with those writes.

The wide variety of optimizations used by modern compilers makes it extremely difficult to predict all possible outcomes for all compilers on all architectures. Worse yet, long experience indicates that optimizers will continue becoming increasingly aggressive. In fact, the C11 memory consistency model (described in section 5.1.2.4 of this specification [PDF]) allows maximal optimizer aggression by stating that data races invoke undefined behavior.

Quick quiz 1: Why can't the Linux kernel just use the C11 memory model?
Answer

For code that does not follow the C11 memory model, the situation is even less clear. In the case of the Linux kernel, the LKMM specifies the expected behavior of concurrent kernel code given production compilers and systems. To this end, LKMM relies on special marked operations (READ_ONCE(), WRITE_ONCE(), etc.) that tell the compiler which accesses are expected to race, thus preventing the compiler from applying harmful optimizations. For a summary please see the second installment in the series, "Calibrating your fear of big bad optimizing compilers".

Quick quiz 2: What's the difference between "data races" and "race conditions"?
Answer

However, if a data race results in unexpected system behavior then this data race is also a race-condition bug, and it is likely to also be a symptom of a bug in the system's higher-level logic. One common example of such a bug is access to lock-protected shared variables from threads that have failed to acquire the corresponding lock. Developers and maintainers who have appropriately marked their accesses to shared variables will find that KCSAN's reports point to such logic bugs. Furthermore, KCSAN allows developers some control over the warnings issued, which in turn allows those developers to focus KCSAN's automated review on the race conditions of interest.

The Kernel Concurrency Sanitizer

KCSAN is a tool that detects data races as defined by the LKMM, but with control over exactly what sorts of data races are reported. KCSAN is aware of all marked atomic operations that the LKMM defines, as well as operations not yet mentioned by the LKMM, such as atomic bitmask operations. KCSAN also extends the LKMM, for example by providing the data_race() marking, which denotes intentional data races and a possible lack of atomicity.

KCSAN is nothing more or less than a way of carrying out an extremely detailed automated concurrency-aware code inspection. In this way, KCSAN augments the "10,000 eyes" with a set of eyes that do not get tired, and which has been running continuously on a syzbot instance since October 2019. To get a peek at some of the data races being found, without having to run KCSAN yourself, have a look at this dashboard.

KCSAN relies on observing that two accesses happen concurrently. Crucially, there is a desire to (a) increase the chances of observing races (especially for races that manifest rarely), and (b) be able to actually observe them. Those things can be accomplished (a) by injecting various delays, and (b) by using address watchpoints (or breakpoints).

If memory accesses are deliberately stalled, while a watchpoint is active for that address, then if the watchpoint is observed to fire, two accesses to the same address just raced. This is the approach taken in DataCollider [PDF] using hardware watchpoints. KCSAN does not use hardware watchpoints, but instead relies on compiler instrumentation and "soft watchpoints".

In KCSAN, watchpoints are implemented using an efficient encoding that stores access type, size, and address in a long integer; the benefits of using "soft watchpoints" are portability and greater flexibility. KCSAN then relies on the compiler instrumenting plain memory accesses. For each instrumented plain access, KCSAN will:

Check if a matching watchpoint exists; if yes, and at least one access is a write, then a racing access has been encountered.
Periodically, if no matching watchpoint exists, set up a watchpoint and stall for a small randomized delay.
Also check the data value before the delay, and re-check the data value after delay; if the values mismatch, a race of unknown origin is inferred.

To detect data races where some (but not all) accesses have been marked with an annotation like READ_ONCE(), KCSAN also instruments marked accesses, but only to check if a watchpoint exists; i.e. KCSAN never sets up a watchpoint on marked accesses. So if all accesses to a variable that is accessed concurrently are properly marked, KCSAN will never trigger a watchpoint, since it never set one up, and therefore will never report the accesses.

How to use KCSAN

The best use of KCSAN depends on the maintainers, developers, and the code in question. This section covers different classes of code and how KCSAN can best help find potential concurrency bugs, then looks at ways of organizing KCSAN reports. But if you remember only one thing from this section, let it be "Do NOT respond to KCSAN reports by mindlessly adding READ_ONCE(), data_race(), and WRITE_ONCE()." The following sections (and part 2) will give a few reasons why this rule is so important.

Recent Linux installations provide everything needed to build the kernel with KCSAN, though KCSAN itself has not been merged to the mainline as yet; it is available in linux-next at this point. A compiler upgrade is required for older Linux installations with GCC prior to version 7.3.0 or with Clang prior to version 7.0.0. Given a KCSAN-capable compiler, running your kernel built with CONFIG_KCSAN=y might result in the following report:

    ==================================================================
    BUG: KCSAN: data-race in rcu_torture_reader / run_timer_softirq

    read (marked) to 0xffff9ea500543e98 of 8 bytes by task 155 on cpu 4:
     rcu_torture_reader+0x2cb/0x3b0
     kthread+0x1c3/0x1e0
     ret_from_fork+0x35/0x40

    write to 0xffff9ea500543e98 of 8 bytes by interrupt on cpu 0:
     run_timer_softirq+0x63c/0x980
     __do_softirq+0xd8/0x2cb
     irq_exit+0xc3/0xd0
     smp_apic_timer_interrupt+0xae/0x230
     apic_timer_interrupt+0xf/0x20
     delay_tsc+0x1b/0x60
     rcu_torture_fwd_prog+0x39d/0xe20
     kthread+0x1c3/0x1e0
     ret_from_fork+0x35/0x40

    Reported by Kernel Concurrency Sanitizer on:
    ...
    ==================================================================

This shows an eight-byte data race between the functions rcu_torture_reader() and run_timer_softirq(), with rcu_torture_reader() having done a marked read ("read (marked) to ...") and run_timer_softirq() having done an unmarked write ("write to ..."), which by the strict LKMM definition constitutes a data race.

However, in this case, the data race is not immediately apparent from the source code of these two functions. One approach would be to attempt to work out which data structure resides at address 0xffff9ea500543e98 and another approach is to examine the assembly language at rcu_torture_reader+0x2cb and run_timer_softirq+0x63c. However, both of these time-honored approaches are labor-intensive and uncertain, not least due to the aggressive inlining and code-motion optimizations practiced by today's compilers.

A much nicer approach is to build your kernel with CONFIG_DEBUG_INFO=y, thus providing debug information that can be used by a variety of tools, for example, gdb (alternatives to gdb include scripts/decode_stacktrace.sh and syz-symbolize):

    (gdb) l*rcu_torture_reader+0x2cb
    0xffffffff8114a2bb is in rcu_torture_reader (./include/linux/list.h:784).
    779      * to avoid potential load-tearing.  The READ_ONCE() is paired with the
    780      * various WRITE_ONCE() in hlist helpers that are defined below.
    781      */
    782     static inline int hlist_unhashed_lockless(const struct hlist_node *h)
    783     {
    784             return !READ_ONCE(h->pprev);
    785     }
    786
    787     /**
    788      * hlist_empty - Is the specified hlist_head structure an empty hlist

The first of the conflicting accesses is the READ_ONCE() in hlist_unhashed_lockless(), which was apparently inlined into rcu_torture_reader(). This fact might have been difficult to glean from the assembly language. Given this information, the location of the other conflicting access is unsurprising:

    (gdb) l*run_timer_softirq+0x63c
    0xffffffff81169aec is in run_timer_softirq (./include/linux/list.h:931).
    926     static inline void hlist_move_list(struct hlist_head *old,
    927                                        struct hlist_head *new)
    928     {
    929             new->first = old->first;
    930             if (new->first)
    931                     new->first->pprev = &new->first;
    932             old->first = NULL;
    933     }
    934
    935     #define hlist_entry(ptr, type, member) container_of(ptr,type,member)

The unmarked write on line 931 is the conflicting access, again according to the strict LKMM definition of data race. However, a kernel built with the default KCSAN Kconfig options would not report this data race because unmarked writes are considered to be atomic. By default, and at a high level, KCSAN looks for unmarked reads that run concurrently with any sort of write to that same variable. This is less strict than LKMM, which would in addition look for unmarked writes that run concurrently with any sort of read from or write to that same variable. If you want reports according to the strict LKMM definition (as Paul McKenney does when applying KCSAN to RCU), build your kernel with the following additional Kconfig options:

    CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n
    CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n
    CONFIG_KCSAN_INTERRUPT_WATCHER=y

These options are discussed in more detail in part 2. A summary of various options is also available in Documentation/dev-tools/kcsan.rst.

Summarizing KCSAN reports

It is not unusual for a moderate-length test to produce thousands of KCSAN reports, which does not necessarily generate enthusiasm for working on fixes. However, most of these reports will be duplicates. One way to reduce the number of duplicates is to build your kernel with KCSAN's KCSAN_REPORT_ONCE_IN_MS Kconfig parameter set to a large value. For example, building with CONFIG_KCSAN_REPORT_ONCE_IN_MS=10000 will collapse any duplicate reports that occur within a ten-second interval. This can greatly reduce the number of duplicates from a given run; for example, on a 90-minute TREE05 rcutorture scenario, using this Kconfig option value reduced the number of KCSAN reports from 6,413 to 3,050.

However, by default rcutorture runs 16 scenarios, ten of which are SMP, thus potentially producing KCSAN diagnostics. It is sometimes useful to collapse the reports from all of these scenarios and summarize them by function names, for example, using a script like the following:

    #!/bin/sh
    grep "BUG: KCSAN: " "$@" | \
            sed -e 's/^\[[^]]*] //' | \
            sort | uniq -c | sort -k1nr

Given the console logs ("$@" above) collected during an rcutorture run in which each scenario ran for 90 minutes, this script reduced 29,312 KCSAN reports to only 56 lines of output. This is a much more manageable number of reports. Of course, much detail is lost, but this detail can be recaptured by searching the console output for the KCSAN reports of interest.

By default, KCSAN operates globally, which means that only a small fraction of the reports will normally pertain to the subsystem at hand. In general, it is quite difficult to identify which reports pertain to which subsystem due to inlining and the possibility that a report in a called function is because of the failure of the caller to provide proper synchronization. One of the following approaches might be useful:

Deduplicate as discussed above, and then look at one report from each of the resulting categories. This works well, though it can miss cases where a pair of functions have data races on more than one variable.
Deduplicate as discussed above, but look only at reports from categories containing a function defined in the subsystem at hand. This works best when there are a large number of reports even after deduplication, but risks missing important reports due to inlining.
Decode the stack traces (for example, using scripts/decode_stacktrace.sh or syz-symbolize), and look at any report having a stack trace containing a function defined in the subsystem at hand. This is the most thorough approach, but can require looking into a huge number of reports.
If only a specific set of known functions defined in the subsystem at hand are of interest, it is possible to tell KCSAN to only report races in those at runtime using its whitelist feature. The next section discusses this in more detail.

Interacting with KCSAN at runtime

It is possible to control KCSAN at runtime, which can help to respond to changes in workload or debugging aims by tweaking KCSAN's parameters and by controlling which data races are reported.

The file /sys/kernel/debug/kcsan provides the following interface to KCSAN:

Reading /sys/kernel/debug/kcsan returns various runtime statistics, such as the number of data races detected.
Writing "on" or "off" to /sys/kernel/debug/kcsan allows turning KCSAN on or off, respectively.
Writing !some_func_name to /sys/kernel/debug/kcsan adds some_func_name to the report filter list, which (by default) blacklists reporting data races where either one of the top stack frames are a function in the list.
Writing either blacklist or whitelist to /sys/kernel/debug/kcsan changes the report filtering behavior. For example, the blacklist feature can be used to silence frequently occurring data races; the whitelist feature can help with reproduction and testing of fixes.

The default configuration parameters are chosen to be conservative, providing overall good performance and race-detection abilities on smaller systems (desktops, workstations, virtual machines). However, large systems, such as servers with more than 64 hardware threads, may require adjustment.

The core parameters that affect KCSAN's overall performance and bug detection ability are exposed as kernel command-line arguments whose defaults can also be changed via the corresponding Kconfig options. All of these arguments and options are related to KCSAN's watchpoint handling.

kcsan.skip_watch (CONFIG_KCSAN_SKIP_WATCH): Number of per-CPU memory operations to skip before setting up another watchpoint. Setting up watchpoints more frequently will result in the likelihood of races to be observed to increase. This parameter has the most significant impact on overall system performance and race detection ability.
kcsan.udelay_task (CONFIG_KCSAN_UDELAY_TASK): For tasks, the microsecond delay to stall execution after a watchpoint has been set up. Larger values increase the window in which a race may be observed.
kcsan.udelay_interrupt (CONFIG_KCSAN_UDELAY_INTERRUPT): For interrupts, the microsecond delay to stall execution after a watchpoint has been set up. Interrupts have tighter latency requirements, and their delay should generally be smaller than the one chosen for tasks.

On a new system, one may either set the corresponding Kconfig option or set them as a boot parameters. For example, on a large system with 64 hardware threads, we would recommend starting with kcsan.skip_watch=64000. Then, once the system has booted, the parameter can be tweaked further via /sys/module/kcsan/parameters/skip_watch.

Next up

This part has provided an overview of the basic usage of KCSAN and ideas on how to apply it to find data races. In part 2, we will look deeper into KCSAN and how it can be used on various types of code. It will also cover using KCSAN in looking at other types of problems, beyond just those governed by the LKMM. Some strategies, alternative approaches, and known limitations will be covered as well.

[Update: Part 2 is now available.]

Answers to quick quizzes

Quick quiz 1: Why can't the Linux kernel just use the C11 memory model?

Answer: In many cases, the kernel's requirements cannot easily be cast into the C11 memory model. While for some parts of the kernel, it could be conceivable to do so, the engineering efforts and resulting inconsistencies make this proposition unattractive today. Note that, the LKMM is still defined at the C-language level, and embedded in the variant of C that the Linux kernel uses today.

Back to quick quiz 1.

Quick quiz 2: What's the difference between "data races" and "race conditions"?

Answer: Race conditions occur if concurrently executing operations result in unexpected system behavior. On the other hand, data races are defined at the C-language level. Data races can manifest as race-condition bugs either through compiler transformations, or if the high-level logic of the code is buggy to begin with (such as failing to acquire a lock).

However, not all race conditions are data races, for example if a racing access is incorrectly marked (how KCSAN can also find these is discussed in part 2). For a moment, imagine that we marked every memory access: at this point, there are no more data races and no more KCSAN complaints. However, we may still have race conditions, many of which, such as failing to acquire the necessary locks, can cause the system to misbehave despite the lack of KCSAN complaints.

Also note that, not all references to "race conditions" imply buggy behavior. Many low-level synchronization mechanisms are meant to resolve race conditions; for example, the reads in race conditions due to unsuccessful sequence-lock reader critical sections will simply be discarded and retried. However, most definitions and uses of "race condition" imply buggy behavior; unless otherwise specified, our use of "race condition" follows this notion.

Back to quick quiz 2.

Acknowledgments

We would like to thank everyone who has given feedback, comments, or otherwise participated in the work discussed in this article. Some notable discussions and feedback resulted from patches to address data races found by KCSAN: in particular, we would like to thank Eric Dumazet and Qian Cai for addressing numerous data races and their continued feedback, Linus Torvalds, Ingo Molnar, and Herbert Xu for their helpful and critical feedback. We are very grateful to Blake Matheny for their support of this effort.

Comments (none posted)

Page editor: Jonathan Corbet
Next page: Brief items>>