Kernel development

Brief items

Kernel release status

The current development kernel is 4.8-rc4, released on August 28. "Everything looks normal, and it's been a bit quieter than rc3 too, so hopefully we're well into the "it's calming down" phase. Although with the usual timing-related fluctuation (different maintainers stagger their pulls differently), it's hard to tell a trend yet."

See this posting for a list of known regressions in the 4.8 kernel.

Stable updates: none have been released in the last week.

Comments (none posted)

Kernel development news

Inside the mind of a Coccinelle programmer

By Jake Edge
August 31, 2016

Linux Security Summit

Coccinelle developer Julia Lawall started her keynote at the Linux Security Summit in Toronto by saying that she didn't know much about security. But the tool she developed, and the patches it has generated, have clearly fixed various patterns of bugs in the kernel, some of which have had security implications. In the talk, Lawall related some of what she has learned by using the tool, while also introducing its capabilities to the roughly 90 attendees.

The principle behind Coccinelle is to "find once, fix everywhere". It is a static analyzer to find patterns in C code. So when someone spots a problem in a large code base, Coccinelle can be used to find other examples of the same kind of bug. It can also then apply transformations to the code to create patches to fix those other occurrences.

Semantic patches

Coccinelle is user-scriptable using "semantic patches", which are based on the patch notation familiar to developers. Ideally, she said, there is not much new that someone has to learn to start using the tool. The goal from the outset was to make it accessible to C developers.

Lawall showed a classic example of where Coccinelle can be used. A commit message by Al Viro ("wmi: (!x & y) strikes again") pointed to a recurring problem with tests that look like:

    if (!block->flags & ACPI_WMI_METHOD)

The proper fix is to parenthesize the expression:

    if (!(block->flags & ACPI_WMI_METHOD))

The problem is that the logical negation (!) operator binds more tightly than the bitwise-and (&) leading to an incorrect test. Given the message, it is clear this problem occurs with some frequency, but it is difficult to use grep to find this kind of pattern. There are too many ! and & characters in the kernel and the test may stretch over multiple lines, which will also defeat simple searches.

But a simple Coccinelle semantic patch will find (and fix by way of a generated patch) these kinds of problems:

    @@
    expression E;
    constant C;
    @@
    - !E & C
    + !(E & C)

Lawall then showed an example of the same problem (spanning two lines) elsewhere in the kernel and how the code was changed to fix it.

History

Coccinelle began when Lawall was on sabbatical in 2004. At that time, the 2.6 kernel had recently been released, but many drivers were still targeting 2.4. She wanted to automate the porting of those drivers to 2.6, but that "turned out to be unrealistic". However, certain changes, which she called "collateral evolution", could be made automatically, such as changes to function calls like adding new parameters.

To solve the problem of collateral evolution, Coccinelle was initially developed by Lawall and three others at the University of Copenhagen between 2005 and 2007. The first patches to the kernel based on Coccinelle output were submitted in 2007; they addressed places where kmalloc() calls were followed by a memset() and replaced those with calls to kzalloc(). Two papers followed, in 2008 and 2011.

Currently, Coccinelle is developed by Lawall and three others at Inria. It is entirely implemented in OCaml, which may reduce the contributor base by a substantial amount, she said to some audience laughter. On the other hand, though, that does reduce the number of patches she has to review.

The tool has had a fairly large impact on the kernel since the first patches were posted in 2007. Over 4,500 patches in the kernel mention Coccinelle; 3,000 of those are from more than 500 developers outside of the Coccinelle project. There are also 56 semantic patches used as tests that are part of the kernel (accessible via make coccicheck). There are more Linux-related semantic patches available at coccinellery.org.

The semantic patches in the kernel are meant to catch various types of errors that can crop up. There are tests for generic C errors (such as testing for unsigned variables for values less than zero or null pointer dereferencing), generic Linux problems (double locks or using the iterator index after a loop), API-specific errors (using free() on a devm allocation), and API modernization (using kmemdup() rather than kmalloc() and memcpy()). Contributions of new semantic patches are welcome; contributors can either make the patch themselves or tell the team about the pattern for it to create the patch.

Applications of Coccinelle

Lawall then presented some more complex applications of Coccinelle to the kernel. The first was an effort at "devmification"—switching drivers to use the devm framework to manage their memory to avoid memory leaks. A driver's probe function often allocates memory with kzalloc() that then gets freed in the remove function. Switching to devm_kzalloc() means that the devm library will manage the memory instead.

There were three steps to the process; the first was to find the names of the probe and remove functions. For that, a regular expression could have been used, but that "seems unappealing", she said. Pointers to the functions are put into the platform_driver (and similar) structures, so their names can be extracted with a Coccinelle rule. The second step is to find the definition of the probe function and to transform the kzalloc() calls to devm_kzalloc(), while also removing the kfree() calls from any error paths. Lastly, it removes the kfree() call from the remove function.

Lawall worked on this semantic patch in 2012 and ended up submitting 39 patches to the kernel to make that change. Her semantic patch finds over 170 opportunities to make the change, but they can't just be applied blindly; there is a responsibility to look at the patches that are generated, she said. The whole process takes just 30 seconds to run (using GLIMPSE indexing), so it changes the problem from finding the pattern to checking the fixes.

Another application of Coccinelle was to detect user-level calls that might block due to a page fault (e.g. copy_*_user(), get_user()) that were being called under a spinlock. Blocking is not allowed under a spinlock, so those calls should not be made under the lock. The patch simply reported when it found one of the candidate functions between the spin_lock() (and variants) and the spin_unlock(). Normally, Coccinelle handles single functions at a time, but iteration can be used to scale it up to look at inter-function and inter-file patterns.

One confirmed bug has been found with copy_from_user(). The copy_to_user() checking found a problem but there was a "FIXME" comment next to it, which might indicate the developer plans to get back to it at some point, she said. When that was met with laughter, she replied: "I understand your lack of confidence." That checking also found a false positive. She noted that she is not searching for perfection with the semantic patches, just trying to find things of interest, so she is not interested in complicating the semantic patch to eliminate all false positives. The get_user() and put_user() checking has found one occurrence of each, but she has not yet evaluated them.

She also worked on a semantic patch for "constification"—adding the const keyword to some structures in the kernel. She used a four-step process that included two manual steps. She used Coccinelle to identify structures that only contain function pointers. Those are not the only ones that need const, she said, but they are often good candidates. In addition, protecting those structures from getting overwritten is important to avoid code execution in the kernel. From the list of such structures, she will choose one manually then use Coccinelle to change all uses of the type to add const. The last step is to build the kernel using GCC to see if it all works.

She submitted 115 patches in 2015 for constification. She does not pay attention to whether they actually get merged. None have been turned down that she remembers, but some of them may not have gotten picked up. Detecting structures of all function pointers is slow—she runs it overnight—but adding const to a type is instantaneous.

Lessons learned

Lawall has learned some lessons along the way that she wanted to pass along. The most important one is to start with something simple. The semantic patches in the kernel tree are more complex and are not representative of where to start with the tool. Start with a common case that might end up with some false positives; perhaps you get 100 results, which results in finding two bugs. Making it more complex is possible, of course, but there is a tradeoff there.

For example, during the devmification process it was hard to figure out how to get rid of the right kfree() in the remove function. She chose a simple solution by just assuming that the kernel developers would use the same name for the variable to be freed that was used in the probe function. It is "completely unsafe, but it works fairly well". The worst case, though, would be false negatives—not removing a kfree() that should have been removed. So she did a test where she used Coccinelle to delete all of the kfree() calls in the remove functions, which turns out to be the same set of files as with her rule, so it "seems probable that the rule is OK" and "simple was good enough".

Semantic patches should be developed incrementally, she said. The devmification rules resulted in 171 files, which might take some time to analyze. In addition, devm frees the memory it manages in a last-in-first-out order, which might cause differences. So she rewrote her semantic patch to avoid any value-returning function calls before the kzalloc() in the probe function or any remove functions that made calls after kfree(). That reduced the search space to 51 files.

Similarly, the checking for blocking functions under a spinlock produced lots of false positives. Conditionals in the code may cause the lock status to be inaccurate in the analysis. To avoid that, she changed the test to look for places where the blocking calls appear after the lock release, then looked closely at places where that wasn't true.

The last lesson she imparted was to exploit information from other tools. The constification effort used GCC to determine if the change would build. But, ideally, there would be ways to warn users about potentially difficult cases. In order to do that, you might want to gather information about the structures (are some already const, which might indicate that making them all const might work?) and about the kernel build with regard to the structure (are all of instances of the structures in files that are built with the current configuration?). Those would indicate structures that might be easier or harder to change and verify to help prioritize which ones to tackle. That kind of information can be gathered by Python or OCaml programs to help users narrow in on good candidates.

In conclusion she said that the tool is useful for a number of different kinds of problems. Coccinelle "compromises on both soundness and completeness, but it is still useful in practice". Developers would be well-served by adding it to their arsenal.

[I would like to thank the Linux Foundation for travel support to attend the Linux Security Summit in Toronto.]

Comments (27 posted)

Atomic usage patterns in the kernel

August 31, 2016

This article was contributed by Neil Brown

The Linux kernel uses a large variety of "atomic" operations — operations that are indivisible as observed from anywhere within the system — to provide safe and efficient behavior in a multi-threaded environment. A recent article explained why a new suite of atomic primitives was added but, as reader "magnus" observed, that article didn't provide any context for how these, or any other atomic operations, actually get used. The new operations are hardly used at all as yet, so we can only guess how useful they might be. More mature operations are in wide use, and while cataloging every distinct use case would be unhelpfully tedious, finding a few common patterns can aid understanding. To this end, I went searching through the Linux kernel code to find out how different atomic operations are used and to look for examples that might shed light on the possible usefulness of the new operations.

Simple flags

In general, atomic operations are only needed when multiple threads might update the same datum in potentially different ways. However, one of my early exposures to the importance of atomics involved values that weren't shared, or were only updated under a spinlock. The RPC (remote procedure call) layer of the NFS server uses a number of state flags, some of which were represented at that time as unsigned char while others were single-bit fields like:

    unsigned int sk_temp : 1,	/* temp socket */

The code that the C compiler generates to update fields like this one will read and then write a whole machine word. It is natural to use a lock to ensure that multiple read-modify-write operations do not happen in parallel but, if different fields in the same machine word are protected by different locks, these whole-word updates can easily interfere with each other. This code was originally safe because an extra spinlock was used just to protect these flags from each other. A significant rewrite of that code removed the extra spinlock and instead used set_bit() and clear_bit() to atomically update individual bits within a word.

So the first purpose of using atomic operations is isolation: ensuring that an update to one value doesn't disturb updates to neighboring, but unrelated, values. A write to an aligned value of the basic machine word size, typically 32 bits, can be assumed not to interfere with writes to neighboring values. Values with a smaller size, which might be updated concurrently with small neighbors, need some sort of protection, either with locking or with atomic operations.

Counters

Of the many situations where concurrent updates to a single value can be expected, counters are probably the simplest. Linux uses atomic counters to count various things including IO errors, dropped packets, CPU cycles used, and various other simple statistics. Atomic operations are not the most efficient way to collect statistics, so many are instead collected in separate per-CPU counters that are only summed when needed, as is done, for example, in the block layer. Atomics are certainly easier, though, and when the events being counted are not too frequent they are a good choice.

Counting the number of some resource that is currently in use (or currently available) is a common use of atomics, and these use cases need to pay careful attention to what happens when the relevant limit is reached. jbd2, the journaling layer for ext4, tracks the blocks committed to the next transaction (outstanding_credits) with an atomic counter and, when the commitment exceeds a limit, it simply subtracts off the last number and waits for space to become available. This means that the counter transiently exceeds the maximum, which presumably is not problematic.

The XFS filesystem also uses an atomic counter to track space commitment in the log, though in quite a different way. In this case, the counter is really a position in the log, and when it reaches the end it must atomically wrap around to the beginning. The "check for an overflow and adjust" approach used in jbd2 won't do here, so an atomic "compare-exchange" loop is used to ensure only valid values are ever visible.

Thus a "reliable counter" is the second common use for atomic operations. It can gather statistics or monitor resource usage and sometimes impose hard limits. A common form of counter not mentioned yet is the reference counter. While these are "reliable counters", they have an important role in resource ownership, and so fit best a little later in our story.

Exclusive ownership

Exclusive ownership of something is a frequent pattern in the Linux kernel, where the "something" might be a data structure, might be some specific part of, or access to, a data structure, or might be a more abstract resource. Often spinlocks or mutexes will be used to gain exclusive ownership but, in cases where there is no desire to wait if the resource is not immediately available, an atomic operation that reports whether or not ownership was obtained is usually the preferred approach.

Possibly the simplest way to gain exclusive ownership is with test_and_set_bit_lock().

    if (!test_and_set_bit_lock(BIT_NUM, &bitmap)) {
	/* make use of exclusive ownership here */
	clear_bit_unlock(BIT_NUM, &bitmap);
    } else
	/* try some other approach */

If the bit is clear and multiple threads run this code at the same time, only one will see that the bit wasn't set, and will have successfully set it. All the others will see the bit already set and will know they didn't set it and so did not gain ownership.

The _lock suffix and the _unlock suffix in clear_bit_unlock() are sometimes important and probably aren't used as much at they should be. test_and_set_bit_lock() and clear_bit_unlock() are variants of the unadorned test_and_set_bit() and clear_bit() functions; they should be used when resource ownership is being claimed as they bring both social and technical benefits. On the social side, they serve as useful documentation for the intent of the code. Not all test_and_set_bit() calls claim ownership; some only need the isolation properties of bit operations. Similarly, not all clear_bit() calls release ownership. Making the intention clear to the reader can be extremely valuable.

The technical value relates to the fact that the C compiler and the CPU hardware are permitted some flexibility in re-ordering memory accesses, providing that they don't change the behavior of a single-threaded program. In the absence of any "barriers" to restrain this flexibility, reads from memory that are textually between the "set bit" and the "clear bit" could be performed before the bit is set, and writes to memory could get delayed until after the bit is cleared. This re-ordering could allow one thread to see data that another thread is still manipulating, which is clearly undesired.

Without the locking suffixes, test_and_set_bit() actually provides a full two-way barrier so that no reads or writes can be moved from one side to the other. This is a stronger guarantee than needed, so test_and_set_bit_lock() can be used to just provide a one-way barrier to reads preventing them from being performed before the bit is set. This is referred to as an "acquire" semantic due to its used in gaining ownership of something. Conversely, clear_bit() provides no barriers at all, so clear_bit_unlock() is needed for correctness. It provides a "release" semantic — a one-way barrier to write requests that ensures that all writes that took place before the clearing of the bit will be visible before cleared bit itself is visible.

The different barrier behavior exhibited by the unadorned operations is explained by the rule that atomic operations that return a value (such as test_and_set_bit()) generally impose full memory barriers, while atomic operations that don't return a value (clear_bit(), for example) generally impose no barriers at all.

This need to be careful about memory barriers is one of the costs of using atomics rather than the safer (but slower) spinlocks. It used to be worse though. Prior to 2007 there was no clear_bit_unlock(), so explicit barriers such as smp_mb__after_clear_bit() would need to be carefully placed to avoid races. Thankfully, that delightfully named function has long been deprecated and barriers are now usually integrated with the operation they protect. Many atomic operations are available with a variety of suffixes that indicate different ordering semantics, including _acquire, _release, _relaxed (which provides no ordering guarantees at all), and _mb (which gives a full memory barrier). As a general rule, these interfaces should probably be avoided unless you really know what you are doing. Even people who do know what they are doing can have long conversations about the issues.

One pleasing example of using bit operations for exclusive access in the cmd_alloc() function is the cciss disk array driver. The driver maintains a pool of "commands" and a bitmap showing which commands are in use. It uses find_first_zero_bit() to find an available command, and then test_and_set_bit() to claim it. On failure, it just goes back to choose another command from the pool. This code could be made a little more readable by using test_and_set_bit_lock() and clear_bit_unlock(), but there is no reason to think the current code is not safe.

I chose that example to contrast it with pasemi_alloc_rx_chan(), which performs similar allocations from a pool, but with some differences: the bitmap identifies free resources, find_first_bit() is used to find one, and test_and_clear_bit() is used to claim it. There is no test_and_clear_bit_lock() or set_bit_unlock() so this code cannot self-document the use of the bits for locking, and we must hope there is no room for races around the set_bit() that releases the lock.

Counters and pointers for exclusive ownership

One of the surprises I found while exploring the kernel was the number of drivers that used an atomic_t counter purely to gain exclusive access. These drivers, such as dasd, a storage driver for IBM s390 systems, use atomic_cmpxchg() much like test_and_set_bit() so, for example:

    if (atomic_cmpxchg (&device->tasklet_scheduled, 0, 1) != 0)
	return;

will only continue to subsequent code if exclusive access was gained to whatever tasklet_scheduled protects. One reason that might justify this unusual construct is that the smallest bitmap that can be used with test_and_set_bit() and related functions is a single unsigned long integer, which is eight bytes on 64-bit architectures. In contrast, the value that atomic_cmpxchg() operates on is an atomic_t, which is only four bytes. Whether this small space saving justifies that non-standard code is a question we must leave to the individual developers to consider.

This space saving is one of the possible benefits of the new atomic_fetch*() operations. atomic_fetch_or() can be used to test and set any arbitrary bit in an atomic_t and this is the use case for two of the three call sites that are currently in the kernel. When an exclusive ownership is being requested, it would be better to use atomic_fetch_or_acquire() to document that intent. This currently produces identical code, but that may change.

A variation of exclusive ownership that requires a counter can be seen when the counter is used to identify a state in a state machine. A transition from one state to the next may require some particular action to be performed by precisely one thread. This pattern can be seen in quite a few network device drivers, though the first I came across was in the core code for Firewire devices. A Firewire device can transition from "initializing" to "running" to "gone" to "shutdown". Most of these transitions use atomic_cmpxchg() to avoid races and to detect which thread first made a particular transition. If the test:

    if (atomic_cmpxchg(&device->state,
	    	       FW_DEVICE_RUNNING,
		       FW_DEVICE_INITIALIZING) == FW_DEVICE_RUNNING) {

succeeds, then some extra work, which is needed when the device first starts running, can be performed.

A particularly common case that uses counters for a form of exclusive ownership are the various providers of unique serial numbers. dm_next_uevent_seq() is just one example of more than a dozen that I found before losing interest. These atomically increment a value and return that value for local use. The caller is certain to be the only caller to be given that particular value, so they can be seen as having exclusive ownership of it. Of the ten places Davidlohr Bueso found to use the newly introduced atomic_fetch_inc(), seven were for unique serial numbers where there was a desire for the first number to be zero — atomic_inc_return() naturally starts by returning one. A similar simplification could be achieved by initializing the atomic counter to -1.

Atomic pointer updates are sometimes used to gain exclusive ownership, though with a slightly different understanding of the ownership concept. IPv6 supports a number of sub-protocols such as TCP, UDP, ICMP, etc. Handlers for these individual protocols can register themselves using inet6_add_protocol(), which uses the cmpxchg() atomic function to install the provided handler into the table of protocol pointers. If a protocol was already registered, this fails. If not, the caller gains ownership of that particular protocol number and can continue as the registered handler.

A similar cmpxchg() to atomically replace a NULL with a pointer occurs in multiple places in the kernel such as in the tty_audit code and in the Btrfs raid56 code. Both of these install a newly allocated and initialized data structure for multiple threads to access. In the unlikely event that two threads find they need to create it at the same time, they might both prepare the structure but only one will successfully install it. The other must discard its structure as wasted effort. Here exclusive access is gained for "the right to initialize", which may seem to be a slightly contorted way to look at things, but does allow the formation of a uniform pattern.

Shared ownership — for another day

The obvious sequel to exclusive ownership is the shared ownership provided by reference counters. This topic has been covered before, so there is little point in more than a cursory review. However, a careful examination of one of the patterns found previously opens up a whole new collection of patterns and provides a hint at another possible use of the new atomic_fetch_*() operations. These topics will be covered in the companion article to this one.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.8-rc4 Aug 28

Steven Rostedt 4.4.19-rt27 Aug 31

Steven Rostedt 4.1.31-rt35 Aug 31

Steven Rostedt 3.18.40-rt42 Aug 31

Steven Rostedt 3.14.77-rt82 Aug 31

Steven Rostedt 3.10.103-rt114 Aug 31

Steven Rostedt 3.2.82-rt118 Aug 31

Architecture-specific

Catalin Marinas arm64: Use static keys for CPU features Aug 25

Paul Burton MIPS generic kernels, SEAD-3 & Boston support Aug 26

Thiago Jung Bauermann kexec_file_load implementation for PowerPC Aug 30

Dmitry Safonov x86: 32-bit compatible C/R on x86_64 Aug 26

Dmitry Safonov x86: 32-bit compatible C/R on x86_64 Aug 31

Dmitry Safonov Put vdso in ramfs-like filesystem (vdsofs) Aug 25

Core kernel code

Peter Zijlstra locking/mutex: Rewrite basic mutex Aug 25

Andrei Vagin Add an interface to discover relationships between namespaces Aug 26

Juergen Gross Support calling functions on dedicated physical cpu Aug 29

Binoy Jayan *** Latency histograms - IRQSOFF,PREEMPTOFF *** Aug 30

Thiago Jung Bauermann kexec_file: Add buffer hand-over for the next kernel Aug 30

Eric Dumazet softirq: let ksoftirqd do its job Aug 31

Development tools

Namhyung Kim virtio: Implement virtio pstore device (v4) Aug 31

Device drivers

Chunfeng Yun Add MediaTek USB3 DRD Driver Aug 25

Chen-Yu Tsai clk: sunxi-ng: Support A31/A31s CCU Aug 25

Florent Revest Add Sunxi Cedrus Video Decoder Driver Aug 25

Neil Armstrong pinctrl: Add SX150X GPIO Extender Pinctrl Driver Aug 25

Peter Griffin Add support for FDMA DMA controller and slim core rproc found on STi chipsets Aug 26

Chen-Yu Tsai mfd: axp20x: Add support for AXP806 PMIC Aug 27

Wadim Egorov mfd: RK808: Add RK818 support Aug 29

John Stultz Audio support for adv7511 hdmi bridge Aug 29

Minghsiu Tsai Add MT8173 MDP Driver Aug 30

Nishanth Menon firmware: Add support for TI System Control Interface (TI-SCI) protocol driver Aug 30

Timur Tabi [v9] net: emac: emac gigabit ethernet controller driver Aug 25

Akinobu Mita iio: adc: add ADC12130/ADC12132/ADC12138 ADC driver Aug 28

Device driver infrastructure

Rafał Miłecki leds: trigger: Introduce an USB port trigger Aug 25

Filesystems and block I/O

Jens Axboe Throttled background buffered writeback Aug 31

Darrick J. Wong vfs: help support reflink for XFS Aug 25

Darrick J. Wong fallocate for block devices Aug 25

Memory management

js1304@gmail.com Introduce ZONE_CMA Aug 29

Michal Hocko fortify oom killer even more Aug 25

Networking

Tom Herbert xfs: Transmit flow steering Aug 30

Daniel Mack Add eBPF hooks for cgroups Aug 24

David Lebrun add support for IPv6 Segment Routing Aug 26

Jakub Kicinski BPF hardware offload (cls_bpf for now) Aug 26

"Christoph Paasch" (via mptcp-dev Mailing List) <mptcp-dev-1cNGNKGn6cRWdXg3Zgxhqoble9XqW/aP@public.gmane.org> mptcp stable Release v0.91 Aug 24

Security-related

Mickaël Salaün Landlock LSM: Unprivileged sandboxing Aug 25

Sargun Dhillon Add Checmate: BPF-driven minor LSM Aug 29

Page editor: Jonathan Corbet
Next page: Distributions>>