Atomic patterns 2: coupled atomics

September 7, 2016

This article was contributed by Neil Brown

Our recent survey of the use of atomic operations in the Linux kernel covered the use of simple flags and counters, along with various approaches to gaining exclusive access to some resource or other. On reaching the topic of shared access we took a break, in part because reference counts, which are the tool for managing shared access, have been covered before. Much of that earlier content requires no more than a brief recap, but the use of biases, then described as an anti-pattern, is worthy of further examination as it is a stepping stone toward understanding a range of other patterns for the use of atomics in the kernel.

Recap: three styles of reference counters

I previously identified three styles of reference counters used in Linux; my recent explorations have found no reason to adjust that list. The distinction between the three involves what happens when the count reaches zero.

When a "plain" reference count reaches zero, nothing particular happens beyond the obvious. Some code somewhere might occasionally check if the counter is zero and behave differently if it is, but the moment of transition from non-zero to zero has no significance. A good example is child_count used by the runtime power management code. This allows a "child" device to hold a reference on its parent to keep it active. Unless it has been configured to ignore_children, the parent will be kept active as long as any child still holds a reference.

When a "kref" reference count reaches zero, some finalization operation happens on the object; typically it is freed. Code requiring that pattern should use the struct kref data type, though an atomic_t counter and atomic_dec_and_test() can be used if there is a good reason to avoid kref.

Finally, the "kcref" counter is not allowed to reach zero unless a lock is held. Code implementing this pattern can use atomic_dec_and_lock(), which takes a spinlock only if it is likely to be needed. A more general approach that can work with any sort of lock is to have a fast path that uses atomic_add_unless() to decrement the counter as long as its value is not one. If this fails, the lock can be taken and at atomic_dec_and_test() or similar can be used. hw_perf_event_destroy() in the perf code displays this quite nicely.

Counter bias: multiple values in the one atomic

A number of reference counters in Linux (e.g. in procfs and kernfs) have a "bias" added to the value. This bias is a large value (larger than the normal range of the counter) that can be added to the counter's value. The presence or absence of the bias can easily be detected even as the counter itself moves up or down. This allows a boolean value to be stored in the same variable as the counter. I previously described this as an anti-pattern; a proper solution would instead use a separate variable (or bit in a bitmap) to store the boolean value. When the counter and the boolean are changed independently, I stand by that assessment, but sometimes there is value in being able to control them together in a single operation.

A particularly simple example is found in the function-tracing (ftrace) code for the SuperH architecture. The nmi_running counter sometimes has its most significant bit set, effectively using a bias of 2³¹. This flag, which is used to provide synchronization between ftrace and non-maskable interrupt handlers, may be cleared at any time, but may only be set when the value of the counter is zero. Normally, when there is a need to synchronize the change in one value with some other value, it is simplest to hold a spinlock whenever either value is changed — but that is not necessarily the fastest way. If the two values of interest can be stored in the same machine word, then an atomic compare-exchange operation, often in a loop to handle races, can achieve the same end more efficiently.

Having identified this pattern of two values being managed with a single atomic operation, we need a name for it; "coupled atomics" seems a good choice as the interdependence between the two values could be seen as a coupling. Other examples of this pattern are easy to find. The "lockref" type that was introduced in Linux 3.12 follows exactly this pattern, storing a 32-bit spinlock and a 32-bit reference count in a single 64-bit word that, on several popular architectures, can be updated atomically. Even this 32-bit spinlock itself is sometimes a multi-part atomic, as is the case for both ticket spinlocks and MSC locks.

The previous article mentioned two uses for the new atomic_fetch*() operations; we can now add a third. This one involves an atomic_t variable that contains a counter and a couple of flags, only this time the flags are in the least significant bits and the counter is in the higher-order bits. This atomic_t is used to implement a queued reader/writer spinlock. The flags record if a write lock is held, or if a writer is waiting for the lock. The counter, which is incremented by adding 256 (using the defined name _QR_BIAS) records the number of active readers. A new reader attempts to get a read lock using an atomic operation to add _QR_BIAS and then see if either of the flags were set in the result. If they were set, the read lock was not acquired; the failed reader subtracts the bias and tries again. Interestingly, the fast path code uses atomic_add_return(), while the slow path code uses the new atomic_fetch_add_acquire(). Either is quite suitable for the task, but a little more consistency would be nice.

Another example is the helpfully named combined_event_count counter in the system suspend code. This variable stores two counters: the number of in-progress wakeup events and the total number of completed wakeup events. When the in-progress counter is decremented, the total needs to be incremented; by combining the two counters in the one atomic value, the two can be updated in a single race-free operation.

More coupled atomics, big and small

Examples so far could be seen as mid-range examples, combining a counter with some other modestly sized value, typically another counter or a flag, into the one atomic value. To finish off we will look at two extremes in size, the largest and smallest.

Most atomics are 32 bits in size, though 64-bit values, whether pointers manipulated with cmpxchg() or the atomic_long_t type, are not exactly uncommon. What is uncommon is 128-bit atomic types. These are limited to three architectures (arm64, x86_64, and s390) and to a small number of users, mainly the SLUB memory allocator.

SLUB owns several fields in the page description structure: a pointer to a list of free space, some counters of allocated objects, and a "frozen" flag. Sometimes it wants to access or update several of these atomically. On a 32-bit host, these values all fit inside a 64-bit value. On a 64-bit machine, they don't, so a larger operation is needed; cmpxchg_double() is available on the listed architectures to allow this. It is given two pointers to 64-bit memory locations that must be consecutive, two values for comparison, and two values for replacement. Unlike the single-word cmpxchg() that always returns the value that was fetched, cmpxchg_double() returns a success status, rather than trying to squeeze 128 bits into the return value.

On 64-bit architectures without this 128-bit atomic option, SLUB will use a spinlock to gain the required exclusive access — effective, but not quite as fast. cmpxchg_double() seems to me to be an eloquent example of the lengths some kernel developers will go to in order to squeeze out that last drop of performance.

The other extreme in size is to combine two of the smallest possible data types into a single atomic: two bits. A simple example in the xen events_fifo code clears one bit, EVTCHN_FIFO_MASKED, but only when the other bit, EVTCHN_FIFO_BUSY is also clear. Manipulating multiple bits at once is another place where the new atomic_fetch*() operations could be useful. They do not support any dependency between bits as we see in the xen example, but they could, for example, clear a collection of bits atomically and report which bits were cleared, by using atomic_fetch_and(). Similarly, if an atomic_t contained a counter in some of the bits, that counter could be extracted and zeroed without affecting other accesses. Whether these are actually useful I cannot say as there are no concrete examples to refer to. But the pattern of multiple values in the one atomic_t does seem to raise more possible uses for these new operations.

Both a strength and a weakness

Having found these various patterns, several of which I did not expect, the overall impression I am left with is the tension between the strength and the weakness of C for implementing these patterns. On the one hand C, together with the GCC extensions for inline assembly language code, provides easy access to low-level details that make it possible to implement the various atomic accesses in the most efficient way possible. On the other hand, the lack of a rich type system means that we tend to use the one type, atomic_t, for a wide range of different use cases. Some improvements might be possible there, as we saw with the introduction of the kref type, but I'm not sure how far we could take that. I contemplate the atomic_cmpxchg_double() usage in SLUB and wonder what sort of high-level language construct would make that more transparent and easy to read, and yet keep it as performant on all hardware as it currently is. It certainly would be nice if some of these patterns were more explicit in the code, rather than requiring careful analysis to find.

Index entries for this article
Kernel	Atomic operations
GuestArticles	Brown, Neil