Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.24-rc6, released by Linus on December 20. "The regression list keeps shrinking, so we're still on track for a full 2.6.24 release in early January. Assuming we don't all overeat during the holidays and nobody gets any work done. But we all know that the holidays are really the time when we get away from the boring 'real work', and can spend 24/7 on kernel hacking instead, right?" The long-form changelog has the details.

The current -mm tree is 2.6.24-rc6-mm1. Recent changes to -mm (beyond failing to work on i386 systems) include a bunch of low-level driver model changes, some tmpfs reworking, some ext4 updates, and the beginning of the removal of the fastcall function attribute.

For older kernels: 2.4.36 was released on January 1 with a number of fixes.

For crazy people: 0.01 was released by Abdel Benamrouche on January 1.

Comments (1 posted)

Kernel development news

Quote of the week

I have to say that Christmas holidays are the best time to hack on things. Most of the yahoos are away on vacation and therefore the constant stream of distractions and really dumb emails in your inbox just aren't there. In short you can work on the things you always want to work on but never can because of time constraints.

-- David Miller

Comments (4 posted)

Some snags for SLUB

By Jonathan Corbet
December 28, 2007

The SLUB allocator is a new implementation of the kernel's low-level page allocator; it is a replacement for the long-lived slab allocator. SLUB was merged for 2.6.22 and made the default allocator for 2.6.23. The long-term plan has always been for SLUB to eventually displace the older slab allocator entirely. That may yet happen, but SLUB has run into a couple of difficulties on its way toward being the one true kernel memory allocator.

The first problem had to do with performance regressions in a few specific situations. It turns out that the hackbench benchmark, which measures scheduler performance, runs slower when the SLUB allocator is being used. In fact, SLUB can cut performance for that benchmark in half, which is enough to raise plenty of eyebrows. This result was widely reproduced; there were also reports of regressions with the proprietary TPC-C benchmark which were not easily reproduced. In both cases, SLUB developer Christoph Lameter was seen as being overly slow in getting a fix out; after all, it is normal to get immediate turnaround on benchmark regressions over the end-of-year holiday period.

When Christoph got back to this problem, he posted a lengthy analysis which asserted that the real scope of the problem was quite small. He concluded: "given all the boundaries for the contention I would think that it is not worth addressing." This was not the answer Linus was looking for:

It really is that simple. Either you say "Hell yes, I'll fix it", or SLUB goes away. There simply is no other choice. When you say "not worth addressing", that to me is a clear an unambiguous "let's remove SLUB".

About this time, the solution to this problem came along in response to a note from Pekka Enberg pointing out that, according to the profiles, an internal SLUB function called add_partial() was accounting for much of the time used. The SLUB allocator works by dividing pages into objects of the same size, with no metadata of its own within those pages. When all objects from a page have been allocated, SLUB forgets about the page altogether. But when one of those objects is freed, SLUB must note the page as a "partial" page and add it to its queue of available memory. This addition of partial pages, it seems, was happening far more often than it should.

The hackbench tool works by passing data quickly between CPUs and measuring how the scheduler responds. In the process, it forces a lot of quick allocation and free operations and that, in turn, was causing the creation of a lot of partial pages. The specific problem was that, when a partial page was created, it was added to the head of the list, meaning that the next allocation operation would allocate the single object available on that page and cause the partial page to become full again. So SLUB would forget about it. When the next free happened, the cycle would happen all over again.

[PULL QUOTE: Once Christoph figured this out, the fix was a simple one-liner: partial pages should be added to the tail of the list instead of the head. END QUOTE] Once Christoph figured this out, the fix was a simple one-liner: partial pages should be added to the tail of the list instead of the head. That would give the page time to accumulate more free objects before it was once again the source for allocations and minimize the number of additions and removals of partial pages. The results came back quickly: the hackbench regression was fixed. There have been no TPC-C results posted (the license for this benchmark suite is not friendly toward the posting of results), but it is expected that the TPC-C regression should be fixed as well.

Meanwhile, another, somewhat belated complaint about SLUB made the rounds: there is no equivalent to /proc/slabinfo for the SLUB allocator. The slabinfo file can be a highly effective tool for figuring out where kernel-allocated memory is going; it is a quick and effective view of current allocation patterns. The associated slabtop tool makes the information even more accessible. The failure of slabtop to work when SLUB is used has been an irritant for some developers for a while; it seems likely that more people will complain when SLUB finds its way into the stock distributor kernels. Linux users are generally asking for more information about how the kernel is working; removing a useful source of that information is unlikely to make them happy.

Some developers went as far as to say that the slabinfo file is part of the user-space ABI and, thus, must be preserved indefinitely. It is hard to say how such an interface could truly be cast in stone, though; it is a fairly direct view into kernel internals which will change quickly over time. So the ABI argument probably will not get too far, but the need for the ability to query kernel memory allocation patterns remains.

There are two solutions to this problem in the works. The first is Pekka Enberg's slabinfo replacement patch for SLUB, which provides enough information to make slabtop work. But the real source for this information in the future will be the rather impressive set of files found in /sys/slab. Digging through that directory by hand is not really recommended, especially given that there's a better way: the slabinfo.c file found in the kernel source (under Documentation/vm) can be compiled into a tool which provides concise and useful information about current slab usage. Eventually distributors will start shipping this tool (it should probably find a home in the util-linux collection); for now, building it from the kernel source is the way to go.

The final remaining problem here has taken a familiar form: the dreaded message from Al Viro on how the lifecycle rules for the files in /sys/slab are all wrong. It turns out that even a developer like Christoph, who can hack core memory management code and make 4096-processor systems hum, has a hard time with sysfs. As does just about everybody else who works with that code. There are patches around to rationalize sysfs; maybe they will help to avoid problems in the future. SLUB will need a quicker fix, but, if that's the final remaining problem for this code, it would seem that One True Allocator status is almost within reach.

Comments (2 posted)

Rationalizing scatter/gather chains

By Jonathan Corbet
December 28, 2007

The chained scatterlist API was arguably the most disruptive addition to 2.6.24, despite being a relatively small amount of code. This API allows kernel code to chain together scatter/gather lists for DMA I/O operations, resulting in a much larger maximum size for those operations. That, in turn, leads to better performance, especially in the block I/O subsystem. The idea of scatterlist chaining is generally popular, but there have been some complaints about the current implementation. As things stand, any code wanting to work with chained scatterlists must construct the chains itself - an error-prone operation. So there is interest in making things better.

One approach to improving the situation is the sg_ring API, proposed by Rusty Russell. This patch does away with the current chaining approach; there are no more scatterlist entries which are actually chain pointers in disguise. Instead, Rusty introduces struct sg_ring:

    struct sg_ring
    {
	struct list_head list;
	unsigned int num, max;
	struct scatterlist sg[0];
    };

The obvious change here is that the chaining has been moved out of the scatterlist itself and made into an explicit linked list. There are also variables tracking the current and maximum sizes of the list, which help reduce explicit housekeeping elsewhere. Some versions of the patch also add an integer dma_num field to hold the number of mapped scatter/gather entries, which can differ from the number initially set up by the driver.

An sg_ring with a given number of scatterlist entries can be declared with this macro:

    DECLARE_SG_RING(name, max);

A ring should then be initialized with one of:

    void sg_ring_init(struct sg_ring *ring, unsigned int max);
    void sg_ring_single(struct sg_ring *ring, const void *buf,
                        unsigned int buflen);

The latter form is a shortcut for cases where a single-entry ring needs to be set up with a given buffer.

Constructing a multi-entry ring is a matter of allocating as many separate sg_ring entries as needed and explicitly chaining them together using the list field. There is a helper macro for stepping through all of the entries in a ring while hiding the boundaries between the individual scatterlists:

    struct sg_ring *sg;
    int i;

    sg_ring_for_each(ring, sg, i) {
	/* *sg is the scatterlist entry to operate on */
    }

Rusty has posted patches converting parts of the SCSI subsystem over to this API. As he points out, the conversion removes a fair amount of logic associated with the construction and destruction of large scatterlists.

Jens Axboe, the creator of the chained scatterlist code, has responded that the current API was aimed at minimizing the effect on drivers for 2.6.24. It is not, he says, a finished product, and things are getting better. A look at his git repository shows some API additions with a very similar goal to Rusty's work.

Jens's work retains the current chaining mechanism, but wraps a structure and some helpers around it to make it easier to work with. So, in this view of the world, drivers will work with struct sg_table:

    struct sg_table {
	struct scatterlist *sgl;        /* the list */
       	unsigned int nents;             /* number of mapped entries */
       	unsigned int orig_nents;        /* original size of list */
    };

An sg_table will be set up with:

    int sg_alloc_table(struct sg_table *table, unsigned int nents,
                       gfp_t gfp_flags);

This function does not allocate the sg_table structure, which must be passed in as a parameter. It does, however, allocate the memory to use for the actual scatterlist arrays and deal with the process of chaining them all together. So a driver needing to construct a large scatter/gather operation can now just do a single sg_alloc_table() call, then iterate through the list of scatterlist entries in the usual way. When the operation is complete, a call to

    void sg_free_table(struct sg_table *table);

will free the allocated memory.

Sometime around the opening of the 2.6.25, a decision will have to be made on the direction of the chained scatterlist API. It may not be one of the most closely-watched kernel development events ever, but this decision will affect how high-performance I/O code is written in the future. As the author of the current chaining code, Jens probably starts with an advantage when it comes to getting his code merged. The nature of kernel development is such that nobody can ever be entirely sure, though; if a consensus builds that Rusty's approach is better, that is the way things will probably go. Stay tuned through the next merge window for the thrilling conclusion to this ongoing story.

Comments (none posted)

What is RCU? Part 2: Usage

December 24, 2007

This article was contributed by Paul McKenney

Introduction

Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU is most frequently described as a replacement for reader-writer locking, but it has also been used in a number of other ways. RCU is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast, and also permits RCU readers to accomplish useful work even when running concurrently with RCU updaters.

This leads to the question "what exactly is RCU?", a question that this document addresses from the viewpoint of someone using it. Because RCU is most frequently used to replace some existing mechanism, we look at it primarily in terms of its relationship to such mechanisms, as follows:

RCU is a Reader-Writer Lock Replacement
RCU is a Restricted Reference-Counting Mechanism
RCU is a Bulk Reference-Counting Mechanism
RCU is a Poor Man's Garbage Collector
RCU is a Way of Providing Existence Guarantees
RCU is a Way of Waiting for Things to Finish

These sections are followed by conclusions and answers to the Quick Quizzes.

RCU is a Reader-Writer Lock Replacement

Perhaps the most common use of RCU within the Linux kernel is as a replacement for reader-writer locking in read-intensive situations. Nevertheless, this use of RCU was not immediately apparent to me at the outset, in fact, I chose to implement something similar to brlock before implementing a general-purpose RCU implementation back in the early 1990s. Each and every one of the uses I envisioned for the proto-brlock primitive was instead implemented using RCU. In fact, it was more than three years before the proto-brlock primitive saw its first use. Boy, did I feel foolish!

The key similarity between RCU and reader-writer locking is that both have read-side critical sections that can execute in parallel. In fact, in some cases, it is possible to mechanically substitute RCU API members for the corresponding reader-writer lock API members. But first, why bother?

Advantages of RCU include performance, deadlock immunity, and realtime latency. There are, of course, limitations to RCU, including the fact that readers and updaters run concurrently, that low-priority RCU readers can block high-priority threads waiting for a grace period to elapse, and that grace-period latencies can extend for many milliseconds. These advantages and limitations are discussed in the following sections.

Performance

The read-side performance advantages of RCU over reader-writer lock are shown on the following graph for a 16-CPU 3GHz Intel x86 system.

Graph showing RCU read-side primitives running multiple orders of magnitude faster than those of rwlock.

Quick Quiz 1: WTF??? How the heck do you expect me to believe that RCU has a 100-femtosecond overhead when the clock period at 3GHz is more than 300 picoseconds?

Note that reader-writer locking is orders of magnitude slower than RCU on a single CPU, and is almost two additional orders of magnitude slower on 16 CPUs. In contrast, RCU scales quite well. In both cases, the error bars span a single standard deviation in either direction.

A more moderate view may be obtained from a CONFIG_PREEMPT kernel, though RCU still beats reader-writer locking by between one and three orders of magnitude. Note the high variability of reader-writer locking at larger numbers of CPUs. The error bars span a single standard deviation in either direction.

Graph showing RCU read-side primitives running multiple orders of magnitude faster than those of rwlock.

Of course, the low performance of reader-writer locking will become less significant as the overhead of the critical section increases, as shown in the following graph for a 16-CPU system.

Graph comparing RCU read-side primitives to those of rwlock for varying critical-section weight.

Quick Quiz 2: Why does both the variability and overhead of rwlock decrease as the critical-section overhead increases?

However, this observation must be tempered by the fact that a number of system calls (and thus any RCU read-side critical sections that they contain) can complete within a few microseconds. In addition, as is discussed in the next section, RCU read-side primitives are almost entirely deadlock-immune.

Deadlock Immunity

Although RCU offers significant performance advantages for read-mostly workloads, one of the primary reasons for creating RCU in the first place was in fact its immunity to read-side deadlocks. This immunity stems from the fact that RCU read-side primitives do not block, spin, or even do backwards branches, so that their execution time is deterministic. It is therefore impossible for them to participate in a deadlock cycle.

Quick Quiz 3: Is there an exception to this deadlock immunity, and if so, what sequence of events could lead to deadlock?

An interesting consequence of RCU's read-side deadlock immunity is that it is possible to unconditionally upgrade an RCU reader to an RCU updater. Attempting to do such an upgrade with reader-writer locking results in deadlock. A sample code fragment that does an RCU read-to-update upgrade follows:

  1 rcu_read_lock();
  2 list_for_each_entry_rcu(p, head, list_field) {
  3   do_something_with(p);
  4   if (need_update(p)) {
  5     spin_lock(&my_lock);
  6     do_update(p);
  7     spin_unlock(&my_lock);
  8   }
  9 }
 10 rcu_read_unlock();

Note that do_update() is executed under the protection of the lock and under RCU read-side protection.

Another interesting consequence of RCU's deadlock immunity is its immunity to a large class of priority inversion problems. For example, low-priority RCU readers cannot prevent a high-priority RCU updater from acquiring the update-side lock. Similarly, a low-priority RCU updater cannot prevent high-priority RCU readers from entering an RCU read-side critical section.

Realtime Latency

Because RCU read-side primitives neither spin nor block, they offer excellent realtime latencies. In addition, as noted earlier, this means that they are immune to priority inversion involving the RCU read-side primitives and locks.

However, RCU is susceptible to more subtle priority-inversion scenarios, for example, a high-priority process blocked waiting for an RCU grace period to elapse can be blocked by low-priority RCU readers in -rt kernels. This can be solved by using RCU priority boosting.

RCU Readers and Updaters Run Concurrently

Because RCU readers never spin nor block, and because updaters are not subject to any sort of rollback or abort semantics, RCU readers and updaters must necessarily run concurrently. This means that RCU readers might access stale data, and might even see inconsistencies, either of which can render conversion from reader-writer locking to RCU non-trivial.

However, in a surprisingly large number of situations, inconsistencies and stale data are not problems. The classic example is the networking routing table. Because routing updates can take considerable time to reach a given system (seconds or even minutes), the system will have been sending packets the wrong way for quite some time when the update arrives. It is usually not a problem to continue sending updates the wrong way for a few additional milliseconds. Furthermore, because RCU updaters can make changes without waiting for RCU readers to finish, the RCU readers might well see the change more quickly than would batch-fair reader-writer-locking readers, as shown in the following figure.

Once the update is received, the rwlock writer cannot proceed until the last reader completes, and subsequent readers cannot proceed until the writer completes. However, these subsequent readers are guaranteed to see the new value, as indicated by the green background. In contrast, RCU readers and updaters do not block each other, which permits the RCU readers to see the updated values sooner. Of course, because their execution overlaps that of the RCU updater, all of the RCU readers might well see updated values, including the three readers that started before the update. Nevertheless only the RCU readers with green backgrounds are guaranteed to see the updated values, again, as indicated by the green background.

Reader-writer locking and RCU simply provide different guarantees. With reader-writer locking, any reader that begins after the writer starts executing is guaranteed to see new values, and readers that attempt to start while the writer is spinning might or might not see new values, depending on the reader/writer preference of the rwlock implementation in question. In contrast, with RCU, any reader that begins after the updater completes is guaranteed to see new values, and readers that end after the updater begins might or might not see new values, depending on timing.

The key point here is that, although reader-writer locking does indeed guarantee consistency within the confines of the computer system, there are situations where this consistency comes at the price of increased inconsistency with the outside world. In other words, reader-writer locking obtains internal consistency at the price of silently stale data with respect to the outside world.

Nevertheless, there are situations where inconsistency and stale data within the confines of the system cannot be tolerated. Fortunately, there are a number of approaches that avoid inconsistency and stale data, as discussed in the FREENIX paper on applying RCU to System V IPC [PDF] and in my dissertation [PDF]. However, an in-depth discussion of these approaches is beyond the scope of this article.

Low-Priority RCU Readers Can Block High-Priority Reclaimers

In Realtime RCU, SRCU, or QRCU, each of which is described in the final installment of this series, a preempted reader will prevent a grace period from completing, even if a high-priority task is blocked waiting for that grace period to complete. Realtime RCU can avoid this problem by substituting call_rcu() for synchronize_rcu() or by using RCU priority boosting, which is still in experimental status as of late 2007. It might become necessary to augment SRCU and QRCU with priority boosting, but not before a clear real-world need is demonstrated.

RCU Grace Periods Extend for Many Milliseconds

With the exception of QRCU, RCU grace periods extend for multiple milliseconds. Although there are a number of techniques to render such long delays harmless, including use of the asynchronous interfaces where available (call_rcu() and call_rcu_bh()), this situation is a major reason for the rule of thumb that RCU be used in read-mostly situations.

Comparison of Reader-Writer Locking and RCU Code

In the best case, the conversion from reader-writer locking to RCU is quite simple, as shown in the following example taken from Wikipedia.

 1 struct el {                          1 struct el {
 2   struct list_head list;             2   struct list_head list;
 3   long key;                          3   long key;
 4   spinlock_t mutex;                  4   spinlock_t mutex;
 5   int data;                          5   int data;
 6   /* Other data fields */            6   /* Other data fields */
 7 };                                   7 };
 8 rwlock_t listmutex;                  8 spinlock_t listmutex;
 9 struct el head;                      9 struct el head;

 1 int search(long key, int *result)    1 int search(long key, int *result)
 2 {                                    2 {
 3   struct list_head *lp;              3   struct list_head *lp;
 4   struct el *p;                      4   struct el *p;
 5                                      5
 6   read_lock(&listmutex);             6   rcu_read_lock();
 7   list_for_each_entry(p, head, lp) { 7   list_for_each_entry_rcu(p, head, lp) {
 8     if (p->key == key) {             8     if (p->key == key) {
 9       *result = p->data;             9       *result = p->data;
10       read_unlock(&listmutex);      10       rcu_read_unlock();
11       return 1;                     11       return 1;
12     }                               12     }
13   }                                 13   }
14   read_unlock(&listmutex);          14   rcu_read_unlock();
15   return 0;                         15   return 0;
16 }                                   16 }

 1 int delete(long key)                 1 int delete(long key)
 2 {                                    2 {
 3   struct el *p;                      3   struct el *p;
 4                                      4
 5   write_lock(&listmutex);            5   spin_lock(&listmutex);
 6   list_for_each_entry(p, head, lp) { 6   list_for_each_entry(p, head, lp) {
 7     if (p->key == key) {             7     if (p->key == key) {
 8       list_del(&p->list);            8       list_del_rcu(&p->list);
 9       write_unlock(&listmutex);      9       spin_unlock(&listmutex);
                                       10       synchronize_rcu();
10       kfree(p);                     11       kfree(p);
11       return 1;                     12       return 1;
12     }                               13     }
13   }                                 14   }
14   write_unlock(&listmutex);         15   spin_unlock(&listmutex);
15   return 0;                         16   return 0;
16 }                                   17 }

More-elaborate cases of replacing reader-writer locking with RCU are beyond the scope of this document.

RCU is a Restricted Reference-Counting Mechanism

Because grace periods are not allowed to complete while there is an RCU read-side critical section in progress, the RCU read-side primitives may be used as a restricted reference-counting mechanism. For example, consider the following code fragment:

  1 rcu_read_lock();  /* acquire reference. */
  2 p = rcu_dereference(head);
  3 /* do something with p. */
  4 rcu_read_unlock();  /* release reference. */

The rcu_read_lock() primitive can be thought of as acquiring a reference to p, because a grace period starting after the rcu_dereference() assigns to p cannot possibly end until after we reach the matching rcu_read_unlock(). This reference-counting scheme is restricted in that we are not allowed to block in RCU read-side critical sections, nor are we permitted to hand off an RCU read-side critical section from one task to another.

Regardless of these restrictions, the following code can safely delete p:

  1 spin_lock(&mylock);
  2 p = head;
  3 head = NULL;
  4 spin_unlock(&mylock);
  5 synchronize_rcu();  /* Wait for all references to be released. */
  6 kfree(p);

The assignment to head prevents any future references to p from being acquired, and the synchronize_rcu() waits for any references that had previously been acquired to be released.

Quick Quiz 4: But wait! This is exactly the same code that might be used when thinking of RCU as a replacement for reader-writer locking! What gives?

Of course, RCU can also be combined with traditional reference counting, as has been discussed on LKML and as summarized in an Overview of Linux-Kernel Reference Counting [PDF].

But why bother? Again, part of the answer is performance, as shown in the following graph, again showing data taken on a 16-CPU 3GHz Intel x86 system.

RCU vs. reference-counting performance.

Quick Quiz 5: Why the dip in refcnt overhead near 6 CPUs?

And, as with reader-writer locking, the performance advantages of RCU are most pronounced for short-duration critical sections, as shown in the following graph for a 16-CPU system. In addition, as with reader-writer locking, many system calls (and thus any RCU read-side critical sections that they contain) complete in a few microseconds.

RCU vs. reference-counting performance.

However, the restrictions that go with RCU can be quite onerous. For example, in many cases, the prohibition against sleeping while in an RCU read-side critical section would defeat the entire purpose. The next section looks at ways of addressing this problem, while also reducing the complexity of traditional reference counting, in some cases.

RCU is a Bulk Reference-Counting Mechanism

As noted in the preceding section, traditional reference counters are usually associated with a specific data structure, or perhaps a specific group of data structures. However, maintaining a single global reference counter for a large variety of data structures typically results in bouncing the cache line containing the reference count. Such cache-line bouncing can severely degrade performance.

In contrast, RCU's light-weight read-side primitives permit extremely frequent read-side usage with negligible performance degradation, permitting RCU to be used as a "bulk reference-counting" mechanism with little or no performance penalty. Situations where a reference must be held by a single task across a section of code that blocks may be accommodated with Sleepable RCU (SRCU). This fails to cover the not-uncommon situation where a reference is "passed" from one task to another, for example, when a reference is acquired when starting an I/O and released in the corresponding completion interrupt handler. (In principle, this could be handled by the SRCU implementation, but in practice, it is not yet clear whether this is a good tradeoff.)

Of course, SRCU brings a restriction of its own, namely that the return value from srcu_read_lock() be passed into the corresponding srcu_read_unlock(). The jury is still out as to how much of a problem is presented by this restriction, and as to how it can best be handled.

RCU is a Poor Man's Garbage Collector

A not-uncommon exclamation made by people first learning about RCU is "RCU is sort of like a garbage collector!". This exclamation has a large grain of truth, but it can also be misleading.

Perhaps the best way to think of the relationship between RCU and automatic garbage collectors (GCs) is that RCU resembles a GC in that the timing of collection is automatically determined, but that RCU differs from a GC in that: (1) the programmer must manually indicate when a given data structure is eligible to be collected, and (2) the programmer must manually mark the RCU read-side critical sections where references might legitimately be held.

Despite these differences, the resemblance does go quite deep, and has appeared in at least one theoretical analysis of RCU. Furthermore, the first RCU-like mechanism I am aware of used a garbage collector to handle the grace periods. Nevertheless, a better way of thinking of RCU is described in the following section.

RCU is a Way of Providing Existence Guarantees

Gamsa et al. [PDF] discuss existence guarantees and describe how a mechanism resembling RCU can be used to provide these existence guarantees (see section 5 on page 7). The effect is that if any RCU-protected data element is accessed within an RCU read-side critical section, that data element is guaranteed to remain in existence for the duration of that RCU read-side critical section.

Alert readers will recognize this as only a slight variation on the original "RCU is a way of waiting for things to finish" theme, which is addressed in the following section.

RCU is a Way of Waiting for Things to Finish

As noted in the first article in this series, an important component of RCU is a way of waiting for RCU readers to finish. One of RCU's great strengths is that it allows you to wait for each of thousands of different things to finish without having to explicitly track each and every one of them, and without having to worry about the performance degradation, scalability limitations, complex deadlock scenarios, and memory-leak hazards that are inherent in schemes that use explicit tracking.

In this section, we will show how synchronize_sched()'s read-side counterparts (which include anything that disables preemption, along with hardware operations and primitives that disable irq) permit you to implement interactions with non-maskable interrupt (NMI) handlers that would be quite difficult if using locking. I called this approach "Pure RCU" in my dissertation [PDF], and it is used in a number of places in the Linux kernel.

The basic form of such "Pure RCU" designs is as follows:

Make a change, for example, to the way that the OS reacts to an NMI.
Wait for all pre-existing read-side critical sections to completely finish (for example, by using the synchronize_sched() primitive). The key observation here is that subsequent RCU read-side critical sections are guaranteed to see whatever change was made.
Clean up, for example, return status indicating that the change was successfully made.

The remainder of this section presents example code adapted from the Linux kernel. In this example, the timer_stop function uses synchronize_sched() to ensure that all in-flight NMI notifications have completed before freeing the associated resources. A simplified version of this code follows:

  1 struct profile_buffer {
  2   long size;
  3   atomic_t entry[0];
  4 };
  5 static struct profile_buffer *buf = NULL;
  6 
  7 void nmi_profile(unsigned long pcvalue)
  8 {
  9   atomic_t *p = rcu_dereference(buf);
 10 
 11   if (p == NULL)
 12     return;
 13   if (pcvalue >= p->size)
 14     return;
 15   atomic_inc(&p->entry[pcvalue]);
 16 }
 17 
 18 void nmi_stop(void)
 19 {
 20   atomic_t *p = buf;
 21 
 22   if (p == NULL)
 23     return;
 24   rcu_assign_pointer(buf, NULL);
 25   synchronize_sched();
 26   kfree(p);
 27 }

Lines 1-4 define a profile_buffer structure, containing a size and an indefinite array of entries. Line 5 defines a pointer to a profile buffer, which is presumably initialized elsewhere to point to a dynamically allocated region of memory.

Lines 7-16 define the nmi_profile() function, which is called from within an NMI handler. As such, it cannot be preempted, nor can it be interrupted by a normal irq handler, however, it is still subject to delays due to cache misses, ECC errors, and cycle stealing by other hardware threads within the same core. Line 9 gets a local pointer to the profile buffer using the rcu_dereference() primitive to ensure memory ordering on DEC Alpha, and lines 11 and 12 exit from this function if there is no profile buffer currently allocated, while lines 13 and 14 exit from this function if the pcvalue argument is out of range. Otherwise, line 15 increments the profile-buffer entry indexed by the pcvalue argument. Note that storing the size with the buffer guarantees that the range check matches the buffer, even if a large buffer is suddenly replaced by a smaller one.

Lines 18-27 define the nmi_stop() function, where the caller is responsible for mutual exclusion (for example, holding the correct lock). Line 20 fetches a pointer to the profile buffer, and lines 22 and 23 exit the function if there is no buffer. Otherwise, line 24 NULLs out the profile-buffer pointer (using the rcu_assign_pointer() primitive to maintain memory ordering on weakly ordered machines), and line 25 waits for an RCU Sched grace period to elapse, in particular, waiting for all non-preemptible regions of code, including NMI handlers, to complete. Once execution continues at line 26, we are guaranteed that any instance of nmi_profile() that obtained a pointer to the old buffer has returned. It is therefore safe to free the buffer, in this case using the kfree() primitive.

Quick Quiz 6: Suppose that the nmi_profile() function was preemptible. What would need to change to make this example work correctly?

In short, RCU makes it easy to dynamically switch among profile buffers (you just try doing this efficiently with atomic operations, or at all with locking!). However, RCU is normally used at a higher level of abstraction, as was shown in the previous sections.

Conclusions

At its core, RCU is nothing more nor less than an API that provides:

a publish-subscribe mechanism for adding new data,
a way of waiting for pre-existing RCU readers to finish, and
a discipline of maintaining multiple versions to permit change without harming or unduly delaying concurrent RCU readers.

That said, it is possible to build higher-level constructs on top of RCU, including the reader-writer-locking, reference-counting, and existence-guarantee constructs listed in the earlier sections. Furthermore, I have no doubt that the Linux community will continue to find interesting new uses for RCU, as well as for any of a number of other synchronization primitives.

Acknowledgements

We are all indebted to Andy Whitcroft, Jon Walpole, and Gautham Shenoy, whose review of an early draft of this document greatly improved it. I owe thanks to the members of the Relativistic Programming project and to members of PNW TEC for many valuable discussions. I am grateful to Dan Frye for his support of this effort.

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: WTF??? How the heck do you expect me to believe that RCU has a 100-femtosecond overhead when the clock period at 3GHz is more than 300 picoseconds?

Answer: First, consider that the inner loop used to take this measurement is as follows:

  1 for (i = 0; i < CSCOUNT_SCALE; i++) {
  2   rcu_read_lock();
  3   rcu_read_unlock();
  4 }

Next, consider the effective definitions of rcu_read_lock() and rcu_read_unlock():

  1 #define rcu_read_lock()   do { } while (0)
  2 #define rcu_read_unlock() do { } while (0)

Consider also that the compiler does simple optimizations, allowing it to replace the loop with:

i = CSCOUNT_SCALE;

So the "measurement" of 100 femtoseconds is simply the fixed overhead of the timing measurements divided by the number of passes through the inner loop containing the calls to rcu_read_lock() and rcu_read_unlock(). And therefore, this measurement really is in error, in fact, in error by an arbitrary number of orders of magnitude. As you can see by the definition of rcu_read_lock() and rcu_read_unlock() above, the actual overhead is precisely zero.

It certainly is not every day that a timing measurement of 100 femtoseconds turns out to be an overestimate!

Back to Quick Quiz 1.

Quick Quiz 2: Why does both the variability and overhead of rwlock decrease as the critical-section overhead increases?

Answer: Because the contention on the underlying rwlock_t decreases as the critical-section overhead increases. However, the rwlock overhead will not quite drop to that on a single CPU because of cache-thrashing overhead.

Back to Quick Quiz 2.

Quick Quiz 3: Is there an exception to this deadlock immunity, and if so, what sequence of events could lead to deadlock?

Answer: One way to cause a deadlock cycle involving RCU read-side primitives is via the following (illegal) sequence of statements:

idx = srcu_read_lock(&srcucb);
synchronize_srcu(&srcucb);
srcu_read_unlock(&srcucb, idx);

The synchronize_rcu() cannot return until all pre-existing SRCU read-side critical sections complete, but is enclosed in an SRCU read-side critical section that cannot complete until the synchronize_srcu() returns. The result is a classic self-deadlock--you get the same effect when attempting to write-acquire a reader-writer lock while read-holding it.

Note that this self-deadlock scenario does not apply to RCU Classic, because the context switch performed by the synchronize_rcu() would act as a quiescent state for this CPU, allowing a grace period to complete. However, this is if anything even worse, because data used by the RCU read-side critical section might be freed as a result of the grace period completing.

In short, do not invoke synchronous RCU update-side primitives from within an RCU read-side critical section.

Back to Quick Quiz 3.

Quick Quiz 4: But wait! This is exactly the same code that might be used when thinking of RCU as a replacement for reader-writer locking! What gives?

Answer: This is an effect of the Law of Toy Examples: beyond a certain point, the code fragments look the same. The only difference is in how we think about the code. However, this difference can be extremely important. For but one example of the importance, consider that if we think of RCU as a restricted reference counting scheme, we would never be fooled into thinking that the updates would exclude the RCU read-side critical sections.

It nevertheless is often useful to think of RCU as a replacement for reader-writer locking, for example, when you are replacing reader-writer locking with RCU.

Back to Quick Quiz 4.

Quick Quiz 5: Why the dip in refcnt overhead near 6 CPUs?

Answer: Most likely NUMA effects. However, there is substantial variance in the values measured for the refcnt line, as can be seen by the error bars. In fact, standard deviations range in excess of 10% of measured values in some cases. The dip in overhead therefore might well be a statistical aberration.

Back to Quick Quiz 5.

Quick Quiz 6: Suppose that the nmi_profile() function was preemptible. What would need to change to make this example work correctly?

Answer: One approach would be to use rcu_read_lock() and rcu_read_unlock() in nmi_profile(), and to replace the synchronize_sched() with synchronize_rcu(), perhaps as follows:

  1 struct profile_buffer {
  2   long size;
  3   atomic_t entry[0];
  4 };
  5 static struct profile_buffer *buf = NULL;
  6 
  7 void nmi_profile(unsigned long pcvalue)
  8 {
  9   atomic_t *p;
 10 
 11   rcu_read_lock();
 12   p = rcu_dereference(buf);
 13   if (p == NULL) {
 14     rcu_read_unlock();
 15     return;
 16   }
 17   if (pcvalue >= p->size) {
 18     rcu_read_unlock();
 19     return;
 20   }
 21   atomic_inc(&p->entry[pcvalue]);
 22   rcu_read_unlock();
 23 }
 24 
 25 void nmi_stop(void)
 26 {
 27   atomic_t *p = buf;
 28 
 29   if (p == NULL)
 30     return;
 31   rcu_assign_pointer(buf, NULL);
 32   synchronize_rcu();
 33   kfree(p);
 34 }

Back to Quick Quiz 6.

Comments (16 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.24-rc6 ?

Andrew Morton 2.6.24-rc6-mm1 ?

Willy Tarreau Linux 2.4.36 released ?

Abdel linux 0.01 released ?

Architecture-specific

Geoff Levand PS3 Linux Distributor's Starter Kit (v1.5.1) released ?

Jochen Friedrich powerpc: DBox2 Board Support ?

travis@sgi.com percpu: Per cpu code simplification V2 ?

Jeremy Fitzhardinge x86: another attempt at x86 pagetable unification ?

Roland McGrath user_regset framework -- arch maintainers take note! ?

Jochen Friedrich powerpc: DBox2 Board Support ?

Rene Herman x86: provide a DMI based port 0x80 I/O delay override ?

Build system

Adrian Bunk remove CONFIG_EXPERIMENTAL ?

Core kernel code

Pavel Machek sleepy linux ?

Huang, Ying kexec jump -v8 ?

Peter Zijlstra sched: hrtick and rt group scheduling ?

Development tools

Pekka J Enberg slub: /proc/slabinfo ABI compatibility ?

Subrata Modak [ANNOUNCE] [LTP] The Linux Test Project has been Released for DECEMBER 2007 ?

Device drivers

Hans Verkuil ANN: Initial Conexant cx23418 driver release ?

oakad@exemail.com.au [MEMSTICK] Initial commit for Sony MemoryStick support ?

Arjan van de Ven [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in ?

Bartlomiej Zolnierkiewicz ide-cd: redux ?

Rusty Russell sg_ring for scsi ?

Adrian McMenamin Add support for GD-Rom device ?

Bjorge Dijkstra RFC: Wireless Extensions for rndis_host ?

mgross USB driver for talking to the Microchip PIC18 boot loader ?

jacliburn@bellsouth.net atl1: divide and modernize ?

Documentation

Michael Kerrisk man-pages-2.74 is released ?

Jeff Garzik Updated Kernel Hacker's guide to git ?

Greg KH kobject/kset/ktype documentation and example code updated ?

Filesystems and block I/O

Evgeniy Polyakov DST: Distributed storage: Groundhogs strike back: no New Year for humans ?

Akira Fujita ext4 online defrag (ver 0.6) ?

Erez Zadok 00/30 Unionfs+fsstack updates/fixes/cleanups ?

Memory management

Peter Zijlstra mm: madvise(WILLNEED) for anonymous memory ?

Christoph Lameter Page cache: Replace PAGE_CACHE_xx with inline functions V3 ?

Networking

Johannes Berg [RFC] cfg80211 API for channels/bitrates, mac80211 and driver conversion ?

Hideo AOKI UDP memory accounting and limitation (take 12) ?

Security-related

Tetsuo Handa Simple tamper-proof device filesystem. ?

Tetsuo Handa TOMOYO Linux Security Goal ?

John@suse.de, Johansen@suse.de AppArmor security module overview ?

Paul Moore LSM: Add inet_sys_snd_skb() LSM hook ?

Paul Moore [RFC PATCH v9 00/18] Labeled networking patches for 2.6.25 (against 2.6.24-rc6) ?

Virtualization and containers

Avi Kivity [PATCH 00/50] KVM patch queue review for 2.6.25 merge window (part I) ?

Miscellaneous

Arjan van de Ven Top 10 kernel oopses/warnings for the week of December 21st 2007 ?

Arjan van de Ven Top 9 kernel oopses/warnings for the week of December 29th, 2007 ?

Marcin Slusarz byteorder: introduce le32_add_cpu & friends to core ?

Page editor: Jonathan Corbet
Next page: Distributions>>