|
|
Log in / Subscribe / Register

Read-copy-update

One of the biggest challenges in kernel programming is managing concurrency. If multiple threads try to access the same resources at the same time, the result can be chaos. Users tend to have a dislike for chaos, so kernel programmers work hard to avoid uncontrolled access to shared data.

In the Linux kernel, this sort of mutual exclusion is usually done with spinlocks. By obtaining a spinlock, a process running in kernel mode can ensure that it is the only one working with the data structures protected by that lock. A variant on spinlocks, called the "reader writer lock," allows numerous threads to access a data structure as long as they do not modify it, but provides exclusive access to processes which make changes.

Spinlocks work well in most situations, but they are not free. Taking out a lock takes time, especially on SMP machines, where the cache line containing the lock must be moved between processors. The overhead of a heavily used lock can be significant. So kernel hackers interested in scalability have long kept an eye out for alternatives; one such alternative is a technique called "read-copy-update," or RCU. A "new and shiny" RCU patch was posted by Dipankar Sarma recently (the code credits Paul McKenney, Andrea Arcangeli, Rusty Russell, Andi Kleen, and "etc."). So it seems like a good time to give RCU a look.

RCU works by requiring shared kernel data structures to be accessed via pointers. Code needing read-only access to a given data structure (a network routing table entry, say) follows a pointer and is able to work with the data with no locking at all (OK, almost none, see below). The reader case is, thus, handled in a fast and efficient manner. When the code needs to make a change to the data, however, life gets rather more complicated; the sequence of steps required is, roughly:

  • The writer allocates a new data structure and makes a copy of the structure to be changed.

  • The new structure is then modified to reflect the new state of the world.

  • The writer saves the pointer to the old version of the data, and sets the global pointer to the new structure. Kernel threads that access the data after this change will see the new version; any thread that came along before will still be working with the old copy.

  • The writer asks for a callback when the kernel knows that no code has any reference to the old version of the data.

  • When the callback happens, the old data can be freed, and the RCU cycle is complete.

This technique is clearly optimized for situations where the data is read frequently and modified rarely. For frequently-changed data, the overhead of the RCU cycle would likely exceed that of simply using spinlocks. The "frequent reads/infrequent writes" mode of operation is quite common in the kernel, though, so there are many places where this technique could be employed. For example, Rusty Russell's hotplug CPU patch uses RCU, on the assumption that processors do not actually come and go very often.

All of the above, however, has glossed over one interesting detail: how, exactly, does the kernel know when it is safe to release an old data structure? The RCU patch handles this with a basic assumption: kernel code will not retain pointers to RCU-protected data after it sleeps or returns to user space. Thus, it is sufficient to wait until every processor in the system has been seen to be running in user space or to be idle. The RCU patch describes such a processor as being "quiescent." Each CPU in the system has a quiescent counter, which is incremented by the scheduler whenever a quiescent state is observed.

To call the RCU writer callbacks at the right time, the RCU code maintains a list of pending RCU completions on each processor. A tasklet runs periodically on any processor with pending RCU callbacks; it polls the quiescent counter for all CPUs and waits until every one of them has changed. Once that has happened, it is safe to free any old RCU data, so the list of callbacks is processed. If, by that time, a new list of pending callbacks has accumulated, the whole thing starts over again.

All of this works until you throw in one other little detail: the preemptive kernel. If a process is preempted while running in kernel space, it could retain a pointer to old RCU data even though the CPU appears to be quiescent. The RCU patch provides two different ways of dealing with this problem. One is that code accessing RCU data for reading can bracket that access with calls to rcu_read_lock() and rcu_read_unlock(), which simply disable preemption in the critical section. Spinlocks, of course, do the same thing.

Alternatively, code can read the RCU data in an unprotected mode as always. In this case, the RCU callback code gets even a little more complicated; it must now wait until every process which had been preempted in kernel mode either exits or is rescheduled normally. This waiting is not quite as bad as it might seem; it is handled with a couple of atomic counters. It does, however, introduce an indeterminate delay between when the new data appears and the old can be freed. If the memory areas involved are large, quite a bit of kernel memory could be tied up waiting for RCU callbacks; disabling preemption is a safer way to go in most cases.

RCU thus involves some complexity, but it holds out a promise of better performance for certain kinds of data access patterns. Will it get into the 2.5 kernel? There is one little problem, being that Linus doesn't like the RCU approach. From a message posted last October:

RCU obviously has major subtle issues, ranging from memory ordering to "what is quiescent", ie the scheduling points. And "subtlety" directly translates into bugs and hard-to-debug seldom-seen problems that end up being "unexplainable".

In short, RCU seems to be a case of "hey, that's cool", but it's a solution in search of a problem so severe that it is worth it.

There are no indications that Linus believes such a problem has yet come up. Work continues on RCU patches (and other patches that use it), however, so the story is not yet finished. (For information in numbing detail about RCU - but without the preemptive kernel changes - see this page on the LSE site).


to post comments

Race condition?

Posted Jul 18, 2002 16:47 UTC (Thu) by bjn (guest, #2179) [Link] (2 responses)

Thanks for the write-up on RCU; a good summary, as usual. However, you don't say anything about how RCU prevents two writers from trying to work at once (race condition)?

Race condition?

Posted Jul 18, 2002 17:04 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

In fact, there really isn't anything in the RCU patch to prevent multiple writers from working concurrently. Whoever updates the data pointer last wins. For the sorts of uses envisioned by the RCU authors (i.e. routing tables, loadable modules, etc.), this usually isn't a problem. There will not normally be concurrent writers, and especially not situations where one would depend on what the other is writing.

One certainly see situations where a write-side race condition would be a problem, though. The solution would be to put a spinlock around the copy-update part of the cycle.

Race condition?

Posted Jun 13, 2003 17:40 UTC (Fri) by ncm (guest, #165) [Link]

In other words, the (overwhelmingly) most frequent users of the data structure never need to lock it. Concurrent writers do need to lock, but since writing happens so rarely, collisions requiring one writer to wait are vanishingly rare. Where you can guarantee there can be no writer collisions (e.g. there's only ever one writer, or they're already co-operating), writers don't need to lock.

The more subtle danger is when there is a dependency relationship between the data copied, and other data elsewhere in the kernel (or in user space!). You just have to be very careful never to let that happen.


Copyright © 2002, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds