Brief items
The 3.0 kernel is out,
released on
July 21. Linus said:
As already mentioned several times, there are no special
landmark features or incompatibilities related to the version number
change, it's simply a way to drop an inconvenient numbering system in honor
of twenty years of Linux. In fact, the 3.0 merge window was calmer than
most, and apart from some excitement from RCU I'd have called it really
smooth.
Beyond the numbering scheme change, this kernel includes POSIX alarm timer support, a just-in-time compiler for BPF packet filters,
a new sendmmsg() system call, ICMP
sockets, the merging of the Xen backend driver (completing the long process
of getting Xen Dom0 support into the kernel), namespace file descriptors, and more. See the
KernelNewbies 3.0 page for
lots of details.
Stable updates: no stable updates have been released in the last
week. The 2.6.35.14 update is in the
review process as of this writing.
Comments (2 posted)
I am quite at ease not participating in netfilter/iptables anymore
while the discussion about IPv6 NAT becomes an issue again: I
always indicated "over my dead body", and now that I am no longer
in charge, nobody will have to kill me ;)
--
Harald Welte
Working on an update kernel for Fedora 15, rebasing from 2.6.38 to
3.0. As we know a bunch of userspace packages need updating to
deal with the 2.6 -> 3.x transition, we made a decision to ship
3.0, but call it 2.6.40 rather than ship a ton of updates, and risk
breaking other code that we don't ship.
I look forward to the "OMG, RED HAT FORKS LINUX" posts on slashdot.
--
OMG
Dave Jones FORKS LINUX!
Thanks to git send-email I know exactly what networking patches
every Linux vendor is backporting into their kernel.
--
David
Miller
Comments (42 posted)
Matthew Garrett
continues
his investigation into the subtleties of booting Linux with EFI. "
GPT, or the GUID Partition Table, is the EFI era's replacement for MBR partitions. It has two main advantages over MBR - firstly it can cover partitions larger than 2TB without having to increase sector size, and secondly it doesn't have the primary/logical partition horror that still makes MBR more difficult than it has any right to be. The format is pretty simple - you have a header block 1 logical block into the media (so 512 bytes on a typical USB stick), and then a pointer to a list of partitions. There's then a secondary table one block from the end of the disk, which points at another list of partitions. Both blocks have multiple CRCs that guarantee that neither the header nor the partition list have been corrupted. It turns out to be a relatively straightforward modification of isohybrid to get it to look for a secondary EFI image and construct a GPT entry pointing at it. This works surprisingly well, and media prepared this way will boot EFI machines if burned to a CD or written to a USB stick."
Comments (9 posted)
Kernel development news
By Jonathan Corbet
July 27, 2011
As of this writing, almost 5,400 non-merge changesets have been pulled into
the mainline repository for the 3.1 development cycle. It's a wide-ranging
set of changes, but many of them are cleanups - almost 600 of those changes
have the word "remove" in the title, and the total growth of the kernel is
less than 5,000 lines. A number of trees remain unpulled, though, so there
is plenty of scope for the kernel to grow yet.
User-visible changes merged for 3.1 include:
- Xen has gained a couple of new guest memory management techniques
called "self-ballooning" and "frontswap-selfshrinking." Both use
transcendent memory to try to improve memory performance and smooth
out usage spikes.
- The Xen PCI backend driver - allowing the kernel to export PCI devices
to guests - has been merged.
- The Xen balloon driver now supports memory hotplug.
- There are a number of enhancements to the IPset including a mechanism
to store network addresses and interface names together as named
pairs, adjustable timeouts for SET targets, and more.
- The BATMAN-adv protocol (covered here
in February) has gained a better roaming mechanism, improved client
announcement, and some performance improvements.
- The networking layer has a new "fanout" feature; using
setsockopt(), packets captured from an AF_PACKET
socket can be divided among multiple processes. A number of policies
describing how packets are "fanned out" are supported.
- The BPF JIT compiler now supports the
PowerPC architecture.
- The ptrace() system call has been augmented with some new
commands, starting with PTRACE_SEIZE, which is like
PTRACE_ATTACH but does not trap the traced process or change
its signal state. PTRACE_INTERRUPT will stop a traced
process without creating confusion with signals.
PTRACE_LISTEN allows the traced process to receive certain
events even though it is in a stopped state. All of these options are
considered to be under development; a special
PTRACE_SEIZE_DEVEL flag must be provided by user space to
acknowledge an understanding that things might change.
- The lseek() system call now implements SEEK_HOLE and
SEEK_DATA; these operations can be used to locate
extended blocks of zeroes within files.
- Architecture support for the OpenRISC
CPU has been added.
- A number of writeback-improvement changes have gone in, including
dynamic estimation of backing store bandwidth and a determined attempt
to make use of most of that bandwidth.
- The iwlagn driver now has WoWLAN (wakeup on wireless LAN) support.
- New drivers:
- Processors and systems:
CSR SiRFSoC PRIMA2 ARM Cortex A9 boards,
Xilinx Zynq ARM Cortex A9 boards,
Wolfson Cragganmore 6410 CPU modules, and
Marvell PXA168 GuruPlug Display (gplugD) boards.
Also, low-level support for the OLPC XO-1 laptop has finally been merged.
- Audio:
Analog Devices ADAU1701 SigmaDSP codecs,
Analog Devices ADAV801 and ADAV803 audio codecs,
ST STA32x 2.1-channel digital audio systems,
Wolfson WM8983 codecs, and
Creative CA0132 codecs.
- Block:
Brocade-1860 fabric adapters.
- Input:
Speedlink VAD Cezanne mice.
- Miscellaneous:
Cirrus Logic EP93xx M2P/M2M DMA controllers,
SMSC SCH5636 Super I/O hardware monitor chips,
AMS369FG06 AMOLED LCD controllers,
FSA9480 micro USB switches,
Microwire 93XX46 EEPROM controllers,
Qualcomm PMIC8XXX realtime clock modules,
Analog Devices AD5686R/AD5685R/AD5684R digital to analog
converters, and
Analog Devices AD7792 and AD7793 analog to digital converters.
- Network:
Low-level CAIF-over-HSI network devices,
Faraday FTGMAC100 Gigabit Ethernet adapters, and
NXP PN533 near-field communication adapters.
- USB: PLX NET2272 controllers.
Changes visible to kernel developers include:
- A general-purpose CRC8 generation library has been added.
- The networking layer has gained generic support for near-field
communication (NFC) devices. See Documentation/networking/nfc.txt for
details.
- The power management callbacks found in struct dev_pm_ops
have been augmented with a whole set of "noirq" versions. The power
domains subsystem uses these callbacks for system-wide power
transitions.
- The cleanup of the ARM tree continues, with a lot of code duplication
resolved and the removal of some unused machine types.
- The check_acl() inode operation has been replaced by
get_acl(), whose job is to simply fetch the access control
list from disk. Actual checking of ACLs is now done in the core VFS
code.
- The checkpatch.pl script has a new --ignore option to turn
off various types of messages.
It is not clear when this merge window will close; Linus is about to go on
vacation, and, as he has noted, connectivity tends to be poor when one is
under water in scuba gear. If he is unable to get everything merged
while he is traveling, the merge window may be extended a little past the
normal two weeks. Or he could decide he has pulled enough and close things
early. Stay tuned for an update next week.
Comments (1 posted)
By Jonathan Corbet
July 26, 2011
One of the problems with relying on out-of-tree kernel code is that one can
never be sure when that code might be updated for newer kernels. Keeping
up with the kernel can be painful even for maintainers of small patches;
it's much more so for those who maintain a large, invasive patch series.
It is probably safe to say that, if the realtime preemption developers do
not keep their patches current, there are very few other developers who are
in a position to take on that work. So it was certainly discouraging for
some realtime users to watch multiple kernel releases go by while the
realtime patch series remained stuck at 2.6.33.
The good news is that the roadblock has been overcome and there is now a
new realtime tree for the 3.0 kernel. Even better news is that the
realtime developers may have come up with a solution for one of the most
vexing problems keeping the realtime code out of the mainline. The only
potential down side is that this approach relies on an interesting
assumption about how per-CPU data is used; this assumption will have to be
verified with a lot of testing and, likely, a number of fixes throughout
the kernel.
Symmetric multiprocessing systems are nice in that they offer equal access
to memory from all CPUs. But taking advantage of the feature is a guaranteed
way to create a slow system. Shared data requires mutual exclusion to
avoid concurrent access; that means locking and the associated bottlenecks.
Even in the absence of lock contention, simply moving cache lines between
CPUs can wreck performance. The key to performance on SMP systems is
minimizing the sharing of data, so it is not surprising that a great deal
of scalability work in the kernel depends on the use of per-CPU data.
A per-CPU variable in the Linux kernel is actually an array with one
instance of the variable for each processor. Each processor works with its
own copy of the variable; this can be done with no locking, and with no
worries about cache line bouncing. For example, some slab allocators
maintain per-CPU lists of free objects and/or pages; these allow quick
allocation and deallocation without the need for locking to exclude any
other CPUs. Without these per-CPU lists, memory allocation would scale
poorly as the number of processors grows.
Safe access to per-CPU data requires a couple of constraints, though: the
thread working with the data cannot be preempted and it cannot be migrated
while it manipulates per-CPU variables. If the thread is preempted, the
thread that replaces it could try to work with the same variable; migration
to another CPU could cause confusion for fairly obvious reasons. To avoid
these hazards, access to per-CPU variables is normally bracketed with
calls to get_cpu_var() and put_cpu_var(); the
get_cpu_var() call, along with providing the address for the
processor's version of the variable, disables preemption. So code which
obtains a reference to a per-CPU data will not be scheduled out of the CPU
until it releases that reference. Needless to say, any such code must be
atomic.
The conflict with realtime operation should be obvious: in the realtime
world, anything that disables preemption is a possible source of unwanted
latency. Realtime developers want the highest-priority process to run at
all times; they have little patience for waiting while a low-priority
thread gets around to releasing a per-CPU variable reference. In the past,
this problem has been worked around by protecting per-CPU variables with
spinlocks. These locks keep the code preemptable, but they wreck the
scalability that per-CPU variables were created to provide and complicate the
code. It has been clear for some time that a different solution would need
to be found.
With the 3.0-rc7-rt0 announcement, Thomas
Gleixner noted that "the number of sites which need to be patched is
way too large and the resulting mess in the code is neither acceptable nor
maintainable." So he and Peter Zijlstra sat down to come up with a
better solution for per-CPU data. The solution they came up with is
surprisingly simple: whenever a process acquires a spinlock or obtains a
CPU reference with get_cpu(), the scheduler will refrain from
migrating that process to any other CPU. That process remains preemptable
- code holding spinlocks can be preempted in the realtime world - but it
will not be moved to another processor.
Disabling migration avoids one clear source of trouble: a process which is
migrated in the middle of manipulating a per-CPU variable will end up
working with the wrong CPU's instance of that variable. But what happens
if a process is preempted by another process that needs to access the same
variable? If preemption is no longer disabled, this unfortunate event
seems like a distinct possibility.
After puzzling over this problem for a bit, the path to enlightenment
became clear: just ask Thomas what they are thinking with this change. What they are
thinking, it turns out, is that any access to per-CPU data needs to be
protected by some sort of lock. If need be, the lock itself can be
per-CPU, so the locking need not reintroduce the cache line bouncing that
the per-CPU variable is intended to prevent. In many cases, that locking
is already there for other purposes.
The realtime developers are making the bet that this locking is already
there in almost every place where per-CPU data is manipulated, and that the
exceptions are mostly for data like
statistics used for debugging where an occasional error is not really a
problem. When it comes to locking, though, a gut feeling that things are
right is just not good enough; locking problems have a way of lurking
undetected for long periods of time until some real damage can be done.
Fortunately, this is a place where computers can help; the realtime tree
will probably soon acquire an extension to the locking validator that
checks for consistent locking around per-CPU data accesses.
Lockdep is very good at finding subtle locking problems which are difficult
or impossible to expose with ordinary testing. So, once this extension has
been implemented and the resulting problem reports investigated and
resolved, the assumption that all per-CPU accesses are protected by locking
will be supportable. That process will likely take some time and,
probably, a number of fixes to the mainline kernel. For example, there may
well be bugs now where per-CPU variables are manipulated in interrupt
handlers but non-interrupt code does not disable interrupts; the resulting
race will be hard to hit, but possibly devastating when it happens.
So, as has happened before, the realtime effort is likely to result in
fixes which improve things for non-realtime users as well. Some churn will
be involved, but, once it is done,
there should be a couple of significant benefits: the realtime kernel will
be more scalable on multiprocessor systems, and the realtime patches should
be that much closer to being ready for merging into the mainline.
Comments (7 posted)
July 27, 2011
This article was contributed by Paul McKenney
My goal has always been for my code to go in without so much as a
ripple. Although I don't always meet that goal, I can't recall any
recent failure quite as spectacular as RCU in v3.0. My v3.0 code
didn't just cause a few ripples, it bellyflopped. It is therefore
worthwhile to review what happened and why it happened in order to avoid
future bellyflops and trainwrecks.
This post-mortem will cover the following topics:
-
Overview of preemptible RCU read-side code
-
Steaming towards the trainwreck
- Fixes
- Current status
-
Preventing future bellyflops and trainwrecks
It will end with the obligatory
answers to the quick quizzes.
Understanding the trainwreck requires reviewing a small amount of
TREE_PREEMPT_RCU's read-side code.
First, let's look at __rcu_read_lock(), which, in preemptible
RCU, does the real work for rcu_read_lock():
1 void __rcu_read_lock(void)
2 {
3 current->rcu_read_lock_nesting++;
4 barrier();
5 }
This is quite straightforward: line 3 increments the per-task
->rcu_read_lock_nesting counter and line 4 ensures
that the compiler does not bleed code from the following RCU read-side
critical section out before the __rcu_read_lock().
In short, __rcu_read_lock() does nothing more than
to increment a nesting-level counter.
The __rcu_read_unlock() function, which, in preemptible RCU,
does the real work for rcu_read_unlock(), is only slightly more
complex:
1 void __rcu_read_unlock(void)
2 {
3 struct task_struct *t = current;
4
5 barrier();
6 --t->rcu_read_lock_nesting;
7 barrier();
8 if (t->rcu_read_lock_nesting == 0 &&
9 unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
10 rcu_read_unlock_special(t);
11 }
Line 5 prevents the compiler from bleeding code from the
RCU read-side critical section out past the __rcu_read_unlock(),
line 6 decrements the per-task nesting-level counter, so that thus
far __rcu_read_unlock() is the inverse of
__rcu_read_lock().
However, if the value of the nesting counter is now zero,
we now need to check to see if anything unusual happened
during the just-ended RCU read-side critical section,
which is the job of lines 8 and 9.
Line 7 prevents the compiler from moving this check to precede
the decrement on line 6 because otherwise something unusual might
happen just after the check but before the decrement, which would
in turn mean that __rcu_read_unlock() would fail to
clean up after that unusual something.
The "unusual somethings" are:
- The RCU read-side critical section might have blocked or
been preempted.
In this case, the per-task
->rcu_read_unlock_special
variable will have the
RCU_READ_UNLOCK_BLOCKED bit set.
- The RCU read-side critical section might have executed for
more than a jiffy or two.
In this case, the per-task
->rcu_read_unlock_special
variable will have the
RCU_READ_UNLOCK_NEED_QS bit set.
In either case, the per-task ->rcu_read_unlock_special
will be non-zero, so that __rcu_read_unlock() will invoke
rcu_read_unlock_special(), which we look at next:
1 static void rcu_read_unlock_special(struct task_struct *t)
2 {
3 int empty;
4 int empty_exp;
5 unsigned long flags;
6 struct rcu_node *rnp;
7 int special;
8
9 if (in_nmi())
10 return;
11 local_irq_save(flags);
12 special = t->rcu_read_unlock_special;
13 if (special & RCU_READ_UNLOCK_NEED_QS) {
14 rcu_preempt_qs(smp_processor_id());
15 }
16 if (in_irq()) {
17 local_irq_restore(flags);
18 return;
19 }
20 if (special & RCU_READ_UNLOCK_BLOCKED) {
21 t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
22
23 /* Clean up after blocking. */
24
25 }
26 }
Lines 9 and 10 take an early exit if we are executing in
non-maskable interrupt (NMI) context.
The reason for this early exit is that NMIs cannot be interrupted
or preempted, so there should be no rcu_read_unlock_special()
processing required.
Otherwise, line 11 disables interrupts and line 12 takes a
snapshot of the per-task ->rcu_read_unlock_special
variable.
Line 13 then checks to see if the just-ended RCU read-side
critical section ran for too long, and, if so, invokes
rcu_preempt_qs() to immediately record a quiescent state.
Recall that any point in the code that is not in an RCU
read-side critical section is potentially a quiescent state.
Therefore, since someone is waiting, report the quiescent state
immediately.
Lines 16 through 18 take an early exit if we are executing
in a hardware interrupt handler.
This is appropriate given that hardware interrupt handlers cannot
block, so it is not possible to preempt or to block within an RCU read-side
critical section running within a hardware interrupt handler.
(Of course, threaded interrupt handlers are another story altogether.)
Finally, line 20 checks to see if we blocked or were
preempted within the
just-ended RCU read-side critical section, clearing the corresponding
bit and cleaning up after blockage or preemption if so.
The exact details of the cleanup are not important (and are therefore
omitted from the code fragment above), although curious
readers are referred to kernel.org.
The important thing is what happens if this RCU read-side critical section
was the last one blocking an expedited RCU grace period or if the
just-ended RCU read-side critical section was priority-boosted.
Either situation requires that RCU interact with the scheduler, which
may require the scheduler to acquire its runqueue and priority-inheritance
locks.
Because the scheduler disables interrupts when acquiring the
runqueue and the priority-inheritance locks, an RCU read-side
critical section that lies entirely within one of these locks'
critical sections cannot be interrupted, preempted, or blocked.
Therefore, such an RCU read-side critical section should not
enter rcu_read_unlock_special(), and should thus
avoid what would otherwise be an obvious self-deadlock scenario.
Quick Quiz 1:
But what about RCU read-side critical sections that begin
before a runqueue lock is acquired and end within that lock's
critical section?
Answer
As we will see later, a number of self-deadlock scenarios can be
avoided via the
in_irq() early exit from rcu_read_unlock_special().
Keep the critical importance of this early exit
firmly in mind as we steam down the tracks towards the
RCU/scheduler/threaded-irq trainwreck.
Before we leave the station, please keep in mind that
in_irq() can return inaccurate results because
it consults the
preempt_count() bitmask, which is updated
in software.
At the start of the interrupt, there is therefore a period of
time before
preempt_count() is updated to record the
start of the interrupt, during which time the interrupt handler
has started executing, but
in_irq() returns false.
Similarly, at the end of the interrupt, there is a period of
time after
preempt_count() is updated to record
the end of the interrupt, during which time the interrupt
handler has not completed executing, but again
in_irq() returns false.
This last is most emphatically the case when the end-of-interrupt
processing kicks off softirq handling.
With that background, the sequence of commits leading to the trainwreck
is as follows:
- In March of 2009, commit
a18b83b7ef added the first known
rcu_read_unlock() to be called while holding
a runqueue lock.
Quick Quiz 2:
Suppose that an RCU read-side critical section is enclosed within
a runqueue-lock critical section.
Why couldn't that RCU read-side critical section
be the last RCU read-side critical section blocking a
TREE_PREEMPT_RCU expedited grace period?
Answer
Quick Quiz 3:
Why can't we avoid this whole mess by treating interrupt-disabled
segments of code as if they were RCU read-side critical sections?
Answer
- In December of 2010, commit
d9a3da069 added
synchronize_rcu_expedited() to
TREE_PREEMPT_RCU, which causes the last
reader blocking an expedited grace period to call
wake_up() from within rcu_read_unlock().
Of course, the wake_up() acquires the runqueue
locks.
Although this appears to open the door to an obvious deadlock
scenario where the RCU read-side critical section under the
runqueue lock is the last one blocking a preemptible-RCU
expedited grace period, this cannot happen as long as
the runqueue lock is held across the entire duration of
the RCU read-side critical section.
Continuing down the tracks toward the trainwreck:
- In June of 2010, commit
f3b577dec1 added an RCU
read-side critical section in wake_affine().
Given that I was blissfully unaware of the true nature of
in_irq(),
I raised no objection to this patch.
Quite the opposite, in fact, as can be seen by a quick
glance at this commit.
- The addition of threaded interrupt handlers meant that almost
all hardware interrupts started invoking the scheduler
in order to awaken the corresponding interrupt kthread,
which in turn increased the likelihood that
rcu_read_unlock_special() would become confused
by the return value from in_irq().
- Many more RCU read-side critical sections were added
within runqueue and priority-inheritance critical sections,
further increasing the interaction cross-section between RCU and the
scheduler.
-
RCU_BOOST introduced an incorrect cross-task write to the
per-task ->rcu_read_unlock_special variable.
This could result in this variable being corrupted, resulting
in all manner of deadlocks.
This was fixed by commit 7765be2fe.
- In addition,
RCU_BOOST introduced another
call from RCU into the scheduler in the form of a
rt_mutex_unlock().
All of these changes set the stage for a number of potential failures; one
possible sequence of events is as follows:
- An RCU read-side critical section is preempted, then resumes.
This causes the the per-task
->rcu_read_unlock_special
variable to have the RCU_READ_UNLOCK_BLOCKED bit set.
- This task remains preempted for so long that RCU priority boosting
is invoked.
- The RCU read-side critical section ends by invoking
rcu_read_unlock(), which in in turn invokes the
__rcu_read_unlock() function shown above.
- An interrupt arrives just after
__rcu_read_unlock()
reaches line 7.
- The interrupt handler runs to completion, so that
irq_exit() is invoked, and irq_exit()
decrements the irq nesting-level count to zero.
- Then
irq_exit() then invokes invoke_softirq(),
which determines that ksoftirqd must be awakened.
- The scheduler is invoked to awaken ksoftirqd, which acquires
a runqueue lock and then enters an RCU read-side critical
section.
- When the interrupt handler leaves the RCU read-side critical
section, line 9 of
__rcu_read_unlock()
will find that the per-task ->rcu_read_unlock_special
variable is non-zero, and will therefore invoke
rcu_read_unlock_special().
- Because
in_irq() returns false,
line 16 of rcu_read_unlock_special()
does not take an early exit.
Therefore, rcu_read_unlock_special() sees the
RCU_READ_UNLOCK_BLOCKED bit set in
->rcu_read_unlock_special, and also notes
that the task has been priority boosted.
It therefore invokes the scheduler to unboost itself.
- The scheduler will therefore attempt to acquire a runqueue lock.
Because this task already holds a runqueue lock, deadlock
can (and sometimes did) result.
There were a number of other failure scenarios, but this one is
a representative specimen. Needless to say, figuring all this out was a
bit of a challenge for everyone involved, as was the question of how to fix the problem.
The fixes applied to the RCU trainwreck are as follows:
-
b0d30417 (rcu: Prevent RCU callbacks from executing
before scheduler initialized), which does what its name says.
This addressed a few boot-time hangs.
-
131906b0 (rcu: decrease rcu_report_exp_rnp coupling
with scheduler), which causes RCU to drop one of its internal
locks before invoking the scheduler, thereby eliminating
one set of deadlock scenarios involving expedited
grace periods.
-
7765be2f (rcu: Fix RCU_BOOST race handling
current->rcu_read_unlock_special), which
allocates a separate task_struct field to
indicate that a task has been priority boosted.
This change meant that the ->rcu_read_unlock_special
field returned to its earlier (and correct) status of being
manipulated only by the corresponding task.
This prevented a number of scenarios where an instance of
__rcu_read_unlock() invoked from interrupt
context would incorrectly
invoke rcu_read_unlock_special(), which would
again result in deadlocks.
-
be0e1e21 (rcu: Streamline code produced by
__rcu_read_unlock()), which was an innocent bystander brought
along due to dependencies among patches.
-
10f39bb1 (rcu: protect __rcu_read_unlock() against
scheduler-using irq handlers), which rearranges
__rcu_read_unlock()'s manipulation of
->rcu_read_lock_nesting so as to prevent
interrupt-induced recursion in __rcu_read_unlock()'s
invocation of rcu_read_unlock_special(),
which in turn prevents another class of deadlock scenarios.
This commit was inspired by an earlier patch by Steven Rostedt.
-
c5d753a5 (sched: Add irq_{enter,exit}() to
scheduler_ipi() by Peter Zijlstra), which informs RCU that the scheduler is
running.
This is especially important when the IPI interrupts
dyntick-idle mode: Without this patch, RCU would simply ignore
any RCU read-side critical sections in scheduler_ipi().
-
ec433f0c
(softirq,rcu: Inform RCU of irq_exit() activity by Peter Zijlstra),
which informs RCU of scheduler activity that occurs from
hardware interrupt level, but after irq_exit()
has cleared the preempt_count() indication
that in_irq() relies on.
It is quite possible that 10f39bb1 makes this change
unnecessary, but proving that would have delayed 3.0 even more.
-
a841796f (signal: align __lock_task_sighand() irq
disabling and RCU) fixes one case where an RCU read-side critical
section is preemptible, but its rcu_read_unlock()
is invoked with interrupts disabled.
As noted earlier, there might be a broad-spectrum solution
that renders this patch unnecessary, but that solution was
not appropriate for 3.0.
So, where are we now?
The Linux 3.0 version of RCU finally seems stable, but
the following potential vulnerabilities remain:
- In
RCU_BOOST kernels, if an RCU read-side critical
section has at any time been preemptible, then it is illegal
to invoke its rcu_read_unlock() with interrupts
disabled.
There is an experimental
patch
that removes this restriction, but at the cost of lengthening
the real-time mutex acquisition code path.
Work continues to find a solution with better performance
characteristics.
- In all preemptible-RCU kernels, if an RCU read-side critical
section has at any time been preemptible, then it is illegal
to invoke its
rcu_read_unlock() while holding
a runqueue or a priority-inheritance lock.
Although there are some possible cures for this condition,
all currently known cures are worse than the disease.
Quick Quiz 5:
How could you remove the restriction on possibly-preempted RCU read-side
critical sections ending with runqueue or priority-inheritance locks held?
Answer
-
TINY_PREEMPT_RCU might well contain similar
vulnerabilities.
So, what should be done to prevent this particular bit of history
from repeating itself?
Prevention is better than cure, so what preventative measures should
be taken?
The most important preventative measure is to do a full review of the
RCU code, documenting it as I go.
In the past, I documented new RCU functionality as a matter of course,
before that functionality was accepted into the kernel.
However, over the past few years, I have gotten out of that habit.
Although some of the bugs would probably have escaped me, I would
likely have spotted a significant fraction.
In addition, the documentation might have helped others better understand RCU,
which in turn might have helped some of them to spot the bugs.
Although no one has yet reported similar bugs in
TINY_PREEMPT_RCU, that does not mean that similar bugs
do not exist.
Therefore, when inspecting the code, I need to pay special attention
to the corresponding portions of TINY_PREEMPT_RCU.
Another important preventative measure is to
question long-held assumptions.
My unquestioning faith in in_irq() was clearly misplaced.
Although in_irq() was
“good enough” for RCU for quite some time,
it suddenly was not.
In short, when you are working on something as low-level as RCU, you
shouldn't be taking things like this for granted.
Dealing with the trainwreck also exposed some shortcomings in my
test setup, which emphasizes thoroughness over fast turnaround.
Although there is no substitute for a heavy round of testing on a
number of different configurations, it would be good to be able to validate
debug patches and experimental fixes much more quickly.
I have therefore started setting up an RCU testing environment
using KVM.
This testing environment also has the great advantage of working
even when I don't have Internet access.
Additionally, use of KVM looks like it will shorten the edit-compile-debug
cycle, which is quite important when chasing bugs that I actually can
reproduce.
Finally, I need to update my test configurations.
Some of the bugs reproduce more quickly when threaded interrupt handlers are enabled,
so I need to add these to my test regime.
Another bug was specific to 32-bit kernels, which I currently
don't test, but which KVM makes it easy to test.
In fact, on my current laptop, 32-bit kernels are all that KVM is
capable of testing.
Hopefully these changes will avoid future
late-in-cycle RCU
trainwrecks.
I am grateful to
Steven Rostedt, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
Ben Greear, Julie Sullivan, and Ed Tomlinson for finding bugs, creating
patches, and lots of testing.
I owe thanks to Jim Wasko for his support of this effort.
Quick Quiz 1:
But what about RCU read-side critical sections that begin
before a runqueue lock is acquired and end within that lock's
critical section?
Answer:
That would be very bad.
The scheduler is therefore forbidden from doing this.
Back to Quick Quiz 1.
Quick Quiz 2:
Suppose that an RCU read-side critical section is enclosed within
a runqueue-lock critical section.
Why couldn't that RCU read-side critical section
be the last RCU read-side critical section blocking a
TREE_PREEMPT_RCU expedited grace period?
Answer:
No, it cannot.
To see why, note that
the TREE_PREEMPT_RCU variant of
synchronize_rcu_expedited
is implemented in two phases.
The first phase invokes synchronize_sched_expedited(),
which forces a context switch on each CPU.
The second phase waits for any RCU read-side critical sections that
were preempted in phase 1.
Because acquiring runqueue locks disables interrupts, it is not possible
to preempt an RCU read-side critical section that is totally enclosed
in a runqueue-lock critical section, and therefore
synchronize_rcu_expedited will never wait on such
an RCU read-side critical section,
which in turn means that the corresponding rcu_read_unlock()
cannot have a need to invoke the scheduler, thus avoiding the deadlock.
Of course, the last link in the above chain of logic was broken
by a later bug, but read on...
Back to Quick Quiz 2.
Quick Quiz 3:
Why can't we avoid this whole mess by treating interrupt-disabled
segments of code as if they were RCU read-side critical sections?
Answer:
For two reasons:
- The fact that interrupt-disable sections of code act as
RCU read-side critical sections is a property of the
current implementation.
Later implementations are likely to need to do quiescent-state
processing off-CPU in order to reduce OS jitter, and such
implementations will not be able to treat interrupt-disable
sections of code as RCU read-side critical sections.
This property is important to a number of users, so much so
that there is an
out-of-tree RCU implementation
that provides it (see
here and
here
for more recent versions).
Therefore, we should be prepared for mainline Linux kernel's
RCU implementation to treat interrupt-disable sections of
code as the quiescent states that they really are.
- Having multiple very different things that provide read-side
protection makes the code more difficult to maintain,
with RCU-sched being a case in point.
Back to Quick Quiz 3.
Quick Quiz 4:
Exactly what vulnerability did commit f3b577dec1 expose?
Answer:
Suppose that an RCU read-side critical section is the last one blocking
an expedited grace period, and that its __rcu_read_unlock()
is interrupted just after it decrements the nesting count to zero.
The current->rcu_read_unlock_special bitmask will
therefore be non-zero, indicating that special processing is required
(in this case, waking up the task that kicked of the expedited grace
period).
Suppose further that softirq processing is kicked off at the end of the
interrupt, and that there are so many softirqs pending that they
need to be handed off to ksoftirqd.
Therefore wake_up() is invoked, which acquires the
needed runqueue locks.
But because wake_affine() is invoked, there is
an RCU read-side critical section whose __rcu_read_unlock()
will see that current->rcu_read_unlock_special is nonzero.
At this point, in_irq() will be returning false,
so the resulting call to rcu_read_unlock_special()
won't know to take the early exit.
It will therefore invoke wake_up(), which will again
attempt to acquire the runqueue lock, resulting in deadlock.
Back to Quick Quiz 4.
Quick Quiz 5:
How could you remove the restriction on possibly-preempted RCU read-side
critical sections ending with runqueue or priority-inheritance locks held?
Answer:
Here are some possibilities:
- Enclose all
runqueue or priority-inheritance critical section
in an RCU read-side critical section.
This would mean that any
rcu_read_unlock() that
executed with one of these locks held would be inside the
enclosing RCU read-side critical section, and thus would be
guaranteed not to invoke rcu_read_unlock_special().
However, this approach would add overhead to the scheduler's
fastpaths and requires yet another odd hand-crafted
handoff at context-switch time.
- Keep some per-task state indicating that at least one
scheduler lock is held.
Then
rcu_read_unlock_special() could set another
per-task variable indicating that cleanup is required.
The scheduler could check this flag when releasing its
locks.
I hope that the maintainability challenges of this approach
are self-evident.
- Your idea here.
Back to Quick Quiz 5.
Comments (10 posted)
Patches and updates
Kernel trees
- Thomas Gleixner: 3.0-rt1 .
(July 22, 2011)
- Thomas Gleixner: 3.0-rt2 .
(July 23, 2011)
- Thomas Gleixner: 3.0-rt3 .
(July 25, 2011)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>