Brief items
The current development kernel is 3.2-rc4,
released on December 1. Linus says that
things are quieting down, but he's worried: "
I'm waiting for the
other shoe to drop. Maybe Davem and GregKH are holding back - they've been
suspiciously quiet, and I think I can hear some evil chuckling going on
there. But maybe it's just time for my meds."
Stable updates: no stable updates have been released in the last
week. The 2.6.32.50 (27 patches)
3.0.13 (80 patches) and 3.1.5 (104 patches) updates are in the review
process as of this writing; they can be expected on or after
December 9.
Comments (none posted)
One of the failures we routinely exclude code from the kernel for
is best described as "user interface of contributor"
--
Alan Cox
You've done the right thing - if you ask, people treat your message
as low priority and don't bother replying in the hope that nothing
will happen.
However, if you send a patch to delete a platform, people treat it
as most urgent because if they care they need to speak up to
actually stop it happening. (Even if you intend to only actually
submit it in 6 months time or so - but don't mention that in the
initial posting!)
--
Russell King
Crazy security policies are almost par for the course. Crazy VFS
layer functions I get upset about.
--
Linus Torvalds
Comments (none posted)
Jim Gettys and Kathleen Nichols have published
a comprehensive, updated
description of the bufferbloat problem in ACM Queue. "
For TCP
congestion avoidance to be useful to people using that link, a TCP
connection causing congestion must react quickly to changes in demand at
the bottleneck link, but TCP's reaction time is quadratic to the amount of
overbuffering. A link that is 10 times overbuffered not only imposes 10
times the latency, but also takes 100 times as long to react to the
congestion. Your short, interactive TCP connection loses completely to any
long-lived flow that has saturated your link."
Comments (58 posted)
Kernel development news
By Jonathan Corbet
December 6, 2011
Control groups are used for a number of purposes, but the most important
role for many is the application of resource limits to groups of
processes. The memory usage controller has seen a lot of development
effort in recent years, but it still has a number of shortcomings, one of
which being that it only applies to user-space memory. Some workloads can
cause the use of large amounts of memory in the kernel, leading to overall
system memory pressure, but the memory controller is unable to place limits
on that memory usage. A first step toward fixing that problem may yet
find its way into the 3.3 kernel, but it shows how hard the job can be.
Glauber Costa's per-cgroup TCP memory pressure
controls patch starts by adding a bit of infrastructure to the memory
controller for the tracking of kernel-space memory use. A pair of new
knobs is added: memory.kmem.limit_in_bytes controls how much
kernel memory a control group is allowed to use, while
memory.independent_kmem_limit is a boolean controlling whether the
kernel limits are to be managed separately from the user-space limits. If
that boolean is false, kernel memory usage is simply added to user-space
memory usage and the kernel memory limit is ignored. There is also a new
memory.kmem.usage_in_bytes value that can be read to see what the
current memory usage is.
This infrastructure is nice, but there is one other little difficulty:
there are thousands of places in the kernel where memory is allocated and
none of them are instrumented in a way that allows those allocations to be
charged to a specific control group. An implementation of that accounting
on a kernel-wide basis
may never happen; it certainly will not happen at the outset. So
developers seeking to add this functionality need to focus on specific data
structures. Some past work has been done to apply limits to the dentry
cache, but Glauber chose a different limit: buffers used in the
implementation of the TCP network protocol.
Those buffers hold packet data as it passes through a socket; in some
situations they can become quite large, so they are a logical place to try
to apply limits. What is even better is that the networking stack already
has logic to place limits on buffer sizes when memory is tight. In current
kernels, if the system as a whole is seen to be under memory pressure, the
networking code will do its best to help. Its responses can include
decisions not to increase the size of TCP windows, to drop packets, or to
refuse to expand buffers for outgoing data.
Given that this infrastructure was in place, all that Glauber needed to do
was to enhance the stack's idea of what constitutes "memory pressure."
That means instrumenting the places that allocate and free buffer memory to
charge those allocations to the owning control group. Then, "memory
pressure" becomes a combination of the previous global value and the
current control group's status with regard to its kernel memory limit. If
that limit has been exceeded, the stack will behave (with regard to that
control group's sockets) as if the system as a
whole were under memory pressure, even if the global state is fine.
The first patches did exactly that and seemed to be on their way toward
inclusion in the 3.2 merge window. It ran into a bit of a snag when the
core networking developers took a look at it, though. Networking
maintainer David Miller rejected the patch
out of hand, complaining about the overhead it adds to the networking fast
paths even in cases where the feature is not in use. He added:
People keep asking every few releases "where the heck has my
performance gone" and it's because of creeping features like this.
This socket cgroup feature is a prime example of where that kind of
stuff comes from.
I really get irritated when people go "oh, it's just one indirect
function call" and "oh, it's just one more pointer in struct sock"
We work really hard to _remove_ elements from structures and make
them smaller, and to remove expensive operations from the fast
paths.
There are a lot of important memory-allocation sites in the kernel that can
be found in relatively hot execution paths; the networking developers may
obsess about adding overhead to those paths, but they are certainly not
alone in that concern. So a solution had to be found that did not impose
that overhead on systems where the feature is not in use.
The words "in use" are important here as well. If kernel memory usage
limits are useful, distributors will want to enable them in the kernels
they ship. But it may still be that most users will not actually make use
of that feature. So it is not sufficient to remove its overhead only in
cases where it has been configured out of the kernel. A feature like this
really needs to avoid imposing costs when it is not in use, even when it is
configured into the running kernel.
Glauber had to make some significant changes to the patch set to meet these
requirements. TCP buffer limits are now managed separately from kernel
limits as a whole; there is a new knob (kmem.tcp.limit_in_bytes)
to control that limit. All of the fast-path code is now contained within a
static branch, meaning that, when the code
is not enabled, it can be skipped over with almost no overhead at all.
That static branch is only turned on when a TCP buffer limit is established
for a non-root control group. So, as required, it should not have
significant effects in kernels where the feature is configured in but not
being used.
As of this writing, no decision to merge these patches for 3.3 has been
announced, but the number of criticisms has been steadily dropping. So it
may well get into the mainline in the next merge window. But Glauber's
experience shows how hard it will be to add more kernel memory accounting;
the requirement that its impact be imperceptible will lead to more work and
trickier code. For that reason alone, accounting for all kernel memory use
seems unlikely to ever happen. Developers will focus on the few cases
where the application of limits can bring about significant changes in
behavior and not worry about the rest.
Comments (none posted)
By Jonathan Corbet
December 7, 2011
The kernel has always used small integer numbers to represent interrupt
(IRQ) lines internally; those numbers usually correspond to the numbers of
the interrupt lines feeding into the CPU, though it need not be that way.
The kernel has also traditionally used the value zero to indicate that
there is no IRQ assignment at all - except in the places where it hasn't.
That inconsistency has, after many years, popped up to bite the ARM
architecture, with the result that there may have to be a big scramble to
fix things for the 3.3 merge window.
While the x86 architecture and core interrupt code have used zero for "no
IRQ," various architectures have made other decisions, often using a value
of -1 instead. The reason for making such a choice is
straightforward: many architectures have a valid IRQ line numbered zero.
In that situation, a few options present themselves. That
inconveniently-numbered IRQ can be hidden away where nobody actually sees
it; that is what x86 does on systems where the timer interrupt shows up as
zero. Another option is to remap that interrupt line to a different number
in software so that IRQ 0 never appears outside of the core
architecture code. Or the architecture can treat zero as a valid IRQ
number and define a symbol like NO_IRQ to a different value; a
number of architectures have done that over time, though some have switched
to zero in recent years. But the ARM architecture still uses -1 as its
NO_IRQ value.
It is worth noting that NO_IRQ has been a topic of discussion at
various times over the years. The earliest appears to have been in
response to this
patch from Matthew Wilcox, posted in 2005. At that time, Linus made
it clear that he saw zero as the only valid NO_IRQ value. His
reasoning was that code reading:
if (!dev->irq)
/* handle the no-IRQ case */
is easier to read and more efficient to execute than comparisons against
some other value. Beyond that, IRQ numbers were (and are) represented by
unsigned variables in many parts of the kernel; values like -1 are
clearly an awkward fit in such situations. Given that, he said:
So my suggested (very _strongly_ suggested) solution is for people
to just consider "irq" to be a cookie, with the magical value 0
meaning "not there" but no inherent meaning otherwise. That just
solves all the fundamentally hard problems, and has no fundamental
problems of its own.
In the intervening six years, most architectures have followed that
suggestion, but ARM has not. That has mattered little thus far, but, as
ARM's popularity grows and the amount of code shared with the rest of the
kernel grows with it, that inconsistency is leading to problems. Some
recent changes to the device tree code threaten to break ARM entirely (for
some subplatforms, anyway) if things are not changed.
How hard will things be to fix? Alan Cox offered this suggestion:
I have so little sympathy over this that you'll need a quantum
physicist to measure it.... You've had years to fix it. If I were
you I'd delete NO_IRQ from your tree, type make and get it
done. It's not even a big job to clean it out.
As it happens, though, it's not quite that easy. The core ARM code will
need to remap IRQ 0 in places where it is a valid number. Arguably
the most unpleasant problem, though, was pointed out by Dave Martin:
Unfortunately, NO_IRQ is often not spelled "NO_IRQ". It looks like
the assumption "irq < 0 === no irq" may be
quite a lot more widespread than "NO_IRQ === no irq".
Since there's no specific thing we can grep for (and simply due to
volume) finding all such instances may be quite a bit harder.
So it will be necessary to find all of the places where the code assumes
that a negative IRQ number means "no interrupt," places where -1
has been hard coded, and so on. One can argue (as ARM maintainer Russell King has) that, as long
as request_irq(0) fails and physical IRQ 0 is handled
properly, most of that code will do the right thing
without being changed. But there is still a fair amount of stuff to clean up.
Putting it into perspective, though: the work required to fix
NO_IRQ in the ARM tree is likely to be tiny relative to the
general cleanup that is happening there. A bit of churn, some drivers to
fix, and ARM should come into agreement with the rest of the kernel on the
way to indicate a missing interrupt. That, in turn, will increase
robustness and allow the sharing of more code with the rest of the kernel -
not something to get too terribly irked about.
Comments (4 posted)
December 6, 2011
This article was contributed by Paul McKenney
Some of the most obscure corners of the Linux kernel are those
employing the dark arts of atomic instructions and memory barriers.
Although things are better than they were (say) ten years ago,
atomic instructions and memory barriers are still a not-infrequent
source of confusion.
One of the difficulties with atomic instructions and especially with
memory barriers is that the documentation for the underlying machine
instructions can be a bit difficult to understand.
It is not hard to work around obtuse documentation for arithmetic
instructions: Simply test them.
However, simple testing
does not work for memory barriers and atomic instructions
because these
instructions must handle non-deterministic situations.
Furthermore, unlike arithmetic instructions, a given piece of hardware
is free to be more strict that absolutely necessary
in its interpretation of ordering requirements.
Therefore, your test results might
not apply to the next generation of that same hardware,
which in turn might result in your code breaking horribly.
Quick Quiz 1:
Yeah, right!!!
Since when has anything ever made the Linux kernel's memory barriers
and atomic instructions easier???
Answer
What we need is a tool that evaluates a code sequence
containing atomic operations and memory barriers, telling the developer
whether or not a given assertion is guaranteed never to trigger.
Given such a guarantee, the developer could then be reasonably confident
that this code sequence would operate correctly on all current and future
versions of the hardware in question.
Previous LWN articles
(here and
here), as well as Mathieu
Desnoyers's
doctoral
dissertation [PDF],
describe how to use Promela and spin
to test assertions on
code fragments that execute concurrently.
Because Promela and spin
search the code fragments' full state space,
the developer can be reasonably confident that the
code fragments will operate correctly on real hardware.
Unfortunately, most code fragments using atomic instructions
and memory barriers require detailed knowledge
of the semantics of a given CPU's memory barriers and atomic instructions.
Promela and spin
do not have such knowledge, so the developer must explicitly write
Promela/spin
code that captures this detailed knowledge.
This is not helpful if the developer in question does not possess
this detailed knowledge in the first place.
It would clearly be much better to have a tool that
already understands this detailed knowledge.
I am happy to report that such tools are now becoming available,
for example, from
Peter Sewell and Susmit Sarkar at the University of Cambridge,
Luc Maranget, Francesco Zappa Nardelli, and Pankaj Pawan at INRIA, and
Jade Alglave at Oxford University.
This group has done some excellent work over the past few years
formalizing memory models for real computer systems,
including that of ARM, Power,
some additional 64-bit members of the Power family, x86
(and yes, academics consider x86 to be a weak-memory system),
and C/C++11's new memory model.
A prototype interactive version of this tool is
available for
ARM and
Power.
This article describes this tool as follows:
-
Anatomy of a Litmus Test.
-
What Does This Litmus Test Mean?
-
Running a Litmus Test.
-
Full State-Space Search.
-
Conclusions.
Following this is the obligatory
Answers to Quick Quizzes.
The input to this tool is a “litmus test”
that consists of assembly language, register initializations, and
an assertion, for example,
as follows for the
Power interface to the model:
1 PPC SB+lwsync-RMW-lwsync+isync-simple
2 ""
3 {
4 0:r2=x; 0:r3=2; 0:r4=y; 0:r10=0 ; 0:r11=0; 0:r12=z ;
5 1:r2=y; 1:r4=x;
6 }
7 P0 | P1 ;
8 li r1,1 | li r1,1 ;
9 stw r1,0(r2) | stw r1,0(r2) ;
10 lwsync | sync ;
11 | lwz r3,0(r4) ;
12 lwarx r11,r10,r12 | ;
13 stwcx. r11,r10,r12 | ;
14 bne Fail1 | ;
15 isync | ;
16 lwz r3,0(r4) | ;
17 Fail1: | ;
18
19 exists
20 (0:r3=0 /\ 1:r3=0)
The ARM interface works exactly the same way, but with ARM
instructions substituted for the Power instructions and with
the initial “PPC” replaced by “ARM”.
You can select the ARM interface by clicking on
“Change to ARM Model” at the
web page called out above.
In the example, line 1 identifies the type of system
(“ARM” or
“PPC”) and contains the title for the model.
Line 2 provides a place for an alternative name for the test,
which you will usually want to leave blank as shown in the above example.
Comments can be inserted between lines 2 and 3 using the
OCaml (or Pascal) syntax of “(* *)”.
Lines 3-6 give initial values for all registers; each is of the
form “P:R=V”, where “P” is the process identifier,
“R” is the register identifier, and “V” is the
value.
For example, process 0's register r3 initially contains the
value 2.
If the value is a variable (“x”, “y”, or
“z” in the example) then the register is initialized
to the address of the variable.
It is also possible to initialize the contents of variables,
for example, “x=1” initializes the value of “x”
to 1.
Uninitialized variables default to the value zero, so that
in the example,
“x”, “y”, and “z”
are all initially zero.
Line 7 provides identifiers for the two processes, so that
the “0:r3=2” on line 4 could instead have been
written “P0:r3=2”.
Line 7 is required, and the identifiers must be of the form
“Pn”, where “n” is the column number, starting
from zero for the left-most column.
This may seem unnecessarily strict, but it does prevent considerable
confusion in actual use.
Quick Quiz 3:
Why does line 8 initialize the registers?
Why not instead initialize them on lines 4 and 5?
Answer
Lines 8-17 are the lines of code for each process.
A given process can have empty lines, as is the case for P0's
line 11 and P1's lines 12-17.
Labels and branches are permitted, as demonstrated by the
branch on line 14 to the label on line 17.
That said, too-free use of branches will expand the state space.
Use of loops is a particularly good way to explode your state space.
Lines 19-20 show the assertion, which in this case indicates
that we are
interested in whether P0's and P1's r3 registers can both contain
zero after both threads complete execution.
This assertion is important because there are a number of use cases
that would fail miserably if both P0 and P1 saw zero in their respective r3
registers.
This should give you enough information to construct simple litmus
tests.
Some additional documentation is
available, though
much of this additional documentation is intended for a different
research tool that runs tests on actual hardware.
Perhaps more importantly, a large number of pre-existing litmus tests
are available with the
online tool
(available via the “Select ARM Test” and
“Select POWER Test” buttons).
It is quite likely that one of these pre-existing litmus tests will
answer your Power or ARM memory-ordering question.
P0's lines 8 and 9 are equivalent to the C statement
x=1 because
line 4 defines P0's register r2 to be the
address of x.
P0's lines 12 and 13 are the mnemonics for load-linked
(“load register exclusive” in ARM parlance and
“load reserve” in Power parlance)
and store-conditional (“store register exclusive” in ARM
parlance), respectively.
When these are used together, they form an atomic instruction sequence,
roughly similar to the compare-and-swap sequences exemplified by
the x86 lock;cmpxchg instruction.
Moving to a higher level of abstraction, the sequence from lines 10-15
is equivalent to
the Linux kernel's atomic_add_return(&z, 0).
Finally, line 16 is roughly equivalent to the C statement
r3=y.
P1's lines 8 and 9 are equivalent to the C statement
y=1, line 10 is a memory barrier, equivalent to the Linux
kernel
statement smp_mb(), and line 11 is equivalent to
the C statement r3=x.
Putting all this together, the C-language equivalent to the entire
litmus test is as follows:
1 void P0(void)
2 {
3 x = 1; /* Lines 8 and 9 */
4 atomic_add_return(&z, 0); /* Lines 10-15 */
5 r3 = y; /* Line 16 */
6 }
7
8 void P1(void)
9 {
10 y = 1; /* Lines 8-9 */
11 smp_mb(); /* Line 10 */
12 r3 = x; /* Line 11 */
13 }
Quick Quiz 5:
Can both threads' r3 registers be equal to
zero after both threads complete?
Answer
Now that we have a meaningful litmus test, it is time to run it.
To run the litmus test, paste it into the textbox at
http://www.cl.cam.ac.uk/~pes20/ppcmem/, but use
a version without line numbers.
Press the "Interactive" button at the lower left-hand side of the
page, and you should then see something like this:
(Click on the images for full-size versions)
The upper portion of the screen beginning with
“Storage subsystem state” tracks an abstraction of the state
of the processor's write buffers, caches, and cache-coherence messages.
The next two portions, beginning with “Thread 0 state”
and “Thread 1 state”, track the instruction and memory-reference
progress of each of the two threads.
For more details on the information presented by this tool, refer to
the PLDI 2011 paper entitled
“Understanding Power Multiprocessors”, which may be found
here.
Because this model is intended for a super-scalar weak-memory system,
a given thread can have several different operations that might
happen at a given time.
For example, for thread 0 there are two operations, indicated by
the links (“Commit” and “Read reserve k:W z 0”),
while thread 1 has only a single “Commit” link.
Clicking on a given link allows you to manually advance the state of
the system.
In this particular case, it is most instructive to start by
clicking the links for thread 0, ignoring thread 1 completely
and also ignoring any links that appear in the
“Storage subsystem state” section preceding thread 0.
There will come a time when the “stwcx” instruction gives
you a choice between “Commit (fail)” and
“Commit (succ)”.
Choosing “Commit (succ)” results in the following state:
At this point, it is time to click the links for thread 1,
resulting in the following state:
Now, click the last link, which is in the
“Storage subsystem state” section and is labeled
“(0:) Write propagate to thread: f:W y 1 to Thread 0”,
and then click the
“(0:) Barrier propagate to thread: g:Sync to Thread 0”
that replaces it, and then the second link (labeled
“Write reaching coherence point: f:W y 1”),
and finally the
“Acknowledge sync: Sync g:Sync”.
This results in the following state:
This will result in thread 1 gaining a
“Read i:W x 0” link.
Click this and the “Commit” link that replaces it.
Then click all the remaining links in any order, resulting in
the following final state:
Quick Quiz 6:
But suppose that I didn't think to try this particular
execution sequence.
How could I have found this bug using this tool?
Answer
Quick Quiz 7:
Does the ARM Linux kernel have a similar bug?
Answer
This shows that the values loaded into both threads'
r3 registers to be zero, in violation of the
requirements set forth in
Documentation/atomic_ops.txt.
This bug is easily fixed by replacing the isync instruction
with sync, and a
patch
exists to carry out this fix.
The fact that this tool is capable of finding this sort of
bug is a testament to its usefulness.
The interactive web-based tool is fun to use and can be quite enlightening,
but it is very difficult to make sure that you have checked for every
possible error condition.
For that reason, there is an executable tool that conducts a full
state-space search, whose source code may be downloaded
here,
covered by the BSD license, with some components licensed under
LGPL with an
exception for linking.
The tool may be built by following the instructions in the README file,
which requires a recent version of
OCaml
(3.12.0 works for me, but 3.11.2 doesn't cut it).
Please note that this tool is a research prototype that is completely
unsupported.
That said, I have found it to be quite useful.
The tool is run from its build directory.
For litmus tests containing only normal memory accesses and memory
barriers, the following command suffices:
./ppcmem filename.litmus
Quick Quiz 8:
But when I built the tool, I didn't find an
armmem
program.
How am I supposed to test ARM code fragments???
Answer
where “filename.litmus” is the path to the file containing
your litmus test.
The tail end of the output of this tool when presented the litmus
test discussed earlier is as follows:
States 6
0:r3=0; 1:r3=0;
0:r3=0; 1:r3=1;
0:r3=1; 1:r3=0;
0:r3=1; 1:r3=1;
0:r3=2; 1:r3=0;
0:r3=2; 1:r3=1;
Ok
Condition exists (0:r3=0 /\ 1:r3=0)
Hash=e2240ce2072a2610c034ccd4fc964e77
Observation SB+lwsync-RMW-lwsync+isync-simple Sometimes 1
The list of states includes “0:r3=0; 1:r3=0;”, indicating
once again that the old powerpc implementation of
atomic_add_return() does not act as a full barrier.
The “Sometimes” on the last line confirms this: the
assertion triggers for some executions, but not all of the time.
As before, the fix is to replace P0's isync with sync,
which results in the following at the end of the tool's output:
States 5
0:r3=0; 1:r3=1;
0:r3=1; 1:r3=0;
0:r3=1; 1:r3=1;
0:r3=2; 1:r3=0;
0:r3=2; 1:r3=1;
No (allowed not found)
Condition exists (0:r3=0 /\ 1:r3=0)
Hash=77dd723cda9981248ea4459fcdf6097d
Observation SB+lwsync-RMW-lwsync+sync-simple Never 0 5
This output confirms the fix: “0:r3=0; 1:r3=0;” does
not appear in the list of states, and the last line calls out
“Never”.
Therefore, the model predicts that the offending execution sequence
cannot happen.
These tools promise to be of great help to people working on
low-level parallel primitives that run on ARM and on Power.
These tools do have some intrinsic limitations:
- These tools do not constitute official statements by
IBM or ARM on their respective CPU architectures.
For example, both corporations reserve the right to report
a bug at any time against any version of any of these tools.
These tools are therefore not a substitute for careful stress
testing on real hardware.
Moreover, both the tools and the model that they are based on
are under active development and might change at any time.
On the other hand, this model was developed in consultation
with the relevant hardware experts, so there is good reason
to be confident that it is a robust representation of the
architectures.
- These tools currently handle a subset of the instruction set.
This subset has been sufficient for my purposes, but your
mileage may vary.
In particular, the tool handles only word-sized accesses
(32 bits), and the words accessed must be properly aligned.
In addition, the tool does not handle some of the weaker
variants of the ARM memory-barrier instructions.
- The tools are restricted to small loop-free code fragments running
on small numbers of threads.
Larger examples result in state-space explosion, just as
with similar tools such as
Promela and spin.
- The full state-space search does not give any indication of
how each offending state was reached.
That said, once you realize that the state is in fact reachable,
it is usually not too hard to find that state using the interactive
tool.
- The tools will detect only those problems for which you code
an assertion.
This weakness is common to all formal methods, and is yet another
reason why testing remains important.
In the immortal words of Donald Knuth,
“Beware of bugs in the above code; I have only proved it
correct, not tried it.”
That said, one strength of these tools is that they are designed
to model the full range of behaviors allowed by the architectures,
including behaviors that are legal, but which current hardware
implementations do not yet inflict on unwary software developers.
Therefore,
an algorithm that is vetted by these tools likely has some additional
safety margin when running on real hardware.
Furthermore, testing on real hardware can only find bugs; such testing
is inherently incapable of proving a given usage correct.
To appreciate this, consider the table in Section 8 on page 10 of
Understanding POWER Multiprocessors [PDF]:
the researchers routinely ran in excess of 100 billion test runs
on real hardware to validate their model.
In one case, behavior that is allowed by the architecture did not occur,
despite 176 billion runs.
In contrast, the full-state-space search allows the tool to prove
code fragments correct.
It is worth repeating that formal methods and tools are no
substitute for testing.
The fact is that producing large reliable concurrent software
artifacts, the Linux kernel for example, is quite difficult.
Developers must therefore be prepared to apply every tool at
their disposal towards this goal.
The tools presented in this paper are able to locate bugs that are
quite difficult to produce (let alone track down) via testing.
On the other hand, testing can be applied to far larger bodies of
software than the tools presented in this paper are ever likely
to handle.
As always, use the right tools for the job!
Of course, it is always best to avoid the need to work at
this level by designing your parallel code to be easily partitioned
and then using higher-level primitives (such as locks, sequence counters,
atomic operations, and RCU) to get your job done more
straightforwardly.
And even if you absolutely must use low-level memory barriers and
read-modify-write instructions to get your job done, the more
conservative your use of these sharp instruments, the easier
your life is likely to be.
I am grateful to
Peter Sewell and Susmit Sarkar at the University of Cambridge,
Luc Maranget, Francesco Zappa Nardelli, and Pankaj Pawan at INRIA,
Jade Alglave at Oxford University, and a number of their colleagues
for their efforts on this research topic.
We all are indebted to Ben Herrenschmidt for fixing the bug noted in
this article.
I am thankful to great number of members of the C and C++ standards committees
for many stimulating discussions on memory models and concurrency,
including Hans Boehm, Lawrence Crowl, Peter Dimov, Clark Nelson, Raul Silvera,
Mark Batty, N.M. Maclaren, Anthony Williams, Blaine Garst, Scott Owens,
Tjark Weber, Michael Wong, Benjamin Kosnik, and Bjarne Stroustrup.
I owe thanks to
Derek Williams and Richard Grisenthwaite for their patient
explanations of the Power and ARM memory models, respectively,
and to Jim Wasko of IBM and Dave Rusling of Linaro for their support of
this effort.
Quick Quiz 1:
Yeah, right!!!
Since when has anything ever made the Linux kernel's memory barriers
and atomic instructions easier???
Answer:
There really have been some improvements over the years, including updates
to Documentation/memory-barriers.txt and
Documentation/atomic_ops.txt.
Perhaps more important, the scripts/checkpatch.pl script
complains if memory barriers are not accompanied by a comment, which
has made the code using memory barriers a bit less obscure.
But yes, more help is needed, hence this article.
Back to Quick Quiz 1.
Quick Quiz 2:
But what about x86?
Answer:
The ppcmem tool could in fact be extended to handle x86 as well as ARM
and Power, but there are no definite plans to carry out this work at
the moment.
Back to Quick Quiz 2.
Quick Quiz 3:
Why does line 8 initialize the registers?
Why not instead initialize them on lines 4 and 5?
Answer:
Either way works.
However, in general, it is better to use initialization than explicit
instructions.
The explicit instructions are used in this example to demonstrate
their use.
In addition, many of the litmus tests available on the tool's
web site
were automatically generated, which generates explicit initialization
instructions.
Back to Quick Quiz 3.
Quick Quiz 4:
But whatever happened to line 17, the one that is
the Fail: label?
Answer:
The implementation of powerpc version of atomic_add_return()
loops when the stwcx instruction fails, which it
communicates by setting non-zero status in the condition-code register,
which in turn is tested by the bne instruction.
Because actually modeling the loop would result in state-space
explosion, we instead branch to the Fail: label,
terminating the model with the initial value of 2 in thread 1's
r3 register, which will not trigger the exists
assertion.
There is some debate about whether this trick is universally
applicable, but I have not seen an example where it fails.
(Famous last words!)
Back to Quick Quiz 4.
Quick Quiz 5:
Can both threads' r3 registers be equal to
zero after both threads complete?
Answer:
Not if atomic_add_return() acts as a full barrier,
as is set out as required behavior in
Documentation/atomic_ops.txt.
Back to Quick Quiz 5.
Quick Quiz 6:
But suppose that I didn't think to try this particular
execution sequence.
How could I have found this bug using this tool?
Answer:
You can easily miss problematic execution sequences when using
the interactive tool.
Which is why the next section covers the full state-space-search tool.
Back to Quick Quiz 6.
Quick Quiz 7:
Does the ARM Linux kernel have a similar bug?
Answer:
ARM does not have this particular bug, given that it places
smp_mb() before and after the
atomic_add_return() function's assembly-language
implementation.
Finding any other bugs that ARM might have is left as an exercise
for the reader.
Back to Quick Quiz 7.
Quick Quiz 8:
But when I built the tool, I didn't find an armmem
program.
How am I supposed to test ARM code fragments???
Answer:
Just replace the litmus file's initial “PPC” with “ARM”
and put ARM assembly language into your litmus file and pass it
to the ppcmem program.
Yes, ppcmem is multilingual.
Perhaps this is because the researchers who created it have a variety of
native languages.
Or maybe because none of them live or work in the USA. ;–)
Back to Quick Quiz 8.
Quick Quiz 9:
Where can I find out more about the formal model underlying this tool?
Answer:
Currently the best source of information is the PLDI 2011 paper entitled
“Understanding Power Multiprocessors”, which may be found
here.
This paper covers memory barriers, but not atomic instruction sequences.
Additional papers that include atomic-instruction-sequence extensions
to this model are likely to appear Real Soon Now.
Back to Quick Quiz 9.
Comments (9 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>