August 3, 2011
This article was contributed by Neil Brown
Our story begins with an observation made by Jon Corbet in a
recent article about delays in the
release of Linux-3.0:
[I]f we reach a point where almost nobody can understand, review,
or fix some of our core code, we may be headed for long-term
trouble.
While undeniably true, it is not clear that this note of gloom is
entirely justified by the immediate circumstances - after all the bug
was found and fixed to everyone's satisfaction in a little over 24
hours. However it is clear that it wasn't just this event that
triggered the observation. Earlier we read:
Our once approachable and hackable kernel has, over time, become
more complex and difficult to understand.
What was once "approachable" is now "difficult to understand". If these
words were used to describe a life-partner, we might be looking to
choose between a divorce settlement or counseling, though often it is
a healthy dose of understanding that is really needed.
The human touch
The allusion to human relationships is not as far fetched as it
might seem. In his excellent book
"The Maths Gene",
Keith Devlin writes about Wim Klein,
"a famous calculating wizard who in the days before electronic
computers once held a professional position with the title
'computer'".
Apparently Klein was known to say: "Numbers are friends to me." and referring to
3844 he says:
For you it's just a three and an eight and a four and a four, but
I say, "Hi 62 squared!"
The point Devlin is leading to is summed up thus:
To put it simply, mathematicians think about mathematical objects
and the mathematical relationships between them using the same
mental facilities that the majority of people use to think about
other people.
The idea is that we can treat abstract ideas more like friends than
tools and understand them holistically instead of just functionally.
If we get to know their preferences and peculiarities we can side step
the fact that they are sometimes difficult to understand and once
again find them approachable.
One more quote from that recent LWN article will start us on the path to
understanding what it means to be friends with abstractions. In a
parenthetical comment concerning an assignment of NULL to d_inode,
Jon writes:
(It's worth noting that this behavior goes against normal RCU
practice, which calls for the structure to be preserved unmodified
until the last reference is known to be gone)
This is just the sort of thing that suggests a holistic "friendly"
relationship rather than a more distant formal one. Something
just doesn't seem right, "He doesn't usually behave this way". Being
able to make that sort of observation helps you focus in and ask
incisive questions. You should then either be able to help your
friend, or else understand them at a deeper level. We'll come back to
this particular observation later, but first I want to demonstrate the
effectiveness of this sort of friendship in a context that all
experienced C programmers should find very familiar.
"For": he's a jolly good fellow
A simple and common construct in C programs is the "for" loop. It is
extremely flexible and can be used like a "while" loop, can iterate over
a set of numbers, can walk a linked list, and can perform many more tasks.
A common form that you might expect to see quite often is:
for (i = 0; i < count; i++)
This will execute the body of the loop "count" times
assigning "i" to
each value from "0" to "count - 1" in turn. There are similar
constructs that are almost as common and familiar such as
for (i = 0; i <= final; i++)
or
for (i = 1; i <= count; i++)
which are really just minor variations on the theme. But if you see
for (i = 0; i <= count; i++)
a seasoned C programmer, familiar with the ways of the "for", will
immediately think that something is odd. Not necessarily wrong, but
certainly odd. This might be an error, but it could also be that "count" has
been misnamed and is really a "final" rather than a "count" sort of
value (an ordinal rather than a cardinal). Or maybe the loop
really has a good reason for executing "count + 1" times. At this
point the friendship has served a valuable purpose by identifying a
possible issue, but must give way to a more detailed analysis to
determine if there really is a problem.
We can make this very concrete by examining the Linux kernel source
code (v3.0) which is often a good source of interesting code fragments. A search
for the above pattern can be achieved with:
git grep 'for.*(.*= *0.*;.*<=.*c[ou]*nt.*;.*)'
which results in:
arch/ia64/kernel/cpufreq/acpi-cpufreq.c: for (i = 0; i <= data->acpi_data.state_count; i++)
drivers/gpu/drm/radeon/evergreen_cs.c: for (i = 0; i <= pkt->count; i++, idx++, reg += 4) {
drivers/gpu/drm/radeon/r100.c: for (i = 0; i <= pkt->count; i++, idx++) {
drivers/gpu/drm/radeon/r100.c: for (i = 0; i <= (pkt->count + 1); i++, idx++) {
drivers/gpu/drm/radeon/r100.c: for (j = 0; j <= count; j++) {
drivers/gpu/drm/radeon/r600.c: for (j = 0; j <= count; j++) {
drivers/gpu/drm/radeon/r600_cs.c: for (i = 0; i <= pkt->count; i++, idx++, reg += 4) {
drivers/i2c/algos/i2c-algo-pcf.c: for (i = 0; i <= count; i++) {
drivers/staging/rts_pstor/rtsx_transport.c: for (i = 0; i <= buf_cnt / (HOST_SG_TBL_BUF_LEN / 8); i++) {
fs/ufs/balloc.c: for (pos = 0; pos <= uspi->s_fpb - count; pos++) {
scripts/docproc.c: for (i=0; i <= symfilecnt; i++)
A closer examination of each of these is left as an exercise for the
reader. The chance of finding a genuine bug is very high, though it
is likely that the bugs are not very serious or they would
have already been fixed.
The success in finding a bug even when we weren't really looking for
one is evidence of the value of building strong relationships with
important abstractions as we can learn to recognize early warning
signs. Unfortunately the building of strong relationships can be
time-consuming and so is often neglected, being left to those with a
passion rather than those who simply have a task to do.
In human relationships a friendship can blossom more quickly if a
mutual friend acts to introduce two parties and start them out on a
sound footing. The same can work for abstractions. While much has
been written about RCU ("Read-Copy-Update" which - from the name
"rcu-walk" seems to be the center-point of the aforementioned bug), it
has largely been of a formal style. The documentation in
linux/Documentation/RCU/
and on
Wikipedia
is, not unlike a résumé, suitable for an introduction to a
potential employer, but not so much as an introduction to a future
friend. What follows is an attempt to present RCU together with other
related technologies in a somewhat different style aimed at building
friendships - what you might call the human-interest perspective.
A family affair
While it is wrong to judge someone based on their parents, knowing their
family background can certainly lead to a deeper understanding.
So we will start by introducing RCU's parents who are, for the purpose
of this tale, Sir Spinlock and Dame Refcount - both recently honored
for services to computer engineering.
Sir Spinlock is somewhat "old school" and a stickler for doing
everything "just right". When he protects a data structure he does it
"properly" (as he says) and ensures everything is consistent, stable
and completely up-to-date. His wife, Dame Refcount, feels that her
husband takes a rather short-term view of things and, like many
mothers, just wants to be sure that her charges are safe and can be
found when needed (comments about apron strings are generally not
appreciated).
The gentleman and his lady had another child much earlier (too early
some say) who was named "Rwlock". When write access is needed, he is
much like his father becoming a short term exclusive lock. When read
access is wanted he is more like his mother, keeping count of all the
different readers. However on the whole he seems to have received the
worst of both worlds, lacking the long term connections of his mother,
or the simple fairness available to his father. Master Rwlock doesn't
get out much these days.
In some ways Rwlock is of the same generation as his parents as none
of them really seem to understand the problem of cache line bouncing.
Sir Spinlock likes to say that "Cache lines belong in banks" and seems
to think he is being awfully funny, but it just makes the younger
generation feel awkward, smile weakly, and try to change the subject.
That younger generation are the twins - RCU and his sister Seqlock.
They both grew up in the shadow of Sir Spinlock's dominance of
multiprocessing and each reacted in their own way. They have both learned
the importance of stability but were culturally more focused on the
appearance than the reality, and technologically more inventive in how
to provide it.
RCU explains that he agrees with his dad's mantra of "Always
consistent, always stable" but finds that the "always up-to-date" idea is
over-rated. He himself finds that he is usually running late for
something or other and doesn't see that it has hurt him all that much.
RCU is a big fan of the photo-copiers and realized early that the best
way to preserve the appearance of stability was not to change the
original at all but to make a copy and work on that. So while RCU
always provides stable data, there might be something newer that he is
working on that he won't let you see just yet. RCU's real skill is
understanding use-by dates. What you get might not be perfectly
fresh, but it will never become stale.
His sister, Seqlock, takes a very different approach to the same
problem learning from her favorite piece of technology - the digital
camera. Like many of us she just takes lots and lots of photos and picks the
ones she likes. "All of them are stable", she says, "and it isn't too
hard to tell when one isn't consistent - just throw those away". You
might not think it is so easy, but then this is her special skill -
seeing the blur in the photo and taking another. She is happy to help and
tell you when a new photo is needed.
Another notable contrast between the twins is reflected in their
mother's pet names for them: "replace" and "in-place" (she sometimes
calls them "The place twins" which they find very irritating). RCU
insists on updating things by replacing them so he is happiest with
relatively small objects without too much linkage which would need to
be re-threaded. He "doesn't work on books", he says, "just
collections of pages, so don't try looking at two pages at once!".
When he does have these small objects he can replace them with
lightning speed, so others hardly even notice anything has changed.
Seqlock expects people to take a photo if they are interested so she
just updates things when ever she likes, where ever they are - just
being sure to add enough "blur" so that if she is caught "in the act"
another photo will be taken. This means she can handle much bigger
containers than her brother, and containers with lots of connections.
Even a wide-angle lens can notice the blur and take a second photo.
A chip off the old lock
When we look more closely at RCU to see how he achieves his flashy
performance we can see echoes of both his parents at work. When he was
younger he found this mildly embarrassing but since they were both
knighted he has found a new respect for his heritage.
From his father, RCU appreciates the value of a quick change when
no-one else can see and has refined this down to a single pointer
update which modern hardware guarantees to be atomic. When he is
really trying to impress he uses the "xchg()" atomic operation to
replace an old structure with a new structure without even needing any
locking for an update (though there are some costs to this and it is
mostly just a party trick - but see
net/ipv4/cipso_ipv4.c:cipso_v4_req_setattr()).
This observation about the centrality of the pointer is crucial to
understanding RCU's sleight-of-hand. It is easy to get distracted by
the "rcu_read_lock() / rcu_read_unlock()" calls and think that they
provide some sort of atomicity, but they don't. Rather they come from
his mother's side of the family as we will see shortly. As the
atomicity is constrained to a single pointer, two separate pointer
accesses - even in the one rcu_read_lock()ed section - will not be
coherent. This means it is very important not to dereference the same
pointer twice (you might get two different values) and if you need to
read two separate pointers, be very careful about assuming any
relationship between the two values - there might be one but you need
to understand the rest of the code to be sure. The most common case
of dereferencing multiple pointers in the one rcu_read_lock()ed region
is when following a linked list. Each pointer found leads directly to
the next. In this case you can be sure that each consecutive pair will be ordered
if there is ordering in the list, but you cannot be sure you will see
all entries that are in the list when you started, or that you have
seen all the entries that are in the list when you finish.
Now that we know that friends don't let RCU-friends assume coherence
outside of a single protected structure, it should be easy to see
that something looks wrong in
kernel/cgroup.c:cgroup_attach_proc().
The relevant code snippet - very well commented which helps a lot -
is:
rcu_read_lock();
if (!thread_group_leader(leader)) {
/*
* a race with de_thread from another thread's exec() may strip
* us of our leadership, making while_each_thread unsafe to use
* on this task. if this happens, there is no choice but to
* throw this task away and try again (from cgroup_procs_write);
* this is "double-double-toil-and-trouble-check locking".
*/
rcu_read_unlock();
retval = -EAGAIN;
goto out_free_group_list;
}
where:
#define thread_group_leader(p) (p == p->group_leader)
is defined elsewhere.
This code is assuming that rcu_read_lock() will provide coherence
between this access of "leader->group_leader" here and other accesses to
something in the "leader" structure later on in "while_each_thread()".
While this is not totally impossible, as a well placed
"synchronize_rcu()" between updates could lead to some coherence, it
is not the usual way RCU is used and questions need to be asked. A
discussion with the author of this code suggests that what is really
needed is a read_lock() on "tasklist_lock".
So just like with the case of the "for" loop the fact that something
"doesn't look right" isn't a guarantee that it is wrong but it is
certainly an indication that the code is worth checking.
From his mother, RCU learned the value of counting to keep track of things. While his mum
has a counter in each individual object, RCU tracks the whole system state (or at least that part
entrusted to him) with a single counter (though admittedly it is spread across all CPUs). The
details of getting this just right are
are rather tricky, but as mentioned earlier, that is his
real skill.
Sometimes he thinks this one skill is over-rated and finds a real
kinship with the artists and musicians he knows who can only find work
in copying the ancient masters or teaching unwilling children to play.
In RCU's case the skill that most people pay for is not the "quick
change, always stable" trick that he is so proud of, but the "never
stale" promise he has to make to achieve this. Among the several
hundred times that RCU is employed by the Linux kernel, a substantial
majority simply use RCU as an extra, cheap, reference count to stop
things from going stale until they are really not needed. In a number
of cases this is very explicit in a "put" function. For example
kernel/audit_tree.c has:
static inline void put_tree(struct audit_tree *tree)
{
if (atomic_dec_and_test(&tree->count))
call_rcu(&tree->head, __put_tree);
}
which makes it very clear where RCU stands - one step below his mother
(we note that she isn't complaining, only him).
Even when RCU does get to use his quick-change trick it is usually to
simply replace an old value with a new value - no real copying is
done. Like most of us, he hardly ever uses his middle name.
Some reasonably simple examples of true RCU where the old value is
copied and updated are in
fs/afs/security.c:afs_cache_permit()
and
drivers/md/linear.c:linear_add().
In both these cases an array needs to
be made bigger, so a new structure is allocated and RCU is used to
slip it into place without anyone noticing.
Others, possibly more genuine (as they aren't just for a size change),
can be found where "list_replace_rcu()" or
"hlist_replace_rcu()" are
used (about a dozen places in Linux-3.0). For example in
net/ipv4/fib_trie.c:fib_table_insert(),
RCU is used to update several fields in a "fib_alias" atomically.
Seeking Seqlock's secret
Returning to Miss Seqlock and examining more closely her relationship
with her parents we find that she is very close to her father and
depends on him to provide exclusion from other writers when writing to
a protected data structure (though in the Linux kernel she has a more
independent alter ego named "Seqcount" who leaves the write-side
exclusion to be taken on by whoever is available).
On the other hand she shows her rebellious side by not requiring a
strict bracketing of get/put calls. Sir Spinlock's calls must always
pair a spin_lock() with spin_unlock() and RCU must always follow
rcu_read_lock() by rcu_read_unlock(). Dame Refcount almost always
pair "increment" with "decrement" though sometimes when there is only
one reference left she won't bother decrementing but just throws the
object away. Seqcount - possibly inspired by her mother, only pretends
to need balance, on the read side at least. Her standard pattern is
to balance "read_seqbegin()" with "read_seqretry()"
as is shown quite eloquently in, for example,
fs/nfs/nfs4state.c:nfs4_copy_stateid():
do {
seq = read_seqbegin(&state->seqlock);
memcpy(dst, &state->stateid, sizeof(*dst));
} while (read_seqretry(&state->seqlock, seq));
We have begin, copy, and retry. But it doesn't have to be like
that. A simple variation can be seen in
fs/cifs/dir.c:build_path_from_dentry()
where three different error cases can return from the middle of the loop.
In any other locking scheme the error path would need to unlock or
release whatever was being held, but not with Miss Seqlock. When she
is in charge you can check-out any time you like.
While this example copies the entire protected structure that is not
necessary and would be prohibitive when Seqlock is protecting a large
or complex structure. Rather it is only necessary to copy the pieces
of interest. If they prove interesting you might copy some more bits
and some more, each time only checking "read_seqretry()" to make sure
the entire picture you are forming is valid (and not blurred). This
is not yet common in Linux with the new scalable dcache being the only
advanced user. This first exposure may yet gain more friends for
Seqlock, though, and we can expect to see more of her as time passes.
It's all about teamwork
We have already noted that Seqlock usually depends on her dad for help
with write-side exclusion and that RCU often helps out his mum.
The co-operation doesn't end there. The parents have long worked
together (it was in the middle of an "atomic_dec_and_lock()"
that he proposed) and the twins carry on that tradition. Some times
the family members take turns in protecting a structure and sometimes they share out
responsibility for different fields, so Sir Spinlock might protect
some fields in a structure while RCU guarantees that others won't
change except by copying the whole.
As we now turn back to look at the context of the bug that started
this discussion which was in the
new path name lookup code
we find that all four (main) members of the family are making
significant contributions and working together.
As the
earlier article on this topic
outlines, Sir Spinlock and Dame Refcount are used for the
"always works but sometimes slow" ref-walk option, while RCU and
Seqlock help out for the "often works, always fast" rcu-walk option.
This allows us to compare and contrast them in a real-life context.
Where ref-walk uses reference counts on one dentry after another to
keep each dentry active, rcu-walk uses RCU to keep an effective
reference count on all dentrys. Where ref-walk uses spinlocks to get
a stable view of each dentry, rcu-walk uses seqlock to avoid using an
unstable view.
To really understand this bug some degree of friendship with VFS is
needed too, but fortunately not much. Central to the issue is the
"d_inode" field of a dentry. It is fairly special
because it doesn't change much. The only valid changes are from
NULL to a valid inode pointer, and from a valid inode pointer
back to NULL. And that second one is special too - it is only
allowed to happen if there are no other references to the dentry.
The result of this in the old dcache (before rcu-walk) is that you
don't need any locking for read access to d_inode, only a
counted reference. If it is NULL, then you know the file doesn't
exist. If it is not NULL, then the value is the file's inode and it
won't change - the reference you have on the dentry becomes an implied
reference on the inode.
As rcu-walk largely mapped spinlocks to seqlocks and refcounts to RCU
you might assume that an RCU reference is sufficient to make accessing
d_inode safe but it isn't quite that simple. To get a new
reference on a dentry you need to be holding a spinlock (hence
dget_locked() is the function to use). So while you only need a
reference to access d_inode, you need to have previously
taken a spinlock. This doesn't map directly into RCU usage as
getting an RCU reference (with rcu_read_lock()) can never
require a spinlock. So in rcu-walk, both Seqlock and RCU are used to
protect d_inode. RCU holds a global reference so you can
safely copy the d_inode value, and then later Seqlock will tell you
if the copy you made is still valid.
It seems that there is room for some confusion on whether just a
reference, or also a lock, is need to access d_inode and we see
some of that in the patches proposed. The first
patch
proposed by Linus seems to treat d_inode like an RCU-protected field
which should not be changed (as observed in our earlier parenthetical
comment), and so removed the code that cleared it before the last
RCU-reference was dropped. While it may have fixed the problem it didn't
fully explain it as the field is meant to be Seqlock protected.
Linus's second patch
added some Seqcount handling which seemed to
acknowledge that the field was meant to be Seqlock protected, but also
moved the access to d_inode towards the end of the loop.
That loop shows off Seqlock's unbalanced nature quite nicely. It has a
read_seqcount_begin() but had no read_seqretry().
Linus added a read_seqretry() which resulted in balance but
that didn't really address the issue.
As mentioned earlier a "Seqlock protected region" is somewhat vague as
it can have several read_seqretry() calls with each being used
to validate what has just been copied. While a Seqlock protected
region is, in some sense, still active when we enter this loop, it is no
longer meant to protect d_inode. d_inode was copied
out and validated in an earlier call to _d_lookup_rcu() (from
do_lookup()). The current Seqlock protection will only be
used if this inode is a directory. So in this loop there is no
Seqlock protection for d_inode. The only reason it is still
safe to access it at the end of the loop is because we know we have a
mount point, and we know a mount point is pinned in such a way that
d_inode cannot change (as the comment explains). At the top
of the loop we don't have that certainty.
So it seems that the problem child here is Seqlock, but in fairness it
isn't really her fault. She looks like a lock, but doesn't bracket
things at all the same way that more traditional locks do and so those
familiar with the older generation can get confused about exactly what
she is protecting.
If we back-track a little and find out how the reference to
d_inode came to not be properly protected we find
commit 62a7375e5d77d654695
which moves the access to d_inode from the end of that loop
to near the top. This looks like it should be safe as a Seqlock seems
to be active, but as we now know, that can be deceiving.
The reason for that move is probably worth exploring, but not here.
This article is about friendship and friendship will rarely tell us how
to fix a problem - we usually need to visit a specialist for that.
But a friendship with data structures and locking mechanisms can help
us identify which code is worthy of a closer inspection just as we
have seen in this exploration.
In this case a friend would need to have known that d_inode
requires Seqlock protection, and to have fully understood the rather
independent approach that Seqlock takes to protected regions. Maybe
Seqlock just needs some more friends.
Invitation
It has been said that
given enough
eyeballs, all bugs are shallow.
For this to be true the eyes must also be able to focus and recognize (and
even be open). Hopefully by gaining more friends who are able to
recognize its faults, the Linux kernel can avoid the stigma of being
difficult to understand and can once more be seen as approachable. Would
you like to be friends?
Exercises:
Examine each unusual "for" loop found above and determine if
each is a bug, an interesting usage, or an uninteresting false-positive.
"struct super" uses a secondary reference count
"s_count" which provides similar functionality to that
provided by RCU. Determine if any flavor of RCU can actually be
used to achieve the same result. Outline the costs and benefits of
such a change.
Understand why
commit 62a7375e5d77d654695
moved the d_inode access as it did and so determine if
commit 59430262401bec02
has undone some good work.
Examine the usage of RCU in
jbd2_dev_to_name and list any features that "don't seem
right". A deeper analysis is not required as this code was recently removed from the kernel.
[The author would like to thank Paul McKenney for providing valuable insights
into the character of RCU.]
(
Log in to post comments)