Brief items
The 3.1 merge window is still open. See the summary below for what
has been merged in the last week.
Stable releases: The 2.6.35.14
longterm kernel was released on August 1.
The 2.6.39.4 stable kernel was released on August
3: "Please note, this is the LAST release of the 2.6.39 kernel series. All
users of 2.6.39 should be moving to 3.0 right now. This tree is now
end-of-life, please plan accordingly."
The 3.0.1 stable kernel is currently in the review process. It can be
expected to be released on or after August 5.
Comments (1 posted)
And it doesn't matter one whit if you say something like "Oh, but I
use quilt, so it's been tested there, and I just imported a very well
tested tree into -git to push it to you". Dammit, even if you use
quilt or something else to actually maintain your patch series, I KNOW
DAMN WELL THAT YOU DIDN'T TEST THAT SERIES ON TOP OF THE RECENT CRAPPY
NFS PULL!
So if you use quilt or something, then import the patch series on top
of something STABLE AND SANE. Start off with the released 3.0, that at
least doesn't have random pulls that have known compile issues. Use
that for testing, and don't send me a re-based patch-series that
clearly cannot possibly have been tested in that form, and that was
based on a kernel that had ugly problems.
--
Linus Torvalds (additional grumpiness
contained within)
So conflicts aren't "bad" per se. I want to see them, because they're
a kind of heads-up for me: while any individual conflict isn't
necessarily a problem at all, it's something that I just want to be
aware of.
So I am not complaining about - or finding it disturbing - that we had
a conflict. It's all par for the course. But I don't want
submaintainers to merge the conflicts away.
--
Linus Torvalds
So quite frankly, I think the patch in this form is totally
self-defeating. I'm not pulling something that is supposed to clean
things up, but just adds more ugliness in some other place.
--
Linus Torvalds
Comments (none posted)
Kernel development news
By Jake Edge
August 3, 2011
Around 1400 non-merge changesets have been pulled into the mainline since
last week's merge window article. That
makes for 6844 changesets since the 3.0 release at the time of this
writing. Linus Torvalds is still on vacation, and the merge numbers are a
bit down compared to previous releases, so there may be more to
come. Significant
user-visible changes include:
- The LIO iSCSI target has been
merged. This was somewhat controversial, first because the competing SCST
target implementation
was pushed aside in favor of LIO, but also because there were questions
about whether the CHAP authentication should be done in the kernel or in
user space. The version that got merged does that authentication in the
kernel over the objections of James Bottomley.
- pNFS now supports IPv6.
- eCryptfs now has support for encrypted
keys.
- md now has support for bad block management.
- tools/power/cpupower has been added with tools to monitor
power management for multiple architectures, and is eventually slated to
replace the Intel-specific tools in tools/power/x86.
- dm now supports separate metadata devices for better fault handling and
array sanity checking.
- New hardware supported includes:
- Audio/video: Cirrus Logic CS421x codecs, Xceive XC4000
silicon tuners, ATMEL Image Sensor Interface (ISI), Endpoints SE401 USB
cameras, Marvell Armada 610 integrated camera controllers, NXP
TDA18271c2 silicon tuners, TI DM644x video
processing back-end (VPBE) displays, Samsung S5P family video devices,
OmniVision OV5642 camera sensors, Micronas DRX-K DVB-C/T demodulators.
- Miscellaneous: Freescale MMA8450 accelerometers, InvenSense
MPU3050 tri-axial gyroscope sensors, Kionix KXTJ9 accelerometers, Xilinx
watchdog timers, ADP1653 LED flash controllers, Digital Devices Octopus
bridge devices, Maxim MAX1668 and compatible temperature sensors,
NTC NCP15WB473, NCP18WB473, NCP21WB473, NCP03WB473, and NCP15WL333
temperature sensors, National Semiconductor LM95245 dual temperature
sensors, National Semiconductor LM25066,
LM5064, and LM5064 power management, monitoring, control, and protection
sensors, Maxim MAX8998/LP3974 PMIC battery chargers, Maxim
MAX8997/MAX8966 PMIC battery chargers, TI TPS95612 power management
chips, TI TPS65912 power regulators, AnalogicTech AAT2870 backlights,
AnalogicTech AAT2870 power regulators, Cirrus Logic EP93xx DMA
controllers.
Changes visible to kernel developers include:
- Multiple ARM SoCs and device drivers have added device tree support.
- A watchdog timer driver core has been added.
- The SLUB slab allocator no longer requires locks on the fast path for
architectures that support cmpxchg.
- EFI non-volatile storage can now be used as a pstore backend to persistently store log
messages or other information.
One notable patch set that will not be merged this time around is the Native Linux KVM tool. Torvalds decided that he needed more convincing before
merging it:
You'll need to convince me it's really worthwhile,
considering that you *can* do kernel testing with existing
virtualization environments that are more powerful in other ways. But
you'll get to do that next merge window, I'm afraid.
I already decided to pull one controversial thing (the iscsi-target
thing), I'm not doing two this merge window ;)
The normal two-week merge window would nominally expire on August 5, but
Torvalds's vacation may alter that somewhat (in either direction). We'll pick up any significant
merges in an update next week should it be warranted. Stay tuned ...
Comments (none posted)
By Jonathan Corbet
July 25, 2011
In May, LWN
examined the "stable pages"
patch, whose intent is to be sure that pages under I/O cannot be
modified (by the kernel or user space) until the I/O completes. Block I/O is
not the only context in which this kind of problem arises, though; memory
which has been given to the network stack should also be kept stable until
the transmission is complete. Unfortunately, it is hard to know when the
network stack is truly done with a page, leaving the system open to
possible data corruption problems.
Ian Campbell described how things can go
wrong in June. Imagine a page full of data to be written to a file on
an NFS-mounted filesystem. The NFS code will put together a network I/O
operation as represented by an sk_buff structure (an "SKB") and
pass it into the network stack for transmission. Perhaps the server is
slow or the network is noisy; one way or another the acknowledgment from
the remote NFS server is slow in coming - slow enough that the network
stack decides to retransmit the request. While the data sits in the
retransmission queue (perhaps already handed to the interface driver), the
ACK arrives from the server. The network stack will tell the NFS client
that the operation has completed. The page used to contain the data to be
written could then be rewritten with some other data - even though that
retransmission is still waiting to go out. The result could be a
(re)transmission of corrupted data. This problem is especially acute for
O_DIRECT writes - where the application is waiting for the end of
the operation - but it can come up in other situations as well.
SKBs can have a destructor function, so one would think that it would be
possible to just wait until the network stack finishes with the structure
before releasing the relevant page(s). But the network stack works in
strange and mysterious ways, and the fact that it has finished with an SKB
does not imply that it is finished with the pages of data referenced by
that SKB. The "cloning" of SKBs happens often in the network stack, and
pages of data can actually move between SKBs. The networking code manages
the page reference counts directly, so there is no danger of the data pages
being put to some other use entirely by the system. But that is not helpful to
higher layers, which have no way to know when it's safe to signal the
completion of an operation.
Fixing this problem requires a significant set
of changes to the low-level SKB-handling code. Ian's patch series
starts by defining a set of helper functions for the tracking of references
to pages from SKBs. Current networking code calls get_page() and
put_page() directly; after patching, all of those calls have been
wrapped by functions like skb_frag_ref(). Quite a few changes are
required to get the networking core and in-tree drivers to use these
functions.
Once that is done, the patch series introduces the concept of a "fragment
destructor" for SKBs:
struct skb_frag_destructor {
atomic_t ref;
int (*destroy)(void *data);
void *data;
};
The low-level functions that add fragments to SKBs are modified to take an
additional destructor argument. The destructor is always optional; code
which does not need to use destructors can simply pass a null pointer
instead.
At this point, it's a relatively simple matter for the accessor functions
added earlier in the series to increment and decrement the reference count
whenever there are destructors present. When the reference count
(ref) drops to zero, the provided destroy() function will
be called. Putting the reference counter in the destructor is a useful
optimization: in the absence of destructors, the overhead of maintaining
the reference count can be skipped. Also worth noting is the fact that
multiple fragments in an SKB can share the same destructor; in this case,
the destroy() function will only be called when the networking
code has finished with all of those fragments.
One other optimization is that, in the presence of a destructor, the
network code will no longer increment and decrement the reference counts
associated with the pages in the fragments. In this situation, the calling
code is assumed to hold a reference for the duration of the operation, so
separate reference counting at that level is not needed.
The final step is to make use of this capability. The internal
kernel_sendpage() function gains an extra parameter to hold a
pointer to the destructor, should the caller want to use one. The sunrpc
code is changed to not signal completion of operations until the networking
code indicates that it is done with the associated memory. And that solves
the problem - for NFS at least; there should be no more troubles with pages
being reused while they are still under network I/O. There are other
places in the kernel which can - and presumably will - make use of this
functionality in the future; this work was originally motivated by problems
encountered in the implementation of zero-copy I/O for Xen clients. Ian
suspects that subsystems like iSCSI could also benefit from this mechanism.
The patch set seems to have been relatively well received. There will be
another posting at some point reorganizing some of the work, but there does
not appear to be a need for significant changes at this point. So this
feature seems likely to appear in the 3.2 kernel.
Comments (2 posted)
August 3, 2011
This article was contributed by Neil Brown
Our story begins with an observation made by Jon Corbet in a
recent article about delays in the
release of Linux-3.0:
[I]f we reach a point where almost nobody can understand, review,
or fix some of our core code, we may be headed for long-term
trouble.
While undeniably true, it is not clear that this note of gloom is
entirely justified by the immediate circumstances - after all the bug
was found and fixed to everyone's satisfaction in a little over 24
hours. However it is clear that it wasn't just this event that
triggered the observation. Earlier we read:
Our once approachable and hackable kernel has, over time, become
more complex and difficult to understand.
What was once "approachable" is now "difficult to understand". If these
words were used to describe a life-partner, we might be looking to
choose between a divorce settlement or counseling, though often it is
a healthy dose of understanding that is really needed.
The human touch
The allusion to human relationships is not as far fetched as it
might seem. In his excellent book
"The Maths Gene",
Keith Devlin writes about Wim Klein,
"a famous calculating wizard who in the days before electronic
computers once held a professional position with the title
'computer'".
Apparently Klein was known to say: "Numbers are friends to me." and referring to
3844 he says:
For you it's just a three and an eight and a four and a four, but
I say, "Hi 62 squared!"
The point Devlin is leading to is summed up thus:
To put it simply, mathematicians think about mathematical objects
and the mathematical relationships between them using the same
mental facilities that the majority of people use to think about
other people.
The idea is that we can treat abstract ideas more like friends than
tools and understand them holistically instead of just functionally.
If we get to know their preferences and peculiarities we can side step
the fact that they are sometimes difficult to understand and once
again find them approachable.
One more quote from that recent LWN article will start us on the path to
understanding what it means to be friends with abstractions. In a
parenthetical comment concerning an assignment of NULL to d_inode,
Jon writes:
(It's worth noting that this behavior goes against normal RCU
practice, which calls for the structure to be preserved unmodified
until the last reference is known to be gone)
This is just the sort of thing that suggests a holistic "friendly"
relationship rather than a more distant formal one. Something
just doesn't seem right, "He doesn't usually behave this way". Being
able to make that sort of observation helps you focus in and ask
incisive questions. You should then either be able to help your
friend, or else understand them at a deeper level. We'll come back to
this particular observation later, but first I want to demonstrate the
effectiveness of this sort of friendship in a context that all
experienced C programmers should find very familiar.
"For": he's a jolly good fellow
A simple and common construct in C programs is the "for" loop. It is
extremely flexible and can be used like a "while" loop, can iterate over
a set of numbers, can walk a linked list, and can perform many more tasks.
A common form that you might expect to see quite often is:
for (i = 0; i < count; i++)
This will execute the body of the loop "count" times
assigning "i" to
each value from "0" to "count - 1" in turn. There are similar
constructs that are almost as common and familiar such as
for (i = 0; i <= final; i++)
or
for (i = 1; i <= count; i++)
which are really just minor variations on the theme. But if you see
for (i = 0; i <= count; i++)
a seasoned C programmer, familiar with the ways of the "for", will
immediately think that something is odd. Not necessarily wrong, but
certainly odd. This might be an error, but it could also be that "count" has
been misnamed and is really a "final" rather than a "count" sort of
value (an ordinal rather than a cardinal). Or maybe the loop
really has a good reason for executing "count + 1" times. At this
point the friendship has served a valuable purpose by identifying a
possible issue, but must give way to a more detailed analysis to
determine if there really is a problem.
We can make this very concrete by examining the Linux kernel source
code (v3.0) which is often a good source of interesting code fragments. A search
for the above pattern can be achieved with:
git grep 'for.*(.*= *0.*;.*<=.*c[ou]*nt.*;.*)'
which results in:
arch/ia64/kernel/cpufreq/acpi-cpufreq.c: for (i = 0; i <= data->acpi_data.state_count; i++)
drivers/gpu/drm/radeon/evergreen_cs.c: for (i = 0; i <= pkt->count; i++, idx++, reg += 4) {
drivers/gpu/drm/radeon/r100.c: for (i = 0; i <= pkt->count; i++, idx++) {
drivers/gpu/drm/radeon/r100.c: for (i = 0; i <= (pkt->count + 1); i++, idx++) {
drivers/gpu/drm/radeon/r100.c: for (j = 0; j <= count; j++) {
drivers/gpu/drm/radeon/r600.c: for (j = 0; j <= count; j++) {
drivers/gpu/drm/radeon/r600_cs.c: for (i = 0; i <= pkt->count; i++, idx++, reg += 4) {
drivers/i2c/algos/i2c-algo-pcf.c: for (i = 0; i <= count; i++) {
drivers/staging/rts_pstor/rtsx_transport.c: for (i = 0; i <= buf_cnt / (HOST_SG_TBL_BUF_LEN / 8); i++) {
fs/ufs/balloc.c: for (pos = 0; pos <= uspi->s_fpb - count; pos++) {
scripts/docproc.c: for (i=0; i <= symfilecnt; i++)
A closer examination of each of these is left as an exercise for the
reader. The chance of finding a genuine bug is very high, though it
is likely that the bugs are not very serious or they would
have already been fixed.
The success in finding a bug even when we weren't really looking for
one is evidence of the value of building strong relationships with
important abstractions as we can learn to recognize early warning
signs. Unfortunately the building of strong relationships can be
time-consuming and so is often neglected, being left to those with a
passion rather than those who simply have a task to do.
In human relationships a friendship can blossom more quickly if a
mutual friend acts to introduce two parties and start them out on a
sound footing. The same can work for abstractions. While much has
been written about RCU ("Read-Copy-Update" which - from the name
"rcu-walk" seems to be the center-point of the aforementioned bug), it
has largely been of a formal style. The documentation in
linux/Documentation/RCU/
and on
Wikipedia
is, not unlike a résumé, suitable for an introduction to a
potential employer, but not so much as an introduction to a future
friend. What follows is an attempt to present RCU together with other
related technologies in a somewhat different style aimed at building
friendships - what you might call the human-interest perspective.
A family affair
While it is wrong to judge someone based on their parents, knowing their
family background can certainly lead to a deeper understanding.
So we will start by introducing RCU's parents who are, for the purpose
of this tale, Sir Spinlock and Dame Refcount - both recently honored
for services to computer engineering.
Sir Spinlock is somewhat "old school" and a stickler for doing
everything "just right". When he protects a data structure he does it
"properly" (as he says) and ensures everything is consistent, stable
and completely up-to-date. His wife, Dame Refcount, feels that her
husband takes a rather short-term view of things and, like many
mothers, just wants to be sure that her charges are safe and can be
found when needed (comments about apron strings are generally not
appreciated).
The gentleman and his lady had another child much earlier (too early
some say) who was named "Rwlock". When write access is needed, he is
much like his father becoming a short term exclusive lock. When read
access is wanted he is more like his mother, keeping count of all the
different readers. However on the whole he seems to have received the
worst of both worlds, lacking the long term connections of his mother,
or the simple fairness available to his father. Master Rwlock doesn't
get out much these days.
In some ways Rwlock is of the same generation as his parents as none
of them really seem to understand the problem of cache line bouncing.
Sir Spinlock likes to say that "Cache lines belong in banks" and seems
to think he is being awfully funny, but it just makes the younger
generation feel awkward, smile weakly, and try to change the subject.
That younger generation are the twins - RCU and his sister Seqlock.
They both grew up in the shadow of Sir Spinlock's dominance of
multiprocessing and each reacted in their own way. They have both learned
the importance of stability but were culturally more focused on the
appearance than the reality, and technologically more inventive in how
to provide it.
RCU explains that he agrees with his dad's mantra of "Always
consistent, always stable" but finds that the "always up-to-date" idea is
over-rated. He himself finds that he is usually running late for
something or other and doesn't see that it has hurt him all that much.
RCU is a big fan of the photo-copiers and realized early that the best
way to preserve the appearance of stability was not to change the
original at all but to make a copy and work on that. So while RCU
always provides stable data, there might be something newer that he is
working on that he won't let you see just yet. RCU's real skill is
understanding use-by dates. What you get might not be perfectly
fresh, but it will never become stale.
His sister, Seqlock, takes a very different approach to the same
problem learning from her favorite piece of technology - the digital
camera. Like many of us she just takes lots and lots of photos and picks the
ones she likes. "All of them are stable", she says, "and it isn't too
hard to tell when one isn't consistent - just throw those away". You
might not think it is so easy, but then this is her special skill -
seeing the blur in the photo and taking another. She is happy to help and
tell you when a new photo is needed.
Another notable contrast between the twins is reflected in their
mother's pet names for them: "replace" and "in-place" (she sometimes
calls them "The place twins" which they find very irritating). RCU
insists on updating things by replacing them so he is happiest with
relatively small objects without too much linkage which would need to
be re-threaded. He "doesn't work on books", he says, "just
collections of pages, so don't try looking at two pages at once!".
When he does have these small objects he can replace them with
lightning speed, so others hardly even notice anything has changed.
Seqlock expects people to take a photo if they are interested so she
just updates things when ever she likes, where ever they are - just
being sure to add enough "blur" so that if she is caught "in the act"
another photo will be taken. This means she can handle much bigger
containers than her brother, and containers with lots of connections.
Even a wide-angle lens can notice the blur and take a second photo.
A chip off the old lock
When we look more closely at RCU to see how he achieves his flashy
performance we can see echoes of both his parents at work. When he was
younger he found this mildly embarrassing but since they were both
knighted he has found a new respect for his heritage.
From his father, RCU appreciates the value of a quick change when
no-one else can see and has refined this down to a single pointer
update which modern hardware guarantees to be atomic. When he is
really trying to impress he uses the "xchg()" atomic operation to
replace an old structure with a new structure without even needing any
locking for an update (though there are some costs to this and it is
mostly just a party trick - but see
net/ipv4/cipso_ipv4.c:cipso_v4_req_setattr()).
This observation about the centrality of the pointer is crucial to
understanding RCU's sleight-of-hand. It is easy to get distracted by
the "rcu_read_lock() / rcu_read_unlock()" calls and think that they
provide some sort of atomicity, but they don't. Rather they come from
his mother's side of the family as we will see shortly. As the
atomicity is constrained to a single pointer, two separate pointer
accesses - even in the one rcu_read_lock()ed section - will not be
coherent. This means it is very important not to dereference the same
pointer twice (you might get two different values) and if you need to
read two separate pointers, be very careful about assuming any
relationship between the two values - there might be one but you need
to understand the rest of the code to be sure. The most common case
of dereferencing multiple pointers in the one rcu_read_lock()ed region
is when following a linked list. Each pointer found leads directly to
the next. In this case you can be sure that each consecutive pair will be ordered
if there is ordering in the list, but you cannot be sure you will see
all entries that are in the list when you started, or that you have
seen all the entries that are in the list when you finish.
Now that we know that friends don't let RCU-friends assume coherence
outside of a single protected structure, it should be easy to see
that something looks wrong in
kernel/cgroup.c:cgroup_attach_proc().
The relevant code snippet - very well commented which helps a lot -
is:
rcu_read_lock();
if (!thread_group_leader(leader)) {
/*
* a race with de_thread from another thread's exec() may strip
* us of our leadership, making while_each_thread unsafe to use
* on this task. if this happens, there is no choice but to
* throw this task away and try again (from cgroup_procs_write);
* this is "double-double-toil-and-trouble-check locking".
*/
rcu_read_unlock();
retval = -EAGAIN;
goto out_free_group_list;
}
where:
#define thread_group_leader(p) (p == p->group_leader)
is defined elsewhere.
This code is assuming that rcu_read_lock() will provide coherence
between this access of "leader->group_leader" here and other accesses to
something in the "leader" structure later on in "while_each_thread()".
While this is not totally impossible, as a well placed
"synchronize_rcu()" between updates could lead to some coherence, it
is not the usual way RCU is used and questions need to be asked. A
discussion with the author of this code suggests that what is really
needed is a read_lock() on "tasklist_lock".
So just like with the case of the "for" loop the fact that something
"doesn't look right" isn't a guarantee that it is wrong but it is
certainly an indication that the code is worth checking.
From his mother, RCU learned the value of counting to keep track of things. While his mum
has a counter in each individual object, RCU tracks the whole system state (or at least that part
entrusted to him) with a single counter (though admittedly it is spread across all CPUs). The
details of getting this just right are
are rather tricky, but as mentioned earlier, that is his
real skill.
Sometimes he thinks this one skill is over-rated and finds a real
kinship with the artists and musicians he knows who can only find work
in copying the ancient masters or teaching unwilling children to play.
In RCU's case the skill that most people pay for is not the "quick
change, always stable" trick that he is so proud of, but the "never
stale" promise he has to make to achieve this. Among the several
hundred times that RCU is employed by the Linux kernel, a substantial
majority simply use RCU as an extra, cheap, reference count to stop
things from going stale until they are really not needed. In a number
of cases this is very explicit in a "put" function. For example
kernel/audit_tree.c has:
static inline void put_tree(struct audit_tree *tree)
{
if (atomic_dec_and_test(&tree->count))
call_rcu(&tree->head, __put_tree);
}
which makes it very clear where RCU stands - one step below his mother
(we note that she isn't complaining, only him).
Even when RCU does get to use his quick-change trick it is usually to
simply replace an old value with a new value - no real copying is
done. Like most of us, he hardly ever uses his middle name.
Some reasonably simple examples of true RCU where the old value is
copied and updated are in
fs/afs/security.c:afs_cache_permit()
and
drivers/md/linear.c:linear_add().
In both these cases an array needs to
be made bigger, so a new structure is allocated and RCU is used to
slip it into place without anyone noticing.
Others, possibly more genuine (as they aren't just for a size change),
can be found where "list_replace_rcu()" or
"hlist_replace_rcu()" are
used (about a dozen places in Linux-3.0). For example in
net/ipv4/fib_trie.c:fib_table_insert(),
RCU is used to update several fields in a "fib_alias" atomically.
Seeking Seqlock's secret
Returning to Miss Seqlock and examining more closely her relationship
with her parents we find that she is very close to her father and
depends on him to provide exclusion from other writers when writing to
a protected data structure (though in the Linux kernel she has a more
independent alter ego named "Seqcount" who leaves the write-side
exclusion to be taken on by whoever is available).
On the other hand she shows her rebellious side by not requiring a
strict bracketing of get/put calls. Sir Spinlock's calls must always
pair a spin_lock() with spin_unlock() and RCU must always follow
rcu_read_lock() by rcu_read_unlock(). Dame Refcount almost always
pair "increment" with "decrement" though sometimes when there is only
one reference left she won't bother decrementing but just throws the
object away. Seqcount - possibly inspired by her mother, only pretends
to need balance, on the read side at least. Her standard pattern is
to balance "read_seqbegin()" with "read_seqretry()"
as is shown quite eloquently in, for example,
fs/nfs/nfs4state.c:nfs4_copy_stateid():
do {
seq = read_seqbegin(&state->seqlock);
memcpy(dst, &state->stateid, sizeof(*dst));
} while (read_seqretry(&state->seqlock, seq));
We have begin, copy, and retry. But it doesn't have to be like
that. A simple variation can be seen in
fs/cifs/dir.c:build_path_from_dentry()
where three different error cases can return from the middle of the loop.
In any other locking scheme the error path would need to unlock or
release whatever was being held, but not with Miss Seqlock. When she
is in charge you can check-out any time you like.
While this example copies the entire protected structure that is not
necessary and would be prohibitive when Seqlock is protecting a large
or complex structure. Rather it is only necessary to copy the pieces
of interest. If they prove interesting you might copy some more bits
and some more, each time only checking "read_seqretry()" to make sure
the entire picture you are forming is valid (and not blurred). This
is not yet common in Linux with the new scalable dcache being the only
advanced user. This first exposure may yet gain more friends for
Seqlock, though, and we can expect to see more of her as time passes.
It's all about teamwork
We have already noted that Seqlock usually depends on her dad for help
with write-side exclusion and that RCU often helps out his mum.
The co-operation doesn't end there. The parents have long worked
together (it was in the middle of an "atomic_dec_and_lock()"
that he proposed) and the twins carry on that tradition. Some times
the family members take turns in protecting a structure and sometimes they share out
responsibility for different fields, so Sir Spinlock might protect
some fields in a structure while RCU guarantees that others won't
change except by copying the whole.
As we now turn back to look at the context of the bug that started
this discussion which was in the
new path name lookup code
we find that all four (main) members of the family are making
significant contributions and working together.
As the
earlier article on this topic
outlines, Sir Spinlock and Dame Refcount are used for the
"always works but sometimes slow" ref-walk option, while RCU and
Seqlock help out for the "often works, always fast" rcu-walk option.
This allows us to compare and contrast them in a real-life context.
Where ref-walk uses reference counts on one dentry after another to
keep each dentry active, rcu-walk uses RCU to keep an effective
reference count on all dentrys. Where ref-walk uses spinlocks to get
a stable view of each dentry, rcu-walk uses seqlock to avoid using an
unstable view.
To really understand this bug some degree of friendship with VFS is
needed too, but fortunately not much. Central to the issue is the
"d_inode" field of a dentry. It is fairly special
because it doesn't change much. The only valid changes are from
NULL to a valid inode pointer, and from a valid inode pointer
back to NULL. And that second one is special too - it is only
allowed to happen if there are no other references to the dentry.
The result of this in the old dcache (before rcu-walk) is that you
don't need any locking for read access to d_inode, only a
counted reference. If it is NULL, then you know the file doesn't
exist. If it is not NULL, then the value is the file's inode and it
won't change - the reference you have on the dentry becomes an implied
reference on the inode.
As rcu-walk largely mapped spinlocks to seqlocks and refcounts to RCU
you might assume that an RCU reference is sufficient to make accessing
d_inode safe but it isn't quite that simple. To get a new
reference on a dentry you need to be holding a spinlock (hence
dget_locked() is the function to use). So while you only need a
reference to access d_inode, you need to have previously
taken a spinlock. This doesn't map directly into RCU usage as
getting an RCU reference (with rcu_read_lock()) can never
require a spinlock. So in rcu-walk, both Seqlock and RCU are used to
protect d_inode. RCU holds a global reference so you can
safely copy the d_inode value, and then later Seqlock will tell you
if the copy you made is still valid.
It seems that there is room for some confusion on whether just a
reference, or also a lock, is need to access d_inode and we see
some of that in the patches proposed. The first
patch
proposed by Linus seems to treat d_inode like an RCU-protected field
which should not be changed (as observed in our earlier parenthetical
comment), and so removed the code that cleared it before the last
RCU-reference was dropped. While it may have fixed the problem it didn't
fully explain it as the field is meant to be Seqlock protected.
Linus's second patch
added some Seqcount handling which seemed to
acknowledge that the field was meant to be Seqlock protected, but also
moved the access to d_inode towards the end of the loop.
That loop shows off Seqlock's unbalanced nature quite nicely. It has a
read_seqcount_begin() but had no read_seqretry().
Linus added a read_seqretry() which resulted in balance but
that didn't really address the issue.
As mentioned earlier a "Seqlock protected region" is somewhat vague as
it can have several read_seqretry() calls with each being used
to validate what has just been copied. While a Seqlock protected
region is, in some sense, still active when we enter this loop, it is no
longer meant to protect d_inode. d_inode was copied
out and validated in an earlier call to _d_lookup_rcu() (from
do_lookup()). The current Seqlock protection will only be
used if this inode is a directory. So in this loop there is no
Seqlock protection for d_inode. The only reason it is still
safe to access it at the end of the loop is because we know we have a
mount point, and we know a mount point is pinned in such a way that
d_inode cannot change (as the comment explains). At the top
of the loop we don't have that certainty.
So it seems that the problem child here is Seqlock, but in fairness it
isn't really her fault. She looks like a lock, but doesn't bracket
things at all the same way that more traditional locks do and so those
familiar with the older generation can get confused about exactly what
she is protecting.
If we back-track a little and find out how the reference to
d_inode came to not be properly protected we find
commit 62a7375e5d77d654695
which moves the access to d_inode from the end of that loop
to near the top. This looks like it should be safe as a Seqlock seems
to be active, but as we now know, that can be deceiving.
The reason for that move is probably worth exploring, but not here.
This article is about friendship and friendship will rarely tell us how
to fix a problem - we usually need to visit a specialist for that.
But a friendship with data structures and locking mechanisms can help
us identify which code is worthy of a closer inspection just as we
have seen in this exploration.
In this case a friend would need to have known that d_inode
requires Seqlock protection, and to have fully understood the rather
independent approach that Seqlock takes to protected regions. Maybe
Seqlock just needs some more friends.
Invitation
It has been said that
given enough
eyeballs, all bugs are shallow.
For this to be true the eyes must also be able to focus and recognize (and
even be open). Hopefully by gaining more friends who are able to
recognize its faults, the Linux kernel can avoid the stigma of being
difficult to understand and can once more be seen as approachable. Would
you like to be friends?
Exercises:
Examine each unusual "for" loop found above and determine if
each is a bug, an interesting usage, or an uninteresting false-positive.
"struct super" uses a secondary reference count
"s_count" which provides similar functionality to that
provided by RCU. Determine if any flavor of RCU can actually be
used to achieve the same result. Outline the costs and benefits of
such a change.
Understand why
commit 62a7375e5d77d654695
moved the d_inode access as it did and so determine if
commit 59430262401bec02
has undone some good work.
Examine the usage of RCU in
jbd2_dev_to_name and list any features that "don't seem
right". A deeper analysis is not required as this code was recently removed from the kernel.
[The author would like to thank Paul McKenney for providing valuable insights
into the character of RCU.]
Comments (8 posted)
Patches and updates
Kernel trees
- Thomas Gleixner: 3.0-rt4 .
(August 2, 2011)
- Thomas Gleixner: 3.0-rt6 .
(August 2, 2011)
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>