LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.1 merge window is still open. See the summary below for what has been merged in the last week.

Stable releases: The 2.6.35.14 longterm kernel was released on August 1.

The 2.6.39.4 stable kernel was released on August 3: "Please note, this is the LAST release of the 2.6.39 kernel series. All users of 2.6.39 should be moving to 3.0 right now. This tree is now end-of-life, please plan accordingly."

The 3.0.1 stable kernel is currently in the review process. It can be expected to be released on or after August 5.

Comments (1 posted)

Quotes of the week: Merge window grumpiness edition

And it doesn't matter one whit if you say something like "Oh, but I use quilt, so it's been tested there, and I just imported a very well tested tree into -git to push it to you". Dammit, even if you use quilt or something else to actually maintain your patch series, I KNOW DAMN WELL THAT YOU DIDN'T TEST THAT SERIES ON TOP OF THE RECENT CRAPPY NFS PULL!

So if you use quilt or something, then import the patch series on top of something STABLE AND SANE. Start off with the released 3.0, that at least doesn't have random pulls that have known compile issues. Use that for testing, and don't send me a re-based patch-series that clearly cannot possibly have been tested in that form, and that was based on a kernel that had ugly problems.

-- Linus Torvalds (additional grumpiness contained within)

So conflicts aren't "bad" per se. I want to see them, because they're a kind of heads-up for me: while any individual conflict isn't necessarily a problem at all, it's something that I just want to be aware of.

So I am not complaining about - or finding it disturbing - that we had a conflict. It's all par for the course. But I don't want submaintainers to merge the conflicts away.

-- Linus Torvalds

So quite frankly, I think the patch in this form is totally self-defeating. I'm not pulling something that is supposed to clean things up, but just adds more ugliness in some other place.
-- Linus Torvalds

Comments (none posted)

Kernel development news

3.1 merge window part 2

By Jake Edge
August 3, 2011

Around 1400 non-merge changesets have been pulled into the mainline since last week's merge window article. That makes for 6844 changesets since the 3.0 release at the time of this writing. Linus Torvalds is still on vacation, and the merge numbers are a bit down compared to previous releases, so there may be more to come. Significant user-visible changes include:

  • The LIO iSCSI target has been merged. This was somewhat controversial, first because the competing SCST target implementation was pushed aside in favor of LIO, but also because there were questions about whether the CHAP authentication should be done in the kernel or in user space. The version that got merged does that authentication in the kernel over the objections of James Bottomley.

  • pNFS now supports IPv6.

  • eCryptfs now has support for encrypted keys.

  • md now has support for bad block management.

  • tools/power/cpupower has been added with tools to monitor power management for multiple architectures, and is eventually slated to replace the Intel-specific tools in tools/power/x86.

  • dm now supports separate metadata devices for better fault handling and array sanity checking.

  • New hardware supported includes:

    • Audio/video: Cirrus Logic CS421x codecs, Xceive XC4000 silicon tuners, ATMEL Image Sensor Interface (ISI), Endpoints SE401 USB cameras, Marvell Armada 610 integrated camera controllers, NXP TDA18271c2 silicon tuners, TI DM644x video processing back-end (VPBE) displays, Samsung S5P family video devices, OmniVision OV5642 camera sensors, Micronas DRX-K DVB-C/T demodulators.

    • Miscellaneous: Freescale MMA8450 accelerometers, InvenSense MPU3050 tri-axial gyroscope sensors, Kionix KXTJ9 accelerometers, Xilinx watchdog timers, ADP1653 LED flash controllers, Digital Devices Octopus bridge devices, Maxim MAX1668 and compatible temperature sensors, NTC NCP15WB473, NCP18WB473, NCP21WB473, NCP03WB473, and NCP15WL333 temperature sensors, National Semiconductor LM95245 dual temperature sensors, National Semiconductor LM25066, LM5064, and LM5064 power management, monitoring, control, and protection sensors, Maxim MAX8998/LP3974 PMIC battery chargers, Maxim MAX8997/MAX8966 PMIC battery chargers, TI TPS95612 power management chips, TI TPS65912 power regulators, AnalogicTech AAT2870 backlights, AnalogicTech AAT2870 power regulators, Cirrus Logic EP93xx DMA controllers.

Changes visible to kernel developers include:

  • Multiple ARM SoCs and device drivers have added device tree support.

  • A watchdog timer driver core has been added.

  • The SLUB slab allocator no longer requires locks on the fast path for architectures that support cmpxchg.

  • EFI non-volatile storage can now be used as a pstore backend to persistently store log messages or other information.

One notable patch set that will not be merged this time around is the Native Linux KVM tool. Torvalds decided that he needed more convincing before merging it:

You'll need to convince me it's really worthwhile, considering that you *can* do kernel testing with existing virtualization environments that are more powerful in other ways. But you'll get to do that next merge window, I'm afraid.

I already decided to pull one controversial thing (the iscsi-target thing), I'm not doing two this merge window ;)

The normal two-week merge window would nominally expire on August 5, but Torvalds's vacation may alter that somewhat (in either direction). We'll pick up any significant merges in an update next week should it be warranted. Stay tuned ...

Comments (none posted)

SKB fragment lifetime tracking

By Jonathan Corbet
July 25, 2011
In May, LWN examined the "stable pages" patch, whose intent is to be sure that pages under I/O cannot be modified (by the kernel or user space) until the I/O completes. Block I/O is not the only context in which this kind of problem arises, though; memory which has been given to the network stack should also be kept stable until the transmission is complete. Unfortunately, it is hard to know when the network stack is truly done with a page, leaving the system open to possible data corruption problems.

Ian Campbell described how things can go wrong in June. Imagine a page full of data to be written to a file on an NFS-mounted filesystem. The NFS code will put together a network I/O operation as represented by an sk_buff structure (an "SKB") and pass it into the network stack for transmission. Perhaps the server is slow or the network is noisy; one way or another the acknowledgment from the remote NFS server is slow in coming - slow enough that the network stack decides to retransmit the request. While the data sits in the retransmission queue (perhaps already handed to the interface driver), the ACK arrives from the server. The network stack will tell the NFS client that the operation has completed. The page used to contain the data to be written could then be rewritten with some other data - even though that retransmission is still waiting to go out. The result could be a (re)transmission of corrupted data. This problem is especially acute for O_DIRECT writes - where the application is waiting for the end of the operation - but it can come up in other situations as well.

SKBs can have a destructor function, so one would think that it would be possible to just wait until the network stack finishes with the structure before releasing the relevant page(s). But the network stack works in strange and mysterious ways, and the fact that it has finished with an SKB does not imply that it is finished with the pages of data referenced by that SKB. The "cloning" of SKBs happens often in the network stack, and pages of data can actually move between SKBs. The networking code manages the page reference counts directly, so there is no danger of the data pages being put to some other use entirely by the system. But that is not helpful to higher layers, which have no way to know when it's safe to signal the completion of an operation.

Fixing this problem requires a significant set of changes to the low-level SKB-handling code. Ian's patch series starts by defining a set of helper functions for the tracking of references to pages from SKBs. Current networking code calls get_page() and put_page() directly; after patching, all of those calls have been wrapped by functions like skb_frag_ref(). Quite a few changes are required to get the networking core and in-tree drivers to use these functions.

Once that is done, the patch series introduces the concept of a "fragment destructor" for SKBs:

    struct skb_frag_destructor {
	atomic_t ref;
	int (*destroy)(void *data);
	void *data;
    };

The low-level functions that add fragments to SKBs are modified to take an additional destructor argument. The destructor is always optional; code which does not need to use destructors can simply pass a null pointer instead.

At this point, it's a relatively simple matter for the accessor functions added earlier in the series to increment and decrement the reference count whenever there are destructors present. When the reference count (ref) drops to zero, the provided destroy() function will be called. Putting the reference counter in the destructor is a useful optimization: in the absence of destructors, the overhead of maintaining the reference count can be skipped. Also worth noting is the fact that multiple fragments in an SKB can share the same destructor; in this case, the destroy() function will only be called when the networking code has finished with all of those fragments.

One other optimization is that, in the presence of a destructor, the network code will no longer increment and decrement the reference counts associated with the pages in the fragments. In this situation, the calling code is assumed to hold a reference for the duration of the operation, so separate reference counting at that level is not needed.

The final step is to make use of this capability. The internal kernel_sendpage() function gains an extra parameter to hold a pointer to the destructor, should the caller want to use one. The sunrpc code is changed to not signal completion of operations until the networking code indicates that it is done with the associated memory. And that solves the problem - for NFS at least; there should be no more troubles with pages being reused while they are still under network I/O. There are other places in the kernel which can - and presumably will - make use of this functionality in the future; this work was originally motivated by problems encountered in the implementation of zero-copy I/O for Xen clients. Ian suspects that subsystems like iSCSI could also benefit from this mechanism.

The patch set seems to have been relatively well received. There will be another posting at some point reorganizing some of the work, but there does not appear to be a need for significant changes at this point. So this feature seems likely to appear in the 3.2 kernel.

Comments (2 posted)

Meet the Lockers

August 3, 2011

This article was contributed by Neil Brown

Our story begins with an observation made by Jon Corbet in a recent article about delays in the release of Linux-3.0:

[I]f we reach a point where almost nobody can understand, review, or fix some of our core code, we may be headed for long-term trouble.

While undeniably true, it is not clear that this note of gloom is entirely justified by the immediate circumstances - after all the bug was found and fixed to everyone's satisfaction in a little over 24 hours. However it is clear that it wasn't just this event that triggered the observation. Earlier we read:

Our once approachable and hackable kernel has, over time, become more complex and difficult to understand.

What was once "approachable" is now "difficult to understand". If these words were used to describe a life-partner, we might be looking to choose between a divorce settlement or counseling, though often it is a healthy dose of understanding that is really needed.

The human touch

The allusion to human relationships is not as far fetched as it might seem. In his excellent book "The Maths Gene", Keith Devlin writes about Wim Klein, "a famous calculating wizard who in the days before electronic computers once held a professional position with the title 'computer'".

Apparently Klein was known to say: "Numbers are friends to me." and referring to 3844 he says:

For you it's just a three and an eight and a four and a four, but I say, "Hi 62 squared!"

The point Devlin is leading to is summed up thus:

To put it simply, mathematicians think about mathematical objects and the mathematical relationships between them using the same mental facilities that the majority of people use to think about other people.

The idea is that we can treat abstract ideas more like friends than tools and understand them holistically instead of just functionally. If we get to know their preferences and peculiarities we can side step the fact that they are sometimes difficult to understand and once again find them approachable.

One more quote from that recent LWN article will start us on the path to understanding what it means to be friends with abstractions. In a parenthetical comment concerning an assignment of NULL to d_inode, Jon writes:

(It's worth noting that this behavior goes against normal RCU practice, which calls for the structure to be preserved unmodified until the last reference is known to be gone)

This is just the sort of thing that suggests a holistic "friendly" relationship rather than a more distant formal one. Something just doesn't seem right, "He doesn't usually behave this way". Being able to make that sort of observation helps you focus in and ask incisive questions. You should then either be able to help your friend, or else understand them at a deeper level. We'll come back to this particular observation later, but first I want to demonstrate the effectiveness of this sort of friendship in a context that all experienced C programmers should find very familiar.

"For": he's a jolly good fellow

A simple and common construct in C programs is the "for" loop. It is extremely flexible and can be used like a "while" loop, can iterate over a set of numbers, can walk a linked list, and can perform many more tasks. A common form that you might expect to see quite often is:

    for (i = 0; i < count; i++)

This will execute the body of the loop "count" times assigning "i" to each value from "0" to "count - 1" in turn. There are similar constructs that are almost as common and familiar such as

    for (i = 0; i <= final; i++)

or

    for (i = 1; i <= count; i++)

which are really just minor variations on the theme. But if you see

    for (i = 0; i <= count; i++)

a seasoned C programmer, familiar with the ways of the "for", will immediately think that something is odd. Not necessarily wrong, but certainly odd. This might be an error, but it could also be that "count" has been misnamed and is really a "final" rather than a "count" sort of value (an ordinal rather than a cardinal). Or maybe the loop really has a good reason for executing "count + 1" times. At this point the friendship has served a valuable purpose by identifying a possible issue, but must give way to a more detailed analysis to determine if there really is a problem.

We can make this very concrete by examining the Linux kernel source code (v3.0) which is often a good source of interesting code fragments. A search for the above pattern can be achieved with:

    git grep 'for.*(.*= *0.*;.*<=.*c[ou]*nt.*;.*)'

which results in:

    arch/ia64/kernel/cpufreq/acpi-cpufreq.c:	for (i = 0; i <= data->acpi_data.state_count; i++)
    drivers/gpu/drm/radeon/evergreen_cs.c:	for (i = 0; i <= pkt->count; i++, idx++, reg += 4) {
    drivers/gpu/drm/radeon/r100.c:	for (i = 0; i <= pkt->count; i++, idx++) {
    drivers/gpu/drm/radeon/r100.c:	for (i = 0; i <= (pkt->count + 1); i++, idx++) {
    drivers/gpu/drm/radeon/r100.c:	for (j = 0; j <= count; j++) {
    drivers/gpu/drm/radeon/r600.c:	for (j = 0; j <= count; j++) {
    drivers/gpu/drm/radeon/r600_cs.c:	for (i = 0; i <= pkt->count; i++, idx++, reg += 4) {
    drivers/i2c/algos/i2c-algo-pcf.c:	for (i = 0; i <= count; i++) {
    drivers/staging/rts_pstor/rtsx_transport.c:	for (i = 0; i <= buf_cnt / (HOST_SG_TBL_BUF_LEN / 8); i++) {
    fs/ufs/balloc.c:		for (pos = 0; pos <= uspi->s_fpb - count; pos++) {
    scripts/docproc.c:	for (i=0; i <= symfilecnt; i++)

A closer examination of each of these is left as an exercise for the reader. The chance of finding a genuine bug is very high, though it is likely that the bugs are not very serious or they would have already been fixed.

The success in finding a bug even when we weren't really looking for one is evidence of the value of building strong relationships with important abstractions as we can learn to recognize early warning signs. Unfortunately the building of strong relationships can be time-consuming and so is often neglected, being left to those with a passion rather than those who simply have a task to do.

In human relationships a friendship can blossom more quickly if a mutual friend acts to introduce two parties and start them out on a sound footing. The same can work for abstractions. While much has been written about RCU ("Read-Copy-Update" which - from the name "rcu-walk" seems to be the center-point of the aforementioned bug), it has largely been of a formal style. The documentation in linux/Documentation/RCU/ and on Wikipedia is, not unlike a résumé, suitable for an introduction to a potential employer, but not so much as an introduction to a future friend. What follows is an attempt to present RCU together with other related technologies in a somewhat different style aimed at building friendships - what you might call the human-interest perspective.

A family affair

While it is wrong to judge someone based on their parents, knowing their family background can certainly lead to a deeper understanding. So we will start by introducing RCU's parents who are, for the purpose of this tale, Sir Spinlock and Dame Refcount - both recently honored for services to computer engineering.

Sir Spinlock is somewhat "old school" and a stickler for doing everything "just right". When he protects a data structure he does it "properly" (as he says) and ensures everything is consistent, stable and completely up-to-date. His wife, Dame Refcount, feels that her husband takes a rather short-term view of things and, like many mothers, just wants to be sure that her charges are safe and can be found when needed (comments about apron strings are generally not appreciated).

The gentleman and his lady had another child much earlier (too early some say) who was named "Rwlock". When write access is needed, he is much like his father becoming a short term exclusive lock. When read access is wanted he is more like his mother, keeping count of all the different readers. However on the whole he seems to have received the worst of both worlds, lacking the long term connections of his mother, or the simple fairness available to his father. Master Rwlock doesn't get out much these days.

In some ways Rwlock is of the same generation as his parents as none of them really seem to understand the problem of cache line bouncing. Sir Spinlock likes to say that "Cache lines belong in banks" and seems to think he is being awfully funny, but it just makes the younger generation feel awkward, smile weakly, and try to change the subject.

That younger generation are the twins - RCU and his sister Seqlock. They both grew up in the shadow of Sir Spinlock's dominance of multiprocessing and each reacted in their own way. They have both learned the importance of stability but were culturally more focused on the appearance than the reality, and technologically more inventive in how to provide it.

RCU explains that he agrees with his dad's mantra of "Always consistent, always stable" but finds that the "always up-to-date" idea is over-rated. He himself finds that he is usually running late for something or other and doesn't see that it has hurt him all that much. RCU is a big fan of the photo-copiers and realized early that the best way to preserve the appearance of stability was not to change the original at all but to make a copy and work on that. So while RCU always provides stable data, there might be something newer that he is working on that he won't let you see just yet. RCU's real skill is understanding use-by dates. What you get might not be perfectly fresh, but it will never become stale.

His sister, Seqlock, takes a very different approach to the same problem learning from her favorite piece of technology - the digital camera. Like many of us she just takes lots and lots of photos and picks the ones she likes. "All of them are stable", she says, "and it isn't too hard to tell when one isn't consistent - just throw those away". You might not think it is so easy, but then this is her special skill - seeing the blur in the photo and taking another. She is happy to help and tell you when a new photo is needed.

Another notable contrast between the twins is reflected in their mother's pet names for them: "replace" and "in-place" (she sometimes calls them "The place twins" which they find very irritating). RCU insists on updating things by replacing them so he is happiest with relatively small objects without too much linkage which would need to be re-threaded. He "doesn't work on books", he says, "just collections of pages, so don't try looking at two pages at once!". When he does have these small objects he can replace them with lightning speed, so others hardly even notice anything has changed.

Seqlock expects people to take a photo if they are interested so she just updates things when ever she likes, where ever they are - just being sure to add enough "blur" so that if she is caught "in the act" another photo will be taken. This means she can handle much bigger containers than her brother, and containers with lots of connections. Even a wide-angle lens can notice the blur and take a second photo.

A chip off the old lock

When we look more closely at RCU to see how he achieves his flashy performance we can see echoes of both his parents at work. When he was younger he found this mildly embarrassing but since they were both knighted he has found a new respect for his heritage.

From his father, RCU appreciates the value of a quick change when no-one else can see and has refined this down to a single pointer update which modern hardware guarantees to be atomic. When he is really trying to impress he uses the "xchg()" atomic operation to replace an old structure with a new structure without even needing any locking for an update (though there are some costs to this and it is mostly just a party trick - but see net/ipv4/cipso_ipv4.c:cipso_v4_req_setattr()).

This observation about the centrality of the pointer is crucial to understanding RCU's sleight-of-hand. It is easy to get distracted by the "rcu_read_lock() / rcu_read_unlock()" calls and think that they provide some sort of atomicity, but they don't. Rather they come from his mother's side of the family as we will see shortly. As the atomicity is constrained to a single pointer, two separate pointer accesses - even in the one rcu_read_lock()ed section - will not be coherent. This means it is very important not to dereference the same pointer twice (you might get two different values) and if you need to read two separate pointers, be very careful about assuming any relationship between the two values - there might be one but you need to understand the rest of the code to be sure. The most common case of dereferencing multiple pointers in the one rcu_read_lock()ed region is when following a linked list. Each pointer found leads directly to the next. In this case you can be sure that each consecutive pair will be ordered if there is ordering in the list, but you cannot be sure you will see all entries that are in the list when you started, or that you have seen all the entries that are in the list when you finish.

Now that we know that friends don't let RCU-friends assume coherence outside of a single protected structure, it should be easy to see that something looks wrong in kernel/cgroup.c:cgroup_attach_proc(). The relevant code snippet - very well commented which helps a lot - is:

	rcu_read_lock();
	if (!thread_group_leader(leader)) {
		/*
		 * a race with de_thread from another thread's exec() may strip
		 * us of our leadership, making while_each_thread unsafe to use
		 * on this task. if this happens, there is no choice but to
		 * throw this task away and try again (from cgroup_procs_write);
		 * this is "double-double-toil-and-trouble-check locking".
		 */
		rcu_read_unlock();
		retval = -EAGAIN;
		goto out_free_group_list;
	}

where:

	#define thread_group_leader(p)	(p == p->group_leader)

is defined elsewhere.

This code is assuming that rcu_read_lock() will provide coherence between this access of "leader->group_leader" here and other accesses to something in the "leader" structure later on in "while_each_thread()". While this is not totally impossible, as a well placed "synchronize_rcu()" between updates could lead to some coherence, it is not the usual way RCU is used and questions need to be asked. A discussion with the author of this code suggests that what is really needed is a read_lock() on "tasklist_lock".

So just like with the case of the "for" loop the fact that something "doesn't look right" isn't a guarantee that it is wrong but it is certainly an indication that the code is worth checking.

From his mother, RCU learned the value of counting to keep track of things. While his mum has a counter in each individual object, RCU tracks the whole system state (or at least that part entrusted to him) with a single counter (though admittedly it is spread across all CPUs). The details of getting this just right are are rather tricky, but as mentioned earlier, that is his real skill.

Sometimes he thinks this one skill is over-rated and finds a real kinship with the artists and musicians he knows who can only find work in copying the ancient masters or teaching unwilling children to play. In RCU's case the skill that most people pay for is not the "quick change, always stable" trick that he is so proud of, but the "never stale" promise he has to make to achieve this. Among the several hundred times that RCU is employed by the Linux kernel, a substantial majority simply use RCU as an extra, cheap, reference count to stop things from going stale until they are really not needed. In a number of cases this is very explicit in a "put" function. For example kernel/audit_tree.c has:

static inline void put_tree(struct audit_tree *tree)
{
	if (atomic_dec_and_test(&tree->count))
		call_rcu(&tree->head, __put_tree);
}

which makes it very clear where RCU stands - one step below his mother (we note that she isn't complaining, only him).

Even when RCU does get to use his quick-change trick it is usually to simply replace an old value with a new value - no real copying is done. Like most of us, he hardly ever uses his middle name.

Some reasonably simple examples of true RCU where the old value is copied and updated are in fs/afs/security.c:afs_cache_permit() and drivers/md/linear.c:linear_add(). In both these cases an array needs to be made bigger, so a new structure is allocated and RCU is used to slip it into place without anyone noticing.

Others, possibly more genuine (as they aren't just for a size change), can be found where "list_replace_rcu()" or "hlist_replace_rcu()" are used (about a dozen places in Linux-3.0). For example in net/ipv4/fib_trie.c:fib_table_insert(), RCU is used to update several fields in a "fib_alias" atomically.

Seeking Seqlock's secret

Returning to Miss Seqlock and examining more closely her relationship with her parents we find that she is very close to her father and depends on him to provide exclusion from other writers when writing to a protected data structure (though in the Linux kernel she has a more independent alter ego named "Seqcount" who leaves the write-side exclusion to be taken on by whoever is available).

On the other hand she shows her rebellious side by not requiring a strict bracketing of get/put calls. Sir Spinlock's calls must always pair a spin_lock() with spin_unlock() and RCU must always follow rcu_read_lock() by rcu_read_unlock(). Dame Refcount almost always pair "increment" with "decrement" though sometimes when there is only one reference left she won't bother decrementing but just throws the object away. Seqcount - possibly inspired by her mother, only pretends to need balance, on the read side at least. Her standard pattern is to balance "read_seqbegin()" with "read_seqretry()" as is shown quite eloquently in, for example, fs/nfs/nfs4state.c:nfs4_copy_stateid():

	do {
		seq = read_seqbegin(&state->seqlock);
		memcpy(dst, &state->stateid, sizeof(*dst));
	} while (read_seqretry(&state->seqlock, seq));

We have begin, copy, and retry. But it doesn't have to be like that. A simple variation can be seen in fs/cifs/dir.c:build_path_from_dentry() where three different error cases can return from the middle of the loop. In any other locking scheme the error path would need to unlock or release whatever was being held, but not with Miss Seqlock. When she is in charge you can check-out any time you like.

While this example copies the entire protected structure that is not necessary and would be prohibitive when Seqlock is protecting a large or complex structure. Rather it is only necessary to copy the pieces of interest. If they prove interesting you might copy some more bits and some more, each time only checking "read_seqretry()" to make sure the entire picture you are forming is valid (and not blurred). This is not yet common in Linux with the new scalable dcache being the only advanced user. This first exposure may yet gain more friends for Seqlock, though, and we can expect to see more of her as time passes.

It's all about teamwork

We have already noted that Seqlock usually depends on her dad for help with write-side exclusion and that RCU often helps out his mum. The co-operation doesn't end there. The parents have long worked together (it was in the middle of an "atomic_dec_and_lock()" that he proposed) and the twins carry on that tradition. Some times the family members take turns in protecting a structure and sometimes they share out responsibility for different fields, so Sir Spinlock might protect some fields in a structure while RCU guarantees that others won't change except by copying the whole.

As we now turn back to look at the context of the bug that started this discussion which was in the new path name lookup code we find that all four (main) members of the family are making significant contributions and working together.

As the earlier article on this topic outlines, Sir Spinlock and Dame Refcount are used for the "always works but sometimes slow" ref-walk option, while RCU and Seqlock help out for the "often works, always fast" rcu-walk option. This allows us to compare and contrast them in a real-life context.

Where ref-walk uses reference counts on one dentry after another to keep each dentry active, rcu-walk uses RCU to keep an effective reference count on all dentrys. Where ref-walk uses spinlocks to get a stable view of each dentry, rcu-walk uses seqlock to avoid using an unstable view.

To really understand this bug some degree of friendship with VFS is needed too, but fortunately not much. Central to the issue is the "d_inode" field of a dentry. It is fairly special because it doesn't change much. The only valid changes are from NULL to a valid inode pointer, and from a valid inode pointer back to NULL. And that second one is special too - it is only allowed to happen if there are no other references to the dentry.

The result of this in the old dcache (before rcu-walk) is that you don't need any locking for read access to d_inode, only a counted reference. If it is NULL, then you know the file doesn't exist. If it is not NULL, then the value is the file's inode and it won't change - the reference you have on the dentry becomes an implied reference on the inode.

As rcu-walk largely mapped spinlocks to seqlocks and refcounts to RCU you might assume that an RCU reference is sufficient to make accessing d_inode safe but it isn't quite that simple. To get a new reference on a dentry you need to be holding a spinlock (hence dget_locked() is the function to use). So while you only need a reference to access d_inode, you need to have previously taken a spinlock. This doesn't map directly into RCU usage as getting an RCU reference (with rcu_read_lock()) can never require a spinlock. So in rcu-walk, both Seqlock and RCU are used to protect d_inode. RCU holds a global reference so you can safely copy the d_inode value, and then later Seqlock will tell you if the copy you made is still valid.

It seems that there is room for some confusion on whether just a reference, or also a lock, is need to access d_inode and we see some of that in the patches proposed. The first patch proposed by Linus seems to treat d_inode like an RCU-protected field which should not be changed (as observed in our earlier parenthetical comment), and so removed the code that cleared it before the last RCU-reference was dropped. While it may have fixed the problem it didn't fully explain it as the field is meant to be Seqlock protected.

Linus's second patch added some Seqcount handling which seemed to acknowledge that the field was meant to be Seqlock protected, but also moved the access to d_inode towards the end of the loop. That loop shows off Seqlock's unbalanced nature quite nicely. It has a read_seqcount_begin() but had no read_seqretry(). Linus added a read_seqretry() which resulted in balance but that didn't really address the issue.

As mentioned earlier a "Seqlock protected region" is somewhat vague as it can have several read_seqretry() calls with each being used to validate what has just been copied. While a Seqlock protected region is, in some sense, still active when we enter this loop, it is no longer meant to protect d_inode. d_inode was copied out and validated in an earlier call to _d_lookup_rcu() (from do_lookup()). The current Seqlock protection will only be used if this inode is a directory. So in this loop there is no Seqlock protection for d_inode. The only reason it is still safe to access it at the end of the loop is because we know we have a mount point, and we know a mount point is pinned in such a way that d_inode cannot change (as the comment explains). At the top of the loop we don't have that certainty.

So it seems that the problem child here is Seqlock, but in fairness it isn't really her fault. She looks like a lock, but doesn't bracket things at all the same way that more traditional locks do and so those familiar with the older generation can get confused about exactly what she is protecting.

If we back-track a little and find out how the reference to d_inode came to not be properly protected we find commit 62a7375e5d77d654695 which moves the access to d_inode from the end of that loop to near the top. This looks like it should be safe as a Seqlock seems to be active, but as we now know, that can be deceiving. The reason for that move is probably worth exploring, but not here. This article is about friendship and friendship will rarely tell us how to fix a problem - we usually need to visit a specialist for that. But a friendship with data structures and locking mechanisms can help us identify which code is worthy of a closer inspection just as we have seen in this exploration.

In this case a friend would need to have known that d_inode requires Seqlock protection, and to have fully understood the rather independent approach that Seqlock takes to protected regions. Maybe Seqlock just needs some more friends.

Invitation

It has been said that given enough eyeballs, all bugs are shallow. For this to be true the eyes must also be able to focus and recognize (and even be open). Hopefully by gaining more friends who are able to recognize its faults, the Linux kernel can avoid the stigma of being difficult to understand and can once more be seen as approachable. Would you like to be friends?

Exercises:

  1. Examine each unusual "for" loop found above and determine if each is a bug, an interesting usage, or an uninteresting false-positive.

  2. "struct super" uses a secondary reference count "s_count" which provides similar functionality to that provided by RCU. Determine if any flavor of RCU can actually be used to achieve the same result. Outline the costs and benefits of such a change.

  3. Understand why commit 62a7375e5d77d654695 moved the d_inode access as it did and so determine if commit 59430262401bec02 has undone some good work.

  4. Examine the usage of RCU in jbd2_dev_to_name and list any features that "don't seem right". A deeper analysis is not required as this code was recently removed from the kernel.

[The author would like to thank Paul McKenney for providing valuable insights into the character of RCU.]

Comments (8 posted)

Patches and updates

Kernel trees

  • Thomas Gleixner: 3.0-rt4 . (August 2, 2011)
  • Thomas Gleixner: 3.0-rt6 . (August 2, 2011)

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds