|
|
Subscribe / Log in / New account

The problem with nested sleeping primitives

By Jonathan Corbet
January 7, 2015
Waiting for events in an operating system is an activity that is fraught with hazards; without a great deal of care, it is easy to miss the event that is being waited for. The result can be an infinite wait — an outcome which tends to be unpopular with users. The kernel has long since buried the relevant code in the core kernel with the idea that, with the right API, wait-related race conditions can be avoided. Recent experience shows, though, that the situation is not always quite that simple.

Many years ago, kernel code that needed to wait for an event would execute something like this:

    while (!condition)
 	sleep_on(&wait_queue);

The problem with this code is that, should the condition become true between the test in the while loop and the call to sleep_on(), the wakeup could be lost and the sleep would last forever. For this reason, sleep_on() was deprecated for a long time and no longer exists in the kernel.

The contemporary pattern looks more like this:

    DEFINE_WAIT(wait);

    while (1) {
    	prepare_to_wait(&queue, &wait, state);
    	if (condition)
	    break;
	schedule();
    }
    finish_wait(&queue, &wait);

Here, prepare_to_wait() will enqueue the thread on the given queue and put it into the given execution state, which is usually either TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. Normally, that state will cause the thread to block once it calls schedule(). If the wakeup happens first, though, the process state will be set back to TASK_RUNNING and schedule() will return immediately (or, at least, as soon as it decides this thread should run again). So, regardless of the timing of events, this code should work properly. The numerous variants of the wait_event() macro expand into a similar sequence of calls.

Signs of trouble can be found in messages like the following, which are turning up on systems running the 3.19-rc kernels:

     do not call blocking ops when !TASK_RUNNING; state=1
     	set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90

This message, the result of some new checks added for 3.19, is indicating that a thread is performing an action that could block while it is ostensibly already in a sleeping state. One might wonder how that can be, but it is not that hard to understand in the light of the sleeping code above.

The "condition" checked in that code is often a function call; that function may perform a fair amount of processing on its own. It may need to acquire locks to properly check for the wakeup condition. That, of course, is where the trouble comes in. Should the condition-checking function call something like mutex_lock(), it will go into a new version of the going-to-sleep code, changing the task state. That, of course, may well interfere with the outer sleeping code. For this reason, nesting of sleeping primitives in this way is discouraged; the new warning was added to point the finger at code performing this kind of nesting. It turns out that kind of nesting happens rather more often than the scheduler developers would have liked.

So what is a developer to do if the need arises to take locks while checking the sleep condition? One solution was added in 3.19; it takes the form of a new pattern that looks like this:

    DEFINE_WAIT_FUNC(wait, woken_wait_function);

    add_wait_queue(&queue, &wait);
    while (1) {
	if (condition)
	    break;
	wait_woken(&wait, state, timeout);
    }
    remove_wait_queue(&queue, &wait);

The new wait_woken() function encapsulates most of the logic needed to wait for a wakeup. At a first glance, though, it looks like it would suffer from the same problem as sleep_on(): what happens if the wakeup comes between the condition test and the wait_woken() call? The key here is in the use of a special wakeup function called woken_wait_function(). The DEFINE_WAIT_FUNC() macro at the top of the above code sequence associates this function with the wait queue entry, changing what happens when the wakeup arrives.

In particular, that change causes a special flag (WQ_FLAG_WOKEN) to be set in the flags field of the wait queue entry. If wait_woken() sees that flag, it knows that the wakeup already occurred and doesn't block. Otherwise, the wakeup has not occurred, so wait_woken() can safely call schedule() to wait.

This pattern solves the problem, but there is a catch: every place in the kernel that might be using nested sleeping primitives needs to be found and changed. There are a lot of places to look for problems and potentially fix, and the fix is not an easy, mechanical change. It would be nicer to come up with a version of wait_event() that doesn't suffer from this problem in the first place or, failing that, with something new that can be easily substituted for wait_event() calls.

Kent Overstreet thinks he has that replacement in the form of the "closure" primitive used in the bcache subsystem. Closures work in a manner similar to wait_woken() in that the wakeup state is stored internally to the relevant data structure; in this case, though, an atomic reference count is used. Interested readers can see drivers/md/bcache/closure.h and closure.c for the details. Scheduler developer Peter Zijlstra is not convinced about the closure code, but he agrees that it would be nice to have a better solution.

The form of that solution is thus unclear at this point. What does seem clear is that the current nesting of sleeping primitives needs to be fixed. So, one way or another, we are likely to see a fair amount of work going into finding and changing problematic calls over the next few development cycles. Until that work is finished, warnings from the new debugging code are likely to be a common event.

Index entries for this article
KernelRace conditions


to post comments

The problem with nested sleeping primitives

Posted Jan 8, 2015 20:23 UTC (Thu) by reubenhwk (guest, #75803) [Link] (3 responses)

Why this...
    while (1) {
	if (condition)
	    break;
	wait_woken(&wait, state, timeout);
    }
...and not this...
    while (!condition) {
	wait_woken(&wait, state, timeout);
    }
??

The problem with nested sleeping primitives

Posted Jan 8, 2015 21:47 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (1 responses)

Possibly because this code example is normally in a macro so the condition is re-evaluated - hence you have to watch for the expansion of 'condition' vs precedence of the logical not.

Of course, some extra brackets can fix that:

while (!(condition)) {
	wait_woken(&wait, state, timeout);
}

Other than that, perhaps there was some compiler that adds instructions for the extra operator, though I'd hope any optimiser should be able to efficiently write the loop from either style.

The problem with nested sleeping primitives

Posted Jan 18, 2015 20:27 UTC (Sun) by vbabka (subscriber, #91706) [Link]

For the record, macro definitions of course include necessary brackets in their definitions, so there are no gotchas when using them.

The problem with nested sleeping primitives

Posted Jan 9, 2015 17:29 UTC (Fri) by iabervon (subscriber, #722) [Link]

Probably because it was a smaller change from the previous version with prepare_to_wait(), and therefore easier to review. Also, when the condition is a function with (temporary) side effects, it can be easier to discuss the former than the latter, just in terms of having separate lines for the "while" keyword and the condition text when proposing potential races.

If you were writing new code with a simple condition, the latter would be better, except that you'd use prepare_to_wait() in that case anyway.

Silly question time

Posted Jan 9, 2015 0:01 UTC (Fri) by Max.Hyre (subscriber, #1054) [Link] (1 responses)

This from an erstwhile real-time programmer with very limited knowledge of the kernel:

Why the problem in
    	if (condition)
	    break;
	schedule();
?
If the condition is raised between if (condition) and schedule(), why isn't it still there to be caught the next time? Is it that some other process/thread is checking on condition, and may snag it erroneously?

If someone could enlighten me why condition wouldn't be waiting around for pickup at the next test, I'd be appreciative.

Silly question time

Posted Jan 9, 2015 2:21 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

Because code doesn't just call schedule() - it changes the task state to something other than TASK_RUNNING, so that once they call schedule() the scheduler won't run them a second time until someone else wakes them up.

If you change that to

if (condition)
break;
set_current_state(TASK_UNINTERRUTIBLE);
schedule();

what happens if the condition becoming true and the wakeup happens after checking the condition, but before setting the task state? See the issue?

The solution is to set the task state to TASK_UNINTERRUTIBLE _before_ checking the condition. Then if the wakeup comes between checking the condition and the schedule(), the task state still gets set back to TASK_RUNNING and the task won't sleep forever.

The problem with nested sleeping primitives

Posted Jan 9, 2015 18:37 UTC (Fri) by iabervon (subscriber, #722) [Link]

When I tried writing out a nested version, I didn't find any actual problems, presuming that the condition code never goes to sleep after it looks at something that could affect the outcome. The outer sleep code's wakeup could inappropriately abort the inner sleep for a cycle, but when the inner sleep is done, the outer condition code would determine that it shouldn't sleep.

On the other hand, you could be sleeping in the condition code after collecting some data that will ultimately lead to deciding not to wake up, and that data changing could lead to a missed wakeup (since the outer wakeup got discarded and you will go to sleep in the outer schedule() anyway). The only reason I can think of that this would happen is if someone wanted to compare two values, each accessed under a different lock, without nesting the locks either way, which can't be very common. On the other hand, it's impractical to determine automatically whether a particular place is safe or not.

The problem with nested sleeping primitives

Posted Sep 28, 2017 2:27 UTC (Thu) by waitqueue (guest, #118805) [Link]

Here is the current wait_woken code(v4.13). What if the wakeup comes between "if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !is_kthread_should_stop())" and "timeout = schedule_timeout(timeout);"? In this case it seems we also sleep forever.

long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout)
{
set_current_state(mode); /* A */
/*
* The above implies an smp_mb(), which matches with the smp_wmb() from
* woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must
* also observe all state before the wakeup.
*/
if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !is_kthread_should_stop())
timeout = schedule_timeout(timeout);
__set_current_state(TASK_RUNNING);

/*
* The below implies an smp_mb(), it too pairs with the smp_wmb() from
* woken_wake_function() such that we must either observe the wait
* condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss
* an event.
*/
smp_store_mb(wq_entry->flags, wq_entry->flags & ~WQ_FLAG_WOKEN); /* B */

return timeout;
}
EXPORT_SYMBOL(wait_woken);


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds