Driver porting: sleeping and waking up

[Posted February 24, 2003 by corbet]

This article is part of the LWN Porting Drivers to 2.6 series.

Contrary to expectations, the classic functions sleep_on() and interruptible_sleep_on() were not removed in the 2.5 series. It seems that they are still needed in a few places where (1) taking them out is quite a bit of work, and (2) they are actually used in a way that is safe. Most authors of kernel code should, however, pretend that those functions no longer exist. There are very few situations in which they can be used safely, and better alternatives exist.

`wait_event()` and friends

Most of those alternatives have been around since 2.3 or earlier. In many situations, one can use the wait_event() macros:

    DECLARE_WAIT_QUEUE_HEAD(queue);

    wait_event(queue, condition);
    int wait_event_interruptible (queue, condition);

These macros work the same as in 2.4: condition is a boolean condition which will be tested within the macro; the wait will end when the condition evaluates true.

It is worth noting that these macros have moved from <linux/sched.h> to <linux/wait.h>, which seems a more sensible place for them. There is also a new one:

    int wait_event_interruptible_timeout(queue, condition, timeout);

which will terminate the wait if the timeout expires.

`prepare_to_wait()` and friends

In many situations, wait_event() does not provide enough flexibility - often because tricky locking is involved. The alternative in those cases has been to do a full "manual" sleep, which involves the following steps (shown here in a sort of pseudocode, of course):

    DECLARE_WAIT_QUEUE_HEAD(queue);
    DECLARE_WAITQUEUE(wait, current);

    for (;;) {
        add_wait_queue(&queue, &wait);
        set_current_state(TASK_INTERRUPTIBLE);
	if (condition)
	    break;
        schedule();
	remove_wait_queue(&queue, &wait);
	if (signal_pending(current))
	    return -ERESTARTSYS;
    }
    set_current_state(TASK_RUNNING);

A sleep coded in this manner is safe against missed wakeups. It is also a fair amount of error-prone boilerplate code for a very common situation. In 2.6, a set of helper functions has been added which makes this task easier. The modern equivalent of the above code would look like:

    DECLARE_WAIT_QUEUE_HEAD(queue);
    DEFINE_WAIT(wait);

    while (! condition) {
        prepare_to_wait(&queue, &wait, TASK_INTERRUPTIBLE);
	if (! condition)
	    schedule();
        finish_wait(&queue, &wait)
    }

prepare_to_wait_exclusive() should be used when an exclusive wait is needed. Note that the new macro DEFINE_WAIT() is used here, rather than DECLARE_WAITQUEUE(). The former should be used when the wait queue entry is to be used with prepare_to_wait(), and should probably not be used in other situations unless you understand what it is doing (which we'll get into next).

Wait queue changes

In addition to being more concise and less error prone, prepare_to_wait() can yield higher performance in situations where wakeups happen frequently. This improvement is obtained by causing the process to be removed from the wait queue immediately upon wakeup; that removal keeps the process from seeing multiple wakeups if it doesn't otherwise get around to removing itself for a bit.

The automatic wait queue removal is implemented via a change in the wait queue mechanism. Each wait queue entry now includes its own "wake function," whose job it is to handle wakeups. The default wake function (which has the surprising name default_wake_function()), behaves in the customary way: it sets the waiting task into the TASK_RUNNING state and handles scheduling issues. The DEFINE_WAIT() macro creates a wait queue entry with a different wake function, autoremove_wake_function(), which automatically takes the newly-awakened task out of the queue.

And that, of course, is how DEFINE_WAIT() differs from DECLARE_WAITQUEUE() - they set different wake functions. How the semantics of the two differ is not immediately evident from their names, but that's how it goes. (The new runtime initialization function init_wait() differs from the older init_waitqueue_entry() in exactly the same way).

If need be, you can define your own wake function - though the need for that should be quite rare (about the only user, currently, is the support code for the epoll() system calls). The wake function has this prototype:

    typedef int (*wait_queue_func_t)(wait_queue_t *wait, 
                                     unsigned mode, int sync);

A wait queue entry can be given a different wakeup function with:

    void init_waitqueue_func_entry(wait_queue_t *queue, 
                                   wait_queue_func_t func);

One other change that most programmers won't notice: a bunch of wait queue cruft from 2.4 (two different kinds of wait queue lock, wait queue debugging) has been removed from 2.6.

Driver porting: sleeping and waking up

Posted Feb 28, 2003 13:49 UTC (Fri) by ortalo (guest, #4654) [Link] (1 responses)

As I understand it (possibly with several misunderstanding), these wait and wake up functions primarily address userspace waits.
What about kernel-internal waiting? (aka: Is it reasonable to call wake_up() from an interrupt handler?)

A practical example (the one I'm concerned with): sending out graphical display lists via DMA to a modern graphics hardware accelerator.
In this context, one process sends memory buffers to the kernel and occasionally waits when too many buffers are already submitted for execution: in this case I have to use the wait functions you describe, that's okay.
Now, later on, the displays lists get executed by the graphic hardware and the end of each display list generates an IRQ. The IRQ handler needs to acknowledge the interrupt, but also to submit the next display list for execution (if any, remember that userspace may submit display lists faster than the hardware executes them and some of them may be in queue inside the kernel).
It can work like this. But it is not very clean to directly submit the next display list to the hardware *from the interrupt handler*. For example, one would like to do a time-consuming checking step on the display list before submission; worse, if AGP is involved (sooner or later) one will want to mess up with the AGP translation tables. Doing this from an interrupt handler does not seem very reasonable (time consuming and possibly disruptive work).
In an ideal world (ie: when I'm knowledgeable enough), a sort of kernel thread (one per graphic processor) would exist for submitting work to the hardware and the interrupt handler would only signal to that kernel thread the end of execution. Would it be possible to use the wait queues you describe for synchronization between the kernel thread and the interrupt handler? How would you recommend to adress such an issue?

Rodolphe

Driver porting: sleeping and waking up

Posted Feb 28, 2003 15:04 UTC (Fri) by corbet (editor, #1) [Link]

The best option, of course, would be to check the display lists at the time they are submitted by the user space process. That way you can return an immediate error if something is wrong.

If you have other stuff that needs doing, a kernel thread could certainly do it. I would recommend a look at the workqueue interface, however, as a relatively easy way to do this sort of deferred processing. You can feed a task into a workqueue from an interrupt handler and it can execute at leisure, in process context, later on.

Driver porting: sleeping and waking up

Posted Mar 6, 2003 10:40 UTC (Thu) by driddoch (guest, #9975) [Link] (1 responses)

And just to illustrate how error prone the manual sleep interface is, the example has a bug: You must remove yourself from the wait queue before you return -ERESTARTSYS.

No doubt this was deliberate ;-)

Driver porting: sleeping and waking up

Posted Mar 6, 2003 13:43 UTC (Thu) by corbet (editor, #1) [Link]

Of course it was deliberate. I decided that maybe it was too subtle a way of making my point, though, so I fixed it...:)

Driver porting: wait queue and preemptible code

Posted Dec 21, 2003 15:12 UTC (Sun) by nblanc (guest, #17996) [Link]

With kernel 2.4, you can assume that when you set the process
into TASK_INTERRUPTIBLE mod, there will be no scheduling
(until you decide to do it explicitly).

But what happens with kernel 2.6 in the case below.

for (;;)
{
add_wait_queue(&queue, &wait);
<------------------------------------------ Event occured
set_current_state(TASK_INTERRUPTIBLE);
if (condition) break;
schedule();
remove_wait_queue(&queue, &wait);
if (signal_pending(current))return -ERESTARTSYS;
}
<------------------------------------------- Schedule
set_current_state(TASK_RUNNING);

I think code like the previous one is no more safe with kernel 2.6.
Is it right?

Does the new way to do take care of this?

Nicolas

checking signal_pending

Posted Mar 12, 2004 0:35 UTC (Fri) by ianw (guest, #20143) [Link]

The code with the helper functions above omits the pending_signal() call, implying it is not needed. I may be wrong, but I can't see anything in the path of finish_wait() that checks for signals, so I belive you still need to make that pending_signal() call?

Driver porting: sleeping and waking up

Posted May 24, 2005 10:07 UTC (Tue) by andyparkins (guest, #30122) [Link]

In the example:

    while (! condition) {
        prepare_to_wait(&queue, &wait, TASK_INTERRUPTIBLE);
	if (! condition)
	    schedule();
        finish_wait(&queue, &wait)
    }

Should finish_wait() not be outside the loop?

Driver porting: sleeping and waking up

wait_event() and friends

prepare_to_wait() and friends

Wait queue changes

Driver porting: sleeping and waking up

Driver porting: sleeping and waking up

Driver porting: sleeping and waking up

Driver porting: sleeping and waking up

Driver porting: wait queue and preemptible code

checking signal_pending

Driver porting: sleeping and waking up

`wait_event()` and friends

`prepare_to_wait()` and friends