Next-interrupt prediction

By Jonathan Corbet
January 27, 2016

There are many things an operating system would like to know about the future; one of those is when the next interrupt might come in. This information could be put to good use when it comes time to put an idle processor to sleep. Unfortunately, wormhole peripherals that can read information from the future are expensive, so the vast majority of processors are not equipped with them. That leaves no alternative to trying to guess this information using past behavior as a guide.

When a CPU has no work to do, it should go into a sleep state to save power. Modern processors offer a number of sleep states, though, with different characteristics. A shallow sleep is quick to get into and out of, but the power savings offered by shallow sleeps are relatively small. The deeper sleep states can reduce power consumption to nearly zero, but getting a processor back into a running state from a deep sleep state takes a long time and consumes a certain amount of power in its own right. So it only makes sense to enter a deep sleep state if the processor will remain asleep for a relatively long time.

The kernel can never really know how long a processor will be able to sleep, so it has to make its best guess. One way to do that is to look at the next scheduled timer event; that provides an upper bound on how long the processor will remain idle, but it is not the whole picture. The other thing that can wake a processor is an interrupt from a peripheral device (or an interprocessor interrupt (IPI) from another CPU). Current kernels try to take interrupts into account by looking at the length of recent idle cycles; if those cycles are reasonably regular, their length can be taken as a good guess for when the next wakeup will occur.

But, as Daniel Lezcano notes in his IRQ-based wakeup prediction patches, this approach has some shortcomings. It looks only at idle periods without taking the wakeup event into account, so it cannot separate the effects of timer events and interrupts. IPIs factor into the estimate as well, but IPIs are often generated by the scheduler, which is also trying to figure out what the next idle period might be, leading to interesting feedback loops. This approach is also unable to take into account the behavior of individual interrupt sources or to respond to their addition or removal.

Daniel has been working on this problem for a while; he presented one solution at the 2014 Linux Plumbers Conference. His approach at that time was a relatively elaborate, bucket-based system that tracked interrupts associated with each process on the system. There were some complaints at the time that interrupt behavior has more to do with devices than processes, and this work was never pushed into the mainline.

The new approach is conceptually simpler. In short, it tracks the recent interrupt behavior of each device in the system on a per-CPU basis. When the time comes to guess at the length of an idle stretch, each device's behavior is examined separately, and a guess is made regarding which device will interrupt next and when that will happen.

To gather this information, Daniel introduces a new mechanism to track interrupt timings. It is all based around a structure with a handful of functions to be called out of the interrupt-handling subsystem:

    typedef void (*irqt_handler_t)(unsigned int irq, ktime_t time, void *dev_id);
    struct irqtimings_ops {
	int (*alloc)(unsigned int irq);
	void (*free)(unsigned int irq);
	int (*setup)(unsigned int irq, struct irqaction *act);
	void (*remove)(unsigned int irq, void *dev_id);
	irqt_handler_t handler;
    };

Interestingly, there can only be one of these structures in the system, and it must be declared with the DECLARE_IRQ_TIMINGS() macro. This mechanism runs in interrupt mode, so it must do as little work as possible; that means there is no desire to add an elaborate mechanism to call multiple handlers. The creation of a single, global structure also ensures that, if the mechanism is built into the kernel (via the CONFIG_IRQ_TIMINGS parameter), there is also a consumer for the timing information. In the absence of that consumer, the global structure will not be defined, and the kernel build will fail.

The alloc() and free() operations are called when interrupt descriptors (the core data structure for managing interrupt sources) are added to or removed from the kernel. setup() and remove(), instead, are called when the first handler is set up for a given interrupt (or the last one removed). Finally, handler() is called whenever an actual interrupt happens for the given irq number; it is passed a timestamp saying when the interrupt occurred.

On the consumer side (the scheduler's idle-time estimation code), Daniel's patch sets up a data structure that looks like this:

    #define STATS_NR_VALUES 4

    struct stats {
	u64           sum;                     /* sum of values */
	u32           values[STATS_NR_VALUES]; /* array of values */
	unsigned char w_ptr;                   /* current window pointer */
    };

    struct wakeup {
	struct stats stats;
	ktime_t timestamp;
    };

Each CPU gets its own array of wakeup structures, with one entry for each interrupt number. That structure holds the time of the last observed interrupt and the stats structure which, in turn, holds a simple circular buffer of observed interrupt timings.

When the interrupt timing handler is called, it looks up the appropriate wakeup structure. The time since the last interrupt is calculated and inserted into the circular buffer; the sum of all the interrupt timings is updated as well. If it has been more than one second since the previous interrupt, though, the accumulated information is discarded instead and the statistics collection is restarted from the beginning. Once collection has been active for a bit, the code can easily calculate the mean time between interrupts and the variance in that time as well.

In the current patch set, there is no tracking at the level of individual devices; if multiple devices share an interrupt number, they will all appear together in the statistics. That may prove to be a shortcoming on systems with large amounts of interrupt sharing, but it should also be easy to fix should that turn out to be the case. Given that interrupt sharing appears to be slowly fading away, this may not be a concern in the end.

When the time comes to make a guess for the duration of the next idle period for a given CPU, the code iterates through all of the interrupts that have been active on that CPU. For each, the mean time between interrupts and the variance are calculated. If the timing between the last two interrupts was within one standard deviation of the mean, the code concludes that interrupts on this line are predictable; the next interrupt time is then calculated by adding the mean time to the time of the last interrupt. The interrupt that is predicted to happen the soonest is used to make a guess at the expected idle time.

One could certainly try to poke holes in this mechanism. Four samples seems like a small number to be drawing conclusions from. The fact that the algorithm skips over interrupts that seem unpredictable suggests that it might overestimate the length of the coming idle period. Daniel acknowledges some of these limitations, but says: "The statistics are very trivial and could be improved later but this first step shows we have a nice overall improvement in SMP." This mechanism does not show an improvement on uniprocessor systems, though.

There has been a fair amount of discussion around these patches, mostly focused on relatively low-level implementation details (whether timestamps should be kept in microseconds or nanoseconds, for example). One more significant issue is that the simple arrays indexed by interrupt number will not work on some systems with more complex interrupt setups. Instead, Thomas Gleixner said, the code needs to use a radix tree to track interrupt sources.

There does not appear to be opposition to the underlying approach, though. So, once the details have been worked out, this work may get the green light to go into the mainline kernel. Then, perhaps, our batteries will last a little longer, which cannot be a bad thing.

Index entries for this article
Kernel	Power management
Kernel	Scheduler/and power management

Next-interrupt prediction

Posted Jan 31, 2016 10:32 UTC (Sun) by tdalman (guest, #41971) [Link] (3 responses)

> Unfortunately, wormhole peripherals that can read information from the future are expensive, so the vast majority of processors are not equipped with them.

Quote of the week :)

Next-interrupt prediction

Posted Jan 31, 2016 20:05 UTC (Sun) by david.a.wheeler (subscriber, #72896) [Link]

> > Unfortunately, wormhole peripherals that can read information from the future are expensive, so the vast majority of processors are not equipped with them.

> Quote of the week :)

Indeed. Or is that quote of next week? :-).

Next-interrupt prediction

Posted Sep 8, 2017 20:28 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

If you know of a processor equipped with a wormhole peripheral, you can do much more useful stuff with it -- creating Turing oracles, for instance.

Next-interrupt prediction

Posted Oct 20, 2017 11:31 UTC (Fri) by physkets (guest, #110439) [Link]

Such machines would be Turing-overcomplete

Next-interrupt prediction

Posted Feb 4, 2016 2:13 UTC (Thu) by clicea (guest, #75492) [Link] (1 responses)

I do wonder, though, how realiable is this information and how resistant to outliers. See Anscombe's quartet for examples on how can you be mislead using only means and standard deviations.

Next-interrupt prediction

Posted Feb 4, 2016 17:50 UTC (Thu) by mips (guest, #105013) [Link]

I also wonder what the power cost of maintaining all this accounting is, relative to doing no prediction (or very naive prediction, like the next-timer approach mentioned).