LWN.net Logo

Better device power management for 3.2

By Jonathan Corbet
November 8, 2011
The Linux kernel has long had the ability to regulate the CPU's voltage and frequency for optimal behavior, where "optimal" is a function of both performance and power consumption. But a system is more than just a CPU, and there are many other components which are able to run at multiple performance levels. It is unsurprising that a proper infrastructure for managing device operating points has lagged that for the CPU, since the amount of power to be saved is usually smaller. But now that CPU power behavior is fairly well optimized, the power infrastructure is growing to encompass the rest of the system. The 3.2 kernel will have a new set of APIs intended to allow drivers to let the system find the best operating level for the devices they manage.

There are three separate pieces to the dynamic voltage and frequency scaling (DVFS) API, the first of which was actually merged for the 2.6.37 release. The "operating power points" module simply tracks the various operating levels available to a given device; the API is declared in <linux/opp.h>. Briefly, operating points are managed with:

    int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt);
    int opp_enable(struct device *dev, unsigned long freq);
    int opp_disable(struct device *dev, unsigned long freq);

Operating points are enabled by default; a driver may disable specific points to reflect temperature or performance concerns. There is a set of functions for retrieving operating points above or below a given frequency, useful for moving up or down the power/performance scale.

A driver wanting to support DVFS on a specific device would start by filling in one of these structures (declared, along with the rest of the API, in <linux/devfreq.h>):

    struct devfreq_dev_profile {
	unsigned long initial_freq;
	unsigned int polling_ms;

	int (*target)(struct device *dev, unsigned long *freq);
	int (*get_dev_status)(struct device *dev,
			      struct devfreq_dev_status *stat);
	void (*exit)(struct device *dev);
    };

Here initial_freq is, unsurprisingly, the original operating frequency of the device. Almost everything else in this structure is there to help frequency governors do their jobs. If polling_ms is non-zero, it tells the governor how often to poll the device to get its usage information; that polling will take the form of a call to get_dev_status(). That function should fill the stat structure with the relevant information:

    struct devfreq_dev_status {
	/* both since the last measure */
	unsigned long total_time;
	unsigned long busy_time;
	unsigned long current_frequency;
	void *private_data;
    };

The governor will use this information to decide whether the current operating frequency should be changed or not. Should a change be needed, the target() callback will be called to change the operating point accordingly. This function should pick a frequency at least as high as the passed in *freq, then update *freq to reflect the actual frequency chosen. The exit() callback gives the driver a chance to clean things up if the DVFS layer decides to forget about the device.

Once the devfreq_dev_profile structure is filled in, the driver registers it with:

    struct devfreq *devfreq_add_device(struct device *dev,
				       struct devfreq_dev_profile *profile,
				       const struct devfreq_governor *governor,
				       void *data);

If need be, a driver can supply its own governor to manage frequencies, but the kernel supplies a few of its own: devfreq_powersave (keeps the frequency as low as possible), devfreq_performance (keeps the frequency as high as possible), devfreq_userspace (allows control of the frequency through sysfs), and devfreq_simple_ondemand (tries to strike a balance between performance and power consumption).

The notifier mechanism built into the operating power points code can be used to automatically invoke the governor should the set of available power points change. There are a number of ways in which that change could come about; one of those is a change in expectations regarding how quickly the device can respond. For this case, 3.2 also gained an enhancement to the quality-of-service (pm_qos) code to handle per-device QOS requirements. Kernel code can express its QOS expectations for a device using these functions (all from <linux/pm_qos.h>):

    int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
			       s32 value);
    int dev_pm_qos_update_request(struct dev_pm_qos_request *req, s32 new_value);
    int dev_pm_qos_remove_request(struct dev_pm_qos_request *req);

The dev_pm_qos_request structure is used as a handle for managing requests, but calling code does not need to access its internals. The passed value describes the desired quality of service; the documentation is surprisingly vague on just what the units of value are. It would appear to describe the desired latency, but the desired precision is unclear.

On the driver side, the notifier interface is used:

    int dev_pm_qos_add_notifier(struct device *dev,
			    	struct notifier_block *notifier);
    int dev_pm_qos_remove_notifier(struct device *dev,
			           struct notifier_block *notifier);

When a device's quality-of-service requirements are changed, the notifier will be called with the new value. The driver can then adjust the available operating power points, disabling any that would render the device unable to meet the specified QOS requirement.

It is worth noting that none of the new code has any in-tree users as of this writing. That suggests that the interface might be more than usually volatile; once developers try to make use of this facility, they are likely to find things that can be improved. But, then, internal interfaces are always subject to change; regardless of any evolution here, the underlying capability should prove useful.


(Log in to post comments)

Better device power management for 3.2

Posted Nov 10, 2011 4:50 UTC (Thu) by nmenon (subscriber, #62037) [Link]

Nitpick minor comment - OPP was meant to stand for Operating Performance Points (Documentation/power/opp.txt)

Better device power management for 3.2

Posted Nov 17, 2011 2:54 UTC (Thu) by kevinm (guest, #69913) [Link]

You missed a perfect chance to use the headline "Who's down with OPP?"

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds