|
|
Subscribe / Log in / New account

The cpuidle subsystem

By Jonathan Corbet
April 26, 2010
Your editor recently had cause to dig around in the cpuidle subsystem. It never makes sense to let such work go to only a single purpose when it could be applied toward the creation of a kernel-page article. So, what follows is a multi-level discussion of cpuidle, what it's for, and how it works. Doing nothing, it turns out, is more complicated than one might think.

On most systems, the processor is idle much of the time. We can't always be running CPU-intensive work like kernel builds, video transcoding, weather modeling, or yum. When there is nothing left to do, the processor will go into the idle state to wait until it is needed again. Once upon a time, on many systems, the "idle state" was literally a thread running at the lowest possible priority which would execute an infinite loop until the system found something better to do. Killing the idle process was a good way to panic a VAX/VMS machine, which had no clue of how to do nothing without a task dedicated to that purpose.

Running a busy-wait loop requires power; contemporary concerns have led us to the conclusion that expending large amounts of power toward the accomplishment of nothing is rarely a good idea. So CPU designers have developed ways for the processor to go into a lower-power state when there is nothing for it to do. Typically, when put into this state, the CPU will stop clocks and power down part or all of its circuitry until the next interrupt arrives. That results in the production of far more nothing per watt than busy-waiting.

In fact, most CPUs have multiple ways of doing nothing more efficiently. These idle modes, which go by names like "C states," vary in the amount of power saved, but also in the amount of ancillary information which may be lost and the amount of time required to get back into a fully-functional mode. On your editor's laptop, there are three idle states with the following characteristics:

C1C2C3
Exit latency (µs)1 1 57
Power consumption (mW)1000 500100

On a typical processor, C1 will just turn off the processor clock, while C2 turns off other clocks in the system and C3 will actively power down parts of the CPU. On such a system, it would make sense to spend as much time as possible in the C3 state; indeed, while this sentence is being typed, the system is in C3 about 97% of the time. One might have thought that emacs could do a better job of hogging the CPU, but even emacs is no challenge for modern processors. The C1 state is not used at all, while a small amount of time is spent in C2.

One might wonder why the system bothers with anything but C3 at all; why not insist on the most nothing for the buck? The answer, of course, is that C3 has a cost. The 57µs exit latency means that the system must commit to doing nothing for a fair while. Bringing the processor back up also consumes power in its own right, and the ancillary costs - the C3 state might cause the flushing of the L2 cache - also hurt. So it's only worth going into C3 if the power savings will be real and if the system knows that it will not have to respond to anything with less than 57µs latency. If those conditions do not hold, it makes more sense to use a different idle state. Making that decision is the cpuidle subsystem's job.

Every processor has different idle-state characteristics and different actions are required to enter and leave those states. The cpuidle code abstracts that complexity into a separate driver layer; the drivers themselves are often found in architecture-specific or ACPI code. On the other hand, the decision as to which idle state makes sense in a given situation is very much a policy issue. The cpuidle "governors" interface allows the implementation of different policies for different needs. We'll take a look at both layers.

cpuidle drivers

At the highest level, the cpuidle driver interface is quite simple. It starts by registering the driver with the subsystem:

    #include <linux/cpuidle.h>

    struct cpuidle_driver {
	char			name[CPUIDLE_NAME_LEN];
	struct module 		*owner;
    };

    int cpuidle_register_driver(struct cpuidle_driver *drv);

About all this accomplishes is making the driver name available in sysfs. The cpuidle core also will enforce the requirement that only one cpuidle driver exist in the system at any given time.

Once the driver exists, though, it can register a cpuidle "device" for each CPU in the system - it is possible for different processors to have completely different setups, though your editor suspects that tends not to happen in real-world systems. The first step is to describe the processor idle states which are available for use:

    struct cpuidle_state {
	char		name[CPUIDLE_NAME_LEN];
	char		desc[CPUIDLE_DESC_LEN];
	void		*driver_data;

	unsigned int	flags;
	unsigned int	exit_latency; /* in US */
	unsigned int	power_usage; /* in mW */
	unsigned int	target_residency; /* in US */

	unsigned long long	usage;
	unsigned long long	time; /* in US */

	int (*enter)	(struct cpuidle_device *dev,
			 struct cpuidle_state *state);
    };

The name and desc fields describe the state; they will show up in sysfs eventually. driver_data is there for the driver's private use. The next four fields, starting with flags, describe the characteristics of this sleep state. Possible flags values are:

  • CPUIDLE_FLAG_TIME_VALID should be set if it is possible to accurately measure the amount of time spent in this particular idle state.

  • CPUIDLE_FLAG_CHECK_BM indicates that this state is not compatible with bus-mastering DMA activity. Deep sleeps will, among other things, disable the bus cycle snooping hardware, meaning that processor-local caches may fail to be updated in response to DMA. That can lead to data corruption problems.

  • CPUIDLE_FLAG_POLL says that this state causes no latency, but also fails to save any power.

  • CPUIDLE_FLAG_SHALLOW indicates a "shallow" sleep state with low latency and minimal power savings.

  • CPUIDLE_FLAG_BALANCED is for intermediate states with some latency and moderate power savings.

  • CPUIDLE_FLAG_DEEP marks deep sleep states with high latency and high power savings.

The depth of the sleep state is also described by the remaining fields: exit_latency says how long it takes to get back to a fully functional state, power_usage is the amount of power consumed by the CPU when it is in this state, and target_residency is the minimum amount of time the processor should spend in this state to make the transition worth the effort.

The enter() function will be called when the current governor decides to put the CPU into the given state; it will be described more fully below. The number of times the state has been entered will be kept in usage, while time records the amount of time spent in this state.

The cpuidle driver should fill in an appropriate set of states in a cpuidle_device structure for each CPU:

    struct cpuidle_device {
	unsigned int		cpu;

	int			last_residency;
	int			state_count;
	struct cpuidle_state	states[CPUIDLE_STATE_MAX];
	struct cpuidle_state	*last_state;

	void			*governor_data;
	struct cpuidle_state	*safe_state;
	/* Others omitted */
    };

The driver should set state_count to the number of valid states and cpu to the number of the CPU described by this device. The safe_state field points to the deepest sleep which is safe to enter while DMA is active elsewhere in the system. The device should be registered with:

    int cpuidle_register_device(struct cpuidle_device *dev);

The return value is, as usual, zero on success or a negative error code.

The only other thing that the driver needs to do is to actually implement the state transitions. As we saw above, that is done through the enter() function associated with each state:

    int (*enter)(struct cpuidle_device *dev, struct cpuidle_state *state);

A call to enter() is a request from the current governor to put the CPU associated with dev into the given state. Note that enter() is free to choose a different state if there is a good reason to do so, but it should store the actual state used in the device's last_state field. If the requested state has the CPUIDLE_FLAG_CHECK_BM flag set, and there is bus-mastering DMA active in the system, a transition to the indicated safe_state should be made instead. The return value from enter() should be the amount of time actually spent in the sleep state, expressed in microseconds.

If the driver needs to temporary put a hold on cpuidle activity, it can call:

    void cpuidle_pause_and_lock(void);
    void cpuidle_resume_and_unlock(void);

Note that cpuidle_pause_and_lock() blocks cpuidle activity for all CPUs in the system. It also acquires a mutex which is held until cpuidle_resume_and_unlock() is called, so it should not be used for long periods of time.

Power management for a specific CPU can be controlled with:

    int cpuidle_enable_device(struct cpuidle_device *dev);
    void cpuidle_disable_device(struct cpuidle_device *dev);

These functions can only be called with cpuidle as a whole paused, so one must call cpuidle_pause_and_lock() first.

cpuidle governors

Governors implement the policy side of cpuidle. The kernel allows the existence of multiple governors at any given time, though only one will be in control of a given CPU at any time. Governor code begins by filling in a cpuidle_governor structure:

    struct cpuidle_governor {
	char			name[CPUIDLE_NAME_LEN];
	unsigned int		rating;

	int  (*enable)		(struct cpuidle_device *dev);
	void (*disable)		(struct cpuidle_device *dev);
	int  (*select)		(struct cpuidle_device *dev);
	void (*reflect)		(struct cpuidle_device *dev);

	struct module 		*owner;
	/* ... */
    };

The name identifies the governor to user space, while rating is the governor's idea of how useful it is. By default, the kernel will use the governor with the highest rating value, but the system administrator can override that choice.

There are four callbacks provided by governors. The first two, enable() and disable(), are called when the governor is enabled for use or removed from use. Both functions are optional; if the governor does not need to know about these events, it need not supply these functions.

The select() function, instead, is mandatory; it is called whenever the CPU has nothing to do and wishes the governor to pick the optimal way of getting that nothing done. This function is where the governor can apply its heuristics, look at upcoming timer events, and generally try to decide how long the sleep can be expected to last and which idle state makes the most sense. The return value should be the integer index of the target state (in the dev->states array).

When making its decision, the governor should pay attention to the current latency requirements expressed by other code in the system. The mechanism for the registration of these requirements is the "pm_qos" subsystem. A number of quality-of-service requirements can be registered with this system, but the one most relevant for cpuidle governors is the CPU latency requirement. That information can be obtained with:

    #include <linux/pm_qos_params.h>

    int max_latency = pm_qos_requirement(PM_QOS_CPU_DMA_LATENCY);

On some systems, an overly-deep sleep state can wreak havoc with DMA operations (trust your editor's experience on this), so it's important to respect the latency requirements given by drivers.

Finally, the reflect() function will be called when the CPU exits the sleep state; the governor can use the resulting timing information to reach conclusions on how good its decision was.

An aside: blocking deep sleep

For what it's worth, driver developers can use these pm_qos functions to specify latency requirements:

    #include <linux/pm_qos_params.h>

    int pm_qos_add_requirement(int qos, char *name, s32 value);
    int pm_qos_update_requirement(int qos, char *name, s32 new_value);
    void pm_qos_remove_requirement(int qos, char *name);

This API is not heavily used in current kernels; most of the real uses would appear to be drivers telling the system that transitions into deep sleep states would be unwelcome. Needless to say, a driver should only block deep sleep when it is strictly necessary; the latency requirement should be removed when I/O is not in progress.

And that describes the 2.6.34 version of the cpuidle subsystem and API. For the curious, the core and governor code can be found in drivers/cpuidle, while cpuidle drivers live in drivers/acpi/processor_idle.c and a handful of ARM subarchitecture implementations. All told, it's a testament to the complexity of doing nothing properly on contemporary systems.

Index entries for this article
KernelACPI
KernelPower management/cpuidle


to post comments

How do you measure C states?

Posted Apr 26, 2010 17:51 UTC (Mon) by pr1268 (guest, #24648) [Link] (1 responses)

Just curious, how did you measure the C1/C2/C3 stats? Is there a command-line tool for this? Thanks!

How do you measure C states?

Posted Apr 26, 2010 17:56 UTC (Mon) by corbet (editor, #1) [Link]

powertop is your friend.

Power consumption

Posted Apr 26, 2010 18:37 UTC (Mon) by koch (subscriber, #55163) [Link] (7 responses)

How did you get the numbers for the power consumption of the different power-states?

Thank you for an excellent article.

Power consumption

Posted Apr 26, 2010 18:42 UTC (Mon) by corbet (editor, #1) [Link] (6 responses)

Sorry, I should have mentioned that in the article. Wander into

    /sys/devices/system/cpu/cpu0/cpuidle

and you'll find a bunch of files with that information. The time and usage numbers are there too.

Power consumption

Posted Apr 26, 2010 20:58 UTC (Mon) by pr1268 (guest, #24648) [Link] (5 responses)

Funny that cpuidle doesn't exist in that directory on my home PC (Slackware 12.2 running vanilla kernel 2.6.31.13). Am I missing something (other than the aforementioned powertop)?

By the way, this is a desktop PC, so all this discussion on my end may be moot (unless I want to save a few pennies on my electric bill). :-)

Power consumption

Posted Apr 26, 2010 21:47 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

Perhaps C-states are disabled in your BIOS. (Personally, I've disabled them on my suspended-when-not-in-use desktop simply because when C states are disabled, if the TSC is otherwise stable it can be used as a time source rather than the expensive HPET. On always-on servers and power-important laptops and netbooks, a bit of timekeepoing expense is worth the power saving, so it's best to turn C states on.)

Power consumption

Posted Apr 27, 2010 0:45 UTC (Tue) by pr1268 (guest, #24648) [Link] (1 responses)

I've got S3 enabled in BIOS (I actually rebooted just to look), but I'm curious whether I've got all the proper kernel options set and modules compiled/installed. Sounds like a research project... :-)

Thanks for the replies.

S3 is different

Posted Apr 29, 2010 22:29 UTC (Thu) by pflugstad (subscriber, #224) [Link]

IIRC, S3 is suspend-system-to-RAM, as opposed to the C states, which are just CPU states.

Power consumption

Posted Apr 27, 2010 0:55 UTC (Tue) by arjan (subscriber, #36785) [Link]

some desktop pc's support C states... but many, especially slightly older ones (1 - 2 years old), do not.

Power consumption

Posted Apr 27, 2010 0:55 UTC (Tue) by xtifr (guest, #143) [Link]

On my system, I have /sys/devices/system/cpu/cpuidle (not in the cpu0 subdir).

The cpuidle subsystem

Posted Apr 26, 2010 20:59 UTC (Mon) by MTecknology (guest, #57596) [Link] (4 responses)

heh... This is an awesome article. Also show you how efficient your system really is (or isn't).

I just thought I'd show what I found:
michael@panther:/sys/devices/system/cpu/cpu0/cpuidle$ cat */power
4294967295
1000
500
250
michael@panther:/sys/devices/system/cpu/cpu0/cpuidle$ cat */usage
273
306
12054
546476

I'd say for what I do that's very impressive. :)

The cpuidle subsystem

Posted Apr 26, 2010 21:42 UTC (Mon) by nix (subscriber, #2304) [Link] (3 responses)

Indeed. One of my larger servers, which spends most of its time idle, is disturbingly stating (via powertop) that it spends 77% of its time in C1, only 20% in C3, and is woken up fifty times a second by a mysterious 'load balancing tick'. I wonder what this is? Maybe compiling with NO_HZ would fix it...

(Still, at least it's running in a low P-state.)

(A completely idle PostgreSQL 8.4.3 also causes a wakeup every 1/10s, which seems a bit off.)

The cpuidle subsystem

Posted Apr 27, 2010 6:14 UTC (Tue) by koch (subscriber, #55163) [Link] (1 responses)

The kernel on my dual-core laptop is compiled with NO_HZ, but still keeps itself warm and awake with these "load balancing ticks". There seems to be a bug (with fixes) filed for Ubuntu, describing these symptoms.

The cpuidle subsystem

Posted Apr 27, 2010 9:37 UTC (Tue) by nix (subscriber, #2304) [Link]

Er, the linked bug is an ACPI bug causing temperature sensing to fail on resume from suspend on some laptops. I doubt it's related to a load-balancing tick on a big beefy quad-core Nehalem server :)

(did you paste in the wrong link?)

The cpuidle subsystem

Posted Apr 29, 2010 12:13 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

It was disappointingly difficult to get many server authors to remove such polling behaviour. The response tended to be that (a) they didn't believe it made any difference and (b) this was the only way to measure some arbitrary thing they'd decided their software needed to constantly know about (e.g. free RAM, system load average).

It appears that after a while everyone just stopped pushing :/

The cpuidle subsystem

Posted Apr 27, 2010 2:23 UTC (Tue) by mcgrof (subscriber, #25917) [Link]

Great article, if you had a *Like* button I would have just hit it :)

The cpuidle subsystem

Posted Apr 27, 2010 4:01 UTC (Tue) by svaidy (subscriber, #39260) [Link]

Good article that describes the subsystem that silently saves power and extends battery life on laptops. This subsystem is also useful on servers that support multiple low power idle states.

The cpuidle subsystem is being ported to other architectures as well, however as mentioned by our editor the complexity of the subsystem poses some challenges for a clean multi architecture implementation.

Reference:
[1] cpuidle for POWER (patch v12)
http://lkml.org/lkml/2010/4/15/190

[2] Discussions and previous patches
http://lkml.org/lkml/2009/12/2/98

The cpuidle subsystem

Posted Apr 27, 2010 18:23 UTC (Tue) by intgr (subscriber, #39733) [Link] (6 responses)

I've always wondering why PowerTop reports: "Detailed C-state information is only available on Mobile CPUs (laptops)"

Why aren't these statistics available on normal CPUs?
And why can't C-state counters be accounted in software?

The cpuidle subsystem

Posted Apr 27, 2010 20:15 UTC (Tue) by nix (subscriber, #2304) [Link] (5 responses)

This was changed (and the message reworded) two years ago in commit 70551c5171abf366b3caa6d22e12892f0da5a95e, after which powertop became capable of reading C-state information exported from sysfs by cpuidle in kernel 2.6.25+ (the data that Jon talks about in this article).

Prior to that, powertop had to read the data itself by making ACPI calls. I suppose this is less likely to work on non-laptops, even when C-states are available, because it's depending on the BIOS vendor doing the right thing (and we know how often *that* happens).

So, upgrade powertop and/or upgrade the kernel?

(You can't count C-state counters in software because when the CPU is in a C state it is not executing instructions. That's the whole point.)

The cpuidle subsystem

Posted Apr 27, 2010 20:53 UTC (Tue) by intgr (subscriber, #39733) [Link] (4 responses)

> This was changed (and the message reworded) two years ago
Indeed, I pasted that from an older installation. Up-to-date computers give me this ever-confusing message:
"< Detailed C-state information is not P-states (frequencies)"
I guess since this is Intel's software, they don't pay much attention to how it looks like on AMD processors. :)

> You can't count C-state counters in software because when the CPU is in
> a C state it is not executing instructions.
But you can record timestamps before entering and after leaving the sleep state. Is that too expensive?

The cpuidle subsystem

Posted Apr 27, 2010 21:23 UTC (Tue) by nix (subscriber, #2304) [Link] (3 responses)

It's got nothing to do with Intel versus AMD. One of my Nehalem machines reports that message: it's just because C states are turned off in the BIOS.

I suspect it needs some word-wrapping: the 'available' has been overwritten by the P-state heading :)

The cpuidle subsystem

Posted Apr 28, 2010 11:53 UTC (Wed) by nye (subscriber, #51576) [Link] (2 responses)

>I suspect it needs some word-wrapping: the 'available' has been overwritten by the P-state heading :)

Since it always says that, no matter how wide the terminal in which it's run, you'd think somebody who knows that the message has been truncated would have noticed at some point and fixed it. I just assumed it was a message in some arcane code known to those deeply involved in power usage monitoring :P.

Anyway it's unfortunate that desktop CPUs don't support this - even the Atom (D510) I bought a month ago only supports C0 and C1, and I don't know if it's possible to tell how long it's spending in what state. It seems you need CPUs designed specifically for battery-powered devices if you want anything more.

The cpuidle subsystem

Posted Apr 28, 2010 12:16 UTC (Wed) by hmh (subscriber, #3838) [Link] (1 responses)

Many desktop CPUs do support it. Your BIOS might not. Your particular CPU might not (or might be buggy and the BIOS went ahead and disabled it for safety). But it is not a rare feature in desktop CPUs anymore.

The cpuidle subsystem

Posted Apr 28, 2010 14:11 UTC (Wed) by nye (subscriber, #51576) [Link]

Is there anywhere a list of supporting processors? I was hoping that a site like lesswatts.org might have that sort of thing, but I couldn't find that information anywhere short of downloading the datasheets for every individual CPU, and of course there are a near-infinite selection of CPUs on offer nowadays.

However, if even brand new all-in-one Atom systems don't support this - given that they're targetted specifically at low-power uses - it doesn't seem to be a stretch to conclude that this is something that manufacturers basically don't care about.

The cpuidle subsystem

Posted Apr 28, 2010 15:43 UTC (Wed) by gartim (guest, #10123) [Link]

Jonathan -- curious what brand/specs are your laptop? -- Gary

The cpuidle subsystem

Posted Apr 28, 2010 23:33 UTC (Wed) by jonabbey (guest, #2736) [Link]

Yah, my nehalem (core i7) Fedora 11 system doesn't report any C-states in powertop, just P-states (reductions in the clockspeed of individual cores).

The cpuidle subsystem

Posted Jul 13, 2010 6:38 UTC (Tue) by MS_ASU (guest, #68817) [Link]

very nice article. I have two questions.

What does the number in power field for each C state signify? for C0 its very high on my system. Its 4294967295 (mW). Is it the total power consumed in C0 till now? But seems like this value is not changing.

The second question is about the C-state transitions. Can a transition occur within the same state? I mean CPU goes into C3 from C3 only. Because the value for C3 usage is very high for my system. (AFAIK the usage number denotes the number of transitions to that state) So seems like the "enter" function is being called from C3 and again the output state is C3.

The cpuidle subsystem

Posted Sep 29, 2010 13:17 UTC (Wed) by rsv (guest, #70379) [Link]

Sir,
I have some questions on cpu idle

a) How does the kernel know the cpu is idle.
b) How does it know (predict) that the next activity will most likely happen after some time (say after 50 seconds or so) so that it can switch to the appropriate cpu sleep state.

Thank you,
rsv

The cpuidle subsystem

Posted Oct 8, 2012 10:42 UTC (Mon) by bandarusivakrishna (guest, #87084) [Link]

Hai All,

I Need some questions on CPUIDLE SUBSYSTEM

1) As per Linux Power States Concepts we have C0 to C7 states are their, but in linux kernel we are using only c0 to c3....can any one please clarify what is the reason?

2) How the processor will move from one state to another state...and once we from one state to another state what about the devices states in kernel...how it control

We are Doing Power management on ARM Cortex-A9. I am new to power management. can any one help me on CPUIDLE SUBSYSTEM. How it works in Linux kernel governors to Hardware level how the flow is going on...

Thanks & Regards,
Siva Krishna.


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds