LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.34-rc5; there have been no prepatches released over the last week. The flow of changes into the mainline continues; it contains lots of fixes but also the VMware balloon driver (discussed briefly here in early April) and the ipeth driver which facilitates USB tethering to iPhones. The 2.6.34-rc6 release can be expected soon - probably a few milliseconds after this page is published.

Stable updates: the 2.6.32.12 and 2.6.33.3 stable kernel updates were released on April 26.. Both updates are massive, with well over 100 fixes in each.

Comments (3 posted)

Quotes of the week

I came to realize that if one wants his work (software) to be used globally, making it in-tree is not the goal but an important first step. Making software in-tree is technical, but affecting distributors decision should involve non-technical issues, I guess.
-- Toshiharu Harada

So, if your display switch button now just makes the letter "P" appear, say thanks to Microsoft. There's a range of ways we can fix this, although none of them are straightforward and those of you who currently use left-Windows-p as a keyboard shortcut are going to be sad. I would say that I feel your pain, but my current plan is to spend the immediate future getting drunk enough that I stop caring.
-- Matthew Garrett

One of the things that we sometimes have to tell people who are trying to navigate the maze of upstream submission is that sometimes you need to know who to ignore, and that sometimes rules are guidelines (despite pedants who will NACK based on rules like, "/proc, eeeeewwww", or "/debugfs must only strictly be for debug information").

Telling embedded developers who only want to submit their driver that they must create a whole new pseudo-filesystem just to export a single file that in older, simpler times, would have just been thrown into /proc is really not fair, and is precisely the sort of thing that may cause them to say, "f*ck it, these is one too many flaming hoops to jump through". If we throw up too many barriers, in the long run it's not actually doing Linux a service.

-- Ted Ts'o

Good heavens, what is EILSEQ? ... Why on earth are driver writers using this in the kernel??? Imagine the confusion which ensues when this error code propagates all the way back to some poor user's console. They'll be scrabbling around with language encodings not even suspecting that their hardware is busted.

People do this *a lot*. They go grubbing through errno.h and grab something which looks vaguely appropriate. But it's wrong. If your hardware is busted then return -EIO and emit a printk to tell the operator what broke.

-- Andrew Morton

Comments (none posted)

CPUS*PIDS = mess

By Jonathan Corbet
April 27, 2010
Mike Travis recently ran into a problem: if you have a system with a mere 2048 processors, there's only room for 16 processes on each CPU before the default 32K limit on process IDs is reached. Systems with lots of processors tend not to run large numbers of processes on each CPU, but 16 is still a bit tight - especially when one considers how many kernel threads run on each CPU. With 2K processors, the kernel threads alone may run the system out of process IDs; with 4K processors, the system will not even succeed in booting.

The proposed solution was a new boot-time parameter allowing the specification of a larger maximum number of process IDs. That idea did not get very far, though; there is not much interest in adding more options just to enable the system to boot. The fact that concurrency-managed workqueues should eventually solve this problem (by getting rid of large numbers of workqueue threads) hasn't helped either; that makes the kernel option look like a temporary stopgap. But the workqueue changes are only so helpful to people who are having this problem now; some form of this work will probably go in eventually, but it does not appear to be a fast process.

So there will most likely be a shorter-term fix merged. Instead of a kernel parameter, though, it will probably be some sort of heuristic which looks at the number of processors and ensures that a sufficient number of process IDs is available. If the default limit is too low, it will be raised automatically.

There is one remaining concern: what about ancient applications which store process IDs in signed, 16-bit integers? Apparently such applications exist. It is less clear, though, that such applications exist on 4096-processor systems. So this fear is unlikely to hold up this change. By the time the rest of us get those shiny, new, 4096-core desktop systems, hopefully, any remaining broken applications will have long since been fixed.

Comments (2 posted)

Suspend block

By Jonathan Corbet
April 28, 2010
One of the key points of contention when it comes to getting the kernel-level Android code merged has been the wakelock API. Wakelocks run counter to how some think power management should be done, so they have been hard to merge. But Android's drivers use wakelocks, so, in the absence of that API, those drivers also cannot be merged. They could be reworked to not use wakelocks, but then the mainline kernel would have a forked version of the driver code which nobody actually uses - not the best of outcomes. So coming to resolution on the wakelock issue has been a high priority for a while.

The result of work in that area can now be seen in the form of the suspend block patches recently posted by Arve Hjønnevåg. The name of the feature has been changed, as has the API, but the core point is the same: allow the system to automatically suspend itself when nothing is going on, and allow code to say "something is going on" at both the kernel and user-space levels.

The suspend block patches add a new sysfs file called /sys/power/policy; the default value found therein is "forced." When the policy is "forced," system state transitions will happen in response to explicit writes to /sys/power/state, as usual. If the policy is changed to "opportunistic," though, things are a bit different. The state written to /sys/power/state does not take effect immediately; instead, the kernel goes into that state whenever it concludes that the system is idle. The suspend blocker API can then be used to prevent the system from suspending when the need arises.

The two postings of this patch set have received a number of comments, causing various things to be fixed. More recently, though, responses have been of the "acked" variety. So one might conclude that suspend block has a reasonable chance of getting in during the 2.6.35 merge window. That, in turn, should open the doors for the merging of a lot of driver code from the Android project. With luck, the much-publicized disconnect between Android and the community may be a thing of the past - at the kernel level, at least.

Comments (2 posted)

Kernel development news

Kernel Hacker's Bookshelf: Generating Realistic Impressions for File-System Benchmarking

April 28, 2010

This article was contributed by Valerie Aurora (formerly Henson)

"File systems benchmarking is in a state of disarray." This stark and undisputed summary comes from the introduction to "Generating Realistic Impressions for File-System Benchmarking [PDF]" by Nitin Agrawal, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. This paper describes Impressions, a tool for generating realistic, reproducible file system images which can serve as the base of new file system benchmarks.

First, a little history. We, the file systems research and development community, unanimously agree that most of our current widely used file system benchmarks are deeply flawed. The Andrew benchmark, originally created around 1988, is not solely a file system benchmark and is so small that it often fits entirely in-cache on modern computers. Postmark (c. 1997) creates and deletes small files in a flat directory structure without any fsync() calls; often the files are deleted so quickly that they never get written to disk. The company that created Postmark, Netapp, stopped hosting the Postmark code and tech report on their web site, forcing developers to pass around bootleg Postmark versions in a bizarre instance of benchmark samizdat. fs_mark (c. 2003) measures synchronous write workloads and is a useful microbenchmark, but is in no way a general purpose file system benchmarking tool. bonnie (c. 1990) and bonnie++ (c. 1999) tend to benchmark the disk more than the file system. In general, run any file system benchmark and you'll find a file system developer who will tell you why it is all wrong.

Why has no new general purpose file system benchmark gained widespread use and acceptance since Postmark? A new benchmark is a dangerous creature to unleash on the world: if it becomes popular enough, years of research and development can go into making systems perform better on what could, in the end, be a misleading or useless workload. "No benchmark is better than a bad benchmark," is how the thinking goes, at least in the academic file systems development community. I've seen several new benchmarks quashed over the years for minor imperfections or lack of features.

However, creating excellent new file systems benchmarks is difficult without intermediate work to build on, flawed though it may be. It's like demanding that architects go straight from grass huts to skyscrapers without building stone buildings in between because stone buildings would be an earthquake hazard. As a result, the file systems benchmarking community continues to live in grass huts.

Impressions: Building better file system images

One thing the file systems community can agree on: We need better file system images to run our benchmarks on - a solid foundation for any putative skyscrapers of the future. The most accurate and reproducible method of creating file system images is to make a copy of a representative real-world file system at the block level and write it back to the device afresh before each run of the benchmark. Obviously, this approach is prohibitively costly in time, storage, and bandwidth. Creating a tarball of the contents of the file system, and extracting it in a freshly created file system is nearly as expensive and also loses the information about the layout of the file system, an important factor in file system performance. Creating all the files at once and in directory order is a best case for the file system block allocation code and won't reflect the real-world performance of the file system when files are created and deleted over time. In all cases, it is impractical for other developers to reproduce the results using the same file system images - no one wants to download (and especially not host) several hundred gigabyte file system images.

This is where Impressions comes in. Impressions is a relatively small, simple, open-source tool (about 3500 lines of C++) that generates a file system image satisfying multiple sophisticated statistical parameters. For example, Impressions chooses file sizes using combinations of statistical functions with multiple user-configurable parameters. Impressions is deterministic: given the same set of starting parameters and random seeds, it will generate the same file system image (at the logical level - the on-disk layout may not be the same).

Impressions: The details

The directory structure of the file system needs to have realistic depth and distribution of both files and directories. Impressions begins with a directory creation phase that creates a target number of directories. The directories are distributed according to a function of the current number of subdirectories of a particular directory, based on a 2007 study of real-world file system directory structure. A caveat here is that creating the directories separately from the files will not properly exercise some important parts of file system allocation strategy. However, in many cases most of the directory structure is static, and most the changes occur as creation and deletion of files within directories, so creation of directories first reflects an important real-world use case.

The distribution of file sizes can't be accurately modeled with any straight-forward probability distribution function due to a second "bump" in the file size distribution, which in modern file systems begins around 512MB. This heavy tail of file size distribution is usually due to video files and disk images, and can't be ignored if you care about the performance of video playback. Impressions combines two probability distribution functions, a log-normal and a Pareto, with five user-configurable parameters to produce a realistic file size distribution.

Files are assigned a depth in the directory tree according to "an inverse polynomial of degree 2." Whatever that is (the code is available for the curious), Figure 2(f) in the paper shows that the resulting distribution of files by namespace depth is almost indistinguishable from that in a real-world file system. Impressions also supports user-configurable "Special" directories with an exceptionally large number of files in them, like /lib.

The authors of Impressions clearly understood the importance of realistic file data; the example use case in the paper is performance comparison of two desktop search applications, which depend heavily on the actual content of files. Filling all files with zeroes, or randomly generated bytes, or repeats of the same pieces of text would make Impressions useless for any benchmark that depends on file data, such as those testing file system level deduplication or compression. Impressions supports two modes of text file content generation, including a word popularity model suitable for evaluation of file search applications. It also creates files with proper headers for sound files, various image and video formats, HTML, and PDF.

Generation of file names is rudimentary but includes advanced support for realistic file name extensions, like .mp3. The file name itself is just a number incremented by one each time a file is created, but the extension is selected from a list of popular extensions according to percentiles observed in earlier file system studies. Popular extensions only account for about half of file names; the rest of the extensions are randomly generated.

One case in which file names generated this way won't be useful is in evaluating a directory entry lookup algorithm. Sequential search of a directory for a particular directory entry isn't very efficient. Instead, most modern file systems have some way to quickly map a file name to its location in a directory, usually based on a hash of the characters of the file name. This mapping function may be more or less efficient on Impression's sequential numerical file names compared to real-world names. File name length also influences performance, since it changes the number of directory entries that fit in a block. Overall, file name generation in Impressions is good enough, but there are opportunities for improvement.

One of the most important features of Impressions is its support for deliberate fragmentation of the file system. Impressions creates fragmentation by writing part of a file, creating a new file, writing another chunk of the file, and then deleting the new file. This cycle is repeated until the requested degree of fragmentation is achieved. Note that file systems with good per-file preallocation may never fragment in this scheme unless the disk space is nearly full or no contiguous free stretches of disk space are left. In this case, fragmenting a file system to the requested degree may take a while. More efficient methods of fragmenting a file system might be necessary in the future. Impressions could also use FIBMAP/FIEMAP to query the layout of file systems in a portable manner; currently calculation of the "fragmentation score" is only supported on ext2/3.

An interesting feature described in the paper but not available in the version 1 release of Impressions is support to run a specified number of rounds of the fragmentation code - sort of a fragmentation workload. This will show the difference in disk allocation strategies between file systems. For example, if one file system manages allocation well enough that it normally never exceeds 30% discontiguous blocks, and the other normally always exceeds 60% discontiguous blocks, it doesn't always make sense to compare their performance when they are both at 50% discontiguous blocks. Instead, running a set fragmentation workload would result in different "natural" fragmentation levels in both file systems, providing a more realistic baseline for performance comparison.

Impressions development

Impressions is open sourced under the GPLv3 and downloadable here. The original author, Nitin Agrawal, has graduated (now at NEC Labs) and does not currently have plans for developing Impressions further. This is a rare golden opportunity for a new maintainer to work on an influential, high-profile project. The code is, in my opinion, easy to understand and clearly written (although I've spent the last year working on e2fsprogs and the VFS, so take that with a grain of salt). Some open areas for contribution include:
  • Measure actual fragmentation using FIBMAP/FIEMAP
  • Smarter filename generation
  • Addition of hard links and symbolic links
  • Performance improvement
  • Scaling to larger file systems (> 100GB)
  • Packaging for distributions
  • More robust error checking and handling
Another possibility for future development is Lars Wirzenius's genbackupdata tool, written in Python. The goal of this tool is to generate a representative file system image for testing a backup tool. It already has some of the features of Impressions and others appear to be easy to add. Python may be a more suitable language for long-term development and maintenance of a file system creation tool.

Conclusion

Impressions is an enormous step forward in the file system benchmarking world. With a little polishing and a dedicated maintainer, it could become the de facto standard for creating file systems for benchmarking. Impressions can report the full set of parameters and random seeds it uses, which can then be used for another Impressions run to recreate the exact same logical file system (actual layout will vary some). Impressions can be used today by file system and application developers to create realistic, reproducible file system images for testing and performance evaluation.

Comments (4 posted)

Might 2.6.35 be BKL-free?

By Jonathan Corbet
April 27, 2010
The removal of the big kernel lock (BKL) has been one of the longest-running projects in kernel development history. The BKL has been a clear scalability and maintainability problem since its addition in the 1.3 development series; efforts to move away from it began in the 2.1 cycle. But the upcoming 2.6.34 kernel will still feature a big kernel lock, despite all the work that has been done to get rid of it. The good news is that 2.6.35 might just work without the BKL - at least, for a number of configurations.

Over the years, use of the BKL has been pushed down into ever lower levels of the kernel. Once a lock_kernel() call has been pushed into an individual device driver, for example, it is relatively easy to determine whether it is really necessary and, eventually, get rid of it altogether. There is, however, one significant BKL acquisition left in the core kernel: the ioctl() implementation. The kernel has supported a BKL-free unlocked_ioctl() operation for years, but there are still many drivers which depend on the older, BKL-protected version.

Clearly, fixing the ioctl() problem is a key part of the overall BKL solution. To that end, Frederic Weisbecker and Arnd Bergmann posted a patch to prepare the ground for change. This patch adds yet another ioctl() variant called locked_ioctl() to the file_operations structure. The idea was to have both ioctl() and locked_ioctl() in place for long enough to change all of the code which still requires the BKL, after which ioctl() could be removed. This new function was also made dependent on a new CONFIG_BKL configuration option.

That patch did not get very far; Linus strongly disliked both locked_ioctl() and CONFIG_BKL. So the search for alternatives began. In the end, it looks like locked_ioctl() may never happen, but the configuration option will eventually exist.

Linus's suggestion was to not bother with locked_ioctl(). Instead, every ioctl() operation should just be renamed to bkl_ioctl() in one big patch. That would allow code which depends on the BKL to be easily located with grep without adding yet another function to struct file_operations even temporarily. A patch which does this renaming has been posted; this patch may well be merged for 2.6.35.

Or perhaps not. Arnd has taken a more traditional approach with his patch which simply pushes the BKL down into every remaining ioctl() function which needs it. Once a specific ioctl() function handles BKL acquisition itself, it can be called from the core kernel as an unlocked_ioctl() function instead. When all such functions have been converted, the locked version of ioctl() can go away, and the BKL can be removed from that bit of core code. The pushdown is a bigger job than the renaming, but it accomplishes a couple of important goals.

One of those goals is simply getting the BKL closer to the code which depends on it, facilitating its eventual removal. The other is to get that much closer to a point where the BKL can simply be configured out of the kernel altogether. That is where the CONFIG_BKL option comes in. Turning that option off will remove BKL support, causing any code which depends on it to fail to compile. That code can be annotated with its BKL dependency, again making it easier to find and fix.

On the face of it, configuring out the BKL may not seem like a hugely desirable thing to do; it takes little space, and the overhead seems small if nobody is actually using it. But there is small - but significant - savings to be had: currently the scheduler must check, at every context switch, whether the BKL must be released by the outgoing process and/or reacquired by the incoming process. Context switches happen often enough that it's worth making them as fast as possible; eliminating the BKL outright will make a small contribution toward that goal.

Making the BKL configurable will also be a motivating factor for anybody who finds that their BKL-free kernel build is blocked by one crufty old driver. Most of the remaining BKL-dependent drivers are unloved and unmaintained; many of them may be entirely unused. Those which are still being used may well be fixed once a suitably-skilled developer realizes that a small amount of work will suffice to banish the BKL from a specific system forevermore.

In the end, 2.6.35 will not be, as a whole, a BKL-free kernel. But, if this work gets in, and if some other core patches are accepted, it may just become possible to build a number of configurations without the big kernel lock. That, certainly, is an achievement worth celebrating.

Comments (9 posted)

The cpuidle subsystem

By Jonathan Corbet
April 26, 2010
Your editor recently had cause to dig around in the cpuidle subsystem. It never makes sense to let such work go to only a single purpose when it could be applied toward the creation of a kernel-page article. So, what follows is a multi-level discussion of cpuidle, what it's for, and how it works. Doing nothing, it turns out, is more complicated than one might think.

On most systems, the processor is idle much of the time. We can't always be running CPU-intensive work like kernel builds, video transcoding, weather modeling, or yum. When there is nothing left to do, the processor will go into the idle state to wait until it is needed again. Once upon a time, on many systems, the "idle state" was literally a thread running at the lowest possible priority which would execute an infinite loop until the system found something better to do. Killing the idle process was a good way to panic a VAX/VMS machine, which had no clue of how to do nothing without a task dedicated to that purpose.

Running a busy-wait loop requires power; contemporary concerns have led us to the conclusion that expending large amounts of power toward the accomplishment of nothing is rarely a good idea. So CPU designers have developed ways for the processor to go into a lower-power state when there is nothing for it to do. Typically, when put into this state, the CPU will stop clocks and power down part or all of its circuitry until the next interrupt arrives. That results in the production of far more nothing per watt than busy-waiting.

In fact, most CPUs have multiple ways of doing nothing more efficiently. These idle modes, which go by names like "C states," vary in the amount of power saved, but also in the amount of ancillary information which may be lost and the amount of time required to get back into a fully-functional mode. On your editor's laptop, there are three idle states with the following characteristics:

C1C2C3
Exit latency (µs)1 1 57
Power consumption (mW)1000 500100

On a typical processor, C1 will just turn off the processor clock, while C2 turns off other clocks in the system and C3 will actively power down parts of the CPU. On such a system, it would make sense to spend as much time as possible in the C3 state; indeed, while this sentence is being typed, the system is in C3 about 97% of the time. One might have thought that emacs could do a better job of hogging the CPU, but even emacs is no challenge for modern processors. The C1 state is not used at all, while a small amount of time is spent in C2.

One might wonder why the system bothers with anything but C3 at all; why not insist on the most nothing for the buck? The answer, of course, is that C3 has a cost. The 57µs exit latency means that the system must commit to doing nothing for a fair while. Bringing the processor back up also consumes power in its own right, and the ancillary costs - the C3 state might cause the flushing of the L2 cache - also hurt. So it's only worth going into C3 if the power savings will be real and if the system knows that it will not have to respond to anything with less than 57µs latency. If those conditions do not hold, it makes more sense to use a different idle state. Making that decision is the cpuidle subsystem's job.

Every processor has different idle-state characteristics and different actions are required to enter and leave those states. The cpuidle code abstracts that complexity into a separate driver layer; the drivers themselves are often found in architecture-specific or ACPI code. On the other hand, the decision as to which idle state makes sense in a given situation is very much a policy issue. The cpuidle "governors" interface allows the implementation of different policies for different needs. We'll take a look at both layers.

cpuidle drivers

At the highest level, the cpuidle driver interface is quite simple. It starts by registering the driver with the subsystem:

    #include <linux/cpuidle.h>

    struct cpuidle_driver {
	char			name[CPUIDLE_NAME_LEN];
	struct module 		*owner;
    };

    int cpuidle_register_driver(struct cpuidle_driver *drv);

About all this accomplishes is making the driver name available in sysfs. The cpuidle core also will enforce the requirement that only one cpuidle driver exist in the system at any given time.

Once the driver exists, though, it can register a cpuidle "device" for each CPU in the system - it is possible for different processors to have completely different setups, though your editor suspects that tends not to happen in real-world systems. The first step is to describe the processor idle states which are available for use:

    struct cpuidle_state {
	char		name[CPUIDLE_NAME_LEN];
	char		desc[CPUIDLE_DESC_LEN];
	void		*driver_data;

	unsigned int	flags;
	unsigned int	exit_latency; /* in US */
	unsigned int	power_usage; /* in mW */
	unsigned int	target_residency; /* in US */

	unsigned long long	usage;
	unsigned long long	time; /* in US */

	int (*enter)	(struct cpuidle_device *dev,
			 struct cpuidle_state *state);
    };

The name and desc fields describe the state; they will show up in sysfs eventually. driver_data is there for the driver's private use. The next four fields, starting with flags, describe the characteristics of this sleep state. Possible flags values are:

  • CPUIDLE_FLAG_TIME_VALID should be set if it is possible to accurately measure the amount of time spent in this particular idle state.

  • CPUIDLE_FLAG_CHECK_BM indicates that this state is not compatible with bus-mastering DMA activity. Deep sleeps will, among other things, disable the bus cycle snooping hardware, meaning that processor-local caches may fail to be updated in response to DMA. That can lead to data corruption problems.

  • CPUIDLE_FLAG_POLL says that this state causes no latency, but also fails to save any power.

  • CPUIDLE_FLAG_SHALLOW indicates a "shallow" sleep state with low latency and minimal power savings.

  • CPUIDLE_FLAG_BALANCED is for intermediate states with some latency and moderate power savings.

  • CPUIDLE_FLAG_DEEP marks deep sleep states with high latency and high power savings.

The depth of the sleep state is also described by the remaining fields: exit_latency says how long it takes to get back to a fully functional state, power_usage is the amount of power consumed by the CPU when it is in this state, and target_residency is the minimum amount of time the processor should spend in this state to make the transition worth the effort.

The enter() function will be called when the current governor decides to put the CPU into the given state; it will be described more fully below. The number of times the state has been entered will be kept in usage, while time records the amount of time spent in this state.

The cpuidle driver should fill in an appropriate set of states in a cpuidle_device structure for each CPU:

    struct cpuidle_device {
	unsigned int		cpu;

	int			last_residency;
	int			state_count;
	struct cpuidle_state	states[CPUIDLE_STATE_MAX];
	struct cpuidle_state	*last_state;

	void			*governor_data;
	struct cpuidle_state	*safe_state;
	/* Others omitted */
    };

The driver should set state_count to the number of valid states and cpu to the number of the CPU described by this device. The safe_state field points to the deepest sleep which is safe to enter while DMA is active elsewhere in the system. The device should be registered with:

    int cpuidle_register_device(struct cpuidle_device *dev);

The return value is, as usual, zero on success or a negative error code.

The only other thing that the driver needs to do is to actually implement the state transitions. As we saw above, that is done through the enter() function associated with each state:

    int (*enter)(struct cpuidle_device *dev, struct cpuidle_state *state);

A call to enter() is a request from the current governor to put the CPU associated with dev into the given state. Note that enter() is free to choose a different state if there is a good reason to do so, but it should store the actual state used in the device's last_state field. If the requested state has the CPUIDLE_FLAG_CHECK_BM flag set, and there is bus-mastering DMA active in the system, a transition to the indicated safe_state should be made instead. The return value from enter() should be the amount of time actually spent in the sleep state, expressed in microseconds.

If the driver needs to temporary put a hold on cpuidle activity, it can call:

    void cpuidle_pause_and_lock(void);
    void cpuidle_resume_and_unlock(void);

Note that cpuidle_pause_and_lock() blocks cpuidle activity for all CPUs in the system. It also acquires a mutex which is held until cpuidle_resume_and_unlock() is called, so it should not be used for long periods of time.

Power management for a specific CPU can be controlled with:

    int cpuidle_enable_device(struct cpuidle_device *dev);
    void cpuidle_disable_device(struct cpuidle_device *dev);

These functions can only be called with cpuidle as a whole paused, so one must call cpuidle_pause_and_lock() first.

cpuidle governors

Governors implement the policy side of cpuidle. The kernel allows the existence of multiple governors at any given time, though only one will be in control of a given CPU at any time. Governor code begins by filling in a cpuidle_governor structure:

    struct cpuidle_governor {
	char			name[CPUIDLE_NAME_LEN];
	unsigned int		rating;

	int  (*enable)		(struct cpuidle_device *dev);
	void (*disable)		(struct cpuidle_device *dev);
	int  (*select)		(struct cpuidle_device *dev);
	void (*reflect)		(struct cpuidle_device *dev);

	struct module 		*owner;
	/* ... */
    };

The name identifies the governor to user space, while rating is the governor's idea of how useful it is. By default, the kernel will use the governor with the highest rating value, but the system administrator can override that choice.

There are four callbacks provided by governors. The first two, enable() and disable(), are called when the governor is enabled for use or removed from use. Both functions are optional; if the governor does not need to know about these events, it need not supply these functions.

The select() function, instead, is mandatory; it is called whenever the CPU has nothing to do and wishes the governor to pick the optimal way of getting that nothing done. This function is where the governor can apply its heuristics, look at upcoming timer events, and generally try to decide how long the sleep can be expected to last and which idle state makes the most sense. The return value should be the integer index of the target state (in the dev->states array).

When making its decision, the governor should pay attention to the current latency requirements expressed by other code in the system. The mechanism for the registration of these requirements is the "pm_qos" subsystem. A number of quality-of-service requirements can be registered with this system, but the one most relevant for cpuidle governors is the CPU latency requirement. That information can be obtained with:

    #include <linux/pm_qos_params.h>

    int max_latency = pm_qos_requirement(PM_QOS_CPU_DMA_LATENCY);

On some systems, an overly-deep sleep state can wreak havoc with DMA operations (trust your editor's experience on this), so it's important to respect the latency requirements given by drivers.

Finally, the reflect() function will be called when the CPU exits the sleep state; the governor can use the resulting timing information to reach conclusions on how good its decision was.

An aside: blocking deep sleep

For what it's worth, driver developers can use these pm_qos functions to specify latency requirements:

    #include <linux/pm_qos_params.h>

    int pm_qos_add_requirement(int qos, char *name, s32 value);
    int pm_qos_update_requirement(int qos, char *name, s32 new_value);
    void pm_qos_remove_requirement(int qos, char *name);

This API is not heavily used in current kernels; most of the real uses would appear to be drivers telling the system that transitions into deep sleep states would be unwelcome. Needless to say, a driver should only block deep sleep when it is strictly necessary; the latency requirement should be removed when I/O is not in progress.

And that describes the 2.6.34 version of the cpuidle subsystem and API. For the curious, the core and governor code can be found in drivers/cpuidle, while cpuidle drivers live in drivers/acpi/processor_idle.c and a handful of ARM subarchitecture implementations. All told, it's a testament to the complexity of doing nothing properly on contemporary systems.

Comments (29 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Security-related

  • Mimi Zohar: EVM . (April 22, 2010)

Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds