Brief items
The current development kernel remains 2.6.34-rc5; there have been
no prepatches released over the last week. The flow of changes into the
mainline continues; it contains lots of fixes but also the VMware balloon
driver (
discussed briefly
here in early April) and
the ipeth
driver which facilitates USB tethering to iPhones. The 2.6.34-rc6 release can be expected soon -
probably a few milliseconds after this page is published.
Stable updates: the 2.6.32.12 and 2.6.33.3 stable kernel updates were released
on April 26.. Both
updates are massive, with well over 100 fixes in each.
Comments (3 posted)
I came to realize that if one wants his work (software) to be used
globally, making it in-tree is not the goal but an important first
step. Making software in-tree is technical, but affecting
distributors decision should involve non-technical issues, I
guess.
--
Toshiharu Harada
So, if your display switch button now just makes the letter "P"
appear, say thanks to Microsoft. There's a range of ways we can fix
this, although none of them are straightforward and those of you
who currently use left-Windows-p as a keyboard shortcut are going
to be sad. I would say that I feel your pain, but my current plan
is to spend the immediate future getting drunk enough that I stop
caring.
--
Matthew
Garrett
One of the things that we sometimes have to tell people who are
trying to navigate the maze of upstream submission is that
sometimes you need to know who to ignore, and that sometimes rules
are guidelines (despite pedants who will NACK based on rules like,
"/proc, eeeeewwww", or "/debugfs must only strictly be for debug
information").
Telling embedded developers who only want to submit their driver
that they must create a whole new pseudo-filesystem just to export
a single file that in older, simpler times, would have just been
thrown into /proc is really not fair, and is precisely the sort of
thing that may cause them to say, "f*ck it, these is one too many
flaming hoops to jump through". If we throw up too many barriers,
in the long run it's not actually doing Linux a service.
--
Ted Ts'o
Good heavens, what is EILSEQ? ... Why on earth are driver writers
using this in the kernel??? Imagine the confusion which ensues
when this error code propagates all the way back to some poor
user's console. They'll be scrabbling around with language
encodings not even suspecting that their hardware is busted.
People do this *a lot*. They go grubbing through errno.h and grab
something which looks vaguely appropriate. But it's wrong. If
your hardware is busted then return -EIO and emit a printk to tell
the operator what broke.
--
Andrew Morton
Comments (none posted)
By Jonathan Corbet
April 27, 2010
Mike Travis recently
ran into a problem: if
you have a system with a mere 2048 processors, there's only room for 16
processes on each CPU before the default 32K limit on process IDs is
reached. Systems with lots of processors tend not to run large numbers of
processes on each CPU, but 16 is still a bit tight - especially when one
considers how many kernel threads run on each CPU. With 2K processors, the
kernel threads alone may run the system out of process IDs; with 4K
processors, the system will not even succeed in booting.
The proposed solution was a new boot-time parameter allowing the
specification of a larger maximum number of process IDs. That idea did not
get very far, though; there is not much interest in adding more options
just to enable the system to boot. The fact that concurrency-managed workqueues
should eventually solve this problem (by getting rid of large numbers of
workqueue threads) hasn't helped either; that makes the kernel option look
like a temporary stopgap. But the workqueue changes are
only so helpful to people who are having this problem now; some form of
this work will probably go in eventually, but it does not appear to be a
fast process.
So there will most likely be a shorter-term fix merged. Instead of a
kernel parameter, though, it will probably be some sort of heuristic
which looks at the number of processors and ensures that a sufficient
number of process IDs is available. If the default limit is too low, it
will be raised automatically.
There is one remaining concern: what about ancient applications which store
process IDs in signed, 16-bit integers? Apparently such applications
exist. It is less clear, though, that such applications exist on
4096-processor systems. So this fear is unlikely to hold up this change.
By the time the rest of us get those shiny, new, 4096-core desktop systems,
hopefully, any remaining broken applications will have long since been
fixed.
Comments (2 posted)
By Jonathan Corbet
April 28, 2010
One of the key points of contention when it comes to getting the
kernel-level Android code merged has been the
wakelock API. Wakelocks run
counter to how some think power management should be done, so they have
been hard to merge. But Android's drivers use wakelocks, so, in the
absence of that API, those drivers also cannot be merged. They
could be reworked to not use wakelocks, but then the mainline kernel would
have a forked version of the driver code which nobody actually uses - not
the best of outcomes. So coming to resolution on the wakelock issue has
been a high priority for a while.
The result of work in that area can now be seen in the form of the suspend block patches recently
posted by Arve Hjønnevåg. The name of the feature has been changed,
as has the API, but the core point is the same: allow the system to
automatically suspend itself when nothing is going on, and allow code to
say "something is going on" at both the kernel and user-space levels.
The suspend block patches add a new sysfs file called
/sys/power/policy; the default value found therein is "forced."
When the policy is "forced," system state transitions will happen in
response to explicit writes to /sys/power/state, as usual. If the
policy is changed to "opportunistic," though, things are a bit different.
The state written to /sys/power/state does not take effect
immediately; instead, the kernel goes into that state whenever it concludes
that the system is idle. The suspend blocker API can then be used to
prevent the system from suspending when the need arises.
The two postings of this patch set have received a number of comments,
causing various things to be fixed. More recently, though, responses have
been of the "acked" variety. So one might conclude that suspend block has
a reasonable chance of getting in during the 2.6.35 merge window. That, in
turn, should open the doors for the merging of a lot of driver code from
the Android project. With luck, the much-publicized disconnect between
Android and the community may be a thing of the past - at the kernel
level, at least.
Comments (2 posted)
Kernel development news
April 28, 2010
This article was contributed by Valerie Aurora (formerly Henson)
"
File systems benchmarking is in a state of disarray." This
stark and undisputed summary comes from the introduction to
"
Generating
Realistic Impressions for File-System Benchmarking [PDF]"
by
Nitin Agrawal, Andrea
Arpaci-Dusseau, and Remzi Arpaci-Dusseau. This paper describes
Impressions, a tool for generating realistic, reproducible file system
images which can serve as the base of new file system benchmarks.
First, a little history. We, the file systems research and
development community, unanimously agree that most of our current
widely used file system benchmarks are deeply flawed. The Andrew
benchmark, originally created around 1988, is not solely a file system
benchmark and is so small that it often fits entirely in-cache on
modern computers. Postmark (c. 1997) creates and deletes small files
in a flat directory structure without any fsync() calls;
often the files are deleted so quickly that they never get written to
disk. The company that created Postmark,
Netapp, stopped
hosting the Postmark code and tech report on their web site,
forcing developers to pass around bootleg Postmark versions in a
bizarre instance of benchmark samizdat.
fs_mark
(c. 2003) measures synchronous write workloads and is a useful
microbenchmark, but is in no way a general purpose file system
benchmarking
tool. bonnie
(c. 1990) and bonnie++
(c. 1999) tend to benchmark the disk more than the file system. In
general, run any file system benchmark and you'll find a file system
developer who will tell you why it is all wrong.
Why has no new general purpose file system benchmark gained widespread
use and acceptance since Postmark? A new benchmark is a dangerous
creature to unleash on the world: if it becomes popular enough, years
of research and development can go into making systems perform better
on what could, in the end, be a misleading or useless workload. "No
benchmark is better than a bad benchmark," is how the thinking goes,
at least in the academic file systems development community. I've
seen several new benchmarks quashed over the years for minor
imperfections or lack of features.
However, creating excellent new file systems benchmarks is
difficult without intermediate work to build on, flawed though it may
be. It's like demanding that architects go straight from grass huts
to skyscrapers without building stone buildings in between because
stone buildings would be an earthquake hazard. As a result, the file
systems benchmarking community continues to live in grass huts.
Impressions: Building better file system images
One thing the file systems community can agree on: We need better file
system images to run our benchmarks on - a solid foundation for any
putative skyscrapers of the future. The most accurate and
reproducible method of creating file system images is to make a copy
of a representative real-world file system at the block level and
write it back to the device afresh before each run of the benchmark.
Obviously, this approach is prohibitively costly in time, storage, and
bandwidth. Creating a tarball of the contents of the file system, and
extracting it in a freshly created file system is nearly as expensive
and also loses the information about the layout of the file system, an
important factor in file system performance. Creating all the files
at once and in directory order is a best case for the file system
block allocation code and won't reflect the real-world performance of
the file system when files are created and deleted over time. In all
cases, it is impractical for other developers to reproduce the results
using the same file system images - no one wants to download (and
especially not host) several hundred gigabyte file system images.
This is where Impressions comes in.
Impressions is a relatively small, simple, open-source tool (about 3500
lines of C++) that generates a file system image satisfying multiple
sophisticated statistical parameters. For example, Impressions
chooses file sizes using combinations of statistical functions with
multiple user-configurable parameters. Impressions is deterministic:
given the same set of starting parameters and random seeds, it will
generate the same file system image (at the logical level - the
on-disk layout may not be the same).
Impressions: The details
The directory structure of the file system needs to have realistic
depth and distribution of both files and directories. Impressions
begins with a directory creation phase that creates a target number of
directories. The directories are distributed according to a function
of the current number of subdirectories of a particular directory,
based on a 2007 study of real-world file system directory structure.
A caveat here is that creating the directories separately from the
files will not properly exercise some important parts of file system
allocation strategy. However, in many cases most of the directory
structure is static, and most the changes occur as creation and
deletion of files within directories, so creation of directories first
reflects an important real-world use case.
The distribution of file sizes can't be accurately modeled with any
straight-forward probability distribution function due to a second
"bump" in the file size distribution, which in modern file systems
begins around 512MB. This heavy tail of file size distribution is
usually due to video files and disk images, and can't be ignored if
you care about the performance of video playback. Impressions
combines two probability distribution functions, a log-normal and a
Pareto, with five user-configurable parameters to produce a realistic
file size distribution.
Files are assigned a depth in the directory tree according to "an
inverse polynomial of degree 2." Whatever that is (the code is
available for the curious), Figure 2(f) in the paper shows that the resulting
distribution of files by namespace depth is almost indistinguishable
from that in a real-world file system. Impressions also supports
user-configurable "Special" directories with an exceptionally large
number of files in them, like /lib.
The authors of Impressions clearly understood the importance of
realistic file data; the example use case in the paper is performance
comparison of two desktop search applications, which depend heavily on
the actual content of files. Filling all files with zeroes, or
randomly generated bytes, or repeats of the same pieces of text would
make Impressions useless for any benchmark that depends on file data,
such as those testing file system level deduplication or compression.
Impressions
supports two modes of text file content generation, including a word
popularity model suitable for evaluation of file search applications.
It also creates files with proper headers for sound files, various
image and video formats, HTML, and PDF.
Generation of file names is rudimentary but includes advanced
support for realistic file name extensions, like .mp3.
The file name itself is just a number incremented by one each time a
file is created, but the extension is selected from a list of popular
extensions according to percentiles observed in earlier file system
studies. Popular extensions only account for about half of file
names; the rest of the extensions are randomly generated.
One case in which file names generated this way won't be useful is in
evaluating a directory entry lookup algorithm. Sequential search of a
directory for a particular directory entry isn't very efficient.
Instead, most modern file systems have some way to quickly map a
file name to its location in a directory, usually based on a hash of
the characters of the file name. This mapping function may be more or
less efficient on Impression's sequential numerical file names compared
to real-world names. File name length also influences
performance, since it changes the number of directory entries that fit
in a block. Overall, file name generation in Impressions is good
enough, but there are opportunities for improvement.
One of the most important features of Impressions is its support for
deliberate fragmentation of the file system. Impressions creates
fragmentation by writing part of a file, creating a new file, writing
another chunk of the file, and then deleting the new file. This cycle
is repeated until the requested degree of fragmentation is achieved.
Note that file systems with good per-file preallocation may never
fragment in this scheme unless the disk space is nearly full or no
contiguous free stretches of disk space are left. In this case,
fragmenting a file system to the requested degree may take a while.
More efficient methods of fragmenting a file system might be necessary
in the future. Impressions could also use FIBMAP/FIEMAP to query the
layout of file systems in a portable manner; currently calculation of the
"fragmentation score" is only supported on ext2/3.
An interesting feature described in the paper but not available in the
version 1 release of Impressions is support to run a specified number
of rounds of the fragmentation code - sort of a fragmentation
workload. This will show the difference in disk allocation strategies
between file systems. For example, if one file system manages
allocation well enough that it normally never exceeds 30%
discontiguous blocks, and the other normally always exceeds 60%
discontiguous blocks, it doesn't always make sense to compare their
performance when they are both at 50% discontiguous blocks. Instead,
running a set fragmentation workload would result in different
"natural" fragmentation levels in both file systems, providing a more
realistic baseline for performance comparison.
Impressions development
Impressions is open sourced under the GPLv3 and
downloadable
here.
The original author, Nitin Agrawal, has graduated (now at NEC Labs)
and does not currently have plans for developing Impressions further.
This is a rare golden opportunity for a new maintainer to work on an
influential, high-profile project. The code is, in my opinion, easy
to understand and clearly written (although I've spent the last year
working on e2fsprogs and the VFS, so take that with a grain of salt).
Some open areas for contribution include:
- Measure actual fragmentation using FIBMAP/FIEMAP
- Smarter filename generation
- Addition of hard links and symbolic links
- Performance improvement
- Scaling to larger file systems (> 100GB)
- Packaging for distributions
- More robust error checking and handling
Another possibility for future development is Lars
Wirzenius's
genbackupdata tool, written in Python. The goal of
this tool is to generate a representative file system image for
testing a backup tool. It already has some of the features of
Impressions and others appear to be easy to add. Python may be a more
suitable language for long-term development and maintenance of a file
system creation tool.
Conclusion
Impressions is an enormous step forward in the file system
benchmarking world. With a little polishing and a dedicated
maintainer, it could become the de facto standard for creating file
systems for benchmarking. Impressions can report the full set of
parameters and random seeds it uses, which can then be used for
another Impressions run to recreate the exact same logical file system
(actual layout will vary some). Impressions can be used today by file
system and application developers to create realistic, reproducible
file system images for testing and performance evaluation.
Comments (4 posted)
By Jonathan Corbet
April 27, 2010
The removal of the big kernel lock (BKL) has been one of the
longest-running projects in kernel development history. The BKL has been a
clear scalability and maintainability problem since its addition in the 1.3
development series; efforts to move away from it began in the 2.1 cycle.
But the upcoming 2.6.34 kernel will still feature a big kernel lock,
despite all the work that has been done to get rid of it. The good news is
that 2.6.35 might just work without the BKL - at least, for a number of
configurations.
Over the years, use of the BKL has been pushed down into ever lower levels
of the kernel. Once a lock_kernel() call has been pushed into an
individual device driver, for example, it is relatively easy to determine
whether it is really necessary and, eventually, get rid of it altogether.
There is, however, one significant BKL acquisition left in the core kernel:
the ioctl() implementation. The kernel has supported a BKL-free
unlocked_ioctl() operation for years, but there are still many
drivers which depend on the older, BKL-protected version.
Clearly, fixing the ioctl() problem is a key part of the overall
BKL solution. To that end, Frederic Weisbecker and Arnd Bergmann posted a patch to prepare the ground for change.
This patch adds yet another ioctl() variant called
locked_ioctl() to the file_operations structure. The
idea was to have both ioctl() and locked_ioctl() in place
for long enough to change all of the code which still requires the BKL,
after which ioctl() could be removed.
This new function was also made dependent on a new CONFIG_BKL
configuration option.
That patch did not get very far; Linus strongly
disliked both locked_ioctl() and CONFIG_BKL. So the
search for alternatives began. In the end, it looks like
locked_ioctl() may never happen, but the configuration option will
eventually exist.
Linus's suggestion was to not bother with locked_ioctl().
Instead, every ioctl() operation should just be renamed to
bkl_ioctl() in one big patch. That would allow code which depends
on the BKL to be easily located with grep without adding yet
another function to struct file_operations even temporarily. A patch which does this renaming has been
posted; this patch may well be merged for 2.6.35.
Or perhaps not. Arnd has taken a more traditional approach with his patch which simply pushes the BKL down
into every remaining ioctl() function which needs it. Once a
specific ioctl()
function handles BKL acquisition itself, it can be called from the core
kernel as an unlocked_ioctl() function instead. When all such
functions have been converted, the locked version of ioctl() can
go away, and the BKL can be removed from that bit of core code. The
pushdown is a bigger job than the renaming, but it accomplishes a couple of
important goals.
One of those goals is simply getting the BKL closer to the code which
depends on it, facilitating its eventual removal. The other is to get that
much closer to a point where the BKL can simply be configured out of the
kernel altogether. That is where the CONFIG_BKL option comes in.
Turning that option off will remove BKL support, causing any code which
depends on it to fail to compile. That code can be annotated with its BKL
dependency, again making it easier to find and fix.
On the face of it, configuring out the BKL may not seem like a hugely
desirable thing to do; it takes little space, and the overhead seems small
if nobody is actually using it. But there is small - but significant -
savings to be had: currently the scheduler must check, at every context
switch, whether the BKL must be released by the outgoing process and/or
reacquired by the incoming process. Context switches happen often enough
that it's worth making them as fast as possible; eliminating the BKL
outright will make a small contribution toward that goal.
Making the BKL configurable will also be a motivating factor for anybody
who finds that their BKL-free kernel build is blocked by one crufty old
driver. Most of the remaining BKL-dependent drivers are unloved and
unmaintained; many
of them may be entirely unused. Those which are still being used may well
be fixed once a suitably-skilled developer realizes that a small amount of
work will suffice to banish the BKL from a specific system forevermore.
In the end, 2.6.35 will not be, as a whole, a BKL-free kernel. But, if
this work gets in, and if some
other core patches are accepted, it may just become possible to build a
number of configurations without the big kernel lock. That, certainly, is
an achievement worth celebrating.
Comments (9 posted)
By Jonathan Corbet
April 26, 2010
Your editor recently had cause to dig around in the cpuidle subsystem.
It never makes sense to let such work go to only a single purpose when it
could be applied toward the creation of a kernel-page article. So, what
follows is a multi-level discussion of cpuidle, what it's for, and how it works.
Doing nothing, it turns out, is more complicated than one might think.
On most systems, the processor is idle much of the time. We can't always
be running CPU-intensive work like kernel builds, video transcoding,
weather modeling, or
yum. When there is nothing left to do, the processor will go into the idle
state to wait until it is needed again. Once upon a time, on many systems,
the "idle state" was literally a thread running at the lowest possible
priority which would execute
an infinite loop until the system found something better to do. Killing
the idle process was a good way to panic a VAX/VMS machine, which had no
clue of how to do nothing without a task dedicated to that purpose.
Running a busy-wait loop requires power; contemporary concerns have led us
to the conclusion that expending large amounts of power toward the
accomplishment of nothing is rarely a good idea. So CPU designers have
developed ways for the processor to go into a lower-power state when
there is nothing for it to do. Typically, when put into this state, the
CPU will stop clocks and power down part or all of its circuitry until the
next interrupt
arrives. That results in the production of far more nothing per watt than
busy-waiting.
In fact, most CPUs have multiple ways of doing nothing more efficiently.
These idle modes, which go by names like "C states," vary in the
amount of power saved, but also in the amount of ancillary information
which may be lost and the amount of time required to get back into a
fully-functional mode. On your editor's laptop, there are three idle
states with the following characteristics:
| C1 | C2 | C3 |
| Exit latency (µs) | 1 |
1 |
57 |
| Power consumption (mW) | 1000 |
500 | 100 |
On a typical processor, C1 will just turn off the processor clock, while C2
turns off other clocks in the system and C3 will actively power down parts
of the CPU.
On such a system, it would make sense to spend as much time as possible in
the C3 state; indeed, while this sentence is being typed, the system is in
C3 about 97% of the time. One might have thought that emacs could do a
better job of hogging the CPU, but even emacs is no challenge for
modern processors. The C1 state is not used at all, while a small amount
of time is spent in C2.
One might wonder why the system bothers with anything but C3 at all; why
not insist on the most nothing for the buck? The answer, of course, is
that C3 has a cost. The 57µs exit latency means that the system must
commit to doing nothing for a fair while. Bringing the processor back up
also consumes power in its own right, and the ancillary costs - the C3
state might cause the flushing of the L2 cache - also hurt. So it's only worth
going into C3 if the power savings will be real and if the system knows
that it will not have to respond to anything with less than 57µs
latency. If those conditions do not hold, it makes more sense to use a
different idle state. Making that decision is the cpuidle subsystem's job.
Every processor has different idle-state characteristics and different
actions are required to enter and leave those states. The cpuidle
code abstracts that complexity into a separate driver layer; the drivers
themselves are often found in architecture-specific or ACPI code. On the other
hand, the decision as to which idle state makes sense in a given situation
is very much a policy issue. The cpuidle "governors" interface allows the
implementation of different policies for different needs. We'll take a
look at both layers.
cpuidle drivers
At the highest level, the cpuidle driver interface is quite simple. It
starts by registering the driver with the subsystem:
#include <linux/cpuidle.h>
struct cpuidle_driver {
char name[CPUIDLE_NAME_LEN];
struct module *owner;
};
int cpuidle_register_driver(struct cpuidle_driver *drv);
About all this accomplishes is making the driver name available in sysfs.
The cpuidle core also will enforce the requirement that only one cpuidle
driver exist in the system at any given time.
Once the driver exists, though, it can register a cpuidle "device" for each
CPU in the system - it is possible for different processors to have
completely different setups, though your editor suspects that tends not to
happen in real-world systems. The first step is to describe the processor
idle states which are available for use:
struct cpuidle_state {
char name[CPUIDLE_NAME_LEN];
char desc[CPUIDLE_DESC_LEN];
void *driver_data;
unsigned int flags;
unsigned int exit_latency; /* in US */
unsigned int power_usage; /* in mW */
unsigned int target_residency; /* in US */
unsigned long long usage;
unsigned long long time; /* in US */
int (*enter) (struct cpuidle_device *dev,
struct cpuidle_state *state);
};
The name and desc fields describe the state; they will
show up in sysfs eventually. driver_data is there for the
driver's private use. The next four fields, starting with flags,
describe the characteristics of this sleep state.
Possible flags values are:
- CPUIDLE_FLAG_TIME_VALID should be set if it is possible
to accurately measure the amount of time spent in this particular idle
state.
- CPUIDLE_FLAG_CHECK_BM indicates that this state is not
compatible with bus-mastering DMA activity. Deep sleeps will, among
other things, disable the bus cycle snooping hardware, meaning that
processor-local caches may fail to be updated in response to DMA.
That can lead to data corruption problems.
- CPUIDLE_FLAG_POLL says that this state causes no latency, but
also fails to save any power.
- CPUIDLE_FLAG_SHALLOW indicates a "shallow" sleep state with
low latency and minimal power savings.
- CPUIDLE_FLAG_BALANCED is for intermediate states with some
latency and moderate power savings.
- CPUIDLE_FLAG_DEEP marks deep sleep states with high latency
and high power savings.
The depth of the sleep state is also described by the remaining fields:
exit_latency says how long it takes to get back to a fully
functional state, power_usage is the amount of power consumed by
the CPU when it is in this state, and target_residency is the
minimum amount of time the processor should spend in this state to make the
transition worth the effort.
The enter() function will be called when the current governor
decides to put the CPU into the given state; it will be described
more fully below. The number of times the state has been entered will be
kept in usage, while time records the amount of time
spent in this state.
The cpuidle driver should fill in an appropriate set of states in a
cpuidle_device structure for each CPU:
struct cpuidle_device {
unsigned int cpu;
int last_residency;
int state_count;
struct cpuidle_state states[CPUIDLE_STATE_MAX];
struct cpuidle_state *last_state;
void *governor_data;
struct cpuidle_state *safe_state;
/* Others omitted */
};
The driver should set state_count to the number of valid states
and cpu to the number of the CPU described by this device. The
safe_state field points to the deepest sleep which is safe to
enter while DMA is active elsewhere in the system. The device
should be registered with:
int cpuidle_register_device(struct cpuidle_device *dev);
The return value is, as usual, zero on success or a negative error code.
The only other thing that the driver needs to do is to actually implement
the state transitions. As we saw above, that is done through the
enter() function associated with each state:
int (*enter)(struct cpuidle_device *dev, struct cpuidle_state *state);
A call to enter() is a request from the current governor to put
the CPU associated with dev into the given state. Note
that enter() is free to choose a different state if there is a
good reason to do so, but it should store the actual state used in the
device's last_state field. If the requested state has the
CPUIDLE_FLAG_CHECK_BM flag set, and there is bus-mastering DMA
active in the system, a transition to the indicated safe_state
should be made instead. The return value from enter()
should be the amount of time actually spent in the sleep state, expressed
in microseconds.
If the driver needs to temporary put a hold on cpuidle activity, it can
call:
void cpuidle_pause_and_lock(void);
void cpuidle_resume_and_unlock(void);
Note that cpuidle_pause_and_lock() blocks cpuidle activity for all
CPUs in the system. It also acquires a mutex which is held until
cpuidle_resume_and_unlock() is called, so it should not be used
for long periods of time.
Power management for a specific CPU can be controlled with:
int cpuidle_enable_device(struct cpuidle_device *dev);
void cpuidle_disable_device(struct cpuidle_device *dev);
These functions can only be called with cpuidle as a whole paused, so one
must call cpuidle_pause_and_lock() first.
cpuidle governors
Governors implement the policy side of cpuidle. The kernel allows the
existence of multiple governors at any given time, though only one will be
in control of a given CPU at any time. Governor code begins by filling in
a cpuidle_governor structure:
struct cpuidle_governor {
char name[CPUIDLE_NAME_LEN];
unsigned int rating;
int (*enable) (struct cpuidle_device *dev);
void (*disable) (struct cpuidle_device *dev);
int (*select) (struct cpuidle_device *dev);
void (*reflect) (struct cpuidle_device *dev);
struct module *owner;
/* ... */
};
The name identifies the governor to user space, while
rating is the governor's idea of how useful it is. By default,
the kernel will use the governor with the highest rating value, but the
system administrator can override that choice.
There are four callbacks provided by governors. The first two,
enable() and disable(), are called when the governor is
enabled for use or removed from use. Both functions are optional; if the
governor does not need to know about these events, it need not supply these
functions.
The select() function, instead, is mandatory; it is called
whenever the CPU has nothing to do and wishes the governor to pick the
optimal way of getting that nothing done. This function is where the
governor can apply its heuristics, look at upcoming timer events, and
generally try to decide how long the sleep can be expected to last and
which idle state makes the most sense. The return value should be the
integer index of the target state (in the dev->states array).
When making its decision, the governor should pay attention to the current
latency requirements expressed by other code in the system. The mechanism
for the registration of these requirements is the "pm_qos" subsystem. A
number of quality-of-service requirements can be registered with this
system, but the one most relevant for cpuidle governors is the CPU latency
requirement. That information can be obtained with:
#include <linux/pm_qos_params.h>
int max_latency = pm_qos_requirement(PM_QOS_CPU_DMA_LATENCY);
On some systems, an overly-deep sleep state can wreak havoc with DMA
operations (trust your editor's experience on this), so it's important to
respect the latency requirements given by drivers.
Finally, the reflect() function will be called when the CPU exits
the sleep state; the governor can use the resulting timing information to
reach conclusions on how good its decision was.
An aside: blocking deep sleep
For what it's worth,
driver developers can use these pm_qos functions to specify latency
requirements:
#include <linux/pm_qos_params.h>
int pm_qos_add_requirement(int qos, char *name, s32 value);
int pm_qos_update_requirement(int qos, char *name, s32 new_value);
void pm_qos_remove_requirement(int qos, char *name);
This API is not heavily used in current kernels; most of the real uses
would appear to be drivers telling the system that transitions into deep
sleep states would be unwelcome. Needless to say, a driver should only
block deep sleep when it is strictly necessary; the latency requirement
should be removed when I/O is not in progress.
And that describes the 2.6.34 version of the cpuidle subsystem and API.
For the curious, the core and governor code can be found in
drivers/cpuidle, while cpuidle drivers live in
drivers/acpi/processor_idle.c and a handful of ARM subarchitecture
implementations.
All told, it's a testament to the complexity of doing nothing properly on
contemporary systems.
Comments (29 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
- Mimi Zohar: EVM .
(April 22, 2010)
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>