The current development kernel is 3.5-rc6
on July 7. "There's
mainly some btrfs and md stuff in here, with the normal driver changes, arm
updates and some networking changes. And a smattering of random stuff
(including docs etc). None of it looks very scary, it's all pretty small,
and there aren't even all that many of those small changes.
also notes that the 3.6 merge window is likely to hit when a lot of
developers are on vacation, so the 3.6 kernel might contain a relatively
small set of changes.
Stable updates: 3.2.22 was released
on July 5. The 3.2.23 update is in
the review process as of this writing; it can be expected almost anytime.
Comments (none posted)
One man's idiom is another man's idiocy.
— Andrew Morton
Perhaps it's a typo, and was meant to be AArgh64
— Jan Ceuleers
Dunno, the moment Apple comes out with their 64-bit iPhone6 (or
whichever version they'll go 64-bit) *everyone* will scramble to go
64-bit, for marketing reasons. Code and SoC details will be ported
to 64-bit in the usual fashion: in the crudest, fastest possible
There's also a technological threshold: once RAM in a typical
smartphone goes above 2GB the pain of a 32-bit kernel becomes
significant. We are only about a year away from that point.
So are you *really* convinced that the colorful ARM SoC world is
not going to go 64-bit and will all unify behind a platform, and
that we can actually force this process by not accepting
non-generic patches? Is such a platform design being enforced by
ARM, like Intel does it on the x86 side?
— Ingo Molnar
A number of the developers all went to a climbing gym one evening, and I found myself climbing with another kernel developer who worked for a different company, someone whose code I had rejected in the past for various reasons, and then eventually accepted after a number of different iterations. So I've always thought after that incident, "always try to be nice in email, you never know when the person on the other side of the email might be holding onto a rope ensuring your safety."
— Greg Kroah-Hartman
Comments (none posted)
Jennifer Cloer interviews
for the Linux.com series. "I was an embedded software developer testing the device I was working on (a barcode scanner) with all different operating systems to ensure that I had gotten the USB firmware correct. Linux had very little USB support at the time, and I realized I could help out and contribute to make it work better. One thing led to another and I soon got a job doing Linux kernel development full time over 10 years ago and never looked back.
Comments (none posted)
Kernel development news
The btrfs snapshot capability allows a system administrator to quickly
capture the state of a filesystem at any given time. Thanks to the
copy-on-write mechanism used by btrfs, snapshots share data with other
snapshots or the "live" system; blocks are only duplicated when they are
changed. While btrfs makes the creation and management of snapshots easy,
it currently lacks the ability to efficiently determine what the
differences are between two snapshots and save that information for future
use. Given that some other advanced filesystems (ZFS, for example) offer
that capability, btrfs can arguably be seen as falling a little short in
this particular area.
Happily, that situation appears to be about to change, as Alexander Block's btrfs send/receive patch set has been well
received by the development community. In short, with this patch set (and
the associated user space tools), btrfs can
be instructed to calculate the set of changes made between two snapshots
and serialize them to a file. That file can then be replayed elsewhere,
possibly at some future time, to regenerate one snapshot from the other.
This functionality is implemented with the new BTRFS_IOC_SEND
ioctl() command. In its simplest form, this operation accepts a
file descriptor representing a mounted volume and the subvolume ID
corresponding to the snapshot of interest; it will then find the changes
between the snapshot and the "parent" snapshot it was generated from.
There are more options, though:
- The operation can actually take a list of snapshot/subvolume IDs and
generate a combined file for all of them.
- The parent snapshot can be specified explicitly. That may be required
for older btrfs volumes that lack the needed identifying information.
It may also be useful to generate differences that skip over a set of
snapshots — differences from a grandparent, say, instead of the direct
- The command also accepts an optional list of "clone sources." Those
are subvolumes that can be expected to exist on the receiving side;
when possible, data blocks will be "cloned" from those snapshots
rather than being written into the differences file. That reduces the
size of the differences, and enables better data sharing on the
The generated file is essentially a set of instructions for converting the
parent snapshot into the one being "sent." The list of commands is
surprisingly long, including operations like create a file (or directory,
device node, FIFO, symbolic link, ...), rename or link a file, unlink a
file, set and remove extended attributes, write data, clone data blocks,
truncate a file, change ownership and permissions, set file times, and so
on. The code that generates this file is also surprisingly long, being
several thousand lines of complex, nearly uncommented functions (some of
the comments that do exist, saying things like "magic happens here," are
not entirely helpful).
Interestingly, according to the patch introduction, the custom file format was
not in the original plan. Instead, the output was meant to be in something
close to the tar file format — close enough that the tar command could be
used to extract data from it. Tar turned out not to have the needed
capabilities, though, so a new format was created. The format should be
considered to be in flux still, though, clearly, it will need to stabilize
before this feature can be considered ready for production use.
As it happens, the
playback of this file can be done almost entirely in user space, so there
is no need for a BTRFS_IOC_RECEIVE operation.
At the command level, using this feature can be as simple as:
btrfs send snapshot
This will send the given snapshot (in its entirety) to the
standard output stream. Writing the command as:
btrfs send -i oldsnap snapshot
will cause the creation of
an incremental send containing just the differences from oldsnap.
The receive command can be used to apply a file created by
btrfs send to an existing filesystem.
The primary use case for this feature (which is clearly patterned after the
ZFS send/receive functionality) is backups in various forms. A
cron job could easily send a snapshot to a remote server on a
regular basis, maintaining a mirror of a filesystem there. The send files
can simply be stored as backups; an entire volume can be sent as a full
backup, while snapshots are easily sent as incrementals. With some
additional tooling, the send/receive feature could develop into an advanced
backup capability with low-level support from the underlying filesystem.
That is for some
time in the future, though; the feature is currently experimental, and
Alexander warns potential users:
If you use it for backups, you're taking big risks and may end up
with unusable backups. Please do not only count on btrfs
That said, there seems to be a fair amount of interest in this feature
(btrfs creator Chris Mason described it as
"just awesome"), so chances are it will be worked into
reasonable shape relatively quickly. Then btrfs will have one more useful
feature and one less reason to be concerned about comparisons with that
Comments (8 posted)
ARM is one of the most successful processor architectures ever created;
most of us possess several ARM cores for every x86 processor we have. ARM
is very much thought of as an embedded systems processor; it is focused on
minimal power use and the ability to be built into a variety of system-on-chip
configurations. The "small systems" image of ARM is certainly encouraged
by the fact that ARM processors are all 32-bit only. That situation is
about to change, though, with the arrival of 64-bit ARM processors. Linux
will be ready for these systems — the first set of 64-bit ARM support
patches have just been posted — but there is still some debate around a
couple of fundamental decisions.
One might well wonder whether a 64-bit ARM processor is truly needed.
64-bit computing seems a bit rich even for the fanciest handsets or
tablets, much less for the kind of embedded controllers where ARM
processors predominate. But mobile devices are beginning to push the
memory-addressing limits of 32-bit systems; even a 1GB system requires the
use of high memory in most configurations. So, even if the heavily
foreshadowed ARM server systems never materialize, there will be a need for
64-bit ARM processors just to be able to efficiently use the memory that
future mobile devices will have. "Mobile" and "embedded" no longer mean
Naturally, Linux support is an important precondition to a successful
64-bit ARM processor introduction, so ARM has been supporting work in that
area for some time. The initial GCC
patches were posted back in May, and the first
set of kernel patches was posted by Catalin Marinas on July 6.
All this code exists despite the fact that no 64-bit ARM hardware is yet
available; it all has been developed on simulators. Once the hardware
shows up, with luck, the software will work with a minimum of tweaking
64-bit ARM support involves the addition of thousands of lines of new code
via a 36-part patch set. There are some novel features, such as the
ability to run with a 64KB native memory page size, and a lot of important
technical decisions to be reviewed. So the kernel developers did what one
would expect: they started complaining about the name given to the
architecture. That name ("AArch64") strikes many as simultaneously
redundant (of course it is an architecture) and uninformative (what does
"A" stand for?). Many would prefer either ARMv8 (which is the actual
hardware architecture name—"AArch64" is ARMv8's 64-bit operating mode) or
Arguments in favor of the current name include the fact that it is already used
to identify the architecture in the ELF triplet used in binaries; using the
same name everywhere should help to reduce confusion. But, then, as Arnd
Bergmann noted: "If everything else
is aarch64, we should use that in the kernel
directory too, but if everyone calls it arm64 anyway, we should
probably use that name for as many things as possible."
Jon Masters added that, in classic
contrarian style, he likes the name as it is; Fedora is planning to use
"aarch64" as the name for its 64-bit ARM releases. Others, such as Ingo Molnar, argue in favor of changing the
name now when it is relatively easy to do. Catalin seems inclined to keep the current name but
says he will think about it before posting the next version of the patch
An arguably more substantive question was raised by a number of developers:
wouldn't it make more sense to unify the 32-bit and 64-bit ARM
implementations from the outset? A number of other architectures (x86,
PowerPC, SPARC, and MIPS) all started with separate implementations, but
ended up merging them later on, usually with some significant pain involved. Rather
than leave that pain for future ARM developers, it has been suggested that,
perhaps, it would be better to start with a unified implementation.
There are a lot of reasons given for the separate 64-bit ARM architecture
implementation. Much of the relevant thinking can be found in this note from Arnd. The 64-bit ARM
instruction set is completely different from the 32-bit variety, to the
point that there is no possibility of writing assembly code that works on
both architectures. The system call interfaces also differ significantly,
with the 64-bit version taking a more standard approach and leaving a lot
of legacy code behind. The 64-bit implementation hopes to leave the entire
32-bit ARM "platform" concept behind as well; indeed, as Jon put it, there are hopes that it will be
possible to have a single kernel binary that runs on all 64-bit ARM systems from the
outset. In general, it is said, giving AArch64 a clean start in its own
top-level hierarchy will make it possible to leave a lot of ARM baggage
behind and will result in a better implementation overall.
Others were quick to point out that most of these arguments have been heard
in the context of other architectures. x86_64 was also meant to be a clean
start that dumped a lot of old i386 code. In the end, things have turned
out otherwise. It may be possible that things are different here; 32-bit
ARM has rather more legacy baggage than other architectures did, and the
processor differences seem to be larger. Some have said that the proper
comparison is with x86 and ia64, though one gets the sense that the
AArch64 developers don't want to be seen in the same light as ia64 in
This decision will come down to what the AArch64 developers want, in the
end; it's up to them to produce a working implementation and to maintain it
into the future. If they insist that it should be a separate top-level
architecture, it is unlikely that others will block its merging for that
reason alone. Of course, it will also be up to those developers to manage
a merger of the two in the future, should that prove to be necessary. If
nothing else, life as a separate top-level architecture will allow some
experimentation without the fear of breaking older 32-bit systems; the
result could be a better unified architecture some years from now, should
things move in that direction.
Thus far, there has been little in the way of deeper technical criticism of
the AArch64 patch set. Things may stay that way. The code has already
been through a number of rounds of private review involving prominent
developers, so the worst problems should already have been found and
addressed. Few developers have the understanding of this new processor
that would be necessary to truly understand much of the code. So it may go
into the mainline kernel (perhaps as early as 3.7) without a whole lot of
substantial changes. After that, all that will be needed is actual
hardware; then things should get truly interesting.
Comments (39 posted)
we discussed three elements that
might serve to
guide the creation of introductory technical documentation. This week
we put those elements to the test by using them to create some
introductory documentation for Linux power management. For me, this exercise
precisely answers the question "What were you looking for that you
didn't find?", as it is the documentation I would have liked to read.
This documentation is necessarily incomplete, partly because my own
experience is not yet broad enough to provide a comprehensive document,
and partly because doing so might try the patience of the present
readership. As such it stops short of delving into the details of
hibernation and completely omits any treatment of quality-of-service and
wakeup sources, all of which would have an important place in a more
complete document. Fortunately there are still sufficient topics to
showcase the presentation of structure, purpose, and examples.
Three perspectives on Linux power management
The power management infrastructure in Linux is quite complex, but
hopefully not intractably so. To get a handle on this complexity it
is helpful to view it from three different perspectives. The first
perspective highlights the different holistic states of the system
which roughly divide into "in use", "not in use", and "indefinitely
not in use", corresponding to "run time power management",
"suspend" and "hibernate". One of the distinctions between these is
the size of the power switch. The first uses lots of little power
switches at different times, while the last turns off everything all at
once (except maybe a real-time clock or similar).
The second of these states is somewhat harder to define. It covers a
range of states which are not easy to clearly differentiate. At one end
of the spectrum we have the traditional "suspend" mode of a laptop,
which is almost like hibernation but uses a little more power and is a
little quicker to get into and out of. Once the laptop has entered
suspend it really must stay there using minimal power until it is
explicitly wakened, as it might have been placed in a padded case for
transport and any increase in power usage could result in over-heating
and damage. This state is often entered with help from BIOS firmware so,
to the OS, it is a bit like a single power switch which transitions from
"on" to "suspend".
At the other end of the spectrum is the way that "suspend" is used in
the Android mobile platform and similar devices. These devices are
expected to wake up spontaneously for various reasons, whether due to
an incoming phone call, a reminder alarm, or just a periodic check for
new email or software updates. Management of power and temperature is
generally better than notebooks so the risk of over-heating is not
present. There is normally little or no firmware and the
entire power-management transition is handled by the OS, so it is
responsible for turning off each individual device in the correct
order, and then restoring them again later.
Between these extremes of a light hibernation and a heavy snooze there
is room for other possibilities. A server might use a BIOS-based
suspend to save power after arranging for wake-ups via wake-on-LAN or
a realtime clock alarm. This can be seen as a deeper sleep than an Android
phone normally enters, but not as deep as the laptop in its padded
case. The "suspend" mode in Linux attempts to cater to all of these
and that flexibility leads to some of the complexity.
The second perspective highlights the broad variety in components that
need to be managed. Some, like rotating disk drives, have a high cost
in power and time for turning off and on again, while others like an
LED have essentially no cost. Some, such as a UART, need to either be
off or sufficiently on to be able to accept full-rate data at any
moment. Others, such as USB, can enter intermediate states where they
can receive external signaling, but are free to take some time to fully wake
Other sources of variability include the level of independence from
other devices, the degree of involvement of user-space in management of
the device, and how power is routed - whether through the same bus that
commands and data flow, or through some separate regulator or "power
domain". These are just some of the ways that devices can vary and thus
some of the issues that Linux power management needs to be prepared for.
The final perspective highlights the different stages on the way
towards a low-power state, and on the way back to full functionality.
The key elements of the low-power transition are to move all relevant components
to a quiescent state, to record that state, then to stop powering some or
all of the components; similar elements apply on the way back up. The details of managing all the aforementioned
complexity through this simple transition means that we have quite a
few stages as we will shortly see.
Part of understanding the solution to managing this complexity is
understanding the balance that has been chosen between a
solution and a "library" solution. That is, how much responsibility
for correct behavior and sequencing is taken by the core code and
imposed on the drivers, and how much of the responsibility is left in
the hands of the drivers. Centralizing responsibility tends to be
safe but inflexible, while distributing it is risky but versatile.
Linux power management takes a middle road, so it is important to
understand where each responsibility lies.
The main imposition made by the PM core is the over-all sequencing of
suspend and resume. Allowing individual drivers to take a more active
role in this process would probably require a general dependency solver and
would undoubtedly make debugging a lot harder. In contrast, choices
that are local to a specific device, such as timeouts before power
management activates, or the use of a separate thread for performing
power management actions are actively supported by the core without being
imposed on drivers that don't want them.
One other imposition, which will be raised again later, involves
interaction with interrupts. The PM core strongly encourages a specific
sequencing, but does provide hooks for a driver to escape it if
Understanding Linux power management also requires knowing how devices are
classified in Linux. The most obvious classification is shown by the
"subsystem" link that can be found in the sysfs entry
for the device. This points to either a "bus" or a "class" that the
device belongs to. This subsystem roughly describes the interface that
the device provides. Together with this can be a "device type" which
allows further specialization. A simple example is that members of the
class "block" - which are block devices such a disk drives - can be of
type "disk" or type "partition" reflecting the fact that both the whole
device and each individual partition is a block device, but that they do
have some specific behaviors that are quite different.
Finally each device can have a "power domain" (or pm_domain).
This is an abstraction that is currently only used for ARM SoC modules
and represents the fact that different collections of devices within
the SoC can be powered on or off together, thus the power domain may
need to know when each device changes power state so it can re-evaluate
or adjust the overall state of the domain.
These classifications are used to direct all the power management
calls that are described below. If a device has a power domain, it
gets to handle the call. If not, but the type, class, or bus declares any
PM operations, those operations get to handle the call, otherwise the call is handled by
the driver for the device. The PM core doesn't attempt to call all of
the possible handlers for a particular device, just the first that is
found. This is an example of distribution of responsibility. The
first handler has the freedom to call more specific handlers, or to do
all the work itself, and equally has the responsibility to ensure all
required handlers are called.
For example, the power domain handler for the OMAP platform (in
calls the driver-specific
handler (bypassing any subsystem handlers) before or after doing any
OMAP-specific handling. The MMC bus handlers call into driver-specific
handlers which are stored in a non-standard location - presumably for
With these perspective and understandings in place, we can move on to
Runtime power management has the fewest states and so is probably the
best place to start digging into details. This is the part of Linux PM
that manages power for individual devices without taking the whole
system into a low-power state.
In this case the most interesting stage of the transition to lower
power is "move to quiescent state". Once that is done there is one
method call (runtime_suspend()) which combines "record current
state" and "remove power", and another (runtime_resume()) which
must restore power and reload any needed device state.
For runtime PM, the "move to quiescent state" transition is a cause, not
an effect - the new state isn't requested, it is simply noticed. The PM
core keeps track of the activity of each device using two counters and
an optional timer. One counter (usage_count) counts active
references to the device. These may be external references such as open
file handles, or other devices that are making use of this one, or they
may be internal references used to hold the device active for the
duration of some operation. The other counter (child_count)
counts the number of children that are active. The timer can be used to
add a delay between the counters reaching zero and the device being
considered to be idle. This is useful for devices with a high cost for
turning on or off.
This "autosuspend" timer is not widely used at present, with only nine
drivers calling pm_runtime_put_autosuspend() to start the
timer, while 14 call pm_runtime_set_autosuspend_delay() which
sets the timeout (though that can be set via sysfs). One user is the
omap_hsmmc driver for the High Speed Multi-Media Card
interface in OMAP processors. It sets a 100ms delay before declaring
a device to be truly idle, presumably due to costs in activating and
deactivating the cards.
The counter of active children can optionally be ignored when
determining whether a device is idle. Normally the parent is needed
to access the child - typically the parent is a bus sending commands
to the child - so powering down the parent while children are active
would be counterproductive. Sometimes it is useful though.
One good example is an I2C bus. I2C (inter-integrated circuit) is a
very simple 2-wire bus for signaling between integrated circuits on a
board. It doesn't carry power, only a clock signal and a
bidirectional data signal. The bus is entirely master-driven. Slaves
(which appears as children in the Linux device tree) cannot signal the
master directly at all, they simply respond to commands from the master.
As an I2C controller is very cheap to turn on
before a command is sent, and off after the response is received,
there is no need to keep it powered just because its child (which
could be a sensor that is monitoring the environment and may have a
higher turn-on cost) is left on. Consequently some I2C controllers,
such as i2c-nomadik and i2c-sh_mobile use
pm_suspend_ignore_children() to allow them to report as idle
even when they have active children.
When a device is deemed to be idle by the above criteria its
runtime_idle() method is called. This function will normally perform
any further checks (as does usb_runtime_idle()) and possibly
call pm_runtime_suspend() to initiate the change in power state. For
a slight variation, lnw_gpio_runtime_idle() in the
gpio-langwell.c driver doesn't call pm_runtime_suspend()
directly but rather calls pm_schedule_suspend() with a 500ms
delay. Presumably this design predates the introduction of the
There is one class of devices that does not follow this structure for
power management, and that is the CPU. The general pattern of entering
a quiescent state, recording state information, and reducing power usage
is the same, however the particular implementation is vastly different.
This is partly due to the uniquely central role that the CPU plays, and partly
due to the fact that a CPU typically has many more levels and styles of
Runtime PM for the CPU is implemented using
the cpuidle, cpufreq, and CPU hotplug
subsystems, which will not be discussed further here; see this article for an introduction to cpuidle.
It can be helpful to view the "suspend" process as forcing all devices
into a quiescent state, and then simply allowing runtime power
management to put them all to sleep. The last to go to sleep would be
the CPU (or CPUs) under the guidance of "cpuidle". While this isn't
the way it is actually implemented, it provides a perspective which
exposes the relationship between suspend and runtime PM quite well.
There are several reasons for not implementing it this way. Possibly
the most unavoidable is that PM_RUNTIME and SUSPEND are separate
kernel config options and there is a desire to keep it that way, so
neither can rely on the other being present. There is also the fact
that a BIOS (such as ACPI) might be involved in one or the other and
may impose different handling requirements. Finally, individual
drivers might want to make different decisions based on what sort of
power management is happening, so it is generally best to tell them
what is actually happening, rather than pretending that one thing is a
form of another.
Forcing devices into a quiescent state has an important difference
from just allowing them to get there on their own - any interdependencies between
devices need to be explicitly handled. Linux PM has chosen to manage
this by having a clear sequence of steps for transitioning to low
power, and an explicit ordering of devices so that they make each step
in a well defined order.
The ordering (stored in dpm_list linked through
dev->power.entry) is normally the order in which devices
are registered, with new devices added to the end, thus being after any
devices that they depend on. However it is possible to reorder the
list using device_move() which gives a device a new parent, and
can place it directly after that parent, or at the end of the list.
For example, when an rfcomm tty-over-bluetooth device is
opened, a bluetooth connection is created and the tty device is
reparented under the relevant bluetooth device and placed at the end of
the device list.
The first stage of suspend, after some preliminaries like calling
"sync" to flush out dirty data and switching to a separate virtual
console, is to move all processes into a quiescent state. Devices
which interact closely with processes need a chance to have one last
chat before their process goes to sleep and this is achieved by
registering a "notifier" which gets called before processes are put to
sleep, and again when they are woken up.
This is variously used to:
- load firmware in case it will be needed during resume
- copy device state to swappable memory as may be needed when
the device state can be enormous such as video RAM
- avoid deadlock when interacting with sysfs
- preemptively remove devices that might be removed while the system
is suspended, so appropriate cleanup can happen
and a few other minor tasks.
Once these notifiers run, all processes are sent a special signal which
results them being moved to the "freezer" where they are forced to
wait for system resume to happen.
Once all processes are quiescent, the next step is to instruct all
devices to also become quiescent. To do this we need to walk the
list in reverse order, putting children to sleep before their parents
- as the parent may be needed to help put the child to sleep. However
as a new child could be born at any moment (e.g. due to a device being
plugged in), and as children get added to the end of the list, we
might miss some children on the first pass. To avoid this, the PM
core makes two passes over the list. The first pass starts at the
beginning and simply asks devices to stop adding children by calling
their "prepare()" method. If children are born during this time they
can only be added after the current pointer in the list, and so will
not be missed. Once this is complete we know that no new devices will
be added, so the list is walked in the reverse order calling the
The "suspend" method is actually three separate methods, suspend(),
suspend_late(), and suspend_noirq(), which can share among themselves
the three tasks of making the device quiescent, saving any state, and
reducing power usage. How much of which task is allocated to which
methods is largely up to each driver providing that the division works
with the calling patterns of the three methods.
Calls to these methods are made to all devices in child-before-parent
order and the sets of calls are interleaved with system-wide suspend
operations, made largely through the suspend_ops dispatch
table. The ordering is roughly:
- system wide begin()
- per-device suspend()
- system wide prepare()
- per-device suspend_late()
- system wide disable (almost) all interrupt handlers
- per-device suspend_noirq()
- system wide prepare_late()
- disable nonboot CPUs
- system wide enter()
Note that it is possible for the sequence from system wide
prepare() onwards to be repeated (after being unwound by
corresponding "resume" actions) without going all the way up to fully
awake and starting the sequence from the top. This happens if
the suspend_again() suspend operation requests it. Currently
this is only requested by the charger manager which often needs to
wake up parts of the system to check battery charging state, without
wanting the cost of a full wakeup.
Deducing the purpose of these method calls by looking for example
usage in the code is problematic for a number of reasons.
For the system-wide operations (begin(), prepare(),
prepare_late()), there are few users and those that exist
do not make their purpose clear to an untrained observer. The
most complete user is ACPI, so possibly a full understanding of
that specification would help. Unfortunately that is beyond the
scope of this article (and of this author).
In general, ACPI recommends specific procedures for entering and
leaving system sleep states (such as suspend) and Linux PM was
modeled on that and then adjusted to meet broader needs. For
example, prepare_late() was
to resolve a conflict between the needs of ACPI and the needs of
the ARM platform.
- For the per-device operations, suspend_late() was only
recently added (commit cf579dfb82550e3) and there are no users
in Linux-3.4. So any examples we find may be working around the
absence of suspend_late() and so should not be copied.
- The initial reason for producing this document was finding
code in drivers that simply wasn't working correctly and trying to
understand what "correctly" might mean. Those drivers clearly
cannot be used as good examples and there is
that other drivers aren't always doing
the right thing, so any example may be equally faulty.
Examining the documentation brings a little more useful information.
- suspend() should leave the device in a
- suspend_late() can often be the
as runtime_suspend() (see also
- suspend_noirq() happens after interrupts are disabled and
are used as you can be certain that the
interrupt handler will not be called after suspend_noirq() runs.
Some interrupts, such as the
timer interrupt, are not disabled.
One observation from the code that seems to be important before we try
to paint the big picture is that, after calling the suspend()
method on a device, runtime power management is
The purpose of this seems to be to stop runtime PM from racing with
system suspend PM - we really don't want two threads
trying to power off a device at once, and this is the interlock that
prevents that. It also prevents runtime PM from powering the device
back on again, so any device that might be needed in the late states
of power management needs to be left on when runtime PM is disabled.
Tying all these threads together we get that:
The suspend() method should cause the device to stop doing
anything, and enter a state much like it would be just before runtime
PM might decide to turn it off. So it should wait for any DMA
requests to complete and ensure new ones won't start. It should
stop transmitting information and ensure that incoming information
is either ignored, or triggers a wake-from-suspend (possibly
marking the interrupt for wakeups). It should cancel any timers
and generally prepare for nothing to happen for a while.
If the device might be needed to power down other devices, such as
an I2C controller that might be needed to tell some regulator to turn off,
then the device should be activated for runtime PM purposes so that
it will still be active when runtime PM is disabled.
Part of the task of ignoring incoming information is to ensure that no
new children will be created much like the runtime PM
prepare() method does. Having new devices appear after
suspend() would be awkward.
The suspend_late() method should power off the device in
much the same way that runtime_suspend() does, and it may be
exactly the same routine as runtime_suspend().
Occasionally preparing the device to wake up may differ between the
system suspend and runtime PM cases. This would be one situation where
suspend_late() might need to be different from
The only case where suspend_late() should not be used is
where interrupts might still be delivered, but the interrupt
handler cannot tolerate the device being off. In many cases
the suspend() routine will have put the device in a state in
which it will not generate interrupts. Likely exceptions to this
are when the interrupt line is shared, or when the device supports
wake-from-suspend and so deliberately does not disable interrupts.
If the platform that the device runs on uses BIOS support to enter
suspend, then it is possible that this support will power off the
device, so suspend_late() does not need to bother. If it
doesn't, it could still be that the device gets powered off by
instructing the BIOS to effect the state change, and it may require
different power-off procedures for runtime PM and for entering
suspend. If this is the case, then suspend_late() will
quite likely be very different from runtime_suspend().
The suspend_noirq() method is an alternative to
suspend_late() but is run without interrupts enabled. It is
unlikely that any driver will provide both methods.
Having interrupts disabled means not only that an interrupt will not
occur at an awkward time, but also that using any functionality
that requires interrupts will not work. So if the driver uses an
I2C bus or similar to tell the device to turn off, and if the I2C bus uses
interrupts to indicate completion (which is normal), then either
the device must be powered-off in suspend_late, or the I2C
interrupt must be marked IRQF_NO_SUSPEND.
Paired with each of these methods is a method that is called when
returning back towards full-functionality: resume_noirq(),
resume_early() and resume(). These simply do the
reverse of what the corresponding "suspend" function did.
Structure, purpose, and examples - these seem to be the elements that
distinguish good documentation and enable the reader not just to
collect knowledge but to gain understanding. I'll leave you, dear
reader, to be the judge of whether their presence here is sufficient
to bring an understanding of power management, or indeed an
understanding of quality documentation.
I would like to thank Rafael Wysocki for valuable review of an early draft
of this article.
Comments (4 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>