November 24, 2010
This article was contributed by Rafael J. Wysocki
If you have been following Linux kernel development over the past few
months, it has been hard to overlook the massive thread on the Linux
Kernel Mailing List (LKML) resulting from an attempt to merge the Google
Android's suspend blockers framework into the main kernel tree. Arguably,
the presentation of the patches might have been better and the explanation
of the problems they addressed
might have been more straightforward [PDF], but in the end it appears that
merging them wouldn't be the smartest thing from the technical point of
view. Unfortunately, though, it is difficult to explain that without
diving into the technical issues behind the suspend blockers patchset, so I
wrote a paper, Technical Background of the Android Suspend
Blockers Controversy [PDF], discussing them in a detailed way, which is
summarized in this article.
Suspend blockers, or wakelocks in the original Android
terminology, are a part of a specific approach to power management, which
is based on aggressive utilization of full system suspend to save as much
energy as reasonably possible. In this approach the natural state of the
system is a sleep
state [PDF], in which energy is only used for refreshing memory and providing
power to a few devices that can generate wakeup signals. The working
state, in which the CPUs are executing instructions and the system is
generally doing some useful work, is only entered in response to a wakeup
signal from one of the selected devices. The system stays in that state
only as long as necessary to do certain work requested by the user. When
the work has been completed, the system automatically goes back to the
sleep state.
This approach can be referred to as opportunistic suspend to
emphasize the fact that it causes the system to suspend every time there is
an opportunity to do so. To implement it effectively one has to address a
number of issues, including possible race conditions between system suspend
and wakeup events (i.e. events that cause the system to wake up from sleep
states). Namely, one of the first things done during system suspend is to
freeze user space processes (except for the suspend process itself) and
after that's been completed user space cannot react to any events signaled
by the kernel. In consequence, if a wakeup event occurs exactly at the
time the suspend process is started, user space may be frozen before it
will have a chance to consume the event, which will be delivered to it only
after the system is woken up from the sleep state as a result of
another wakeup event. Unfortunately, on a cell phone the
"deferred" wakeup event may be a very important incoming call, so
the above scenario is hardly acceptable for this type of device.
Wakelocks
On Android this issue has been addressed with the help of wakelocks.
Essentially, a wakelock is an object that can be in one of two states,
active or inactive, and the system cannot be suspended if at least one
wakelock is active. Thus, if the kernel subsystem handling a wakeup
event activates a wakelock right after the event has been signaled and
deactivates it after the event has been passed to user space, the race
condition described in the previous paragraph can be avoided. Moreover, on
Android, the suspend process is started from kernel space whenever there are
no active wakelocks, which addresses the problem of deciding when to
suspend, and user space is allowed to manipulate wakelocks. Unfortunately, that requires every
user space process doing important work to use wakelocks, which creates
unusual and cumbersome issues for application developers to
deal with.
Of course, processes using wakelocks can impact the system's battery
life quite significantly, so the ability to use them has to be regarded as
a privilege that should not be given unwittingly to all applications.
Unfortunately, however, there is no general principle the system designer
can rely on to figure out what applications will be important enough to the
system user to allow them to use wakelocks by default. Therefore,
ultimately the decision is left to the user which, naturally, is only going
to really work if the user is qualified enough to make the decision.
Moreover, if the user is expected to make such a decision, they
should be informed exactly of the possible consequences of it.
The user also should be able to disallow chosen applications the use of
wakelocks at any time. On Android, though, at least up to and including
version 2.2, that simply doesn't happen.
Apart from this, some advertised features of applications don't really
work on Android because of its use of opportunistic suspend. Namely, some
applications are supposed to periodically check things on remote Internet
servers. For this purpose they need to run when there's the time to make
their checks, but they obviously aren't running when the system is in a
sleep state. Thus the periodic checks the applications are supposed to make
aren't really made at that time. In fact, they are only made when the
system is in the working state incidentally for another reason, and there
happens to be the time to make them. This most likely is not what the
users of the affected applications would have expected.
Timekeeping issues
There is one more problem with full system suspend that is related to
time measurements, although it is not limited to the opportunistic suspend
initiated from kernel space. Namely, every suspend-resume cycle,
regardless of the way it is initiated, introduces inaccuracies into the
kernel's timekeeping subsystem. Usually, when the system goes into a sleep
state, the hardware that the kernel's timekeeping subsystem relies on is powered
off, so it has to be reinitialized during a subsequent system resume. Then,
among other things, the global kernel variables representing the current
time need to be readjusted to keep track of the time spent in the sleep
state. This involves reading the current time value from a persistent
clock which typically is much less accurate than the clock sources used by
the kernel in the system's working state. So that introduces a random shift
of the kernel's representation of current time, depending on the resolution
of the persistent clock, during every suspend-resume cycle. Moreover,
kernel timers used for scheduling the future execution of work inside of
the kernel also are affected by this issue in a similar way. In
consequence, the timing of some events in a suspending and resuming system
is different from their analogous timing without a suspend-resume
cycle.
If system suspend is initiated by user space, the kernel may assume that
user space is ready for it and is somehow prepared to cope with the
consequences. For example, it may want to use settimeofday() to
set the kernel's monotonic clock using a time value taken from an NTP
server right after the subsequent system resume. On the other hand, if
system suspend is started by the kernel in an opportunistic fashion, user
space doesn't really have a chance to do anything like that.
For this reason, one may think that it's better not to suspend the
system at all and use the cpuidle framework for the entire system
power management. This approach appears to allow some systems to be put
into a low-power state
resembling a sleep state. However, it may not guarantee that the
system will be put into that state sufficiently often because of
applications using busy loops to excess and kernel timers. PM quality
of service (QoS)
requests [PDF] may also prevent cpuidle from using deep low-power
state of the CPUs. Moreover, while only a few selected devices are enabled
to signal wakeup during system suspend, the runtime power management
routines that may be used by cpuidle for suspending I/O devices
tend to enable all of them to signal wakeup. Thus the system wakes up from
low-power states entered as a result of cpuidle transitions
relatively more often than from "real" sleep states, so its
ability to save energy is limited. This basically means that
cpuidle-based system power management may not be sufficient to
save as much energy as opportunistic suspend on the same system.
The alternative implementation
Even if opportunistic suspend is not going to be used on a
given system, it generally makes sense to suspend the system sometimes, for
example when its user knows in advance that it will not need to be in the
working state in the near future. However, the problem of possible races
between the suspend process and wakeup events, addressed on Android with
the help of the wakelocks framework, affects all forms of system suspend,
not only the opportunistic one. Thus this problem should be addressed in
general and it is not really convenient to simply use the Android's
wakelocks for this purpose, because that would require all of user space to be
modified to use wakelocks. While that may be good for Android,
whose user space already is designed this way at least to some extent, it
wouldn't be very practical for other Linux-based systems, whose user space
is not aware of the wakelocks interface. This observation led to the kernel patch that introduced the
wakeup events framework, which was shipped in the 2.6.36 kernel.
This patch introduced a running counter of signaled wakeup events,
event_count, and a counter of wakeup events whose data is being
processed by the kernel at the moment, events_in_progress. Two
interfaces have been added to allow kernel subsystems to modify these
counters in a consistent way. pm_stay_awake() is meant to keep
the system from suspending, while pm_wakeup_event() ensures that
the system stays awake during the processing of a wakeup event.
In order to do that, pm_stay_awake()
increments events_in_progress and the complementary function
pm_relax() decrements it and increments event_count at
the same time. pm_wakeup_event() increments
events_in_progress and sets up a timer to decrement it and
increment event_count in the future.
The current value of event_count can be read from the new
sysfs file /sys/power/wakeup_count. In turn, writing to
it causes the current value of event_count to be stored in the
auxiliary variable saved_count, so that it can be compared with
event_count in the future. However, the write operation will only
succeed if the written number is already equal to event_count. If
that happens, another auxiliary variable events_check_enabled is
set, which tells the PM core to check whether event_count has
changed or events_in_progress is different from zero while
suspending the system.
This relatively simple mechanism allows the PM core to react to wakeup
events signaled during system suspend if it is asked to do so by user space
and if the kernel subsystems detecting wakeup events use either
pm_stay_awake() or pm_wakeup_event(). Still, its support
for collecting device statistics related to wakeup events is not comparable
to the one provided by the wakelocks framework. Moreover, it assumes that
wakeup events will always be associated with devices, or at least with
entities represented by device objects, which need not be the case in all
situations. The need to address these shortcomings led to a kernel patch introducing wakeup
source objects and adding some flexibility to the existing
framework.
Most importantly, the new patch introduces objects of type struct
wakeup_source to represent entities that can generate wakeup events.
Those objects are created automatically for devices enabled to signal
wakeup and are used internally by pm_wakeup_event(),
pm_stay_awake(), and pm_relax(). Although the
highest-level interfaces are still designed to report wakeup events
relative to devices, which is particularly convenient to device drivers and
subsystems that generally deal with device objects, the new framework makes
it possible to use wakeup source objects directly.
A
"standalone" wakeup source object is created by
wakeup_source_create() and added to the kernel's list of wakeup
sources by wakeup_source_add(). Afterward one can use three new
interfaces, __pm_wakeup_event(), __pm_stay_awake() and
__pm_relax(), to manipulate it and, when it is not necessary any
more, it may be removed from the global list of wakeup sources by calling
wakeup_source_remove(). It can then be deleted with the help of
wakeup_source_destroy(). Thus reported wakeup events need not be
associated with device objects any more. Also, at the kernel level, wakeup
source objects may be used to replace Android's wakelocks on a one-for-one
basis because the above interfaces are completely analogous to the ones
introduced by the wakelocks framework.
The infrastructure described above ought to make it easier to port
device drivers from Android to the mainline kernel. It hasn't been
designed with opportunistic suspend in mind, but in theory it may be used
for implementing a very similar power management technique. Namely, in
principle, all wakelocks in the Android kernel can be replaced with wakeup
source objects. Then, if the /sys/power/wakeup_count interface is
used correctly, the resulting kernel will be able to abort suspend in
progress in reaction to wakeup events in the same circumstances in which
the original Android kernel would do that. Yet, user space cannot access
wakeup source objects, so the part of the wakelocks framework allowing user
space to manipulate them has to be replaced with a different mechanics
implemented entirely in user space, involving a power manager process and a
suitable IPC interface for the processes that would use wakelocks on
Android.
The IPC interface in question may be implemented using three components,
a shared memory location containing a counter variable referred to as the
"suspend counter" in what follows, a mutex, and a conditional variable
associated with that mutex. Then, a process wanting to prevent the
system from suspending will acquire the mutex, increment the suspend
counter, and release the mutex. In turn, a process wanting to permit the
system to suspend will acquire the mutex and decrement the suspend counter.
If the suspend counter happens to be equal to zero at that point, the
processes waiting on the conditional variable will be unblocked. The mutex
will be released afterward.
With the above IPC interface in place the power manager process can
perform the following steps in a loop:
- Read from /sys/power/wakeup_count (this will block until the
events_in_progress kernel variable is equal to zero).
- Acquire the mutex.
- Check if the suspend counter is equal to zero. If that's not the case, block
on the conditional variable (that releases the mutex automatically) and go to step 2
when unblocked.
- Release the mutex.
- Write the value read from /sys/power/wakeup_count in step 1 back to
this file. If the write fails, go to step 1.
- Start suspend or hibernation and go to step 1 when it returns.
Of course, this design will cause the system to be suspended very
aggressively. Although it is not entirely equivalent to the Android's
opportunistic suspend, it appears to be close enough to yield the same
level of energy savings. However, it also suffers from a number of
problems affecting the Android's approach. Some of them may be addressed
by adding complexity to the power manager and the IPC interface between it
and the processes permitted to block and unblock suspend, but the others
are not really avoidable. Thus it may be better to use system suspend less
aggressively, but in combination with some other techniques described
above.
Overall, while the idea of suspending the system extremely aggressively
may be controversial, it doesn't seem reasonable to entirely dismiss
automatic suspending of it as a valid power management measure. Many
different operating systems do that and they achieve
good battery life [PDF] with the help of it. There don't seem to be
any valid reasons why Linux-based systems shouldn't do that, especially if
they are battery-powered. As far as desktop and similar (e.g. laptop or
netbook) systems are concerned, it makes sense to configure them to suspend
automatically in specific situations so long as system suspend is known to
work reliably on the given configuration of hardware. The new interfaces
and ideas presented above may be used to this end.
(
Log in to post comments)