The past two weeks has seen a
huge long email thread about the future of suspend in Linux. No, not
that other type of suspend, this is
all about what users really want, a working suspend to RAM.
It all started out with a few simple patches from Linus that implemented a
framework for allowing a way to debug problems during suspend, but quickly
spiraled out of control into rants about how badly the kernel handles
suspend issues today:
> I think you are trying to change a model that is not broken...
Bzzt. Thank you for playing.
The fact is, this thing has been broken for years. At some point,
we have to just accept the fact that it's not just "drivers".
There's something else that is broken, and I bet it's the model.
To how wrong everyone has been over the years in how suspend should really
work:
See? WE DO NOT DO THIS. I told people we needed to do this _years_
ago. I tried to push through the two-phase suspend. I tried to
explain why. I clearly failed, because we do _nothing_of_the_sort_
right now.
Instead, the "please suspend" thing to the devices is a
single-phase "put yourself into D3", with no support for a
separate "please save your state" call. Crap.
After arguing this last point over and over for many emails, Linus did
what anyone should do who wants to prove that their point is correct, he
wrote up a working patch that implements his proposed changes.
To fully understand the problem, let us look at the interface that the
kernel provides drivers today to handle suspend. When the kernel wants to
shut devices down (for some kind of suspend action), the whole device tree
is walked, and the suspend callback is called.
For PCI devices, this callback looks like:
int (*suspend) (struct pci_dev *dev, pm_message_t state);
The pointer to the PCI device that is about to be suspended is passed to
the driver, along with the state that the kernel wants to go into. Within
this single function, the driver is responsible for doing all suspend
tasks needed for the device.
The big problem with this is that if a device can not be suspended at that
point in time, it has to go through great lengths to try to let the core
know that it should be called back again (it does this by returning
-EAGAIN to the core and hoping that it will be called back.)
But the big issue is that the driver is responsible for shutting the
device down entirely in this function. This prevents the kernel from
doing things like system snapshots easily, or what to do if the driver
simply does not have enough memory available to it in order to properly
save the device state off in order to suspend.
Also the big issue is that the "class" cores should be handling most of
the suspend process, instead of the individual drivers. For example, the
network core should be shutting down the transmit queues and making stuff
go quiet for the drivers, so that they do not need to individually do this
in each and every driver. This last point is the biggest change in
Linus's model, and (in this author's opinion) the most important issue.
So, Linus changes the suspend process to a series of different steps:
- All devices start out on a list called dpm_active and are, as
indicated, "active" and up and running.
- A new callback is called for every device in the global device tree. This
callback is called suspend_prepare and has the same arguments
that the current suspend callback has for each individual bus
type. In this function, the devices are not allowed to disconnect
themselves from the kernel (like USB devices disconnecting themselves to
shut down), and the drivers for these devices need to do everything
necessary to be ready to suspend the device some time in the future. This
usually entails allocating any needed memory to save the device state, or
other kinds of housekeeping. Anything that might possibly fail should be
done here, and if something bad happens, the error should be reported.
Drivers can call functions that might sleep here, as interrupts are not
disabled.
- The kernel then iterates over all of the dpm_active list and
moves it to the dpm_off list and calls the suspend
callback for the different subsystems (which is new). Followed by the
subsystem suspend, the bus suspend callback is made.
- Interrupts are now disabled in the system.
- Then the kernel iterates over all of the devices on the dpm_off
list and moves them to the dpm_off_irq list, while calling a new
callback called suspend_late().
- After this is complete, the system can be suspended by shutting
down the CPU by putting it into any sleep level that is desired.
To resume the system, the kernel reverses the order of manipulating the
device lists and does the following steps:
- The kernel iterates over the dpm_off_irq list and moves the
devices to the dpm_off list while calling a new callback called
resume_early.
- Interrupts are enabled.
- The kernel iterates over all of the devices on the dpm_off list
and moves them to the dpm_active list, while calling the
resume callback (first the bus specific resume function,
followed by the class specific resume.)
This new scheme allows the kernel to properly handle error conditions if
anything bad happens while the suspend process was happening. For
example, if an error is caused during the suspend_late process,
then only the devices on the dpm_off_irq list will be called with
the resume_early callback in order to resume the system in the
proper procedure and recover from the error properly.
Linus's patch is a small patch, not over 400 lines, and generated some
good feedback with other kernel developers who seem to be coming around to
this new scheme. The patch has not shown up in any public kernel trees
yet, but hopefully soon Linux will be able to handle suspend issues in a
much more robust and correct manner.
(
Log in to post comments)