Adams: Linux's bedtime routine
How does Linux move from an awake machine to a hibernating one? How does it then manage to restore all state? These questions led me to read way too much C in trying to figure out how this particular hardware/software boundary is navigated.
Posted Sep 9, 2024 20:31 UTC (Mon)
by nyanpasu64 (guest, #135579)
[Link]
The specific crash bug I encountered was because Linux calls pm_restrict_gfp_mask() to prevent swapping to disk, *before* dpm_prepare() (the first opportunity for a GPU driver to backup VRAM to system RAM before the PCIe GPU is shut down and VRAM is lost). So if you don't have enough free system RAM to hold all VRAM, the sleep is aborted midway through (waking the system) or produces a failed memory allocation later during sleep or resume (often resulting in undefined system state, halted network or USB controllers, or worst yet a halted NVMe controller resulting in the system running around like a headless chicken unable to load data from disk or *even log data to the journal*). I'm wondering if this was a deliberate decision or an unforeseen interaction between suspend-time GFP masks and GPU drivers.
It seems Nvidia can't reliably backup VRAM either without being informed by systemd prior to the kernel initiating suspend (https://download.nvidia.com/XFree86/Linux-x86_64/560.35.0...).
Userspace snapshot API; suspend with dGPUs causes OOM