|
|
Subscribe / Log in / New account

A parallel path for GPU restore in CRIU

June 17, 2025

This article was contributed by Dong Du and Yanning Yang

The fundamental concept of checkpoint/restore is elegant: capture a process's state and resurrect it later, perhaps elsewhere. Checkpointing meticulously records a process's memory, open files, CPU state, and more into a snapshot. Restoration then reconstructs the process from this state. This established technique faces new challenges with GPU-accelerated applications, where low-latency restoration is crucial for fault tolerance, live migration, and fast startups. Recently, the restore process for AMD GPUs has been redesigned to eliminate substantial bottlenecks.

The challenges of GPU checkpoint/restore

Restoring GPU applications is complex. GPU internal states vary significantly between vendors and are often opaque to general-purpose tools like Checkpoint/Restore in Userspace (CRIU). That project is widely used for checkpointing and restoration; it bridges the differences between vendors with a flexible plugin mechanism. These plugins, which are dynamically loadable shared libraries, often vendor-supplied, provide specialized logic for GPU states. CRIU invokes plugin functions at specific junctures, known as hooks. Both AMD (for its AMDGPU driver stack) and NVIDIA have developed and contributed CRIU plugins.

The AMDGPU plugin, for example, uses the DUMP_EXT_FILE hook during checkpointing to save the state of the AMD driver and GPU video-RAM (VRAM) content. Symmetrically, the RESTORE_EXT_FILE hook (prior to the work discussed here) handled the preparation of the driver state and VRAM repopulation during restore.

Understanding the original GPU restore bottleneck

The traditional CRIU restore process for a GPU application, particularly with the AMDGPU plugin, involves several sequential steps that can lead to significant delays. Initially, the main CRIU process forks a child, destined to become the target application process. This child, called the restore process, then undertakes most restoration tasks.

[A diagram of the original CRIU restore process]

This restore process first reestablishes essential states, including file descriptors, GPU-driver state (often via ioctl()), and, critically, GPU-memory content, typically using system DMA for VRAM transfers. Following this, CRIU tackles the host memory. This intricate step involves unmapping all existing memory regions within the restore process and then mapping the new memory segments from the snapshot. To execute this without disrupting itself, the restore process jumps to a "restorer blob", a small, self-contained piece of code in a safe memory region. Finally, with the core memory layout in place, the main CRIU process restores any remaining state dependent on these new mappings.

The performance bottleneck lies in the sequential execution of the most time-consuming operations: restoring GPU content and then restoring host memory. Both are handled by the single-threaded restore process. While GPU content is being restored, all other logic in the child process is blocked. Simply offloading GPU restore to a background thread within the restore process is problematic, however. When CRIU needs to restore host memory, it unmaps all old mappings, including libraries (like libdrm or libc) potentially still used by the GPU-restoration thread, leading to conflicts. This limits any parallelization benefits significantly.

[A diagram of a broken version of the CRIU restore process]

Enabling parallel GPU restore

To decouple GPU-content restoration from the host-memory setup, a recently merged CRIU patch by Yanning Yang, Dong Du (the authors of this article), Yubin Xia, and Haibo Chen from Shanghai Jiao Tong University, which followed the GPU Checkpoint/Restore made On-demand and Parallel (gCROP) idea, introduces a new POST_FORKING plugin hook.

This POST_FORKING hook is invoked by the main CRIU process. It's called after the restore process is forked but before the main CRIU process waits for the restore child to complete its tasks. This timing is key.

The new hook allows GPU-content restore to be offloaded from the child restore process to the main CRIU process. This frees the restore child to proceed with other tasks, notably host-memory restoration, in parallel with its parent handling the GPU-VRAM repopulation.

A pivotal challenge is granting the main CRIU process access to GPU memory regions originally managed by the target process. AMD GPUs, via the AMDGPU driver, manage memory using "buffer objects". These buffer objects, representing contiguous memory areas in VRAM, can be exported as dma-buf file descriptors. Dma-buf is a kernel framework for sharing buffers between device drivers and processes.

Once a buffer object is exported as a dma-buf, its file descriptor can be passed to another process via a Unix domain socket. The receiving process imports the file descriptor to access the memory. This mechanism underpins the new parallel restore strategy: the restore process transfers dma-buf file descriptors, along with restore commands, to the main CRIU process.

[A diagram of the new CRIU restore process]

Adapting the AMDGPU plugin for parallel restore

Implementing this delegation required changes to key AMDGPU plugin hook functions. The amdgpu_plugin_restore_file() function, called by the restore process, previously handled all driver state and GPU content transfer. Now, its role is to identify necessary buffer objects and send these file descriptors with restoration metadata to the main CRIU process. It no longer performs the data transfers itself.

The new amdgpu_plugin_post_forking() function, tied to the new hook, executes in the main CRIU process. It typically launches a background thread that receives the file descriptors and commands from the restore process. The socket used by the background thread is opened during AMDGPU plugin initialization, prior to the restore process forking. This guarantees the socket is already available when the restore process sends file descriptors. The background thread then imports the buffer object and performs the actual GPU-content restoration using system DMA. This occurs concurrently with the child process restoring host memory.

Finally, amdgpu_plugin_resume_devices_late(), also called by the main CRIU process but much later, acts as a synchronization point. By this stage, the restore child has completed its memory setup. The main CRIU process uses this hook to ensure its background GPU-content restoration is complete, possibly by notifying and waiting for the worker thread(s).

The enhanced GPU-restore procedure now allows for significant concurrency. After the main CRIU process forks the restore child, it prepares, via amdgpu_plugin_post_forking(), to handle GPU restoration, often by starting a worker thread. The restore child, through amdgpu_plugin_restore_file(), sends dma-buf file descriptors with commands to this worker thread. Then, critical operations proceed in parallel. On our evaluation platform, parallel restore achieves a 34.3% improvement in the restore time of our test application when the data is in the page cache, and a 7.6% improvement when restoring from disk. The test scripts and implementation details are available here.

Looking ahead

The journey to optimal GPU checkpoint/restore, however, is ongoing. Continued collaboration between the CRIU community and GPU vendors is vital for refining these mechanisms across diverse architectures. The quest for faster, more robust checkpoint/restore for complex, accelerated applications remains an active and important area of development.

[Acknowledgments: The parallel restore patch is significantly improved with comments and suggestions from the CRIU community: Andrei Vagin, Radostin Stoyanov, and David Yat Sin.]


Index entries for this article
GuestArticlesDu, Dong


to post comments

Usecases

Posted Jun 18, 2025 21:26 UTC (Wed) by Zero_Dogg (subscriber, #31310) [Link] (1 responses)

This was a very interesting breakdown of an impressive piece of software. I am wondering what the main use case/motivation is for this? Could it be used to, for instance, implement the kind of “quick resume” of games that many consoles have, ie. on the Steam Deck, or is it more to move running server processes between servers (or both)?

Usecases

Posted Jun 30, 2025 15:27 UTC (Mon) by atnot (subscriber, #124910) [Link]

I don't think there's currently any effort to enable CRIU to suspend/resume Wayland or even X11 clients, which would be a bit of a challenge. So as usual these days the primary use case of this is machine learning.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds