Checkpoint/restart (mostly) in user space

By Jonathan Corbet
July 19, 2011

There are numerous use cases for a checkpoint/restart capability in the kernel, but the highest level of interest continues to come from the containers area. There is clear value in being able to save the complete state of a container to a disk file and restarting that container's execution at some future time, possibly on a different machine. The kernel-based checkpoint/restart patch has been discussed here a number of times, including a report from last year's Kernel Summit and a followup published shortly thereafter. In the end, the developers of this patch do not seem to have been able to convince the kernel community that the complexity of the patch is manageable and that the feature is worth merging.

As a result, there has been relatively little news from the checkpoint/restart community in recent months. That has changed, though, with the posting of a new patch by Pavel Emelyanov. Previous patches have implemented the entire checkpoint/restart process in the kernel, with the result that the patches added a lot of seemingly fragile (though the developers dispute that assessment) code into the kernel. Pavel's approach, instead, is focused on simplicity and doing as much as possible in user space.

Pavel notes in the patch introduction that almost all of the information needed to checkpoint a simple process tree can already be found in /proc; he just needs to augment that information a bit. So his patch set adds some relevant information there:

There is a new /proc/pid/mfd directory containing information about files mapped into the process's address space. Each virtual memory area is represented by a symbolic link whose name is the area's starting virtual address and whose target is the mapped file. The bulk of this information already exists in /proc/pid/maps, but the mfd directory collects it in a useful format and makes it possible for a checkpoint program to be sure it can open the exact same file that the process has mapped.
/proc/pid/status is enhanced with a line listing all of the process's children. Again, that is information which could be obtained in other ways, but having it in one spot makes life easier.
The big change is the addition of a /proc/pid/dump file. A process reading this file will obtain the information about the process which is not otherwise available: primarily the contents of the CPU registers and its anonymous memory.

The dump file has an interesting format: it looks like a new binary executable format to the kernel. Another patch in Pavel's series implements the necessary logic to execute a "program" represented in that format; it restores the register and memory contents, then resumes executing where the process was before it was checkpointed. This approach eliminates the need to add any sort of special system call to restart a process.

There is need for one other bit of support, though: checkpointed processes may become very confused if they are restarted with a different process ID than they had before. Various enhancements to (or replacements for) the clone() system call have been proposed to deal with this problem in the past. Pavel's answer is a new flag to clone(), called CLONE_CHILD_USEPID, which allows the parent process to request that a specific PID be used.

With this much support, Pavel is able to create a set of tools which can checkpoint and restart simple trees of processes. There are numerous things which are not handled; the list would include network connections, SYSV IPC, security contexts, and more. Presumably, if this patch set looks like it can be merged into the mainline, support for other types of objects can be added. Whether adding that support would cause the size and complexity of the patch to grow to the point where it rivals its predecessors remains to be seen.

Thus far, there has been little discussion of this patch set. The fact that it was posted to the containers list - not the largest or most active list in our community - will have something to do with that. The few comments which have been posted have been positive, though. If this patch is to go forward, it will need to be sent to a larger list where a wider group of developers will have the opportunity to review it. Then we'll be able to restart the whole discussion for real - and maybe actually get a solution into the kernel this time.

Index entries for this article
Kernel	Checkpointing
Kernel	Containers

Checkpoint/restart (mostly) in user space

Posted Jul 21, 2011 5:58 UTC (Thu) by dlang (guest, #313) [Link] (2 responses)

this sounds like exactly the way major things like this work best in kernel development.

people initially propose a big, complex, intrusive patch. there is push back from kernel developers. time passes and people think more. a simple, minimal patch is created that implements a large portion of the desired functionality at a minimum impact.

the next steps are to see this added and let people build on it.

almost the exact same process happened with visualization (between Xen as the big patch, and KVM as the minimal starting point.

people wanting to get major things added to the kernel should pay attention, even if you did develop a big massive patchset, once you know where you want to end up, go back and look for the minimum that can be done to get something useful, get that accepted and build on that.

not coincidently, this looks very similar to the 'release early, release often' mantra of the bazaar development model.

Checkpoint/restart (mostly) in user space

Posted Jul 22, 2011 21:59 UTC (Fri) by mhelsley (guest, #11324) [Link] (1 responses)

"people wanting to get major things added to the kernel should pay attention, even if you did develop a big massive patchset, once you know where you want to end up, go back and look for the minimum that can be done to get something useful, get that accepted and build on that."

We did that. Multiple times. The first implementation effort was primarily in userspace using ptrace and /proc. The second was Oren's in-kernel work which started out small and grew at the request of Andrew. The third was Nathan's stripped-down revision of Oren's patch set earlier this year.

"not coincidently, this looks very similar to the 'release early, release often' mantra of the bazaar development model."

There were plenty of small early releases. In fact, I seem to recall we were told to use containers@ because our frequent releases were annoying LKML folks. Releases that did the same thing Pavel's stuff does only a different way.

"Release early, release often" is not enough. There have to be people with the time, will, and influence to review and merge the work. Without that it doesn't matter what you push or how often you push it.

Checkpoint/restart (mostly) in user space

Posted Jul 22, 2011 23:33 UTC (Fri) by dlang (guest, #313) [Link]

the key is to make the early pieces useful enough that people will take the time to review them.

they don't have to be something that the reviewer is going to use directly, but it does need to be something that the people being asked to review will see a direct need for.

Checkpoint/restart makes observing processes easier?

Posted Jul 21, 2011 9:40 UTC (Thu) by mjw (subscriber, #16740) [Link]

Heay this sounds exciting also for debuggers/tracers/profilers. No more tricky ptrace back and forth to collect all that info. Finally a way to access mapped shared library files from long running executables that might have been deleted on disk already. It looks like you could even "snapshot" the process easily (as if making a checkpoint and doing a restart) and then debug on a copy instead of on the original process that you might want to keep running without interference.

Swapping

Posted Jul 21, 2011 10:55 UTC (Thu) by epa (subscriber, #39769) [Link] (9 responses)

I guess this means we can now have an old school swapper where an entire process is frozen and saved to disk, then swapped back in when it's time to run. Swapping became obsolete when virtual memory and paging was invented, but perhaps in some situations there's an advantage to pushing the process out to disk in one lump rather than a page at a time? For example you could compress the process image with lzop and end up doing less I/O in total than if you had written individual 4k pages. Or, perhaps, swapping out a whole process could serve to defragment memory so some huge pages can be created. Or are there still systems without a sophisticated MMU which can't do paging but might still need to swap programs out of main memory, as when switching between apps on a phone for example?

Swapping

Posted Jul 21, 2011 12:10 UTC (Thu) by ebirdie (guest, #512) [Link] (7 responses)

Cool. Possibly a way for the OOM killer to morph into an OOM dumper? And to get a better handle to OOM situations.

Swapping

Posted Jul 22, 2011 18:38 UTC (Fri) by jeremiah (subscriber, #1221) [Link] (6 responses)

I like this idea ALOT. I like the idea that's once memory pressure is down, that you could conceivably have the dumped process restart. I have had some issues where routine maintenance has done something that caused a high memory service to be killed, when it should have been the maint process. It would have been great for the service to be restarted after the service finished. Or even, be able to pause/dump some processes, run a main routine, then when done un-pause the processes.

Swapping

Posted Jul 22, 2011 22:09 UTC (Fri) by mhelsley (guest, #11324) [Link] (5 responses)

The difficult part is you really want to know the amount of memory necessary to dump the process at OOM time. If you don't have the memory to start a new process, much less do the checkpoint, then you can't use this method of checkpointing to avoid OOM kills. If the amount of memory needed to do a dump is some asymptotic function of the size of the program being checkpointed, well, those are exactly the programs you're trying to dump during OOM!

Swapping

Posted Jul 23, 2011 5:55 UTC (Sat) by jeremiah (subscriber, #1221) [Link] (2 responses)

I would think you could reserve an amount large enough to start a process to spool out and checkpoint the process to disk. It might not be the most efficient thing, but it might just get the job done. I obviously haven't looked at this/ heard enough to make an informed opinion, just seemed like a cool idea though. I'm curious why you need to know the amount of memory needed though?

Swapping

Posted Jul 28, 2011 9:21 UTC (Thu) by mhelsley (guest, #11324) [Link] (1 responses)

Precisely to know how much memory to reserve as you suggested. Unless you're clever at managing it yet keeping it available during OOM that reserved memory is wasted memory. So you'd want to be careful not to waste too much.

Swapping

Posted Jul 28, 2011 17:10 UTC (Thu) by jeremiah (subscriber, #1221) [Link]

Right, guess I misread something. Obviously you have to reserve enough mem for the checkpoint process, but that seems like it would be of a fixed and predictable size, probably even fairly small. I was under the impression that they were trying to figure out how big the process that was being killed was, which as long as it fits on a disk, which seems pretty likely, you don't really care. As long as there is always more free space on the drive, than there is total memory on the system, you never have a out of disk space problem.

Swapping

Posted Jul 30, 2011 17:54 UTC (Sat) by oak (guest, #2786) [Link]

That's what cgroups is for. You make sure that things OOM early enough that rest of the system has enough memory to handle it gracefully.

As to kernel swapping the OOMed program back to ram from swap when you read the dump file, with cgroups setup retaining enough memory for the rest of the system (and kernel) while the OOMing container group is frozen that shouldn't be a problem either.

Swapping

Posted Jul 31, 2011 3:50 UTC (Sun) by slashdot (guest, #22014) [Link]

Honestly, a dynamically resizing swap file does the same thing, better.

You can't use any checkpoint/restart system to swap processes, because none can guarantee to perfectly restore them.

Swapping

Posted Jul 29, 2011 11:37 UTC (Fri) by obi (guest, #5784) [Link]

Maybe another feature to add to systemd, especially if systemd ends up managing the lifecycle of our user session: it would be nice to be able to do what Apple's iOS and Lion do: you start all your programs, and the system is allowed to switch them off under pressure, but the user is never made aware of it.

From what I understand, iOS and OSX apps get notified so the can dump their state themselves when necessary. This checkpointing would be even nicer, because app devs wouldn't have to change anything; it would just transparently work.

Checkpoint/restart (mostly) in user space

Posted Jul 21, 2011 17:05 UTC (Thu) by paravoid (subscriber, #32869) [Link] (4 responses)

I wonder if checkpoint/restart in combination with kexec would help security kernel upgrades with a limited downtime (worse than ksplice but better than a reboot :)

Checkpoint/restart (mostly) in user space

Posted Jul 22, 2011 22:19 UTC (Fri) by mhelsley (guest, #11324) [Link] (3 responses)

Actually it could be much better than ksplice -- ksplice is rather limited in the kinds of security fixes it can apply as I recall. To be fair, checkpoint/restart is also limited -- if a process being checkpointed has a physical device open (unlike, say, a pty) then it's difficult or practicaly impossible to checkpoint. So the set of processes with devices open this way are what will determine whether checkpoint/restart is a practical solution for this problem.

Checkpoint/restart (mostly) in user space

Posted Jul 23, 2011 15:06 UTC (Sat) by Lennie (subscriber, #49641) [Link] (2 responses)

ksplice is limited, but if I recall correctly it works automated for 89% of the security patches. The other 11% a programmer has to make the code to change a data structure in the kernel.

Checkpoint/restart (mostly) in user space

Posted Jul 28, 2011 9:16 UTC (Thu) by mhelsley (guest, #11324) [Link] (1 responses)

Sure. Though I wonder what "89% security patches" really means in practical terms. What is a "security patch" to whoever produced that number, and what portion of all kernel patches (in the sample, a single release, or overall?) are classified as such? "security patches" could be a small portion of the patches applied to the kernel in any given release so that 89% is potentially much less impressive than it sounds.

Checkpoint/restart (mostly) in user space

Posted Jul 28, 2011 19:05 UTC (Thu) by Lennie (subscriber, #49641) [Link]

I always assumed it was a stable release of 'server-distribution', a slow moving target, like Debian stable or Ubuntu LTS which only gets security updates on the kernel (maybe some stability updates). And it is those security patches that they are talking about it.

But I could be wrong ofcourse.

Checkpoint/restart (mostly) in user space

Posted Jul 31, 2011 3:58 UTC (Sun) by slashdot (guest, #22014) [Link]

Seems good.

The best way to implement checkpoint/restart seems to me to first write something that works in userspace using the current kernel, and have it use the additional kernel information exposed by submitted patches if available.

For example the tool could initially just read /proc/maps and hope there aren't any pathological issues, and then be enhanced to use /proc/mfd to be actually fully correct.

Then, just submit a ton of small patches that add new generally useful query/set (and atomicity) interfaces, which coincidentally happen to be those needed for more accurate checkpoint/restart.

Checkpoint/restart (mostly) in user space

Posted Aug 11, 2011 17:39 UTC (Thu) by gene (guest, #78097) [Link]

Hi. This is definitely an interesting approach to checkpoint-restart, and I look forward to seeing how it develops. On behalf of the DMTCP team, I'd also like to mention that since 2004, a group of us have been working on transparent checkpoint-restart entirely in user space. If you'd like to try it, it's at:
DMTCP (Distributed MultiThreaded CheckPointing)
http://dmtcp.sourceforge.net
Debian (testing) package: dmtcp