Checkpoint/restart (mostly) in user space
As a result, there has been relatively little news from the checkpoint/restart community in recent months. That has changed, though, with the posting of a new patch by Pavel Emelyanov. Previous patches have implemented the entire checkpoint/restart process in the kernel, with the result that the patches added a lot of seemingly fragile (though the developers dispute that assessment) code into the kernel. Pavel's approach, instead, is focused on simplicity and doing as much as possible in user space.
Pavel notes in the patch introduction that almost all of the information needed to checkpoint a simple process tree can already be found in /proc; he just needs to augment that information a bit. So his patch set adds some relevant information there:
- There is a new /proc/pid/mfd directory containing
information about files mapped into the process's address space. Each
virtual memory area is represented by a symbolic link whose name is
the area's starting virtual
address and whose target is the mapped file. The bulk of this
information already exists in /proc/pid/maps, but the
mfd directory collects it in a useful format and makes it
possible for a checkpoint program to be sure it can open the exact
same file that the process has mapped.
- /proc/pid/status is enhanced with a line listing all of the
process's children. Again, that is information which could be
obtained in other ways, but having it in one spot makes life easier.
- The big change is the addition of a /proc/pid/dump
file. A process reading this file will obtain the information about
the process which is not otherwise available: primarily the contents
of the CPU registers and its anonymous memory.
There is need for one other bit of support, though: checkpointed processes may become very confused if they are restarted with a different process ID than they had before. Various enhancements to (or replacements for) the clone() system call have been proposed to deal with this problem in the past. Pavel's answer is a new flag to clone(), called CLONE_CHILD_USEPID, which allows the parent process to request that a specific PID be used.
With this much support, Pavel is able to create a set of tools which can checkpoint and restart simple trees of processes. There are numerous things which are not handled; the list would include network connections, SYSV IPC, security contexts, and more. Presumably, if this patch set looks like it can be merged into the mainline, support for other types of objects can be added. Whether adding that support would cause the size and complexity of the patch to grow to the point where it rivals its predecessors remains to be seen.
Thus far, there has been little discussion of this patch set. The fact
that it was posted to the containers list - not the largest or most active
list in our community - will have something to do with that. The few
comments which have been posted have been positive, though. If this patch
is to go forward, it will need to be sent to a larger list where a wider
group of developers will have the opportunity to review it. Then we'll be
able to restart the whole discussion for real - and maybe actually get a
solution into the kernel this time.
Index entries for this article | |
---|---|
Kernel | Checkpointing |
Kernel | Containers |
Posted Jul 21, 2011 5:58 UTC (Thu)
by dlang (guest, #313)
[Link] (2 responses)
people initially propose a big, complex, intrusive patch. there is push back from kernel developers. time passes and people think more. a simple, minimal patch is created that implements a large portion of the desired functionality at a minimum impact.
the next steps are to see this added and let people build on it.
almost the exact same process happened with visualization (between Xen as the big patch, and KVM as the minimal starting point.
people wanting to get major things added to the kernel should pay attention, even if you did develop a big massive patchset, once you know where you want to end up, go back and look for the minimum that can be done to get something useful, get that accepted and build on that.
not coincidently, this looks very similar to the 'release early, release often' mantra of the bazaar development model.
Posted Jul 22, 2011 21:59 UTC (Fri)
by mhelsley (guest, #11324)
[Link] (1 responses)
We did that. Multiple times. The first implementation effort was primarily in userspace using ptrace and /proc. The second was Oren's in-kernel work which started out small and grew at the request of Andrew. The third was Nathan's stripped-down revision of Oren's patch set earlier this year.
"not coincidently, this looks very similar to the 'release early, release often' mantra of the bazaar development model."
There were plenty of small early releases. In fact, I seem to recall we were told to use containers@ because our frequent releases were annoying LKML folks. Releases that did the same thing Pavel's stuff does only a different way.
"Release early, release often" is not enough. There have to be people with the time, will, and influence to review and merge the work. Without that it doesn't matter what you push or how often you push it.
Posted Jul 22, 2011 23:33 UTC (Fri)
by dlang (guest, #313)
[Link]
they don't have to be something that the reviewer is going to use directly, but it does need to be something that the people being asked to review will see a direct need for.
Posted Jul 21, 2011 9:40 UTC (Thu)
by mjw (subscriber, #16740)
[Link]
Posted Jul 21, 2011 10:55 UTC (Thu)
by epa (subscriber, #39769)
[Link] (9 responses)
Posted Jul 21, 2011 12:10 UTC (Thu)
by ebirdie (guest, #512)
[Link] (7 responses)
Posted Jul 22, 2011 18:38 UTC (Fri)
by jeremiah (subscriber, #1221)
[Link] (6 responses)
Posted Jul 22, 2011 22:09 UTC (Fri)
by mhelsley (guest, #11324)
[Link] (5 responses)
Posted Jul 23, 2011 5:55 UTC (Sat)
by jeremiah (subscriber, #1221)
[Link] (2 responses)
Posted Jul 28, 2011 9:21 UTC (Thu)
by mhelsley (guest, #11324)
[Link] (1 responses)
Posted Jul 28, 2011 17:10 UTC (Thu)
by jeremiah (subscriber, #1221)
[Link]
Posted Jul 30, 2011 17:54 UTC (Sat)
by oak (guest, #2786)
[Link]
As to kernel swapping the OOMed program back to ram from swap when you read the dump file, with cgroups setup retaining enough memory for the rest of the system (and kernel) while the OOMing container group is frozen that shouldn't be a problem either.
Posted Jul 31, 2011 3:50 UTC (Sun)
by slashdot (guest, #22014)
[Link]
You can't use any checkpoint/restart system to swap processes, because none can guarantee to perfectly restore them.
Posted Jul 29, 2011 11:37 UTC (Fri)
by obi (guest, #5784)
[Link]
From what I understand, iOS and OSX apps get notified so the can dump their state themselves when necessary. This checkpointing would be even nicer, because app devs wouldn't have to change anything; it would just transparently work.
Posted Jul 21, 2011 17:05 UTC (Thu)
by paravoid (subscriber, #32869)
[Link] (4 responses)
Posted Jul 22, 2011 22:19 UTC (Fri)
by mhelsley (guest, #11324)
[Link] (3 responses)
Posted Jul 23, 2011 15:06 UTC (Sat)
by Lennie (subscriber, #49641)
[Link] (2 responses)
Posted Jul 28, 2011 9:16 UTC (Thu)
by mhelsley (guest, #11324)
[Link] (1 responses)
Posted Jul 28, 2011 19:05 UTC (Thu)
by Lennie (subscriber, #49641)
[Link]
But I could be wrong ofcourse.
Posted Jul 31, 2011 3:58 UTC (Sun)
by slashdot (guest, #22014)
[Link]
The best way to implement checkpoint/restart seems to me to first write something that works in userspace using the current kernel, and have it use the additional kernel information exposed by submitted patches if available.
For example the tool could initially just read /proc/maps and hope there aren't any pathological issues, and then be enhanced to use /proc/mfd to be actually fully correct.
Then, just submit a ton of small patches that add new generally useful query/set (and atomicity) interfaces, which coincidentally happen to be those needed for more accurate checkpoint/restart.
Posted Aug 11, 2011 17:39 UTC (Thu)
by gene (guest, #78097)
[Link]
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
Checkpoint/restart makes observing processes easier?
Swapping
Swapping
Swapping
Swapping
Swapping
Swapping
Swapping
Swapping
Swapping
Swapping
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
Checkpoint/restart (mostly) in user space
DMTCP (Distributed MultiThreaded CheckPointing)
http://dmtcp.sourceforge.net
Debian (testing) package: dmtcp