LWN.net Logo

clone_with_pids()

By Jake Edge
August 12, 2009

As part of the changes to support application checkpoint and restart in the kernel, Sukadev Bhattiprolu has proposed a new system call: clone_with_pids(). When a process that was checkpointed gets restarted, having the same process id (PID) as it had when the checkpoint was done is important to some kinds of applications. Normally, the kernel assigns an unused PID when a new task is started (via clone()), but, for checkpointed processes, that could lead to processes' PIDs changing during their lifetime, which could be an undesirable side effect. So, Bhattiprolu is looking for a way to avoid that by allowing clone() callers to specify the PID—or PIDs for processes in nested namespaces—of the child.

The actual system call is fairly straightforward. It adds an additional pid_set parameter to clone(), to contain a list of process ids; pid_set has the obvious definition:

    struct pid_set {
	   int num_pids;
	   pid_t *pids;
    };
A pointer to a pid_set is passed as the last parameter to clone_with_pids(). Each of the PIDs is used to specify which PID should be assigned at each level of namespace nesting. The patch that actually implements clone_with_pids() (as opposed to the earlier patches in the patchset that prepare the way) illustrates this with an example (slightly edited for clarity):
	pid_t pids[] = { 0, 77, 99 };
	struct pid_set pid_set;

	pid_set.num_pids = sizeof(pids) / sizeof(int);
	pid_set.pids = &pids;

	clone_with_pids(flags, stack, NULL, NULL, NULL, &pid_set);
If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY.

The patchset assumes that being able to set PIDs is desirable, but Linus Torvalds was not particularly in favor of that approach when it was first discussed on linux-kernel back in March. His complaint was that there are far too many stateful attributes of processes to ever be able to handle checkpointing in the general case. His suggestion: "just teach the damn program you're checkpointing that pids will change, and admit to everybody that people who want to be checkpointed need to do work".

Others disagreed—no surprise—but it is unclear that Torvalds has changed his mind. He was also concerned about the security implications of processes being able to request PID assignments: "But it also sounds like a _wonderful_ attack vector against badly written user-land software that sends signals and has small races." That particular concern should be alleviated by the requirement that a process have the CAP_SYS_ADMIN capability (essentially root privileges) in order to use clone_with_pids().

Requiring root to handle restarts, which in practice means that root must manage the checkpoint process as well, makes checkpoint/restart less useful, overall. But there are a whole host of problems to solve before allowing users to arbitrarily checkpoint and restore from their own, quite possibly maliciously crafted, checkpoint images. Even with root handling the process, there are a number of interesting applications.

There is an additional wrinkle that Bhattiprolu notes in the patch. Currently, all of the available clone() flags are allocated. That doesn't affect clone_with_pids() directly, as the flags it needs are already present, but, when adding a system call, it is good to look to the future. To that end, there are two proposed implementations of a clone_extended() system call, which could be added instead of clone_with_pids(), that would allow for more clone() flags, while still supporting the restart case.

The first possibility is to turn the flags argument into a pointer to an array of flag entries, that would be treated like signal() sets, including operations to test, set, and clear flags a la sigsetops():

    typedef struct {
	    unsigned long flags[CLONE_FLAGS_WORDS];
    } clone_flags_t;

    int clone_extended(clone_flags_t *flags, void *child_stack, int *unused,
	    int *parent_tid, int *child_tid, struct pid_set *pid_set);
In the proposal, CLONE_FLAGS_WORDS would be set to 1 for 64-bit architectures, while on 32-bit architectures, it would be set to 2, thus doubling the number of available flags to 64. Should the number of clone flags needed grow, that could be expanded as required, though doing so in a backward-compatible manner is not really possible.

Another option is to split the flags into two parameters, keeping the current flags parameter as it is, and adding a new clone_info parameter that contains new flags along with the pid_set:

    struct clone_info {
	    int num_clone_high_words;
	    int *flags_high;
	    struct pid_set pid_set;
    }

    int clone_extended(int flags_low, void *child_stack, void *unused,
	    int *parent_tid, int *child_tid, struct clone_info *clone_info);
There are pros and cons to each approach, as Bhattiprolu points out. The first requires a copy_from_user() for the flags in all cases (though 64-bit architectures might be able to avoid that for now), while the second requires the awkward splitting of the flags, but avoids the copy_from_user() for calls that don't use the new flags or pid_sets.

It is hard to imagine that copying a bit of data from user space will measurably impact a system call that is creating a process, though, so some derivative of the first option would seem to be the better choice. It's also a bit hard to see the need for more than 64 clone() flags, but if that is truly desired, something with a path for compatibility is needed.

There has been no objection to the implementation of clone_with_pids(), but there have been few comments overall. Pavel Machek wondered about the need for setting the PID of anything but the inner-most namespace, but Serge E. Hallyn noted that nested namespaces require that ability: "we might be restarting an app using a nested pid namespace, in which case restart would specify pids for 2 (or more) of the innermost containers".

Machek also thought there should be a documentation file that described the new system call, and Bhattiprolu agreed, but is waiting to see what kind of consensus on either clone_with_pids() or clone_extended() (and which of the two interfaces for the latter) would emerge. So far, no one has commented on that particular aspect.

This is version 4 of the patchset, and the history shows that earlier comments have been addressed. It is still at the RFC stage, or, as Bhattiprolu puts it: "Its mostly an exploratory patch seeking feedback on the interface". That feedback has yet to emerge, however, and one might wonder whether Torvalds will still object to the whole approach. It would seem, though, that there are too many important applications for checkpoint and restart—including process migration and the ability to upgrade kernels underneath long-running processes—for some kind of solution not to make its way into the kernel eventually.


(Log in to post comments)

clone_with_pids()

Posted Aug 13, 2009 7:25 UTC (Thu) by hickinbottoms (subscriber, #14798) [Link]

Re the example:

pid_t pids[] = { 0, 77, 99 };
struct pid_set pid_set;

pid_set.num_pids = sizeof(pids) / sizeof(int);
pid_set.pids = &pids;

Doesn't that make the (possibly wrong) assumption that pid_t is an int in setting pid_set.num_pids? Or perhaps I'm being naive and it's been too long since I did this kind of stuff.

clone_with_pids()

Posted Aug 13, 2009 10:08 UTC (Thu) by jengelh (subscriber, #33263) [Link]

IMHO you are right. “sizeof(pids) / sizeof(pid_t)” would be the right thing, or better yet “sizeof(pids) / sizeof(*pids)”, or with the appropriate macro: “ARRAY_SIZE(pids)”.

clone_with_pids()

Posted Aug 13, 2009 15:52 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

Just never do:
#include <stdio.h>

#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*x))

unsigned fct (unsigned array[4]) {
return ARRAY_SIZE(array);
}

int main (int argc, char **argv) {
unsigned array[4];
printf ("array size: %u\n", fct (array));
return 0;
}

clone_with_pids()

Posted Aug 13, 2009 16:14 UTC (Thu) by jengelh (subscriber, #33263) [Link]

Yes, but that is not a problem of ARRAY_SIZE or sizeof itself, but a PEBKAC. Using “fct(unsigned int *array)” instead of “fct(unsigned int array[4])”, preferably consistently, should give yourself a reminder how C works.

misuse of ARRAY_SIZE

Posted Aug 14, 2009 18:39 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Yes, but that is not a problem of ARRAY_SIZE or sizeof itself

Sure it is. It's just a matter of perspective. That ARRAY_SIZE(array) doesn't do what the user thinks it does can just as easily be called a problem with ARRAY_SIZE (I said problem, not bug) as with the user. In a nice language, array sizes would work the way a coder who hasn't already been bitten by this a few times would expect. Degeneration of an array to a pointer, though quite traditional in C, is a problem.

clone_with_pids()

Posted Aug 13, 2009 19:52 UTC (Thu) by rwmj (subscriber, #5474) [Link]

I'm with Linus on this one. Why can't the process use a LD_PRELOAD wrapper around getpid(2), or just be taught that PIDs might change?

clone_with_pids()

Posted Aug 13, 2009 22:41 UTC (Thu) by bronson (subscriber, #4806) [Link]

> If either 77 or 99 are taken, the system call fails with -EBUSY.

I smell a DOS vector...

BSD sockets are the best example

Posted Aug 17, 2009 22:08 UTC (Mon) by elanthis (guest, #6227) [Link]

The BSD sockets API are the best example of extensible syscall interfaces. In fact, only the name resolution stuff (separate from the sockets itself) have needed to be replaced since the official inception of sockets.

Basically, instead of defining a ton of parameters, define a structure that is passed in. Either pass in a version of the structure or the structure size (which grows with additions). Then adding new data never requires breaking the syscall interface itself.

There is a performance hit for small syscalls, but then one doesn't have to do all or nothing. A syscall can include the common bits as regular parameters and just use the structure for extension data.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds