User: Password:
|
|
Subscribe / Log in / New account

new system call, unshare

From:  Janak Desai <janak@us.ibm.com>
To:  linux-fsdevel@vger.kernel.org
Subject:  [RFC][patch 1/2] new system call, unshare
Date:  Tue, 10 May 2005 09:08:14 -0400 (Eastern Daylight Time)
Archive-link:  Article, Thread


Patch Summary:
This patch implements a new system call, unshare.  unshare allows a 
process to dissociate parts of process context that were initially 
being shared using the clone() system call.

The patch consists of two parts:
[1/2] Implements the system call handler function sys_unshare.
[2/2] Implements system call setup for different architectures.

For now, I am only posting part I to request comments on justification,
approach and implementation.

Patch Justification:
unshare system call is needed to implement, using PAM, 
per-security_context and/or per-user namespace to provide 
polyinstantiated directories. Using unshare and bind mounts, a 
PAM module can create private namespace with appropriate 
directories(based on user's security context) bind mounted on 
public directories such as /tmp, thus providing an instance of 
/tmp that is based on user's security context. Without the 
unshare system call, namespace separation can only be achieved 
by clone, which would require porting and maintaining all commands 
such as login, and su, that establish a user session. 

Overall Approach:
The overall approach followed clone system call and its permission
enforcement. However, instead of clone's "what do we leave shared?" 
logic, here the logic was based on "what do we unshare, that was 
previously being shared?". Unlike clone, which operated on a newly 
allocated and not-yet schedulable task structure, additional
task_lock()s were taken to avoid race conditions from unshare 
having to work on the current process. Before unsharing any part 
of the context, a check is made to ensure that that part of the
context is being shared in the first place. If the context is not
being shared to begin with, the system call returns success. If 
the context is being shared, the system call makes a private copy
of that context and updates the appropriate pointers of the 
current task structure to point to this new private copy. If allocation
and setup of the private copy fails, the system call appropriately
restores the current task structures to continue using the shared
context.

Currently, the system call only allows "unsharing" of namespace, 
signal handlers and virtual memory, because those three were deemed 
useful on this mailing list in the past. 

Testing:
The patch has been unit tested on uni-processor i386 architecture
based Fedora Core 3 system.

Signed-off-by: Janak Desai


-------------------------------------------------------------------
diff -Naurp linux-2.6.11.8/kernel/fork.c linux-2.6.11.8-p1/kernel/fork.c
--- linux-2.6.11.8/kernel/fork.c	2005-04-30 01:23:45.000000000 +0000
+++ linux-2.6.11.8-p1/kernel/fork.c	2005-05-09 19:03:52.000000000 +0000
@@ -56,6 +56,17 @@ int nr_threads; 		/* The idle threads do
 
 int max_threads;		/* tunable limit on nr_threads */
 
+/*
+ * mm_copy gets called from clone or unshare system calls. When called
+ * from clone, mm_struct may be shared depending on the clone flags
+ * argument, however, when called from the unshare system call, a private
+ * copy of mm_struct is made.
+ */
+enum mm_copy_share {
+	MAY_SHARE,
+	UNSHARE,
+};
+
 DEFINE_PER_CPU(unsigned long, process_counts) = 0;
 
  __cacheline_aligned DEFINE_RWLOCK(tasklist_lock);  /* outer */
@@ -421,16 +432,26 @@ void mm_release(struct task_struct *tsk,
 	}
 }
 
-static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
+static int copy_mm(unsigned long clone_flags, struct task_struct * tsk,
+			enum mm_copy_share copy_share_action)
 {
 	struct mm_struct * mm, *oldmm;
 	int retval;
 
-	tsk->min_flt = tsk->maj_flt = 0;
-	tsk->nvcsw = tsk->nivcsw = 0;
+	/*
+	 * If the process memory is being duplicated as part of the
+	 * unshare system call, we are working with the current process
+	 * and not a newly allocated task strucutre, and should not
+	 * zero out fault info, context switch counts, mm and active_mm
+	 * fields.
+	 */
+	if (copy_share_action == MAY_SHARE) {
+		tsk->min_flt = tsk->maj_flt = 0;
+		tsk->nvcsw = tsk->nivcsw = 0;
 
-	tsk->mm = NULL;
-	tsk->active_mm = NULL;
+		tsk->mm = NULL;
+		tsk->active_mm = NULL;
+	}
 
 	/*
 	 * Are we cloning a kernel thread?
@@ -917,7 +938,7 @@ static task_t *copy_process(unsigned lon
 		goto bad_fork_cleanup_fs;
 	if ((retval = copy_signal(clone_flags, p)))
 		goto bad_fork_cleanup_sighand;
-	if ((retval = copy_mm(clone_flags, p)))
+	if ((retval = copy_mm(clone_flags, p, MAY_SHARE)))
 		goto bad_fork_cleanup_signal;
 	if ((retval = copy_keys(clone_flags, p)))
 		goto bad_fork_cleanup_mm;
@@ -1222,3 +1243,172 @@ void __init proc_caches_init(void)
 			sizeof(struct mm_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 }
+
+/*
+ * unshare_mm is called from the unshare system call handler function to
+ * make a private copy of the mm_struct structure. It calls copy_mm with
+ * CLONE_VM flag cleard, to ensure that a private copy of mm_struct is made,
+ * and with mm_copy_share enum set to UNSHARE, to ensure that copy_mm
+ * does not clear fault info, context switch counts, mm and active_mm
+ * fields of the mm_struct.
+ */
+static int unshare_mm(unsigned long unshare_flags, struct task_struct *tsk)
+{
+	int retval = 0;
+	struct mm_struct *mm = tsk->mm;
+
+	/*
+	 * If the virtual memory is being shared, make a private
+	 * copy and disassociate the process from the shared virtual
+	 * memory.
+	 */
+	if (atomic_read(&mm->mm_users) > 1) {
+		retval = copy_mm((unshare_flags & ~CLONE_VM), tsk, UNSHARE);
+
+		/*
+		 * If copy_mm was successful, decrement the number of users
+		 * on the original, shared, mm_struct.
+		 */
+		if (!retval)
+			atomic_dec(&mm->mm_users);
+	}
+	return retval;
+}
+
+/*
+ * unshare_sighand is called from the unshare system call handler function to
+ * make a private copy of the sighand_struct structure. It calls copy_sighand
+ * with CLONE_SIGHAND cleared to ensure that a new signal handler structure
+ * is cloned from the current shared one.
+ */
+static int unshare_sighand(unsigned long unshare_flags, struct task_struct
*tsk)
+{
+	int retval = 0;
+	struct sighand_struct *sighand = tsk->sighand;
+
+	/*
+	 * If the signal handlers are being shared, make a private
+	 * copy and disassociate the process from the shared signal
+	 * handlers.
+	 */
+	if (atomic_read(&sighand->count) > 1) {
+		retval = copy_sighand((unshare_flags & ~CLONE_SIGHAND), tsk);
+
+		/*
+		 * If copy_sighand was successful, decrement the use count
+		 * on the original, shared, sighand_struct.
+		 */
+		if (!retval)
+			atomic_dec(&sighand->count);
+	}
+	return retval;
+}
+
+/*
+ * unshare_namespace is called from the unshare system call handler
+ * function to make a private copy of the current shared namespace. It
+ * calls copy_namespace with CLONE_NEWNS set to ensure that a new
+ * namespace is cloned from the current namespace.
+ */
+static int unshare_namespace(struct task_struct *tsk)
+{
+	int retval = 0;
+	struct namespace *namespace = tsk->namespace;
+
+	/*
+	 * If the namespace is being shared, make a private copy
+	 * and disassociate the process from the shared namespace.
+	 */
+	if (atomic_read(&namespace->count) > 1) {
+		retval = copy_namespace(CLONE_NEWNS, tsk);
+
+		/*
+		 * If copy_namespace was successful, decrement the use count
+		 * on the original, shared, namespace struct.
+		 */
+		if (!retval)
+			atomic_dec(&namespace->count);
+	}
+	return retval;
+}
+
+/*
+ * unshare allows a process to 'unshare' part of the process
+ * context which was originally shared using clone.
+ */
+asmlinkage long sys_unshare(unsigned long unshare_flags)
+{
+	struct task_struct *tsk = current;
+	int retval = 0;
+	struct namespace *namespace;
+	struct mm_struct *mm;
+	struct sighand_struct *sighand;
+
+	if (!(unshare_flags & (CLONE_NEWNS | CLONE_VM | CLONE_SIGHAND)))
+		goto bad_unshare_invalid_val;
+
+	/*
+	 * Shared signal handlers imply shared VM, so if CLONE_SIGHAND is
+	 * set, CLONE_VM must also be set in the system call argument.
+	 */
+	if ((unshare_flags & CLONE_SIGHAND) && !(unshare_flags & CLONE_VM))
+		goto bad_unshare_invalid_val;
+
+	task_lock(tsk);
+	namespace = tsk->namespace;
+	mm = tsk->mm;
+	sighand = tsk->sighand;
+
+	if (unshare_flags & CLONE_VM) {
+		retval = unshare_mm(unshare_flags, tsk);
+		if (retval)
+			goto unshare_unlock_task;
+		else if (unshare_flags & CLONE_SIGHAND) {
+			retval = unshare_sighand(unshare_flags, tsk);
+			if (retval)
+				goto bad_unshare_cleanup_mm;
+		}
+	}
+
+	if (unshare_flags & CLONE_NEWNS) {
+		retval = unshare_namespace(tsk);
+		if (retval)
+			goto bad_unshare_cleanup_sighand;
+	}
+
+unshare_unlock_task:
+	task_unlock(tsk);
+
+unshare_out:
+	return retval;
+
+bad_unshare_cleanup_sighand:
+	/*
+	 * If signal handlers were unshared (private copy was made),
+	 * clean them up (delete the private copy) and restore
+	 * the task to point to the old, shared, value.
+	 */
+	if (unshare_flags & CLONE_SIGHAND) {
+		exit_sighand(tsk);
+		tsk->sighand = sighand;
+		atomic_inc(&sighand->count);
+	}
+
+bad_unshare_cleanup_mm:
+	/*
+	 * If mm struct was unshared (private copy was made),
+	 * clean it up (delete the private copy) and restore
+	 * the task to point to the old, shared, value.
+	 */
+	if (unshare_flags & CLONE_VM) {
+		if (tsk->mm)
+			mmput(tsk->mm);
+		tsk->mm = mm;
+		atomic_inc(&mm->mm_users);
+	}
+	goto unshare_unlock_task;
+
+bad_unshare_invalid_val:
+	retval = -EINVAL;
+	goto unshare_out;
+}

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds