| From: |
| "Christian Brauner (Amutable)" <brauner-AT-kernel.org> |
| To: |
| Jann Horn <jannh-AT-google.com>, Linus Torvalds <torvalds-AT-linuxfoundation.org>, Oleg Nesterov <oleg-AT-redhat.com> |
| Subject: |
| [PATCH RFC v3 0/4] exec: introduce task_exec_state for exec-time metadata |
| Date: |
| Wed, 20 May 2026 23:48:51 +0200 |
| Message-ID: |
| <20260520-work-task_exec_state-v3-0-69f895bc1385@kernel.org> |
| Cc: |
| "David Hildenbrand (Arm)" <david-AT-kernel.org>, Andrew Morton <akpm-AT-linux-foundation.org>, Qualys Security Advisory <qsa-AT-qualys.com>, Kees Cook <kees-AT-kernel.org>, Minchan Kim <minchan-AT-kernel.org>, linux-mm-AT-kvack.org, Suren Baghdasaryan <surenb-AT-google.com>, Lorenzo Stoakes <ljs-AT-kernel.org>, "Liam R. Howlett" <liam-AT-infradead.org>, Vlastimil Babka <vbabka-AT-kernel.org>, Mike Rapoport <rppt-AT-kernel.org>, Michal Hocko <mhocko-AT-suse.com>, "Christian Brauner (Amutable)" <brauner-AT-kernel.org> |
| Archive-link: |
| Article |
This series relocates the dumpable mode and the user_namespace
captured at execve() from mm_struct onto a new per-task
task_exec_state structure that stays attached to the task for its
full lifetime.
__ptrace_may_access() and several /proc owner / visibility checks
need to consult two pieces of state for any observable task,
including zombies that have already gone through exit_mm(): the
dumpable mode and the user namespace captured at execve(). Both
live on mm_struct today, which exit_mm() clears from the task long
before the task is reaped.
A reader that races with do_exit() observes task->mm == NULL and
either fails the check or falls back to init_user_ns - which denies
legitimate access to non-dumpable zombies that were running in a
nested user namespace.
mm_struct loses ->user_ns and the dumpability bits in ->flags.
MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* layout exposed via
/proc/<pid>/coredump_filter stays stable. task->user_dumpable and its
exit_mm() snapshot are removed.
task_exec_state is the privilege domain established by an execve()
[1]. Within a thread group it is shared via refcount; across thread
groups each task has its own:
- CLONE_VM siblings (thread-group members, io_uring workers)
refcount-share the parent's exec_state.
- Non-CLONE_VM clones (fork(), vfork() without CLONE_VM)
allocate a fresh exec_state inheriting the parent's dumpable
mode and user_ns.
- execve() in the child allocates a fresh instance and installs
it under task_lock + exec_update_lock via
task_exec_state_replace().
- Credential changes (setresuid, capset, ...) and
prctl(PR_SET_DUMPABLE) update dumpability on the current
task's exec_state, i.e. on the thread group's shared instance.
Behavioral change:
Kernel threads that briefly use a user mm via kthread_use_mm() no
longer inherit dumpability from the borrowed mm. Kthreads are not
ptraceable (PF_KTHREAD short-circuits __ptrace_may_access), so this
is observable only via /proc surfaces that a sufficiently privileged
reader can reach.
[1] https://lore.kernel.org/r/CAHk-=wj+NgoDH3GSicJ140SV8OoDd7...
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Changes in v3:
- Restore alloc-fresh-and-inherit semantics for non-CLONE_VM clones.
CLONE_VM siblings still refcount-share; fork() and other
non-CLONE_VM clones get a fresh exec_state that inherits the
parent's dumpable mode and user_ns. The v2 "every clone
refcount-shares" model would have let any forked process in an
Android zygote64 subtree influence dumpability of its siblings
via prctl(PR_SET_DUMPABLE).
- Link to v2: https://patch.msgid.link/20260520-work-task_exec_state-v2...
Changes in v2:
- Drop dup-on-fork for non-CLONE_VM clones: every clone() variant
refcount-shares the parent's task_exec_state; only execve()
allocates a fresh one. See "Behavioral changes" in the cover
letter for the implications.
- Switch commit_creds() to update dumpability on the new
task_exec_state (instead of dropping the set_dumpable() call
entirely as in v1). Drops the explicit smp_wmb()/smp_rmb() pair
- RCU acquire/release on the cred pointer provides the ordering.
- Link to v1: https://patch.msgid.link/20260516-work-exit_mm-v1-1-76bcc...
---
Christian Brauner (Amutable) (4):
sched/coredump: introduce enum task_dumpable
exec: introduce struct task_exec_state
ptrace: add ptracer_access_allowed()
exec_state: relocate dumpable information
arch/arm64/kernel/mte.c | 6 +-
drivers/firmware/efi/efi.c | 1 -
fs/coredump.c | 22 +++-----
fs/exec.c | 39 ++++++-------
fs/pidfs.c | 23 +++-----
fs/proc/base.c | 39 ++++++-------
include/linux/binfmts.h | 2 +
include/linux/coredump.h | 4 ++
include/linux/mm_types.h | 9 ++-
include/linux/ptrace.h | 1 +
include/linux/sched.h | 6 +-
include/linux/sched/coredump.h | 47 ++++------------
include/linux/sched/exec_state.h | 29 ++++++++++
init/init_task.c | 10 ++++
kernel/Makefile | 2 +-
kernel/cred.c | 3 +-
kernel/exec_state.c | 116 +++++++++++++++++++++++++++++++++++++++
kernel/exit.c | 1 -
kernel/fork.c | 32 +++++++++--
kernel/kthread.c | 1 -
kernel/ptrace.c | 53 ++++++++++++------
kernel/sys.c | 6 +-
mm/init-mm.c | 1 -
23 files changed, 301 insertions(+), 152 deletions(-)
---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260520-work-task_exec_state-83209d8b3e53