Running code within another process's address space
In truth, alternatives to ptrace() already exist for some tasks. The cross-memory attach system calls were merged for 3.2 in 2011 as process_vm_readv() and process_vm_writev(). As their names would suggest, they allow one process to read from and write to another process's memory. Those system calls satisfy many needs, but fall short when even more invasive access is needed to another process's address space. Sometimes, it seems, there is no alternative to running code within the target address space.
Vagin's patch set gives a couple of examples of where this access would be useful. User-mode kernels, such as User-mode Linux and gVisor, have to be able to intercept system calls made by a sandboxed process and, possibly, run them in the address space of that process. The Checkpoint/Restore in User space project needs to reach deeply within a process to extract all of the information needed to checkpoint it. Both use cases are currently handled with ptrace() but, once again, better and faster alternatives are wanted.
The alternative proposed by Vagin is a new system call:
int process_vm_exec(pid_t pid, struct sigcontext uctx, unsigned long flags, siginfo_t siginfo, sigset_t *sigmask, size_t sizemask);
A successful call will cause the calling process's address space to be changed to that of the process identified by pid. The cover letter notes that using a pidfd might be preferable; that would make this system call inconsistent with process_vm_readv() and process_vm_writev(), though. The values in uctx are used to load the processor registers (including the instruction pointer) before resuming execution in the new address space — an important step, since using the previous instruction pointer from the old address space is unlikely to yield satisfactory results in the new address space.
If flags is zero, process_vm_exec() will change the address space, then resume execution as indicated by uctx; that execution will continue until the process either makes a system call or receives a signal. Either way, the old address space will be restored and process_vm_exec() will return to the caller. The siginfo structure will describe the event that interrupted execution in the other address space; if it's a system call, siginfo will be made to look as if a SIGSYS signal had been received.
If, instead, flags contains PROCESS_VM_EXEC_SYSCALL, the purpose of the call is to invoke a system call within the target process's address space. In this case, uctx should contain the system call number and arguments in the appropriate registers, as would be the case for a real system call. The address space will be switched for the duration of the system call, then restored before returning to the caller.
This patch series was posted as a proof of concept with the idea of getting comments on the proposed API. Jann Horn was quick to respond that the proposed system call does not appear to fit the stated use cases well; it is too much for one and not enough for the other. For the case of running code within a different address space (as systems like User-mode Linux do), he suggested, creating a whole new process is overkill; it might be better to have a system call that allows the construction of new address spaces separately. For the checkpoint/restore case, instead, there may still be a need to access resources within a process beyond its address space, though he didn't say which resources those might be. Vagin responded that a relatively generic system call seemed better than a whole set of specialized ones, even if the generic alternative is not a perfect fit to all use cases.
Florian Weimer did have another resource in mind, though, that would be useful for the the GNU C library. There is a difference between how Linux implements setuid() and what POSIX requires: Linux only changes the credentials for the calling thread, while POSIX specifies that it must change the credentials for all of the threads running in a process. Currently, glibc implements POSIX semantics on Linux by sending signals to all threads so that they can all call setuid() together, which is less than ideal. It would be much nicer to just be able to call setuid() within the context of each thread without actually interrupting the threads. Such a feature could also be useful for implementing memory barriers, he said.
There is clearly some tension here between creating a feature that would be
useful in some contexts and trying to solve a larger and more complex
problem. In such cases, developers must pick their path carefully; trying
to do too much is a good way to ensure that nothing actually gets far
enough to land in the mainline kernel. So what will happen with
process_vm_exec() is far from clear at this point; it may
eventually find its way to acceptance, but it could change form
considerably before that happens.
Index entries for this article | |
---|---|
Kernel | System calls/process_vm_exec() |
Posted Apr 16, 2021 15:58 UTC (Fri)
by rvolgers (guest, #63218)
[Link] (3 responses)
Posted Apr 16, 2021 16:46 UTC (Fri)
by josh (subscriber, #17465)
[Link] (2 responses)
Posted Apr 16, 2021 20:13 UTC (Fri)
by luto (guest, #39314)
[Link] (1 responses)
Posted Apr 17, 2021 0:47 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Apr 16, 2021 17:50 UTC (Fri)
by klbrun (subscriber, #45083)
[Link] (1 responses)
Posted Apr 17, 2021 1:26 UTC (Sat)
by CChittleborough (subscriber, #60775)
[Link]
The Doors facility needs both (1) kernel support and (2) a non-trivial user-space library. Door servers need that library to manage a thread pool in ways that almost require tight integration with the threads library. This is much easier when one organization produces both the kernel and the core libraries than in the Linux community.
(Full disclosure: I created that Wikipedia article.)
Posted Apr 17, 2021 0:02 UTC (Sat)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Apr 22, 2021 16:50 UTC (Thu)
by avagin (subscriber, #63724)
[Link]
Posted Apr 17, 2021 5:57 UTC (Sat)
by dongmk (subscriber, #131668)
[Link] (3 responses)
1. If I understand correctly, the current `process_vm_exec` can only execute code that is already in the target process's address space, which could be inconvenient for tasks such as inspecting the address space content.
Or does `process_vm_exec` provide some mechanisms to "bring" its own code to the switched address space?
2.
> that execution will continue until the **process** either makes a system call or receives a signal.
Is this **process** the calling process or the target process? Or both?
Will the target process (and threads in the target process) pause its execution upon the system call?
3. I think things will be complicated when there are multiple threads in the calling process.
Sorry for asking so many questions.
Posted Apr 17, 2021 9:23 UTC (Sat)
by smurf (subscriber, #17840)
[Link] (2 responses)
> Is this **process** the calling process or the target process? Or both?
Umm, the target of course. When the target does [ make a syscall / gets a signal ], the system call returns, as described in the text. It obviously cannot do that if the caller continues execution in any way.
Posted Apr 17, 2021 10:57 UTC (Sat)
by dongmk (subscriber, #131668)
[Link] (1 responses)
It may be helpful to confirm what the caller will do during `process_vm_exec`; here are two possibilities:
1) the caller blocks until the target [makes a syscall / gets a signal];
For 1), emmm, there seems to be no `exec` effect since the original address space is restored when `process_vm_exec` returns; the caller simply blocks to wait for the target's syscall/signal.
For 2), a new control flow seems to have emerged after `process_vm_exec`, and the new control flow is terminated (at any time) upon the target's syscall/signal.
Posted Apr 23, 2021 5:18 UTC (Fri)
by avagin (subscriber, #63724)
[Link]
I would rephrase this:
2) the caller resumes execution by jumping to the IP specified in `uctx` in the target's address space; and the caller somehow gets back to the original context in the original address space when it makes a syscall or gets a signal.
The target process is used only to grub its address space. process_vm_exec doesn't stop it and doesn't change its state (registers, signals, fpu, etc).
There are a few examples, I think they can help to understand how this works:
Posted Apr 17, 2021 9:15 UTC (Sat)
by flussence (guest, #85566)
[Link]
Posted Apr 17, 2021 15:09 UTC (Sat)
by pm215 (subscriber, #98099)
[Link]
Posted Apr 19, 2021 16:54 UTC (Mon)
by jnewsome (guest, #151740)
[Link]
Posted Apr 22, 2021 2:01 UTC (Thu)
by alkbyby (subscriber, #61687)
[Link]
One particularly nice feature of this is that it is entirely compatible with all kinds of libraries. I.e. today libraries are essentially unable to use signals because it is ~impossible to arbitrate between libraries who is using which signal number. When there are no signal numbers or any state (e.g. altstack, sigaction) involved in first place, it looks like there no any kinds compatibility trouble.
Another thing that would be nifty is being able to somehow specify not saving/restoring huge simd states to enable cheaper or more perf-sensitive usages of this facility (e.g. garbage collectors might choose to use it too), and to save stack. It seems inevitable that safe usage of this facility requires caller to allocate stack for each thread's "remote call" and not having to bother about simd registers would save nontrivial number of bytes.
Running code within another process's address space
Running code within another process's address space
Running code within another process's address space
Running code within another process's address space
Running code within another process's address space
The Linux port was only “alpha quality”, and sadly work on it seems to have stopped in 2003.
Solaris-style Doors
Running code within another process's address space
Running code within another process's address space
Some questions on Running code within another process's address space
In other words, the calling process needs to inject the code to the target (perhaps by `process_vm_writev`) before it invokes `process_vm_exec` to execute the injected code.
If so, how could the calling process resume the target's execution?
If not, the calling process could miss some system calls made by the target, right?
For example, when one thread calls `process_vm_exec` to switch the address space, other threads will run in the "wrong address space" and fail.
Some questions on Running code within another process's address space
Some questions on Running code within another process's address space
2) the caller resumes execution by jumping to the IP specified in `uctx` in the target's address space; and the caller somehow gets back to the original context in the original address space when the target [makes a syscall / gets a signal].
Some questions on Running code within another process's address space
https://lwn.net/ml/linux-kernel/20210414055217.543246-5-a...
Running code within another process's address space
Running code within another process's address space
Running code within another process's address space
Running code within another process's address space