LWN: Comments on "process_madvise(), pidfd capabilities, and the revenge of the PIDs"

process_madvise(), pidfd capabilities, and the revenge of the PIDs

nix — Mon, 03 Feb 2020 20:56:18 +0000

This is a very silly argument, though. You can easily wrap the pidfd_open/pidfd-operation/_close in a function, and as for CPU time wastage -- two transitions to kernel space will likely be utterly minor compared to the large number of transitions that almost all pidfd operations are likely to incur; and if they only incur one, then the tripling of CPU time is *still* utterly irrelevant unless they're being called on literally millions of processes -- in which case one must wonder what on earth they are doing, and whether they should be using some other mechanism based around pgids or cgroups or something so they don't have to do something so ridiculously inefficient as calling *any* one syscall millions of times on millions of foreign processes.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

NYKevin — Thu, 23 Jan 2020 19:03:22 +0000

It was mostly a tongue-in-cheek suggestion, and I'm pretty sure actually doing this would be a Bad Idea. But if you really wanted to do it, you could probably use the signals API to deal with those issues. That is:

Create a new SIGFOO value, and make all of the sigaction() et al. functions accept it.
Whenever someone calls process_syscall(), the target process receives a SIGFOO.
The default behavior of the SIGFOO is to call syscall(...) as if from a signal handler.
If the signal is masked, handled, or ignored, it behaves as you would expect (syscall() is not invoked).
(Optional) Masking, handling, or ignoring the signal requires a capability and/or UID=0, so that containers cannot veto the actions of their supervisors. Alternatively, it can't be masked, handled, or ignored at all, so that it behaves like SIGSTOP and SIGKILL.

“which”?

mirabilos — Wed, 22 Jan 2020 17:19:50 +0000

No, why?

It’s really common practice to do things this way (maybe you don’t know this as you seem to be a C++ programmer, but in the C/UNIX world, we do).

Take, for example, mmap, when called with MAP_ANONYMOUS in flags, ignores the fd argument instead of checking it.

“which”?

james — Wed, 22 Jan 2020 16:59:01 +0000

Perhaps the call should return an error if the pid is non-zero and pidfd isn't -1.

“which”?

mirabilos — Wed, 22 Jan 2020 15:20:40 +0000

> the pidfd variable

No, there’s no extra variable, this is just a parameter.

You’d either call it with…

process_madvise(pidfd, /* ignored */ 0, …)

… or with…

process_madvise(-1, pid, …)

… so this question never comes up.

> bugs where both the pidfd and pid would be set

This is a very common interface, and the answer is trivial, as I stated in the comment above: pid is used iff pidfd == -1 (meaning if it’s not -1, pid will be ignored). This is a basic standard technique.

“which”?

NAR — Wed, 22 Jan 2020 12:37:43 +0000

That could be confusing - the pidfd variable would be used not only to store a pidfd, but to select which interface is called. Also, this interface would invite bugs where both the pidfd and pid would be set, maybe even to different processes. Documentation could make it explicit, however, programmers have the habit to not read the documentation and write code based on assumption created by reading the API.

My idea was to use the flags argument to select between pid and pidfd - but I guess that flag should be the same as accepted by madvise. If only C had function overloading... But it doesn't so maybe bite the bullet and create two functions: madvise_by_pid and madvise_by_pidfd.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

cyphar — Wed, 22 Jan 2020 06:46:52 +0000

Sure, waitid(2) obviously wasn't written with pidfds in mind and adding them later was to avoid making a new syscall -- my point was that the resulting interface is identical to the one being proposed (even the proposed constants -- P_PID and P_PIDFD -- are the same):

int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);

int process_madvise(int which, pid_t pid, void *addr, size_t length, int advice, unsigned long flag);

Where @which is equivalent to @idtype. Thus, it is arguably only as ugly as the waitid(2) interface. Also, they didn't re-use a flag argument -- waitid(2) explicitly had "type switching" from the outset (though it was intended to be used to differentiate between process groups and PIDs).

process_madvise(), pidfd capabilities, and the revenge of the PIDs

cyphar — Wed, 22 Jan 2020 06:41:06 +0000

That already exists with the pidfd_open(2) syscall, the concern is that it's needless overhead (and a meaningless gesture) for userspace to have to create a pidfd if they are just going to use it as though it were a PID (with all of the issues related to it). That's what Kirill Tkhai was referring to when they wrote:

> In this moment the tracer knows everything about tracee state, and pidfd brackets pidfd_open() and close() around actual action look just stupid, and this is cpu time wasting.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

roc — Wed, 22 Jan 2020 02:44:09 +0000

The current ptrace code is designed around that relationship, and I assume that any significant changes to ptrace code are going to be hard. I'd love to be wrong!

process_madvise(), pidfd capabilities, and the revenge of the PIDs

roc — Wed, 22 Jan 2020 02:42:51 +0000

There are many states that thread could be in that would be problematic. Most of the time the thread would be blocked in the kernel for some reason, in which case you'd want to queue the syscall for execution when it would next return to userspace, but that might never happen.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

KaiRo — Wed, 22 Jan 2020 01:25:32 +0000

I have no clue about kernel-level coding, but why not have a function like pidfd_from_pid(pid) that would create/take a pidfd for a given pid (with all the ambivalence a pid has) and which you then hand over to new calls that only support a pidfd? From the command line that should be fine and other code should see to convert to pidfd in the longer run anyhow.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Paf — Tue, 21 Jan 2020 22:52:58 +0000

Why do you think breaking the ptrace parent relationship would be so hard?

I’m not really familiar with that part of ptrace, but I have had to look at the signal handling dance and it is a *mess*. (Not incorrect in any way, just... messy)

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Paf — Tue, 21 Jan 2020 22:49:50 +0000

That’s an interesting point about the PID uncertainty for those programs, thanks. Though at the same time, the pidfd isn’t any *worse* than the questionable pid... hmm.

This might be a situation where the kernel commitment to backwards compatibility implicitly pushes a less elegant interface. (I say implicitly because there is no explicit backwards compatibility issue here, but I think the philosophy arguably applies because not having both PIDs and pidfds implicitly pushes people to the new interface.)

Hmm.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

NYKevin — Tue, 21 Jan 2020 21:17:51 +0000

Well... you *could* have something like this:

long process_syscall(int pidfd, long number, ...)

It would behave as-if process had invoked syscall(2) with the remaining arguments. Maybe you also whitelist number to syscalls that actually make sense to invoke remotely, and are unlikely to cause massive reentrancy or threading issues.

“which”?

mirabilos — Tue, 21 Jan 2020 21:12:49 +0000

This breaks types, though.

“which” is an int, sure, but the PID is of type pid_t, while pidfds as file descriptors are of type int.

I’d rather have it…

int process_madvise(int pidfd, pid_t pid, …

… and use “pid” iff pidfd == -1 (which is the usual closed/invalid fd number).

process_madvise(), pidfd capabilities, and the revenge of the PIDs

roc — Tue, 21 Jan 2020 19:58:24 +0000

I think pidfds for ptrace() would be good, but I don't think they immediately solve the major issues with ptrace.

I would really like the ability to hand-off ptrace control to other processes by passing them a pidfd, but that would require lots more work. Something like:
* Make sure pidfd_wait or whatever can read the special ptrace status events.
* When pidfds are used, break the "ptrace parent" relationship in the kernel so *any* process with a pidfd for the tracee can ptrace() it or get the ptrace status events. (I bet this is *really* hard.)
But it would make much-requested rr features, like the ability to start debugging an in-progress rr recording without interrupting it, much more tractable.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

roc — Tue, 21 Jan 2020 19:52:10 +0000

Maybe hack pidfds into ptrace and other syscalls by passing "-pidfd" as the pid?

process_madvise(), pidfd capabilities, and the revenge of the PIDs

rvolgers — Tue, 21 Jan 2020 19:32:55 +0000

If ever there was a good use case for pidfd's it's ptrace. That is one ugly interface, with all the signal magic and pseudo-reparenting.

Of course, porting it would take quite a lot of effort probably, nevermind deprecating the old interface so all the cruft can be removed, but one can dream.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

josh — Tue, 21 Jan 2020 17:35:12 +0000

waitid has a flag to avoid having to create a new version of the syscall. It already accepted flags, so adding a flag to accept a pidfd in place of the existing pid argument made some sense.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

cyphar — Tue, 21 Jan 2020 15:50:15 +0000

> In this moment the tracer knows everything about tracee state, and pidfd brackets pidfd_open() and close() around actual action look just stupid, and this is cpu time wasting.

While I do understand wanting to maintain support for PIDs in newer syscalls (after all, in some cases you only get a PID from a user or other program), I don't think that a tracer program would be written in the way described. It's far more likely that the tracer would already have a pidfd open for each process it is tracing. But then again, since ptrace (and ptrace-related syscalls) doesn't use pidfds, it would also be fair to say that the interface mismatch would make the code ugly no matter what.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

cyphar — Tue, 21 Jan 2020 15:45:33 +0000

The counter-argument is waitid(2), which has basically the exact same interface.

Additionally, doing the switching in user-space isn't all that fun. The syscall has to take pidfds, otherwise there's no point to permitting pidfds to be used with the interface (getting the pid of a pidfd is painful, but it also immediately becomes susceptible to pid recycling attacks -- the thing pidfds were meant to block). And that would annoy the people who are unhappy with requiring pidfds for new syscalls (they don't want to take up file handles, and it's likely that for their programs creating the pidfd is a meaningless gesture because they have no way of actually being sure the pid is correct).

process_madvise(), pidfd capabilities, and the revenge of the PIDs

dskoll — Tue, 21 Jan 2020 14:46:06 +0000

Hah! I have never programmed on Windows and know nothing about its API, and feel a little ashamed for having rediscovered that. :)

Thanks for the info.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

Paf — Tue, 21 Jan 2020 14:28:20 +0000

I am very curious to see how this gets worked out, my instinct is that the desire for the switchable interface implemented in the calls is deeply silly. Unless there are significant performance or other (security?) implications, this desire for an alternate interface should be solved with wrappers and/or macros, not have a switching argument baked in to the basic call.

That is to say, there’s nothing silly about wanting both interfaces, but having a “what is this other argument” switch argument in the call, when not strictly necessary, seems... yuck? Perhaps opinions differ :)

process_madvise(), pidfd capabilities, and the revenge of the PIDs

rvolgers — Tue, 21 Jan 2020 13:26:26 +0000

Or, as windows calls it, CreateRemoteThreadEx.

process_madvise(), pidfd capabilities, and the revenge of the PIDs

dskoll — Tue, 21 Jan 2020 12:29:15 +0000

Why not take this to its logical extreme?

execute_as_process(int pidfd, void (*func)());

I am of course not completely serious and realize the difficulty of implementing this as well as the security implications, but it seems we are approaching this asymptotically.