Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 5:10 UTC (Thu) by josh (subscriber, #17465)
In reply to: Attaching file descriptors to processes with CLONE_FD by Cyberax
Parent article: Attaching file descriptors to processes with CLONE_FD

> without the usual write-to-pipe-from-SIGCHLD trick

That's actually the original motivation for this patch. Thiago Macieira has a userspace library that emulates a subset of clonefd by installing a SIGCHLD handler and sending clonefd_info over a pipe. Having clone4 available allows that library to work in a race-free way, and not take over process-wide SIGCHLD handling.

> A small note, though: clonefd_info needs to have size and version fields for future extensions.

That doesn't necessarily help, especially for a first version: nothing forces userspace to actually *check* those fields, and if they don't, then they'll end up being immutable anyway because the kernel can't break old userspace.

So our current plan is to require applications that can handle reading other structures from the clonefd to opt-in by passing an appropriate flag to clone4. Suggestions welcome, though.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 5:22 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (20 responses)

> That doesn't necessarily help, especially for a first version: nothing forces userspace to actually *check* those fields, and if they don't, then they'll end up being immutable anyway because the kernel can't break old userspace.
Certainly. But it would make it easier for me to simply read the whole packet and transmit it to another part of my program for more detailed parsing. Having a length field would help here.

> So our current plan is to require applications that can handle reading other structures from the clonefd to opt-in by passing an appropriate flag to clone4. Suggestions welcome, though.
That as well. But simply adding a length field and a bitmask with enabled options shouldn't be too hard and it might help people.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 5:30 UTC (Thu) by josh (subscriber, #17465) [Link] (19 responses)

> Certainly. But it would make it easier for me to simply read the whole packet and transmit it to another part of my program for more detailed parsing. Having a length field would help here.

Note that clonefd behaves like a UDP socket here, in that you have to read the entire clonefd_info structure in one read() call, and any unread bits are lost. We chose that behavior because in theory, you could hand the clonefd to multiple threads or processes, all of which can call read() in parallel, and they shouldn't race or receive partial data even if they don't read a complete structure.

So, in effect, there *is* a length field: it's the return value from read(). If you understand a longer structure, read() a longer structure, and on older kernels you'll get a short read.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 6:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

That's interesting.

But this design seem to make it impossible to pass large data (command line arguments, for example).

On the other hand, if the corresponding /proc directory could stay alive while there are open FDs to its process, then there's little need to have more information.

Perhaps you could at least add a 'flags' field so the library code could inspect what extra fields are available?

Also, what about getting a FD of an already running process?

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 6:18 UTC (Thu) by josh (subscriber, #17465) [Link] (10 responses)

> On the other hand, if the corresponding /proc directory could stay alive while there are open FDs to its process, then there's little need to have more information.

We've actually talked about adding an API that would get you from a clonefd to the dirfd of the corresponding /proc directory, from which you could use openat to open the proc files of that process.

> Perhaps you could at least add a 'flags' field so the library code could inspect what extra fields are available?

Something like that might happen the first time clonefd_info gets extended, depending on the added fields.

> Also, what about getting a FD of an already running process?

Planned for a subsequent patch.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 6:30 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> We've actually talked about adding an API that would get you from a clonefd to the dirfd of the corresponding /proc directory, from which you could use openat to open the proc files of that process.
Why not just let the /proc entry to stick around? Kernel can't reuse a PID while it's still held by an open FD or am I wrong?

> Something like that might happen the first time clonefd_info gets extended, depending on the added fields.
Oh well... I guess it's OK, but I'd definitely prefer to have a straightforward "features" flag as the first member of the structure. It'll also condition clients to the fact that the structure might get extended someday.

> Planned for a subsequent patch.
Thanks!

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 7:43 UTC (Thu) by josh (subscriber, #17465) [Link] (8 responses)

> Why not just let the /proc entry to stick around? Kernel can't reuse a PID while it's still held by an open FD or am I wrong?

Among other reasons, because file descriptors are unique system-wide, while if you have a PID, you have to know what PID namespace it's relative to. If the process the clonefd refers to is in another PID namespace from you, and for that matter if you've passed it around via UNIX pipes, the clonefd can still hand you a dirfd that unambiguously refers to the correct /proc directory, without any PIDs or translation involved. That seems far cleaner than, for instance, using an ioctl to get the PID from the clonefd, sprintf-ing that into a filename, and opening a file in /proc.

Ideally, it ought to be possible to operate on processes without ever using a PID.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 7:57 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (7 responses)

Radical idea: Make the filedescriptor actually be a the proc directory?

This would mean that opening that directory directly would automatically mean that you get an equivalent handle. That multiple processes can wait on the exit of that process (and receive basic accounting information). And if the filehandle isn't opened with 'O_PATH' than the task directory would stick around, otherwise it may be automatically cleaned up.

And why not use include the more verbose "getrusage" stats but only the "times" one?

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 17:31 UTC (Thu) by josh (subscriber, #17465) [Link] (1 responses)

> Radical idea: Make the filedescriptor actually be a the proc directory?

We did consider that during the design. We also considered making the "get file descriptor for existing process" mechanism be open("/proc/$PID/clonefd") or similar. However, that approach would require that /proc is mounted, and would raise interesting interaction issues with containers. The kernel has been moving in the opposite direction; for instance, see execveat, which was created specifically as an alternative to execve of /proc/self/fd/N because the latter requires a mounted /proc.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 3, 2015 7:04 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

I'm not saying it should be mounted, but I must admit that I don't know enough about the internals of procfs to know if you could open an fd internally.

I guess it would be a security risk, because a process could this way get a procfs handle, follow that to '..' and have access to it's entire procfs even if the admin chose not to mount it (chroot escaping for root processes would probably become easier this way, though that is officially not considered a security mechanism).

But I don't see how the 'open(/proc/$pid/clonefd)' would introduce a dependency on procfs. Only the feature "open a process fd for an existing process" would require that. The fd itself wouldn't necessarily have to be tied to procfs in this case (as it would in the fd for task directory).

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 17:40 UTC (Thu) by josh (subscriber, #17465) [Link] (4 responses)

> And why not use include the more verbose "getrusage" stats but only the "times" one?

clonefd_info closely matches the information available in the siginfo structure that you receive when you get SIGCHLD. siginfo doesn't include the rusage information either. And including the full rusage information would make the structure *much* larger. Most callers are not likely to need that information, because they'll only care that the process exited and perhaps what exit code it exited with.

I would propose, instead, adding a version of getrusage that takes a clonefd. Since the task_struct will stick around as long as the clonefd remains open, those stats remain available.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 3, 2015 7:16 UTC (Fri) by iq-0 (subscriber, #36655) [Link] (3 responses)

SIGCHLD is a relic and it's interface reflects what was available at the time.

Most programs only care about exit status. But for anybody who is interested in resource usage, I think nobody want to know only the cpu timings. Most programs make do with it, but that is more a case of what they got, not what they want.

We are talking about 144 bytes vs 32 bytes per exited process that has not been reaped. The pointers inside the kernel for basic administration are probably more expensive to keep around.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 3, 2015 7:35 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

Not to say that getrusage may not be the best alternative perse, pherhaps BSD process accounting of similar structure might be a better match.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 3, 2015 8:20 UTC (Fri) by josh (subscriber, #17465) [Link] (1 responses)

That's making me tempted to just drop the CPU usage numbers and only keep the status and code, with getrusage available for anything else.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 3, 2015 8:49 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

Sure, accounting information could easily be envisioned as a future improvement when the need arises (and when the need arises it will be much clearer what kind of information is desired).

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 10:13 UTC (Thu) by drysdale (guest, #95971) [Link] (6 responses)

> you could hand the clonefd to multiple threads or processes, all of which can call read() in parallel

By the way, I'm don't think this works with v2 of the patchset (presumably because of the code with the "EOF after first read" comment and the ppos update from simple_read_from_buffer).

(I'll send an email separately)

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 10:46 UTC (Thu) by josh (subscriber, #17465) [Link] (5 responses)

You can still have multiple processes or threads read() in parallel; one of them will get a clonefd_info, and the rest will get nothing. Same as if you handed a pipe end to multiple processes and wrote a byte to it. That's the intended behavior.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 11:41 UTC (Thu) by drysdale (guest, #95971) [Link] (4 responses)

Ah, OK, I misread/misunderstood your earlier comment.

However, I wonder if it might be more helpful for each reader to get its own clonefd_info. That would be more consistent with the behaviour of other special-FDs (timerfds, eventfds) and I suspect it would also make it easier to implement pdwait4()-compatible behaviour in the future (where I believe the FreeBSD folk intend for multiple callers each to get their own copy of the returned information).

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 17:46 UTC (Thu) by josh (subscriber, #17465) [Link] (3 responses)

I don't think that's consistent with timerfd or eventfd (or signalfd for that matter); those file descriptors become readable once the desired time or event occurs, and once you read the structure, it's no longer available for other readers.

If multiple readers each want to read their own clonefd_info, I would suggest that they should each open an independent fd for the process.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 20:44 UTC (Thu) by wodny (subscriber, #73045) [Link] (2 responses)

> If multiple readers each want to read their own clonefd_info, I would suggest that they should each open an independent fd for the process.

How can one do that? If clone() has to be used to open an fd and at the same time it creates a new process?

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 20:49 UTC (Thu) by josh (subscriber, #17465) [Link]

See comments elsewhere in this thread; we're going to add a new call in the future to get an fd for an existing process (given either a PID or an existing clonefd).

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 20:58 UTC (Thu) by wodny (subscriber, #73045) [Link]

OK, I've missed the 638824 comment.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 18:34 UTC (Thu) by hmh (subscriber, #3838) [Link] (3 responses)

If you're going to [in the future] have more than one type of message (struct) going through the channel, won't that at least require a common header for every message, that has the message type (please, at least 32 bits worth of it) and total size (so that it can be skipped/ignored when unknown) ?

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 20:43 UTC (Thu) by josh (subscriber, #17465) [Link] (2 responses)

Total size is the value returned from read(); always read the size of the structure you understand, and then process the size of the structure you get back, which may be smaller.

Additional information to distinguish between structures would depend on the structures. For example, if we added a flag to obtain SIGSTOP/SIGCONT information for children (as you can currenly obtain via SIGCHLD), we'd just return exactly the same clonefd_info structure, with CLD_STOPPED or CLD_CONTINUED in the code field; no need to add types or flags for that.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 21:48 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

using the length as the only version indicator is a clever hack, but it only works well as long as you never end up abandoning/depreciating any data. If you do, you end up having to pad the messages with bogus backwards compatibility data, or some rough approximation of it.

I would have thought that the problem with /proc, specifically with slabinfo data that continues to need to be faked by slab replacements, even when it doesn't actually mean anything, would have shown this to be an "anti-pattern"

yes, including version info wastes a little space from the beginning, but it means that you can actually eliminate fields in the future if you find that you should.

Attaching file descriptors to processes with CLONE_FD

Posted Apr 2, 2015 21:53 UTC (Thu) by josh (subscriber, #17465) [Link]

The length isn't the only version indicator; if we need to change something more fundamental, we can use an explicit flag for that.

And no, we can't ever eliminate a field even with a version number, unless we have backward compatibility code to return old versions to old userspace.