Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Posted Apr 2, 2015 5:10 UTC (Thu) by josh (subscriber, #17465)In reply to: Attaching file descriptors to processes with CLONE_FD by Cyberax
Parent article: Attaching file descriptors to processes with CLONE_FD
That's actually the original motivation for this patch. Thiago Macieira has a userspace library that emulates a subset of clonefd by installing a SIGCHLD handler and sending clonefd_info over a pipe. Having clone4 available allows that library to work in a race-free way, and not take over process-wide SIGCHLD handling.
> A small note, though: clonefd_info needs to have size and version fields for future extensions.
That doesn't necessarily help, especially for a first version: nothing forces userspace to actually *check* those fields, and if they don't, then they'll end up being immutable anyway because the kernel can't break old userspace.
So our current plan is to require applications that can handle reading other structures from the clonefd to opt-in by passing an appropriate flag to clone4. Suggestions welcome, though.
Posted Apr 2, 2015 5:22 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (20 responses)
> So our current plan is to require applications that can handle reading other structures from the clonefd to opt-in by passing an appropriate flag to clone4. Suggestions welcome, though.
Posted Apr 2, 2015 5:30 UTC (Thu)
by josh (subscriber, #17465)
[Link] (19 responses)
Note that clonefd behaves like a UDP socket here, in that you have to read the entire clonefd_info structure in one read() call, and any unread bits are lost. We chose that behavior because in theory, you could hand the clonefd to multiple threads or processes, all of which can call read() in parallel, and they shouldn't race or receive partial data even if they don't read a complete structure.
So, in effect, there *is* a length field: it's the return value from read(). If you understand a longer structure, read() a longer structure, and on older kernels you'll get a short read.
Posted Apr 2, 2015 6:13 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
But this design seem to make it impossible to pass large data (command line arguments, for example).
On the other hand, if the corresponding /proc directory could stay alive while there are open FDs to its process, then there's little need to have more information.
Perhaps you could at least add a 'flags' field so the library code could inspect what extra fields are available?
Also, what about getting a FD of an already running process?
Posted Apr 2, 2015 6:18 UTC (Thu)
by josh (subscriber, #17465)
[Link] (10 responses)
We've actually talked about adding an API that would get you from a clonefd to the dirfd of the corresponding /proc directory, from which you could use openat to open the proc files of that process.
> Perhaps you could at least add a 'flags' field so the library code could inspect what extra fields are available?
Something like that might happen the first time clonefd_info gets extended, depending on the added fields.
> Also, what about getting a FD of an already running process?
Planned for a subsequent patch.
Posted Apr 2, 2015 6:30 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
> Something like that might happen the first time clonefd_info gets extended, depending on the added fields.
> Planned for a subsequent patch.
Posted Apr 2, 2015 7:43 UTC (Thu)
by josh (subscriber, #17465)
[Link] (8 responses)
Among other reasons, because file descriptors are unique system-wide, while if you have a PID, you have to know what PID namespace it's relative to. If the process the clonefd refers to is in another PID namespace from you, and for that matter if you've passed it around via UNIX pipes, the clonefd can still hand you a dirfd that unambiguously refers to the correct /proc directory, without any PIDs or translation involved. That seems far cleaner than, for instance, using an ioctl to get the PID from the clonefd, sprintf-ing that into a filename, and opening a file in /proc.
Ideally, it ought to be possible to operate on processes without ever using a PID.
Posted Apr 2, 2015 7:57 UTC (Thu)
by iq-0 (subscriber, #36655)
[Link] (7 responses)
This would mean that opening that directory directly would automatically mean that you get an equivalent handle. That multiple processes can wait on the exit of that process (and receive basic accounting information). And if the filehandle isn't opened with 'O_PATH' than the task directory would stick around, otherwise it may be automatically cleaned up.
And why not use include the more verbose "getrusage" stats but only the "times" one?
Posted Apr 2, 2015 17:31 UTC (Thu)
by josh (subscriber, #17465)
[Link] (1 responses)
We did consider that during the design. We also considered making the "get file descriptor for existing process" mechanism be open("/proc/$PID/clonefd") or similar. However, that approach would require that /proc is mounted, and would raise interesting interaction issues with containers. The kernel has been moving in the opposite direction; for instance, see execveat, which was created specifically as an alternative to execve of /proc/self/fd/N because the latter requires a mounted /proc.
Posted Apr 3, 2015 7:04 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link]
I guess it would be a security risk, because a process could this way get a procfs handle, follow that to '..' and have access to it's entire procfs even if the admin chose not to mount it (chroot escaping for root processes would probably become easier this way, though that is officially not considered a security mechanism).
But I don't see how the 'open(/proc/$pid/clonefd)' would introduce a dependency on procfs. Only the feature "open a process fd for an existing process" would require that. The fd itself wouldn't necessarily have to be tied to procfs in this case (as it would in the fd for task directory).
Posted Apr 2, 2015 17:40 UTC (Thu)
by josh (subscriber, #17465)
[Link] (4 responses)
clonefd_info closely matches the information available in the siginfo structure that you receive when you get SIGCHLD. siginfo doesn't include the rusage information either. And including the full rusage information would make the structure *much* larger. Most callers are not likely to need that information, because they'll only care that the process exited and perhaps what exit code it exited with.
I would propose, instead, adding a version of getrusage that takes a clonefd. Since the task_struct will stick around as long as the clonefd remains open, those stats remain available.
Posted Apr 3, 2015 7:16 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link] (3 responses)
Most programs only care about exit status. But for anybody who is interested in resource usage, I think nobody want to know only the cpu timings. Most programs make do with it, but that is more a case of what they got, not what they want.
We are talking about 144 bytes vs 32 bytes per exited process that has not been reaped. The pointers inside the kernel for basic administration are probably more expensive to keep around.
Posted Apr 3, 2015 7:35 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link]
Posted Apr 3, 2015 8:20 UTC (Fri)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Apr 3, 2015 8:49 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link]
Posted Apr 2, 2015 10:13 UTC (Thu)
by drysdale (guest, #95971)
[Link] (6 responses)
By the way, I'm don't think this works with v2 of the patchset (presumably because of the code with the "EOF after first read" comment and the ppos update from simple_read_from_buffer).
(I'll send an email separately)
Posted Apr 2, 2015 10:46 UTC (Thu)
by josh (subscriber, #17465)
[Link] (5 responses)
Posted Apr 2, 2015 11:41 UTC (Thu)
by drysdale (guest, #95971)
[Link] (4 responses)
However, I wonder if it might be more helpful for each reader to get its own clonefd_info. That would be more consistent with the behaviour of other special-FDs (timerfds, eventfds) and I suspect it would also make it easier to implement pdwait4()-compatible behaviour in the future (where I believe the FreeBSD folk intend for multiple callers each to get their own copy of the returned information).
Posted Apr 2, 2015 17:46 UTC (Thu)
by josh (subscriber, #17465)
[Link] (3 responses)
If multiple readers each want to read their own clonefd_info, I would suggest that they should each open an independent fd for the process.
Posted Apr 2, 2015 20:44 UTC (Thu)
by wodny (subscriber, #73045)
[Link] (2 responses)
How can one do that? If clone() has to be used to open an fd and at the same time it creates a new process?
Posted Apr 2, 2015 20:49 UTC (Thu)
by josh (subscriber, #17465)
[Link]
Posted Apr 2, 2015 18:34 UTC (Thu)
by hmh (subscriber, #3838)
[Link] (3 responses)
Posted Apr 2, 2015 20:43 UTC (Thu)
by josh (subscriber, #17465)
[Link] (2 responses)
Additional information to distinguish between structures would depend on the structures. For example, if we added a flag to obtain SIGSTOP/SIGCONT information for children (as you can currenly obtain via SIGCHLD), we'd just return exactly the same clonefd_info structure, with CLD_STOPPED or CLD_CONTINUED in the code field; no need to add types or flags for that.
Posted Apr 2, 2015 21:48 UTC (Thu)
by dlang (guest, #313)
[Link] (1 responses)
I would have thought that the problem with /proc, specifically with slabinfo data that continues to need to be faked by slab replacements, even when it doesn't actually mean anything, would have shown this to be an "anti-pattern"
yes, including version info wastes a little space from the beginning, but it means that you can actually eliminate fields in the future if you find that you should.
Posted Apr 2, 2015 21:53 UTC (Thu)
by josh (subscriber, #17465)
[Link]
And no, we can't ever eliminate a field even with a version number, unless we have backward compatibility code to return old versions to old userspace.
Attaching file descriptors to processes with CLONE_FD
Certainly. But it would make it easier for me to simply read the whole packet and transmit it to another part of my program for more detailed parsing. Having a length field would help here.
That as well. But simply adding a length field and a bitmask with enabled options shouldn't be too hard and it might help people.
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Why not just let the /proc entry to stick around? Kernel can't reuse a PID while it's still held by an open FD or am I wrong?
Oh well... I guess it's OK, but I'd definitely prefer to have a straightforward "features" flag as the first member of the structure. It'll also condition clients to the fact that the structure might get extended someday.
Thanks!
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD
Attaching file descriptors to processes with CLONE_FD