New system calls: pidfd_open() and close_range()
pidfd_open()
There has been a fair amount of development activity around pidfds, which can be used to send signals to processes without worries that the target process may die and be replaced by another one using the same process ID. The 5.2 merge window saw the addition of a new CLONE_PIDFD flag to the clone() system call. If that flag is present, the kernel will return a pidfd (referring to the newly created child) to the parent by way of the ptid argument; that pidfd can then be used to send signals to the child process at some future point.
There are times, though, when it is not possible to create a process in this manner, but a management process would still like to get a pidfd for another process. Opening the target's /proc directory could work; that was once the only way to get a pidfd for a process. But the /proc approach is apparently not usable in all situations. On some systems, /proc may not exist (or be accessible) at all. For situations like this, Christian Brauner has brought back an earlier proposal for a new system call to create a pidfd:
int pidfd_open(pid_t pid, unsigned int flags);
The target process is identified with pid; the flags argument must be zero in the current proposal. The return value will be a pidfd corresponding to pid. It's worth noting that there is a possible race window here; pid could be recycled before pidfd_open() runs. That window is small in most normal usage, though, and there are ways for the caller to check and ensure that the process of interest is still running.
When pidfd_open() was proposed in the past, it would return a different flavor of pidfd than would be obtained by opening /proc; an ioctl() call was provided to convert between the two. This behavior was not particularly popular, and has been dropped; there is now just a single type of pidfd, regardless of where it has been obtained.
The lack of pidfd_open() is, Brauner says, the main obstacle
keeping applications like Android's low-memory killer and systemd from using
pidfds for process management. Once that has been resolved, "they
intend to switch to this API for process supervision/management as soon as
possible
". Comments on this system call have settled down to relatively
small implementation details, so it seems likely to go in during the 5.3
merge window.
close_range()
One possibly surprising pidfd_open() feature is that the pidfd it creates has the O_CLOEXEC flag set automatically; that will cause the descriptor to be automatically closed should the owning process call execve() to run a new program. This is a hardening feature, intended to prevent open file descriptors from leaking into places where they were not intended to be. David Howells has recently proposed changing the new filesystem mounting API to unconditionally set that flag as well.
This change evoked a protest from Al Viro, who does not feel that changing longstanding Unix conventions is the right approach, especially since the behavior of existing calls like open() cannot possibly change in this way. He later suggested that a close_range() system call might be a better way to ensure that file descriptors are closed before calls like execve(). Brauner duly implemented this idea for consideration. The new system call would be:
int close_range(unsigned int first, unsigned int last).
A call to close_range() will close every open file descriptor from
first through last, inclusive. Passing a number like
MAXINT for last will work and is the expected usage much
of the time. Closing descriptors in the kernel this way, rather than in a
loop in user space, allows for a significant speedup; as Brauner put it,
"the performance is striking
", even though there are clearly
ways in which the implementation could be made faster yet.
This API is rather less settled at this point. Howells suggested something more like:
int close_from(unsigned int first);
This variant would close all open descriptors starting with first. It seems that there are use cases, though, for leaving some high-numbered file descriptors open, so this version would be less useful. Florian Weimer, instead, suggested looking at the Solaris closefrom() and fdwalk() functions for inspiration. closefrom() is equivalent to Howells's close_from(), while fdwalk() allows a process to iterate through its open file descriptors. Weimer said that if the kernel were to implement a nextfd() system call to obtain the next open file descriptor, both closefrom() and fdwalk() could be implemented in the C library.
The value of these functions was not
clear to Brauner, though. In particular, fdwalk() appears to
be mostly needed on systems that lack information on open file descriptors
in /proc. In the absence of a pressing need for
nextfd(), it is unlikely to be implemented, much less merged. So,
unless some other proposal comes along and proves more interesting, a
future close_range()
implementation appears to be the most likely to find its way into a
mainline kernel release.
| Index entries for this article | |
|---|---|
| Kernel | pidfd |
| Kernel | System calls |
Posted May 23, 2019 14:41 UTC (Thu)
by dskoll (subscriber, #1630)
[Link] (4 responses)
It would be handy if close_range took another argument of type If you don't need this facility, just supply
Posted May 23, 2019 15:46 UTC (Thu)
by ttuttle (subscriber, #51118)
[Link] (3 responses)
Posted May 23, 2019 17:35 UTC (Thu)
by josh (subscriber, #17465)
[Link] (2 responses)
The advantage of close_range is that the kernel knows which file descriptors the process has, so instead of userspace closing thousands of *possible* descriptors, the kernel can quickly close the handful of *actual* descriptors.
Posted May 23, 2019 23:43 UTC (Thu)
by cyphar (subscriber, #110703)
[Link] (1 responses)
Posted Jul 24, 2019 22:39 UTC (Wed)
by immibis (subscriber, #105511)
[Link]
It might be that you're out of file descriptors, so you need to close one in order to open /proc/self/fd in order to know which file descriptors you can close. But how do you know which one to close?
You could institute a rule like "always close the lowest numbered one that we aren't being asked to keep open". But that's not guaranteed to be an open file descriptor. Although FDs are usually small positive integers, they *could* be any positive integer up to 2^31. Will you loop through all descriptors up to 2^31, just so you can find one to close? (Worst case: the FD limit is 1, and the only FD that's open is 2^31-1).
You might need to close more than one FD. IIRC it's possible to open, say, 1000 FDs, and then set the resource limit much lower than 1000, and the process will have to close enough FDs to get under the limit before it can open /proc/self/fd.
There's also the possibility that /proc isn't mounted. It's not sensible that a filesystem should need to be mounted in order for a process to manage its own internal state.
Posted May 23, 2019 14:50 UTC (Thu)
by brauner (subscriber, #109349)
[Link]
Posted May 23, 2019 15:02 UTC (Thu)
by mezcalero (subscriber, #45103)
[Link] (13 responses)
I must say, for all purposes I have in the codebases I care for (systemd…) the range thing is a bit weird though, usually we want to close everything but a few arbitrary fds, and then rearrange those fds to specific positions. But for that close_range() is not particularly useful, as it requires you to sort your list of fds to keep open first and then find all ranges between them. This means behaviour of closing everything is O(n*log(n)) (for the worst case where the fds to keep open are fully distributed over the entire range), for n being the number of fds to keep open. This is only marginally better than enumerating /proc/self/fd/ and closing everything found there, which is O(m) for m being the number of fds previously open. Marginally better since usually n ≪ m.
I personally would much rather prefer a prototype like:
int close_except(const int *fds, size_t n_fds);
i.e. just specify the fds you want to keep open explicitly, regardless of order, trivially easily...
And I think not only systemd would benefit from such a close_except() call, but also everything else that invokes something with fds set up in a special way, for example popen() and friends.
Lennart
Posted May 23, 2019 16:10 UTC (Thu)
by tao (subscriber, #17563)
[Link]
Posted May 23, 2019 16:22 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (7 responses)
dup2(5, 0);
(If you also wanted to keep e.g. 1, you would need some extra fiddling, but that goes for close_except(), too.)
Posted May 24, 2019 12:49 UTC (Fri)
by mezcalero (subscriber, #45103)
[Link] (6 responses)
Posted May 24, 2019 12:53 UTC (Fri)
by Sesse (subscriber, #53779)
[Link]
Posted May 30, 2019 8:44 UTC (Thu)
by mina86 (guest, #68442)
[Link] (4 responses)
Posted May 30, 2019 10:31 UTC (Thu)
by Jandar (subscriber, #85683)
[Link] (3 responses)
And the important exercise is fixing the bug ;-). The fd's have to be moved back to their original value.
Closing every fd I don't know about is only a band-aid to fix buggy software (own or 3rd part library). The correct solution is using CLOEXEC and is available for ages.
Posted May 30, 2019 10:49 UTC (Thu)
by mina86 (guest, #68442)
[Link]
Or start referring to the file descriptors by their new numbers (though even then the function lacks a way to communicate all the remappings) to save on bunch of syscalls.
Posted Sep 7, 2019 17:59 UTC (Sat)
by ceplm (subscriber, #41334)
[Link] (1 responses)
No, it is useful also for programs where the author doesn’t have full control. See https://bugs.python.org/issue11284. I could have switch all Python open() calls to use CLOEXEC (that’s essentially what happened in Python 3.2 as an implementation of https://www.python.org/dev/peps/pep-0446/), but it doesn’t save me from C extensions which just use open(2) with default values and create inheritable file descriptors on POSIX platforms.
And walking through /proc/self/fd/ is horribly slow (think about FreeBSD’s MAXFD=655000), and mostly not async signal safe.
Posted Sep 14, 2019 20:54 UTC (Sat)
by Jandar (subscriber, #85683)
[Link]
Mangling with unknown resources is never good practice but maybe as I said: a band-aid to fix buggy software. The C extension of your example is such buggy software you are applying a band-aid to.
I hope I never see close_range() in production-code. Using it is the confession one has given up on producing/using good software.
Posted May 23, 2019 20:07 UTC (Thu)
by scientes (guest, #83068)
[Link] (3 responses)
O_ANYFD
Posted May 24, 2019 6:14 UTC (Fri)
by smurf (subscriber, #17840)
[Link] (1 responses)
Well, that's O(n) over a bitmap, plus there's a CPU instruction to find the first free bit, so the cost is reasonably low no matter how large N is. You could even cache the first free FD or two. I doubt that there'd be any measurable impact of O_ANYFD, esp. compared to the cost of looking up the target of the open().
Posted May 24, 2019 22:50 UTC (Fri)
by dezgeg (subscriber, #92243)
[Link]
Posted May 25, 2019 0:33 UTC (Sat)
by wahern (subscriber, #37304)
[Link]
Posted May 23, 2019 15:06 UTC (Thu)
by magfr (subscriber, #16052)
[Link]
Posted May 23, 2019 15:10 UTC (Thu)
by tux3 (subscriber, #101245)
[Link] (1 responses)
Posted May 24, 2019 15:05 UTC (Fri)
by perennialmind (guest, #45817)
[Link]
Whether you were serious or not,
Posted May 23, 2019 15:13 UTC (Thu)
by magfr (subscriber, #16052)
[Link] (5 responses)
Posted May 23, 2019 15:19 UTC (Thu)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
Posted May 23, 2019 15:30 UTC (Thu)
by magfr (subscriber, #16052)
[Link]
Posted May 24, 2019 7:49 UTC (Fri)
by maxfragg (subscriber, #122266)
[Link] (2 responses)
Posted May 24, 2019 12:51 UTC (Fri)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
Posted May 25, 2019 1:16 UTC (Sat)
by wahern (subscriber, #37304)
[Link]
Even POSIX admits pthread_atfork was a mistake. It explains the history and why it's broken in the RATIONALE section; for FUTURE DIRECTIONS it says:
Posted May 23, 2019 16:03 UTC (Thu)
by sorokin (guest, #88478)
[Link] (2 responses)
If it is the latter it must be very inefficient for the processes with elevated limit of the number of possible file descriptors.
Posted May 24, 2019 7:03 UTC (Fri)
by cyphar (subscriber, #110703)
[Link]
Posted May 25, 2019 20:00 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link]
However, walking the bitmap is for all purposes fast enough (a handful of instructions for 64 entries) that you can consider close_range to be linear in the number of file descriptors to be closed.
Posted May 23, 2019 16:38 UTC (Thu)
by ju3Ceemi (subscriber, #102464)
[Link] (9 responses)
I suppose "Closing descriptors in a loop in user space is slower" because it basically bruteforce the whole fd-space, calling close() for every fd, even if there is none
Posted May 23, 2019 23:40 UTC (Thu)
by cyphar (subscriber, #110703)
[Link] (8 responses)
Posted May 24, 2019 17:27 UTC (Fri)
by ju3Ceemi (subscriber, #102464)
[Link] (6 responses)
Posted May 25, 2019 23:06 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
Posted May 25, 2019 23:21 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (4 responses)
Posted May 25, 2019 23:31 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (3 responses)
Posted May 26, 2019 20:46 UTC (Sun)
by roc (subscriber, #30627)
[Link] (2 responses)
Maybe a single new form of "open" syscall that opts into a different mount namespace that contains /proc and nothing else?
Posted May 26, 2019 20:52 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
That's basically what I had in mind. I'm imagining an open_proc(2) that returns a dirfd that can be used to access standard procfs files using openat(2), with the specific filesystem returned by open_proc(2) filtered to restrict access to security-sensitive parts of procfs like /proc/pid/net.
But I got a ton of opposition on lkml when I proposed something like this, so I think that my always-available procfs idea will be in the far future. :-(
Posted May 28, 2019 16:21 UTC (Tue)
by nybble41 (subscriber, #55106)
[Link]
If the pidfd_open() syscall is added, would it be possible to use the returned pidfd with openat()? My understanding is that the result should be equivalent to calling open("/proc/PID") in a context where /proc is procfs, whether or not /proc is visibly mounted. That would offer access to at least the standard /proc/PID hierarchy regardless of mount namespaces or other factors affecting path resolution.
Posted Jul 24, 2019 22:52 UTC (Wed)
by immibis (subscriber, #105511)
[Link]
Posted May 23, 2019 17:42 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (21 responses)
Posted May 23, 2019 17:56 UTC (Thu)
by james (subscriber, #1325)
[Link] (20 responses)
Posted May 23, 2019 19:25 UTC (Thu)
by josh (subscriber, #17465)
[Link]
Posted May 23, 2019 20:03 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (15 responses)
I've tried to find something about this but the only thing I got is the author's firm conviction that "it [pid re-use] ain't gonna happen".
Posted May 24, 2019 5:11 UTC (Fri)
by cyphar (subscriber, #110703)
[Link] (14 responses)
This is similar to opening a file with O_PATH, checking whether it is the file you expect through readlink(/proc/self/fd/$fd) and then operating on the O_PATH. Once you've got the reference, and checked it, the reference is free from TOCTOU attacks.
Posted May 24, 2019 13:06 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link] (13 responses)
A reminder for "But the time window is sooo small! It really ain't gonna happen!"-people: Processes can be suspended. Or can be blocked for extended periods of time if the system is thrashing.
Posted May 24, 2019 13:26 UTC (Fri)
by cyphar (subscriber, #110703)
[Link] (12 responses)
1. Do a pidfd_open(); *then*
If the process dies after (1) then you will get -ESRCH on all operations. If (1) is the wrong process, you will detect it during (2) and can error out. Thus (3) will only ever operate on the correct process -- because you have a re-usable reference that won't be recycled you aren't subject to recycling problems. This is not possible with the original pid-based interfaces because any operations in (3) would be using a pid that might be recycled and thus the check in (2) is worthless.
Please note that all of the above is also true with the current pidfd interface which works through opening /proc/$pid and pidfd_send_signal(2) (this was merged in 5.1). pidfd_open(2) is not anything more radical than that, it just offers a way of using pidfds without the need for procfs to be mounted and usable.
Posted May 24, 2019 14:44 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link] (11 responses)
Posted May 24, 2019 15:44 UTC (Fri)
by cyphar (subscriber, #110703)
[Link] (10 responses)
This depends strongly on what the application is doing. If the application is pkill(1), then you check openat(pidfd, "cmdline") and check the name of the command. If you're an init system you might check the ppid is 1, that it's in the right cgroup, that it has the right exe magiclink, that you haven't received a death signal from that service, and so on.
In addition, the benefit of pidfds is that they can be passed around or even persisted (through bind-mounts) so that if you are in a scenario where you are sure the pidfd is correct (for instance, you are the parent process and spawned the child) you can pass the pidfd to another process which can operate on the pidfd even though they could not (by themselves) get a pidfd that they were sure about. A toy usecase might be a container runtime bind-mounting the pidfd of the pid1 of containers (after spawning the pid1) and then using that pidfd after-the-fact to operate on the container's namespaces.
Posted May 24, 2019 18:31 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link] (9 responses)
Posted May 24, 2019 19:59 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted May 25, 2019 15:43 UTC (Sat)
by rweikusat2 (subscriber, #117920)
[Link] (7 responses)
Various heuristics can be applied here in order to find a process which is (according to someone's opinion at least) very likely to be the process whose pid was originally acquired but this doesn't mean that it actually is this process. The only way to know that it pid refers to a certain process is to acquire it from the parent and 'somehow' ensure that the exit status won't be reaped until one is done using the pid. If this is guaranteed, jumping through the pidfd hoop is just useless. The pid could be used instead (for signalling at least).
Posted May 25, 2019 23:09 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (3 responses)
Posted May 26, 2019 20:34 UTC (Sun)
by rweikusat2 (subscriber, #117920)
[Link] (2 responses)
Posted May 26, 2019 20:46 UTC (Sun)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted May 26, 2019 21:33 UTC (Sun)
by rweikusat2 (subscriber, #117920)
[Link]
Posted May 26, 2019 3:28 UTC (Sun)
by mjg59 (subscriber, #23239)
[Link] (2 responses)
Posted May 26, 2019 20:42 UTC (Sun)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
Posted May 26, 2019 22:46 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
pkill() has a contract - it kills processes by specified attributes. Right now it can kill a random process due to the PID-based race condition. There's still a race condition - pkill is not guaranteed to kill processes that launched concurrently with it.
The new pkill() would _always_ kill the right processes.
Posted May 23, 2019 20:11 UTC (Thu)
by droundy (subscriber, #4559)
[Link] (2 responses)
Posted May 23, 2019 20:21 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
Posted May 23, 2019 22:49 UTC (Thu)
by droundy (subscriber, #4559)
[Link]
Posted May 23, 2019 20:31 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
Posted May 23, 2019 20:57 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (3 responses)
Apart from that, there's no "wrong" and no "right" approach here. There are two choices for a default policy, namely inheritable or not inheritable, and which is better for a certain application depends on what the application is trying to do. Making the default policy a configurable per-process setting would probably make sense. Different default policies depending on whatever the person who implemented the syscall perferred for whatever reason just make the system more complicated without any benefit.
Posted May 23, 2019 23:45 UTC (Thu)
by cyphar (subscriber, #110703)
[Link] (2 responses)
Posted May 24, 2019 12:59 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
The O_CLOEXEC file descriptor creation flag doesn't belong into this discussion.
The default policy for existing system calls is mandated by POSIX, hence, changing that is not an option. Different default policies for different system calls are still a bad idea as they make the system more difficult to program for no real benefit. A default policy which could be changed on a per-process basis could be a good idea. OTOH, file descriptor creation is a fairly rare occurence, hence, this might also not be worth the effort. But it would at least solve a problem and not create one.
That some aspects of open are different from every other system call is unsurprising. If this wasn't the case, an open syscall wouldn't exist.
Posted May 24, 2019 13:35 UTC (Fri)
by cyphar (subscriber, #110703)
[Link]
> That some aspects of open are different from every other system call is unsurprising. If this wasn't the case, an open syscall wouldn't exist.
I don't understand what you mean by this. open(2) is a syscall that must exist for reasons that should be abundantly obvious. However, the fact that open("foo", 0xFFFFFFFFFFFFFFFF, 0) works today -- despite many of those bits being undefined -- is not required for open(2) to exist. This is known as a bad practice -- no other syscall operates like this (and the kernel documentation explicitly tells you not to design syscalls or ioctls like this).
Posted May 24, 2019 1:03 UTC (Fri)
by mirabilos (subscriber, #84359)
[Link]
https://www.mirbsd.org/htman/i386/man2/closefrom.htm
I for one will be rooting for that!
Posted May 24, 2019 8:34 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted May 25, 2019 14:52 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted May 24, 2019 11:36 UTC (Fri)
by stressinduktion (subscriber, #46452)
[Link] (2 responses)
a new flag/flags or bitset to open and a new syscall that closes all descriptors with a particular flag:
- open(..., O_FD_KEEP_OPEN_EXECVE)
and finally close all descriptors without those flags: close_except(O_FD_KEEP_OPEN_EXECVE)
Obviously I haven't put much time into the naming of those flags. :)
One could build on top of that and implement:
close_type(FD_STDIO); // dubious semantics
Not sure if it is worth the complexity, but could be useful. I think that in most cases it is already known at open/fcntl time if a descriptor should stay open. Furthermore it might be useful during debugging, if those information are also discoverable in the proc/pid/fd hierarchy.
Posted May 25, 2019 15:31 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted May 28, 2019 1:03 UTC (Tue)
by cyphar (subscriber, #110703)
[Link]
We add flags to open(2) (and openat(2)) all the time, despite this problem. New syscalls aren't defined this way, but we're stuck with this behaviour unfortunately.
Posted May 25, 2019 23:11 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (3 responses)
Posted May 25, 2019 23:20 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link]
Posted May 26, 2019 0:13 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jul 30, 2019 23:30 UTC (Tue)
by immibis (subscriber, #105511)
[Link]
Posted May 26, 2019 3:27 UTC (Sun)
by sbaugh (guest, #103291)
[Link] (9 responses)
I'm in favor of setting CLOEXEC by default - that's the only safe way to do things in the presence of a shared file descriptor table (i.e. with threads). Sometimes you want to implicitly inherit a file descriptor, and you can just unset CLOEXEC in that case. But closefrom will make that impossible: it does not support implicit inheritance of file descriptors.
Posted May 27, 2019 14:34 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (8 responses)
Posted May 28, 2019 0:04 UTC (Tue)
by sbaugh (guest, #103291)
[Link] (7 responses)
That doesn't solve the problem; it still breaks implicit inheritance of file descriptors. If my shell, for example, ran closefrom after fork but before exec, I wouldn't be able to pass down fds to programs started by the shell. The same concern applies in any fork/exec case.
Also, independently, many unfortunate programs already do call closefrom at startup. I think adding it as a syscall will make this worse, as userspace programmers hear the bad advice to call closefrom on startup.
Posted May 28, 2019 1:11 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted May 28, 2019 1:36 UTC (Tue)
by sbaugh (guest, #103291)
[Link] (5 responses)
Not sure what you're meaning to ask. There is no difference in functionality. close_range or closefrom are just a more efficient version of looping through all open FDs and closing them.
I'm saying that close_range, closefrom, and looping through open FDs are all bad things that you shouldn't do. close_range makes it easier to do something you shouldn't do. So it shouldn't be added. We shouldn't add syscalls to make it faster to do things that shouldn't be done.
Posted May 28, 2019 2:50 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link] (4 responses)
Except that sometimes you *do* want to close all other fds (sometimes even all fds, before reopening /dev/null as fd 0 and duping it to 1 and 2) for security reasons.
Programmers will shoot themselves into the foot, but if the rope is efficient… *shrug*
And if you don’t like it, it’s open source and you’ll be able to remove the call in your local copy of the software.
Posted May 28, 2019 3:56 UTC (Tue)
by sbaugh (guest, #103291)
[Link] (3 responses)
Yes, the Korn shell (and mksh) are wrong here, I'm afraid. Thankfully most shells don't have that particular bad behavior.
>Except that sometimes you *do* want to close all other fds (sometimes even all fds, before reopening /dev/null as fd 0 and duping it to 1 and 2) for security reasons.
The "security reasons" argument here applies identically to environment variables. Indeed, I think it's sound to clear the environment and close all fds when you really want to harden something - in setuid environments, for example. But most programs shouldn't be clearing their environment, and most programs shouldn't be closing all fds.
>Programmers will shoot themselves into the foot, but if the rope is efficient… *shrug*
This doesn't make any sense. Again, we shouldn't add syscalls to speed up operations which should not be done in the first place. That has nothing to do with open source choice. If you think this operation is good, then say that.
Posted May 28, 2019 6:58 UTC (Tue)
by roc (subscriber, #30627)
[Link] (2 responses)
But many programs do want hardening --- e.g. Firefox and Chrome need to close all fds in sandboxed child processes --- and probably that trend is to be encouraged. Why shouldn't they have an efficient way to do it?
Plus, it seems to me that leaking internal file descriptors into spawned subprocesses is unhygenic, asking for trouble, and should not be the default.
Posted May 28, 2019 12:57 UTC (Tue)
by sbaugh (guest, #103291)
[Link] (1 responses)
Primarily: because they don't need an efficient way to do it. They can close file descriptors once at startup and thereafter make sure to use CLOEXEC to prevent file descriptor leakage. That works fine and doesn't encourage bad practices in the rest of userspace. But more generally...
It's obviously hard to argue against "we should do more sandboxing". But I fall back to the analogy to environment variables: If we shouldn't clear the environment before running the program, then we shouldn't close all file descriptors before running it. Browser Javascript sandboxes are totally closed off from the rest of the system - they certainly don't need environment variables, and they wouldn't benefit from file descriptors either. So it's fine to clear the environment and close all fds before running them.
Certainly I think we should move towards a world where processes can't access things they haven't somehow been given access to. But I don't think we should make every process like a browser Javascript sandbox, and be totally closed off from the system. There are many things that are implicitly inherited in Linux - the root directory, the current working directory, all the namespaces, tons of things. Most of those should be removed or reworked, because they're easy to mess up and can't easily be controlled, and most of them can't even be changed without privileges. But many parts of Unix are still useful, and shouldn't be removed. Environment variables shouldn't be removed, and the possibility of implicit inheritance of file descriptors shouldn't be removed. Otherwise we'll inevitably end up reinventing those, slowly and painfully.
>Plus, it seems to me that leaking internal file descriptors into spawned subprocesses is unhygenic, asking for trouble, and should not be the default.
I agree! Thankfully, there is already a way to prevent this: CLOEXEC. You don't need close_range to avoid leaking file descriptors into spawned subprocesses.
As I said in my original comment: CLOEXEC is the right way to stop to stop implicit inheritance of file descriptors, and I think it should be the default. Almost all file descriptors shouldn't be implicitly inherited into children - that would be massive resource leakage. But sometimes you *do* want to implicitly inherit a file descriptor and CLOEXEC gives you that option, because you can turn it off and inherit a file descriptor without having to explicitly tell your child that you're doing it. closefrom is wrong because there's no way to turn it off when some random glue program in the middle of the tree calls it, so it completely prevents implicit inheritance of file descriptors.
Posted May 30, 2019 12:26 UTC (Thu)
by roc (subscriber, #30627)
[Link]
That would mean a bug in a third-party library that creates a non-CLOEXEC file descriptor, even for a moment, creates a security hole for the sandbox. That's not really acceptable.
> If we shouldn't clear the environment before running the program, then we shouldn't close all file descriptors before running it.
I don't buy that argument. Reading the environment can't usually affect other processes. Reading or writing leaked file descriptors can, sometimes in very subtle ways.
Close_range()
fd_set * that contained a set of file descriptors you definitely want to keep open. One use case is if a parent process opens a PID file and wants to keep it open in the child / grandchild for locking purposes.
NULL where the fd_set * argument goes.
Would it be possible to have close_set(fd_set* fds) instead/as well? (Or does that already exist?)
Close_range()
Close_range()
Close_range()
Close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
dup2(26, 1);
dup2(4, 2);
close_range(3, MAXINT);
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
int close_except(int *fds_to_keep, size_t n) {
qsort(fds_to_keep, n, sizeof *fds_to_keep, int_cmp);
int fd = 0;
for (size_t i = 0; i < n; ++i)
if (fds_to_keep[i] != fd) dup2(fds_to_keep[i], fd++);
return close_range(fd, INT_MAX);
}
Handling of EBADF and EINTR left as exercise to the reader.
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
Double standards
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
close_range and close_except do share the familiar pattern "do thing foo on series bar". All these round-trip eliminating optimizations remind me of capnproto's "infinity times faster" tagline.
close_range()
close_range()
close_range()
For threaded use i suppose som monstrosity like CreateProcess that allows the caller to specify the file descriptors that should be duplicated is better.
close_range()
close_range()
close_range()
The pthread_atfork() function may be formally deprecated (for example, by shading it OB) in a future version of this standard.
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
Using it, the userspace could do whatever it wants : filtering, closing some or others etc
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
I presume the idea is to open the pidfd, check you've got the right process, and then you can reliably use it. (If you haven't got the right process, then the one you were interested in isn't there anymore).
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
2. Check that it is the process you want; and *then*
3. Operate on the pidfd which you now know reference the process you want.
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
Linux is catching up with MirBSD
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
- fnctl(fd, F_SETFL, ...|O_FD_KEEP_OPEN_EXECVE)
close_type(FD_ALL_SOCKETS);
close_type(FD_ALL_TTYS);
close_type(FD_ALL);
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
A shell can just loop through FDs and close them. So what's the difference?
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
>And if you don’t like it, it’s open source and you’ll be able to remove the call in your local copy of the software.
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
New system calls: pidfd_open() and close_range()
