New system calls: pidfd_open() and close_range()

By Jonathan Corbet
May 23, 2019

The linux-kernel mailing list has recently seen more than the usual amount of traffic proposing new system calls. LWN is endeavoring to catch up with that stream, starting with a couple of proposals for the management of file descriptors. pidfd_open() is a new way to create a "pidfd" file descriptor that refers to a process in the system, while close_range() is an efficient way to close many open descriptors with a single call.

pidfd_open()

There has been a fair amount of development activity around pidfds, which can be used to send signals to processes without worries that the target process may die and be replaced by another one using the same process ID. The 5.2 merge window saw the addition of a new CLONE_PIDFD flag to the clone() system call. If that flag is present, the kernel will return a pidfd (referring to the newly created child) to the parent by way of the ptid argument; that pidfd can then be used to send signals to the child process at some future point.

There are times, though, when it is not possible to create a process in this manner, but a management process would still like to get a pidfd for another process. Opening the target's /proc directory could work; that was once the only way to get a pidfd for a process. But the /proc approach is apparently not usable in all situations. On some systems, /proc may not exist (or be accessible) at all. For situations like this, Christian Brauner has brought back an earlier proposal for a new system call to create a pidfd:

    int pidfd_open(pid_t pid, unsigned int flags);

The target process is identified with pid; the flags argument must be zero in the current proposal. The return value will be a pidfd corresponding to pid. It's worth noting that there is a possible race window here; pid could be recycled before pidfd_open() runs. That window is small in most normal usage, though, and there are ways for the caller to check and ensure that the process of interest is still running.

When pidfd_open() was proposed in the past, it would return a different flavor of pidfd than would be obtained by opening /proc; an ioctl() call was provided to convert between the two. This behavior was not particularly popular, and has been dropped; there is now just a single type of pidfd, regardless of where it has been obtained.

The lack of pidfd_open() is, Brauner says, the main obstacle keeping applications like Android's low-memory killer and systemd from using pidfds for process management. Once that has been resolved, "they intend to switch to this API for process supervision/management as soon as possible". Comments on this system call have settled down to relatively small implementation details, so it seems likely to go in during the 5.3 merge window.

close_range()

One possibly surprising pidfd_open() feature is that the pidfd it creates has the O_CLOEXEC flag set automatically; that will cause the descriptor to be automatically closed should the owning process call execve() to run a new program. This is a hardening feature, intended to prevent open file descriptors from leaking into places where they were not intended to be. David Howells has recently proposed changing the new filesystem mounting API to unconditionally set that flag as well.

This change evoked a protest from Al Viro, who does not feel that changing longstanding Unix conventions is the right approach, especially since the behavior of existing calls like open() cannot possibly change in this way. He later suggested that a close_range() system call might be a better way to ensure that file descriptors are closed before calls like execve(). Brauner duly implemented this idea for consideration. The new system call would be:

    int close_range(unsigned int first, unsigned int last).

A call to close_range() will close every open file descriptor from first through last, inclusive. Passing a number like MAXINT for last will work and is the expected usage much of the time. Closing descriptors in the kernel this way, rather than in a loop in user space, allows for a significant speedup; as Brauner put it, "the performance is striking", even though there are clearly ways in which the implementation could be made faster yet.

This API is rather less settled at this point. Howells suggested something more like:

    int close_from(unsigned int first);

This variant would close all open descriptors starting with first. It seems that there are use cases, though, for leaving some high-numbered file descriptors open, so this version would be less useful. Florian Weimer, instead, suggested looking at the Solaris closefrom() and fdwalk() functions for inspiration. closefrom() is equivalent to Howells's close_from(), while fdwalk() allows a process to iterate through its open file descriptors. Weimer said that if the kernel were to implement a nextfd() system call to obtain the next open file descriptor, both closefrom() and fdwalk() could be implemented in the C library.

The value of these functions was not clear to Brauner, though. In particular, fdwalk() appears to be mostly needed on systems that lack information on open file descriptors in /proc. In the absence of a pressing need for nextfd(), it is unlikely to be implemented, much less merged. So, unless some other proposal comes along and proves more interesting, a future close_range() implementation appears to be the most likely to find its way into a mainline kernel release.

Index entries for this article
Kernel	pidfd
Kernel	System calls

Close_range()

Posted May 23, 2019 14:41 UTC (Thu) by dskoll (subscriber, #1630) [Link] (4 responses)

It would be handy if close_range took another argument of type fd_set * that contained a set of file descriptors you definitely want to keep open. One use case is if a parent process opens a PID file and wants to keep it open in the child / grandchild for locking purposes.

If you don't need this facility, just supply NULL where the fd_set * argument goes.

Close_range()

Posted May 23, 2019 15:46 UTC (Thu) by ttuttle (subscriber, #51118) [Link] (3 responses)

Would it be possible to have close_set(fd_set* fds) instead/as well? (Or does that already exist?)

Close_range()

Posted May 23, 2019 17:35 UTC (Thu) by josh (subscriber, #17465) [Link] (2 responses)

close_set wouldn't be *substantially* faster than doing a series of close calls in userspace. It'd eliminate a bit of syscall overhead, but that alone doesn't seem worth it.

The advantage of close_range is that the kernel knows which file descriptors the process has, so instead of userspace closing thousands of *possible* descriptors, the kernel can quickly close the handful of *actual* descriptors.

Close_range()

Posted May 23, 2019 23:43 UTC (Thu) by cyphar (subscriber, #110703) [Link] (1 responses)

This comes back to the question of "do you have /proc". Userspace can know what fds it has open by looking at /proc/self/fd. This is what rpm, zypper, and quite a few other programs do now (they used to loop over all possible file descriptors but it turns out this can be incredibly slow inside containers).

Close_range()

Posted Jul 24, 2019 22:39 UTC (Wed) by immibis (subscriber, #105511) [Link]

As I recall the problem with reading /proc/self/fd is that now you have to open a FD to know which FDs you can close.

It might be that you're out of file descriptors, so you need to close one in order to open /proc/self/fd in order to know which file descriptors you can close. But how do you know which one to close?

You could institute a rule like "always close the lowest numbered one that we aren't being asked to keep open". But that's not guaranteed to be an open file descriptor. Although FDs are usually small positive integers, they *could* be any positive integer up to 2^31. Will you loop through all descriptors up to 2^31, just so you can find one to close? (Worst case: the FD limit is 1, and the only FD that's open is 2^31-1).

You might need to close more than one FD. IIRC it's possible to open, say, 1000 FDs, and then set the resource limit much lower than 1000, and the process will have to close enough FDs to get under the limit before it can open /proc/self/fd.

There's also the possibility that /proc isn't mounted. It's not sensible that a filesystem should need to be mounted in order for a process to manage its own internal state.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 14:50 UTC (Thu) by brauner (subscriber, #109349) [Link]

Just for the sake of completeness, the seccomp notification fd retrieved via SECCOMP_RET_USER_NOTIF is also O_CLOEXEC by default.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 15:02 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (13 responses)

That syscall prototype with the "unsigned int" used for fds looks like the kernel side prototype. In userspace fds are generally "int". (Isnt't it great that the Linux kernel side and userspace type disagree on such a fundamental type? Yay for Linux!). Hence I presume the MAXINT in the article should actually be a UINTMAX, i.e. the same as (unsigned int) -1...

I must say, for all purposes I have in the codebases I care for (systemd…) the range thing is a bit weird though, usually we want to close everything but a few arbitrary fds, and then rearrange those fds to specific positions. But for that close_range() is not particularly useful, as it requires you to sort your list of fds to keep open first and then find all ranges between them. This means behaviour of closing everything is O(n*log(n)) (for the worst case where the fds to keep open are fully distributed over the entire range), for n being the number of fds to keep open. This is only marginally better than enumerating /proc/self/fd/ and closing everything found there, which is O(m) for m being the number of fds previously open. Marginally better since usually n ≪ m.

I personally would much rather prefer a prototype like:

int close_except(const int *fds, size_t n_fds);

i.e. just specify the fds you want to keep open explicitly, regardless of order, trivially easily...

And I think not only systemd would benefit from such a close_except() call, but also everything else that invokes something with fds set up in a special way, for example popen() and friends.

Lennart

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 16:10 UTC (Thu) by tao (subscriber, #17563) [Link]

Yeah, close_except() certainly seems like a more desirable behaviour.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 16:22 UTC (Thu) by Sesse (subscriber, #53779) [Link] (7 responses)

Can't you rearrange before close instead of after? Say that you want fds 5, 26 and 4, and then close the rest:

dup2(5, 0);
dup2(26, 1);
dup2(4, 2);
close_range(3, MAXINT);

(If you also wanted to keep e.g. 1, you would need some extra fiddling, but that goes for close_except(), too.)

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 12:49 UTC (Fri) by mezcalero (subscriber, #45103) [Link] (6 responses)

Well, for stdin/stdout/stderr that'll work, but it falls apart if the fds you want to keep and maybe later rearrange are already in the range you want to rearrange them to... The socket activation protocol systemd implements (i.e. LISTEN_FDS=) supports large numbers of fds, and systemd might create them a long time before actually activating things, hence they likely are in the range we want to move fds to.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 12:53 UTC (Fri) by Sesse (subscriber, #53779) [Link]

But that's still true even with your proposed close_except()?

New system calls: pidfd_open() and close_range()

Posted May 30, 2019 8:44 UTC (Thu) by mina86 (guest, #68442) [Link] (4 responses)

int close_except(int *fds_to_keep, size_t n) {
    qsort(fds_to_keep, n, sizeof *fds_to_keep, int_cmp);
    int fd = 0;
    for (size_t i = 0; i < n; ++i)
        if (fds_to_keep[i] != fd) dup2(fds_to_keep[i], fd++);
    return close_range(fd, INT_MAX);
}

Handling of EBADF and EINTR left as exercise to the reader.

New system calls: pidfd_open() and close_range()

Posted May 30, 2019 10:31 UTC (Thu) by Jandar (subscriber, #85683) [Link] (3 responses)

> Handling of EBADF and EINTR left as exercise to the reader.

And the important exercise is fixing the bug ;-). The fd's have to be moved back to their original value.

Closing every fd I don't know about is only a band-aid to fix buggy software (own or 3rd part library). The correct solution is using CLOEXEC and is available for ages.

New system calls: pidfd_open() and close_range()

Posted May 30, 2019 10:49 UTC (Thu) by mina86 (guest, #68442) [Link]

> And the important exercise is fixing the bug ;-). The fd's have to be moved back to their original value.

Or start referring to the file descriptors by their new numbers (though even then the function lacks a way to communicate all the remappings) to save on bunch of syscalls.

New system calls: pidfd_open() and close_range()

Posted Sep 7, 2019 17:59 UTC (Sat) by ceplm (subscriber, #41334) [Link] (1 responses)

> Closing every fd I don't know about is only a band-aid to fix buggy software (own or 3rd part library). The correct solution is using CLOEXEC and is available for ages.

No, it is useful also for programs where the author doesn’t have full control. See https://bugs.python.org/issue11284. I could have switch all Python open() calls to use CLOEXEC (that’s essentially what happened in Python 3.2 as an implementation of https://www.python.org/dev/peps/pep-0446/), but it doesn’t save me from C extensions which just use open(2) with default values and create inheritable file descriptors on POSIX platforms.

And walking through /proc/self/fd/ is horribly slow (think about FreeBSD’s MAXFD=655000), and mostly not async signal safe.

New system calls: pidfd_open() and close_range()

Posted Sep 14, 2019 20:54 UTC (Sat) by Jandar (subscriber, #85683) [Link]

There are libraries (I have encountered one) opening a file-descriptor which has to remain open across exec to be used by the same library in (grand)child processes.

Mangling with unknown resources is never good practice but maybe as I said: a band-aid to fix buggy software. The C extension of your example is such buggy software you are applying a band-aid to.

I hope I never see close_range() in production-code. Using it is the confession one has given up on producing/using good software.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 20:07 UTC (Thu) by scientes (guest, #83068) [Link] (3 responses)

What about having a open() flag that doesn't guarantee the lowest syscall number? That is also O(n) for no good reason.

O_ANYFD

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 6:14 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

You mean the lowest FD number.

Well, that's O(n) over a bitmap, plus there's a CPU instruction to find the first free bit, so the cost is reasonably low no matter how large N is. You could even cache the first free FD or two. I doubt that there'd be any measurable impact of O_ANYFD, esp. compared to the cost of looking up the target of the open().

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 22:50 UTC (Fri) by dezgeg (subscriber, #92243) [Link]

See https://lwn.net/Articles/787473/, specifically this quote: "Jan Kara wondered if the file-descriptor table bouncing could be handled by allocating file descriptors in a way that causes separate threads to use different parts of the table. "

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 0:33 UTC (Sat) by wahern (subscriber, #37304) [Link]

FWIW, fcntl F_DUPFD (or F_DUPFD_CLOEXEC) can be used to get a higher descriptor: it returns a descriptor greater than equal to the fcntl argument. Not ideal because you have to open and close the original descriptor, but technically sufficient to implement your suggested interface. One still has to figure out what the highest descriptor is and ensuring contiguous allocation of related descriptors, but so too with O_ANYFD.

Double standards

Posted May 23, 2019 15:06 UTC (Thu) by magfr (subscriber, #16052) [Link]

It is interesting that systems where /proc ain't accessible is a problem for open_pid() but not for nextfd()

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 15:10 UTC (Thu) by tux3 (subscriber, #101245) [Link] (1 responses)

Clearly there should be an API that takes an eBPF callback and queries it for each fd!

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 15:05 UTC (Fri) by perennialmind (guest, #45817) [Link]

Whether you were serious or not, close_range and close_except do share the familiar pattern "do thing foo on series bar". All these round-trip eliminating optimizations remind me of capnproto's "infinity times faster" tagline.

close_range()

Posted May 23, 2019 15:13 UTC (Thu) by magfr (subscriber, #16052) [Link] (5 responses)

While a close_range or close_except certainly would be useful I think an O_CLOFORK flag to set on file descriptors would be even better as it would prevent the creation of all the fds in the child.

close_range()

Posted May 23, 2019 15:19 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (1 responses)

That'd be unusable in a threaded environment, since if three threads want to use this and fork of different programs with different sets of fds they might end up turning off and on O_CLOFORK for each other interefering with each other.

close_range()

Posted May 23, 2019 15:30 UTC (Thu) by magfr (subscriber, #16052) [Link]

Yes, threads and fork does not mix very well.
For threaded use i suppose som monstrosity like CreateProcess that allows the caller to specify the file descriptors that should be duplicated is better.

close_range()

Posted May 24, 2019 7:49 UTC (Fri) by maxfragg (subscriber, #122266) [Link] (2 responses)

well, for most of those cases, it can be solved without the kernel using something like pthread_atfork()

close_range()

Posted May 24, 2019 12:51 UTC (Fri) by mezcalero (subscriber, #45103) [Link] (1 responses)

pthread_atfork() is probably one of those concepts that are supposed to solve a problem, but when you use it you then have two problems.

close_range()

Posted May 25, 2019 1:16 UTC (Sat) by wahern (subscriber, #37304) [Link]

Even POSIX admits pthread_atfork was a mistake. It explains the history and why it's broken in the RATIONALE section; for FUTURE DIRECTIONS it says:

The pthread_atfork() function may be formally deprecated (for example, by shading it OB) in a future version of this standard.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 16:03 UTC (Thu) by sorokin (guest, #88478) [Link] (2 responses)

What is the computation complexity of this close_range function? Is it linear from the number of opened descriptors in range or is it linear from the size of the range?

If it is the latter it must be very inefficient for the processes with elevated limit of the number of possible file descriptors.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 7:03 UTC (Fri) by cyphar (subscriber, #110703) [Link]

It would be O(#open_fds) given that files_struct has an open_fds bitmap that can be iterated over. This is what close_files() does already in the free path for put_files_struct().

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 20:00 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

Technically walking a bitmap is O(size of the bitmap); from the point of view of big-O notation it doesn't matter that the constant in front of it is much smaller than the cost of actually closing those few open file descriptors.

However, walking the bitmap is for all purposes fast enough (a handful of instructions for 64 entries) that you can consider close_range to be linear in the number of file descriptors to be closed.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 16:38 UTC (Thu) by ju3Ceemi (subscriber, #102464) [Link] (9 responses)

Why not create a call to get a list of opened fd ?
Using it, the userspace could do whatever it wants : filtering, closing some or others etc

I suppose "Closing descriptors in a loop in user space is slower" because it basically bruteforce the whole fd-space, calling close() for every fd, even if there is none

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 23:40 UTC (Thu) by cyphar (subscriber, #110703) [Link] (8 responses)

You can already do this by looping over /proc/self/fd. This is what rpm, zypper, and quite a few other programs do.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 17:27 UTC (Fri) by ju3Ceemi (subscriber, #102464) [Link] (6 responses)

Sure, but one needs access to the procfs, which is the issue

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 23:06 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (5 responses)

No, it isn't the issue at all. We shouldn't optimize the kernel for random important kernel interfaces being disabled. Why should running without procfs be a supported configuration at all? Why are we even spending time on this bizarre usecase? We should just hardcode CONFIG_PROCFS=y.

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 23:21 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (4 responses)

Just because /proc is mounted doesn't mean that all the processes on the system have a view of the fs hierarchy that lets them see it.

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 23:31 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (3 responses)

That's true, but the right approach here is to make a useful subset of /proc visible to all processes regardless of the mount table instead of adding dedicated system calls that reproduce what /proc can already do just for those rare cases where /proc isn't available.

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 20:46 UTC (Sun) by roc (subscriber, #30627) [Link] (2 responses)

Mount namespaces and filesystem path resolution are already complicated enough without adding new "always visible" mounts.

Maybe a single new form of "open" syscall that opts into a different mount namespace that contains /proc and nothing else?

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 20:52 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (1 responses)

> Maybe a single new form of "open" syscall that opts into a different mount namespace that contains /proc and nothing else?

That's basically what I had in mind. I'm imagining an open_proc(2) that returns a dirfd that can be used to access standard procfs files using openat(2), with the specific filesystem returned by open_proc(2) filtered to restrict access to security-sensitive parts of procfs like /proc/pid/net.

But I got a ton of opposition on lkml when I proposed something like this, so I think that my always-available procfs idea will be in the far future. :-(

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 16:21 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> I'm imagining an open_proc(2) that returns a dirfd that can be used to access standard procfs files using openat(2) ...

If the pidfd_open() syscall is added, would it be possible to use the returned pidfd with openat()? My understanding is that the result should be equivalent to calling open("/proc/PID") in a context where /proc is procfs, whether or not /proc is visibly mounted. That would offer access to at least the standard /proc/PID hierarchy regardless of mount namespaces or other factors affecting path resolution.

New system calls: pidfd_open() and close_range()

Posted Jul 24, 2019 22:52 UTC (Wed) by immibis (subscriber, #105511) [Link]

There's also the case where you have too many file descriptors open already, and shouldn't have to open one more in order to close some.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 17:42 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (21 responses)

close_range may be useful although determining open file descriptors on Linux is easy enough. But pidfd_open is inherently broken for the exact same reason "pid files" are broken: If the pid was a reliable identfier for the process, nobody would need a pidfd instead. So, please, dump this idea.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 17:56 UTC (Thu) by james (subscriber, #1325) [Link] (20 responses)

I presume the idea is to open the pidfd, check you've got the right process, and then you can reliably use it. (If you haven't got the right process, then the one you were interested in isn't there anymore).

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 19:25 UTC (Thu) by josh (subscriber, #17465) [Link]

You could also use pidfd_open if you're the parent process but can't use CLONE_PIDFD, because then you know the PID will stay valid until you wait on it.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 20:03 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (15 responses)

How's that supposed to happen?

I've tried to find something about this but the only thing I got is the author's firm conviction that "it [pid re-use] ain't gonna happen".

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 5:11 UTC (Fri) by cyphar (subscriber, #110703) [Link] (14 responses)

Once you have the pidfd, then you will get ESRCH if the associated process dies -- even if the pid itself gets reused. This is by design and comes from the same property you get from "v1" pidfds (opening /proc/$pid).

This is similar to opening a file with O_PATH, checking whether it is the file you expect through readlink(/proc/self/fd/$fd) and then operating on the O_PATH. Once you've got the reference, and checked it, the reference is free from TOCTOU attacks.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 13:06 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (13 responses)

This doesn't help. The pid may already refer to the wrong process by the time pidfd_open is called.

A reminder for "But the time window is sooo small! It really ain't gonna happen!"-people: Processes can be suspended. Or can be blocked for extended periods of time if the system is thrashing.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 13:26 UTC (Fri) by cyphar (subscriber, #110703) [Link] (12 responses)

We are in strong agreement over the pid reuse -- that's literally the whole point of this interface. What I'm saying is that you:

1. Do a pidfd_open(); *then*
2. Check that it is the process you want; and *then*
3. Operate on the pidfd which you now know reference the process you want.

If the process dies after (1) then you will get -ESRCH on all operations. If (1) is the wrong process, you will detect it during (2) and can error out. Thus (3) will only ever operate on the correct process -- because you have a re-usable reference that won't be recycled you aren't subject to recycling problems. This is not possible with the original pid-based interfaces because any operations in (3) would be using a pid that might be recycled and thus the check in (2) is worthless.

Please note that all of the above is also true with the current pidfd interface which works through opening /proc/$pid and pidfd_send_signal(2) (this was merged in 5.1). pidfd_open(2) is not anything more radical than that, it just offers a way of using pidfds without the need for procfs to be mounted and usable.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 14:44 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (11 responses)

I was asking how this "check that you got the right process" is supposed to happen.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 15:44 UTC (Fri) by cyphar (subscriber, #110703) [Link] (10 responses)

I guess I misunderstood your question (you were talking about the race condition not how the check might work).

This depends strongly on what the application is doing. If the application is pkill(1), then you check openat(pidfd, "cmdline") and check the name of the command. If you're an init system you might check the ppid is 1, that it's in the right cgroup, that it has the right exe magiclink, that you haven't received a death signal from that service, and so on.

In addition, the benefit of pidfds is that they can be passed around or even persisted (through bind-mounts) so that if you are in a scenario where you are sure the pidfd is correct (for instance, you are the parent process and spawned the child) you can pass the pidfd to another process which can operate on the pidfd even though they could not (by themselves) get a pidfd that they were sure about. A toy usecase might be a container runtime bind-mounting the pidfd of the pid1 of containers (after spawning the pid1) and then using that pidfd after-the-fact to operate on the container's namespaces.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 18:31 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (9 responses)

Short version of the answer: It can't de done. This was my impression as well.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 19:59 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

What do you mean by "it can't be done"? Cyphar provided the exact way it can be done.

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 15:43 UTC (Sat) by rweikusat2 (subscriber, #117920) [Link] (7 responses)

"It can't be done".

Various heuristics can be applied here in order to find a process which is (according to someone's opinion at least) very likely to be the process whose pid was originally acquired but this doesn't mean that it actually is this process. The only way to know that it pid refers to a certain process is to acquire it from the parent and 'somehow' ensure that the exit status won't be reaped until one is done using the pid. If this is guaranteed, jumping through the pidfd hoop is just useless. The pid could be used instead (for signalling at least).

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 23:09 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (3 responses)

Cyphar really did describe exactly how it should be done, as Cyberax noted. Either provide a *precise timeline* showing a scenario under which Cypher's approach fails or admit that you're wrong. Your comment makes unsubstantiated claims that are simply incorrect. Coordination with parental reaping is unnecessary.

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 20:34 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

The original statement was that it would be possible to check that a pid which was used for pidfd_open indeed referred to the process it was intended to refer when the open was done. This is impossible unless the process doing the open is the parent or communicates with the parent. In particular, it cannot be done by looking at the content of /proc-files related to the process now using the pid as there's nothing specific to a particular process in there: All /proc directories for processes running ls -l /proc/self/ will contain the same metainformation (except the timestamps, that is, but these depend on the system clock and could also repeat in sufficiently adverse circumstances).

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 20:46 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (1 responses)

All you've done is repeat your initial assertion. You're wrong. Either admit that or provide a *SPECIFIC* and *CONCRETE* concrete counter-example showing an example of the race you're claiming exist.

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 21:33 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link]

I'm sorry but if you don't understand that, I obviously won't be able to explain it to you.

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 3:28 UTC (Sun) by mjg59 (subscriber, #23239) [Link] (2 responses)

The behaviour of pill(1) is well documented - it kills processes that have specified attributes. You take a reference to a pid and then start examining those attributes. If they all match, you send an appropriate signal to that pid. Where's the race?

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 20:42 UTC (Sun) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

It's inherent in the semantics of the command: pkill (I'm assuming this was typo) kills some processes which are currently running and have certain attributes, namely, all it happens to find. This may include processes which were started after the pkill (and thus, very likely shouldn't have been killed by it) but it may as well not (they might be started such that pkill won't find them). Arguably, a pidfd_open in pkill would stop that from possibly killing processes it shouldn't ever have killed because they didn't match the specification. I didn't understand that.

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 22:46 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Now you're just being silly.

pkill() has a contract - it kills processes by specified attributes. Right now it can kill a random process due to the PID-based race condition. There's still a race condition - pkill is not guaranteed to kill processes that launched concurrently with it.

The new pkill() would _always_ kill the right processes.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 20:11 UTC (Thu) by droundy (subscriber, #4559) [Link] (2 responses)

And I'll add that you can't do the same process *without* pidfds, because the process may die and be replaced between when you check that it is the one you want to signal, and when you actually signal it.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 20:21 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

If you're the parent, there's no issue here: The pid will remain associated with the correct process until its exit status has been collected.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 22:49 UTC (Thu) by droundy (subscriber, #4559) [Link]

Agreed, if you are the parent then you don't need to check that you've got the right process.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 20:31 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (4 responses)

Those longstanding unix conventions are terrible. Al is simply wrong here to fight to preserve them. O_CLOEXEC really is the right approach. This debate was settled a long time ago, when most system calls got O_CLOEXEC options. Let's not reopen it for pointless reasons.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 20:57 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

The reason for O_CLOEXEC options is that this flag needs to be set atomically if the process doing the open is multithreaded to prevent accidental inheritance of a file descriptor because another thread initiated a fork/exec after the open returned and before a fcntl calls setting FD_CLOEXEC was made.

Apart from that, there's no "wrong" and no "right" approach here. There are two choices for a default policy, namely inheritable or not inheritable, and which is better for a certain application depends on what the application is trying to do. Making the default policy a configurable per-process setting would probably make sense. Different default policies depending on whatever the person who implemented the syscall perferred for whatever reason just make the system more complicated without any benefit.

New system calls: pidfd_open() and close_range()

Posted May 23, 2019 23:45 UTC (Thu) by cyphar (subscriber, #110703) [Link] (2 responses)

Right, but the point is that you can always disable O_CLOEXEC safely through fnctl(2) but you cannot safely enable it. So really, making O_CLOEXEC the default seems to have no real downside other than inconsistencies with existing syscalls -- but open(2) is already inconsistent with almost every other syscall by ignoring unknown flags instead of giving -EINVAL.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 12:59 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

These are three grammatically connected but unrelated statements.

The O_CLOEXEC file descriptor creation flag doesn't belong into this discussion.

The default policy for existing system calls is mandated by POSIX, hence, changing that is not an option. Different default policies for different system calls are still a bad idea as they make the system more difficult to program for no real benefit. A default policy which could be changed on a per-process basis could be a good idea. OTOH, file descriptor creation is a fairly rare occurence, hence, this might also not be worth the effort. But it would at least solve a problem and not create one.

That some aspects of open are different from every other system call is unsurprising. If this wasn't the case, an open syscall wouldn't exist.

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 13:35 UTC (Fri) by cyphar (subscriber, #110703) [Link]

I guess we have different opinions on what APIs make the most sense. I agree that it does result in inconsistencies, though I disagree this makes systems significantly harder to program.

> That some aspects of open are different from every other system call is unsurprising. If this wasn't the case, an open syscall wouldn't exist.

I don't understand what you mean by this. open(2) is a syscall that must exist for reasons that should be abundantly obvious. However, the fact that open("foo", 0xFFFFFFFFFFFFFFFF, 0) works today -- despite many of those bits being undefined -- is not required for open(2) to exist. This is known as a bad practice -- no other syscall operates like this (and the kernel documentation explicitly tells you not to design syscalls or ioctls like this).

Linux is catching up with MirBSD

Posted May 24, 2019 1:03 UTC (Fri) by mirabilos (subscriber, #84359) [Link]

Cool, close_range() will make it possible to emulate the OpenBSD-derived closefrom(2) syscall:

https://www.mirbsd.org/htman/i386/man2/closefrom.htm

I for one will be rooting for that!

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 8:34 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

How will error handling work? You're supposed to check the return value from close(), but if it fails, how do you know which fd had an issue…?

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 14:52 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

There have been articles about the return value of close(2) here before. Basically the state of the descriptor after a failed close(2) is undefined, even for EINTR, so checking the return value is only of "notify that something went wrong" value. There's nothing to actually *do* in response an error from close(2).

New system calls: pidfd_open() and close_range()

Posted May 24, 2019 11:36 UTC (Fri) by stressinduktion (subscriber, #46452) [Link] (2 responses)

I would like to propose another idea on how to attack this problems:

a new flag/flags or bitset to open and a new syscall that closes all descriptors with a particular flag:

- open(..., O_FD_KEEP_OPEN_EXECVE)
- fnctl(fd, F_SETFL, ...|O_FD_KEEP_OPEN_EXECVE)

and finally close all descriptors without those flags: close_except(O_FD_KEEP_OPEN_EXECVE)

Obviously I haven't put much time into the naming of those flags. :)

One could build on top of that and implement:

close_type(FD_STDIO); // dubious semantics
close_type(FD_ALL_SOCKETS);
close_type(FD_ALL_TTYS);
close_type(FD_ALL);

Not sure if it is worth the complexity, but could be useful. I think that in most cases it is already known at open/fcntl time if a descriptor should stay open. Furthermore it might be useful during debugging, if those information are also discoverable in the proc/pid/fd hierarchy.

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 15:31 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (1 responses)

There are a lot of ABI problems here. open(2) can't take new flags (because the implementation never verified that unused flags were passed in unused), POSIX specifies how things work, etc. I'm also suspicious that applications would actually list the full set of fd-kinds for closing all kinds (block, character, pidfd, memfd, etc.) except those which they care about. Or that they even know the answer to such a question.

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 1:03 UTC (Tue) by cyphar (subscriber, #110703) [Link]

> open(2) can't take new flags (because the implementation never verified that unused flags were passed in unused), POSIX specifies how things work, etc.

We add flags to open(2) (and openat(2)) all the time, despite this problem. New syscalls aren't defined this way, but we're stuck with this behaviour unfortunately.

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 23:11 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (3 responses)

I'm opposed to closerange. procfs already provides the necessary functionality, and I don't think we should be putting engineering effort into duplicating features of procfs in a procfs-free environment. Instead, we should ensure that procfs is always available --- at least a subset of procfs useful for process-management tasks.

New system calls: pidfd_open() and close_range()

Posted May 25, 2019 23:20 UTC (Sat) by quotemstr (subscriber, #45331) [Link]

Granted, a batched close(2) would be useful, and not just immediately before an execve! I'd be fully in support of a close_fds(2) that accepted a *list* of FDs --- I just think that reading the list of FDs from /proc is perfectly reasonable.

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 0:13 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Scanning /proc/self is not at all free or fast. You'll need to do a bunch of kernel-userspace transitions instead of one syscall.

New system calls: pidfd_open() and close_range()

Posted Jul 30, 2019 23:30 UTC (Tue) by immibis (subscriber, #105511) [Link]

The killer problem (to me) with procfs is that you have to open an FD in order to close any FDs. What if the process has reached the maximum number of FDs? Do you try to close arbitrary FDs in the hope of hitting one that is actually open, so that you can then open /proc/self/fd in order to know the rest of the FDs to close?

New system calls: pidfd_open() and close_range()

Posted May 26, 2019 3:27 UTC (Sun) by sbaugh (guest, #103291) [Link] (9 responses)

close_range, and its more common form closefrom, are much worse for Unix conventions than setting CLOEXEC by default. Many foolish processes run closefrom on startup, which means you can't implicitly pass down arbitrary file descriptors to them. That is, you can't pass down a file descriptor without explicitly telling the program about it. That in turn closes off a whole class of tricks. For example, passing a temporary file via file descriptor and referencing it with /proc/self/fd/n means you can pass a file without every linking it into the filesystem. Libraries can accept file descriptors specified by environment variables and use them to extend the main program's functionality without the main program having to be aware.

I'm in favor of setting CLOEXEC by default - that's the only safe way to do things in the presence of a shared file descriptor table (i.e. with threads). Sometimes you want to implicitly inherit a file descriptor, and you can just unset CLOEXEC in that case. But closefrom will make that impossible: it does not support implicit inheritance of file descriptors.

New system calls: pidfd_open() and close_range()

Posted May 27, 2019 14:34 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (8 responses)

You’re not supposed to run closefrom() at startup… only before passing control to another process via exec*() after a fork(), for example.

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 0:04 UTC (Tue) by sbaugh (guest, #103291) [Link] (7 responses)

>You’re not supposed to run closefrom() at startup… only before passing control to another process via exec*() after a fork(), for example.

That doesn't solve the problem; it still breaks implicit inheritance of file descriptors. If my shell, for example, ran closefrom after fork but before exec, I wouldn't be able to pass down fds to programs started by the shell. The same concern applies in any fork/exec case.

Also, independently, many unfortunate programs already do call closefrom at startup. I think adding it as a syscall will make this worse, as userspace programmers hear the bad advice to call closefrom on startup.

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 1:11 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> That doesn't solve the problem; it still breaks implicit inheritance of file descriptors. If my shell, for example, ran closefrom after fork but before exec, I wouldn't be able to pass down fds to programs started by the shell. The same concern applies in any fork/exec case.
A shell can just loop through FDs and close them. So what's the difference?

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 1:36 UTC (Tue) by sbaugh (guest, #103291) [Link] (5 responses)

>A shell can just loop through FDs and close them. So what's the difference?

Not sure what you're meaning to ask. There is no difference in functionality. close_range or closefrom are just a more efficient version of looping through all open FDs and closing them.

I'm saying that close_range, closefrom, and looping through open FDs are all bad things that you shouldn't do. close_range makes it easier to do something you shouldn't do. So it shouldn't be added. We shouldn't add syscalls to make it faster to do things that shouldn't be done.

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 2:50 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (4 responses)

Except it’s in the Korn Shell’s definition that only explicitly passed fds are passed to processes.

Except that sometimes you *do* want to close all other fds (sometimes even all fds, before reopening /dev/null as fd 0 and duping it to 1 and 2) for security reasons.

Programmers will shoot themselves into the foot, but if the rope is efficient… *shrug*

And if you don’t like it, it’s open source and you’ll be able to remove the call in your local copy of the software.

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 3:56 UTC (Tue) by sbaugh (guest, #103291) [Link] (3 responses)

>Except it’s in the Korn Shell’s definition that only explicitly passed fds are passed to processes.

Yes, the Korn shell (and mksh) are wrong here, I'm afraid. Thankfully most shells don't have that particular bad behavior.

>Except that sometimes you *do* want to close all other fds (sometimes even all fds, before reopening /dev/null as fd 0 and duping it to 1 and 2) for security reasons.

The "security reasons" argument here applies identically to environment variables. Indeed, I think it's sound to clear the environment and close all fds when you really want to harden something - in setuid environments, for example. But most programs shouldn't be clearing their environment, and most programs shouldn't be closing all fds.

>Programmers will shoot themselves into the foot, but if the rope is efficient… *shrug*
>And if you don’t like it, it’s open source and you’ll be able to remove the call in your local copy of the software.

This doesn't make any sense. Again, we shouldn't add syscalls to speed up operations which should not be done in the first place. That has nothing to do with open source choice. If you think this operation is good, then say that.

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 6:58 UTC (Tue) by roc (subscriber, #30627) [Link] (2 responses)

> Again, we shouldn't add syscalls to speed up operations which should not be done in the first place.

But many programs do want hardening --- e.g. Firefox and Chrome need to close all fds in sandboxed child processes --- and probably that trend is to be encouraged. Why shouldn't they have an efficient way to do it?

Plus, it seems to me that leaking internal file descriptors into spawned subprocesses is unhygenic, asking for trouble, and should not be the default.

New system calls: pidfd_open() and close_range()

Posted May 28, 2019 12:57 UTC (Tue) by sbaugh (guest, #103291) [Link] (1 responses)

>But many programs do want hardening --- e.g. Firefox and Chrome need to close all fds in sandboxed child processes --- and probably that trend is to be encouraged. Why shouldn't they have an efficient way to do it?

Primarily: because they don't need an efficient way to do it. They can close file descriptors once at startup and thereafter make sure to use CLOEXEC to prevent file descriptor leakage. That works fine and doesn't encourage bad practices in the rest of userspace. But more generally...

It's obviously hard to argue against "we should do more sandboxing". But I fall back to the analogy to environment variables: If we shouldn't clear the environment before running the program, then we shouldn't close all file descriptors before running it. Browser Javascript sandboxes are totally closed off from the rest of the system - they certainly don't need environment variables, and they wouldn't benefit from file descriptors either. So it's fine to clear the environment and close all fds before running them.

Certainly I think we should move towards a world where processes can't access things they haven't somehow been given access to. But I don't think we should make every process like a browser Javascript sandbox, and be totally closed off from the system. There are many things that are implicitly inherited in Linux - the root directory, the current working directory, all the namespaces, tons of things. Most of those should be removed or reworked, because they're easy to mess up and can't easily be controlled, and most of them can't even be changed without privileges. But many parts of Unix are still useful, and shouldn't be removed. Environment variables shouldn't be removed, and the possibility of implicit inheritance of file descriptors shouldn't be removed. Otherwise we'll inevitably end up reinventing those, slowly and painfully.

>Plus, it seems to me that leaking internal file descriptors into spawned subprocesses is unhygenic, asking for trouble, and should not be the default.

I agree! Thankfully, there is already a way to prevent this: CLOEXEC. You don't need close_range to avoid leaking file descriptors into spawned subprocesses.

As I said in my original comment: CLOEXEC is the right way to stop to stop implicit inheritance of file descriptors, and I think it should be the default. Almost all file descriptors shouldn't be implicitly inherited into children - that would be massive resource leakage. But sometimes you *do* want to implicitly inherit a file descriptor and CLOEXEC gives you that option, because you can turn it off and inherit a file descriptor without having to explicitly tell your child that you're doing it. closefrom is wrong because there's no way to turn it off when some random glue program in the middle of the tree calls it, so it completely prevents implicit inheritance of file descriptors.

New system calls: pidfd_open() and close_range()

Posted May 30, 2019 12:26 UTC (Thu) by roc (subscriber, #30627) [Link]

> They can close file descriptors once at startup and thereafter make sure to use CLOEXEC to prevent file descriptor leakage. That works fine and doesn't encourage bad practices in the rest of userspace. But more generally...

That would mean a bug in a third-party library that creates a non-CLOEXEC file descriptor, even for a moment, creates a security hole for the sandbox. That's not really acceptable.

> If we shouldn't clear the environment before running the program, then we shouldn't close all file descriptors before running it.

I don't buy that argument. Reading the environment can't usually affect other processes. Reading or writing leaked file descriptors can, sometimes in very subtle ways.