Extending system calls

By Jake Edge
May 14, 2008

Getting interfaces right is a hard, but necessary, task, especially when that interface has to be supported "forever". Such is the case with the system call interface that the kernel presents to user space, so adding features to it must be done very carefully. Even so, when Ulrich Drepper set out to remove a hole that could lead to a race condition, he probably did not expect all the different paths that would need to be tried before closing in on an acceptable solution.

The problem stems from wanting to be able to create file descriptors with new properties—things like close-on-exec, non-blocking, or non-sequential descriptors. Those features were not considered when the system call interface was developed. After all, many of those system calls are essentially unchanged from early Unix implementations of the 1970s. The open() call is the most obvious way to request a file descriptor from the kernel, but there are plenty of others.

In fact, open() is one of the easiest to extend with new features because of its flags argument. Calls like pipe(), socket(), accept(), epoll_create() and others produce file descriptors as well, but don't have a flags argument available. Something different would have to be done to support additional features for the file descriptors resulting from those calls.

The close-on-exec functionality is especially important to close a security hole for multi-threaded programs. Currently, programs can use fcntl() to change an open file descriptor to have the close-on-exec property, but there is always a window in time between the creation of the descriptor and changing its behavior. Another thread could do an exec() call in that window, leaking a potentially sensitive file descriptor into the newly run program. Closing that window requires an in-kernel solution.

Back in June of last year, after some false starts, Linus Torvalds suggested adding an indirect() system call, as a way to pass flags to system calls that don't currently support them. The indirect() call would apply a set of flags to the invocation of an existing system call. This would allow existing calls to remain unchanged, with only new uses calling indirect(). User space programs would be unlikely to call the new function directly, instead they would call glibc functions that handled any necessary indirect() calls.

Davide Libenzi created a sys_indirect() patch in July, but Drepper saw it as "more complex than warranted". So Drepper created his own "trivial" implementation, that was described on this page in November. It was met with a less than enthusiastic response on linux-kernel for being, amongst other things, an exceedingly ugly interface.

The alternative to sys_indirect() is to create a new system call for each existing call that needed a flags argument. This was seen as messy by some, including Torvalds, leading some kernel hackers into looking for alternatives. The indirect approach also had some other potential benefits, though, because it was seen as something that could be used by syslets to allow asynchronous system calls. No decision seemed to be forthcoming, leading Drepper to ask Torvalds for one:

Will you please make a decision regarding sys_indirect? There has been no other proposal so the alternative is to add more syscalls.

To bolster his argument that sys_indirect() was the way to go, Drepper also created a patch to add some of the required system calls. He started with the socket() family, by adding socket4(), socketpair5(), and accept4()—tacking the number of arguments onto the function name a la wait3() and wait4(). Drepper's intent may not have been well served by choosing those calls as Alan Cox immediately noted that the type argument could be overloaded:

Given we will never have 2^32 socket types, and in a sense this is part of the type why not just use

        socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, ...)

that would be far far cleaner, no new syscalls on the socket side at all.

Michael Kerrisk looked over the set of system calls that generate file descriptors, categorizing them based on whether they needed a flag argument added. He observed that roughly half of the file descriptor producing calls need not change because they could either use an overloading trick like the socket calls, the glibc API already added a flags argument, or there were alternatives available to provide the same functionality along with flags.

In response, Drepper made one last attempt to push the indirect approach, saying:

Or we just add sys_indirect (which is also usable for other syscall extensions, not just the CLOEXEC stuff) and let userlevel (i.e., me) worry about adding new interfaces to libc. As you can see, for the more recent interfaces like signalfd I have already added an additional parameter so the number of interface changes would be reduced.

Even though the indirect approach has some good points, Torvalds liked the approach advocated by Cox, saying:

Ok, I have to admit that I find this very appealing. It looks much cleaner, but perhaps more importantly, it also looks both readable and easier to use for the user-space programmer.

Ultimately, developers will only use these new interfaces if they can easily test for the existence of the new code. Torvalds gives an example of how that might be done using the O_NOATIME flag to open(), which has only been available since 2.6.8. It is this testability issue that makes him believe the flags-based approach is the right one:

And that's the problem with anything that isn't flags-based. Once you do new system calls, doing the above is really quite nasty. How do you statically even test that you have a system call? Now you need to add a whole autoconf thing for it existing, and when it does exist you still need to test whether it works, and you can't even do it in the slow-path like the above (which turns the failure into a fast-path without the flag).

This new approach, with a scaled down number of new system calls rather than adding a general-purpose system call extension mechanism like sys_indirect(), is now being pursued by Drepper. In the explanatory patch at the start of the series, he lays out which of the system calls will require a new user space interface: paccept(), epoll_create2(), dup3(), pipe2(), and inotify_init1(), as well as those that do not: signalfd4(), eventfd2(), timerfd_create(), socket(), and socketpair().

Drepper has already made several iterations of patches addressing most of the concerns expressed by the kernel developers along the way. There have been some architecture specific problems, but Drepper has been knocking those down as well. If no further roadblocks appear, it would seem a likely candidate for inclusion in 2.6.27.

Index entries for this article
Kernel	indirect()
Kernel	System calls

Extending system calls

Posted May 15, 2008 8:16 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

Please forgive my obtuseness (obtusity?)... why can't open() be made to do the job of all of
pipe(), socket(), etc., with all the flags and whistles one could want, leaving the
traditional entry points as special cases?

Extending system calls

Posted May 15, 2008 10:01 UTC (Thu) by ncm (guest, #165) [Link]

Scratch that.

Extending system calls

Posted May 15, 2008 16:45 UTC (Thu) by i3839 (guest, #31386) [Link]

I'm so glad the sys_indirect approach is abandoned, it looked horrible to use (and that glibc
would provide wrapper functions is no consolidation).

Stylistic gripe

Posted May 15, 2008 22:25 UTC (Thu) by alvherre (subscriber, #18730) [Link] (1 responses)

This is very minor -- it was rather disapppointing to see developers being talked about by
last name.  I can only hazard that it's the standard in press, so Edge follows it, but Jon has
always seemed to prefer using first names instead and the result is better (IMVHO).

Stylistic gripe

Posted May 15, 2008 22:50 UTC (Thu) by nix (subscriber, #2304) [Link]

I concur. The developers don't refer to each other by last name, and are 
pretty much universally known either by first name or email address, so it 
looks really rather out of place.

Extending system calls

Posted May 17, 2008 21:12 UTC (Sat) by jimparis (guest, #38647) [Link] (3 responses)

> Currently, programs can use fcntl() to change an open file descriptor to
> have the close-on-exec property, but there is always a window in time
> between the creation of the descriptor and changing its behavior. Another
> thread could do an exec() call in that window, leaking a potentially
> sensitive file descriptor into the newly run program. Closing that window
> requires an in-kernel solution.

No it doesn't!  Simple locking between threads would easily fix the race.  See
https://bugzilla.redhat.com/show_bug.cgi?id=233481 for an example.  The problem with this
approach appears to be poor performance.

Extending system calls

Posted May 19, 2008 11:42 UTC (Mon) by liljencrantz (guest, #28458) [Link] (2 responses)

If the fd creation is part of a library, that solution may be unacceptable.

Syscalls and locking

Posted Aug 3, 2008 5:43 UTC (Sun) by smurf (subscriber, #17840) [Link]

Libraries don't directly call the system; they use the glibc entry points.

However, any possible locking scheme would require that execve() and open()/whatever cannot
run concurrently. It's not hard to construct a program that would be hurt rather severely by
that restriction.

Extending system calls

Posted Oct 10, 2008 19:41 UTC (Fri) by bluefoxicy (guest, #25366) [Link]

Why not uh. fork(), then in the child (which calls exec()) call close() to drop the file handle, THEN call exec()? You know, like normal people?