LWN.net Logo

sys_indirect()

By Jonathan Corbet
November 19, 2007
Creating user-space APIs is a hard task. Even if an interface seems complete and well designed when it is created, the future can often add new requirements which the old API is hard-put to satisfy. So, for example, Unix started with the wait() system call. As applications got more complicated, it became necessary to wait for a specific process, to get more information about exiting processes, to wait in a non-blocking manner, and so on. So now, in addition to wait(), we have waitid(), waitpid(), and wait4(). Since old versions of system calls can (almost) never go away, changing needs over time tend to cause a proliferation of new calls.

Most recently, Ulrich Drepper has been asking for the ability to add flags to system calls which create file descriptors, but which have no flags argument. Examples of these include socket() and accept(). It is possible to adjust the behavior of file descriptors created with these system calls after the fact (with fcntl()), but there will always be a period during which the file descriptors exist, but the desired behavior has not been set. When that behavior is "close on exec," and a multi-threaded program is running, one thread might run a new program with exec() before another one has managed to set the "close on exec flag." The result of this race is a leaked file descriptor which can, in turn, be a security problem. The only efficient way to close this particular race is for the kernel to create file descriptors with the desired flags set from the outset.

Traditionally, this sort of problem would be solved through the creation of a new system call; one could, for example, add a four-argument socket4() which has the requisite flags parameter. This solution is unsatisfying, though; as has been seen, it leads to an ever-growing list of system calls. So it would be nice to find a different solution. Ulrich thinks he has done so by adding a single system call (indirect()), which works by passing additional information to existing system calls.

It should be noted that the first sys_indirect() implementation was created by Davide Libenzi back in July. Ulrich wasn't entirely happy with that code, though:

Davide's previous implementation is IMO far more complex than warranted. This code here is trivial, as you can see. I've discussed this approach with Linus last week and for a brief moment we actually agreed on something.

The prototype for the new system call looks something like this:

    int indirect(struct indirect_registers *regs,
                 void *userparams,
		 size_t paramslen,
		 int flags);

The regs structure holds the process registers normally used in system calls; the system call number and its (normal) arguments, in other words. The extra parameters to be passed to the system call live in userparams, with a length of paramslen. The flags argument is currently unused; it's there for any sort of future expansion, since extending indirect() with itself is not allowed.

The task_struct structure has been extended with a new field:

    union indirect_params indirect_params;

This union is meant to contain fields for each sort of parameter which can be added to a system call; in Ulrich's patch it looks like this:

    union indirect_params {
	struct {
	    int flags;
	} file_flags;
    };

It can, thus, be used to pass a flags argument to system calls which deal in file descriptors.

When indirect() is called, it checks the requested system call number against an internal whitelist. If the specific system call has not been marked as being extensible in this way, the call fails with EINVAL. Otherwise the application-supplied parameters are copied into the current process's task_struct structure and the system call is invoked in the usual way. Once that system call completes, the indirect_params area in the task structure is zeroed.

The kernel provides no indication to the system call that it has been invoked via indirect(); the only difference in that case is that there might be non-zero values in indirect_params. So, in a sense, this mechanism can be seen as a way to add parameters to system calls with a default value of zero. So it is not possible, without some additional work, to add a parameter to a system call where passing a value of zero has a different meaning than omitting the parameter altogether.

Should a need for yet another parameter materialize in the future, the size of the indirect_params structure can be increased as needed. As long as the kernel retains the old behavior when the new parameter has a value of zero, older applications and libraries will continue to operate as they did before. The extended system call need not (and cannot) know whether the larger indirect_params structure is being used or not.

There is a possible use for this mechanism beyond extending system calls: the syslet developers see it as a possible way of specifying asynchronous behavior. The current syslet patches are essentially an indirect wrapper layer around system calls which specifies that the call is asynchronous (and what to do with the results). Adding two separate indirect layers for system calls seems like a suboptimal solution, so there is interest in adding syslet information to indirect() instead. That is one of the intended purposes for the currently-unused flags argument.

Naturally, it would be surprising to see applications ever making calls to indirect(), well, directly. A much more likely scenario is for uses of indirect() to be buried inside the C library, which would then export a more straightforward interface to the application.

While some developers (including Linus, evidently) like this patch set, others are less enthusiastic. David Miller was blunt in his review, saying: "I think this indirect syscall stuff is the most ugly interface I've ever seen proposed for the kernel." H. Peter Anvin is also unimpressed:

I think it is a horrible kluge. It's yet another multiplexer, which we are trying desperately to avoid in the kernel. Just to make things more painful, it is a multiplexer which creates yet another ad hoc calling convention, whereas we should strive to make the kernel calling convention as uniform as possible.

So would it not be surprising if this new system call were to evolve somewhat before making its way into the mainline - it's a new and somewhat tricky API which could certainly benefit from discussion. But there are some real needs driving this work. So chances are that indirect() will eventually show up, in some form, in mainline kernels.


(Log in to post comments)

sys_indirect()

Posted Nov 21, 2007 9:51 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

A rather different way of dealing with this problem would be to just change the system call
interface, add marking to binaries which use the new interface (perhaps even with an syscall
interface version field) and provide a mechanism for a userspace wrapper, which could be
loaded by ld.so if needed, to translate system calls for old binaries.  The mechanism could
(but need not necessarily) be something similar to ptrace.  Since almost all binaries on a
Linux system are built for that system, the mechanism would only need to be a fallback.

sys_indirect()

Posted Nov 21, 2007 11:29 UTC (Wed) by nix (subscriber, #2304) [Link]

I don't see how this differs from versioned symbols and a set of new syscalls with an extra
argument, except that it's much uglier.

sys_indirect()

Posted Nov 21, 2007 12:06 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

But the point being that at some point it may be better to stop supporting old binaries
directly and move them to compatibility layers instead.  Currently I assume that binaries over
10 years old will still work in the unlikely event that everything else they need is still
available.  How long is forever though?

I suppose that the main difference from versioned symbols is that the compatibility layer
doesn't need to be inside the kernel, and doesn't even need to be installed on the machine if
it is not needed.

sys_indirect()

Posted Nov 21, 2007 16:39 UTC (Wed) by nix (subscriber, #2304) [Link]

There's no need for that, though: glibc already *contains* code which looks at the
capabiliities of your machine and uses different syscall mechanisms depending on that (int80
versus vsyscall on x86/x86-64, for instance).

Should we run out of syscalls (on x86-64, we only have 2^32 minus a couple of hundred left!
such a harsh limit!) or should we decide the syscall table is too damn huge, then we could
always introduce another syscall mechanism and call it unconditionally from a new glibc.

The point being, if you can hack ld.so to preload things automatically, you can just as easily
modify libc.so to do the job itself in a much less ugly fashion :)

(of course the problem here remains the breaking of compatibility with all older glibcs: the
newer glibc doesn't need a major version bump, though. I suspect this is a step which the
kernel hackers will be very reluctant to take. It's happened on some arches, but they're all
relatively minor ones.)

sys_indirect()

Posted Nov 21, 2007 16:52 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

Right - as far as I know, the main reason that the syscall interface may never be changed is
old statically linked binaries, or at least binaries and libraries which do syscalls
themselves and not through libc.  That problem could be solved by hacking ld.so (as far as I
know, even statically linked binaries are loaded by ld.so, and if not something similar could
be found) but not by hacking libc.

sys_indirect()

Posted Nov 22, 2007 10:26 UTC (Thu) by nix (subscriber, #2304) [Link]

Statically linked binaries don't have a PT_INTERP header at all, so ld.so never gets involved.

ld.so and libc are tightly tied: if one is used, the other always will be as well. (It is I
think possible to dynamically link ld.so and statically link libc.a, but this is rare and
strange and you don't do it just by specifying -static at link time. It may well have
bitrotted.)

sys_indirect()

Posted Nov 21, 2007 18:26 UTC (Wed) by drag (subscriber, #31333) [Link]

It (transition from old and busted glibc to a imaginary shiny and sexy replacement-for-glibc)
could probably be handled in a similar fasion on how the X.org developers are handling the
transition from XCB to XLib. 

First they tried to go with XCL, which was a XLib compatability layer for XCB. It worked for
the most part, but a lot of the more oddball features of Xlib proved too difficult to
replication on top of XCB. 
http://xcb.freedesktop.org/History/

Now they are doing the Xlib/XCL approach, which is to have Xlib-guts being slowly ripped out
as applications gradually migrate to XCB, while having the Xlib actually running on top of
XCB. (or something like that) Using this approach they can acheive 100% binary compatability
and for programmers of applications/libraries using xlib they can start using XCB stuff right
away without a total rewrite. 

Also there is probably some answers to be found in the Linux-binary emulation features present
in other operating systems like FreeBSD, AIX, and Solaris.

sys_indirect()

Posted Nov 22, 2007 10:28 UTC (Thu) by nix (subscriber, #2304) [Link]

I wasn't aware there was any intent to have apps migrate en masse to XCB (does anyone know of
anything other than Xlib that uses XCB yet?). What is more likely to be useful is to have
*widget sets* migrate, as they already implement most of what Xlib does themselves, and don't
need most of the rest except for the raw comms stuff which XCB provides.

As far as I know the XCB-based Xlib is here to stay.

sys_indirect()

Posted Nov 22, 2007 22:36 UTC (Thu) by njs (guest, #40338) [Link]

GTK+ does use XCB in one or two obscure places, but most of it is still Xlib-based.  The big
advantage of Xlib-on-XCB is that it becomes possible to mix Xlib and XCB calls in the same
program, talking on the same X connection -- so you can have a gradual transition.  Aside from
toolkits themselves, interesting programs often have *some* direct calls to Xlib in addition
to all the ones that go through the toolkit, because toolkits don't expose everything.  (And
toolkits are explicitly architected to allow for this, e.g. there is a way to get an Xlib
Display* out of a GdkDisplay*.)  For some of those apps, being able to use a sane and
more-async-able API might well be valuable, and in general it is probably smart to use XCB in
new programs.

I have no idea what the equivalent for "libc migration" would be.  When would you want to use
two different libcs from the same program?  How would the compiler even know which version of
fork() you were trying to call?

sys_indirect()

Posted Nov 23, 2007 0:22 UTC (Fri) by nix (subscriber, #2304) [Link]

I suspect that the only way to `migrate' libcs would be a big bang with 
interface back-compatibility (which is, of course, exactly how glibc 
upgrades work). Nothing else works because of the presence of critical 
global data structures whose format must be understood by everything that 
accesses them (I'm thinking mostly of malloc() and free() here, but the 
exception unwinder is another example, which is why libgcc_s.so exists at 
all as opposed to just libgcc.a).

sys_indirect()

Posted Nov 21, 2007 11:09 UTC (Wed) by liljencrantz (guest, #28458) [Link]

Is close on exec flags on other such race conditions really common enough that using a simple
lock around exec calls is an unacceptable solution?

sys_indirect()

Posted Nov 21, 2007 18:05 UTC (Wed) by daney (subscriber, #24551) [Link]

The problem for the accept() system call is that it normally blocks.  If you had to acquire a
lock, you would end up blocking all other threads.

sys_indirect()

Posted Nov 29, 2007 13:10 UTC (Thu) by endecotp (guest, #36428) [Link]

The problem is that any third-party library functions that you call need to know about and use
this lock.  Or, you need to hold a lock around the entire library function call.

Personally I'd be happy to just make close-on-exec the default, though no doubt other people
have code that would be broken by that.

sys_indirect()

Posted Nov 22, 2007 14:14 UTC (Thu) by i3839 (guest, #31386) [Link]

> Naturally, it would be surprising to see applications ever making calls to 
> indirect(), well, directly. A much more likely scenario is for uses of 
> indirect() to be buried inside the C library, which would then export a 
> more straightforward interface to the application.

I really hate this attitude. Especially if this argument is used to justify an ugly interface.
If the systemcall isn't supposed to be used directly by applications, it isn't worth existing.

sys_indirect()

Posted Nov 22, 2007 15:16 UTC (Thu) by tyhik (guest, #14747) [Link]

"If the systemcall isn't supposed to be used directly by applications, it isn't worth
existing."

There exist some, and rightly so. futex() comes to mind.

But indirect() is ugly indeed.

sys_indirect()

Posted Nov 22, 2007 18:28 UTC (Thu) by i3839 (guest, #31386) [Link]

The futex adds a unique and useful feature which can also be interesting for certain
apps/libraries other than glibc. The API is a bit ugly, mostly because of historical reasons
it seems.

But what they seem to be wanting to do here is providing a userspace wrapper in glibc for the
indirect variant of a syscall, and that's just plain silly.

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds