LWN.net Logo

More fun with file descriptors

In last week's episode, the kernel developers were considering the addition of a couple of flags to the open() system call; these flags would allow applications to select previously unavailable features like the non-sequential file descriptor range or immediate close-on-exec behavior. The problem that comes up quickly is that open() is just one of many system calls which creates file descriptors; most of the others do not have a parameter which allows an application to pass a set of accompanying flags. So it is not possible to request, for example, the non-sequential behavior when obtaining a file descriptor with socket(), pipe(), epoll_create(), timerfd(), signalfd(), accept(), and so on.

In the second version of the non-sequential file descriptor patch, Davide Libenzi attempted to address part of the problem by adding a socket2() system call with an added "flags" parameter. That was enough to frighten a number of developers; nobody really wants to see a big expansion of the system call list resulting from the addition of variations on all the file-descriptor-creating calls. Another approach, it seems, is required, but finding that approach is not entirely easy.

One possibility is to simply ignore the problem; not everybody is sold on the need for non-sequential file descriptors or immediate close-on-exec behavior. There are enough people who see a problem here to motivate some sort of solution, though. Ulrich Drepper, the glibc maintainer, has seen enough applications to conclude that the issue is real.

An alternative, suggested by Alan Cox, is to create a process state flag which controls the use of these features. So a call like:

    prctl(PR_SPARSEFD, 1);

would turn on non-sequential file descriptor allocation for all system calls made by the calling process. The problem here is that the lowest-available-descriptor behavior is a documented part of the POSIX binary interface. A process could waive that guarantee for itself, but it will always be hard to know that all libraries used by that process are safe in the absence of that behavior. One library might want to use non-sequential file descriptors, but that library cannot safely turn them on for the whole process without risking the creation of difficult bugs in obscure situations. It has been suggested that linker tricks could be used to avoid bringing older libraries, but Ulrich feels that people would respond by simply recompiling the older libraries and the potential bugs would remain.

Linus came into the discussion with a statement that neither adding a bunch of new system calls nor the global flag were acceptable. Instead, he came up with a completely different idea: create a mechanism which allows a single system call to be invoked with a specific set of flags. His proposed interface is:

    int syscall_indirect(unsigned long flags, sigset_t sigmask,
                         int syscall, unsigned long args[6]);

The result would be a call to the given system call with the requested arguments. For the duration of the call, the given flags would be in effect, and signals in sigmask would be blocked. Even before adding any flags, this mechanism could be used to implement the series of system calls (pselect(), for example) which exists only to apply a signal mask to an earlier version of the call. Then the non-sequential file descriptor and close-on-exec behavior could be requested via the flags argument. Beyond that, flags could be added to control the handling of symbolic links, and various other things. Matt Mackall suggested that the "syslet" mechanism could be implemented as a "run this call asynchronously" flag.

This approach is not without its potential problems. There are worries that the flags bits could be quickly exhausted, once again making it hard to add options to existing system calls. Linus suggests overloading the flag bits as a way of making them last longer. That approach risks problems if application developers attempt to apply the wrong flags for a given system call - there would be no automatic way of catching such errors - but it is unlikely that applications would be calling syscall_indirect() themselves, so this risk is relatively small. It is appropriate to worry about whether any conceivable, sensible behavior modification is covered by this interface, or whether it needs a different set of parameters. And one might well wonder whether, some years from now, a large percentage of system calls will be made via syscall_indirect().

This new system call suffers from one other shortcoming as well: there is currently no working implementation. That will likely change at some point, leading to a wider discussion of the proposed interface. If it still seems like a good idea, we might just have a way of adding new behavior to old functions without an explosion in the number of system calls. Sometimes, perhaps, it really is true that problems in computer science are best solved through the addition of another level of indirection.


(Log in to post comments)

More fun with file descriptors

Posted Jun 14, 2007 13:43 UTC (Thu) by davecb (subscriber, #1574) [Link]

Odd, I recollect building apps which used high-numbered FDs via a well-known idiom, below. I would expect that anyone who needed to grab an FD for out-of-band use would use something like
       if (fstat(maxFd, &stat_buf) == -1) {
                /* It's not in use, so grab it. */
                if (fcntl(confFd, F_DUPFD, maxFd) != -1) {
                        /* Turns off FD_CLOEXEC as a sideffect. */
                        UTIL_CLOSE(confFd);
                        confFd = maxFd;
                }
        }
        maxFd--;
Does that mean this ia not as well-known in the application-design world as one would expect? It's the problem that motivated adding fcntl(F_DUPFD) to the system, after all.

--dave

More fun with file descriptors

Posted Jun 14, 2007 14:41 UTC (Thu) by nix (subscriber, #2304) [Link]

That works fine in apps, but not in libraries. If a library wants to open some persistent fd, it currently has no guarantee that the app hasn't closed that fd on it, or dup2()ed another one over the top of it. I've seen problems with syslog() caused by exactly this in the past, and even problems with the three standard fds (buggy app closes them all rather than opening /dev/null three times and wackiness ensues.)

More fun with file descriptors

Posted Jun 14, 2007 14:52 UTC (Thu) by davecb (subscriber, #1574) [Link]

Hmmn, any app writer who kills his own syslog get exactly what
they deserve (;-))

Joking aside, the code snippit was from a LD_PRELOAD library that
I tested with approximtely 2954 popular apps (on Solaris, mind
you) without getting whacked.

I suspect normal evolution will prune out the exceptions over
time: the commercial plus open-source Solaris space seems
to be pretty well clean.

--dave

More fun with file descriptors

Posted Jun 14, 2007 18:24 UTC (Thu) by vmole (guest, #111) [Link]

Hmmn, any app writer who kills his own syslog get exactly what they deserve (;-))

Any syslog that allows the user app to kill it through normal standard procedures (e.g. closing fds for a daemon) is broken. :-)

Unfortunately, this is one of those cases where you have to know something about the underlying libc implementation to avoid screwing yourself. In particular, most of the implementations I've worked with only break if you've called openlog().

More fun with file descriptors

Posted Jun 15, 2007 19:12 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Any syslog that allows the user app to kill it through normal standard procedures (e.g. closing fds for a daemon) is broken. :-)

Unfortunately, this is one of those cases where you have to know something about the underlying libc implementation to avoid screwing yourself.

It's not that you have to know the underlying libc implementation. Rather, you have to know libc's requirements of its environment. You don't leave a file descriptor open because you know syslog functions use it; you leave it open because the syslog facility requires you not to mess with any file descriptor you didn't create.

There are dozens of ways a library places requirements on its environment because of resources shared among all code in the process. Some of the requirements are easily accepted, such as that a caller should not write over any memory it did not allocate (which allows the library to keep memory of its own). Sometimes the requirements are onerous, but "broken" is too strong a word for a library with inconvenient requirements. "less useful" or "dangerous" are better descriptions. Signal handlers, alarms, environment variables, stack space, terminal display space, Standard Error file contents, etc. are all controversial.

More fun with file descriptors

Posted Jun 15, 2007 19:01 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

If a library wants to open some persistent fd, it currently has no guarantee that the app hasn't closed that fd on it, or dup2()ed another one over the top of it.

But that's also true of the kernel modifications being proposed. And it's similar to the risk that the app will write over memory that was malloc'ed by the library. The app and library, in a single thread, can stay out of each others' way with the F_DUPFD method if they observe obvious protocol. That is in contrast with simple open(), in which a library call can defeat its caller's assumptions of sequentially allocated file descriptors.

What the kernel proposal has that F_DUPFD doesn't is that 1) it works even multithreaded (the F_DUPFD method requires the library to temporarily to use a low FD, and another thread could see that) and 2) it allows the high fds to be higher (today, the maximum fd is quite low because of the way the kernel data structures are).

More fun with file descriptors

Posted Jun 14, 2007 16:15 UTC (Thu) by zlynx (subscriber, #2285) [Link]

I believe the context of the whole file descriptor discussion involves threading and the bad performance of high file descriptors.

If your library wants to dup2 a high file descriptor, another library could be trying the same trick in another thread and screw up the whole thing if it happened at just the right point between your fstat and the dup.

The performance problems happen because of the way file descriptors are handled in-kernel.

More fun with file descriptors

Posted Jun 15, 2007 1:44 UTC (Fri) by mikov (subscriber, #33179) [Link]

Sorry if this is a stupid question, but why not do something like this:

  osflags_t old_flags = set_flags_for_current_thread( PR_SPARSEFD );
  ...
  x = socket(..);
  ...
  set_flags_for_current_thread( old_flags );

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds