LWN.net Logo

In brief

By Jonathan Corbet
August 26, 2009
What is direct I/O, really? Linux, like many operating systems, supports direct I/O operations to block devices. But how, exactly, should programmers expect direct I/O to work? As a recent document posted by Ted Ts'o notes, there is no real specification for what direct I/O means:

It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.

Ted's document is an attempt to better specify what is really going on when a process requests a direct I/O operation. It is currently focused on the ext4 filesystem, but the hope is to forge a consensus among Linux filesystem developers so that consistent semantics can be obtained on all filesystems.

Can you thaw out TuxOnIce? TuxOnIce is the perennially out-of-tree hibernation implementation. It has a number of nice features which are not available with the mainstream version; these features have never managed to get into a form where they could be merged. TuxOnIce developer Nigel Cunningham has recently concluded that it looks like this merger is not going to happen because the relevant people are simply too busy. He says:

Given that this has been the outcome so far, I see no reason to imagine that we're going to make any serious progress any time soon.

In response, he is now actively looking for developers who would like to take on the task of getting TuxOnIce (or, at least, parts of it) into the mainline. He has put together a "todo" list for potentially interested parties.

Lazy workqueues. Kernel developers have been concerned for years that the number of kernel threads was growing beyond reason; see, for example, this article from 2007. Jens Axboe recently became concerned himself when he noticed that his system (a modest 64-processor box) had 531 kernel threads running on it. Enough, he decided, was enough.

His response was the lazy workqueue concept. As might be expected, this patch is an extension of the workqueue mechanism. A "lazy" workqueue can be created with create_lazy_workqueue(); it will be established with a single worker thread. Unlike single-threaded workqueues, though, lazy workqueues still try to preserve the concept of dedicated, per-CPU worker threads. Whenever a task is submitted to a lazy workqueue, the kernel will direct it toward the thread running on the submitting CPU; if no such thread exists, the kernel will create it. These threads will exit if they are idle for a sufficient period.

The end result was a halving of the number of kernel threads on Jens's system. That still seems like too many threads, but it's a good step in the right direction.

Embedded x86. Thomas Gleixner started his patch series with a note that the "embedded nightmare" has finally come to the x86 architecture. The key development here is a new set of patches intended to support Intel's new "Moorestown" processor series; these patches added a bunch of code to deal with the new quirks in this processor. Rather than further clutter the x86 architecture code, Thomas decided that it was time for a major cleanup.

The result is a new, global platform_setup structure designed to tell the architecture code how to set up the current processor. It includes a set of function pointers which handle platform-specific tasks like locating BIOS ROMs, setting up interrupt handling, initializing clocks, and much more; it is a 32-part patch in all. This new structure is able to encapsulate many of the initialization-time differences between the 32-bit and 64-bit x86 architectures, the new "Moorestown" architecture, and various virtualized variants as well. It is also runtime-configurable, so a single kernel should be able to run efficiently on any of the supported systems.

O_NOSTD. Longstanding Unix practice dictates that applications are started with the standard input, output, and error I/O streams on file descriptors 0, 1, and 2, respectively. The assumption that these file descriptors will be properly set up is so strong that most developers never think to check them. So interesting things can happen if an application is run with one or more of the standard file descriptors closed.

Consider, for example, running a program with file descriptor 2 closed. The next file the program opens will be assigned that descriptor. If something then causes the program to write to (what it thinks is) the standard error stream, that output will, instead, go to the other file which had been opened, probably corrupting that file. A malicious user can easily make messes this way; when setuid programs are involved, the potential consequences are worse.

There are a number of ways to avoid falling into this trap. An application can, on startup, ensure that the first three file descriptors are open. Or it can check the returned file descriptor from open() calls and use dup() to change the descriptor if need be. But these options are expensive, especially considering that, almost all of the time, the standard file descriptors are set up just as they should be.

Eric Blake has proposed a new alternative in the form of the O_NOSTD flag. The semantics are simple: if this flag is provided to an open() call, the kernel will not return one of the "standard" file descriptors. If this patch goes in (and there does not seem to be any opposition to that), application developers will be able to use it to ensure that they are not getting any file descriptor surprises without additional runtime cost.

There is a cost, of course, in the form of a non-standard flag that will not be supported on all platforms. One could almost argue that it would be better to add a specific flag for cases where a file descriptor in the [0..2] range is desired. But that would be a major ABI change to say the least; it's not an idea that would be well received.

Linux-ARM mailing lists. Russell King has announced that the ARM-related mailing lists on arm.linux.kernel.org will be shut down immediately. He is, it seems, not happy about some of the criticism he has received about the operation of those lists. So the lists will be moving, though exactly where is not entirely clear. David Woodhouse has created a new set of lists on infradead; he appears to have moved the subscriber lists over as well. There is also a push to move the list traffic to vger, but the preservation of the full set of lists and their subscribers suggests that the infradead lists are the ones which will actually get used.


(Log in to post comments)

O_NOSTD

Posted Aug 27, 2009 0:23 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

O_NOSTD seems like an odd hack... Any longtime Unix hacker would assume that file descriptors 0, 1, and 2 refer to stdin, stdout, and stderr, respectively—I'm surprised that these aren't in the SUS or POSIX standards... How naïve am I. :-\

O_NOSTD

Posted Aug 27, 2009 1:50 UTC (Thu) by corbet (editor, #1) [Link]

That's the point: those file descriptors are defined in the standards. That's why application programmers expect them to be set up correctly. But nothing forces that to happen, and other standard-ordained behavior says that new files are always assigned the lowest available file descriptor.

O_NOSTD

Posted Aug 27, 2009 2:13 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

SUS also says that exec() of a setuid executable may assign file descriptors 0, 1, and 2 to unspecified files if they are not already assigned.

O_NOSTD is utterly pointless as it would need to be added all over the place, including libraries outside the application programmer's control.

O_NOSTD

Posted Aug 27, 2009 9:47 UTC (Thu) by epa (subscriber, #39769) [Link]

I think if the kernel developers don't have the guts to just turn on O_NOSTD
for all open() calls by default, then maybe the C library will. (With an
O_ALLOW_STD flag for those cases like the shell where you really do want to
fiddle with the standard file descriptors.)

O_NOSTD

Posted Aug 28, 2009 0:53 UTC (Fri) by willy (subscriber, #9762) [Link]

This was already fixed years ago after Chuck Lever came up with the problem originally. Try running a setuid program with a standard file descriptor closed. You'll find that libcrt0 opens them again.

O_NOSTD

Posted Aug 27, 2009 2:54 UTC (Thu) by foom (subscriber, #14868) [Link]

Indeed. It, O_NOSTD, like O_CLOEXEC before it, are almost entirely pointless extensions that basically only gnu coreutils and glibc can ever use. And when I think of all the functions that needed extra arguments added just to be able to pass the O_CLOEXEC flag...ugh.

IMO, Instead of O_CLOEXEC, glibc should have added a function to close all but a specified list of fds. And apps or libraries could call this themselves after fork, before exec. Obviously you can write this yourself already by iterating over the files in /proc/self/fds, but I just want to write:

int keep_fds[] = {0,1,2,-1}; close_everything_but(keep_fds);

and be done with it! The list of fds to keep across a given call to exec is *always* going to be small, and known to the program calling exec. And then, it's also trivial to write the compatibility library for non-glibc platforms...

Sigh. And now this O_NOSTD...so instead of just calling a single function at program startup, I have to change every open call in every library that I use to use O_NOSTD? Yeah, right.

If they're that concerned about number-of-syscalls-per-program-start, why not just make a single syscall for "make sure I have fds 0,1, and 2 open to a fd, otherwise open /dev/null there". Yay, now I only have a single syscall on program startup. That'd make much more sense than this proposal...

O_NOSTD

Posted Aug 27, 2009 11:34 UTC (Thu) by hppnq (guest, #14462) [Link]

And now this O_NOSTD...so instead of just calling a single function at program startup, I have to change every open call in every library that I use to use O_NOSTD? Yeah, right.

There's nothing that forces you to use O_NOSTD. If you want to manually make sure that your fd's are numbered properly -- which has to be a bit of a kludge considering the nature of the problem of having regular fd's with a special meaning -- then it should still work as before.

If they're that concerned about number-of-syscalls-per-program-start, why not just make a single syscall for "make sure I have fds 0,1, and 2 open to a fd, otherwise open /dev/null there". Yay, now I only have a single syscall on program startup. That'd make much more sense than this proposal...

Then all we'd need is CAP_DO_WHAT_I_MEAN and O_NOSH*T. There's no fun in that.

O_NOSTD

Posted Aug 27, 2009 16:31 UTC (Thu) by daney (subscriber, #24551) [Link]

For a multi-threaded program, O_CLOEXEC is required if you want race free operation. Hardly pointless.

The O_NOSTD problem, on the other hand, can be worked around entirely in user space.

O_NOSTD

Posted Aug 27, 2009 18:09 UTC (Thu) by foom (subscriber, #14868) [Link]

> For a multi-threaded program, O_CLOEXEC is required if you want race free operation.
> Hardly pointless.

No, it's not required. After a fork, your program is no longer multi-threaded, and you can at your
leisure close all file descriptors except those necessary for program you're about to exec, with no
race condition.

Third-party libraries

Posted Aug 27, 2009 20:24 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

The problem is then that a third-party library might not be notified between a call to fork() and exec(). That's what pthread_atfork is for, but not everyone links against libpthread.

Even with pthread_atfork, however, you can race. Consider:

[library code]
static int fd = -1;
void mylib_atfork() {
    close(fd);
}

void mylib_dosomething() {
    fd = open(...);
    do_something_with_fd(fd);
}

If the fork happens between the return from open() and the assignment to fd, then you race and leak the file descriptor.

The real userspace solution would be for programs to just close unknown file descriptors between fork and exec. But they don't, so O_CLOEXEC is a decent facility for defensive library programing.

Now, on the other hand, this O_NOSTD business is pure junk that will uselessly take up a valuable flag bit for all eternity.

Third-party libraries

Posted Aug 27, 2009 21:33 UTC (Thu) by foom (subscriber, #14868) [Link]

The real userspace solution would be for programs to just close unknown file descriptors between fork and exec. But they don't, so O_CLOEXEC is a decent facility for defensive library programing.

Yes, this is what I've been saying -- see previous comment regarding "close_everything_but". The bug is in the code that calls fork/exec, not the code that opens a file descriptor!

Comments like this one just show how insane this whole thing is. The *bug* there is that libuuid doesn't close fds before execing a long-lived daemon! It should not be the responsibility of everyone to open all their fds with O_NOEXEC.

Third-party libraries

Posted Aug 27, 2009 21:51 UTC (Thu) by giraffedata (subscriber, #1954) [Link]

I don't see it. Never mind pthread_atfork() -- after all, the issue is exec, not fork. The alternative to O_CLOEXEC for a library that opens files under the covers would seem to be a prepare_for_exec() function exported by the library. The user makes sure he calls that before any exec().

As for the multithreaded program, it already has to serialize access to these file-descriptor-using functions anyway (you wouldn't want two threads opening that file at the same time), so it might as well synchronize prepare_for_exec() with those. This serialization could be done either in the library (i.e. the library is thread-safe), or outside. Oh, and here comes pthread_atfork(): it can make sure the serialization mechanism survives a fork that may precede the exec.

Third-party libraries

Posted Aug 27, 2009 22:54 UTC (Thu) by Los__D (subscriber, #15263) [Link]

Besides "prepare_for_exec()" being a horrible interface, do you really think that programmers that doesn't care to close unknown fds, would care to call it?

Third-party libraries

Posted Aug 28, 2009 1:44 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Besides "prepare_for_exec()" being a horrible interface, do you really think that programmers that doesn't care to close unknown fds, would care to call it?

That's beside the point. I was responding to a claim that there is no way to write a correct multithread program involving a third party library that opens files without O_CLOEXEC. And that that distinguishes O_CLOEXEC from the proposed O_NOSTD.

If on the other hand you just want to argue that O_CLOEXEC is convenient, with or without threads, then you're putting it in the same class as O_NOSTD.

Third-party libraries

Posted Aug 28, 2009 5:39 UTC (Fri) by Los__D (subscriber, #15263) [Link]

There is no other way for the library author to make sure that his library opened fds are safe.

It is hardly "convenience", but good library programming style, to make sure that internal data stays internal.

I would go so far as argue, that the naïve library user that doesn't close the fds, isn't entirely an idiot for expecting not to be responsible for library data.

Third-party libraries

Posted Aug 27, 2009 23:10 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Or you just do this (error checks omitted):
#include <pthread.h>

static int fd = -1;
static pthread_mutex_t fd_lock = PTHREAD_MUTEX_INITIALIZER;

static void mylib_beforefork() {
    pthread_mutex_lock(&fd_lock);
}

static void mylib_afterfork() {
    pthread_mutex_unlock(&fd_lock);
}

void mylib_dosomething() {
    pthread_mutex_lock(&fd_lock);
    if(fd == -1) {
        pthread_atfork(mylib_beforefork, 
                       mylib_afterfork, 
                       mylib_afterfork);
        fd = open(...);
        fcntl(fd, F_SETFD, O_CLOEXEC);
    }
    pthread_mutex_unlock(&fd_lock);

    do_something(fd);
}

Third-party libraries

Posted Sep 8, 2009 9:06 UTC (Tue) by jlokier (guest, #52227) [Link]

Doesn't work if the thread uses vfork() :-)

O_NOSTD

Posted Aug 27, 2009 20:11 UTC (Thu) by kjp (subscriber, #39639) [Link]

Finally a voice of reason.

O_NOSTD is a joke. WAAAAAAAAAAAH I don't want to call some syscalls in gnu coreutils cp. So yeah, lets add more flags to open and freaking *socket* while we're at it?

WAAAAAAAAAAAAAAH I don't want to close fds in my pre exec code so I need CLOEXEC added to Everything. Jeez, even a per process flag to say no inherit by default would be better than that crap.

God, Linux is turning into Windows crapware. Please Linus...axe this crap.

O_NOSTD

Posted Sep 8, 2009 8:56 UTC (Tue) by jlokier (guest, #52227) [Link]

I proposed that per-process cloexec-by-default flag some years ago, and it was (rightly) shot down for breaking third party libraries that internally create descriptors, spawn child processes and _expect_ those processes to inherit those descriptors.

Personally I'd rather break those libraries than have descriptor leaks, security holes and more complicated APIs. After all what we're doing now pretty much requires all libraries to be changed anyway, only this way, it's silent breakage in the meantime.

Old and well-known

Posted Aug 27, 2009 14:18 UTC (Thu) by dwheeler (guest, #1216) [Link]

The issue this patch is trying to address is noted in "Secure Programming for Linux and Unix HOWTO", section 5.3 (File Descriptors): "[do] not assume that standard input (stdin), standard output (stdout), and standard error (stderr) refer to a terminal or are even open." I don't know if this is the best way to go about it, but I applaud the idea of trying to make it easier to write correct software.

In brief

Posted Aug 27, 2009 1:16 UTC (Thu) by butlerm (subscriber, #13312) [Link]

Speaking of O_DIRECT, For the lazy among us, it would be a good idea to
abstain from putting single and double quotes in URLs, because browsers like
Chrome assume that those are URL terminating characters.

In brief

Posted Aug 27, 2009 9:56 UTC (Thu) by pebolle (guest, #35204) [Link]

Firefox 3.5.x claims the link you seem to refer to is formatted like this:

<a href="http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO">http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_I...>'s_Semantics

Even with my limited knowledge of HTML it is easy to see why that link doesn't work as expected.

In brief

Posted Aug 27, 2009 9:59 UTC (Thu) by pebolle (guest, #35204) [Link]

That comment was, of course, mangled. In short it seems the closing "</a>" tag was put somewhere in the middle of the URL (presumably by the HTML generating system used to generate that page).

nostd in userspace?

Posted Aug 27, 2009 10:51 UTC (Thu) by job (guest, #670) [Link]

Couldn't the O_NOSTD checks be implemented in glibc just as well?

nostd in userspace?

Posted Aug 27, 2009 20:16 UTC (Thu) by kjp (subscriber, #39639) [Link]

Oh, but that would require 'syscalls' according to the person who wants the code added to the kernel. The new motto is if it can go in the kernel, it should. You're behind the times.

nostd in userspace?

Posted Aug 27, 2009 20:27 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

But, like, that's a performance hit and maybe another system call dude. But, like, if you pass a flag to the kernel, then oh man, that conditional branch in the kernel doesn't, you know, count. So it's really like cool and nonstandard and stuff.

nostd in userspace?

Posted Aug 27, 2009 20:35 UTC (Thu) by kjp (subscriber, #39639) [Link]

Yeah. and it's much easier, robust, and secure to do this:

* Change every open() or socket() line in every library that my super robust 'fooutils' program uses to use NOSTD

than:

* Change my main() code to open 0,1,2 to /dev/null ...

Just like it's SO error prone to read /proc/self/fd/ and close what you don't want, than using CLOEXEC *in every library you'd ever think about using or be LD_PRELOAD'd with*

nostd in userspace?

Posted Aug 27, 2009 20:40 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

I actually agree with you. Well-written programs should close file descriptors before exec. Additionally, well-written libraries should mark their file descriptors close-on-exec. That way, at least one component will be well-written and the correct behavior will result.

nostd in userspace?

Posted Aug 28, 2009 3:54 UTC (Fri) by kjp (subscriber, #39639) [Link]

What about libraries that use fopen() anywhere? There is no way for them to set CLOEXEC race free..or even pass NOSTD for that matter. The people who thought these hacks up didn't think it through. If you use open(2) you should be able to realize you don't need these hacks.

nostd in userspace?

Posted Aug 28, 2009 4:05 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Your argument is spot-on. That's why I proposed, way back when, that O_CLOEXEC be a thread flag. I'm glad to see that people finally realize I was right.

nostd in userspace?

Posted Aug 28, 2009 4:53 UTC (Fri) by foom (subscriber, #14868) [Link]

If only linux would implement this:
http://docs.sun.com/app/docs/doc/816-5168/closefrom-3c?a=...
....*that* would be a welcome addition.

Oh look, this was already discussed. :)
http://lwn.net/Articles/292674/

Kernel threads - so what?

Posted Aug 27, 2009 20:29 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Err, so what if there are over 500 kernel threads? What makes 500 kernel threads "too many"? If there's a problem here, it's that the system doesn't scale to having thousands of processes, not that there are "too many" processes. A process is a fundamentally useful abstraction that a kernel would support well. The solution to poor performance isn't to hack around the problem.

Kernel threads - so what?

Posted Aug 28, 2009 20:19 UTC (Fri) by oak (guest, #2786) [Link]

Threads obviously have an overhead (scheduling, task struct etc). The
less overhead you have, the better it scales.

Besides, it clutters your "ps" output. Having more kernel threads running
than the actual user-space processes for which they do stuff, sounds
wrong.

Kernel threads - so what?

Posted Aug 28, 2009 20:49 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

computers can easily track several thousand inactive threads with very little overhead, but the sysadmin trying to examine ps or top output to figure out what's happening has a much lower threshold

Kernel threads - so what?

Posted Sep 8, 2009 9:04 UTC (Tue) by jlokier (guest, #52227) [Link]

Obvious solution is to hide kernel threads in ps/pstree/top.

It would be nice if the kernel threads had lazy stacks too, to save a bit of memory.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds