The curious case of O_DIRECTORY|O_CREAT

By Jonathan Corbet
March 27, 2023

The open() system call offers a number of flags that modify its behavior; not all combinations of those flags make sense in a single call. It turns out, though, that the kernel has responded in a surprising way to the combination of O_CREAT and O_DIRECTORY for a long time. After a 2020 change made that response even more surprising, it seems likely that this behavior will soon be fixed, resulting in a rare user-visible semantic change to a core system call.

The O_CREAT flag requests that open() create a regular file if the named path doesn't exist (adding O_EXCL will cause the call to fail if the path does exist). O_DIRECTORY, instead, indicates that the call should only succeed if the path exists and is a directory. It is not possible to create a directory with open(); that is what mkdir() is for. So the combination of O_CREAT and O_DIRECTORY requests the kernel to create a directory (which is supposed to already exist) as a regular file — which clearly does not make sense.

Since time immemorial, the kernel's response to the combination of those two flags has been to flag an error in most situations. If the path exists and is a regular file, open() fails and returns with an ENOTDIR error. If, instead, the path is an existing directory, the error is EISDIR — perhaps a bit surprising, given that O_DIRECTORY indicates that the path is expected to be a directory. If, however, the path does not exist at all, the open() call will succeed after creating a regular file with the indicated name, which is also a surprising result.

Recently, though, Pedro Falcato noticed that the behavior in the final case above had changed; the kernel will now return ENOTDIR if the path does not exist — but it also still creates a regular file. It is fair to say that this behavior is even more surprising than what happened before. Christian Brauner tracked the behavioral change down to this commit from Al Viro, which was merged for the 5.7 release.

Falcato included a patch to restore the previous behavior, which arguably makes a bit more sense than what the kernel does now and is, in any case, what the kernel did for a long time. But Brauner wondered if the right thing to do was to fix the kernel to do something more rational with that combination of flags:

So before we continue down that road should we maybe treat this as a chance to fix the old bug? Because this behavior of returning -ENOTDIR has existed ever since v5.7 now. Since that time we had three LTS releases all returning ENOTDIR even if the file was created.

Since, he said, nobody seems to have noticed the change over this time, it seems likely that nobody is actually counting on the strange semantics given to that combination of flags in the past. Linus Torvalds agreed that actually fixing the kernel's behavior seemed like a sensible path: "I think we can pretty much assume that there are no actual users of it, and we might as well clean up the semantics properly".

Falcato did some research on what other systems do in response to that combination of flags. NetBSD, it seems, will simply fail an open() call in that situation, returning EINVAL. FreeBSD, instead, will allow the call to succeed if the path exists and is a directory; otherwise it will fail. He also noted that all of the behaviors seen — Linux pre- and post-5.7, NetBSD, and FreeBSD — are allowed by POSIX: "I would not call the old Linux behavior a *bug*, just really odd semantics".

Torvalds answered that either of the BSD behaviors would make sense, while the kernel's current behavior "has no excuse". The NetBSD response is "the clearest case", he said, but FreeBSD's behavior is closer to what Linux did before the 5.7 change. Brauner favored the NetBSD behavior, and put together a patch to implement it. As part of that work, he put some effort into searching through code looking for cases that would be broken by the change in semantics; he came up nearly empty:

Time was spent finding potential users of this combination. Searching on codesearch.debian.net showed that codebases often express semantical expectations about O_DIRECTORY | O_CREAT which are completely contrary to what our code has done and currently does.
The expectation often is that this particular combination would create and open a directory. This suggests users who tried to use that combination would stumble upon the counterintuitive behavior no matter if pre-v5.7 or post v5.7 and quickly realize neither semantics give them what they want

Included in the patch are some links to places where developers had attempted this combination; see this libglnx comment for an example.

As the result of Brauner's patch, the combination of O_CREAT and O_DIRECTORY will cause an open() call to fail with EINVAL regardless of whether the given path exists or not. Chances are that nothing will break with this change, but he is asking for widespread testing to be sure of that. It would, after all, be annoying to have to revert this change if a problem report surfaces at some point in the future. The patch has not actually been applied as of this writing; given that there is a semantic change involved, it would be a bit surprising to see it land for 6.3. That said, your editor has been surprised by such things before.

This is one of those cases where the subtleties in the kernel's API policies come into play. In a real sense, this fix is an incompatible API change, and it will indeed break any program that is relying on the current behavior. But, in cases where no program does rely on a specific behavior, that behavior can indeed be changed. This fix seems unlikely to break anything, and so is permissible for the kernel developers to do. Should the assumption that nothing will break prove true, it may even be possible, someday, to make that flag combination do what developers evidently expect and create a directory. But first it is necessary to demonstrate that there are indeed no problems resulting from the removal of the current, strange semantics.

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	System calls/open()

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 27, 2023 22:49 UTC (Mon) by jhoblitt (subscriber, #77733) [Link] (13 responses)

Why does open() need the O_DIRECTORY flag at all? It seems like unnecessary overlap with opendir(), mkdir(), and possibly access().

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 27, 2023 23:08 UTC (Mon) by mpr22 (subscriber, #60784) [Link]

opendir(3) is implemented on top of open(2), and the man page for open(2) on my Linux system says:

>O_DIRECTORY
>
>If pathname is not a directory, cause the open to fail. This flag was added in kernel version 2.1.126, to avoid denial-of-service problems if opendir(3) is called on a FIFO or tape device.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 27, 2023 23:26 UTC (Mon) by atnot (guest, #124910) [Link] (10 responses)

In general, the reason why open() has all of these flags is because of the lack of concurrency semantics of the unix file system API. There is no way to guarantee that whatever check you have made before still applies to that file by the time you open it. The only way you can guarantee that is to move the check into the kernel, where it can use the secret transaction system built into the filesystem to provide atomicity for your open call.

Accordingly, the kernel must provide every combination of checks or requests one might want to do in it's flags. This does lead to a lot of duplication, but it is also the only way to write correct and secure code without some sort of major change to the filesystem API. It turns out that sometimes, checking whether something is a directory is just one of those things.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 16:11 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (7 responses)

Well, technically, there is also the option of exposing an API for the secret transaction system that's built into the filesystem. Microsoft actually did that for NTFS, and shortly thereafter turned around and published documentation telling everyone not to use it because they were going to deprecate it next week,[1] although to my knowledge this deprecation never officially happened. Perhaps someone with deeper filesystem knowledge than me can explain why this is so fraught, but from my perspective as an application developer, the whole thing seems a bit silly. You have a series of functions that amount to "atomically read X and then write Y" for numerous different values of X and Y - why not just let userspace directly express the "atomically" part instead?

The obvious concern is about blocking the system, but you could still provide optimistic concurrency control, so that doesn't seem like a valid problem to me. There's also the problem of "this is too complicated and userspace might do something dumb with it" - but that hasn't stopped mmap and similarly weird syscalls from existing. And there's also "we don't know if we made the best possible transaction system, so to avoid breaking backcompat, we'll just avoid exposing it" - but SQL servers have exposed transaction systems since the beginning, and they seem to be doing just fine on the backcompat side of things. I get the sense that the real problem is "this isn't worth the engineering effort in our judgment," but nobody wants to say that out loud.

[1]: https://learn.microsoft.com/en-us/windows/win32/fileio/de...

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 19:40 UTC (Tue) by roc (subscriber, #30627) [Link] (5 responses)

You don't want userspace code execution to be part of the transaction, that seems really dangerous even with optimistic concurrency. You would want to be able to submit the complete transaction for validation and execution by the kernel. Actually these days you could probably cobble something together using io_uring and eBPF to do this.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 14:59 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (4 responses)

> You don't want userspace code execution to be part of the transaction, that seems really dangerous even with optimistic concurrency. You would want to be able to submit the complete transaction for validation and execution by the kernel.

Isn't that exactly what SQL does with COMMIT right now?

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 16:50 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

No SQL doesn't do it ... Oracle or PostGreSQL or ScarletDME or whatever do it.

The database batches things up until userspace/SQL/DataBASIC says "okay I'm done". At present, the kernel does not have the ability to do that.

And actually, it might be quite tricky for the kernel, in general, because with a database COMMIT, the database maintains different versions of reality. Yes I know Posix mandates some sort of reality distortion field, but when you're trying to get VFS, and btrfx, and ext?, and whatever whatever whatever all to agree, life gets rather hard ...

Cheers,
Wol

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 18:30 UTC (Wed) by jhoblitt (subscriber, #77733) [Link] (1 responses)

> No SQL doesn't do it ... Oracle or PostGreSQL or ScarletDME or whatever do it.

AFAIK, ANSI/ISO SQL includes transactions.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 19:49 UTC (Wed) by Wol (subscriber, #4433) [Link]

Yes it does. But ANSI/ISO SQL doesn't *DO* transactions - it's an interpreted/jit'd language. In the words of the GP, it is "user space code" that shouldn't go anywhere near doing transactions.

COMMIT merely tells the underlying layer (and it doesn't have to be SQL, ScarletDME is NoSQL and there are a heck of a lot of NoSQL databases out there that have "commit" - without it you can't really claim to be a database at all) "this series of actions are supposed to be carried out in an atomic manner".

This is basically the big problem that everything has with transactions. It's easy for COMMIT (or my case, more likely START TRANSACTION / END TRANSACTION) to *define* *what* *is* *SUPPOSED* *to* *be* *atomic*, it's a much bigger problem for the underlying layer to actually implement it atomically.

Cheers,
Wol

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 18:36 UTC (Wed) by jhoblitt (subscriber, #77733) [Link]

Yes and transactions can be a useful feature in a filesystem. A few years prior to the existence of AWS S3, I wrote an object store like pseudo filesystems by basically converting the kernel's ext3 headers into a sql schema. Transactions were useful in several situations, such as atomic "file" renames where the target name was an existing "file" which needed to be unlinked.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 20:12 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Well, technically, there is also the option of exposing an API for the secret transaction system that's built into the filesystem.

Uhm, it was not secret at any point. Moreover, MS was pushing it pretty hard around the Vista time. The API was actually pretty cool, you could even do distributed transactions (using the Microsoft Distributed Transaction Coordinator) that involved the filesystem and SQL server. We used it to do transactional file operations in our CRM running under the IIS.

There were some problems with it:
1. It was slow as hell. NTFS is not a speed daemon in the best of times, and additional journaling slowed it down even more.
2. Pretty much nobody used it, even Microsoft's own Windows Update. So it got dropped from the ReFS.

secret transaction system?

Posted Apr 1, 2023 5:08 UTC (Sat) by moxfyre (guest, #13847) [Link] (1 responses)

> The only way you can guarantee that is to move the check into the kernel, where it can use the secret transaction system built into the filesystem to provide atomicity for your open call.

The Linux kernel contains a "secret transaction system" for filesystem operations…? By "secret", do you just mean that there's no API exposed to userspace to manage such transactions?

secret transaction system?

Posted Apr 25, 2023 13:21 UTC (Tue) by Rudd-O (guest, #61155) [Link]

If we told ya, it wouldn't be a secret anymore, would it?

😂

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 11:03 UTC (Tue) by aaronmdjones (subscriber, #119973) [Link]

opendir(3) is not a system call. It's a C library function implemented on top of open(2) (which returns a file descriptor referring to that directory) and getdents(2) (which takes a file descriptor referring to that directory).

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 2:03 UTC (Tue) by josh (subscriber, #17465) [Link] (16 responses)

> The expectation often is that this particular combination would create and open a directory.

Given the indication that the behavior of this combination can have its behavior changed/fixed, is there some strong reason *not* to make it successfully create a directory? That seems like *useful behavior*: create a directory and atomically return an fd for that directory.

Would that break some existing software? It doesn't sound like it would, given

> "I think we can pretty much assume that there are no actual users of it, and we might as well clean up the semantics properly"

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 3:53 UTC (Tue) by brauner (subscriber, #109349) [Link] (8 responses)

I already proposed that when I fixed the bug:

"(As a sidenote, posix made an interpretation change a long time ago to
at least allow for O_DIRECTORY | O_CREAT to create a directory (see [3]).

But that's a whole different can of worms and I haven't spent any
thoughts even on feasibility. And even if we should probably get through
a couple of kernels with O_DIRECTORY | O_CREAT failing with EINVAL first.)"

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 4:11 UTC (Tue) by josh (subscriber, #17465) [Link] (6 responses)

Thank you! Sorry I missed that.

But in any case, Linus pointed out open's hard-to-extend semantics (the same ones that motivated O_TMPFILE), which make it less worthwhile to attempt to make this work.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 8:40 UTC (Tue) by smcv (subscriber, #53363) [Link] (5 responses)

Perhaps openat2() (which always rejected unknown flags) could interpret O_DIRECTORY|O_CREAT as "open a directory, creating it if it doesn't exist", even if plain open() and openat() don't? Or would it be more confusing for openat2() and openat() to differ on that?

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 9:36 UTC (Tue) by brauner (subscriber, #109349) [Link] (4 responses)

openat2() can very well differ from the other variants. But if we wanted it to be the only open* syscall to open and create a directory then we should add an openat2() specific flag instead of reusing O_DIRECTORY | O_CREAT. Otherwise userspace might rightly be confused why O_DIRECTORY|O_CREAT works on openat2() but not on other open() variants. Reusing O_DIRECTORY|O_CREAT really only makes sense when we provide consistent behavior for all open syscalls imho.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 9:49 UTC (Tue) by josh (subscriber, #17465) [Link] (3 responses)

Yeah, agreed; might as well make O_CREATE_DIR.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 11:58 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (2 responses)

I like how you give "create" back its "E" but then steal "ectory" again.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 12:54 UTC (Tue) by brauner (subscriber, #109349) [Link] (1 responses)

The API giveth, and the API taketh away.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 20:44 UTC (Tue) by Villemoes (subscriber, #91911) [Link]

O_LORD_WONT_YOU_BUY_ME_PONIES

open(, O_DIRECTORY|O_CREAT) should create and open a file descriptor to a directory

Posted Apr 11, 2023 16:39 UTC (Tue) by meuh (guest, #22042) [Link]

Yes please

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 16:19 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (6 responses)

You don't, strictly speaking, need it. mkdirat(2) and openat(2) should be able to resolve any race condition that might otherwise be possible. Here's a possible sequence of events:

0. Thread A opens the parent directory. This may or may not race with something, but let's call it out of scope for now and just assume that it happens.
1. Thread A calls mkdirat and creates /path/to/foo/
2. Thread B renames it to /path/to/bar/ (or removes it or whatever)
3. Now if thread A tries to openat /path/to/foo/, it gets ENOENT, so it would know to start over and try again.

Alternatively:

2. Thread B renames it to /path/to/bar/
3. Thread B creates a new /path/to/foo/
4. Thread A successfully openats /path/to/foo/. It's a different directory than the one that it created, but from thread A's perspective, this makes no practical difference. One newly-created directory is just as good as another, right?

Alternatively alternatively:

3. Thread B creates /path/to/foo as a symlink to something else.
4. Thread A tries to openat /path/to/foo/, but it fails because of O_NOFOLLOW. A knows that something is wrong and aborts the operation.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 18:35 UTC (Tue) by zev (subscriber, #88455) [Link] (5 responses)

> One newly-created directory is just as good as another, right?

Not if they have different metadata (permissions or ownership, which could arise with multiple processes accessing /tmp, say). Also, is there any guarantee B even left it empty and didn't also start populating it with things that would conflict with A's plans for it?

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 0:46 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

> Not if they have different metadata (permissions or ownership, which could arise with multiple processes accessing /tmp, say).

A can fstat it after opening it, if A cares.

> Also, is there any guarantee B even left it empty and didn't also start populating it with things that would conflict with A's plans for it?

If the permissions on the original dir allow it, then B can do this anyway. If not, then see previous reply.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 2:12 UTC (Wed) by interalia (subscriber, #26615) [Link] (3 responses)

I could be wrong but I imagine that even if A managed to create the directory atomically first, process B could create files in there before A gets control again. So there's guarantee the newly created directory is empty by the time that A reads it. Most programs probably just don't care, of course, as long as the directory serves their purposes.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 29, 2023 5:37 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (2 responses)

It is possible for A to use unlinkat(2) to remove the directory immediately after creating it (with AT_REMOVEDIR). In principle, I imagine that you might be able to continue using other fooat syscalls on the unlinked directory until you close it, thus creating a "true" temporary directory, but I have not tried it.

That still doesn't work because you race against someone else opening the directory before you can unlink it, but it's a neat party trick (if it works).

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 30, 2023 2:20 UTC (Thu) by josh (subscriber, #17465) [Link]

Sadly, that doesn't work. Once you've unlinked a directory, attempting to create something in that directory produces ENOENT.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 30, 2023 6:50 UTC (Thu) by donald.buczek (subscriber, #112892) [Link]

Yes, anonymous directories would be nice to have. Of course, with `dfd = mkdirat(parentfd, optional_name, mode, O_TMPDIR)` ínstead of a non-atomic `mkdir()`, `unlinkat()` combination. But I assume, that would be far from trivial to implement, because filesystems are not prepared to handle trees of unlinked directories.

If you want a directory tree and its contents to go away by kernel cleanup when the last access is gone and have privilege, you can use a lazily unmounted mount on top of a lazily detached loop device on top of an unlinked file like this:

root@theinternet:~# fallocate -l 10G /tmp/x.dat
root@theinternet:~# losetup --find --show /tmp/x.dat
/dev/loop0
root@theinternet:~# mkfs.ext4 -q /dev/loop0
root@theinternet:~# mount /dev/loop0 /mnt
root@theinternet:~# cd /mnt
root@theinternet:/mnt# umount -l /mnt
root@theinternet:/mnt# losetup -d /dev/loop0
root@theinternet:/mnt# rm /tmp/x.dat
root@theinternet:/mnt# ls -l
total 16
drwx------ 2 root root 16384 Mar 30 08:43 lost+found
root@theinternet:/mnt# df /tmp
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 268304384 44856232 223448152 17% /
root@theinternet:/mnt# cd
root@theinternet:~# df /tmp
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 268304384 44786364 223518020 17% /

But there is no way to create this stack atomically.

Another ugliness here is, that the system will unnecessarily flush modified data during (auto-)umount.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 16:02 UTC (Tue) by mb (subscriber, #50428) [Link] (2 responses)

I think it would be a good idea to add the O_DIRECTORY|O_CREAT combination to a unit test as well.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 17:37 UTC (Tue) by brauner (subscriber, #109349) [Link]

See my request for this in the thread and the discussion following that. TL;DR we'll have some as part of xfstests.

The curious case of O_DIRECTORY|O_CREAT

Posted Mar 28, 2023 18:50 UTC (Tue) by iabervon (subscriber, #722) [Link]

I'd also be interested in fuzz tests that report if a file gets created by open() without an fd for it being returned. If you pick a random combination of flags and permissions and starting state, I'm not sure what should happen, but I don't think that's ever an expected result.