The curious case of O_DIRECTORY|O_CREAT
The O_CREAT flag requests that open() create a regular file if the named path doesn't exist (adding O_EXCL will cause the call to fail if the path does exist). O_DIRECTORY, instead, indicates that the call should only succeed if the path exists and is a directory. It is not possible to create a directory with open(); that is what mkdir() is for. So the combination of O_CREAT and O_DIRECTORY requests the kernel to create a directory (which is supposed to already exist) as a regular file — which clearly does not make sense.
Since time immemorial, the kernel's response to the combination of those two flags has been to flag an error in most situations. If the path exists and is a regular file, open() fails and returns with an ENOTDIR error. If, instead, the path is an existing directory, the error is EISDIR — perhaps a bit surprising, given that O_DIRECTORY indicates that the path is expected to be a directory. If, however, the path does not exist at all, the open() call will succeed after creating a regular file with the indicated name, which is also a surprising result.
Recently, though, Pedro Falcato noticed that the behavior in the final case above had changed; the kernel will now return ENOTDIR if the path does not exist — but it also still creates a regular file. It is fair to say that this behavior is even more surprising than what happened before. Christian Brauner tracked the behavioral change down to this commit from Al Viro, which was merged for the 5.7 release.
Falcato included a patch to restore the previous behavior, which arguably makes a bit more sense than what the kernel does now and is, in any case, what the kernel did for a long time. But Brauner wondered if the right thing to do was to fix the kernel to do something more rational with that combination of flags:
So before we continue down that road should we maybe treat this as a chance to fix the old bug? Because this behavior of returning -ENOTDIR has existed ever since v5.7 now. Since that time we had three LTS releases all returning ENOTDIR even if the file was created.
Since, he said, nobody seems to have noticed the change over this time, it
seems likely that nobody is actually counting on the strange semantics
given to that combination of flags in the past. Linus Torvalds agreed
that actually fixing the kernel's behavior seemed like a sensible path:
"I think we can pretty much assume that there are no actual users of it,
and we might as well clean up the semantics properly
".
Falcato did
some research on what other systems do in response to that combination
of flags. NetBSD, it seems, will simply fail an open() call in
that situation, returning EINVAL. FreeBSD, instead, will
allow the call to succeed if the path exists and is a directory; otherwise
it will fail. He also noted that all of the behaviors seen — Linux pre-
and post-5.7, NetBSD, and FreeBSD — are allowed by POSIX: "I would not
call the old Linux behavior a *bug*, just really odd semantics
".
Torvalds answered
that either of the BSD behaviors would make sense, while the kernel's
current behavior "has no excuse
". The NetBSD response is "the
clearest case
", he said, but FreeBSD's behavior is closer to what Linux
did before the 5.7 change. Brauner favored
the NetBSD behavior, and put together a
patch to implement it.
As part of that work, he put some effort into searching through code
looking for cases that would be broken by the change in semantics; he came
up nearly empty:
Time was spent finding potential users of this combination. Searching on codesearch.debian.net showed that codebases often express semantical expectations about O_DIRECTORY | O_CREAT which are completely contrary to what our code has done and currently does.The expectation often is that this particular combination would create and open a directory. This suggests users who tried to use that combination would stumble upon the counterintuitive behavior no matter if pre-v5.7 or post v5.7 and quickly realize neither semantics give them what they want
Included in the patch are some links to places where developers had attempted this combination; see this libglnx comment for an example.
As the result of Brauner's patch, the combination of O_CREAT and O_DIRECTORY will cause an open() call to fail with EINVAL regardless of whether the given path exists or not. Chances are that nothing will break with this change, but he is asking for widespread testing to be sure of that. It would, after all, be annoying to have to revert this change if a problem report surfaces at some point in the future. The patch has not actually been applied as of this writing; given that there is a semantic change involved, it would be a bit surprising to see it land for 6.3. That said, your editor has been surprised by such things before.
This is one of those cases where the subtleties in the kernel's API
policies come into play. In a real sense, this fix is an incompatible API
change, and it will indeed break any program that is relying on the current
behavior. But, in cases where no program does rely on a specific behavior,
that behavior can indeed be changed. This fix seems unlikely to break
anything, and so is permissible for the kernel developers to do. Should
the assumption that nothing will break prove true, it may even be possible,
someday, to make that flag combination do what developers evidently expect
and create a directory. But first it is necessary to demonstrate that
there are indeed no problems resulting from the removal of the current,
strange semantics.
| Index entries for this article | |
|---|---|
| Kernel | Development model/User-space ABI |
| Kernel | System calls/open() |
Posted Mar 27, 2023 22:49 UTC (Mon)
by jhoblitt (subscriber, #77733)
[Link] (13 responses)
Posted Mar 27, 2023 23:08 UTC (Mon)
by mpr22 (subscriber, #60784)
[Link]
>O_DIRECTORY
Posted Mar 27, 2023 23:26 UTC (Mon)
by atnot (guest, #124910)
[Link] (10 responses)
Accordingly, the kernel must provide every combination of checks or requests one might want to do in it's flags. This does lead to a lot of duplication, but it is also the only way to write correct and secure code without some sort of major change to the filesystem API. It turns out that sometimes, checking whether something is a directory is just one of those things.
Posted Mar 28, 2023 16:11 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
The obvious concern is about blocking the system, but you could still provide optimistic concurrency control, so that doesn't seem like a valid problem to me. There's also the problem of "this is too complicated and userspace might do something dumb with it" - but that hasn't stopped mmap and similarly weird syscalls from existing. And there's also "we don't know if we made the best possible transaction system, so to avoid breaking backcompat, we'll just avoid exposing it" - but SQL servers have exposed transaction systems since the beginning, and they seem to be doing just fine on the backcompat side of things. I get the sense that the real problem is "this isn't worth the engineering effort in our judgment," but nobody wants to say that out loud.
[1]: https://learn.microsoft.com/en-us/windows/win32/fileio/de...
Posted Mar 28, 2023 19:40 UTC (Tue)
by roc (subscriber, #30627)
[Link] (5 responses)
Posted Mar 29, 2023 14:59 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
Isn't that exactly what SQL does with COMMIT right now?
Posted Mar 29, 2023 16:50 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
The database batches things up until userspace/SQL/DataBASIC says "okay I'm done". At present, the kernel does not have the ability to do that.
And actually, it might be quite tricky for the kernel, in general, because with a database COMMIT, the database maintains different versions of reality. Yes I know Posix mandates some sort of reality distortion field, but when you're trying to get VFS, and btrfx, and ext?, and whatever whatever whatever all to agree, life gets rather hard ...
Cheers,
Posted Mar 29, 2023 18:30 UTC (Wed)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
AFAIK, ANSI/ISO SQL includes transactions.
Posted Mar 29, 2023 19:49 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
COMMIT merely tells the underlying layer (and it doesn't have to be SQL, ScarletDME is NoSQL and there are a heck of a lot of NoSQL databases out there that have "commit" - without it you can't really claim to be a database at all) "this series of actions are supposed to be carried out in an atomic manner".
This is basically the big problem that everything has with transactions. It's easy for COMMIT (or my case, more likely START TRANSACTION / END TRANSACTION) to *define* *what* *is* *SUPPOSED* *to* *be* *atomic*, it's a much bigger problem for the underlying layer to actually implement it atomically.
Cheers,
Posted Mar 29, 2023 18:36 UTC (Wed)
by jhoblitt (subscriber, #77733)
[Link]
Posted Mar 28, 2023 20:12 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Uhm, it was not secret at any point. Moreover, MS was pushing it pretty hard around the Vista time. The API was actually pretty cool, you could even do distributed transactions (using the Microsoft Distributed Transaction Coordinator) that involved the filesystem and SQL server. We used it to do transactional file operations in our CRM running under the IIS.
There were some problems with it:
Posted Apr 1, 2023 5:08 UTC (Sat)
by moxfyre (guest, #13847)
[Link] (1 responses)
The Linux kernel contains a "secret transaction system" for filesystem operations…? By "secret", do you just mean that there's no API exposed to userspace to manage such transactions?
Posted Apr 25, 2023 13:21 UTC (Tue)
by Rudd-O (guest, #61155)
[Link]
😂
Posted Mar 28, 2023 11:03 UTC (Tue)
by aaronmdjones (subscriber, #119973)
[Link]
Posted Mar 28, 2023 2:03 UTC (Tue)
by josh (subscriber, #17465)
[Link] (16 responses)
Given the indication that the behavior of this combination can have its behavior changed/fixed, is there some strong reason *not* to make it successfully create a directory? That seems like *useful behavior*: create a directory and atomically return an fd for that directory.
Would that break some existing software? It doesn't sound like it would, given
> "I think we can pretty much assume that there are no actual users of it, and we might as well clean up the semantics properly"
Posted Mar 28, 2023 3:53 UTC (Tue)
by brauner (subscriber, #109349)
[Link] (8 responses)
"(As a sidenote, posix made an interpretation change a long time ago to
But that's a whole different can of worms and I haven't spent any
Posted Mar 28, 2023 4:11 UTC (Tue)
by josh (subscriber, #17465)
[Link] (6 responses)
But in any case, Linus pointed out open's hard-to-extend semantics (the same ones that motivated O_TMPFILE), which make it less worthwhile to attempt to make this work.
Posted Mar 28, 2023 8:40 UTC (Tue)
by smcv (subscriber, #53363)
[Link] (5 responses)
Posted Mar 28, 2023 9:36 UTC (Tue)
by brauner (subscriber, #109349)
[Link] (4 responses)
Posted Mar 28, 2023 9:49 UTC (Tue)
by josh (subscriber, #17465)
[Link] (3 responses)
Posted Mar 28, 2023 11:58 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
Posted Mar 28, 2023 12:54 UTC (Tue)
by brauner (subscriber, #109349)
[Link] (1 responses)
Posted Mar 28, 2023 20:44 UTC (Tue)
by Villemoes (subscriber, #91911)
[Link]
Posted Apr 11, 2023 16:39 UTC (Tue)
by meuh (guest, #22042)
[Link]
Posted Mar 28, 2023 16:19 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
0. Thread A opens the parent directory. This may or may not race with something, but let's call it out of scope for now and just assume that it happens.
Alternatively:
2. Thread B renames it to /path/to/bar/
Alternatively alternatively:
3. Thread B creates /path/to/foo as a symlink to something else.
Posted Mar 28, 2023 18:35 UTC (Tue)
by zev (subscriber, #88455)
[Link] (5 responses)
Not if they have different metadata (permissions or ownership, which could arise with multiple processes accessing /tmp, say). Also, is there any guarantee B even left it empty and didn't also start populating it with things that would conflict with A's plans for it?
Posted Mar 29, 2023 0:46 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
A can fstat it after opening it, if A cares.
> Also, is there any guarantee B even left it empty and didn't also start populating it with things that would conflict with A's plans for it?
If the permissions on the original dir allow it, then B can do this anyway. If not, then see previous reply.
Posted Mar 29, 2023 2:12 UTC (Wed)
by interalia (subscriber, #26615)
[Link] (3 responses)
Posted Mar 29, 2023 5:37 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
That still doesn't work because you race against someone else opening the directory before you can unlink it, but it's a neat party trick (if it works).
Posted Mar 30, 2023 2:20 UTC (Thu)
by josh (subscriber, #17465)
[Link]
Posted Mar 30, 2023 6:50 UTC (Thu)
by donald.buczek (subscriber, #112892)
[Link]
If you want a directory tree and its contents to go away by kernel cleanup when the last access is gone and have privilege, you can use a lazily unmounted mount on top of a lazily detached loop device on top of an unlinked file like this:
root@theinternet:~# fallocate -l 10G /tmp/x.dat
But there is no way to create this stack atomically.
Another ugliness here is, that the system will unnecessarily flush modified data during (auto-)umount.
Posted Mar 28, 2023 16:02 UTC (Tue)
by mb (subscriber, #50428)
[Link] (2 responses)
Posted Mar 28, 2023 17:37 UTC (Tue)
by brauner (subscriber, #109349)
[Link]
Posted Mar 28, 2023 18:50 UTC (Tue)
by iabervon (subscriber, #722)
[Link]
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
>
>If pathname is not a directory, cause the open to fail. This flag was added in kernel version 2.1.126, to avoid denial-of-service problems if opendir(3) is called on a FIFO or tape device.
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
Wol
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
Wol
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
1. It was slow as hell. NTFS is not a speed daemon in the best of times, and additional journaling slowed it down even more.
2. Pretty much nobody used it, even Microsoft's own Windows Update. So it got dropped from the ReFS.
secret transaction system?
secret transaction system?
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
at least allow for O_DIRECTORY | O_CREAT to create a directory (see [3]).
thoughts even on feasibility. And even if we should probably get through
a couple of kernels with O_DIRECTORY | O_CREAT failing with EINVAL first.)"
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
open(, O_DIRECTORY|O_CREAT) should create and open a file descriptor to a directory
The curious case of O_DIRECTORY|O_CREAT
1. Thread A calls mkdirat and creates /path/to/foo/
2. Thread B renames it to /path/to/bar/ (or removes it or whatever)
3. Now if thread A tries to openat /path/to/foo/, it gets ENOENT, so it would know to start over and try again.
3. Thread B creates a new /path/to/foo/
4. Thread A successfully openats /path/to/foo/. It's a different directory than the one that it created, but from thread A's perspective, this makes no practical difference. One newly-created directory is just as good as another, right?
4. Thread A tries to openat /path/to/foo/, but it fails because of O_NOFOLLOW. A knows that something is wrong and aborts the operation.
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
root@theinternet:~# losetup --find --show /tmp/x.dat
/dev/loop0
root@theinternet:~# mkfs.ext4 -q /dev/loop0
root@theinternet:~# mount /dev/loop0 /mnt
root@theinternet:~# cd /mnt
root@theinternet:/mnt# umount -l /mnt
root@theinternet:/mnt# losetup -d /dev/loop0
root@theinternet:/mnt# rm /tmp/x.dat
root@theinternet:/mnt# ls -l
total 16
drwx------ 2 root root 16384 Mar 30 08:43 lost+found
root@theinternet:/mnt# df /tmp
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 268304384 44856232 223448152 17% /
root@theinternet:/mnt# cd
root@theinternet:~# df /tmp
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 268304384 44786364 223518020 17% /
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
The curious case of O_DIRECTORY|O_CREAT
