|
|
Subscribe / Log in / New account

Flags for fchmodat()

By Jonathan Corbet
July 27, 2023
The fchmodat() system call on Linux hides a little secret: it does not actually implement all of the functionality that the man page claims (and that POSIX calls for). As a result, C libraries have to do a bit of a complicated workaround to provide the API that applications expect. That situation looks likely to change with the 6.6 kernel, though, as the result of this patch series posted by Alexey Gladkov.

The prototype for fchmodat() is defined as:

    int fchmodat(int fd, const char *path, mode_t mode, int flag);

Its purpose is to change the permissions of the file identified by path to the given mode. In the style of all the *at() system calls, fd can be an open file descriptor referring to a directory; if path is relative, the lookup process will start at the directory indicated by fd rather than the current working directory. The flag argument can be either zero or AT_SYMLINK_NOFOLLOW.

Support for fchmodat() was added to the Linux kernel for the 2.6.16 release in 2006 as part of a series from Ulrich Drepper adding a number of the *at() calls. That version of fchmodat(), though, did not include the flag argument, a situation that continues to the present. As a result, the kernel's fchmodat() implementation is not compliant with the specification, and is not what application developers will expect. That, in itself, is not entirely unusual; applications do not (usually) invoke system calls directly. Instead, they use wrappers in a low-level library, usually the C library, which do what is needed to provide the expected API. That is what happens here, but the result is not ideal.

The POSIX specification defines the behavior of the AT_SYMLINK_NOFOLLOW flag as: "If path names a symbolic link, then the mode of the symbolic link is changed". That behavior differs from the default, where the mode of the file pointed to by that link will be changed instead. There are two reasons why one might want a flag like this: to actually change permissions on a symbolic link, and, more importantly, to prevent the changing of permissions on a real file by way of a symbolic link. Attackers have been known to use symbolic links to confuse a privileged program into changing file modes that should not be changed; using this flag will prevent such an outcome.

If one looks at the (functionally identical) fchmodat() implementations in the GNU C library and musl libc, two things jump out: implementing AT_SYMLINK_NOFOLLOW in user space is inelegant at best and, due to limitations in Linux itself, neither library is able to implement exactly what the specification says (but they are able to provide the important part).

The C-library implementations start by opening the file indicated by the fd and path arguments to fchmodat() as an O_PATH file descriptor. Such a descriptor allows metadata operations, but cannot be used to read or write the file; thus, it does not require read or write permission on the file to open. That open() call also uses the O_NOFOLLOW flag; if the path ends with a symbolic link, that will cause the link itself to be opened, rather than the file pointed to.

At this point, the C libraries do an fstatat64() call to determine what kind of file has just been opened; if the new file descriptor turns out to be a symbolic link, an EOPNOTSUPP failure status will be returned to the caller. The Linux kernel does not support changing the permission bits on a symbolic link in general (those bits have no real meaning anyway), so neither C-library implementation even tries.

If the target is not a symbolic link, the library could just issue a normal fchmodat() call with the given parameters and no flag. That, however, could open the door to a time-of-check-to-time-of-use vulnerability, where an attacker would replace the file with a symbolic link between the check and the mode change. So, instead, the library must change the mode bits on the file that it actually opened in the first step, without using the path name again. Unfortunately, the obvious way (using fchmod()) won't work, because that system call cannot operate on O_PATH file descriptors in many filesystems. So, instead, the C library generates the path for the open file descriptor under /proc/self/fd, then passes that to chmod() to effect the mode change.

This sequence seems unlikely to be the most efficient way to prevent the following of a symbolic link for an fchmodat() call. It also will fail to work in settings where /proc is not available. A much nicer solution would be to just implement the AT_SYMLINK_NOFOLLOW flag in the kernel, which already has the needed machinery to do so in an atomic and efficient manner.

That is what Gladkov's patch series does: it creates a new fchmodat2() system call that implements the AT_SYMLINK_NOFOLLOW flag. Once this system call is available in released kernels, the C-library implementations can use it for their implementation of fchmodat(), bypassing the current workarounds. The result should be a faster and more robust implementation. Chances are that change will happen soon; VFS maintainer Christian Brauner has applied the series and routed it into linux-next, meaning that it should be pushed during the 6.6 merge window.

Interestingly, this is not the first attempt to add an fchmodat2() implementation; there were patches posted by Rich Felker in 2020 and Greg Kurz in 2017. It is not entirely clear why the patches were not accepted at that time; it may be simply because VFS patches have occasionally tended to fall through the cracks over the years. The previous failure may be part of why Felker responded rather negatively to a suggestion from David Howells that, perhaps, it would be better to add a new set_file_attrs() system call, with a number of new features, rather than completing fchmodat(). That suggestion has not gained much support, so Gladkov's attempt appears to be the one that will actually succeed; after 17 years in the kernel, fchmodat() should finally get in-kernel AT_SYMLINK_NOFOLLOW support.

Index entries for this article
KernelReleases/6.6
KernelSymbolic links
KernelSystem calls/fchmodat()


to post comments

Flags for fchmodat()

Posted Jul 27, 2023 18:34 UTC (Thu) by bof (subscriber, #110741) [Link]

Hurray. So in 2-3 years, I can stop the weird extra proc mounts needed inside each rsyncd chroot, again...

Flags for fchmodat()

Posted Jul 28, 2023 0:55 UTC (Fri) by Paf (subscriber, #91811) [Link]

Reading through the earlier thread, it looks like there was some - not entirely positively presented, to be fair - issues with the 2020 submission that have been corrected here, like the lack of tests. So it might’ve been tough to get over the hump before but there were also some issues.

Flags for fchmodat()

Posted Jul 28, 2023 3:11 UTC (Fri) by wahern (subscriber, #37304) [Link] (9 responses)

The patch originally used the name fchmodat4, but review requested
s/fchmodat4/fchmodat2/

With very few exceptions we don't version by argument number but by
revision and we should stick to one scheme:

I personally prefer the former, partly perhaps because at some point long ago I was under the impression it was the more the common practice. But more recently I've gotten the impression that most developers are unfamiliar with the version by argument number (i.e. arity) pattern. Have I always been in the minority camp or did the world change around me?

Flags for fchmodat()

Posted Jul 28, 2023 4:10 UTC (Fri) by willy (subscriber, #9762) [Link] (8 responses)

Funnily, there's not much precedence either way.

Most syscalls that end in a number are 16, 32 or 64, indicating their limits (eg sys_time32)

There's wait/waitpid/wait3/wait4, but the arity matches the sequence number. Similarly for dup/dup2/dup3 and pipe/pipe2

The less said about sys_vm86 the better ;-)

There's signalfd4, eventfd2, epoll_create1, accept4 which look to be arity based.

But then there's renameat2 which has 5 arguments. mlock2 which takes 3. preadv2 and pwritev2 which take 5. openat2 takes 4. faccessat2 takes 4. epoll_pwait2 takes 5.

pselect6 is named for its arity, but that's because you can't normally have more than 6 arguments to a syscall. clone3 was preceded by a clone2 that we don't talk about.

I think you could make an argument either way.

Flags for fchmodat()

Posted Jul 28, 2023 8:50 UTC (Fri) by brauner (subscriber, #109349) [Link]

There as a discussion a few years back when we did openat2(), clone3() and others that simple versioning is the default.
In sheer numbers this scheme also wins iirc.

There's also the possibility that a system call like bla4() is broken and you'd wanted to change a system call argument type but not the actual number of arguments. Then you'd not be able to call it bla5() and blat4.2() would be rather weird. Imho, the simple versioning is just more flexible and is nowadays the de facto standard anyway.

I also had documentation for all of this but there was never enough time to send it actually but fwiw:

https://github.com/brauner/linux/commit/5fe619ce62bae64cf...

which is part of

https://github.com/brauner/linux/commits/docs_extensible_...

and contains a lot of other info.

Flags for fchmodat()

Posted Jul 30, 2023 3:03 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (6 responses)

I think dup2 was actually “dup to”, and no, numbering functions by argument number totally strikes me as unusual and odd when I come by it (e.g. in jq). It’s extremely rare at least in whatever relatively broad but mostly traditional FOSS sphres I’ve been seeing.

It’s also inflexible (what if you change one argument type in a revision, or even lose one).

Flags for fchmodat()

Posted Jul 30, 2023 3:08 UTC (Sun) by willy (subscriber, #9762) [Link] (3 responses)

You can make a similar argument that wait4 was actually "wait for"

Flags for fchmodat()

Posted Jul 30, 2023 3:20 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (2 responses)

Right, fully agreed there.

I’m not too much of a fan of puns based on specific pronunciations of things, especially if they go unexplained, but it’s probably obvious to english speakers.

Flags for fchmodat()

Posted Jul 30, 2023 3:25 UTC (Sun) by willy (subscriber, #9762) [Link] (1 responses)

I'm kind of neutral on puns. If they're not needed to make sense (and dup2/wait4 are examples of that), I don't mind. The Lemmings icon of a pair of paws to mean pause is unforgivable.

Flags for fchmodat()

Posted Jul 30, 2023 9:23 UTC (Sun) by Wol (subscriber, #4433) [Link]

But puns like that is part of "the Unix way". Go back to the early versions, and Unix is absolutely full of them.

My favourite example is the mutt man page - mutts (dogs) collect mail hence the name, and "mutts don't have bugs, they have fleas ..."

Cheers,
Wol

Flags for fchmodat()

Posted Aug 3, 2023 9:00 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

dup2() is a POSIX-ism, and it appears to be sequential since POSIX also specifies dup(). I would be surprised if Linux named dup2 and then got POSIX to adopt that name (I think dup2 is probably older than that), but I'm insufficiently familiar with the history here to categorically rule it out.

https://pubs.opengroup.org/onlinepubs/9699919799/function...

Flags for fchmodat()

Posted Aug 3, 2023 18:24 UTC (Thu) by jwilk (subscriber, #63328) [Link]

dup2() was added in V7 Unix (released in 1979), so it predates both POSIX and Linux by about a decade.

Flags for fchmodat()

Posted Jul 28, 2023 10:41 UTC (Fri) by cyphar (subscriber, #110703) [Link]

> The Linux kernel does not support changing the permission bits on a symbolic link in general (those bits have no real meaning anyway), so neither C-library implementation even tries.

This is actually not quite true, at least not until Christian's patch to enforce this is merged. The restriction on symlink modes was always done on a per-filesystem basis (which lead some filesystems to allowing it by accident -- procfs allows this for several symlinks and magic-links). In fact, several filesystems (btrfs, xfs, and ext4) all returned -EOPNOTSUPP but still modified the inode mode.

> Unfortunately, the obvious way (using fchmod()) won't work, because that system call cannot operate on O_PATH file descriptors in many filesystems. So, instead, the C library generates the path for the open file descriptor under /proc/self/fd, then passes that to chmod() to effect the mode change.

This restriction is done on the VFS level, it's not per-filesystem (fchmod() uses fdget() rather than fdget_raw() -- and this behaviour is intentional per the description of O_PATH in open(2)).

If fchmodat2() adds support for AT_EMPTY_PATH, it would be possible to avoid even procfs nastiness when dealing with O_PATH file descriptors -- something which is necessary in plenty of cases where AT_SYMLINK_NOFOLLOW is inadequate, such as when dealing with paths you need to resolve safely with RESOLVE_IN_ROOT or other openat2() flags). I'll send a patch for this...


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds