/dev/userfaultfd
A call to userfaultfd() returns a special file descriptor attached to the current process. Among other things, this descriptor can be used (with ioctl()) to register regions of memory. When any thread in the current process encounters a page fault in a registered area, it will be blocked and an event will be sent to the userfaultfd() file descriptor. The managing thread, on reading that event, has several options for how to resolve the fault; these include copying data into a new page, creating a zero-filled page, or mapping in a page that exists elsewhere in the address space. Once the fault has been dealt with, the faulting thread will continue its execution.
A thread will normally encounter a page fault while running in user space; it may have dereferenced a pointer to a not-present page, for example. But there are times that page faults can happen within the kernel. As a simple example, consider a read() call; if the buffer provided to read() is not resident in RAM, a page fault will result when the kernel tries to access it. At that point, execution will be blocked as usual, but it will be blocked in the kernel rather than in user space.
Blocking on page faults within the kernel is a normal experience when dealing with user-space memory, and everything works as it should. There is one little problem, though. If an attacker can force a page fault at a known point in the kernel — which is often not hard to do — they can use userfaultfd() to block the execution of a thread in the kernel indefinitely. That, in turn, can expand a race window that would otherwise be difficult or impossible to hit, giving the attacker a chance to change the world in potentially surprising ways while the kernel is waiting.
This abuse of userfaultfd() is not just a theoretical possibility; various exploits (example) using userfaultfd() have been disclosed over the years. The problem was deemed serious enough that some restrictions were added in 2020. If the vm/unprivileged_userfaultfd sysctl knob is set to zero (as it is on many distributions), then one of two conditions must apply for a userfaultfd() call to succeed: either the calling process has the CAP_SYS_PTRACE capability, or it supplies the UFFD_USER_MODE_ONLY flag to the system call. In the latter case, page faults encountered while running in the kernel will not be processed via the userfaultfd() mechanism, even if they occur within a registered area.
This change was merged for 5.11 at the end of 2020. It closes off this use of userfaultfd() by attackers, but it also makes the full functionality unavailable to legitimate (but unprivileged) processes. As Rasmussen notes in this patch from the series, that problem can be worked around by giving the process in question the CAP_SYS_PTRACE capability, but that enables a number of actions that have nothing to do with userfaultfd(). Specifically, it could allow the process to read data from or inject code into any other process on the system, which may be undesirable. It would be good, instead, to be able to enable the full userfaultfd() functionality for a process without granting it wider, unneeded privileges.
Rasmussen's solution is to create a new special file called /dev/userfaultfd that gives access to this functionality without the need to call userfaultfd(). One might think that opening this file would yield a file descriptor that acts just like the descriptor from userfaultfd(), but it is not quite as simple. Instead, the only thing that can be done with a /dev/userfaultfd file descriptor is to call ioctl() with the USERFAULTFD_IOC_NEW command; that will create a userfaultfd()-style file descriptor.
A file descriptor created in this way will behave like one from userfaultfd() in every way, with one exception: the handling of kernel faults will be allowed regardless of the calling process's privilege level or the setting of the global sysctl knob. The effect, in other words, is to circumvent the 2020 patch, making full userfaultfd() features available again to all processes. The catch is that a process must be able to open /dev/userfaultfd in the first place to gain access to the feature it provides. By setting the access permissions on this file, an administrator can control who is able to open it and use userfaultfd() in this way.
In other words, /dev/userfaultfd allows an administrator to give
the ability to handle kernel faults to specific processes without the need
to grant any other privileges.
This patch series is in its third revision, and it would appear that the
review comments received so far have been addressed. Barring some sort of
surprise, this new tweak to the security policy surrounding
userfaultfd() seems likely to find its way into the kernel during
a near-future merge window.
Index entries for this article | |
---|---|
Kernel | userfaultfd() |
Posted Jun 13, 2022 15:33 UTC (Mon)
by kilpatds (subscriber, #29339)
[Link]
That does require that the system calls are able to be safely recalled, but that seems easier to program around than random userspace induced pauses.
Posted Jun 13, 2022 18:12 UTC (Mon)
by josh (subscriber, #17465)
[Link] (16 responses)
Is there some explanation for why the indirection through an ioctl?
Posted Jun 13, 2022 18:21 UTC (Mon)
by corbet (editor, #1)
[Link] (11 responses)
Posted Jun 13, 2022 18:36 UTC (Mon)
by josh (subscriber, #17465)
[Link]
Posted Jun 13, 2022 21:04 UTC (Mon)
by epa (subscriber, #39769)
[Link] (9 responses)
Posted Jun 14, 2022 15:17 UTC (Tue)
by k3ninho (subscriber, #50375)
[Link]
K3n.
Posted Jun 14, 2022 16:58 UTC (Tue)
by nybble41 (subscriber, #55106)
[Link] (5 responses)
Yes, they're tracked in 32-bit bitfields so there can never be more than 32 unique capabilities.
Posted Jun 16, 2022 4:47 UTC (Thu)
by clay.sweetser@gmail.com (guest, #155278)
[Link]
Alternately, are there any similar permission mechanisms in Linux that are more suitable? This whole ioctl indirection seems to be an attempt to address, partially, what is really a larger problem.
Posted Jun 16, 2022 9:04 UTC (Thu)
by cortana (subscriber, #24596)
[Link] (2 responses)
Posted Jun 16, 2022 11:59 UTC (Thu)
by metan (subscriber, #74107)
[Link] (1 responses)
Posted Jun 16, 2022 14:59 UTC (Thu)
by nybble41 (subscriber, #55106)
[Link]
Ah, I missed that the v2 ABI changed this to an array in 2007. I would imagine that it's still not that easy to extend it, since it appears to require a new version of the ABI to increase the array length, and any userspace tools and filesystems which work with capability sets would need to be updated as well. In a sense it's always going to be *possible* to extend the set of capabilities with ABI changes; the question is how much work is involved. Also, as long as the capability set remains a bitfield—and not, say, a sparse array—there will be significant overhead to tracking large numbers of potential capabilities, whether or not they're actually used by a given program.
Posted Jun 16, 2022 22:37 UTC (Thu)
by draco (subscriber, #1792)
[Link]
They hit the old limit and extended the limit.
Posted Jun 14, 2022 18:50 UTC (Tue)
by axelrasmussen (subscriber, #140005)
[Link] (1 responses)
This was rejected by the capability maintainers, based on the idea that the capability would have a very narrow use case, and they want to avoid adding many more capabilities (e.g. if we add a narrow one for this use case, we might equally come up with 100s of other narrow use cases).
Posted Jun 17, 2022 20:30 UTC (Fri)
by epa (subscriber, #39769)
[Link]
Posted Jun 13, 2022 18:56 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
Posted Jun 14, 2022 7:47 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link]
Posted Jun 14, 2022 10:03 UTC (Tue)
by danpb (subscriber, #4831)
[Link]
Posted Jun 13, 2022 20:39 UTC (Mon)
by axelrasmussen (subscriber, #140005)
[Link]
I guess another argument is, as pointed out, the ioctl is more extensible - we could add arguments, flags, whatever, if we want to fine tune the behavior in the future. Then again, I don't have immediate plans to do that.
I'm not sure these are very strong arguments. :) If people think it would be cleaner to do it the other way, I can always send a v4...
Posted Oct 25, 2022 16:49 UTC (Tue)
by bobozi (guest, #161810)
[Link]
/dev/userfaultfd
/dev/userfaultfd
There was no explanation in the patch. I'm guessing that it's because userfaultfd() takes a flags argument that they wanted to be able to support in the /dev interface too.
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
Use of ioctl()
/dev/userfaultfd
/dev/userfaultfd
/dev/userfaultfd
/dev/userfaultfd
/dev/userfaultfd
I updated my ubuntu to 6.1-rc1, and use ioctl(dev_userfaultfd_fd, USERFAULTFD_IOC_NEW);
but I get error: ‘USERFAULTFD_IOC_NEW’ undeclared.