/dev/userfaultfd

By Jonathan Corbet
June 13, 2022

The userfaultfd() system call allows one thread to handle page faults for another in user space. It has a number of interesting use cases, including the live migration of virtual machines. There are also some less appealing use cases, though, most of which are appreciated by attackers trying to take control of a machine. Attempts have been made over the years to make userfaultfd() less useful as an exploit tool, but this patch set from Axel Rasmussen takes a different approach by circumventing the system call entirely.

A call to userfaultfd() returns a special file descriptor attached to the current process. Among other things, this descriptor can be used (with ioctl()) to register regions of memory. When any thread in the current process encounters a page fault in a registered area, it will be blocked and an event will be sent to the userfaultfd() file descriptor. The managing thread, on reading that event, has several options for how to resolve the fault; these include copying data into a new page, creating a zero-filled page, or mapping in a page that exists elsewhere in the address space. Once the fault has been dealt with, the faulting thread will continue its execution.

A thread will normally encounter a page fault while running in user space; it may have dereferenced a pointer to a not-present page, for example. But there are times that page faults can happen within the kernel. As a simple example, consider a read() call; if the buffer provided to read() is not resident in RAM, a page fault will result when the kernel tries to access it. At that point, execution will be blocked as usual, but it will be blocked in the kernel rather than in user space.

Blocking on page faults within the kernel is a normal experience when dealing with user-space memory, and everything works as it should. There is one little problem, though. If an attacker can force a page fault at a known point in the kernel — which is often not hard to do — they can use userfaultfd() to block the execution of a thread in the kernel indefinitely. That, in turn, can expand a race window that would otherwise be difficult or impossible to hit, giving the attacker a chance to change the world in potentially surprising ways while the kernel is waiting.

This abuse of userfaultfd() is not just a theoretical possibility; various exploits (example) using userfaultfd() have been disclosed over the years. The problem was deemed serious enough that some restrictions were added in 2020. If the vm/unprivileged_userfaultfd sysctl knob is set to zero (as it is on many distributions), then one of two conditions must apply for a userfaultfd() call to succeed: either the calling process has the CAP_SYS_PTRACE capability, or it supplies the UFFD_USER_MODE_ONLY flag to the system call. In the latter case, page faults encountered while running in the kernel will not be processed via the userfaultfd() mechanism, even if they occur within a registered area.

This change was merged for 5.11 at the end of 2020. It closes off this use of userfaultfd() by attackers, but it also makes the full functionality unavailable to legitimate (but unprivileged) processes. As Rasmussen notes in this patch from the series, that problem can be worked around by giving the process in question the CAP_SYS_PTRACE capability, but that enables a number of actions that have nothing to do with userfaultfd(). Specifically, it could allow the process to read data from or inject code into any other process on the system, which may be undesirable. It would be good, instead, to be able to enable the full userfaultfd() functionality for a process without granting it wider, unneeded privileges.

Rasmussen's solution is to create a new special file called /dev/userfaultfd that gives access to this functionality without the need to call userfaultfd(). One might think that opening this file would yield a file descriptor that acts just like the descriptor from userfaultfd(), but it is not quite as simple. Instead, the only thing that can be done with a /dev/userfaultfd file descriptor is to call ioctl() with the USERFAULTFD_IOC_NEW command; that will create a userfaultfd()-style file descriptor.

A file descriptor created in this way will behave like one from userfaultfd() in every way, with one exception: the handling of kernel faults will be allowed regardless of the calling process's privilege level or the setting of the global sysctl knob. The effect, in other words, is to circumvent the 2020 patch, making full userfaultfd() features available again to all processes. The catch is that a process must be able to open /dev/userfaultfd in the first place to gain access to the feature it provides. By setting the access permissions on this file, an administrator can control who is able to open it and use userfaultfd() in this way.

In other words, /dev/userfaultfd allows an administrator to give the ability to handle kernel faults to specific processes without the need to grant any other privileges. This patch series is in its third revision, and it would appear that the review comments received so far have been addressed. Barring some sort of surprise, this new tweak to the security policy surrounding userfaultfd() seems likely to find its way into the kernel during a near-future merge window.

Index entries for this article
Kernel	userfaultfd()

/dev/userfaultfd

Posted Jun 13, 2022 15:33 UTC (Mon) by kilpatds (subscriber, #29339) [Link]

At a former employer who's product was based on FreeBSD, we changed the syscall layer to support things like this. Instead of blocking inside the syscall, we'd return an error code (EFAULT) like normal... in a situation like this, we'd also set a flag in a thread-context to maybe say what address it happened on. Then at the syscall boundary we'd check for those flags, and if <conditions>, react and block there. Once resumed, we'd restart the system call with the same args.

That does require that the system calls are able to be safely recalled, but that seems easier to program around than random userspace induced pauses.

/dev/userfaultfd

Posted Jun 13, 2022 18:12 UTC (Mon) by josh (subscriber, #17465) [Link] (16 responses)

> One might think that opening this file would yield a file descriptor that acts just like the descriptor from userfaultfd(), but it is not quite as simple. Instead, the only thing that can be done with a /dev/userfaultfd file descriptor is to call ioctl() with the USERFAULTFD_IOC_NEW command; that will create a userfaultfd()-style file descriptor.

Is there some explanation for why the indirection through an ioctl?

Use of ioctl()

Posted Jun 13, 2022 18:21 UTC (Mon) by corbet (editor, #1) [Link] (11 responses)

There was no explanation in the patch. I'm guessing that it's because userfaultfd() takes a flags argument that they wanted to be able to support in the /dev interface too.

Use of ioctl()

Posted Jun 13, 2022 18:36 UTC (Mon) by josh (subscriber, #17465) [Link]

Ah, that makes sense.

Use of ioctl()

Posted Jun 13, 2022 21:04 UTC (Mon) by epa (subscriber, #39769) [Link] (9 responses)

Instead of overloading CAP_SYS_PTRACE why not make a new capability only for userfaultfd() with kernel trapping? Are capabilities a scarce resource? Or are they like system calls, essentially unlimited but avoided in favour of hacks like ioctl() for weird cultural reasons?

Use of ioctl()

Posted Jun 14, 2022 15:17 UTC (Tue) by k3ninho (subscriber, #50375) [Link]

I did wonder if the capability mechanism was a better match for sysadmins to enable access to the original syscall. Maybe I misunderstood the drive toward this approach.

K3n.

Use of ioctl()

Posted Jun 14, 2022 16:58 UTC (Tue) by nybble41 (subscriber, #55106) [Link] (5 responses)

> Are capabilities a scarce resource?

Yes, they're tracked in 32-bit bitfields so there can never be more than 32 unique capabilities.

Use of ioctl()

Posted Jun 16, 2022 4:47 UTC (Thu) by clay.sweetser@gmail.com (guest, #155278) [Link]

Huh, that seems... short-sighted. Could this be changed (expanded to, say, a larger data type, or a variable-length stay), or is it locked in place due to backwards compatibility?

Alternately, are there any similar permission mechanisms in Linux that are more suitable? This whole ioctl indirection seems to be an attempt to address, partially, what is really a larger problem.

Use of ioctl()

Posted Jun 16, 2022 9:04 UTC (Thu) by cortana (subscriber, #24596) [Link] (2 responses)

That's what I thought but I count 41 capabilities in capabilities(7)... how can this be?

Use of ioctl()

Posted Jun 16, 2022 11:59 UTC (Thu) by metan (subscriber, #74107) [Link] (1 responses)

Capabilities are actually stored as an array of structures. The structure is there since we have effective, permitted and inheritable sets of capabilities for each process. The array is there to get over the 32bit limitation and was implemented in v2 of the ABI. Currently we are at v3 and the array size is 2 so we are limited to 64 bits and we use 40 bits of that. So technically it's possible to add new capabilities, but I do understand that it's not desirable to add capability for random syscall like this since soon we would end up with thousands of capabilities and it would be a nightmare for any users of the interface.

Use of ioctl()

Posted Jun 16, 2022 14:59 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

> Capabilities are actually stored as an array of structures.

Ah, I missed that the v2 ABI changed this to an array in 2007. I would imagine that it's still not that easy to extend it, since it appears to require a new version of the ABI to increase the array length, and any userspace tools and filesystems which work with capability sets would need to be updated as well. In a sense it's always going to be *possible* to extend the set of capabilities with ABI changes; the question is how much work is involved. Also, as long as the capability set remains a bitfield—and not, say, a sparse array—there will be significant overhead to tracking large numbers of potential capabilities, whether or not they're actually used by a given program.

Use of ioctl()

Posted Jun 16, 2022 22:37 UTC (Thu) by draco (subscriber, #1792) [Link]

Not anymore. There are currently 41 capabilities defined: <https://man7.org/linux/man-pages/man7/capabilities.7.html>

They hit the old limit and extended the limit.

Use of ioctl()

Posted Jun 14, 2022 18:50 UTC (Tue) by axelrasmussen (subscriber, #140005) [Link] (1 responses)

A new CAP_USERFAULTFD is what I proposed first. :) https://lore.kernel.org/lkml/686276b9-4530-2045-6bd8-170e...

This was rejected by the capability maintainers, based on the idea that the capability would have a very narrow use case, and they want to avoid adding many more capabilities (e.g. if we add a narrow one for this use case, we might equally come up with 100s of other narrow use cases).

Use of ioctl()

Posted Jun 17, 2022 20:30 UTC (Fri) by epa (subscriber, #39769) [Link]

Bah. I thought that was the whole point of capabilities, to make them as fine-grained as you need.

/dev/userfaultfd

Posted Jun 13, 2022 18:56 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (2 responses)

One thought I had is that it allows the fd to be passed across a socket (say, from systemd) instead of having to do some permission dropping dance after opening it.

/dev/userfaultfd

Posted Jun 14, 2022 7:47 UTC (Tue) by pbonzini (subscriber, #60935) [Link]

Yes, file descriptors are as close as you can get to capabilities, so this effectively does capability-based filtering of the userfaultfd system call.

/dev/userfaultfd

Posted Jun 14, 2022 10:03 UTC (Tue) by danpb (subscriber, #4831) [Link]

Yes, this is exactly what we could do for userfaultfd usage with QEMU. The privileged libvirt can open the FD on behalf of unprivileged QEMU and pass it across to QEMU either inherited across fork/exec at startup or at runtime over its UNIX monitor socket.

/dev/userfaultfd

Posted Jun 13, 2022 20:39 UTC (Mon) by axelrasmussen (subscriber, #140005) [Link]

I don't have a strong opinion on whether it should be one way or the other. One reason is just, how to get it working with an ioctl was obvious to me, whereas how to convince open() to return the right thing isn't immediately clear to me - would require some research.

I guess another argument is, as pointed out, the ioctl is more extensible - we could add arguments, flags, whatever, if we want to fine tune the behavior in the future. Then again, I don't have immediate plans to do that.

I'm not sure these are very strong arguments. :) If people think it would be cleaner to do it the other way, I can always send a v4...

/dev/userfaultfd

Posted Oct 25, 2022 16:49 UTC (Tue) by bobozi (guest, #161810) [Link]

Is there any example that using /dev/userfaultfd and ioctl() to create a a userfaultfd()-style file descriptor.
I updated my ubuntu to 6.1-rc1, and use ioctl(dev_userfaultfd_fd, USERFAULTFD_IOC_NEW);
but I get error: ‘USERFAULTFD_IOC_NEW’ undeclared.