A review of file descriptor memory safety in the kernel
On July 30, Al Viro sent a patch set to the linux-fsdevel mailing list with a comprehensive cover letter explaining his recent work on ensuring that the kernel's internal representation of file descriptors are used correctly in the kernel. File descriptors are ubiquitous; many system calls need to handle them. Viro's review identified a few existing bugs, and may prevent more in the future. He also had suggestions for ways to keep uses consistent throughout the kernel.
File descriptors are represented in user space as non-negative integers. In the kernel, these are actually indexes into the process's file-descriptor table, which stores entries of type struct file. Most system calls that take a file descriptor — with the exception of things such as dup2() that only touch the file-descriptor table, not the files themselves — need to refer to the associated struct file to determine how to handle the call. These structures are, like many things in the kernel, reference counted.
Therefore, a typical system call could increment the reference count for a file
object, do something with it, and then decrement the reference count again. This
would be functional, but, as Viro said in his cover letter,
"incrementing and decrementing refcount has
an overhead and for some applications it's non-negligible.
" So in practice,
the kernel only changes the reference count when it knows the descriptor table
is shared with other threads. If the table is not shared, the kernel knows that the
reference count can't change, and the file won't be freed, so there's no reason
to incur the overhead.
Unfortunately, this complicates handling the structures in question, because
it's always possible that a thread could be spawned
could end
between checking that the table is shared and the end of
the system call. Therefore, the kernel needs to store whether or not it skipped
incrementing the reference count. Some file operations also need to take a lock
on the file, so the kernel also needs to remember whether it took a lock at the
same time. Both of these pieces of information are stored alongside a pointer to
a struct file in
struct fd,
the primary subject of Viro's patch set.
Instances of struct fd are mostly created and destroyed by the special helper
functions:
fdget() and
fdput().
Viro's investigation touched on all 160 call sites throughout the kernel, including
a few places that don't go through the helpers, and caught several leaks and
use-after-free bugs.
"That kind of stuff
keeps cropping up; it would be nice to have as much as possible done
automatically.
"
Overall, Viro's report shows the kernel to be in a good state. Most of the places where struct fd is used have no problem. One of the exceptions is overlayfs, which uses the fd structure not only to indicate "non-reference-counted file", "reference counted file", or "empty", but also various errors. struct fd is not meant to store errors, and overlayfs's use gets in the way of some of Viro's follow-up work. Since that is the only code that treats the structure like that, Viro suggested that overlayfs might want to switch to a separate type that has better support for representing errors.
Almost everywhere else, the kernel acquires and releases a struct fd within a single function, which makes it easy to audit the code to be sure that the use is correct. But, manually checking so many uses is tedious; Viro ended his cover letter by exploring ways that the compiler could potentially help with that check. The kernel already uses the __cleanup attribute to manage function-local values in several places, so it would be ideal if the attribute could be used in this case as well. Unfortunately, doing so is not trivial — at least, not if one also wants good code generation.
Specifically, with the current representation of struct fd, the compilers used to build the kernel are not smart enough to tell that calling fdput() with an empty struct fd (one that does not point to a struct file) is a no-op and can just be elided. This adds a bunch of unnecessary code — mostly to error paths, but it can still be a performance hit. Using __cleanup makes this problem worse, because it will add cleanup code to branches where a human could tell it was unnecessary, but the compiler cannot. Viro said that Linus Torvalds had suggested making the struct a single pointer-sized integer, and packing the flags into the lowest bits. This representation would allow the compiler to determine when calls to fdput() can be elided with greater accuracy.
That isn't the only issue standing in the way of using __cleanup,
however. The attribute interacts badly with goto statements; in version 12 of GCC and earlier,
using a goto statement to jump past the initialization of an annotated variable still
results in the the cleanup code getting called, but it refers to the
uninitialized variable, which usually has undesirable effects. Clang does catch this
problem, "but we can neither make it mandatory (not all
targets are supported, to start with) nor bump the minimal gcc version
from 5.1 all the way to 13.
"
So, while most functions can use __cleanup, some require a more manual approach. The bulk of Viro's patch set consists of patches for different subsystems that convert uses of struct fd to use __cleanup where possible. Viro used Clang to verify the parts of the code that support building that way, and manually inspected the rest.
The review
Reviewers were largely happy with the patch set. Christian Brauner
said the patch set as a whole "looks good and pretty simple...nothing
really surprising
". As with any large patch set,
there were a few places where the patches collided with other in-progress
changes, but that was resolved without fuss.
Andrii Nakryiko and the other BPF developers had some comments about Viro's approach in the BPF code, suggesting that a small refactoring could make the necessary change less intrusive. Nakryiko ended up putting together a separate patch set to go via the bpf-next tree with the changes.
Overall, Viro's patch set is not terribly exciting or controversial — which is a good thing. While LWN often reports on the most contentious parts of the kernel-development process, it's important to remember that the vast majority of changes are like this one (although perhaps less broad in scope). Once the patch set is merged, users will be able to enjoy having fewer use-after-free vulnerabilities in the kernel, and kernel developers will be able to enjoy less complicated and error-prone code around file access.
