A farewell to set_fs()?
The original role of set_fs() was to set the x86 processor's FS segment register which, in the early days, was used to control the range of virtual addresses that could be accessed by unprivileged code. The kernel has, of course, long since stopped using x86 segments this way. In current kernels, set_fs() works by setting a global variable called addr_limit, but the intended functionality is the same: unprivileged code is only allowed to dereference addresses that are below addr_limit. The kernel's access_ok() function, used to validate user-space accesses throughout the kernel, is a simple check against addr_limit, with the rest of the protection being handled by the processor's memory-management unit.
The addr_limit variable, thus, marks the partition between user and kernel space. One might think that such a limit would be fixed, with good reasons for changing it being few and far between. As it happens, there are nearly 400 set_fs() calls in the kernel. Usually, such calls are made to allow code that is normally restricted to accessing user-space memory to operate on a range of kernel memory instead. In 0.10, for example, it was added so that the exec() system call could use the normal filesystem I/O routines to read an executable image into memory that was not yet part of the calling program's address space.
The usual pattern for use of set_fs() looks like this code snippet from the splice() system call:
old_fs = get_fs();
set_fs(get_ds());
res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
set_fs(old_fs);
This sequence temporarily raises addr_limit so that vfs_readv(), which is normally restricted to reading data into user-space memory, can read data into a kernel-space pipe buffer.
In 2010, it was discovered that, if the kernel could be made to oops between the two set_fs() calls, the second call restoring the address limit would never be made; that left kernel data open to being overwritten by user space. Hilarity, as they say, ensued in the form of CVE-2010-4258. That problem is long since fixed. In late 2016, though, an Android bug was reported for an LG touchscreen driver; there was a way to cause that driver to raise addr_limit and return to user space, once again leaving the kernel open to exploitation.
set_fs() is clearly the sort of interface that can easily create severe security bugs. It is also a tempting shortcut that tends to find its way into code of questionable quality such as out-of-tree drivers. In an attempt to harden the system against set_fs() bugs, Thomas Garnier posted a simple patch changing the system-call code so that it would check addr_limit before returning to user space. If it ever finds an incorrect value, it causes a system panic — a severe response, but probably better than allowing an exploit to occur.
Nobody disagreed with the goal of this patch, but it ran into a problem that is familiar to security developers: its impact on performance. As Ingo Molnar pointed out, the patch adds several instructions to the system-call path, which is one of the most performance-sensitive parts of the kernel. Adding overhead to system calls will slow down everything the kernel does; when one considers how many Linux machines would be executing this code on every system call, one begins to think that its carbon footprint might rival that of a small country. That is not a cost to be paid lightly.
Molnar suggested adding some sort of static analysis to the kernel build
system instead. The standard pattern of set_fs() calls should be
amenable to some sort of static analysis, he said, but Kees Cook argued that the problem was not quite so
simple and that the cost of the patch was worth paying. "Until we
can eliminate set_fs(), we need to add this
check
", he said.
As it happens, some other developers were already considering removing set_fs(), which has, arguably, hung around for far longer than it really should have. Christoph Hellwig suggested removing all calls outside of the core filesystem and architecture code; Andy Lutomirski went one step further and said they should all go. Without set_fs(), the kernel would be more secure, and the code that checks user-space memory accesses could become that much simpler.
Removing set_fs() depends on replacing those calls with a better alternative, of course. Many set_fs() calls exist to enable I/O to kernel-space memory; it should be possible to replace the bulk of those using the iov_iter interface. Hellwig has already started doing this replacement.
Another common pattern occurs in compatibility code where, for example, a structure passed to an ioctl() call from a 32-bit user-space process is converted to the 64-bit equivalent in kernel space, then passed to the regular ioctl() implementation. See do_compat_ioctl() in the media subsystem for an example. In such cases, it's just a matter of splitting that implementation into two pieces: one that fetches the argument from user space, and one that actually performs the desired action.
Other set_fs() calls will have to be dealt with in other ways.
But it would appear that this ball is now rolling with a certain amount of
momentum. Given the benefits of removing set_fs(), it would not
be surprising to see much of this work merged for 4.13, with the task
completed not long thereafter. It will be the end of a longstanding
traditional kernel-code pattern, but it's doubtful that many developers
will mourn its passing.
| Index entries for this article | |
|---|---|
| Kernel | iov_iter |
| Kernel | set_fs() |
| Security | Linux kernel |
