Saying goodbye to set_fs()

By Jonathan Corbet
September 24, 2020

The set_fs() function dates back to the earliest days of the Linux kernel; it is a key part of the machinery that keeps user-space and kernel-space memory separated from each other. It is also easy to misuse and has been the source of various security problems over the years; kernel developers have long wanted to be rid of it. They won't completely get their wish in the 5.10 kernel but, as the result of work that has been quietly progressing for several months, the end of set_fs() will be easily visible at that point.

This 2017 article describes set_fs() and its history in some detail. The short version is that set_fs() sets the location of the boundary between the user-space portion of the address space and the kernel's part. Any virtual address that is below the boundary set by the last set_fs() call on behalf of a given process is fair game for that process to access, though the memory permissions stored in the page tables still apply. Anything above that limit belongs to the kernel and is out of bounds.

Normally, that boundary should be firmly fixed in place. When the need to move it arises, the reason is usually the same: some kernel subsystem needs to invoke a function that is intended to access user-space data, but on a kernel-space address. Think, for example, of the simple task of reading the contents of a file into a memory buffer; the read() system call will do that, but it also performs all of the usual access checks, meaning that it will refuse to read into a kernel-space buffer. If a kernel subsystem must perform such a read, it first calls set_fs() to disable those checks; if all goes well, it remembers to restore the old boundary with another set_fs() call when the work is done.

Naturally, history has proved that all does not always go well. It's thus not surprising that the development community has wanted to rid itself of set_fs() for many years. It's also unsurprising that this hasn't happened, though. The kernel project does not lack for developers, but there is always a shortage of people who are willing and able to do this sort of deep infrastructural work; it tends to not feature highly in any company's marketing plan. So the task of removing set_fs() has languished for years.

Recently, though, Christoph Hellwig has stepped up to this task and the kernel-wide cleaning-up that is required to get it done.

For example, one might be surprised to find set_fs() calls in the core networking code, and even more surprised to learn that they were added in 2019, during the 5.3 development cycle. The patch in question added the ability for BPF programs to invoke the setsockopt() and getsockopt() system calls. Those calls are normally invoked from user space, so they apply the usual access checks on any parameters passed to them; calls originating from BPF programs, though, will supply buffers in kernel space. Putting in a call to set_fs() in that case allowed those calls to work without further modification.

Hellwig's plan for taking that set_fs() call back out involved the creation of a new sockptr_t type that can hold an address pointing into either kernel or user space:

    typedef struct {
	union {
	    void	*kernel;
	    void __user	*user;
	};
	bool		is_kernel : 1;
    } sockptr_t;

Code that initializes a sockptr_t variable must specify whether the address is meant to refer to kernel or user space; a set of helper functions can then be used to copy data to and from that address without needing to worry further about where the destination buffer is — or to call set_fs(). As it turns out, setsockopt() and getsockopt() offer a lot of different options, so a long patch series was required to convert the relevant functions to sockptr_t addresses. At the end of the series, the set_fs() calls were removed. This series entered the mainline during the 5.9 merge window.

Something that was not merged was an earlier version of this idea, which was meant to be used throughout the kernel. Hellwig proposed the creation of a "universal pointer" type (uptr) that functioned like sockptr_t; it was accompanied by a pair of new file_operations methods that would work with those pointers. Then, any kernel subsystem that might need to perform I/O on both kernel-space and user-space pointers could be converted to use these new methods rather than calling set_fs().

Linus Torvalds vetoed that idea; he objected to the addition of the new type and file_operations methods, which he saw as temporary and unnecessary workarounds for the real problem. If somebody was going to bother to convert ordinary read() and write() calls to the new read_uptr() and write_uptr(), he asked, why wouldn't they just convert to the existing read_iter() and write_iter() methods instead? Those methods already handle the different address spaces just fine (through yet another union in struct iov_iter that tracks which type of address is in use); indeed, much of the work to remove set_fs() calls in various parts of the kernel has involved switching to iov_iter. So the uptr type fell by the wayside, but the sockptr_t was able to overcome Torvalds's opposition and was merged.

Then, there is the set_fs() call that isn't actually there. In current kernels, the boundary between kernel and user space is established fairly late in the boot process (but before the init process is started). Before that happens, kernel functions that operate on user-space pointers will happily use kernel-space pointers instead; parts of the initialization code (dealing with the initial ramdisk, for example) depend on this behavior. Eliminating that implicit set_fs() call required another patch series creating a set of special helpers that is discarded once the bootstrap process is complete. This series, too, was merged for the 5.9 release.

The final step, for the x86 and PowerPC architectures at least, is this patch series removing set_fs() entirely. Getting there requires tidying up a number of loose ends. It adds iov_iter support to the /proc filesystem, for example. This patch converts kernel_read() and kernel_write() (yet another way to perform I/O on kernel-space buffers) to iov_iter, removing the set_fs() calls previously used there. The splice() implementation is changed in a way that might break existing users: it simply no longer works if the data source is a device that does not support the splice_read() method. Hellwig said that the affected users all appear to have working fallbacks in place, but that specific devices can gain splice_read() methods if the need turns out to exist.

After a few more patches to remove the last uses of set_fs() from the x86 and PowerPC architectures, support for set_fs() itself is disabled and the task is complete. These patches are currently in linux-next, and thus should be merged for the 5.10 release. Hellwig has also posted a patch set for RISC-V, and Arnd Bergman has a patch set for Arm, but those have not yet been applied. Hellwig intends to work through the remaining architectures, removing set_fs() from each.

The patches described above are only a small portion of the effort that has gone into making it possible to finally get rid of set_fs(). The end result of all this work is the near elimination of a kernel interface that has been deemed dangerous for almost as long as it has existed — and it has been around for a long time. It is an example of a form of kernel development that tends not to create headlines, but which quietly keeps the kernel maintainable in the long term. Tasks like this often suffer from a lack of attention, but they do tend to get done in the long run, which is a good thing; even after nearly 30 years, there is a lot of cleaning up still to be done in the kernel.

Index entries for this article
Kernel	set_fs()

Why a typedef?

Posted Sep 24, 2020 18:33 UTC (Thu) by leromarinvit (subscriber, #56850) [Link] (5 responses)

It's been a long time since I've written any kernel code, but I remember reading in the kernel coding style document (or some other list of "dos and don'ts") that one shouldn't use typedefs for structs. The reasoning being, if something is a struct, that fact should be clearly visible to all users instead of being hidden behind an opaque typedef, to avoid being bitten by e.g. a function parameter taking more than one register, the access not being atomic, and similar gotchas.

So why use one for this new type? Has the general consensus changed, or is there an advantage in this specific situation? Is the idea that is_kernel will always be known at compile time and the compiler will optimize it away, leaving just the pointer? Or am I missing something?

Why a typedef?

Posted Sep 25, 2020 0:06 UTC (Fri) by nevets (subscriber, #11875) [Link] (1 responses)

I believe anything that is a typedef of a structure ends with "_t" to let you know it's a structure.

Why a typedef?

Posted Sep 25, 2020 2:04 UTC (Fri) by nivedita76 (subscriber, #121790) [Link]

There are lots of non-structure typedef's that also end in _t.

Why a typedef?

Posted Sep 25, 2020 12:38 UTC (Fri) by richiejp (guest, #111135) [Link] (2 responses)

I guess in this case they want most code to treat it as an opaque type and not to mess with the internal struct fields.

from the docs:

Lots of people think that typedefs ``help readability``. Not so. They are
useful only for:

(a) totally opaque objects (where the typedef is actively used to **hide**
what the object is).

Example: ``pte_t`` etc. opaque objects that you can only access using
the proper accessor functions.

Why a typedef?

Posted Sep 25, 2020 15:06 UTC (Fri) by segher (subscriber, #109337) [Link]

And (b) types that are just hard to read and type, like void (*)(void)

Why a typedef?

Posted Sep 29, 2020 7:56 UTC (Tue) by k8to (guest, #15413) [Link]

i never much liked the habit of reflexively typedef ing every struct etc, but I do it anyway to minimize impedance mismatch with expectations.

Saying goodbye to set_fs()

Posted Nov 2, 2020 22:51 UTC (Mon) by nix (subscriber, #2304) [Link]

This one did create headlines, in the Register. Shame the article was terrible. If you believe El Reg's article, set_fs is a "defunct addressing artifact" that has "been made redundant by chipmakers", because apparently the author didn't realise that this article isn't talking about removing the %fs register, that set_fs is used on non-x86-compatible platforms, that it hasn't used %fs even on x86 for absolutely ages, and that the %fs register is in the userspace ABI and is not going away. Apparently set_fs removal it also has to do with 286s, which is a neat trick given that Linux never ran on 286s and 286s don't even have %fs and %gs registers.

It's fun to follow other trade press now and then: it reminds me of how privileged we are to have something of the sheer quality of LWN.