splice() and the ghost of set_fs()

By Jonathan Corbet
May 26, 2022

The normal rule of kernel development is that the creation of user-space regressions is not allowed; a patch that breaks a previously working application must be either fixed or reverted. There are exceptions, though, including a 5.10 patch that has been turning up regressions ever since. The story that emerges here shows what can happen when the goals of stability, avoiding security problems, and code cleanup run into conflict.

The set_fs() function was added to the kernel early in its history; it was not in the initial 0.01 release, but was added before the 0.10 release in late 1991. Normally, kernel code that is intended to access user-space memory will generate an error if it attempts to access kernel space instead; this restriction prevents, for example, attempts by an attacker to access kernel memory via system calls. A call to set_fs(KERNEL_DS) can be used to lift the restriction when the need arises; a common use case for set_fs() is to be able to perform file I/O from within the kernel. Calling set_fs(USER_DS) puts the restriction back.

The problem with set_fs() is that it turns out to be easy to forget the second set_fs() call to restore the protection of kernel space, leading directly to the "total compromise" scenario that kernel developers will normally take some pains to avoid. Numerous such bugs have been fixed over the years, but it had long been clear that the real solution was to just get rid of set_fs() entirely and adopt safer ways of accessing kernel memory when needed.

Developers (and Christoph Hellwig in particular) got more serious about this objective in 2020 and made a determined push to eliminate set_fs() entirely. Much of this work went into 5.10, though the final bits of the set_fs() infrastructure were only removed in 5.18. Back in 2020, though, one question that provoked some discussion was what should be done about splice().

The splice() system call will connect an open file descriptor to a pipe, then move data between the two for as long as the data stream lasts. This movement happens entirely within the kernel, potentially eliminating the need for large numbers of system calls; in some settings, it can provide a significant performance improvement. By its nature, splice() often has to move data to and from buffers that are in kernel space; to make that possible, it used set_fs().

Hellwig duly came up with a new implementation that would keep splice() working in the absence of set_fs(), but Linus Torvalds rejected it, saying that he didn't like the "complexity and messiness" of the implementation. But he also made it clear that he didn't feel the need to guarantee that splice() would keep working at all; he felt that making splice() work by default on most file types led to a number of security issues. Later in 2020, for example, he said:

I'd rather limit splice (and kernel_read too, for that matter) as much as possible. It was a mistake originally to allow it everywhere, and it's come back to bite us.
So I'd rather have people notice these odd corner cases and get them fixed one by one than just say "anything goes".

So the patches that went into 5.10 ended up breaking splice() for any file type that did not have explicit support for the new way of doing things; the idea was that the important cases would be noticed and fixed over time. That has indeed happened; if one looks for patches committed as explicit fixes to the disabling of splice() support, one finds fixes for the AFS filesystem, the 9p filesystem, the orangefs filesystem, /proc/mountinfo, the TTY subsystem, kernfs, sendfile(), the nilfs2 filesystem, and the JFFS2 filesystem.

Most recently, Jens Axboe reported that splice() no longer worked on /dev/random or /dev/urandom; he included a patch to fix the problem as well. These patches were later reworked by random-number-generator maintainer Jason Donenfeld and were applied to the mainline during the 5.19 merge window. Along the way, Donenfeld observed that the necessary changes resulted in a performance regression of about 3% when reading from /dev/urandom. That led him to ask whether the fix was something that was needed at all; after some discussion, Axboe gave him the lecture on regressions:

If you have an application that is written using eg splice from /dev/urandom, then it should indeed be safe to expect that it will indeed continue working. If we have one core tenet in the kernel it's that you should ALWAYS be able to upgrade your kernel and not have any breakage in terms of userspace ABI. Obviously that can happen sometimes, but I think this one is exactly the poster child of breakage that should NOT happen. We took away a feature that someone depended on.

That is the sort of breakage that did indeed happen but, in this case, a change was made knowing that this kind of problem would result. Hellwig said in response to Axboe's patch set that "compared to my initial fears the fallout actually isn't that bad", but a perusal of the above list of fixes might lead one to a different conclusion.

The removal of set_fs() is, in many ways, a model for what the kernel development process can do. A fundamental piece of low-level structure that had been deeply wired into the kernel since the beginning was replaced with a much safer alternative without breaking the project's pace of a stable release every nine or ten weeks. The steady stream of regressions resulting from this change, though, is not what the project sets out to do — and it seems certain that this particular gift has not yet stopped giving.

The decision to take this path was driven by a fear of security problems, based on the past history of the splice() system call. If those fears are still justified (and they might well be; consider, for example, that splice() was a part of the "Dirty Pipe" vulnerability reported earlier this year), then refusing to make all existing splice() implementations just work without set_fs() may have prevented far worse regressions than the ones we have seen. Having to fix a filesystem is annoying; having to endure yet another security drill for a branded vulnerability with a silly name is rather more so.

There is no way of knowing whether that is how things would have gone in this case. But it is true that this type of episode makes the kernel's "no regressions" rule look a bit more like just a guideline. It does not take too many of those to create breakage to the project's reputation that is hard to splice back together.

Index entries for this article
Kernel	Development model/Regressions
Kernel	set_fs()

splice() and the ghost of set_fs()

Posted May 26, 2022 21:09 UTC (Thu) by zx2c4 (subscriber, #82519) [Link]

The ensuing performance discussion is somewhat interesting. Currently it's taking place across a few threads. My tracking of it is:

- I noticed the slow down here:
https://lore.kernel.org/lkml/Yoey+FOYO69lS5qP@zx2c4.com/
- Jens confirmed it's around 3%:
https://lore.kernel.org/lkml/0a6ed6b9-0917-0d83-5c45-70ff...
- Relatedly, I had proposed doing the same thing to /dev/zero: https://lore.kernel.org/lkml/20220520135030.166831-1-Jaso...
- Jens liked the idea, but Al pointed out the
performance issues, and later started figuring out why:
https://lore.kernel.org/lkml/Yokmu7bQpg70Bp8R@zeniv-ca.li...
- Al references resurrecting a particularly relevent older thread on
fsdevel:
https://lore.kernel.org/linux-fsdevel/Yokl+uHTVWFxoQGn@ze...
- This thread is now a many messages deep. That's where things are at now.
- Looks like Al's got some patches he's playing with in
https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs....

Hopefully those close the performance gap and then all drivers get faster.

splice() and the ghost of set_fs()

Posted May 26, 2022 22:23 UTC (Thu) by josh (subscriber, #17465) [Link] (5 responses)

splice has to keep working; does it have to keep working *fast*? Could it become a wrapper around sendfile-like semantics, and then just have specific cases where it can go faster?

splice() and the ghost of set_fs()

Posted May 26, 2022 23:55 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (4 responses)

I think that depends on the use case. When fd_in is a pipe, splice should be quite fast, because the man page says it just copies pointers to individual pages. If you use a naive sendfile-like implementation, suddenly you're making real copies. Or at least, that's what I was able to figure out from the man pages, anyway.

Side note: Why are there so many syscalls that do almost, but not quite, entirely the same thing in this space? We've also got copy_file_range(2), which seems to be the same as splice(2) but both fds must be normal files. And then there's vmsplice(2), which appears to be exactly the same as read(2)/write(2), but with an overly-complicated API, unless you pass SPLICE_F_GIFT, which looks to be the "I'm doing something ridiculous, don't judge me" flag. And I imagine there's also some io_uring equivalent to this madness, too. Why is there not a simple, all-purpose "move data from here to here and don't bother me about the details, just do whatever's fastest or most reasonable" syscall?

* splice isn't it, because splice requires one of the fds to be a pipe.
* copy_file_range isn't it, because it requires *both* of the fds to be normal files.
* sendfile isn't it, because it's missing an offset argument for the output file, and the input file must not be a socket.
* io_uring isn't it, because it's like five syscalls and a userspace buffer, not one fire-and-forget syscall.

splice() and the ghost of set_fs()

Posted May 27, 2022 13:09 UTC (Fri) by Sesse (subscriber, #53779) [Link] (3 responses)

vmsplice() is for sending the same data multiple times, I believe? E.g., pre-canned HTTP headers or small responses. vmsplice() once to get it from userspace into the kernel, then you can splice multiple times with copy.

splice() and the ghost of set_fs()

Posted May 27, 2022 17:42 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

That still doesn't explain why you need the silly GIFT flag. Why can't the kernel just mark the offending pages as COW, like fork(2) does? You could do that without a special flag, because it should be transparent to userspace. Indeed, you can do that even for write(2), if you really want to.

splice() and the ghost of set_fs()

Posted May 27, 2022 17:44 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

I believe the gift flag is seen as a mistake in retrospect.

FreeBSD has CoW on send(), I believe, but of course that means you need to go through a rather expensive page fault when/if the data changes.

splice() and the ghost of set_fs()

Posted May 28, 2022 2:37 UTC (Sat) by wahern (subscriber, #37304) [Link]

This was exactly Linus' original reasoning wrt vmsplice:

On Thu, 20 Apr 2006, Piet Delaney wrote:
> 
> What about marking the pages Read-Only while it's being used by the
> kernel

NO!

That's a huge mistake, and anybody that does it that way (FreeBSD) is totally incompetent.

[...]

That cost is _bigger_ than the cost of just copying the page in the first place.

Source: https://lkml.org/lkml/2006/4/20/310

See also Linus' justifications of splice and tee earlier in that thread. vmsplice is also briefly mentioned, and it's implicit from the context that much of the net value of vmsplice comes from combining with tee, as you mentioned earlier.

splice() and the ghost of set_fs()

Posted May 27, 2022 1:03 UTC (Fri) by gerdesj (subscriber, #5446) [Link]

"But it is true that this type of episode makes the kernel's "no regressions" rule look a bit more like just a guideline."

A bit like the Constitution for the UKoGBnNI - https://en.wikipedia.org/wiki/Constitution_of_the_United_... - "Unlike in most countries, no attempt has been made to codify such arrangements into a single document."

... and yet somehow we struggle along.

It might look a bit daft to conflate the Linux kernel's governance with the UK's legal system but I think it is quite instructional. One of our cherished principles is the idea that you should be able to "quietly enjoy" your property - I can't remember the exact term. I think a coal mine making a racket caused the formative judgment.

So, the no regressions thing can be seen in similar terms: Don't break people's code.

However we have to be practical and sometimes stuff has to change.

splice() and the ghost of set_fs()

Posted May 27, 2022 18:02 UTC (Fri) by SLi (subscriber, #53131) [Link] (2 responses)

Is this something that is a problem only because RAII is not used (and couldn't you _really_ get something like that working for the kernel), or do I misunderstand the original problem?

It seems like a silly limitation to me; as far as I can tell, it's 100% knowable at the time of the initial set_fs() call that you want to reset it exactly when the current function (or block) exits. It's not even something that would require producing code that is slower than the correct C code.

splice() and the ghost of set_fs()

Posted May 28, 2022 3:56 UTC (Sat) by willy (subscriber, #9762) [Link] (1 responses)

That's not the problem.

The problem is twofold. One is that while set_fs() is active, various security measures (like SMEP and SMAP) are defeated. The other is that (on some architectures and eg on a 4GB/4GB split x86-32), you may not actually be able to access userspace because accessing userspace actually accesses kernel space. On x86-64, you can tell from the high bits of the pointer whether it's userspace or kernel space, but that's not true eg on SPARC or PARISC.

splice() and the ghost of set_fs()

Posted Jun 9, 2022 17:06 UTC (Thu) by stem (guest, #83810) [Link]

> One is that while set_fs() is active, various security measures (like SMEP and SMAP) are defeated.
Are you sure?
afaik, set_fs() has nothing to do with SM*P, it was abused wrt access_ok() - copy_*_user().