splice() and the ghost of set_fs()
The set_fs() function was added to the kernel early in its history; it was not in the initial 0.01 release, but was added before the 0.10 release in late 1991. Normally, kernel code that is intended to access user-space memory will generate an error if it attempts to access kernel space instead; this restriction prevents, for example, attempts by an attacker to access kernel memory via system calls. A call to set_fs(KERNEL_DS) can be used to lift the restriction when the need arises; a common use case for set_fs() is to be able to perform file I/O from within the kernel. Calling set_fs(USER_DS) puts the restriction back.
The problem with set_fs() is that it turns out to be easy to forget the second set_fs() call to restore the protection of kernel space, leading directly to the "total compromise" scenario that kernel developers will normally take some pains to avoid. Numerous such bugs have been fixed over the years, but it had long been clear that the real solution was to just get rid of set_fs() entirely and adopt safer ways of accessing kernel memory when needed.
Developers (and Christoph Hellwig in particular) got more serious about this objective in 2020 and made a determined push to eliminate set_fs() entirely. Much of this work went into 5.10, though the final bits of the set_fs() infrastructure were only removed in 5.18. Back in 2020, though, one question that provoked some discussion was what should be done about splice().
The splice() system call will connect an open file descriptor to a pipe, then move data between the two for as long as the data stream lasts. This movement happens entirely within the kernel, potentially eliminating the need for large numbers of system calls; in some settings, it can provide a significant performance improvement. By its nature, splice() often has to move data to and from buffers that are in kernel space; to make that possible, it used set_fs().
Hellwig duly came up with a new implementation that would keep
splice() working in the absence of set_fs(), but Linus
Torvalds rejected
it, saying that he didn't like the "complexity and
messiness
" of the implementation. But he also made it clear that he
didn't feel the need to guarantee that splice() would keep working
at all; he felt that making splice() work by default on most file
types led to a number of security issues. Later in 2020, for example, he said:
I'd rather limit splice (and kernel_read too, for that matter) as much as possible. It was a mistake originally to allow it everywhere, and it's come back to bite us.So I'd rather have people notice these odd corner cases and get them fixed one by one than just say "anything goes".
So the patches that went into 5.10 ended up breaking splice() for any file type that did not have explicit support for the new way of doing things; the idea was that the important cases would be noticed and fixed over time. That has indeed happened; if one looks for patches committed as explicit fixes to the disabling of splice() support, one finds fixes for the AFS filesystem, the 9p filesystem, the orangefs filesystem, /proc/mountinfo, the TTY subsystem, kernfs, sendfile(), the nilfs2 filesystem, and the JFFS2 filesystem.
Most recently, Jens Axboe reported that splice() no longer worked on /dev/random or /dev/urandom; he included a patch to fix the problem as well. These patches were later reworked by random-number-generator maintainer Jason Donenfeld and were applied to the mainline during the 5.19 merge window. Along the way, Donenfeld observed that the necessary changes resulted in a performance regression of about 3% when reading from /dev/urandom. That led him to ask whether the fix was something that was needed at all; after some discussion, Axboe gave him the lecture on regressions:
If you have an application that is written using eg splice from /dev/urandom, then it should indeed be safe to expect that it will indeed continue working. If we have one core tenet in the kernel it's that you should ALWAYS be able to upgrade your kernel and not have any breakage in terms of userspace ABI. Obviously that can happen sometimes, but I think this one is exactly the poster child of breakage that should NOT happen. We took away a feature that someone depended on.
That is the sort of breakage that did indeed happen but, in this case, a
change was made knowing that this kind of problem would result. Hellwig
said in
response to Axboe's patch set that "compared to my initial fears the
fallout actually isn't
that bad
", but a perusal of the above list of fixes might lead one
to a different conclusion.
The removal of set_fs() is, in many ways, a model for what the kernel development process can do. A fundamental piece of low-level structure that had been deeply wired into the kernel since the beginning was replaced with a much safer alternative without breaking the project's pace of a stable release every nine or ten weeks. The steady stream of regressions resulting from this change, though, is not what the project sets out to do — and it seems certain that this particular gift has not yet stopped giving.
The decision to take this path was driven by a fear of security problems, based on the past history of the splice() system call. If those fears are still justified (and they might well be; consider, for example, that splice() was a part of the "Dirty Pipe" vulnerability reported earlier this year), then refusing to make all existing splice() implementations just work without set_fs() may have prevented far worse regressions than the ones we have seen. Having to fix a filesystem is annoying; having to endure yet another security drill for a branded vulnerability with a silly name is rather more so.
There is no way of knowing whether that is how things would have gone in
this case. But it is
true that this type of episode makes the kernel's "no regressions" rule
look a bit more like just a guideline. It does not take too many of those
to create breakage to the project's reputation that is hard to splice back together.
Index entries for this article | |
---|---|
Kernel | Development model/Regressions |
Kernel | set_fs() |
Posted May 26, 2022 21:09 UTC (Thu)
by zx2c4 (subscriber, #82519)
[Link]
- I noticed the slow down here:
Hopefully those close the performance gap and then all drivers get faster.
Posted May 26, 2022 22:23 UTC (Thu)
by josh (subscriber, #17465)
[Link] (5 responses)
Posted May 26, 2022 23:55 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
Side note: Why are there so many syscalls that do almost, but not quite, entirely the same thing in this space? We've also got copy_file_range(2), which seems to be the same as splice(2) but both fds must be normal files. And then there's vmsplice(2), which appears to be exactly the same as read(2)/write(2), but with an overly-complicated API, unless you pass SPLICE_F_GIFT, which looks to be the "I'm doing something ridiculous, don't judge me" flag. And I imagine there's also some io_uring equivalent to this madness, too. Why is there not a simple, all-purpose "move data from here to here and don't bother me about the details, just do whatever's fastest or most reasonable" syscall?
* splice isn't it, because splice requires one of the fds to be a pipe.
Posted May 27, 2022 13:09 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted May 27, 2022 17:42 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Posted May 27, 2022 17:44 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (1 responses)
FreeBSD has CoW on send(), I believe, but of course that means you need to go through a rather expensive page fault when/if the data changes.
Posted May 28, 2022 2:37 UTC (Sat)
by wahern (subscriber, #37304)
[Link]
This was exactly Linus' original reasoning wrt vmsplice: Source: https://lkml.org/lkml/2006/4/20/310 See also Linus' justifications of splice and tee earlier in that thread. vmsplice is also briefly mentioned, and it's implicit from the context that much of the net value of vmsplice comes from combining with tee, as you mentioned earlier.
Posted May 27, 2022 1:03 UTC (Fri)
by gerdesj (subscriber, #5446)
[Link]
A bit like the Constitution for the UKoGBnNI - https://en.wikipedia.org/wiki/Constitution_of_the_United_... - "Unlike in most countries, no attempt has been made to codify such arrangements into a single document."
... and yet somehow we struggle along.
It might look a bit daft to conflate the Linux kernel's governance with the UK's legal system but I think it is quite instructional. One of our cherished principles is the idea that you should be able to "quietly enjoy" your property - I can't remember the exact term. I think a coal mine making a racket caused the formative judgment.
So, the no regressions thing can be seen in similar terms: Don't break people's code.
However we have to be practical and sometimes stuff has to change.
Posted May 27, 2022 18:02 UTC (Fri)
by SLi (subscriber, #53131)
[Link] (2 responses)
It seems like a silly limitation to me; as far as I can tell, it's 100% knowable at the time of the initial set_fs() call that you want to reset it exactly when the current function (or block) exits. It's not even something that would require producing code that is slower than the correct C code.
Posted May 28, 2022 3:56 UTC (Sat)
by willy (subscriber, #9762)
[Link] (1 responses)
The problem is twofold. One is that while set_fs() is active, various security measures (like SMEP and SMAP) are defeated. The other is that (on some architectures and eg on a 4GB/4GB split x86-32), you may not actually be able to access userspace because accessing userspace actually accesses kernel space. On x86-64, you can tell from the high bits of the pointer whether it's userspace or kernel space, but that's not true eg on SPARC or PARISC.
Posted Jun 9, 2022 17:06 UTC (Thu)
by stem (guest, #83810)
[Link]
splice() and the ghost of set_fs()
https://lore.kernel.org/lkml/Yoey+FOYO69lS5qP@zx2c4.com/
- Jens confirmed it's around 3%:
https://lore.kernel.org/lkml/0a6ed6b9-0917-0d83-5c45-70ff...
- Relatedly, I had proposed doing the same thing to /dev/zero: https://lore.kernel.org/lkml/20220520135030.166831-1-Jaso...
- Jens liked the idea, but Al pointed out the
performance issues, and later started figuring out why:
https://lore.kernel.org/lkml/Yokmu7bQpg70Bp8R@zeniv-ca.li...
- Al references resurrecting a particularly relevent older thread on
fsdevel:
https://lore.kernel.org/linux-fsdevel/Yokl+uHTVWFxoQGn@ze...
- This thread is now a many messages deep. That's where things are at now.
- Looks like Al's got some patches he's playing with in
https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs....
splice() and the ghost of set_fs()
splice() and the ghost of set_fs()
* copy_file_range isn't it, because it requires *both* of the fds to be normal files.
* sendfile isn't it, because it's missing an offset argument for the output file, and the input file must not be a socket.
* io_uring isn't it, because it's like five syscalls and a userspace buffer, not one fire-and-forget syscall.
splice() and the ghost of set_fs()
splice() and the ghost of set_fs()
splice() and the ghost of set_fs()
splice() and the ghost of set_fs()
On Thu, 20 Apr 2006, Piet Delaney wrote:
>
> What about marking the pages Read-Only while it's being used by the
> kernel
NO!
That's a huge mistake, and anybody that does it that way (FreeBSD) is totally incompetent.
[...]
That cost is _bigger_ than the cost of just copying the page in the first place.
splice() and the ghost of set_fs()
splice() and the ghost of set_fs()
splice() and the ghost of set_fs()
splice() and the ghost of set_fs()
Are you sure?
afaik, set_fs() has nothing to do with SM*P, it was abused wrt access_ok() - copy_*_user().