|
|
Subscribe / Log in / New account

splice() and the ghost of set_fs()

splice() and the ghost of set_fs()

Posted May 26, 2022 22:23 UTC (Thu) by josh (subscriber, #17465)
Parent article: splice() and the ghost of set_fs()

splice has to keep working; does it have to keep working *fast*? Could it become a wrapper around sendfile-like semantics, and then just have specific cases where it can go faster?


to post comments

splice() and the ghost of set_fs()

Posted May 26, 2022 23:55 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (4 responses)

I think that depends on the use case. When fd_in is a pipe, splice should be quite fast, because the man page says it just copies pointers to individual pages. If you use a naive sendfile-like implementation, suddenly you're making real copies. Or at least, that's what I was able to figure out from the man pages, anyway.

Side note: Why are there so many syscalls that do almost, but not quite, entirely the same thing in this space? We've also got copy_file_range(2), which seems to be the same as splice(2) but both fds must be normal files. And then there's vmsplice(2), which appears to be exactly the same as read(2)/write(2), but with an overly-complicated API, unless you pass SPLICE_F_GIFT, which looks to be the "I'm doing something ridiculous, don't judge me" flag. And I imagine there's also some io_uring equivalent to this madness, too. Why is there not a simple, all-purpose "move data from here to here and don't bother me about the details, just do whatever's fastest or most reasonable" syscall?

* splice isn't it, because splice requires one of the fds to be a pipe.
* copy_file_range isn't it, because it requires *both* of the fds to be normal files.
* sendfile isn't it, because it's missing an offset argument for the output file, and the input file must not be a socket.
* io_uring isn't it, because it's like five syscalls and a userspace buffer, not one fire-and-forget syscall.

splice() and the ghost of set_fs()

Posted May 27, 2022 13:09 UTC (Fri) by Sesse (subscriber, #53779) [Link] (3 responses)

vmsplice() is for sending the same data multiple times, I believe? E.g., pre-canned HTTP headers or small responses. vmsplice() once to get it from userspace into the kernel, then you can splice multiple times with copy.

splice() and the ghost of set_fs()

Posted May 27, 2022 17:42 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

That still doesn't explain why you need the silly GIFT flag. Why can't the kernel just mark the offending pages as COW, like fork(2) does? You could do that without a special flag, because it should be transparent to userspace. Indeed, you can do that even for write(2), if you really want to.

splice() and the ghost of set_fs()

Posted May 27, 2022 17:44 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

I believe the gift flag is seen as a mistake in retrospect.

FreeBSD has CoW on send(), I believe, but of course that means you need to go through a rather expensive page fault when/if the data changes.

splice() and the ghost of set_fs()

Posted May 28, 2022 2:37 UTC (Sat) by wahern (subscriber, #37304) [Link]

This was exactly Linus' original reasoning wrt vmsplice:

On Thu, 20 Apr 2006, Piet Delaney wrote:
> 
> What about marking the pages Read-Only while it's being used by the
> kernel

NO!

That's a huge mistake, and anybody that does it that way (FreeBSD) is totally incompetent.

[...]

That cost is _bigger_ than the cost of just copying the page in the first place.

Source: https://lkml.org/lkml/2006/4/20/310

See also Linus' justifications of splice and tee earlier in that thread. vmsplice is also briefly mentioned, and it's implicit from the context that much of the net value of vmsplice comes from combining with tee, as you mentioned earlier.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds