Rethinking splice()
Rethinking splice()
Posted Feb 17, 2023 22:29 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)Parent article: Rethinking splice()
splice() should have been modified to use just regular descriptors. E.g. if I want to connect a file descriptor to a network socket, I should just do it directly. This way the kernel can have special-cased handling of file-based pages and provide meaningful completion notifications. There's probably a handful of such combinations that make sense (memfd() to socket, socket to memfd() or file, etc.)
To make it even better, add a flag F_SPLICE_NO_FALLBACK that will fail the operation if there's no accelerated path available and the kernel would instead just fall back on a memcpy.
Posted Feb 18, 2023 6:40 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (20 responses)
At this point, it's even worse than setuid/seteuid/etc., and that mess led directly to the creation of setresuid(2). Why can't we (application programmers) have nice things once in a while? This is not O_PONIES. This is a simple matter of "don't make three different syscalls that do basically the same thing." Or maybe even "don't provide abstractions that leak like a sieve."
All we need is one, single syscall that:
* Is exactly equivalent to a while/read/write loop.
Posted Feb 18, 2023 6:56 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
To be clear: When I say this, I'm referring to the application-observable semantics of the operation, not to the underlying mechanism. In other words, I don't care whether the kernel actually copies the data, or moves it, or COWs it, or makes it dance the moonwalk. All I care about is the *semantics* - when this hypothetical syscall returns, it means the kernel considers the operation "committed" in some sense, and I can now proceed to close the src, or reopen it as writable and scribble on it, or truncate it, or whatever else I feel like, and the dest will not get messed up.
Posted Feb 20, 2023 7:38 UTC (Mon)
by moxfyre (guest, #13847)
[Link] (1 responses)
I very much like your proposal, and agree that it'd be a great design… but the "in some sense" is doing some pretty heavy lifting here.
Also, what happens if another thread seeks in one of the FDs before this one completes? Presumably it shouldn't be EXACTLY equivalent to a while/read/write loop in that sense?
Aaaand now I start to get sense of how the current kernel interfaces came to be. 😅
Posted Feb 20, 2023 8:43 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
It's vague because it's an implementation detail. The application should not know or care about what this entails. All it cares about is that the right data ends up in the right place.
> Also, what happens if another thread seeks in one of the FDs before this one completes? Presumably it shouldn't be EXACTLY equivalent to a while/read/write loop in that sense?
None of the three syscalls that I mentioned upthread has anything in their respective man pages about how that works. Presumably, the kernel either interleaves the I/O with no synchronization, or just does whatever it feels like.
Posted Feb 18, 2023 8:05 UTC (Sat)
by joib (subscriber, #8541)
[Link] (16 responses)
You could have two offsets. If a FD is non-seekable, ignore the offset, if the offset is -1 the current offset is used?
Posted Feb 18, 2023 8:32 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (15 responses)
Posted Feb 18, 2023 10:10 UTC (Sat)
by joib (subscriber, #8541)
[Link] (14 responses)
Similar to the normal IO read/write syscalls. If we had p{read,write}v2() from the start, read(), pread(), readv() (and corresponding ones for write) could be implemented in user space as wrappers.
And also, a flags argument just in case it's needed later. History seems to suggest every syscall eventually needs such a thing. :)
Posted Feb 27, 2023 3:14 UTC (Mon)
by ringerc (subscriber, #3071)
[Link] (13 responses)
Given the number of times we've had issues with adding flags that change important semantics, where the syscall has no way to say "IDK what you setting flag bit 7 means, hope it isn't important".
Posted Feb 27, 2023 11:53 UTC (Mon)
by paulj (subscriber, #341)
[Link] (12 responses)
Posted Feb 27, 2023 12:29 UTC (Mon)
by johill (subscriber, #25196)
[Link] (10 responses)
unsigned int flags1, flags2 = 0;
then you've basically painted yourself into the corner of not being able to use flags1 for anything of interest in the future?
Posted Feb 27, 2023 12:37 UTC (Mon)
by paulj (subscriber, #341)
[Link] (9 responses)
Posted Feb 27, 2023 14:21 UTC (Mon)
by johill (subscriber, #25196)
[Link] (8 responses)
Posted Feb 27, 2023 16:22 UTC (Mon)
by paulj (subscriber, #341)
[Link] (7 responses)
That's no different from today, where we have syscalls with flags that are not yet defined and their value (as yet) unchecked; or flags that are defined for future use but not yet implemented (and not checked), is it?
GIGO.
Posted Feb 27, 2023 16:38 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (6 responses)
That conflicts with "don't break userspace". I built a perfectly working binary on Linux 6.4, which sets flags to 0x100 - a bad value, since the only currently defined value is 0. I upgrade to Linux 7.1, and it still works, since the defined flags values don't yet interpret 0x100. When I upgrade from 7.1 to 7.2, flags 0x100 is given a meaning, and my binary breaks. As far as Linus is concerned, that's a kernel regression, and you need to revert the feature that makes sense of flags value 0x100, and find a value that userspace doesn't set.
Ultimately, this forces kernel developers to check all parameter values are set to something meaningful, and to error if any of the values are either not valid, or valid but not understood by this kernel. That way, my binary fails on Linux 6.4 as well as on 7.1, and stops failing on 7.2 - and Linus agrees that "binary used to fail with EINVAL, now works" is not a regression.
Posted Feb 28, 2023 11:08 UTC (Tue)
by paulj (subscriber, #341)
[Link] (4 responses)
In networking, it is common to allow for values that are optional, and not per se understood by the recipient - who may just ignore them. And values that are mandatory to understand, so the recipient must give some error if not recognised.
Optional values allow a protocol to be extended with optional and wholly backwards compatible features, so that newer speakers with the feature happily co-exist with speakers without it. While 2 "newer" speakers presumably derive some benefit from both supporting the feature. I guess it's more rare in software (Linux especially) to have an application compiled with some such feature, and run it on some older kernel/library-stack that lacks it.
Another way to achieve the latter - rather than explicitly having 2 classes of flags - is to specify that unused fields "Must Be Zero". If such a bit is set it's an error. If such a bit is then repurposed in an update, and it used by a new speaker with an old speaker, then the old speaker raises an error - the new speaker can try again without. 2 new speakers speaking to each other just happily use the new meaning of the formerly "Must Be Zero" flag.
What you're saying is Linux kernel userspace-API flags must always be of the "MBZ" kind - there is no need for optional. Fair enough. :)
Posted Feb 28, 2023 17:29 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (3 responses)
The tradeoff is different between networking and the kernel, too A program runs for milliseconds through to months on the same kernel, and the RTT to the kernel is on the order of 1 microsecond. Doing 10 RTTs to determine what features are supported and choosing fallbacks isn't significant time compared to the runtime of the program - especially since programs that use new features and have fallbacks for older kernels are likely to be long-running programs.
In contrast, in the networking world, RTTs are higher (milliseconds, not microseconds, most of the time), and connection lifetime is shorter on the high end (connections for more than a day are unusual). Doing 10 RTTs to determine the feature set of the other end, when you'll only have the same remote for a few hours at most, and more likely for seconds at a time.
Posted Mar 1, 2023 11:03 UTC (Wed)
by paulj (subscriber, #341)
[Link] (2 responses)
The fallback thing, the problem is the entity asking for the optional enhancement, that could otherwise be ignored, often will not implement the fallback path. So with a hard fail, the entire thing may fail. You need more logic to make the "nice to have, but optional and can be ignored" thing work reliably. The test matrix gets bigger (and bigger and bigger, with each such option).
Just having it silently ignored if communicated to an entity that doesn't know it is simpler, and can not have fallback path bugs.
Trade-offs in all directions. ;)
Posted Mar 1, 2023 11:22 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (1 responses)
And to add another layer to the tradeoffs (one that's changed over time, to boot), in today's world it's often easier to not bother with the new feature at all until you can guarantee that all the hosts your application runs on have the new kernel feature, whereas it's often hard to get all the remote endpoints of a service you depend upon upgraded to new networking features.
This will change again, but for now, that's where we sit.
Posted Mar 1, 2023 17:02 UTC (Wed)
by paulj (subscriber, #341)
[Link]
Posted Mar 1, 2023 22:12 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Feb 27, 2023 15:38 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
You basically cannot do "optional, proceed if unrecognised" sanely. If you do, you have no way of distinguishing "app passed a non-zero flags value because it never went wrong testing on older kernels" from "app passed a non-zero flags value because it knows about the new meaning of this value".
What you do need is a very clear way for the app to ask what flags values are known about - so that the app can test all the combinations it wants to use at start-up, fail early if the kernel doesn't support anything appropriate, and choose fallbacks if the kernel support is sub-optimal (e.g. older kernel).
Posted Feb 18, 2023 8:08 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (7 responses)
Because the pipe is used as a kernel-side memory buffer. Don't think about it as splicing from one fd to another, think about it as reading from an fd to kernel memory (or writing from kernel memory to an fd).
Posted Feb 18, 2023 8:15 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
That's the problem. I _want_ to think about it as splicing one FD to another because it simply makes no sense otherwise. A buffer should be an internal technical detail.
And for the zero-copy scenario, the "pipe as a kernel buffer" abstraction doesn't even make any sense!
Posted Feb 18, 2023 8:26 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (5 responses)
int ret = read(infd, buf, sizeof(buf));
With splice(), the buffer is referenced by ID instead of by pointer, that's all.
There's tons of problems with splice (as the article points out), and it hasn't aged well, but the abstraction isn't so weird.
Posted Feb 18, 2023 8:40 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
int ret = readthenwrite(infd, outfd, maybe_some_other_args);
and never even deal with buffers at all. Maybe under the hood the kernel deals with buffers, but as an application programmer, I would rather not have to think about them if I can avoid it.
This is a superset of splice, because splice is just a special case of it where one of the fds happens to be a pipe.
Posted Feb 18, 2023 11:55 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (2 responses)
Most of this sort of feels obsolete with io_uring dominating, though.
Posted Feb 19, 2023 8:50 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Mar 3, 2023 3:37 UTC (Fri)
by njs (subscriber, #40338)
[Link]
- splice composes with tee/vmsplice/etc. to allow more complex zero-copy IO flows
- readthenwrite could not replace splice for this purpose, because splice has *super bizarre* semantics that are different from readthenwrite, as this article notes. (In particular, readthenwrite from a file into a pipe has to make a copy -- it skips bouncing through userspace, but it has to copy data from the page buffer into the pipe buffer, instead of doing wacky things with page pointers like splice does.) But those bizarre semantics are what let compositions of splice/tee/vmsplice/etc. be zero-copy.
I don't think it really works out; this stuff really needs to be replaced with something better. But I can at least see why splice seemed like a good idea at the time.
Posted Feb 18, 2023 17:31 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
The whole reason for splice() is to magically avoid buffers and transfer the data from one location to another.
> I have no qualms doing
Except that if you actually do that, a pipe will be so much slower because of additional synchronization.
Rethinking splice()
* Starts reading/writing from wherever the fd is currently positioned (if it's seekable and you want to seek, then call seek explicitly).
* Takes a size argument (which is just about the only thing all three of those syscalls have in common).
* Fails with EDOITYOURSELF if there's no optimization available and the kernel doesn't feel like emulating it. Then libc or somebody else can write a simple wrapper that does the while loop if necessary.
* Blocks at least until the last write (in a hypothetical while/read/write loop) would have returned. It might still need to be fsync'd, but there should be no "oh, if you do a write at exactly the wrong time, it'll silently clobber all of your data" case.
* Also, O_NONBLOCK and/or io_uring would be nice to have, but *now* we're getting more into the O_PONIES realm, so I would say this is a bonus goal (but still probably doable).
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
syscall(..., &flags1, &flags2);
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
write(outfd, buf, ret);
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()
Rethinking splice()