Rethinking splice()

Posted Feb 17, 2023 22:29 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
Parent article: Rethinking splice()

FWIW, I'm an application programmer, and for me splice() never made any sense. Why the heck do you need one side of the splice() to be a pipe?

splice() should have been modified to use just regular descriptors. E.g. if I want to connect a file descriptor to a network socket, I should just do it directly. This way the kernel can have special-cased handling of file-based pages and provide meaningful completion notifications. There's probably a handful of such combinations that make sense (memfd() to socket, socket to memfd() or file, etc.)

To make it even better, add a flag F_SPLICE_NO_FALLBACK that will fail the operation if there's no accelerated path available and the kernel would instead just fall back on a memcpy.

Rethinking splice()

Posted Feb 18, 2023 6:40 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (20 responses)

There is also copy_file_range(2). And sendfile(2). And probably one or two random others that I don't know about. Every single one has a different set of arcane restrictions on what types of file descriptors you can pass it. And now we're talking about deprecating *one* of these things, but presumably the others will be left as-is.

At this point, it's even worse than setuid/seteuid/etc., and that mess led directly to the creation of setresuid(2). Why can't we (application programmers) have nice things once in a while? This is not O_PONIES. This is a simple matter of "don't make three different syscalls that do basically the same thing." Or maybe even "don't provide abstractions that leak like a sieve."

All we need is one, single syscall that:

* Is exactly equivalent to a while/read/write loop.
* Starts reading/writing from wherever the fd is currently positioned (if it's seekable and you want to seek, then call seek explicitly).
* Takes a size argument (which is just about the only thing all three of those syscalls have in common).
* Fails with EDOITYOURSELF if there's no optimization available and the kernel doesn't feel like emulating it. Then libc or somebody else can write a simple wrapper that does the while loop if necessary.
* Blocks at least until the last write (in a hypothetical while/read/write loop) would have returned. It might still need to be fsync'd, but there should be no "oh, if you do a write at exactly the wrong time, it'll silently clobber all of your data" case.
* Also, O_NONBLOCK and/or io_uring would be nice to have, but *now* we're getting more into the O_PONIES realm, so I would say this is a bonus goal (but still probably doable).

Rethinking splice()

Posted Feb 18, 2023 6:56 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

> * Blocks at least until the last write (in a hypothetical while/read/write loop) would have returned. It might still need to be fsync'd, but there should be no "oh, if you do a write at exactly the wrong time, it'll silently clobber all of your data" case.

To be clear: When I say this, I'm referring to the application-observable semantics of the operation, not to the underlying mechanism. In other words, I don't care whether the kernel actually copies the data, or moves it, or COWs it, or makes it dance the moonwalk. All I care about is the *semantics* - when this hypothetical syscall returns, it means the kernel considers the operation "committed" in some sense, and I can now proceed to close the src, or reopen it as writable and scribble on it, or truncate it, or whatever else I feel like, and the dest will not get messed up.

Rethinking splice()

Posted Feb 20, 2023 7:38 UTC (Mon) by moxfyre (guest, #13847) [Link] (1 responses)

> when this hypothetical syscall returns, it means the kernel considers the operation "committed" in some sense, and I can now proceed to close the src, or reopen it as writable and scribble on it, or truncate it, or whatever else I feel like, and the dest will not get messed up.

I very much like your proposal, and agree that it'd be a great design… but the "in some sense" is doing some pretty heavy lifting here.

Also, what happens if another thread seeks in one of the FDs before this one completes? Presumably it shouldn't be EXACTLY equivalent to a while/read/write loop in that sense?

Aaaand now I start to get sense of how the current kernel interfaces came to be. 😅

Rethinking splice()

Posted Feb 20, 2023 8:43 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

> but the "in some sense" is doing some pretty heavy lifting here.

It's vague because it's an implementation detail. The application should not know or care about what this entails. All it cares about is that the right data ends up in the right place.

> Also, what happens if another thread seeks in one of the FDs before this one completes? Presumably it shouldn't be EXACTLY equivalent to a while/read/write loop in that sense?

None of the three syscalls that I mentioned upthread has anything in their respective man pages about how that works. Presumably, the kernel either interleaves the I/O with no synchronization, or just does whatever it feels like.

Rethinking splice()

Posted Feb 18, 2023 8:05 UTC (Sat) by joib (subscriber, #8541) [Link] (16 responses)

> Starts reading/writing from wherever the fd is currently positioned (if it's seekable and you want to seek, then call seek explicitly).

You could have two offsets. If a FD is non-seekable, ignore the offset, if the offset is -1 the current offset is used?

Rethinking splice()

Posted Feb 18, 2023 8:32 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (15 responses)

Perhaps that is worth the bother, but the only use case I can think of is "I want to have one thread copy between fd A and fd B, while another thread simultaneously reads from or writes to one or both of A and B, and also I don't want to reopen the file," and IMHO that's just crazy. But I'm sure there's someone writing some ridiculous app out there that needs it.

Rethinking splice()

Posted Feb 18, 2023 10:10 UTC (Sat) by joib (subscriber, #8541) [Link] (14 responses)

Yes. And while I'm at it, use iovec's instead of a direct pointer+len to a buffer. Then a user space library can provide easier to use wrappers for common cases.

Similar to the normal IO read/write syscalls. If we had p{read,write}v2() from the start, read(), pread(), readv() (and corresponding ones for write) could be implemented in user space as wrappers.

And also, a flags argument just in case it's needed later. History seems to suggest every syscall eventually needs such a thing. :)

Rethinking splice()

Posted Feb 27, 2023 3:14 UTC (Mon) by ringerc (subscriber, #3071) [Link] (13 responses)

Not only that, but a flags argument in which the low bits are "ignore if flag unrecognised" and the high bits are "syscall should fail if the flag bits are unrecognised".

Given the number of times we've had issues with adding flags that change important semantics, where the syscall has no way to say "IDK what you setting flag bit 7 means, hope it isn't important".

Rethinking splice()

Posted Feb 27, 2023 11:53 UTC (Mon) by paulj (subscriber, #341) [Link] (12 responses)

No, just use 2 separate flags arguments. One for "optional, proceed if unrecognised" and the other for "mandatory, fail if unrecognised".

Rethinking splice()

Posted Feb 27, 2023 12:29 UTC (Mon) by johill (subscriber, #25196) [Link] (10 responses)

This basically doesn't work. If any of your userspace is something like

unsigned int flags1, flags2 = 0;
syscall(..., &flags1, &flags2);

then you've basically painted yourself into the corner of not being able to use flags1 for anything of interest in the future?

Rethinking splice()

Posted Feb 27, 2023 12:37 UTC (Mon) by paulj (subscriber, #341) [Link] (9 responses)

mandatory fail if unrecognised - for a set bit, obviously.

Rethinking splice()

Posted Feb 27, 2023 14:21 UTC (Mon) by johill (subscriber, #25196) [Link] (8 responses)

I meant flags1 for the "optional, proceed if unrecognised" part. Don't see how you can really do "optional, proceed if unrecognised" at all since applications might just erroneously set random bits in there (as in the example), was just trying to illustrate why not.

Rethinking splice()

Posted Feb 27, 2023 16:22 UTC (Mon) by paulj (subscriber, #341) [Link] (7 responses)

If apps are specifying flag arguments with undefined values, well... they're going to get undefined behaviour (sooner or later) - tough for them I'd say.

That's no different from today, where we have syscalls with flags that are not yet defined and their value (as yet) unchecked; or flags that are defined for future use but not yet implemented (and not checked), is it?

GIGO.

Rethinking splice()

Posted Feb 27, 2023 16:38 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

That conflicts with "don't break userspace". I built a perfectly working binary on Linux 6.4, which sets flags to 0x100 - a bad value, since the only currently defined value is 0. I upgrade to Linux 7.1, and it still works, since the defined flags values don't yet interpret 0x100. When I upgrade from 7.1 to 7.2, flags 0x100 is given a meaning, and my binary breaks. As far as Linus is concerned, that's a kernel regression, and you need to revert the feature that makes sense of flags value 0x100, and find a value that userspace doesn't set.

Ultimately, this forces kernel developers to check all parameter values are set to something meaningful, and to error if any of the values are either not valid, or valid but not understood by this kernel. That way, my binary fails on Linux 6.4 as well as on 7.1, and stops failing on 7.2 - and Linus agrees that "binary used to fail with EINVAL, now works" is not a regression.

Rethinking splice()

Posted Feb 28, 2023 11:08 UTC (Tue) by paulj (subscriber, #341) [Link] (4 responses)

Fair enough, that works too.

In networking, it is common to allow for values that are optional, and not per se understood by the recipient - who may just ignore them. And values that are mandatory to understand, so the recipient must give some error if not recognised.

Optional values allow a protocol to be extended with optional and wholly backwards compatible features, so that newer speakers with the feature happily co-exist with speakers without it. While 2 "newer" speakers presumably derive some benefit from both supporting the feature. I guess it's more rare in software (Linux especially) to have an application compiled with some such feature, and run it on some older kernel/library-stack that lacks it.

Another way to achieve the latter - rather than explicitly having 2 classes of flags - is to specify that unused fields "Must Be Zero". If such a bit is set it's an error. If such a bit is then repurposed in an update, and it used by a new speaker with an old speaker, then the old speaker raises an error - the new speaker can try again without. 2 new speakers speaking to each other just happily use the new meaning of the formerly "Must Be Zero" flag.

What you're saying is Linux kernel userspace-API flags must always be of the "MBZ" kind - there is no need for optional. Fair enough. :)

Rethinking splice()

Posted Feb 28, 2023 17:29 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

The tradeoff is different between networking and the kernel, too A program runs for milliseconds through to months on the same kernel, and the RTT to the kernel is on the order of 1 microsecond. Doing 10 RTTs to determine what features are supported and choosing fallbacks isn't significant time compared to the runtime of the program - especially since programs that use new features and have fallbacks for older kernels are likely to be long-running programs.

In contrast, in the networking world, RTTs are higher (milliseconds, not microseconds, most of the time), and connection lifetime is shorter on the high end (connections for more than a day are unusual). Doing 10 RTTs to determine the feature set of the other end, when you'll only have the same remote for a few hours at most, and more likely for seconds at a time.

Rethinking splice()

Posted Mar 1, 2023 11:03 UTC (Wed) by paulj (subscriber, #341) [Link] (2 responses)

They're still protocols for different entities to communicate and achieve something, end of the day. ;)

The fallback thing, the problem is the entity asking for the optional enhancement, that could otherwise be ignored, often will not implement the fallback path. So with a hard fail, the entire thing may fail. You need more logic to make the "nice to have, but optional and can be ignored" thing work reliably. The test matrix gets bigger (and bigger and bigger, with each such option).

Just having it silently ignored if communicated to an entity that doesn't know it is simpler, and can not have fallback path bugs.

Trade-offs in all directions. ;)

Rethinking splice()

Posted Mar 1, 2023 11:22 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

And to add another layer to the tradeoffs (one that's changed over time, to boot), in today's world it's often easier to not bother with the new feature at all until you can guarantee that all the hosts your application runs on have the new kernel feature, whereas it's often hard to get all the remote endpoints of a service you depend upon upgraded to new networking features.

This will change again, but for now, that's where we sit.

Rethinking splice()

Posted Mar 1, 2023 17:02 UTC (Wed) by paulj (subscriber, #341) [Link]

yeah, software often doesn't care about this kind of compatibility. Except when it comes to systems software and features critical for booting. Then you need to think about forward and backward compatibility - least in the Linux world.

Rethinking splice()

Posted Mar 1, 2023 22:12 UTC (Wed) by nix (subscriber, #2304) [Link]

This mistake exists in other contexts too, even in hardware: <http://www.os2museum.com/wp/forward-compatibility-landmines/> <http://www.os2museum.com/wp/theres-more-to-the-286-xenix-...>

Rethinking splice()

Posted Feb 27, 2023 15:38 UTC (Mon) by farnz (subscriber, #17727) [Link]

You basically cannot do "optional, proceed if unrecognised" sanely. If you do, you have no way of distinguishing "app passed a non-zero flags value because it never went wrong testing on older kernels" from "app passed a non-zero flags value because it knows about the new meaning of this value".

What you do need is a very clear way for the app to ask what flags values are known about - so that the app can test all the combinations it wants to use at start-up, fail early if the kernel doesn't support anything appropriate, and choose fallbacks if the kernel support is sub-optimal (e.g. older kernel).

Rethinking splice()

Posted Feb 18, 2023 8:08 UTC (Sat) by Sesse (subscriber, #53779) [Link] (7 responses)

> FWIW, I'm an application programmer, and for me splice() never made any sense. Why the heck do you need one side of the splice() to be a pipe?

Because the pipe is used as a kernel-side memory buffer. Don't think about it as splicing from one fd to another, think about it as reading from an fd to kernel memory (or writing from kernel memory to an fd).

Rethinking splice()

Posted Feb 18, 2023 8:15 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> Don't think about it as splicing from one fd to another

That's the problem. I _want_ to think about it as splicing one FD to another because it simply makes no sense otherwise. A buffer should be an internal technical detail.

And for the zero-copy scenario, the "pipe as a kernel buffer" abstraction doesn't even make any sense!

Rethinking splice()

Posted Feb 18, 2023 8:26 UTC (Sat) by Sesse (subscriber, #53779) [Link] (5 responses)

Why doesn't it make sense otherwise? I have no qualms doing

int ret = read(infd, buf, sizeof(buf));
write(outfd, buf, ret);

With splice(), the buffer is referenced by ID instead of by pointer, that's all.

There's tons of problems with splice (as the article points out), and it hasn't aged well, but the abstraction isn't so weird.

Rethinking splice()

Posted Feb 18, 2023 8:40 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

The argument isn't that buffers are illogical. The argument is that a more useful primitive would be one where I can just do

int ret = readthenwrite(infd, outfd, maybe_some_other_args);

and never even deal with buffers at all. Maybe under the hood the kernel deals with buffers, but as an application programmer, I would rather not have to think about them if I can avoid it.

This is a superset of splice, because splice is just a special case of it where one of the fds happens to be a pipe.

Rethinking splice()

Posted Feb 18, 2023 11:55 UTC (Sat) by Sesse (subscriber, #53779) [Link] (2 responses)

It's not really a strict superset, since splice() follows tee() (where you can read once and send many) or vmsplice()+tee() (write once from memory, send many; think standardized HTTP headers).

Most of this sort of feels obsolete with io_uring dominating, though.

Rethinking splice()

Posted Feb 19, 2023 8:50 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

Those syscalls are not splice. I never claimed that readthenwrite (or whatever less-terrible name you want to call it, since readthenwrite is obviously just a placeholder) would be a superset of "splice, and also a bunch of random other syscalls that have a vaguely similar implementation to splice." My specific beef is not with the splice family of syscalls, it's with the family of "syscalls that are semantically equivalent to read and then write, but also have arbitrary restrictions that make no sense to the average application developer and have to be looked up in the man page every time you use them." See my other comment regarding copy_file_range(2) and sendfile(2).

Rethinking splice()

Posted Mar 3, 2023 3:37 UTC (Fri) by njs (subscriber, #40338) [Link]

I think the logic would be:

- splice composes with tee/vmsplice/etc. to allow more complex zero-copy IO flows

- readthenwrite could not replace splice for this purpose, because splice has *super bizarre* semantics that are different from readthenwrite, as this article notes. (In particular, readthenwrite from a file into a pipe has to make a copy -- it skips bouncing through userspace, but it has to copy data from the page buffer into the pipe buffer, instead of doing wacky things with page pointers like splice does.) But those bizarre semantics are what let compositions of splice/tee/vmsplice/etc. be zero-copy.

I don't think it really works out; this stuff really needs to be replaced with something better. But I can at least see why splice seemed like a good idea at the time.

Rethinking splice()

Posted Feb 18, 2023 17:31 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> Why doesn't it make sense otherwise?

The whole reason for splice() is to magically avoid buffers and transfer the data from one location to another.

> I have no qualms doing

Except that if you actually do that, a pipe will be so much slower because of additional synchronization.