Rethinking splice()

By Jonathan Corbet
February 17, 2023

The splice() system call is built on an appealing idea: connect two file descriptors together so that data can be moved from one to the other without passing through user space and, preferably, without being copied in the kernel. splice() has enabled some significant performance optimizations over the years, but it has also proved difficult to work with and occasionally surprising. A recent linux-kernel discussion showed how splice() can cause trouble, to the point that some developers now wonder if adding it was a good idea.

Stefan Metzmacher is a Samba developer who would like to use splice() to implement zero-copy I/O in the Samba server. He has run into a problem, though. If a file is being sent to a remote client over the network, splice() can be used to feed the file data into a socket; the network layer will read that data directly out of the page cache without needing to make a copy in the kernel — exactly the desired result. But if the file is written before network transmission is complete, the newly written data may be sent, even though that write happened after the splice() call was made, perhaps even in the same process. That can lead to unpleasant surprises (and unhappy Samba users) when the data received at the remote end is not what is expected.

The problem here is a bit more subtle than it might seem at a first glance. To begin with, it is not possible to splice a file directly into a network socket; splice() requires that at least one of the file descriptors given to it is a pipe. So the actual sequence of operations is to splice the file into a pipe, then to connect the pipe to the socket with a second splice() call. Neither splice() call knows when the data it passes through has reached its final destination; the network layer may still be working with the file data even after both splice() calls have completed. There is no easy way to know that the data has been transmitted and that it is safe to modify the file again.

In his initial email, Metzmacher asked whether it would be possible to prevent this problem by marking file-cache pages as copy-on-write when they are passed to splice(). Then, if the file were written while the transfer was underway, that transfer could continue to read from the older data while the write to the file proceeded independently. Linus Torvalds quickly rejected that idea, saying that the sharing of the buffers holding the data is "the whole point of splice". Making those pages copy-on-write would break sharing of data in general. He later added that a splice() call should be seen as a form of mmap(), with similar semantics.

He also said: "You can say 'I don't like splice()'. That's fine. I used to think splice was a really cool concept, but I kind of hate it these days. Not liking splice() makes a ton of sense." Like it or not, though, the current behavior of splice() cannot change, since that would break existing applications; even Torvalds's dislike cannot overcome that.

Samba developer Jeremy Allison suggested that the solution to Metzmacher's problem could be for Samba to only attempt zero-copy I/O when the client holds a lease on the file in question, thus ensuring that there should be no concurrent access. He later had to backtrack on that idea, though; since the Samba server cannot know when network transmission is complete, the possibility for surprises still exists even in the presence of a lease. Thus, he concluded, "splice() is unusable for Samba even in the leased file case".

Dave Chinner observed that this problem resembles those that have previously been solved in the filesystem layer. There are many cases, including RAID 5 or data compressed by the filesystem, where data to be written must be held stable for the duration of the operation; this is the whole stable-pages problem that was confronted almost twelve years ago. Perhaps a similar solution could be implemented here, he said, where attempts to write to pages currently being used in a splice() chain would simply block until the operation has completed.

Both Torvalds and Matthew Wilcox pointed out the flaw with this idea: the splice() operation can take an unbounded amount of time, so it could be used (accidentally or otherwise) to block access to a file indefinitely. That idea did not go far.

Andy Lutomirski argued that splice() is the wrong interface for what applications want to do; splice() has no way of usefully communicating status information back to the caller. Instead, he said, io_uring might be a better way to implement this functionality. It allows multiple operations to be queued efficiently and, crucially, it has the completion mechanism that can let user space know when a given buffer is no longer in use. Jens Axboe, the maintainer of io_uring, was initially unsure about this idea, but warmed to it after Lutomirski suggested that the problem could be simplified by taking the pipes out of the picture and allowing one non-pipe file descriptor to be connected directly to another. The pipes, Axboe said, "do get in the way" sometimes.

Axboe thought that a new "send file" io_uring operation could be a good solution to this problem; it could be designed from the beginning with asynchronous operation in mind and without the use of pipes. So that may be the solution that comes out of this discussion — though somebody would, naturally, actually have to implement it first.

There was some talk about whether splice() should be deprecated; Torvalds doesn't think the system call has much value:

The same way "everything is a pipeline of processes" is very much historical Unix and very useful for shell scripting, but isn't actually then normally very useful for larger problems, splice() really never lived up to that conceptual issue, and it's just really really nasty in practice.
But we're stuck with it.

There is little point in discouraging use of splice(), though, if the kernel lacks a better alternative; Torvalds expressed doubt that the io_uring approach would turn out to be better in the end. The only way to find out is probably to try it and see how well it actually works. Until that happens, splice() will be the best that the kernel has to offer, its faults notwithstanding.

Index entries for this article
Kernel	io_uring
Kernel	splice()

Rethinking splice()

Posted Feb 17, 2023 16:32 UTC (Fri) by paulj (subscriber, #341) [Link]

"since the Samba server cannot know when network transmission is complete,"

Sounds like a network fs that uses splice needs to have acknowledgements of when data is received, at the network fs protocol level.

Not familiar with SMB, but I guess it lacks SMB level of signalling of file transfer completion.

Rethinking splice()

Posted Feb 17, 2023 16:48 UTC (Fri) by wtarreau (subscriber, #51152) [Link]

The way it works is annoying indeed, but the main reason is that like most syscalls it returns a single result. A direct splice between two FDs would be great for networking, but it would need to provide two reports, one per FD (blocked on read, write, end reached, error etc). It would also save a lot of memory because right now when you pump from one FD to a pipe just to discover the other side is full, you've made room for no reason in the input FD and again when it's a socket, it reopens the window so the device that splices between two sockets serves as a giant pipe-based buffer between a server and a client.

Rethinking splice()

Posted Feb 17, 2023 22:29 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (29 responses)

FWIW, I'm an application programmer, and for me splice() never made any sense. Why the heck do you need one side of the splice() to be a pipe?

splice() should have been modified to use just regular descriptors. E.g. if I want to connect a file descriptor to a network socket, I should just do it directly. This way the kernel can have special-cased handling of file-based pages and provide meaningful completion notifications. There's probably a handful of such combinations that make sense (memfd() to socket, socket to memfd() or file, etc.)

To make it even better, add a flag F_SPLICE_NO_FALLBACK that will fail the operation if there's no accelerated path available and the kernel would instead just fall back on a memcpy.

Rethinking splice()

Posted Feb 18, 2023 6:40 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (20 responses)

There is also copy_file_range(2). And sendfile(2). And probably one or two random others that I don't know about. Every single one has a different set of arcane restrictions on what types of file descriptors you can pass it. And now we're talking about deprecating *one* of these things, but presumably the others will be left as-is.

At this point, it's even worse than setuid/seteuid/etc., and that mess led directly to the creation of setresuid(2). Why can't we (application programmers) have nice things once in a while? This is not O_PONIES. This is a simple matter of "don't make three different syscalls that do basically the same thing." Or maybe even "don't provide abstractions that leak like a sieve."

All we need is one, single syscall that:

* Is exactly equivalent to a while/read/write loop.
* Starts reading/writing from wherever the fd is currently positioned (if it's seekable and you want to seek, then call seek explicitly).
* Takes a size argument (which is just about the only thing all three of those syscalls have in common).
* Fails with EDOITYOURSELF if there's no optimization available and the kernel doesn't feel like emulating it. Then libc or somebody else can write a simple wrapper that does the while loop if necessary.
* Blocks at least until the last write (in a hypothetical while/read/write loop) would have returned. It might still need to be fsync'd, but there should be no "oh, if you do a write at exactly the wrong time, it'll silently clobber all of your data" case.
* Also, O_NONBLOCK and/or io_uring would be nice to have, but *now* we're getting more into the O_PONIES realm, so I would say this is a bonus goal (but still probably doable).

Rethinking splice()

Posted Feb 18, 2023 6:56 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

> * Blocks at least until the last write (in a hypothetical while/read/write loop) would have returned. It might still need to be fsync'd, but there should be no "oh, if you do a write at exactly the wrong time, it'll silently clobber all of your data" case.

To be clear: When I say this, I'm referring to the application-observable semantics of the operation, not to the underlying mechanism. In other words, I don't care whether the kernel actually copies the data, or moves it, or COWs it, or makes it dance the moonwalk. All I care about is the *semantics* - when this hypothetical syscall returns, it means the kernel considers the operation "committed" in some sense, and I can now proceed to close the src, or reopen it as writable and scribble on it, or truncate it, or whatever else I feel like, and the dest will not get messed up.

Rethinking splice()

Posted Feb 20, 2023 7:38 UTC (Mon) by moxfyre (guest, #13847) [Link] (1 responses)

> when this hypothetical syscall returns, it means the kernel considers the operation "committed" in some sense, and I can now proceed to close the src, or reopen it as writable and scribble on it, or truncate it, or whatever else I feel like, and the dest will not get messed up.

I very much like your proposal, and agree that it'd be a great design… but the "in some sense" is doing some pretty heavy lifting here.

Also, what happens if another thread seeks in one of the FDs before this one completes? Presumably it shouldn't be EXACTLY equivalent to a while/read/write loop in that sense?

Aaaand now I start to get sense of how the current kernel interfaces came to be. 😅

Rethinking splice()

Posted Feb 20, 2023 8:43 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

> but the "in some sense" is doing some pretty heavy lifting here.

It's vague because it's an implementation detail. The application should not know or care about what this entails. All it cares about is that the right data ends up in the right place.

> Also, what happens if another thread seeks in one of the FDs before this one completes? Presumably it shouldn't be EXACTLY equivalent to a while/read/write loop in that sense?

None of the three syscalls that I mentioned upthread has anything in their respective man pages about how that works. Presumably, the kernel either interleaves the I/O with no synchronization, or just does whatever it feels like.

Rethinking splice()

Posted Feb 18, 2023 8:05 UTC (Sat) by joib (subscriber, #8541) [Link] (16 responses)

> Starts reading/writing from wherever the fd is currently positioned (if it's seekable and you want to seek, then call seek explicitly).

You could have two offsets. If a FD is non-seekable, ignore the offset, if the offset is -1 the current offset is used?

Rethinking splice()

Posted Feb 18, 2023 8:32 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (15 responses)

Perhaps that is worth the bother, but the only use case I can think of is "I want to have one thread copy between fd A and fd B, while another thread simultaneously reads from or writes to one or both of A and B, and also I don't want to reopen the file," and IMHO that's just crazy. But I'm sure there's someone writing some ridiculous app out there that needs it.

Rethinking splice()

Posted Feb 18, 2023 10:10 UTC (Sat) by joib (subscriber, #8541) [Link] (14 responses)

Yes. And while I'm at it, use iovec's instead of a direct pointer+len to a buffer. Then a user space library can provide easier to use wrappers for common cases.

Similar to the normal IO read/write syscalls. If we had p{read,write}v2() from the start, read(), pread(), readv() (and corresponding ones for write) could be implemented in user space as wrappers.

And also, a flags argument just in case it's needed later. History seems to suggest every syscall eventually needs such a thing. :)

Rethinking splice()

Posted Feb 27, 2023 3:14 UTC (Mon) by ringerc (subscriber, #3071) [Link] (13 responses)

Not only that, but a flags argument in which the low bits are "ignore if flag unrecognised" and the high bits are "syscall should fail if the flag bits are unrecognised".

Given the number of times we've had issues with adding flags that change important semantics, where the syscall has no way to say "IDK what you setting flag bit 7 means, hope it isn't important".

Rethinking splice()

Posted Feb 27, 2023 11:53 UTC (Mon) by paulj (subscriber, #341) [Link] (12 responses)

No, just use 2 separate flags arguments. One for "optional, proceed if unrecognised" and the other for "mandatory, fail if unrecognised".

Rethinking splice()

Posted Feb 27, 2023 12:29 UTC (Mon) by johill (subscriber, #25196) [Link] (10 responses)

This basically doesn't work. If any of your userspace is something like

unsigned int flags1, flags2 = 0;
syscall(..., &flags1, &flags2);

then you've basically painted yourself into the corner of not being able to use flags1 for anything of interest in the future?

Rethinking splice()

Posted Feb 27, 2023 12:37 UTC (Mon) by paulj (subscriber, #341) [Link] (9 responses)

mandatory fail if unrecognised - for a set bit, obviously.

Rethinking splice()

Posted Feb 27, 2023 14:21 UTC (Mon) by johill (subscriber, #25196) [Link] (8 responses)

I meant flags1 for the "optional, proceed if unrecognised" part. Don't see how you can really do "optional, proceed if unrecognised" at all since applications might just erroneously set random bits in there (as in the example), was just trying to illustrate why not.

Rethinking splice()

Posted Feb 27, 2023 16:22 UTC (Mon) by paulj (subscriber, #341) [Link] (7 responses)

If apps are specifying flag arguments with undefined values, well... they're going to get undefined behaviour (sooner or later) - tough for them I'd say.

That's no different from today, where we have syscalls with flags that are not yet defined and their value (as yet) unchecked; or flags that are defined for future use but not yet implemented (and not checked), is it?

GIGO.

Rethinking splice()

Posted Feb 27, 2023 16:38 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

That conflicts with "don't break userspace". I built a perfectly working binary on Linux 6.4, which sets flags to 0x100 - a bad value, since the only currently defined value is 0. I upgrade to Linux 7.1, and it still works, since the defined flags values don't yet interpret 0x100. When I upgrade from 7.1 to 7.2, flags 0x100 is given a meaning, and my binary breaks. As far as Linus is concerned, that's a kernel regression, and you need to revert the feature that makes sense of flags value 0x100, and find a value that userspace doesn't set.

Ultimately, this forces kernel developers to check all parameter values are set to something meaningful, and to error if any of the values are either not valid, or valid but not understood by this kernel. That way, my binary fails on Linux 6.4 as well as on 7.1, and stops failing on 7.2 - and Linus agrees that "binary used to fail with EINVAL, now works" is not a regression.

Rethinking splice()

Posted Feb 28, 2023 11:08 UTC (Tue) by paulj (subscriber, #341) [Link] (4 responses)

Fair enough, that works too.

In networking, it is common to allow for values that are optional, and not per se understood by the recipient - who may just ignore them. And values that are mandatory to understand, so the recipient must give some error if not recognised.

Optional values allow a protocol to be extended with optional and wholly backwards compatible features, so that newer speakers with the feature happily co-exist with speakers without it. While 2 "newer" speakers presumably derive some benefit from both supporting the feature. I guess it's more rare in software (Linux especially) to have an application compiled with some such feature, and run it on some older kernel/library-stack that lacks it.

Another way to achieve the latter - rather than explicitly having 2 classes of flags - is to specify that unused fields "Must Be Zero". If such a bit is set it's an error. If such a bit is then repurposed in an update, and it used by a new speaker with an old speaker, then the old speaker raises an error - the new speaker can try again without. 2 new speakers speaking to each other just happily use the new meaning of the formerly "Must Be Zero" flag.

What you're saying is Linux kernel userspace-API flags must always be of the "MBZ" kind - there is no need for optional. Fair enough. :)

Rethinking splice()

Posted Feb 28, 2023 17:29 UTC (Tue) by farnz (subscriber, #17727) [Link] (3 responses)

The tradeoff is different between networking and the kernel, too A program runs for milliseconds through to months on the same kernel, and the RTT to the kernel is on the order of 1 microsecond. Doing 10 RTTs to determine what features are supported and choosing fallbacks isn't significant time compared to the runtime of the program - especially since programs that use new features and have fallbacks for older kernels are likely to be long-running programs.

In contrast, in the networking world, RTTs are higher (milliseconds, not microseconds, most of the time), and connection lifetime is shorter on the high end (connections for more than a day are unusual). Doing 10 RTTs to determine the feature set of the other end, when you'll only have the same remote for a few hours at most, and more likely for seconds at a time.

Rethinking splice()

Posted Mar 1, 2023 11:03 UTC (Wed) by paulj (subscriber, #341) [Link] (2 responses)

They're still protocols for different entities to communicate and achieve something, end of the day. ;)

The fallback thing, the problem is the entity asking for the optional enhancement, that could otherwise be ignored, often will not implement the fallback path. So with a hard fail, the entire thing may fail. You need more logic to make the "nice to have, but optional and can be ignored" thing work reliably. The test matrix gets bigger (and bigger and bigger, with each such option).

Just having it silently ignored if communicated to an entity that doesn't know it is simpler, and can not have fallback path bugs.

Trade-offs in all directions. ;)

Rethinking splice()

Posted Mar 1, 2023 11:22 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

And to add another layer to the tradeoffs (one that's changed over time, to boot), in today's world it's often easier to not bother with the new feature at all until you can guarantee that all the hosts your application runs on have the new kernel feature, whereas it's often hard to get all the remote endpoints of a service you depend upon upgraded to new networking features.

This will change again, but for now, that's where we sit.

Rethinking splice()

Posted Mar 1, 2023 17:02 UTC (Wed) by paulj (subscriber, #341) [Link]

yeah, software often doesn't care about this kind of compatibility. Except when it comes to systems software and features critical for booting. Then you need to think about forward and backward compatibility - least in the Linux world.

Rethinking splice()

Posted Mar 1, 2023 22:12 UTC (Wed) by nix (subscriber, #2304) [Link]

This mistake exists in other contexts too, even in hardware: <http://www.os2museum.com/wp/forward-compatibility-landmines/> <http://www.os2museum.com/wp/theres-more-to-the-286-xenix-...>

Rethinking splice()

Posted Feb 27, 2023 15:38 UTC (Mon) by farnz (subscriber, #17727) [Link]

You basically cannot do "optional, proceed if unrecognised" sanely. If you do, you have no way of distinguishing "app passed a non-zero flags value because it never went wrong testing on older kernels" from "app passed a non-zero flags value because it knows about the new meaning of this value".

What you do need is a very clear way for the app to ask what flags values are known about - so that the app can test all the combinations it wants to use at start-up, fail early if the kernel doesn't support anything appropriate, and choose fallbacks if the kernel support is sub-optimal (e.g. older kernel).

Rethinking splice()

Posted Feb 18, 2023 8:08 UTC (Sat) by Sesse (subscriber, #53779) [Link] (7 responses)

> FWIW, I'm an application programmer, and for me splice() never made any sense. Why the heck do you need one side of the splice() to be a pipe?

Because the pipe is used as a kernel-side memory buffer. Don't think about it as splicing from one fd to another, think about it as reading from an fd to kernel memory (or writing from kernel memory to an fd).

Rethinking splice()

Posted Feb 18, 2023 8:15 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> Don't think about it as splicing from one fd to another

That's the problem. I _want_ to think about it as splicing one FD to another because it simply makes no sense otherwise. A buffer should be an internal technical detail.

And for the zero-copy scenario, the "pipe as a kernel buffer" abstraction doesn't even make any sense!

Rethinking splice()

Posted Feb 18, 2023 8:26 UTC (Sat) by Sesse (subscriber, #53779) [Link] (5 responses)

Why doesn't it make sense otherwise? I have no qualms doing

int ret = read(infd, buf, sizeof(buf));
write(outfd, buf, ret);

With splice(), the buffer is referenced by ID instead of by pointer, that's all.

There's tons of problems with splice (as the article points out), and it hasn't aged well, but the abstraction isn't so weird.

Rethinking splice()

Posted Feb 18, 2023 8:40 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

The argument isn't that buffers are illogical. The argument is that a more useful primitive would be one where I can just do

int ret = readthenwrite(infd, outfd, maybe_some_other_args);

and never even deal with buffers at all. Maybe under the hood the kernel deals with buffers, but as an application programmer, I would rather not have to think about them if I can avoid it.

This is a superset of splice, because splice is just a special case of it where one of the fds happens to be a pipe.

Rethinking splice()

Posted Feb 18, 2023 11:55 UTC (Sat) by Sesse (subscriber, #53779) [Link] (2 responses)

It's not really a strict superset, since splice() follows tee() (where you can read once and send many) or vmsplice()+tee() (write once from memory, send many; think standardized HTTP headers).

Most of this sort of feels obsolete with io_uring dominating, though.

Rethinking splice()

Posted Feb 19, 2023 8:50 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

Those syscalls are not splice. I never claimed that readthenwrite (or whatever less-terrible name you want to call it, since readthenwrite is obviously just a placeholder) would be a superset of "splice, and also a bunch of random other syscalls that have a vaguely similar implementation to splice." My specific beef is not with the splice family of syscalls, it's with the family of "syscalls that are semantically equivalent to read and then write, but also have arbitrary restrictions that make no sense to the average application developer and have to be looked up in the man page every time you use them." See my other comment regarding copy_file_range(2) and sendfile(2).

Rethinking splice()

Posted Mar 3, 2023 3:37 UTC (Fri) by njs (subscriber, #40338) [Link]

I think the logic would be:

- splice composes with tee/vmsplice/etc. to allow more complex zero-copy IO flows

- readthenwrite could not replace splice for this purpose, because splice has *super bizarre* semantics that are different from readthenwrite, as this article notes. (In particular, readthenwrite from a file into a pipe has to make a copy -- it skips bouncing through userspace, but it has to copy data from the page buffer into the pipe buffer, instead of doing wacky things with page pointers like splice does.) But those bizarre semantics are what let compositions of splice/tee/vmsplice/etc. be zero-copy.

I don't think it really works out; this stuff really needs to be replaced with something better. But I can at least see why splice seemed like a good idea at the time.

Rethinking splice()

Posted Feb 18, 2023 17:31 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> Why doesn't it make sense otherwise?

The whole reason for splice() is to magically avoid buffers and transfer the data from one location to another.

> I have no qualms doing

Except that if you actually do that, a pipe will be so much slower because of additional synchronization.

Rethinking splice()

Posted Feb 18, 2023 0:06 UTC (Sat) by stressinduktion (subscriber, #46452) [Link] (1 responses)

The information is possible to gather from the socket directly. Managing this along with the splice web is probably even worse, but maybe useful in certain cases:

There is a possibility to loop progress information from the sockets back to the sockets' error queues. It was designed for handling transmit timestamp of network packets, but should also give clues about when data actually left the buffers: <https://www.kernel.org/doc/Documentation/networking/times...>, Section 2.1.1 (maybe along with bytestream timestamps, I don't remember the details anymore).

Rethinking splice()

Posted Feb 18, 2023 22:47 UTC (Sat) by willemb (subscriber, #73364) [Link]

The socket error queue is indeed what MSG_ZEROCOPY uses for this exact purpose.