Toward a reverse splice()

By Jonathan Corbet
May 1, 2019

The splice() system call is, at its core, a write operation; it attempts to implement zero-copy I/O by moving pages from a pipe to a file. At the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Miklos Szeredi described a nascent idea for rsplice() — a "reverse splice" system call. There were not a lot of definitive outcomes from this discussion, but one thing was clear: rsplice() needs a much better description (and some code posted) before the development community can begin to form an opinion on it.

A key aspect of splice() is that it works with up-to-date buffers of data, meaning that it moves pages already containing data obtained from some source. The reverse-splice operation would, instead, operate on empty buffers that need to be filled from somewhere. It would, in other words, be a read operation rather than a write. It could be used to fill buffers from a file and feed the result into a pipe, for example. One possible use case, he said, is user-space filesystems, which could use it to feed pages of file data into the kernel (and beyond) without copying the data. He thinks that the idea is "implementable", but was curious to hear what the other developers in the room thought about the idea.

Rik van Riel worried about page-lifecycle problems. Moving a page of data into the page cache (as rsplice() might do) is easy if there are no other users of the page, but what if other processes already have that page in their page tables? Szeredi responded that rsplice() can only work if the pages involved have not yet been activated, so no other references to them can exist.

Matthew Wilcox said that he knows of a use case from his time at Microsoft. It has to do with fetching pages of data from a remote server; there would be value in having an efficient way to place that data into the page cache. Doing this would require adding a recvpage() complement to the kernel's internal sendpage() method. He hoped that somebody within Microsoft would be able to speak more about this use case in the future.

Hugh Dickins, instead, recalled that Linus Torvalds has expressed regret about having introduced splice() in the first place. Torvalds thought that it was a great idea at the time, but few users have materialized since it was implemented. Adding new system calls that lack users only serves to increase the attack surface of the kernel, he said. There are few people who truly understand splice(), which has needed to be "corrected" numerous times over the years. An rsplice() call, he said, would likely have to go through the same process before it could be trusted.

From there the discussion wandered in various directions. There was some questioning of the value of zero-copy interfaces in general, but it does seem to offer benefits on systems with high-bandwidth adapters and huge pages. There was a fair amount of confusion about how rsplice() differs from splice(), perhaps driving home the point that not many people fully understand splice() in the first place. What is needed, it was agreed, was a well-defined use case for this new system call that would help to nail down what it actually does. Then, if an implementation appears shortly thereafter, it will be possible to have a more informed discussion on whether the whole thing makes sense.

Index entries for this article
Kernel	splice()
Conference	Storage, Filesystem, and Memory-Management Summit/2019

Toward a reverse splice()

Posted May 2, 2019 0:14 UTC (Thu) by wahern (subscriber, #37304) [Link]

> Torvalds thought that it was a great idea at the time, but few users have materialized since it was implemented.

Because moving a page of data from userspace without processing that data in any manner is a rather esoteric operation, especially in a monolithic kernel where such work is generally performed by in-kernel device drivers.

sendfile covers the vast majority of userspace use cases that splice could be used for. splice is a powerful abstraction of sendfile, but the benefits aren't realizable without providing a mechanism for userspace to inject or consume the data. That's what vmsplice was intended for, but vmsplice is nearly impossible to use effectively, especially in asynchronous (epoll-type), event-oriented models. The biggest problem is memory and page management, and in particular knowing if and when you can recycle a gifted page. I vaguely remember reading on LWN about patches to add a notification facility; they did ever make it upstream?

The fact that vmsplice requires one end to be a pipe doesn't help, either. Entering the kernel twice (or more) just to move a page of data is convoluted from the perspective of typical design patterns, and subtracts from the performance benefits of zero-copy.

And this description from the splice man page (copied from Ubuntu 18.04), was always a huge turn-off:

SPLICE_F_MOVE
Attempt to move pages instead of copying. This is only a hint to
the kernel: pages may still be copied if the kernel cannot move the
pages from the pipe, or if the pipe buffers don't refer to full
pages. The initial implementation of this flag was buggy: therefore
starting in Linux 2.6.21 it is a no-op (but is still permitted in a
splice() call); in the future, a correct implementation may be
restored.

A better vmsplice and more consistent splice semantics (remove the constraint of one end always being a pipe) could have shifted significant complexity from kernel space to userspace as originally intended, with greater flexibility and less overall complexity, but it never happened. Instead focus shifted back to traditional AIO interfaces, which keeps most of the complexity in kernel space and does a better job of hiding data copies, preserving the fiction of zero-copy.

Toward a reverse splice()

Posted May 3, 2019 3:38 UTC (Fri) by nevets (subscriber, #11875) [Link]

Note, trace-cmd makes heavy use of the splice() system call.