Toward a reverse splice()
A key aspect of splice() is that it works with up-to-date buffers
of data, meaning that it moves pages already containing data obtained from
some source. The reverse-splice operation would, instead, operate on empty
buffers that need to be filled from somewhere. It would, in other words,
be a read operation rather than a write. It could be used to fill buffers
from a file and feed the result into a pipe, for example. One possible use
case, he said, is user-space filesystems, which could use it to feed pages
of file data into the kernel (and beyond) without copying the data. He
thinks that the idea is "implementable", but was curious to hear what the
other developers in the room thought about the idea.
Rik van Riel worried about page-lifecycle problems. Moving a page of data into the page cache (as rsplice() might do) is easy if there are no other users of the page, but what if other processes already have that page in their page tables? Szeredi responded that rsplice() can only work if the pages involved have not yet been activated, so no other references to them can exist.
Matthew Wilcox said that he knows of a use case from his time at Microsoft. It has to do with fetching pages of data from a remote server; there would be value in having an efficient way to place that data into the page cache. Doing this would require adding a recvpage() complement to the kernel's internal sendpage() method. He hoped that somebody within Microsoft would be able to speak more about this use case in the future.
Hugh Dickins, instead, recalled that Linus Torvalds has expressed regret about having introduced splice() in the first place. Torvalds thought that it was a great idea at the time, but few users have materialized since it was implemented. Adding new system calls that lack users only serves to increase the attack surface of the kernel, he said. There are few people who truly understand splice(), which has needed to be "corrected" numerous times over the years. An rsplice() call, he said, would likely have to go through the same process before it could be trusted.
From there the discussion wandered in various directions. There was some
questioning of the value of zero-copy interfaces in general, but it does
seem to offer benefits on systems with high-bandwidth adapters and huge
pages. There was a fair amount of confusion about how rsplice()
differs from splice(), perhaps driving home the point that not
many people fully understand splice() in the first place. What is
needed, it was agreed, was a well-defined use case for this new system call
that would help to nail down what it actually does. Then, if an
implementation appears shortly thereafter, it will be possible to have a
more informed discussion on whether the whole thing makes sense.
Index entries for this article | |
---|---|
Kernel | splice() |
Conference | Storage, Filesystem, and Memory-Management Summit/2019 |
Posted May 2, 2019 0:14 UTC (Thu)
by wahern (subscriber, #37304)
[Link]
Because moving a page of data from userspace without processing that data in any manner is a rather esoteric operation, especially in a monolithic kernel where such work is generally performed by in-kernel device drivers.
sendfile covers the vast majority of userspace use cases that splice could be used for. splice is a powerful abstraction of sendfile, but the benefits aren't realizable without providing a mechanism for userspace to inject or consume the data. That's what vmsplice was intended for, but vmsplice is nearly impossible to use effectively, especially in asynchronous (epoll-type), event-oriented models. The biggest problem is memory and page management, and in particular knowing if and when you can recycle a gifted page. I vaguely remember reading on LWN about patches to add a notification facility; they did ever make it upstream?
The fact that vmsplice requires one end to be a pipe doesn't help, either. Entering the kernel twice (or more) just to move a page of data is convoluted from the perspective of typical design patterns, and subtracts from the performance benefits of zero-copy.
And this description from the splice man page (copied from Ubuntu 18.04), was always a huge turn-off:
SPLICE_F_MOVE
A better vmsplice and more consistent splice semantics (remove the constraint of one end always being a pipe) could have shifted significant complexity from kernel space to userspace as originally intended, with greater flexibility and less overall complexity, but it never happened. Instead focus shifted back to traditional AIO interfaces, which keeps most of the complexity in kernel space and does a better job of hiding data copies, preserving the fiction of zero-copy.
Posted May 3, 2019 3:38 UTC (Fri)
by nevets (subscriber, #11875)
[Link]
Toward a reverse splice()
Attempt to move pages instead of copying. This is only a hint to
the kernel: pages may still be copied if the kernel cannot move the
pages from the pipe, or if the pipe buffers don't refer to full
pages. The initial implementation of this flag was buggy: therefore
starting in Linux 2.6.21 it is a no-op (but is still permitted in a
splice() call); in the future, a correct implementation may be
restored.
Toward a reverse splice()