Two new system calls: splice() and sync_file_range()

[Posted April 3, 2006 by corbet]

The 2.6.17 kernel will include two new system calls which expand the capabilities available to user-space programs in some interesting ways. This article contains a look at the current form of these new interfaces.

splice()

The splice() system call has a long history. First proposed by Larry McVoy in 1998; it was seen as a way of improving I/O performance on server systems. Despite being often mentioned in the following years, no splice() implementation was ever created for the mainline Linux kernel. That situation changed, however, just before the 2.6.17 merge window was closed when Jens Axboe's splice() patch, along with a number of modifications, was merged.

As of this writing, the splice() interface looks like this:

    long splice(int fdin, int fdout, size_t len, unsigned int flags);

A call to splice() will cause the kernel to move up to len bytes from the data source fdin to fdout. The data will move through kernel space only, with a minimum of copying. In the current implementation, at least one of the two file descriptors must refer to a pipe device. That requirement is a limitation of the current code, and it could be removed at some future time.

The flags argument modifies how the copy is done. Currently implemented flags are:

SPLICE_F_NONBLOCK: makes the splice() operations non-blocking. A call to splice() could still block, however, especially if either of the file descriptors has not been set for non-blocking I/O.
SPLICE_F_MORE: a hint to the kernel that more data will come in a subsequent splice() call.
SPLICE_F_MOVE: if the output is a file, this flag will cause the kernel to attempt to move pages directly from the input pipe buffer into the output address space, avoiding a copy operation.

Internally, splice() works using the pipe buffer mechanism added by Linus in early 2005 - that is why one side of the operation is required to be a pipe for now. There are two additions to the ever-growing file_operations structure for devices and filesystems which wish to support splice():

    ssize_t (*splice_write)(struct inode *pipe, struct file *out, 
                            size_t len, unsigned int flags);
    ssize_t (*splice_read)(struct file *in, struct inode *pipe, 
                           size_t len, unsigned int flags);

The new operations should move len bytes between pipe and either in or out, respecting the given flags. For filesystems, there are generic implementations of these operations which can be used; there is also a generic_splice_sendpage() which is used to enable splicing to a socket. As of this writing, there are no splice() implementations for device drivers, but there is nothing preventing such implementations in the future, for char devices at least.

Discussions on the linux-kernel suggest that the splice() interface could change before it is set in stone with the 2.6.17 release. Andrew Tridgell has requested that an offset argument be added to specify where copying should begin - either that, or a separate psplice() should be added. There is also some concern about error handling; if a splice() call returns an error, how does the application tell whether the problem is with the input or the output? Resolving those issues may require some interface changes over the next month or so.

sync_file_range()

Early in the 2.6.17 process, some changes to the posix_fadvise() system call were merged. The new, Linux-specific options were meant to give applications better control over how data written to files is flushed to the physical media. The capabilities provided are needed, but there were concerns about extending a POSIX-defined function in a Linux-specific way. So, after some discussions, Andrew Morton pulled that patch back out and replaced it with a new system call:

    long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);

This call will synchronize a file's data to disk, starting at the given offset and proceeding for nbytes bytes (or to the end of the file if nbytes is zero). How the synchronization is done is controlled by flags:

SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until any already in-progress writeout of pages (in the given range) completes.
SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in the given range which are not already under I/O.
SYNC_FILE_RANGE_WAIT_AFTER blocks the calling process until the newly-initiated writes complete.

An application which wants to initiate writeback of all dirty pages should provide the first two flags. Providing all three flags guarantees that those pages are actually on disk when the call returns.

The new implementation avoids distorting the posix_fadvise() system call. It also allows synchronization operations to be performed with a single call, instead of the multiple calls required by the previous attempt. In the future, it may also be possible to add other operations to the flags list; the ability to request metadata synchronization seems to be high on the list.

(Thanks to Michael Kerrisk - who agitated for this change - for providing some of the background information).

Index entries for this article
Kernel	posix_fadvise()
Kernel	splice()
Kernel	sync_file_range()

Two new system calls: splice() and sync_file_range()

Posted Apr 3, 2006 23:40 UTC (Mon) by TwoTimeGrime (guest, #11688) [Link] (1 responses)

Could someone please explain what the effect of splice() is supposed to be? Is it just a faster way for the kernel to move data from one place to another? Is this something that programs would use or would it be something that other parts of the kernel use? I see that it has fdin and fdout as arguments. Does that mean that it uses file descriptors and can be used for copying files? I don't know anything about kernel development so I'm trying to put this feature into context.

Thanks.

Two new system calls: splice() and sync_file_range()

Posted Apr 3, 2006 23:44 UTC (Mon) by TwoTimeGrime (guest, #11688) [Link]

Nevermind. I saw in http://lwn.net/Articles/164887/ that it described how splice works at the bottom of the article.

Two new system calls: splice() and sync_file_range()

Posted Apr 4, 2006 11:26 UTC (Tue) by andersg (guest, #25522) [Link] (11 responses)

Is splice going to deprecate sendfile(2)?

Two new system calls: splice() and sync_file_range()

Posted Apr 4, 2006 14:13 UTC (Tue) by zlynx (guest, #2285) [Link] (10 responses)

I gathered from LKML that sendfile is going to become a call to splice. But sendfile will probably stay in the kernel for many versions to come for legacy support. Linus does try to keep ABI compatibility for user-space.

Two new system calls: splice() and sync_file_range()

Posted Apr 4, 2006 15:02 UTC (Tue) by axboe (subscriber, #904) [Link] (9 responses)

sys_sendfile() can never go, as it's a part of the user space ABI. However, the internal implementation can be replaced with a call to splice() instead. The splice git branch has support for fd -> fd splicing now (by using a virtual pipe), so sys_sendfile can basically just use that.

And what becomes of zero-copy?

Posted Apr 6, 2006 1:13 UTC (Thu) by xoddam (subscriber, #2322) [Link] (8 responses)

> The splice git branch has support for fd -> fd splicing now
> (by using a virtual pipe), so sys_sendfile can basically
> just use that.

But the internal pipe buffer uses dedicated pages, so there is
a minimum of one copy involved. Doesn't sendfile() do zero-copy
from a file's pages to a socket, if the socket's driver supports
it?

Also why isn't splice() just called cat()?

It is zero-copy

Posted Apr 6, 2006 5:59 UTC (Thu) by axboe (subscriber, #904) [Link] (7 responses)

It is zero-copy, why do you think the pipe pages are copied before being transmitted? Splice can even do zero-copy file copies, by moving pages from one file to another.

It is zero-copy

Posted Apr 6, 2006 6:42 UTC (Thu) by xoddam (subscriber, #2322) [Link] (6 responses)

> It is zero-copy

Sorry -- I misremembered Linus' original pipe-buffer
discussion. It has been a while:

http://lwn.net/Articles/118760/

> Splice can even do zero-copy file copies

Wow! Copying without copying! splice() must truly be a great philosopher!

It is zero-copy

Posted Apr 6, 2006 7:16 UTC (Thu) by axboe (subscriber, #904) [Link] (5 responses)

> Wow! Copying without copying! splice() must truly be a great philosopher!

Yeah it sounds crazy, but it's really true :-). You bring in the pages from the source file, then migrate them to the address space of the target file. Bingo, zero copy copies!

It is zero-copy

Posted Apr 7, 2006 22:46 UTC (Fri) by giraffedata (guest, #1954) [Link] (3 responses)

It sounds funny only because you're using "copy" two different ways in the same sentence. When we talk about copying a file, we're talking about copying from disk to disk. When we say zero-copy, we're talking about copying data from memory to memory.

A splice-based file copy does one disk-disk copy, and no memory-memory copy, as contrasted with the traditional file copy that does one disk-disk copy plus two memory-memory copies.

Since the naive observer wouldn't even expect there to be memory-memory copies involved in copying files, "zero copy file copy" shouldn't sound odd at all.

It is zero-copy

Posted Apr 8, 2006 11:14 UTC (Sat) by axboe (subscriber, #904) [Link] (2 responses)

A splice based copy does no copies. The source data is DMA'ed from hardware to the source file page(s), then those pages are moved to the destination and dirted to so will eventually go out to disk again with another DMA operations.

A normal copy will DMA those pages in, allocate pages in the target file address space, copy that data over, and then it'll be DMA'ed to disk again. So two DMA operations, and one full copy of all the data.

So a splice based copy will be zero memory copies, and two DMA "copies" (the dma operation above may be a series of dma transactions, depending on how large the file is). A normal copy contains the same number of DMA operations, but includes a memory copy.

It is zero-copy

Posted Apr 8, 2006 17:31 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

So a splice based copy will be zero memory copies, and two DMA "copies"

I presume "copies" is in quotes because you agree with me that the DMA operations are not copies in the sense we're talking about. (If they were, a "zero-copy" file read wouldn't be zero-copy, and moving data from kernel file data cache to user space would be two copies (kernel memory to register, register to user memory)).

However, the combination of the two DMAs constitutes one disk copy. A disk copy is an instance of replicating data from one place on a disk to another, just as a memory copy is an instance of replicating data from one place in memory to another. And it's worth talking about because it takes time. If you could truly do a zero-copy copy, that would be remarkable. As it stands, "zero-copy copy" is just a trick of words in which you change the definition of copy in the middle of the sentence.

A normal copy contains the same number of DMA operations, but includes a memory copy.

Actually, the most normal file copy includes two memory copies -- from kernel file data cache of the source file to user space buffer, and from that buffer to the other file's kernel file data cache. With mmap, you can get it down to one, and with direct I/O you can get to zero.

It is zero-copy

Posted Apr 8, 2006 18:26 UTC (Sat) by axboe (subscriber, #904) [Link]

Copy is in quotes, because it's not the CPU doing the copy. Which is what is interesting, and why zero-copy just means zero CPU copies. That is where you pay the cost, at least in CPU cycles and potentially also in cache. So zero-copy definitely isn't just a play on words. It may sometimes be used in silly marketing ways, but if you are CPU bound it makes all the difference in the world that the CPU doesn't have to touch the data.

And yes, the normal copy is indeed two copies, to and from kernel/user space.

You can continue talking if you want, but don't expect a response from me.

heavy-duty philosophy

Posted Apr 10, 2006 1:03 UTC (Mon) by xoddam (subscriber, #2322) [Link]

> splice() must truly be a great philosopher!

This is an in-joke, from the text adventure game version of Douglas
Adams' Hitch-Hiker's Guide to the Galaxy:

>open door
The door explains, in a haughty tone, that the room is occupied
by a super-intelligent robot and that lesser beings (by which
it means you) are not to be admitted. "Show me some tiny example
of your intelligence," it says, "and maybe, just maybe, I might
reconsider."

>get tea
no tea: Dropped.

>get no tea
no tea: Taken.

>inventory
You have:
no tea
tea
a small key
a flowerpot
a thing your aunt gave you which you don't know what it is
a babel fish (in your ear)

>open door
The door is almost speechless with admiration. "Wow. Simultaneous
tea and no tea. My apologies. You are clearly a heavy-duty
philosopher." It opens respectfully.