tee() with your splice()?

[Posted April 11, 2006 by corbet]

The new splice() system call was covered here last week. As was predicted then, this new kernel API has continued to evolve; many of the non-fix patches going into the post-2.6.17-rc1 mainline involved changes to splice().

For starters, the prototype of the splice() system call has changed:

    long splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out,
                size_t len, unsigned int flags);

The two offset values (off_in and off_out) are new; they indicate where each file descriptor should be positioned prior to beginning the transfer of data. Note that these offsets are passed via pointers; user space can use a NULL pointer to indicate that the current offset should be used. Note also that these offsets do not work like the offsets in pread() or pwrite(): they will actually change the offset associated with the file descriptor. Providing an offset for a file descriptor associated with a pipe is an error.

Internally, the splice() code has seen a couple of interesting changes. One of them (in the regular pipe code, actually) is the creation of a new pipe_inode_info structure to represent the core machinery behind a pipe. This structure can stand apart from the normal inode structure. Many of the internal interfaces have been changed to use the new structure, including the new methods in the file_operations structure:

    ssize_t (*splice_write)(struct pipe_inode_info *pipe, 
                            struct file *out, size_t len, 
			    unsigned int flags);
    ssize_t (*splice_read)(struct file *in, struct pipe_inode_info *pipe, 
                           size_t len, unsigned int flags);

Since there are still few implementations of these methods in the kernel, the changes are not particularly disruptive.

Next in the list is support for directly splicing two file descriptors where neither is a pipe. This functionality is not (yet) available to user space via splice(), but it is used internally to implement sendfile() with the splice() mechanism. The direct splicing is implemented using a hidden pipe_inode_info structure (i.e. a pipe); it is stored in the new splice_pipe field of the task structure, so each process can only have one such connection running at any given time.

One patch which has not been merged - and will likely wait until 2.6.18 at this point - is the tee() system call:

    long tee(int fdin, int fdout, size_t len, unsigned int flags);

This call requires that both file descriptors be pipes. It simply connects fdin to fdout, transferring up to len bytes between them. Unlike splice(), however, tee() does not consume the input, enabling the input data to be read normally later on by the calling process. Jens Axboe provides an example implementation of the user-space tee utility, which comes down to a couple of calls:

    len = tee(STDIN_FILENO, STDOUT_FILENO, INT_MAX, SPLICE_F_NONBLOCK);
    splice(STDIN_FILENO, out_file, len, 0);

The input data will be written both to the standard output and the file represented by out_file without ever being copied to or from user space. To be sure of copying the entire input data stream, the application must perform the above calls within a loop, of course; see the full example at the end of the tee() patch.

This call is quite new, and may well change before it makes it into the mainline. Among other things, it might get a new name, since something as simple as tee() may already be in use in a number of applications.

Index entries for this article
Kernel	Pipes
Kernel	splice()
Kernel	tee()

tee() with your splice()?

Posted Apr 13, 2006 14:31 UTC (Thu) by RobSeace (subscriber, #4435) [Link] (3 responses)

Why not just have tee()'s non-consumption of input behavior simply be
specified by a new flag for splice(), rather than have a completely
separate syscall?

tee() with your splice()?

Posted Apr 13, 2006 15:49 UTC (Thu) by axboe (subscriber, #904) [Link] (2 responses)

Because then you want to outputs at least, otherwise you'd end up with a new pipe just containing what you consumed. Not very interesting. And it was decided that the semantics of a one -> one tee was more appropriate than a one -> two where the input is consumed.

With the current tee, you can think of it as simply a pipe dupe with memcpy() like semantics.

tee() with your splice()?

Posted Apr 13, 2006 16:36 UTC (Thu) by RobSeace (subscriber, #4435) [Link] (1 responses)

I'm not sure I understand why it would have to be any different internally...
I understand what tee() is doing, but I was just saying why not have a new
flag to splice() specify doing exactly that, rather than add a new syscall
to do it... Then, your example code that does tee() followed by splice(),
would instead just do two separate splice() calls... But, internally,
splice() could do the same thing tee() does (when the hypothetical new flag
is set), couldn't it? Or, is there some internal implementation trickery
that I'm completely missing? (Or, is it just that it seems unclean to add
such behavior to splice()? Ie: splice() should always be expected to
consume the input, and you don't want break that assumption, even with a
special flag?)

tee() with your splice()?

Posted Apr 25, 2006 8:24 UTC (Tue) by hozelda (guest, #19341) [Link]

I really am not sure of the answer to your question, but it made me wonder why not have Linux just have one system call "function()" and simply call it with a different flag depeding on whether we want to read or write or chmod etc. function(READ, fd, buff, bytes) to read, function(WRITE, fd, ...) to write, etc.

Or if we just had to have 2, then "function" and "ioctl" (that's my vote).

If 3: "function" "ioctl" and "read." In particular, read(WRITE, fd, ...) for opening a file, read (OPEN, ...) for duplicating a file descriptor, read (IOCTL, ...) for closing a network connection, function(READ, ...) for reserving memory, and function (CHMOD, ...) for shutting the system. Everything else should be doable with ioctl(..).

Cool! We should suggest this on the mailing lists!!!

[On a serious note, splice and tee may be related; for example, splice(WITH_TEE) might just increment counters beyond 1 and add ptrs to some linked list, etc. I don't really know. Maybe. But what really concerns me is your desire to do away with system calls.]

tee() with your splice()?

Posted Apr 23, 2006 4:30 UTC (Sun) by anLWNreader (guest, #36915) [Link] (1 responses)

> it is stored in the new splice_pipe field of the task structure, so each process can only have one such connection running at any given time.

This seems like an unfortunate limitation to me... As sendfile is now using splice, it means every process can only have one sendfile going at a given time? I hope I just misunderstood something... Think about web servers with possibly hundreds of threads all wanting to do sendfile.

tee() with your splice()?

Posted May 10, 2006 17:56 UTC (Wed) by nevyn (guest, #33129) [Link]

The task struct is for tasks, not processes. The article was misleading. A thread is a task, as is a process. So each thread/process can only have one sendfile() running at once (which is basically how it is now -- at least I'm assuming they don't mean one sendfile over all open fds, which would make it worthless).