Sharing buffers between devices

By Jonathan Corbet
August 15, 2011

CPUs may not have gotten hugely faster in recent years, but they have gained in other ways; a typical system-on-chip (SoC) device now has a number of peripherals which would qualify as reasonably powerful CPUs in their own right. More powerful devices with direct access to the memory bus can take on more demanding tasks. For example, an image frame captured from a camera device can often be passed directly to the graphics processor for display without all of the user-space processing that was once necessary. Increasingly, the CPU's job looks like that of a shop foreman whose main concern is keeping all of the other processors busy.

The foreman's job will be easier if the various devices under its control can communicate easily with each other. One useful addition in this area might be the buffer sharing patch set recently posted by Marek Szyprowski. The idea here is to make it possible for multiple kernel subsystems to share buffers under the control of user space. With this type of feature, applications could wire kernel subsystems together in problem-specific ways then get out of the way, letting the devices involved process the data as it passes through.

There are (at least) a couple of challenges which must be dealt with to make this kind of functionality safe to export to applications. One is that the application should not be able to "create" buffers at arbitrary kernel addresses. Indeed, kernel-space addresses should not be visible to user space at all, so the kernel must provide some other way for an application to refer to a specific buffer. The other is that shared buffers must not go away until all users have let go of it. A buffer may be created by a specific device driver, but it must persist, even if the device is closed, until nobody else expects it to be there.

The mechanism added in this patch set (this part in particular is credited to Tomasz Stanislawski) is relatively simple - though it will probably get more complex in the future. Kernel code wanting to make a buffer available to other parts of the kernel via user space starts by filling in one of these structures:

    struct shrbuf {
    	void (*get)(struct shrbuf *);
    	void (*put)(struct shrbuf *);
    	unsigned long dma_addr;
    	unsigned long size;
    };

One could immediately raise a number of complaints about this structure: the address should be a dma_addr_t, there's no reason not to put the kernel virtual address there, only physically-contiguous buffers are allowed, etc. It also seems like there could be value in the ability to annotate the state of the buffer (filled or empty, for example) and possibly signal another thread when that state changes. But it's worth remembering that this is an explicitly proof-of-concept patch posting and a lot of things will change. In particular, the eventual plan is to pass a scatterlist around instead of a single physical address.

The get() and put() functions are important: they manage reference counts to the buffer, which must continue to exist until that count goes to zero. Any subsystem depending on a buffer's continued existence should hold a reference to that buffer. The put() function should release the buffer when the last reference is dropped.

Once this structure exists, it can be passed to:

	int shrbuf_export(struct shrbuf *sb);

The return value (if all goes well) will be an integer file descriptor which can be handed to user space. This file descriptor embodies a reference to the buffer, which now will not be released before the file descriptor is closed. Other than closing it, there is very little that the application can do with the descriptor other than give it to another kernel subsystem; attempts to read from or write to it will fail, for example.

If a kernel subsystem receives a file descriptor which is purported to represent a kernel buffer, it can pass that descriptor to:

    struct shrbuf *shrbuf_import(int fd);

The return value will be the same shrbuf structure (or an ERR_PTR() error value for a file descriptor of the wrong type). A reference is taken on the structure before returning it, so the recipient should call put() at some future time to release it.

The patch set includes a new Video4Linux2 ioctl() command (VIDIOC_EXPBUF) enabling the exporting of buffers as file descriptors; a couple of capture drivers have been augmented to support this functionality. No examples of the other side (importing a buffer) have been posted yet.

There has been relatively little commentary on the patch set so far, possibly because it was posted to a couple of relatively obscure mailing lists. It has the look of functionality that could be useful beyond one or two kernel subsystems, though. It would probably make sense for the next iteration, which presumably will have more of the anticipated functionality built into it, to be distributed more widely for review.

Index entries for this article
Kernel	Device drivers/Support APIs

Sharing buffers between devices

Posted Aug 18, 2011 12:42 UTC (Thu) by justincormack (subscriber, #70439) [Link] (3 responses)

I thought we had agreed that the userspace representation of a kernel buffer was a pipe? As used in tee and splice etc. It would be nice to keep this consistent.

Sharing buffers between devices

Posted Aug 18, 2011 14:04 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link] (1 responses)

Pipes represent byte streams. A plain memory block is better if you want to do random accesses, i.e., if it contains a ring buffer or packets that are reused.

Sharing buffers between devices

Posted Aug 18, 2011 14:11 UTC (Thu) by justincormack (subscriber, #70439) [Link]

No, pipes are supposed to be a general kernel memory (ring) buffer in effect now, which as it happens you can implement a Unix pipe on top of: https://lwn.net/Articles/119682/

Sharing buffers between devices

Posted Aug 18, 2011 19:24 UTC (Thu) by robclark (subscriber, #74945) [Link]

not pipe, but file descriptor

Sharing buffers between devices

Posted Aug 18, 2011 19:58 UTC (Thu) by dougg (guest, #1894) [Link] (2 responses)

A more efficient dd between storage devices could be implemented if the data did not need to be shunted in and out of the user space.

Sharing buffers between devices

Posted Aug 19, 2011 11:40 UTC (Fri) by epa (subscriber, #39769) [Link] (1 responses)

I remember being told about IBM's Micro Channel architecture (MCA) and its ability to let devices talk directly to each other across the MCA bus, without involving the CPU. I imagine this was never really used in PC-compatible systems although it might have been in certain MCA-based mainframes or RS/6000 systems. There might also be the chance for two disks attached to the same controller to copy data directly between themselves without going through the host system at all.

Sharing buffers between devices

Posted Aug 19, 2011 20:31 UTC (Fri) by giraffedata (guest, #1954) [Link]

A more efficient dd between storage devices could be implemented if the data did not need to be shunted in and out of the user space.

The shunting in and out of user space is easy to eliminate: mmap.

That leaves you with the copying from one device's buffer to the other. For that, Linus invented the 'splice' system call about ten years ago and, as justincormack points out in another comment on this article, actually implemented in 2005. Splice takes two file descriptors and a length as arguments and reads that many bytes from one of the devices and writes it to the other, by DMAing into, then out of, the same memory. https://lwn.net/Articles/119682/ . I don't know what the current state of deployment is, though.

The MCA thing would presumably be the next step, where the data doesn't have to stop over in system memory.

In big systems, where the storage devices are rather separate from the CPUs, this exists in the form that you can tell a device to send some of its contents directly to another device, e.g. through a fibre channel network. I guess the same thing over a PCI-class network can't be far behind.

In fact, my guess is that the bus protocol itself allows this in PCI Express and Infiniband; I don't think the CPU/main memory is particularly special in those protocols. Does somebody know?

Sharing buffers between devices

Posted Aug 19, 2011 16:21 UTC (Fri) by cavok (subscriber, #33216) [Link] (1 responses)

Is this "buffer fd" a proper fd?
May sharing buffers between devices cross also the application boundary?
In such case, what happens to the fd number?

Sharing buffers between devices

Posted Aug 19, 2011 21:20 UTC (Fri) by zlynx (guest, #2285) [Link]

There is a mechanism to send file descriptors across Unix local sockets between processes. Network servers have used this a lot in order to send a socket to another process for doing the actual work.

This is sendmsg/recvmsg with SCM_RIGHTS, I believe.

Sharing buffers between devices

Posted Aug 25, 2011 2:36 UTC (Thu) by quanstro (guest, #77996) [Link]

the funny thing is that plan 9 has been able to do this
for over a decade with the kernel-only versions of read
and write, bread and bwrite.

Sharing buffers between devices

Posted Aug 26, 2011 14:37 UTC (Fri) by slashdot (guest, #22014) [Link] (1 responses)

What's wrong with just using the virtual address space of the current process? (the good old void* p, size_t size)

If the issue is accessibility via restricted DMA, then add some way to mmap ZONE_DMA or similar memory.

If the issue is physical contiguity, add support to mmap contiguous memory.

If the issue is synchronous IO, make the APIs asynchronous.

Sharing buffers between devices

Posted Sep 10, 2011 2:53 UTC (Sat) by smowton (guest, #57076) [Link]

Building pagetables is certainly more expensive than not doing so; this patch just separates the concepts of get-handle-to-kernel-memory and pin-kernel-memory-for-direct-access, which seems entirely sensible. See also sendfile(), which is a special case of splice().

I *think* splice() should be able to accomplish the same feats as this patchset providing both ends can provide an FD that adequately represents what we want to do with the buffer; e.g. in the case that we're piping network packets to a video device, the video driver needs to be able to provide an FD representing the target video buffer, pixel format, etc, so something like:

// fd is a socket
int fd2 = ioctl(video_control_fd, IOCTL_GET_SINK_FD, /* description of sink buffer */);
splice(fd, fd2, ...);

In the language of this patchset the network driver would provide an ioctl that yields a buffer FD, and the video driver would provide one that copies buffer data into specified video memory.

int bufferfd;
int fd = ioctl(socket_or_if_control_fd, IOCTL_GET_PACKET_BUFFER, &bufferfd);
ioctl(video_control_fd, IOCTL_COPY_BUFFER_TO_VMEM, fd2, /* buffer description */);

Basically a splice()-based approach would need more ioctls in order to establish something that looks like a "connection endpoint" that means what we want it to, but mean less FD table operations if we want to repeatedly perform a similar operation (likely?), whilst an ioctl() + fd-per-buffer approach means lots of fd table operations (efficient locks in the table?) but less syscalls if we're routing buffers in a way that would effectively compel a splice operation to create a new channel per operation.