LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.22.1. Linus announced the release of the 2.6.22 kernel on July 8. For those just tuning in: much has happened in this development cycle, including the addition of the mac80211 (formerly "Devicescape") wireless networking stack, the eventfd system calls, some new TCP congestion control algorithms, a rewritten CFQ I/O scheduler, a new IEEE1394 ("Firewire") stack, support for the Blackfin architecture, the long-awaited IVTV TV tuner driver, and much more. See the KernelNewbies 2.6.22 page for vast amounts of detail, the long-format changelog for even more detail, or the short-form changelog for a (relatively) concise listing of patches in this release.

The 2.6.22.1 update, released on July 10, adds an SCTP security fix which had somehow managed not to get into 2.6.22.

The 2.6.23 merge window has opened, and some 500 patches have found their way into the mainline git repository (as of this writing).

For older kernels: the 2.6.20.15 and 2.6.21.6 stable updates were released on July 6; each contains a single fix for a security problem in the netfilter H323 connection tracking code.

Comments (none posted)

Kernel development news

Quote of the week

We all know swap prefetch has been tested out the wazoo since Moses was a little boy, is compile-time and runtime selectable, and gives an important and quantifiable performance increase to desktop systems. Save a Redhat employee some time reinventing the wheel and just merge it. This wheel already has dope 21" rims...
-- Matthew Hawkins

Comments (1 posted)

The 2.6.23 merge window opens

Linus opened the 2.6.23 merge window with a bang: the first thing merged was the CFS CPU scheduler. The group scheduling feature is not available, however, since it depends on the generic process containers patch, and would appear that containers will have to wait another cycle.

Other patches merged so far include an IDE update, the rtl8187 wireless network driver (the first driver to use the mac80211 stack), support for the Yukon EX (88e8071) network adapter chipset, Xbox 360 gamepad support, a big rework of the splice() code which replaces sendfile() and adds an internal vmsplice_to_user() feature, an LZ01X compression implementation, and the removal of a number of ancient CDROM drivers.

The 2.6.23 process has barely begun, expect a great deal of work to be merged yet. Andrew Morton's 2.6.23 merge plan is useful reading for those who would like to know what else may go in; among other things, it looks like this kernel will include fallocate(), lguest, and the on-demand readahead patches. Bear in mind that much of what goes into 2.6.23 will not get there by way of Andrew, so this is far from a complete list of what this kernel will contain.

Comments (none posted)

An API for virtual I/O: virtio

Linux has an abundance of virtualization choices, each with its own way of dealing with I/O. A recent set of kernel patches, submitted to the kernel-virtualization mailing list by Rusty Russell, would allow different virtualization implementations to share drivers by using a virtual I/O interface called virtio. There have been several public iterations of the interface with the latest, draft IV, narrowing in on what appears to be an acceptable solution, at least with the virtualization folks.

There are always questions about adding yet another layer into the kernel, but the advantages for virtio are numerous. Russell outlines several in one of his posts to the kernel-virtualization list. There is some amount of urgency in devising a solution because several of the virtualization projects are either working on or reworking their virtual I/O. If an established mechanism, that already provides working block and network drivers existed, those projects, as well as any newcomers, would be likely to use it.

Another key element is to try and prevent a major proliferation of kernel drivers each handling slightly different virtual block I/O. Trying to tune and maintain those drivers could become a major headache, so virtio separates the guest Linux side of the driver from the code that is specific to the hypervisor implementation. Each group of developers can maintain the code on their side of the API without changing the other, unless, of course, the virtio API itself needs to change. It is likely that some kind of virtual I/O will be adopted, as the kernel developers are likely to be unwilling to merge new drivers for each different virtualization mechanism that comes along; some commonality is required.

The basic abstraction used by virtio is a "buffer", which consists of a struct scatterlist array. The array contains "out" entries describing data destined for the underlying hypervisor driver, as well as "in" entries for that driver to store data to return to the guest driver. The order is fixed (out followed by in) and a count of each is part of the buffer description, which allows the hypervisor driver to determine what it has. This buffer abstraction encapsulates everything needed to communicate data to be written to or read from the hypervisor driver and, eventually, the underlying device. A guest driver, that uses the virtio interface, hands off buffers to the hypervisor driver and awaits their completion.

At its core, the virtio API is a set of functions that are provided by the hypervisor driver to be used by the guest:

    struct virtqueue_ops {
        int (*add_buf)(struct virtqueue *vq,
                       struct scatterlist sg[],
                       unsigned int out_num,
                       unsigned int in_num,
                       void *data);

        void (*sync)(struct virtqueue *vq);

        void *(*get_buf)(struct virtqueue *vq, unsigned int *len);

        int (*detach_buf)(struct virtqueue *vq, void *data);

        bool (*restart)(struct virtqueue *vq);
    };
This operations vector is initialized by the hypervisor and passed to the guest driver using a probe() function. The guest then sets up its data structures and registers with its kernel as a block or network device driver.

The basic operation uses add_buf() to register one or more buffers with the hypervisor driver. That driver is kicked via the sync() call to start processing the buffers. Each struct virtqueue has a callback associated with it which will be called when some buffers have completed. The guest then calls the get_buf() function to retrieve completed buffers. To support polling, which is used by network drivers, get_buf() can be called at any time, returning NULL if none have completed. The guest driver can disable further callbacks, at any time, by returning zero from the callback. The restart() routine is then used to re-enable them. Finally, the detach_buf() call is used during shutdown to cancel the operation indicated by the buffer and to retrieve it from the hypervisor driver.

As part of his patches, Russell has working example block and network drivers using the virtio interface. Each uses the virtio API differently, and the requirements of each kind of device has pushed the evolution of the interface into its current form. He has also posted an example of a driver implementing virtio for his lguest hypervisor.

The block driver uses a protocol that the buffer always has at least one out and in element. The first element passes the sector and type (read or write) information to the hypervisor driver and the first in element receives the status of the request. For a write, there are additional out elements, whereas for a read, there are additional in elements. When the I/O completes, the callback is invoked and the get_buf() calls return the completed buffers.

The network driver uses separate virtqueues for sending and receiving packets which allows it to avoid any locking between the two. Each side only uses half of the scatterlist, out for sending and in for receiving. One of the major differences from "draft III" is combining the two types of buffers; previously there were "inbufs" and "outbufs" and the operations vector had calls for each type. By noticing that they could be combined while still supporting single direction buffers, Russell has halved the number of operations that need to be implemented.

Currently, a hypervisor that wants to provide virtio devices to its guests must arrange to call the virtblock_probe() or virtnet_probe() functions. Any device discovery must be handled by the hypervisor and the guest driver is linked to the hypervisor driver at compile time. Dynamic, mix and match, hypervisor/guest combinations are not yet available, but will be down the road; proposals are already being floated on the kernel-virtualization list.

In a blog posting, Russell describes the tension between performance and abstraction:

The danger is to come up with an abstraction so far removed from what's actually happening that performance sucks, there's more glue code than actual driver code and there are seemingly arbitrary correctness requirements. But being efficient for both network and block devices is also quite a trick.

It remains to be seen if the performance can live up to the needs of the various virtualization projects. If it does, and the interface is abstract enough to handle the kinds of virtual devices required, we should see some kind of push to get it included in the kernel sometime soon.

Comments (none posted)

Video4Linux2 part 6b: Streaming I/O

The LWN.net Video4Linux2 API series.
The previous installment in this series discussed how to transfer video frames with the read() and write() system calls. Such an implementation can get the basic job done, but it is not normally the preferred method for performing video I/O. For the highest performance and the best information transfer, video drivers should support the V4L2 streaming I/O API.

With the read() and write() methods, each video frame is copied between user and kernel space as part of the I/O operation. When streaming I/O is being used, instead, this copying does not happen; instead, the application and the driver exchange pointers to buffers. These buffers will be mapped into the application's address space, making it possible to perform zero-copy frame I/O. There are two different types of streaming I/O buffers:

  • Memory-mapped buffers (type V4L2_MEMORY_MMAP) are allocated in kernel space; the application maps them into its address space with the mmap() system call. The buffers can be large, contiguous DMA buffers, virtual buffers created with vmalloc(), or, if the hardware supports it, they can be located directly in the video device's I/O memory.

  • User-space buffers (V4L2_MEMORY_USERPTR) are allocated by the application in user space. Clearly, in this situation, no mmap() call is required, but the driver may have to work harder to support efficient I/O to user-space buffers.

Note that drivers are not required to support streaming I/O, and, if they do support streaming, they do not have to handle both buffer types. A driver which is more flexible will support more applications; in practice, it seems that most applications are written to use memory-mapped buffers. It is not possible to use both types of buffer simultaneously.

We will now delve into the numerous grungy details involved in supporting streaming I/O. Any Video4Linux2 driver writer will need to understand this API; it is worth noting, however, that there is a higher-level API which can help in the writing of streaming drivers. That layer (called video-buf) can make life easier when the underlying device can support scatter/gather I/O. The video-buf API will be discussed in a future installment.

Drivers which support streaming I/O should inform the application of that fact by setting the V4L2_CAP_STREAMING flag in their vidioc_querycap() method. Note that there is no way to describe which buffer types are supported; that comes later.

The v4l2_buffer structure

When streaming I/O is active, frames are passed between the application and the driver in the form of struct v4l2_buffer. This structure is a complicated beast which will take a while to describe. A good starting point is to note that there are three fundamental states that a buffer can be in:

  • In the driver's incoming queue. Buffers are placed in this queue by the application in the expectation that the driver will do something useful with them. For a video capture device, buffers in the incoming queue will be empty, waiting for the driver to fill them with video data. For an output device, these buffers will have frame data to be sent to the device.

  • In the driver's outgoing queue. These buffers have been processed by the driver and are waiting for the application to claim them. For capture devices, outgoing buffers will have new frame data; for output devices, these buffers are empty.

  • In neither queue. In this state, the buffer is owned by user space and will not normally be touched by the driver. This is the only time that the application should do anything with the buffer. We'll call this the "user space" state.

These states, and the operations which cause transitions between them, come together as shown in the diagram below:

[Buffer states]

The actual v4l2_buffer structure looks like this:

    struct v4l2_buffer
    {
	__u32			index;
	enum v4l2_buf_type      type;
	__u32			bytesused;
	__u32			flags;
	enum v4l2_field		field;
	struct timeval		timestamp;
	struct v4l2_timecode	timecode;
	__u32			sequence;

	/* memory location */
	enum v4l2_memory        memory;
	union {
		__u32           offset;
		unsigned long   userptr;
	} m;
	__u32			length;
	__u32			input;
	__u32			reserved;
    };

The index field is a sequence number identifying the buffer; it is only used with memory-mapped buffers. Like other objects which can be enumerated in the V4L2 interface, memory-mapped buffers start with index 0 and go up sequentially from there. The type field describes the type of the buffer, usually V4L2_BUF_TYPE_VIDEO_CAPTURE or V4L2_BUF_TYPE_VIDEO_OUTPUT.

The size of the buffer is given by length, which is in bytes. The size of the image data contained within the buffer is found in bytesused; obviously bytesused <= length. For capture devices, the driver will set bytesused; for output devices the application must set this field.

field describes which field of an image is stored in the buffer; fields were discussed in part 5a of this series.

The timestamp field, for input devices, tells when the frame was captured. For output devices, the driver should not send the frame out before the time found in this field; a timestamp of zero means "as soon as possible." The driver will set timestamp to the time that the first byte of the frame was transferred to the device - or as close to that time as it can get. timecode can be used to hold a timecode value, useful for video editing applications; see this table for details on timecodes.

The driver maintains a incrementing count of frames passing through the device; it stores the current sequence number in sequence as each frame is transferred. For input devices, the application can watch this field to detect dropped frames.

memory tells whether the buffer is memory-mapped or user-space. For memory-mapped buffers, m.offset describes where the buffer is to be found. The specification describes it as "the offset of the buffer from the start of the device memory," but the truth of the matter is that it is simply a magic cookie that the application can pass to mmap() to specify which buffer is being mapped. For user-space buffers, instead, m.userptr is the user-space address of the buffer.

The input field can be used to quickly switch between inputs on a capture device - assuming the device supports quick switching between frames. The reserved field should be set to zero.

Finally, there are several flags defined:

  • V4L2_BUF_FLAG_MAPPED indicates that the buffer has been mapped into user space. It is only applicable to memory-mapped buffers.

  • V4L2_BUF_FLAG_QUEUED: the buffer is in the driver's incoming queue.

  • V4L2_BUF_FLAG_DONE: the buffer is in the driver's outgoing queue.

  • V4L2_BUF_FLAG_KEYFRAME: the buffer holds a key frame - useful in compressed streams.

  • V4L2_BUF_FLAG_PFRAME and V4L2_BUF_FLAG_BFRAME are also used with compressed streams; they indicated predicted or difference frames.

  • V4L2_BUF_FLAG_TIMECODE: the timecode field is valid.

  • V4L2_BUF_FLAG_INPUT: the input field is valid.

Buffer setup

Once a streaming application has performed its basic setup, it will turn to the task of organizing its I/O buffers. The first step is to establish a set of buffers with the VIDIOC_REQBUFS ioctl(), which is turned by V4L2 into a call to the driver's vidioc_reqbufs() method:

    int (*vidioc_reqbufs) (struct file *file, void *private_data, 
			   struct v4l2_requestbuffers *req);

Everything of interest will be in the v4l2_requestbuffers structure, which looks like this:

    struct v4l2_requestbuffers
    {
	__u32			count;
	enum v4l2_buf_type      type;
	enum v4l2_memory        memory;
	__u32			reserved[2];
    };

The type field describes the type of I/O to be done; it will usually be either V4L2_BUF_TYPE_VIDEO_CAPTURE for a video acquisition device or V4L2_BUF_TYPE_VIDEO_OUTPUT for an output device. There are other types, but they are beyond the scope of this article.

If the application wants to use memory-mapped buffers, it will set memory to V4L2_MEMORY_MMAP and count to the number of buffers it wants to use. If the driver does not support memory-mapped buffers, it should return -EINVAL. Otherwise, it should allocate the requested buffers internally and return zero. On return, the application will expect the buffers to exist, so any part of the task which could fail (memory allocation, for example) should be done at this stage.

Note that the driver is not required to allocate exactly the requested number of buffers. In many cases there is a minimum number of buffers which makes sense; if the application requests fewer than the minimum, it may actually get more buffers than it asked for. In your editor's experience, for example, the mplayer application will request two buffers, which makes it susceptible to overruns (and thus lost frames) if things slow down in user space. By enforcing a higher minimum buffer count (adjustable with a module parameter), the cafe_ccic driver is able to make the streaming I/O path a little more robust. The count field should be set to the number of buffers actually allocated before the method returns.

Setting count to zero is a way for the application to request that all existing buffers be released. In this case, the driver must stop any DMA operations before freeing the buffers or terrible things could happen. It is also not possible to free buffers if they are current mapped into user space.

If, instead, user-space buffers are to be used, the only fields which matter are the buffer type and a value of V4L2_MEMORY_USERPTR in the memory field. The application need not specify the number of buffers that it intends to use; since the allocation will be happening in user space, the driver need not care. If the driver supports user-space buffers, it need only note that the application will be using this feature and return zero; otherwise the usual -EINVAL return is called for.

The VIDIOC_REQBUFS command is the only way for an application to discover which types of streaming I/O buffer are supported by a given driver.

Mapping buffers into user space

If user-space buffers are being used, the driver will not see any more buffer-related calls until the application starts putting buffers on the incoming queue. Memory-mapped buffers require more setup, though. The application will typically step through each allocated buffer and map it into its address space. The first stop is the VIDIOC_QUERYBUF command, which becomes a call to the driver's vidioc_querybuf() method:

    int (*vidioc_querybuf)(struct file *file, void *private_data, 
                           struct v4l2_buffer *buf);

On entry to this method, the only fields of buf which will be set are type (which should be checked against the type specified when the buffers were allocated) and index, which identifies the specific buffer. The driver should make sure that index makes sense and fill in the rest of the fields in buf. Typically drivers store an array of v4l2_buffer structures internally, so the core of a vidioc_querybuf() method is just a structure assignment.

The only way for an application to access memory-mapped buffers is to map them into their address space, so a vidioc_querybuf() call will typically be followed by a call to the driver's mmap() method - this method, remember, is stored in the fops field of the video_device structure associated with this device. How the driver handles mmap() will depend on just how the buffers are set up in the kernel. If the buffer can be mapped up front with remap_pfn_range() or remap_vmalloc_range(), that should be done at this time. For buffers in kernel space, pages can also be mapped individually at page-fault time by setting up a nopage() method in the usual way. A good discussion of handling mmap() can be found in Linux Device Drivers for those who need it.

When mmap() is called, the VMA structure passed in should have the address of one of your buffers in the vm_pgoff field - right-shifted by PAGE_SHIFT, of course. It should, in particular, be the offset value that your driver returned in response to a VIDIOC_QUERYBUF call. Please iterate through your list of buffers and be sure that the incoming address matches one of them; video drivers should not be a means by which hostile programs can map arbitrary regions of memory.

The offset value you provide can be almost anything, incidentally. Some drivers just return (index<<PAGE_SHIFT), meaning that the incoming vm_pgoff field should just be the buffer index. The one thing you should not do is store the actual kernel-space address of the buffer in offset; leaking kernel addresses into user space is never a good idea.

When user space maps a buffer, the driver should set the V4L2_BUF_FLAG_MAPPED flag in the associated v4l2_buffer structure. It must also set up open() and close() VMA operations so that it can track the number of processes which have the buffer mapped. As long as this buffer remains mapped somewhere, it cannot be released back to the kernel. If the mapping count of one or more buffers drops to zero, the driver should also stop any in-progress I/O, as there will be no process which can make use of it.

Streaming I/O

So far we have looked at a lot of setup without the transfer of a single frame. We're getting closer, but there is one more step which must happen first. When the application obtains buffers with VIDIOC_REQBUFS, those buffers are all in the user-space state; if they are user-space buffers, they do not really even exist yet. Before the application can start streaming I/O, it must put at least one buffer into the driver's incoming queue; for an output device, of course, those buffers should also be filled with valid frame data.

To enqueue a buffer, the application will issue a VIDIOC_QBUF ioctl(), which the V4L2 maps into a call to the driver's vidioc_qbuf() method:

    int (*vidioc_qbuf) (struct file *file, void *private_data, 
                        struct v4l2_buffer *buf);

For memory-mapped buffers, once again, only the type and index fields of buf are valid. The driver can just perform the obvious checks (type and index make sense, the buffer is not already on one of the driver's queues, the buffer is mapped, etc.), put the buffer on its incoming queue (setting the V4L2_BUF_FLAG_QUEUED flag), and return.

User-space buffers can be more complicated at this point, because the driver will have never seen this buffer before. When using this method, applications are allowed to pass a different address every time they enqueue a buffer, so the driver can do no setup ahead of time. If your driver is bouncing frames through a kernel-space buffer, it need only make a note of the user-space address provided by the application. If you are trying to DMA the data directly into user-space, however, life is significantly more challenging.

To ship data directly into user space, the driver must first fault in all of the pages of the buffer and lock them into place; get_user_pages() is the tool to use for this job. Note that this function can perform significant amounts of memory allocation and disk I/O - it could block for a long time. You will need to take care to ensure that important driver functions do not stall while get_user_pages(), which can block for long enough for many video frames to go by, does its thing.

Then there is the matter of telling the device to transfer image data to (or from) the user-space buffer. This buffer will not be contiguous in physical memory - it will, instead, be broken up into a large number of separate 4096-byte pages (on most architectures). Clearly, the device will have to be able to do scatter/gather DMA operations. If the device transfers full video frames at once, it will need to accept a scatterlist which holds a great many pages; a VGA-resolution image in a 16-bit format requires 150 pages. As the image size grows, so will the size of the scatterlist. The V4L2 specification says:

If required by the hardware the driver swaps memory pages within physical memory to create a continuous area of memory. This happens transparently to the application in the virtual memory subsystem of the kernel.

Your editor, however, is unwilling to recommend that driver writers attempt this kind of deep virtual memory trickery. A more promising approach could be to require user-space buffers to be located in hugetlb pages, but no drivers do that now.

If your device transfers images in smaller pieces (a USB camera, for example), direct DMA to user space may be easier to set up. In any case, when faced with the challenges of supporting direct I/O to user-space buffers, the driver writer should (1) be sure that it is worth the trouble, given that applications tend to expect to use memory-mapped buffers anyway, and (2) make use of the video-buf layer, which can handle some of the pain for you.

Once streaming I/O starts, the driver will grab buffers from its incoming queue, have the device perform the requested transfer, then move the buffer to the outgoing queue. The buffer flags should be adjusted accordingly when this transition happens; fields like the sequence number and time stamp should also be filled in at this time. Eventually the application will want to claim buffers in the outgoing queue, returning them to the user-space state. That is the job of VIDIOC_DQBUF, which becomes a call to:

    int (*vidioc_dqbuf) (struct file *file, void *private_data, 
                         struct v4l2_buffer *buf);

Here, the driver will remove the first buffer from the outgoing queue, storing the relevant information in *buf. Normally, if the outgoing queue is empty, this call should block until a buffer becomes available. V4L2 drivers are expected to handle non-blocking I/O, though, so if the video device has been opened with O_NONBLOCK, the driver should return -EAGAIN in the empty-queue case. Needless to say, this requirement also implies that the driver must support poll() for streaming I/O.

The only remaining step is to actually tell the device to start performing streaming I/O. The Video4Linux2 driver methods for this task are:

    int (*vidioc_streamon) (struct file *file, void *private_data, 
                            enum v4l2_buf_type type);
    int (*vidioc_streamoff)(struct file *file, void *private_data, 
    	                    enum v4l2_buf_type type);

The call to vidioc_streamon() should start the device after checking that type makes sense. The driver can, if need be, require that a certain number of buffers be in the incoming queue before streaming can be started.

When the application is done it should generate a call to vidioc_streamoff(), which must stop the device. The driver should also remove all buffers from both the incoming and outgoing queues, leaving them all in the user-space state. Of course, the driver must be prepared for the application to simply close the device without stopping streaming first.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

  • Bartlomiej Zolnierkiewicz: IDE update. (July 10, 2007)

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds