Release status
Kernel release status
The current 2.6 kernel is 2.6.22.1. Linus
announced the release of the 2.6.22 kernel on
July 8. For those just tuning in: much has happened in this
development cycle, including the addition of the mac80211 (formerly
"Devicescape") wireless networking stack, the
eventfd system calls, some new
TCP congestion control algorithms, a rewritten CFQ I/O scheduler, a new
IEEE1394 ("Firewire") stack, support for the
Blackfin architecture, the
long-awaited IVTV TV tuner driver, and much more. See
the KernelNewbies 2.6.22
page for vast amounts of detail, the
long-format
changelog for even more detail, or
the
short-form changelog for a (relatively) concise listing of patches in
this release.
The 2.6.22.1 update, released
on July 10, adds an SCTP security fix which had somehow managed not to
get into 2.6.22.
The 2.6.23 merge window has opened, and some 500 patches have found their
way into the mainline git repository (as of this writing).
For older kernels: the 2.6.20.15 and 2.6.21.6 stable updates were
released on July 6; each contains a single fix for a security problem
in the netfilter H323 connection tracking code.
Comments (none posted)
Kernel development news
Quote of the week
We all know swap prefetch has been tested out the wazoo since Moses
was a little boy, is compile-time and runtime selectable, and gives
an important and quantifiable performance increase to desktop
systems. Save a Redhat employee some time reinventing the wheel
and just merge it. This wheel already has dope 21" rims...
--
Matthew
Hawkins
Comments (1 posted)
The 2.6.23 merge window opens
Linus opened the 2.6.23 merge window with a bang: the first thing merged
was the
CFS CPU scheduler.
The
group scheduling feature
is not available, however, since it depends on the
generic process containers
patch, and would appear that containers will have to wait another cycle.
Other patches merged so far include an IDE update, the rtl8187 wireless
network driver (the first driver to use the mac80211 stack), support for
the Yukon EX (88e8071) network adapter chipset, Xbox 360 gamepad support, a
big rework of the splice() code which replaces sendfile()
and adds an internal vmsplice_to_user() feature, an LZ01X
compression implementation,
and the removal
of a number of ancient CDROM drivers.
The 2.6.23 process has barely begun, expect a great deal of work to be
merged yet. Andrew Morton's
2.6.23 merge plan is useful reading for
those who would like to know what else may go in; among other things, it
looks like this kernel will
include fallocate(),
lguest, and the on-demand readahead patches.
Bear in
mind that much of what goes into 2.6.23 will not get there by way of
Andrew, so this is far from a complete list of what this kernel will
contain.
Comments (none posted)
An API for virtual I/O: virtio
Linux has an abundance of virtualization choices, each with its own way
of dealing with I/O. A recent set of kernel patches, submitted to the
kernel-virtualization mailing list by Rusty Russell, would allow different
virtualization implementations to share drivers by using a virtual I/O
interface called virtio. There have been several public iterations
of the interface with the latest, draft IV, narrowing in on what
appears to be an acceptable solution, at least with the virtualization
folks.
There are always questions about adding yet another layer into the
kernel, but the advantages for virtio are numerous. Russell outlines
several in one of his posts
to the kernel-virtualization list. There is
some amount of urgency in devising a solution because several of the
virtualization projects are either working on or reworking their virtual
I/O. If an established mechanism, that already provides working block and
network drivers existed, those projects, as well as any newcomers, would be
likely to use it.
Another key element is to try and prevent a major proliferation of
kernel drivers each handling slightly different virtual block I/O.
Trying to tune and maintain those drivers could become a
major headache, so virtio separates the guest Linux side of the driver
from the code that is specific to the hypervisor implementation. Each group
of developers can maintain the code on their side of the API without
changing the other, unless, of course, the virtio API itself needs to
change. It is likely that some kind of virtual I/O will be adopted,
as the kernel developers are likely to be unwilling to merge new
drivers for each different virtualization mechanism that comes along; some
commonality is required.
The basic abstraction used by virtio is a "buffer", which consists of a
struct scatterlist array. The array
contains "out" entries describing data destined for the underlying hypervisor
driver, as well as "in" entries for that driver to store data to return to the
guest driver. The order is fixed (out followed by in) and a count of each
is part of the buffer description, which allows the hypervisor driver to
determine what it has.
This buffer abstraction
encapsulates everything needed to communicate data to be written to or read
from the hypervisor driver and, eventually, the underlying device.
A guest driver, that uses the virtio interface, hands off buffers to the
hypervisor driver and awaits their completion.
At its core, the virtio API is a set of functions that are provided by the
hypervisor driver to be used by the guest:
struct virtqueue_ops {
int (*add_buf)(struct virtqueue *vq,
struct scatterlist sg[],
unsigned int out_num,
unsigned int in_num,
void *data);
void (*sync)(struct virtqueue *vq);
void *(*get_buf)(struct virtqueue *vq, unsigned int *len);
int (*detach_buf)(struct virtqueue *vq, void *data);
bool (*restart)(struct virtqueue *vq);
};
This operations vector is initialized by the hypervisor and passed to the
guest driver using a
probe() function. The guest then
sets up its data structures and registers with its kernel as a block
or network device driver.
The basic operation uses add_buf() to register one or more buffers with the
hypervisor driver. That driver is kicked via the sync() call to
start processing the buffers. Each struct virtqueue has a callback
associated with it which will be called when some buffers have completed.
The guest then calls the get_buf() function to retrieve completed
buffers. To support polling, which is used by network drivers,
get_buf() can be called at any time, returning NULL if none have
completed.
The guest driver can disable further callbacks, at any time, by returning
zero from the callback. The restart() routine is then used to
re-enable them. Finally, the detach_buf() call is used
during shutdown to cancel the operation indicated by the buffer and to
retrieve it from the hypervisor driver.
As part of his patches, Russell has working example block and network
drivers using the virtio interface. Each uses the virtio API differently,
and the requirements of each kind of device has pushed the evolution of the
interface into its current form. He has also posted an example of a
driver implementing virtio for his lguest hypervisor.
The block driver uses a protocol that the buffer always has at least one
out and in element. The first element passes the sector and type (read or
write) information to the hypervisor driver and the first in element
receives the status of the request. For a write, there are additional out
elements, whereas for a read, there are additional in elements. When the
I/O completes, the callback is invoked and the get_buf() calls
return the completed buffers.
The network driver uses separate virtqueues for sending and receiving
packets which
allows it to avoid any locking between the two. Each side only uses half
of the scatterlist, out for sending and in for receiving. One of the major
differences from "draft III" is combining the two types of buffers;
previously there were "inbufs" and "outbufs" and the operations vector had
calls for each type. By noticing that they could be combined while still
supporting single direction buffers, Russell has halved the number of
operations that need to be implemented.
Currently, a hypervisor that wants to provide virtio devices to its guests
must arrange to call the virtblock_probe() or
virtnet_probe() functions. Any device discovery must be handled
by the hypervisor and the guest driver is linked to the
hypervisor driver at compile time. Dynamic, mix and match,
hypervisor/guest combinations are not yet available, but will be down the
road; proposals are already being floated on the kernel-virtualization list.
In a blog posting,
Russell describes the tension between performance and abstraction:
The danger is to come up with an abstraction so far removed from what's
actually happening that performance sucks, there's more glue code than
actual driver code and there are seemingly arbitrary correctness
requirements. But being efficient for both network and block devices is
also quite a trick.
It remains to be seen if the performance can live up to the needs of the
various virtualization projects. If it does, and the interface is abstract
enough to handle the kinds of virtual devices required, we should see some
kind of push to get it included in the kernel sometime soon.
Comments (none posted)
Video4Linux2 part 6b: Streaming I/O
The
previous installment in
this series discussed how to transfer video frames with the
read()
and
write() system calls. Such an implementation can get the
basic job done, but it is not normally the preferred method for performing
video I/O. For the highest performance and the best information transfer,
video drivers should support the V4L2 streaming I/O API.
With the read() and write() methods, each video frame is
copied between user and kernel space as part of the I/O operation. When
streaming I/O is being used, instead, this copying does not happen;
instead, the application and the driver exchange pointers to buffers.
These buffers will be mapped into the application's address space, making
it possible to perform zero-copy frame I/O. There are two
different types of streaming I/O buffers:
- Memory-mapped buffers (type V4L2_MEMORY_MMAP) are allocated
in kernel space; the application maps them into its address space with
the mmap() system call. The buffers can be large, contiguous
DMA buffers, virtual buffers created with vmalloc(), or, if
the hardware supports it, they can be located directly in the video
device's I/O memory.
- User-space buffers (V4L2_MEMORY_USERPTR) are allocated by the
application in user space. Clearly, in this situation, no
mmap() call is required, but the driver may have to work
harder to support efficient I/O to user-space buffers.
Note that drivers are not required to support streaming I/O, and, if they
do support streaming, they do not have to handle both buffer types. A
driver which is more flexible will support more applications; in practice,
it seems that most applications are written to use memory-mapped buffers.
It is not possible to use both types of buffer simultaneously.
We will now delve into the numerous grungy details involved in supporting
streaming I/O. Any Video4Linux2 driver writer will need to understand this
API; it is worth noting, however, that there is a higher-level API which
can help in the writing of streaming drivers. That layer (called
video-buf) can make life easier when the underlying device can support
scatter/gather I/O. The video-buf API will be discussed in a future
installment.
Drivers which support streaming I/O should inform the application of that
fact by setting the V4L2_CAP_STREAMING flag in their
vidioc_querycap() method. Note that there is no way to describe
which buffer types are supported; that comes later.
The v4l2_buffer structure
When streaming I/O is active, frames are passed between the application and
the driver in the form of struct v4l2_buffer. This structure is a
complicated beast which will take a while to describe. A good starting
point is to note that there are three fundamental states that a buffer can
be in:
- In the driver's incoming queue. Buffers are placed in this queue by
the application in the expectation that the driver will do something
useful with them. For a video capture device, buffers in the incoming
queue will be empty, waiting for the driver to fill them with video
data. For an output device, these buffers will have frame data to be
sent to the device.
- In the driver's outgoing queue. These buffers have been processed by
the driver and are waiting for the application to claim them. For
capture devices, outgoing buffers will have new frame data; for output
devices, these buffers are empty.
- In neither queue. In this state, the buffer is owned by user space
and will not normally be touched by the driver. This is the only time
that the application should do anything with the buffer. We'll call
this the "user space" state.
These states, and the operations which cause transitions between them, come
together as shown in the diagram below:
The actual v4l2_buffer structure looks like this:
struct v4l2_buffer
{
__u32 index;
enum v4l2_buf_type type;
__u32 bytesused;
__u32 flags;
enum v4l2_field field;
struct timeval timestamp;
struct v4l2_timecode timecode;
__u32 sequence;
/* memory location */
enum v4l2_memory memory;
union {
__u32 offset;
unsigned long userptr;
} m;
__u32 length;
__u32 input;
__u32 reserved;
};
The index field is a sequence number identifying the buffer; it is
only used with memory-mapped buffers. Like other objects which can be
enumerated in the V4L2 interface, memory-mapped buffers start with index 0
and go up sequentially from there. The type field describes the
type of the buffer, usually V4L2_BUF_TYPE_VIDEO_CAPTURE or
V4L2_BUF_TYPE_VIDEO_OUTPUT.
The size of the buffer is given by length, which is in bytes. The
size of the image data contained within the buffer is found in
bytesused; obviously bytesused <= length.
For capture devices, the driver will set bytesused; for output
devices the application must set this field.
field describes which field of an image is stored in the buffer;
fields were discussed in part 5a of this series.
The timestamp field, for input devices, tells when the frame was
captured. For output devices, the driver should not send the frame out
before the time found in this field; a timestamp of zero means "as
soon as possible." The driver will set timestamp to the time that
the first byte of the frame was transferred to the device - or as close to
that time as it can get. timecode can be used to hold a timecode value,
useful for video editing applications; see this
table for details on timecodes.
The driver maintains a incrementing count of frames passing through the
device; it stores the current sequence number in sequence as each
frame is transferred. For input devices, the application can watch this
field to detect dropped frames.
memory tells whether the buffer is memory-mapped or user-space.
For memory-mapped buffers, m.offset describes where the buffer is
to be found. The specification describes it as "the offset of the
buffer from the start of the device memory," but the truth of the
matter is that it is simply a magic cookie that the application can pass to
mmap() to specify which buffer is being mapped. For user-space
buffers, instead, m.userptr is the user-space address of the
buffer.
The input field can be used to quickly switch between inputs on a
capture device - assuming the device supports quick switching between
frames. The reserved field should be set to zero.
Finally, there are several flags defined:
- V4L2_BUF_FLAG_MAPPED indicates that the buffer
has been mapped into user space. It is only applicable to
memory-mapped buffers.
- V4L2_BUF_FLAG_QUEUED: the buffer is in the driver's incoming
queue.
- V4L2_BUF_FLAG_DONE: the buffer is in the driver's outgoing
queue.
- V4L2_BUF_FLAG_KEYFRAME: the buffer holds a key frame - useful
in compressed streams.
- V4L2_BUF_FLAG_PFRAME and V4L2_BUF_FLAG_BFRAME are
also used with compressed streams; they indicated predicted or
difference frames.
- V4L2_BUF_FLAG_TIMECODE: the timecode field is valid.
- V4L2_BUF_FLAG_INPUT: the input field is valid.
Buffer setup
Once a streaming application has performed its basic setup, it will turn to
the task of organizing its I/O buffers. The first step is to establish a
set of buffers with the VIDIOC_REQBUFS ioctl(), which is
turned by V4L2 into a call to the driver's vidioc_reqbufs()
method:
int (*vidioc_reqbufs) (struct file *file, void *private_data,
struct v4l2_requestbuffers *req);
Everything of interest will be in the v4l2_requestbuffers
structure, which looks like this:
struct v4l2_requestbuffers
{
__u32 count;
enum v4l2_buf_type type;
enum v4l2_memory memory;
__u32 reserved[2];
};
The type field describes the type of I/O to be done; it will
usually be either V4L2_BUF_TYPE_VIDEO_CAPTURE for a video
acquisition device or V4L2_BUF_TYPE_VIDEO_OUTPUT for an output
device. There are other types, but they are beyond the scope of this
article.
If the application wants to use memory-mapped buffers, it will set
memory to V4L2_MEMORY_MMAP and count to the
number of buffers it wants to use. If the driver does not support
memory-mapped buffers, it should return -EINVAL. Otherwise, it
should allocate the requested buffers internally and return zero. On
return, the application will expect the buffers to exist, so any part of
the task which could fail (memory allocation, for example) should be done
at this stage.
Note
that the driver is not required to allocate exactly the requested number of
buffers. In many cases there is a minimum number of buffers which makes
sense; if the application requests fewer than the minimum, it may actually
get more buffers than it asked for. In your editor's experience, for
example, the mplayer application will request two buffers, which
makes it susceptible to overruns (and thus lost frames) if things slow
down in user space. By enforcing a higher minimum buffer count (adjustable with a module
parameter), the cafe_ccic driver is able to make the streaming I/O path a
little more robust.
The count field should be set
to the number of buffers actually allocated before the method returns.
Setting count to zero is a way for the application to request that
all existing buffers be released. In this case, the driver must stop any
DMA operations before freeing the buffers or terrible things could happen.
It is also not possible to free buffers if they are current mapped into
user space.
If, instead, user-space buffers are to be used, the only fields which
matter are the buffer type and a value of
V4L2_MEMORY_USERPTR in the memory field. The application
need not specify the number of buffers that it intends to use; since the
allocation will be happening in user space, the driver need not care. If
the driver supports user-space buffers, it need only note that the
application will be using this feature and return zero; otherwise the usual
-EINVAL return is called for.
The VIDIOC_REQBUFS command is the only way for an application to
discover which types of streaming I/O buffer are supported by a given
driver.
Mapping buffers into user space
If user-space buffers are being used, the driver will not see any more
buffer-related calls until the application starts putting buffers on the
incoming queue. Memory-mapped buffers require more setup, though. The
application will typically step through each allocated buffer and map it
into its address space. The first stop is the VIDIOC_QUERYBUF
command, which becomes a call to the driver's vidioc_querybuf()
method:
int (*vidioc_querybuf)(struct file *file, void *private_data,
struct v4l2_buffer *buf);
On entry to this method, the only fields of buf which will be set
are type (which should be checked against the type specified when
the buffers were allocated) and index, which identifies the
specific buffer. The driver should make sure that index makes
sense and fill in the rest of the fields in buf. Typically
drivers store an array of v4l2_buffer structures internally, so
the core of a vidioc_querybuf() method is just a structure
assignment.
The only way for an application to access memory-mapped buffers is to map
them into their address space, so a vidioc_querybuf() call will
typically be followed by a call to the driver's mmap() method -
this method, remember, is stored in the fops field of the
video_device structure associated with this device. How the
driver handles mmap() will depend on just how the buffers are set
up in the kernel. If the buffer can be mapped up front with
remap_pfn_range() or remap_vmalloc_range(), that should
be done at this time. For buffers in kernel space, pages can also be
mapped individually at page-fault time by setting up a nopage()
method in the usual
way. A good discussion of handling mmap() can be found in Linux Device Drivers for those who need it.
When mmap() is called, the VMA structure passed in should have the
address of one of your buffers in the vm_pgoff field -
right-shifted by PAGE_SHIFT, of course. It should, in particular,
be the offset value that your driver returned in response to a
VIDIOC_QUERYBUF call. Please iterate through your list of buffers
and be sure that the incoming address matches one of them; video drivers
should not be a means by which hostile programs can map arbitrary regions
of memory.
The offset value you provide can be almost anything,
incidentally. Some drivers just return (index<<PAGE_SHIFT),
meaning that the incoming vm_pgoff field should just be the buffer
index. The one thing you should not do is store the actual
kernel-space address of the buffer in offset; leaking kernel
addresses into user space is never a good idea.
When user space maps a buffer, the driver should set the
V4L2_BUF_FLAG_MAPPED flag in the associated v4l2_buffer
structure. It must also set up open() and close() VMA
operations so that it can track the number of processes which have the
buffer mapped. As long as this buffer remains mapped somewhere, it cannot
be released back to the kernel. If the mapping count of one or more
buffers drops to zero, the driver should also stop any in-progress I/O, as
there will be no process which can make use of it.
Streaming I/O
So far we have looked at a lot of setup without the transfer of a single
frame. We're getting closer, but there is one more step which must happen
first. When the application obtains buffers with VIDIOC_REQBUFS,
those buffers are all in the user-space state; if they are user-space
buffers, they do not really even exist yet. Before the application can
start streaming I/O, it must put at least one buffer into the driver's
incoming queue; for an output device, of course, those buffers should also
be filled with valid frame data.
To enqueue a buffer, the application will issue a VIDIOC_QBUF
ioctl(), which the V4L2 maps into a call to the driver's
vidioc_qbuf() method:
int (*vidioc_qbuf) (struct file *file, void *private_data,
struct v4l2_buffer *buf);
For memory-mapped buffers, once again, only the type and
index fields of buf are valid. The driver can just
perform the obvious checks (type and index make sense,
the buffer is not already on one of the driver's queues, the buffer is
mapped, etc.), put the buffer on its incoming queue (setting the
V4L2_BUF_FLAG_QUEUED flag), and return.
User-space buffers can be more complicated at this point, because the
driver will have never seen this buffer before. When using this method,
applications are allowed to pass a different address every time they enqueue
a buffer, so the driver can do no setup ahead of time. If your driver is
bouncing frames through a kernel-space buffer, it need only make a note of
the user-space address provided by the application. If you are trying to
DMA the data directly into user-space, however, life is significantly more
challenging.
To ship data directly into user space, the driver must first fault in all
of the pages of the buffer and lock them into place;
get_user_pages() is the tool to use for this job. Note that this
function can perform significant amounts of memory allocation and disk I/O
- it could block for a long time. You will need to take care to ensure
that important driver functions do not stall while
get_user_pages(), which can block for long enough for many video
frames to go by, does its thing.
Then there is the matter of telling the device to transfer image data to
(or from) the user-space buffer. This buffer will not be contiguous in
physical memory - it will, instead, be broken up into a large number of
separate 4096-byte pages (on most architectures). Clearly, the device will
have to be able to do
scatter/gather DMA operations. If the device transfers full video frames
at once, it will need to accept a scatterlist which holds a great many
pages; a VGA-resolution image in a 16-bit format requires 150 pages. As
the image size grows, so will the size of the scatterlist. The V4L2
specification says:
If required by the hardware the driver swaps memory pages within
physical memory to create a continuous area of memory. This happens
transparently to the application in the virtual memory subsystem of
the kernel.
Your editor, however, is unwilling to recommend that driver writers attempt
this kind of deep virtual memory trickery. A more promising approach could
be to require user-space buffers to be located in hugetlb pages, but no
drivers do that now.
If your device transfers images in smaller pieces (a USB camera, for
example), direct DMA to user space may be easier to set up. In any case,
when faced with the challenges of supporting direct I/O to user-space
buffers, the driver writer should (1) be sure that it is worth the
trouble, given that applications tend to expect to use memory-mapped
buffers anyway, and (2) make use of the video-buf layer, which can
handle some of the pain for you.
Once streaming I/O starts, the driver will grab buffers from its incoming
queue, have the device perform the requested transfer, then move the buffer
to the outgoing queue. The buffer flags should be adjusted accordingly
when this transition happens; fields like the sequence number and time stamp
should also
be filled in at this time. Eventually the application will want to claim
buffers in the outgoing queue, returning them to the user-space state.
That is the job of VIDIOC_DQBUF, which becomes a call to:
int (*vidioc_dqbuf) (struct file *file, void *private_data,
struct v4l2_buffer *buf);
Here, the driver will remove the first buffer from the outgoing queue,
storing the relevant information in *buf. Normally, if the
outgoing queue is empty, this call should block until a buffer becomes
available. V4L2 drivers are expected to handle non-blocking I/O, though, so if the
video device has been opened with O_NONBLOCK, the driver should
return -EAGAIN in the empty-queue case. Needless to say, this
requirement also implies that the driver must support poll() for
streaming I/O.
The only remaining step is to actually tell the device to start performing
streaming I/O. The Video4Linux2 driver methods for this task are:
int (*vidioc_streamon) (struct file *file, void *private_data,
enum v4l2_buf_type type);
int (*vidioc_streamoff)(struct file *file, void *private_data,
enum v4l2_buf_type type);
The call to vidioc_streamon() should start the device after
checking that type makes sense. The driver can, if need be,
require that a certain number of buffers be in the incoming queue before
streaming can be started.
When the application is done it should generate a call to
vidioc_streamoff(), which must stop the device. The driver should
also remove all buffers from both the incoming and outgoing queues, leaving
them all in the user-space state. Of course, the driver must be prepared
for the application to simply close the device without stopping streaming
first.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
- Bartlomiej Zolnierkiewicz: IDE update.
(July 9, 2007)
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>