The current 2.6 prepatch is 2.6.21-rc5
on March 25. It
contains a number of fixes, including a set for timer-related regressions.
Says Linus: "Those timer changes ended up much more painful than
anybody wished for, but big thanks to Thomas Gleixner for being on it like
a weasel on a dead rat, and the regression list has kept shrinking.
for the details.
The current -mm tree is 2.6.21-rc5-mm3,
released on March 30. (see below).
The current stable 2.6 kernel is 184.108.40.206, released on March 23.
For older kernels: 220.127.116.11 was released with
several fixes and some USB work on March 31. (see below).
In the 2.4 world, 18.104.22.168
was released on March 24; it only contains two changes. 2.4.35-pre2 is also out with a
rather larger set of fixes.
Comments (none posted)
Kernel development news
I find that the key to understanding kernel code is to understand the data
structures and the relationships between them. Once you have that in your
head, the code tends to just fall out. Hence there is good maintainability
payoff in putting work into documenting the struct, its fields, the
relationship between this struct and other structs, and any and all locking
<wonders wtf "ticks" does>
-- Andrew Morton
Comments (3 posted)
The 2.6.21 kernel release is getting closer, so it makes sense to review
the internal API changes which have been made in this development cycle.
As always, this information will eventually find its way to the LWN 2.6 kernel API changes page
- Sysfs now supports the concept of "shadow directories" - multiple
versions of a directory with the same name. This feature is to be
used with container applications, allowing each namespace to have
resources (network interfaces, for example) with the same name. To
that end, two new functions have been added:
int sysfs_make_shadowed_dir(struct kobject *kobj,
void *(*follow_link)(struct dentry *,
struct nameidata *));
struct dentry *sysfs_create_shadow_dir(struct kobject *kobj);
sysfs_make_shadowed_dir() takes the existing directory for a
kobject and makes it shadowed - capable of having multiple
instantiations. The follow_link() method must be able to
pick out the right version for any given situation. A call to
sysfs_create_shadow_dir() will create a new instantiation for a
directory which has been made shadowed.
- Quite a few kobject functions - kobject_init(),
subsystem_register(), subsystem_unregister(), and
subsys_create_file() - now return harmlessly if passed a
- Many kernel subsystems which once used class_device
structures have been changed to use struct device instead;
this work is toward a long-term goal of getting rid of the class tree
and having a single device tree in sysfs.
- There is a new function:
int device_schedule_callback(struct device *dev,
void (*func)(struct device *))
This function will arrange for func() to be called at some
future time in process context. It's meant to enable device
attributes to unregister themselves, but one can imagine other
applications as well.
- The ALSA system on chip ("ASoC") layer provides extensive support for
the implementation of sound drivers on embedded systems; see the
documentation files packaged with the kernel for details.
- Significant changes have been made to the crypto support interface.
- The device resource
management patches, making a lot of driver code
easier to write, have been merged.
- The DMA memory zone (ZONE_DMA) is now optional and may not be
present in all kernels.
- The local_t type has been made consistent across
architectures and has gained some documentation.
- The nopfn() address space operation can now return
NOPFN_REFAULT to indicate that the faulting instruction
should be re-executed.
- A new function, vm_insert_pfn(), enables the insertion of a
new page into a process's address space by page-frame number.
- A new driver API for general-purpose I/O signals has been added.
- The sysctl code has been heavily reworked, leading to a number of
internal API changes.
- The clockevents and dynamic
tick patches have been merged. Most code will not require
changes, but kernel developers should be aware of code which depends
Comments (none posted)
This article is a continuation of the irregular LWN series on writing video
drivers for Linux. The introductory article
series and contains pointers to the previous articles. In the last episode
looked at how the Video4Linux2 API describes video formats: image sizes and
the representation of pixels within them. This article will complete the
discussion by describing the process of coming to an agreement with an
application on an actual video format supported by the hardware.
As we saw in the previous article, there are many ways of representing
image data in memory. There is probably no video device on the market
which can handle all of the formats understood by the Video4Linux
interface. Drivers are not expected to support formats not understood by
the underlying hardware; in fact, performing format conversions within the
kernel is explicitly frowned upon. So the driver must make it possible for
the application to select a format which works with the hardware.
The first step is to simply allow the application to query the supported
formats. The VIDIOC_ENUM_FMT ioctl() is provided for the
purpose; within the driver this command turns into a call to this callback
(if a video capture device is being queried):
int (*vidioc_enum_fmt_cap)(struct file *file, void *private_data,
struct v4l2_fmtdesc *f);
This callback will ask a video capture device to describe one of its
formats. The application will pass in a v4l2_fmtdesc structure:
enum v4l2_buf_type type;
The application will set the index and type fields.
index is a simple integer used to identify a format; like the
other indexes used by V4L2, this one starts at zero and increases to the
maximum number of formats supported. An application can enumerate all of
the supported formats by incrementing the index value until the driver
returns EINVAL. The type field describes the data stream
type; it will be V4L2_BUF_TYPE_VIDEO_CAPTURE for a video capture
(camera or tuner) device.
If the index corresponds to a supported format, the driver should
fill in the rest of the structure. The pixelformat field should
be the fourcc code describing the video representation and
description a short textual description of the format. The only
defined value for the flags field is
V4L2_FMT_FLAG_COMPRESSED, which indicates a compressed video
The above callback is for video capture devices; it will only be called
when type is V4L2_BUF_TYPE_VIDEO_CAPTURE. The
VIDIOC_ENUM_FMT call will be split out into different callbacks
depending on the type field:
/* V4L2_BUF_TYPE_VIDEO_OUTPUT */
int (*vidioc_enum_fmt_video_output)(file, private_date, f);
/* V4L2_BUF_TYPE_VIDEO_OVERLAY */
int (*vidioc_enum_fmt_overlay)(file, private_date, f);
/* V4L2_BUF_TYPE_VBI_CAPTURE */
int (*vidioc_enum_fmt_vbi)(file, private_date, f);
/* V4L2_BUF_TYPE_SLICED_VBI_CAPTURE */ */
int (*vidioc_enum_fmt_vbi_capture)(file, private_date, f);
/* V4L2_BUF_TYPE_VBI_OUTPUT */
/* V4L2_BUF_TYPE_SLICED_VBI_OUTPUT */
int (*vidioc_enum_fmt_vbi_output)(file, private_date, f);
/* V4L2_BUF_TYPE_VIDEO_PRIVATE */
int (*vidioc_enum_fmt_type_private)(file, private_date, f);
The argument types are the same for all of these calls.
It's worth noting that drivers can support special buffer types with codes
starting with V4L2_BUF_TYPE_PRIVATE, but that would clearly
require a special understanding on the application side.
For the purposes of this article, we will focus on video capture and output
devices; the other types of video devices will be examined in future
The application can find out how the hardware is currently configured with
the VIDIOC_G_FMT call. The argument passed in this case is a
enum v4l2_buf_type type;
struct v4l2_pix_format pix;
struct v4l2_window win;
struct v4l2_vbi_format vbi;
struct v4l2_sliced_vbi_format sliced;
Once again, type describes the buffer type; the V4L2 layer will
split this call into one of several driver callbacks depending on that
type. For video capture devices, the callback is:
int (*vidioc_g_fmt_cap)(struct file *file, void *private_data,
struct v4l2_format *f);
For video capture (and output) devices, the pix field of the union
is of interest. This is the v4l2_pix_format structure seen in the
previous installment; the driver should fill in that structure with the
current hardware settings and return. This call should not normally fail
unless something is seriously wrong with the hardware.
The other callbacks are:
int (*vidioc_s_fmt_overlay)(file, private_data, f);
int (*vidioc_s_fmt_video_output)(file, private_data, f);
int (*vidioc_s_fmt_vbi)(file, private_data, f);
int (*vidioc_s_fmt_vbi_output)(file, private_data, f);
int (*vidioc_s_fmt_vbi_capture)(file, private_data, f);
int (*vidioc_s_fmt_type_private)(file, private_data, f);
The vidioc_s_fmt_video_output() callback uses the same
pix field in the same way as capture interfaces do.
Most applications will eventually want to configure the hardware to provide
a format which works for their purpose. There are two interfaces provided
for changing video formats. The first of these is the
VIDIOC_TRY_FMT call, which, within a V4L2 driver, turns into one
of these callbacks:
int (*vidioc_try_fmt_cap)(struct file *file, void *private_data,
struct v4l2_format *f);
int (*vidioc_try_fmt_video_output)(struct file *file, void *private_data,
struct v4l2_format *f);
/* And so on for the other buffer types */
To handle this call,
the driver should look at the requested video format and decide whether
that format can be supported by the hardware or not. If the application
has requested something impossible, the driver should return
-EINVAL. So, for example, a fourcc code describing an unsupported
format or a request for interlaced video on a progressive-only device would
fail. On the other hand, the driver can adjust size fields to match an
image size supported by the hardware; normal practice is to adjust sizes
downward if need be. So a driver for a device which only handles
VGA-resolution images would change the width and height
parameters accordingly and return success. The v4l2_format
structure will be copied back to user space after the call; the driver
should update the structure to reflect any changed parameters so the
application can see what it is really getting.
The VIDIOC_TRY_FMT handlers are optional for drivers, but omitting
this functionality is not recommended. If provided, this function is
callable at any time, even if the device is currently operating. It should
not make any changes to the actual hardware operating parameters; it
is just a way for the application to find out what is possible.
When the application wants to change the hardware's format for real, it
does a VIDIOC_S_FMT call, which arrives at the driver in this
int (*vidioc_s_fmt_cap)(struct file *file, void *private_data,
struct v4l2_format *f);
int (*vidioc_s_fmt_video_output)(struct file *file, void *private_data,
struct v4l2_format *f);
Unlike VIDIOC_TRY_FMT, this call cannot be made at arbitrary
times. If the hardware is currently operating, or if it has streaming
buffers allocated (a topic for yet another future installment), changing
the format could lead to no end of mayhem. Consider what happens, for
example, if the new format is larger than the buffers which are currently
in use. So the driver should always ensure that the hardware is idle and
fail the request (with -EBUSY) if not.
A format change should be atomic - it should change all of the parameters
to match the request or none of them. Once again, image size parameters
can be adjusted by the driver if need be. The usual form of these
callbacks is something like this:
int my_s_fmt_cap(struct file *file, void *private,
struct v4l2_format *f)
struct mydev *dev = (struct mydev *) private;
ret = my_try_fmt_cap(file, private, f);
if (ret != 0)
return tweak_hardware(mydev, &f->fmt.pix);
Using the VIDIOC_TRY_FMT handler avoids duplication of code and
gets rid of any excuse for not implementing that handler in the first
place. If the "try" function succeeds, the resulting format is known to
work and can be programmed directly into the hardware.
There are a number of other calls which influence how video I/O is done.
Future articles will look at some of them. Support for setting formats is
enough to enable applications to start transferring images, however, and
that is what the purpose of all this structure is in the end. So the next
article, hopefully to come after a shorter delay than happened this time
around, will get into support for reading and writing video data.
Comments (2 posted)
In this article, we will describe several aspects of the architecture of
DragonFly BSD's virtual kernel infrastructure, which allows the kernel to
be run as a user-space process. Its design and implementation is
largely the work of the project's lead developer, Matthew Dillon, who first
announced his intention of modifying the kernel to run in userspace on
September 2nd 2006
. The first stable DragonFlyBSD version to
feature virtual kernel (vkernel) support was DragonFly 1.8, released on January
The motivation for this work (as can be found in the initial mail linked
to above) was finding an elegant solution to one immediate and one long term
issue in pursuing the project's main goal of Single System Image clustering
over the Internet. First, as any person who is familiar with distributed
algorithms will attest, implementing cache coherency without hardware support is
a complex task. It would not be made any easier by enduring a 2-3 minute delay
in the edit-compile-run cycle while each machine goes through the boot
sequence. As a nice side effect, userspace programming errors are unlikely to
bring the machine down and one has the benefit of working with superior
debugging tools (and can more easily develop new ones).
The second, long term, issue that virtual kernels are intended to
address is finding a way to securely and
efficiently dedicate system resources to a cluster that operates over the
(hostile) Internet. Because a kernel is a more or less standalone
environment, it should be possible to completely isolate the process a
virtual kernel runs in from the rest of the system. While the
problem of process isolation is far from solved, there exist a number of
promising approaches. One option, for example, would be to use systrace
(refer to [Provos03]) to mask-out all but the few (and hopefully
carefully audited) system calls that the vkernel requires after initialization
has taken place. This setup would allow for a significantly higher degree of
protection for the host system in the event that the virtualized environment was
compromised. Moreover, the host kernel already has well-tested facilities for
arbitrating resources, although these facilities are not necessarily sufficient
or dependable; the CPU scheduler is not infallible and mechanisms for allocating
disk I/O bandwidth will need to be implemented or expanded. In any case,
leveraging preexisting mechanisms reduces the burden on the project's
development team, which can't be all bad.
Getting the kernel to build as a regular, userspace, elf executable
required tidying up large portions of the source tree. In this section we
will focus on the two large sets of changes that took place as part of
this cleanup. The second set might seem superficial and hardly worthy of
mention as such, but in explaining the reason that lead to it, we shall
discuss an important decision that was made in the implementation of the
The first set of changes was separating machine dependent code to
platform- and CPU-specific parts. The real and virtual kernels can be
considered to run on two different platforms; the first is (only, as must
reluctantly be admitted) running on 32-bit PC-style hardware, while the
second is running on a DragonFly kernel. Regardless of the differences
between the two platforms, both kernels expect the same processor
architecture. After the separation, the cpu/i386
directory of the kernel tree is left with hand-optimized assembly
versions of certain kernel routines, headers relevant only to x86 CPUs
and code that deals with object relocation and debug information. The
real kernel's platform directory (platform/pc32) is
familiar with things like programmable interrupt controllers, power
management and the PC bios (that the vkernel doesn't need), while
the virtual kernel's platform/vkernel directory is
happily using the system calls that the real kernel can't have. Of
course this does not imply that there is absolutely no code duplication,
but fixing that is not a pressing problem.
The massive second set of changes involved primarily renaming quite
a few kernel symbols so that there are no clashes with the libc ones
(e.g. *printf(), qsort, errno etc.) and using kdev_t for the POSIX dev_t
type in the kernel. As should be plain, this was a prerequisite for
having the virtual kernel link with the standard C library. Given that
the kernel is self-hosted (this means that, since it cannot generally
rely on support software after it has been loaded, the kernel includes
its own helper routines), one can question the decision of pulling in all
of libc instead of simply adding the (few) system calls that the vkernel
actually uses. A controversial choice at the time, it prevailed because
it was deemed that it would allow future vkernel code to leverage the
extended functionality provided by libc. Particularly, thread-awareness in the
system C library should accommodate the (medium term) plan to mimic
multi-processor operation by the use of one vkernel thread for each hypothetical
CPU. It is safe to say that if the plan is materialized, linking against libc
will prove to be a most profitable tradeoff.
The Virtual Kernel
In this section, we will study the architecture of the virtual kernel and
the design choices made in its development, focusing on its differences from a
kernel running on actual hardware. In the process, we'll need to describe the
changes made in the real (host) kernel code, specifically in order to support a
DragonFly kernel running as a user process.
Address Space Model
The first design choice made in the development of the vkernel is that the
whole virtualized environment is executing as part of the same real-kernel
process. This imposes well defined limits on the amount of real-kernel
resources that may be consumed by it and makes containment straightforward.
Processes running under the vkernel are not in direct competition with host
processes for cpu time and most parts of the bookkeeping that is expected
from a kernel during the lifetime of a process are handled by the virtual
kernel. The alternative,
running each vkernel process
in the context of a real
kernel process, imposes extra burden on the host kernel and requires additional
mechanisms for effective isolation of vkernel processes from the host system.
That said, the real kernel still has to deal with some amount of VM work and
reserve some memory space that is proportional to the number of processes
running under the vkernel. This statement will be made clear after we examine
the new system calls for the manipulation of vmspace objects.
In the kernel, the main purpose of a vmspace object is to describe the
address space of one or more processes. Each process normally has one vmspace,
but a vmspace may be shared by several processes. An address space is logically
partitioned into sets of pages, so that all pages in a set are backed by the
same VM object (and are linearly mapped on it) and have the same protection
bits. All such sets are represented as vm_map_entry structures. VM map entries
are linked together both by a tree and a linked list so that lookups,
additions, deletions and merges can be performed efficiently (with low time
complexity). Control information and pointers to these data structures are
encapsulated in the vm_map object that is contained in every vmspace (see the
A VM object (vm_object) is an interface to a data store
and can be of various types (default, swap, vnode, ...) depending on where it
gets its pages from. The existence of shadow objects somewhat complicates
matters, but for our purposes this simplified model should be sufficient. For
more information you're urged to have a look at the source and refer to
In the first stages of the development of vkernel, a number of system
calls were added to the kernel that allow a process to associate itself with
more than one vmspace. The creation of a vmspace is accomplished by
vmspace_create(). The new vmspace is uniquely identified by an arbitrary value
supplied as an argument. Similarly, the vmspace_destroy() call deletes the
vmspace identified by the value of its only parameter. It is expected that only
a virtual kernel running as a user process will need access to alternate
address spaces. Also, it should be made clear that while a process can have
many vmspaces associated with it, only one vmspace is active at any given time.
The active vmspace is the one operated on by
The virtual kernel creates a vmspace for each of its processes and it
destroys the associated vmspace when a vproc is terminated, but this behavior
is not compulsory. Since, just like in the real kernel, all information about a
process and its address space is stored in kernel memory, the vmspace
can be disposed of and reinstantiated at
will; its existence is only necessary while the vproc is running. One can
imagine the vkernel destroying the vproc vmspaces in response to a low memory
situation in the host system.
When it decides that it needs to run a certain process, the vkernel issues
a vmspace_ctl() system call with an argument of
VMSPACE_CTL_RUN as the command
(currently there are no other commands available), specifying the desired
vmspace to activate. Naturally, it also needs to supply the necessary context
(values of general purpose registers, instruction/stack pointers, descriptors)
in which execution will resume. The original vmspace is special; if, while
running on an alternate address space, a condition occurs which requires kernel
intervention (for example, a floating point operation throws an exception or a
system call is made), the host kernel automatically switches back to the
previous vmspace handing over the execution context at the time the exceptional
condition caused entry into the kernel and leaving it to the vkernel to resolve
matters. Signals by other host processes are likewise delivered after switching
back to the vkernel vmspace.
Support for creating and managing alternate vmspaces is also
available to vkernel processes. This requires special care so that all the
relevant code sections can operate in a recursive manner. The result is that
vkernels can be nested, that is, one can have a vkernel running as a process
under a second vkernel running as a process under a third vkernel and so
on. Naturally, the overhead incurred for each level of recursion does not
make this an attractive setup performance-wise, but it is a neat feature
The previous paragraphs have described the background of vkernel
development and have given a high-level overview of how the vkernel fits in with
the abstractions provided by the real kernel. We are now ready to dive into the
most interesting parts of the code, where we will get acquainted with a new
type of page table and discuss the details of FPU virtualization and vproc <->;
vkernel communication. But this discussion needs an article of its own,
therefore it will have to wait for a future week.
[McKusick04] The Design and Implementation of the FreeBSD Operating
System, Kirk McKusick
and George Neville-Neil
[Dillon00] Design elements of the
FreeBSD VM system
[Lemon00] Kqueue: A generic and
scalable event notification facility
[AST06] Operating Systems Design and Implementation,Andrew Tanenbaum
and Albert Woodhull.
[Provos03] Improving Host Security with
System Call PoliciesNiels Provos
[Stevens99] UNIX Network Programming, Volume 1: Sockets and XTI, Richard Stevens.
There are of course other alternatives, the most obvious one being
having one process for the virtual kernel and another for contained processes,
which is mostly equivalent to the choice made in DragonFly.
A process running under a virtual kernel will also be referred to as a
to distinguish it from host kernel processes.
small matter of the actual data belonging to the vproc is not an issue, but you
will have to wait until we get to the RAM file in the next subsection to see
Comments (4 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>