LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements ->One big page
This page Previous weekFollowing week |
Kernel developmentRelease status Kernel release status The current 2.6 prepatch is 2.6.21-rc5, released on March 25. It contains a number of fixes, including a set for timer-related regressions. Says Linus: "Those timer changes ended up much more painful than anybody wished for, but big thanks to Thomas Gleixner for being on it like a weasel on a dead rat, and the regression list has kept shrinking." See the long-format changelog for the details.The current -mm tree is 2.6.21-rc5-mm3, released on March 30. (see below). The current stable 2.6 kernel is 2.6.20.4, released on March 23. For older kernels: 2.6.16.46 was released with several fixes and some USB work on March 31. (see below). In the 2.4 world, 2.4.34.2 was released on March 24; it only contains two changes. 2.4.35-pre2 is also out with a rather larger set of fixes.
Kernel development news Quote of the week
I find that the key to understanding kernel code is to understand the data
structures and the relationships between them. Once you have that in your
head, the code tends to just fall out. Hence there is good maintainability
payoff in putting work into documenting the struct, its fields, the
relationship between this struct and other structs, and any and all locking
requirements.
-- Andrew Morton
<wonders wtf "ticks" does>
A summary of 2.6.21 API changes The 2.6.21 kernel release is getting closer, so it makes sense to review the internal API changes which have been made in this development cycle. As always, this information will eventually find its way to the LWN 2.6 kernel API changes page.
Video4Linux2 part 5b: format negotiation
As we saw in the previous article, there are many ways of representing image data in memory. There is probably no video device on the market which can handle all of the formats understood by the Video4Linux interface. Drivers are not expected to support formats not understood by the underlying hardware; in fact, performing format conversions within the kernel is explicitly frowned upon. So the driver must make it possible for the application to select a format which works with the hardware. The first step is to simply allow the application to query the supported formats. The VIDIOC_ENUM_FMT ioctl() is provided for the purpose; within the driver this command turns into a call to this callback (if a video capture device is being queried):
int (*vidioc_enum_fmt_cap)(struct file *file, void *private_data,
struct v4l2_fmtdesc *f);
This callback will ask a video capture device to describe one of its formats. The application will pass in a v4l2_fmtdesc structure:
struct v4l2_fmtdesc
{
__u32 index;
enum v4l2_buf_type type;
__u32 flags;
__u8 description[32];
__u32 pixelformat;
__u32 reserved[4];
};
The application will set the index and type fields. index is a simple integer used to identify a format; like the other indexes used by V4L2, this one starts at zero and increases to the maximum number of formats supported. An application can enumerate all of the supported formats by incrementing the index value until the driver returns EINVAL. The type field describes the data stream type; it will be V4L2_BUF_TYPE_VIDEO_CAPTURE for a video capture (camera or tuner) device. If the index corresponds to a supported format, the driver should fill in the rest of the structure. The pixelformat field should be the fourcc code describing the video representation and description a short textual description of the format. The only defined value for the flags field is V4L2_FMT_FLAG_COMPRESSED, which indicates a compressed video format. The above callback is for video capture devices; it will only be called when type is V4L2_BUF_TYPE_VIDEO_CAPTURE. The VIDIOC_ENUM_FMT call will be split out into different callbacks depending on the type field:
/* V4L2_BUF_TYPE_VIDEO_OUTPUT */
int (*vidioc_enum_fmt_video_output)(file, private_date, f);
/* V4L2_BUF_TYPE_VIDEO_OVERLAY */
int (*vidioc_enum_fmt_overlay)(file, private_date, f);
/* V4L2_BUF_TYPE_VBI_CAPTURE */
int (*vidioc_enum_fmt_vbi)(file, private_date, f);
/* V4L2_BUF_TYPE_SLICED_VBI_CAPTURE */ */
int (*vidioc_enum_fmt_vbi_capture)(file, private_date, f);
/* V4L2_BUF_TYPE_VBI_OUTPUT */
/* V4L2_BUF_TYPE_SLICED_VBI_OUTPUT */
int (*vidioc_enum_fmt_vbi_output)(file, private_date, f);
/* V4L2_BUF_TYPE_VIDEO_PRIVATE */
int (*vidioc_enum_fmt_type_private)(file, private_date, f);
The argument types are the same for all of these calls. It's worth noting that drivers can support special buffer types with codes starting with V4L2_BUF_TYPE_PRIVATE, but that would clearly require a special understanding on the application side. For the purposes of this article, we will focus on video capture and output devices; the other types of video devices will be examined in future installments. The application can find out how the hardware is currently configured with the VIDIOC_G_FMT call. The argument passed in this case is a v4l2_format structure:
struct v4l2_format
{
enum v4l2_buf_type type;
union
{
struct v4l2_pix_format pix;
struct v4l2_window win;
struct v4l2_vbi_format vbi;
struct v4l2_sliced_vbi_format sliced;
__u8 raw_data[200];
} fmt;
};
Once again, type describes the buffer type; the V4L2 layer will split this call into one of several driver callbacks depending on that type. For video capture devices, the callback is:
int (*vidioc_g_fmt_cap)(struct file *file, void *private_data,
struct v4l2_format *f);
For video capture (and output) devices, the pix field of the union is of interest. This is the v4l2_pix_format structure seen in the previous installment; the driver should fill in that structure with the current hardware settings and return. This call should not normally fail unless something is seriously wrong with the hardware. The other callbacks are:
int (*vidioc_s_fmt_overlay)(file, private_data, f);
int (*vidioc_s_fmt_video_output)(file, private_data, f);
int (*vidioc_s_fmt_vbi)(file, private_data, f);
int (*vidioc_s_fmt_vbi_output)(file, private_data, f);
int (*vidioc_s_fmt_vbi_capture)(file, private_data, f);
int (*vidioc_s_fmt_type_private)(file, private_data, f);
The vidioc_s_fmt_video_output() callback uses the same pix field in the same way as capture interfaces do. Most applications will eventually want to configure the hardware to provide a format which works for their purpose. There are two interfaces provided for changing video formats. The first of these is the VIDIOC_TRY_FMT call, which, within a V4L2 driver, turns into one of these callbacks:
int (*vidioc_try_fmt_cap)(struct file *file, void *private_data,
struct v4l2_format *f);
int (*vidioc_try_fmt_video_output)(struct file *file, void *private_data,
struct v4l2_format *f);
/* And so on for the other buffer types */
To handle this call, the driver should look at the requested video format and decide whether that format can be supported by the hardware or not. If the application has requested something impossible, the driver should return -EINVAL. So, for example, a fourcc code describing an unsupported format or a request for interlaced video on a progressive-only device would fail. On the other hand, the driver can adjust size fields to match an image size supported by the hardware; normal practice is to adjust sizes downward if need be. So a driver for a device which only handles VGA-resolution images would change the width and height parameters accordingly and return success. The v4l2_format structure will be copied back to user space after the call; the driver should update the structure to reflect any changed parameters so the application can see what it is really getting. The VIDIOC_TRY_FMT handlers are optional for drivers, but omitting this functionality is not recommended. If provided, this function is callable at any time, even if the device is currently operating. It should not make any changes to the actual hardware operating parameters; it is just a way for the application to find out what is possible. When the application wants to change the hardware's format for real, it does a VIDIOC_S_FMT call, which arrives at the driver in this form:
int (*vidioc_s_fmt_cap)(struct file *file, void *private_data,
struct v4l2_format *f);
int (*vidioc_s_fmt_video_output)(struct file *file, void *private_data,
struct v4l2_format *f);
Unlike VIDIOC_TRY_FMT, this call cannot be made at arbitrary times. If the hardware is currently operating, or if it has streaming buffers allocated (a topic for yet another future installment), changing the format could lead to no end of mayhem. Consider what happens, for example, if the new format is larger than the buffers which are currently in use. So the driver should always ensure that the hardware is idle and fail the request (with -EBUSY) if not. A format change should be atomic - it should change all of the parameters to match the request or none of them. Once again, image size parameters can be adjusted by the driver if need be. The usual form of these callbacks is something like this:
int my_s_fmt_cap(struct file *file, void *private,
struct v4l2_format *f)
{
struct mydev *dev = (struct mydev *) private;
int ret;
if (hardware_busy(mydev))
return -EBUSY;
ret = my_try_fmt_cap(file, private, f);
if (ret != 0)
return ret;
return tweak_hardware(mydev, &f->fmt.pix);
}
Using the VIDIOC_TRY_FMT handler avoids duplication of code and gets rid of any excuse for not implementing that handler in the first place. If the "try" function succeeds, the resulting format is known to work and can be programmed directly into the hardware. There are a number of other calls which influence how video I/O is done. Future articles will look at some of them. Support for setting formats is enough to enable applications to start transferring images, however, and that is what the purpose of all this structure is in the end. So the next article, hopefully to come after a shorter delay than happened this time around, will get into support for reading and writing video data.
A peek at the DragonFly Virtual Kernel (part 1) In this article, we will describe several aspects of the architecture of DragonFly BSD's virtual kernel infrastructure, which allows the kernel to be run as a user-space process. Its design and implementation is largely the work of the project's lead developer, Matthew Dillon, who first announced his intention of modifying the kernel to run in userspace on September 2nd 2006. The first stable DragonFlyBSD version to feature virtual kernel (vkernel) support was DragonFly 1.8, released on January 30th 2007.The motivation for this work (as can be found in the initial mail linked to above) was finding an elegant solution to one immediate and one long term issue in pursuing the project's main goal of Single System Image clustering over the Internet. First, as any person who is familiar with distributed algorithms will attest, implementing cache coherency without hardware support is a complex task. It would not be made any easier by enduring a 2-3 minute delay in the edit-compile-run cycle while each machine goes through the boot sequence. As a nice side effect, userspace programming errors are unlikely to bring the machine down and one has the benefit of working with superior debugging tools (and can more easily develop new ones). The second, long term, issue that virtual kernels are intended to address is finding a way to securely and efficiently dedicate system resources to a cluster that operates over the (hostile) Internet. Because a kernel is a more or less standalone environment, it should be possible to completely isolate the process a virtual kernel runs in from the rest of the system. While the problem of process isolation is far from solved, there exist a number of promising approaches. One option, for example, would be to use systrace (refer to [Provos03]) to mask-out all but the few (and hopefully carefully audited) system calls that the vkernel requires after initialization has taken place. This setup would allow for a significantly higher degree of protection for the host system in the event that the virtualized environment was compromised. Moreover, the host kernel already has well-tested facilities for arbitrating resources, although these facilities are not necessarily sufficient or dependable; the CPU scheduler is not infallible and mechanisms for allocating disk I/O bandwidth will need to be implemented or expanded. In any case, leveraging preexisting mechanisms reduces the burden on the project's development team, which can't be all bad. Preparatory workGetting the kernel to build as a regular, userspace, elf executable required tidying up large portions of the source tree. In this section we will focus on the two large sets of changes that took place as part of this cleanup. The second set might seem superficial and hardly worthy of mention as such, but in explaining the reason that lead to it, we shall discuss an important decision that was made in the implementation of the virtual kernel. The first set of changes was separating machine dependent code to platform- and CPU-specific parts. The real and virtual kernels can be considered to run on two different platforms; the first is (only, as must reluctantly be admitted) running on 32-bit PC-style hardware, while the second is running on a DragonFly kernel. Regardless of the differences between the two platforms, both kernels expect the same processor architecture. After the separation, the cpu/i386 directory of the kernel tree is left with hand-optimized assembly versions of certain kernel routines, headers relevant only to x86 CPUs and code that deals with object relocation and debug information. The real kernel's platform directory (platform/pc32) is familiar with things like programmable interrupt controllers, power management and the PC bios (that the vkernel doesn't need), while the virtual kernel's platform/vkernel directory is happily using the system calls that the real kernel can't have. Of course this does not imply that there is absolutely no code duplication, but fixing that is not a pressing problem. The massive second set of changes involved primarily renaming quite a few kernel symbols so that there are no clashes with the libc ones (e.g. *printf(), qsort, errno etc.) and using kdev_t for the POSIX dev_t type in the kernel. As should be plain, this was a prerequisite for having the virtual kernel link with the standard C library. Given that the kernel is self-hosted (this means that, since it cannot generally rely on support software after it has been loaded, the kernel includes its own helper routines), one can question the decision of pulling in all of libc instead of simply adding the (few) system calls that the vkernel actually uses. A controversial choice at the time, it prevailed because it was deemed that it would allow future vkernel code to leverage the extended functionality provided by libc. Particularly, thread-awareness in the system C library should accommodate the (medium term) plan to mimic multi-processor operation by the use of one vkernel thread for each hypothetical CPU. It is safe to say that if the plan is materialized, linking against libc will prove to be a most profitable tradeoff. The Virtual KernelIn this section, we will study the architecture of the virtual kernel and the design choices made in its development, focusing on its differences from a kernel running on actual hardware. In the process, we'll need to describe the changes made in the real (host) kernel code, specifically in order to support a DragonFly kernel running as a user process. Address Space ModelThe first design choice made in the development of the vkernel is that the whole virtualized environment is executing as part of the same real-kernel process. This imposes well defined limits on the amount of real-kernel resources that may be consumed by it and makes containment straightforward. Processes running under the vkernel are not in direct competition with host processes for cpu time and most parts of the bookkeeping that is expected from a kernel during the lifetime of a process are handled by the virtual kernel. The alternative[1], running each vkernel process[2] in the context of a real kernel process, imposes extra burden on the host kernel and requires additional mechanisms for effective isolation of vkernel processes from the host system. That said, the real kernel still has to deal with some amount of VM work and reserve some memory space that is proportional to the number of processes running under the vkernel. This statement will be made clear after we examine the new system calls for the manipulation of vmspace objects. In the kernel, the main purpose of a vmspace object is to describe the address space of one or more processes. Each process normally has one vmspace, but a vmspace may be shared by several processes. An address space is logically partitioned into sets of pages, so that all pages in a set are backed by the same VM object (and are linearly mapped on it) and have the same protection bits. All such sets are represented as vm_map_entry structures. VM map entries are linked together both by a tree and a linked list so that lookups, additions, deletions and merges can be performed efficiently (with low time complexity). Control information and pointers to these data structures are encapsulated in the vm_map object that is contained in every vmspace (see the diagram below).
A VM object (vm_object) is an interface to a data store and can be of various types (default, swap, vnode, ...) depending on where it gets its pages from. The existence of shadow objects somewhat complicates matters, but for our purposes this simplified model should be sufficient. For more information you're urged to have a look at the source and refer to [McKusick04] and [Dillon00]. In the first stages of the development of vkernel, a number of system calls were added to the kernel that allow a process to associate itself with more than one vmspace. The creation of a vmspace is accomplished by vmspace_create(). The new vmspace is uniquely identified by an arbitrary value supplied as an argument. Similarly, the vmspace_destroy() call deletes the vmspace identified by the value of its only parameter. It is expected that only a virtual kernel running as a user process will need access to alternate address spaces. Also, it should be made clear that while a process can have many vmspaces associated with it, only one vmspace is active at any given time. The active vmspace is the one operated on by mmap()/munmap()/madvise()/etc. The virtual kernel creates a vmspace for each of its processes and it destroys the associated vmspace when a vproc is terminated, but this behavior is not compulsory. Since, just like in the real kernel, all information about a process and its address space is stored in kernel memory[3], the vmspace can be disposed of and reinstantiated at will; its existence is only necessary while the vproc is running. One can imagine the vkernel destroying the vproc vmspaces in response to a low memory situation in the host system. When it decides that it needs to run a certain process, the vkernel issues a vmspace_ctl() system call with an argument of VMSPACE_CTL_RUN as the command (currently there are no other commands available), specifying the desired vmspace to activate. Naturally, it also needs to supply the necessary context (values of general purpose registers, instruction/stack pointers, descriptors) in which execution will resume. The original vmspace is special; if, while running on an alternate address space, a condition occurs which requires kernel intervention (for example, a floating point operation throws an exception or a system call is made), the host kernel automatically switches back to the previous vmspace handing over the execution context at the time the exceptional condition caused entry into the kernel and leaving it to the vkernel to resolve matters. Signals by other host processes are likewise delivered after switching back to the vkernel vmspace. Support for creating and managing alternate vmspaces is also available to vkernel processes. This requires special care so that all the relevant code sections can operate in a recursive manner. The result is that vkernels can be nested, that is, one can have a vkernel running as a process under a second vkernel running as a process under a third vkernel and so on. Naturally, the overhead incurred for each level of recursion does not make this an attractive setup performance-wise, but it is a neat feature nonetheless. The previous paragraphs have described the background of vkernel development and have given a high-level overview of how the vkernel fits in with the abstractions provided by the real kernel. We are now ready to dive into the most interesting parts of the code, where we will get acquainted with a new type of page table and discuss the details of FPU virtualization and vproc <->; vkernel communication. But this discussion needs an article of its own, therefore it will have to wait for a future week. Bibliography[McKusick04] The Design and Implementation of the FreeBSD Operating System, Kirk McKusick and George Neville-Neil [Dillon00] Design elements of the FreeBSD VM system Matthew Dillon [Lemon00] Kqueue: A generic and scalable event notification facility Jonathan Lemon [AST06] Operating Systems Design and Implementation,Andrew Tanenbaum and Albert Woodhull. [Provos03] Improving Host Security with System Call PoliciesNiels Provos [Stevens99] UNIX Network Programming, Volume 1: Sockets and XTI, Richard Stevens.
Notes
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Kernel building
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet |
Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.