Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is 2.6.18.1, released on October 16. It contains a rather long list of fixes for problems which have been encountered in 2.6.18.

The stable team has also released 2.6.17.14 with a smaller set of fixes. This will probably be the final 2.6.17.x release.

Adrian Bunk has released 2.6.16.30-rc1 with several new fixes.

The current 2.6 prepatch is 2.6.19-rc2, released by Linus on October 13. There's a bunch of fixes here, but also the big interrupt handler prototype change and the initial merge of the developmental ext4 filesystem with a few enhancements. See the long-format changelog for the details.

Around 250 post-rc2 patches - almost all fixes - have gone into the mainline git repository as of this writing.

The current -mm tree is 2.6.19-rc2-mm1. Recent changes to -mm include generic backlight device support, some changes to how per-CPU data works on i386, and a FUSE update. There is also a new round_jiffies() function which rounds a time value up to the next whole second. The idea is to cause recurring timers to go off at the same time, reducing the number of timer interrupts needed.

Comments (none posted)

Kernel development news

Quote of the week

Wow, who'd have thought that loading 6 megabytes of unauditable code into your kernel and X server might be a bad idea? It's almost like code running as root was some sort of potential security issue, or something.

-- Matthew Garrett

Comments (14 posted)

Return values, warnings, and error situations

The function pci_set_mwi() enables the "memory write and invalidate" (MWI) mode on the PCI bus. If the device on the other end can work with MWI, a small optimization results. The MWI mode might not be enabled, however, even if a device driver requests it; the bus hardware itself might not support it. A failure to set MWI is not generally a problem; things just go a bit slower than they would have otherwise. The calling driver might still want to know if the call succeeded, however, so Matthew Wilcox recently fixed the function to return -EINVAL if the attempt fails.

It turns out that this is one of the many patches which have recently sabotaged Andrew Morton's heavily abused Vaio laptop. Some code was checking the result of pci_set_mwi(); once that function actually returned the result of the operation, the calling code failed on an error path. But, as noted above, a failure to set MWI is almost never a fatal problem. So, in response to this series of events, Alan Cox asserted:

The underlying bug is that someone marked pci_set_mwi must-check, that's wrong for most of the drivers that use it. If you remove the must check annotation from it then the problem and a thousand other spurious warnings go away.

One suspects Alan is also behind code like the following, from drivers/ata/pata_cs5530.c:

    compiler_warning_pointless_fix = pci_set_mwi(cs5530_0);

The __must_check annotation makes use of the gcc warn_unused_result attribute; it first found its way into the mainline in 2.6.8. If a function is marked __must_check, the compiler will issue a strong warning whenever the function is called and its return code is unused.

The use of __must_check is another step in the long path toward automatic detection of potential bugs. It is intended for functions whose return value really does require checking - copy_from_user() is a good example. If that function fails, and the calling code does not notice, it will proceed using essentially random data. Similar issues come up in user space; witness the recent vulnerabilities resulting from privileged applications which fail to check the result of a setuid() call. In some cases, there clearly is no excuse for not looking at the return value, and __must_check is a good way to find incorrect function usage before it creates real problems.

In current kernels, however, the list of __must_check functions has grown rather long: it includes most of the sysfs, PCI, kobject, and driver core APIs. In some cases, as with pci_set_mwi(), it now includes functions whose return values are often of no interest to the calling code. The result, in this case, is snide workarounds in the code, added warning noise, and an actual bug where code which need not fail does so in response to an error return code.

Still, according to Andrew Morton, it is a mistake to ignore an error return from a function like pci_set_mwi():

You, the driver author _do not know_ what pci_set_mwi() does at present, on all platforms, nor do you know what it does in the future. For you the driver author to make assumptions about what's happening inside pci_set_mwi() is a layering violation. Maybe the bridge got hot-unplugged. Maybe the attempt to set MWI caused some synchronous PCI error. For example, take a look at the various implementations of pci_ops.read() around the place - various of them can fail for various reasons.

This discussion led, eventually, to what might be the real issue: how should in-kernel APIs be designed to properly return status information? A suggestion which has been made is that pci_set_mwi() should return zero or one, depending on whether MWI is a possible operating mode. Only if something goes drastically wrong on the PCI bus should a negative error code be returned. No such patch has yet been merged, but that seems like the way this particular issue is likely to be resolved.

The larger discussion of how errors should be handled may just be beginning, however. There are a number of de-facto conventions for kernel APIs which have evolved over time, but no overall policy on error handling. So Andrew would like to talk about guidelines on how different kinds of errors should be handled. In particular, he suggests a rule that a negative error code should never be ignored in any situation. Cases where this kind of result is not relevant (pci_set_mwi() being an example) are an indication of an API in need of a redesign.

So over time, it would not be surprising to see a number of kernel interfaces shift such that a number of error conditions are handled further down the call chain and with the goal of not returning error codes for non-error situations. There is also likely to be a continued effort to cut down on the warning noise, which, at times, threatens to drown out the real errors. With luck, all of this work will lead to safer interfaces and a more robust kernel in the future.

Comments (2 posted)

The death and possible rebirth of sysctl()

The sysctl() system call has had a rough life. It began as an idea imported from BSD; it allows a user-space process to tweak various kernel parameters using a set of integer indexes. People quickly discovered, however, that a text and filesystem-based interface (as seen under /proc/sys) is much easier to deal with. The /proc/sys hierarchy can be adjusted from the shell and manipulated by scripts - and nobody has to worry about sysctl numbers. So there are very few users of sysctl(), which has been considered deprecated for a long time. Recent kernels have issued warnings when sysctl() is called.

The 2.6.19-rc kernels take things one step further: for most configurations, sysctl() disappears altogether. In a strange sort of turnaround, only configurations with the "embedded" option set can enable sysctl() at all. This is all in accordance with the feature removal schedule, which calls for sysctl() to go away in January, 2007.

But sysctl() is part of the user-space API, which is never supposed to be broken for any reason. The removal of this function would appear to be a violation of the oft-repeated promise to keep this interface stable. So some developers have started to complain about the API change. There have been calls to back it out again, and to restore sysctl() to normal configurations. As Alan Cox put it: "We added it, we supported it, we get to keep it. We just stick notes in the docs saying 'please use /proc instead'."

Patches which restore sysctl() are circulating, though none have been merged. There appears to be some disagreement over whether removing sysctl() would truly break user-space applications or not. There are some uses of it in older C libraries, but, apparently, those libraries do the right thing when the attempt to use sysctl() fails, and applications operate normally. Linus has asked for an example of an application which truly breaks in the absence of sysctl(); none have been posted as of this writing. Interfaces which are not actually used on real systems are fair game for removal, so, unless somebody comes up with a a real-world problem soon, sysctl() will likely continue on its path out of the kernel.

Comments (none posted)

Video4Linux2 part 2: registration and open()

The LWN.net Video4Linux2 API series.

This is the second article in the LWN series on writing drivers for the Video4Linux2 kernel interface; those who have not yet seen the introductory article may wish to start there. This installment will look at the overall structure of a Video4Linux driver and the device registration process.

Before starting, it is worth noting that there are two resources which will prove invaluable for anybody working with video drivers:

The V4L2 API Specification. This document covers the API from the user-space point of view, but, to a great extent, V4L2 drivers implement that API directly. So most of the structures are the same, and the semantics of the V4L2 calls are clearly laid out. Print a copy (consider cutting out the Free Documentation License text to save trees) and keep it somewhere within easy reach.
The "vivi" driver found in the kernel source as drivers/media/video/vivi.c. It is a virtual driver, in that it generates test patterns and does not actually interface to any hardware. As such, it serves as a relatively clear illustration of how V4L2 drivers should be written.

To start, every V4L2 driver must include the requisite header file:

    #include <linux/videodev2.h>

Much of the needed information is there. When digging through the headers as a driver author, however, you'll also want to have a look at include/media/v4l2-dev.h, which defines many of the structures you'll be working with.

A video driver will probably have sections which deal with the PCI or USB bus (for example); we'll not spend much time on that part of the driver here. There is often an internal i2c interface, which will be examined later on in this article series. Then, there is the interface to the V4L2 subsystem. That interface is built around struct video_device, which represents a V4L2 device. Covering everything that goes into this structure will be the topic of several articles; here we'll just have an overview.

The name field of struct video_device is a name for the type of device; it will appear in kernel log messages and in sysfs. The name usually matches the name of the driver.

There are two fields to describe what type of device is being represented. The first (type) looks like a holdover from the Video4Linux1 API; it can have one of four values:

VFL_TYPE_GRABBER indicates a frame grabber device - including cameras, tuners, and such.
VFL_TYPE_VBI is for devices which pull information transmitted during the video blanking interval.
VFL_TYPE_RADIO for radio devices.
VFL_TYPE_VTX for videotext devices.

If your device can perform more than one of the above functions, a separate V4L2 device should be registered for each of the supported functions. In V4L2, however, any of the registered devices can be called upon to function in any of the supported modes. What it comes down to is that, for V4L2, there is really only need for a single device, but compatibility with the older Video4Linux API requires that individual devices be registered for each function.

The second field, called type2, is a bitmask describing the device's capabilities in more detail. It can contain any of the following values:

VID_TYPE_CAPTURE: the device can capture video data.
VID_TYPE_TUNER: it can tune to different frequencies.
VID_TYPE_TELETEXT: it can grab teletext data.
VID_TYPE_OVERLAY: it can overlay video data directly into the frame buffer.
VID_TYPE_CHROMAKEY: a special form of overlay capability where the video data is only displayed where the underlying frame buffer contains pixels of a specific color.
VID_TYPE_CLIPPING: it can clip overlay data.
VID_TYPE_FRAMERAM: it uses memory located in the frame buffer device.
VID_TYPE_SCALES: it can scale video data.
VID_TYPE_MONOCHROME: it is a monochrome-only device.
VID_TYPE_SUBCAPTURE: it can capture sub-areas of the image.
VID_TYPE_MPEG_DECODER: it can decode MPEG streams.
VID_TYPE_MPEG_ENCODER: it can encode MPEG streams.
VID_TYPE_MJPEG_DECODER: it can decode MJPEG streams.
VID_TYPE_MJPEG_ENCODER: it can encode MJPEG streams.

Another field initialized by all V4L2 drivers is minor, which is the desired minor number for the device. Usually this field will be set to -1, which causes the Video4Linux subsystem to allocate a minor number at registration time.

There are also three distinct sets of function pointers found within struct video_device. The first, consisting of a single function, is the release() method. If a device lacks a release() function, the kernel will complain (your editor was amused to note that it refers offending programmers to an LWN article). The release() function is important: for various reasons, references to a video_device structure can remain long after that last video application has closed its file descriptor. Those references can remain after the device has been unregistered. For this reason, it is not safe to free the structure until the release() method has been called. So, often, this function consists of a simple kfree() call.

The video_device structure contains within it a file_operations structure with the usual function pointers. Video drivers will always need open() and release() operations; note that this release() is called whenever the device is closed, not when it can be freed as with the other function with the same name described above. There will often be a read() or write() method, depending on whether the device performs input or output; note, however, that for streaming video devices, there are other ways of transferring data. Most devices which handle streaming video data will need to implement poll() and mmap(). And every V4l2 device needs an ioctl() method - but they can use video_ioctl2(), which is provided by the V4L2 subsystem.

The third set of methods, stored in the video_device structure itself, makes up the core of the V4L2 API. There are several dozen of them, handling various device configuration operations, streaming I/O, and more.

Finally, a useful field to know from the beginning is debug. Setting it to either (or both - it's a bitmask) of V4L2_DEBUG_IOCTL and V4L2_DEBUG_IOCTL_ARG will yield a fair amount of debugging output which can help a befuddled programmer figure out why a driver and an application are failing to understand each other.

Video device registration

Once the video_device structure has been set up, it should be registered with:

    int video_register_device(struct video_device *vfd, int type, int nr);

Here, vfd is the device structure, type is the same value found in its type field, and nr is, again, the desired minor number (or -1 for dynamic allocation). The return value should be zero; a negative error code indicates that something went badly wrong. As always, one should be aware that the device's methods can be called immediately once the device is registered; do not call video_register_device() until everything is ready to go.

A device can be unregistered with:

    void video_unregister_device(struct video_device *vfd);

Stay tuned for the next article in this series, which will begin to look at the implementation of some of these methods.

open() and release()

Every V4L2 device will need an open() method, which will have the usual prototype:

    int (*open)(struct inode *inode, struct file *filp);

The first thing an open() method will normally do is to locate an internal device corresponding to the given inode; this is done by keying on the minor number stored in inode. A certain amount of initialization can be performed; this can also be a good time to power up the hardware if it has a power-down option.

The V4L2 specification defines some conventions which are relevant here. One is that, by design, all V4L2 devices can have multiple open file descriptors at any given time. The purpose here is to allow one application to display (or generate) video data while another one, perhaps, tweaks control values. So, while certain V4L2 operations (actually reading and writing video data, in particular) can be made exclusive to a single file descriptor, the device as a whole should support multiple open descriptors.

Another convention worth mentioning is that the open() method should not, in general, make changes to the operating parameters currently set in the hardware. It should be possible to run a command-line program which configures a camera according to a certain set of desires (resolution, video format, etc.), then run an entirely separate application to, for example, capture a frame from the camera. This mode would not work if the camera's settings were reset in the middle, so a V4L2 driver should endeavor to keep existing settings until an application explicitly resets them.

The release() method performs any needed cleanup. Since video devices can have multiple open file descriptors, release() will need to decrement a counter and check before doing anything radical. If the just-closed file descriptor was being used to transfer data, it may necessary to shut down the DMA engine and perform other cleanups.

The next installment in this series will start into the long process of querying device capabilities and configuring operating modes. Stay tuned.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.19-rc2 ?

Andrew Morton 2.6.19-rc2-mm1 ?

Greg KH Linux 2.6.18.1 ?

Ingo Molnar 2.6.18-rt6 ?

Greg KH Linux 2.6.17.14 ?

Adrian Bunk Linux 2.6.16.30-rc1 ?

Core kernel code

Paul Jackson Cpuset: explicit dynamic sched domain control flags ?

Development tools

Jean-Marc Saffroy kdump2gdb 0.2 ?

Junio C Hamano GIT 1.4.2.4 ?

Michael Reed The Linux Test project ltp-20061017 Released ?

Device drivers

Li Yu usb/hid: The HID Simple Driver Patches 0.4.0 (all-in-one) ?

Greg KH USB fixes and drivers for 2.6.18-rc2 ?

Burman Yan HP mobile data protection system driver take 2 ?

Tejun Heo implement ata_link, take 3 ?

Robert Hancock sata_nv ADMA/NCQ support for nForce4 ?

Jonathan Corbet Marvell 88alp01 "cafe" camera driver ?

Matthew Wilcox Block on access to temporarily unavailable pci device ?

Alan Cox tty: preparatory structures for termios revamp ?

Alan Cox tty: switch to ktermios and new framework ?

Documentation

Michael Kerrisk man-pages-2.41 is released ?

Filesystems and block I/O

Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Stackfs: generic stackable filesystem helper functions ?

Networking

Eric Barton PATCH zero-copy send completion callback ?

Jiri Benc d80211: pull request ?

Security-related

Dawid Ciezarkiewicz Ethernet Cheap Cryptography ?

Noriaki TAKAMIYA Changeset of Camellia cipher algorithm. ?

Virtualization and containers

Ian Pratt Xen 3.0.3 released! ?

Cedric Le Goater pid namespace and namespace cleanups ?

Miscellaneous

Neil Brown ANNOUNCE: mdadm 2.5.4 - A tool for managing Soft RAID under Linux ?

Douglas Gilbert sg3_utils-1.22 available ?

Douglas Gilbert sdparm 1.00 ?

Page editor: Jonathan Corbet
Next page: Distributions>>