User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.19-rc1, released on October 4, several milliseconds after last week's Kernel Page was published. For a summary of changes, see this article and this one from the last two weeks. Highlights include the parallel ATA driver set, labeled networking for IPsec and CIPSO security, a few new architectures, lots of new drivers, the GFS2 cluster filesystem, eCryptfs, and large numbers of internal changes.

The long-format changelog has the details - but, since we're talking about almost 5000 patches from over 600 contributors, it's best to have a lot of time on one's hands. The short-form changelog is somewhat more compact, but still lengthy.

At this point in the process, patches going into the mainline repository are supposed to be confined to fixes. Many of them are, but Linus has merged a few other significant changes, including, as predicted, the interrupt handler prototype change, which has caused changes throughout the tree. There is a new epoll_pwait() system call which takes an additional signal mask parameter, and the venerable (but long-unused) <linux/config.h> include file has been removed at last.

Also merged is the developmental ext4 filesystem, which includes a number of enhancements, including support for extents and 48-bit block numbers. See the ext4 documentation file if you are interested in playing with ext4 (and have good backups).

The current -mm tree is 2.6.19-rc1-mm1. Recent changes to -mm include the addition of ext4 (which promptly moved on into the mainline), continued work on the swap token mechanism, a generic log2() implementation, and the dynamic tick patch.

Comments (3 posted)

Kernel development news

Quote of the week

Maintaining drivers out of tree is shameless autoflagellation at the best of times. We really don't care -- if we didn't make life hard for them in this way they'd only go and stick pins under their fingernails to make up for the lack of pain. If you think about it like that, we're probably doing them a favour -- at least this way they're _safe_.

-- David Woodhouse

Comments (40 posted)

Faulting out populate(), nopfn(), and nopage()

The nopfn() VMA operation was added for 2.6.19-rc1; see this article from last month for information on this method. It turns out, though, that nopfn() might just be one of the shortest-lived kernel API extensions in some time; Nick Piggin has posted a series of patches which will bring significant changes to how page faults are handled at the lowest levels.

The 2.6.19-rc1 vm_operations_struct structure defines three methods which handle low-level paging:

    	struct page *(*nopage)(struct vm_area_struct *area, 
                               unsigned long address, int *type);
	unsigned long (*nopfn)(struct vm_area_struct *area, 
                               unsigned long address);
	int (*populate)(struct vm_area_struct *area, unsigned long address, 
                        unsigned long len, pgprot_t prot, 
			unsigned long pgoff, int nonblock);

Ordinarily, page faults are handled by nopfn() (if it exists) or nopage(). Those functions are supposed to take the given address and associate it with a page in physical memory. For virtual memory areas (VMAs) which are backed up by files, the virtual filesystem layer reacts to a nopage() call by allocating a page of memory, reading the appropriate contents from backing store, then passing the page back to the kernel for insertion into the page tables. Device drivers which implement nopage() typically just translate the address into an appropriate pointer for an in-memory buffer being mapped into user space.

Both nopfn() and nopage() assume that the mapping between virtual memory addresses and the offset within the VMA is linear - that is why only the address is provided as a parameter. The kernel, however, also supports nonlinear mappings, where an application can turn a VMA into a complex window into different parts of the backing file. The nopfn() and nopage() methods cannot handle these mappings, since they do not have the required information. Instead, any backing store which supports nonlinear mappings must provide a populate() method, which has parameters for both the virtual memory address and the associated offset (pgoff) into the backing store device.

Enter Nick, who was working on a tricky race condition found within one of the most notoriously tricky parts of the kernel: the code which handles file truncation. In some conditions, a page which was being removed as a result of a truncate() call could be simultaneously faulted in via nopage(), leading to memory management confusion. While rethinking the locking rules for these operations, Nick decided that there should be a better way. The result was a new VMA operation called fault():

    struct fault_data {
	struct vm_area_struct *vma;
	unsigned long address;
	pgoff_t pgoff;
	unsigned int flags;

	int type;

    struct page *(*fault)(struct vm_area_struct *vma, 
			  struct fault_data *fdata);

This method is intended to replace all of nopfn(), nopage(), and populate(). When a page fault happens, the kernel fills in the fault_data structure with the needed information: the user-space address associated with the fault, the corresponding offset pgoff, and a couple of flags which indicate whether the fault happened on a write access and whether a nonlinear mapping is involved.

The fault() function should locate a page which can satisfy a request for the offset pgoff; it won't normally need address at all. The function can then either return the associated struct page, or set the page table entry directly (with something like vm_insert_page()) and return NULL. Either way, the type field should be set to the type of fault (major or minor). If the fault cannot be handled, the appropriate error code should be put into type instead.

Nick's patch gets rid of the nopfn() and populate() methods immediately. There is currently only one user of nopfn(), and the older populate() API has never been widely used outside of the mainline kernel. The install_page() function is also destined for a near-term demise. The nopage() method, instead, is widely used by device drivers, inside and outside of the mainline. So it has been marked as deprecated and scheduled for removal one year from now, in October, 2007. There have been suggestions that nopage() should go sooner (after six months, say), but no definitive decision.

Details like that aside, there appears to be broad support for this change. These patches would probably be a bit too new for 2.6.19, even if the merge window were still open, so 2.6.20 is the earliest likely date for them to appear in the mainline. But, at that point, driver and out-of-tree filesystem maintainers will have some updating to do.

Comments (1 posted)

Sleepable RCU

Classic RCU requires that read-side critical sections obey the same rules obeyed by the critical sections of pure spinlocks: blocking or sleeping of any sort is strictly prohibited. This has frequently been an obstacle to the use of RCU, and I have received numerous requests for a ``sleepable RCU'' (SRCU) that permits arbitrary sleeping (or blocking) within RCU read-side critical sections. I had previously rejected all such requests as unworkable, since arbitrary sleeping in RCU read-side could indefinitely extend grace periods, which in turn could result in arbitrarily large amounts of memory awaiting the end of a grace period, which finally would result in system hangs due to memory exhaustion. After all, any concurrency-control primitive that could result in system hangs -- even when used correctly - does not deserve to exist.

However, the realtime kernels that require spinlock critical sections be preemptible [3] also require that RCU read-side critical sections be preemptible [2]. Preemptible critical sections in turn require that lock-acquisition primitives block in order to avoid deadlock, which in turns means that both RCU's and spinlocks' critical sections be able to block awaiting a lock. However, these two forms of sleeping have the special property that priority boosting and priority inheritance may be used to awaken the sleeping tasks in short order.

Nevertheless, use of RCU in realtime kernels was the first crack in the tablets of stone on which were inscribed ``RCU read-side critical sections can never sleep''. That said, indefinite sleeping, such as blocking waiting for an incoming TCP connection, is strictly verboten even in realtime kernels.

Quick Quiz 1: Why is sleeping prohibited within Classic RCU read-side critical sections?

Quick Quiz 2: Why not permit sleeping in Classic RCU read-side critical sections by eliminating context switch as a quiescent state, leaving user-mode execution and idle loop as the remaining quiescent states?

(Click below for the rest of this lengthy, technical article - and the answers to the quick quiz questions).

Full Story (comments: 12)

The Video4Linux2 API: an introduction

Your editor has recently had the opportunity to write a Linux driver for a camera device - the camera which will be packaged with the One Laptop Per Child system, in particular. This driver works with the internal kernel API designed for such purposes: the Video4Linux2 API. In the process of writing this code, your editor made the shocking discovery that, in fact, this API is not particularly well documented - though the user-space side is, instead, quite well documented indeed. In an attempt to remedy the situation somewhat, LWN will, over the coming months, publish a series of articles describing how to write drivers for the V4L2 interface.

V4L2 has a long history - the first gleam came into Bill Dirks's eye back around August of 1998. Development proceeded for years, and the V4L2 API was finally merged into the mainline in November, 2002, when 2.5.46 was released. To this day, however, quite a few Linux drivers do not support the newer API; the conversion process is an ongoing task. Meanwhile, the V4L2 API continues to evolve, with some major changes being made in 2.6.18. Applications which work with V4L2 remain relatively scarce.

V4L2 is designed to support a wide variety of devices, only some of which are truly "video" in nature:

  • The video capture interface grabs video data from a tuner or camera device. For many, video capture will be the primary application for V4L2. Since your editor's experience is strongest in this area, this series will tend to emphasize the capture API, but there is more to V4L2 than that.

  • The video output interface allows applications to drive peripherals which can provide video images - perhaps in the form of a television signal - outside of the computer.

  • A variant of the capture interface can be found in the video overlay interface, whose job is to facilitate the direct display of video data from a capture device. Video data moves directly from the capture device to the display, without passing through the system's CPU.

  • The VBI interfaces provide access to data transmitted during the video blanking interval. There are two of them, the "raw" and "sliced" interfaces, which differ in the amount of processing of the VBI data performed in hardware.

  • The radio interface provides access to audio streams from AM and FM tuner devices.

Other types of devices are possible. The V4L2 API has some stubs for "codec" and "effect" devices, both of which perform transformations on video data streams. Those areas have not yet been completely specified, however, much less implemented. There are also the "teletext" and "radio data system" interfaces currently implemented in the older V4L1 API; those have not been moved to V4L2 and there do not appear to be any immediate plans to do so.

Video devices differ from many others in the vast number of ways in which they can be configured. As a result, much of a V4L2 driver implements code which enables applications to discover a given device's capabilities and to configure that device to operate in the desired manner. The V4L2 API defines several dozen callbacks for the configuration of parameters like tuner frequencies, windowing and cropping, frame rates, video compression, image parameters (brightness, contrast, ...), video standards, video formats, etc. Much of this series will be devoted to looking at how this configuration process happens.

Then, there is the small task of actually performing I/O at video rates in an efficient manner. The V4L2 API defines three different ways of moving video data between user space and the peripheral, some of which can be on the complex side. Separate articles will look at video I/O and the video-buf layer which has been provided to handle common tasks.

Subsequent articles will appear every few weeks, and will be added to the list below:

Comments (6 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds