Kernel development [LWN.net]

Kernel release status

There is no 2.6 prepatch outstanding as of this writing. The 2.6.22 merge window has opened, and about 2,000 changesets have been merged so far (see below).

The current -mm tree is 2.6.21-rc7-mm2. There's not been a lot of new features going into -mm recently; the focus has been on bug fixes.

The current stable 2.6 kernel is 2.6.21, released on April 25. For those just tuning in, 2.6.21 includes clockevents and the dynamic tick patch, the VMI virtualization interface, a number of KVM improvements, the ALSA system on chip layer, and much more. See the KernelNewbies 2.6.21 summary for vast amounts of detail.

The 2.6.21.1 update added a couple of fixes for security issues in the networking code.

For older kernels: the current 2.6.20 release is ~~2.6.20.8~~ ~~2.6.20.9~~ ~~2.6.20.10~~ 2.6.20.11, released on May 1. The 2.6.20.11 release contains a few dozen important fixes; the previous updates contained fixes for networking-related security problems.

2.6.16.50-rc1 was released on May 1 with several fixes, a couple of which have CVE numbers attached.

Comments (none posted)

Quotes of the week

So -mm is still very useful just because *Andrew* tests it, and finds all kinds of issues with it, but I literally suspect that Andrew himself is personally a big part of that, which is kind of wasteful - we should be able to spread out the pain more. Andrew is also too damn polite when something goes wrong.

-- Linus Torvalds

The overall stability in recent -mm's was not sufficiently high and we ran out of time to find all the bugs. I shouldn't have merged all those patches last week - they contained an exceptional amount of garbage. This all means that more bugs than usual will probably leak into mainline, and we'll have to fix them there.

-- Andrew Morton

Comments (2 posted)

Job opening: kernel bug manager

In the middle of the discussion on the handling of kernel bugs, Andrew Morton let it slip that the long-rumored, Google-funded kernel bug manager position is now open. It's apparently proved hard to fill: "Unfortunately the recruiting has been a bit tricky - this is not a typical job and it's a funny mixture of bureaucracy/politics/social engineering and programming. People who are skilled in both areas, are, ah, uncommon." If you are such a person this could be a great opportunity to build kernel skills while working directly with Andrew - and help the kernel process as well.

Comments (17 posted)

Merged (and to be merged) for 2.6.22

The 2.6.22 merge window has opened, with almost 2,000 changesets merged as of this writing. The merge process appears to have slowed somewhat; it may be that the level of traffic on linux-kernel is so high (even by linux-kernel standards) that nobody has time to deal with actual patches. Be that as it may, user-visible changes merged so far include:

Lots of networking changes, including improvements to the forward receive timeout recovery (RFC4138) implementation, a YeAH-TCP congestion control [PDF] implementation, a TCP Illinois congestion control implementation, and a new RxRPC secure socket layer (along with support for using RxRPC in the AFS filesystem). Also, the old, IPv4-only connection tracking code has been removed as per the feature removal schedule.
The cfg80211 patches - a new, netlink-based interface for configuring wireless interfaces - have been merged. At the same time, the netlink version of the "wireless extensions" interface has been removed.
The OCFS2 filesystem now has sparse file support.
The UBI patch, which performs flash-aware partitioning and volume management, has been merged.
New drivers for USB webcams based on zr364xx chipsets, AT26Fxxx dataflash devices, CM-X270-based NAND flash memory, Freescale SOC USB controllers, and Marvell Libertas 802.11 adaptors (used in the OLPC system).
It's also worth noting that the IVTV video driver, long out of the mainline, has finally been merged. "It took three core maintainers, over four years of work, eight new i2c modules, eleven new V4L2 ioctls, three new DVB video ioctls, a Sliced VBI API, a new MPEG encoder API, an enhanced DVB video MPEG decoding API, major YUV/OSD contributions from Ian and John, web/wiki/svn/trac support from Axel Thimm, (hardware) support from Hauppauge, support and assistance from the v4l-dvb people and the many, many users of ivtv to finally make it possible to merge this driver into the kernel."
A new "sony-laptop" layer which replaces sonypi and provides better Sony support. The old "ibm_acpi" module has been renamed "thinkpad-acpi," and it features improved support for those laptops.
The CFQ I/O scheduler has been reworked. Taking inspiration from the CFS CPU scheduler, it now uses a red-black tree to sort pending requests by expected execution time and track them.

Changes visible to kernel developers include:

The eth_type_trans() function now sets the skb->dev field, consistent with how similar functions for other link types operate. As a result, many Ethernet drivers have been changed to remove the (now) redundant assignment.
The header fields in the sk_buff structure have been renamed and are no longer unions. Networking code and drivers can now just use skb->transport_header, skb->network_header, and skb->skb_mac_header. There are new functions for finding specific headers within packets: tcp_hdr(), udp_hdr(), ipip_hdr(), and ipipv6_hdr().
Also in the networking area: the packet scheduler has been reworked to use ktime values rather than jiffies.

Those who are curious about what else might get in to 2.6.22 can have a look at Andrew Morton's 2.6.22 merge plans document. Interestingly, Lguest, the signalfd work, and the SLUB allocator are all planned for merging, but all have become less certain since:

There have been some complaints that Lguest has not been sufficiently reviewed. Since this development is independent and will not bother those who do not use it, the concerns are less likely to delay its inclusion.
Signalfd has a new competitor in the form of the pollfs patch. Pollfs takes takes a different approach to many of the same problems and throws in polling for futex operations as well. It is far from clear that pollfs is better (some of the early reviews have been on the unfavorable side), but the process of figuring out whether that is true could delay signalfd past the closing of the merge window.
The SLUB allocator has also been subject to concerns that it has not been sufficiently tested for such a fundamental change. Additionally, there seems to be a difference of goals between Andrew Morton (who would like to see SLUB eventually replace the current slab allocator) and SLUB developer Christoph Lameter, who had seen the two coexisting indefinitely. Chances are these issues will get worked out and SLUB will go in as scheduled.

There are a few things of interest which are not on Andrew's list. The reiser4 filesystem seems certain to sit out (at least) another cycle, despite a resurgence in interest in getting it ready for inclusion. Xen is not mentioned, but it seems that, behind the scenes, it is being worked on. So Xen could actually show up before the merge window closes. There will be no major scheduler rework in 2.6.22; it's too soon for any of those patches to go in. The anti-fragmentation patches look likely to wait a little longer; Andrew worries that they still haven't seen enough review and benchmarking despite many iterations over a few years. The integrity management patches are considered to be unready and will not be merged.

Beyond that, there will be doubtless be surprises over the next week or so; stay tuned.

Comments (10 posted)

UIO: user-space drivers

The concept of supporting user-space drivers has appeared on this page a few times before. It's back; this time there is a version of the patch (now called "UIO") which is being proposed for inclusion into 2.6.22. The interface has changed somewhat, so another look is called for.

Like the previous version, UIO does not completely eliminate the need for kernel-space code. A small module is required to set up the device, perhaps interface to the PCI bus, and register an interrupt handler. The last function (interrupt handling) is particularly important; much can be done in user space, but there needs to be an in-kernel interrupt handler which knows how to tell the device to stop crying for attention.

The kernel module includes <linux/uio_driver.h>. If it's a driver for a PCI device, it should register itself as a PCI driver in the usual way. When it comes time to connect a device (perhaps in the PCI probe() function), the driver fills in a uio_info structure:

    struct uio_info {
	char			*name;
	char			*version;
	struct uio_mem		mem[MAX_UIO_MAPS];
	long			irq;
	unsigned long		irq_flags;
	void			*priv;
	irqreturn_t (*handler)(int irq, struct uio_info *dev_info);
	int (*mmap)(struct uio_info *info, struct vm_area_struct *vma);
	int (*open)(struct uio_info *info, struct inode *inode);
	int (*release)(struct uio_info *info, struct inode *inode);
	/* Internal stuff omitted */
    };

Here, name is the name of the device and version is the driver version (which will show up in sysfs). The number of the interrupt used by the device (if any) goes into irq, with irq_flags being the flags which will be passed to request_irq(). The function which handles interrupts is handler(). This handler should acknowledge the interrupt; it usually does not need to do anything else. The mmap(), open(), and release() functions are called from the equivalent file_operations members.

The mem array describes any memory areas which can be mapped into user space. The uio_mem structure looks like:

    struct uio_mem {
	unsigned long addr;
	unsigned long size;
	int memtype;
	void __iomem *internal_addr;
	/* ... */
    };

For each mappable area, addr is the relevant address, and size is the size of the area. If it's an I/O memory area, internal_addr is the address returned by ioremap(). The memtype field describes what the area really is:

UIO_MEM_PHYS indicates that addr is a physical address, generally for an I/O memory area.
UIO_MEM_LOGICAL is memory in the kernel logical address space, such as that returned by kmalloc().
UIO_MEM_VIRTUAL is memory in the kernel virtual address space - the space used by vmalloc_user() and friends.

Once the structure is filled in, the driver stub passes it to:

    int uio_register_device(struct device *parent, struct uio_info *info);

The parent pointer tells the kernel which "real" device is associated with the UIO device; if the driver is for a PCI device, parent will be pci_dev->dev.

There is not much more to the kernel-space UIO API. When a device goes away, the driver should call:

    void uio_unregister_device(struct uio_info *info);

The final function of note is:

    void uio_event_notify(struct uio_info *info);

Its purpose is to notify the UIO core that an event (typically an interrupt) has occurred. The stub driver need not call uio_event_notify() for real interrupts, but it can be used to simulate interrupts in other situations.

On the user space side, the first UIO-handled device will show up as /dev/uio0 (assuming a normal udev setup). The user-space driver will open the device. Reading the device returns an int value which is the event count (number of interrupts) seen by the device; if no interrupts have come in since the last read, the operation will block until an interrupt happens (though non-blocking operation is supported in the usual way as well). The file descriptor can be passed to poll().

The memory areas described by the kernel-space driver can be mapped into user space with the mmap() call. The interface is just a little strange: the offset value passed to mmap() should be N times the page size for the Nth memory area. So, on a system with 4096-byte pages, the first memory area will be found with an offset of zero, the second at 4096, the third at 8192, etc. Once that is figured out, though, everything is pretty straightforward.

There are some limitations, of course. UIO drivers are char drivers; there is no provision for creating user-space block or network drivers at this time. It is not possible to set up DMA operations from user space. But, for drivers which can be implemented with I/O memory access and simple interrupt handlers, the necessary pieces are in place. The patch set includes an example driver to show how it all works. According to Thomas Gleixner, the original, fully in-kernel version of the driver had to implement 68 different ioctl() commands and was over 5,000 lines long. The associated user-space code was over 3,000 lines as well. The new driver eliminates all of that, with a total of 156 lines of kernel code and just under 3,000 lines in user space.

Andrew Morton has expressed some reservations about the patch:

I'm a bit uncertain about the whole UIO idea, really. I have this vague feeling that we'd prefer to encourage people to move device drivers into GPL'ed kernel rather than encouraging them to do closed-source userspace implementations which will probably end up being slower, less reliable and unavailable on various architectures, distros, etc

The authors respond that it's not really about doing proprietary drivers, though some of that will undoubtedly go on. There's a number of people, especially in the embedded space, who want to do user-space drivers, for prototyping purposes if nothing else. The UIO framework gives them a relatively safe and standard way to write these drivers, which is seen as being better than having them each create their own kernel hooks. The patch has not been merged as of this writing, but, unless stronger objections arise, it's chances of getting into 2.6.22 are reasonably good.

Comments (16 posted)

Large block size support

On its face, it doesn't seem like Christoph Lameter's large block size support patch would be that controversial. This patch set equips the page cache to hold blocks which are larger than the system's page size by storing them in higher-order, compound pages. That, in turn, enables filesystems to work with larger blocks. The patch should make operations on large files more efficient and improve the kernel's support for some types of hardware. The patch might eventually get merged, but not before more discussion has happened.

The problem is that this patch is not without its difficulties. It adds a certain amount of complexity to the core virtual memory subsystem to implement what is, in all reality, a feature which has been rejected before: larger pages. The patch currently ducks the most difficult part of the problem - handling faults on larger pages, needed to make mmap() work - meaning that more complexity can be expected in the future. Larger blocks in the page cache means more demand for higher-order pages, which are already in short supply on many systems; that, in turn, means that the anti-fragmentation patches would almost certainly be needed as well. Use of larger pages in the page cache can also lead to more internal fragmentation and less efficient memory use.

For all these reasons, Andrew Morton has been expressing some reservations:

And make no mistake: the latter disadvantage is huge. Because if we do the PAGE_CACHE_SIZE hack (sorry, but it _is_), we have to do it *for ever*. Maintaining and enhancing core MM and VFS becomes harder and more costly and slower and more buggy *for ever*. The ramp for people to become competent on core MM becomes longer. Our developer pool becomes smaller, and proportionally less skilled.

Andrew is not necessarily opposed to the patch; he is more concerned that it not be merged until it has been carefully compared with the alternatives. He suggests keeping the page cache entry size unchanged, but trying to allocate entries in higher-order groups. That would result in larger blocks being stored contiguously in memory without the memory subsystem changes. Filesystems could use those larger blocks, and hardware could treat them as single units in scatter/gather lists for DMA, leading to more efficient operations.

Another possibility which has been raised is raising the maximum size of hardware scatter/gather lists or allowing them to be chained. Drivers could then set up larger I/O operations, improving efficiency without requiring the other changes.

Still, there is support for Christoph's patch. It would make support of larger blocks relatively straightforward for the lower layers, perhaps enabling the removal of some real hacks found in some drivers and filesystems now. The patch would also allow ext3 filesystems with larger block sizes - sometimes created on ia64 systems, which use larger pages - to be mounted on other architectures. Christoph Hellwig likes the idea that a higher-order page cache could force a solution to the longstanding problem of physical memory fragmentation. To many, it seems like a straightforward and necessary solution to a longstanding problem.

So the large block size idea is unlikely to just go away. It may be a while, though, before its proponents can do enough homework and benchmarking to fully address the worries which have been expressed. Fundamental changes are often the ones which take the longest to get into the kernel, so there is little that is surprising here. Just don't ask for a prediction of the final outcome.

Comments (3 posted)

Linus Torvalds Linux 2.6.21 ?

Greg KH Linux 2.6.21.1 ?

Ingo Molnar v2.6.21-rt1 ?

Andrew Morton 2.6.21-rc7-mm2 ?

Greg KH Linux 2.6.20.11 ?

Greg KH Linux 2.6.20.10 ?

Greg KH Linux 2.6.20.9 ?

Greg KH Linux 2.6.20.8 ?

Adrian Bunk Linux 2.6.16.50-rc1 ?

Bill Irwin i386 stack handling updates ?

Greg Ungerer : linux-2.6.21-uc0 (MMU-less updates) ?

Andi Kleen Please pull x86 updates for .22 ?

Ulrich Drepper v4: merged utimensat implementation ?

Ingo Molnar CFS scheduler, -v7 ?

Ingo Molnar CFS scheduler, -v8 ?

Amit K. Arora fallocate system call ?

Peter Williams PlugSched-6.5.1 for 2.6.21 ?

Davi Arnaut pollfs: filesystem abstraction for pollable objects ?

Dave Jones checkpatch, a patch checking script. ?

Josef Sipek Guilt v0.24 ?

Junio C Hamano GIT 1.5.1.3 ?

Josh Triplett Sparse 0.3 released ?

Greg KH USB patches for 2.6.21 ?

Greg KH Driver core patches for 2.6.21 ?

Greg KH UIO patches for 2.6.21 ?

Dave Airlie DRM patches for 2.6.22-rc1 ?

Jeff Garzik What's in netdev-2.6.git? (and, netdev rebased) ?

Len Brown ACPI patches for 2.6.22 ?

Jiri Kosina HID and USB HID updates for 2.6.22 merge window ?

Stefan Richter ieee1394 updates post 2.6.21 ?

Wim Van Sebroeck Watchdog patches for v2.6.22-rc1 ?

James Bottomley SCSI updates for 2.6.21 ?

David Miller : Final ESP driver rewrite ?

Paul Sokolovsky SoC base drivers ?

Kristian Hogsberg New firewire stack ?

Jean Delvare i2c updates for 2.6.22 ?

Pierre Ossman MMC updates ?

Grant Likely Add support for Xilinx SystemACE CompactFlash interface. ?

Sam Revitch USBCAM driver abstraction library for webcams ?

Sam Revitch Ricoh R5U870 webcam driver ?

Michael Kerrisk man-pages-2.45 and man-pages-2.46 are released ?

Ulrich Drepper utimensat implementation ?

Ulrich Drepper v3: utimensat implementation ?

Steve French UID/GID override on CIFS mounts to Samba and proposed new mount parameter to disable Unix Extensions on the client ?

Miklos Szeredi mount ownership and unprivileged mount syscall (v5) ?

Tejun Heo sysfs: sysfs rework, take #2 ?

Theodore Ts'o 2.6.21-ext4-1 ?

Steven Whitehouse GFS Patches for the current merge window [0/34] ?

Dan Williams [PATCH 00/16] raid acceleration and asynchronous offload api for 2.6.22 ?

Trond Myklebust NFS client updates for 2.6.22... ?

Nitin Gupta Announce: Compressed Cache for 2.6.21 ?

Giridhar Pemmasani Allow __vmalloc with GFP_ATOMIC ?

Latchesar Ionkov 9p: create separate 9p client interface ?

Roberto De Ioris UidBind LSM 0.3 ?

James Morris SELinux patches for 2.6.22 ?

Herbert Xu Crypto Update for 2.6.22 ?

Jeremy Fitzhardinge Xen/paravirt_ops kernel available for testing ?

Jeremy Fitzhardinge xen: Xen implementation for paravirt_ops ?

Jeremy Fitzhardinge x86: Add a sched_clock paravirt_op ?

Avi Kivity KVM updates for 2.6.22 ?

menage@google.com Containers (V9): Generic Process Containers ?

Pavel Emelianov Virtual ethernet device (tunnel) ?

Rusty Russell Lguest for 2.6.21 ?

Andrew Morton 2.6.22 -mm merge plans ?

Kay Sievers udev 110 release ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

Job opening: kernel bug manager

Merged (and to be merged) for 2.6.22

UIO: user-space drivers

Large block size support

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous