Kernel development
Brief items
Kernel release status
There is no 2.6 prepatch outstanding as of this writing. The 2.6.22 merge window has opened, and about 2,000 changesets have been merged so far (see below).The current -mm tree is 2.6.21-rc7-mm2. There's not been a lot of new features going into -mm recently; the focus has been on bug fixes.
The current stable 2.6 kernel is 2.6.21, released on April 25. For those just tuning in, 2.6.21 includes clockevents and the dynamic tick patch, the VMI virtualization interface, a number of KVM improvements, the ALSA system on chip layer, and much more. See the KernelNewbies 2.6.21 summary for vast amounts of detail.
The 2.6.21.1 update added a couple of fixes for security issues in the networking code.
For older kernels: the current 2.6.20 release is 2.6.20.8
2.6.20.9
2.6.20.10
2.6.20.11, released on
May 1. The 2.6.20.11 release contains a few dozen important fixes;
the previous updates contained fixes for networking-related security
problems.
2.6.16.50-rc1 was released on May 1 with several fixes, a couple of which have CVE numbers attached.
Kernel development news
Quotes of the week
Job opening: kernel bug manager
In the middle of the discussion on the handling of kernel bugs, Andrew Morton let it slip that the long-rumored, Google-funded kernel bug manager position is now open. It's apparently proved hard to fill: "Unfortunately the recruiting has been a bit tricky - this is not a typical job and it's a funny mixture of bureaucracy/politics/social engineering and programming. People who are skilled in both areas, are, ah, uncommon." If you are such a person this could be a great opportunity to build kernel skills while working directly with Andrew - and help the kernel process as well.
Merged (and to be merged) for 2.6.22
The 2.6.22 merge window has opened, with almost 2,000 changesets merged as of this writing. The merge process appears to have slowed somewhat; it may be that the level of traffic on linux-kernel is so high (even by linux-kernel standards) that nobody has time to deal with actual patches. Be that as it may, user-visible changes merged so far include:
- Lots of networking changes, including improvements to the forward
receive timeout recovery (RFC4138)
implementation, a YeAH-TCP
congestion control [PDF] implementation, a TCP
Illinois congestion control implementation, and a new RxRPC secure
socket layer (along with support for using RxRPC in the AFS
filesystem).
Also, the old, IPv4-only connection tracking code has been removed as
per the feature removal schedule.
- The cfg80211 patches - a new, netlink-based interface for configuring
wireless interfaces - have been merged. At the same time, the netlink
version of the "wireless extensions" interface has been removed.
- The OCFS2 filesystem now has sparse file support.
- The UBI
patch, which performs flash-aware partitioning and volume management,
has been merged.
- New drivers for USB webcams based on zr364xx chipsets, AT26Fxxx
dataflash devices, CM-X270-based NAND flash memory, Freescale SOC USB
controllers, and Marvell Libertas 802.11 adaptors (used in the OLPC
system).
It's also worth noting that the IVTV video driver, long out of the mainline, has finally been merged. "
It took three core maintainers, over four years of work, eight new i2c modules, eleven new V4L2 ioctls, three new DVB video ioctls, a Sliced VBI API, a new MPEG encoder API, an enhanced DVB video MPEG decoding API, major YUV/OSD contributions from Ian and John, web/wiki/svn/trac support from Axel Thimm, (hardware) support from Hauppauge, support and assistance from the v4l-dvb people and the many, many users of ivtv to finally make it possible to merge this driver into the kernel.
" - A new "sony-laptop" layer which replaces sonypi and provides better
Sony support. The old "ibm_acpi" module has been renamed
"thinkpad-acpi," and it features improved support for those laptops.
- The CFQ I/O scheduler has been reworked. Taking inspiration from the CFS CPU scheduler, it now uses a red-black tree to sort pending requests by expected execution time and track them.
Changes visible to kernel developers include:
- The eth_type_trans() function now sets the
skb->dev field, consistent with how similar functions for
other link types operate. As a result, many Ethernet drivers have
been changed to remove the (now) redundant assignment.
- The header fields in the sk_buff structure have been renamed
and are no longer unions. Networking code and drivers can now just
use skb->transport_header,
skb->network_header, and skb->skb_mac_header.
There are new functions for finding specific headers within packets:
tcp_hdr(), udp_hdr(), ipip_hdr(), and
ipipv6_hdr().
- Also in the networking area: the packet scheduler has been reworked to use ktime values rather than jiffies.
Those who are curious about what else might get in to 2.6.22 can have a look at Andrew Morton's 2.6.22 merge plans document. Interestingly, Lguest, the signalfd work, and the SLUB allocator are all planned for merging, but all have become less certain since:
- There have been some complaints that Lguest has not been sufficiently
reviewed. Since this development is independent and will not bother
those who do not use it, the concerns are less likely to delay its
inclusion.
- Signalfd has a new competitor in the form of the pollfs patch. Pollfs takes
takes a different approach to many of the same problems and throws in
polling for futex operations as well. It is far from clear that
pollfs is better (some of the early reviews have been on the
unfavorable side), but the process of figuring out whether that is
true could delay signalfd past the closing of the merge window.
- The SLUB allocator has also been subject to concerns that it has not been sufficiently tested for such a fundamental change. Additionally, there seems to be a difference of goals between Andrew Morton (who would like to see SLUB eventually replace the current slab allocator) and SLUB developer Christoph Lameter, who had seen the two coexisting indefinitely. Chances are these issues will get worked out and SLUB will go in as scheduled.
There are a few things of interest which are not on Andrew's list. The reiser4 filesystem seems certain to sit out (at least) another cycle, despite a resurgence in interest in getting it ready for inclusion. Xen is not mentioned, but it seems that, behind the scenes, it is being worked on. So Xen could actually show up before the merge window closes. There will be no major scheduler rework in 2.6.22; it's too soon for any of those patches to go in. The anti-fragmentation patches look likely to wait a little longer; Andrew worries that they still haven't seen enough review and benchmarking despite many iterations over a few years. The integrity management patches are considered to be unready and will not be merged.
Beyond that, there will be doubtless be surprises over the next week or so; stay tuned.
UIO: user-space drivers
The concept of supporting user-space drivers has appeared on this page a few times before. It's back; this time there is a version of the patch (now called "UIO") which is being proposed for inclusion into 2.6.22. The interface has changed somewhat, so another look is called for.Like the previous version, UIO does not completely eliminate the need for kernel-space code. A small module is required to set up the device, perhaps interface to the PCI bus, and register an interrupt handler. The last function (interrupt handling) is particularly important; much can be done in user space, but there needs to be an in-kernel interrupt handler which knows how to tell the device to stop crying for attention.
The kernel module includes <linux/uio_driver.h>. If it's a driver for a PCI device, it should register itself as a PCI driver in the usual way. When it comes time to connect a device (perhaps in the PCI probe() function), the driver fills in a uio_info structure:
struct uio_info { char *name; char *version; struct uio_mem mem[MAX_UIO_MAPS]; long irq; unsigned long irq_flags; void *priv; irqreturn_t (*handler)(int irq, struct uio_info *dev_info); int (*mmap)(struct uio_info *info, struct vm_area_struct *vma); int (*open)(struct uio_info *info, struct inode *inode); int (*release)(struct uio_info *info, struct inode *inode); /* Internal stuff omitted */ };
Here, name is the name of the device and version is the driver version (which will show up in sysfs). The number of the interrupt used by the device (if any) goes into irq, with irq_flags being the flags which will be passed to request_irq(). The function which handles interrupts is handler(). This handler should acknowledge the interrupt; it usually does not need to do anything else. The mmap(), open(), and release() functions are called from the equivalent file_operations members.
The mem array describes any memory areas which can be mapped into user space. The uio_mem structure looks like:
struct uio_mem { unsigned long addr; unsigned long size; int memtype; void __iomem *internal_addr; /* ... */ };
For each mappable area, addr is the relevant address, and size is the size of the area. If it's an I/O memory area, internal_addr is the address returned by ioremap(). The memtype field describes what the area really is:
- UIO_MEM_PHYS indicates that addr is a physical
address, generally for an I/O memory area.
- UIO_MEM_LOGICAL is memory in the kernel logical address
space, such as that returned by kmalloc().
- UIO_MEM_VIRTUAL is memory in the kernel virtual address space - the space used by vmalloc_user() and friends.
Once the structure is filled in, the driver stub passes it to:
int uio_register_device(struct device *parent, struct uio_info *info);
The parent pointer tells the kernel which "real" device is associated with the UIO device; if the driver is for a PCI device, parent will be pci_dev->dev.
There is not much more to the kernel-space UIO API. When a device goes away, the driver should call:
void uio_unregister_device(struct uio_info *info);
The final function of note is:
void uio_event_notify(struct uio_info *info);
Its purpose is to notify the UIO core that an event (typically an interrupt) has occurred. The stub driver need not call uio_event_notify() for real interrupts, but it can be used to simulate interrupts in other situations.
On the user space side, the first UIO-handled device will show up as /dev/uio0 (assuming a normal udev setup). The user-space driver will open the device. Reading the device returns an int value which is the event count (number of interrupts) seen by the device; if no interrupts have come in since the last read, the operation will block until an interrupt happens (though non-blocking operation is supported in the usual way as well). The file descriptor can be passed to poll().
The memory areas described by the kernel-space driver can be mapped into user space with the mmap() call. The interface is just a little strange: the offset value passed to mmap() should be N times the page size for the Nth memory area. So, on a system with 4096-byte pages, the first memory area will be found with an offset of zero, the second at 4096, the third at 8192, etc. Once that is figured out, though, everything is pretty straightforward.
There are some limitations, of course. UIO drivers are char drivers; there is no provision for creating user-space block or network drivers at this time. It is not possible to set up DMA operations from user space. But, for drivers which can be implemented with I/O memory access and simple interrupt handlers, the necessary pieces are in place. The patch set includes an example driver to show how it all works. According to Thomas Gleixner, the original, fully in-kernel version of the driver had to implement 68 different ioctl() commands and was over 5,000 lines long. The associated user-space code was over 3,000 lines as well. The new driver eliminates all of that, with a total of 156 lines of kernel code and just under 3,000 lines in user space.
Andrew Morton has expressed some reservations about the patch:
The authors respond that it's not really about doing proprietary drivers, though some of that will undoubtedly go on. There's a number of people, especially in the embedded space, who want to do user-space drivers, for prototyping purposes if nothing else. The UIO framework gives them a relatively safe and standard way to write these drivers, which is seen as being better than having them each create their own kernel hooks. The patch has not been merged as of this writing, but, unless stronger objections arise, it's chances of getting into 2.6.22 are reasonably good.
Large block size support
On its face, it doesn't seem like Christoph Lameter's large block size support patch would be that controversial. This patch set equips the page cache to hold blocks which are larger than the system's page size by storing them in higher-order, compound pages. That, in turn, enables filesystems to work with larger blocks. The patch should make operations on large files more efficient and improve the kernel's support for some types of hardware. The patch might eventually get merged, but not before more discussion has happened.The problem is that this patch is not without its difficulties. It adds a certain amount of complexity to the core virtual memory subsystem to implement what is, in all reality, a feature which has been rejected before: larger pages. The patch currently ducks the most difficult part of the problem - handling faults on larger pages, needed to make mmap() work - meaning that more complexity can be expected in the future. Larger blocks in the page cache means more demand for higher-order pages, which are already in short supply on many systems; that, in turn, means that the anti-fragmentation patches would almost certainly be needed as well. Use of larger pages in the page cache can also lead to more internal fragmentation and less efficient memory use.
For all these reasons, Andrew Morton has been expressing some reservations:
Andrew is not necessarily opposed to the patch; he is more concerned that it not be merged until it has been carefully compared with the alternatives. He suggests keeping the page cache entry size unchanged, but trying to allocate entries in higher-order groups. That would result in larger blocks being stored contiguously in memory without the memory subsystem changes. Filesystems could use those larger blocks, and hardware could treat them as single units in scatter/gather lists for DMA, leading to more efficient operations.
Another possibility which has been raised is raising the maximum size of hardware scatter/gather lists or allowing them to be chained. Drivers could then set up larger I/O operations, improving efficiency without requiring the other changes.
Still, there is support for Christoph's patch. It would make support of larger blocks relatively straightforward for the lower layers, perhaps enabling the removal of some real hacks found in some drivers and filesystems now. The patch would also allow ext3 filesystems with larger block sizes - sometimes created on ia64 systems, which use larger pages - to be mounted on other architectures. Christoph Hellwig likes the idea that a higher-order page cache could force a solution to the longstanding problem of physical memory fragmentation. To many, it seems like a straightforward and necessary solution to a longstanding problem.
So the large block size idea is unlikely to just go away. It may be a while, though, before its proponents can do enough homework and benchmarking to fully address the worries which have been expressed. Fundamental changes are often the ones which take the longest to get into the kernel, so there is little that is surprising here. Just don't ask for a prediction of the final outcome.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>