Kernel development [LWN.net]

Kernel release status

The current 2.6 development kernel is 2.6.28-rc2, released by Linus on October 26. It adds a mere 22 changesets to 2.6.28-rc1, which came out on the 23rd. This kernel is now known as the "Killer Bat of Doom."

As of this writing, almost 200 changesets have been merged into the mainline since 2.6.28-rc2. They are mostly fixes, but there is also a driver for Elantech (EeePC) touchpads, support for MIPS-based NXP Semiconductors STB220 development boards, and a number of large ftrace changes.

The current stable 2.6 kernel is 2.6.27.4, released with a number of important fixes on October 25. Previously, 2.6.25.19, 2.6.26.7, and 2.6.27.3 were released on October 22. There will probably only be one more stable update for the 2.6.25 and 2.6.26 kernels, so users who are dependent on those updates may want to start thinking about moving to 2.6.27.

Comments (none posted)

Quotes of the week

I look at Linux VT's and their kernel complexity with a mixture of awe and stupefaction that so much effort has gone in that direction....

-- Jim Gettys

I actually think it's a bit of an insult if people think of Motorola's EZX or MAGX (and now Android) phones as "Linux phones". Because all the freedoms of Linux (writing native applications against native Linux APIs that Linux developers know and love, being able to do Linux [kernel] development) are stripped.

In the end, to what good is Linux in those devices? Definitely not to any benefit of the user. It's to the benefit of the handset maker, who can skip a pretty expensive Windows Mobile licensing fee. Oh and, yes, they get better memory management than on Symbian ;)

That's the brave new world. It makes me sick.

-- Harald Welte

The actual problem is that if the kernel grows by 12k every time a developer says "what's the big deal?" the kernel will become very large indeed.

-- Matt Mackall

So it had sat in the mainline kernel for 4 years. During those years _nobody_ had ever tried to compile it. Nonetheless, there had been patches affecting it - including such exciting stuff as removal of trailing whitespaces, which had certainly greatly improved the damn thing.

-- Al Viro

Comments (17 posted)

Interview videos from the Kernel Summit

The Linux Foundation has produced a whole pile of video interviews with kernel developers from this year's Kernel Summit. Short 5-10 minute interviews with 15 different kernel developers are available. You can watch interviews with Linus Torvalds, Ted Ts'o, Greg Kroah-Hartman, and many others including LWN Executive Editor Jonathan Corbet. Videos are available in both Ogg and Flash formats.

Comments (6 posted)

Closing out the 2.6.28 merge window

By Jonathan Corbet
October 27, 2008

About 1000 changesets were merged after the previous summary was posted here. Much of those came from architecture-specific trees. Other changes merged this time around include:

There are new drivers for Mellanox ConnectX 10GbE network adapters, PowerPC PPC40x and PPC44x GPIO controllers, Panasonic "Let's Note" laptop special keys, Sharp SL-6000 backlight and LCD devices, Dialog Semiconductor DA9030/DA9034 backlight devices, Tabletkiosk Sahara Touch-iT backlight devices, and Toshiba TX4939 SoC ATA controllers.
One more not-ready-for-prime-time driver was merged via the staging tree; this one supports Redrapids Pocket Change cardbus devices. The staging tree also brought an extensive set of fixes to the drivers added earlier in the merge window.
The kernel has gained support for ultra-wideband protocol stacks. UWB can be used for normal networking, but the immediate application is wireless USB, which will be supported in 2.6.28.
The ACPI docking station code has gained support for bay and battery hotplug events.
The IA64 architecture now supports Xen. Also added to IA64 is support for DMA remapping devices (IOMMUs).
Support for kdump has been added to the PowerPC architecture.
The 9P (Plan9) filesystem now has RDMA support.

Changes visible to kernel developers include:

There is a new core_param() macro:
```
    core_param(name, var, type, perm);
```
Its purpose is to define "core" parameters and let them be represented in /sys/module/kernel/parameters.
It is now possible to create a workqueue running at realtime priority with:
```
    struct workqueue_struct *create_rt_workqueue(const char *name);
```

The block driver API has changed considerably, with the inode and file parameters being removed from most block device operations. The new API looks like this:

    struct block_device_operations {
	int (*open) (struct block_device *bdev, fmode_t mode);
	int (*release) (struct gendisk *gd, fmode_t mode);
	int (*locked_ioctl) (struct block_device *bdev, fmode_t mode, 
	    		     unsigned cmd, unsigned long arg);
	int (*ioctl) (struct block_device *bdev, fmode_t mode, 
	    	      unsigned cmd, unsigned long arg);
	int (*compat_ioctl) (struct block_device *bdev, fmode_t mode, 
	    		     unsigned cmd, unsigned long arg);
	int (*direct_access) (struct block_device *bdev, sector_t sector,
			      void **kaddr, unsigned long *pfn);
	int (*media_changed) (struct gendisk *gd);
	int (*revalidate_disk) (struct gendisk *gd);
	int (*getgeo)(struct block_device *bdev, struct hd_geometry *geo);
	struct module *owner;
    };

The new prototypes do away with the file and inode structure pointers which were passed in previous kernels. Note that the ioctl() method is now called without the big kernel lock; code needing BKL protection must explicitly define a locked_ioctl() function instead.

The range timer API has been merged; callers can now specify a time period in which they would like the timeout to be delivered. The kernel can then take advantage of the range to coalesce wakeups and keep the processor idle for longer periods.

This time around, linux-next maintainer Stephen Rothwell has put together a list of linux-next patches which did not get into 2.6.28. Perhaps the biggest omission was the credentials work, which seemed poised to go in this time around. Other changes which failed to get merged include the message catalog code (which looks like it will need a change of approach) and TOMOYO Linux (which seems to be caught up in the same old "new security module with pathname-based rules" swamp).

Now the stabilization period starts. Linus, perhaps, was trying to set the tone for this development cycle when he released a much smaller and earlier 2.6.28-rc2 than would have normally been expected. By way of comparison: 2.6.25-rc2 had 359 patches applied since 2.6.25-rc1. For 2.6.26-rc2, 446 changesets were merged, and, for 2.6.27-rc2, the count was 780. For 2.6.28-rc2, instead, a total of 22 changes went in. Says Linus:

And hey, maybe we can even _continue_ the nice model of "just small fixes after -rc1". I know, it sounds insane, but it's a real pleasure to do an -rc2 with just a handful of fixes for real problems that real people see. What a concept!

Should this pattern hold, it may well be that 2.6.28 will stabilize more quickly and successfully than its predecessors. It will, in any case, be interesting to watch.

Comments (1 posted)

Tracking tbench troubles

By Jonathan Corbet
October 29, 2008

Kernel developers tend to have a mixed view of benchmarks. A benchmarking tool can do an effective job of quantifying specific aspects of system performance. But benchmarks are not real workloads; optimizing for a benchmark can often distort a system in ways which are detrimental to real applications. Since kernel hackers do not always see benchmark optimization as their top priority, they can sometimes assign a lower priority to benchmark regressions as well. But, sometimes, benchmark problems indicate a real problem in the kernel.

The tbench benchmark is meant to measure networking performance; it consists of a collection of processes quickly making lots of small requests from a server process. Since the requests are small, there is not much time spent actually moving data; it's all a matter of shifting small packets around - and scheduling between the processes. Back in August, Christoph Lameter reported that tbench performance in the mainline kernel had been declining for some time. His system was able to move 3208 MB/sec with a 2.6.22 kernel, but only 2571 MB/sec with a 2.6.27-rc kernel. Each of the releases in between showed a decline from the one which came before, with 2.6.25 showing an especially big hit. Others were able to reproduce the results, and they engaged in various rounds of speculation on where the problem might be, but it seems that, initially, nobody actually dug into the system to see what was going on.

At linux.conf.au 2007, Andi Kleen gave a talk describing various types of kernel hackers. One of those was the "Russian mathematician" who, he suspected, was often a room full of talented developers operating under a single name. Evgeniy Polyakov can only have reinforced that view when, in early October, he tracked down the biggest offending commit through a process which, he says, involved "just [a] couple of hundreds of compilations." In the process, he put together a plot of tbench performance which, he says, is suitable for scaring children. Through a massive amount of work, he was able to point the finger at a scheduler patch - not something in the networking stack at all.

In particular, Evgeniy found that the patch adding high-resolution preemption ticks was the problem. The idea behind this patch was to make time slices more accurate by scheduling preemption at just the right time. It makes sense; once the regular clock tick has been eliminated, there is no reason not to arrange for preemption to happen when the scheduling algorithm says it should. Unfortunately, it seems that this change also adds sufficient overhead to slow down tbench performance considerably; when Evgeniy backed it out, his performance went from 373 MB/sec to 455 MB/sec. That would seem to be a pretty clear indication that something is amiss with high-resolution preemption ticks.

At this point, the public discussion went quiet, though it appears that a number of developers were working on it off-list. David Miller eventually tracked down the worst of the trouble to the wakeup code, something he was rather vocally unhappy about having had to do. Eventually a patch was merged (for 2.6.28-rc2) disabling the high-resolution preemption tick feature. Since the discussion is private, it's not quite clear why this change took as long as it did. But there's a couple of plausible reasons. One is that this particular feature is disabled by default anyway, so most users will not encounter the performance problem it creates.

But there is also the question of weighing the benchmark result against the effects on other, "real" workloads. Ingo Molnar said:

But it's a difficult call with no silver bullets. On one hand we have folks putting more and more stuff into the context-switching hotpath on the (mostly valid) point that the scheduler is a slowpath compared to most other things. On the other hand we've got folks doing high-context-switch ratio benchmarks and complaining about the overhead whenever something goes in that improves the quality of scheduling of a workload that does not context-switch as massively as tbench. It's a difficult balance and we cannot satisfy both camps.

So, by this view, performance on scheduler-intensive benchmarks must be weighed against the wider value of other scheduler enhancements. David Miller has a different view of the situation, though:

If we now think it's ok that picking which task to run is more expensive than writing 64 bytes over a TCP socket and then blocking on a read, I'd like to stop using Linux. :-) That's "real work" and if the scheduler is more expensive than "real work" we lose.

In David's view, scheduler performance has been getting consistently worse since the switch to the completely fair scheduler in 2.6.23. He would like to see some energy put into recovering some of the performance of the pre-CFS scheduler; in particular, he thinks that Ingo and company should work to fix (what he sees as) a regression that they caused.

For the time being, the worst performance regression has been "fixed" by disabling the high-resolution preemption tick feature; Ingo says that the feature will not come back until it can be supported without slowing things down. But the scheduler seems to have gotten slower in a number of other ways as well. Your editor will make a prediction here: now that the issue has been called out in such clear terms, somebody will find the time to fix these problems to the point that the CFS scheduler will be faster than the O(1) scheduler which preceded it.

Beyond that, there are suggestions that the scheduler cannot take the blame for all of the observed regressions in tbench results. So developers will have to look at the rest of the system to figure out what's going on. The good news is that this is a clear challenge with an objective way to measure success. Once a problem reaches that level of clarity, it's usually just a matter of some hacking.

Comments (6 posted)

Squashfs submitted for the mainline

By Jake Edge
October 29, 2008

The Squashfs compressed filesystem is used in everything from Live CDs to embedded devices. Many or most distributions ship it in such situations, but squashfs has been maintained outside of the mainline kernel for years. That appears to be changing as it was recently submitted for inclusion in the mainline by Phillip Lougher. The reaction has been generally favorable, with Andrew Morton requesting that Lougher move it forward: "Please prepare a tree for linux-next inclusion and unless serious problems are pointed out I'd suggest shooting for a 2.6.29 merge." So it seems like a good time to take a look at some of the features and capabilities of Squashfs.

The basic idea behind Squashfs is to generate a compressed image of a filesystem or directory hierarchy that can be mounted as a read-only filesystem. This can be done to archive a set of directories or to store them on a smaller capacity device than would normally be required. The latter is used by both Live CDs and embedded devices to squeeze more into less.

It has been nearly four years since Squashfs was last submitted to linux-kernel. Since that time, it has been almost completely rewritten based on comments from that attempt. In addition, it has gone through two filesystem layout revisions in part to allow for 64-bit sizes for files and filesystems. Another major change is to make the filesystem little-endian, so that it can be read on any architecture, regardless of endian-ness.

The mksquashfs utility is used to create the image, which can then be mounted either via loopback (from a file) or from a regular block device. One of the features added since the original attempt to mainline Squashfs—to address complaints made at that time—is the ability to export a Squashfs filesystem via NFS.

Squashfs uses gzip compression on filesystem data and metadata, achieving sizes roughly one-third that of an ext3 filesystem with the same data. The performance is quite good as well, even when compared with the simpler cramfs—a compressed read-only filesystem already available with the kernel. According to Lougher, these performance numbers were gathered a number of years ago, with older versions of the code; newer numbers should be even better.

Previously, some kernel developers were resistant to adding another compressed filesystem to the kernel, so Lougher outlines a number of reasons that Squashfs is superior to cramfs. Certainly support for larger files and filesystems is compelling, but the fact that cramfs is orphaned and unmaintained will likely also play a role. In addition, Squashfs supports many more "normal" Linux filesystem features like real inode numbers, hard links, and exportability.

Morton had a laundry list of overall suggestions for making Squashfs better in the email referenced above, but documentation is certainly one of the areas that is somewhat lacking. In particular, Squashfs maintains its own cache, which puzzles Morton:

Why not just decompress these blocks into pagecache and let the VFS handle the caching??

The real bug here is that this rather obvious question wasn't answered anywhere in the patch submission (afaict). How to fix that?

Methinks we need a squashfs.txt which covers these things.

One of the reasons that Squashfs doesn't use the page cache is that it allows for multiple block sizes, from 4K up to 1M, with a default of 128K. Better compression ratios can be achieved with a larger block size, but that doesn't work well with the page cache as Jörn Engel notes: "One of the problems seems to be that your blocksize can exceed page size and there really isn't any infrastructure to deal with such cases yet."

Lougher has moved the code into a git repository, presumably in preparation to get it into linux-next. He notes that the CE Linux Forum has been instrumental in providing funding over the last four months to allow him to work on getting Squashfs into the mainline. With the additional testing that will come from being included in linux-next, it seems quite possible we could see Squashfs in 2.6.29.

Comments (13 posted)

Linus Torvalds Linux 2.6.28-rc2 ?

Andrew Morton 2.6.28-rc2-mm1 ?

Linus Torvalds Linux 2.6.28-rc1 ?

Greg KH Linux 2.6.27.4 ?

Greg KH Linux 2.6.27.3 ?

Greg KH Linux 2.6.26.7 ?

Greg KH Linux 2.6.25.19 ?

David Daney Add Cavium OCTEON processor support (v2). ?

Keith Packard Adding kmap_atomic_prot_pfn (was: [git pull] drm patches for 2.6.27-rc1) ?

Joern Engel B+Tree library ?

Ulrich Drepper reintroduce accept4 ?

Rusty Russell work_on_cpu: helper for doing task on a CPU. ?

Mike Travis cpumask: Replace cpumask_t with struct cpumask ?

Manfred Spraul rcu-state ?

Dave Hansen Filesystem-based checkpoint ?

Lai Jiangshan new probes manager ?

Johannes Berg Timer sync lock checking ?

Mathieu Desnoyers LTTng 0.44 and LTTV 0.11.3 ?

Steven Rostedt ftrace: function oprofiler ?

Steven Rostedt trace: profile likely and unlikely annotations ?

Tom Zanussi relay revamp v8 ?

Andy Whitcroft checkpatch: update to versoin 0.25 ?

Arnaldo Carvalho de Melo blktrace: conversion to tracepoints ?

Keith Packard Add io-mapping functions to dynamically map large device apertures ?

Jesse Barnes GTT mapping support for GEM ?

Ira Snyder net: add PCINet driver ?

John Linn [powerpc] GPIO: Adding new Xilinx driver ?

Trent Piepho OpenFirmware GPIO LED driver ?

Arjan van de Ven RFC: [PATCH] resource: ensure MMIO exclusivity for device drivers ?

Kuninori Morimoto Add ov772x driver ?

=?iso-8859-2?Q?Micha=B3_Miros=B3aw?= RFC: Driver for CB710/720 memory card reader (MMC part) - v3 ?

Al Viro vfs patches ?

Al Viro bdev API series ?

Evgeniy Polyakov POHMELFS: the new release. Extended attributes. ?

Jan Kara 64-bit quotas and preparations for OCFS2 quotas ?

ngupta@google.com Priorities in Anticipatory I/O scheduler ?

npiggin@suse.de writeback data integrity and other fixes (take 3) ?

Theodore Ts'o ext4: Add support for non-native signed/unsigned htree hash algorithms ?

Phillip Lougher Squashfs: compressed read-only filesystem ?

Arjan van de Ven getting rid of __cpuinit ?

Marcin Slusarz Rename DECLARE_MUTEX to DEFINE_SEMAPHORE ?

Harvey Harrison printk: add %pM format specifier for MAC addresses ?

Jari Ruusu Announce loop-AES-v3.2d file/swap crypto package ?

Eric Paris SECURITY: new capable_noaudit interface ?

Eric W. Biederman netns: Coexist with the sysfs limitations ?

Bharata B Rao Add hierarchical accounting to cpu accounting controller ?

Serge E. Hallyn User namespaces: set of cleanups (v2) ?

Rafael J. Wysocki 2.6.28-rc1-git1: Reported regressions from 2.6.27 ?

Rafael J. Wysocki 2.6.28-rc1-git1: Reported regressions 2.6.26 -> 2.6.27 ?

Frank Ch. Eigler notes for linux plumbers conference talk on systemtap ?

Stephen Rothwell linux-next: left over things in linux-next after 2.6.28-c1 ?

Pablo Neira Ayuso conntrack-tools 0.9.8 released ?

Jozsef Kadlecsik ipset-2.4.3 released ?

Hans de Goede libv4l release: 0.5.3 ?

Jackson Yee Testing Requested: Python Bindings for Video4linux2 ?

Daniel Lezcano New version 0.4.0 of the Linux Container ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

Interview videos from the Kernel Summit

Closing out the 2.6.28 merge window

Tracking tbench troubles

Squashfs submitted for the mainline

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous