Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.28-rc2, released by Linus on October 26. It adds a mere 22 changesets to 2.6.28-rc1, which came out on the 23rd. This kernel is now known as the "Killer Bat of Doom."
As of this writing, almost 200 changesets have been merged into the mainline since 2.6.28-rc2. They are mostly fixes, but there is also a driver for Elantech (EeePC) touchpads, support for MIPS-based NXP Semiconductors STB220 development boards, and a number of large ftrace changes.
The current stable 2.6 kernel is 2.6.27.4, released with a number of important fixes on October 25. Previously, 2.6.25.19, 2.6.26.7, and 2.6.27.3 were released on October 22. There will probably only be one more stable update for the 2.6.25 and 2.6.26 kernels, so users who are dependent on those updates may want to start thinking about moving to 2.6.27.
Kernel development news
Quotes of the week
In the end, to what good is Linux in those devices? Definitely not to any benefit of the user. It's to the benefit of the handset maker, who can skip a pretty expensive Windows Mobile licensing fee. Oh and, yes, they get better memory management than on Symbian ;)
That's the brave new world. It makes me sick.
Interview videos from the Kernel Summit
The Linux Foundation has produced a whole pile of video interviews with kernel developers from this year's Kernel Summit. Short 5-10 minute interviews with 15 different kernel developers are available. You can watch interviews with Linus Torvalds, Ted Ts'o, Greg Kroah-Hartman, and many others including LWN Executive Editor Jonathan Corbet. Videos are available in both Ogg and Flash formats.Closing out the 2.6.28 merge window
About 1000 changesets were merged after the previous summary was posted here. Much of those came from architecture-specific trees. Other changes merged this time around include:
- There are new drivers for
Mellanox ConnectX 10GbE network adapters,
PowerPC PPC40x and PPC44x GPIO controllers,
Panasonic "Let's Note" laptop special keys,
Sharp SL-6000 backlight and LCD devices,
Dialog Semiconductor DA9030/DA9034 backlight devices,
Tabletkiosk Sahara Touch-iT backlight devices, and
Toshiba TX4939 SoC ATA controllers.
- One more not-ready-for-prime-time driver was merged via the staging
tree; this one supports Redrapids Pocket Change cardbus devices. The
staging tree also brought an extensive set of fixes to the drivers
added earlier in the merge window.
- The kernel has gained support for ultra-wideband
protocol stacks. UWB can be used for normal networking, but the
immediate application is wireless USB, which will be
supported in 2.6.28.
- The ACPI docking station code has gained support for bay and battery
hotplug events.
- The IA64 architecture now supports Xen. Also added to IA64 is support
for DMA remapping devices (IOMMUs).
- Support for kdump has
been added to the PowerPC architecture.
- The 9P (Plan9) filesystem now has RDMA support.
Changes visible to kernel developers include:
- There is a new core_param() macro:
core_param(name, var, type, perm);
Its purpose is to define "core" parameters and let them be represented in /sys/module/kernel/parameters.
- It is now possible to create a workqueue running at realtime priority
with:
struct workqueue_struct *create_rt_workqueue(const char *name);
- The block driver API has changed considerably, with the inode
and file parameters being removed from most block device
operations. The new API looks like this:
struct block_device_operations { int (*open) (struct block_device *bdev, fmode_t mode); int (*release) (struct gendisk *gd, fmode_t mode); int (*locked_ioctl) (struct block_device *bdev, fmode_t mode, unsigned cmd, unsigned long arg); int (*ioctl) (struct block_device *bdev, fmode_t mode, unsigned cmd, unsigned long arg); int (*compat_ioctl) (struct block_device *bdev, fmode_t mode, unsigned cmd, unsigned long arg); int (*direct_access) (struct block_device *bdev, sector_t sector, void **kaddr, unsigned long *pfn); int (*media_changed) (struct gendisk *gd); int (*revalidate_disk) (struct gendisk *gd); int (*getgeo)(struct block_device *bdev, struct hd_geometry *geo); struct module *owner; };
The new prototypes do away with the file and inode structure pointers which were passed in previous kernels. Note that the ioctl() method is now called without the big kernel lock; code needing BKL protection must explicitly define a locked_ioctl() function instead.
- The range timer API has been merged; callers can now specify a time period in which they would like the timeout to be delivered. The kernel can then take advantage of the range to coalesce wakeups and keep the processor idle for longer periods.
This time around, linux-next maintainer Stephen Rothwell has put together a list of linux-next patches which did not get into 2.6.28. Perhaps the biggest omission was the credentials work, which seemed poised to go in this time around. Other changes which failed to get merged include the message catalog code (which looks like it will need a change of approach) and TOMOYO Linux (which seems to be caught up in the same old "new security module with pathname-based rules" swamp).
Now the stabilization period starts. Linus, perhaps, was trying to set the tone for this development cycle when he released a much smaller and earlier 2.6.28-rc2 than would have normally been expected. By way of comparison: 2.6.25-rc2 had 359 patches applied since 2.6.25-rc1. For 2.6.26-rc2, 446 changesets were merged, and, for 2.6.27-rc2, the count was 780. For 2.6.28-rc2, instead, a total of 22 changes went in. Says Linus:
Should this pattern hold, it may well be that 2.6.28 will stabilize more quickly and successfully than its predecessors. It will, in any case, be interesting to watch.
Tracking tbench troubles
Kernel developers tend to have a mixed view of benchmarks. A benchmarking tool can do an effective job of quantifying specific aspects of system performance. But benchmarks are not real workloads; optimizing for a benchmark can often distort a system in ways which are detrimental to real applications. Since kernel hackers do not always see benchmark optimization as their top priority, they can sometimes assign a lower priority to benchmark regressions as well. But, sometimes, benchmark problems indicate a real problem in the kernel.The tbench benchmark is meant to measure networking performance; it consists of a collection of processes quickly making lots of small requests from a server process. Since the requests are small, there is not much time spent actually moving data; it's all a matter of shifting small packets around - and scheduling between the processes. Back in August, Christoph Lameter reported that tbench performance in the mainline kernel had been declining for some time. His system was able to move 3208 MB/sec with a 2.6.22 kernel, but only 2571 MB/sec with a 2.6.27-rc kernel. Each of the releases in between showed a decline from the one which came before, with 2.6.25 showing an especially big hit. Others were able to reproduce the results, and they engaged in various rounds of speculation on where the problem might be, but it seems that, initially, nobody actually dug into the system to see what was going on.
At linux.conf.au 2007, Andi Kleen gave a talk describing various types of kernel hackers. One of those was the "Russian mathematician" who, he suspected, was often a room full of talented developers operating under a single name. Evgeniy Polyakov can only have reinforced that view when, in early October, he tracked down the biggest offending commit through a process which, he says, involved "just [a] couple of hundreds of compilations." In the process, he put together a plot of tbench performance which, he says, is suitable for scaring children. Through a massive amount of work, he was able to point the finger at a scheduler patch - not something in the networking stack at all.
In particular, Evgeniy found that the patch adding high-resolution preemption ticks was the problem. The idea behind this patch was to make time slices more accurate by scheduling preemption at just the right time. It makes sense; once the regular clock tick has been eliminated, there is no reason not to arrange for preemption to happen when the scheduling algorithm says it should. Unfortunately, it seems that this change also adds sufficient overhead to slow down tbench performance considerably; when Evgeniy backed it out, his performance went from 373 MB/sec to 455 MB/sec. That would seem to be a pretty clear indication that something is amiss with high-resolution preemption ticks.
At this point, the public discussion went quiet, though it appears that a number of developers were working on it off-list. David Miller eventually tracked down the worst of the trouble to the wakeup code, something he was rather vocally unhappy about having had to do. Eventually a patch was merged (for 2.6.28-rc2) disabling the high-resolution preemption tick feature. Since the discussion is private, it's not quite clear why this change took as long as it did. But there's a couple of plausible reasons. One is that this particular feature is disabled by default anyway, so most users will not encounter the performance problem it creates.
But there is also the question of weighing the benchmark result against the effects on other, "real" workloads. Ingo Molnar said:
So, by this view, performance on scheduler-intensive benchmarks must be weighed against the wider value of other scheduler enhancements. David Miller has a different view of the situation, though:
In David's view, scheduler performance has been getting consistently worse since the switch to the completely fair scheduler in 2.6.23. He would like to see some energy put into recovering some of the performance of the pre-CFS scheduler; in particular, he thinks that Ingo and company should work to fix (what he sees as) a regression that they caused.
For the time being, the worst performance regression has been "fixed" by disabling the high-resolution preemption tick feature; Ingo says that the feature will not come back until it can be supported without slowing things down. But the scheduler seems to have gotten slower in a number of other ways as well. Your editor will make a prediction here: now that the issue has been called out in such clear terms, somebody will find the time to fix these problems to the point that the CFS scheduler will be faster than the O(1) scheduler which preceded it.
Beyond that, there are suggestions that the scheduler cannot take the blame for all of the observed regressions in tbench results. So developers will have to look at the rest of the system to figure out what's going on. The good news is that this is a clear challenge with an objective way to measure success. Once a problem reaches that level of clarity, it's usually just a matter of some hacking.
Squashfs submitted for the mainline
The Squashfs compressed
filesystem is
used in everything from Live CDs to embedded devices. Many or most
distributions ship it in such situations, but squashfs has been
maintained outside of the mainline kernel for years. That appears to be changing as
it was recently submitted for inclusion in the mainline by Phillip Lougher. The reaction has
been generally favorable, with Andrew Morton requesting that Lougher move it forward:
"Please prepare a tree for linux-next
inclusion and unless serious problems are pointed out I'd suggest
shooting for a 2.6.29 merge.
"
So it seems like a good time to take a look at some of the
features and capabilities of Squashfs.
The basic idea behind Squashfs is to generate a compressed image of a filesystem or directory hierarchy that can be mounted as a read-only filesystem. This can be done to archive a set of directories or to store them on a smaller capacity device than would normally be required. The latter is used by both Live CDs and embedded devices to squeeze more into less.
It has been nearly four years since Squashfs was last submitted to linux-kernel. Since that time, it has been almost completely rewritten based on comments from that attempt. In addition, it has gone through two filesystem layout revisions in part to allow for 64-bit sizes for files and filesystems. Another major change is to make the filesystem little-endian, so that it can be read on any architecture, regardless of endian-ness.
The mksquashfs utility is used to create the image, which can then be mounted either via loopback (from a file) or from a regular block device. One of the features added since the original attempt to mainline Squashfs—to address complaints made at that time—is the ability to export a Squashfs filesystem via NFS.
Squashfs uses gzip compression on filesystem data and metadata, achieving sizes roughly one-third that of an ext3 filesystem with the same data. The performance is quite good as well, even when compared with the simpler cramfs—a compressed read-only filesystem already available with the kernel. According to Lougher, these performance numbers were gathered a number of years ago, with older versions of the code; newer numbers should be even better.
Previously, some kernel developers were resistant to adding another compressed filesystem to the kernel, so Lougher outlines a number of reasons that Squashfs is superior to cramfs. Certainly support for larger files and filesystems is compelling, but the fact that cramfs is orphaned and unmaintained will likely also play a role. In addition, Squashfs supports many more "normal" Linux filesystem features like real inode numbers, hard links, and exportability.
Morton had a laundry list of overall suggestions for making Squashfs better in the email referenced above, but documentation is certainly one of the areas that is somewhat lacking. In particular, Squashfs maintains its own cache, which puzzles Morton:
The real bug here is that this rather obvious question wasn't answered anywhere in the patch submission (afaict). How to fix that?
Methinks we need a squashfs.txt which covers these things.
One of the reasons that Squashfs doesn't use the page cache is that it
allows for multiple block sizes, from 4K up to 1M, with a default of 128K.
Better compression ratios can be achieved with a larger block size, but that
doesn't work well with the page cache as Jörn Engel
notes: "One of the problems seems to
be that your blocksize
can exceed page size and there really isn't any infrastructure to deal
with such cases yet.
"
Lougher has moved the code into a git repository, presumably in preparation to get it into linux-next. He notes that the CE Linux Forum has been instrumental in providing funding over the last four months to allow him to work on getting Squashfs into the mainline. With the additional testing that will come from being included in linux-next, it seems quite possible we could see Squashfs in 2.6.29.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>