Brief items
The current 2.6 development kernel is 2.6.28-rc2, released by Linus on
October 26. It adds a mere 22 changesets to 2.6.28-rc1, which came out on
the 23rd. This kernel is now known as the "Killer Bat of Doom."
As of this writing, almost 200 changesets have been merged into the
mainline since 2.6.28-rc2. They are mostly fixes, but there is also a
driver for Elantech (EeePC) touchpads, support for MIPS-based NXP
Semiconductors STB220 development boards, and a number of large ftrace
changes.
The current stable 2.6 kernel is 2.6.27.4, released with a number of
important fixes on October 25. Previously, 2.6.25.19, 2.6.26.7, and 2.6.27.3 were released on October 22.
There will probably only be one more stable update for the 2.6.25 and
2.6.26 kernels, so users who are dependent on those updates may want to
start thinking about moving to 2.6.27.
Comments (none posted)
Kernel development news
I look at Linux VT's and their kernel complexity with a mixture of
awe and stupefaction that so much effort has gone in that
direction....
--
Jim Gettys
I actually think it's a bit of an insult if people think of
Motorola's EZX or MAGX (and now Android) phones as "Linux
phones". Because all the freedoms of Linux (writing native
applications against native Linux APIs that Linux developers know
and love, being able to do Linux [kernel] development) are
stripped.
In the end, to what good is Linux in those devices? Definitely not
to any benefit of the user. It's to the benefit of the handset
maker, who can skip a pretty expensive Windows Mobile licensing
fee. Oh and, yes, they get better memory management than on Symbian
;)
That's the brave new world. It makes me sick.
--
Harald
Welte
The actual problem is that if the kernel grows by 12k every time a
developer says "what's the big deal?" the kernel will become very
large indeed.
--
Matt Mackall
So it had sat in the mainline kernel for 4 years. During those
years _nobody_ had ever tried to compile it. Nonetheless, there
had been patches affecting it - including such exciting stuff as
removal of trailing whitespaces, which had certainly greatly
improved the damn thing.
--
Al Viro
Comments (17 posted)
The Linux Foundation has produced a whole pile of
video interviews with kernel developers from this year's Kernel Summit. Short 5-10 minute interviews with 15 different kernel developers are available. You can watch interviews with Linus Torvalds, Ted Ts'o, Greg Kroah-Hartman, and many others including LWN Executive Editor Jonathan Corbet. Videos are available in both Ogg and Flash formats.
Comments (6 posted)
By Jonathan Corbet
October 27, 2008
About 1000 changesets were merged after
the previous summary was posted
here. Much of those came from architecture-specific trees. Other changes
merged this time around include:
- There are new drivers for
Mellanox ConnectX 10GbE network adapters,
PowerPC PPC40x and PPC44x GPIO controllers,
Panasonic "Let's Note" laptop special keys,
Sharp SL-6000 backlight and LCD devices,
Dialog Semiconductor DA9030/DA9034 backlight devices,
Tabletkiosk Sahara Touch-iT backlight devices, and
Toshiba TX4939 SoC ATA controllers.
- One more not-ready-for-prime-time driver was merged via the staging
tree; this one supports Redrapids Pocket Change cardbus devices. The
staging tree also brought an extensive set of fixes to the drivers
added earlier in the merge window.
- The kernel has gained support for ultra-wideband
protocol stacks. UWB can be used for normal networking, but the
immediate application is wireless USB, which will be
supported in 2.6.28.
- The ACPI docking station code has gained support for bay and battery
hotplug events.
- The IA64 architecture now supports Xen. Also added to IA64 is support
for DMA remapping devices (IOMMUs).
- Support for kdump has
been added to the PowerPC architecture.
- The 9P (Plan9) filesystem now has RDMA support.
Changes visible to kernel developers include:
- There is a new core_param() macro:
core_param(name, var, type, perm);
Its purpose is to define "core" parameters and let them be
represented in /sys/module/kernel/parameters.
- It is now possible to create a workqueue running at realtime priority
with:
struct workqueue_struct *create_rt_workqueue(const char *name);
- The block driver API has changed considerably, with the inode
and file parameters being removed from most block device
operations. The new API looks like this:
struct block_device_operations {
int (*open) (struct block_device *bdev, fmode_t mode);
int (*release) (struct gendisk *gd, fmode_t mode);
int (*locked_ioctl) (struct block_device *bdev, fmode_t mode,
unsigned cmd, unsigned long arg);
int (*ioctl) (struct block_device *bdev, fmode_t mode,
unsigned cmd, unsigned long arg);
int (*compat_ioctl) (struct block_device *bdev, fmode_t mode,
unsigned cmd, unsigned long arg);
int (*direct_access) (struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn);
int (*media_changed) (struct gendisk *gd);
int (*revalidate_disk) (struct gendisk *gd);
int (*getgeo)(struct block_device *bdev, struct hd_geometry *geo);
struct module *owner;
};
The new prototypes do away with the file and inode
structure pointers which were passed in previous kernels.
Note that the ioctl() method is now called without the big
kernel lock; code needing BKL protection must explicitly define a
locked_ioctl() function instead.
- The range timer API
has been merged; callers can now specify a time period in which they
would like the timeout to be delivered. The kernel can then take
advantage of the range to coalesce wakeups and keep the processor idle
for longer periods.
This time around, linux-next maintainer Stephen Rothwell has put together
a list of linux-next patches
which did not get into 2.6.28. Perhaps the biggest omission was the credentials work, which seemed
poised to go in this time around. Other changes which failed to get merged
include the message catalog
code (which looks like it will need a change of approach) and TOMOYO Linux (which seems to be caught
up in the same old "new security module with pathname-based rules" swamp).
Now the stabilization period starts. Linus, perhaps, was trying to set the
tone for this development cycle when he released a much smaller and earlier
2.6.28-rc2 than would have
normally been expected. By way of comparison: 2.6.25-rc2 had 359 patches
applied since 2.6.25-rc1. For 2.6.26-rc2, 446 changesets were merged, and,
for 2.6.27-rc2, the count was 780. For 2.6.28-rc2, instead, a total of 22
changes went in. Says Linus:
And hey, maybe we can even _continue_ the nice model of "just small
fixes after -rc1". I know, it sounds insane, but it's a real
pleasure to do an -rc2 with just a handful of fixes for real
problems that real people see. What a concept!
Should this pattern hold, it may well be that 2.6.28 will stabilize more
quickly and successfully than its predecessors. It will, in any case, be
interesting to watch.
Comments (1 posted)
By Jonathan Corbet
October 29, 2008
Kernel developers tend to have a mixed view of benchmarks.
A benchmarking tool can do an effective job of quantifying specific aspects
of system performance. But benchmarks are not real workloads; optimizing
for a benchmark can often distort a system in ways which are detrimental to
real applications. Since kernel hackers do not always see benchmark optimization
as their top priority, they can sometimes assign a lower priority to
benchmark regressions as well. But, sometimes, benchmark problems indicate
a real problem in the kernel.
The tbench benchmark is meant to measure networking performance; it
consists of a collection of processes quickly making lots of small requests
from a server process. Since the requests are small, there is not much
time spent actually moving data; it's all a matter of shifting small
packets around - and scheduling between the processes. Back in August,
Christoph Lameter reported
that tbench performance in the mainline kernel had been declining for some
time. His system was able to move 3208 MB/sec with a 2.6.22 kernel,
but only 2571 MB/sec with a 2.6.27-rc kernel. Each of the releases in
between showed a decline from the one which came before, with 2.6.25
showing an especially big hit. Others were able to reproduce the results,
and they engaged in various rounds of speculation on where the problem
might be, but it seems that, initially, nobody actually dug into the
system to see what was going on.
At linux.conf.au 2007, Andi Kleen gave a talk describing various types of
kernel hackers. One of those was the "Russian mathematician" who, he
suspected, was often a room full of talented developers operating under a
single name. Evgeniy Polyakov can only have reinforced that view when, in
early October, he tracked down the biggest
offending commit through a process which, he says, involved "just [a]
couple of hundreds of compilations." In the process, he put together a plot of tbench performance
which, he says, is suitable for scaring children. Through a massive amount
of work, he was able to point the finger at a scheduler patch - not
something in the networking stack at all.
In particular, Evgeniy found that the patch adding high-resolution
preemption ticks was the problem. The idea behind this patch was to make
time slices more accurate by scheduling preemption at just the right time.
It makes sense; once the regular clock tick has been eliminated, there is
no reason not to arrange for preemption to happen when the scheduling
algorithm says it should. Unfortunately, it seems that this change also
adds sufficient overhead to slow down tbench performance considerably; when
Evgeniy backed it out, his performance went from 373 MB/sec to
455 MB/sec. That would seem to be a pretty clear indication that
something is amiss with high-resolution preemption ticks.
At this point, the public discussion went quiet, though it appears that a number
of developers were working on it off-list. David Miller eventually tracked
down the worst of the trouble to the wakeup code, something he was rather vocally unhappy about having had to
do. Eventually a patch was merged (for 2.6.28-rc2) disabling the
high-resolution preemption tick feature. Since the discussion is private,
it's not quite clear why this change took as long as it did. But there's a
couple of plausible reasons. One is that this particular feature is
disabled by default anyway, so most users will not encounter the
performance problem it creates.
But there is also the question of weighing the benchmark result against the
effects on other, "real" workloads. Ingo Molnar said:
But it's a difficult call with no silver bullets. On one hand we
have folks putting more and more stuff into the context-switching
hotpath on the (mostly valid) point that the scheduler is a
slowpath compared to most other things. On the other hand we've got
folks doing high-context-switch ratio benchmarks and complaining
about the overhead whenever something goes in that improves the
quality of scheduling of a workload that does not context-switch as
massively as tbench. It's a difficult balance and we cannot satisfy
both camps.
So, by this view, performance on scheduler-intensive benchmarks must be
weighed against the wider value of other scheduler enhancements. David
Miller has a different view of the
situation, though:
If we now think it's ok that picking which task to run is more
expensive than writing 64 bytes over a TCP socket and then blocking
on a read, I'd like to stop using Linux. :-) That's "real work" and
if the scheduler is more expensive than "real work" we lose.
In David's view, scheduler performance has been getting consistently worse
since the switch to the completely fair scheduler in 2.6.23. He would like
to see some energy put into recovering some of the performance of the
pre-CFS scheduler; in particular, he thinks that Ingo and company should
work to fix (what he sees as) a regression that they caused.
For the time being, the worst performance regression has been "fixed" by
disabling the high-resolution preemption tick feature; Ingo says that the
feature will not come back until it can be supported without slowing
things down. But the scheduler seems to have gotten slower in a number of
other ways as well. Your editor will make a prediction here: now that the
issue has been called out in such clear terms, somebody will find the time
to fix these problems to the point that the CFS scheduler will be faster
than the O(1) scheduler which preceded it.
Beyond that, there are suggestions that the
scheduler cannot take the blame for all of the observed regressions in
tbench results. So developers will have to look at the rest of the system
to figure out what's going on. The good news is that this is a clear
challenge with an
objective way to measure success. Once a problem reaches that level of
clarity, it's usually just a matter of some hacking.
Comments (6 posted)
By Jake Edge
October 29, 2008
The Squashfs compressed
filesystem is
used in everything from Live CDs to embedded devices. Many or most
distributions ship it in such situations, but squashfs has been
maintained outside of the mainline kernel for years. That appears to be changing as
it was recently submitted for inclusion in the mainline by Phillip Lougher. The reaction has
been generally favorable, with Andrew Morton requesting that Lougher move it forward:
"Please prepare a tree for linux-next
inclusion and unless serious problems are pointed out I'd suggest
shooting for a 2.6.29 merge."
So it seems like a good time to take a look at some of the
features and capabilities of Squashfs.
The basic idea behind Squashfs is to generate a compressed image of a
filesystem or directory hierarchy that can be mounted as a read-only
filesystem. This can be done to archive a set of directories or to store
them on a smaller capacity device than would normally be required. The
latter is used by both Live CDs and embedded devices to squeeze more into
less.
It has been nearly four years since Squashfs was last submitted to linux-kernel.
Since that time, it has been almost completely rewritten based on
comments from that attempt. In addition, it has gone through two filesystem
layout revisions in part to allow for 64-bit sizes for files and
filesystems. Another major change is to make the filesystem little-endian,
so that it can be read on any architecture, regardless of endian-ness.
The mksquashfs utility is used to create the image, which can then
be mounted either via loopback (from a file) or from a regular block device.
One of the features added since the original attempt to mainline
Squashfs—to address complaints made at that time—is the ability
to export a Squashfs filesystem via NFS.
Squashfs uses gzip compression on filesystem data and metadata, achieving
sizes roughly one-third that of an ext3 filesystem with the same data. The
performance
is quite good as well, even when compared with the simpler cramfs—a
compressed read-only filesystem already available with the kernel.
According to Lougher, these performance numbers were gathered a number of
years ago, with older versions of the code; newer numbers should be even
better.
Previously, some kernel developers were resistant to adding another
compressed filesystem to the kernel, so Lougher outlines a number of
reasons that Squashfs is superior to cramfs. Certainly support for larger
files and filesystems is compelling, but the fact that cramfs is orphaned
and unmaintained will likely also play a role. In addition, Squashfs
supports many more "normal" Linux filesystem features like real inode
numbers, hard links, and exportability.
Morton had a laundry list of overall suggestions for making Squashfs better
in the email referenced above, but documentation is certainly one of the
areas that is somewhat lacking. In particular, Squashfs maintains its own
cache, which puzzles Morton:
Why not just decompress these blocks into pagecache
and let the VFS handle the caching??
The real bug here is that this rather obvious question wasn't
answered anywhere in the patch submission (afaict). How to fix that?
Methinks we need a squashfs.txt which covers these things.
One of the reasons that Squashfs doesn't use the page cache is that it
allows for multiple block sizes, from 4K up to 1M, with a default of 128K.
Better compression ratios can be achieved with a larger block size, but that
doesn't work well with the page cache as Jörn Engel
notes: "One of the problems seems to
be that your blocksize
can exceed page size and there really isn't any infrastructure to deal
with such cases yet."
Lougher has moved the code into a git
repository, presumably in preparation to get it into linux-next. He
notes that the CE Linux Forum has
been instrumental in providing funding over the last four months to allow
him to work on getting Squashfs into the mainline. With the additional
testing that will come from being included in linux-next, it seems quite
possible we could see Squashfs in 2.6.29.
Comments (13 posted)
Patches and updates
Kernel trees
Core kernel code
- Manfred Spraul: rcu-state.
(October 28, 2008)
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>