Brief items
The current 2.6 development kernel is 2.6.28-rc5,
released on November 15.
It contains the usual pile of fixes; see
the
long-format changelog for the details.
The current stable 2.6 kernel is 2.6.27.6, released on November 13.
It includes a fair number of fixes, one of which has a CVE number
attached. As of this writing, 46 patches are under review for inclusion in 2.6.27.7 which will likely be released soon.
Comments (none posted)
Kernel development news
That GLOBAL_EXTERN thing should be held on the ground whilst
farm animals poop on its head, but my attempts to remove it have thus
far fallen on deaf inboxes.
--
Andrew Morton
Your patch is still adding bells and whistles to a useless turd. In
fact this patch is worse. Without this patch the turd can be
disabled and left out, with your patch everyone now has to compile
in said turd pile.
--
Alan Cox joining the scatological mood
Comments (none posted)
The Linux Foundation has posted
a set of photos from
the 2008 Kernel Summit. If these pictures are to be believed, the
Summit involved a lot of time spent consuming alcoholic beverages. But it
was a more serious event than that, honest.
Comments (4 posted)
Arjan van de Ven reports that
kerneloops.org has recorded oops #100,000, just shy of its first birthday. The site gathers the output of kernel oops messages, which are the crash signatures from the kernel. The intent is to find out which are the most common in order to find and fix the underlying bugs. "
Other than the top 2 items, which have patches, we've done a pretty good job of fixing
the high occurrence bugs (excluding the binary drivers which we obviously cannot fix)" Click below for his full report.
Full Story (comments: 20)
By Jake Edge
November 19, 2008
PCI express (PCIe) is not normally considered as a way to connect
computers, rather it is a bus for attaching peripherals, but there are
advantages to using it as an interconnect. Kernel hacker Arnd Bergmann gave a
presentation at the recent UKUUG Linux 2008
conference on work he has been doing on using PCIe for IBM. He
outlined the current state of Linux support as well as some plans for the
future.
The availability of PCIe endpoints for much of the hardware in use today is
one major advantage. By using PCIe, instead of other interconnects such as
InfiniBand, the same
throughput can be achieved with lower latency and
power consumption. Bergmann noted that avoiding using a separate
InfiniBand chip saves 10-30 watts which adds up rather quickly on a 30,000
node supercomputer.
There are some downsides to PCIe as well. There is no security model, for
example, so a root process on one machine can crash other connected machines.
There is also a single point of failure because if the PCIe root port goes
down, it takes the network with it or, as Bergmann puts it: "if
anything goes wrong, the whole system goes down". PCIe lacks a
standard high-level interface for Linux and there is no generic code shared
between the various drivers—at least so far.
As an example of a system that uses PCIe, Bergmann described the
"Roadrunner" supercomputer that is currently the fastest in existence. It
is a cluster of hybrid nodes, called "Triblades", each of which has one
Opteron blade along
with two Cell blades. The nodes are connected with
InfiniBand, but PCIe is used to communicate between the processors within
each node by using the Opteron root port and PCIe endpoints on the Cells.
There is other hardware that uses PCIe in this way, including the Fixstars
GigaAccel 180 accelerator board and an embedded PowerPC 440/460
system-on-a-chip (SoC) board, both of which use the same Axon PCIe device.
Bergmann also talked about PCIe switches and non-transparent bridges that
perform the same
kinds of functions as networking switches and bridges. Bridges are called
"non-transparent" because they have I/O remapping tables—sometimes
IOMMUs—that can be addressed by the two root ports that are connected via
the bridge. These bridges may also have DMA engines to facilitate data transfer
without host processor control.
Bergmann then moved on to the software side of things, looking at the
drivers available—and planned—to support connection via PCIe.
The first driver was written by Mercury Computers in 2006 for a Cell
accelerator board and is now "abandonware". It has many deficiencies and
would take a lot of work to get it into shape for the mainline.
Another choice is the driver used in the Roadrunner Triblade and the
GigaAccel device which is vaguely modeled on InfiniBand. It has an
interface that uses custom ioctl() commands that implement just
eight operations, as opposed to hundreds for InfiniBand. It is
"enormous for a Linux device driver", weighing in at 13,000
lines of code.
The Triblade driver is not as portable as it could be, as it is very
specific to the Opteron and Cell architectures. On the Cell side, it is
implemented as an Open Firmware driver, but the Opteron side is a PCIe
driver. There is a lot of virtual ethernet code mixed in as well.
Overall, it is not seen as the best way forward to support these kinds of
devices in Linux.
Another approach was taken by a group of students sponsored by IBM who
developed a virtual ethernet prototype to talk to an IBM BladeCenter from a
workstation by way of a non-transparent bridge. Each side could access
memory on the other by using ioremap() on one side and
dma_map_single() on the other. By implementing a virtio driver,
they did not have to write an ethernet driver, as the virtio abstraction
provided that functionality. The driver was a bit slow, as it didn't use
DMA, but it is a start down the road that Bergmann thinks should be taken.
He went on to describe a "conceptual driver" for PCIe endpoints that is
based on the students' work but adds on things like DMA as well as
additional virtio drivers. Adding a virtio block device would allow
embedded devices to use hard disks over PCIe or, by implementing a Plan 9
filesystem (9pfs) virtio driver, individual files could be used directly
over PCIe. All of this depends on using the virtio abstraction.
Virtio is seen as a useful layer in the driver because it is a standard
abstraction for "doing something when you aren't limited by
hardware". Networking, block device, and filesystem "hosts" are all
implemented atop virtio drivers, which makes them available fairly easily.
One problem area, though, is the runtime configuration piece. The problem
there is "not in coming up with something that works, but something that
will also work in the future".
Replacing the ioctl() interface with the InfiniBand verbs (ibverb)
interface is planned. The ibverb interface may not be the best choice in
an abstract sense, but it exists and supports OpenMPI, so the new driver
should implement it as well.
Two types of virtqueue implementations are envisioned, one for
memory-mapped I/O (MMIO) and the other for a DMA-based virtqueue. The MMIO
would be the most basic virtqueue implementation, with a local read of a
remote write. Read access on PCIe is much slower than write because a read
must flush all writes then wait for data reception. Data and signaling
information would have separate areas so that data ordering guarantees
could be relaxed on the data area for better performance, while strict data
ordering would be set for the signalling area.
The DMA engine virtqueue implementation would be highly hardware-specific
to incorporate performance and other limitations of the underlying engine.
In some cases, for example, it is not worth setting up a DMA for transfers
of less than 2K, so copying via MMIO should be used instead. DMA would be
used for transferring payload data, but signaling would still be handled
via MMIO. Bergmann noted that the kernel DMA abstraction may not provide
all that is needed so enhancements to that interface may be required as
well.
Bergmann did not provide any kind of time frame in which this work might
make its way into the kernel as it is a work in progress. There is much
still to be done, but his presentation laid out a roadmap of where he
thinks it is headed.
In a post-talk email exchange, Bergmann points to his triblade-2.6.27
branch for those interested in looking at the current state of affairs, while noting that it "is only mildly related to what I think
we should be
doing". He also mentioned a patch by Ira Snyder that
implements virtual ethernet over PCI, which "is more
likely to go into the kernel in the near future". Bergmann
and Snyder have to agreed to join forces down the road to add more
functionality along the lines that were outlined in the talk.
Comments (5 posted)
By Jonathan Corbet
November 19, 2008
LWN has previously
covered
concerns over slowly deteriorating performance by current Linux systems on
the network- and scheduler-heavy tbench benchmark. Tbench runs have been
getting worse since roughly 2.6.22. At the end of the last episode,
attention had been directed toward the CFS scheduler as the presumptive
culprit. That article concluded with the suggestion that, now that
attention had been focused on the scheduler's role in the tbench
performance regression, fixes would be relatively quick in coming. One
month later, it
would appear that those fixes have indeed come, and that developers looking
for better tbench results will need to cast their gaze beyond the
scheduler.
The discussion resumed after a routine weekly posting of the post-2.6.26
regression list; one entry in that list is
the tbench performance issue. Ingo Molnar responded to that posting with a pointer to an
extensive set of benchmark runs done by Mike Galbraith. The conclusion
Ingo draws from all those runs is that the CFS scheduler is now faster than
the old O(1) scheduler, and that "all scheduler components of this
regression have been eliminated." Beyond that:
In fact his numbers show that scheduler speedups since 2.6.22 have
offset and hidden most other sources of tbench
regression. (i.e. the scheduler portion got 5% faster, hence it was
able to offset a slowdown of 5% in other areas of the kernel that
tbench triggers)
This improvement is not something that just happened; it is the result of a
focused effort on the part of the scheduler developers. Quite a few
changes have been merged; they all seem like small tweaks, but, together,
they add up to substantial improvements in scheduler performance.
One
change fixes a spot where the scheduler code disabled interrupts
needlessly. Some others (here
and here)
adjust the scheduler's "wakeup buddy" mechanism, a feature which ties
processes together in the scheduler's view. As an example, consider a
process which wakes up a second process, then runs out of its allocated
time on the CPU. The wakeup buddy system will cause the scheduler to bias
its selection mechanism to favor the just-waked process, on the theory that
said process will be consuming cache-warm data created by the waking
process. By allowing cooperating processes like this to run slightly ahead
of what a strictly fair scheduling algorithm would provide, the scheduler
gets better performance out of the system as a whole.
The recent changes add a "backward buddy" concept. If there is no recently-waked
process to switch to, the scheduler will, instead, bias the selection
toward the process which was preempted to enable the outgoing process to
run. Chances are relatively good that the preempted process might
(1) be cooperating with the outgoing process or (2) have some
data still in cache - or both. So running that process next is likely to
yield better performance overall.
A number of other small changes have been merged, to the point that the
scheduler developers think that the tbench regressions are no longer their
problem. Networking maintainer David Miller has disagreed with this assessment, though,
claiming that performance problems still exist in the scheduler. Ingo responded
in a couple of ways, starting with the posting of some profiling results which show very little
scheduler overhead. Interestingly, it turns out that the networking
developers get different results from their profiling runs than the
scheduler developers do. And that, in turn, is a result of the different
hardware that they are using for their work. Ingo has a bleeding-edge
Intel processor to play with; the networking folks have processors which
are not quite so new. David Miller tends to run on SPARC processors, which
may be adding unique problems of their own.
The other thing Ingo did was, for all practical purposes, to profile the
entire kernel code path involved in a tbench run, then to disassemble
the executable and examine the profile results on a per-instruction basis.
The postings that resulted (example) point
out a number of potential problem spots, most of which are in the
networking code. Some of those have already been fixed, while others are
being disputed. It is, in the end, a large amount of raw data which is
likely to inspire discussion for a while.
To an outsider, this whole affair can have the look of an ongoing
finger-pointing exercise. And, perhaps, that's what it is. But it's
highly-technical finger-pointing which has increased the understanding of
how the kernel responds to a specific type of stress while also
demonstrating the limits of some of our measurement tools and the
performance differences exhibited by various types of hardware. The end
result will be a faster, more tightly-tuned kernel - and better tbench
numbers too.
Comments (11 posted)
By Jake Edge
November 19, 2008
Arnd Bergmann pulled double duty at the recent UKUUG Linux 2008
conference by giving a talk on each day of the event. His talk on
Saturday, entitled "Porting Linux to a new architecture, the right way",
looked at various problems with recent architecture ports along with a
project he has been working on to simplify that process. By creating a
generic template for architectures, some of the mistakes of the past can be
avoided.
This is one of Bergmann's pet projects, that "I like to do for fun,
when I am hacking on the kernel, but not for IBM". The project and
talk were inspired by a few new architectures that were merged—or
were submitted for merging—in the
last few years. In particular, the Blackfin and MicroBlaze architectures
were inspiring, with the latter architecture still not merged, perhaps due
to Bergmann's comments. He is hoping to help that situation get better.
The biggest problem with architecture ports tends to be code duplication
because people start by copying all of the files from an existing
architecture. In addition, "most people who don't know what they are
doing copy from x86, which in my opinion is a big mistake".
According to Bergmann, architecture porters seem to "first copy the
header files and then change the whitespace", which makes it
difficult to immediately spot duplicated code.
He points to termbits.h as an example of an include file that is
duplicated in multiple architectures unnecessarily as the code is the same
in most cases. He also notes there is "incorrect code
duplication", pointing to new architectures that implement the
sys_ipc() system call, resulting in "brand new architectures
supporting a broken interface for x86 UNIX from the 80s". That call
is a de-multiplexer for System V IPC calls that has the
comment—dutifully duplicated into other architectures—"This is
really
horribly ugly".
Then there are problems with "code duplication by clueless
people" which
includes a sembuf.h implementation that puts the padding in the
wrong place because of 64 vs. 32-bit confusion. In addition, because
code is duplicated in multiple
locations, bug fixes that are made for one architecture don't propagate to
all the places that need the fix. As an example he noted a bug fix made by
Sparc maintainer David Miller in the x86 tree that didn't make it into the
Sparc tree. Finally, there are ABIs that are being needlessly propagated
in new architecture ports: system calls that are implemented in terms
of newer calls are still present in new ports even though it could all be
handled in libc.
The "obvious" solution is to create a generic architecture implementation
that can be
used as a starting point for new ports. Bergmann has been working on that,
resulting in a 3000 line patch that "should make it very easy for
people to port to new architectures". To start with, it defines a
canonical ABI that is a list of all of the system calls that need to be
implemented for a new architecture. It puts all of the required include
files into the asm-generic directory that new ports can just
include—or copy if they need to modify them.
Unfortunately, things are not quite that simple of course, there are a number
of problem areas. There are "lots of things you simply cannot do in
a generic way". Most of these things are fairly hardware-specific
areas like MMU support, atomics, interrupts, task switching, byte order,
signal contexts, hardware probing and the like.
Bergmann decided to go ahead by defining away some of these problems in
his example architecture. So, there is no SMP or MMU support with the
asm-generic/atomic.h and asm-generic/mmu_context.h
include files being appropriately modified. Many of the
architecture-specific functions have been stubbed out in
arch/example/kernel/dummy.c so that he can compile the template
architecture.
The example architecture uses an Open Firmware device tree to
describe the hardware that is available at boot time. Open Firmware
"is a bit like what you have with the new Intel EFI firmware, but
it's a lot nicer". A flattened device tree data structure is passed
to the kernel at boot time by the bootloader, so Bergmann will be able make
it to the next step: making it boot.
As one might guess, there is still more work to be done.
There are eight header files that are needed from the
asm-example directory, but Bergmann hopes to reduce that some. He
notes that there are other architecture-specific areas that need work. For
example,
every single architecture has its own implementation of TCP
checksums in assembly language, which may not be optimal
Bergmann pointed attendees at the ukuug2008 branch of his
kernel.org playground git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git
to see the current state of his example architecture. It looks to be a
nice addition to the kernel that will likely result in better architecture
ports down the road.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>