The current development kernel is 2.6.36-rc7
on October 6. "This
should be the last -rc, I'm not seeing any reason to keep delaying a real
release. There was still more changes to drivers/gpu/drm than I really
would have hoped for, but they all look harmless and good. Famous last
" The short-form changelog is in the announcement; kernel.org
Stable updates: 18.104.22.168, containing a single
fix for a typo in the Xen code, was released on October 1. As of this
writing, there are no stable updates in the review process.
Comments (none posted)
As a general rule, if a reviewer's comment doesn't result in a code
change then it should result in a changelog fix or a code comment.
Because if the code wasn't clear enough to the reviewer then it
won't be clear enough to later readers.
-- Andrew Morton
AMD's reference BIOS code had a bug that could result in the
firmware failing to reenable the iommu on resume. It transpires
that this causes certain less than desirable behaviour when it
comes to PCI accesses, to whit them ending up somewhere near
Bristol when the more desirable outcome was Edinburgh. Sadness
ensues, perhaps along with filesystem corruption. Let's make sure
that it gets turned back on, and that we restore its configuration
so decisions it makes bear some resemblance to those made by
reasonable people rather than crack-addled lemurs who spent all
your DMA on Thunderbird.
-- Matthew Garrett
Comments (none posted)
The PowerPC architecture is normally thought of as a big-endian domain -
the most significant byte of multi-byte values comes first. Big-endian is
consistent with a number of other architectures, but the fact that one
obscure architecture - x86 - is little-endian means that the world as a
whole tends toward the little-endian persuasion. As it happens, at least
some PowerPC processors can optionally be run in a little-endian mode. Ian
Munsie has posted a patch set
which enables Linux to take advantage of that feature and run little-endian
on suitably-equipped PowerPC processors.
The first question that came to the mind of a few reviewers was: "why?"
PowerPC runs fine as a big-endian architecture, and there has been little
clamor for little-endian support. Besides, endianness seems to be one of
those things that users can feel strongly about; to at least some PowerPC
users, little-endian apparently feels cheap, wrong, and PCish.
The answer, as expressed by Ben
Herrenschmidt, appears to be graphics hardware. A number of GPUs,
especially those aimed at embedded applications, only work in the
little-endian mode. Carefully-written device drivers can work around that
sort of limitation without too much trouble, but user-space code - which
often ends up talking to graphics hardware - is another story. Fixing all
of that code is not a task that anybody wants to take on. As a result,
PowerPC processors will not be considered for situations where
little-endian support is needed. Running the processor in little-endian
mode will nicely overcome that obstacle.
That said, it will take a little while before this support is generally
available. The kernel patches apparently look good, but there are
toolchain changes required which are not, yet, generally available. Until
that little issue is resolved, PowerPC will remain a club for big-endian
Comments (17 posted)
Kernel development news
The Trusted Platform Module (TPM) present on many of today's systems can be
used in various ways, from making completely locked-down systems that
cannot be changed by users to protecting sensitive systems from various
kinds of attacks. While the TPM-using integrity measurement architecture
(IMA), which can
measure and attest to the integrity of a running Linux system, has
been part of the kernel for some time now, the related extended
(EVM) has not made it into the mainline. One of the concerns raised about
EVM was that it obtained a cryptographic key from user space that is then used
as a key for integrity verification—largely nullifying the
integrity guarantees that EVM is
supposed to provide.
A set of
patches that were recently posted for comments to the linux-security-module
mailing list would add two new key types to the kernel that would allow
user space to provide the key without being able to see the actual
We last looked in on
EVM back in June when it seemed like it might make it into 2.6.36.
That didn't happen, nor has EVM been incorporated into linux-next, so its path
into the mainline is a bit unclear at this point. EVM calculates HMAC (hash-based message authentication
code) values for on-disk files, uses the EVM key and TPM to sign the
values, and stores
them in extended attributes (xattrs) in the security namespace.
If the EVM key is subverted, all bets are off in terms of the integrity of
While they are targeted
for use by EVM, Mimi Zohar's patches to add
trusted and encrypted key types could also
be used for other purposes such as handling the keys for filesystem encryption.
The basic idea is that these keys would be generated by the kernel, and would
never be touched by user space in an unencrypted form. Encrypted "blobs"
would be provided to user space by the kernel and would contain the key
material. User space could store the keys, for example, but the blobs would
opaque to anything outside of the kernel. The patches come with two new
flavors of these in-kernel keys: trusted and encrypted.
Trusted keys are generated by the TPM and then encrypted using the TPM's
storage root key (SRK), which is a 2048-bit RSA key (this is known as
key in TPM terminology). Furthermore, trusted keys can also be sealed to a
particular set of TPM platform configuration register (PCR) values so that the
keys cannot be unsealed unless the PCR values match. The PCR
contains an integrity measurement of the system BIOS, bootloader, and
operating system, so tying keys to PCR values means that the trusted keys
cannot be accessed except from those systems for which it was specifically
authorized. Any change to the underlying code will result in undecryptable
Since the PCR values change based on the kernel and initramfs used,
trusted keys can be updated to use different PCRs, once they have been
added to a keyring (so that the existing PCR values have been verified).
There can also be
multiple versions of a single trusted key, each of which is sealed to
different PCR values. This can be used to support booting multiple kernels
use the same key. While the underlying, unencrypted key data will not need
to change for
different kernels, the user-space blob will change because of the
PCR values, which will require some kind of key management in user space.
Encrypted keys, on the other hand, do not rely on the TPM, and use the
kernel's AES encryption
instead which is faster than the TPM's public key encryption. Keys are
generated as random numbers of the requested length from the kernel's
random pool and, when they are
user-space blobs, they are encrypted using a master key. That master key
can either be the new trusted key type or the user key type that already
exists in the
kernel. Obviously, if the master key is not a trusted key, it needs to be
handled securely, as it provides security for any other encrypted keys.
The user-space blobs contain an HMAC that the kernel can use to verify
the integrity of a key. The keyctl utility (or keyctl()
call) can be used to generate keys, add
them to a kernel keyring, as well as to extract a key blob
from the kernel. The patch set introduction gives some examples of using
keyctl to manipulate both trusted and encrypted keys.
A recent proposal for a kernel
crypto API was not particularly well-received, in part because it was
not integrated with the existing kernel keyring API, but Zohar's proposal
doesn't suffer from that problem. Both have the idea of wrapping keys into
opaque blobs before handing them off to user space, but the crypto API went
much further, adding lots of ways to actually use the keys from user
space for encryption and decryption.
While the trusted and encrypted key types would be useful to kernel services
(like EVM or filesystem encryption), they aren't very useful to
applications that want to do cryptography without exposing key data to user
space. The keys could potentially be used by hardware cryptographic
accelerators, or possibly be wired into the existing kernel crypto
services, but they won't provide all of the different algorithms envisioned
by the kernel crypto API.
The existing IMA code only solves part of the integrity problem, leaving
the detection of offline attacks against disk files (e.g. by mounting the
disk under another OS) to EVM.
If EVM is to
eventually be added to the kernel to complete the integrity verification
puzzle, then trusted keys or something similar will be
needed. So far, the patches have
attracted few comments or complaints, but they were posted to various
Linux security mailing lists, and have not yet run the linux-kernel gauntlet.
Comments (none posted)
It has long been accepted by kernel developers that the user-space ABI
cannot be broken in most situations. But what happens if the current ABI
is a mistake, or if blocking changes risks stopping kernel development
altogether? Both of those possibilities have been raised in recent
The capi driver provides a control interface for ISDN adapters -
some of which, apparently, are still in use somewhere out there. If the
devices.txt file is to be believed, the control device for CAPI
applications should be /dev/capi20, while the first actual
application shows up as /dev/capi20.00. That is not what the
applications apparently want to see, though, so Marc-Andre Dahlhaus posted a patch moving the application devices under
their own directory. In other words, the first CAPI application would show
up as /dev/capi/0. The patch also modified the
devices.txt file to match the new naming.
Alan Cox rejected the patch, saying:
devices.txt is the specification, and its ABI.
It is fixed and the kernel behaviour is to follow it. Those who
didn't follow it, or who didn't propose a change back when it was
specified in the first place have only themselves to blame.
It isn't changing, and the ISDN code should follow the spec.
Maintaining the ABI is normally the right thing, but there are a couple of
problems with the reasoning here. First is that, apparently, few (if any)
distributions follow the rules described in devices.txt; the real
ABI, in practice, may be different. Second: the kernel doesn't follow
devices.txt either: current practice is to create
/dev/capi as the control device, and /dev/capi0 as the
first application device. The capifs virtual filesystem covered over some
of this, but capifs is on its way out of the kernel.
In the short term, the fix appears to
redefine the current behavior as a typo, tweaking things just enough that
udev is able to create the right file names. The devices.txt file
will not be touched for now. If regressions turn up, though, it may become
necessary to support alternative names for these devices for well into the
Jean Pihet recently posted a set of tracepoint
changes for power-related events. The patch added some new
tracepoints, added information to others, and added some documentation as
well. Even more recently, Thomas Renninger came forward with a different set of power tracepoint changes,
meant to clean things up and make the tracepoints more applicable to ARM
systems. In both cases, Arjan van de Ven opposed the patches, claiming that they are an
The ABI in question does have users - tools like powertop and pytimechart
in particular. It seems that Intel also has "internal tools" which would
be affected by this change. As Arjan put
it: "the thing with ABIs is that you don't know how many users
you have." When things are expressed this way, it looks like a
standard case of a user-space ABI which must be preserved, but not all
developers see it that way.
Peter Zijlstra argues that tools using
tracepoints need to be more flexible:
These tools should be smart enough to look up the tracepoint name,
fail it its not available, read the tracepoint format, again fail
if not compatible.
I really object to treating tracepoints as ABI and being tied to any
implementation details due to that.
Steven Rostedt worries about the effects of
a tracepoint ABI on kernel development:
Once we start saying that a tracepoint is a fixed abi, we just
stopped innovation of the kernel. Tracepoints are too intrusive to
guarantee their stability. Tools that need to get information from
a tracepoint should either be bound to a given kernel, or have a
easy way to update the tool (config file or script) that can cope
with a change.
The issue of ABI status for tracepoints has come up in the past, but it has
never really been resolved. In other situations, Linus has said that any
kernel interface which is taken up by applications becomes part of the ABI
whether that status was intended or not. From this point of view, it is
not a matter of "saying" that there is an ABI here or not; applications are
using the tracepoints, so the damage has already been done. Given that
user-space developers are being pushed to use tracepoints in various
situations, it makes sense to offer those developers a stable interface.
On the other hand, it is very much true that these tracepoints hook deeply
into the kernel. If they truly cannot be changed, then either
(1) changes in the kernel itself will be severely restricted, or
(2) we will start to accumulate backward-compatibility tracepoints
which are increasingly unrelated to anything that the kernel is actually
doing. Neither of these outcomes is conducive to the rapid evolution of
the kernel in the coming years.
If nothing else, if tracepoints are deemed to be part of the user-space
ABI, there will be strong resistance to the addition of any more of them to
large parts of the kernel.
Some alternatives have been discussed; the old idea of marking specific
tracepoints as being stable came back again. Frank Eigler suggested the creation of a compatibility
module which could attach to tracepoints which have been changed, remapping
the trace data into the older format for user space. There has also been
talk of creating a mapping layer in user space. But none of these ideas
have actually been put into the mainline kernel.
This issue is clearly not going to go away; it can only get worse as more
application developers start to make use of the tracepoints which are being
added to the kernel. It seems like an obvious topic to discuss at the 2010
Kernel Summit, scheduled for the beginning of November. What the outcome
of that discussion might be is hard to predict, but, with luck, it will at
least provide some sort of clarity on this issue.
Comments (3 posted)
Over the last few years, it has become clear that one of the most pressing
scalability problems faced by Linux is being driven by solid-state storage
devices (SSDs). The rapid increase in performance offered by these devices
cannot help but reveal any bottlenecks in the Linux filesystem and block
layers. What has been less clear, at times, is what we are going to do
about this problem. In his LinuxCon Japan talk, block maintainer Jens
Axboe described some of the work that has been done to improve block layer
scalability and offered a view of where things might go in the future.
While workloads will vary, Jens says, most I/O patterns are dominated by
random I/O and relatively small requests. Thus, getting the best results
requires being able to perform a large number of I/O operations per second
(IOPS). With a high-end rotating drive (running at 15,000 RPM), the
maximum rate possible is about 500 IOPS. Most real-world drives, of
course, will have significantly slower performance and lower I/O rates.
SSDs, by eliminating seeks and rotational delays, change everything; we
have gone from hundreds of IOPS to hundreds of thousands of IOPS in a very
short period of time. A number of people have said that the massive
increase in IOPS means that the block layer will have to become more like
the networking layer, where every bit of per-packet overhead has been
squeezed out over time. But, as Jens points out, time is not in great
abundance. Networking technology went from 10Mb/s in the 1980's to 10Gb/s
now, the better part of 30 years later. SSDs have forced a similar jump
(three orders of magnitude) in a much shorter period of time - and every
indication suggests that devices with IOPS rates in the millions are not
that far away. The result, says Jens, is "a big problem."
This problem pops up in a number of places, but it usually comes down to
contention for shared resources. Locking overhead which is tolerable at
500 IOPS is crippling at 500,000. There are also problems with contention
at the hardware level too; vendors of storage controllers have been caught
by surprise by SSDs and are having to scramble to get their performance up
to the required levels. The growth of multicore systems naturally makes
things worse; such systems can create contention problems throughout the
kernel, and the block layer is no exception. So much of the necessary work
comes down to avoiding contention.
Before that, though, some work had to be done just to get the block layer
to recognize that it is dealing with an SSD and react accordingly.
Traditionally, the block layer has been driven by the need to avoid head
seeks; the use of quite a bit of CPU time could be justified if it managed
to avoid a single seek. SSDs - at least the good ones - care a lot less
about seeks, so expending a bunch of CPU time to avoid them no longer makes
sense. There are various ways of detecting SSDs in the hardware, but they
don't always work, especially with the lower-quality devices. So the block
layer exports a flag under
which can be used to override the system's notion of what kind of storage
device it is dealing with.
Improving performance with SSDs can be a challenging task. There is no
single big bottleneck which is causing performance problems; instead, there
are numerous small things to fix. Each fix yields a bit of progress, but
it mostly serves to highlight the next problem. Additionally, performance
testing is hard; results are often not reproducible and can be perturbed by
small changes. This is especially true on larger systems with more CPUs.
management can also get in the way of the generation of consistent results.
One of the first things to address on an SSD was queue plugging. On a
rotating disk, the first I/O operation to show up in the request queue will
cause the queue to be "plugged," meaning that no operations will actually
be dispatched to the hardware. The idea behind plugging is that, by
allowing a little time for additional I/O requests to arrive, the block
layer will be able to merge adjacent requests (reducing the operation
count) and sort them into an optimal order, increasing performance.
Performance on SSDs tends not to benefit from this treatment, though there
is still a little value to merging requests. Dropping (or, at least,
reducing) plugging not only
eliminates a needless delay; it also reduces the need to take the queue
lock in the process.
Then, there is the issue of request timeouts. Like most I/O code, the
block layer needs to notice when an I/O request is never completed by the
device. That detection is done with timeouts. The old implementation
involved a separate timeout for each outstanding request, but that clearly
does not scale when the number of such requests can be huge. The answer
was to go to a per-queue timer, reducing the number of running timers
Block I/O operations, due to their inherently unpredictable execution
times, have traditionally contributed entropy to the kernel's random number
pool. There is a problem, though: the necessary call to
add_timer_randomness() has to acquire a global lock, causing
unpleasant systemwide contention. Some work was done to batch these calls
and accumulate randomness on a per-CPU basis, but, even when batching 4K
operations at a time, the performance cost was significant. On top of it
all, it's not really clear that using an SSD as an entropy source makes a
lot of sense. SSDs lack mechanical parts moving around, so their
completion times are much more predictable. Still, for the moment, SSDs
contribute to the entropy pool by default; administrators who would
like to change that behavior can do so by changing the
queue/add_random sysfs variable.
There are other locking issues to be dealt with. Over time, the block
layer has gone from being protected by the big kernel lock to a block-level
lock, then to a per-disk lock, but lock contention is still a problem. The
I/O scheduler adds contention of its own, especially if it is performing
disk-level accounting. Interestingly, contention for the locks themselves
usually the problem; it's not that the locks are being held for too long.
The big problem is the cache-line bouncing caused by moving the lock
between processors. So the traditional technique of dropping and
reacquiring locks to reduce lock contention does not help here - indeed, it
makes things worse. What's needed is to avoid taking the lock altogether.
Block requests enter the system via __make_request(), which is
responsible for getting a request (represented by a BIO structure) onto the
queue. Two lock acquisitions are required to do this job - three if the
CFQ I/O scheduler is in use. Those two acquisitions are the result of a
lock split done to reduce contention in the past; that split, when the
system is handling requests at SSD speeds, makes things worse. Eliminating
it led to a roughly 3% increase in IOPS with a reduction in CPU time on a
32-core system. It is, Jens says, a "quick hack," but it demonstrates the
kind of changes that need to be made.
The next step for this patch is to drop the I/O request allocation batching
- a mechanism added to increase throughput on rotating drives by allowing
the simultaneous submission of multiple requests. Jens also plans to drop
the allocation accounting code, which tracks the number of requests in
flight at any given time. Counting outstanding I/O operations requires
global counters and the associated contention, but it can be done without
most of the time. Some accounting will still be done at the request queue
level to ensure that some control is maintained over the number of
outstanding requests. Beyond that, there is some per-request accounting
which can be cleaned up and, Jens thinks, request completion can be made
completely lockless. He hopes that this work will be ready for merging
Another important technique for reducing contention is keeping processing
on the same CPU as often as possible. In particular, there are a number of
costs which are incurred if the CPU which handles the submission of a specific I/O request is
not the CPU which handles that request's completion. Locks are bounced
between CPUs in an unpleasant way, and the slab allocator tends not to
respond well when memory allocated on one processor is freed elsewhere in
the system. In the networking layer, this problem has been addressed with
techniques like receive packet
steering, but, unlike some networking hardware, block I/O controllers
are not able to direct specific I/O completion interrupts to specific
CPUs. So a different solution was required.
That solution took the form of smp_call_function(), which performs
fast cross-CPU calls. Using smp_call_function(), the block I/O
completion code can direct the completion of specific requests to the CPU
where those requests were initially submitted. The result is a relatively
easy performance improvement. A dedicated administrator who is willing to
tweak the system manually can do better, but
that takes a lot of work and the solution tends to be fragile. This
code - which was merged back in 2.6.27 and made the default in 2.6.32 -
is an easier way that takes away a fair amount of the pain of cross-CPU
noted with pride that the block layer was not chasing the networking code
with regard to completion steering - the block code had it first.
On the other hand, the blk-iopoll interrupt mitigation
code was not just inspired by the networking layer - some of the code was
"shamelessly stolen" from there. The blk-iopoll code turns off completion
interrupts when I/O traffic is high and uses polling to pick up completed
events instead. On a test system, this code reduced 20,000
interrupts/second to about 1,000. Jens says that the results are less
conclusive on real-world systems, though.
An approach which "has more merit" is "context plugging," a rework of the
queue plugging code. Currently, queue plugging is done implicitly on I/O
submission, with an explicit unplug required at a later time. That has
been the source of a lot of bugs; forgetting to unplug queues is a common
mistake to make. The plan is to make plugging and unplugging fully
implicit, but give I/O
submitters a way to inform the block layer that more requests are coming
soon. It makes the code more clear and robust; it also gets rid of a lot
of expensive per-queue state which must be maintained. There are still
some problems to be solved, but the code works, is "tasty on many levels,"
and yields a net reduction of some 600 lines of code. Expect a merge in
2.6.38 or 2.6.39.
Finally, there is the "weird territory" of a multiqueue block layer - an
idea which, once again, came from the networking layer. The creation of
multiple I/O queues for a given device will allow multiple processors to
handle I/O requests simultaneously with less contention. It's currently
hard to do, though, because block I/O controllers do not (yet) have
multiqueue support. That problem will be fixed eventually, but there will
be some other challenges to overcome: I/O barriers will become
significantly more complicated, as will per-device accounting. All told,
it will require some major changes to the block layer and a special I/O
scheduler. Jens offered no guidance as to when we might see this code
The conclusion which comes from this talk is that the Linux block layer is
facing some significant challenges driven by hardware changes. These
challenges are being addressed, though, and the code is moving in the
necessary direction. By the time most of us can afford a system with one
of those massive, 1 MIOPS arrays on it, Linux should be able to use it
to its potential.
Comments (66 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>