LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.36-rc7, released on October 6. "This should be the last -rc, I'm not seeing any reason to keep delaying a real release. There was still more changes to drivers/gpu/drm than I really would have hoped for, but they all look harmless and good. Famous last words." The short-form changelog is in the announcement; kernel.org has the full changelog.

Stable updates: 2.6.32.24, containing a single fix for a typo in the Xen code, was released on October 1. As of this writing, there are no stable updates in the review process.

Comments (none posted)

Quotes of the week

As a general rule, if a reviewer's comment doesn't result in a code change then it should result in a changelog fix or a code comment. Because if the code wasn't clear enough to the reviewer then it won't be clear enough to later readers.
-- Andrew Morton

AMD's reference BIOS code had a bug that could result in the firmware failing to reenable the iommu on resume. It transpires that this causes certain less than desirable behaviour when it comes to PCI accesses, to whit them ending up somewhere near Bristol when the more desirable outcome was Edinburgh. Sadness ensues, perhaps along with filesystem corruption. Let's make sure that it gets turned back on, and that we restore its configuration so decisions it makes bear some resemblance to those made by reasonable people rather than crack-addled lemurs who spent all your DMA on Thunderbird.
-- Matthew Garrett

Comments (none posted)

Little-endian PowerPC

By Jonathan Corbet
October 6, 2010
The PowerPC architecture is normally thought of as a big-endian domain - the most significant byte of multi-byte values comes first. Big-endian is consistent with a number of other architectures, but the fact that one obscure architecture - x86 - is little-endian means that the world as a whole tends toward the little-endian persuasion. As it happens, at least some PowerPC processors can optionally be run in a little-endian mode. Ian Munsie has posted a patch set which enables Linux to take advantage of that feature and run little-endian on suitably-equipped PowerPC processors.

The first question that came to the mind of a few reviewers was: "why?" PowerPC runs fine as a big-endian architecture, and there has been little clamor for little-endian support. Besides, endianness seems to be one of those things that users can feel strongly about; to at least some PowerPC users, little-endian apparently feels cheap, wrong, and PCish.

The answer, as expressed by Ben Herrenschmidt, appears to be graphics hardware. A number of GPUs, especially those aimed at embedded applications, only work in the little-endian mode. Carefully-written device drivers can work around that sort of limitation without too much trouble, but user-space code - which often ends up talking to graphics hardware - is another story. Fixing all of that code is not a task that anybody wants to take on. As a result, PowerPC processors will not be considered for situations where little-endian support is needed. Running the processor in little-endian mode will nicely overcome that obstacle.

That said, it will take a little while before this support is generally available. The kernel patches apparently look good, but there are toolchain changes required which are not, yet, generally available. Until that little issue is resolved, PowerPC will remain a club for big-endian users only.

Comments (17 posted)

Kernel development news

Trusted and encrypted keys

By Jake Edge
October 6, 2010

The Trusted Platform Module (TPM) present on many of today's systems can be used in various ways, from making completely locked-down systems that cannot be changed by users to protecting sensitive systems from various kinds of attacks. While the TPM-using integrity measurement architecture (IMA), which can measure and attest to the integrity of a running Linux system, has been part of the kernel for some time now, the related extended verification module (EVM) has not made it into the mainline. One of the concerns raised about EVM was that it obtained a cryptographic key from user space that is then used as a key for integrity verification—largely nullifying the integrity guarantees that EVM is supposed to provide. A set of patches that were recently posted for comments to the linux-security-module mailing list would add two new key types to the kernel that would allow user space to provide the key without being able to see the actual key data.

We last looked in on EVM back in June when it seemed like it might make it into 2.6.36. That didn't happen, nor has EVM been incorporated into linux-next, so its path into the mainline is a bit unclear at this point. EVM calculates HMAC (hash-based message authentication code) values for on-disk files, uses the EVM key and TPM to sign the values, and stores them in extended attributes (xattrs) in the security namespace. If the EVM key is subverted, all bets are off in terms of the integrity of the system. While they are targeted for use by EVM, Mimi Zohar's patches to add trusted and encrypted key types could also be used for other purposes such as handling the keys for filesystem encryption.

The basic idea is that these keys would be generated by the kernel, and would never be touched by user space in an unencrypted form. Encrypted "blobs" would be provided to user space by the kernel and would contain the key material. User space could store the keys, for example, but the blobs would be completely opaque to anything outside of the kernel. The patches come with two new flavors of these in-kernel keys: trusted and encrypted.

Trusted keys are generated by the TPM and then encrypted using the TPM's storage root key (SRK), which is a 2048-bit RSA key (this is known as "sealing" the key in TPM terminology). Furthermore, trusted keys can also be sealed to a particular set of TPM platform configuration register (PCR) values so that the keys cannot be unsealed unless the PCR values match. The PCR contains an integrity measurement of the system BIOS, bootloader, and operating system, so tying keys to PCR values means that the trusted keys cannot be accessed except from those systems for which it was specifically authorized. Any change to the underlying code will result in undecryptable keys.

Since the PCR values change based on the kernel and initramfs used, trusted keys can be updated to use different PCRs, once they have been added to a keyring (so that the existing PCR values have been verified). There can also be multiple versions of a single trusted key, each of which is sealed to different PCR values. This can be used to support booting multiple kernels that use the same key. While the underlying, unencrypted key data will not need to change for different kernels, the user-space blob will change because of the different PCR values, which will require some kind of key management in user space.

Encrypted keys, on the other hand, do not rely on the TPM, and use the kernel's AES encryption instead which is faster than the TPM's public key encryption. Keys are generated as random numbers of the requested length from the kernel's random pool and, when they are exported as user-space blobs, they are encrypted using a master key. That master key can either be the new trusted key type or the user key type that already exists in the kernel. Obviously, if the master key is not a trusted key, it needs to be handled securely, as it provides security for any other encrypted keys.

The user-space blobs contain an HMAC that the kernel can use to verify the integrity of a key. The keyctl utility (or keyctl() system call) can be used to generate keys, add them to a kernel keyring, as well as to extract a key blob from the kernel. The patch set introduction gives some examples of using keyctl to manipulate both trusted and encrypted keys.

A recent proposal for a kernel crypto API was not particularly well-received, in part because it was not integrated with the existing kernel keyring API, but Zohar's proposal doesn't suffer from that problem. Both have the idea of wrapping keys into opaque blobs before handing them off to user space, but the crypto API went much further, adding lots of ways to actually use the keys from user space for encryption and decryption.

While the trusted and encrypted key types would be useful to kernel services (like EVM or filesystem encryption), they aren't very useful to applications that want to do cryptography without exposing key data to user space. The keys could potentially be used by hardware cryptographic accelerators, or possibly be wired into the existing kernel crypto services, but they won't provide all of the different algorithms envisioned by the kernel crypto API.

The existing IMA code only solves part of the integrity problem, leaving the detection of offline attacks against disk files (e.g. by mounting the disk under another OS) to EVM. If EVM is to eventually be added to the kernel to complete the integrity verification puzzle, then trusted keys or something similar will be needed. So far, the patches have attracted few comments or complaints, but they were posted to various Linux security mailing lists, and have not yet run the linux-kernel gauntlet.

Comments (none posted)

Two ABI troubles

By Jonathan Corbet
October 5, 2010
It has long been accepted by kernel developers that the user-space ABI cannot be broken in most situations. But what happens if the current ABI is a mistake, or if blocking changes risks stopping kernel development altogether? Both of those possibilities have been raised in recent discussions.

The capi driver provides a control interface for ISDN adapters - some of which, apparently, are still in use somewhere out there. If the devices.txt file is to be believed, the control device for CAPI applications should be /dev/capi20, while the first actual application shows up as /dev/capi20.00. That is not what the applications apparently want to see, though, so Marc-Andre Dahlhaus posted a patch moving the application devices under their own directory. In other words, the first CAPI application would show up as /dev/capi/0. The patch also modified the devices.txt file to match the new naming.

Alan Cox rejected the patch, saying:

devices.txt is the specification, and its ABI.

It is fixed and the kernel behaviour is to follow it. Those who didn't follow it, or who didn't propose a change back when it was specified in the first place have only themselves to blame. It isn't changing, and the ISDN code should follow the spec.

Maintaining the ABI is normally the right thing, but there are a couple of problems with the reasoning here. First is that, apparently, few (if any) distributions follow the rules described in devices.txt; the real ABI, in practice, may be different. Second: the kernel doesn't follow devices.txt either: current practice is to create /dev/capi as the control device, and /dev/capi0 as the first application device. The capifs virtual filesystem covered over some of this, but capifs is on its way out of the kernel.

In the short term, the fix appears to redefine the current behavior as a typo, tweaking things just enough that udev is able to create the right file names. The devices.txt file will not be touched for now. If regressions turn up, though, it may become necessary to support alternative names for these devices for well into the future.

Tracepoints, again

Jean Pihet recently posted a set of tracepoint changes for power-related events. The patch added some new tracepoints, added information to others, and added some documentation as well. Even more recently, Thomas Renninger came forward with a different set of power tracepoint changes, meant to clean things up and make the tracepoints more applicable to ARM systems. In both cases, Arjan van de Ven opposed the patches, claiming that they are an ABI break.

The ABI in question does have users - tools like powertop and pytimechart in particular. It seems that Intel also has "internal tools" which would be affected by this change. As Arjan put it: "the thing with ABIs is that you don't know how many users you have." When things are expressed this way, it looks like a standard case of a user-space ABI which must be preserved, but not all developers see it that way.

Peter Zijlstra argues that tools using tracepoints need to be more flexible:

These tools should be smart enough to look up the tracepoint name, fail it its not available, read the tracepoint format, again fail if not compatible. I really object to treating tracepoints as ABI and being tied to any implementation details due to that.

Steven Rostedt worries about the effects of a tracepoint ABI on kernel development:

Once we start saying that a tracepoint is a fixed abi, we just stopped innovation of the kernel. Tracepoints are too intrusive to guarantee their stability. Tools that need to get information from a tracepoint should either be bound to a given kernel, or have a easy way to update the tool (config file or script) that can cope with a change.

The issue of ABI status for tracepoints has come up in the past, but it has never really been resolved. In other situations, Linus has said that any kernel interface which is taken up by applications becomes part of the ABI whether that status was intended or not. From this point of view, it is not a matter of "saying" that there is an ABI here or not; applications are using the tracepoints, so the damage has already been done. Given that user-space developers are being pushed to use tracepoints in various situations, it makes sense to offer those developers a stable interface.

On the other hand, it is very much true that these tracepoints hook deeply into the kernel. If they truly cannot be changed, then either (1) changes in the kernel itself will be severely restricted, or (2) we will start to accumulate backward-compatibility tracepoints which are increasingly unrelated to anything that the kernel is actually doing. Neither of these outcomes is conducive to the rapid evolution of the kernel in the coming years.

If nothing else, if tracepoints are deemed to be part of the user-space ABI, there will be strong resistance to the addition of any more of them to large parts of the kernel.

Some alternatives have been discussed; the old idea of marking specific tracepoints as being stable came back again. Frank Eigler suggested the creation of a compatibility module which could attach to tracepoints which have been changed, remapping the trace data into the older format for user space. There has also been talk of creating a mapping layer in user space. But none of these ideas have actually been put into the mainline kernel.

This issue is clearly not going to go away; it can only get worse as more application developers start to make use of the tracepoints which are being added to the kernel. It seems like an obvious topic to discuss at the 2010 Kernel Summit, scheduled for the beginning of November. What the outcome of that discussion might be is hard to predict, but, with luck, it will at least provide some sort of clarity on this issue.

Comments (3 posted)

Solid-state storage devices and the block layer

By Jonathan Corbet
October 4, 2010
Over the last few years, it has become clear that one of the most pressing scalability problems faced by Linux is being driven by solid-state storage devices (SSDs). The rapid increase in performance offered by these devices cannot help but reveal any bottlenecks in the Linux filesystem and block layers. What has been less clear, at times, is what we are going to do about this problem. In his LinuxCon Japan talk, block maintainer Jens Axboe described some of the work that has been done to improve block layer scalability and offered a view of where things might go in the future.

While workloads will vary, Jens says, most I/O patterns are dominated by random I/O and relatively small requests. Thus, getting the best results requires being able to perform a large number of I/O operations per second (IOPS). With a high-end rotating drive (running at 15,000 RPM), the maximum rate possible is about 500 IOPS. Most real-world drives, of course, will have significantly slower performance and lower I/O rates.

SSDs, by eliminating seeks and rotational delays, change everything; we have gone from hundreds of IOPS to hundreds of thousands of IOPS in a very short period of time. A number of people have said that the massive increase in IOPS means that the block layer will have to become more like the networking layer, where every bit of per-packet overhead has been squeezed out over time. But, as Jens points out, time is not in great abundance. Networking technology went from 10Mb/s in the 1980's to 10Gb/s [Jens Axboe] now, the better part of 30 years later. SSDs have forced a similar jump (three orders of magnitude) in a much shorter period of time - and every indication suggests that devices with IOPS rates in the millions are not that far away. The result, says Jens, is "a big problem."

This problem pops up in a number of places, but it usually comes down to contention for shared resources. Locking overhead which is tolerable at 500 IOPS is crippling at 500,000. There are also problems with contention at the hardware level too; vendors of storage controllers have been caught by surprise by SSDs and are having to scramble to get their performance up to the required levels. The growth of multicore systems naturally makes things worse; such systems can create contention problems throughout the kernel, and the block layer is no exception. So much of the necessary work comes down to avoiding contention.

Before that, though, some work had to be done just to get the block layer to recognize that it is dealing with an SSD and react accordingly. Traditionally, the block layer has been driven by the need to avoid head seeks; the use of quite a bit of CPU time could be justified if it managed to avoid a single seek. SSDs - at least the good ones - care a lot less about seeks, so expending a bunch of CPU time to avoid them no longer makes sense. There are various ways of detecting SSDs in the hardware, but they don't always work, especially with the lower-quality devices. So the block layer exports a flag under

    /sys/block/<device>/queue/rotational

which can be used to override the system's notion of what kind of storage device it is dealing with.

Improving performance with SSDs can be a challenging task. There is no single big bottleneck which is causing performance problems; instead, there are numerous small things to fix. Each fix yields a bit of progress, but it mostly serves to highlight the next problem. Additionally, performance testing is hard; results are often not reproducible and can be perturbed by small changes. This is especially true on larger systems with more CPUs. Power management can also get in the way of the generation of consistent results.

One of the first things to address on an SSD was queue plugging. On a rotating disk, the first I/O operation to show up in the request queue will cause the queue to be "plugged," meaning that no operations will actually be dispatched to the hardware. The idea behind plugging is that, by allowing a little time for additional I/O requests to arrive, the block layer will be able to merge adjacent requests (reducing the operation count) and sort them into an optimal order, increasing performance. Performance on SSDs tends not to benefit from this treatment, though there is still a little value to merging requests. Dropping (or, at least, reducing) plugging not only eliminates a needless delay; it also reduces the need to take the queue lock in the process.

Then, there is the issue of request timeouts. Like most I/O code, the block layer needs to notice when an I/O request is never completed by the device. That detection is done with timeouts. The old implementation involved a separate timeout for each outstanding request, but that clearly does not scale when the number of such requests can be huge. The answer was to go to a per-queue timer, reducing the number of running timers considerably.

Block I/O operations, due to their inherently unpredictable execution times, have traditionally contributed entropy to the kernel's random number pool. There is a problem, though: the necessary call to add_timer_randomness() has to acquire a global lock, causing unpleasant systemwide contention. Some work was done to batch these calls and accumulate randomness on a per-CPU basis, but, even when batching 4K operations at a time, the performance cost was significant. On top of it all, it's not really clear that using an SSD as an entropy source makes a lot of sense. SSDs lack mechanical parts moving around, so their completion times are much more predictable. Still, for the moment, SSDs contribute to the entropy pool by default; administrators who would like to change that behavior can do so by changing the queue/add_random sysfs variable.

There are other locking issues to be dealt with. Over time, the block layer has gone from being protected by the big kernel lock to a block-level lock, then to a per-disk lock, but lock contention is still a problem. The I/O scheduler adds contention of its own, especially if it is performing disk-level accounting. Interestingly, contention for the locks themselves is not usually the problem; it's not that the locks are being held for too long. The big problem is the cache-line bouncing caused by moving the lock between processors. So the traditional technique of dropping and reacquiring locks to reduce lock contention does not help here - indeed, it makes things worse. What's needed is to avoid taking the lock altogether.

Block requests enter the system via __make_request(), which is responsible for getting a request (represented by a BIO structure) onto the queue. Two lock acquisitions are required to do this job - three if the CFQ I/O scheduler is in use. Those two acquisitions are the result of a lock split done to reduce contention in the past; that split, when the system is handling requests at SSD speeds, makes things worse. Eliminating it led to a roughly 3% increase in IOPS with a reduction in CPU time on a 32-core system. It is, Jens says, a "quick hack," but it demonstrates the kind of changes that need to be made.

The next step for this patch is to drop the I/O request allocation batching - a mechanism added to increase throughput on rotating drives by allowing the simultaneous submission of multiple requests. Jens also plans to drop the allocation accounting code, which tracks the number of requests in flight at any given time. Counting outstanding I/O operations requires global counters and the associated contention, but it can be done without most of the time. Some accounting will still be done at the request queue level to ensure that some control is maintained over the number of outstanding requests. Beyond that, there is some per-request accounting which can be cleaned up and, Jens thinks, request completion can be made completely lockless. He hopes that this work will be ready for merging into 2.6.38.

Another important technique for reducing contention is keeping processing on the same CPU as often as possible. In particular, there are a number of costs which are incurred if the CPU which handles the submission of a specific I/O request is not the CPU which handles that request's completion. Locks are bounced between CPUs in an unpleasant way, and the slab allocator tends not to respond well when memory allocated on one processor is freed elsewhere in the system. In the networking layer, this problem has been addressed with techniques like receive packet steering, but, unlike some networking hardware, block I/O controllers are not able to direct specific I/O completion interrupts to specific CPUs. So a different solution was required.

That solution took the form of smp_call_function(), which performs fast cross-CPU calls. Using smp_call_function(), the block I/O completion code can direct the completion of specific requests to the CPU where those requests were initially submitted. The result is a relatively easy performance improvement. A dedicated administrator who is willing to tweak the system manually can do better, but that takes a lot of work and the solution tends to be fragile. This code - which was merged back in 2.6.27 and made the default in 2.6.32 - is an easier way that takes away a fair amount of the pain of cross-CPU contention. Jens noted with pride that the block layer was not chasing the networking code with regard to completion steering - the block code had it first.

On the other hand, the blk-iopoll interrupt mitigation code was not just inspired by the networking layer - some of the code was "shamelessly stolen" from there. The blk-iopoll code turns off completion interrupts when I/O traffic is high and uses polling to pick up completed events instead. On a test system, this code reduced 20,000 interrupts/second to about 1,000. Jens says that the results are less conclusive on real-world systems, though.

An approach which "has more merit" is "context plugging," a rework of the queue plugging code. Currently, queue plugging is done implicitly on I/O submission, with an explicit unplug required at a later time. That has been the source of a lot of bugs; forgetting to unplug queues is a common mistake to make. The plan is to make plugging and unplugging fully implicit, but give I/O submitters a way to inform the block layer that more requests are coming soon. It makes the code more clear and robust; it also gets rid of a lot of expensive per-queue state which must be maintained. There are still some problems to be solved, but the code works, is "tasty on many levels," and yields a net reduction of some 600 lines of code. Expect a merge in 2.6.38 or 2.6.39.

Finally, there is the "weird territory" of a multiqueue block layer - an idea which, once again, came from the networking layer. The creation of multiple I/O queues for a given device will allow multiple processors to handle I/O requests simultaneously with less contention. It's currently hard to do, though, because block I/O controllers do not (yet) have multiqueue support. That problem will be fixed eventually, but there will be some other challenges to overcome: I/O barriers will become significantly more complicated, as will per-device accounting. All told, it will require some major changes to the block layer and a special I/O scheduler. Jens offered no guidance as to when we might see this code merged.

The conclusion which comes from this talk is that the Linux block layer is facing some significant challenges driven by hardware changes. These challenges are being addressed, though, and the code is moving in the necessary direction. By the time most of us can afford a system with one of those massive, 1 MIOPS arrays on it, Linux should be able to use it to its potential.

Comments (66 posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds