LWN.net Logo

Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is 2.6.16.15, released on May 9. It adds four security patches, all of which apply to the SCTP code. Previously, 2.6.16.14 was released on May 5 with a patch for an smbfs problem which could enable a process to escape a chroot environment.

The current 2.6 prepatch remains 2.6.17-rc3. A few hundred patches have been merged into the mainline git repository since -rc3 was released; they are mostly fixes, but there is also a set of splice() improvements and the ability add attribute groups to class_device entries at registration time.

There have been no -mm releases over the last week.

Comments (none posted)

Kernel development news

Quote of the week

Actually, I think the system is working quite well. We've got a quick route for getting bug fixes and security fixes to users, and a shorter devel cycle helping distro folks get more regular drops from upstream. This particular patch [2.6.16.14] applies all the way back to the beginning of git time (over a year ago), and I'm sure earlier. So it's hard to conclude it's a byproduct of the release cycles.
-- Chris Wright

Comments (none posted)

Multi-protection VMAs

The virtual memory area (VMA) structure (struct vm_area_struct) is one of the core building blocks of the Linux virtual memory code. Each VMA describes a piece of a process's address space; that piece is a (usually contiguous) series of pages from a single backing store (a file or, for anonymous memory, swap space) with a uniform set of access permissions. Each VMA maintains information on the address space covered, pointers to the backing store, permission information, a set of function pointers for operations on that VMA, and other housekeeping information.

Before the 2.6 kernel was released, all VMAs mapped a range of address space onto a contiguous range of pages in the backing store. Things got a bit more complicated with the addition of the remap_file_pages() system call, which allows applications to rearrange the mapping of memory pages to backing store pages within a VMA. That system call includes a parameter for setting the permissions of the remapped pages, but that parameter is currently ignored. For now, it is still true that all pages within a VMA carry the same page permissions. If an application tries to break that rule - by calling mprotect() on a subset of the pages within a VMA, for example - the VMA will be split into multiple VMAs, each of which imposes uniform permissions on its (reduced) part of the address space.

This behavior might just change however. Paolo Giarrusso has recently dusted off an old patch (developed with Ingo Molnar) which allows remap_file_pages() to change page permission as well. In theory, this change should be relatively straightforward. The page tables already hold the permissions for each page, so there is no need for any additional data structures to track the per-page permissions. The tricky part comes in when the page is swapped out. At that point, the kernel must take care to keep the permission information in the page table entry. A new VM_MANYPROTS VMA flag tells the kernel to use those saved permissions (instead of the permissions stored in the VMA itself) when the page is faulted back in.

To change page permissions, an application must pass the new MAP_CHGPROT flag to remap_file_pages(). Interestingly, the current patch does not support creating or operating on VM_MANYPROTS areas with mprotect(); there is, apparently, a disagreement over just what the semantics should be in that case.

The motivation behind this change is to improve performance for User-mode Linux. The UML code creates vast numbers (tens of thousands) of single-page mappings to simulate its own virtual memory environment. Each of those mappings creates a VMA. As the kernel works with all of those VMAs, memory-oriented operations slow down significantly. The memory overhead is also significant - each VMA requires at least 88 bytes of memory, 200 bytes on your editor's x86-64 system. Eliminating all of those VMAs can make UML much more efficient; Ingo Molnar reports that UML performance improves noticeably with the patch in place.

Ordinary Linux users could also benefit from this patch, however. Ulrich Drepper explained how the C library uses VMAs currently; it turns out that linking to a single shared library can create up to five separate VMAs. An application which brings in a large number of libraries - as many desktop applications do - can end up creating hundreds of VMAs for shared library mappings. That leads to many VMAs being created on the system; just how many can be seen by looking at the vm_area_struct line in /proc/slabinfo. Your editor's system currently has over 13,000 VMAs active, using about 2.5MB of memory.

Of the five VMAs potentially created by glibc for each shared library mapping, four are mappings into the same file with different permissions. The ability to have multiple permissions settings within a single VMA has the potential to collapse those four VMAs into one, leaving a single file mapping and an anonymous memory segment for each library. The result would be significantly reduced memory usage and faster kernel performance. Those benefits are likely to motivate the inclusion of this patch, sooner or later.

Comments (1 posted)

On the safety of Linux random numbers

Random number generation is an important operating system function. The generation of networking sequence numbers, cryptographic session keys, and public keys all depend on the creation of numbers which are sufficiently random that they cannot be guessed by an attacker. Weak random numbers can lead to session hijacking, disclosed secrets, forged identities, and predictable umber hulks. Any system which is serious about security has to be serious about creating good random numbers.

Doing that, however, can be a challenge for computers. As a general rule, designers of computers like to make hardware which does the same thing every time. Randomness is not normally a desirable feature in computer operation; for most systems, it is restricted to emacs responding to mistaken keystrokes. So, while there is no shortage of algorithms which can produce a random-seeming sequence of numbers, those numbers are not truly random. Restart the algorithm with the same initial conditions, and the same sequence of numbers will result.

Linux implements a purely algorithmic random number generator, accessible as /dev/urandom. Its results are good enough for most purposes, but there are times when true randomness is needed. To that end, the kernel attempts to harvest randomness (called "entropy") from its environment. The timing between the keystrokes as your editor types this article, for example, exhibits some randomness. The same is true of, for example, the timing of disk interrupts. The lower bits of the system time stamp counter can also provide a bit of entropy. The kernel collects this entropy into a special pool of bits, and uses this entropy pool when true random numbers (obtained from /dev/random) are required. The amount of accumulated entropy is also tracked; if there is insufficient entropy in the pool to satisfy a random number request, the requesting process will block until the needed entropy arrives.

One of the most common ways of putting entropy into the pool is to register interrupt handlers with the SA_SAMPLE_RANDOM flag. That flag tells the kernel that the indicated interrupt will arrive at random times, so its timing can be used to generate entropy. This interface has been in place for many years, but Matt Mackall has recently decided that it is not the best way to go. So he has posted a series of patches removing SA_SAMPLE_RANDOM from a large number of request_irq() calls. Most of the changes are not controversial. For example, a number of disk drivers set SA_SAMPLE_RANDOM, but also use the block-specific add_disk_randomness() function. Removing SA_SAMPLE_RANDOM in those cases eliminates a source of redundant "entropy." But Matt rekindled an old debate when one of his patches removed SA_SAMPLE_RANDOM from a set of network drivers.

The issue with network drivers is this: network interrupts are created by incoming and outgoing packets. If an attacker gets access to the network segment used by a target system, that attacker can observe the timing of packets entering and leaving that system. The attacker can also influence that timing by generating packets and sending them to the target in a carefully-timed manner. Over the years, a number of people have worried that a well-connected attacker might be able to guess the contents of the entropy pool and predict future random numbers.

Others argue that nobody has shown a scenario where the ability to observe and generate packet timings could actually lead to the compromise of the entropy pool. The actual timing of packets hitting a given system can only be reliably observed by another system on the same network segment. But network segments are almost never shared anymore; most systems tend to be plugged into switches, and a switch will hide packets and change their timing. In addition, anybody who is in a position to get onto a target system's network segment is quite likely to be able to obtain physical access to the target itself. At that point, the installation of a keystroke logger or hostile kernel patch seems easier than trying to guess where the entropy pool will go.

If we assume a particularly determined and masochistic attacker, however, then we can start to think about the other challenges this person will have to face. One is guessing the contents of the entropy pool at a given time. Such a guess will have to be made by observing the random numbers generated by the system, which can be done by looking at sequence numbers and keys emitted by that system. Then the attacker will have to find a way to reverse the algorithm (SHA-1) which is used to generate a given random number from the pool. That reversal will generate a large set of possible pool values which could all hash to the same value, so the attacker must be prepared to work with many simultaneous possibilities.

Once the pool has been guessed, it is time to predict its future value, as determined by the incoming entropy. The problem here is that the timing of packets on the wire does not exactly match the timing of interrupts within the kernel. There are delays within the network card, delays in DMAing a packet into main memory (which can be influenced by other memory traffic being generated in the system), variable interrupt handling times caused by critical sections which mask interrupts, cache misses, etc. Then there is the occasional mixing of bits from the time stamp counter, the value of which is not available to the attacker. All told, it is a fair stretch to go from an observation of traffic on the network to any sort of guess as to what the random number generator will produce next.

Meanwhile, many systems running as network servers have access to relatively few sources of entropy. If interrupt timings from network interfaces are made unavailable, those systems could run out of entropy altogether. Given that need, and given that most developers seem unworried about the potential weaknesses, the use of network timings is unlikely to go away anytime soon. What might happen, however, is the addition of some sort of runtime configuration option. Truly paranoid administrators could then disallow entropy from network interfaces. Those who are merely worried could, instead, use those timings, but reduce the amount of entropy which is credited to a network interface timing value. And most of the rest of us will probably leave things the way they are now.

[See also: this paper by Z. Gutterman, B. Pinkas and T. Reinman [PDF] on potential weaknesses in the Linux random number generator (thanks to Neil Harris).]

Comments (23 posted)

The Xen patches

The Xen hypervisor has been the source of large amounts of hype for some time now. The Xen paravirtualization scheme allows the running of guest operating systems, but the guest kernel must be ported explicitly to the "architecture" supported by the hypervisor. Paravirtualization provides strong isolation of virtual machines and can be quite fast, but it cannot run unmodified operating systems on its virtual machines. Many had expected support for Xen to be merged into the mainline by now, but that has not happened. In fact, it is only recently that the Xen patches have even been posted for developer review. A new set of Xen patches was posted on May 9, however, giving some insights into how Xen will affect the kernel.

The patches in the 35-part set fall into two broad categories. The first of those creates a new architecture (a subarchitecture of i386) and a port of the Linux kernel for that architecture. This is the code which is built into the modified kernel which can run as a Xen guest. Some of the more significant changes include:

  • Allowing for more interrupt vectors. Xen uses pseudo-interrupts for various types of communications with guests, so there needs to be room for more interrupt handlers.

  • An events mechanism has been built on top of the interrupt management code so that the hypervisor can pass information into guest systems. The virtual machines can also use event channels to communicate with each other.

  • Much of the i386 initialization code is split out so that subarchitectures can override it. Since a Xen-hosted kernel is not booting on cold hardware, and it will not use a number of hardware features, it will have to initialize itself differently than the host system does.

  • A version of the dynamic tick patch is used to keep idle virtual machines from wasting time servicing timer interrupts. There is also a separate timekeeping implementation which allows guest systems to perform their own timekeeping without having to involve the hypervisor.

  • A whole range of virtual devices has been provided. These include a console, virtual network interfaces, and virtual block devices.

Then, there are a couple of changes to the core (host) kernel:

  • A new set of synchronous bit operations, with names like synch_set_bit(). These operations differ from the regular bit operations in that they are always atomic. The regular bit operations will, when built for a uniprocessor system, use less-expensive, non-atomic operations. But that will not work well if a uniprocessor Xen guest runs on an SMP host.

  • The function apply_to_page_range() will call a given function for every page table entry in a given range. This patch seems worth merging ahead of the rest of Xen; currently, code iterating through PTEs duplicates a complicated set of functions for walking through the page table structure.

There has been a fair amount of comment on the patches, but few objections of great substance. Instead, the Xen developers look to have a long list of nits to address. The most fundamental complaints, perhaps, concern the network driver, which includes its own, built-in ARP implementation. The Xen developers defend this code as being necessary for fast migration of Xen guests. If the ARP code were moved to a more appropriate place - user space, for example - a migration which happens in milliseconds could turn into a one-second (or longer) affair, and that is not a cost the Xen folks want to pay. The addition of files to /proc is also unpopular, but that code was already on the list of things to fix.

When Xen might actually merge is still unclear. There is work to be done still, and it is a large body of code for the developers to work through. But that date is getting closer, now that there is code to discuss.

Comments (none posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds