Brief items
The current stable 2.6 kernel is 2.6.16.15,
released on May 9. It adds
four security patches, all of which apply to the SCTP code. Previously,
2.6.16.14 was released on
May 5 with a patch for an smbfs problem which could enable a process
to escape a chroot environment.
The current 2.6 prepatch remains 2.6.17-rc3. A few hundred patches
have been merged into the mainline git repository since -rc3 was released;
they are mostly fixes, but there is also a set of splice()
improvements and the ability add attribute groups to class_device
entries at registration time.
There have been no -mm releases over the last week.
Comments (none posted)
Kernel development news
Actually, I think the system is working quite well. We've got a
quick route for getting bug fixes and security fixes to users, and
a shorter devel cycle helping distro folks get more regular drops
from upstream. This particular patch [2.6.16.14] applies all the
way back to the beginning of git time (over a year ago), and I'm
sure earlier. So it's hard to conclude it's a byproduct of the
release cycles.
--
Chris Wright
Comments (none posted)
The virtual memory area (VMA) structure (
struct vm_area_struct) is
one of the core building blocks of the Linux virtual memory code. Each VMA
describes a piece of a process's address space; that piece is a (usually
contiguous) series of pages from a single backing store (a file or, for
anonymous memory, swap space) with a uniform set of access permissions. Each
VMA maintains information on the address space covered, pointers to the
backing store, permission information, a set of function pointers for
operations on that VMA, and other housekeeping information.
Before the 2.6 kernel was released, all VMAs mapped a range of address
space onto a contiguous range of pages in the backing store. Things got a
bit more complicated with the addition of the remap_file_pages() system
call, which allows applications to rearrange the mapping of memory
pages to backing store pages within a VMA. That system call includes a
parameter for setting the permissions of the remapped pages, but that
parameter is currently ignored. For now, it is still true that all pages
within a VMA carry the same page permissions. If an application tries to
break that rule - by calling mprotect() on a subset of the pages
within a VMA, for example - the VMA will be split into multiple VMAs, each
of which imposes uniform permissions on its (reduced) part of the address
space.
This behavior might just change however. Paolo Giarrusso has recently dusted off an old patch
(developed with Ingo Molnar) which allows remap_file_pages() to
change page permission as well. In theory, this change should be
relatively straightforward. The page tables already hold the permissions
for each page, so there is no need for any additional data structures to
track the per-page permissions. The tricky part comes in when the page is
swapped out. At that point, the kernel must take care to keep the
permission information in the page table entry. A new
VM_MANYPROTS VMA flag tells the kernel to use those saved
permissions (instead of the permissions stored in the VMA itself) when the
page is faulted back in.
To change page permissions, an application must pass the new
MAP_CHGPROT flag to remap_file_pages().
Interestingly, the current patch does not support creating or operating on
VM_MANYPROTS areas with mprotect(); there is, apparently,
a disagreement over just what the semantics should be in that case.
The motivation behind this change is to improve performance for User-mode
Linux. The UML code creates vast numbers (tens of thousands) of
single-page mappings to simulate its own virtual memory environment. Each
of those mappings creates a VMA. As the kernel works with all of those
VMAs, memory-oriented operations slow down significantly. The memory
overhead is also significant - each VMA requires at least 88 bytes of
memory, 200 bytes on your editor's x86-64 system. Eliminating all of those
VMAs can make UML much more efficient; Ingo Molnar reports that
UML performance improves noticeably with the patch in place.
Ordinary Linux users could also benefit from this patch, however. Ulrich
Drepper explained how the C library uses
VMAs currently; it turns out that linking to a single shared library can
create up to five
separate VMAs. An application which brings in a large number of libraries
- as many desktop applications do - can end up creating hundreds of VMAs
for shared library mappings. That leads to many VMAs being created on the
system; just how many can be seen by looking at the vm_area_struct
line in /proc/slabinfo. Your editor's system currently has over
13,000 VMAs active, using about 2.5MB of memory.
Of the five VMAs potentially created by glibc for each shared library
mapping, four are mappings into the same file with different permissions.
The ability to have multiple permissions settings within a single VMA has
the potential to collapse those four VMAs into one, leaving a single file
mapping and an anonymous memory segment for each library. The result would
be significantly reduced memory usage and faster kernel performance. Those
benefits are likely to motivate the inclusion of this patch, sooner or
later.
Comments (1 posted)
Random number generation is an important operating system function. The
generation of networking sequence numbers, cryptographic session keys, and
public keys all depend on the creation of numbers which are sufficiently
random that they cannot be guessed by an attacker. Weak random numbers can
lead to session hijacking, disclosed secrets, forged identities, and
predictable umber hulks. Any system which is serious about security has to
be serious about creating good random numbers.
Doing that, however, can be a challenge for computers. As a general rule,
designers of computers like to make hardware which does the same thing
every time. Randomness is not normally a desirable feature in computer
operation; for most systems, it is restricted to emacs responding to
mistaken keystrokes. So, while there is no shortage of algorithms which
can produce a random-seeming sequence of numbers, those numbers are not
truly random. Restart the algorithm with the same initial conditions, and
the same sequence of numbers will result.
Linux implements a purely algorithmic random number generator, accessible
as /dev/urandom. Its results are good enough for most purposes,
but there are times when true randomness is needed. To that end, the
kernel attempts to harvest randomness (called "entropy") from its
environment. The timing between the keystrokes as your editor types this
article, for example, exhibits some randomness. The same is true of, for
example, the timing of disk interrupts. The lower bits of the system time
stamp counter can also provide a bit of entropy. The kernel collects this
entropy into a special pool of bits, and uses this entropy pool when true
random numbers (obtained from /dev/random) are required. The
amount of accumulated entropy is also tracked; if there is insufficient
entropy in the pool to satisfy a random number request, the requesting
process will block until the needed entropy arrives.
One of the most common ways of putting entropy into the pool is to register
interrupt handlers with the SA_SAMPLE_RANDOM flag. That flag
tells the kernel that the indicated interrupt will arrive at random times,
so its timing can be used to generate entropy. This interface has been in
place for many years, but Matt Mackall has recently decided that it is not
the best way to go. So he has posted a series
of patches removing SA_SAMPLE_RANDOM from a large number of
request_irq() calls.
Most of the changes are not controversial. For example, a number of disk
drivers set SA_SAMPLE_RANDOM, but also use the block-specific
add_disk_randomness() function. Removing
SA_SAMPLE_RANDOM in those cases eliminates a source of redundant
"entropy." But Matt rekindled an old debate
when one of his patches removed SA_SAMPLE_RANDOM from a set of
network drivers.
The issue with network drivers is this: network interrupts are created by
incoming and outgoing packets. If an attacker gets access to the network
segment used by a target system, that attacker can observe the timing of
packets entering and leaving that system. The attacker can also influence
that timing by generating packets and sending them to the target in a
carefully-timed manner. Over the years, a number of people have worried
that a well-connected attacker might be able to guess the contents of the
entropy pool and predict future random numbers.
Others argue that nobody has shown a scenario where the ability to observe
and generate packet timings could actually lead to the compromise of the
entropy pool. The actual timing of packets hitting a given system can only
be reliably observed by another system on the same network segment. But
network segments are almost never shared anymore; most systems tend to be
plugged into switches, and a switch will hide packets and change their
timing. In addition, anybody who is in a position to get onto a target
system's network segment is quite likely to be able to obtain physical
access to the target itself. At that point, the installation of a
keystroke logger or hostile kernel patch seems easier than trying to guess
where the entropy pool will go.
If we assume a particularly determined and masochistic attacker, however,
then we can start to think about the other challenges this person will have
to face. One is guessing the contents of the entropy pool at a given
time. Such a guess will have to be made by observing the random numbers
generated by the system, which can be done by looking at sequence numbers
and keys emitted by that system. Then the attacker will have to find a way
to reverse the algorithm (SHA-1) which is used to generate a given random
number from the pool. That reversal will generate a large set of possible
pool values which could all hash to the same value, so the attacker must be
prepared to work with many simultaneous possibilities.
Once the pool has been guessed, it is time to predict its future value, as
determined by the incoming entropy. The problem here is that the timing of
packets on the wire does not exactly match the timing of interrupts within
the kernel. There are delays within the network card, delays in DMAing a
packet into main memory (which can be influenced by other memory traffic
being generated in the system), variable interrupt handling times caused by
critical sections which mask interrupts, cache misses, etc. Then there is
the occasional mixing of bits from the time stamp counter, the value of
which is not available to the attacker. All told, it is a fair stretch to
go from an observation of traffic on the network to any sort of guess as to
what the random number generator will produce next.
Meanwhile, many systems running as network servers have access to
relatively few sources of entropy. If interrupt timings from network
interfaces are made unavailable, those systems could run out of entropy
altogether. Given that need, and given that most developers seem unworried
about the potential weaknesses, the use of network timings is unlikely to
go away anytime soon. What might happen, however, is the addition of some
sort of runtime configuration option. Truly paranoid administrators could
then disallow entropy from network interfaces. Those who are merely
worried could, instead, use those timings, but reduce the amount of entropy
which is credited to a network interface timing value. And most of the
rest of us will probably leave things the way they are now.
[See also: this paper by
Z. Gutterman, B. Pinkas and T. Reinman [PDF] on potential weaknesses in
the Linux random number generator (thanks to Neil Harris).]
Comments (23 posted)
The Xen hypervisor has been the source of large amounts of hype for some
time now. The Xen paravirtualization scheme allows the running of guest
operating systems, but the guest kernel must be ported explicitly to the
"architecture" supported by the hypervisor. Paravirtualization provides
strong isolation of virtual machines and can be quite fast, but it cannot
run unmodified operating systems on its virtual machines. Many had
expected support for Xen to be merged into the mainline by now, but that
has not happened. In fact, it is only recently that the Xen patches have
even been posted for developer review. A
new set of Xen patches was
posted on May 9, however, giving some insights into how Xen will
affect the kernel.
The patches in the 35-part set fall into two broad categories. The first
of those creates a new architecture (a subarchitecture of i386) and a port
of the Linux kernel for that architecture. This is the code which is built
into the modified kernel which can run as a Xen guest. Some of the more
significant changes include:
- Allowing for more interrupt vectors. Xen uses pseudo-interrupts for
various types of communications with guests, so there needs to be room
for more interrupt handlers.
- An events mechanism has been built on top of the interrupt management
code so that the hypervisor can pass information into guest systems.
The virtual machines can also use event channels to communicate with
each other.
- Much of the i386 initialization code is split out so that
subarchitectures can override it. Since a Xen-hosted kernel is not
booting on cold hardware, and it will not use a number of hardware
features, it will have to initialize itself differently than the host
system does.
- A version of the dynamic
tick patch is used to keep idle virtual machines from wasting time
servicing timer interrupts. There is also a separate timekeeping
implementation which allows guest systems to perform their own
timekeeping without having to involve the hypervisor.
- A whole range of virtual devices has been provided. These include a
console, virtual network interfaces, and virtual block devices.
Then, there are a couple of changes to the core (host) kernel:
- A new set of synchronous bit operations, with names like
synch_set_bit(). These operations differ from the regular
bit operations in that they are always atomic. The regular bit
operations will, when built for a uniprocessor system, use
less-expensive, non-atomic operations. But that will not work well if
a uniprocessor Xen guest runs on an SMP host.
- The function apply_to_page_range() will call a given function
for every page table entry in a given range. This patch seems worth
merging ahead of the rest of Xen; currently, code iterating through
PTEs duplicates a complicated set of functions for walking through the
page table structure.
There has been a fair amount of comment on the patches, but few objections
of great substance. Instead, the Xen developers look to have a long list
of nits to address. The most fundamental complaints, perhaps, concern the
network driver, which includes its own, built-in ARP implementation. The
Xen developers defend this code as being necessary for fast migration of
Xen guests. If the ARP code were moved to a more appropriate place - user
space, for example - a migration which happens in milliseconds could turn
into a one-second (or longer) affair, and that is not a cost the Xen folks
want to pay. The addition of files to /proc is also unpopular,
but that code was already on the list of things to fix.
When Xen might actually merge is still unclear. There is work to be done
still, and it is a large body of code for the developers to work through.
But that date is getting closer, now that there is code to discuss.
Comments (none posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>