Brief items
The current 2.6 prepatch is 2.6.9-rc4, which was
released by Linus on
October 10. Says Linus:
Ok, trying to make ready for the real 2.6.9 in a week or so, so
please give this a beating, and if you have pending patches, please
hold on to them for a bit longer, until after the 2.6.9 release. It
would be good to have a 2.6.9 that doesn't need a dot-release
immediately ;)
Changes in
this set include a number of architecture updates, an ACPI update, Linus's
kernel management style document, some
networking tweaks, and lots of fixes. See the
long-format changelog for the details.
Linus's BitKeeper repository contains a handful of serious fixes; it looks
like very few patches will be accepted until 2.6.9 comes out.
The current prepatch from Andrew Morton is 2.6.9-rc4-mm1.
Recent changes to -mm include the removal of lockmeter (it was interfering
with some of the latency work), a buddy allocator rework, a number of
reiserfs error handling improvements, and various architecture updates.
The current 2.4 prepatch is 2.4.28-pre4, released by Marcelo on October 8.
The number of new patches is small; they include some networking tweaks, a
serial ATA update, and various fixes.
Comments (none posted)
Kernel development news
I don't know what exactly you will receive from Linus and Alan, but here's
a reply from me (and I do have code in quite a few places in the tree):
Sod Off.
If you need it in writing and notarized, that could be arranged.
-- Al Viro, not tempted by Jeff Merkey's
offer.
Comments (11 posted)
Using Linux systems for realtime tasks has long been an area of interest.
In the last couple of weeks, a number of projects working to implement
realtime response have posted their work. This article looks at the
patches posted recently to get a sense for where the realtime projects are
headed.
The realtime LSM
A relatively simple contribution is the realtime security module by Torben Hohn and
Jack O'Quin. This module does not actually add any new realtime features
to the kernel; instead, it uses the LSM hooks to let users belonging to a
specific group use more of the system's resources. In particular, it
adds the CAP_SYS_NICE, CAP_IPC_LOCK, and
CAP_SYS_RESOURCE capabilities to the selected group. These
capabilities allow the affected processes to raise their priority, lock
memory into RAM, and generally to exceed resource limits. Granting
capabilities in this way goes somewhat beyond the usual "restrictive hooks
only" practice for security modules, but there have not been any complaints
on that score.
MontaVista's patch
The event which really stirred up the discussion, however, was the posting
of the realtime kernel patch set by
MontaVista's Sven-Thorsten Dietrich. This highly intrusive patch attempts
to minimize system response latency by taking the preemptible kernel
approach to its limit. In comparison, the current preemption approach,
which is considered to be too risky to use by most distributors, is a half
measure at best.
MontaVista's patch begins by adopting the "IRQ
threads" patch posted by
Ingo Molnar. This patch moves the running of most interrupt handlers into a
separate kernel thread which competes with the others for processor time.
Once that is done, interrupt handlers become preemptible and are far less
likely to stall the system for long periods of time.
The biggest source of latency in the kernel then becomes critical sections
protected by spinlocks. So why not make those sections preemptible as
well? To that end, the PMutex
patch has been adapted to the 2.6 kernel. This patch implements
blocking mutexes, similar to the existing kernel semaphores. The PMutex
version, however, has a simple priority inheritance mechanism; processes
holding a mutex can have their priority bumped up temporarily so that they
get their work done and release the mutex as quickly as possible. Among
other things, this approach helps to minimize priority inversion problems.
The biggest change is replacing of most spinlocks in the system with the
new mutexes; the patch uses a set of preprocessor macros to turn
spinlock_t, and the operations on spinlocks, into their mutex
equivalents. In one step, most critical sections become preemptible and no
longer are part of the latency problem. As an added bonus, the moving of
interrupt handlers to their own thread means that interrupt handlers can no
longer deadlock with non-interrupt code when contending for the same lock;
that means that it is no longer necessary to disable interrupts when taking
a lock which might also be used by an interrupt handler.
There are, of course, a few nagging little problems to deal with. Some
code in the system really shouldn't be preempted while holding a
lock. In particular, code which might be in the middle of programming
hardware registers, the page table handling code, and the scheduler itself
need to be allowed to do their job in peace. It is hard, after all, to
imagine a scenario where preempting the scheduler will lead to good
things. So a number of places in the kernel cannot be switched from
spinlocks to the new mutexes.
The realtime patch attempts to handle these cases by creating a new
_spinlock_t type, which is just the old spinlock_t under
a newer, uglier name. The spinlock primitives have been renamed in the
same way (e.g. _spin_lock()). Code which truly needs an old-style
spinlock is then hacked up to use the new names, and it functions as
before. Except for some files, where the developers were able to include
<linux/spin_undefs.h>, which restores the old functionality
under the old names. The header file rightly describes this technique as
"a dirty, dirty hack." But it does make the patch smaller.
Needless to say, the task of sifting through every lock in the kernel to
figure out which ones cannot be changed to mutexes is a long and
error-prone process. In fact, the job is nowhere near complete, and the
MontaVista patch is, by its authors' admission, marginally stable on
uniprocessor systems, unstable on SMP systems, and unrunnable on
hyperthreaded systems. But you have to start somewhere.
Ingo's fully preemptible kernel
Ingo Molnar liked that start, but had some issues with it. So he went off
for two days and created a better version,
which has been folded into his "voluntary preemption" series of patches.
Ingo takes the same basic approach used by the MontaVista patch, but with
some changes:
- The PMutex patch is not used; instead, Ingo uses the existing
kernel semaphore implementation. His argument is that semaphores work
on all architectures, while PMutexes currently only work on x86. It
would be better to hack priority inheritance into the existing
semaphores, and thus make it available to all of the current semaphore
users as well as those converted over from spinlocks. Ingo's patch
does not currently implement priority inheritance, however.
- Through some preprocessor trickery, Ingo was able to avoid changing
all of the spinlock calls. Preserving "old style" spinlock behavior
is simply a matter of changing the type of the lock to
raw_spinlock_t and, perhaps, changing the initialization of
the lock. The actual spin_lock() and related calls do the
right thing with either a "raw" spinlock or a new semaphore-based
mutex. Think of it as a sort of poor man's polymorphic lock type.
- Ingo found a much larger set of core locks which must use the true
spinlock type. This was done partly through a set of checks built
into the kernel which complain when the wrong type of lock is being
used. With Ingo's patch, some 90 spinlocks remain in the kernel (in
comparison, MontaVista preserved about 30 of them).
Even so, thanks to the reworked locking primitives, Ingo's patch is
much smaller than the MontaVista patch.
Ingo would like to reduce the number of remaining spinlocks, but he warns
that a number of "core infrastructure" changes will be required first. In
particular, code using read-copy-update must
continue to use spinlocks for
now; allowing code which holds a reference to an RCU-protected structure to
be preempted would break one of the core RCU assumptions. MontaVista has
apparently taken a stab at the RCU issue, but does not yet have a patch
which they are ready to circulate.
Ingo continues to post patches at a furious rate; things are
evolving quickly on this front.
RTAI/Fusion
Meanwhile, the
real realtime people point out that none of this work
provides deterministic, quantifiable latencies. It does help to reduce
latency, but it cannot provide guarantees. A "realtime" system without
latency guarantees may be suitable for a number of tasks, but it still
isn't up to the challenge of running a nuclear power plant, an airliner's
flight management system, or an extra-fast IRC spambot. If it absolutely,
positively must respond within a few microseconds, you need a real realtime
system.
There are two longstanding Linux projects which are intended to provide
this sort of deterministic response: RTLinux and RTAI. There is the obligatory
bad blood between the two, complicated by a software patent held by the
RTLinux camp.
The RTLinux approach (and the subject of the patent) is to put the hardware
under the control of a small, hard realtime system, and to run the whole of
Linux as a single, low-priority task under the realtime system. Access to
the realtime mode is obtained by writing a kernel module which uses a
highly restricted set of primitives. Channels have been provided for
communicating between the realtime module and the normal Linux user space.
Since the realtime side of the system controls the hardware and gets first
claim on its resources, it is possible to guarantee a maximum response
time.
RTAI initially used that approach, but has since shifted to running under
the Adeos kernel. Adeos
is essentially a "hyperviser" system which runs both Linux and a
real-time system as subsidiary tasks, and allows the two to communicate.
It allows a pecking order to be established between the secondary operating
systems so that the realtime component can respond first to hardware
events. This approach is said to be more flexible and also to avoid the
RTLinux patent.
Working with RTAI still requires writing kernel-mode code to handle the
hard realtime part of the task.
In response to the current discussion, Philippe Gerum surfaced with an introduction to the RTAI/Fusion project.
This project, which is "a branch" of the RTAI effort, is looking for a
middle ground between the low-latency efforts and the full RTAI mode of
operation; its goal is to allow code to be written for the Linux user
space, with access to regular Linux facilities, but still being able to
provide deterministic, bounded response times. To this end, RTAI/Fusion
provides two operating modes for realtime tasks:
- The "hardened" mode offers strict latency guarantees, but programs
must restrict themselves to the services provided by RTAI. A subset
of Linux system calls are available as RTAI services, but most of them
are not.
- When a task invokes a system call which cannot be implemented in the
hardened mode, it is shifted over to the secondary ("shielded")
scheduling mode. This mode is similar to the realtime modes
implemented by MontaVista and Ingo Molnar; all Linux services are
available, but the maximum latency may be higher. The RTAI/Fusion
shielded mode defers most interrupt processing while the realtime task
is running, which is said to improve latency somewhat.
Processes may move between the two modes at will.
The end result is a blurring of the line between regular Linux processes
and the hard realtime variety. Developers can select the mode which best
suits their needs while running under the same system, and they can use
different modes for different phases of a program's execution. RTAI/Fusion
might yet succeed in the task of combining a general-purpose operating
system with hard realtime operation.
In conclusion...
Whether any of the work described here will make it into the mainline
kernel is another question. The preemptible kernel patch, which was far
less ambitious, has still not been accepted by many developers. Removing
most spinlocks and making the kernel fully preemptible will certainly be an
even harder sell. It is an intrusive change which could take some time to
stabilize fully. If a fully-preemptible, closer-to-realtime kernel does
pass muster with the kernel developers, it may well be the sort of
development that finally forces the creation of a 2.7 branch.
Another challenge will be building a consensus around the idea that the
mainline kernel should even try to be suitable for hard realtime tasks.
The kernel developers are, as a rule, opposed to changes which benefit a
tiny minority of users, but which impose costs on all users. Merging
intrusive patches for the sake of realtime response looks like that sort of
change to many. Before mainline Linux can truly claim to be a realtime
system, the relevant patches will have to prove themselves to be highly
stable and without penalty for "regular" users.
Comments (39 posted)
Most Linux users probably have a sufficiently interesting life that they
spend little time imagining how page tables are represented in the kernel.
Many of those who do ponder on that issue may think in terms of a
linear array which maps virtual addresses onto their corresponding physical
addresses. This view of page tables is enough to understand the basic
function that they perform, but the real situation is more complicated than
that.
A single array large enough to hold the page table entries for a single
process would be huge. On a typical x86 system, a page table entry
requires 32 bits, so 1024 of them (covering 4MB of virtual address space)
can be stored in one page. If the virtual address space is 3GB (as it is
on many x86 systems), 768 pages would be required to hold all of the page
table entries. Allocating that much contiguous memory (for each process)
would be impossible, even if that sort of memory overhead were tolerable.
The fact is that most processes only use a small portion of the total
virtual address space - but the parts they use are widely scattered over
that space. Program text lives down near the bottom, heap memory and
dynamic libraries are distributed throughout the middle, and the stack is
put up at the very top. So the real page table structure must handle a
sparse, widely distributed set of virtual addresses without wasting
excessive amounts of memory or requiring large, physically-contiguous
arrays.
To that end, modern processors which use page tables use a hierarchical,
tree structure. This structure allows the table to be broken up into
individual pages, and the subtrees corresponding to unused parts of the
address space can be absent. The Linux kernel works with a three-level
structure which looks like this:
On an x86 system running in the PAE mode (only needed when more than 4GB of
memory is installed), all three levels of page tables are present. The
page global directory (PGD) contains only four entries, each corresponding
to 1GB of virtual address space; the PGD is indexed using the top two bits
of the virtual address. Each PGD entry points to a page middle directory
(PMD), which holds 512 entries indexed by bits 21-29 of the virtual
address. The PMD entry (if it is not empty) points to an actual page
table. Using bits 12-20 of the virtual address to index into that page
table yields the actual physical address of the page, assuming that page is
currently resident in RAM.
The current 2.6 kernel implements a three-level page table for all
architectures. As it turns out, the bulk of x86 systems will not be
running in the PAE mode; on those systems, the hardware only supports two
levels of page tables. The PGD holds 1024 entries (bits
22-31), each of which points to a 1024-entry page table (bits 12-21). For
the benefit of the rest of the kernel, the page table access functions are
set up to emulate the existence of a single-entry PMD, so these systems
still appear to use a three-level page table.
The three-level design is wired deeply into the kernel. Any code which
must manually map a virtual address into its physical counterpart must do
something like this (error handling and other details omitted):
pmd = pmd_offset(pgd, address);
pte = *pte_offset_map(pmd, address);
page = pte_page(pte);
Similarly, any kernel function which affects a range of virtual addresses
must implement a depth-first traversal of the relevant portion of the
three-level tree. Most of these traversals of the page table tree have
been isolated behind functions, but it is still surprising how many places
are coded around the three-level assumption. But it all works fine, since
the architecture-specific code makes it looks like all systems have
three-level page tables.
The only problem is that some hardware actually supports four-level
tables. The example which is driving the current changes is x86-64. The
current x86-64 port emulates a three-level architecture by using a single,
shared, top-level directory ("PML4") and fitting (most of) the virtual
address space in a three-level tree pointed to by a single PML4 entry. It
all works, but it limits Linux processes to a mere 512GB of virtual address
space. Such limits are irksome to the kernel developers when the hardware
can do more, and, besides, somebody is likely to release a web browser
or office suite which runs into that limit in the near future.
The solution is to shift the kernel over to using four-level page tables
everywhere, with the fourth level emulated (and optimized out of existence)
on architectures which do not support it. Andi Kleen has posted a four-level page tables patch which
implements this change. With Andi's patch, the x86-64 architecture
implements a 512-entry PML4 directory, 512-entry PGD, 512-entry PMD, and
512-entry PTE. After various deductions, that is sufficient to implement a
128TB address space, which should last for a little while.
The actual patch works as one might expect; code which currently handles
three-level page tables is extended to deal with the fourth level. There
is a default PML4 implementation which can be included by architectures
which do not have four-level tables; that should make porting most
architectures to the new scheme relatively easy. That work is likely to
happen in the near future, after which Andi has stated his intention to get
the four-level patch merged into the -mm tree. Andrew Morton has already
said (at the kernel summit) that he would consider merging such a patch.
Your Linux system may be running with four-level page tables in the near
future.
Comments (3 posted)
Greg Kroah-Hartman recently
expressed some
concerns about the InfiniBand specification. It seems that, if you are
not a member of the
InfiniBand
Trade Association, a copy of the specification will cost $9500 - and it
requires signing a license which reads:
Upon receipt by IBTA of payment for a single copy license to the
Specification, you are entitled to possess one physical copy of the
Specification in the form provided to you by IBTA, and to make
internal, noncommercial use of the Specification within your
organization.
Such language raises the obvious question: how can anybody write or
distribute a free InfiniBand implementation after having signed that sort
of license? Things get worse when one looks at the IBTA
membership agreement (PDF):
When the member or its Affiliates makes a Contribution or when the
Steering Committee adopts and approves for release a Specification,
the Member and its Affiliates hereby agree to grant
to other
members and their affiliates under reasonable terms and
conditions that are demonstrably free of any unfair discrimination,
a nonexclusive, nontransferable, worldwide license under its
Necessary Claims to allow such Members to make, have made, use,
import, offer to sell, lease, and sell and otherwise distribute
Compliant Portions ....
The Member and its Affiliates retain the independent right to grant
or withhold a nonexclusive license or sublicense of patents
containing Necessary Claims to non-Members on such terms as the
Member may determine.
(Emphasis added). The InfiniBand standard, in other words, is allowed to
contain patented technology, only IBTA members must be given the
opportunity to license any patented technology, and only under "reasonable
terms and conditions." If said "reasonable terms and conditions" included
the right to distribute code under a free license, one would assume those
who wrote the agreement would have seen fit to say so.
The end result is that InfiniBand looks like a closed, proprietary
standard, and not something which can be supported in free software. Greg
asked, flat out:
So, OpenIB group, how to you plan to address this issue? Do you
all have a position as to how you think your code base can be
accepted into the main kernel tree given these recent events?
In response, there have been some "we don't think it's a problem"
mumblings, but nothing that looks like a real answer to this question.
Until this all gets straightened out, anybody considering using InfiniBand
with free software may well want to think about alternatives.
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>