Brief items
The current stable 2.6 kernel is 2.6.16.18,
released on May 22 with a
single fix for a remote denial of service problem in the netfilter SNMP NAT
code.
2.6.16.17 was released
on May 20 with a rather larger set of fixes.
The current 2.6 prepatch remains 2.6.17-rc4. Fixes continue to
accumulate in the mainline git repository, however, and it looks like the
-rc5 release could happen sometime soon.
The current -mm tree is 2.6.17-rc4-mm3. Recent changes
to -mm include the big serial ATA
patch set, an S/390 hypervisor filesystem, the Secmark packet filtering
code, a new set of page migration patches, a new framework for hardware
random number generator support, the file_operations read/write
consolidation patch (since dropped until some problems are fixed), and the
UTS namespace patches (see below). The next -mm release will also include
the genirq patch set (see below).
Comments (none posted)
Kernel development news
Guys, a kernel developer who cannot understand that user space is
important should just drop their pretentions of being a kernel
developer, and go play with some toy system like Hurd
instead. There you can say "user space doesn't matter".
-- Linus Torvalds
Comments (11 posted)
Greg Kroah-Hartman has decided that it's time to put an end to people
sneering that Linux lacks a proper device driver development kit. So, he
has created
the first
Linux DDK. It includes a fresh 2.6.16.18 kernel, a full copy of
LDD3, and copies of all the in-tree
kernel documentation. A CD image can be downloaded
from
kernel.org.
Comments (11 posted)
James Morris's
secmark
patches have been circulating for a few weeks now. Secmark is a new
mechanism for filtering network packets through SELinux. Your editor had
pondered writing an article about secmark, but that turns out to be
unnecessary; James
did it first.
The idea is to separate labeling and enforcement. Specifically:
use iptables to select and label packets, then use SELinux to
enforce security policy using these packet labels. This utilizes
the expressiveness of iptables rulesets, as well as the flexibility
of any its many matches and targets, and powerful components such
as connection tracking. At the same time, enforcement of security
policy remains the responsibility of the SELinux AVC, and access
control rules can be meaningfully analyzed as part of overall
SELinux policy analysis.
Read the full article for a detailed description of what secmark does and
how to use it.
Comments (1 posted)
Serge Hallyn recently posted
a
new version of the UTS namespaces patch. This code, a small part of
the "lightweight virtualization" or "containers" concept, allows various
bits of system naming information (the stuff which can be seen with
uname, essentially) to differ between sets of processes on the
same system. It may not seem like a big thing, but, as a piece of
container technology which has received the approval of several projects
working in this area, it gives a hint of how the larger problem might be
solved.
Andrew Morton responded with a note praising
the way the work has been done, but asking a fundamental question:
Generally, I think that the whole approach of virtualising the OS
so it can run multiple independent instances of userspace is a good
one. It's an extension and a strengthening of things which Linux
is already doing and it pushes further along a path we've been
taking for many years. If done right, it's even possible that each
of these featurettes could improve the kernel in its own right -
better layering, separation, etc. [...]
All of which begs the question "now what?".
The worry is that the kernel developers could merge a large amount of
non-trivial code, make a number of internal kernel interfaces more
complicated, and still not have an end result that is useful to the
containers community. The fact that the developers working in this area
were able to agree on a patch for UTS namespaces is encouraging, but it is
not a guarantee that consensus will be reached on the more complicated
changes. The possibility of an intractable disagreement derailing the
whole process partway through is a real one.
On the other hand, keeping all of the container code out of the kernel
until it is reasonably complete has its own costs. Some of the container
changes look to be relatively large and intrusive. Maintaining them all
out of the tree would not be a great deal of fun. Neither would merging
the whole mess at some future point when enough developers can agree that
they are "done."
There are a number of features needed by the projects concerned with
virtualization and containers. They include:
- The UTS namespace patch mentioned above.
- PID virtualization,
isolating each group of processes on the system from each other, and
allowing process IDs to be reused between containers.
- Namespaces for SYSV interprocess communication primitives (semaphores,
shared memory, and message queues).
- Time virtualization, so
that each container can have its own idea of what time it is.
- Virtualization of user and group ID values.
- Network namespaces, intended to give each container a specific set of
network interfaces to which it has access. When used in conjunction
with IP aliases, this feature can set up a separate IP address for
each container and keep containers from accessing each others'
traffic.
The ability to virtualize the view of the filesystem through namespaces is
also required, but Linux has had that capability for some years now. Some
of the more advanced container capabilities - live checkpointing and
process migration, for example - will require yet another set of deep
kernel hooks.
Most container concepts need most of the items from the list above to be
able to provide useful isolation. So, somehow, a path must be found to get
those features into the kernel without running into a blocking disagreement
partway through - assuming that container support is considered desirable
in general, of course.
Andrey Savochkin came up with a proposal
which could be a good step forward: implement the network namespaces
feature first. It is one of the most complex features, and it must be
implemented in a way which doesn't upset the highly refined sensibilities
of the networking subsystem developers. Some fairly tricky side problems -
such as virtualizing access to /proc and sysfs - will have to be
solved in the process. All told, it may be the hardest part of the
problem, and it may be the place where an extended disagreement is most
likely to show up.
Often, developers like to take on the easier parts of a problem first,
then apply any lessons learned to the harder parts. In this case, however,
starting with the hardest part may make some sense. If no universally
acceptable solution can be found, the idea of generalized container support
in the kernel can be dropped before too much other code has been merged.
If, instead, the developers involved are able to implement something which
pleases (or, at least, does not mortally offend) everybody, they should be
able to get over any other roadblocks which may show up later on. In that
case, the various pieces of the puzzle could be merged with confidence as
they become ready.
Comments (3 posted)
The Linux kernel has a generic layer for the handling of hardware
interrupts, hidden behind a standard API. There's only one problem: not
all architectures use this layer. In particular, ARM is a holdout. It
seems that interrupt handling in the ARM world is a complicated,
subarchitecture-specific business which does not fit into the current
"generic" code at all, so ARM sticks with its own code - even though there
is a fair amount of overlap with code found in the generic subsystem. But,
even for the architectures which are able to use it, the current IRQ
subsystem has shortcomings which are becoming increasingly apparent.
An attempt to change the situation can be seen in the genirq patch set by Thomas
Gleixner and Ingo Molnar. These patches attempt to take lessons learned
about optimal interrupt handling on all architectures, mix in the quirks
found in the fifty (yes, fifty) ARM subarchitectures, and create a new IRQ subsystem
which is truly generic, and more powerful as well. It is a big patch set
which reworks a great deal of crucially important low-level code. Expect
some interesting discussion before any eventual mainline merge.
After some cleanup work, the patch gets serious with the creation of a new
irq_chip structure. This structure is based on the old
hw_interrupt_type structure, but it includes a rather longer list
of low-level operations. The things for which the kernel can now request
a specific interrupt controller include:
- startup(): enable the interrupt and generally get the
controller ready to handle it.
- shutdown(): completely shut down the interrupt.
- enable(): enable the interrupt.
- disable(): disable the interrupt.
- ack(): inform the controller that the CPU has begun
processing the interrupt.
- end(): inform the controller that interrupt processing is
done.
- mask(): mask a specific interrupt, blocking its delivery.
- mask_ack(): a combination of mask() and
ack() which can be optimized on some platforms.
- unmask(): unmask an interrupt.
- set_affinity(): bind an interrupt to a specific CPU.
- retrigger(): re-create and re-deliver an interrupt.
- set_type(): set the flow type (described below) of the
interrupt.
- set_wake(): enable or disable wake-on-interrupt behavior.
Many of these methods existed previously, but the mask(),
mask_ack(), unmask(), set_type(), and
set_wake() functions are new. With this set of functions, kernel
code can manage interrupt controller chips in a fine-grained manner.
Moving up a level, the existing irq_desc structure, which holds
all of the kernel's information about any specific interrupt, now has a
pointer to an associated irq_chip structure. It also has a new
method, handle_irq(), pointing to the function which actually
handles this interrupt. That, perhaps, is the most fundamental change from
the existing system, which uses a single handler function
(__do_IRQ()) for all interrupts. It is a recognition of the fact
that not all interrupts are equal, so there is little to gain by trying to
deal with them all in a single, big function.
The biggest difference between interrupts is what is called the "flow
type" - a combination of how the interrupt is signaled and how the system
processes it. The genirq patches define these flow types:
- Level-triggered interrupts are active as long as the device asserts
its IRQ line. These interrupts must be masked while being processed,
and can only be unmasked after the device has stopped asserting the
interrupt.
- Edge-triggered interrupts are signaled by a change in the interrupt
line - from low voltage to high, from high to low, or both. These
interrupts do not necessarily have to be masked while being processed,
but, if they are not masked, more interrupts can arrive before the
first has been handled. So the kernel must track "pending"
interrupts, and the interrupt handler must loop until all interrupts
have been dealt with.
- "Simple" interrupts do not require any special control, and can be
processed directly.
- Per-CPU interrupts are bound to a single CPU. They are much like
simple interrupts, but even simpler: since the handler will only run
on one CPU, there is no need for locking.
The current IRQ code attempts to handle all of the above cases in a single,
large routine. The new code, instead, creates a number of flow-specific
handler functions, then sets the appropriate one as the
handle_irq() method in the interrupt descriptor. The result is
code which can be optimized for specific needs, and shorter code paths in
the interrupt system as a whole. If a particular hardware platform has
quirks which are not addressed by the current handlers, creating a new one
is a relatively straightforward task.
At the kernel API level, the changes are relatively small; changes to
drivers are not generally required. There are a few new capabilities,
however. One is that there are some new flags which can be passed to
request_irq():
- SA_TRIGGER_LOW and SA_TRIGGER_HIGH: treat the
interrupt source as being level-triggered, with interrupts happening
at either the high or low level.
- SA_TRIGGER_FALLING and SA_TRIGGER_RISING: treat the
interrupt as being edge-triggered.
This addition to the API actually happened in 2.6.16, but only the ARM
architecture had any support for it at all. With the genirq patches, all
architectures support these flags, and the appropriate flow handler will be
selected internally. When interrupts are shared, however, all users must
agree on how the triggering will be handled.
It is also possible to change the flow type of an IRQ directly with:
int set_irq_type(unsigned int irq, unsigned int type);
Here, type should be one of IRQ_TYPE_EDGE_RISING,
IRQ_TYPE_EDGE_FALLING, IRQ_TYPE_EDGE_BOTH,
IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_LEVEL_LOW,
IRQ_TYPE_SIMPLE, or IRQ_TYPE_PERCPU. Calling this
function has the same effect as specifying the trigger type with
request_irq(), but it offers a wider range of possibilities. It
also does not check for compatibility with any other users of a shared
interrupt, so a certain potential for confusion exists.
Some devices can generate interrupts which should wake up the system from a
suspended state. Wake-on-LAN behavior in network adaptors is one example;
allowing the keyboard to wake the system is another. Kernel code can
enable or disable this behavior in the interrupt controller with:
int set_irq_wake(unsigned int irq, unsigned int on);
An error code will be returned if the chip-level controller does not
implement this operation.
There has been a relatively small amount of discussion so far; the biggest
objection seems to be a claim that the
separate flow handlers are an unnecessarily complex addition. The decision
on whether genirq is merged very likely depends on whether the ARM
maintainers are willing to drop their architecture-specific IRQ
implementation and move to the new, generic version. Without that, the
genirq code, which contains a lot of work aimed specifically at ARM's
needs, will not truly be a generic solution. In the mean time, genirq has
found its way into the -mm tree.
Comments (none posted)
The kernel has long used "tainting" as a way of noting that something has
happened which may affect the stability of the system. Should a kernel
oops occur, the resulting kernel trace includes information on the kernel's
taint status. This information can then be used by developers to ask hard
questions about what was really going on. The taint flag was originally
added to flag the use of binary-only kernel modules, but its use has grown
since then. Events which will taint a current kernel include the forced
removal of a module, loading a module without proper (or matching) version
information, or running an SMP kernel with processors not designed for
SMP operation. Machine check exceptions and certain kinds of memory
management errors will also result in a tainted kernel.
A recent patch by Ted Ts'o
expands the taint concept in an interesting way. It adds a new file
(/proc/sys/kernel/tainted); should user space write to that file,
the kernel will be marked tainted with the new "U" flag. The
idea, says Ted, is to flag "when userspace is potentially doing
something naughty that might compromise the kernel." It took a few
more questions before the real
truth of the matter came out:
The problem is that the Real-Time Specification for Java (RTSJ)
**requires** that the JVM provide class functions which provide
direct access to physical memory; all physical memory. In fact,
the RTSJ compliance test explicitly checks for this; it requires
that you give the compliance test the address of a few hundred megs
of physical memory for the test. The absolutely hilarious bit
about all of this is that the same customer who wants RTSJ
compliance because of federal procurement regulations is also
interested in using SELinux.
The idea of using SELinux on a system where Java code is free to mess
around with physical memory does involve a fair amount of cognitive
dissonance. But The Customer Is Always Right, so Ted is making this work.
Not entirely willingly, though:
In fact, I was so unhappy about being forced by the RTSJ
specification to do this insane thing that I wanted to make sure
that if it were ever used, it would set a TAINT flag to warn people
that just about anything unsane could have happened, and the
system's stability was at the mercy of the competence of Java
application programmers.
Nobody has stepped forward to say that the kernel should not be tainted in
such a situation. Instead, one might almost be able to merge a patch
causing the kernel to emit scary horror-movie sounds as well.
There appears to be general agreement that this patch makes sense;
certainly there are plenty of situations where user-space actions might
affect the stability of the system. There was one request for a log
message to be stored with the user-space taint flag so that the reason for
its presence would be more clear later on. A concern was also raised that
some distributions were using the "U" flag for other reasons (to
flag the presence of "unsupported" modules), though it is not clear that
this is actually happening. Collisions over the use of taint flags could
indeed create confusion, so Dave Jones has suggested that any taint flags
used in out-of-tree code should at least be documented with a comment in
the mainline kernel. Whether any such flags exist remains to be seen,
however.
Comments (19 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>