Brief items
The current 2.6 prepatch remains 2.6.24-rc3. Fixes continue to flow
into the mainline git repository at a relatively high rate; 2.6.24-rc4 must
be due sometime in the very near future.
The current -mm tree is 2.6.24-rc3-mm2. Recent changes
to -mm include the new timerfd API (see below), a number of driver core
changes, a per-process capability bounding set feature, and an updated
version of the SMACK security module.
The current stable 2.6 kernel is 2.6.23.9, released on November 26.
There are a couple dozen or so important fixes in this update.
For older kernels: 2.6.22.14 was released on
November 21.
Comments (none posted)
Kernel development news
The Linux kernel requires that any needed documentation accompany
all changes requiring said documentation -- part of the source-code
patch must apply to the Documentation/ directory.
--
Donnie Berkholz engages in some wishful thinking
Comments (none posted)
By Jonathan Corbet
November 27, 2007
The kernel's loadable module mechanism does not give modules access to
all
parts of the kernel. Instead, any kernel symbol which is intended to be
usable by loadable modules must be explicitly exported to them via one of
the variants of the
EXPORT_SYMBOL() macro. The idea behind this
restriction is to place limits on the reach of modules and to provide a
relatively well-defined module API. In practice, there have been few
limits placed on the exporting of symbols, with the result that many
thousands of symbols are available to modules. Loadable modules can access
many of the obviously useful symbols (
printk(), say, or
kmalloc()), but they can also get at generic symbols like
edd,
tpm_pm_suspend(),
vr41xx_set_irq_trigger(),
or
flexcop_dump_reg().
There are reasons for the concern over excessive symbol exports felt by
some developers. Wrongly exported symbols can lead module authors to use
incorrect interfaces; for example, the exporting of sys_open() is
an active inducement for developers to open files directly inside the
kernel, which is almost never a good idea. But such symbols, once
exported, can prove hard to
unexport. While the official line says that the internal kernel API
can change at any time, the truth of the matter is that at least some
developers are reluctant to break external modules when that can be
avoided.
A more timely example would be init_level4_pgt, a low-level symbol
exported only by the x86_64 architecture. The current -mm tree removes
that export, breaking the proprietary NVIDIA module in the process. Andrew
Morton describes this removal as "our
clever way of reducing the tester base so we don't get so many bug
reports." While many developers make a show of not caring about
binary-only modules, there is still a good chance that this particular
export removal (of a symbol which should not really be available globally)
may not make it into the mainline as a result of this breakage.
The end result of all this is that there has long been interest in somehow
cleaning up the modular API, though there have not been a whole lot of
people who have put a lot of time toward that end. Occasionally somebody
has remarked upon one piece of low-hanging fruit: symbols which are
exported only to make it possible to modularize other bits of mainline
kernel code. One example is a whole set of TCP stack symbols (things like
__tcp_put_md5sig_pool()) which have exactly one user: the IPv6
module. Restricting these special-purpose exports has the potential to
significantly narrow the modular API without making it harder to modularize
the mainline.
Andi Kleen's module symbol
namespace patch is meant to enable just this sort of narrowing of the
API. With this patch, symbols can be exported into specific "namespaces"
which are only available to modules appearing on an associated
whitelist. In a sense, the term "namespace" is a poor fit here; there is
still a single, global namespace within which all exported symbols must be
unique. These "namespaces" are more like special exclusion zones
containing symbols which are not globally accessible. They
work like GPL-only exports, which also restrict the availability of symbols
to a subset of modules.
To create a restricted export, an ordinary EXPORT_SYMBOL()
declaration is changed to:
EXPORT_SYMBOL_NS(namespace, symbol);
Where namespace is the name of a restricted symbol namespace. So,
going back to the TCP example, Andi's patch contains a number of changes
like:
-EXPORT_SYMBOL(__tcp_put_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, __tcp_put_md5sig_pool);
Note that there is no _GPL version; any symbol which is exported
into a specific namespace is treated as GPL-only by default.
The other part of the equation is to enable access to a namespace. That is
done with:
MODULE_NAMESPACE_ALLOW(namespace, module);
Such a declaration (which must appear in a module exporting symbols into
the namespace) says that the given module can access
symbols in that namespace. Andi's patch creates three namespaces
(tcp, tcpcong for congestion control modules, and
udp), removing about 30 symbols from the global namespace.
A number of developers welcomed this patch, seeing it as a step forward in
the rationalization of the loadable module API. It is seen as a way to
prevent out-of-tree modules from using symbols which they should not be
using. It also reduces the number of interfaces which must be kept stable
in situations (enterprise kernels, for example) where changes are not
allowed. And, finally, the symbol namespaces offer the ability to organize
exports somewhat and document who the intended users are.
There is a bit of dissent, though. In particular, Rusty Russell fears that
the patch adds unneeded complexity and threatens to make life harder for
out-of-tree developers for little (if any) gain. Says Rusty:
For example, you put all the udp functions in the "udp" namespace.
But what have we gained? What has become easier to maintain? All
those function start with "udp_": are people having trouble telling
what they're for?
If you really want to reduce "public interfaces" then it's much simpler to
mark explicitly what out-of-tree modules can use.
Herbert Xu has similar concerns:
These symbols are exported because they're needed by protocols. If
they weren't available to everyone then it would be difficult to
start writing new protocols....
So based on the network code at least I'm kind of starting to agree
with Rusty now: if a symbol is needed by more than one in-tree
module chances are we want it to be exported for all.
While these voices seem to be in the minority, they still carry quite a bit
of weight. So your editor is unwilling to make any sort of guess as to
whether this patch will be merged, or in what form. The desire to clean up
the modular API is unlikely to go away, though, so, sooner or later,
something is likely to happen.
Comments (12 posted)
By Jonathan Corbet
November 27, 2007
Using uninitialized memory can lead to some seriously annoying bugs. If
you are lucky, the kernel will crash with the telltale slab poisoning
pattern (
0x5a5a5a5a or similar) in the traceback. Other times,
though, something more subtly wrong happens, forcing a long hunt for the
stupid mistake. Wouldn't it be nicer if the kernel could simply detect
references to uninitialized memory and scream loudly at the time?
The kmemcheck patch recently
posted by Vegard Nossum offers just that functionality, though, perhaps, in
a somewhat heavy-handed manner. A kernel with kmemcheck enabled is
unlikely to be suitable for production use, but it should, indeed, do a
good job at finding code using memory which has not yet been set to a
useful value.
Kmemcheck is a relatively simple patch; the approach used is, essentially,
this:
- Every memory allocation is trapped at the page-allocator level. For
each allocation, the requested order is increased by one, doubling the
size of the allocation. The additional ("shadow") pages are initialized to zero
and kept hidden.
- The allocated memory is returned to the caller, but with the "present"
bit cleared in the page tables. As a result, every attempt to access
that memory will cause a page fault.
- Once the fault happens, kmemcheck (through some ugly,
architecture-specific code) determines the exact address and size of
the attempted access. If the access is a write, the corresponding
bytes in the shadow page are set to 0xff and the operation is
allowed to complete.
- For read accesses, the corresponding shadow page bytes are tested; if
any of them are zero, the code concludes that the read is trying to
access uninitialized data. A stack traceback is printed to enable the
developer to find the location where this access is happening.
As should be evident, running with kmemcheck enabled will have certain
performance impacts. Taking a page fault on every access to slab memory
just cannot be fast. Doubling the size of every allocation will impose
costs of its own, including the cache effects of simply working with twice
as much memory. But that is a cost which can be paid when the kernel is
being run in a debugging mode.
Vegard has posted some sample
output which shows how the system responds to reads from uninitialized
memory. If this output is to be believed, access to unset memory is not an
especially uncommon occurrence in current kernels. If some of references
flagged here, once tracked down, turn out to be real bugs, the kmemcheck
patch will have earned its keep, even if it never finds its way into the
mainline.
Comments (8 posted)
By Jonathan Corbet
November 28, 2007
Last week's discussion of the
proposed
indirect() system call ended with some complaints from
developers on the ugliness of the interface. Since then there has been
some talk about system call interfaces in general, but not a whole lot of
ideas for how
indirect() could be done better.
The leading alternative would be that pushed by H. Peter Anvin: rather than
use indirect() to extend a system call, simply make a new system
call with the desired additional parameters. Then, usually, the old
implementation can be replaced with a simple stub which calls the new
version with the default values for the new parameters. It is a simple
approach which easily maintains binary compatibility with very little
runtime cost. Since there is no particular shortage of system call
numbers, this is a process which could go on for a long time.
The management of increasing numbers of system calls does impose a cost,
though; each one of those system calls is a user-space API which cannot
ever be broken. The indirect() approach, instead, does not add
more system calls. As long as the addition of parameters (with default
values of zero) is done with care, avoiding API problems should be
relatively easy to do.
There are also limits on how many parameters can be easily passed to system
calls; on most systems, that limit is around six. Any system call requiring
more arguments must already do uncomfortable things with indirect blocks.
Creating new system calls with additional parameters will create more cases
where this sort of indirect parameter handling is required. So the
approach used by indirect() will find itself being used, in some
form, anyway.
The key argument, though, still appears to be the syslet/threadlet
mechanism. The ability to make any system call asynchronous has a lot of
appeal, but doing so requires some additional information - a place to
store the result of the call, if nothing else. Asynchronous system calls,
in Linux, are, for all practical purposes, a type of indirect call. The
proposed indirect() interface looks like it should be able to
accommodate asynchronous calls nicely - though the precise API has not,
yet, been nailed down.
As a result of all this, chances are that some form of indirect()
will find its way into the mainline - though there is still time for
somebody to come up with a better idea.
Meanwhile, the last time timerfd() was discussed here, it had been
disabled in the 2.6.23 kernel as a result of complaints about its
interface. Since then, little has happened with timerfd(), with
the result that it will almost certainly not be present in 2.6.24 either.
Some work has been done with this system call, though, and a new API proposal has been
posted. This version has three system calls, the first of which is
timerfd_create():
int timerfd_create(int clockid, int flags);
The clockid argument tells the system which clock should be used:
CLOCK_MONOTONIC or CLOCK_REALTIME. The flags
argument is a recent addition; it is currently unused and must be zero. It
was added on the assumption that somebody, somewhere, will always want some
sort of behavior modification and one might as well avoid the need for an
indirect version while it's easy. The return value from
timerfd_create() is a file descriptor which can be passed to
read() or any of the poll() variants. But, first, the
timer should probably be programmed with:
int timerfd_settime(int fd,
int flags,
const struct itimerspec *timer,
struct itimerspec *old_timer);
Here, fd is a file descriptor obtained from
timerfd_create(),
flags contains TFD_TIMER_ABSTIME if the timer is being
set to an absolute time, and timer is the expiration time for the
timer. If old_timer is not NULL, the location pointed to
will be set to the previous value of the timer.
It is also possible to query the value of the timer with:
int timerfd_gettime(int fd, struct itimerspec *timer);
The value returned in *timer will be the current setting of the
timer associated with fd.
There's not been a whole lot of comments on this version of the API, so
something very similar to it will probably be merged. It would normally be
considered to be too late to put a change like this into 2.6.24, but the 2.6.24-rc3-mm2 patch log says
"Probably 2.6.24?". So one never knows. If this change is not merged
soon, it will almost certainly
become available for 2.6.25.
Finally, the hijack() system call continues to be developed on
relatively quiet kernel subsystem lists. This call (described here in October)
behaves much like clone() in that it creates a new process.
Unlike clone(), however, hijack() causes the new process
to share resources with a specified third process rather than with the
parent. Its main reason for existence is to make it easy to enter
different namespaces.
The hijack() interface remains almost unchanged:
int hijack(unsigned long clone_flags, int which, int id);
The specified id value is interpreted according to which,
which now has three possible values:
- HIJACK_PID says that id is a process ID; the
newly-created process will share resources (including namespaces) with
the indicated process.
- HIJACK_CG says that id is an open file descriptor
for the tasks file in a target control group. In this case,
the kernel will find a process within that control group and use it as
the source for resources and namespaces.
- HIJACK_NS is the newest option; like HIJACK_CG, it
is an open file descriptor indicating a control group. In this case,
though, only the control group itself and any associated namespaces
will be inherited by the new process. This version is intended for
use when entry into an empty control group (where there are no
processes to inherit from) is desired.
This new system call still has not seen any exposure on linux-kernel; it
may well not survive its first experience there in its current form. If
nothing else, a name change (to something which is more descriptive of the
real function and, preferably, which does not put users onto intelligence
agency watch lists) may well be called for. But a full container
implementation on Linux will clearly need some sort of
enter_container() system call at some point.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>