The current 2.6 prepatch is 2.6.23-rc8
on September 24.
It contains a relatively small number of fixes, and Linus is confident that
the final release is getting close. "Of course, me feeling happy is
usually immediately followed by some nasty person finding new problems, but
I'll just ignore that and enjoy the feeling anyway, however fleeting it may
As of this writing, about 50 post-rc8 patches have gone into the mainline
The current -mm tree is 2.6.23-rc8-mm1. Recent changes
to -mm include some ext4 enhancements, support for read-only bind mounts,
some kdump improvements, and a rework of the NFS export code.
The current stable 2.6 kernel is 184.108.40.206, released on September 26.
It contains a few dozen fixes for problems throughout the kernel.
220.127.116.11, released on September 24,
contains a single security fix for a privilege escalation
vulnerability in the sound subsystem. 18.104.22.168, released on the 21st,
is also a single-fix update; this one addresses an x86_64-only privilege
escalation problem. There is a larger 2.6.22 update in the works which
should be released shortly.
For older kernels: 22.214.171.124, released on
September 23, fixes the x86_64 vulnerability and one other bug. The
2.6.16 series returned on September 25 with 126.96.36.199-rc1, which contains a
fair number of fixes. 188.8.131.52 (September 23)
also has the x86_64 fix and a couple of others.
Comments (none posted)
Kernel development news
Currently, sysfs files which want to kill themselves should ask
someone else (workqueue) to kill it, which is so inhumane. This
patchset updates sysfs file implementation such that sysfs files
can commit suicide peacefully.
-- Tejun Heo
creates a more
Allowing users to turn off security is generally better than
assuming they will read the manual and turn it on.
-- Alan Cox
Comments (none posted)
The developers for the MadWifi
have announced their intention to move away from their current Atheros
driver (which contains a binary-only component) and, instead, work on the
development of the free ath5k driver. "To underline our decision and commitment to ath5k we now declare MadWifi
'legacy.'. In the long run ath5k will replace the MadWifi driver. For the
time being MadWifi will still be supported, bugs will get fixed and HAL
updates will be applied where possible. But it becomes unlikely that we'll
see new features or go through major changes on that codebase.
Full Story (comments: 17)
An announcement of the revival of
linux-tiny, a set of patches aimed at reducing the footprint of the
kernel, mainly for the embedded world, has led to a number of linux-kernel
threads. The conversations range from the proper place for linux-tiny to
reside to the removal of the enormous number of printk() strings
in the kernel. They provide an interesting glimpse into the kernel
The linux-tiny project was started
by Matt Mackall in December 2003 with the aim to "collect patches that
reduce kernel disk and memory footprint as well as tools for working
on small systems." LWN covered
the announcement at the time and tried out the patches more than
a year ago. Many of the linux-tiny features have found their way into the
mainline, but quite a few still remain outside.
The Consumer Electronics Linux Forum (CELF) is behind the effort to revive
the project, with Tim Bird, architecture group chair, announcing the plan,
including a new maintainer, Michael Opdenacker. The first step has been
mostly completed, bringing the patches forward from the 2.6.14 kernel to
2.6.22. A status
page has been established to track the progress of updating the
patches, but it is clear that moving them into the mainline, rather than
maintaining them as patches, is a big motivation behind the revival.
Andrew Morton immediately volunteered to manage the linux-tiny patches in an answer to the revival message:
Seriously, putting this stuff into some private patch collection should
be a complete last resort - you should only do this with patches which
you (and the rest of us) agree have no hope of ever getting into mainline.
Reactions were quite favorable, with the maintainer, Opdenacker responding:
Andrew, you're completely right... The patches should all aim at being
included into mainline or die.
I'm finishing a sequence of crazy weeks and I will have time to send you
patches one by one next week, starting with the easiest ones.
The full patchset will live in a separate repository as the individual
patches are being
worked on for inclusion, but it is clear that no one wants to continuously
maintain and out-of-tree patchset for a long time. The cost of ensuring
that the patches do not bitrot is large and their inclusion in the mainline
will get them in the hands of more developers.
From there, more detailed discussion of how to structure the patches - and
tiny features in general - ensued. A separate discussion also came about regarding
printk() and the large amounts of memory it consumes with all of
its static strings. printk() has long been seen as an area that
could be improved to reduce the memory footprint of the kernel.
All sorts of kernel messages are printed to logfiles or the console via
printk(); there are something on the order of 60,000 calls
in 2.6. There can be a severity level associated with a specific call, which
provides a primitive syslog-style categorization of the messages.
Unfortunately, in the mainline, those calls are either present, with all the
associated memory for the strings, or completely absent, compiled out via a config
option. It is rather difficult to diagnose problems without at least some
printk() information, but keeping all of the data in can increase
the size of the kernel 5-10%.
Rob Landley started things off
with a way to make it possible to only compile in messages based on their
severity level. An embedded developer could remove KERN_NOTICE,
KERN_DEBUG and similar low severity messages while keeping the
more critical messages:
[...] the compiler's dead code eliminator zaps the printks you don't
care about so they
don't bloat the kernel image. But this doesn't _completely_ eliminate
printks, so you can still get the panic() calls and such. You tweak precisely
how much bloat you want, using the granularity information that's already
there in the source code.
Landley's suggestion has a drawback in that it would require a flag
day for printk() or the creation of a new function that implemented
his suggestion with relevant changes trickling into the kernel over time.
In the meantime, small-system developers would still be looking for ways to get
the messages they want, while removing the others from the code. There was
also discussion of using separate calls for each severity level, where
pr_info(), or some similar name, would produce messages with that
level. The preprocessor could then be used to remove those that a developer
is not interested in.
The discussion led Vegard Nossum to put together an RFC for a
new kernel-message logging API.
He starts with requirements that the API be backwards-compatible with the
existing printk() usage, with the output format being extensible
at either compile or run time. The RFC also tries to handle the case of
multiple printk() calls to emit what is essentially a single
message, but it seems like an over-engineered solution to what should be
a fairly straightforward problem.
Another contender, one that is already part of the linux-tiny patchset, is
This allows developers to selectively choose source code files for which
printk() will be enabled, removing it from the rest of the code and
resulting kernel image. While not allowing fine-grained selection of
messages based on severity, it does put more control into the hands of
It is too early to say which, if any, printk() changes are coming down
the pike. There does seem to be a lot of interest in helping small systems
reduce their kernel footprint without sacrificing all diagnostic messages.
printk() is claimed to be one of the lowest hanging fruit for
significant kernel size reduction, which would seem to make it a likely
candidate for change.
Comments (18 posted)
system call was added in the 2.6.22 kernel. The
core idea behind timerfd()
- allowing a process to associate a
file descriptor with timer events - is not controversial, but the
implementation of this idea did, belatedly, raise a few eyebrows
particular, Michael Kerrisk pointed out that timerfd()
inconsistent with (and less powerful than) the existing timer-related
system calls, and, besides, the 2.6.22 version did not even work as
advertised. After a
fair amount of discussion, it became clear that the issues with this system
call would not be worked out in the 2.6.23 time frame. So the 2.6.23-rc7
prepatch disabled timerfd()
altogether in an attempt to prevent
application developers from using an API which is going to change.
Prompted by all of this, Davide Libenzi (the creator of the original
timerfd() system call) has posted a proposal for a revised
timerfd() API. The single system call has turned into three
different calls with a few new features.
Under the new API, an application wanting to create a file descriptor for
timer events would make a call to:
int timerfd_create(int clockid);
Where clockid describes which clock should be used; it will be
either CLOCK_MONOTONIC or CLOCK_REALTIME. The return
value will, if all goes well, be the requested file descriptor.
A timer event can be requested with:
int timerfd_settime(int fd, int flags, const struct itimerspec *timer,
struct itimerspec *previous);
Here, fd is a file descriptor obtained from
timerfd_create(), and timer gives the desired expiration
time (and re-arming interval value, if desired). This time is normally a
relative time, but if the timer sets the
TFD_TIMER_ABSTIME bit in flags, it will be interpreted as
an absolute time instead. If previous is not NULL, the
pointed-to structure will be filled with the previous value of the timer.
This ability to obtain the previous value is one of the features which was
lacking in the original timerfd() implementation.
That implementation also had no way for an application to simply ask what
the current value of the timer was. The new API provides a function for
querying a timer non-destructively:
int timerfd_gettime(int fd, struct itimerspec *timer);
This system call will store the current expiration time (if any) associated
with fd into timer.
The read() interface is essentially unchanged. A process which
reads on a timer file descriptor will block if the timer has not yet
expired. It will then read a 64-bit integer value indicating how many
times the timer has expired since it was last read. A timer file
descriptor can be passed to poll(), allowing timers to be handled
in an applications main event loop.
Responses to the new API proposal have been muted at best; hopefully this
silence means that developers are happy with the new system calls. The
alternative is that this iteration of timerfd() will not be
reviewed any more extensively than its predecessor was. As things stand,
the new set of system calls looks likely to be merged for 2.6.24.
Comments (7 posted)
Every Linux process carries with it a set of credentials which describe its
privileges within the system. Credentials include the user ID, group
membership, capabilities, security context, and more. These credentials
are currently stored in the task_struct
structure associated with each
process; an operation which changes credentials does so by operating
directly on the task_struct
structure. This approach has worked for many
years, but it occasionally shows its age.
In particular, the current scheme makes life hard for kernel code which
needs to adopt a different set of credentials for a limited time. In an
attempt to remedy that situation,
David Howells has posted a
patch which significantly changes the handling of process credentials.
The result is a more complex system, but also a system which is more
flexible, and, with luck, more secure.
The core idea behind this patch is that all process credentials (attributes
which describe how a process can operate on other objects) should be pulled
out of the task structure into a separate structure of their own. The
result is struct cred, which holds the effective filesystem user
and group IDs, the list of group memberships, the effective capabilities,
the process keyrings, a generic pointer for security modules, and some
housekeeping information. The result is quite a bit of code churn as every
access to the old credential information is changed to look into the new
cred structure instead.
That churn is complicated by the fact that quite a bit of the credential
information has not really moved to the cred structure;
instead it is mirrored there. One of the fundamental rules for how
struct cred works is that the structure can only be changed by the
process it describes. So anything in the structure which can be changed by
somebody else - capabilities and keyrings, for example - remain in the
task_struct structure and are copied into the cred structure as
needed. "As needed," for all practical purposes, means anytime those
credentials are to be checked. So most system calls get decorated with
this extra bit of code:
result = update_current_cred();
if (result < 0)
The next rule says that the cred structure can never be altered
once it has been attached to a task. Instead, a read-copy-update technique
must be used, wherein the cred structure is copied, the new copy
is changed, then the pointer from the task_struct structure is set to the
new structure. The old one, which is reference counted, persists while it
is in use and is eventually disposed of via RCU.
There is a whole set of utility functions for dealing with credentials, a
few of which are:
struct cred *get_current_cred();
void put_cred(struct cred *cred);
A call to get_current_cred() takes a reference to the current
process's cred structure and returns a pointer to that structure.
put_cred() releases a reference.
A change to a credentials structure usually involves a set of calls to:
struct cred *dup_cred(const struct cred *cred);
void set_current_cred(struct cred *cred);
The current credentials can be copied with dup_cred(); the
duplicate, once modified, can be made current with
set_current_cred(). A set of new hooks has been added to allow
security modules to participate in the duplication and setting of
So far, this infrastructure may seem like a bunch of extra work with the
gain yet to be explained. The direction that David is going with this
change can be seen with this new function:
struct cred *get_kernel_cred(const char *service,
struct task_struct *daemon);
The purpose of this function is to create a new credentials structure with
the requisite privileges for the given service. The
daemon pointer indicates a current process which should be used as
the source for the new credentials - essentially, the new cred
structure will enable its holder to act as if it were the daemon
process. The current security module gets a chance to change how those
credentials are set up; in fact, the interpretation of the "service" string
is only done in security modules. In the absence of a security module,
get_kernel_cred() will just duplicate the credentials held by
This capability is used in a new version of David's venerable FS-Cache
(formerly cachefs) patch
set. FS-Cache implements a local cache for network-based filesystems; the
locally-stored cache will, naturally, have all of the security concerns as
the remote filesystem. There is a daemon which does a certain amount of
the cache management work, but other accesses to the cache are performed by
FS-Cache code running in the context of a process which is working with
files on the remote filesystem. Using the above function, the FS-Cache
code is able to empower any process to work with the privileges of the
daemon process for just as long as is needed to get the filesystem work
The end result is that security policies can be carried further into the
kernel than before. In the FS-Cache case, kernel code doing caching work
always operates under the effective capabilities of the cache management
daemon. So any protections, SELinux policies, etc. which apply to the
daemon will also apply when FS-Cache work is being done in a different
context. This should result in a more secure system overall.
The credential work is still in a relatively early state with a fair amount
of work yet to be done. It will be quite a big patch by the time the
required changes are made throughout the kernel. So this is not a
2.6.24 candidate. The work is progressing, though, so it will likely be
knocking on the mainline door at some point.
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>