Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.23-rc8, released on September 24. It contains a relatively small number of fixes, and Linus is confident that the final release is getting close. "Of course, me feeling happy is usually immediately followed by some nasty person finding new problems, but I'll just ignore that and enjoy the feeling anyway, however fleeting it may be."
As of this writing, about 50 post-rc8 patches have gone into the mainline repository.
The current -mm tree is 2.6.23-rc8-mm1. Recent changes to -mm include some ext4 enhancements, support for read-only bind mounts, some kdump improvements, and a rework of the NFS export code.
The current stable 2.6 kernel is 2.6.22.9, released on September 26. It contains a few dozen fixes for problems throughout the kernel. 2.6.22.8, released on September 24, contains a single security fix for a privilege escalation vulnerability in the sound subsystem. 2.6.22.7, released on the 21st, is also a single-fix update; this one addresses an x86_64-only privilege escalation problem. There is a larger 2.6.22 update in the works which should be released shortly.
For older kernels: 2.6.20.20, released on September 23, fixes the x86_64 vulnerability and one other bug. The 2.6.16 series returned on September 25 with 2.6.16.54-rc1, which contains a fair number of fixes. 2.4.35.3 (September 23) also has the x86_64 fix and a couple of others.
Kernel development news
Quotes of the week
MadWifi developers move to ath5k
The developers for the MadWifi project have announced their intention to move away from their current Atheros driver (which contains a binary-only component) and, instead, work on the development of the free ath5k driver. "To underline our decision and commitment to ath5k we now declare MadWifi 'legacy.'. In the long run ath5k will replace the MadWifi driver. For the time being MadWifi will still be supported, bugs will get fixed and HAL updates will be applied where possible. But it becomes unlikely that we'll see new features or go through major changes on that codebase."
Reviving linux-tiny
An announcement of the revival of linux-tiny, a set of patches aimed at reducing the footprint of the kernel, mainly for the embedded world, has led to a number of linux-kernel threads. The conversations range from the proper place for linux-tiny to reside to the removal of the enormous number of printk() strings in the kernel. They provide an interesting glimpse into the kernel development process.
The linux-tiny project was started
by Matt Mackall in December 2003 with the aim to "collect patches that
reduce kernel disk and memory footprint as well as tools for working
on small systems.
" LWN covered
the announcement at the time and tried out the patches more than
a year ago. Many of the linux-tiny features have found their way into the
mainline, but quite a few still remain outside.
The Consumer Electronics Linux Forum (CELF) is behind the effort to revive the project, with Tim Bird, architecture group chair, announcing the plan, including a new maintainer, Michael Opdenacker. The first step has been mostly completed, bringing the patches forward from the 2.6.14 kernel to 2.6.22. A status page has been established to track the progress of updating the patches, but it is clear that moving them into the mainline, rather than maintaining them as patches, is a big motivation behind the revival.
Andrew Morton immediately volunteered to manage the linux-tiny patches in an answer to the revival message:
Reactions were quite favorable, with the maintainer, Opdenacker responding:
I'm finishing a sequence of crazy weeks and I will have time to send you patches one by one next week, starting with the easiest ones.
The full patchset will live in a separate repository as the individual patches are being worked on for inclusion, but it is clear that no one wants to continuously maintain and out-of-tree patchset for a long time. The cost of ensuring that the patches do not bitrot is large and their inclusion in the mainline will get them in the hands of more developers.
From there, more detailed discussion of how to structure the patches - and tiny features in general - ensued. A separate discussion also came about regarding printk() and the large amounts of memory it consumes with all of its static strings. printk() has long been seen as an area that could be improved to reduce the memory footprint of the kernel.
All sorts of kernel messages are printed to logfiles or the console via printk(); there are something on the order of 60,000 calls in 2.6. There can be a severity level associated with a specific call, which provides a primitive syslog-style categorization of the messages. Unfortunately, in the mainline, those calls are either present, with all the associated memory for the strings, or completely absent, compiled out via a config option. It is rather difficult to diagnose problems without at least some printk() information, but keeping all of the data in can increase the size of the kernel 5-10%.
Rob Landley started things off with a way to make it possible to only compile in messages based on their severity level. An embedded developer could remove KERN_NOTICE, KERN_DEBUG and similar low severity messages while keeping the more critical messages:
Landley's suggestion has a drawback in that it would require a flag day for printk() or the creation of a new function that implemented his suggestion with relevant changes trickling into the kernel over time. In the meantime, small-system developers would still be looking for ways to get the messages they want, while removing the others from the code. There was also discussion of using separate calls for each severity level, where pr_info(), or some similar name, would produce messages with that level. The preprocessor could then be used to remove those that a developer is not interested in.
The discussion led Vegard Nossum to put together an RFC for a new kernel-message logging API. He starts with requirements that the API be backwards-compatible with the existing printk() usage, with the output format being extensible at either compile or run time. The RFC also tries to handle the case of multiple printk() calls to emit what is essentially a single message, but it seems like an over-engineered solution to what should be a fairly straightforward problem.
Another contender, one that is already part of the linux-tiny patchset, is Tim Bird's DoPrintk patch. This allows developers to selectively choose source code files for which printk() will be enabled, removing it from the rest of the code and resulting kernel image. While not allowing fine-grained selection of messages based on severity, it does put more control into the hands of developers.
It is too early to say which, if any, printk() changes are coming down the pike. There does seem to be a lot of interest in helping small systems reduce their kernel footprint without sacrificing all diagnostic messages. printk() is claimed to be one of the lowest hanging fruit for significant kernel size reduction, which would seem to make it a likely candidate for change.
The new timerfd() API
The timerfd() system call was added in the 2.6.22 kernel. The core idea behind timerfd() - allowing a process to associate a file descriptor with timer events - is not controversial, but the implementation of this idea did, belatedly, raise a few eyebrows. In particular, Michael Kerrisk pointed out that timerfd() was inconsistent with (and less powerful than) the existing timer-related system calls, and, besides, the 2.6.22 version did not even work as advertised. After a fair amount of discussion, it became clear that the issues with this system call would not be worked out in the 2.6.23 time frame. So the 2.6.23-rc7 prepatch disabled timerfd() altogether in an attempt to prevent application developers from using an API which is going to change.Prompted by all of this, Davide Libenzi (the creator of the original timerfd() system call) has posted a proposal for a revised timerfd() API. The single system call has turned into three different calls with a few new features.
Under the new API, an application wanting to create a file descriptor for timer events would make a call to:
int timerfd_create(int clockid);
Where clockid describes which clock should be used; it will be either CLOCK_MONOTONIC or CLOCK_REALTIME. The return value will, if all goes well, be the requested file descriptor.
A timer event can be requested with:
int timerfd_settime(int fd, int flags, const struct itimerspec *timer,
struct itimerspec *previous);
Here, fd is a file descriptor obtained from timerfd_create(), and timer gives the desired expiration time (and re-arming interval value, if desired). This time is normally a relative time, but if the timer sets the TFD_TIMER_ABSTIME bit in flags, it will be interpreted as an absolute time instead. If previous is not NULL, the pointed-to structure will be filled with the previous value of the timer. This ability to obtain the previous value is one of the features which was lacking in the original timerfd() implementation.
That implementation also had no way for an application to simply ask what the current value of the timer was. The new API provides a function for querying a timer non-destructively:
int timerfd_gettime(int fd, struct itimerspec *timer);
This system call will store the current expiration time (if any) associated with fd into timer.
The read() interface is essentially unchanged. A process which reads on a timer file descriptor will block if the timer has not yet expired. It will then read a 64-bit integer value indicating how many times the timer has expired since it was last read. A timer file descriptor can be passed to poll(), allowing timers to be handled in an applications main event loop.
Responses to the new API proposal have been muted at best; hopefully this silence means that developers are happy with the new system calls. The alternative is that this iteration of timerfd() will not be reviewed any more extensively than its predecessor was. As things stand, the new set of system calls looks likely to be merged for 2.6.24.
Credential records
Every Linux process carries with it a set of credentials which describe its privileges within the system. Credentials include the user ID, group membership, capabilities, security context, and more. These credentials are currently stored in the task_struct structure associated with each process; an operation which changes credentials does so by operating directly on the task_struct structure. This approach has worked for many years, but it occasionally shows its age.In particular, the current scheme makes life hard for kernel code which needs to adopt a different set of credentials for a limited time. In an attempt to remedy that situation, David Howells has posted a patch which significantly changes the handling of process credentials. The result is a more complex system, but also a system which is more flexible, and, with luck, more secure.
The core idea behind this patch is that all process credentials (attributes which describe how a process can operate on other objects) should be pulled out of the task structure into a separate structure of their own. The result is struct cred, which holds the effective filesystem user and group IDs, the list of group memberships, the effective capabilities, the process keyrings, a generic pointer for security modules, and some housekeeping information. The result is quite a bit of code churn as every access to the old credential information is changed to look into the new cred structure instead.
That churn is complicated by the fact that quite a bit of the credential information has not really moved to the cred structure; instead it is mirrored there. One of the fundamental rules for how struct cred works is that the structure can only be changed by the process it describes. So anything in the structure which can be changed by somebody else - capabilities and keyrings, for example - remain in the task_struct structure and are copied into the cred structure as needed. "As needed," for all practical purposes, means anytime those credentials are to be checked. So most system calls get decorated with this extra bit of code:
result = update_current_cred();
if (result < 0)
return result;
The next rule says that the cred structure can never be altered once it has been attached to a task. Instead, a read-copy-update technique must be used, wherein the cred structure is copied, the new copy is changed, then the pointer from the task_struct structure is set to the new structure. The old one, which is reference counted, persists while it is in use and is eventually disposed of via RCU.
There is a whole set of utility functions for dealing with credentials, a few of which are:
struct cred *get_current_cred();
void put_cred(struct cred *cred);
A call to get_current_cred() takes a reference to the current process's cred structure and returns a pointer to that structure. put_cred() releases a reference.
A change to a credentials structure usually involves a set of calls to:
struct cred *dup_cred(const struct cred *cred);
void set_current_cred(struct cred *cred);
The current credentials can be copied with dup_cred(); the duplicate, once modified, can be made current with set_current_cred(). A set of new hooks has been added to allow security modules to participate in the duplication and setting of credentials.
So far, this infrastructure may seem like a bunch of extra work with the gain yet to be explained. The direction that David is going with this change can be seen with this new function:
struct cred *get_kernel_cred(const char *service,
struct task_struct *daemon);
The purpose of this function is to create a new credentials structure with the requisite privileges for the given service. The daemon pointer indicates a current process which should be used as the source for the new credentials - essentially, the new cred structure will enable its holder to act as if it were the daemon process. The current security module gets a chance to change how those credentials are set up; in fact, the interpretation of the "service" string is only done in security modules. In the absence of a security module, get_kernel_cred() will just duplicate the credentials held by daemon.
This capability is used in a new version of David's venerable FS-Cache (formerly cachefs) patch set. FS-Cache implements a local cache for network-based filesystems; the locally-stored cache will, naturally, have all of the security concerns as the remote filesystem. There is a daemon which does a certain amount of the cache management work, but other accesses to the cache are performed by FS-Cache code running in the context of a process which is working with files on the remote filesystem. Using the above function, the FS-Cache code is able to empower any process to work with the privileges of the daemon process for just as long as is needed to get the filesystem work done.
The end result is that security policies can be carried further into the kernel than before. In the FS-Cache case, kernel code doing caching work always operates under the effective capabilities of the cache management daemon. So any protections, SELinux policies, etc. which apply to the daemon will also apply when FS-Cache work is being done in a different context. This should result in a more secure system overall.
The credential work is still in a relatively early state with a fair amount of work yet to be done. It will be quite a big patch by the time the required changes are made throughout the kernel. So this is not a 2.6.24 candidate. The work is progressing, though, so it will likely be knocking on the mainline door at some point.
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
