User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.5, which was announced by Linus on April 3. Changes since -rc3 include another ALSA update, some architecture updates, and various fixes.

Linus's BitKeeper repository has no new patches; he is off the net for the week. In its place, Andrew Morton has put together a "merge candidate" tree, the current release of which is 2.6.5-mc2. This tree contains the laptop mode patches, a set of ReiserFS updates, IPv6 support for SELinux, the lightweight auditing framework (see below), the POSIX message queues patch, the fcntl() file_operations method (covered here last month), some virtual memory improvements, non-exec stack support, various architecture updates, and lots of fixes - 207 patches in all.

The current -mm tree is 2.6.5-mm2; recent additions to -mm include some software suspend fixes, an autofs4 update, and more fixes. The 4G/4G virtual memory patch has been dropped for now; it was suspected of causing some problems, and it gets in the way of the other virtual memory work being done.

The current 2.4 prepatch is 2.4.26-rc2, which was released by Marcelo on April 5. This patch adds a relatively small number of fixes, including adds some IDE updates, and an XFS update.

Comments (1 posted)

Kernel development news

A new device naming scheme

A recent posting on linux-kernel announced the creation of a new mailing list, hosted at OSDL, for the discussion of device naming schemes. The Linux Standard Base does not currently specify device names, but its maintainers would like to change that. To that end, they are seeking input on how devices should be named on Linux systems.

The discussion, so far, has centered around a proposal (available in PDF format) from SUSE. Its purpose is to create a set of persistent device names which will remain valid even in a hotpluggable world where the hardware configuration can change at any time. To that end, the proposal creates a version of /dev which is radically different from anything seen on current Linux systems.

All of the current device names found in /dev are relegated to the category of "compatibility names." They will still exist, but the proposal suggests that they should be maintained by udev, rather than being a static part of the system. The new names, instead, will all be found in subdirectories under /dev. Disks will be in /dev/disk (with a "k"), and the obvious things will be found in other directories, such as /dev/printer, /dev/cdrom (these, evidently, are not "disks"), or /dev/modem.

The proposal calls for another level of subdirectories before you find any actual device names. Each of the /dev subdirectories would be further divided into by-path, which names each device by how it is connected to the system; by-serial, which uses the device's model name and serial number; by-uuid, which uses a device's "universal unique identifier"; and by-label, which uses a device's filesystem label. Thus, a system's root partition might have all of the following names:

  • /dev/disk/by-path/ide-0.0-part1
  • /dev/disk/by-serial/ata-ST340810A-53-5BIN-part1
  • /dev/disk/by-label/label-ROOT
  • /dev/disk/by-uuid/uuid-0bee1954-b245-4df1-b2af-785fecd75b8f

The use of multiple names for the same device does not sit well with everybody; fears have been expressed that it could confuse users and applications which perform user-space locking by device name. The by-path names were received critically; since the path can change on a modern system, those names will never be persistent. There were also complaints about by-label and by-uuid; those names are meant to allow Linux systems to find and mount disks regardless of their position in the device hierarchy, but the mount utility already implements that functionality.

While there have been complaints about the SUSE proposal, there have not, thus far, been a lot of alternatives put forward. Something, however, is clearly going to have to change. A Fedora Core 2 Test 2 system has almost 19,000 entries under /dev; this mass of names can only get larger and increasingly unmaintainable. And it fails to address the dynamic nature of devices in modern systems. Device naming looks to be an interesting issue for some time to come.

Comments (6 posted)

Capabilities in 2.6

The kernel capability mechanism gives (relatively) fine-grained control over what actions any given process can perform. The various capabilities include the ability to override file permissions, send signals to other processes, bind to low-numbered ports, and many other tasks. There have been visions over the years of exporting capabilities to user space and eliminating the "all-powerful superuser" concept, but none of those visions have been implemented in any sort of widely-distributed sort of way.

One of the capabilities is called CAP_IPC_LOCK; it gives a process the ability to lock a region of virtual memory into physical RAM. This capability needs to be controlled; otherwise a rogue process could lock up all of physical memory and effectively shut down the system. There are, however, legitimate reasons for giving this capability to normal users. Programs which handle encryption (such as gpg) would like to lock in some of their memory so that passphrases and clear text do not get written out to swap. Systems like Oracle need the capability to lock in their shared segments (since they do their own paging, essentially) and to be able to allocate large page "hugetlb" segments.

To this end, Andrea Arcangeli posted a patch which allows the system administrator to disable CAP_IPC_LOCK checking via a sysctl variable. With those checks disabled, any non-privileged process can lock pages into memory or allocate large-page shared memory segments. Andrea asked for the patch to be incorporated into the 2.6 mainline.

The patch inspired some thinking on how best to make certain capabilities available to users. There has been a patch in circulation for a while which simply opens up memory locking to everybody, but which puts a resource limit on the number of pages which can be locked. The default limit is a single page, which works for gpg but which does not easily threaten the system as a whole. With a suitably adjusted limit, this patch should work for Oracle as well - but it does not address the large-page shared memory issue.

William Lee Irwin put together a different patch which allows the administrator to turn off checks for any capability via a set of sysctl variables. It differs from Andrea's patch in its generality, but also by virtue of using the security module framework rather than direct changes to the kernel core. Some people seemed to like this patch better, though there was some nervousness about its overall security which led William to add a strong comment and a lockdown capability to the patch.

Given that the whole idea behind capabilities was to be able to give specific capabilities to individual users, however, some developers wondered why the current system couldn't be used. To this end, Andrew Morton looked into hacking login to enable it to give capabilities to users. He was not impressed with what he found once he started trying to work with kernel capabilities:

It turns out that the whole "drop capabilities and then run something" thing does not work in either 2.4 or 2.6. And hasn't done since forever. What we have in there is no more useful than suser()...

I must say that I'm fairly disappointed that we developed and merged all that fancy security stuff but nobody ever bothered to fix up the existing simple capability code. Particularly as, apparently, the new security stuff STILL cannot solve the extremely simple Oracle-wants-CAP_IPC_LOCK requirement.

It was pointed out that SELinux can, in fact, solve this problem. But that will be little comfort to those who are not yet ready to adopt SELinux for their production systems.

The problem may originate from the fact that the visions of fully capability-driven systems involve assigning capabilities to all executables and having a process's capabilities tweaked every time a new program is run. That part of the system has never been merged into the mainline, partly because nobody has ever really figured out how to deal with system administration when every file has another 32 permissions bits added onto it. The end result, in any case, is that the capability subsystem has never worked quite as it should. Given that Andrew is the gatekeeper, chances are good that some sort of fix for that problem will get into the kernel before any sort of more complicated solution to the problem of giving capabilities to users.

Comments (5 posted)

The lightweight auditing framework

One of the patches in Andrew Morton's "merge candidate" tree is the lightweight audit framework. This patch, written by Rik Faith, is intended to be a way for the kernel to get various types of audit information out to user space without slowing things down, especially when auditing is not being used. The framework is meant to serve as a complement to SELinux; it is already being shipped as a part of the Fedora Core 2 test 2 kernel.

There are two kernel-side components to the audit code. The first is a generic mechanism for creating audit records and communicating with user space. All of that communication is performed via netlink sockets; there are no new system calls added as part of the audit framework. Essentially, a user-space process creates a NETLINK_AUDIT socket, writes audit_request structures it, and reads back audit_reply structures in return.

The generic part of the audit mechanism can control whether auditing is enabled at all, perform rate limiting of messages, and handle a few other tasks. On the kernel side, it provides a printk()-like mechanism for sending messages to user space. This code also implements a user-specified policy on what happens if memory is not available for auditing; truly paranoid administrators can request that the kernel panic in such situations.

The audit patch includes some SELinux tweaks to make it use the audit functions rather than printk() when it has something to log.

The audit logging code expects an audit daemon to be running to accept messages via the netlink socket. Code for an example daemon is available in Rik's Red Hat web area. Should there be no daemon running, log messages are simply passed to printk() instead.

In addition to the generic support code, the audit patch includes a mechanism for auditing system calls. One gets the sense that this was the real purpose for the patch. System call auditing is off by default, but a suitably privileged user-space process can turn it on and load a whole set of rules describing what should be logged. Rules can test on various attributes of the calling process, including its process ID, user and group ID (both "real" and "effective"), etc. Rules can also be set to fire on accesses to particular devices or files. Finally, there are also tests on specific system call arguments, whether the call succeeds, or for a specific return value.

Included with the audit daemon is an auditctl utility which can be used for setting and tweaking rules.

The audit mechanism will give system administrators a new tool for looking at what is going on between user space and the kernel. With the addition of some user-space utilities, it could become a powerful facility for tracking down system problems and security issues - or for any number of big-brotherish applications. Expect to see it in 2.6.6.

Comments (6 posted)

Patches and updates

Kernel trees

  • Andrea Arcangeli: 2.6.5-aa1. (April 4, 2004)
  • Andrea Arcangeli: 2.6.5-aa2. (April 5, 2004)
  • Andrea Arcangeli: 2.6.5-aa3. (April 5, 2004)


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds