Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.16.11, released on April 24. It is a single-patch release containing a fix for a CIFS filesystem vulnerability. 2.6.16.10, also released on the 24th, contained a larger set of important fixes.

The current 2.6 prepatch remains 2.6.17-rc2; there have been no -rc releases over the last week. Patches are accumulating in the mainline git repository, however; they are mostly fixes, but there is also trusted platform module (TPM) 1.2 support, multiple page size support for the PA-RISC architecture, and the vmsplice() system call (see below).

There have been no -mm tree releases over the last week.

Comments (1 posted)

The splice() weekly news

Jens Axboe sent around a note on the status of splice(). He notes that the splice() and tee() interfaces - on both the user and kernel side - should be stable now, with no further changes anticipated. The sendfile() system call has been reworked to use the splice() machinery, though that process will not be complete until after the 2.6.18 kernel cycle opens.

While splice() might be stable, things are still happening. In particular, Jens has added yet another system call:

    long vmsplice(int fd, void *buffer, size_t len, unsigned int flags);

While the regular splice() call will connect a pipe to a file, this call, instead, is designed to feed user-space memory directly into a pipe. So the memory range of len bytes starting at buffer will be pushed into the pipe represented by fd. The flags argument is not currently used.

Using vmsplice(), an application which generates data in a memory buffer can send that data on to its eventual destination in a zero-copy manner. With a suitably-sized buffer, the application can do easy double-buffering; half of the buffer can be under I/O with vmsplice() while the other half is being filled. If the buffer is big enough, the application need only call vmsplice() each time half of the buffer has been filled, and the rest will simply work with no need for multiple threads or complicated synchronization mechanisms.

Getting the buffer size right is important, however. If the buffer is at least twice as large as the maximum number of pages that the kernel will load into a pipe at an given time, a successful vmsplice() of half of the buffer can be safely interpreted by the application as meaning that the other half of the buffer is no longer under I/O. Since half of the buffer will completely fill the space available within a kernel pipe, that half can only be inserted when all other data has been consumed out of the pipe - in simple situations, anyway. So, after vmsplice() succeeds, the application can safely refill the second half with new data. If the application gets confused, however, it could find itself overwriting data which has not yet been consumed by the kernel.

Jens's patch adds a couple of fcntl() operations intended to help in this regard. The F_GETPSZ operation will return the maximum number of pages which can be inserted into a pipe buffer, which is also the maximum number of pages which can be under I/O from a vmsplice() operation. There is also F_SETPSZ for changing the maximum size, though that operation just returns EINVAL for now. Linus, however, worries that this information is not enough to know that a given page is no longer under I/O. In situations where there are other buffers in the kernel - perhaps just another pipe in series - the kernel could still have references to a page even after that page has been consumed out of the original pipe. Networking adds some challenges of its own: if a page has been vmsplice()ed to a TCP socket, it will not be reusable until the remote host has acknowledged the receipt of the data contained within that page. That acknowledgment will arrive long after the page has been consumed out of the pipe buffer.

What this all means is that the vmsplice() interface probably needs a bit more work. In particular, there may need to be yet another system call which will allow an application to know that the kernel is done with a specific page. The current vmsplice() implementation is also unable to connect an incoming pipe to user-space memory. Making the read side work is a rather more complicated affair, and may not happen anytime in the near future.

Comments (9 posted)

OpenVZ's live checkpointing

The OpenVZ project is a GPL-licensed subset of SWSoft's proprietary Virtuozzo offering. With OpenVZ, a Linux system can implement multiple "virtual environments", each of which appears, to the processes running within it, to be a separate, standalone system. Virtual environments can have their own IP addresses and be subjected to specific resource limits. They are, in other words, an implementation of the container concept, one of several for Linux. In recent times the various virtualization and container projects have shown a higher level of interest in getting at least some of their code merged into the mainline kernel, and OpenVZ is no exception. So the OpenVZ developers have been maintaining a higher profile on the kernel mailing lists.

The latest news from OpenVZ is this announcement of a new release with a major feature addition: live checkpointing and migration of virtual environments. An environment (being a container full of Linux processes) can be checkpointed to a file, allowing it to be restarted at some later time. But it is also possible to checkpoint a running virtual environment and move it to another system, with no interruption in service. This feature, clearly meant to be competitive with Xen's live migration capabilities, enables run-time load balancing across systems.

The OpenVZ patch, weighing at 2.2MB, is not for the faint of heart; it makes the price to be paid for these features quite clear. Much of what is contained within the patch has been discussed here before; for example, it contains the PID virtualization patches, and every bit of code within the kernel must be aware of whether it is working with "real" or "virtual" process IDs. A number of other kernel interfaces must be changed to support OpenVZ's virtualization features; among other things, many device drivers and filesystems require tweaks.

As might be expected, the checkpointing code is on the long and complicated side. The checkpoint process starts by putting the target process(es) on hold, in a manner similar to what the software suspend code does. Then it comes down to a long series of routines which serialize and write out every data structure and bit of memory associated with a virtual environment. The obvious things are saved: process memory, open files, etc. But the code must also save the full state of each TCP socket (including the backlog of sk_buff structures waiting to be processed), connection tracking information, signal handling status, SYSV IPC information, file descriptors obtained via Unix-domain sockets, asynchronous I/O operations, memory mappings, filesystem namespaces, data in tmpfs files, tty settings, file locks, epoll() file descriptors, accounting information, and more.

For each of the objects to be saved, an in-file version of the kernel data structure must be created. Each dump routine then serializes one or more data structures into the proper format for writing to the checkpoint file. It all apparently works, but it has the look of a highly brittle system - almost any change to the kernel's data structures seems guaranteed to break the checkpoint and restore code. Even if the checkpoint and restore code were merged into the mainline, getting kernel developers to understand (and care about) that code would be a challenge. Keeping it working must be be an ongoing hassle, whether or not the code is in the mainline tree.

None of the above should be interpreted to say that OpenVZ's features are not worth the cost. Virtual environments, checkpointing, and live migration are powerful and useful features. But the virtualization of everything within the kernel will lead to a higher level of internal complexity and higher maintenance costs. The decision process which draws the line determining which features are merged and which are not will be interesting to watch.

Comments (3 posted)

The AppArmor debate begins

Novell announced the release of the AppArmor security module last January. Then everything went quiet; in particular, no attempt was made to get the AppArmor code merged into the mainline kernel. The silence was broken last week, however, as a result of the discussion on the possible removal of the Linux security module (LSM) API. The submission of the AppArmor code has had the desired short-term effect: the discussion has moved away from removal of the LSM interface and toward the merits of AppArmor. The AppArmor developers may not see that shift as a blessing at the moment, however.

As expected, AppArmor has taken a fair amount of criticism. The largest complaint is the fact that AppArmor uses pathnames for its security policies. Using AppArmor, a system administrator can provide a list of files accessible by a given application; anything not on the list becomes inaccessible. Other things - such as capabilities - are also configurable, but there is no controversy over that aspect of the system. It is the use of path names which raises the red flags.

The sticking point is that a given file name is not the file itself. So, while /etc/shadow might identify the shadow password file, that name is not the shadow password file. If an attacker is able to create another name for that file (through the use of links or namespaces, perhaps), that other name could become a way for the attacker to access the shadow password file. So, even if AppArmor forbids access to /etc/shadow for a given application, that application might still have access to other pathnames which could be made to refer to the same file.

AppArmor thus differs from the SELinux approach, which attaches labels to objects and enforces access control rules based on the labels. With SELinux, the shadow password file has the same label (and, thus, the same access rules) regardless of the name by which it is accessed. So SELinux lacks a possible failure mode (rule bypass through different file names) that exists in AppArmor. Of course, as any SELinux administrator knows, maintaining file labels in a consistent and correct state poses challenges of its own.

The other problem with the AppArmor approach is that the LSM API is not well suited to pathname-based security policies. As a result, AppArmor must often go through a fair amount of (potentially expensive) pain to obtain the names corresponding to files. The impedance mismatch between AppArmor and LSM is not generally seen as a reason to keep AppArmor out of the kernel, but it has led to suggestions that the AppArmor developers should either extend LSM for pathname-based policies or just add their own patches and drop LSM altogether. If AppArmor gets past the other objections, some work will almost certainly have to be done in this area.

At this point, how any decision will be made on merging AppArmor is far from clear. It has not escaped notice that some of the strongest criticism of AppArmor is coming from the SELinux camp; SELinux developer Stephen Smalley has defended that criticism this way:

We're not threatened by alternatives. We're concerned about a technically unsound approach. The arguments being raised against pathname-based access control are about the soundness of that technical approach, not whether there should be any alternatives to SELinux.

The proponents of AppArmor claim that the approach is sound. Unlike SELinux, AppArmor does not attempt to be the ultimate security solution for all situations. Instead, it simply puts a lid on applications which might be compromised by an attacker. AppArmor raises the bar by limiting what a broken application might do; it does not attempt to regulate the interactions between every application and every object in the system. This approach is, it is claimed, enough to significantly raise the security of a system while maintaining an administrative interface which is accessible to mere mortals. And, for AppArmor's goals, a pathname-based access control mechanism is said to be good enough. It will probably be some time before we will see whether the kernel development community agrees with that claim.

(See also: this detailed criticism of pathname-based access control by Joshua Brindle).

Comments (29 posted)

Greg KH Linux 2.6.16.11 ?

Michael Holzheu s390: Hypervisor File System ?

Andrew Grover [IOAT] I/OAT patches repost ?

Jack Steiner - Kernel text replication on IA64 ?

David Woodhouse 'make headers_install' kbuild target. ?

Jens Axboe sys_vmsplice ?

sekharan@us.ibm.com [PATCH 00/12] CKRM after a major overhaul ?

sekharan@us.ibm.com [PATCH 0/6] Number of Tasks Resource controller ?

maeda.naoaki@jp.fujitsu.com CKRM CPU resource controller ?

Shailabh Nagar per-task delay accounting ?

Nick Piggin radix-tree: direct data ?

Matt Mackall Ketchup 0.9.7 released ?

Daniel Walker Profile likely/unlikely macros ?

Junio C Hamano GIT 1.3.1 ?

Stephen Hemminger class device: add attribute_group creation ?

Steven Whitehouse GFS2 patches for review ?

David Howells [PATCH 1/7] FS-Cache: Provide a filesystem-specific sync'able page bit ?

David Woodhouse Shrink rbtree ?

Nick Piggin mm: introduce remap_vmalloc_range (pls. drop previous patchset) ?

Dean Nelson change gen_pool allocator to not touch managed memory ?

Jeff Garzik TOE info page ?

Jiri Benc d80211 patches ?

Kelly Daly Rough VJ Channel Implementation - vj_core.patch ?

Kelly Daly Rough VJ Channel Implementation - vj_udp.patch ?

Kelly Daly Rough VJ Channel Implementation - vj_ne2k.patch ?

John W. Linville wireless-dev tree updated ?

Makan Pourzandi [ANNOUNCE] Release Digsig 1.5: kernel module for run-time authentication of binaries ?

Kirill Korotaev OpenVZ releases checkpointing/live migration of processes ?

Douglas Gilbert sg3_utils-1.20 available ?

Kay Sievers udev 091 release ?

Kernel development

Brief items

Kernel release status

Kernel development news

The splice() weekly news

OpenVZ's live checkpointing

The AppArmor debate begins

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous