Brief items
The current stable 2.6 kernel is 2.6.16.11,
released on April 24. It
is a single-patch release containing a fix for a CIFS filesystem
vulnerability. 2.6.16.10, also released on the 24th, contained a larger
set of important fixes.
The current 2.6 prepatch remains 2.6.17-rc2; there have been no -rc
releases over the last week. Patches are accumulating in the mainline git
repository, however; they are mostly fixes, but there is also trusted
platform module (TPM) 1.2 support, multiple page size support for the
PA-RISC architecture, and the vmsplice() system call (see below).
There have been no -mm tree releases over the last week.
Comments (1 posted)
Kernel development news
Jens Axboe sent around
a note on the status of
splice(). He notes that the
splice() and
tee() interfaces - on both the user and
kernel side - should be stable now, with no further changes anticipated.
The
sendfile() system call has been reworked to use the
splice() machinery, though that process will not be complete until
after the 2.6.18 kernel cycle opens.
While splice() might be stable, things are still happening. In
particular, Jens has added yet
another system call:
long vmsplice(int fd, void *buffer, size_t len, unsigned int flags);
While the regular splice() call will connect a pipe to a file,
this call, instead, is designed to feed user-space memory directly into a
pipe. So the memory range of len bytes starting at
buffer will be pushed into the pipe represented by fd. The
flags argument is not currently used.
Using vmsplice(), an application which generates data in a memory
buffer can send that data on to its eventual destination in a zero-copy
manner. With a suitably-sized buffer, the application can do easy
double-buffering; half of the buffer can be under I/O with
vmsplice() while the other half is being filled. If the buffer is
big enough, the application need only call vmsplice() each time
half of the buffer has been filled, and the rest will simply work with no
need for multiple threads or complicated synchronization mechanisms.
Getting the buffer size right is important, however. If the buffer is at least
twice as large as the maximum number of pages that the kernel will load
into a pipe at an given time, a successful vmsplice() of half of
the buffer can be safely interpreted by the application as meaning that the
other half of the buffer is no longer under I/O. Since half of the
buffer will completely fill the space available within a kernel pipe, that
half can only be inserted when all other data has been consumed out of the
pipe - in simple situations, anyway. So, after vmsplice()
succeeds, the application
can safely refill the second half with new data. If the application gets
confused, however, it could find itself overwriting data which has not yet
been consumed by the kernel.
Jens's patch adds a couple of fcntl() operations intended to help
in this regard. The F_GETPSZ operation will return the maximum
number of pages which can be inserted into a pipe buffer, which is also the
maximum number of pages which can be under I/O from a vmsplice()
operation. There is also F_SETPSZ for changing the maximum size,
though that operation just returns EINVAL for now. Linus,
however, worries that this information is
not enough to know that a given page is no longer under I/O. In situations
where there are other buffers in the kernel - perhaps just another pipe in
series -
the kernel could still have references to a page even after that page has
been consumed out of the original pipe. Networking adds some challenges of
its own: if a page has been vmsplice()ed to a TCP socket, it will
not be reusable until the remote host has acknowledged the receipt of the
data contained within that page. That acknowledgment will arrive long
after the page has been consumed out of the pipe buffer.
What this all means is that the vmsplice() interface probably
needs a bit more work. In particular, there may need to be yet another
system call which will allow an application to know that the kernel is done
with a specific page. The current vmsplice() implementation is
also unable to connect an incoming pipe to user-space memory. Making the
read side work is a rather more complicated affair, and may not happen
anytime in the near future.
Comments (9 posted)
The
OpenVZ project is a GPL-licensed
subset of SWSoft's proprietary Virtuozzo offering. With OpenVZ, a Linux
system can implement multiple "virtual environments", each of which
appears, to the processes running within it, to be a separate, standalone
system. Virtual environments can have their own IP addresses and be
subjected to specific resource limits. They are, in other words, an
implementation of the container concept, one of several for Linux. In
recent times the various virtualization and container projects have shown a
higher level of interest in getting at least some of their code merged into
the mainline kernel, and OpenVZ is no exception. So the OpenVZ developers
have been maintaining a higher profile on the kernel mailing lists.
The latest news from OpenVZ is this announcement of a new
release with a major feature addition: live checkpointing and migration of
virtual environments. An environment (being a container full of Linux
processes) can be checkpointed to a file,
allowing it to be restarted at some later time. But it is also possible to
checkpoint a running virtual environment and move it to another system,
with no interruption in service. This feature, clearly meant to be
competitive with Xen's live migration capabilities, enables run-time load
balancing across systems.
The OpenVZ patch, weighing at 2.2MB, is not for the faint of heart; it
makes the price to be paid for these features quite clear. Much of what is
contained within the patch has been discussed here before; for example, it
contains the PID virtualization
patches, and every bit of code within the kernel must be aware of
whether it is working with "real" or "virtual" process IDs. A number of
other kernel interfaces must be changed to support OpenVZ's virtualization
features; among other things, many device drivers and filesystems require
tweaks.
As might be expected, the checkpointing code is on the long and complicated
side. The checkpoint process starts by putting the target process(es) on
hold, in a manner similar to what the software suspend code does. Then it
comes down to a long series of routines which serialize and write out
every data structure and bit of memory associated with a virtual
environment. The obvious things are saved: process memory, open files,
etc. But the code must also save the full state of each TCP socket
(including the backlog of sk_buff structures waiting to be
processed), connection tracking information, signal handling status, SYSV
IPC information, file descriptors obtained via Unix-domain sockets,
asynchronous I/O operations, memory mappings, filesystem namespaces, data
in tmpfs files, tty settings, file locks, epoll() file
descriptors, accounting information, and more.
For each of the objects to be saved, an in-file version of the kernel data
structure must be created. Each dump routine then serializes one or more
data structures into the proper format for writing to the checkpoint file.
It all apparently works, but it has the look of a highly brittle system -
almost any change to the kernel's data structures seems guaranteed to break
the checkpoint and restore code. Even if the checkpoint and restore code
were merged into the mainline, getting kernel developers to understand (and
care about) that code would be a challenge. Keeping it working must be be an
ongoing hassle, whether or not the code is in the mainline tree.
None of the above should be interpreted to say that OpenVZ's features are
not worth the cost. Virtual environments, checkpointing, and live
migration are powerful and useful features. But the virtualization of
everything within the kernel will lead to a higher level of internal
complexity and higher maintenance costs. The decision process which draws
the line determining which features are merged and which are not will be
interesting to watch.
Comments (3 posted)
Novell
announced the release
of the AppArmor security module last January. Then everything went quiet;
in particular, no attempt was made to get the AppArmor code merged into the
mainline kernel. The silence was broken last week, however, as a result of
the discussion on the possible
removal of the Linux security module (LSM) API. The submission of the
AppArmor code has had the desired short-term effect: the discussion has
moved away from removal of the LSM interface and toward the merits of
AppArmor. The AppArmor developers may not see that shift as a blessing at
the moment, however.
As expected, AppArmor has taken a fair amount of criticism. The largest
complaint is the fact that AppArmor uses pathnames for its security
policies. Using AppArmor, a system administrator can provide a list of
files accessible by a given application; anything not on the list becomes
inaccessible. Other things - such as capabilities - are also configurable,
but there is no controversy over that aspect of the system. It is the use
of path names which raises the red flags.
The sticking point is that a given file name is not the file itself. So,
while /etc/shadow might identify the shadow password file, that
name is not the shadow password file. If an attacker is able to
create another name for that file (through the use of links or namespaces,
perhaps), that other name could become a way for the attacker to access the
shadow password file. So, even if AppArmor forbids access to
/etc/shadow for a given application, that application might still
have access to other pathnames which could be made to refer to the same
file.
AppArmor thus differs from the SELinux approach, which attaches labels to
objects and enforces access control rules based on the labels. With
SELinux, the shadow password file has the same label (and, thus, the same
access rules) regardless of the name by which it is accessed. So SELinux
lacks a possible failure mode (rule bypass through different file names)
that exists in AppArmor. Of course, as any
SELinux administrator knows, maintaining file labels in a consistent and
correct state poses challenges of its own.
The other problem with the AppArmor approach is that the LSM API is not
well suited to pathname-based security policies. As a result, AppArmor
must often go through a fair amount of (potentially expensive) pain to
obtain the names corresponding to files. The impedance mismatch between
AppArmor and LSM is not generally seen as a reason to keep AppArmor out of
the kernel, but it has led to suggestions that the AppArmor developers
should either extend LSM for pathname-based policies or just add their own
patches and drop LSM altogether. If AppArmor gets past the other
objections, some work will almost certainly have to be done in this area.
At this point, how any decision will be made on merging AppArmor is far
from clear. It has not escaped notice that some of the strongest criticism
of AppArmor is coming from the SELinux camp; SELinux developer Stephen
Smalley has defended that criticism this
way:
We're not threatened by alternatives. We're concerned about a
technically unsound approach. The arguments being raised against
pathname-based access control are about the soundness of that
technical approach, not whether there should be any alternatives to
SELinux.
The proponents of AppArmor claim that the approach is sound. Unlike
SELinux, AppArmor does not attempt to be the ultimate security solution for
all situations. Instead, it simply puts a lid on applications which might
be compromised by an attacker. AppArmor raises the bar by limiting what a
broken application might do; it does not attempt to regulate the
interactions between every application and every object in the system.
This approach is, it is claimed, enough to significantly raise the security
of a system while maintaining an administrative interface which is
accessible to mere mortals. And, for AppArmor's goals, a pathname-based
access control mechanism is said to be good enough. It will probably be
some time before we will see whether the kernel development community
agrees with that claim.
(See also: this
detailed criticism of pathname-based access control by Joshua Brindle).
Comments (29 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>