Brief items
The current development kernel remains 2.6.35-rc3. Linus has
returned from his vacation, though, and has resumed merging changes into
the mainline.
Stable updates still have not been seen since 2.6.32.15
(June 1) and 2.6.33.5 on May 26.
Comments (none posted)
File locking on Linux is just broken. The broken semantics of POSIX
locking show that the designers of this API apparently never have
tried to actually use it in real software. It smells a lot like an
interface that kernel people thought makes sense but in reality
doesn't when you try to use it from userspace.
--
Lennart
Poettering
Yeah, yeah, maybe you're waiting for flower power and free
sex. Good for you. But if you are, don't ask the Linux kernel to
wait with you. Ok?
--
Linus Torvalds (see below)
Now that I've had a look at the whole series, I'll make an overall
comment: I suspect that the locking is sufficiently complex that we
can count the number of people that will be able to debug it on one
hand. This patch set didn't just fall off the locking cliff, it
fell into a bottomless pit...
--
Dave Chinner
This is a disaster. I can't see for the life of me why we haven't
had 100,000 bug reports.
--
Joel Becker (OCFS2 users might want to
be careful with 2.6.35-rc for now)
Comments (12 posted)
By Jonathan Corbet
June 30, 2010
POSIX has long defined variants on the
stat() system call, which
returns information about files in the filesystem. There are a couple of
limitations associated with
stat() which have been seen as
a problem for a while: it can only return the information defined in the
standard, and it returns
all of that information, regardless of
whether the caller needs it. David Howells has attempted to address both
problems with
a new set of system
calls:
ssize_t xstat(int dfd, const char *filename, unsigned atflag,
struct xstat *buffer, size_t buflen);
ssize_t fxstat(int fd, struct xstat *buffer, size_t buflen);
The struct xstat structure resembles struct stat, but
with some differences. It includes fields for file metadata like the
creation time, the inode "generation number," and the "data version number"
for filesystems which support this information, and it has a version number
to provide for changes in the system call API in the future.
What also has been added, though, is a query_flags field where the
caller specifies which fields are actually desired; if all that is needed
is the file size or the link count, for example, the caller can say so. The
kernel may return other information as well, but it does not have to go out
of its way to ensure that it's accurate. There can be a real performance
benefit to this behavior, especially for network-mounted filesystems where
getting an updated value may require a conversation with the server. There
is also a provision for adding "extra results" for types of metadata which
are not currently envisioned.
The addition of this sort of stat() variant has been discussed for
years, so something is likely to be merged. Chances are good, though, that
the API will change somewhat before the patch is finalized. There were objections to the use of a version number in
the xstat structure; the overhead of supporting another system
call, should one become necessary, will be less than that of dealing with
multiple versions. There were also complaints about the use of that
structure as both an input and an output parameter, so the input portion
(the query flags) may just become a direct system call parameter instead.
Update: there is already a new version
of the patch available with some changes to the system call API.
Comments (4 posted)
Kernel development news
By Jake Edge
June 30, 2010
The integrity measurement architecture (IMA) has been a part of Linux for
roughly a year now—it was merged for 2.6.30—and it can be used
to attest to the integrity of a running Linux system. But IMA can be
subverted by "offline" attacks, where file data or metadata is changed out
from under IMA. Mimi Zohar has proposed the extended
verification module (EVM) patch set as a means to
protect against these offline attacks.
In its default configuration, IMA calculates hash values for executables,
files which are mmap()ed for execution, and files open for reading
by root. That list of hashes is consulted each time those files are
accessed anew, so that unexpected changes can be detected. In addition,
IMA can be
used with the trusted platform module (TPM) hardware, which is present in
many systems, to
sign a collection of these hash values in such a way that a remote system
can verify that only "trusted" code is running (remote attestation).
But an attacker could modify the contents of the disk by accessing it
under another kernel or operating system. That could potentially be
detected by the remote attestation, but cannot be detected by the system
itself. EVM sets out to change that.
One of the additions that comes with the EVM patch set is the integrity appraisal extension, which maintains
the file's integrity measurement (hash value) as an extended attribute (xattr) of a file.
The security.ima xattr is used to store the hash, which gets
compared to the calculated value each time the file is opened.
EVM itself just calculates a hash over the extended attributes in the security namespace (e.g. security.ima, security.selinux, and
security.SMACK64), uses the TPM to sign it, and stores it as the
security.evm attribute on the file. Currently, the key to be used
with the TPM signature gets loaded onto the root keyring by
readevmkey,
which just
prompts for a password at the console. Because an attacker doesn't have
the key, an offline attack cannot correctly modify the EVM xattr when it
changes file data. Securing the key is important, so future work will
entail using TPM
sealed keys and encrypted symmetric keys so that the plaintext EVM key will
never be visible to user space.
With all of that in place, a system administrator can be sure that the code
running on the system is the same as that which was measured. Presumably,
the initial measurement is done from a known good state. After that, any
offline
attack would need to either modify a file's contents, which would
cause the IMA comparison to fail, or modify its security xattrs, which
would cause the EVM comparison to fail.
These patches have been bouncing around in various forms for five years or
more; we first looked at EVM
in 2005. The EVM patch describes some of the changes that EVM has
undergone along the way: "EVM has gone
through a number of iterations, initially as an LSM module, subsequently
as a LIM [Linux integrity
module] integrity provider, and now, when co-located with a security_
hook, embedded directly in the security_ hook, similar to IMA."
That evolution reflects both changes suggested in the review process as
well as a realization that, since Linux security modules (LSMs) don't stack, it would be impossible to
have both EVM and SELinux, say, in one kernel. That led to adding IMA, and
now EVM, as calls out from the appropriate security hooks or VFS code.
For EVM, the hooks affected are security_inode_setxattr(),
security_inode_post_setxattr(), and
security_inode_removexattr(), each of which embeds a call to the
appropriate evm_* function. The evm_inode_setxattr()
function protects the security.evm xattr from modification unless
the CAP_MAC_ADMIN capability is held. The other two calls update
the EVM hash associated with a file when xattrs are changed.
The patches aren't too intrusive outside of the security subsystem, though
they do touch some other areas. Two new generic VFS calls
(vfs_getxattr_alloc() and vfs_xattr_cmp()) were added to
simplify xattr handling. Because various additional file attributes
(beyond just the security xattrs, like inode number, uid, mode, and so on)
are used in the EVM hash, changes to those need to cause a recalculation,
which necessitated changes fs/attr.c. And so on.
There are few comments on this iteration of the EVM patches. The idea has
been through several rounds of review over the years and the patches have
picked up an ACK
from Serge E. Hallyn. EVM closes
the offline attack hole in the protection that IMA provides and would
thus seem to make a
good addition the mainline kernel. For those who want to try it out now,
there are instructions available on
the Linux integrity subsystem web page.
Unless major complaints appear, one would think that EVM might well be a
candidate for 2.6.36.
Comments (none posted)
By Jonathan Corbet
June 29, 2010
The SLUB allocator first
made its
appearance in April, 2007. It went into the mainline shortly
thereafter. This allocator was intended to provide better performance
while being much more memory efficient than the existing slab allocator.
One of the key mechanisms for improving memory use was to get rid of the
extensive object queues maintained by slab; with enough processors, those
queues can grow to the point that they occupy a significant percentage of
total memory even when there is nothing in them. SLUB works well in many
workloads, but it has been plagued by regressions on certain benchmarks.
So SLUB has never achieved its goal of displacing slab altogether, and
developers have talked occasionally about getting rid of it.
But SLUB does better than slab on other benchmarks, and its code is widely
held to be more readable than slab - though that is widely held to
be faint praise. So, over the years, attempts have been made to improve
the SLUB allocator's performance. The latest such attempt is SLUB+Queuing which, according to
its developer Christoph Lameter, beats slab on the all-important
"hackbench" benchmark.
There are a couple of significant changes in the SLUB+Q patch set which are
intended to improve the performance of SLUB. At the top of the
list is the restoration of queues to the allocator. SLUB+Q does not use the
elaborate queues found in slab, though; there is, instead, a single per-CPU
queue containing pointers to free objects belonging to the cache.
Allocation operations are now simple, at least when the queue is not empty:
the last object in the queue is handed out, and the length of the queue is
decreased by one. Freeing into a non-empty queue is similar. So the fast
path, in both cases, should be fast indeed.
If a given CPU's queue goes empty, the SLUB+Q allocator must fall back to
allocating objects out of pages, perhaps allocating more pages in the
process. That, of course, is quite a bit slower. In an attempt to
minimize the cost of this slow path, SLUB+Q will go ahead and pre-fill the
queue, up to the "batch size" (half of the queue's total length, by
default) with free objects. So, in a
situation where many more objects are being allocated than freed, the fast
allocation path will continue to be used most of the time.
If the queue overflows, instead, the allocator must push objects back into
the pages they came from. Once again, the behavior chosen is to prune the
queue back to a half-full state; the allocator will not push back all
objects in the queue unless the kernel has indicated that it is under
serious memory pressure. The default size of the queue is dependent on the
object size, but it (along with the batch size) can be changed via a sysfs
parameter.
The other significant change has to do with how free objects are handled
when they are not stored in one of the per-CPU queues. In current mainline
kernels, SLUB maintains a list of pages which contain some free objects.
Note that it does not keep pages which are fully allocated (those can be
simply forgotten about until at least one object contained therein is
freed); it also does not keep pages which are fully free (those are handed
back to the page allocator). The partial pages contain one or more free
objects which are organized into a linked list, as is vaguely shown in the
diagram to the right. There is a certain aesthetic value to doing things
this way; it uses the free memory itself to keep track of free objects,
minimizing the amount of overhead needed for object management.
Unfortunately, there is also a cost to storing list pointers in the freed
objects. Chances are good that, by the time the kernel gets around to
freeing an object, it will not have been used for a bit; it may well be
cache-cold on the freeing CPU. Objects which are on the free list are even
more likely to be cache-cold. Putting list pointers into that object will
bring it into the CPU cache, incurring a cache miss and, possibly,
displacing something which was more useful. The result is a measurable
performance hit.
Thus, over time, it has become clear
that memory management is more efficient if it can avoid touching the
objects which are being managed.
The SLUB+Q patches achieve this goal by using a bitmap to track which objects in
a given page are free. If the number of objects which can fit into a page
is small enough, this bitmap can be stored in the page structure
in the system memory map; otherwise it is placed at the end of the page
itself. Now the management of free objects just requires tweaking bits in
this (small) bitmap; the objects themselves are not changed by the allocator.
The hackbench benchmark works by creating groups of processes, then quickly
passing messages between them using sockets. SLUB has tended to perform
worse on this benchmark than slab, sometimes significantly so. With the
new patches, Christoph has posted benchmark results showing performance
which is significantly better than what slab achieves. If these results
hold, SLUB+Q will have overcome one of the primary problems seen by SLUB.
It should be noted, though, that performance on a single benchmark is not
especially indicative of the performance of a memory allocator in general;
SLUB already beat slab on a number of other tests. Memory management
performance is a subtle and tricky area to work in. So a lot more testing
will be required before it will be possible to say that SLUB+Q has truly
addressed SLUB's difficulties without introducing regressions of its own.
But the initial indications look good.
Comments (none posted)
By Jonathan Corbet
June 29, 2010
The notion that one should be liberal in what one accepts while being
conservative in what one sends is often expressed in the networking field,
but it shows up in a number of other areas as well. Often, though, it can
make more sense to be conservative on the accepting side; the condition of
many web pages would have been far better had early browsers not been so
forgiving of bad HTML. The tradeoff between being accepting and insisting
on correctness recently came up in a discussion of a proposed API change
for the
futex() system call; "conservative" appears to be the
winning approach in this case.
The futex() system call provides fast locking operations to user
space. Callers will normally block until a lock becomes available, but
they can also provide a struct timespec value specifying the
maximum amount of time to wait:
struct timespec {
long tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
The interpretation of the timeout value is a little strange. For a
FUTEX_WAIT command, the timeout is relative to the current time;
for any other command, it is either ignored or treated as an absolute
time. In particular, the operations like FUTEX_WAIT_BITSET and
FUTEX_LOCK_PI use absolute timeouts.
Oleg Nesterov recently came to the kernel
mailing list with an interesting glibc problem. If the tv_sec
portion of the timeout is negative, the kernel will fail the
futex() call with an EINVAL error. The POSIX thread code
is not prepared for that to happen and shows its anger by going into an
infinite loop - behavior which is not normally appreciated by user-space
programmers. The glibc developers have concluded that this behavior is a
kernel bug; to them, a negative absolute time value indicates a time before the
epoch. Since the epoch is, for all practical purposes, the beginning of
time, the response to a pre-epochal time should be ETIMEDOUT,
which the library is prepared to deal with.
This position was not well received. Thomas Gleixner responded that times before the epoch cannot be
programmed into the system clock and, thus, are not accepted by any Linux system
call which deals with absolute times. Since some system calls cannot
possibly accept such values, Thomas says, none should: "I'm strictly
against having different definitions of sanity for different
syscalls."
Linus, too, opposes accepting negative
times, but for slightly different reasons:
A positive time_t value is well-defined. In contrast, a negative
tv_sec value is inherently suspect. Traditionally, you couldn't
even know if time_t was a signed quantity to begin with! And on
32-bit machines, a negative time_t is quite often the result of
overflow (no, you don't have to get to 2038 to see it - you can
get overflows from simply doing large relative timeouts etc).
In other words, a negative time value is an indication that something,
somewhere has gone wrong. In such situations, rejecting the value may well
be the best thing to do.
That leaves the glibc developers in the position of having to fix their
code to deal with this (previously) unexpected return value. The good
news, such as it is, is that they'll be working on that code anyway. It
seems that the same function will also loop if it gets EFAULT back
from futex(), and that is clearly a user-space bug.
Comments (16 posted)
Patches and updates
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
- Mimi Zohar: EVM .
(June 24, 2010)
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>