|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.35-rc3. Linus has returned from his vacation, though, and has resumed merging changes into the mainline.

Stable updates still have not been seen since 2.6.32.15 (June 1) and 2.6.33.5 on May 26.

Comments (none posted)

Quotes of the week

File locking on Linux is just broken. The broken semantics of POSIX locking show that the designers of this API apparently never have tried to actually use it in real software. It smells a lot like an interface that kernel people thought makes sense but in reality doesn't when you try to use it from userspace.
-- Lennart Poettering

Yeah, yeah, maybe you're waiting for flower power and free sex. Good for you. But if you are, don't ask the Linux kernel to wait with you. Ok?
-- Linus Torvalds (see below)

Now that I've had a look at the whole series, I'll make an overall comment: I suspect that the locking is sufficiently complex that we can count the number of people that will be able to debug it on one hand. This patch set didn't just fall off the locking cliff, it fell into a bottomless pit...
-- Dave Chinner

This is a disaster. I can't see for the life of me why we haven't had 100,000 bug reports.
-- Joel Becker (OCFS2 users might want to be careful with 2.6.35-rc for now)

Comments (12 posted)

xstat() and fxstat()

By Jonathan Corbet
June 30, 2010
POSIX has long defined variants on the stat() system call, which returns information about files in the filesystem. There are a couple of limitations associated with stat() which have been seen as a problem for a while: it can only return the information defined in the standard, and it returns all of that information, regardless of whether the caller needs it. David Howells has attempted to address both problems with a new set of system calls:

    ssize_t xstat(int dfd, const char *filename, unsigned atflag,
	          struct xstat *buffer, size_t buflen);

    ssize_t fxstat(int fd, struct xstat *buffer, size_t buflen);

The struct xstat structure resembles struct stat, but with some differences. It includes fields for file metadata like the creation time, the inode "generation number," and the "data version number" for filesystems which support this information, and it has a version number to provide for changes in the system call API in the future.

What also has been added, though, is a query_flags field where the caller specifies which fields are actually desired; if all that is needed is the file size or the link count, for example, the caller can say so. The kernel may return other information as well, but it does not have to go out of its way to ensure that it's accurate. There can be a real performance benefit to this behavior, especially for network-mounted filesystems where getting an updated value may require a conversation with the server. There is also a provision for adding "extra results" for types of metadata which are not currently envisioned.

The addition of this sort of stat() variant has been discussed for years, so something is likely to be merged. Chances are good, though, that the API will change somewhat before the patch is finalized. There were objections to the use of a version number in the xstat structure; the overhead of supporting another system call, should one become necessary, will be less than that of dealing with multiple versions. There were also complaints about the use of that structure as both an input and an output parameter, so the input portion (the query flags) may just become a direct system call parameter instead.

Update: there is already a new version of the patch available with some changes to the system call API.

Comments (4 posted)

Kernel development news

The return of EVM

By Jake Edge
June 30, 2010

The integrity measurement architecture (IMA) has been a part of Linux for roughly a year now—it was merged for 2.6.30—and it can be used to attest to the integrity of a running Linux system. But IMA can be subverted by "offline" attacks, where file data or metadata is changed out from under IMA. Mimi Zohar has proposed the extended verification module (EVM) patch set as a means to protect against these offline attacks.

In its default configuration, IMA calculates hash values for executables, files which are mmap()ed for execution, and files open for reading by root. That list of hashes is consulted each time those files are accessed anew, so that unexpected changes can be detected. In addition, IMA can be used with the trusted platform module (TPM) hardware, which is present in many systems, to sign a collection of these hash values in such a way that a remote system can verify that only "trusted" code is running (remote attestation).

But an attacker could modify the contents of the disk by accessing it under another kernel or operating system. That could potentially be detected by the remote attestation, but cannot be detected by the system itself. EVM sets out to change that.

One of the additions that comes with the EVM patch set is the integrity appraisal extension, which maintains the file's integrity measurement (hash value) as an extended attribute (xattr) of a file. The security.ima xattr is used to store the hash, which gets compared to the calculated value each time the file is opened.

EVM itself just calculates a hash over the extended attributes in the security namespace (e.g. security.ima, security.selinux, and security.SMACK64), uses the TPM to sign it, and stores it as the security.evm attribute on the file. Currently, the key to be used with the TPM signature gets loaded onto the root keyring by readevmkey, which just prompts for a password at the console. Because an attacker doesn't have the key, an offline attack cannot correctly modify the EVM xattr when it changes file data. Securing the key is important, so future work will entail using TPM sealed keys and encrypted symmetric keys so that the plaintext EVM key will never be visible to user space.

With all of that in place, a system administrator can be sure that the code running on the system is the same as that which was measured. Presumably, the initial measurement is done from a known good state. After that, any offline attack would need to either modify a file's contents, which would cause the IMA comparison to fail, or modify its security xattrs, which would cause the EVM comparison to fail.

These patches have been bouncing around in various forms for five years or more; we first looked at EVM in 2005. The EVM patch describes some of the changes that EVM has undergone along the way: "EVM has gone through a number of iterations, initially as an LSM module, subsequently as a LIM [Linux integrity module] integrity provider, and now, when co-located with a security_ hook, embedded directly in the security_ hook, similar to IMA." That evolution reflects both changes suggested in the review process as well as a realization that, since Linux security modules (LSMs) don't stack, it would be impossible to have both EVM and SELinux, say, in one kernel. That led to adding IMA, and now EVM, as calls out from the appropriate security hooks or VFS code.

For EVM, the hooks affected are security_inode_setxattr(), security_inode_post_setxattr(), and security_inode_removexattr(), each of which embeds a call to the appropriate evm_* function. The evm_inode_setxattr() function protects the security.evm xattr from modification unless the CAP_MAC_ADMIN capability is held. The other two calls update the EVM hash associated with a file when xattrs are changed.

The patches aren't too intrusive outside of the security subsystem, though they do touch some other areas. Two new generic VFS calls (vfs_getxattr_alloc() and vfs_xattr_cmp()) were added to simplify xattr handling. Because various additional file attributes (beyond just the security xattrs, like inode number, uid, mode, and so on) are used in the EVM hash, changes to those need to cause a recalculation, which necessitated changes fs/attr.c. And so on.

There are few comments on this iteration of the EVM patches. The idea has been through several rounds of review over the years and the patches have picked up an ACK from Serge E. Hallyn. EVM closes the offline attack hole in the protection that IMA provides and would thus seem to make a good addition the mainline kernel. For those who want to try it out now, there are instructions available on the Linux integrity subsystem web page. Unless major complaints appear, one would think that EVM might well be a candidate for 2.6.36.

Comments (none posted)

Slab allocator of the week: SLUB+Queuing

By Jonathan Corbet
June 29, 2010
The SLUB allocator first made its appearance in April, 2007. It went into the mainline shortly thereafter. This allocator was intended to provide better performance while being much more memory efficient than the existing slab allocator. One of the key mechanisms for improving memory use was to get rid of the extensive object queues maintained by slab; with enough processors, those queues can grow to the point that they occupy a significant percentage of total memory even when there is nothing in them. SLUB works well in many workloads, but it has been plagued by regressions on certain benchmarks. So SLUB has never achieved its goal of displacing slab altogether, and developers have talked occasionally about getting rid of it.

But SLUB does better than slab on other benchmarks, and its code is widely held to be more readable than slab - though that is widely held to be faint praise. So, over the years, attempts have been made to improve the SLUB allocator's performance. The latest such attempt is SLUB+Queuing which, according to its developer Christoph Lameter, beats slab on the all-important "hackbench" benchmark.

There are a couple of significant changes in the SLUB+Q patch set which are intended to improve the performance of SLUB. At the top of the list is the restoration of queues to the allocator. SLUB+Q does not use the elaborate queues found in slab, though; there is, instead, a single per-CPU queue containing pointers to free objects belonging to the cache. Allocation operations are now simple, at least when the queue is not empty: the last object in the queue is handed out, and the length of the queue is decreased by one. Freeing into a non-empty queue is similar. So the fast path, in both cases, should be fast indeed.

If a given CPU's queue goes empty, the SLUB+Q allocator must fall back to allocating objects out of pages, perhaps allocating more pages in the process. That, of course, is quite a bit slower. In an attempt to minimize the cost of this slow path, SLUB+Q will go ahead and pre-fill the queue, up to the "batch size" (half of the queue's total length, by default) with free objects. So, in a situation where many more objects are being allocated than freed, the fast allocation path will continue to be used most of the time.

If the queue overflows, instead, the allocator must push objects back into the pages they came from. Once again, the behavior chosen is to prune the queue back to a half-full state; the allocator will not push back all objects in the queue unless the kernel has indicated that it is under serious memory pressure. The default size of the queue is dependent on the object size, but it (along with the batch size) can be changed via a sysfs parameter.

[SLUB free list] The other significant change has to do with how free objects are handled when they are not stored in one of the per-CPU queues. In current mainline kernels, SLUB maintains a list of pages which contain some free objects. Note that it does not keep pages which are fully allocated (those can be simply forgotten about until at least one object contained therein is freed); it also does not keep pages which are fully free (those are handed back to the page allocator). The partial pages contain one or more free objects which are organized into a linked list, as is vaguely shown in the diagram to the right. There is a certain aesthetic value to doing things this way; it uses the free memory itself to keep track of free objects, minimizing the amount of overhead needed for object management.

Unfortunately, there is also a cost to storing list pointers in the freed objects. Chances are good that, by the time the kernel gets around to freeing an object, it will not have been used for a bit; it may well be cache-cold on the freeing CPU. Objects which are on the free list are even more likely to be cache-cold. Putting list pointers into that object will bring it into the CPU cache, incurring a cache miss and, possibly, displacing something which was more useful. The result is a measurable performance hit.

Thus, over time, it has become clear that memory management is more efficient if it can avoid touching the objects which are being managed. The SLUB+Q patches achieve this goal by using a bitmap to track which objects in a given page are free. If the number of objects which can fit into a page is small enough, this bitmap can be stored in the page structure in the system memory map; otherwise it is placed at the end of the page itself. Now the management of free objects just requires tweaking bits in this (small) bitmap; the objects themselves are not changed by the allocator.

The hackbench benchmark works by creating groups of processes, then quickly passing messages between them using sockets. SLUB has tended to perform worse on this benchmark than slab, sometimes significantly so. With the new patches, Christoph has posted benchmark results showing performance which is significantly better than what slab achieves. If these results hold, SLUB+Q will have overcome one of the primary problems seen by SLUB.

It should be noted, though, that performance on a single benchmark is not especially indicative of the performance of a memory allocator in general; SLUB already beat slab on a number of other tests. Memory management performance is a subtle and tricky area to work in. So a lot more testing will be required before it will be possible to say that SLUB+Q has truly addressed SLUB's difficulties without introducing regressions of its own. But the initial indications look good.

Comments (none posted)

What makes a valid timespec?

By Jonathan Corbet
June 29, 2010
The notion that one should be liberal in what one accepts while being conservative in what one sends is often expressed in the networking field, but it shows up in a number of other areas as well. Often, though, it can make more sense to be conservative on the accepting side; the condition of many web pages would have been far better had early browsers not been so forgiving of bad HTML. The tradeoff between being accepting and insisting on correctness recently came up in a discussion of a proposed API change for the futex() system call; "conservative" appears to be the winning approach in this case.

The futex() system call provides fast locking operations to user space. Callers will normally block until a lock becomes available, but they can also provide a struct timespec value specifying the maximum amount of time to wait:

    struct timespec {
	long		tv_sec;			/* seconds */
	long		tv_nsec;		/* nanoseconds */
    };

The interpretation of the timeout value is a little strange. For a FUTEX_WAIT command, the timeout is relative to the current time; for any other command, it is either ignored or treated as an absolute time. In particular, the operations like FUTEX_WAIT_BITSET and FUTEX_LOCK_PI use absolute timeouts.

Oleg Nesterov recently came to the kernel mailing list with an interesting glibc problem. If the tv_sec portion of the timeout is negative, the kernel will fail the futex() call with an EINVAL error. The POSIX thread code is not prepared for that to happen and shows its anger by going into an infinite loop - behavior which is not normally appreciated by user-space programmers. The glibc developers have concluded that this behavior is a kernel bug; to them, a negative absolute time value indicates a time before the epoch. Since the epoch is, for all practical purposes, the beginning of time, the response to a pre-epochal time should be ETIMEDOUT, which the library is prepared to deal with.

This position was not well received. Thomas Gleixner responded that times before the epoch cannot be programmed into the system clock and, thus, are not accepted by any Linux system call which deals with absolute times. Since some system calls cannot possibly accept such values, Thomas says, none should: "I'm strictly against having different definitions of sanity for different syscalls."

Linus, too, opposes accepting negative times, but for slightly different reasons:

A positive time_t value is well-defined. In contrast, a negative tv_sec value is inherently suspect. Traditionally, you couldn't even know if time_t was a signed quantity to begin with! And on 32-bit machines, a negative time_t is quite often the result of overflow (no, you don't have to get to 2038 to see it - you can get overflows from simply doing large relative timeouts etc).

In other words, a negative time value is an indication that something, somewhere has gone wrong. In such situations, rejecting the value may well be the best thing to do.

That leaves the glibc developers in the position of having to fix their code to deal with this (previously) unexpected return value. The good news, such as it is, is that they'll be working on that code anyway. It seems that the same function will also loop if it gets EFAULT back from futex(), and that is clearly a user-space bug.

Comments (16 posted)

Patches and updates

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Jan Engelhardt xt2 table core ?

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds