LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.30-rc6, released on May 15. "Things definitely are calming down, with just about 300 commits in the last week. And most of them are pretty small too, although the powerpc updates brought some defconfig changes that look largish." The long-format changelog has all the details.

The current stable kernel is 2.6.29.4, released on May 19. It contains a fair number of fixes, some with apparent security implications, and some user-space API improvements for labeled networking. 2.6.27.24, released at the same time, contains a smaller (but still significant) list of fixes.

Comments (none posted)

Kernel development news

Quotes of the week

The Titanic sank because there was too much water onboard after all. How it got in there (Iceberg, torpedoed by the CIA (oh it did not yet exist) in retaliation for British misbehavior) would be good to know. But the message should not suggest increasing the size of the Titanic because it cannot hold all the incoming water.
-- Christoph Lameter (thanks to Florian Mickler)

Even it if is, in fact, correct, it's such an egregious violation of good style, that your good programmer's card is going to lose a big coupon and have a hole punched in it.
-- Pete Zaitcev

Comments (5 posted)

In brief

By Jonathan Corbet
May 20, 2009

get_random_int() for address space layout randomization was examined on last week's Security page, but, since then, some additional ideas have been discussed. Performance was the main reason to stick with the partial MD4 hash that is in the current code, but other possibilities are being considered. Eric Biederman noted that using a stream cipher such as AES might produce high-quality randomness without the performance penalty, but Matt Mackall didn't agree: "It's also unclear that encrypting small blocks with software AES is actually a performance win relative to SHA1, once you look at the key scheduling overhead and the cache footprint of its s-boxes."

Willy Tarreau pointed out that hash functions used to produce the random numbers actually generate far more data than is currently used. Storing the output of half_md4_transform() (or that of a different implementation using the much stronger SHA1 hash) and returning 4-byte chunks from the full—128 or 160 bit—hash value for each call would perform much better. But, Linus Torvalds is concerned that giving out more of the hash value could lead to easier attacks: "I personally suspect that it would be _much_ easier to attack the hash if we actually gave out the whole 16 bytes (over several iteration), when compared to what we do now (only give out a small part and then re-hash)." No patches emerged from the discussion, but no one is completely happy with the current implementation, so that could change at some point.

OOM by any other name. The out-of-memory (OOM) killer has often been featured on this page. But Christoph Lameter raised a previously unseen complaint about OOM in a recent discussion:

While we are at it: Could we get rid of the name "Out of Memory" and stop printing texts to that effect? What we call an OOM is a failure to perform memory reclaim or we are running out of reserves due to not being able to run reclaim. Mostly this is due to OS internal issues having nothing to do actual amounts of memory available.

Christoph says that users are tempted to add more memory in response to OOM situations, while the real fix is often to be found elsewhere - fixing the application which is locking down the bulk of RAM, for example. There was some discussion of changing the OOM messages to read something like "Unable to satisfy memory allocation request and not making progress reclaiming from other sources," but nothing has been merged at this point.

Tracepoints leave a trace: one of the key design criteria for kernel tracepoints is that their impact must be minimal when they are not being used. So Jason Baron was surprised to find out that, with the current tracepoint implementation, arguments to tracepoints are being evaluated before the determination of whether the tracepoint is active. In response, he prepared a patch to prevent the evaluation of arguments for inactive tracepoints. The patch works, but at a cost: it requires a change to the internal tracepoint API.

There is resistance to the API change, mostly because the people involved seem to think that the new version is uglier and that the problem is really a bug in GCC. It may prove hard to avoid, though; even if GCC is fixed soon, older versions will be out there for a long time. Minimizing overhead is seen as more important than API beauty, so, unless somebody comes up with a clever workaround, there may be no avoiding an unwanted tracepoint change.

Union mounts: there is a new set of union mount patches out there for evaluation, along with the necessary user-space tools. Valerie Aurora has put together a howto page for those who would like to experiment with this code. See this LWN article (also by Valerie) for information on the implementation of union mounts.

Comments (18 posted)

GlusterFS 2.0 released

The GlusterFS 2.0 release is available. GlusterFS is an interesting cluster filesystem which runs mostly in user space and which claims to scale into the petabyte range. See the feature list for an overview. "As your volume size grows beyond 32TBs, fsck (filesystem check) downtime becomes a huge problem. GlusterFS has no fsck. It heals itself transparently with very little impact on performance." License is GPLv3.

Comments (5 posted)

This week's reflink() API

By Jonathan Corbet
May 19, 2009
The proposed reflink() system call creates an interesting cross between a hard link and a file copy. The end result of a successful reflink() call is a new, distinct file - with its own inode - which shares data blocks with the original file. A copy-on-write policy is used, so the two files remain distinct; if one is modified, the changes will not be visible in the other. This call has a number of uses, including fast snapshotting and as a sort of optimized copy operation. But, as was described in the previous article on reflink(), there is some disagreement over how file ownership and security-related metadata should be handled.

It comes down to the different use cases for this system call. In the "snapshot" case, security information must be preserved; that, in turn, means that reflink() can only be used by the owner of the file (or by a process with sufficient capabilities to get around ownership restrictions). On the other hand, those wanting to use reflink() as a fast file copy would rather see security information treated like it would be with a file copy; the user creating the reflink must have read access to the original file and ends up owning the new one.

For a while, it seemed like the reflink-as-copy use case was simply going to be left out in the cold. But then Joel Becker, the author of the reflink() patches, proposed a compromise. If the process calling reflink() had ownership or suitable privilege, the snapshot semantics would prevail. Otherwise, read access would be required and a new set of security attributes would be applied. The idea was to try to automatically do the right thing in all situations.

In the end, though, this approach didn't fly either. From Andy Lutomirski's objection:

There are plenty of syscalls that require some privilege and fail if the caller doesn't have it. But I can think of only one syscall that does *something different* depending on who called it: setuid.

Please search the web and marvel at the disasters caused by setuid's magical caller-dependent behavior (the sendmail bug is probably the most famous). This proposal for reflink is just asking for bugs where an attacker gets some otherwise privileged program to call reflink but to somehow lack the privileges (CAP_CHOWN, selinux rights, or whatever) to copy security attributes, thus exposing a link with the wrong permissions.

Others agreed that automagically changing behavior depending on caller privilege was not the best way to go. So Joel went back to the drawing board yet another time. On May 15, he came back with a new proposal. The reflink() API would now look like:

    int reflink(const char *oldpath, const char *newpath, int preserve);

The new preserve parameter would be a set of flags allowing the caller to specify which bits of security-oriented information are to be preserved. Anticipated values are:

  • REFLINK_ATTR_OWNER: keep the ownership of the file the same. The caller must either be the owner or have the CAP_CHOWN capability.

  • REFLINK_ATTR_SECURITY: preserves the SELinux/SMACK/TOMOYO linux security state. This flag is only valid if REFLINK_ATTR_OWNER is also provided. In the absence of REFLINK_ATTR_SECURITY, the new link gets a brand-new security state, as if it were any other new file.

  • REFLINK_ATTR_MODE: the discretionary access control permissions bits remain the same; requires ownership or CAP_FOWNER.

  • REFLINK_ATTR_ACL: all access control lists are preserved. This only works if REFLINK_ATTR_MODE is specified.

The API would also provide REFLINK_ATTR_NONE and REFLINK_ATTR_ALL, with the obvious semantics. Importantly, if the caller lacks the requisite credentials to preserve the requested information, the call will simply fail. There will be no magically-changing semantics depending on the caller's capabilities.

Joel also proposes some new flags to the ln command:

  • -r requests that a reflink be made.
  • -P says that the reflink() call should use REFLINK_ATTR_ALL
  • -p (lower case) is like -P, except that it will retry with REFLINK_ATTR_NONE if the first call fails.

There were some question as to whether all the flags are necessary; perhaps all that is really needed is "preserve all" or "preserve none." But Joel feels like one might as well add the flexibility, given that the argument is being added to the API anyway, and there doesn't seem to be that much strong sentiment to the contrary. All told, the reflink() API would appear to be stabilizing toward something that everybody can agree on. It's probably late for 2.6.31, but this new system call could conceivably be ready for the 2.6.32 development cycle.

Comments (4 posted)

Block layer request queue API changes

By Jonathan Corbet
May 18, 2009
Prior to the 2.6 kernel series, the Linux block layer was somewhat simplistic and inflexible; it showed a lot of history from the early days of the Linux kernel. With the 2.5 development series came a complete rewrite; there have, of course, been a great many changes since then as well. But there are still bits of history to be found in the Linux block API. If Tejun Heo has his way, some of that history will be gone in the near future.

The standard way for a block driver to gain access to the next I/O request in the queue is with a call to:

    struct request *elv_next_request(struct request_queue *queue);

This function returns the request which is, in the I/O scheduler's opinion, the best one to execute next. An interesting feature of elv_next_request() is that it leaves the request on the queue; two calls to elv_next_request() in quick succession will return pointers to the same request. A block driver can explicitly remove the request from the queue with a call to blkdev_dequeue_request(), but that step is not necessary. If a request remains at the head of the queue when the block driver indicates that it has been completed, the block layer will dequeue the request at that time.

Leaving the request on the queue is a throwback to the very early days, when requests were handled one at a time - often a single sector at a time. By hiding the queuing details, the block layer made life easier for simple block drivers. But this apparent simplicity comes at a cost: it complicates the block API and makes it impossible for the block layer to know when processing of a request has begun. So it's not possible to do reliable request timing when drivers work on requests which remain on the queue.

This feature is also increasingly useless. Any contemporary driver worth its salt will process multiple requests at once; that, in return, requires that the driver dequeue requests and keep track of them itself. So few drivers that people actually care about use the process-on-queue model. Given that, Tejun has come to the conclusion that processing on-queue requests is an idea whose time has passed. He has posted a lengthy patch series to make it go away.

The bulk of the patches are concerned with converting all drivers over to the "dequeue the request first" mode of operation. Typically that's just a matter of adding a blkdev_dequeue_request() call in the right place. A few places (the IDE subsystem, for example) are a bit more complex, but, for the most part, the changes are straightforward.

Once that has been done, the patch series culminates with a set of API changes. There is no more elv_next_request(); instead, a driver wanting to look at a request without dequeueing it will call:

    struct request *blk_peek_request(struct request_queue *queue);

Following that, a request can be dequeued with a call to blk_start_request(), which replaces blkdev_dequeue_request():

    void blk_start_request(struct request *req);

In addition to removing the request from the queue, blk_start_request() will start a timer for the request, allowing it to eventually respond if completion is never signaled. Most of the time, though, drivers will just call:

    struct request *blk_fetch_request(struct request_queue *q);

which is a combination of blk_peek_request() and blk_start_request().

There is one other, under-the-hood change which goes along with the above: any attempt to complete a request which remains on the request queue will oops the system. One can think of this as a very clear message that on-queue processing is no longer considered to be the right thing to do in the Linux kernel. That, in turn, is part of the motivation for the API changes, which, for the most part, are just name changes: Tejun wants to be sure that maintainers of out-of-tree block drivers will notice that something has changed and respond accordingly.

These patches have been through a couple of rounds of review. Nothing is ever certain, but it's entirely possible that this set of changes could go in for the 2.6.31 kernel.

Comments (none posted)

Being nicer to executable pages

By Jonathan Corbet
May 19, 2009
In an ideal world, our computers would have enough memory to run all of the applications we need. In the real world, our systems are loaded with contemporary desktop environments, office suites, and more. So, even with the large amounts of memory being shipped on modern systems, there still never quite seems to be enough. Memory gets paged out to make room for new demands, and performance suffers. Some help may be on the way in the form of a new patch by Wu Fengguang which has the potential to make things better, should it ever be merged.

The kernel maintains two least-recently-used (LRU) lists for pages owned by user processes. One of these lists holds pages which are backed up by files - they are the page cache; the other list holds anonymous pages which are backed up by the swap device, assuming one exists. When the kernel needs to free up memory, it will do its best to push out pages which are backed up by files first. Those pages are much more likely to be unmodified, and I/O to them tends to be faster. So, with luck, a system which evicts file-backed pages first will perform better.

It may be possible to do things better, though. Certain kinds of activities - copying a large file, for example - can quickly fill memory with file-backed pages. As the kernel works to recover those pages, it stands a good chance of pushing out other file-backed pages which are likely to be more useful. In particular, pages containing executable code are relatively likely to be wanted in the near future. If the kernel pages out the C library, for example, chances are good that running processes will cause it to be paged back in quickly. The loss of needed executable pages is part of why operations involving large amounts of file data can make the system seem sluggish for a while afterward.

Wu's patch tries to improve the situation through a fairly simple change: when the page reclaim scanning code hits a file-backed, executable page which has the "referenced" bit set, it simply clears the bit and moves on. So executable pages get an extra trip through the LRU list; that will happen repeatedly for as long as somebody is making use of the page. If all goes well, pages running useful code will stay in RAM, while those holding less useful file data will get pushed out first. It should lead to a more responsive system.

The code seems to be in a relatively finished state at this point. So one might well ask whether it will be merged in the near future. That is never a straightforward question with memory management code, though. This patch may well make it into the mainline, but it will have to get over some hurdles in the process. The first of those hurdles is a simple question from Andrew Morton:

Now. How do we know that this patch improves Linux?

Claims like "it feels more responsive" are notoriously hard to quantify. But, without some sort of reasonably objective way to see what benefit is offered by this patch, the kernel developers are going to be reluctant to make changes to low-level memory management heuristics. The fear of regressions is always there as well; nobody wants to learn about some large database workload which gets slower after a patch like this goes in. In summary: knowing whether this kind of patch really makes the situation better is not as easy as one might wish.

The second problem is that this change would make it possible for a sneaky application to keep its data around by mapping its files with the "executable" bit set. The answer to this objection is easier: an application which seeks unfair advantage by playing games can already do so. Since anonymous pages receive preferable treatment already, the sneaky application could obtain a similar effect on current kernels by allocating memory and reading in the full file contents. Sites which are truly worried about this sort of abuse can (1) use the memory controller to put a lid on memory use, and/or (2) use SELinux to prevent applications from mapping file-backed pages with execute permission enabled.

Finally, Alan Cox has wondered whether this kind of heuristic-tweaking is the right approach in the first place:

I still think the focus is on the wrong thing. We shouldn't be trying to micro-optimise page replacement guesswork - we should be macro-optimising the resulting I/O performance. My disks each do 50MBytes/second and even with the Gnome developers finest creations that ought to be enough if the rest of the system was working properly.

Alan is referring to some apparent performance problems with the memory management and block I/O subsystems which crept in a few years ago. Some of these issues have been addressed for 2.6.30, but others remain unidentified and unresolved so far.

Wu's patch will not change that, of course. But it may still make life a little better for desktop Linux users. It is sufficiently simple and well contained that, in the absence of clear performance regressions for other workloads, it will probably find its way into the mainline sooner or later.

Comments (18 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds