User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.10-rc2, announced by Linus on November 14. Patches merged since -rc1 include fixes for the ELF loader security problems, Anubis block cypher support, an ALSA update, a number of networking updates, kprobes support for the x86-64 architecture, a frame buffer device update, a set of user-mode Linux patches, an NTFS update, version 2.0 of the USB gadget serial driver, some kernel build tweaks (the preferred name for kernel makefiles is now Kbuild), the ext3 block reservation and online resizing patches, sysfs backing store, locking behavior annotations for the "sparse" utility, a reworking of spin lock initialization, the un-exporting of add_timer_on(), sys_lseek(), and a number of other kernel functions, an x86 signal delivery optimization, an IDE update, I/O space write barrier support, a frame buffer driver update, more scheduler tweaks, some big kernel lock preemption patches, a large number of architecture updates, and lots of fixes. See the long-format changelog (600KB) for the details.

Linus has noted that now would be a good time to calm down and stick to bug fixes until 2.6.10 comes out. His BitKeeper repository shows that he is sticking to that; it contains mostly fixes. There is also a memory technology device (and JFFS2) update, a frame buffer device update, some user-mode Linux patches, some page allocator tuning, and a few architecture updates.

The current prepatch from Andrew Morton is 2.6.10-rc2-mm1. Recent changes to -mm include some kmap_atomic() changes (see below), the ability to disable a subset of "magic sysrq" features, some SELinux scalability work, enhanced I/O and memory usage accounting data collection, and an updated reiser4 filesystem.

The 2.4.28 kernel has been released; Marcelo announced its availability on November 17. The biggest change since 2.4.27, for many people, will be the serial ATA and networking improvements, but many other fixes have gone in as well.

Comments (none posted)

Kernel development news

Quote of the week

Well, yes, your base appetites have led you to the name "pud", where my refined intellect led me to "phd", with h for higher ;)

-- Hugh Dickins on the philosophy of page table naming.

Comments (none posted)

On not getting burned by kmap_atomic()

"High memory," on a Linux system is, by definition, memory which is not normally mapped into the kernel's virtual address space. It is a mechanism which enables 32-bit architectures to make use of more physical memory than would otherwise be possible. When the kernel needs to directly manipulate the contents of a high-memory page, it must explicitly create a virtual address for it. The traditional functions for creating and removing those addresses are:

    void *kmap(struct page *page);
    void kunmap(struct page *page);

These functions work as intended, but they can be expensive to use. The virtual address space they use is limited, and shared across all processors. As a result, each kmap() and kunmap() invocation requires a global TLB flush. Often, however, high memory does not need to be mapped for long periods of time, and does not need to be shared across processors. To improve performance in such situations, the notion of an "atomic kmap" was added:

    void *kmap_atomic(struct page *page, enum km_type type);
    void kunmap_atomic(void *address, enum km_type type);

Atomic kmaps use a very small set of predefined virtual "slots," which are not shared across processors. The type argument specifies which slot is to be used, with the callers taking responsibility for not stepping on each others' toes. Slots are dedicated to specific purposes - two for code called in user context, two for interrupt handlers, two for page table management, etc. In practice, it all works out; conflicts over atomic kmap slots don't happen.

Another problem has come up, however, and that has led to a small change in the prototypes of the atomic kmap functions in the -mm kernel. The regular kmap functions have a symmetrical interface in that both take a struct page * argument. kunmap_atomic(), instead, takes a void * argument - the kernel virtual address to be unmapped. It is a common mistake, however, to pass in the associated struct page pointer instead. Since the argument type is void *, the compiler does not complain, and the discovery of the problem does not come until (possibly much) later.

The solution is straightforward: redefine the function as follows:

    char *kmap_atomic(struct page *page, enum km_type type);
    void kunmap_atomic(char *address, enum km_type type);

With this change, the compiler will issue a warning whenever somebody tries to pass a struct page pointer to kunmap_atomic().

The patch has generated a surprising number of follow-on fixes, mostly to suppress warnings caused by the change. Many kunmap_atomic() calls now explicitly cast the address argument to the char * type. In the end, though, the result should be one more potential mistake which can be caught before it burns somebody - as long as programmers don't "fix" warnings by casting struct page pointers.

Comments (6 posted)

Trustees Linux

Linux currently offers a wealth of projects which are working to extend the classic Unix permissions mechanism with more flexible schemes. One recent entry is an LSM port of Trustees Linux, which has been done by Andrew Ruder. Trustees Linux starts with the idea that access control lists are overly complicated and inefficient; achieving the desired goals can require hanging ACLs on thousands of files, and keeping all of those ACLs in sync can be a challenge.

The Trustees approach, instead, is to create a separate, central database which contains filesystem permissions. This database can assign a "trustee" to a directory; this trustee provides access permissions which apply to the directory and, by default, everything below that directory. A single rule can, thus, cover a large part of the filesystem hierarchy.

The trustee rules cover the usual sorts of permissions; who can search for, read, and write files in a given subtree. The format is somewhat terse; one of the rules provided in the examples is:


This rule enables user "zavadsky" to wander around in (and under) /var/log and read files there.

Mr. Ruder's port is centered around the Linux security module inode_permission() hook; that code examines the trustees which apply to a given inode and decides whether the requested access is to be allowed or not.

It's all pretty straightforward, but there is an interesting twist to how Trustees works with file permissions: the module gives the CAP_DAC_OVERRIDE capability to every process, allowing them to override the existing Unix file permissions. The Trustees module will, in turn, apply those permissions itself much of the time, but it is possible to write rules which override them. In this sense, Trustees functions as an authoritative module, which is not how LSM modules are supposed to work. If Trustees Linux is ever proposed for merging into the mainline, that little feature could come back to haunt it.

Comments (7 posted)

Stopping unwanted OOM killer experiences

There has, in recent times, been a small increase in the number of complaints from users who have seen processes killed by the kernel in response to an out-of-memory (OOM) situation. The only problem is that the system should not have been quite that hard up for memory at the time. Even if the user is doing something which requires completely irrational amounts of memory ("yum update", say), it seems like the system should have been able to muddle along without killing low-priority processes, like the ssh server. These unwanted OOM killer experiences have driven a few developers to take a closer look at what was going on.

Marcelo Tosatti has been working on the problem for a bit; he put together a patch which tries to avoid invocations of the OOM killer if things might get better soon. The idea is that, while a full scan of a memory zone may have failed to turn up any free pages, it may have kicked I/O into motion that will, very soon, make some pages free. So the OOM killer is kept in its cage until the no-memory situation has persisted for a few seconds. Marcelo reported that this patch improved things significantly for his test cases.

It turns out, though, that the real problem was elsewhere; the token-based thrashing control patch appears to be the real culprit. This patch, remember, tries to reduce system thrashing in memory-constrained situations by exempting one process at a time from the page reclaim mechanism. That process will, in theory, make use of its sheltered time to make some real progress before the token moves on and its pages are, once again, subject to eviction. The token-based mechanism has been shown to truly improve the situation when memory is tight.

Until it gets too tight, as it turns out. A process which needs a page, but which does not hold the token, may find that all of the (otherwise) reclaimable pages belong to the process currently holding the token. The unlucky process thus finds no pages to grab, and pushes the big red OOM button. The system is not truly out of memory, however; it has simply been told that all the good pages are temporarily off limits.

Rik van Riel put his finger on the problem, and Andrew Morton put together a simple patch to fix it. Essentially, the VM subsystem will now ignore the swap token when finding reclaimable pages gets too hard. During normal operation, the token-based mechanism holds sway, but it can be set aside as a preferable alternative to killing random processes in the system. The patch appears to have solved the problems without taking away the benefits of the token-based approach.

Marcelo acknowledged that this was the right fix, grumbled that he had wasted a bunch of time, and promised "Next time I should be looking into the easy stuff before trying miraculous solutions." It was his work, however, which shone a light on the problem in the first place, and led to its eventual solution.

Comments (5 posted)

Patches and updates

Kernel trees


Core kernel code

Device drivers


Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds