Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.26-rc2, released on May 12. This release is dominated by fixes, but also turns the big kernel lock (back) into a spinlock; see the article below for more information.

The current -mm tree is 2.6.26-rc2-mm1. Recent changes to -mm include a rebasing on top of the linux-next tree, a memory initialization debugging framework, the removal of the v850 port, a new set of system calls with additional flags arguments (see below), a WARN() macro which takes printk()-style arguments, ext4 online defragmentation, ext4 FIEMAP support, and the OMFS filesystem.

The current stable 2.6 kernel is 2.6.25.3, released on May 9. This release came out earlier than planned due to the inclusion of a couple of security fixes.

As of this writing, the 2.6.25.4 update - containing a few dozen fixes - is in the review process.

Comments (none posted)

Kernel development news

Quotes of the week

According to my quick & dirty git-log analysis, at the current pace of [big kernel lock] removal we'd have to wait more than 10 years to remove most BKL critical sections from the kernel and to get acceptable latencies again.

-- Ingo Molnar doesn't want to wait that long

Even if you're totally right, with Nick's mmu notifiers, Rusty's mmu notifiers, my original mmu notifiers, Christoph's first version of my mmu notifiers, with my new mmu notifiers, with christoph EMM version of my new mmu notifiers, with my latest mmu notifiers, and all people making suggestions and testing the code and needing the code badly, and further patches waiting inclusion during 2.6.27 in this area, it must be obvious for everyone, that there's zero chance this code won't evolve over time to perfection, but we can't wait it to be perfect before start using it or we're screwed.

Andrea Arcangeli - also impatient

19:29	*	dilinger sends his patch to lkml for a proper flaming
19:32		Quozl> dilinger: it's called "patch hardening"? ;-} the heat of the flames makes the patch stronger.

-- Andres 'dilinger' Salomon

Comments (1 posted)

The big kernel lock strikes again

By Jonathan Corbet
May 13, 2008

When Alan Cox first made Linux work on multiprocessor systems, he added a primitive known as the big kernel lock (or BKL). This lock, originally, ensured that only one processor could be running kernel code at any given time. Over the years, the role of the BKL has diminished as increasingly fine-grained locking - along with lock-free algorithms - have been implemented throughout the kernel. Getting rid of the BKL entirely has been on the list of things to do for some time, but progress in that direction has been slow in recent years. A recent performance regression tied to the BKL might give some new urgency to that task, though; it also shows how subtle algorithmic changes can make a big difference.

The AIM benchmark attempts to measure system throughput by running a large number of tasks (perhaps thousands of them), each of which is exercising some part of the kernel. Yanmin Zhang reported that his AIM results got about 40% worse under the 2.6.26-rc1 kernel. He took the trouble to bisect the problem; the guilty patch turned out to be the generic semaphores code. Reverting that patch made the performance regression go away - at the cost of restoring over 7,000 lines of old, unlamented code. The thought of bringing back the previous semaphore implementation was enough to inspire a few people to look more deeply at the problem.

It did not take too long to narrow the focus to the BKL, which was converted to a semaphore a few years ago. That part of the process was easy - there aren't a whole lot of other semaphores left in the kernel, especially in performance-critical places. But the BKL stubbornly remains in a number of core places, including the fcntl() system call, a number of ioctl() implementations, the TTY code, and open() for char devices. That's enough for a badly-performing BKL to create larger problems, especially when running VFS-heavy benchmarks with a lot of contention.

Ingo Molnar tracked down the problem in the new semaphore code. In short: the new semaphore code is too fair for its own good. When a semaphore is released, and there is another thread waiting for it, the semaphore is handed over to the new thread (which is then made runnable) at that time. This approach ensures that threads obtain the semaphore in something close to the order in which they asked for it.

The problem is that fairness can be expensive. The thread waiting for the semaphore may be on another processor, its cache could be cold, and it might be at a low enough priority that it will not even begin running for some time. Meanwhile, another thread may request the semaphore, but it will get put at the end of the queue behind the new owner, which may not be running yet. The result is a certain amount of dead time where no running thread holds the semaphore. And, in fact, Yanmin's experience with the AIM benchmark showed this: his system was running idle almost 50% of the time.

The solution is to bring in a technique from the older semaphore code: lock stealing. If a thread tries to acquire a semaphore, and that semaphore is available, that thread gets it regardless of whether a different thread is patiently waiting in the queue. Or, in other words, the thread at the head of the queue only gets the semaphore once it starts running and actually claims it; if it's too slow, somebody else might get there first. In human interactions, this sort of behavior is considered impolite (in some cultures, at least), though it is far from unknown. In a multiprocessor computer, though, it makes the difference between acceptable and unacceptable performance - even a thread which gets its lock stolen will benefit in the long run.

Interestingly, the patch which implements this change was merged into the mainline, then reverted before 2.6.26-rc2 came out. The initial reason for the revert was that the patch broke semaphores in other situations; for some usage patterns, the semaphore code could fail to wake a thread when the semaphore became available. This bug could certainly have been fixed, but it appears that things will not go that way - there is a bit more going on here.

What is happening instead is that Linus has committed a patch which simply turns the BKL into a spinlock. By shorting out the semaphore code entirely, this patch fixes the AIM regression while leaving the slow (but fair) semaphore code in place. This change also makes the BKL non-preemptible, which will not be entirely good news for those who are concerned with latency issues - especially the real time tree.

The reasoning behind this course of action would appear to be this: both semaphores and the BKL are old, deprecated mechanisms which are slated for minimization (semaphores) or outright removal (BKL) in the near future. Given that, it is not worth adding more complexity back into the semaphore code, which was dramatically simplified for 2.6.26. And, it seems, Linus is happy with a sub-optimal BKL:

Quite frankly, maybe we _need_ to have a bad BKL for those to ever get fixed. As it was, people worked on trying to make the BKL behave better, and it was a failure. Rather than spend the effort on trying to make it work better (at a horrible cost), why not just say "Hell no - if you have issues with it, you need to work with people to get rid of the BKL rather than cluge around it".

So the end result of all this may be a reinvigoration of the effort to remove the big kernel lock from the kernel. It still is not something which is likely to happen over the next few kernel releases: there is a lot of code which can subtly depend on BKL semantics, and there is no way to be sure that it is safe without auditing it in detail. And that is not a small job. Alan Cox has been reworking the TTY code for some time, but he has some ground to cover yet - and the TTY code is only part of the problem. So the BKL will probably be with us for a while yet.

Comments (5 posted)

Getting a handle on caching

By Jonathan Corbet
May 14, 2008

Memory management changes (for the x86 architecture) have caused surprises for a few kernel developers. As these issues have been worked out, it has become clear that not everybody understands how memory caching works on contemporary systems. In an attempt to bring some clarity, Arjan van de Ven wrote up some notes and sent them to your editor, who has now worked them into this article. Thanks to Arjan for putting this information together - all the useful stuff found below came from him.

As readers of What every programmer should know about memory will have learned, the caching mechanisms used by contemporary processors are crucial to the performance of the system. Memory is slow; without caching, systems will run much slower. There are situations where caching is detrimental, though, so the hardware must provide mechanisms which allow for control over caching with specific ranges of memory. With 2.6.26, Linux is (rather belatedly) starting to catch up with the current state of the art on x86 hardware; that, in turn, is bringing some changes to how caching is managed.

It is good to start with a definition of the terms being used. If a piece of memory is cachable, that means:

The processor is allowed to read that memory into its cache at any time. It may choose to do so regardless of whether the currently-executing program is interested in reading that memory. Reads of cachable memory can happen in response to speculative execution, explicit prefetching, or a number of other reasons. The CPU can then hold the contents of this memory in its cache for an arbitrary period of time, subject only to an explicit request to release the cache line from elsewhere in the system.
The CPU is allowed to write the contents of its cache back to memory at any time, again regardless of what any running program might choose to do. Memory which has never been changed by the program might be rewritten, or writes done by a program may be held in the cache for an arbitrary period of time. The CPU need not have read an entire cache line before writing that line back.

What this all means is that, if the processor sees a memory range as cachable, it must be possible to (almost) entirely disconnect the operations on the underlying device from what the program thinks it is doing. Cachable memory must always be readable without side effects. Writes have to be idempotent (writing the same value to the same location several times has the same effect as writing it once), ordering-independent, and size-independent. There must be no side effects from writing back a value which was read from the same location. In practice, this means that what sits behind a cachable address range must be normal memory - though there are some other cases.

If, instead, an address range is uncachable, every read and write operation generated by software will go directly to the underlying device, bypassing the CPU's caches. The one exception is with writes to I/O memory on a PCI bus; in this case, the PCI hardware is allowed to buffer and combine write operations. Writes are not reordered with reads, though, which is why a read from I/O memory is often used in drivers for PCI devices as a sort of write barrier.

A variant form of uncached access is write combining. For read operations, write-combined memory is the same as uncachable memory. The hardware is, however, allowed to buffer consecutive write operations and execute them as a smaller series of larger I/O operations. The main user of this mode is video memory, which often sees sequential writes and which offers significant performance improvements when those writes are combined.

The important thing is to use the right cache mode for each memory range. Failure to make ordinary memory cachable can lead to terrible performance. Enabling caching on I/O memory can cause strange hardware behavior, corrupted data, and is probably implicated in global warming. So the CPU and the hardware behind a given address must agree on caching.

Traditionally, caching has been controlled with a CPU feature called "memory type range registers," or MTRRs. Each processor has a finite set of MTRRs, each of which controls a range of the physical address space. The BIOS sets up at least some of the MTRRs before booting the operating system; some others may be available for tweaking later on. But MTRRs are somewhat inflexible, subject to the BIOS not being buggy, and are limited in number.

In more recent times, CPU vendors have added a concept known as "page attribute tables," or PAT. PAT, essentially, is a set of bits stored in the page table entries which control how the CPU does caching for each page. The PAT bits are more flexible and, since they live in the page table entries, they are difficult to run out of. They are also completely under the control of the operating system instead of the BIOS. The only problem is that Linux doesn't support PAT on the x86 architecture, despite the fact that the hardware has had this capability for some years.

The lack of PAT support is due to a few things, not the least of which has been problematic support on the hardware side. Processors have stabilized over the years, though, to the point that it is possible to create a reasonable whitelist of CPU families known to actually work with PAT. There have also been challenges on the kernel side; when multiple page table entries refer to the same physical page (a common occurrence), all of the page table entries must use the same caching mode. Even a brief window with inconsistent caching can be enough to bring down the system. But the code on the kernel side has finally been worked into shape; as a result, PAT support was merged for the 2.6.26 kernel. Your editor is typing this on a PAT-enabled system with no ill effects - so far.

On most systems, the BIOS will set MTRRs so that regular memory is cachable and I/O memory is not. The processor can then complicate the situation with the PAT bits. In general, when there is a conflict between the MTRR and PAT settings, the setting with the lower level of caching prevails. The one exception appears to be when one says "uncachable" and the other enables write combining; in that case, write combining will be used. So the CPU, through the management of the PAT bits, can make a couple of effective changes:

Uncached memory can have write combining turned on. As noted above, this mode is most useful for video memory.
Normal memory can be made uncached. This mode can also be useful for video memory; in this case, though, the memory involved is normal RAM which is also accessed by the video card.

Linux device drivers must map I/O memory before accessing it; the function which performs this task is ioremap(). Traditionally, ioremap() made no specific changes to the cachability of the remapped range; it just took whatever the BIOS had set up. In practice, that meant that I/O memory would be uncachable, which is almost always what the driver writer wanted. There is a separate ioremap_nocache() variant for cases where the author wants to be explicit, but use of that interface has always been rare.

In 2.6.26, ioremap() was changed to map the memory uncached at all times. That created a couple of surprises in cases where, as it happens, the memory range involved had been cachable before and that was what the code needed. As of 2.6.26, such code will break until the call is changed to use the new ioremap_cache() interface instead. There is also a ioremap_wc() function for cases where a write-combined mapping is needed.

It is also possible to manipulate the PAT entries for an address range explicitly:

    int set_memory_uc(unsigned long addr, int numpages);
    int set_memory_wc(unsigned long addr, int numpages);
    int set_memory_wb(unsigned long addr, int numpages);

These functions will set the given pages to uncachable, write-combining, or writeback (cachable), respectively. Needless to say, anybody using these functions should have a firm grasp of exactly what they are doing or unpleasant results are certain.

Comments (12 posted)

Extending system calls

By Jake Edge
May 14, 2008

Getting interfaces right is a hard, but necessary, task, especially when that interface has to be supported "forever". Such is the case with the system call interface that the kernel presents to user space, so adding features to it must be done very carefully. Even so, when Ulrich Drepper set out to remove a hole that could lead to a race condition, he probably did not expect all the different paths that would need to be tried before closing in on an acceptable solution.

The problem stems from wanting to be able to create file descriptors with new properties—things like close-on-exec, non-blocking, or non-sequential descriptors. Those features were not considered when the system call interface was developed. After all, many of those system calls are essentially unchanged from early Unix implementations of the 1970s. The open() call is the most obvious way to request a file descriptor from the kernel, but there are plenty of others.

In fact, open() is one of the easiest to extend with new features because of its flags argument. Calls like pipe(), socket(), accept(), epoll_create() and others produce file descriptors as well, but don't have a flags argument available. Something different would have to be done to support additional features for the file descriptors resulting from those calls.

The close-on-exec functionality is especially important to close a security hole for multi-threaded programs. Currently, programs can use fcntl() to change an open file descriptor to have the close-on-exec property, but there is always a window in time between the creation of the descriptor and changing its behavior. Another thread could do an exec() call in that window, leaking a potentially sensitive file descriptor into the newly run program. Closing that window requires an in-kernel solution.

Back in June of last year, after some false starts, Linus Torvalds suggested adding an indirect() system call, as a way to pass flags to system calls that don't currently support them. The indirect() call would apply a set of flags to the invocation of an existing system call. This would allow existing calls to remain unchanged, with only new uses calling indirect(). User space programs would be unlikely to call the new function directly, instead they would call glibc functions that handled any necessary indirect() calls.

Davide Libenzi created a sys_indirect() patch in July, but Drepper saw it as "more complex than warranted". So Drepper created his own "trivial" implementation, that was described on this page in November. It was met with a less than enthusiastic response on linux-kernel for being, amongst other things, an exceedingly ugly interface.

The alternative to sys_indirect() is to create a new system call for each existing call that needed a flags argument. This was seen as messy by some, including Torvalds, leading some kernel hackers into looking for alternatives. The indirect approach also had some other potential benefits, though, because it was seen as something that could be used by syslets to allow asynchronous system calls. No decision seemed to be forthcoming, leading Drepper to ask Torvalds for one:

Will you please make a decision regarding sys_indirect? There has been no other proposal so the alternative is to add more syscalls.

To bolster his argument that sys_indirect() was the way to go, Drepper also created a patch to add some of the required system calls. He started with the socket() family, by adding socket4(), socketpair5(), and accept4()—tacking the number of arguments onto the function name a la wait3() and wait4(). Drepper's intent may not have been well served by choosing those calls as Alan Cox immediately noted that the type argument could be overloaded:

Given we will never have 2^32 socket types, and in a sense this is part of the type why not just use

        socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, ...)

that would be far far cleaner, no new syscalls on the socket side at all.

Michael Kerrisk looked over the set of system calls that generate file descriptors, categorizing them based on whether they needed a flag argument added. He observed that roughly half of the file descriptor producing calls need not change because they could either use an overloading trick like the socket calls, the glibc API already added a flags argument, or there were alternatives available to provide the same functionality along with flags.

In response, Drepper made one last attempt to push the indirect approach, saying:

Or we just add sys_indirect (which is also usable for other syscall extensions, not just the CLOEXEC stuff) and let userlevel (i.e., me) worry about adding new interfaces to libc. As you can see, for the more recent interfaces like signalfd I have already added an additional parameter so the number of interface changes would be reduced.

Even though the indirect approach has some good points, Torvalds liked the approach advocated by Cox, saying:

Ok, I have to admit that I find this very appealing. It looks much cleaner, but perhaps more importantly, it also looks both readable and easier to use for the user-space programmer.

Ultimately, developers will only use these new interfaces if they can easily test for the existence of the new code. Torvalds gives an example of how that might be done using the O_NOATIME flag to open(), which has only been available since 2.6.8. It is this testability issue that makes him believe the flags-based approach is the right one:

And that's the problem with anything that isn't flags-based. Once you do new system calls, doing the above is really quite nasty. How do you statically even test that you have a system call? Now you need to add a whole autoconf thing for it existing, and when it does exist you still need to test whether it works, and you can't even do it in the slow-path like the above (which turns the failure into a fast-path without the flag).

This new approach, with a scaled down number of new system calls rather than adding a general-purpose system call extension mechanism like sys_indirect(), is now being pursued by Drepper. In the explanatory patch at the start of the series, he lays out which of the system calls will require a new user space interface: paccept(), epoll_create2(), dup3(), pipe2(), and inotify_init1(), as well as those that do not: signalfd4(), eventfd2(), timerfd_create(), socket(), and socketpair().

Drepper has already made several iterations of patches addressing most of the concerns expressed by the kernel developers along the way. There have been some architecture specific problems, but Drepper has been knocking those down as well. If no further roadblocks appear, it would seem a likely candidate for inclusion in 2.6.27.

Comments (9 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.26-rc2 ?

Andrew Morton 2.6.26-rc2-mm1 ?

Greg Kroah-Hartman Linux 2.6.25.3 ?

Oliver Pinter v2.6.22.24-op1 ?

Architecture-specific

Suresh Siddha x86: xsave/xrstor support, ucontext_t extensions ?

Jon Tollefson 16G and multi size hugetlb page support on powerpc ?

Core kernel code

Gregory Haskins sched: core balancer ?

Huang, Ying [PATCH] kexec based hibernation: a prototype of kexec multi-stage load ?

Hirofumi Nakagawa MAZE: Mazed processes monitor ?

Ingo Molnar "kill the Big Kernel Lock (BKL)" tree ?

Device drivers

Darrick J. Wong [PATCH v3] ibmaem: New driver for power/energy/temp meters in IBM System X hardware ?

Greg KH HSO: add option hso driver ?

Filesystems and block I/O

Sage Weil Ceph distributed file system v0.2 ?

Andrew Patterson detect online disk resize ?

Miklos Szeredi file locking fixes + fuse nfs export support ?

Barry Naujok XFS: Case-insensitive support - ASCII only ?

Evgeniy Polyakov POHMELFS high performance network filesystem. Transactions, failover, performance. ?

Memory management

Mel Gorman Guarantee faults for processes that call mmap(MAP_PRIVATE) on hugetlbfs v2 ?

Andrea Arcangeli mmu-notifier-core v17 ?

Ross Biro MM: Make Page Tables Relocatable -- Conditional Flush ?

Christoph Lameter Slab Fragmentation Reduction V12 ?

Timur Tabi Add alloc_pages_exact() and free_pages_exact() ?

Virtualization and containers

KAMEZAWA Hiroyuki memcg: better migration handling ?

Balbir Singh Add memrlimit controller (v4) ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.26-rc1-git9: Reported regressions from 2.6.25 ?

Miscellaneous

Kay Sievers udev 121 release ?

Alex Tsariounov Cpuset management utility 0.9.1 release announcement ?

Page editor: Jonathan Corbet
Next page: Distributions>>