Brief items
The current 2.6 prepatch is 2.6.26-rc2,
released on May 12. This
release is dominated by fixes, but also turns the big kernel lock (back)
into a spinlock; see the article below for more information.
The current -mm tree is 2.6.26-rc2-mm1. Recent changes
to -mm include a rebasing on top of the linux-next tree, a memory
initialization debugging framework, the removal of the v850 port, a new set
of system calls with additional flags arguments (see below), a
WARN() macro which takes printk()-style arguments, ext4
online defragmentation, ext4 FIEMAP support, and the OMFS filesystem.
The current stable 2.6 kernel is 2.6.25.3, released on May 9. This
release came out earlier than planned due to the inclusion of a couple of
security fixes.
As of this writing, the 2.6.25.4 update - containing a few dozen fixes - is
in the review process.
Comments (none posted)
Kernel development news
According to my quick & dirty git-log analysis, at the current pace
of [big kernel lock] removal we'd have to wait more than 10 years to remove most
BKL critical sections from the kernel and to get acceptable
latencies again.
--
Ingo Molnar doesn't want
to wait that long
Even if you're totally right, with Nick's mmu notifiers, Rusty's
mmu notifiers, my original mmu notifiers, Christoph's first version
of my mmu notifiers, with my new mmu notifiers, with christoph EMM
version of my new mmu notifiers, with my latest mmu notifiers, and
all people making suggestions and testing the code and needing the
code badly, and further patches waiting inclusion during 2.6.27 in
this area, it must be obvious for everyone, that there's zero
chance this code won't evolve over time to perfection, but we can't
wait it to be perfect before start using it or we're screwed.
Andrea Arcangeli - also impatient
| 19:29 | * | dilinger sends his patch to lkml for a
proper flaming |
| 19:32 | | Quozl> dilinger: it's called "patch
hardening"? ;-} the heat
of the flames makes the patch stronger. |
--
Andres 'dilinger' Salomon
Comments (1 posted)
By Jonathan Corbet
May 13, 2008
When Alan Cox first made Linux work on multiprocessor systems, he added a
primitive known as the big kernel lock (or BKL). This lock, originally,
ensured that only one processor could be running kernel code at any given
time. Over the years, the role of the BKL has diminished as increasingly
fine-grained locking - along with lock-free algorithms - have been
implemented throughout the kernel. Getting rid of the BKL entirely has
been on the list of things to do for some time, but progress in that
direction has been slow in recent years. A recent performance regression
tied to the BKL might give some new urgency to that task, though; it also
shows how subtle algorithmic changes can make a big difference.
The AIM benchmark attempts to measure system throughput by running a large
number of tasks (perhaps thousands of them), each of which is exercising
some part of the kernel. Yanmin Zhang reported that his AIM results got about 40%
worse under the 2.6.26-rc1 kernel. He took the trouble to bisect the
problem; the guilty patch turned out to be the generic semaphores code.
Reverting that patch made the performance regression go away - at the cost
of restoring over 7,000 lines of old, unlamented code. The thought of
bringing back the previous semaphore implementation was enough to inspire a
few people to look more deeply at the problem.
It did not take too long to narrow the focus to the BKL, which was converted to a semaphore a few
years ago. That part of the process was easy - there aren't a whole lot of
other semaphores left in the kernel, especially in performance-critical
places. But the BKL stubbornly remains in a number of core places,
including the fcntl() system call, a number of ioctl()
implementations, the TTY code, and open() for char devices.
That's enough for a badly-performing BKL to create larger problems,
especially when running VFS-heavy benchmarks with a lot of contention.
Ingo Molnar tracked down the problem in the
new semaphore code. In short: the new semaphore code is too fair for its
own good. When a semaphore is released, and there is another thread
waiting for it, the semaphore is handed over to the new thread (which is
then made runnable) at that time. This approach ensures that threads
obtain the semaphore in something close to the order in which they asked
for it.
The problem is that fairness can be expensive. The thread waiting for the
semaphore may be on another processor, its cache could be cold, and it
might be at a low enough priority that it will not even begin running for
some time. Meanwhile, another thread may request the semaphore, but
it will get put at the end of the queue behind the new owner, which may not
be running yet. The result is a certain amount of dead time where no
running thread holds the semaphore. And, in fact, Yanmin's
experience with the AIM benchmark showed this: his system was running idle
almost 50% of the time.
The solution is to bring in a technique from the older semaphore code: lock
stealing. If a thread tries to acquire a semaphore, and that semaphore is
available, that thread gets it regardless of whether a different thread is
patiently waiting in the queue. Or, in other words, the thread at the head
of the queue only gets the semaphore once it starts running and actually
claims it; if it's too slow, somebody else might get there first. In human
interactions, this sort of behavior is considered impolite (in some
cultures, at least), though it is far from unknown. In a multiprocessor
computer, though, it makes the difference between acceptable and
unacceptable performance - even a thread which gets its lock stolen will
benefit in the long run.
Interestingly, the patch which implements this change was merged into the
mainline, then reverted before 2.6.26-rc2 came out. The initial reason for the
revert was that the patch broke semaphores in other situations; for some
usage patterns, the semaphore code could fail to wake a thread when the
semaphore became available. This bug could certainly have been fixed, but
it appears that things will not go that way - there is a bit more going on
here.
What is happening instead is that Linus has committed a patch which simply
turns the BKL into a spinlock. By shorting out the semaphore code
entirely, this patch fixes the AIM regression while leaving the slow (but
fair) semaphore code in place. This change also makes the BKL
non-preemptible, which will not be entirely good news for those who are
concerned with latency issues - especially the real time tree.
The reasoning behind this course of action would appear to be this: both
semaphores and the BKL are old, deprecated mechanisms which are slated for
minimization (semaphores) or outright removal (BKL) in the near future.
Given that, it is not worth adding more complexity back into the semaphore
code, which was dramatically simplified for 2.6.26. And, it seems, Linus
is happy with a sub-optimal BKL:
Quite frankly, maybe we _need_ to have a bad BKL for those to ever
get fixed. As it was, people worked on trying to make the BKL
behave better, and it was a failure. Rather than spend the effort
on trying to make it work better (at a horrible cost), why not just
say "Hell no - if you have issues with it, you need to work with
people to get rid of the BKL rather than cluge around it".
So the end result of all this may be a reinvigoration of the effort to
remove the big kernel lock from the kernel. It still is not something
which is likely to happen over the next few kernel releases: there is a lot
of code which can subtly depend on BKL semantics, and there is no way to be
sure that it is safe without auditing it in detail. And that is not a
small job. Alan Cox has been reworking the TTY code for some time, but he
has some ground to cover yet - and the TTY code is only part of the
problem. So the BKL will probably be with us for a while yet.
Comments (5 posted)
By Jonathan Corbet
May 14, 2008
Memory management changes (for the x86 architecture) have caused surprises
for a few kernel developers. As these issues have been worked out, it has
become clear that not everybody understands how memory caching works on
contemporary systems. In an attempt to bring some clarity, Arjan van de
Ven wrote up some notes and sent them to your editor, who has now worked
them into this article. Thanks to Arjan for putting this information
together - all the useful stuff found below came from him.
As readers of What every
programmer should know about memory will have learned, the caching
mechanisms used by contemporary processors are crucial to the performance
of the system. Memory is slow; without caching, systems will run
much slower. There are situations where caching is detrimental,
though, so the hardware must provide mechanisms which allow for control
over caching with specific ranges of memory. With 2.6.26, Linux is
(rather belatedly) starting to catch up with the current state of the art
on x86 hardware; that, in turn, is bringing some changes to how caching is
managed.
It is good to start with a definition of the terms being used. If a piece
of memory is cachable, that means:
- The processor is allowed to read that memory into its cache at
any time. It may choose to do so regardless of whether the
currently-executing program is interested in reading that memory.
Reads of cachable memory can happen in response to speculative
execution, explicit prefetching, or a number of other reasons. The
CPU can then hold the contents of this memory in its cache for an
arbitrary period of time, subject only to an explicit request to
release the cache line from elsewhere in the system.
- The CPU is allowed to write the contents of its cache back to memory
at any time, again regardless of what any running program might choose
to do. Memory which has never been changed by the program might be
rewritten, or writes done by a program may be held in the cache for an
arbitrary period of time. The CPU need not have read an entire cache
line before writing that line back.
What this all means is that, if the processor sees a memory range as
cachable, it must be possible to (almost) entirely disconnect the
operations on the underlying device from what the program thinks it is
doing. Cachable memory must always be readable without side effects.
Writes have to be idempotent (writing the same value to the same location
several times has the same effect as writing it once),
ordering-independent, and size-independent. There must be no side effects
from writing back a value which was read from the same location. In
practice, this means that what sits behind a cachable address range must be
normal memory - though there are some other cases.
If, instead, an address range is uncachable, every read and write
operation generated by software will go directly to the underlying device,
bypassing the CPU's caches. The one exception is with writes to I/O memory
on a PCI bus; in this case, the PCI hardware is allowed to buffer and
combine write operations. Writes are not reordered with reads, though,
which is why a read from I/O memory is often used in drivers for PCI
devices as a sort of write barrier.
A variant form of uncached access is write combining. For read
operations, write-combined memory is the same as uncachable memory. The
hardware is, however, allowed to buffer consecutive write operations and
execute them as a smaller series of larger I/O operations. The main user
of this mode is video memory, which often sees sequential writes and which
offers significant performance improvements when those writes are
combined.
The important thing is to use the right cache mode for each memory range.
Failure to make ordinary memory cachable can lead to terrible performance.
Enabling caching on I/O memory can cause strange hardware behavior,
corrupted data, and is probably implicated in global warming. So the CPU
and the hardware behind a given address must agree on caching.
Traditionally, caching has been controlled with a CPU feature called
"memory type range registers," or MTRRs. Each processor has a finite set
of MTRRs, each of which controls a range of the physical address space.
The BIOS sets up at least some of the MTRRs before booting the operating
system; some others may be available for tweaking later on. But MTRRs are
somewhat inflexible, subject to the BIOS not being buggy, and are limited
in number.
In more recent times, CPU vendors have added a concept known as "page
attribute tables," or PAT. PAT, essentially, is a set of bits stored in
the page table entries which control how the CPU does caching for each
page. The PAT bits are more flexible and, since they live in the page
table entries, they are difficult to run out of. They are also completely
under the control of the operating system instead of the BIOS. The only
problem is that Linux doesn't support PAT on the x86 architecture, despite
the fact that the hardware has had this capability for some years.
The lack of PAT support is due to a few things, not the least of which has
been problematic support on the hardware side. Processors have stabilized
over the years, though, to the point that it is possible to create a
reasonable whitelist of CPU families known to actually work with PAT.
There have also been challenges on the kernel side; when multiple page
table entries refer to the same physical page (a common occurrence), all of
the page table entries must use the same caching mode. Even a brief window
with inconsistent caching can be enough to bring down the system. But the
code on the kernel side has finally been worked into shape; as a
result, PAT support was merged for the 2.6.26 kernel. Your editor is
typing this on a PAT-enabled system with no ill effects - so far.
On most systems, the BIOS will set MTRRs so that regular memory is cachable
and I/O memory is not. The processor can then complicate the situation
with the PAT bits. In general, when there is a conflict between the MTRR
and PAT settings, the setting with the lower level of caching prevails.
The one exception appears to be when one says "uncachable" and the other
enables write combining; in that case, write combining will be used. So
the CPU, through the management of the PAT bits, can make a couple of
effective changes:
- Uncached memory can have write combining turned on. As noted above,
this mode is most useful for video memory.
- Normal memory can be made uncached. This mode can also be useful for
video memory; in this case, though, the memory involved is normal RAM
which is also accessed by the video card.
Linux device drivers must map I/O memory before accessing it; the function
which performs this task is ioremap(). Traditionally,
ioremap() made no specific changes to the cachability of the
remapped range; it just took whatever the BIOS had set up. In practice,
that meant that I/O memory would be uncachable, which is almost always what
the driver writer wanted. There is a separate ioremap_nocache()
variant for cases where the author wants to be explicit, but use of that
interface has always been rare.
In 2.6.26, ioremap() was changed to map the memory uncached at all
times. That created a couple of surprises in cases where, as it happens,
the memory range involved had been cachable before and that was what the
code needed. As of 2.6.26, such code will break until the call is changed
to use the new ioremap_cache() interface instead. There is also a
ioremap_wc() function for cases where a write-combined mapping is
needed.
It is also possible to manipulate the PAT entries for an address range
explicitly:
int set_memory_uc(unsigned long addr, int numpages);
int set_memory_wc(unsigned long addr, int numpages);
int set_memory_wb(unsigned long addr, int numpages);
These functions will set the given pages to uncachable, write-combining, or
writeback (cachable), respectively. Needless to say, anybody using these
functions should have a firm grasp of exactly what they are doing or
unpleasant results are certain.
Comments (12 posted)
By Jake Edge
May 14, 2008
Getting interfaces right is a hard, but necessary, task, especially when
that interface
has to be supported "forever". Such is the case with the system call
interface that
the kernel presents to user space, so adding features to it must be done
very carefully. Even so, when Ulrich Drepper set out to remove a hole that could lead to a
race condition, he probably did not expect all the different paths that
would need to be tried before closing in on an acceptable solution.
The problem stems from wanting to be able to create file descriptors with
new properties—things like close-on-exec, non-blocking, or
non-sequential descriptors. Those features were not considered when the
system call interface was developed. After all, many of those system calls
are essentially unchanged from early Unix implementations of the 1970s.
The open() call is the most obvious way to request a file descriptor
from the kernel, but there are plenty of others.
In fact, open() is one of the easiest to extend with new
features because of its flags argument. Calls like
pipe(), socket(), accept(),
epoll_create() and others produce file descriptors as well, but don't
have a flags argument available. Something different would have
to be done to support additional features for the file descriptors
resulting from those calls.
The close-on-exec functionality is especially important to close a security
hole for multi-threaded programs. Currently, programs can use
fcntl() to change an open file descriptor to have the close-on-exec property,
but there is always a window in time between the creation of the descriptor and
changing its behavior. Another thread could do an exec() call in
that window, leaking a potentially sensitive file descriptor into the newly
run program. Closing that window requires an in-kernel solution.
Back in June of last year, after some false starts, Linus Torvalds suggested adding an
indirect() system call, as a way to pass flags to system calls
that don't currently support them. The indirect() call would
apply a set of flags to the invocation of an existing system call. This
would allow existing calls to remain unchanged, with only new uses calling
indirect(). User space programs would be unlikely to call the new
function directly, instead they would call glibc functions that
handled any necessary
indirect() calls.
Davide Libenzi created a sys_indirect() patch in July, but Drepper
saw it as "more complex than warranted". So Drepper created his own "trivial"
implementation, that was described on this page in
November. It was met with a less than enthusiastic response on
linux-kernel for being, amongst other things, an exceedingly ugly
interface.
The alternative to sys_indirect() is to create a new system call
for each existing call that needed a flags argument. This was seen as
messy by some, including Torvalds, leading some kernel hackers into looking for alternatives. The indirect approach
also had some other potential benefits, though, because it was seen as something that
could be used by syslets to
allow asynchronous system calls. No decision seemed to be forthcoming,
leading Drepper to ask Torvalds for one:
Will you please make a decision regarding sys_indirect? There has been
no other proposal so the alternative is to add more syscalls.
To bolster his argument that sys_indirect() was the way to go,
Drepper also created a patch to add some of
the required system calls. He started with the socket()
family, by adding socket4(), socketpair5(), and
accept4()—tacking the number of arguments onto the function
name a la wait3() and wait4(). Drepper's intent may not
have been well served by choosing those calls as Alan Cox immediately noted that the type argument could be
overloaded:
Given we will never have 2^32 socket types, and in a sense this is part
of the type why not just use
socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, ...)
that would be far far cleaner, no new syscalls on the socket side at all.
Michael Kerrisk looked over the set of system calls that generate file
descriptors, categorizing them based on whether they needed a
flag argument added. He observed that roughly half of the file
descriptor producing calls need not change because they could either use an
overloading trick like the socket calls, the glibc API already added a
flags argument, or there were alternatives available to provide the same
functionality along with flags.
In response, Drepper made one last attempt to push the indirect approach,
saying:
Or we just add sys_indirect (which is also usable for other syscall
extensions, not just the CLOEXEC stuff) and let userlevel (i.e., me)
worry about adding new interfaces to libc. As you can see, for the more
recent interfaces like signalfd I have already added an additional
parameter so the number of interface changes would be reduced.
Even though the indirect approach has some good points, Torvalds liked the
approach advocated by Cox, saying:
Ok, I have to admit that I find this very appealing. It looks much
cleaner, but perhaps more importantly, it also looks both readable and
easier to use for the user-space programmer.
Ultimately, developers will only use these new interfaces if they can
easily test for the existence of the new code. Torvalds gives an example
of how that might be done using the O_NOATIME flag to
open(), which has only been available since 2.6.8. It is this
testability issue that makes him believe the flags-based approach is the
right one:
And that's the problem with anything that isn't flags-based. Once you do
new system calls, doing the above is really quite nasty. How do you
statically even test that you have a system call? Now you need to add a
whole autoconf thing for it existing, and when it does exist you still
need to test whether it works, and you can't even do it in the slow-path
like the above (which turns the failure into a fast-path without the
flag).
This new approach, with a scaled down number of new system calls rather
than adding a general-purpose system call extension mechanism like
sys_indirect(), is now being pursued by Drepper. In the explanatory patch at the start of the series,
he lays out which
of the system calls will require a new user space interface: paccept(), epoll_create2(),
dup3(),
pipe2(), and
inotify_init1(), as well as those that do not:
signalfd4(),
eventfd2(),
timerfd_create(),
socket(), and
socketpair().
Drepper has already made
several iterations of patches addressing most of the concerns expressed by
the kernel developers along the way. There have been some architecture
specific problems, but Drepper has been knocking those down as well. If no
further roadblocks appear, it would seem a likely candidate for inclusion
in 2.6.27.
Comments (9 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>