The current development kernel is 2.5.31
by Linus on August 10. It
includes an ISDN update, more driverfs work, a JFS update, a lot of
ethernet driver updates, a number of ARM, Alpha, and SPARC64 updates, and
more. This tree also includes the "User-mode Linux preparation" patches,
which make various changes to core code needed by UML - but UML itself has
not yet been merged. The long format
is available for people wanting the details.
Linus's BitKeeper tree - which will become 2.5.32 - currently contains
Andrew Morton's controversial "printk from userspace" patch (to support
boot-time message logging), the pthreads-support patches from Ingo Molnar
(see below), more device model/driverfs work, a new realtime clock driver,
some USB update, and the usual pile of fixes.
The latest 2.5 kernel status summary from
Guillaume Boissiere is dated August 14.
The current stable is still 2.4.19. Marcelo released 2.4.20-pre2 on August 12; it
includes a big S/390 update, a ReiserFS update, a number of small VM
tweaks, some new netfilter modules, the "block I/O from high memory" patch,
a set of NFS updates, and a very long list of other fixes and updates.
The current prepatch from Alan Cox is 2.4.20-pre2-ac2; the main item of interest in
this patch is the merging of LVM2, the new Linux volume manager
Comments (none posted)
Kernel development news
The Linux kernel has long been criticized for its thread support. This
criticism is surprising to some, since the Linux clone()
call provides a great deal of flexibility in the creation of threads that
share resources with their parent process. But clone()
enough to allow Linux to fully support the Posix thread (pthreads) standard
with good performance - especially for applications which create thousands
And such applications do exist. A lot of kernel hackers dismiss highly
threaded applications as being poorly written - having more threads than
on the system is almost always a loss from a performance point of view, and
truly robust thread programming is difficult. But Linux must support what
users want to do, or they will use a different system. This week has seen
the culmination of quite a bit of work aimed at improving the kernel's
basic thread support.
The push to improve thread support began some months ago with Rusty
Russell's "Futex" (fast user-space mutex) patch. Futexes allow the
implementation of pthread mutexes and condition variables in a fast manner
that only requires a system call when there is contention. This patch was
merged in 2.5.7 and has been refined since then.
More recently, Ingo Molnar has been working on thread support issues. His
first thread-local storage (TLS) patch was
posted on July 25; it was merged in 2.5.29 and is still being hacked
upon. The purpose of TLS, of course, is to give each thread access to a
region of memory which is not shared with all other threads. Ingo's
patch, which is implemented only for the x86 architecture, supports TLS
with the following changes:
- Doing thread-local storage right on the x86 requires using the segment
mechanism. The patch sets aside a few entries in the processor's
global descriptor table (GDT) to implement the TLS segments. In the
most recent patch as of this writing (tls-2.5.31-D9) creates three segments: one
for glibc (and, thus, pthreads), one for Wine, and one unassigned.
- A new set_thread_area() system call allows library code to
set up thread-local storage using one of the TLS segments.
- At every context switch, the kernel copies the new process's TLS
entries into the appropriate part of the GDT.
With these changes, each thread can have its own, transparent, local
storage area. There was just one last complication: the x86 GDT was global
and shared on SMP systems. So Ingo had to create a separate GDT for each
processor, with the interesting result that context switches got a little
Next problem: what if you want to create lots of threads in a quick and
safe manner? The classic Unix fork() system call has a problem in
that the newly-created child process could exit before the process ID is
ever returned to the parent; if the parent loses this race, it can be left
in a position where it no longer knows what is going on with its children.
This problem can be worked around, but the workaround involves more system
calls, which slow down thread creation.
Ingo's solution comes in the form of a couple
of new flags to the clone() system call. The pthread library can
throw in CLONE_SETTID, which causes the process ID of the new
thread to be written back to a variable in the parent's address space
before the new thread begins running. There is also a
CLONE_SETTLS flag which causes the equivalent of a
set_thread_area() call to happen as well. The result is a robust
way of creating new threads with a single system call.
Finally, the pthreads code has a couple of issues to deal with when threads
die. The stack used by the thread must be deallocated - and the dying
thread can not do that itself. With enough system calls, pthreads handles
that now, but thread exit should really be a lightweight event, and a
system call-heavy solution defeats that purpose.
Much of the overhead can be eliminated if the thread library can be told
about thread exit without the usual SIGCHLD signal - signals are
expensive. The new pthreads code can do that with the futex mechanism -
almost. It is still difficult to know, without a signal, when the thread
has truly finished using its stack, so that said stack can be freed. If
the stack gets freed before the thread is done with it, the result is a big
mess and a new interest on the developer's part in Windows threading
packages; this outcome needs to be avoided.
Ingo's first attempt to solve
this problem was through the addition of an exit_free() system
call, which would simply write a special value in the parent's address
space to indicate that the stack could be freed. Linus, however, called
this patch "too ugly to live." After some
discussion, the solution that emerged was to
add another clone() flag: CLONE_RELEASE_VM. If a thread
is created with that flag, a word is set aside at the top of the thread's
stack. When the thread releases its current virtual memory - by exiting,
or by execing another program - that word is written with a flag
value. The parent can see that value and know that the stack can be
Finally, Ingo has posted yet another patch
implementing the CLONE_DETACHED flag. If a thread is created with
that flag, no signal is sent to the parent process when the thread exits.
This solution is faster than having the parent simply ignore
SIGCHLD, and also does not require the parent to do without
notification for all of its children.
The other half of all this work, of course, is a new pthreads library that
actually uses all of these new features. The code is in progress and will
be part of a future glibc release. Then, maybe, people will stop
complaining about thread support in Linux.
Comments (5 posted)
Linux VM hackers are engaged in ongoing discussions on both large page
support (covered last week
) and improving the
performance of the new reverse mapping mechanism. That conversation slowed
down, however, when Alan Cox pointed out
a number of the techniques being discussed are covered by SGI patents. In
fact, a closer look
by Daniel Phillips shows
that a number of existing Linux technologies, including reverse mapping in
general and the buddy allocator, are covered by these patents. This is a
problem, he said, that we can't ignore.
That was Linus's cue to jump in with his
policy on software patents and kernel code:
I do not look up any patents on _principle_, because (a) it's a
horrible waste of time and (b) I don't want to know.
The fact is, technical people are better off not looking at
patents. If you don't know what they cover and where they are, you
won't be knowingly infringing on them. If somebody sues you, you
change the algorithm or you just hire a hit-man to whack the stupid
Linus followed up with a note that the above
"may not be legally tenable advice." But he sticks by his point that,
anymore, it's impossible to write an interesting program without running
into somebody's patent. Rather than worry about it, it's better to just
proceed and deal with any problems as they emerge.
This is probably the only rational approach; otherwise kernel hackers would
go nuts trying to find and avoid all of the applicable patents. It's
probably only a matter of time, though, until one of these patents bites
the kernel in a big way - at least in the U.S. Those are the times we live
Comments (8 posted)
The integration of an NFS version 4 implementation into the Linux kernel
got one step closer this week when Kendrick Smith announced
the availability of a set of patches
for 2.5.31. These patches are not for casual users quite yet - there are
38 of them, they only implement a small part of the NFSv4 protocol, and a
fair amount of work is needed to get it all going. The purpose of this set
of patches is to get a conversation started toward the merging of NFSv4
into the kernel. Once the minimal code is in, the rest of the protocol
(which works in a 2.4 version of the patch) can be ported forward and
Comments (none posted)
Patches and updates
- Marc-Christian Petersen: WOLK v3.5 FINAL, Codemane 'Fin' alias 'Birthday Release'. "<span>Also I am a kind of happy that this is the last release of the
'Working Overloaded Linux Kernel', because I don't have the time that WOLK
needs for further good development.</span>"
(August 14, 2002)
Core kernel code
- john stultz: tsc-disable_B9. "<span>This patch enables a workaround for multi-node NUMA
systems that are experiencing gettimeofday returning "old" time values.</span>"
(August 9, 2002)
- Erich Focht: ACPI_NUMA for SRAT/SLIT table parsing. "<span>The attached patch implements the parsing of the ACPI SRAT (Static
Resource Affinity Table) and SLIT (System Locality Information Table)
which are meanwhile the standard for providing NUMA information on
IA64 platforms and started to spread on IA32, too.</span>"
(August 12, 2002)
Filesystems and block I/O
- H. Peter Anvin: klibc development release. "<span>klibc is a tiny C library subset intended to be integrated into the
kernel source tree and being used for initramfs stuff.</span>"
(August 9, 2002)
Page editor: Jonathan Corbet
Next page: Distributions>>