Making Linux safe for pthreads
[Posted August 14, 2002 by corbet]
The Linux kernel has long been criticized for its thread support. This
criticism is surprising to some, since the Linux
clone() system
call provides a great deal of flexibility in the creation of threads that
share resources with their parent process. But
clone() is not
enough to allow Linux to fully support the Posix thread (pthreads) standard
with good performance - especially for applications which create thousands
of threads.
And such applications do exist. A lot of kernel hackers dismiss highly
threaded applications as being poorly written - having more threads than
processors
on the system is almost always a loss from a performance point of view, and
truly robust thread programming is difficult. But Linux must support what
users want to do, or they will use a different system. This week has seen
the culmination of quite a bit of work aimed at improving the kernel's
basic thread support.
The push to improve thread support began some months ago with Rusty
Russell's "Futex" (fast user-space mutex) patch. Futexes allow the
implementation of pthread mutexes and condition variables in a fast manner
that only requires a system call when there is contention. This patch was
merged in 2.5.7 and has been refined since then.
More recently, Ingo Molnar has been working on thread support issues. His
first thread-local storage (TLS) patch was
posted on July 25; it was merged in 2.5.29 and is still being hacked
upon. The purpose of TLS, of course, is to give each thread access to a
region of memory which is not shared with all other threads. Ingo's
patch, which is implemented only for the x86 architecture, supports TLS
with the following changes:
- Doing thread-local storage right on the x86 requires using the segment
mechanism. The patch sets aside a few entries in the processor's
global descriptor table (GDT) to implement the TLS segments. In the
most recent patch as of this writing (tls-2.5.31-D9) creates three segments: one
for glibc (and, thus, pthreads), one for Wine, and one unassigned.
- A new set_thread_area() system call allows library code to
set up thread-local storage using one of the TLS segments.
- At every context switch, the kernel copies the new process's TLS
entries into the appropriate part of the GDT.
With these changes, each thread can have its own, transparent, local
storage area. There was just one last complication: the x86 GDT was global
and shared on SMP systems. So Ingo had to create a separate GDT for each
processor, with the interesting result that context switches got a little
faster.
Next problem: what if you want to create lots of threads in a quick and
safe manner? The classic Unix fork() system call has a problem in
that the newly-created child process could exit before the process ID is
ever returned to the parent; if the parent loses this race, it can be left
in a position where it no longer knows what is going on with its children.
This problem can be worked around, but the workaround involves more system
calls, which slow down thread creation.
Ingo's solution comes in the form of a couple
of new flags to the clone() system call. The pthread library can
throw in CLONE_SETTID, which causes the process ID of the new
thread to be written back to a variable in the parent's address space
before the new thread begins running. There is also a
CLONE_SETTLS flag which causes the equivalent of a
set_thread_area() call to happen as well. The result is a robust
way of creating new threads with a single system call.
Finally, the pthreads code has a couple of issues to deal with when threads
die. The stack used by the thread must be deallocated - and the dying
thread can not do that itself. With enough system calls, pthreads handles
that now, but thread exit should really be a lightweight event, and a
system call-heavy solution defeats that purpose.
Much of the overhead can be eliminated if the thread library can be told
about thread exit without the usual SIGCHLD signal - signals are
expensive. The new pthreads code can do that with the futex mechanism -
almost. It is still difficult to know, without a signal, when the thread
has truly finished using its stack, so that said stack can be freed. If
the stack gets freed before the thread is done with it, the result is a big
mess and a new interest on the developer's part in Windows threading
packages; this outcome needs to be avoided.
Ingo's first attempt to solve
this problem was through the addition of an exit_free() system
call, which would simply write a special value in the parent's address
space to indicate that the stack could be freed. Linus, however, called
this patch "too ugly to live." After some
discussion, the solution that emerged was to
add another clone() flag: CLONE_RELEASE_VM. If a thread
is created with that flag, a word is set aside at the top of the thread's
stack. When the thread releases its current virtual memory - by exiting,
or by execing another program - that word is written with a flag
value. The parent can see that value and know that the stack can be
freed.
Finally, Ingo has posted yet another patch
implementing the CLONE_DETACHED flag. If a thread is created with
that flag, no signal is sent to the parent process when the thread exits.
This solution is faster than having the parent simply ignore
SIGCHLD, and also does not require the parent to do without
notification for all of its children.
The other half of all this work, of course, is a new pthreads library that
actually uses all of these new features. The code is in progress and will
be part of a future glibc release. Then, maybe, people will stop
complaining about thread support in Linux.
(
Log in to post comments)