Readers of the LWN Kernel Page have been aware of the intensive effort to
improve threading performance on Linux - at least from the kernel point of
view. Now, with the announcement
of version 0.1
of the Native POSIX Thread Library (NPTL), the user-space side of this
project has come into view. This article will take a look at the
technical and performance aspects of NPTL; the next will wander briefly
into the political issues.
Threads, of course, are processes (or something that looks like processes)
that share an address space and various other resources. Multi-threaded
applications can be tricky to write (they end up presenting the same sorts
of problems with race conditions that operating system programmers have to
deal with), but they can be a good solution to a number of programming
challenges. Your web browser, for example, likely keeps one thread around
to respond quickly to user events (mouse clicks and such) while another
thread downloads a web page and yet another one renders it onto the
screen. Java programs also tend to be highly threaded. Some applications can
create many thousands of threads; obviously, such applications can only be
reasonably run on systems with top-quality thread support.
Threading can be implemented entirely in user space, in kernel space, or a
combination of both. User-space threads are traditionally lighter weight,
since they do not require system calls and do not run in independent
processes. They can be tricky to make work in all situations, however, and
a pure user-space implementation can not make good use of multiprocessor
systems, since all threads run within a single process. So most operating
systems provide some degree of kernel support for threading. Linux has
long had this support via the flexible clone() system call, which
allows a great deal of control over which resources are to be shared with
the new thread, and which are to be private.
Pure kernel-based threads are often perceived as being slow, however, since
the kernel scheduler must be invoked to switch between threads. So
conventional wisdom has often said that the best way to get good thread
performance is with the "M:N model." M:N is a hybrid approach, where M
user space threads run on each of N kernel threads. The multiple kernel
threads allow the application to use all processors on the system, while
keeping the performance benefits of doing (most) switching between
user-space threads. Many people have said that the key to fixing the (not
great) performance of Linux threads is adopting the M:N approach.
So it is interesting to note that NPTL has, instead, stayed with the 1:1
pure kernel thread model. NPTL authors Ulrich Drepper and Ingo Molnar took
a close look at the problem, and came to the conclusion that 1:1 was, in
the end, the more promising approach. Their reasoning can be found in the
NPTL white paper (available in PDF format);
the main points are:
- The kernel problems which slow down thread performance can be
eliminated; that has been the focus of Ingo Molnar's work.
- An M:N threading implementation requires two schedulers: the usual
kernel scheduler, and a user-space implementation. Getting the two to
work together for best performance is difficult - especially if you do
not want to impact performance for the system as a whole. Rather than
duplicate the scheduling function, the NPTL implementers felt it was
better to use the (highly optimized) kernel scheduler exclusively.
- Signal handling is the bane of many threading implementations, and
M:N implementations have an even harder time of it. The 1:1 model
leaves signal handling in the kernel.
- User-space threading implementations have to go to great length to
ensure that one thread, when it performs a blocking operation, does
not block all threads running under that process; this can be a
complex task. Kernel-based threads naturally schedule (and block)
independently of each other.
Finally, the 1:1 implementation is generally simpler, since user space need
not duplicate functionality already found in the kernel.
Of course, all of that means little if the 1:1 model is unable to perform
up to expectations. The benchmarking process has just begun, but the
initial signs are encouraging. Ingo ran one
test where he started up and ran 100,000 concurrent threads - in less
than two seconds. This test would have taken about 15 minutes before
the threading improvements went into the kernel.
Ulrich Drepper has posted some other
benchmarks which mostly measure thread creation and shutdown time; some
of his results can be seen in the chart to the right..
Such a test should naturally favor the M:N model, since user-space thread
creation and destruction can be performed without any system calls. And,
in fact, the M:N Next Generation
POSIX Threading (NGPT) implementation beat standard Linux threads by
at least a factor of two in these tests. The NPTL library, however, beat
NGPT by about a factor of four. So the initial indications are that NPTL
can deliver the goods. And this is only the 0.1 release.
to post comments)