By Jonathan Corbet
October 7, 2009
A "thread pool" is a common group of processes which can be called on to
perform work at some future time. The kernel does not lack for thread pool
implementations; indeed, there are more choices than one might like. Options
include
workqueues, the
slow work mechanism, and
asynchronous function calls -
not to mention various private thread pool implementations found elsewhere
in the kernel. It has long been thought that having just one thread pool
mechanism would be better, but nobody, so far, has managed to put together
a single implementation that everybody likes.
Of the mechanisms listed above, the most commonly used by far is
workqueues. A workqueue makes it easy for code to set aside work to be
done in process context at a future time, but workqueues are not without
their problems. There is a shared workqueue that all can use, but one
long-running task can create indefinite delays for others, so few
developers take advantage of it. Instead, the kernel has filled with
subsystem-specific workqueues, each of which contributes to the surfeit of
kernel threads running on contemporary systems. Workqueue threads contend
with each other for the CPU, causing more context switches than are really
necessary. It's discouragingly easy to create deadlocks with workqueues
when one task depends on work done by another. All told, workqueues -
despite a couple of major rewrites already - are in need of a bit of a face
lift.
Tejun Heo has provided that face lift in the form of his concurrency managed workqueues
patch. This 19-part series massively reworks the workqueue code,
addressing the shortcomings of the current workqueue subsystem. This
effort is clearly aimed at replacing the other thread pool implementations
in the kernel too, though that work is left for a later date.
Current workqueues have dedicated threads associated with them - a single
thread in some cases, one thread per CPU in others. The new workqueues do
away with that; there are no threads dedicated to any specific workqueue.
Instead, there is a global pool of threads attached to each CPU in the
system. When a work item is enqueued, it will be passed to one of the
global threads at the right time (as deemed by the workqueue code). One
interesting implication of this change is that tasks submitted to the same
workqueue on the same CPU may now execute concurrently - something which
does not happen with current workqueues.
One of the key features of the new code is its ability to manage
concurrency in general. In one sense, all workqueue tasks are executed
concurrently after submission. Actually doing things that way would yield
poor results, though; those tasks would simply contend with each other,
causing more context switches, worse cache behavior, and generally worse
performance. What's really needed is a way to run exactly one workqueue
task at a time (avoiding contention) but to switch immediately to another
if that task blocks for any reason (avoiding processor idle time). Doing
this job correctly requires that the workqueue manager become a sort of
special-purpose scheduler.
As it happens, that's just how Tejun has implemented it. The workqueue
patch adds a new scheduler class which behaves very much like the normal
fair scheduler class. The workqueue class adds a couple of hooks which
call back into the workqueue code whenever a task running under that class
transitions between the blocked and runnable states. When the first
workqueue task is submitted, a thread running under the workqueue scheduler
class is created to execute it. As long as that task continues to run,
other tasks will wait. But as soon as the running task blocks on some
resource, the scheduler will notify the workqueue code and another thread
will be created to run the next task. The workqueue manager will create as
many threads as needed (up to a limit) to keep the CPU busy, but it tries
to only have one task actually running at any given time.
Also new with Tejun's patch is the concept of "rescuer" threads. In a
tightly resource-constrained system, it may become impossible to create new
worker threads. But any existing threads may be waiting for the results of
other tasks which have not yet been executed. In that situation,
everything will stop cold. To deal with this problem, some special
"rescuer" threads are kept around. If attempts to create new workers fail
for a period of time, the rescuers will be summoned to execute tasks and,
hopefully, clear the logjam.
The handling of CPU hotplugging is interesting. If a CPU is being
taken offline, the system needs to move all work off that CPU as quickly as
possible. To that end, the workqueue manager responds to a hot-unplug
notification by creating a special "trustee" manager on a CPU which is
sticking around. That trustee takes over responsibility for the workqueue
running on the doomed CPU, executing tasks until they are all gone and the
workqueue can be shut down. Meanwhile, the CPU can go offline without
waiting for the workqueue to drain.
These patches were generally welcomed, but there were some concerns
expressed. The
biggest complaint related to the special-purpose scheduling
class. The hooks were described as (1) not really scheduler-related,
and (2) potentially interesting beyond the workqueue code. For
example, Linus suggested that this kind of hook could be used
to implement the big kernel lock semantics, releasing the lock when a
process sleeps and reacquiring it on wakeup. The scheduler class will
probably go away in the next version of the patch; what remains to be seen
is what will replace it.
One idea which was suggested was to use the preemption notifier hooks which
are already in the kernel. These notifiers would have to become mandatory,
and some new callbacks would be required. Another possibility would be to
give in to
the inevitable future when perf counters events will take over
the entire kernel. Event tracepoints are designed to provide callbacks at
specific points in the kernel; some already exist for most of the
interesting scheduler events. Using them in this context would mostly be a
matter of streamlining the perf events mechanism to handle this task
efficiently.
Andrew Morton was concerned that the new
code would take away the ability for a specific workqueue user to modify
its worker tasks - changing their priority, say, or having them run under a
different UID. It turns out that, so far, only a couple of workqueues have
been modified in this way. The workqueue used by stop_machine()
puts its worker threads into the realtime scheduling class, allowing them
to monopolize the processors when needed; Tejun simply replaced that
workqueue with a set of dedicated kernel threads. The ACPI code had bound
a workqueue thread to CPU 0 because some operations corrupt the system
if run anywhere else; that case is easily handled with the existing
schedule_work_on() function. So it seems that, for now at least,
there is no need for non-default worker threads.
One remaining issue is that some subsystems use single-threaded workqueues
as a sort of synchronization mechanism; they expect tasks to complete in
the same order they were submitted. Global thread pools change that
behavior; Tejun has not yet said how he will solve that problem.
It almost certainly will be solved, along with the other concerns. David
Howells, the creator of the slow work subsystem, thinks that the new workqueues could be a good
replacement. In summary, this change looks likely to be accepted, perhaps
as early as 2.6.33. Then we might finally have a single thread pool in the
kernel.
(
Log in to post comments)