|| ||Ingo Molnar <mingo-AT-elte.hu> |
|| ||Stijn Devriendt <highguy-AT-gmail.com>,
Linus Torvalds <torvalds-AT-linux-foundation.org>,
Mike Galbraith <efault-AT-gmx.de>,
Peter Zijlstra <a.p.zijlstra-AT-chello.nl>,
Andrea Arcangeli <andrea |
|| ||[RFC] observe and act upon workload parallelism:
PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked
|| ||Mon, 16 Nov 2009 09:35:21 +0100|
|| ||peterz-AT-infradead.org, linux-kernel-AT-vger.kernel.org|
|| ||Article, Thread
(Cc:-ed more interested parties)
* Stijn Devriendt <firstname.lastname@example.org> wrote:
> Hi Ingo, Peter, all,
> The attached patch is a prototype for a new system call which allows
> threads to wait for other threads being blocked.
> Its main use is to allow threading libraries to resume executing more
> CPU-bound work when one of its threads is blocked while not having to
> over-allocating threads in a normal situation.
> Benefit over asynchronous I/O is that a threadpool thread that
> performs asynchronous I/O might not have work enough in one item to
> keep the CPU busy during the whole asynchronous operation and that not
> all operations are async capable. Giving control back to the library
> through a thread waiting for the blocked one allows new workitems to
> be executed as long as the former is blocked.
> Code performing this wait could look like:
> pid_t parent = ...;
> while (waitpid(parent, NULL, WNOHANG) != 0)
> if (sched_wait_block(parent, NULL) == 0)
> // do work, possibly notify threadpool manager
> // to start another thread blocked on this one
> // first
> Any feedback on the concept is much appreciated.
That is a ... rather interesting idea IMO.
Regarding the API and your patch, i think we can and should do something
different and more capable - while still keeping your basic idea:
Lets turn it all around on its head and add the capability to user-space
to observe the 'parallelism' of a workload (not limit it to the blocking
of a single task) and allow the poll()ing of that quantity - without
It should not be limited to a single task, and it should work with
existing syscall APIs - i.e. be fd based.
Incidentally we already have a syscall and a kernel subsystem that is
best suited to deal with such types of issues: perf events. I think we
can create a new, special performance event type that observes
task/workload (or CPU) parallelism:
With a 'parallelism_threshold' attribute. (which is '1' for a single
task. See below.)
And then we can use poll() in the thread manager task to observe PIDs,
workloads or full CPUs. The poll() implementation of perf events is fast
( Note: there's no need to actually _capture_ the events into the
ring-buffer - this is done by not mmap()-ing the fd. I.e. we'll just
have a pure poll() wakeup overhead and no tracing overhead. )
The semantics are basically that we are observing task
schedule/unschedule events and keep a count and a threshold - and can
poll() on that. perf_event_attr can be used to inject a 'minimum
parallelism' threshold value (and maybe a 'max parallelism' value as
Events are emitted (and poll() returns) if the observed workload gets
'blocked' according to the parallelism threshold - i.e. if the number of
runnable tasks drops below the threshold.
This fits very nicely into the existing perf events API and we wouldnt
have to add a new syscall.
Usage is very simple and straightforward, and can happen on various
levels of 'observation detail':
- the new fd can be attached to a specific PID (like your syscall).
perf_event_attr::threshold == 1 means we get the semantics of your
sched_wait_block() system call. Note that poll() wont have to do a
PID lookup (as it is already attached) so it will be much faster than
- the new fd can be attached to a hieararchy of tasks and observe _all_
of the parallelism there. This has the advantage of not having to
track each thread in a pool of threads. (this is done via inherited
events, see include/linux/perf_event.h:perf_event_attr::inherit) In
this case a parallelism threshold value larger than 1 makes sense
too, to allow the workload to spread to a number of CPUs. On a 4-CPU
system if we set threshold==4 it means that we'll return from poll()
if the number of runnable tasks drops below 4.
- the new fd can be attached to a CPU - observing parallelism of a full
CPU without having to track all workloads. In this case threshold==1
means that we'll return from poll() if the last task on that CPU
schedules out - i.e. if the CPU becomes idle.
This would make a very powerful task queueing framework. It basically
allows a 'lazy' user-space scheduler, which only activates if the kernel
scheduler has run out of work.
What do you think?
to post comments)