User: Password:
Subscribe / Log in / New account

sched: power aware scheduling

From:  Alex Shi <>
Subject:  [patch v6 0/21] sched: power aware scheduling
Date:  Sat, 30 Mar 2013 22:34:47 +0800
Message-ID:  <>
Archive-link:  Article

This patch set implement/consummate the rough power aware scheduling

The code also on this git tree: power-scheduling

The patch defines a new policy 'powersaving', that try to pack tasks on
each sched groups level. Then it can save much power when task number in
system is no more than LCPU number.

As mentioned in the power aware scheduling proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched groups will reduce cpu power consumption

The first assumption make performance policy take over scheduling when
any group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.

Compare to the removed power balance, this power balance has the following
1, simpler sys interface
	only 2 sysfs interface VS 2 interface for each of LCPU
2, cover on all cpu topology 
	effect on all domain level VS only work on SMT/MC domain
3, Less task migration 
	mutual exclusive perf/power LB VS balance power on balanced performance
4, considered system load threshing 
	yes VS no
5, transitory task considered       
	yes VS no

BTW, like sched numa, Power aware scheduling is also a kind of cpu
locality oriented scheduling.

Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
Ingo, Len Brown, Arjan, Borislav Petkov, PJT, Namhyung Kim, Mike
Galbraith, Greg, Preeti, Morten Rasmussen, Rafael etc.

Since the patch can perfect pack tasks into fewer groups, I just show
some performance/power testing data here:
$for ((i = 0; i < x; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4 core* HT: the data is avg Watts
         powersaving     performance
x = 8	 72.9482 	 72.6702
x = 4	 61.2737 	 66.7649
x = 2	 44.8491 	 59.0679
x = 1	 43.225 	 43.0638

on SNB EP machine with 2 sockets * 8 cores * HT:
         powersaving     performance
x = 32	 393.062 	 395.134
x = 16	 277.438 	 376.152
x = 8	 209.33 	 272.398
x = 4	 199 	         238.309
x = 2	 175.245 	 210.739
x = 1	 174.264 	 173.603

tasks number keep waving benchmark, 'make -j <x> vmlinux'
on my SNB EP 2 sockets machine with 8 cores * HT:
         powersaving              performance
x = 2    189.416 /228 23          193.355 /209 24
x = 4    215.728 /132 35          219.69 /122 37
x = 8    244.31 /75 54            252.709 /68 58
x = 16   299.915 /43 77           259.127 /58 66
x = 32   341.221 /35 83           323.418 /38 81

data explains: 189.416 /228 23
	189.416: average Watts during compilation
	228: seconds(compile time)
	23:  scaled performance/watts = 1000000 / seconds / watts
The performance value of kbuild is better on threads 16/32, that's due
to lazy power balance reduced the context switch and CPU has more boost 
chance on powersaving balance.

Some performance testing results:

Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms.

A, no clear performance change found on 'performance' policy.
B, specjbb2005 drop 5~7% on both of policy whenever with openjdk or
   jrockit on powersaving polocy
C, hackbench drops 40% with powersaving policy on snb 4 sockets platforms.
Others has no clear change.

V6 change:
a, remove 'balance' policy.
b, consider RT task effect in balancing
c, use avg_idle as burst wakeup indicator
d, balance on task utilization in fork/exec/wakeup.
e, no power balancing on SMT domain.

V5 change:
a, change sched_policy to sched_balance_policy
b, split fork/exec/wake power balancing into 3 patches and refresh
commit logs
c, others minors clean up

V4 change:
a, fix few bugs and clean up code according to Morten Rasmussen, Mike
Galbraith and Namhyung Kim. Thanks!
b, take Morten Rasmussen's suggestion to use different criteria for
different policy in transitory task packing.
c, shorter latency in power aware scheduling.

V3 change:
a, engaged nr_running and utilisation in periodic power balancing.
b, try packing small exec/wake tasks on running cpu not idle cpu.

V2 change:
a, add lazy power scheduling to deal with kbuild like benchmark.

-- Thanks Alex
[patch v6 01/21] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v6 02/21] sched: set initial value of runnable avg for new
[patch v6 03/21] sched: only count runnable avg on cfs_rq's
[patch v6 04/21] sched: add sched balance policies in kernel
[patch v6 05/21] sched: add sysfs interface for sched_balance_policy
[patch v6 06/21] sched: log the cpu utilization at rq
[patch v6 07/21] sched: add new sg/sd_lb_stats fields for incoming
[patch v6 08/21] sched: move sg/sd_lb_stats struct ahead
[patch v6 09/21] sched: scale_rt_power rename and meaning change
[patch v6 10/21] sched: get rq potential maximum utilization
[patch v6 11/21] sched: detect wakeup burst with rq->avg_idle
[patch v6 12/21] sched: add power aware scheduling in fork/exec/wake
[patch v6 13/21] sched: using avg_idle to detect bursty wakeup
[patch v6 14/21] sched: packing transitory tasks in wakeup power
[patch v6 15/21] sched: add power/performance balance allow flag
[patch v6 16/21] sched: pull all tasks from source group
[patch v6 17/21] sched: no balance for prefer_sibling in power
[patch v6 18/21] sched: add new members of sd_lb_stats
[patch v6 19/21] sched: power aware load balance
[patch v6 20/21] sched: lazy power balance
[patch v6 21/21] sched: don't do power balance on share cpu power
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds